* CPU Hotplug rework @ 2012-03-19 14:44 Srivatsa S. Bhat 2012-03-19 14:48 ` Srivatsa S. Bhat 2012-03-19 23:42 ` Rusty Russell 0 siblings, 2 replies; 38+ messages in thread From: Srivatsa S. Bhat @ 2012-03-19 14:44 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list Hi, There had been some discussion on CPU Hotplug redesign/rework some time ago, but it was buried under a thread with a different subject. (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) So I am opening a new thread with an appropriate subject to discuss what needs to be done and how to go about it, as part of the rework. Peter Zijlstra and Paul McKenney had come up with TODO lists for the rework, and here are their extracts from the previous discussion: On Tue, Jan 31, 2012 at 02:01:56PM +0100, Peter Zijlstra wrote: > I paged out most details again, but it goes something like: > > - read and understand the current generic code > > - and all architecture code, at which point you'll probably boggle > at all the similarities that are all subtly different (there's > about 3 actually different ways in the arch code). > > - pick one, preferably one that keeps additional state and doesn't > fully rely on the online bits and pull it into generic code and > provide a small vector of arch specific functions. > > - convert all archs over. > > > Also related: > > - figure out why cpu_down needs kstopmachine, I'm not sure it does.. > we should be able to tear down a cpu using synchronize_sched() and a > single stop_one_cpu(). (someday when there's time I might actually > try to implement this). On 02/02/2012 06:03 AM, Paul E. McKenney wrote: > Currently, a number of the CPU_DYING notifiers assume that they are > running in stop-machine context, including those of RCU. > > However, this is not an inherent property of RCU -- DYNIX/ptx's > CPU-offline process did not stop the whole machine, after all, and RCU > (we called it rclock, but whatever) was happy with this arrangement. > In fact, if the outgoing CPU could be made to stop in that context > instead of returning to the scheduler and the idle loop, it would make > my life a bit easier. > > My question is why aren't the notifiers executed in the opposite > order going down and coming up, with the coming-up order matching the > boot order? Also, why can't the CPU's exit from this world be driven > out of the idle loop? That way, the CPU wouldn't mark itself offline > (thus in theory to be ignored by CPU), and then immediately dive into > the scheduler and who knows what all else, using RCU all the time. ;-) > > (RCU handles this by keeping a separate set of books for online CPUs. > It considers a CPU online at CPU_UP_PREPARE time, and doesn't consider > it offline until CPU_DEAD time. To handle the grace periods between, > force_quiescent_state() allows the grace period to run a few jiffies > before checking cpu_online_map, which allows a given CPU to safely use > RCU for at least one jiffy before marking itself online and for at least > one jiffy after marking itself offline.) On Fri, Feb 03, 2012 at 09:32:35AM -0800, Paul E. McKenney wrote: > Starting from the top, what does CPU hotplug need to do? > > 1. preempt_disable() or something similarly lightweight and > unconditional must block removal of any CPU that was > in cpu_online_map at the start of the "critical section". > (I will identify these as hotplug read-side critical sections.) > > I don't believe that there is any prohibition against a CPU > appearing suddenly, but some auditing would be required to > confirm this. But see below. > > 2. A subsystem not involved in the CPU-hotplug process must be able > to test if a CPU is online and be guaranteed that this test > remains valid (the CPU remains fully functional) for the duration > of the hotplug read-side critical section. > > 3. If a subsystem needs to operate on all currently online CPUs, > then it must participate in the CPU-hotplug process. My > belief is that if some code needs to test whether a CPU is > present, and needs an "offline" indication to persist, then > that code's subsystem must participate in CPU-hotplug operations. > > 4. There must be a way to register/unregister for CPU-hotplug events. > This is currently cpu_notifier(), register_cpu_notifier(), > and unregister_cpu_notifier(). > > n-1. CPU-hotplug operations should be reasonably fast. Tens of > milliseconds is OK, multiple seconds not so much. > > n. (Your additional constraints here.) > > How to do this? Here is one possible approach, probably full of holes: > > a. Maintain the cpu_online_map, as currently, but the meaning > of a set bit is that the CPU is fully functional. If there > is any service that the CPU no longer offers, its bit is > cleared. > > b. Continue to use preempt_enable()/preempt_disable() to mark > hotplug read-side critical sections. > > c. Instead of using __stop_machine(), use a per-CPU variable that > is checked in the idle loop. Possibly another TIF_ bit. > > d. The CPU notifiers are like today, except that CPU_DYING() is > invoked by the CPU after it sees that its per-CPU variable > telling it to go offline. As today, the CPU_DYING notifiers > are invoked with interrupts disabled, but other CPUs are still > running. Of course, the CPU_DYING notifiers need to be audited > and repaired. There are fewer than 20 of them, so not so bad. > RCU's is an easy fix: Just re-introduce locking and the global > RCU callback orphanage. My guesses for the others at the end. > > e. Getting rid of __stop_machine() means that the final step of the > CPU going offline will no longer be seen as atomic by other CPUs. > This will require more careful tracking of dependencies among > different subsystems. The required tracking can be reduced > by invoking notifiers in registration order for CPU-online > operations and invoking them in the reverse of registration > order for CPU-offline operations. > > For example, the scheduler uses RCU. If notifiers are invoked in > the same order for all CPU-hotplug operations, then on CPU-offline > operations, during the time between when RCU's notifier is invoked > and when the scheduler's notifier is invoked, the scheduler must > deal with a CPU on which RCU isn't working. (RCU currently > works around this by allowing a one-jiffy time period after > notification when it still pays attention to the CPU.) > > In contrast, if notifiers are invoked in reverse-registration > order for CPU-offline operations, then any time the scheduler > sees a CPU as online, RCU also is treating it as online. > > f. There will be some circular dependencies. For example, the > scheduler uses RCU, but in some configurations, RCU also uses > kthreads. These dependencies must be handled on a case-by-case > basis. For example, the scheduler could invoke an RCU API > to tell RCU when to shut down its per-CPU kthreads and when > to start them up. Or RCU could deal with its kthreads in the > CPU_DOWN_PREPARE and CPU_ONLINE notifiers. Either way, RCU > needs to correctly handle the interval when it cannot use > kthreads on a given CPU that it is still handling, for example, > by switching to running the RCU core code in softirq context. > > g. Most subsystems participating in CPU-hotplug operations will need > to keep their own copy of CPU online/offline state. For example, > RCU uses the ->qsmaskinit fields in the rcu_node structure for > this purpose. > > h. So CPU-offline handling looks something like the following: > > i. Acquire the hotplug mutex. > > ii. Invoke the CPU_DOWN_PREPARE notifiers. If there > are objections, invoke the CPU_DOWN_FAILED notifiers > and return an error. > > iii. Clear the CPU's bit in cpu_online_map. > > iv. Invoke synchronize_sched() to ensure that all future hotplug > read-side critical sections ignore the outgoing CPU. > > v. Set a per-CPU variable telling the CPU to take itself > offline. There would need to be something here to > help the CPU get to idle quickly, possibly requiring > another round of notifiers. CPU_DOWN? > > vi. When the dying CPU gets to the idle loop, it invokes the > CPU_DYING notifiers and updates its per-CPU variable to > indicate that it is ready to die. It then spins in a > tight loop (or does some other architecture-specified > operation to wait to be turned off). > > Note that there is no need for RCU to guess how long the > CPU might be executing RCU read-side critical sections. > > vii. When the task doing the offline operation sees the > updated per-CPU variable, it calls __cpu_die(). > > viii. The CPU_DEAD notifiers are invoked. > > ix. Theeck_for_tasks() function is invoked. > > x. Release the hotplug mutex. > > xi. Invoke the CPU_POST_DEAD notifiers. > > i. I do not believe that the CPU-offline handling needs to change > much. > > > CPU_DYING notifiers as of 3.2: > > o vfp_hotplug(): I believe that this works as-is. > o s390_nohz_notify(): I believe that this works as-is. > o x86_pmu_notifier(): I believe that this works as-is. > o perf_ibs_cpu_notifier(): I don't know enough about > APIC to say. > o tboot_cpu_callback(): I believe that this works as-is, > but this one returns NOTIFY_BAD to a CPU_DYING notifier, > which is badness. But it looks like that case is a > "cannot happen" case. Still needs to be fixed. > o clockevents_notify(): This one acquires a global lock, > so it should be safe as-is. > o console_cpu_notify(): This one takes the same action > for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and > CPU_UP_CANCELLED that it does for CPU_DYING, so it > should be OK. > o rcu_cpu_notify(): This one needs adjustment as noted > above, but nothing major. > o migration_call(): I defer to Peter on this one. > It looks to me like it is written to handle other > CPUs, but... > o workqueue_cpu_callback(): Might need help, does a > non-atomic OR. > o kvm_cpu_hotplug(): Uses a global spinlock, so should > be OK as-is. Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-19 14:44 CPU Hotplug rework Srivatsa S. Bhat @ 2012-03-19 14:48 ` Srivatsa S. Bhat 2012-03-20 11:28 ` Peter Zijlstra 2012-04-05 17:39 ` Paul E. McKenney 2012-03-19 23:42 ` Rusty Russell 1 sibling, 2 replies; 38+ messages in thread From: Srivatsa S. Bhat @ 2012-03-19 14:48 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote: > Hi, > > There had been some discussion on CPU Hotplug redesign/rework > some time ago, but it was buried under a thread with a different > subject. > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) > > So I am opening a new thread with an appropriate subject to discuss > what needs to be done and how to go about it, as part of the rework. > > Peter Zijlstra and Paul McKenney had come up with TODO lists for the > rework, and here are their extracts from the previous discussion: > > On Tue, Jan 31, 2012 at 02:01:56PM +0100, Peter Zijlstra wrote: >> I paged out most details again, but it goes something like: >> >> - read and understand the current generic code >> >> - and all architecture code, at which point you'll probably boggle >> at all the similarities that are all subtly different (there's >> about 3 actually different ways in the arch code). >> >> - pick one, preferably one that keeps additional state and doesn't >> fully rely on the online bits and pull it into generic code and >> provide a small vector of arch specific functions. >> >> - convert all archs over. >> >> >> Also related: >> >> - figure out why cpu_down needs kstopmachine, I'm not sure it does.. >> we should be able to tear down a cpu using synchronize_sched() and a >> single stop_one_cpu(). (someday when there's time I might actually >> try to implement this). > > > > On 02/02/2012 06:03 AM, Paul E. McKenney wrote: >> Currently, a number of the CPU_DYING notifiers assume that they are >> running in stop-machine context, including those of RCU. >> >> However, this is not an inherent property of RCU -- DYNIX/ptx's >> CPU-offline process did not stop the whole machine, after all, and RCU >> (we called it rclock, but whatever) was happy with this arrangement. >> In fact, if the outgoing CPU could be made to stop in that context >> instead of returning to the scheduler and the idle loop, it would make >> my life a bit easier. >> >> My question is why aren't the notifiers executed in the opposite >> order going down and coming up, with the coming-up order matching the >> boot order? Also, why can't the CPU's exit from this world be driven >> out of the idle loop? That way, the CPU wouldn't mark itself offline >> (thus in theory to be ignored by CPU), and then immediately dive into >> the scheduler and who knows what all else, using RCU all the time. ;-) >> >> (RCU handles this by keeping a separate set of books for online CPUs. >> It considers a CPU online at CPU_UP_PREPARE time, and doesn't consider >> it offline until CPU_DEAD time. To handle the grace periods between, >> force_quiescent_state() allows the grace period to run a few jiffies >> before checking cpu_online_map, which allows a given CPU to safely use >> RCU for at least one jiffy before marking itself online and for at least >> one jiffy after marking itself offline.) > > > > On Fri, Feb 03, 2012 at 09:32:35AM -0800, Paul E. McKenney wrote: > >> Starting from the top, what does CPU hotplug need to do? >> >> 1. preempt_disable() or something similarly lightweight and >> unconditional must block removal of any CPU that was >> in cpu_online_map at the start of the "critical section". >> (I will identify these as hotplug read-side critical sections.) >> >> I don't believe that there is any prohibition against a CPU >> appearing suddenly, but some auditing would be required to >> confirm this. But see below. >> >> 2. A subsystem not involved in the CPU-hotplug process must be able >> to test if a CPU is online and be guaranteed that this test >> remains valid (the CPU remains fully functional) for the duration >> of the hotplug read-side critical section. >> >> 3. If a subsystem needs to operate on all currently online CPUs, >> then it must participate in the CPU-hotplug process. My >> belief is that if some code needs to test whether a CPU is >> present, and needs an "offline" indication to persist, then >> that code's subsystem must participate in CPU-hotplug operations. >> >> 4. There must be a way to register/unregister for CPU-hotplug events. >> This is currently cpu_notifier(), register_cpu_notifier(), >> and unregister_cpu_notifier(). >> >> n-1. CPU-hotplug operations should be reasonably fast. Tens of >> milliseconds is OK, multiple seconds not so much. >> >> n. (Your additional constraints here.) >> >> How to do this? Here is one possible approach, probably full of holes: >> >> a. Maintain the cpu_online_map, as currently, but the meaning >> of a set bit is that the CPU is fully functional. If there >> is any service that the CPU no longer offers, its bit is >> cleared. >> >> b. Continue to use preempt_enable()/preempt_disable() to mark >> hotplug read-side critical sections. >> >> c. Instead of using __stop_machine(), use a per-CPU variable that >> is checked in the idle loop. Possibly another TIF_ bit. >> >> d. The CPU notifiers are like today, except that CPU_DYING() is >> invoked by the CPU after it sees that its per-CPU variable >> telling it to go offline. As today, the CPU_DYING notifiers >> are invoked with interrupts disabled, but other CPUs are still >> running. Of course, the CPU_DYING notifiers need to be audited >> and repaired. There are fewer than 20 of them, so not so bad. >> RCU's is an easy fix: Just re-introduce locking and the global >> RCU callback orphanage. My guesses for the others at the end. >> >> e. Getting rid of __stop_machine() means that the final step of the >> CPU going offline will no longer be seen as atomic by other CPUs. >> This will require more careful tracking of dependencies among >> different subsystems. The required tracking can be reduced >> by invoking notifiers in registration order for CPU-online >> operations and invoking them in the reverse of registration >> order for CPU-offline operations. >> >> For example, the scheduler uses RCU. If notifiers are invoked in >> the same order for all CPU-hotplug operations, then on CPU-offline >> operations, during the time between when RCU's notifier is invoked >> and when the scheduler's notifier is invoked, the scheduler must >> deal with a CPU on which RCU isn't working. (RCU currently >> works around this by allowing a one-jiffy time period after >> notification when it still pays attention to the CPU.) >> >> In contrast, if notifiers are invoked in reverse-registration >> order for CPU-offline operations, then any time the scheduler >> sees a CPU as online, RCU also is treating it as online. >> >> f. There will be some circular dependencies. For example, the >> scheduler uses RCU, but in some configurations, RCU also uses >> kthreads. These dependencies must be handled on a case-by-case >> basis. For example, the scheduler could invoke an RCU API >> to tell RCU when to shut down its per-CPU kthreads and when >> to start them up. Or RCU could deal with its kthreads in the >> CPU_DOWN_PREPARE and CPU_ONLINE notifiers. Either way, RCU >> needs to correctly handle the interval when it cannot use >> kthreads on a given CPU that it is still handling, for example, >> by switching to running the RCU core code in softirq context. >> >> g. Most subsystems participating in CPU-hotplug operations will need >> to keep their own copy of CPU online/offline state. For example, >> RCU uses the ->qsmaskinit fields in the rcu_node structure for >> this purpose. >> >> h. So CPU-offline handling looks something like the following: >> >> i. Acquire the hotplug mutex. >> >> ii. Invoke the CPU_DOWN_PREPARE notifiers. If there >> are objections, invoke the CPU_DOWN_FAILED notifiers >> and return an error. >> >> iii. Clear the CPU's bit in cpu_online_map. >> >> iv. Invoke synchronize_sched() to ensure that all future hotplug >> read-side critical sections ignore the outgoing CPU. >> >> v. Set a per-CPU variable telling the CPU to take itself >> offline. There would need to be something here to >> help the CPU get to idle quickly, possibly requiring >> another round of notifiers. CPU_DOWN? >> >> vi. When the dying CPU gets to the idle loop, it invokes the >> CPU_DYING notifiers and updates its per-CPU variable to >> indicate that it is ready to die. It then spins in a >> tight loop (or does some other architecture-specified >> operation to wait to be turned off). >> >> Note that there is no need for RCU to guess how long the >> CPU might be executing RCU read-side critical sections. >> >> vii. When the task doing the offline operation sees the >> updated per-CPU variable, it calls __cpu_die(). >> >> viii. The CPU_DEAD notifiers are invoked. >> >> ix. Theeck_for_tasks() function is invoked. >> >> x. Release the hotplug mutex. >> >> xi. Invoke the CPU_POST_DEAD notifiers. >> >> i. I do not believe that the CPU-offline handling needs to change >> much. >> >> >> CPU_DYING notifiers as of 3.2: >> >> o vfp_hotplug(): I believe that this works as-is. >> o s390_nohz_notify(): I believe that this works as-is. >> o x86_pmu_notifier(): I believe that this works as-is. >> o perf_ibs_cpu_notifier(): I don't know enough about >> APIC to say. >> o tboot_cpu_callback(): I believe that this works as-is, >> but this one returns NOTIFY_BAD to a CPU_DYING notifier, >> which is badness. But it looks like that case is a >> "cannot happen" case. Still needs to be fixed. >> o clockevents_notify(): This one acquires a global lock, >> so it should be safe as-is. >> o console_cpu_notify(): This one takes the same action >> for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and >> CPU_UP_CANCELLED that it does for CPU_DYING, so it >> should be OK. >> o rcu_cpu_notify(): This one needs adjustment as noted >> above, but nothing major. >> o migration_call(): I defer to Peter on this one. >> It looks to me like it is written to handle other >> CPUs, but... >> o workqueue_cpu_callback(): Might need help, does a >> non-atomic OR. >> o kvm_cpu_hotplug(): Uses a global spinlock, so should >> be OK as-is. > > Additional things that I would like to add to the list: 1. Fix issues with CPU Hotplug callback registration. Currently there is no totally-race-free way to register callbacks and do setup for already online cpus. I had posted an incomplete patchset some time ago regarding this, which gives an idea of the direction I had in mind. http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 2. There is a mismatch between the code and the documentation around the difference between [un/register]_hotcpu_notifier and [un/register]_cpu_notifier. And I remember seeing several places in the code that uses them inconsistently. Not terribly important, but good to fix it up while we are at it. 3. There was another thread where stuff related to CPU hotplug had been discussed. It had exposed some new challenges to CPU hotplug, if we were to support asynchronous smp booting. http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535 http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542 http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241 http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267 4. Because the current CPU offline code depends on stop_machine(), every online CPU must cooperate with the offline event. This means, whenever we do a preempt_disable(), it ensures not only that that particular CPU won't go offline, but also that *any* CPU cannot go offline. This is more like a side-effect of using stop_machine(). So when trying to move over to stop_one_cpu(), we have to carefully audit places where preempt_disable() has been used in that manner (ie., preempt_disable used as a light-weight and non-blocking form of get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline, a preempt disabled section will prevent only that particular CPU from going offline. I haven't audited preempt_disable() calls yet, but one such use was there in brlocks (include/linux/lglock.h) until quite recently. 5. Given the point above (#4), we might need a new way to disable CPU hotplug (atleast CPU offline) of any CPU in a non-blocking manner, as a replacement for preempt disabled sections. Of course, if all the existing code just depends on the current CPU being online, then we are good, as it is. Else, we'll have to come up with something here.. (I was thinking on the lines of an rwlock being taken inside stop_one_cpu() before calling the CPU_DYING notifiers... Then the non-blocking code that needs to disable CPU offlining of any CPU, can grab this lock and prevent the offline event from proceeding). If there is anything I missed out, please feel free to add them here. And suggestions are of course, always welcome :-) Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-19 14:48 ` Srivatsa S. Bhat @ 2012-03-20 11:28 ` Peter Zijlstra 2012-04-05 17:39 ` Paul E. McKenney 1 sibling, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2012-03-20 11:28 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Paul E. McKenney, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, linux-alpha, Mike Frysinger On Mon, 2012-03-19 at 20:18 +0530, Srivatsa S. Bhat wrote: > > If there is anything I missed out, please feel free to add them here. > And suggestions are of course, always welcome :-) > OK, so I haven't fully read your list, but looking through the code today I found that alpha and blackfin call CPU_STARTING after set_cpu_online(, true), whereas all the other archs call CPU_STARTING before. Aside from that probably wanting to get fixed (not sure it actually breaks anything atm), this is another reason most of that wants to be generic code. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-19 14:48 ` Srivatsa S. Bhat 2012-03-20 11:28 ` Peter Zijlstra @ 2012-04-05 17:39 ` Paul E. McKenney 2012-04-05 17:55 ` Paul E. McKenney 2012-04-06 19:52 ` Srivatsa S. Bhat 1 sibling, 2 replies; 38+ messages in thread From: Paul E. McKenney @ 2012-04-05 17:39 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote: > On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote: > > > Hi, > > > > There had been some discussion on CPU Hotplug redesign/rework > > some time ago, but it was buried under a thread with a different > > subject. > > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) > > > > So I am opening a new thread with an appropriate subject to discuss > > what needs to be done and how to go about it, as part of the rework. > > > > Peter Zijlstra and Paul McKenney had come up with TODO lists for the > > rework, and here are their extracts from the previous discussion: Finally getting around to looking at this in more detail... > Additional things that I would like to add to the list: > > 1. Fix issues with CPU Hotplug callback registration. Currently there > is no totally-race-free way to register callbacks and do setup > for already online cpus. > > I had posted an incomplete patchset some time ago regarding this, > which gives an idea of the direction I had in mind. > http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 Another approach is to have the registration function return the CPU mask corresponding to the instant at which registration occurred, perhaps via an additional function argument that points to a cpumask_var_t that can be NULL if you don't care. Then you can do setup for the CPUs indicated in the mask. Or am I missing the race you had in mind? Or is the point to make sure that the notifiers execute in order? > 2. There is a mismatch between the code and the documentation around > the difference between [un/register]_hotcpu_notifier and > [un/register]_cpu_notifier. And I remember seeing several places in > the code that uses them inconsistently. Not terribly important, but > good to fix it up while we are at it. The following lead me to believe that they were the same: #define register_hotcpu_notifier(nb) register_cpu_notifier(nb) #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) What am I missing here? > 3. There was another thread where stuff related to CPU hotplug had been > discussed. It had exposed some new challenges to CPU hotplug, if we > were to support asynchronous smp booting. > > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535 > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542 > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241 > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267 Good points! ;-) > 4. Because the current CPU offline code depends on stop_machine(), every > online CPU must cooperate with the offline event. This means, whenever > we do a preempt_disable(), it ensures not only that that particular > CPU won't go offline, but also that *any* CPU cannot go offline. This > is more like a side-effect of using stop_machine(). > > So when trying to move over to stop_one_cpu(), we have to carefully audit > places where preempt_disable() has been used in that manner (ie., > preempt_disable used as a light-weight and non-blocking form of > get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline, > a preempt disabled section will prevent only that particular CPU from > going offline. > > I haven't audited preempt_disable() calls yet, but one such use was there > in brlocks (include/linux/lglock.h) until quite recently. I was thinking in terms of the offline code path doing a synchronize_sched() to allow preempt_disable() to retain a reasonable approximation of its current semantics. This would require a pair of CPU masks, one for code using CPU-based primitives (e.g., sending IPIs) and another for code implementing those primitives. Or is there some situation that makes this approach fail? Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-05 17:39 ` Paul E. McKenney @ 2012-04-05 17:55 ` Paul E. McKenney 2012-04-05 23:06 ` Paul E. McKenney 2012-04-06 19:52 ` Srivatsa S. Bhat 1 sibling, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-04-05 17:55 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Thu, Apr 05, 2012 at 10:39:18AM -0700, Paul E. McKenney wrote: > On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote: > > On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote: > > > > > Hi, > > > > > > There had been some discussion on CPU Hotplug redesign/rework > > > some time ago, but it was buried under a thread with a different > > > subject. > > > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) > > > > > > So I am opening a new thread with an appropriate subject to discuss > > > what needs to be done and how to go about it, as part of the rework. > > > > > > Peter Zijlstra and Paul McKenney had come up with TODO lists for the > > > rework, and here are their extracts from the previous discussion: > > Finally getting around to looking at this in more detail... > > > Additional things that I would like to add to the list: > > > > 1. Fix issues with CPU Hotplug callback registration. Currently there > > is no totally-race-free way to register callbacks and do setup > > for already online cpus. > > > > I had posted an incomplete patchset some time ago regarding this, > > which gives an idea of the direction I had in mind. > > http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 > > Another approach is to have the registration function return the > CPU mask corresponding to the instant at which registration occurred, > perhaps via an additional function argument that points to a > cpumask_var_t that can be NULL if you don't care. Then you can > do setup for the CPUs indicated in the mask. > > Or am I missing the race you had in mind? Or is the point to make > sure that the notifiers execute in order? > > > 2. There is a mismatch between the code and the documentation around > > the difference between [un/register]_hotcpu_notifier and > > [un/register]_cpu_notifier. And I remember seeing several places in > > the code that uses them inconsistently. Not terribly important, but > > good to fix it up while we are at it. > > The following lead me to believe that they were the same: > > #define register_hotcpu_notifier(nb) register_cpu_notifier(nb) > #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) > > What am I missing here? > > > 3. There was another thread where stuff related to CPU hotplug had been > > discussed. It had exposed some new challenges to CPU hotplug, if we > > were to support asynchronous smp booting. > > > > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535 > > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542 > > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241 > > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267 > > Good points! ;-) > > > 4. Because the current CPU offline code depends on stop_machine(), every > > online CPU must cooperate with the offline event. This means, whenever > > we do a preempt_disable(), it ensures not only that that particular > > CPU won't go offline, but also that *any* CPU cannot go offline. This > > is more like a side-effect of using stop_machine(). > > > > So when trying to move over to stop_one_cpu(), we have to carefully audit > > places where preempt_disable() has been used in that manner (ie., > > preempt_disable used as a light-weight and non-blocking form of > > get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline, > > a preempt disabled section will prevent only that particular CPU from > > going offline. > > > > I haven't audited preempt_disable() calls yet, but one such use was there > > in brlocks (include/linux/lglock.h) until quite recently. > > I was thinking in terms of the offline code path doing a synchronize_sched() > to allow preempt_disable() to retain a reasonable approximation of its > current semantics. This would require a pair of CPU masks, one for code > using CPU-based primitives (e.g., sending IPIs) and another for code > implementing those primitives. > > Or is there some situation that makes this approach fail? Hmmm... I suppose that -rt's use of migrate_disable() needs some other approach in any case, unless -rt's offlining waits for all pre-existing migrate_disable() sections to finish. Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-05 17:55 ` Paul E. McKenney @ 2012-04-05 23:06 ` Paul E. McKenney 2012-04-06 20:15 ` Srivatsa S. Bhat 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-04-05 23:06 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list Hello, Here is my attempt at a summary of the discussion. Srivatsa, I left out the preempt_disable() pieces, but would be happy to add them in when you let me know what you are thinking to do for de-stop_machine()ing CPU hotplug. Thanx, Paul ------------------------------------------------------------------------ CPU-hotplug work breakout: 1. Read and understand the current generic code. Srivatsa Bhat has done this, as have Paul E. McKenney and Peter Zijlstra to a lesser extent. 2. Read and understand the architecture-specific code, looking for opportunities to consolidate additional function into core code. a. Carry out any indicated consolidation. b. Convert all architectures to make use of the consolidated implementation. Not started. Low priority from a big.LITTLE perspective. 3. Address the current kthread creation/teardown/migration performance issues. (More details below.) Highest priority from a big.LITTLE perspective. 4. Wean CPU-hotplug offlining from stop_machine(). (More details below.) Moderate priority from a big.LITTLE perspective. ADDRESSING KTHREAD CREATION/TEARDOWN/MIGRATION PERFORMANCE ISSUES 1. Evaluate approaches. Approaches currently under consideration include: a. Park the kthreads rather than tearing them down or migrating them. RCU currently takes this sort of approach. Note that RCU currently relies on both preempt_disable() and local_bh_disable() blocking the current CPU from going offline. b. Allow in-kernel kthreads to avoid the delay required to work around a bug in old versions of bash. (This bug is a failure to expect receiving a SIGCHILD signal corresponding to a child created by a fork() system call that has not yet returned.) This might be implemented using an additional CLONE_ flag. This should allow kthreads to be created and torn down much more quickly. c. Have some other TBD way to "freeze" a kthread. (As in "your clever idea here".) 2. Implement the chosen approach or approaches. (Different kernel subsystems might have different constraints, possibly requiring different kthread handling.) WEAN CPU-HOTPLUG OFFLINING FROM stop_machine() 1. CPU_DYING notifier fixes needed as of 3.2: o vfp_hotplug(): I believe that this works as-is. o s390_nohz_notify(): I believe that this works as-is. o x86_pmu_notifier(): I believe that this works as-is. o perf_ibs_cpu_notifier(): I don't know enough about APIC to say. o tboot_cpu_callback(): I believe that this works as-is, but this one returns NOTIFY_BAD to a CPU_DYING notifier, which is badness. But it looks like that case is a "cannot happen" case. Still needs to be fixed. o clockevents_notify(): This one acquires a global lock, so it should be safe as-is. o console_cpu_notify(): This one takes the same action for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and CPU_UP_CANCELLED that it does for CPU_DYING, so it should be OK. * rcu_cpu_notify(): This one needs adjustment as noted above, but nothing major. Patch has been posted, probably needs a bit of debugging. o migration_call(): I defer to Peter on this one. It looks to me like it is written to handle other CPUs, but... * workqueue_cpu_callback(): Might need help, does a non-atomic OR. o kvm_cpu_hotplug(): Uses a global spinlock, so should be OK as-is. 2. Evaluate designs for stop_machine()-free CPU hotplug. Implement the chosen design. An outline for a particular design is shown below, but the actual design might be quite different. 3. Fix issues with CPU Hotplug callback registration. Currently there is no totally-race-free way to register callbacks and do setup for already online cpus. Srivatsa had posted an incomplete patchset some time ago regarding this, which gives an idea of the direction he had in mind. http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 4. There is a mismatch between the code and the documentation around the difference between [un/register]_hotcpu_notifier and [un/register]_cpu_notifier. And I remember seeing several places in the code that uses them inconsistently. Not terribly important, but good to fix it up while we are at it. 5. There was another thread where stuff related to CPU hotplug had been discussed. It had exposed some new challenges to CPU hotplug, if we were to support asynchronous smp booting. http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535 http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542 http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241 http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267 6. If preempt_disable() no longer blocks CPU offlining, then uses of preempt_disable() in the kernel need to be inspected to see which are relying on blocking offlining, and any identified will need adjustment. DRAFT REQUIREMENTS FOR stop_machine()-FREE CPU HOTPLUG 1. preempt_disable() or something similarly lightweight and unconditional must block removal of any CPU that was in cpu_online_map at the start of the "critical section". (I will identify these as hotplug read-side critical sections.) I don't believe that there is any prohibition against a CPU appearing suddenly, but some auditing would be required to confirm this. But see below. 2. A subsystem not involved in the CPU-hotplug process must be able to test if a CPU is online and be guaranteed that this test remains valid (the CPU remains fully functional) for the duration of the hotplug read-side critical section. 3. If a subsystem needs to operate on all currently online CPUs, then it must participate in the CPU-hotplug process. My belief is that if some code needs to test whether a CPU is present, and needs an "offline" indication to persist, then that code's subsystem must participate in CPU-hotplug operations. 4. There must be a way to register/unregister for CPU-hotplug events. This is currently cpu_notifier(), register_cpu_notifier(), and unregister_cpu_notifier(). n-1. CPU-hotplug operations should be reasonably fast. A few milliseconds is OK, multiple seconds not so much. n. (Your additional constraints here.) STRAWMAN DESIGN FOR stop_machine()-FREE CPU HOTPLUG a. Maintain the cpu_online_map, as currently, but the meaning of a set bit is that the CPU is fully functional. If there is any service that the CPU no longer offers, its bit is cleared. b. Continue to use preempt_enable()/preempt_disable() to mark hotplug read-side critical sections. c. Instead of using __stop_machine(), use a per-CPU variable that is checked in the idle loop. Possibly another TIF_ bit. d. The CPU notifiers are like today, except that CPU_DYING() is invoked by the CPU after it sees that its per-CPU variable telling it to go offline. As today, the CPU_DYING notifiers are invoked with interrupts disabled, but other CPUs are still running. Of course, the CPU_DYING notifiers need to be audited and repaired. There are fewer than 20 of them, so not so bad. RCU's is an easy fix: Just re-introduce locking and the global RCU callback orphanage. My guesses for the others at the end. e. Getting rid of __stop_machine() means that the final step of the CPU going offline will no longer be seen as atomic by other CPUs. This will require more careful tracking of dependencies among different subsystems. The required tracking can be reduced by invoking notifiers in registration order for CPU-online operations and invoking them in the reverse of registration order for CPU-offline operations. For example, the scheduler uses RCU. If notifiers are invoked in the same order for all CPU-hotplug operations, then on CPU-offline operations, during the time between when RCU's notifier is invoked and when the scheduler's notifier is invoked, the scheduler must deal with a CPU on which RCU isn't working. (RCU currently works around this by allowing a one-jiffy time period after notification when it still pays attention to the CPU.) In contrast, if notifiers are invoked in reverse-registration order for CPU-offline operations, then any time the scheduler sees a CPU as online, RCU also is treating it as online. f. There will be some circular dependencies. For example, the scheduler uses RCU, but in some configurations, RCU also uses kthreads. These dependencies must be handled on a case-by-case basis. For example, the scheduler could invoke an RCU API to tell RCU when to shut down its per-CPU kthreads and when to start them up. Or RCU could deal with its kthreads in the CPU_DOWN_PREPARE and CPU_ONLINE notifiers. Either way, RCU needs to correctly handle the interval when it cannot use kthreads on a given CPU that it is still handling, for example, by switching to running the RCU core code in softirq context. g. Most subsystems participating in CPU-hotplug operations will need to keep their own copy of CPU online/offline state. For example, RCU uses the ->qsmaskinit fields in the rcu_node structure for this purpose. h. So CPU-offline handling looks something like the following: i. Acquire the hotplug mutex. ii. Invoke the CPU_DOWN_PREPARE notifiers. If there are objections, invoke the CPU_DOWN_FAILED notifiers and return an error. iii. Clear the CPU's bit in cpu_online_map. iv. Invoke synchronize_sched() to ensure that all future hotplug read-side critical sections ignore the outgoing CPU. v. Set a per-CPU variable telling the CPU to take itself offline. There would need to be something here to help the CPU get to idle quickly, possibly requiring another round of notifiers. CPU_DOWN? vi. When the dying CPU gets to the idle loop, it invokes the CPU_DYING notifiers and updates its per-CPU variable to indicate that it is ready to die. It then spins in a tight loop (or does some other architecture-specified operation to wait to be turned off). Note that there is no need for RCU to guess how long the CPU might be executing RCU read-side critical sections. vii. When the task doing the offline operation sees the updated per-CPU variable, it calls __cpu_die(). viii. The CPU_DEAD notifiers are invoked. ix. Theeck_for_tasks() function is invoked. x. Release the hotplug mutex. xi. Invoke the CPU_POST_DEAD notifiers. i. I do not believe that the CPU-offline handling needs to change much. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-05 23:06 ` Paul E. McKenney @ 2012-04-06 20:15 ` Srivatsa S. Bhat 2012-04-09 16:46 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Srivatsa S. Bhat @ 2012-04-06 20:15 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, nikunj On 04/06/2012 04:36 AM, Paul E. McKenney wrote: > Hello, > > Here is my attempt at a summary of the discussion. > Thanks for the summary, it is really helpful :-) > Srivatsa, I left out the preempt_disable() pieces, but would be happy > to add them in when you let me know what you are thinking to do for > de-stop_machine()ing CPU hotplug. > Ok.. > Thanx, Paul > > ------------------------------------------------------------------------ > > CPU-hotplug work breakout: > > 1. Read and understand the current generic code. > Srivatsa Bhat has done this, as have Paul E. McKenney and > Peter Zijlstra to a lesser extent. "lesser extent"?? Hell no! :-) ;-) > > 2. Read and understand the architecture-specific code, looking > for opportunities to consolidate additional function into > core code. > > a. Carry out any indicated consolidation. > > b. Convert all architectures to make use of the > consolidated implementation. > > Not started. Low priority from a big.LITTLE perspective. Recently this unexpectedly assumed high priority due to some scheduler changes and things got fixed up temporarily. And in that context, Peter Zijlstra gave some more technical pointers on what is wrong and needs to be done right. Link: https://lkml.org/lkml/2012/3/22/149 Nikunj (in CC) has offered to work with me on this consolidation. > > 3. Address the current kthread creation/teardown/migration > performance issues. (More details below.) > > Highest priority from a big.LITTLE perspective. > > 4. Wean CPU-hotplug offlining from stop_machine(). > (More details below.) > > Moderate priority from a big.LITTLE perspective. > > > ADDRESSING KTHREAD CREATION/TEARDOWN/MIGRATION PERFORMANCE ISSUES > > 1. Evaluate approaches. Approaches currently under > consideration include: > > a. Park the kthreads rather than tearing them down or > migrating them. RCU currently takes this sort of > approach. Note that RCU currently relies on both > preempt_disable() and local_bh_disable() blocking the > current CPU from going offline. > > b. Allow in-kernel kthreads to avoid the delay > required to work around a bug in old versions of > bash. (This bug is a failure to expect receiving > a SIGCHILD signal corresponding to a child > created by a fork() system call that has not yet > returned.) > > This might be implemented using an additional > CLONE_ flag. This should allow kthreads to > be created and torn down much more quickly. > > c. Have some other TBD way to "freeze" a kthread. > (As in "your clever idea here".) > > 2. Implement the chosen approach or approaches. (Different > kernel subsystems might have different constraints, possibly > requiring different kthread handling.) > > > WEAN CPU-HOTPLUG OFFLINING FROM stop_machine() > > > 1. CPU_DYING notifier fixes needed as of 3.2: > > o vfp_hotplug(): I believe that this works as-is. > o s390_nohz_notify(): I believe that this works as-is. > o x86_pmu_notifier(): I believe that this works as-is. > o perf_ibs_cpu_notifier(): I don't know enough about > APIC to say. > o tboot_cpu_callback(): I believe that this works as-is, > but this one returns NOTIFY_BAD to a CPU_DYING notifier, > which is badness. But it looks like that case is a > "cannot happen" case. Still needs to be fixed. > o clockevents_notify(): This one acquires a global lock, > so it should be safe as-is. > o console_cpu_notify(): This one takes the same action > for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and > CPU_UP_CANCELLED that it does for CPU_DYING, so it > should be OK. > * rcu_cpu_notify(): This one needs adjustment as noted > above, but nothing major. Patch has been posted, > probably needs a bit of debugging. > o migration_call(): I defer to Peter on this one. > It looks to me like it is written to handle other > CPUs, but... > * workqueue_cpu_callback(): Might need help, does a > non-atomic OR. > o kvm_cpu_hotplug(): Uses a global spinlock, so should > be OK as-is. > > 2. Evaluate designs for stop_machine()-free CPU hotplug. > Implement the chosen design. An outline for a particular > design is shown below, but the actual design might be > quite different. > > 3. Fix issues with CPU Hotplug callback registration. Currently > there is no totally-race-free way to register callbacks and do > setup for already online cpus. > > Srivatsa had posted an incomplete patchset some time ago > regarding this, which gives an idea of the direction he had > in mind. > http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 > Gah, this has been "incomplete" for quite some time now.. I'll try to speed up things a bit :-) > 4. There is a mismatch between the code and the documentation around > the difference between [un/register]_hotcpu_notifier and > [un/register]_cpu_notifier. And I remember seeing several places > in the code that uses them inconsistently. Not terribly important, > but good to fix it up while we are at it. > > 5. There was another thread where stuff related to CPU hotplug had > been discussed. It had exposed some new challenges to CPU hotplug, > if we were to support asynchronous smp booting. > > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535 > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542 > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241 > http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267 > > 6. If preempt_disable() no longer blocks CPU offlining, then > uses of preempt_disable() in the kernel need to be inspected > to see which are relying on blocking offlining, and any > identified will need adjustment. > > > DRAFT REQUIREMENTS FOR stop_machine()-FREE CPU HOTPLUG > > 1. preempt_disable() or something similarly lightweight and > unconditional must block removal of any CPU that was > in cpu_online_map at the start of the "critical section". > (I will identify these as hotplug read-side critical sections.) > > I don't believe that there is any prohibition against a CPU > appearing suddenly, but some auditing would be required to > confirm this. But see below. > > 2. A subsystem not involved in the CPU-hotplug process must be able > to test if a CPU is online and be guaranteed that this test > remains valid (the CPU remains fully functional) for the duration > of the hotplug read-side critical section. > > 3. If a subsystem needs to operate on all currently online CPUs, > then it must participate in the CPU-hotplug process. My > belief is that if some code needs to test whether a CPU is > present, and needs an "offline" indication to persist, then > that code's subsystem must participate in CPU-hotplug operations. > > 4. There must be a way to register/unregister for CPU-hotplug events. > This is currently cpu_notifier(), register_cpu_notifier(), > and unregister_cpu_notifier(). > > n-1. CPU-hotplug operations should be reasonably fast. A few > milliseconds is OK, multiple seconds not so much. > > n. (Your additional constraints here.) > > > STRAWMAN DESIGN FOR stop_machine()-FREE CPU HOTPLUG > > a. Maintain the cpu_online_map, as currently, but the meaning > of a set bit is that the CPU is fully functional. If there > is any service that the CPU no longer offers, its bit is > cleared. > > b. Continue to use preempt_enable()/preempt_disable() to mark > hotplug read-side critical sections. > > c. Instead of using __stop_machine(), use a per-CPU variable that > is checked in the idle loop. Possibly another TIF_ bit. > > d. The CPU notifiers are like today, except that CPU_DYING() is > invoked by the CPU after it sees that its per-CPU variable > telling it to go offline. As today, the CPU_DYING notifiers > are invoked with interrupts disabled, but other CPUs are still > running. Of course, the CPU_DYING notifiers need to be audited > and repaired. There are fewer than 20 of them, so not so bad. > RCU's is an easy fix: Just re-introduce locking and the global > RCU callback orphanage. My guesses for the others at the end. > > e. Getting rid of __stop_machine() means that the final step of the > CPU going offline will no longer be seen as atomic by other CPUs. > This will require more careful tracking of dependencies among > different subsystems. The required tracking can be reduced > by invoking notifiers in registration order for CPU-online > operations and invoking them in the reverse of registration > order for CPU-offline operations. > > For example, the scheduler uses RCU. If notifiers are invoked in > the same order for all CPU-hotplug operations, then on CPU-offline > operations, during the time between when RCU's notifier is invoked > and when the scheduler's notifier is invoked, the scheduler must > deal with a CPU on which RCU isn't working. (RCU currently > works around this by allowing a one-jiffy time period after > notification when it still pays attention to the CPU.) > > In contrast, if notifiers are invoked in reverse-registration > order for CPU-offline operations, then any time the scheduler > sees a CPU as online, RCU also is treating it as online. > > f. There will be some circular dependencies. For example, the > scheduler uses RCU, but in some configurations, RCU also uses > kthreads. These dependencies must be handled on a case-by-case > basis. For example, the scheduler could invoke an RCU API > to tell RCU when to shut down its per-CPU kthreads and when > to start them up. Or RCU could deal with its kthreads in the > CPU_DOWN_PREPARE and CPU_ONLINE notifiers. Either way, RCU > needs to correctly handle the interval when it cannot use > kthreads on a given CPU that it is still handling, for example, > by switching to running the RCU core code in softirq context. > > g. Most subsystems participating in CPU-hotplug operations will need > to keep their own copy of CPU online/offline state. For example, > RCU uses the ->qsmaskinit fields in the rcu_node structure for > this purpose. > > h. So CPU-offline handling looks something like the following: > > i. Acquire the hotplug mutex. > > ii. Invoke the CPU_DOWN_PREPARE notifiers. If there > are objections, invoke the CPU_DOWN_FAILED notifiers > and return an error. > > iii. Clear the CPU's bit in cpu_online_map. > > iv. Invoke synchronize_sched() to ensure that all future hotplug > read-side critical sections ignore the outgoing CPU. > > v. Set a per-CPU variable telling the CPU to take itself > offline. There would need to be something here to > help the CPU get to idle quickly, possibly requiring > another round of notifiers. CPU_DOWN? > > vi. When the dying CPU gets to the idle loop, it invokes the > CPU_DYING notifiers and updates its per-CPU variable to > indicate that it is ready to die. It then spins in a > tight loop (or does some other architecture-specified > operation to wait to be turned off). > > Note that there is no need for RCU to guess how long the > CPU might be executing RCU read-side critical sections. > > vii. When the task doing the offline operation sees the > updated per-CPU variable, it calls __cpu_die(). > > viii. The CPU_DEAD notifiers are invoked. > > ix. Theeck_for_tasks() function is invoked. > > x. Release the hotplug mutex. > > xi. Invoke the CPU_POST_DEAD notifiers. > > i. I do not believe that the CPU-offline handling needs to change > much. Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-06 20:15 ` Srivatsa S. Bhat @ 2012-04-09 16:46 ` Paul E. McKenney 2012-04-10 7:56 ` Nikunj A Dadhania 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-04-09 16:46 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, nikunj On Sat, Apr 07, 2012 at 01:45:44AM +0530, Srivatsa S. Bhat wrote: > On 04/06/2012 04:36 AM, Paul E. McKenney wrote: > > > Hello, > > > > Here is my attempt at a summary of the discussion. > > > > > Thanks for the summary, it is really helpful :-) > > > Srivatsa, I left out the preempt_disable() pieces, but would be happy > > to add them in when you let me know what you are thinking to do for > > de-stop_machine()ing CPU hotplug. > > > > > Ok.. > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > CPU-hotplug work breakout: > > > > 1. Read and understand the current generic code. > > Srivatsa Bhat has done this, as have Paul E. McKenney and > > Peter Zijlstra to a lesser extent. > > > "lesser extent"?? Hell no! :-) ;-) Certainly to a lesser extent on my part, but yes, I should not speak for Peter. > > 2. Read and understand the architecture-specific code, looking > > for opportunities to consolidate additional function into > > core code. > > > > a. Carry out any indicated consolidation. > > > > b. Convert all architectures to make use of the > > consolidated implementation. > > > > Not started. Low priority from a big.LITTLE perspective. > > > Recently this unexpectedly assumed high priority due to some scheduler > changes and things got fixed up temporarily. And in that context, > Peter Zijlstra gave some more technical pointers on what is wrong and needs > to be done right. Link: https://lkml.org/lkml/2012/3/22/149 > > Nikunj (in CC) has offered to work with me on this consolidation. Very cool! I have added the following: ------------------------------------------------------------------------ CONSOLIDATE ARCHITECTURE-SPECIFIC CPU-HOTPLUG CODE 1. Ensure that all CPU_STARTING notifiers complete before the incoming CPU is marked online (the blackfin architecture fails to do this). 2. Ensure that interrupts are disabled throughout the CPU_STARTING notifiers. Currently, blackfin, cris, m32r, mips, sh, sparc64, um, and x86 fail to do this properly. 3. Ensure that all architectures that use CONFIG_USE_GENERIC_SMP_HELPERS hold ipi_call_lock() over the entire CPU-online process. Currently, alpha, arm, m32r, mips, sh, and sparc32 seem to fail to do this properly. 4. Additional memory barriers are likely to be needed, for example, an smp_wmb() after setting cpu_active and an smp_rmb() in select_fallback_rq() before reading cpu_active. Srivatsa Bhat (srivatsa.bhat@linux.vnet.ibm.com) and Nikunj A Dadhania (nikunj@linux.vnet.ibm.com) are taking on this work. ------------------------------------------------------------------------ Please let me know if adjustments are needed. > > 3. Address the current kthread creation/teardown/migration > > performance issues. (More details below.) > > > > Highest priority from a big.LITTLE perspective. > > > > 4. Wean CPU-hotplug offlining from stop_machine(). > > (More details below.) > > > > Moderate priority from a big.LITTLE perspective. > > > > > > ADDRESSING KTHREAD CREATION/TEARDOWN/MIGRATION PERFORMANCE ISSUES > > > > 1. Evaluate approaches. Approaches currently under > > consideration include: > > > > a. Park the kthreads rather than tearing them down or > > migrating them. RCU currently takes this sort of > > approach. Note that RCU currently relies on both > > preempt_disable() and local_bh_disable() blocking the > > current CPU from going offline. > > > > b. Allow in-kernel kthreads to avoid the delay > > required to work around a bug in old versions of > > bash. (This bug is a failure to expect receiving > > a SIGCHILD signal corresponding to a child > > created by a fork() system call that has not yet > > returned.) > > > > This might be implemented using an additional > > CLONE_ flag. This should allow kthreads to > > be created and torn down much more quickly. > > > > c. Have some other TBD way to "freeze" a kthread. > > (As in "your clever idea here".) > > > > 2. Implement the chosen approach or approaches. (Different > > kernel subsystems might have different constraints, possibly > > requiring different kthread handling.) > > > > > > WEAN CPU-HOTPLUG OFFLINING FROM stop_machine() > > > > > > 1. CPU_DYING notifier fixes needed as of 3.2: > > > > o vfp_hotplug(): I believe that this works as-is. > > o s390_nohz_notify(): I believe that this works as-is. > > o x86_pmu_notifier(): I believe that this works as-is. > > o perf_ibs_cpu_notifier(): I don't know enough about > > APIC to say. > > o tboot_cpu_callback(): I believe that this works as-is, > > but this one returns NOTIFY_BAD to a CPU_DYING notifier, > > which is badness. But it looks like that case is a > > "cannot happen" case. Still needs to be fixed. > > o clockevents_notify(): This one acquires a global lock, > > so it should be safe as-is. > > o console_cpu_notify(): This one takes the same action > > for CPU_ONLINE, CPU_DEAD, CPU_DOWN_FAILED, and > > CPU_UP_CANCELLED that it does for CPU_DYING, so it > > should be OK. > > * rcu_cpu_notify(): This one needs adjustment as noted > > above, but nothing major. Patch has been posted, > > probably needs a bit of debugging. > > o migration_call(): I defer to Peter on this one. > > It looks to me like it is written to handle other > > CPUs, but... > > * workqueue_cpu_callback(): Might need help, does a > > non-atomic OR. > > o kvm_cpu_hotplug(): Uses a global spinlock, so should > > be OK as-is. > > > > 2. Evaluate designs for stop_machine()-free CPU hotplug. > > Implement the chosen design. An outline for a particular > > design is shown below, but the actual design might be > > quite different. > > > > 3. Fix issues with CPU Hotplug callback registration. Currently > > there is no totally-race-free way to register callbacks and do > > setup for already online cpus. > > > > Srivatsa had posted an incomplete patchset some time ago > > regarding this, which gives an idea of the direction he had > > in mind. > > http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 > > Gah, this has been "incomplete" for quite some time now.. I'll try to speed up > things a bit :-) Sounds good to me! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-09 16:46 ` Paul E. McKenney @ 2012-04-10 7:56 ` Nikunj A Dadhania 0 siblings, 0 replies; 38+ messages in thread From: Nikunj A Dadhania @ 2012-04-10 7:56 UTC (permalink / raw) To: paulmck, Srivatsa S. Bhat Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 9 Apr 2012 09:46:28 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > 2. Read and understand the architecture-specific code, looking > > > for opportunities to consolidate additional function into > > > core code. > > > > > > a. Carry out any indicated consolidation. > > > > > > b. Convert all architectures to make use of the > > > consolidated implementation. > > > > > > Not started. Low priority from a big.LITTLE perspective. > > > > > > Recently this unexpectedly assumed high priority due to some scheduler > > changes and things got fixed up temporarily. And in that context, > > Peter Zijlstra gave some more technical pointers on what is wrong and needs > > to be done right. Link: https://lkml.org/lkml/2012/3/22/149 > > > > Nikunj (in CC) has offered to work with me on this consolidation. > > Very cool! I have added the following: > > ------------------------------------------------------------------------ > > CONSOLIDATE ARCHITECTURE-SPECIFIC CPU-HOTPLUG CODE > > 1. Ensure that all CPU_STARTING notifiers complete before the > incoming CPU is marked online (the blackfin architecture > fails to do this). > > 2. Ensure that interrupts are disabled throughout the CPU_STARTING > notifiers. Currently, blackfin, cris, m32r, mips, sh, sparc64, > um, and x86 fail to do this properly. > > 3. Ensure that all architectures that use CONFIG_USE_GENERIC_SMP_HELPERS > hold ipi_call_lock() over the entire CPU-online process. Currently, > alpha, arm, m32r, mips, sh, and sparc32 seem to fail to do this > properly. > > 4. Additional memory barriers are likely to be needed, for example, > an smp_wmb() after setting cpu_active and an smp_rmb() in > select_fallback_rq() before reading cpu_active. > > Srivatsa Bhat (srivatsa.bhat@linux.vnet.ibm.com) and Nikunj A Dadhania > (nikunj@linux.vnet.ibm.com) are taking on this work. > Sounds good :-) Regards Nikunj ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-05 17:39 ` Paul E. McKenney 2012-04-05 17:55 ` Paul E. McKenney @ 2012-04-06 19:52 ` Srivatsa S. Bhat 2012-04-09 17:13 ` Paul E. McKenney 2012-04-11 0:09 ` Steven Rostedt 1 sibling, 2 replies; 38+ messages in thread From: Srivatsa S. Bhat @ 2012-04-06 19:52 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On 04/05/2012 11:09 PM, Paul E. McKenney wrote: > On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote: >> On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote: >> >>> Hi, >>> >>> There had been some discussion on CPU Hotplug redesign/rework >>> some time ago, but it was buried under a thread with a different >>> subject. >>> (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) >>> >>> So I am opening a new thread with an appropriate subject to discuss >>> what needs to be done and how to go about it, as part of the rework. >>> >>> Peter Zijlstra and Paul McKenney had come up with TODO lists for the >>> rework, and here are their extracts from the previous discussion: > > Finally getting around to looking at this in more detail... > >> Additional things that I would like to add to the list: >> >> 1. Fix issues with CPU Hotplug callback registration. Currently there >> is no totally-race-free way to register callbacks and do setup >> for already online cpus. >> >> I had posted an incomplete patchset some time ago regarding this, >> which gives an idea of the direction I had in mind. >> http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 > > Another approach is to have the registration function return the > CPU mask corresponding to the instant at which registration occurred, > perhaps via an additional function argument that points to a > cpumask_var_t that can be NULL if you don't care. Then you can > do setup for the CPUs indicated in the mask. > That would look something like this right? register_cpu_notifier(nb, mask); do_setup(mask); If yes, then here is our problem: 1. A CPU hotplug operation can happen, in-between these 2 function calls. 2. No matter how we try to wrap these up with get/put_online_cpus(), it will still lead to either a race, or a deadlock, depending on how we do it. Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our purpose, since the race with CPU Hotplug would still exist, just like before. So, let's consider what happens when we wrap both the functions within get/put_online_cpus(): get_online_cpus(); register_cpu_notifier(nb, mask); do_setup(mask); put_online_cpus(); Unfortunately this leads to an ABBA deadlock (see below). > Or am I missing the race you had in mind? Or is the point to make > sure that the notifiers execute in order? > No, I wasn't referring to the order of execution of the notifiers here. Well, the race I was concerned about was the one between a CPU hotplug operation and a callback registration operation, which might lead to an ABBA deadlock, with the 2 locks in question being cpu hotplug lock and cpu_add_remove_lock. The heart of the problem is that, something as simple as the following is prone to an ABBA deadlock (note that I am not even talking about do_setup() here): get_online_cpus() register_cpu_notifier() or any other variant put_online_cpus() In the above, the lock ordering is: grab cpu_hotplug lock -> grab cpu_add_remove_lock. But in a usual CPU hotplug operation, the lock ordering is the exact reverse: grab cpu_add_remove_lock -> grab cpu_hotplug lock. Hence, we have an ABBA deadlock possibility. I had posted a pictorial representation of the deadlock condition here: https://lkml.org/lkml/2012/3/19/289 So, to get rid of this deadlock, the approach I proposed came out of the following observation: A code such as: register_cpu_notifier(); //Go on doing stuff that are irrelevant to CPU hotplug. is totally safe from any kind of race or deadlock, the reason being that CPU Hotplug operations begin by taking the cpu_add_remove_lock, and callback registration functions like the above also take only this particular lock. IOW, the mutex that protects a race between CPU hotplug operation vs callback registration is the cpu_add_remove_lock. The cpu_hotplug lock is actually irrelevant here! CPU0 CPU 1 register_cpu_notifier() cpu_down() ---------------------------------------------------------------------- acquire cpu_add_remove_lock Blocked on cpu_add_remove_lock do callback registration release cpu_add_remove_lock Only now we can proceed and acquire the cpu_add_remove_lock. acquire cpu_hotplug lock do cpu_down related work release cpu_hotplug lock release cpu_add_remove_lock So my approach is: consider whatever we want to do while registering our callback, including doing setup for already online cpus; - and do *all of it* within the section marked as "do callback registration". IOW, we piggyback on something _known_ to be perfectly fine and hence avoid all races and deadlocks. >> 2. There is a mismatch between the code and the documentation around >> the difference between [un/register]_hotcpu_notifier and >> [un/register]_cpu_notifier. And I remember seeing several places in >> the code that uses them inconsistently. Not terribly important, but >> good to fix it up while we are at it. > > The following lead me to believe that they were the same: > > #define register_hotcpu_notifier(nb) register_cpu_notifier(nb) > #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) > > What am I missing here? At first, even I thought that they were exactly same. But then looking at the #ifdef magic in include/linux/cpu.h, I realized that there is a subtle difference between the two: The register_cpu_notifier() family of functions are defined whenever: a. CONFIG_HOTPLUG_CPU is set b. or, CONFIG_HOTPLUG_CPU is not set, but we are in core code ie., !defined(MODULE) The hotcpu family of functions are different in the sense that they are defined only when CONFIG_HOTPLUG_CPU is set. However, if CONFIG_HOTPLUG_CPU is not set, they are defined as pretty much empty functions, irrespective of whether we are calling them from core code or from module code. And the reasoning behind this difference is: strictly speaking, CONFIG_HOTPLUG_CPU needs to be set only if we want to *offline* CPUs at some point in time, like suspend/resume for example. Otherwise, we don't need to set CONFIG_HOTPLUG_CPU. But there are subsystems/arch specific stuff which need to take action even during CPU online operations (like booting), and hence their callback registrations must work even when CONFIG_HOTPLUG_CPU is not set. So they must use the native register_cpu_notifier() family of functions. If some code doesn't really care much when offlining of CPUs is not required, then they can use the hotcpu_* family and optimize out for the !HOTPLUG_CPU case. None of this seems to be documented in Documentation/cpu-hotplug.txt. In fact what that document explains (something about early init/late init) is actually misleading because it is out-dated beyond recognition! :-( Now coming to the inconsistent uses of these functions, if some core code which needs to register its callback irrespective of CONFIG_HOTPLUG_CPU, uses hotcpu* variants instead, it'll break stuff in the !HOTPLUG_CPU case. The other case (using register_cpu* variants in every place) is less worrisome, in fact harmless, but could have been optimized out in the !HOTPLUG_CPU case, if the hotcpu* variants had been used. > >> 3. There was another thread where stuff related to CPU hotplug had been >> discussed. It had exposed some new challenges to CPU hotplug, if we >> were to support asynchronous smp booting. >> >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535 >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542 >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241 >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267 > > Good points! ;-) Thank you :-) > >> 4. Because the current CPU offline code depends on stop_machine(), every >> online CPU must cooperate with the offline event. This means, whenever >> we do a preempt_disable(), it ensures not only that that particular >> CPU won't go offline, but also that *any* CPU cannot go offline. This >> is more like a side-effect of using stop_machine(). >> >> So when trying to move over to stop_one_cpu(), we have to carefully audit >> places where preempt_disable() has been used in that manner (ie., >> preempt_disable used as a light-weight and non-blocking form of >> get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline, >> a preempt disabled section will prevent only that particular CPU from >> going offline. >> >> I haven't audited preempt_disable() calls yet, but one such use was there >> in brlocks (include/linux/lglock.h) until quite recently. > > I was thinking in terms of the offline code path doing a synchronize_sched() > to allow preempt_disable() to retain a reasonable approximation of its > current semantics. This would require a pair of CPU masks, one for code > using CPU-based primitives (e.g., sending IPIs) and another for code > implementing those primitives. > > Or is there some situation that makes this approach fail? > Oh, right, using synchronize_sched() in the offline path would pretty much solve our problem.. I guess I overlooked that. Thanks for pointing it out! Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-06 19:52 ` Srivatsa S. Bhat @ 2012-04-09 17:13 ` Paul E. McKenney 2012-04-10 13:41 ` Srivatsa S. Bhat 2012-04-11 0:09 ` Steven Rostedt 1 sibling, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-04-09 17:13 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Sat, Apr 07, 2012 at 01:22:23AM +0530, Srivatsa S. Bhat wrote: > On 04/05/2012 11:09 PM, Paul E. McKenney wrote: > > > On Mon, Mar 19, 2012 at 08:18:42PM +0530, Srivatsa S. Bhat wrote: > >> On 03/19/2012 08:14 PM, Srivatsa S. Bhat wrote: > >> > >>> Hi, > >>> > >>> There had been some discussion on CPU Hotplug redesign/rework > >>> some time ago, but it was buried under a thread with a different > >>> subject. > >>> (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) > >>> > >>> So I am opening a new thread with an appropriate subject to discuss > >>> what needs to be done and how to go about it, as part of the rework. > >>> > >>> Peter Zijlstra and Paul McKenney had come up with TODO lists for the > >>> rework, and here are their extracts from the previous discussion: > > > > Finally getting around to looking at this in more detail... > > > >> Additional things that I would like to add to the list: > >> > >> 1. Fix issues with CPU Hotplug callback registration. Currently there > >> is no totally-race-free way to register callbacks and do setup > >> for already online cpus. > >> > >> I had posted an incomplete patchset some time ago regarding this, > >> which gives an idea of the direction I had in mind. > >> http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 > > > > Another approach is to have the registration function return the > > CPU mask corresponding to the instant at which registration occurred, > > perhaps via an additional function argument that points to a > > cpumask_var_t that can be NULL if you don't care. Then you can > > do setup for the CPUs indicated in the mask. > > That would look something like this right? > > register_cpu_notifier(nb, mask); > do_setup(mask); > > If yes, then here is our problem: > 1. A CPU hotplug operation can happen, in-between these 2 function calls. > 2. No matter how we try to wrap these up with get/put_online_cpus(), it > will still lead to either a race, or a deadlock, depending on how we > do it. Hmmm... The call to register_cpu_notifier() excludes CPU-hotplug operations, so any CPU-hotplug operation that happens after the register_cpu_notifier() will invoke the notifier. This does mean that we can then see concurrent invocation of the notifier from do_setup() and from a CPU-hotplug operation, but it should not be hard to create a reasonably locking scheme to prevent this. BTW, I am assuming that it is OK for the notifiers to run out of order in this case. Perhaps you are trying to preserve ordering? The reason that I believe that it should be OK to allow misordering of notifiers is that the subsystem in question is not fully initialized yet. It is therefore probably not yet actually servicing requests, for example, something like RCU would might be refusing to start grace periods. If it is not yet servicing requests, it should not matter what order it sees the CPUs come online. Once do_setup() returns, it will see subsequent CPU-hotplug operations in order, so at that point it might be able to start servicing requests. Or am I missing something else? > Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our > purpose, since the race with CPU Hotplug would still exist, just like > before. So, let's consider what happens when we wrap both the functions > within get/put_online_cpus(): > > get_online_cpus(); > register_cpu_notifier(nb, mask); > do_setup(mask); > put_online_cpus(); > > Unfortunately this leads to an ABBA deadlock (see below). > > > > Or am I missing the race you had in mind? Or is the point to make > > sure that the notifiers execute in order? > > No, I wasn't referring to the order of execution of the notifiers here. OK, then I am still missing something... > Well, the race I was concerned about was the one between a CPU hotplug > operation and a callback registration operation, which might lead to an > ABBA deadlock, with the 2 locks in question being cpu hotplug lock and > cpu_add_remove_lock. > > The heart of the problem is that, something as simple as the following > is prone to an ABBA deadlock (note that I am not even talking about > do_setup() here): > > get_online_cpus() > register_cpu_notifier() or any other variant > put_online_cpus() > > In the above, the lock ordering is: > grab cpu_hotplug lock -> grab cpu_add_remove_lock. But why do we need the get_online_cpus() in this case? What would it be preventing from happening and why do we care? > But in a usual CPU hotplug operation, the lock ordering is the exact reverse: > grab cpu_add_remove_lock -> grab cpu_hotplug lock. > > Hence, we have an ABBA deadlock possibility. > > I had posted a pictorial representation of the deadlock condition here: > https://lkml.org/lkml/2012/3/19/289 And Peter replied to this saying that he can remove the get_online_cpus() call, which gets rid of the ABBA deadlock condition, correct? > So, to get rid of this deadlock, the approach I proposed came out of the > following observation: > > A code such as: > > register_cpu_notifier(); > > //Go on doing stuff that are irrelevant to CPU hotplug. > > is totally safe from any kind of race or deadlock, the reason being that > CPU Hotplug operations begin by taking the cpu_add_remove_lock, and > callback registration functions like the above also take only this > particular lock. IOW, the mutex that protects a race between CPU hotplug > operation vs callback registration is the cpu_add_remove_lock. The > cpu_hotplug lock is actually irrelevant here! Yep, agreed, despite my confusion some weeks ago. > > CPU0 CPU 1 > register_cpu_notifier() cpu_down() > ---------------------------------------------------------------------- > > acquire cpu_add_remove_lock > Blocked on cpu_add_remove_lock > > do callback registration > > release cpu_add_remove_lock > > Only now we can proceed and acquire > the cpu_add_remove_lock. > > acquire cpu_hotplug lock > do cpu_down related work > release cpu_hotplug lock > > release cpu_add_remove_lock > > > So my approach is: consider whatever we want to do while registering our > callback, including doing setup for already online cpus; - and do *all of it* > within the section marked as "do callback registration". IOW, we piggyback > on something _known_ to be perfectly fine and hence avoid all races and > deadlocks. OK, interesting approach. But does this actually save anything in any real situation, given that CPU hotplug notifiers are run during initialization, before the subsystem being initialized is really doing anything? > >> 2. There is a mismatch between the code and the documentation around > >> the difference between [un/register]_hotcpu_notifier and > >> [un/register]_cpu_notifier. And I remember seeing several places in > >> the code that uses them inconsistently. Not terribly important, but > >> good to fix it up while we are at it. > > > > The following lead me to believe that they were the same: > > > > #define register_hotcpu_notifier(nb) register_cpu_notifier(nb) > > #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) > > > > > What am I missing here? > > At first, even I thought that they were exactly same. But then looking at the > #ifdef magic in include/linux/cpu.h, I realized that there is a subtle > difference between the two: > > The register_cpu_notifier() family of functions are defined whenever: > a. CONFIG_HOTPLUG_CPU is set > b. or, CONFIG_HOTPLUG_CPU is not set, but we are in core code > ie., !defined(MODULE) > > The hotcpu family of functions are different in the sense that they are > defined only when CONFIG_HOTPLUG_CPU is set. However, if CONFIG_HOTPLUG_CPU > is not set, they are defined as pretty much empty functions, irrespective of > whether we are calling them from core code or from module code. > > And the reasoning behind this difference is: strictly speaking, > CONFIG_HOTPLUG_CPU needs to be set only if we want to *offline* CPUs at > some point in time, like suspend/resume for example. Otherwise, we don't need > to set CONFIG_HOTPLUG_CPU. > > But there are subsystems/arch specific stuff which need to take action even > during CPU online operations (like booting), and hence their callback > registrations must work even when CONFIG_HOTPLUG_CPU is not set. So they must > use the native register_cpu_notifier() family of functions. > > If some code doesn't really care much when offlining of CPUs is not required, > then they can use the hotcpu_* family and optimize out for the !HOTPLUG_CPU > case. > > None of this seems to be documented in Documentation/cpu-hotplug.txt. In fact > what that document explains (something about early init/late init) is actually > misleading because it is out-dated beyond recognition! :-( Indeed, it appears that the last substantive change to this file was more than two years ago, and most of the work on the file happened more than five years ago. It certainly needs attention! I have added documentation updates to the list. > Now coming to the inconsistent uses of these functions, if some core code which > needs to register its callback irrespective of CONFIG_HOTPLUG_CPU, uses > hotcpu* variants instead, it'll break stuff in the !HOTPLUG_CPU case. > The other case (using register_cpu* variants in every place) is less worrisome, > in fact harmless, but could have been optimized out in the !HOTPLUG_CPU case, > if the hotcpu* variants had been used. OK, got it, thank you for the tutorial! > >> 3. There was another thread where stuff related to CPU hotplug had been > >> discussed. It had exposed some new challenges to CPU hotplug, if we > >> were to support asynchronous smp booting. > >> > >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48535 > >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=48542 > >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253241 > >> http://thread.gmane.org/gmane.linux.kernel/1246209/focus=1253267 > > > > Good points! ;-) > > Thank you :-) > > >> 4. Because the current CPU offline code depends on stop_machine(), every > >> online CPU must cooperate with the offline event. This means, whenever > >> we do a preempt_disable(), it ensures not only that that particular > >> CPU won't go offline, but also that *any* CPU cannot go offline. This > >> is more like a side-effect of using stop_machine(). > >> > >> So when trying to move over to stop_one_cpu(), we have to carefully audit > >> places where preempt_disable() has been used in that manner (ie., > >> preempt_disable used as a light-weight and non-blocking form of > >> get_online_cpus()). Because when we move to stop_one_cpu() to do CPU offline, > >> a preempt disabled section will prevent only that particular CPU from > >> going offline. > >> > >> I haven't audited preempt_disable() calls yet, but one such use was there > >> in brlocks (include/linux/lglock.h) until quite recently. > > > > I was thinking in terms of the offline code path doing a synchronize_sched() > > to allow preempt_disable() to retain a reasonable approximation of its > > current semantics. This would require a pair of CPU masks, one for code > > using CPU-based primitives (e.g., sending IPIs) and another for code > > implementing those primitives. > > > > Or is there some situation that makes this approach fail? > > Oh, right, using synchronize_sched() in the offline path would pretty much > solve our problem.. I guess I overlooked that. Thanks for pointing it out! Here is hoping. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-09 17:13 ` Paul E. McKenney @ 2012-04-10 13:41 ` Srivatsa S. Bhat 2012-04-10 15:46 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Srivatsa S. Bhat @ 2012-04-10 13:41 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, nikunj On 04/09/2012 10:43 PM, Paul E. McKenney wrote: > On Sat, Apr 07, 2012 at 01:22:23AM +0530, Srivatsa S. Bhat wrote: >>> >>>> Additional things that I would like to add to the list: >>>> >>>> 1. Fix issues with CPU Hotplug callback registration. Currently there >>>> is no totally-race-free way to register callbacks and do setup >>>> for already online cpus. >>>> >>>> I had posted an incomplete patchset some time ago regarding this, >>>> which gives an idea of the direction I had in mind. >>>> http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 >>> >>> Another approach is to have the registration function return the >>> CPU mask corresponding to the instant at which registration occurred, >>> perhaps via an additional function argument that points to a >>> cpumask_var_t that can be NULL if you don't care. Then you can >>> do setup for the CPUs indicated in the mask. >> >> That would look something like this right? >> >> register_cpu_notifier(nb, mask); >> do_setup(mask); >> >> If yes, then here is our problem: >> 1. A CPU hotplug operation can happen, in-between these 2 function calls. >> 2. No matter how we try to wrap these up with get/put_online_cpus(), it >> will still lead to either a race, or a deadlock, depending on how we >> do it. > > Hmmm... The call to register_cpu_notifier() excludes CPU-hotplug > operations, so any CPU-hotplug operation that happens after the > register_cpu_notifier() will invoke the notifier. Right. > This does mean that > we can then see concurrent invocation of the notifier from do_setup() > and from a CPU-hotplug operation, Exactly! > but it should not be hard to create > a reasonably locking scheme to prevent this. > Ah, this is where our approach differs ;-) My approach is to *avoid* the concurrent invocation. Yours seems to be to *handle* the concurrent invocation. > BTW, I am assuming that it is OK for the notifiers to run out of order > in this case. Perhaps you are trying to preserve ordering? > No, I am not worried about the ordering. Ordering is an issue only when we allow concurrent invocations and try to handle them. If we prevent concurrent invocation of notifiers, (mis)ordering of notifiers for subsequent CPU hotplug operations won't even come into the picture, right? > The reason that I believe that it should be OK to allow misordering of > notifiers is that the subsystem in question is not fully initialized yet. > It is therefore probably not yet actually servicing requests, for example, > something like RCU would might be refusing to start grace periods. > If it is not yet servicing requests, it should not matter what order > it sees the CPUs come online. Once do_setup() returns, it will see > subsequent CPU-hotplug operations in order, so at that point it might > be able to start servicing requests. > > Or am I missing something else? > Looks like the two of us are trying to solve the same problem in two different ways, hence the difficulty in understanding each other ;-) The problem: During registration of notifier, CPU hotplug cannot happen - guaranteed. Beyond that, CPU hotplug can happen, which includes the do_setup() phase. Hence, concurrent invocations of notifiers might take place and mess up things. Your solution/approach seems to be: 1. Register cpu hotplug notifiers. 2. Run setup for already online cpus, *knowing* that the notifier invocations can happen due to 2 sources (CPU hotplug and do_setup). Add appropriate locking to handle this. This raises a new issue: notifiers can get executed out-of-order, because there are 2 sources that are running the notifiers. Ignore this issue because it doesn't appear to be troublesome in any way. My thoughts on this: 1. The extra locking needs to be added to every initialization code that does this "register + do_setup()" thing. 2. Atleast I am not able to think of any simple locking scheme. Consider some of the problems it has to address: * do_setup is running for CPU 3 which is already online. * Now somebody happens to take CPU 3 offline. What happens to the notifiers? CPU_UP_PREPARE/CPU_ONLINE notifiers from do_setup() and CPU_DOWN_PREPARE/ CPU_DEAD notifiers from the CPU Hotplug operation get interleaved? That would be bad, isn't it? (CPUs getting offlined while do_setup() is running might look like a totally theoretical or rather impractical scenario. But its not. Consider the fact that the initialization of this kind happens in module_init() code of some modules. And module load/unload racing with CPU hotplug is not that much of a theoretical case). My solution/approach is: 1. Register cpu hotplug notifiers. (we know CPU hotplug is disabled throughout this) 2. Run setup for already online cpus, but ensure that the "CPU hotplug disabled" phase which started in the previous step extends throughout this step too, without any gaps in between. This would *prevent* concurrent invocation of notifiers. This would also *prevent* misordering of notifiers, because when we re-enable CPU hotplug after our do_setup(), all the registered notifiers run in the right order for future CPU hotplug operations. My thoughts on this: 1. No extra locking necessary. Also, no locking added to each call site/ initialization code. Since all changes are made to the generic register_cpu_notifier() function, it simply boils down to all the initialization code adapting to the new form of callback registration by providing appropriate parameters. 2. This is immune to any sort of CPU Hotplug (offline/online) at any point in time, without assumptions such as 'CPU offline won't happen during early boot' or 'nobody will do CPU Hotplug when loading/unloading modules'. 3. Looks simple (atleast in my opinion) since no extra locking logic is necessary :-) We just piggyback on what is already in place. And also, no need to handle some arcanely complex scenarios like what I mentioned above (interleaving of online/offline notifiers), since that scenario simply won't occur here. >> Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our >> purpose, since the race with CPU Hotplug would still exist, just like >> before. So, let's consider what happens when we wrap both the functions >> within get/put_online_cpus(): >> >> get_online_cpus(); >> register_cpu_notifier(nb, mask); >> do_setup(mask); >> put_online_cpus(); >> >> Unfortunately this leads to an ABBA deadlock (see below). >> >> >>> Or am I missing the race you had in mind? Or is the point to make >>> sure that the notifiers execute in order? >> >> No, I wasn't referring to the order of execution of the notifiers here. > > OK, then I am still missing something... > >> Well, the race I was concerned about was the one between a CPU hotplug >> operation and a callback registration operation, which might lead to an >> ABBA deadlock, with the 2 locks in question being cpu hotplug lock and >> cpu_add_remove_lock. >> >> The heart of the problem is that, something as simple as the following >> is prone to an ABBA deadlock (note that I am not even talking about >> do_setup() here): >> >> get_online_cpus() >> register_cpu_notifier() or any other variant >> put_online_cpus() >> >> In the above, the lock ordering is: >> grab cpu_hotplug lock -> grab cpu_add_remove_lock. > > But why do we need the get_online_cpus() in this case? What would it be > preventing from happening and why do we care? It was a case of too much short-circuiting, I admit. I intended to show the concept of "CPU hotplug disabled" phase as being persistent throughout callback registration + do_setup(), using some known primitives for disabling CPU hotplug, like get/put_online_cpus(). That is: get_online_cpus() register_cpu_notifier() do_setup() put_online_cpus() Then I wanted to show that this can lead to an ABBA deadlock; and also point out that do_setup() has got nothing to do with the deadlock - the interplay between get/put_online_cpus and register_cpu_notifier() is the actual culprit. So I short-circuited all this into the code above ;-) > >> But in a usual CPU hotplug operation, the lock ordering is the exact reverse: >> grab cpu_add_remove_lock -> grab cpu_hotplug lock. >> >> Hence, we have an ABBA deadlock possibility. >> >> I had posted a pictorial representation of the deadlock condition here: >> https://lkml.org/lkml/2012/3/19/289 > > And Peter replied to this saying that he can remove the get_online_cpus() > call, which gets rid of the ABBA deadlock condition, correct? > Yes, but that is a special case - his code is an early-initcall (pre-smp). So there is no possibility of CPU hotplug at that point; only one CPU is up and running. (Even the async booting patch[1] doesn't alter this behaviour and hence we are safe.) This is not the case with initialization code that can occur much later in the boot sequence or which can be triggered after booting is complete, like module initialization code. >> So, to get rid of this deadlock, the approach I proposed came out of the >> following observation: >> >> A code such as: >> >> register_cpu_notifier(); >> >> //Go on doing stuff that are irrelevant to CPU hotplug. >> >> is totally safe from any kind of race or deadlock, the reason being that >> CPU Hotplug operations begin by taking the cpu_add_remove_lock, and >> callback registration functions like the above also take only this >> particular lock. IOW, the mutex that protects a race between CPU hotplug >> operation vs callback registration is the cpu_add_remove_lock. The >> cpu_hotplug lock is actually irrelevant here! > > Yep, agreed, despite my confusion some weeks ago. > >> >> CPU0 CPU 1 >> register_cpu_notifier() cpu_down() >> ---------------------------------------------------------------------- >> >> acquire cpu_add_remove_lock >> Blocked on cpu_add_remove_lock >> >> do callback registration >> >> release cpu_add_remove_lock >> >> Only now we can proceed and acquire >> the cpu_add_remove_lock. >> >> acquire cpu_hotplug lock >> do cpu_down related work >> release cpu_hotplug lock >> >> release cpu_add_remove_lock >> >> >> So my approach is: consider whatever we want to do while registering our >> callback, including doing setup for already online cpus; - and do *all of it* >> within the section marked as "do callback registration". IOW, we piggyback >> on something _known_ to be perfectly fine and hence avoid all races and >> deadlocks. > > OK, interesting approach. But does this actually save anything in > any real situation, given that CPU hotplug notifiers are run during > initialization, before the subsystem being initialized is really doing > anything? > Yes - we know that if we run multiple instances of a non-reentrant code simultaneously without locking, it will crash. Here the non-reentrant code is the individual CPU hotplug notifiers/callbacks. Why are they non-reentrant? Because it is a given that notifiers are always run one after the other in the CPU hotplug path. And hence none of the notifiers take care (locking) to see that they don't race with themselves; which was perfectly OK previously. What was the assumption earlier? During bootup, the onlining of CPUs won't run in parallel with the rest of the kernel initialization. IOW, when do_setup() is running (as part of initcalls), no CPU hotplug (online/offline) operation can occur. [Even the CPU onlining as part of booting wont happen in parallel with do_setup()] What changed that? The async boot patch[1]. It made CPU online during boot to run in parallel with rest of the kernel initialization. (This didn't go upstream though, because the kernel was not ready for such a change, as shown by these problems). What was the effect? Except early initcalls which are pre-smp, all other initcalls can race with CPU Hotplug. That is, do_setup() can race with notifiers from ongoing CPU hotplug (CPU online operations, as part of booting) and hence the notifiers suddenly need extra locking or something to be reentrant. Why does my approach help? It ensures that do_setup() will never occur in parallel with CPU hotplug, at any time. Hence the individual notifiers need not watch their back - they can continue to be non-reentrant and still everything will work fine because we fix it at the callback registration level itself. Honestly, I wrote this patchset to fix issues opened up by the async booting patch[1]. That patch caused boot failures in powerpc [2] because of CPU Hotplug notifier races. And I believe the solution I proposed will fix it. Without the async booting patch, this was more or less a theoretical race. That patch made it not only real but also severe enough to cause boot failures. So, if the async booting design is not being pushed any further, then I guess we can simply ignore this theoretical race altogether and focus on more important issues (I am totally OK with that) ... and possibly revisit this race whenever it bites us again ;-) What do you think? [1]. http://thread.gmane.org/gmane.linux.kernel/1246209 [2]. http://thread.gmane.org/gmane.linux.kernel.next/20726/focus=20757 Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-10 13:41 ` Srivatsa S. Bhat @ 2012-04-10 15:46 ` Paul E. McKenney 2012-04-10 17:26 ` Srivatsa S. Bhat 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-04-10 15:46 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, nikunj On Tue, Apr 10, 2012 at 07:11:50PM +0530, Srivatsa S. Bhat wrote: > On 04/09/2012 10:43 PM, Paul E. McKenney wrote: > > > On Sat, Apr 07, 2012 at 01:22:23AM +0530, Srivatsa S. Bhat wrote: > >>> > >>>> Additional things that I would like to add to the list: > >>>> > >>>> 1. Fix issues with CPU Hotplug callback registration. Currently there > >>>> is no totally-race-free way to register callbacks and do setup > >>>> for already online cpus. > >>>> > >>>> I had posted an incomplete patchset some time ago regarding this, > >>>> which gives an idea of the direction I had in mind. > >>>> http://thread.gmane.org/gmane.linux.kernel/1258880/focus=15826 > >>> > >>> Another approach is to have the registration function return the > >>> CPU mask corresponding to the instant at which registration occurred, > >>> perhaps via an additional function argument that points to a > >>> cpumask_var_t that can be NULL if you don't care. Then you can > >>> do setup for the CPUs indicated in the mask. > >> > >> That would look something like this right? > >> > >> register_cpu_notifier(nb, mask); > >> do_setup(mask); > >> > >> If yes, then here is our problem: > >> 1. A CPU hotplug operation can happen, in-between these 2 function calls. > >> 2. No matter how we try to wrap these up with get/put_online_cpus(), it > >> will still lead to either a race, or a deadlock, depending on how we > >> do it. > > > > Hmmm... The call to register_cpu_notifier() excludes CPU-hotplug > > operations, so any CPU-hotplug operation that happens after the > > register_cpu_notifier() will invoke the notifier. > > Right. > > > This does mean that > > we can then see concurrent invocation of the notifier from do_setup() > > and from a CPU-hotplug operation, > > Exactly! > > > but it should not be hard to create > > a reasonably locking scheme to prevent this. > > Ah, this is where our approach differs ;-) > My approach is to *avoid* the concurrent invocation. Yours seems to be to > *handle* the concurrent invocation. I am lazy, I know. But I didn't see any reasonable way to prevent the concurrency. > > BTW, I am assuming that it is OK for the notifiers to run out of order > > in this case. Perhaps you are trying to preserve ordering? > > No, I am not worried about the ordering. Ordering is an issue only when we > allow concurrent invocations and try to handle them. If we prevent concurrent > invocation of notifiers, (mis)ordering of notifiers for subsequent CPU > hotplug operations won't even come into the picture, right? Right -- after all, we are probably initializing the pre-existing CPUs in some order other than their onlining order, right? ;-) > > The reason that I believe that it should be OK to allow misordering of > > notifiers is that the subsystem in question is not fully initialized yet. > > It is therefore probably not yet actually servicing requests, for example, > > something like RCU would might be refusing to start grace periods. > > If it is not yet servicing requests, it should not matter what order > > it sees the CPUs come online. Once do_setup() returns, it will see > > subsequent CPU-hotplug operations in order, so at that point it might > > be able to start servicing requests. > > > > Or am I missing something else? > > Looks like the two of us are trying to solve the same problem in two different > ways, hence the difficulty in understanding each other ;-) That usually does make things a bit more difficult. ;-) > The problem: > During registration of notifier, CPU hotplug cannot happen - guaranteed. > Beyond that, CPU hotplug can happen, which includes the do_setup() phase. > Hence, concurrent invocations of notifiers might take place and mess up things. > > Your solution/approach seems to be: > 1. Register cpu hotplug notifiers. > 2. Run setup for already online cpus, *knowing* that the notifier invocations > can happen due to 2 sources (CPU hotplug and do_setup). Add appropriate > locking to handle this. This raises a new issue: notifiers can get executed > out-of-order, because there are 2 sources that are running the notifiers. > Ignore this issue because it doesn't appear to be troublesome in any way. Yep! > My thoughts on this: > 1. The extra locking needs to be added to every initialization code that does > this "register + do_setup()" thing. > 2. Atleast I am not able to think of any simple locking scheme. Consider some > of the problems it has to address: > > * do_setup is running for CPU 3 which is already online. > * Now somebody happens to take CPU 3 offline. What happens to the notifiers? > CPU_UP_PREPARE/CPU_ONLINE notifiers from do_setup() and CPU_DOWN_PREPARE/ > CPU_DEAD notifiers from the CPU Hotplug operation get interleaved? > That would be bad, isn't it? > > (CPUs getting offlined while do_setup() is running might look like a totally > theoretical or rather impractical scenario. But its not. Consider the fact > that the initialization of this kind happens in module_init() code of some > modules. And module load/unload racing with CPU hotplug is not that much of > a theoretical case). > > My solution/approach is: > 1. Register cpu hotplug notifiers. (we know CPU hotplug is disabled throughout > this) > 2. Run setup for already online cpus, but ensure that the "CPU hotplug > disabled" phase which started in the previous step extends throughout this > step too, without any gaps in between. This would *prevent* concurrent > invocation of notifiers. This would also *prevent* misordering of notifiers, > because when we re-enable CPU hotplug after our do_setup(), all the > registered notifiers run in the right order for future CPU hotplug > operations. > > My thoughts on this: > 1. No extra locking necessary. Also, no locking added to each call site/ > initialization code. Since all changes are made to the generic > register_cpu_notifier() function, it simply boils down to all the > initialization code adapting to the new form of callback registration by > providing appropriate parameters. > 2. This is immune to any sort of CPU Hotplug (offline/online) at any point > in time, without assumptions such as 'CPU offline won't happen during early > boot' or 'nobody will do CPU Hotplug when loading/unloading modules'. > 3. Looks simple (atleast in my opinion) since no extra locking logic is > necessary :-) We just piggyback on what is already in place. > And also, no need to handle some arcanely complex scenarios like what > I mentioned above (interleaving of online/offline notifiers), since that > scenario simply won't occur here. I was thinking in terms of the CPU_DOWN_PREPARE or CPU_DEAD notifiers checking the corresponding bit of the mask returned by the notifier registration function. If the bit is set, just clear it, otherwise execute the normal notifier function. With appropriate locking, of course. (Waves hands wildly.) Note that this does require that the notifier registration function store the cpumask while under the protection of cpu_maps_update_begin(). It also requires two wrapper functions for the underlying notifier function, one to be called from the actual notifier and the other from the initialization/setup code. > >> Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our > >> purpose, since the race with CPU Hotplug would still exist, just like > >> before. So, let's consider what happens when we wrap both the functions > >> within get/put_online_cpus(): > >> > >> get_online_cpus(); > >> register_cpu_notifier(nb, mask); > >> do_setup(mask); > >> put_online_cpus(); > >> > >> Unfortunately this leads to an ABBA deadlock (see below). > >> > >> > >>> Or am I missing the race you had in mind? Or is the point to make > >>> sure that the notifiers execute in order? > >> > >> No, I wasn't referring to the order of execution of the notifiers here. > > > > OK, then I am still missing something... > > > >> Well, the race I was concerned about was the one between a CPU hotplug > >> operation and a callback registration operation, which might lead to an > >> ABBA deadlock, with the 2 locks in question being cpu hotplug lock and > >> cpu_add_remove_lock. > >> > >> The heart of the problem is that, something as simple as the following > >> is prone to an ABBA deadlock (note that I am not even talking about > >> do_setup() here): > >> > >> get_online_cpus() > >> register_cpu_notifier() or any other variant > >> put_online_cpus() > >> > >> In the above, the lock ordering is: > >> grab cpu_hotplug lock -> grab cpu_add_remove_lock. > > > > But why do we need the get_online_cpus() in this case? What would it be > > preventing from happening and why do we care? > > It was a case of too much short-circuiting, I admit. I intended to show the > concept of "CPU hotplug disabled" phase as being persistent throughout callback > registration + do_setup(), using some known primitives for disabling CPU > hotplug, like get/put_online_cpus(). > > That is: get_online_cpus() > register_cpu_notifier() > do_setup() > put_online_cpus() > > Then I wanted to show that this can lead to an ABBA deadlock; and also point > out that do_setup() has got nothing to do with the deadlock - the interplay > between get/put_online_cpus and register_cpu_notifier() is the actual culprit. > So I short-circuited all this into the code above ;-) OK. ;-) > >> But in a usual CPU hotplug operation, the lock ordering is the exact reverse: > >> grab cpu_add_remove_lock -> grab cpu_hotplug lock. > >> > >> Hence, we have an ABBA deadlock possibility. > >> > >> I had posted a pictorial representation of the deadlock condition here: > >> https://lkml.org/lkml/2012/3/19/289 > > > > And Peter replied to this saying that he can remove the get_online_cpus() > > call, which gets rid of the ABBA deadlock condition, correct? > > Yes, but that is a special case - his code is an early-initcall (pre-smp). > So there is no possibility of CPU hotplug at that point; only one CPU is up > and running. (Even the async booting patch[1] doesn't alter this behaviour and > hence we are safe.) This is not the case with initialization code that can > occur much later in the boot sequence or which can be triggered after booting > is complete, like module initialization code. But as long as the cpumask is written under the protection of cpu_maps_update_begin(), use of a common lock for the cpumask should suffice. If more scalability is required, multiple locks can be used, each protecting a different section of the cpumask. >From the notifiers, execution would be as follows: Acquire the subsystem's cpumask lock. Check the bit -- if going offline and the bit is set: Clear the bit. Release the subsystem's cpumask lock. Return. Invoke the underlying notifier function. Release the subsystem's cpumask lock. Notifiers cannot run concurrently with initialization of the cpumask because the notifier registration function initializes the cpumask under protection of cpu_maps_update_begin(), which excludes CPU-hotplug operations. >From setup/initialization, execution would be as follows: Acquire the subsystem's cpumask lock. Find the first set bit. If there is no bit set: Release the subsystem's cpumask lock. Return "done" indication. Clear the bit. Invoke the underlying notifier function. Release the subsystem's cpumask lock. Seem reasonable, or am I missing something? > >> So, to get rid of this deadlock, the approach I proposed came out of the > >> following observation: > >> > >> A code such as: > >> > >> register_cpu_notifier(); > >> > >> //Go on doing stuff that are irrelevant to CPU hotplug. > >> > >> is totally safe from any kind of race or deadlock, the reason being that > >> CPU Hotplug operations begin by taking the cpu_add_remove_lock, and > >> callback registration functions like the above also take only this > >> particular lock. IOW, the mutex that protects a race between CPU hotplug > >> operation vs callback registration is the cpu_add_remove_lock. The > >> cpu_hotplug lock is actually irrelevant here! > > > > Yep, agreed, despite my confusion some weeks ago. > > > >> > >> CPU0 CPU 1 > >> register_cpu_notifier() cpu_down() > >> ---------------------------------------------------------------------- > >> > >> acquire cpu_add_remove_lock > >> Blocked on cpu_add_remove_lock > >> > >> do callback registration > >> > >> release cpu_add_remove_lock > >> > >> Only now we can proceed and acquire > >> the cpu_add_remove_lock. > >> > >> acquire cpu_hotplug lock > >> do cpu_down related work > >> release cpu_hotplug lock > >> > >> release cpu_add_remove_lock > >> > >> > >> So my approach is: consider whatever we want to do while registering our > >> callback, including doing setup for already online cpus; - and do *all of it* > >> within the section marked as "do callback registration". IOW, we piggyback > >> on something _known_ to be perfectly fine and hence avoid all races and > >> deadlocks. > > > > OK, interesting approach. But does this actually save anything in > > any real situation, given that CPU hotplug notifiers are run during > > initialization, before the subsystem being initialized is really doing > > anything? > > Yes - we know that if we run multiple instances of a non-reentrant code > simultaneously without locking, it will crash. Here the non-reentrant code > is the individual CPU hotplug notifiers/callbacks. Right -- and registering CPU hotplug notifiers during early boot remains a very nice labor-saving approach, because CPU-hotplug operations cannot run that early. However, CPU hotplug notifiers registered by loadable modules do not have this choice, and must therefore correctly handle CPU-hotplug operations running concurrently with module initialization. > Why are they non-reentrant? > Because it is a given that notifiers are always run one after the other in > the CPU hotplug path. And hence none of the notifiers take care (locking) > to see that they don't race with themselves; which was perfectly OK previously. And that is still the case -unless- your code must register CPU-hotplug notifiers after early boot. Right? > What was the assumption earlier? > During bootup, the onlining of CPUs won't run in parallel with the rest of > the kernel initialization. IOW, when do_setup() is running (as part of > initcalls), no CPU hotplug (online/offline) operation can occur. [Even the > CPU onlining as part of booting wont happen in parallel with do_setup()] > > What changed that? > The async boot patch[1]. It made CPU online during boot to run in parallel > with rest of the kernel initialization. (This didn't go upstream though, > because the kernel was not ready for such a change, as shown by these > problems). But even so, CPU online happens after the scheduler is initialized. So things like RCU can still work as they used to, relying on the fact that they are initializing before CPU-hotplug operations can run. Besides, this vulnerability existed beforehand -- a driver really could bring a CPU down during late boot, which would confuse drivers initializing concurrently, right? I doubt that anyone does this, but there is nothing stopping them from doing it. ;-) > What was the effect? > Except early initcalls which are pre-smp, all other initcalls can race with > CPU Hotplug. That is, do_setup() can race with notifiers from ongoing CPU > hotplug (CPU online operations, as part of booting) and hence the notifiers > suddenly need extra locking or something to be reentrant. > > Why does my approach help? At this point, I must confess that I have lost track of exactly what your approach is... > It ensures that do_setup() will never occur in parallel with CPU hotplug, > at any time. Hence the individual notifiers need not watch their back - > they can continue to be non-reentrant and still everything will work fine > because we fix it at the callback registration level itself. > > Honestly, I wrote this patchset to fix issues opened up by the async booting > patch[1]. That patch caused boot failures in powerpc [2] because of CPU > Hotplug notifier races. And I believe the solution I proposed will fix it. > > Without the async booting patch, this was more or less a theoretical race. > That patch made it not only real but also severe enough to cause boot > failures. > > So, if the async booting design is not being pushed any further, then I > guess we can simply ignore this theoretical race altogether and focus on > more important issues (I am totally OK with that) ... and possibly revisit > this race whenever it bites us again ;-) > > What do you think? > > [1]. http://thread.gmane.org/gmane.linux.kernel/1246209 > [2]. http://thread.gmane.org/gmane.linux.kernel.next/20726/focus=20757 Neither of the above two URLs points to a patch, so I am still a bit lost. That said, it is only a matter of time before async boot reappears -- systems are still getting more CPUs, and boot-time constraints are becoming more severe. So something will eventually need to be done, and we might as well deal with it ahead of time. Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-10 15:46 ` Paul E. McKenney @ 2012-04-10 17:26 ` Srivatsa S. Bhat 0 siblings, 0 replies; 38+ messages in thread From: Srivatsa S. Bhat @ 2012-04-10 17:26 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Arjan van de Ven, Steven Rostedt, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, nikunj On 04/10/2012 09:16 PM, Paul E. McKenney wrote: > On Tue, Apr 10, 2012 at 07:11:50PM +0530, Srivatsa S. Bhat wrote: [This is a quick reply to give the links you requested. I'll reply to the other things after I read what you wrote more thoroughly.] >> Why does my approach help? > > At this point, I must confess that I have lost track of exactly what > your approach is... The same old "incomplete" patchset ;-) (Note that the patch 1/3 is complete. The "incomplete" tag is just because it is followed by changes only to powerpc (2/3) and sparc (3/3), while actually, many other places need to be changed. But the first patch in the series is definitely in full form. https://lkml.org/lkml/2012/3/1/39 https://lkml.org/lkml/2012/3/1/40 https://lkml.org/lkml/2012/3/1/41 > >> It ensures that do_setup() will never occur in parallel with CPU hotplug, >> at any time. Hence the individual notifiers need not watch their back - >> they can continue to be non-reentrant and still everything will work fine >> because we fix it at the callback registration level itself. >> >> Honestly, I wrote this patchset to fix issues opened up by the async booting >> patch[1]. That patch caused boot failures in powerpc [2] because of CPU >> Hotplug notifier races. And I believe the solution I proposed will fix it. >> >> Without the async booting patch, this was more or less a theoretical race. >> That patch made it not only real but also severe enough to cause boot >> failures. >> >> So, if the async booting design is not being pushed any further, then I >> guess we can simply ignore this theoretical race altogether and focus on >> more important issues (I am totally OK with that) ... and possibly revisit >> this race whenever it bites us again ;-) >> >> What do you think? >> >> [1]. http://thread.gmane.org/gmane.linux.kernel/1246209 >> [2]. http://thread.gmane.org/gmane.linux.kernel.next/20726/focus=20757 > > Neither of the above two URLs points to a patch, ?? Well, the first one points to the async booting patch and the second one points to a verbal root-cause analysis of the boot failure on powerpc, caused by that patch. Let me give equivalent links from lkml.org: [1]. https://lkml.org/lkml/2012/1/31/286 [2]. https://lkml.org/lkml/2012/2/13/383 Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-06 19:52 ` Srivatsa S. Bhat 2012-04-09 17:13 ` Paul E. McKenney @ 2012-04-11 0:09 ` Steven Rostedt 2012-04-11 0:28 ` Paul E. McKenney 1 sibling, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2012-04-11 0:09 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: paulmck, Peter Zijlstra, Arjan van de Ven, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Sat, 2012-04-07 at 01:22 +0530, Srivatsa S. Bhat wrote: > Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our > purpose, since the race with CPU Hotplug would still exist, just like > before. So, let's consider what happens when we wrap both the functions > within get/put_online_cpus(): > > get_online_cpus(); > register_cpu_notifier(nb, mask); > do_setup(mask); > put_online_cpus(); > > Unfortunately this leads to an ABBA deadlock (see below). > Just to throw out the stupid silly approach. What about creating a "__register_cpu_notifier()" that just does: int __ref __register_cpu_notifier(struct notifier_block *nb) { return raw_notifier_chain_register(&cpu_chain, nb); } Also making cpu_maps_update_begin/done() global (and probably rename them). and then in the above code do: cpu_maps_update_begin(); __register_cpu_notifier(nb); do_setup(); cpu_maps_update_done(); Just saying, -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-11 0:09 ` Steven Rostedt @ 2012-04-11 0:28 ` Paul E. McKenney 2012-04-11 0:37 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-04-11 0:28 UTC (permalink / raw) To: Steven Rostedt Cc: Srivatsa S. Bhat, Peter Zijlstra, Arjan van de Ven, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Tue, Apr 10, 2012 at 08:09:59PM -0400, Steven Rostedt wrote: > On Sat, 2012-04-07 at 01:22 +0530, Srivatsa S. Bhat wrote: > > > Wrapping only do_setup() within get/put_online_cpus() wouldn't serve our > > purpose, since the race with CPU Hotplug would still exist, just like > > before. So, let's consider what happens when we wrap both the functions > > within get/put_online_cpus(): > > > > get_online_cpus(); > > register_cpu_notifier(nb, mask); > > do_setup(mask); > > put_online_cpus(); > > > > Unfortunately this leads to an ABBA deadlock (see below). > > > > Just to throw out the stupid silly approach. > > What about creating a "__register_cpu_notifier()" that just does: > > int __ref __register_cpu_notifier(struct notifier_block *nb) > { > return raw_notifier_chain_register(&cpu_chain, nb); > } > > Also making cpu_maps_update_begin/done() global (and probably rename > them). > > and then in the above code do: > > cpu_maps_update_begin(); > __register_cpu_notifier(nb); > do_setup(); > cpu_maps_update_done(); > > > Just saying, That does have some attractive properties, now that you mention it. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-11 0:28 ` Paul E. McKenney @ 2012-04-11 0:37 ` Steven Rostedt 2012-04-11 1:00 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2012-04-11 0:37 UTC (permalink / raw) To: paulmck Cc: Srivatsa S. Bhat, Peter Zijlstra, Arjan van de Ven, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote: > > Just to throw out the stupid silly approach. > > > > What about creating a "__register_cpu_notifier()" that just does: > > > > int __ref __register_cpu_notifier(struct notifier_block *nb) > > { > > return raw_notifier_chain_register(&cpu_chain, nb); > > } > > > > Also making cpu_maps_update_begin/done() global (and probably rename > > them). I just noticed that the cpu_maps_update_begin/done() are already global. > > > > and then in the above code do: > > > > cpu_maps_update_begin(); > > __register_cpu_notifier(nb); > > do_setup(); > > cpu_maps_update_done(); > > > > > > Just saying, > > That does have some attractive properties, now that you mention it. ;-) Which property? Stupid or Silly ;-) -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-11 0:37 ` Steven Rostedt @ 2012-04-11 1:00 ` Paul E. McKenney 2012-04-11 6:02 ` Srivatsa S. Bhat 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-04-11 1:00 UTC (permalink / raw) To: Steven Rostedt Cc: Srivatsa S. Bhat, Peter Zijlstra, Arjan van de Ven, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Tue, Apr 10, 2012 at 08:37:18PM -0400, Steven Rostedt wrote: > On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote: > > > > Just to throw out the stupid silly approach. > > > > > > What about creating a "__register_cpu_notifier()" that just does: > > > > > > int __ref __register_cpu_notifier(struct notifier_block *nb) > > > { > > > return raw_notifier_chain_register(&cpu_chain, nb); > > > } > > > > > > Also making cpu_maps_update_begin/done() global (and probably rename > > > them). > > I just noticed that the cpu_maps_update_begin/done() are already global. > > > > > > > and then in the above code do: > > > > > > cpu_maps_update_begin(); > > > __register_cpu_notifier(nb); > > > do_setup(); > > > cpu_maps_update_done(); > > > > > > > > > Just saying, > > > > That does have some attractive properties, now that you mention it. ;-) > > Which property? Stupid or Silly ;-) As with any piece of software, no matter how small, both. ;-) Of course, __register_cpu_notifier() would need lockdep checking to make sure that it wasn't called without the benefit of cpu_maps_update_begin(). I might be missing something, but as long as that was in place, seems like it is a lot simpler and easier to use than the alternatives that Srivatsa and I were kicking around. Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-11 1:00 ` Paul E. McKenney @ 2012-04-11 6:02 ` Srivatsa S. Bhat 2012-04-11 12:28 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Srivatsa S. Bhat @ 2012-04-11 6:02 UTC (permalink / raw) To: paulmck Cc: Steven Rostedt, Peter Zijlstra, Arjan van de Ven, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, nikunj On 04/11/2012 06:30 AM, Paul E. McKenney wrote: > On Tue, Apr 10, 2012 at 08:37:18PM -0400, Steven Rostedt wrote: >> On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote: >> >>>> Just to throw out the stupid silly approach. >>>> >>>> What about creating a "__register_cpu_notifier()" that just does: >>>> >>>> int __ref __register_cpu_notifier(struct notifier_block *nb) >>>> { >>>> return raw_notifier_chain_register(&cpu_chain, nb); >>>> } >>>> >>>> Also making cpu_maps_update_begin/done() global (and probably rename >>>> them). >> >> I just noticed that the cpu_maps_update_begin/done() are already global. >> >>>> >>>> and then in the above code do: >>>> >>>> cpu_maps_update_begin(); >>>> __register_cpu_notifier(nb); >>>> do_setup(); >>>> cpu_maps_update_done(); >>>> >>>> Wow! Believe it or not, this is precisely the crux of the approach I was suggesting all along!! :-) Just that when put to code, it looked slightly different than this.. Sorry for not being clear. So here is what I proposed, in a simplified form: Modify the existing register_cpu_notifier() to this (by possibly giving it a different name): int __ref register_cpu_notifier(struct notifier_block *nb, int (*do_setup)(void)) { int ret; cpu_maps_update_begin(); ret = raw_notifier_chain_register(&cpu_chain, nb); do_setup(); cpu_maps_update_done(); return ret; } and then, in the caller, do: register_cpu_notifier(nb, do_setup); If the caller doesn't need any such extra setup, just do: register_cpu_notifier(nb, NULL); Of course, register_cpu_notifier() should handle NULL properly. (My patch [1] handles it, along with some other special cases.) That's it! Also, it is to be noted that cpu_maps_update_begin/done() are global, but not exported symbols - so modules can't use them. With the above approach, we need not make them exported symbols, since the caller need not care about these locks at all. >>>> Just saying, >>> >>> That does have some attractive properties, now that you mention it. ;-) >> >> Which property? Stupid or Silly ;-) > > As with any piece of software, no matter how small, both. ;-) > > Of course, __register_cpu_notifier() would need lockdep checking to make > sure that it wasn't called without the benefit of cpu_maps_update_begin(). Not with my approach ;-) Its all automatically handled :-) > I might be missing something, but as long as that was in place, seems > like it is a lot simpler and easier to use than the alternatives that > Srivatsa and I were kicking around. > Hehe :-) Thanks for simplifying things, Steve! [1]. https://lkml.org/lkml/2012/3/1/39 Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-04-11 6:02 ` Srivatsa S. Bhat @ 2012-04-11 12:28 ` Paul E. McKenney 0 siblings, 0 replies; 38+ messages in thread From: Paul E. McKenney @ 2012-04-11 12:28 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Steven Rostedt, Peter Zijlstra, Arjan van de Ven, rusty@rustcorp.com.au, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list, nikunj On Wed, Apr 11, 2012 at 11:32:57AM +0530, Srivatsa S. Bhat wrote: > On 04/11/2012 06:30 AM, Paul E. McKenney wrote: > > > On Tue, Apr 10, 2012 at 08:37:18PM -0400, Steven Rostedt wrote: > >> On Tue, 2012-04-10 at 17:28 -0700, Paul E. McKenney wrote: > >> > >>>> Just to throw out the stupid silly approach. > >>>> > >>>> What about creating a "__register_cpu_notifier()" that just does: > >>>> > >>>> int __ref __register_cpu_notifier(struct notifier_block *nb) > >>>> { > >>>> return raw_notifier_chain_register(&cpu_chain, nb); > >>>> } > >>>> > >>>> Also making cpu_maps_update_begin/done() global (and probably rename > >>>> them). > >> > >> I just noticed that the cpu_maps_update_begin/done() are already global. > >> > >>>> > >>>> and then in the above code do: > >>>> > >>>> cpu_maps_update_begin(); > >>>> __register_cpu_notifier(nb); > >>>> do_setup(); > >>>> cpu_maps_update_done(); > >>>> > >>>> > > > Wow! Believe it or not, this is precisely the crux of the approach I was > suggesting all along!! :-) Just that when put to code, it looked slightly > different than this.. Sorry for not being clear. > > So here is what I proposed, in a simplified form: > > Modify the existing register_cpu_notifier() to this (by possibly giving > it a different name): > > int __ref register_cpu_notifier(struct notifier_block *nb, > int (*do_setup)(void)) > { > int ret; > > cpu_maps_update_begin(); > ret = raw_notifier_chain_register(&cpu_chain, nb); > do_setup(); > cpu_maps_update_done(); > > return ret; > } > > and then, in the caller, do: > > register_cpu_notifier(nb, do_setup); > > If the caller doesn't need any such extra setup, just do: > > register_cpu_notifier(nb, NULL); > > > Of course, register_cpu_notifier() should handle NULL properly. > (My patch [1] handles it, along with some other special cases.) > > That's it! > > Also, it is to be noted that cpu_maps_update_begin/done() are global, but > not exported symbols - so modules can't use them. With the above approach, > we need not make them exported symbols, since the caller need not care about > these locks at all. > > >>>> Just saying, > >>> > >>> That does have some attractive properties, now that you mention it. ;-) > >> > >> Which property? Stupid or Silly ;-) > > > > As with any piece of software, no matter how small, both. ;-) > > > > Of course, __register_cpu_notifier() would need lockdep checking to make > > sure that it wasn't called without the benefit of cpu_maps_update_begin(). > > > Not with my approach ;-) Its all automatically handled :-) Good point, looks good! Thanx, Paul > > I might be missing something, but as long as that was in place, seems > > like it is a lot simpler and easier to use than the alternatives that > > Srivatsa and I were kicking around. > > > > > Hehe :-) Thanks for simplifying things, Steve! > > > [1]. https://lkml.org/lkml/2012/3/1/39 > > Regards, > Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-19 14:44 CPU Hotplug rework Srivatsa S. Bhat 2012-03-19 14:48 ` Srivatsa S. Bhat @ 2012-03-19 23:42 ` Rusty Russell 2012-03-20 10:42 ` Peter Zijlstra 1 sibling, 1 reply; 38+ messages in thread From: Rusty Russell @ 2012-03-19 23:42 UTC (permalink / raw) To: Srivatsa S. Bhat, Peter Zijlstra, Paul E. McKenney Cc: Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 19 Mar 2012 20:14:25 +0530, "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote: > Hi, > > There had been some discussion on CPU Hotplug redesign/rework > some time ago, but it was buried under a thread with a different > subject. > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) > > So I am opening a new thread with an appropriate subject to discuss > what needs to be done and how to go about it, as part of the rework. > > Peter Zijlstra and Paul McKenney had come up with TODO lists for the > rework, and here are their extracts from the previous discussion: This is possible, but quite a lot of tricky auditing work. There's an underlying assumption that stop_machine is the slow part, since it feels so heavy. Unfortunately, this doesn't seem to be the case in my testing. The time for hotplug seems to be moving all the threads around. So how about: (1) Let's not shutdown per-cpu kthreads, just leave them there to run if the CPU comes back. (2) Do something more efficient with userspace threads than migrating them one at a time. Otherwise, we risk doing a great deal of work and gaining nothing (cleanups aside, of course). Thanks, Rusty. -- How could I marry someone with more hair than me? http://baldalex.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-19 23:42 ` Rusty Russell @ 2012-03-20 10:42 ` Peter Zijlstra 2012-03-20 23:00 ` Rusty Russell 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2012-03-20 10:42 UTC (permalink / raw) To: Rusty Russell Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Tue, 2012-03-20 at 10:12 +1030, Rusty Russell wrote: > On Mon, 19 Mar 2012 20:14:25 +0530, "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote: > > Hi, > > > > There had been some discussion on CPU Hotplug redesign/rework > > some time ago, but it was buried under a thread with a different > > subject. > > (http://thread.gmane.org/gmane.linux.kernel/1246208/focus=1246404) > > > > So I am opening a new thread with an appropriate subject to discuss > > what needs to be done and how to go about it, as part of the rework. > > > > Peter Zijlstra and Paul McKenney had come up with TODO lists for the > > rework, and here are their extracts from the previous discussion: > > This is possible, but quite a lot of tricky auditing work. There's an > underlying assumption that stop_machine is the slow part, since it feels > so heavy. Depends on the machine and the needs. For the regular desktop with a regular kernel, the stop_machine in hotplug isn't really a problem. For _BIG_ machines stop_machine is a problem, for -RT stop_machine is a problem. So if we're going to re-architect hotplug anyway, it would be very good to get rid of it, because I really don't see any hard reasons why we would need it. > Unfortunately, this doesn't seem to be the case in my testing. The time > for hotplug seems to be moving all the threads around. So how about: Agreed, the thread creation on online is the most expensive operation. > (1) Let's not shutdown per-cpu kthreads, just leave them there to run > if the CPU comes back. Wasn't as easy as it sounds, but should be doable. > (2) Do something more efficient with userspace threads than migrating > them one at a time. Sadly that can't really be done. We need to pick up every task (userspace, but also running kernel threads) and update their state. > Otherwise, we risk doing a great deal of work and gaining nothing > (cleanups aside, of course). I don't really think its possible to spend too much time cleaning up hotplug at this point :-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-20 10:42 ` Peter Zijlstra @ 2012-03-20 23:00 ` Rusty Russell 2012-03-21 9:01 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Rusty Russell @ 2012-03-20 23:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Tue, 20 Mar 2012 11:42:31 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Tue, 2012-03-20 at 10:12 +1030, Rusty Russell wrote: > Depends on the machine and the needs. For the regular desktop with a > regular kernel, the stop_machine in hotplug isn't really a problem. For > _BIG_ machines stop_machine is a problem I tested on a PPC 64-way a few years ago, but let's get a really big machine and re-run the tests. Simplest is to benchmark module removal, which uses a very trivial stop_machine call. Old test code below, but it'll need to be updated (module insertion no longer uses stop_machine). > for -RT stop_machine is a problem. If this is really about -RT, let's say so. There's nothing *wrong* with that, but it feels more honest. > So if we're going to re-architect hotplug anyway, it would be very good > to get rid of it, because I really don't see any hard reasons why we > would need it. Absolutely. It was an easy way to introduce it, but it's not fundamental. > > Unfortunately, this doesn't seem to be the case in my testing. The time > > for hotplug seems to be moving all the threads around. So how about: > > Agreed, the thread creation on online is the most expensive operation. > > > (1) Let's not shutdown per-cpu kthreads, just leave them there to run > > if the CPU comes back. > > Wasn't as easy as it sounds, but should be doable. > > > (2) Do something more efficient with userspace threads than migrating > > them one at a time. > > Sadly that can't really be done. We need to pick up every task > (userspace, but also running kernel threads) and update their state. What if we had an "orphan" runqueue which everyone pulled from? Then we could grab the lock, move them all to the fake rq, then let stuff happen normally. Maybe that's crap, but at least we could move the migration out of the hotplug callback somehow. > > Otherwise, we risk doing a great deal of work and gaining nothing > > (cleanups aside, of course). > > I don't really think its possible to spend too much time cleaning up > hotplug at this point :-) There is that, yes :) Cheers, Rusty. -- How could I marry someone with more hair than me? http://baldalex.org /* measure_latency.c */ #define _GNU_SOURCE #include <sched.h> #include <sys/time.h> #include <time.h> #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include <sys/types.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <sys/wait.h> #include <err.h> #include <string.h> #include "time_diff.h" /* "Copyright 2007 <kathy.staples@au.ibm.com> IBM Corp" */ #define streq(a,b) (strcmp((a),(b)) == 0) static uint64_t timeval_to_usecs(struct timeval convert) { /* this function works out the number of microseconds */ return (convert.tv_sec * (uint64_t)1000000 + convert.tv_usec); } static void *grab_file(const char *filename, unsigned long *size) { unsigned int max = 16384; int ret, fd; void *buffer = malloc(max); if (!buffer) return NULL; if (streq(filename, "-")) fd = dup(STDIN_FILENO); else fd = open(filename, O_RDONLY, 0); if (fd < 0) return NULL; *size = 0; while ((ret = read(fd, buffer + *size, max - *size)) > 0) { *size += ret; if (*size == max) buffer = realloc(buffer, max *= 2); } if (ret < 0) { free(buffer); buffer = NULL; } close(fd); return buffer; } extern long init_module(void *, unsigned long, const char *); /* If module is NULL, merely go through the motions. */ static void do_modprobe(int cpu, int pollfd, int secs, const char *module) { struct sched_param sparam = { .sched_priority = 99 }; cpu_set_t this_cpu; fd_set readfds; int error; struct timeval timeout = { .tv_sec = 5 }; void *file; unsigned long len; if (module) { file = grab_file(module, &len); if (!file) err(1, "Loading file %s", module); } CPU_ZERO(&this_cpu); CPU_SET(cpu, &this_cpu); if (sched_setaffinity(getpid(), sizeof(cpu_set_t), &this_cpu) != 0) err(1, "Could not move modprobe to cpu %i", cpu); if (sched_setscheduler(getpid(), SCHED_FIFO, &sparam) != 0) err(1, "Could not set FIFO scheduler for modprobe"); /* Wait for go signal. */ FD_ZERO(&readfds); FD_SET(pollfd, &readfds); /* We can timeout. */ if (select(pollfd + 1, &readfds, NULL, NULL, &timeout) != 1) exit(1); /* Sleep until halfway through. */ usleep(secs * 500000); if (module) { error = init_module(file, len, ""); if (error) err(1, "init_module '%s'", module); } printf("Modprobe done on cpu %i\n", cpu); exit(0); } static void measure_latency(int cpu, int secs, int writefd, int pollfd) { struct timeval start_time, now, elapsed_time, previous_time, diff; uint64_t least, max_diff, no_of_diffs; cpu_set_t this_cpu; fd_set readfds; struct timeval timeout = { .tv_sec = 5 }; char buf[1024]; struct sched_param sparam = { .sched_priority = 50 }; least = UINT64_MAX; max_diff = 0; no_of_diffs = 0; CPU_ZERO(&this_cpu); CPU_SET(cpu, &this_cpu); if (sched_setaffinity(getpid(), sizeof(cpu_set_t), &this_cpu) != 0) err(1, "Could not move to cpu %i", cpu); if (sched_setscheduler(getpid(), SCHED_FIFO, &sparam) != 0) err(1, "Could not set FIFO scheduler"); /* Note that we're ready. */ write(writefd, "", 1); /* Wait for go signal. */ FD_ZERO(&readfds); FD_SET(pollfd, &readfds); /* We can timeout. */ if (select(pollfd + 1, &readfds, NULL, NULL, &timeout) != 1) exit(1); gettimeofday(&start_time, NULL); previous_time = start_time; do { gettimeofday(&now, NULL); /* call conv_timeval func; apply to now and previous time; calc diff */ time_diff(&previous_time, &now, &diff); if (timeval_to_usecs(diff) > max_diff) max_diff = timeval_to_usecs(diff); if (timeval_to_usecs(diff) < least) /* This should always return 0 */ least = timeval_to_usecs(diff); /* Work out time to elapse since the starting time */ time_diff(&start_time, &now, &elapsed_time); /* reset previous_time to now */ previous_time = now; no_of_diffs++; } while (elapsed_time.tv_sec < secs); sprintf(buf, "CPU %u: %llu diffs, min/avg/max = %llu/%llu/%llu\n", cpu, no_of_diffs, least, timeval_to_usecs(elapsed_time) / no_of_diffs, max_diff); write(STDOUT_FILENO, buf, strlen(buf)); exit(0); } int main(int argc, char *argv[]) { int i, secs, status, tochildren[2], fromchild[2], arg; const char *module; if (argc < 3) { printf("Usage: %s [--modprobe=<module>] <seconds> <cpunum>...\n", argv[0]); exit(1); } arg = 1; if (strncmp(argv[arg], "--modprobe=", strlen("--modprobe=")) == 0) { module = argv[arg] + 11; arg++; } else module = NULL; if (pipe(tochildren) != 0 || pipe(fromchild) != 0) err(1, "Creating pipes"); secs = atoi(argv[arg++]); switch (fork()) { case -1: err(1, "fork failed"); case 0: do_modprobe(atoi(argv[arg]), tochildren[0], secs, module); } for (i = arg+1; i < argc; i++) { char c; switch (fork()) { case -1: err(1, "fork failed"); case 0: measure_latency(atoi(argv[i]), secs, fromchild[1], tochildren[0]); } if (read(fromchild[0], &c, 1) != 1) err(1, "Read from child failed"); } /* Tell them to go. */ write(tochildren[1], "", 1); /* Wait for the children. */ status = 0; for (i = arg; i < argc; i++) { if (status == 0) { wait(&status); if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) status = 1; } else wait(NULL); } return status; } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-20 23:00 ` Rusty Russell @ 2012-03-21 9:01 ` Peter Zijlstra 2012-03-22 4:25 ` Rusty Russell 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2012-03-21 9:01 UTC (permalink / raw) To: Rusty Russell Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Wed, 2012-03-21 at 09:30 +1030, Rusty Russell wrote: > > > (2) Do something more efficient with userspace threads than migrating > > > them one at a time. > > > > Sadly that can't really be done. We need to pick up every task > > (userspace, but also running kernel threads) and update their state. > > What if we had an "orphan" runqueue which everyone pulled from? Then we > could grab the lock, move them all to the fake rq, then let stuff happen > normally. Well, we could simply let them sit where they are and fudge load-balance to consider it a source but not a destination until its empty, but it might be somewhat tricky to make it fast enough to not introduce noticable latencies. Also, you really don't want everyone to pull, that's a serialization/scalability problem. Also, since we really only move the currently runnable tasks it shouldn't be too many in the first place. Is it really that expensive? > Maybe that's crap, but at least we could move the migration out of the > hotplug callback somehow. Thing is, if its really too much for some people, they can orchestrate it such that its not. Just move everybody in a cpuset, clear the to be offlined cpu from the cpuset's mask -- this will migrate everybody away. Then hotplug will find an empty runqueue and its fast, no? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-21 9:01 ` Peter Zijlstra @ 2012-03-22 4:25 ` Rusty Russell 2012-03-22 22:49 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Rusty Russell @ 2012-03-22 4:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Srivatsa S. Bhat, Paul E. McKenney, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Wed, 2012-03-21 at 09:30 +1030, Rusty Russell wrote: > > > > (2) Do something more efficient with userspace threads than migrating > > > > them one at a time. > > > > > > Sadly that can't really be done. We need to pick up every task > > > (userspace, but also running kernel threads) and update their state. > > > > What if we had an "orphan" runqueue which everyone pulled from? Then we > > could grab the lock, move them all to the fake rq, then let stuff happen > > normally. > > Well, we could simply let them sit where they are and fudge load-balance > to consider it a source but not a destination until its empty, but it > might be somewhat tricky to make it fast enough to not introduce > noticable latencies. Also, you really don't want everyone to pull, > that's a serialization/scalability problem. > > Also, since we really only move the currently runnable tasks it > shouldn't be too many in the first place. Is it really that expensive? Good question, requires measurement to answer. > > Maybe that's crap, but at least we could move the migration out of the > > hotplug callback somehow. > > Thing is, if its really too much for some people, they can orchestrate > it such that its not. Just move everybody in a cpuset, clear the to be > offlined cpu from the cpuset's mask -- this will migrate everybody away. > Then hotplug will find an empty runqueue and its fast, no? I like this solution better. Thanks, Rusty. -- How could I marry someone with more hair than me? http://baldalex.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-22 4:25 ` Rusty Russell @ 2012-03-22 22:49 ` Paul E. McKenney 2012-03-23 23:27 ` Rusty Russell 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-03-22 22:49 UTC (permalink / raw) To: Rusty Russell Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote: > On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > On Wed, 2012-03-21 at 09:30 +1030, Rusty Russell wrote: > > > > > (2) Do something more efficient with userspace threads than migrating > > > > > them one at a time. > > > > > > > > Sadly that can't really be done. We need to pick up every task > > > > (userspace, but also running kernel threads) and update their state. > > > > > > What if we had an "orphan" runqueue which everyone pulled from? Then we > > > could grab the lock, move them all to the fake rq, then let stuff happen > > > normally. > > > > Well, we could simply let them sit where they are and fudge load-balance > > to consider it a source but not a destination until its empty, but it > > might be somewhat tricky to make it fast enough to not introduce > > noticable latencies. Also, you really don't want everyone to pull, > > that's a serialization/scalability problem. > > > > Also, since we really only move the currently runnable tasks it > > shouldn't be too many in the first place. Is it really that expensive? > > Good question, requires measurement to answer. > > > > Maybe that's crap, but at least we could move the migration out of the > > > hotplug callback somehow. > > > > Thing is, if its really too much for some people, they can orchestrate > > it such that its not. Just move everybody in a cpuset, clear the to be > > offlined cpu from the cpuset's mask -- this will migrate everybody away. > > Then hotplug will find an empty runqueue and its fast, no? > > I like this solution better. As long as we have some way to handle kthreads that are algorithmically tied to a given CPU. There are coding conventions to handle this, for example, do everything with preemption disabled and just after each preempt_disable() verify that you are in fact running on the correct CPU, but it is easy to imagine improvements. Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-22 22:49 ` Paul E. McKenney @ 2012-03-23 23:27 ` Rusty Russell 2012-03-24 0:23 ` Paul E. McKenney 0 siblings, 1 reply; 38+ messages in thread From: Rusty Russell @ 2012-03-23 23:27 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Thu, 22 Mar 2012 15:49:20 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote: > > On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > Thing is, if its really too much for some people, they can orchestrate > > > it such that its not. Just move everybody in a cpuset, clear the to be > > > offlined cpu from the cpuset's mask -- this will migrate everybody away. > > > Then hotplug will find an empty runqueue and its fast, no? > > > > I like this solution better. > > As long as we have some way to handle kthreads that are algorithmically > tied to a given CPU. There are coding conventions to handle this, for > example, do everything with preemption disabled and just after each > preempt_disable() verify that you are in fact running on the correct > CPU, but it is easy to imagine improvements. I don't think we should move per-cpu kthreads at all. Let's stop trying to save a few bytes of memory, and just leave them frozen. They'll run again if/when the CPU returns. Cheers, Rusty. -- How could I marry someone with more hair than me? http://baldalex.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-23 23:27 ` Rusty Russell @ 2012-03-24 0:23 ` Paul E. McKenney 2012-03-26 0:41 ` Rusty Russell 0 siblings, 1 reply; 38+ messages in thread From: Paul E. McKenney @ 2012-03-24 0:23 UTC (permalink / raw) To: Rusty Russell Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Sat, Mar 24, 2012 at 09:57:32AM +1030, Rusty Russell wrote: > On Thu, 22 Mar 2012 15:49:20 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote: > > > On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > Thing is, if its really too much for some people, they can orchestrate > > > > it such that its not. Just move everybody in a cpuset, clear the to be > > > > offlined cpu from the cpuset's mask -- this will migrate everybody away. > > > > Then hotplug will find an empty runqueue and its fast, no? > > > > > > I like this solution better. > > > > As long as we have some way to handle kthreads that are algorithmically > > tied to a given CPU. There are coding conventions to handle this, for > > example, do everything with preemption disabled and just after each > > preempt_disable() verify that you are in fact running on the correct > > CPU, but it is easy to imagine improvements. > > I don't think we should move per-cpu kthreads at all. Let's stop trying > to save a few bytes of memory, and just leave them frozen. They'll run > again if/when the CPU returns. OK, that would work for me. So, how do I go about freezing RCU's per-CPU kthreads? Thanx, Paul ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-24 0:23 ` Paul E. McKenney @ 2012-03-26 0:41 ` Rusty Russell 2012-03-26 8:02 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Rusty Russell @ 2012-03-26 0:41 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Srivatsa S. Bhat, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Fri, 23 Mar 2012 17:23:47 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Sat, Mar 24, 2012 at 09:57:32AM +1030, Rusty Russell wrote: > > On Thu, 22 Mar 2012 15:49:20 -0700, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Thu, Mar 22, 2012 at 02:55:04PM +1030, Rusty Russell wrote: > > > > On Wed, 21 Mar 2012 10:01:59 +0100, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > > Thing is, if its really too much for some people, they can orchestrate > > > > > it such that its not. Just move everybody in a cpuset, clear the to be > > > > > offlined cpu from the cpuset's mask -- this will migrate everybody away. > > > > > Then hotplug will find an empty runqueue and its fast, no? > > > > > > > > I like this solution better. > > > > > > As long as we have some way to handle kthreads that are algorithmically > > > tied to a given CPU. There are coding conventions to handle this, for > > > example, do everything with preemption disabled and just after each > > > preempt_disable() verify that you are in fact running on the correct > > > CPU, but it is easy to imagine improvements. > > > > I don't think we should move per-cpu kthreads at all. Let's stop trying > > to save a few bytes of memory, and just leave them frozen. They'll run > > again if/when the CPU returns. > > OK, that would work for me. So, how do I go about freezing RCU's > per-CPU kthreads? Good question. Obviously, having callbacks hanging around until the CPU comes back is not viable, nor is blocking preempt during the callbacks. Calling get_online_cpus() is too heavy. I can think of three approaches: 1) Put the being-processed rcu calls into a per-cpu var, and pull them off that list with preempt disabled. This lets us cleanup after the thread gets frozen as its CPU goes ofline, but doesn't solve the case of going offline during a callback. 2) Sync with the thread somehow during a notifier callback. This is the same kind of logic as shutting the thread down, so it's not really attractive from a simplicity POV. 3) Create to a per-cpu rwsem to stop a specific CPU from going down, and just grab that while we're processing rcu callbacks. If this pattern of kthread is common, then #3 (or some equiv lightwieght way of stopping a specific CPU from going offline) is looking attractive. Cheers, Rusty. -- How could I marry someone with more hair than me? http://baldalex.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 0:41 ` Rusty Russell @ 2012-03-26 8:02 ` Peter Zijlstra 2012-03-26 13:09 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2012-03-26 8:02 UTC (permalink / raw) To: Rusty Russell Cc: paulmck, Srivatsa S. Bhat, Arjan van de Ven, Steven Rostedt, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote: > Obviously, having callbacks hanging around until the CPU comes back is > not viable, nor is blocking preempt during the callbacks. Calling > get_online_cpus() is too heavy. -rt has patches (albeit somewhat ugly) to make get_online_cpus() a br style rw-lock. That is, it makes the read side a per-cpu lock/counter so that its much faster and only touches cpu-local state in the common case. It would be good to clean those up and make get_online_cpus() fast and usable. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 8:02 ` Peter Zijlstra @ 2012-03-26 13:09 ` Steven Rostedt 2012-03-26 13:38 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2012-03-26 13:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 2012-03-26 at 10:02 +0200, Peter Zijlstra wrote: > On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote: > > Obviously, having callbacks hanging around until the CPU comes back is > > not viable, nor is blocking preempt during the callbacks. Calling > > get_online_cpus() is too heavy. > > -rt has patches (albeit somewhat ugly) to make get_online_cpus() a br > style rw-lock. Unfortunately, -rt still has broken cpu hotplug. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 13:09 ` Steven Rostedt @ 2012-03-26 13:38 ` Peter Zijlstra 2012-03-26 15:22 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2012-03-26 13:38 UTC (permalink / raw) To: Steven Rostedt Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 2012-03-26 at 09:09 -0400, Steven Rostedt wrote: > On Mon, 2012-03-26 at 10:02 +0200, Peter Zijlstra wrote: > > On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote: > > > Obviously, having callbacks hanging around until the CPU comes back is > > > not viable, nor is blocking preempt during the callbacks. Calling > > > get_online_cpus() is too heavy. > > > > -rt has patches (albeit somewhat ugly) to make get_online_cpus() a br > > style rw-lock. > > Unfortunately, -rt still has broken cpu hotplug. This is still the issue where we take the hotplug lock for write, but then rely on other threads to complete -- say CPU_DOWN_PREPARE notifiers doing flush_workqueue(), and the worker kthread needing to acquire the hotplug lock for read, right? Yes, that is bothersome, but doesn't hinder the work required to get get_online_cpus() usably fast, right? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 13:38 ` Peter Zijlstra @ 2012-03-26 15:22 ` Steven Rostedt 2012-03-26 16:13 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2012-03-26 15:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 2012-03-26 at 15:38 +0200, Peter Zijlstra wrote: > On Mon, 2012-03-26 at 09:09 -0400, Steven Rostedt wrote: > > On Mon, 2012-03-26 at 10:02 +0200, Peter Zijlstra wrote: > > > On Mon, 2012-03-26 at 11:11 +1030, Rusty Russell wrote: > > > > Obviously, having callbacks hanging around until the CPU comes back is > > > > not viable, nor is blocking preempt during the callbacks. Calling > > > > get_online_cpus() is too heavy. > > > > > > -rt has patches (albeit somewhat ugly) to make get_online_cpus() a br > > > style rw-lock. > > > > Unfortunately, -rt still has broken cpu hotplug. > > This is still the issue where we take the hotplug lock for write, but > then rely on other threads to complete -- say CPU_DOWN_PREPARE notifiers > doing flush_workqueue(), and the worker kthread needing to acquire the > hotplug lock for read, right? Right, I did get it working, but required making the current "ugly" code even uglier, and Thomas nack'd it for that reason. He wanted a clean up on hotplug instead. > > Yes, that is bothersome, but doesn't hinder the work required to get > get_online_cpus() usably fast, right? Yep. What we do in -rt currently (without my updates) is that we create a thread pinned to the CPU going down. This thread counts the "readers" that is on the CPU. That is, every user that wanted to do a "get_online_cpus()", and as -rt allows preemption while doing so, there may be more than one of these for a given CPU. Once the count goes to zero, new tasks that enter this path will block on the hotplug lock. Then the CPU task we created is done and the cpu offline continues. But then we call the cpu offline notifiers. Some of the notifiers wakeup threads and tell them to exit using kthread_stop(). But if the thread it waits for does something that will enter the above path, it will block on the hotplug lock too, and cause a deadlock. Unfortunately, in -rt we have the sleeping spinlocks go into this path, thus any thread that is going down that grabs a sleeping spinlock will hit this deadlock. There's also the case where the thread that is going offline, or even the notifier itself, may be grabbing a lock, that happens to be held by another thread that is on the offlining CPU but is blocked on the hotplug lock. This also causes a deadlock. The workaround I added was to do several things: 1) instead of blocking on the hotplug lock, try to migrate itself instead. If it succeeds, then we don't need to worry about this thread. But if the thread is pinned to the CPU, then we need to worry about it. I first had it block only in this case, but that wasn't good enough, so I let them just continue. 2) had the CPU thread that was created do a multi-phase. The first phase it still waited for the cpu ref counter to go to zero, but instead of having tasks block, tasks would instead try to migrate and if I could not then just continue. 3) after all the notifiers are finished, notify the created CPU thread to sync tasks. Now that the notifiers are done, we can make any remaining task block. That is, the old method is done, where the CPU thread waits for the ref counter to go to zero, and new tasks will block on the hotplug lock. Because this happens after the notifiers, we do not need to worry about the previous deadlocks. The above changes mostly worked. Where it broke was paths in bringing up the CPU that may call code with sleeping spin locks where preemption is disabled. But this is actually unrelated to the cpu hotplug itself and more related to the normal problems caused by sleeping spinlocks added by -rt. The above changes also required reverting the -rt changes done to workqueue. Now what are the issues we have: 1) We need to get tasks off the CPU going down. For most tasks this is not an issue. But for CPU specific kernel threads, this can be an issue. To get tasks off of the CPU is required before the notifiers are called. This is to keep them from creating work on the CPU, because after the notifiers, there should be no more work added to the CPU. 2) Some tasks are going to go down and exit. We can audit all the notifier callbacks for CPU offlining, and see if we can just make them dormant instead of killing them. As Rusty said, it may not be that important to save the memory of these tasks. 3) Some tasks do not go offline, instead they just get moved to another CPU. This is the case of ksoftirqd. As it is killed after the CPU is down (POST_DEAD) (at least in -rt it is). All that is needed now is, at the beginning of taking the CPU down is to push off tasks from the CPU that may migrate. Then call the notifiers, and then block the rest and take the CPU down. This seems to work fine. It was just the implementation I proposed was a bit too ugly for Thomas's taste. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 15:22 ` Steven Rostedt @ 2012-03-26 16:13 ` Peter Zijlstra 2012-03-26 17:05 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2012-03-26 16:13 UTC (permalink / raw) To: Steven Rostedt Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote: > The workaround I added was to do several things: > > 1) instead of blocking on the hotplug lock, try to migrate itself > instead. If it succeeds, then we don't need to worry about this thread. > But if the thread is pinned to the CPU, then we need to worry about it. > I first had it block only in this case, but that wasn't good enough, so > I let them just continue. > > 2) had the CPU thread that was created do a multi-phase. The first phase > it still waited for the cpu ref counter to go to zero, but instead of > having tasks block, tasks would instead try to migrate and if I could > not then just continue. > > 3) after all the notifiers are finished, notify the created CPU thread > to sync tasks. Now that the notifiers are done, we can make any > remaining task block. That is, the old method is done, where the CPU > thread waits for the ref counter to go to zero, and new tasks will block > on the hotplug lock. Because this happens after the notifiers, we do not > need to worry about the previous deadlocks. So how about we add another variant of kthread_freezable_should_stop(), maybe call it kthread_bound_should_stop() that checks if the cpu its bound to goes awol, if so, park it. Then after CPU_DOWN_PREPARE, wait for all such threads (as registered per kthread_bind()) to pass through kthread_bound_should_stop() and get frozen. This should restore PF_THREAD_BOUND to mean its actually bound to this cpu, since if the cpu goes down, the task won't actually run at all. Which means you can again use PF_THREAD_BOUND to by-pass the whole get_online_cpus()/pin_curr_cpu() muck. Any subsystem that can still accrue state after this (eg, softirq/rcu and possible kworker) need to register a CPU_DYING or CPU_DEAD notifier to either complete the state or take it away and give it to someone else. > Now what are the issues we have: > > 1) We need to get tasks off the CPU going down. For most tasks this is > not an issue. But for CPU specific kernel threads, this can be an issue. > To get tasks off of the CPU is required before the notifiers are called. > This is to keep them from creating work on the CPU, because after the > notifiers, there should be no more work added to the CPU. This is hard for things like ksoftirq, because for as long as interrupts are enabled we can trigger softirqs. And since we need to deal with that, we might as well deal with it for all and not bother. See the CPU_DYING/DEAD notifier as described above that can deal with this. > > 2) Some tasks are going to go down and exit. We can audit all the > notifier callbacks for CPU offlining, and see if we can just make them > dormant instead of killing them. As Rusty said, it may not be that > important to save the memory of these tasks. Right, this shouldn't be a difficult task, but isn't required for -rt afaict, its just good practise. > 3) Some tasks do not go offline, instead they just get moved to another > CPU. This is the case of ksoftirqd. As it is killed after the CPU is > down (POST_DEAD) (at least in -rt it is). No, we should really stop allowing tasks that were kthread_bind() to run anywhere else. Breaking the strict affinity and letting them run someplace else to complete their work is what gets is in a whole heap of trouble. > All that is needed now is, at the beginning of taking the CPU down is to > push off tasks from the CPU that may migrate. Then call the notifiers, > and then block the rest and take the CPU down. This seems to work fine. > It was just the implementation I proposed was a bit too ugly for > Thomas's taste. I really don't see the point in that. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 16:13 ` Peter Zijlstra @ 2012-03-26 17:05 ` Steven Rostedt 2012-03-26 17:59 ` Peter Zijlstra 2012-03-27 1:32 ` Rusty Russell 0 siblings, 2 replies; 38+ messages in thread From: Steven Rostedt @ 2012-03-26 17:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 2012-03-26 at 18:13 +0200, Peter Zijlstra wrote: > On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote: > So how about we add another variant of kthread_freezable_should_stop(), > maybe call it kthread_bound_should_stop() that checks if the cpu its > bound to goes awol, if so, park it. Do you mean to have this function automate the "park". When it is called, if the cpu is going down it should simply schedule off and not return until the CPU comes back on line? Actually, why not just keep "kthread_should_stop()" and instead create a "kthread_park()", and call that instead of kthread_stop(). Then when the task calls kthread_should_stop(), that can park the thread then. > > Then after CPU_DOWN_PREPARE, wait for all such threads (as registered > per kthread_bind()) to pass through kthread_bound_should_stop() and get > frozen. We could have the notifiers call kthread_park(). > > This should restore PF_THREAD_BOUND to mean its actually bound to this > cpu, since if the cpu goes down, the task won't actually run at all. > Which means you can again use PF_THREAD_BOUND to by-pass the whole > get_online_cpus()/pin_curr_cpu() muck. > > Any subsystem that can still accrue state after this (eg, softirq/rcu > and possible kworker) need to register a CPU_DYING or CPU_DEAD notifier > to either complete the state or take it away and give it to someone > else. I'm afraid that this part sounds easier than done. > > > Now what are the issues we have: > > > > 1) We need to get tasks off the CPU going down. For most tasks this is > > not an issue. But for CPU specific kernel threads, this can be an issue. > > To get tasks off of the CPU is required before the notifiers are called. > > This is to keep them from creating work on the CPU, because after the > > notifiers, there should be no more work added to the CPU. > > This is hard for things like ksoftirq, because for as long as interrupts > are enabled we can trigger softirqs. And since we need to deal with > that, we might as well deal with it for all and not bother. Heh, at least for -rt we don't need to worry about that. As interrupts are threads and are moved to other CPUS. Although I'm not sure that's true about the timer softirq. > > See the CPU_DYING/DEAD notifier as described above that can deal with > this. > > > > > 2) Some tasks are going to go down and exit. We can audit all the > > notifier callbacks for CPU offlining, and see if we can just make them > > dormant instead of killing them. As Rusty said, it may not be that > > important to save the memory of these tasks. > > Right, this shouldn't be a difficult task, but isn't required for -rt > afaict, its just good practise. If we get a consensus to do this, then sure. > > > 3) Some tasks do not go offline, instead they just get moved to another > > CPU. This is the case of ksoftirqd. As it is killed after the CPU is > > down (POST_DEAD) (at least in -rt it is). > > No, we should really stop allowing tasks that were kthread_bind() to run > anywhere else. Breaking the strict affinity and letting them run > someplace else to complete their work is what gets is in a whole heap of > trouble. Agreed, but to fix this is not a easy problem. > > > All that is needed now is, at the beginning of taking the CPU down is to > > push off tasks from the CPU that may migrate. Then call the notifiers, > > and then block the rest and take the CPU down. This seems to work fine. > > It was just the implementation I proposed was a bit too ugly for > > Thomas's taste. > > I really don't see the point in that. It was the easiest fix for the current state of the kernel. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 17:05 ` Steven Rostedt @ 2012-03-26 17:59 ` Peter Zijlstra 2012-03-27 1:32 ` Rusty Russell 1 sibling, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2012-03-26 17:59 UTC (permalink / raw) To: Steven Rostedt Cc: Rusty Russell, paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 2012-03-26 at 13:05 -0400, Steven Rostedt wrote: > On Mon, 2012-03-26 at 18:13 +0200, Peter Zijlstra wrote: > > On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote: > > > So how about we add another variant of kthread_freezable_should_stop(), > > maybe call it kthread_bound_should_stop() that checks if the cpu its > > bound to goes awol, if so, park it. > > Do you mean to have this function automate the "park". When it is > called, if the cpu is going down it should simply schedule off and not > return until the CPU comes back on line? Yep.. > Actually, why not just keep "kthread_should_stop()" and instead create a > "kthread_park()", and call that instead of kthread_stop(). Then when the > task calls kthread_should_stop(), that can park the thread then. That would add an if ((current->flags & PF_THREAD_BOUND) && kthread_should_park(cpu))) conditional to every kthread_stop() invocation. So as per the example of kthread_freezable_should_stop() I opted for another function. Note that kernel/workqueue.c should be fixed to use kthread_stop() or whatever variant we implement, as it currently uses a home brewn solution to stop threads. > > Then after CPU_DOWN_PREPARE, wait for all such threads (as registered > > per kthread_bind()) to pass through kthread_bound_should_stop() and get > > frozen. > > We could have the notifiers call kthread_park(). You mean to avoid having to track them through kthread_bind() ? The advantage of tracking them is that its harder to 'forget' about one. > > This should restore PF_THREAD_BOUND to mean its actually bound to this > > cpu, since if the cpu goes down, the task won't actually run at all. > > Which means you can again use PF_THREAD_BOUND to by-pass the whole > > get_online_cpus()/pin_curr_cpu() muck. > > > > Any subsystem that can still accrue state after this (eg, softirq/rcu > > and possible kworker) need to register a CPU_DYING or CPU_DEAD notifier > > to either complete the state or take it away and give it to someone > > else. > > I'm afraid that this part sounds easier than done. Got anything particularly difficult in mind? Workqueues can give the gcwq to unbound threads -- it doesn't guarantee the per-cpu-ness of work items anyway. Softirqs can be ran from CPU_DYING since interrupts will never be enabled again at that point. RCU would have to make sure the cpu doesn't complete a grace period and fixup from CPU_DEAD, so have it complete any outstanding grace periods, move it to extended idle and steal the callback list. I'm not sure there's anything really hard there. > > > Now what are the issues we have: > > > > > > 1) We need to get tasks off the CPU going down. For most tasks this is > > > not an issue. But for CPU specific kernel threads, this can be an issue. > > > To get tasks off of the CPU is required before the notifiers are called. > > > This is to keep them from creating work on the CPU, because after the > > > notifiers, there should be no more work added to the CPU. > > > > This is hard for things like ksoftirq, because for as long as interrupts > > are enabled we can trigger softirqs. And since we need to deal with > > that, we might as well deal with it for all and not bother. > > Heh, at least for -rt we don't need to worry about that. As interrupts > are threads and are moved to other CPUS. Although I'm not sure that's > true about the timer softirq. Its a problem for rt, since as long as interrupts are enabled (and we can schedule) interrupts can come in and wake their respective threads, this can happen during the CPU_DOWN_PREPARE notifier just fine. For both -rt and mainline we can schedule right up until we call stop-machine, mainline (!threadirq) will continue servicing interrupts another few instructions until the stop_machine bits disable interrupts on all cpus. The difference is really not that big. > > > 3) Some tasks do not go offline, instead they just get moved to another > > > CPU. This is the case of ksoftirqd. As it is killed after the CPU is > > > down (POST_DEAD) (at least in -rt it is). > > > > No, we should really stop allowing tasks that were kthread_bind() to run > > anywhere else. Breaking the strict affinity and letting them run > > someplace else to complete their work is what gets is in a whole heap of > > trouble. > > Agreed, but to fix this is not a easy problem. I'm not sure its that hard, just work. If we get the above stuff done, we should be able to put BUG_ON(p->flags & PF_THREAD_BOUND) in select_fallback_rq(). Also, I think you should opt for the solution that has the cleanest/strongest semantics so you can add more debug infrastructure around it to enforce it. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-26 17:05 ` Steven Rostedt 2012-03-26 17:59 ` Peter Zijlstra @ 2012-03-27 1:32 ` Rusty Russell 2012-03-27 3:05 ` Steven Rostedt 1 sibling, 1 reply; 38+ messages in thread From: Rusty Russell @ 2012-03-27 1:32 UTC (permalink / raw) To: Steven Rostedt, Peter Zijlstra Cc: paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Mon, 26 Mar 2012 13:05:41 -0400, Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 2012-03-26 at 18:13 +0200, Peter Zijlstra wrote: > > On Mon, 2012-03-26 at 11:22 -0400, Steven Rostedt wrote: > > > So how about we add another variant of kthread_freezable_should_stop(), > > maybe call it kthread_bound_should_stop() that checks if the cpu its > > bound to goes awol, if so, park it. > > Do you mean to have this function automate the "park". When it is > called, if the cpu is going down it should simply schedule off and not > return until the CPU comes back on line? > > Actually, why not just keep "kthread_should_stop()" and instead create a > "kthread_park()", and call that instead of kthread_stop(). Then when the > task calls kthread_should_stop(), that can park the thread then. Do we ever have to do this? Why not say we never move kthreads, and if anyone really needs this, they have to do it themselves. We *will* need a way for the kthreads to say "don't take this cpu away now" in a v. lightweight way, but I'm happy with the -rt solution that Steven mentioned. Basically, it's time to revisit cpu hotplug entirely. Let's not get too caught up in what we've got now. Cheers, Rusty. PS. And I'm sure you'll finish the work while I'm on honeymoon :) -- How could I marry someone with more hair than me? http://baldalex.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: CPU Hotplug rework 2012-03-27 1:32 ` Rusty Russell @ 2012-03-27 3:05 ` Steven Rostedt 0 siblings, 0 replies; 38+ messages in thread From: Steven Rostedt @ 2012-03-27 3:05 UTC (permalink / raw) To: Rusty Russell Cc: Peter Zijlstra, paulmck, Srivatsa S. Bhat, Arjan van de Ven, Rafael J. Wysocki, Srivatsa Vaddagiri, akpm@linux-foundation.org, Paul Gortmaker, Milton Miller, mingo@elte.hu, Tejun Heo, KOSAKI Motohiro, linux-kernel, Linux PM mailing list On Tue, 2012-03-27 at 12:02 +1030, Rusty Russell wrote: > PS. And I'm sure you'll finish the work while I'm on honeymoon :) I'll be getting ready for Linux Collaboration. But have fun on your honeymoon, and congrats! -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2012-04-11 12:29 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-19 14:44 CPU Hotplug rework Srivatsa S. Bhat 2012-03-19 14:48 ` Srivatsa S. Bhat 2012-03-20 11:28 ` Peter Zijlstra 2012-04-05 17:39 ` Paul E. McKenney 2012-04-05 17:55 ` Paul E. McKenney 2012-04-05 23:06 ` Paul E. McKenney 2012-04-06 20:15 ` Srivatsa S. Bhat 2012-04-09 16:46 ` Paul E. McKenney 2012-04-10 7:56 ` Nikunj A Dadhania 2012-04-06 19:52 ` Srivatsa S. Bhat 2012-04-09 17:13 ` Paul E. McKenney 2012-04-10 13:41 ` Srivatsa S. Bhat 2012-04-10 15:46 ` Paul E. McKenney 2012-04-10 17:26 ` Srivatsa S. Bhat 2012-04-11 0:09 ` Steven Rostedt 2012-04-11 0:28 ` Paul E. McKenney 2012-04-11 0:37 ` Steven Rostedt 2012-04-11 1:00 ` Paul E. McKenney 2012-04-11 6:02 ` Srivatsa S. Bhat 2012-04-11 12:28 ` Paul E. McKenney 2012-03-19 23:42 ` Rusty Russell 2012-03-20 10:42 ` Peter Zijlstra 2012-03-20 23:00 ` Rusty Russell 2012-03-21 9:01 ` Peter Zijlstra 2012-03-22 4:25 ` Rusty Russell 2012-03-22 22:49 ` Paul E. McKenney 2012-03-23 23:27 ` Rusty Russell 2012-03-24 0:23 ` Paul E. McKenney 2012-03-26 0:41 ` Rusty Russell 2012-03-26 8:02 ` Peter Zijlstra 2012-03-26 13:09 ` Steven Rostedt 2012-03-26 13:38 ` Peter Zijlstra 2012-03-26 15:22 ` Steven Rostedt 2012-03-26 16:13 ` Peter Zijlstra 2012-03-26 17:05 ` Steven Rostedt 2012-03-26 17:59 ` Peter Zijlstra 2012-03-27 1:32 ` Rusty Russell 2012-03-27 3:05 ` Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).