[PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2)
@ 2008-07-16  6:21 Max Krasnyansky
  2008-07-16  7:11 ` Paul Menage
  2008-07-18 11:28 ` Ingo Molnar
  0 siblings, 2 replies; 4+ messages in thread
From: Max Krasnyansky @ 2008-07-16  6:21 UTC (permalink / raw)
  To: mingo, pj
  Cc: linux-kernel, Max Krasnyanskiy, a.p.zijlstra, menage,
	vegard.nossum

From: Max Krasnyanskiy <maxk@qualcomm.com>

Basically as Paul J. pointed out rebuild_sched_domains() is the
only way to rebuild sched domains correctly based on the current
cpuset settings. What this means is that we need to be able to
call it from different contexts, like cpuhotplug for example.
Also if you look at the cpu_active_map patches, the scheduler now
calls rebuild_sched_domains() directly from places like
arch_init_sched_domains().

In order to support that properly we need to change cpuset locking
a bit, to avoid circular locking dependencies.
This patch makes rebuild_sched_domains() callable from any context.
The only requirement now is the the caller has to hold cpu_hotplug.lock
(ie get_online_cpus()). This allows cpu hotplug handlers and the
scheduler code to call rebuild_sched_domains() directly.

The rest of the cpuset code now offloads sched domains rebuilds to
the single threaded workqueue. As suggested by both Paul J. and Paul M.

This passes light testing (building kernel with -j 16, creating/removing
domains and bring cpus off/online at the same time) on the dual-Core2
based machine.
It passes (for the first) time lockdep checks, even with preemptable RCU
enabled.

So it looks like we're good to go :). Obviously needs more testing but
otherwise things are looking good.

Signed-off-by: Max Krasnyanskiy <maxk@qualcomm.com>
Cc: a.p.zijlstra@chello.nl
Cc: pj@sgi.com
Cc: mingo@elte.hu
Cc: menage@google.com
Cc: vegard.nossum@gmail.com
---
 kernel/cpuset.c |   92 ++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 61 insertions(+), 31 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3c3ef02..4581e4f 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -59,6 +59,11 @@
 #include <linux/cgroup.h>
 
 /*
+ * Workqueue for cpuset related tasks.
+ */
+static struct workqueue_struct *cpuset_wq;
+
+/*
  * Tracks how many cpusets are currently defined in system.
  * When there is only one cpuset (the root cpuset) we can
  * short circuit some hooks.
@@ -522,13 +527,12 @@ update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c)
  * domains when operating in the severe memory shortage situations
  * that could cause allocation failures below.
  *
- * Call with cgroup_mutex held.  May take callback_mutex during
- * call due to the kfifo_alloc() and kmalloc() calls.  May nest
- * a call to the get_online_cpus()/put_online_cpus() pair.
- * Must not be called holding callback_mutex, because we must not
- * call get_online_cpus() while holding callback_mutex.  Elsewhere
- * the kernel nests callback_mutex inside get_online_cpus() calls.
- * So the reverse nesting would risk an ABBA deadlock.
+ * Call under get_online_cpus().
+ * Must not be called with cgroup_lock or callback_mutex held.
+ * This means that internally cpuset code must not use this function
+ * and call __rebuild_sched_domains() instead. The only exception is
+ * cpu hotplug code where get_online_cpus() lock is taken at the top 
+ * level and nesting is not an issue.
  *
  * The three key local variables below are:
  *    q  - a kfifo queue of cpuset pointers, used to implement a
@@ -581,6 +585,10 @@ void rebuild_sched_domains(void)
 	doms = NULL;
 	dattr = NULL;
 
+	/* We have to iterate cgroup hierarchy, make sure nobody is messing 
+	 * with it. */
+	cgroup_lock();
+
 	/* Special case for the 99% of systems with one, full, sched domain */
 	if (is_sched_load_balance(&top_cpuset)) {
 		ndoms = 1;
@@ -598,10 +606,10 @@ void rebuild_sched_domains(void)
 
 	q = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
 	if (IS_ERR(q))
-		goto done;
+		goto unlock;
 	csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
 	if (!csa)
-		goto done;
+		goto unlock;
 	csn = 0;
 
 	cp = &top_cpuset;
@@ -688,10 +696,16 @@ restart:
 	BUG_ON(nslot != ndoms);
 
 rebuild:
+	/* Drop cgroup lock before calling the scheduler.
+	 * This is not strictly necesseary but simplifies lock nesting. */
+	cgroup_unlock();
+
 	/* Have scheduler rebuild sched domains */
-	get_online_cpus();
 	partition_sched_domains(ndoms, doms, dattr);
-	put_online_cpus();
+	goto done;
+
+unlock:
+	cgroup_unlock();
 
 done:
 	if (q && !IS_ERR(q))
@@ -701,6 +715,30 @@ done:
 	/* Don't kfree(dattr) -- partition_sched_domains() does that. */
 }
 
+/*
+ * Simply calls rebuild_sched_domains() with correct locking rules.
+ */
+static void rebuild_domains_callback(struct work_struct *work)
+{
+	get_online_cpus();
+	rebuild_sched_domains();
+	put_online_cpus();
+}
+static DECLARE_WORK(rebuild_domains_work, rebuild_domains_callback);
+
+/*
+ * Internal helper for rebuilding sched domains when something changes.
+ * rebuild_sched_domains() must be called under get_online_cpus() and 
+ * it needs to take cgroup_lock(). But most of the cpuset code is already 
+ * holding cgroup_lock() while calling __rebuild_sched_domains(). 
+ * In order to avoid incorrect lock nesting we delegate the actual domain 
+ * rebuilding to the workqueue.
+ */ 
+static void __rebuild_sched_domains(void)
+{
+	queue_work(cpuset_wq, &rebuild_domains_work);
+}
+
 static inline int started_after_time(struct task_struct *t1,
 				     struct timespec *time,
 				     struct task_struct *t2)
@@ -831,7 +869,7 @@ static int update_cpumask(struct cpuset *cs, char *buf)
 	heap_free(&heap);
 
 	if (is_load_balanced)
-		rebuild_sched_domains();
+		__rebuild_sched_domains();
 	return 0;
 }
 
@@ -1042,7 +1080,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
 
 	if (val != cs->relax_domain_level) {
 		cs->relax_domain_level = val;
-		rebuild_sched_domains();
+		__rebuild_sched_domains();
 	}
 
 	return 0;
@@ -1083,7 +1121,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	mutex_unlock(&callback_mutex);
 
 	if (cpus_nonempty && balance_flag_changed)
-		rebuild_sched_domains();
+		__rebuild_sched_domains();
 
 	return 0;
 }
@@ -1677,15 +1715,9 @@ static struct cgroup_subsys_state *cpuset_create(
 }
 
 /*
- * Locking note on the strange update_flag() call below:
- *
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains().  The get_online_cpus()
- * call in rebuild_sched_domains() must not be made while holding
- * callback_mutex.  Elsewhere the kernel nests callback_mutex inside
- * get_online_cpus() calls.  So the reverse nesting would risk an
- * ABBA deadlock.
+ * will call rebuild_sched_domains().
  */
 
 static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1894,7 +1926,7 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
  * in order to minimize text size.
  */
 
-static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
+static void common_cpu_mem_hotplug_unplug(void)
 {
 	cgroup_lock();
 
@@ -1902,13 +1934,6 @@ static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
 	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
 	scan_for_empty_cpusets(&top_cpuset);
 
-	/*
-	 * Scheduler destroys domains on hotplug events.
-	 * Rebuild them based on the current settings.
-	 */
-	if (rebuild_sd)
-		rebuild_sched_domains();
-
 	cgroup_unlock();
 }
 
@@ -1934,12 +1959,14 @@ static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
 	case CPU_ONLINE_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-		common_cpu_mem_hotplug_unplug(1);
 		break;
+
 	default:
 		return NOTIFY_DONE;
 	}
 
+	common_cpu_mem_hotplug_unplug();
+	rebuild_sched_domains();
 	return NOTIFY_OK;
 }
 
@@ -1953,7 +1980,7 @@ static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
 
 void cpuset_track_online_nodes(void)
 {
-	common_cpu_mem_hotplug_unplug(0);
+	common_cpu_mem_hotplug_unplug();
 }
 #endif
 
@@ -1969,6 +1996,9 @@ void __init cpuset_init_smp(void)
 	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
 
 	hotcpu_notifier(cpuset_handle_cpuhp, 0);
+
+	cpuset_wq = create_singlethread_workqueue("cpuset");
+	BUG_ON(!cpuset_wq);
 }
 
 /**
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2)
  2008-07-16  6:21 [PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2) Max Krasnyansky
@ 2008-07-16  7:11 ` Paul Menage
  2008-07-16 18:27   ` Max Krasnyansky
  2008-07-18 11:28 ` Ingo Molnar
  1 sibling, 1 reply; 4+ messages in thread
From: Paul Menage @ 2008-07-16  7:11 UTC (permalink / raw)
  To: Max Krasnyansky; +Cc: mingo, pj, linux-kernel, a.p.zijlstra, vegard.nossum

On Tue, Jul 15, 2008 at 11:21 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
> From: Max Krasnyanskiy <maxk@qualcomm.com>
>
> Basically as Paul J. pointed out rebuild_sched_domains() is the
> only way to rebuild sched domains correctly based on the current
> cpuset settings. What this means is that we need to be able to
> call it from different contexts, like cpuhotplug for example.
> Also if you look at the cpu_active_map patches, the scheduler now
> calls rebuild_sched_domains() directly from places like
> arch_init_sched_domains().
>
> In order to support that properly we need to change cpuset locking
> a bit, to avoid circular locking dependencies.
> This patch makes rebuild_sched_domains() callable from any context.
> The only requirement now is the the caller has to hold cpu_hotplug.lock
> (ie get_online_cpus()).

Calling get_online_cpus() doesn't technically result in you holding
cpu_hotplug.lock - it ensures that either you're the active
cpu_hotplug writer or else no-one else holds cpu_hotplug.lock and
cpu_hotplug.refcount > 0. Can you specify this more precisely? Maybe
say "the caller has to hold cpu_hotplug for read or write"? A useful
patch might be to rename "struct {} cpu_hotplug" to "struct {}
cpu_hotplug_recursive_rw_mutex", since that's exactly what it is. Then
this patch could say "caller has to hold
cpu_hotplug_recursive_rw_mutex". Yes, it's a bit ugly, but at least it
exposes it properly.

> +       /* We have to iterate cgroup hierarchy, make sure nobody is messing
> +        * with it. */
> +       cgroup_lock();
> +
>        /* Special case for the 99% of systems with one, full, sched domain */
>        if (is_sched_load_balance(&top_cpuset)) {
>                ndoms = 1;
> @@ -598,10 +606,10 @@ void rebuild_sched_domains(void)
>
>        q = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
>        if (IS_ERR(q))
> -               goto done;
> +               goto unlock;
>        csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
>        if (!csa)
> -               goto done;
> +               goto unlock;
>        csn = 0;
>
>        cp = &top_cpuset;
> @@ -688,10 +696,16 @@ restart:
>        BUG_ON(nslot != ndoms);
>
>  rebuild:
> +       /* Drop cgroup lock before calling the scheduler.
> +        * This is not strictly necesseary but simplifies lock nesting. */
> +       cgroup_unlock();
> +
>        /* Have scheduler rebuild sched domains */
> -       get_online_cpus();
>        partition_sched_domains(ndoms, doms, dattr);
> -       put_online_cpus();
> +       goto done;
> +
> +unlock:
> +       cgroup_unlock();

This goto ordering's a bit ugly. rebuild_sched_domains() is ripe for a
refactoring, but I guess that can wait.
> +/*
> + * Internal helper for rebuilding sched domains when something changes.
> + * rebuild_sched_domains() must be called under get_online_cpus() and
> + * it needs to take cgroup_lock(). But most of the cpuset code is already
> + * holding cgroup_lock() while calling __rebuild_sched_domains().
> + * In order to avoid incorrect lock nesting we delegate the actual domain
> + * rebuilding to the workqueue.
> + */
> +static void __rebuild_sched_domains(void)
> +{
> +       queue_work(cpuset_wq, &rebuild_domains_work);
> +}
> +

The __ prefix is normally used to indicate a lower-level or pre-locked
version of a function. In this case it's a higher-level function so I
don't think that prefix is appropriate. How about
async_rebuild_sched_domains() ?

>
> -static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
> +static void common_cpu_mem_hotplug_unplug(void)
>  {
>        cgroup_lock();
>
> @@ -1902,13 +1934,6 @@ static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
>        top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
>        scan_for_empty_cpusets(&top_cpuset);

This is still unsafe because it accesses cpu_online_map without
get_online_cpus() (when called as the memory hotplug handler)?

Maybe scan_for_empty_cpusets() should take a parameter indicating
whether we're interested in cpus or mems?

>        hotcpu_notifier(cpuset_handle_cpuhp, 0);
> +
> +       cpuset_wq = create_singlethread_workqueue("cpuset");
> +       BUG_ON(!cpuset_wq);

Seems a bit of a shame to waste a kthread on this. Is there no generic
single-threaded workqueue that we could piggyback on? Maybe create one
in workqueue.c that can be used by anyone who specifically needs a
singlethreaded workqueue for occasional work?

Paul

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2)
  2008-07-16  7:11 ` Paul Menage
@ 2008-07-16 18:27   ` Max Krasnyansky
  0 siblings, 0 replies; 4+ messages in thread
From: Max Krasnyansky @ 2008-07-16 18:27 UTC (permalink / raw)
  To: Paul Menage; +Cc: mingo, pj, linux-kernel, a.p.zijlstra, vegard.nossum

Paul Menage wrote:
> On Tue, Jul 15, 2008 at 11:21 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>> From: Max Krasnyanskiy <maxk@qualcomm.com>
>>
>> Basically as Paul J. pointed out rebuild_sched_domains() is the
>> only way to rebuild sched domains correctly based on the current
>> cpuset settings. What this means is that we need to be able to
>> call it from different contexts, like cpuhotplug for example.
>> Also if you look at the cpu_active_map patches, the scheduler now
>> calls rebuild_sched_domains() directly from places like
>> arch_init_sched_domains().
>>
>> In order to support that properly we need to change cpuset locking
>> a bit, to avoid circular locking dependencies.
>> This patch makes rebuild_sched_domains() callable from any context.
>> The only requirement now is the the caller has to hold cpu_hotplug.lock
>> (ie get_online_cpus()).
> 
> Calling get_online_cpus() doesn't technically result in you holding
> cpu_hotplug.lock - it ensures that either you're the active
> cpu_hotplug writer or else no-one else holds cpu_hotplug.lock and
> cpu_hotplug.refcount > 0. Can you specify this more precisely? Maybe
> say "the caller has to hold cpu_hotplug for read or write"? 
Very good point.
How about we just say
	"has to be called under get_online_cpus()"
I think that reflects what the calling requirement is without diving into
to much detail as to how get_online_cpus() works internally.

> A useful
> patch might be to rename "struct {} cpu_hotplug" to "struct {}
> cpu_hotplug_recursive_rw_mutex", since that's exactly what it is. 
Hmm, I guess the idea was that struct cpu_hotplug would hold more stuff
then just the lock. But maybe I'm wrong.

> Then this patch could say "caller has to hold
> cpu_hotplug_recursive_rw_mutex". Yes, it's a bit ugly, but at least it
> exposes it properly.
I think at this level we should just say
	"call me inside get_online_cpus() ... put_online_cpus()"
and leave the details. I mean we should just stop using "hold
cpu_hotplug.lock" notation because it's something internal. All we really care
about is that we called under get_online_cpus() to make sure that cpus are not
unplugged at the wrong time.

>> +       /* We have to iterate cgroup hierarchy, make sure nobody is messing
>> +        * with it. */
>> +       cgroup_lock();
>> +
>>        /* Special case for the 99% of systems with one, full, sched domain */
>>        if (is_sched_load_balance(&top_cpuset)) {
>>                ndoms = 1;
>> @@ -598,10 +606,10 @@ void rebuild_sched_domains(void)
>>
>>        q = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
>>        if (IS_ERR(q))
>> -               goto done;
>> +               goto unlock;
>>        csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
>>        if (!csa)
>> -               goto done;
>> +               goto unlock;
>>        csn = 0;
>>
>>        cp = &top_cpuset;
>> @@ -688,10 +696,16 @@ restart:
>>        BUG_ON(nslot != ndoms);
>>
>>  rebuild:
>> +       /* Drop cgroup lock before calling the scheduler.
>> +        * This is not strictly necesseary but simplifies lock nesting. */
>> +       cgroup_unlock();
>> +
>>        /* Have scheduler rebuild sched domains */
>> -       get_online_cpus();
>>        partition_sched_domains(ndoms, doms, dattr);
>> -       put_online_cpus();
>> +       goto done;
>> +
>> +unlock:
>> +       cgroup_unlock();
> 
> This goto ordering's a bit ugly. rebuild_sched_domains() is ripe for a
> refactoring, but I guess that can wait.
Yeah. I tried a couple of option and settled on the current one for now.
If we did not have to drop cgroup_lock() before calling the scheduler some of
the gotos would go away but I'd rather drop it to avoid more nesting in the
future.
Sounds like you're ok with the current version logic wise. I'll take another
cleanup pass some time late.

>> +/*
>> + * Internal helper for rebuilding sched domains when something changes.
>> + * rebuild_sched_domains() must be called under get_online_cpus() and
>> + * it needs to take cgroup_lock(). But most of the cpuset code is already
>> + * holding cgroup_lock() while calling __rebuild_sched_domains().
>> + * In order to avoid incorrect lock nesting we delegate the actual domain
>> + * rebuilding to the workqueue.
>> + */
>> +static void __rebuild_sched_domains(void)
>> +{
>> +       queue_work(cpuset_wq, &rebuild_domains_work);
>> +}
>> +
> 
> The __ prefix is normally used to indicate a lower-level or pre-locked
> version of a function. In this case it's a higher-level function so I
> don't think that prefix is appropriate. How about
> async_rebuild_sched_domains() ?
Yep. Much better name.

>> -static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
>> +static void common_cpu_mem_hotplug_unplug(void)
>>  {
>>        cgroup_lock();
>>
>> @@ -1902,13 +1934,6 @@ static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
>>        top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
>>        scan_for_empty_cpusets(&top_cpuset);
> 
> This is still unsafe because it accesses cpu_online_map without
> get_online_cpus() (when called as the memory hotplug handler)?
> 
> Maybe scan_for_empty_cpusets() should take a parameter indicating
> whether we're interested in cpus or mems?
Ah, I thought I separated them out, but you're right it still accesses
cpu_online_map from mem hotplug events. I think instead of the argument
we can just split them up completely. The common part is
scan_for_empty_cpusets() the rest is not. I'll fix that.

>>        hotcpu_notifier(cpuset_handle_cpuhp, 0);
>> +
>> +       cpuset_wq = create_singlethread_workqueue("cpuset");
>> +       BUG_ON(!cpuset_wq);
> 
> Seems a bit of a shame to waste a kthread on this. Is there no generic
> single-threaded workqueue that we could piggyback on? Maybe create one
> in workqueue.c that can be used by anyone who specifically needs a
> singlethreaded workqueue for occasional work?
Agree
$ git grep create_singlethread_workqueue | wc
   104     412    9037

Granted, it's unlikely that a single machine would have all that stuff enabled
but 104 kthreads just for workqueues does seems like an overkill.

I was going to take a pass at the workqueue users and replaces
flush_scheduled_work() with flush_work()/cancel_work_sync() so I might take a
crack at this too.

Max

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2)
  2008-07-16  6:21 [PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2) Max Krasnyansky
  2008-07-16  7:11 ` Paul Menage
@ 2008-07-18 11:28 ` Ingo Molnar
  1 sibling, 0 replies; 4+ messages in thread
From: Ingo Molnar @ 2008-07-18 11:28 UTC (permalink / raw)
  To: Max Krasnyansky; +Cc: pj, linux-kernel, a.p.zijlstra, menage, vegard.nossum


* Max Krasnyansky <maxk@qualcomm.com> wrote:

> From: Max Krasnyanskiy <maxk@qualcomm.com>
> 
> Basically as Paul J. pointed out rebuild_sched_domains() is the only 
> way to rebuild sched domains correctly based on the current cpuset 
> settings. What this means is that we need to be able to call it from 
> different contexts, like cpuhotplug for example. Also if you look at 
> the cpu_active_map patches, the scheduler now calls 
> rebuild_sched_domains() directly from places like 
> arch_init_sched_domains().

please fix the whitespace problems and the misformatted multi-line 
comment as well. (checkpatch will point out where they are)

	Ingo

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-07-18 11:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-16  6:21 [PATCH] cpuset: Make rebuild_sched_domains() usable from any context (take 2) Max Krasnyansky
2008-07-16  7:11 ` Paul Menage
2008-07-16 18:27   ` Max Krasnyansky
2008-07-18 11:28 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox