[RFC][PATCH] CPUSets: Move most calls to rebuild_sched

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
@ 2008-06-26  7:56 Paul Menage
  2008-06-26  9:34 ` Vegard Nossum
  2008-06-27  3:22 ` Gautham R Shenoy
  0 siblings, 2 replies; 14+ messages in thread
From: Paul Menage @ 2008-06-26  7:56 UTC (permalink / raw)
  To: Vegard Nossum, Paul Jackson, a.p.zijlstra, maxk; +Cc: linux-kernel

CPUsets: Move most calls to rebuild_sched_domains() to the workqueue

In the current cpusets code the lock nesting between cgroup_mutex and
cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
and in all other paths that call rebuild_sched_domains() it nests
inside.

This patch makes most calls to rebuild_sched_domains() asynchronous
via the workqueue, which removes the nesting of the two locks in that
case. In the case of an actual hotplug event, cpuhotplug.lock nests
outside cgroup_mutex as now.

Signed-off-by: Paul Menage <menage@google.com>

---

Note that all I've done with this patch is verify that it compiles
without warnings; I'm not sure how to trigger a hotplug event to test
the lock dependencies or verify that scheduler domain support is still
behaving correctly. Vegard, does this fix the problems that you were
seeing? Paul/Max, does this still seem sane with regard to scheduler domains?


 kernel/cpuset.c |   35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

Index: lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
===================================================================
--- lockfix-2.6.26-rc5-mm3.orig/kernel/cpuset.c
+++ lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
@@ -522,13 +522,9 @@ update_domain_attr(struct sched_domain_a
  * domains when operating in the severe memory shortage situations
  * that could cause allocation failures below.
  *
- * Call with cgroup_mutex held.  May take callback_mutex during
- * call due to the kfifo_alloc() and kmalloc() calls.  May nest
- * a call to the get_online_cpus()/put_online_cpus() pair.
- * Must not be called holding callback_mutex, because we must not
- * call get_online_cpus() while holding callback_mutex.  Elsewhere
- * the kernel nests callback_mutex inside get_online_cpus() calls.
- * So the reverse nesting would risk an ABBA deadlock.
+ * Call with cgroup_mutex held, and inside get_online_cpus().  May
+ * take callback_mutex during call due to the kfifo_alloc() and
+ * kmalloc() calls.
  *
  * The three key local variables below are:
  *    q  - a kfifo queue of cpuset pointers, used to implement a
@@ -689,9 +685,7 @@ restart:
 
 rebuild:
 	/* Have scheduler rebuild sched domains */
-	get_online_cpus();
 	partition_sched_domains(ndoms, doms, dattr);
-	put_online_cpus();
 
 done:
 	if (q && !IS_ERR(q))
@@ -701,6 +695,21 @@ done:
 	/* Don't kfree(dattr) -- partition_sched_domains() does that. */
 }
 
+/*
+ * Due to the need to nest cgroup_mutex inside cpuhotplug.lock, most
+ * of our invocations of rebuild_sched_domains() are done
+ * asynchronously via the workqueue
+ */
+static void delayed_rebuild_sched_domains(struct work_struct *work)
+{
+	get_online_cpus();
+	cgroup_lock();
+	rebuild_sched_domains();
+	cgroup_unlock();
+	put_online_cpus();
+}
+static DECLARE_WORK(rebuild_sched_domains_work, delayed_rebuild_sched_domains);
+
 static inline int started_after_time(struct task_struct *t1,
 				     struct timespec *time,
 				     struct task_struct *t2)
@@ -853,7 +862,7 @@ static int update_cpumask(struct cpuset 
 		return retval;
 
 	if (is_load_balanced)
-		rebuild_sched_domains();
+		schedule_work(&rebuild_sched_domains_work);
 	return 0;
 }
 
@@ -1080,7 +1089,7 @@ static int update_relax_domain_level(str
 
 	if (val != cs->relax_domain_level) {
 		cs->relax_domain_level = val;
-		rebuild_sched_domains();
+		schedule_work(&rebuild_sched_domains_work);
 	}
 
 	return 0;
@@ -1121,7 +1130,7 @@ static int update_flag(cpuset_flagbits_t
 	mutex_unlock(&callback_mutex);
 
 	if (cpus_nonempty && balance_flag_changed)
-		rebuild_sched_domains();
+		schedule_work(&rebuild_sched_domains_work);
 
 	return 0;
 }
@@ -1929,6 +1938,7 @@ static void scan_for_empty_cpusets(const
 
 static void common_cpu_mem_hotplug_unplug(void)
 {
+	get_online_cpus();
 	cgroup_lock();
 
 	top_cpuset.cpus_allowed = cpu_online_map;
@@ -1942,6 +1952,7 @@ static void common_cpu_mem_hotplug_unplu
 	rebuild_sched_domains();
 
 	cgroup_unlock();
+	put_online_cpus();
 }
 
 /*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26  7:56 [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue Paul Menage
@ 2008-06-26  9:34 ` Vegard Nossum
  2008-06-26  9:50   ` Paul Menage
  2008-06-27  3:22 ` Gautham R Shenoy
  1 sibling, 1 reply; 14+ messages in thread
From: Vegard Nossum @ 2008-06-26  9:34 UTC (permalink / raw)
  To: Paul Menage; +Cc: Paul Jackson, a.p.zijlstra, maxk, linux-kernel

On Thu, Jun 26, 2008 at 9:56 AM, Paul Menage <menage@google.com> wrote:
> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>
> In the current cpusets code the lock nesting between cgroup_mutex and
> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
> and in all other paths that call rebuild_sched_domains() it nests
> inside.
>
> This patch makes most calls to rebuild_sched_domains() asynchronous
> via the workqueue, which removes the nesting of the two locks in that
> case. In the case of an actual hotplug event, cpuhotplug.lock nests
> outside cgroup_mutex as now.
>
> Signed-off-by: Paul Menage <menage@google.com>
>
> ---
>
> Note that all I've done with this patch is verify that it compiles
> without warnings; I'm not sure how to trigger a hotplug event to test
> the lock dependencies or verify that scheduler domain support is still
> behaving correctly. Vegard, does this fix the problems that you were
> seeing? Paul/Max, does this still seem sane with regard to scheduler
> domains?

Nope, sorry :-(

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.26-rc8-dirty #39
-------------------------------------------------------
bash/3510 is trying to acquire lock:
 (events){--..}, at: [<c0145690>] cleanup_workqueue_thread+0x10/0x70

but task is already holding lock:
 (&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&cpu_hotplug.lock){--..}:
       [<c0158e65>] __lock_acquire+0xf45/0x1040
       [<c0158ff8>] lock_acquire+0x98/0xd0
       [<c057e6a1>] mutex_lock_nested+0xb1/0x300
       [<c015da3c>] get_online_cpus+0x2c/0x40
       [<c0162c98>] delayed_rebuild_sched_domains+0x8/0x30
       [<c014548b>] run_workqueue+0x15b/0x1f0
       [<c0145f09>] worker_thread+0x99/0xf0
       [<c0148772>] kthread+0x42/0x70
       [<c0105a63>] kernel_thread_helper+0x7/0x14
       [<ffffffff>] 0xffffffff

-> #1 (rebuild_sched_domains_work){--..}:
       [<c0158e65>] __lock_acquire+0xf45/0x1040
       [<c0158ff8>] lock_acquire+0x98/0xd0
       [<c0145486>] run_workqueue+0x156/0x1f0
       [<c0145f09>] worker_thread+0x99/0xf0
       [<c0148772>] kthread+0x42/0x70
       [<c0105a63>] kernel_thread_helper+0x7/0x14
       [<ffffffff>] 0xffffffff

-> #0 (events){--..}:
       [<c0158a15>] __lock_acquire+0xaf5/0x1040
       [<c0158ff8>] lock_acquire+0x98/0xd0
       [<c01456b6>] cleanup_workqueue_thread+0x36/0x70
       [<c055d91a>] workqueue_cpu_callback+0x7a/0x130
       [<c014d497>] notifier_call_chain+0x37/0x70
       [<c014d509>] __raw_notifier_call_chain+0x19/0x20
       [<c014d52a>] raw_notifier_call_chain+0x1a/0x20
       [<c055bb28>] _cpu_down+0x148/0x240
       [<c055bc4b>] cpu_down+0x2b/0x40
       [<c055ce69>] store_online+0x39/0x80
       [<c02fb91b>] sysdev_store+0x2b/0x40
       [<c01dd0a2>] sysfs_write_file+0xa2/0x100
       [<c019ecc6>] vfs_write+0x96/0x130
       [<c019f38d>] sys_write+0x3d/0x70
       [<c0104ceb>] sysenter_past_esp+0x78/0xd1
       [<ffffffff>] 0xffffffff

other info that might help us debug this:

3 locks held by bash/3510:
 #0:  (&buffer->mutex){--..}, at: [<c01dd02b>] sysfs_write_file+0x2b/0x100
 #1:  (cpu_add_remove_lock){--..}, at: [<c015d97f>]
cpu_maps_update_begin+0xf/0x20
 #2:  (&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50

stack backtrace:
Pid: 3510, comm: bash Not tainted 2.6.26-rc8-dirty #39
 [<c0156517>] print_circular_bug_tail+0x77/0x90
 [<c0155b93>] ? print_circular_bug_entry+0x43/0x50
 [<c0158a15>] __lock_acquire+0xaf5/0x1040
 [<c010aeb5>] ? native_sched_clock+0xb5/0x110
 [<c0157895>] ? mark_held_locks+0x65/0x80
 [<c0158ff8>] lock_acquire+0x98/0xd0
 [<c0145690>] ? cleanup_workqueue_thread+0x10/0x70
 [<c01456b6>] cleanup_workqueue_thread+0x36/0x70
 [<c0145690>] ? cleanup_workqueue_thread+0x10/0x70
 [<c055d91a>] workqueue_cpu_callback+0x7a/0x130
 [<c0580613>] ? _spin_unlock_irqrestore+0x43/0x70
 [<c014d497>] notifier_call_chain+0x37/0x70
 [<c014d509>] __raw_notifier_call_chain+0x19/0x20
 [<c014d52a>] raw_notifier_call_chain+0x1a/0x20
 [<c055bb28>] _cpu_down+0x148/0x240
 [<c015d97f>] ? cpu_maps_update_begin+0xf/0x20
 [<c055bc4b>] cpu_down+0x2b/0x40
 [<c055ce69>] store_online+0x39/0x80
 [<c055ce30>] ? store_online+0x0/0x80
 [<c02fb91b>] sysdev_store+0x2b/0x40
 [<c01dd0a2>] sysfs_write_file+0xa2/0x100
 [<c019ecc6>] vfs_write+0x96/0x130
 [<c01dd000>] ? sysfs_write_file+0x0/0x100
 [<c019f38d>] sys_write+0x3d/0x70
 [<c0104ceb>] sysenter_past_esp+0x78/0xd1
 =======================


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26  9:34 ` Vegard Nossum
@ 2008-06-26  9:50   ` Paul Menage
  2008-06-26 18:49     ` Max Krasnyansky
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Menage @ 2008-06-26  9:50 UTC (permalink / raw)
  To: Vegard Nossum; +Cc: Paul Jackson, a.p.zijlstra, maxk, linux-kernel

On Thu, Jun 26, 2008 at 2:34 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote:
> On Thu, Jun 26, 2008 at 9:56 AM, Paul Menage <menage@google.com> wrote:
>> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>>
>> In the current cpusets code the lock nesting between cgroup_mutex and
>> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
>> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
>> and in all other paths that call rebuild_sched_domains() it nests
>> inside.
>>
>> This patch makes most calls to rebuild_sched_domains() asynchronous
>> via the workqueue, which removes the nesting of the two locks in that
>> case. In the case of an actual hotplug event, cpuhotplug.lock nests
>> outside cgroup_mutex as now.
>>
>> Signed-off-by: Paul Menage <menage@google.com>
>>
>> ---
>>
>> Note that all I've done with this patch is verify that it compiles
>> without warnings; I'm not sure how to trigger a hotplug event to test
>> the lock dependencies or verify that scheduler domain support is still
>> behaving correctly. Vegard, does this fix the problems that you were
>> seeing? Paul/Max, does this still seem sane with regard to scheduler
>> domains?
>
> Nope, sorry :-(
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc8-dirty #39
> -------------------------------------------------------
> bash/3510 is trying to acquire lock:
>  (events){--..}, at: [<c0145690>] cleanup_workqueue_thread+0x10/0x70
>
> but task is already holding lock:
>  (&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50
>
> which lock already depends on the new lock.
>

Does that mean that you can't ever call get_online_cpus() from a
workqueue thread?

Paul.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26  9:50   ` Paul Menage
@ 2008-06-26 18:49     ` Max Krasnyansky
  2008-06-26 19:19       ` Peter Zijlstra
  2008-06-26 20:34       ` Paul Menage
  0 siblings, 2 replies; 14+ messages in thread
From: Max Krasnyansky @ 2008-06-26 18:49 UTC (permalink / raw)
  To: Paul Menage; +Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, linux-kernel

Paul Menage wrote:
> On Thu, Jun 26, 2008 at 2:34 AM, Vegard Nossum <vegard.nossum@gmail.com> wrote:
>> On Thu, Jun 26, 2008 at 9:56 AM, Paul Menage <menage@google.com> wrote:
>>> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>>>
>>> In the current cpusets code the lock nesting between cgroup_mutex and
>>> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
>>> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
>>> and in all other paths that call rebuild_sched_domains() it nests
>>> inside.
>>>
>>> This patch makes most calls to rebuild_sched_domains() asynchronous
>>> via the workqueue, which removes the nesting of the two locks in that
>>> case. In the case of an actual hotplug event, cpuhotplug.lock nests
>>> outside cgroup_mutex as now.
>>>
>>> Signed-off-by: Paul Menage <menage@google.com>
>>>
>>> ---
>>>
>>> Note that all I've done with this patch is verify that it compiles
>>> without warnings; I'm not sure how to trigger a hotplug event to test
>>> the lock dependencies or verify that scheduler domain support is still
>>> behaving correctly. 
You can just do:
	echo 0 > /sys/devices/cpu/cpuN/online
	echo 1 > /sys/devices/cpu/cpuN/online

>>> Vegard, does this fix the problems that you were
>>> seeing? Paul/Max, does this still seem sane with regard to scheduler
>>> domains?
>> Nope, sorry :-(
>>
>> =======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 2.6.26-rc8-dirty #39
>> -------------------------------------------------------
>> bash/3510 is trying to acquire lock:
>>  (events){--..}, at: [<c0145690>] cleanup_workqueue_thread+0x10/0x70
>>
>> but task is already holding lock:
>>  (&cpu_hotplug.lock){--..}, at: [<c015d9da>] cpu_hotplug_begin+0x1a/0x50
>>
>> which lock already depends on the new lock.
>>
> 
> Does that mean that you can't ever call get_online_cpus() from a
> workqueue thread?

In general it should be ok (no different from user-space task calling 
it). But there is still circular dependency because we're calling into
domain partitioning code.
Below is more detailed lockdep report with your patch applied on top of 
-rc8.
Looks like this might be a good time to rethink overall locking in there.

> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc8 #3
> -------------------------------------------------------
> bash/2836 is trying to acquire lock:
>  (cgroup_mutex){--..}, at: [<ffffffff802653e2>] cgroup_lock+0x12/0x20
> 
> but task is already holding lock:
>  (&cpu_hotplug.lock){--..}, at: [<ffffffff8025e062>] cpu_hotplug_begin+0x22/0x60
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&cpu_hotplug.lock){--..}:
>        [<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
>        [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
>        [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
>        [<ffffffff8025e024>] get_online_cpus+0x24/0x40
>        [<ffffffff8022fee1>] sched_getaffinity+0x11/0x80
>        [<ffffffff8026e5c9>] __synchronize_sched+0x19/0x90
>        [<ffffffff8022ed46>] detach_destroy_domains+0x46/0x50
>        [<ffffffff8022f6b9>] partition_sched_domains+0xf9/0x2b0
>        [<ffffffff80268eea>] rebuild_sched_domains+0x9a/0x3e0
>        [<ffffffff80269243>] delayed_rebuild_sched_domains+0x13/0x30
>        [<ffffffff80247d5e>] run_workqueue+0xde/0x220
>        [<ffffffff80247f00>] worker_thread+0x60/0xb0
>        [<ffffffff8024c069>] kthread+0x49/0x90
>        [<ffffffff8020c898>] child_rip+0xa/0x12
>        [<ffffffffffffffff>] 0xffffffffffffffff
> 
> -> #1 (sched_domains_mutex){--..}:
>        [<ffffffff8025988f>] __lock_acquire+0x9cf/0xe50
>        [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
>        [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
>        [<ffffffff8022f5e9>] partition_sched_domains+0x29/0x2b0
>        [<ffffffff80268eea>] rebuild_sched_domains+0x9a/0x3e0
>        [<ffffffff80269243>] delayed_rebuild_sched_domains+0x13/0x30
>        [<ffffffff80247d5e>] run_workqueue+0xde/0x220
>        [<ffffffff80247f00>] worker_thread+0x60/0xb0
>        [<ffffffff8024c069>] kthread+0x49/0x90
>        [<ffffffff8020c898>] child_rip+0xa/0x12
>        [<ffffffffffffffff>] 0xffffffffffffffff
> 
> -> #0 (cgroup_mutex){--..}:
>        [<ffffffff80259913>] __lock_acquire+0xa53/0xe50
>        [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
>        [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
>        [<ffffffff802653e2>] cgroup_lock+0x12/0x20
>        [<ffffffff80269bf1>] cpuset_handle_cpuhp+0x31/0x230
>        [<ffffffff804d677f>] notifier_call_chain+0x3f/0x80
>        [<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
>        [<ffffffff804c1638>] _cpu_down+0xa8/0x290
>        [<ffffffff804c185b>] cpu_down+0x3b/0x60
>        [<ffffffff804c2b58>] store_online+0x48/0xa0
>        [<ffffffff803a45b4>] sysdev_store+0x24/0x30
>        [<ffffffff802eeaaa>] sysfs_write_file+0xca/0x140
>        [<ffffffff8029ca2b>] vfs_write+0xcb/0x170
>        [<ffffffff8029cbc0>] sys_write+0x50/0x90
>        [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80
>        [<ffffffffffffffff>] 0xffffffffffffffff
> 
> other info that might help us debug this:
> 
> 3 locks held by bash/2836:
>  #0:  (&buffer->mutex){--..}, at: [<ffffffff802eea23>] sysfs_write_file+0x43/0x140
>  #1:  (cpu_add_remove_lock){--..}, at: [<ffffffff804c1847>] cpu_down+0x27/0x60
>  #2:  (&cpu_hotplug.lock){--..}, at: [<ffffffff8025e062>] cpu_hotplug_begin+0x22/0x60
> 
> stack backtrace:
> Pid: 2836, comm: bash Not tainted 2.6.26-rc8 #3
> 
> Call Trace:
>  [<ffffffff80258c0c>] print_circular_bug_tail+0x8c/0x90
>  [<ffffffff802589c4>] ? print_circular_bug_entry+0x54/0x60
>  [<ffffffff80259913>] __lock_acquire+0xa53/0xe50
>  [<ffffffff8025867d>] ? mark_held_locks+0x4d/0x90
>  [<ffffffff804d1345>] ? mutex_lock_nested+0x225/0x250
>  [<ffffffff802653e2>] ? cgroup_lock+0x12/0x20
>  [<ffffffff80259d6b>] lock_acquire+0x5b/0x80
>  [<ffffffff802653e2>] ? cgroup_lock+0x12/0x20
>  [<ffffffff804d11b4>] mutex_lock_nested+0x94/0x250
>  [<ffffffff802653e2>] cgroup_lock+0x12/0x20
>  [<ffffffff80269bf1>] cpuset_handle_cpuhp+0x31/0x230
>  [<ffffffff8022edaa>] ? update_sched_domains+0x5a/0x70
>  [<ffffffff804d677f>] notifier_call_chain+0x3f/0x80
>  [<ffffffff80250679>] __raw_notifier_call_chain+0x9/0x10
>  [<ffffffff804c1638>] _cpu_down+0xa8/0x290
>  [<ffffffff804c185b>] cpu_down+0x3b/0x60
>  [<ffffffff804c2b58>] store_online+0x48/0xa0
>  [<ffffffff803a45b4>] sysdev_store+0x24/0x30
>  [<ffffffff802eeaaa>] sysfs_write_file+0xca/0x140
>  [<ffffffff8029ca2b>] vfs_write+0xcb/0x170
>  [<ffffffff8029cbc0>] sys_write+0x50/0x90
>  [<ffffffff8020b92b>] system_call_after_swapgs+0x7b/0x80





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26 18:49     ` Max Krasnyansky
@ 2008-06-26 19:19       ` Peter Zijlstra
  2008-06-26 20:34       ` Paul Menage
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2008-06-26 19:19 UTC (permalink / raw)
  To: Max Krasnyansky
  Cc: Paul Menage, Vegard Nossum, Paul Jackson, a.p.zijlstra,
	linux-kernel, Gautham R Shenoy

On Thu, 2008-06-26 at 11:49 -0700, Max Krasnyansky wrote:

> > Does that mean that you can't ever call get_online_cpus() from a
> > workqueue thread?
> 
> In general it should be ok (no different from user-space task calling 
> it). But there is still circular dependency because we're calling into
> domain partitioning code.
> Below is more detailed lockdep report with your patch applied on top of 
> -rc8.
> Looks like this might be a good time to rethink overall locking in there.

Gautham has been working on some of this recently. Perhaps he can share
his patch-stack again.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26 18:49     ` Max Krasnyansky
  2008-06-26 19:19       ` Peter Zijlstra
@ 2008-06-26 20:34       ` Paul Menage
  2008-06-26 21:17         ` Paul Menage
  1 sibling, 1 reply; 14+ messages in thread
From: Paul Menage @ 2008-06-26 20:34 UTC (permalink / raw)
  To: Max Krasnyansky
  Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, linux-kernel,
	Gautham shenoy

On Thu, Jun 26, 2008 at 11:49 AM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>>
>> Does that mean that you can't ever call get_online_cpus() from a
>> workqueue thread?
>
> In general it should be ok (no different from user-space task calling it).

No, I think it is a problem.

When a CPU goes down, the hotplug code holds cpu_hotplug.lock and
calls cleanup_workqueue_thread() which waits for any running work on
that thread to finish.

So if the workqueue thread running on that CPU calls get_online_cpus()
while the hotplug thread is waiting for it, it will deadlock. (It's
not clear to me how lockdep figured this out - I guess it's something
to do with the *_acquire() annotations that tell lockdep to treat the
workqueue structure as a pseudo-lock?)

I guess the fix for that would be to have a non-workqueue thread to
handle the async domain rebuilding, which isn't tied to a particular
cpu the way workqueue threads are.

> But there is still circular dependency because we're calling into
> domain partitioning code.

OK, so the problem is that since cpu_hotplug contains a hand-rolled
rwsem, lockdep is going to spot false deadlocks.

Is there any reason not to replace cpu_hotplug.lock and
cpu_hotplug.refcount with an rw_semaphore, and have the following:

void get_online_cpus(void)
{
        might_sleep();
        if (cpu_hotplug.active_writer == current)
                return;
        down_read(&cpu_hotplug.lock);
}

void put_online_cpus(void)
{
        if (cpu_hotplug.active_writer == current)
                return;
        up_read(&cpu_hotplug.lock);
}

static void cpu_hotplug_begin(void)
{
        down_write(&cpu_hotplug.lock);
        cpu_hotplug.active_writer = current;
}

static void cpu_hotplug_done(void)
{
        cpu_hotplug.active_writer = NULL;
        up_write(&cpu_hotplug.lock);
}

I think that combined with moving the async rebuild_sched_domains to a
separate thread should solve the problem, but I'm wondering why
cpu_hotplug.lock was implemented this way in the first place.

Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26 20:34       ` Paul Menage
@ 2008-06-26 21:17         ` Paul Menage
  2008-06-27  5:10           ` Max Krasnyansky
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Menage @ 2008-06-26 21:17 UTC (permalink / raw)
  To: Max Krasnyansky
  Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, linux-kernel,
	Gautham shenoy

On Thu, Jun 26, 2008 at 1:34 PM, Paul Menage <menage@google.com> wrote:
>
> void get_online_cpus(void)
> {
>        might_sleep();
>        if (cpu_hotplug.active_writer == current)
>                return;
>        down_read(&cpu_hotplug.lock);
> }
>
> void put_online_cpus(void)
> {
>        if (cpu_hotplug.active_writer == current)
>                return;
>        up_read(&cpu_hotplug.lock);
> }
>
> static void cpu_hotplug_begin(void)
> {
>        down_write(&cpu_hotplug.lock);
>        cpu_hotplug.active_writer = current;
> }
>
> static void cpu_hotplug_done(void)
> {
>        cpu_hotplug.active_writer = NULL;
>        up_write(&cpu_hotplug.lock);
> }
>
> I think that combined with moving the async rebuild_sched_domains to a
> separate thread should solve the problem, but I'm wondering why
> cpu_hotplug.lock was implemented this way in the first place.

Oh, I guess that doesn't work because of recursive calls to
get_online_cpus(). Maybe we need a down_read_recursive() that skips
ahead of waiting writers if the lock is already held in read mode?

Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26  7:56 [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue Paul Menage
  2008-06-26  9:34 ` Vegard Nossum
@ 2008-06-27  3:22 ` Gautham R Shenoy
  2008-06-27  3:23   ` Gautham R Shenoy
  1 sibling, 1 reply; 14+ messages in thread
From: Gautham R Shenoy @ 2008-06-27  3:22 UTC (permalink / raw)
  To: Paul Menage; +Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, maxk, linux-kernel

On Thu, Jun 26, 2008 at 12:56:49AM -0700, Paul Menage wrote:
> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>
> In the current cpusets code the lock nesting between cgroup_mutex and
> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
> and in all other paths that call rebuild_sched_domains() it nests
> inside.
>
> This patch makes most calls to rebuild_sched_domains() asynchronous
> via the workqueue, which removes the nesting of the two locks in that
> case. In the case of an actual hotplug event, cpuhotplug.lock nests
> outside cgroup_mutex as now.
>
> Signed-off-by: Paul Menage <menage@google.com>
>
> ---
>
> Note that all I've done with this patch is verify that it compiles
> without warnings; I'm not sure how to trigger a hotplug event to test
> the lock dependencies or verify that scheduler domain support is still
> behaving correctly. Vegard, does this fix the problems that you were
> seeing? Paul/Max, does this still seem sane with regard to scheduler domains?
>

Hi Paul, 

Using a multithreaded workqueue(kevent here) for this is not such a
great idea this,since currently we cannot call
get_online_cpus() from a workitem executed by a multithreaded workqueue.

Can one use a single threaded workqueue here instead ?

Or, better, I think we can ask Oleg to re-submit the patch he had to make
get_online_cpus() safe to be called from within the workqueue. It does
require a special post CPU_DEAD notification, but as it does work the
last time I checked.

>
> kernel/cpuset.c |   35 +++++++++++++++++++++++------------
> 1 file changed, 23 insertions(+), 12 deletions(-)
>
> Index: lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> ===================================================================
> --- lockfix-2.6.26-rc5-mm3.orig/kernel/cpuset.c
> +++ lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> @@ -522,13 +522,9 @@ update_domain_attr(struct sched_domain_a
>  * domains when operating in the severe memory shortage situations
>  * that could cause allocation failures below.
>  *
> - * Call with cgroup_mutex held.  May take callback_mutex during
> - * call due to the kfifo_alloc() and kmalloc() calls.  May nest
> - * a call to the get_online_cpus()/put_online_cpus() pair.
> - * Must not be called holding callback_mutex, because we must not
> - * call get_online_cpus() while holding callback_mutex.  Elsewhere
> - * the kernel nests callback_mutex inside get_online_cpus() calls.
> - * So the reverse nesting would risk an ABBA deadlock.
> + * Call with cgroup_mutex held, and inside get_online_cpus().  May
> + * take callback_mutex during call due to the kfifo_alloc() and
> + * kmalloc() calls.
>  *
>  * The three key local variables below are:
>  *    q  - a kfifo queue of cpuset pointers, used to implement a
> @@ -689,9 +685,7 @@ restart:
>
> rebuild:
> 	/* Have scheduler rebuild sched domains */
> -	get_online_cpus();
> 	partition_sched_domains(ndoms, doms, dattr);
> -	put_online_cpus();
>
> done:
> 	if (q && !IS_ERR(q))
> @@ -701,6 +695,21 @@ done:
> 	/* Don't kfree(dattr) -- partition_sched_domains() does that. */
> }
>
> +/*
> + * Due to the need to nest cgroup_mutex inside cpuhotplug.lock, most
> + * of our invocations of rebuild_sched_domains() are done
> + * asynchronously via the workqueue
> + */
> +static void delayed_rebuild_sched_domains(struct work_struct *work)
> +{
> +	get_online_cpus();
> +	cgroup_lock();
> +	rebuild_sched_domains();
> +	cgroup_unlock();
> +	put_online_cpus();
> +}
> +static DECLARE_WORK(rebuild_sched_domains_work, delayed_rebuild_sched_domains);
> +
> static inline int started_after_time(struct task_struct *t1,
> 				     struct timespec *time,
> 				     struct task_struct *t2)
> @@ -853,7 +862,7 @@ static int update_cpumask(struct cpuset 		return 
> retval;
>
> 	if (is_load_balanced)
> -		rebuild_sched_domains();
> +		schedule_work(&rebuild_sched_domains_work);
> 	return 0;
> }
>
> @@ -1080,7 +1089,7 @@ static int update_relax_domain_level(str
>
> 	if (val != cs->relax_domain_level) {
> 		cs->relax_domain_level = val;
> -		rebuild_sched_domains();
> +		schedule_work(&rebuild_sched_domains_work);
> 	}
>
> 	return 0;
> @@ -1121,7 +1130,7 @@ static int update_flag(cpuset_flagbits_t
> 	mutex_unlock(&callback_mutex);
>
> 	if (cpus_nonempty && balance_flag_changed)
> -		rebuild_sched_domains();
> +		schedule_work(&rebuild_sched_domains_work);
>
> 	return 0;
> }
> @@ -1929,6 +1938,7 @@ static void scan_for_empty_cpusets(const
>
> static void common_cpu_mem_hotplug_unplug(void)
> {
> +	get_online_cpus();
> 	cgroup_lock();
>
> 	top_cpuset.cpus_allowed = cpu_online_map;
> @@ -1942,6 +1952,7 @@ static void common_cpu_mem_hotplug_unplu
> 	rebuild_sched_domains();
>
> 	cgroup_unlock();
> +	put_online_cpus();
> }
>
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-27  3:22 ` Gautham R Shenoy
@ 2008-06-27  3:23   ` Gautham R Shenoy
  2008-06-27  4:53     ` Max Krasnyansky
  2008-06-27 16:42     ` Oleg Nesterov
  0 siblings, 2 replies; 14+ messages in thread
From: Gautham R Shenoy @ 2008-06-27  3:23 UTC (permalink / raw)
  To: Paul Menage
  Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, maxk, linux-kernel,
	Oleg Nesterov

On Fri, Jun 27, 2008 at 08:52:28AM +0530, Gautham R Shenoy wrote:
> On Thu, Jun 26, 2008 at 12:56:49AM -0700, Paul Menage wrote:
> > CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
> >
> > In the current cpusets code the lock nesting between cgroup_mutex and
> > cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
> > in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
> > and in all other paths that call rebuild_sched_domains() it nests
> > inside.
> >
> > This patch makes most calls to rebuild_sched_domains() asynchronous
> > via the workqueue, which removes the nesting of the two locks in that
> > case. In the case of an actual hotplug event, cpuhotplug.lock nests
> > outside cgroup_mutex as now.
> >
> > Signed-off-by: Paul Menage <menage@google.com>
> >
> > ---
> >
> > Note that all I've done with this patch is verify that it compiles
> > without warnings; I'm not sure how to trigger a hotplug event to test
> > the lock dependencies or verify that scheduler domain support is still
> > behaving correctly. Vegard, does this fix the problems that you were
> > seeing? Paul/Max, does this still seem sane with regard to scheduler domains?
> >
> 
> Hi Paul, 
> 

This time CC'ing Oleg!

> Using a multithreaded workqueue(kevent here) for this is not such a
> great idea this,since currently we cannot call
> get_online_cpus() from a workitem executed by a multithreaded workqueue.
> 
> Can one use a single threaded workqueue here instead ?
> 
> Or, better, I think we can ask Oleg to re-submit the patch he had to make
> get_online_cpus() safe to be called from within the workqueue. It does
> require a special post CPU_DEAD notification, but as it does work the
> last time I checked.
> 
> >
> > kernel/cpuset.c |   35 +++++++++++++++++++++++------------
> > 1 file changed, 23 insertions(+), 12 deletions(-)
> >
> > Index: lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> > ===================================================================
> > --- lockfix-2.6.26-rc5-mm3.orig/kernel/cpuset.c
> > +++ lockfix-2.6.26-rc5-mm3/kernel/cpuset.c
> > @@ -522,13 +522,9 @@ update_domain_attr(struct sched_domain_a
> >  * domains when operating in the severe memory shortage situations
> >  * that could cause allocation failures below.
> >  *
> > - * Call with cgroup_mutex held.  May take callback_mutex during
> > - * call due to the kfifo_alloc() and kmalloc() calls.  May nest
> > - * a call to the get_online_cpus()/put_online_cpus() pair.
> > - * Must not be called holding callback_mutex, because we must not
> > - * call get_online_cpus() while holding callback_mutex.  Elsewhere
> > - * the kernel nests callback_mutex inside get_online_cpus() calls.
> > - * So the reverse nesting would risk an ABBA deadlock.
> > + * Call with cgroup_mutex held, and inside get_online_cpus().  May
> > + * take callback_mutex during call due to the kfifo_alloc() and
> > + * kmalloc() calls.
> >  *
> >  * The three key local variables below are:
> >  *    q  - a kfifo queue of cpuset pointers, used to implement a
> > @@ -689,9 +685,7 @@ restart:
> >
> > rebuild:
> > 	/* Have scheduler rebuild sched domains */
> > -	get_online_cpus();
> > 	partition_sched_domains(ndoms, doms, dattr);
> > -	put_online_cpus();
> >
> > done:
> > 	if (q && !IS_ERR(q))
> > @@ -701,6 +695,21 @@ done:
> > 	/* Don't kfree(dattr) -- partition_sched_domains() does that. */
> > }
> >
> > +/*
> > + * Due to the need to nest cgroup_mutex inside cpuhotplug.lock, most
> > + * of our invocations of rebuild_sched_domains() are done
> > + * asynchronously via the workqueue
> > + */
> > +static void delayed_rebuild_sched_domains(struct work_struct *work)
> > +{
> > +	get_online_cpus();
> > +	cgroup_lock();
> > +	rebuild_sched_domains();
> > +	cgroup_unlock();
> > +	put_online_cpus();
> > +}
> > +static DECLARE_WORK(rebuild_sched_domains_work, delayed_rebuild_sched_domains);
> > +
> > static inline int started_after_time(struct task_struct *t1,
> > 				     struct timespec *time,
> > 				     struct task_struct *t2)
> > @@ -853,7 +862,7 @@ static int update_cpumask(struct cpuset 		return 
> > retval;
> >
> > 	if (is_load_balanced)
> > -		rebuild_sched_domains();
> > +		schedule_work(&rebuild_sched_domains_work);
> > 	return 0;
> > }
> >
> > @@ -1080,7 +1089,7 @@ static int update_relax_domain_level(str
> >
> > 	if (val != cs->relax_domain_level) {
> > 		cs->relax_domain_level = val;
> > -		rebuild_sched_domains();
> > +		schedule_work(&rebuild_sched_domains_work);
> > 	}
> >
> > 	return 0;
> > @@ -1121,7 +1130,7 @@ static int update_flag(cpuset_flagbits_t
> > 	mutex_unlock(&callback_mutex);
> >
> > 	if (cpus_nonempty && balance_flag_changed)
> > -		rebuild_sched_domains();
> > +		schedule_work(&rebuild_sched_domains_work);
> >
> > 	return 0;
> > }
> > @@ -1929,6 +1938,7 @@ static void scan_for_empty_cpusets(const
> >
> > static void common_cpu_mem_hotplug_unplug(void)
> > {
> > +	get_online_cpus();
> > 	cgroup_lock();
> >
> > 	top_cpuset.cpus_allowed = cpu_online_map;
> > @@ -1942,6 +1952,7 @@ static void common_cpu_mem_hotplug_unplu
> > 	rebuild_sched_domains();
> >
> > 	cgroup_unlock();
> > +	put_online_cpus();
> > }
> >
> > /*
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> -- 
> Thanks and Regards
> gautham

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-27  3:23   ` Gautham R Shenoy
@ 2008-06-27  4:53     ` Max Krasnyansky
  2008-06-27 16:42     ` Oleg Nesterov
  1 sibling, 0 replies; 14+ messages in thread
From: Max Krasnyansky @ 2008-06-27  4:53 UTC (permalink / raw)
  To: ego
  Cc: Paul Menage, Vegard Nossum, Paul Jackson, a.p.zijlstra,
	linux-kernel, Oleg Nesterov

Gautham R Shenoy wrote:
> On Fri, Jun 27, 2008 at 08:52:28AM +0530, Gautham R Shenoy wrote:
>> On Thu, Jun 26, 2008 at 12:56:49AM -0700, Paul Menage wrote:
>>> CPUsets: Move most calls to rebuild_sched_domains() to the workqueue
>>>
>>> In the current cpusets code the lock nesting between cgroup_mutex and
>>> cpuhotplug.lock when calling rebuild_sched_domains is inconsistent -
>>> in the CPU hotplug path cpuhotplug.lock nests outside cgroup_mutex,
>>> and in all other paths that call rebuild_sched_domains() it nests
>>> inside.
>>>
>>> This patch makes most calls to rebuild_sched_domains() asynchronous
>>> via the workqueue, which removes the nesting of the two locks in that
>>> case. In the case of an actual hotplug event, cpuhotplug.lock nests
>>> outside cgroup_mutex as now.
>>>
>> Using a multithreaded workqueue(kevent here) for this is not such a
>> great idea this,since currently we cannot call
>> get_online_cpus() from a workitem executed by a multithreaded workqueue.
>>
>> Can one use a single threaded workqueue here instead ?
We could certainly do it in the single threaded workqueue. It won't help to
avoid circular locking dependencies though.
Did you get a chance to read entire thread on this topic ? There were some
questions that you might be able to answer.

Max

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-26 21:17         ` Paul Menage
@ 2008-06-27  5:10           ` Max Krasnyansky
  2008-06-27  5:51             ` Paul Menage
  0 siblings, 1 reply; 14+ messages in thread
From: Max Krasnyansky @ 2008-06-27  5:10 UTC (permalink / raw)
  To: Paul Menage
  Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, linux-kernel,
	Gautham shenoy

Paul Menage wrote:
> On Thu, Jun 26, 2008 at 1:34 PM, Paul Menage <menage@google.com> wrote:
>> void get_online_cpus(void)
>> {
>>        might_sleep();
>>        if (cpu_hotplug.active_writer == current)
>>                return;
>>        down_read(&cpu_hotplug.lock);
>> }
>>
>> void put_online_cpus(void)
>> {
>>        if (cpu_hotplug.active_writer == current)
>>                return;
>>        up_read(&cpu_hotplug.lock);
>> }
>>
>> static void cpu_hotplug_begin(void)
>> {
>>        down_write(&cpu_hotplug.lock);
>>        cpu_hotplug.active_writer = current;
>> }
>>
>> static void cpu_hotplug_done(void)
>> {
>>        cpu_hotplug.active_writer = NULL;
>>        up_write(&cpu_hotplug.lock);
>> }
>>
>> I think that combined with moving the async rebuild_sched_domains to a
>> separate thread should solve the problem, but I'm wondering why
>> cpu_hotplug.lock was implemented this way in the first place.
> 
> Oh, I guess that doesn't work because of recursive calls to
> get_online_cpus(). Maybe we need a down_read_recursive() that skips
> ahead of waiting writers if the lock is already held in read mode?

Instead of changing cpu_hotplug locking should we maybe try to avoid using
cgroup_lock in rebuild_sched_domains() ?
There is a comment in cpuset.c that says
 * If a task is only holding callback_mutex, then it has read-only
 * access to cpusets.

I'm not sure if it's still valid. rebuild_sched_domains() only needs read only
access, it does not really modify any cpuset structures.

Max



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-27  5:10           ` Max Krasnyansky
@ 2008-06-27  5:51             ` Paul Menage
  2008-06-27 17:31               ` Max Krasnyansky
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Menage @ 2008-06-27  5:51 UTC (permalink / raw)
  To: Max Krasnyansky
  Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, linux-kernel,
	Gautham shenoy

On Thu, Jun 26, 2008 at 10:10 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>
> Instead of changing cpu_hotplug locking should we maybe try to avoid using
> cgroup_lock in rebuild_sched_domains() ?

Yes, that would be good too.

> There is a comment in cpuset.c that says
>  * If a task is only holding callback_mutex, then it has read-only
>  * access to cpusets.
>
> I'm not sure if it's still valid. rebuild_sched_domains() only needs read only
> access, it does not really modify any cpuset structures.

The comment is still valid, if you interpret it narrowly enough.
Holding callback_mutex gives you read-only access to structures that
are under the control of cpusets. But rebuild_sched_domains() needs to
traverse the hierarchy of cpusets, and that hierarchy is controlled by
cgroups. Currently the only synchronization primitives exposed by
cgroups are:

- cgroup_lock()/cgroup_unlock() to prevent all cgroup modifications
(also used as the main synchronization primitive by some subsystems,
i.e. it's in danger of becoming the cgroups equivalent of the BKL).

- task_lock()/task_unlock() to prevent a specific task from changing cgroups

Possible options for richer locking support include:

- lock/unlock a hierarchy, to prevent creation/deletion of cgroups in
that hierarchy

- lock/unlock a cgroup to prevent deletion of that cgroup

- lock/unlock a cgroup to prevent task movement in/out of that cgroup

For the case of rebuild_sched_domains, we need the first of these
options. This lock would be taken in cgroup.c at the points where it
attached and removed cgroups from a cgroup tree, and could be taken by
something like cpusets that needed to keep the hierarchy stable while
scanning it. I think it would be fine to make it a mutex rather than a
spinlock.

cpu_hotplug.lock has to nest outside this hierarchy lock due to it
being taken at the root of the hotplug/unplug path. So as long as we
can ensure that we can always nest the hierarchy lock inside any
get_online_cpus()/put_online_cpus() pairs, we should be OK.

Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-27  3:23   ` Gautham R Shenoy
  2008-06-27  4:53     ` Max Krasnyansky
@ 2008-06-27 16:42     ` Oleg Nesterov
  1 sibling, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2008-06-27 16:42 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Paul Menage, Vegard Nossum, Paul Jackson, a.p.zijlstra, maxk,
	linux-kernel

On 06/27, Gautham R Shenoy wrote:
>
> > Using a multithreaded workqueue(kevent here) for this is not such a
> > great idea this,since currently we cannot call
> > get_online_cpus() from a workitem executed by a multithreaded workqueue.
> > 
> > Can one use a single threaded workqueue here instead ?

Yes, this should help.

But I think you are right, we should just fix this annoying problem,

> > Or, better, I think we can ask Oleg to re-submit the patch he had to make
> > get_online_cpus() safe to be called from within the workqueue. It does
> > require a special post CPU_DEAD notification, but as it does work the
> > last time I checked.

OK, will do on Sunday.

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue
  2008-06-27  5:51             ` Paul Menage
@ 2008-06-27 17:31               ` Max Krasnyansky
  0 siblings, 0 replies; 14+ messages in thread
From: Max Krasnyansky @ 2008-06-27 17:31 UTC (permalink / raw)
  To: Paul Menage
  Cc: Vegard Nossum, Paul Jackson, a.p.zijlstra, linux-kernel,
	Gautham shenoy

Paul Menage wrote:
> On Thu, Jun 26, 2008 at 10:10 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>> Instead of changing cpu_hotplug locking should we maybe try to avoid using
>> cgroup_lock in rebuild_sched_domains() ?
> 
> Yes, that would be good too.
> 
>> There is a comment in cpuset.c that says
>>  * If a task is only holding callback_mutex, then it has read-only
>>  * access to cpusets.
>>
>> I'm not sure if it's still valid. rebuild_sched_domains() only needs read only
>> access, it does not really modify any cpuset structures.
> 
> The comment is still valid, if you interpret it narrowly enough.
> Holding callback_mutex gives you read-only access to structures that
> are under the control of cpusets. But rebuild_sched_domains() needs to
> traverse the hierarchy of cpusets, and that hierarchy is controlled by
> cgroups. 
Yes that's what I meant by "not sure if it's still valid" I looked at 
the code and it did not look like callback_mutex protected overall 
hierarchy. Thanx for confirming that.

> Currently the only synchronization primitives exposed by
> cgroups are:
> 
> - cgroup_lock()/cgroup_unlock() to prevent all cgroup modifications
> (also used as the main synchronization primitive by some subsystems,
> i.e. it's in danger of becoming the cgroups equivalent of the BKL).
> 
> - task_lock()/task_unlock() to prevent a specific task from changing cgroups
> 
> Possible options for richer locking support include:
> 
> - lock/unlock a hierarchy, to prevent creation/deletion of cgroups in
> that hierarchy
Sounds good.

> - lock/unlock a cgroup to prevent deletion of that cgroup
Can that be just an atomic refcount ?

> - lock/unlock a cgroup to prevent task movement in/out of that cgroup
Sounds good.

> For the case of rebuild_sched_domains, we need the first of these
> options. This lock would be taken in cgroup.c at the points where it
> attached and removed cgroups from a cgroup tree, and could be taken by
> something like cpusets that needed to keep the hierarchy stable while
> scanning it. I think it would be fine to make it a mutex rather than a
> spinlock.
Agree

> cpu_hotplug.lock has to nest outside this hierarchy lock due to it
> being taken at the root of the hotplug/unplug path. So as long as we
> can ensure that we can always nest the hierarchy lock inside any
> get_online_cpus()/put_online_cpus() pairs, we should be OK.
Yes. Although that basically means that we always have to take 
cpu_hotplug.lock before hierarchy lock.

I like the proposal in general. Specifically for the 
rebuild_sched_domain() I'm now thinking that maybe we can get away with 
not involving cpuset at all. I think that what Peter meant originally. 
I'll send more thoughts on this separately.

Max


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-06-27 17:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-26  7:56 [RFC][PATCH] CPUSets: Move most calls to rebuild_sched_domains() to the workqueue Paul Menage
2008-06-26  9:34 ` Vegard Nossum
2008-06-26  9:50   ` Paul Menage
2008-06-26 18:49     ` Max Krasnyansky
2008-06-26 19:19       ` Peter Zijlstra
2008-06-26 20:34       ` Paul Menage
2008-06-26 21:17         ` Paul Menage
2008-06-27  5:10           ` Max Krasnyansky
2008-06-27  5:51             ` Paul Menage
2008-06-27 17:31               ` Max Krasnyansky
2008-06-27  3:22 ` Gautham R Shenoy
2008-06-27  3:23   ` Gautham R Shenoy
2008-06-27  4:53     ` Max Krasnyansky
2008-06-27 16:42     ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox