All of lore.kernel.org
 help / color / mirror / Atom feed
* possible recursive locking in numasched code
@ 2012-09-25  9:41 Srikar Dronamraju
  2012-09-25 13:52 ` Oleg Nesterov
  0 siblings, 1 reply; 2+ messages in thread
From: Srikar Dronamraju @ 2012-09-25  9:41 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov; +Cc: Ingo Molnar, LKML

Hi Peter,

While running tests with patches for sched-numa rewrite posted at 
http://permalink.gmane.org/gmane.linux.kernel/1335968, I 
came across a possible recursive locking scenaio. 

Please see approaches and a possible patch to avoid the same.
 
 ======================================================
 [ INFO: possible circular locking dependency detected ]
 3.6.0-rc1-numasched_v2_100912+ #2 Not tainted
 -------------------------------------------------------
 gzip/54680 is trying to acquire lock:
 (&p->pi_lock){-.-.-.}, at: [<ffffffff81078b08>] task_work_add+0x38/0xb0
 
 but task is already holding lock:
 (&rq->lock){-.-.-.}, at: [<ffffffff8108e203>] scheduler_tick+0x53/0x150
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (&rq->lock){-.-.-.}:
       [<ffffffff810be782>] lock_acquire+0xa2/0x130
       [<ffffffff81551866>] _raw_spin_lock+0x36/0x70
       [<ffffffff810903c7>] wake_up_new_task+0xa7/0x1c0
       [<ffffffff8105259b>] do_fork+0xcb/0x380
       [<ffffffff8101cef6>] kernel_thread+0x76/0x80
       [<ffffffff81535036>] rest_init+0x26/0x160
       [<ffffffff81ca6d93>] start_kernel+0x3b6/0x3c3
       [<ffffffff81ca6356>] x86_64_start_reservations+0x131/0x136
       [<ffffffff81ca645e>] x86_64_start_kernel+0x103/0x112
 
 -> #0 (&p->pi_lock){-.-.-.}:
       [<ffffffff810be478>] __lock_acquire+0x1428/0x1690
       [<ffffffff810be782>] lock_acquire+0xa2/0x130
       [<ffffffff81551a15>] _raw_spin_lock_irqsave+0x55/0xa0
       [<ffffffff81078b08>] task_work_add+0x38/0xb0
       [<ffffffff8109920f>] task_tick_numa+0xcf/0xe0
       [<ffffffff81099fb9>] task_tick_fair+0x129/0x160
       [<ffffffff8108e28e>] scheduler_tick+0xde/0x150
       [<ffffffff81064dfe>] update_process_times+0x6e/0x90
       [<ffffffff810b68d6>] tick_sched_timer+0x76/0xe0
       [<ffffffff81084ec8>] __run_hrtimer+0x78/0x1b0
       [<ffffffff810852a6>] hrtimer_interrupt+0x106/0x280
       [<ffffffff8155cc39>] smp_apic_timer_interrupt+0x69/0x99
       [<ffffffff8155bb2f>] apic_timer_interrupt+0x6f/0x80
 
 other info that might help us debug this:
 
 Possible unsafe locking scenario:
 
       CPU0                    CPU1
       ----                    ----
  lock(&rq->lock);
                               lock(&p->pi_lock);
                               lock(&rq->lock);
  lock(&p->pi_lock);
 
 *** DEADLOCK ***
 
 1 lock held by gzip/54680:
 #0:  (&rq->lock){-.-.-.}, at: [<ffffffff8108e203>] scheduler_tick+0x53/0x150
 
 stack backtrace:
 Pid: 54680, comm: gzip Not tainted 3.6.0-rc1-numasched_v2_100912+ #2
 Call Trace:
 <IRQ>  [<ffffffff810bbc6a>] print_circular_bug+0x21a/0x2f0
 [<ffffffff810be478>] __lock_acquire+0x1428/0x1690
 [<ffffffff81096ab8>] ? sched_clock_cpu+0xb8/0x110
 [<ffffffff810be782>] lock_acquire+0xa2/0x130
 [<ffffffff81078b08>] ? task_work_add+0x38/0xb0
 [<ffffffff81096b7f>] ? local_clock+0x6f/0x80
 [<ffffffff81551a15>] _raw_spin_lock_irqsave+0x55/0xa0
 [<ffffffff81078b08>] ? task_work_add+0x38/0xb0
 [<ffffffff81078b08>] task_work_add+0x38/0xb0
 [<ffffffff8109920f>] task_tick_numa+0xcf/0xe0
 [<ffffffff81099fb9>] task_tick_fair+0x129/0x160
 [<ffffffff8108e28e>] scheduler_tick+0xde/0x150
 [<ffffffff81064dfe>] update_process_times+0x6e/0x90
 [<ffffffff810b68d6>] tick_sched_timer+0x76/0xe0
 [<ffffffff81084ec8>] __run_hrtimer+0x78/0x1b0
 [<ffffffff810b6860>] ? tick_setup_sched_timer+0x100/0x100
 [<ffffffff810852a6>] hrtimer_interrupt+0x106/0x280
 [<ffffffff8155cc39>] smp_apic_timer_interrupt+0x69/0x99
 [<ffffffff8155bb2f>] apic_timer_interrupt+0x6f/0x80
 <EOI>  [<ffffffff81552595>] ? retint_swapgs+0x13/0x1b



So its clear that we are calling task_work_add from scheduler_tick.
scheduler_tick has already taken rq->lock  and task_work_add needs to take
the same rq->lock.

I figured out that this deadlock can be resolved in more than one way.
1. Add a new callback in sched_class. 
	May be the cleanest but not sure if its worth adding a callback just
for sched class.

2. Have a modified task_work_add that assumes rq->lock is already taken.
	We have to export such an interface and that seems risky if somebody
unknowingly uses such an interface instead of main interface.

3. Move the task_tick_numa to sleep/wakeup.
	Not sure about the overheads.

4. Move the call to task_tick_numa to outside the rq->lock.
	Not a clean approach as in we end up calling a sched/fair.c defined
function in sched/core.c explicitly.

The below patch tries to implement 4th approach. 

---
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Date: Tue, 25 Sep 2012 02:37:55 -0700
Subject: [PATCH] Dont call task_tick_numa while holding the rq->lock

task_tick_numa() calls task_work_add() which also needs rq->lock.
However task_tick_numa() gets called under rq->lock. This leads to 
recursive locking. Avoid recursive locking by calling task_tick_numa()
outside the rq->lock.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 87b4601..5d496be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3236,6 +3236,9 @@ void scheduler_tick(void)
 	curr->sched_class->task_tick(rq, curr, 0);
 	raw_spin_unlock(&rq->lock);
 
+	if (sched_feat_numa(NUMA) && curr->sched_class == &fair_sched_class)
+		task_tick_numa(rq, curr);
+
 	perf_event_task_tick();
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85c8fb7..5f594a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5417,8 +5417,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
-	if (sched_feat_numa(NUMA))
-		task_tick_numa(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index deff28e..e4ea589 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -913,6 +913,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
 extern void update_idle_cpu_load(struct rq *this_rq);
+extern void task_tick_numa(struct rq *rq, struct task_struct *curr);
 
 #ifdef CONFIG_CGROUP_CPUACCT
 #include <linux/cgroup.h>


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: possible recursive locking in numasched code
  2012-09-25  9:41 possible recursive locking in numasched code Srikar Dronamraju
@ 2012-09-25 13:52 ` Oleg Nesterov
  0 siblings, 0 replies; 2+ messages in thread
From: Oleg Nesterov @ 2012-09-25 13:52 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Peter Zijlstra, Ingo Molnar, LKML

On 09/25, Srikar Dronamraju wrote:
>
>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.6.0-rc1-numasched_v2_100912+ #2 Not tainted
>  -------------------------------------------------------
>
>  -> #0 (&p->pi_lock){-.-.-.}:
>        [<ffffffff810be478>] __lock_acquire+0x1428/0x1690
>        [<ffffffff810be782>] lock_acquire+0xa2/0x130
>        [<ffffffff81551a15>] _raw_spin_lock_irqsave+0x55/0xa0
>        [<ffffffff81078b08>] task_work_add+0x38/0xb0

Yes. This should be fixed by ac3d0da8 from tip:core/urgent,
attached below.

Oleg.

task_work: Make task_work_add() lockless

Change task_work's to use llist-like code to avoid pi_lock
in task_work_add(), this makes it useable under rq->lock.

task_work_cancel() and task_work_run() still use pi_lock
to synchronize with each other.

(This is in preparation for a deadlock fix.)

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191209.GA4221@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/task_work.c |   95 ++++++++++++++++++++++++++-------------------------
 1 files changed, 48 insertions(+), 47 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index d320d44..f13ec0b 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,25 +3,18 @@
 #include <linux/tracehook.h>
 
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *last, *first;
-	unsigned long flags;
-
+	struct callback_head *head;
 	/*
 	 * Not inserting the new work if the task has already passed
 	 * exit_task_work() is the responisbility of callers.
 	 */
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	first = last ? last->next : twork;
-	twork->next = first;
-	if (last)
-		last->next = twork;
-	task->task_works = twork;
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	do {
+		head = ACCESS_ONCE(task->task_works);
+		work->next = head;
+	} while (cmpxchg(&task->task_works, head, work) != head);
 
-	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
 	if (notify)
 		set_notify_resume(task);
 	return 0;
@@ -30,52 +23,60 @@ task_work_add(struct task_struct *task, struct callback_head *twork, bool notify
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
+	struct callback_head **pprev = &task->task_works;
+	struct callback_head *work = NULL;
 	unsigned long flags;
-	struct callback_head *last, *res = NULL;
-
+	/*
+	 * If cmpxchg() fails we continue without updating pprev.
+	 * Either we raced with task_work_add() which added the
+	 * new entry before this work, we will find it again. Or
+	 * we raced with task_work_run(), *pprev == NULL.
+	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	if (last) {
-		struct callback_head *q = last, *p = q->next;
-		while (1) {
-			if (p->func == func) {
-				q->next = p->next;
-				if (p == last)
-					task->task_works = q == p ? NULL : q;
-				res = p;
-				break;
-			}
-			if (p == last)
-				break;
-			q = p;
-			p = q->next;
-		}
+	while ((work = ACCESS_ONCE(*pprev))) {
+		read_barrier_depends();
+		if (work->func != func)
+			pprev = &work->next;
+		else if (cmpxchg(pprev, work, work->next) == work)
+			break;
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-	return res;
+
+	return work;
 }
 
 void task_work_run(void)
 {
 	struct task_struct *task = current;
-	struct callback_head *p, *q;
+	struct callback_head *work, *head, *next;
 
-	while (1) {
-		raw_spin_lock_irq(&task->pi_lock);
-		p = task->task_works;
-		task->task_works = NULL;
-		raw_spin_unlock_irq(&task->pi_lock);
+	for (;;) {
+		work = xchg(&task->task_works, NULL);
+		if (!work)
+			break;
+		/*
+		 * Synchronize with task_work_cancel(). It can't remove
+		 * the first entry == work, cmpxchg(task_works) should
+		 * fail, but it can play with *work and other entries.
+		 */
+		raw_spin_unlock_wait(&task->pi_lock);
+		smp_mb();
 
-		if (unlikely(!p))
-			return;
+		/* Reverse the list to run the works in fifo order */
+		head = NULL;
+		do {
+			next = work->next;
+			work->next = head;
+			head = work;
+			work = next;
+		} while (work);
 
-		q = p->next; /* head */
-		p->next = NULL; /* cut it */
-		while (q) {
-			p = q->next;
-			q->func(q);
-			q = p;
+		work = head;
+		do {
+			next = work->next;
+			work->func(work);
+			work = next;
 			cond_resched();
-		}
+		} while (work);
 	}
 }



^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-09-25 13:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-25  9:41 possible recursive locking in numasched code Srikar Dronamraju
2012-09-25 13:52 ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.