possible recursive locking in numasched code

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>, Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>, LKML <linux-kernel@vger.kernel.org>
Subject: possible recursive locking in numasched code
Date: Tue, 25 Sep 2012 15:11:57 +0530	[thread overview]
Message-ID: <20120925094157.GD18334@linux.vnet.ibm.com> (raw)

Hi Peter,

While running tests with patches for sched-numa rewrite posted at 
http://permalink.gmane.org/gmane.linux.kernel/1335968, I 
came across a possible recursive locking scenaio. 

Please see approaches and a possible patch to avoid the same.
 
 ======================================================
 [ INFO: possible circular locking dependency detected ]
 3.6.0-rc1-numasched_v2_100912+ #2 Not tainted
 -------------------------------------------------------
 gzip/54680 is trying to acquire lock:
 (&p->pi_lock){-.-.-.}, at: [<ffffffff81078b08>] task_work_add+0x38/0xb0
 
 but task is already holding lock:
 (&rq->lock){-.-.-.}, at: [<ffffffff8108e203>] scheduler_tick+0x53/0x150
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (&rq->lock){-.-.-.}:
       [<ffffffff810be782>] lock_acquire+0xa2/0x130
       [<ffffffff81551866>] _raw_spin_lock+0x36/0x70
       [<ffffffff810903c7>] wake_up_new_task+0xa7/0x1c0
       [<ffffffff8105259b>] do_fork+0xcb/0x380
       [<ffffffff8101cef6>] kernel_thread+0x76/0x80
       [<ffffffff81535036>] rest_init+0x26/0x160
       [<ffffffff81ca6d93>] start_kernel+0x3b6/0x3c3
       [<ffffffff81ca6356>] x86_64_start_reservations+0x131/0x136
       [<ffffffff81ca645e>] x86_64_start_kernel+0x103/0x112
 
 -> #0 (&p->pi_lock){-.-.-.}:
       [<ffffffff810be478>] __lock_acquire+0x1428/0x1690
       [<ffffffff810be782>] lock_acquire+0xa2/0x130
       [<ffffffff81551a15>] _raw_spin_lock_irqsave+0x55/0xa0
       [<ffffffff81078b08>] task_work_add+0x38/0xb0
       [<ffffffff8109920f>] task_tick_numa+0xcf/0xe0
       [<ffffffff81099fb9>] task_tick_fair+0x129/0x160
       [<ffffffff8108e28e>] scheduler_tick+0xde/0x150
       [<ffffffff81064dfe>] update_process_times+0x6e/0x90
       [<ffffffff810b68d6>] tick_sched_timer+0x76/0xe0
       [<ffffffff81084ec8>] __run_hrtimer+0x78/0x1b0
       [<ffffffff810852a6>] hrtimer_interrupt+0x106/0x280
       [<ffffffff8155cc39>] smp_apic_timer_interrupt+0x69/0x99
       [<ffffffff8155bb2f>] apic_timer_interrupt+0x6f/0x80
 
 other info that might help us debug this:
 
 Possible unsafe locking scenario:
 
       CPU0                    CPU1
       ----                    ----
  lock(&rq->lock);
                               lock(&p->pi_lock);
                               lock(&rq->lock);
  lock(&p->pi_lock);
 
 *** DEADLOCK ***
 
 1 lock held by gzip/54680:
 #0:  (&rq->lock){-.-.-.}, at: [<ffffffff8108e203>] scheduler_tick+0x53/0x150
 
 stack backtrace:
 Pid: 54680, comm: gzip Not tainted 3.6.0-rc1-numasched_v2_100912+ #2
 Call Trace:
 <IRQ>  [<ffffffff810bbc6a>] print_circular_bug+0x21a/0x2f0
 [<ffffffff810be478>] __lock_acquire+0x1428/0x1690
 [<ffffffff81096ab8>] ? sched_clock_cpu+0xb8/0x110
 [<ffffffff810be782>] lock_acquire+0xa2/0x130
 [<ffffffff81078b08>] ? task_work_add+0x38/0xb0
 [<ffffffff81096b7f>] ? local_clock+0x6f/0x80
 [<ffffffff81551a15>] _raw_spin_lock_irqsave+0x55/0xa0
 [<ffffffff81078b08>] ? task_work_add+0x38/0xb0
 [<ffffffff81078b08>] task_work_add+0x38/0xb0
 [<ffffffff8109920f>] task_tick_numa+0xcf/0xe0
 [<ffffffff81099fb9>] task_tick_fair+0x129/0x160
 [<ffffffff8108e28e>] scheduler_tick+0xde/0x150
 [<ffffffff81064dfe>] update_process_times+0x6e/0x90
 [<ffffffff810b68d6>] tick_sched_timer+0x76/0xe0
 [<ffffffff81084ec8>] __run_hrtimer+0x78/0x1b0
 [<ffffffff810b6860>] ? tick_setup_sched_timer+0x100/0x100
 [<ffffffff810852a6>] hrtimer_interrupt+0x106/0x280
 [<ffffffff8155cc39>] smp_apic_timer_interrupt+0x69/0x99
 [<ffffffff8155bb2f>] apic_timer_interrupt+0x6f/0x80
 <EOI>  [<ffffffff81552595>] ? retint_swapgs+0x13/0x1b



So its clear that we are calling task_work_add from scheduler_tick.
scheduler_tick has already taken rq->lock  and task_work_add needs to take
the same rq->lock.

I figured out that this deadlock can be resolved in more than one way.
1. Add a new callback in sched_class. 
	May be the cleanest but not sure if its worth adding a callback just
for sched class.

2. Have a modified task_work_add that assumes rq->lock is already taken.
	We have to export such an interface and that seems risky if somebody
unknowingly uses such an interface instead of main interface.

3. Move the task_tick_numa to sleep/wakeup.
	Not sure about the overheads.

4. Move the call to task_tick_numa to outside the rq->lock.
	Not a clean approach as in we end up calling a sched/fair.c defined
function in sched/core.c explicitly.

The below patch tries to implement 4th approach. 

---
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Date: Tue, 25 Sep 2012 02:37:55 -0700
Subject: [PATCH] Dont call task_tick_numa while holding the rq->lock

task_tick_numa() calls task_work_add() which also needs rq->lock.
However task_tick_numa() gets called under rq->lock. This leads to 
recursive locking. Avoid recursive locking by calling task_tick_numa()
outside the rq->lock.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 87b4601..5d496be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3236,6 +3236,9 @@ void scheduler_tick(void)
 	curr->sched_class->task_tick(rq, curr, 0);
 	raw_spin_unlock(&rq->lock);
 
+	if (sched_feat_numa(NUMA) && curr->sched_class == &fair_sched_class)
+		task_tick_numa(rq, curr);
+
 	perf_event_task_tick();
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85c8fb7..5f594a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5417,8 +5417,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
-	if (sched_feat_numa(NUMA))
-		task_tick_numa(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index deff28e..e4ea589 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -913,6 +913,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
 extern void update_idle_cpu_load(struct rq *this_rq);
+extern void task_tick_numa(struct rq *rq, struct task_struct *curr);
 
 #ifdef CONFIG_CGROUP_CPUACCT
 #include <linux/cgroup.h>

next             reply	other threads:[~2012-09-25 10:17 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-25  9:41 Srikar Dronamraju [this message]
2012-09-25 13:52 ` possible recursive locking in numasched code Oleg Nesterov

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:87b4601 dfblob:5d496be dfblob:85c8fb7 dfblob:5f594a5
dfblob:deff28e dfblob:e4ea589 )
 OR (
bs:"Dont call task_tick_numa while holding the rq->lock" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120925094157.GD18334@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.