[BUG] sched_rt_periodic_timer vs cpu hotplug

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG] sched_rt_periodic_timer vs cpu hotplug
@ 2009-11-11 10:18 Heiko Carstens
  2009-11-11 10:30 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Heiko Carstens @ 2009-11-11 10:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Gregory Haskins, Siddha, Suresh B
  Cc: linux-kernel, Martin Schwidefsky

Hi all,

we've seen a crash on s390 which seems to be related to sched_rt_period_timer vs.
cpu hotplug:

    <1>Unable to handle kernel pointer dereference at virtual kernel address 00000000ff5ec000
    <4>Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    <4>Modules linked in: sunrpc qeth_l2 dm_multipath dm_mod chsc_sch qeth ccwgroup
    <4>CPU: 9 Not tainted 2.6.31-39.x.20090916-s390xdefault #1
    <4>Process swapper (pid: 0, task: 00000000ffc8ca40, ksp: 00000000ffc93d48)
    <4>Krnl PSW : 0404200180000000 000000000013952c (sched_rt_period_timer+0x188/0x3d8)
    <4>           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
    <4>Krnl GPRS: ffffffffffffffff ffffffffffffff80 00000000ff5ec000 0000000000000008
    <4>           0000000000000000 0000000000000040 0000000000000001 0000000000a7db58
    <4>           0000000087139db8 0000000087da6500 0000000000000000 0000000000000007
    <4>           00000000ff5ec008 0000000000598cc8 00000000001394d0 00000000ff7b7968
    <4>Krnl Code: 000000000013951c: a709ffff            lghi    %r0,-1
    <4>           0000000000139520: eb102000000d        sllg    %r1,%r0,0(%r2)
    <4>           0000000000139526: e320f0b80004        lg      %r2,184(%r15)
    <4>          >000000000013952c: e31320000080        ng      %r1,0(%r3,%r2)
    <4>           0000000000139532: 1211                ltr     %r1,%r1
    <4>           0000000000139534: a78400ff            brc     8,139732
    <4>           0000000000139538: a7290000            lghi    %r2,0
    <4>           000000000013953c: a711ffff            tmll    %r1,65535
    <4>Call Trace:
    <4>([<00000000001394d0>] sched_rt_period_timer+0x12c/0x3d8)
    <4> [<0000000000173db0>] __run_hrtimer+0xb0/0x110
    <4> [<00000000001740b2>] hrtimer_interrupt+0xf2/0x1e8
    <4> [<000000000010770c>] clock_comparator_work+0x68/0x70
    <4> [<000000000010dbc0>] do_extint+0x18c/0x190
    <4> [<0000000000117f9e>] ext_no_vtime+0x1e/0x22
    <4> [<000000000058ea04>] _spin_unlock_irq+0x48/0x80
    <4>([<000000000058ea00>] _spin_unlock_irq+0x44/0x80)
    <4> [<000000000043c190>] dasd_block_tasklet+0x1b8/0x2b0
    <4> [<0000000000155b0e>] tasklet_hi_action+0xfe/0x1f4
    <4> [<00000000001570d4>] __do_softirq+0x184/0x2e8
    <4> [<0000000000110b34>] do_softirq+0xe4/0xe8
    <4> [<0000000000156ac4>] irq_exit+0xc0/0xe0
    <4> [<000000000010db7a>] do_extint+0x146/0x190
    <4> [<0000000000117f9e>] ext_no_vtime+0x1e/0x22
    <4> [<0000000000115040>] vtime_stop_cpu+0xac/0x100
    <4>([<0000000000114fe6>] vtime_stop_cpu+0x52/0x100)
    <4> [<000000000010a324>] cpu_idle+0xfc/0x198
    <4> [<0000000000584a64>] start_secondary+0xb4/0xc0

sched_rt_period_timer tried to access a memory region which was unmapped from
the kernel 1:1 mapping. So we seem to have a use-after-free bug.

The C code snippet in question, which seems to cause the addressing exception is:

static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
{
	int i, idle = 1;
	const struct cpumask *span;

	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
		return 1;

	span = sched_rt_period_mask();
	for_each_cpu(i, span) {   <------ read access to root_domain of runqueue
		int enqueue = 0;
...

with

static inline const struct cpumask *sched_rt_period_mask(void)
{
	return cpu_rq(smp_processor_id())->rd->span;
}

The read access to the span cpumask within the root_domain caused the exception.

Now since DEBUG_PAGEALLOC is turned on we can easily see who freed the piece of
memory since it contains a backtrace:

0x13caca <cpu_attach_domain+482>
0x1418fe <partition_sched_domains+350>
0x141d90 <update_sched_domains+100>
0x5915a6 <notifier_call_chain+150>
0x17666c <raw_notifier_call_chain+44>
0x585b74 <_cpu_up+436>
0x585c3a <cpu_up+186>
0x58336a <store_online+146>
0x29cfa4 <sysfs_write_file+248>
0x228b60 <SyS_write>

cpu_attach_domain calls (inlined) rq_attach_root. That function replaces a
runqueue's root_domain while holding its lock (&rq->lock).

Now the code snippet above from do_sched_rt_period_timer does access a
runqueue's root_domain _without_ holding its lock.
That way a concurrent cpu_up operation can easily change a runqueue's
root_domain pointer while it is still in use. Which is what happened here.

Just grabbing and releasing the lock for each iteration is probably not the
real fix, since the span mask could change between iterations. Which might
lead to strange effects.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] sched_rt_periodic_timer vs cpu hotplug
  2009-11-11 10:18 [BUG] sched_rt_periodic_timer vs cpu hotplug Heiko Carstens
@ 2009-11-11 10:30 ` Peter Zijlstra
  2009-11-11 12:27   ` Heiko Carstens
  2009-11-16  9:05   ` Heiko Carstens
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2009-11-11 10:30 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ingo Molnar, Gregory Haskins, Siddha, Suresh B, linux-kernel,
	Martin Schwidefsky

On Wed, 2009-11-11 at 11:18 +0100, Heiko Carstens wrote:
> Hi all,
> 
> we've seen a crash on s390 which seems to be related to sched_rt_period_timer vs.
> cpu hotplug:
> 
>     <1>Unable to handle kernel pointer dereference at virtual kernel address 00000000ff5ec000
>     <4>Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>     <4>Modules linked in: sunrpc qeth_l2 dm_multipath dm_mod chsc_sch qeth ccwgroup
>     <4>CPU: 9 Not tainted 2.6.31-39.x.20090916-s390xdefault #1
>     <4>Process swapper (pid: 0, task: 00000000ffc8ca40, ksp: 00000000ffc93d48)
>     <4>Krnl PSW : 0404200180000000 000000000013952c (sched_rt_period_timer+0x188/0x3d8)
>     <4>           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
>     <4>Krnl GPRS: ffffffffffffffff ffffffffffffff80 00000000ff5ec000 0000000000000008
>     <4>           0000000000000000 0000000000000040 0000000000000001 0000000000a7db58
>     <4>           0000000087139db8 0000000087da6500 0000000000000000 0000000000000007
>     <4>           00000000ff5ec008 0000000000598cc8 00000000001394d0 00000000ff7b7968
>     <4>Krnl Code: 000000000013951c: a709ffff            lghi    %r0,-1
>     <4>           0000000000139520: eb102000000d        sllg    %r1,%r0,0(%r2)
>     <4>           0000000000139526: e320f0b80004        lg      %r2,184(%r15)
>     <4>          >000000000013952c: e31320000080        ng      %r1,0(%r3,%r2)
>     <4>           0000000000139532: 1211                ltr     %r1,%r1
>     <4>           0000000000139534: a78400ff            brc     8,139732
>     <4>           0000000000139538: a7290000            lghi    %r2,0
>     <4>           000000000013953c: a711ffff            tmll    %r1,65535
>     <4>Call Trace:
>     <4>([<00000000001394d0>] sched_rt_period_timer+0x12c/0x3d8)
>     <4> [<0000000000173db0>] __run_hrtimer+0xb0/0x110
>     <4> [<00000000001740b2>] hrtimer_interrupt+0xf2/0x1e8
>     <4> [<000000000010770c>] clock_comparator_work+0x68/0x70
>     <4> [<000000000010dbc0>] do_extint+0x18c/0x190
>     <4> [<0000000000117f9e>] ext_no_vtime+0x1e/0x22
>     <4> [<000000000058ea04>] _spin_unlock_irq+0x48/0x80
>     <4>([<000000000058ea00>] _spin_unlock_irq+0x44/0x80)
>     <4> [<000000000043c190>] dasd_block_tasklet+0x1b8/0x2b0
>     <4> [<0000000000155b0e>] tasklet_hi_action+0xfe/0x1f4
>     <4> [<00000000001570d4>] __do_softirq+0x184/0x2e8
>     <4> [<0000000000110b34>] do_softirq+0xe4/0xe8
>     <4> [<0000000000156ac4>] irq_exit+0xc0/0xe0
>     <4> [<000000000010db7a>] do_extint+0x146/0x190
>     <4> [<0000000000117f9e>] ext_no_vtime+0x1e/0x22
>     <4> [<0000000000115040>] vtime_stop_cpu+0xac/0x100
>     <4>([<0000000000114fe6>] vtime_stop_cpu+0x52/0x100)
>     <4> [<000000000010a324>] cpu_idle+0xfc/0x198
>     <4> [<0000000000584a64>] start_secondary+0xb4/0xc0
> 
> sched_rt_period_timer tried to access a memory region which was unmapped from
> the kernel 1:1 mapping. So we seem to have a use-after-free bug.
> 
> The C code snippet in question, which seems to cause the addressing exception is:
> 
> static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
> {
> 	int i, idle = 1;
> 	const struct cpumask *span;
> 
> 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
> 		return 1;
> 
> 	span = sched_rt_period_mask();
> 	for_each_cpu(i, span) {   <------ read access to root_domain of runqueue
> 		int enqueue = 0;
> ....
> 
> with
> 
> static inline const struct cpumask *sched_rt_period_mask(void)
> {
> 	return cpu_rq(smp_processor_id())->rd->span;
> }
> 
> The read access to the span cpumask within the root_domain caused the exception.
> 
> Now since DEBUG_PAGEALLOC is turned on we can easily see who freed the piece of
> memory since it contains a backtrace:
> 
> 0x13caca <cpu_attach_domain+482>
> 0x1418fe <partition_sched_domains+350>
> 0x141d90 <update_sched_domains+100>
> 0x5915a6 <notifier_call_chain+150>
> 0x17666c <raw_notifier_call_chain+44>
> 0x585b74 <_cpu_up+436>
> 0x585c3a <cpu_up+186>
> 0x58336a <store_online+146>
> 0x29cfa4 <sysfs_write_file+248>
> 0x228b60 <SyS_write>
> 
> cpu_attach_domain calls (inlined) rq_attach_root. That function replaces a
> runqueue's root_domain while holding its lock (&rq->lock).
> 
> Now the code snippet above from do_sched_rt_period_timer does access a
> runqueue's root_domain _without_ holding its lock.
> That way a concurrent cpu_up operation can easily change a runqueue's
> root_domain pointer while it is still in use. Which is what happened here.
> 
> Just grabbing and releasing the lock for each iteration is probably not the
> real fix, since the span mask could change between iterations. Which might
> lead to strange effects.

Does something like the below fix it? Normal sched_domain bits also do
sync_sched() for domain destruction as can be seen from
detach_destroy_domains()..

diff --git a/kernel/sched.c b/kernel/sched.c
index 91642c1..3b02339 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7918,6 +7923,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 
 static void free_rootdomain(struct root_domain *rd)
 {
+	synchronize_sched();
+
 	cpupri_cleanup(&rd->cpupri);
 
 	free_cpumask_var(rd->rto_mask);


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [BUG] sched_rt_periodic_timer vs cpu hotplug
  2009-11-11 10:30 ` Peter Zijlstra
@ 2009-11-11 12:27   ` Heiko Carstens
  2009-11-16  9:05   ` Heiko Carstens
  1 sibling, 0 replies; 6+ messages in thread
From: Heiko Carstens @ 2009-11-11 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Gregory Haskins, Siddha, Suresh B, linux-kernel,
	Martin Schwidefsky

On Wed, Nov 11, 2009 at 11:30:28AM +0100, Peter Zijlstra wrote:
> On Wed, 2009-11-11 at 11:18 +0100, Heiko Carstens wrote:
> > cpu_attach_domain calls (inlined) rq_attach_root. That function replaces a
> > runqueue's root_domain while holding its lock (&rq->lock).
> > 
> > Now the code snippet above from do_sched_rt_period_timer does access a
> > runqueue's root_domain _without_ holding its lock.
> > That way a concurrent cpu_up operation can easily change a runqueue's
> > root_domain pointer while it is still in use. Which is what happened here.
> > 
> > Just grabbing and releasing the lock for each iteration is probably not the
> > real fix, since the span mask could change between iterations. Which might
> > lead to strange effects.
> 
> Does something like the below fix it? Normal sched_domain bits also do
> sync_sched() for domain destruction as can be seen from
> detach_destroy_domains()..
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 91642c1..3b02339 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -7918,6 +7923,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
>  
>  static void free_rootdomain(struct root_domain *rd)
>  {
> +	synchronize_sched();
> +

Looks good. It might take a few days for a final confirmation.

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] sched_rt_periodic_timer vs cpu hotplug
  2009-11-11 10:30 ` Peter Zijlstra
  2009-11-11 12:27   ` Heiko Carstens
@ 2009-11-16  9:05   ` Heiko Carstens
  2009-11-16  9:31     ` [PATCH] sched: " Peter Zijlstra
  1 sibling, 1 reply; 6+ messages in thread
From: Heiko Carstens @ 2009-11-16  9:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Gregory Haskins, Siddha, Suresh B, linux-kernel,
	Martin Schwidefsky

On Wed, Nov 11, 2009 at 11:30:28AM +0100, Peter Zijlstra wrote:
> On Wed, 2009-11-11 at 11:18 +0100, Heiko Carstens wrote:
> > we've seen a crash on s390 which seems to be related to sched_rt_period_timer vs.
> > cpu hotplug:
> > 
> >     <1>Unable to handle kernel pointer dereference at virtual kernel address 00000000ff5ec000
> >     <4>Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[...]
> >     <4>Krnl PSW : 0404200180000000 000000000013952c (sched_rt_period_timer+0x188/0x3d8)
[...]
> >     <4>([<00000000001394d0>] sched_rt_period_timer+0x12c/0x3d8)
> >     <4> [<0000000000173db0>] __run_hrtimer+0xb0/0x110
> >     <4> [<00000000001740b2>] hrtimer_interrupt+0xf2/0x1e8
> >     <4> [<000000000010770c>] clock_comparator_work+0x68/0x70
> >     <4> [<000000000010dbc0>] do_extint+0x18c/0x190
> >     <4> [<0000000000117f9e>] ext_no_vtime+0x1e/0x22
> >     <4> [<000000000058ea04>] _spin_unlock_irq+0x48/0x80
> >     <4>([<000000000058ea00>] _spin_unlock_irq+0x44/0x80)
> >     <4> [<000000000043c190>] dasd_block_tasklet+0x1b8/0x2b0
> >     <4> [<0000000000155b0e>] tasklet_hi_action+0xfe/0x1f4
> >     <4> [<00000000001570d4>] __do_softirq+0x184/0x2e8
> >     <4> [<0000000000110b34>] do_softirq+0xe4/0xe8
> >     <4> [<0000000000156ac4>] irq_exit+0xc0/0xe0
> >     <4> [<000000000010db7a>] do_extint+0x146/0x190
> >     <4> [<0000000000117f9e>] ext_no_vtime+0x1e/0x22
> >     <4> [<0000000000115040>] vtime_stop_cpu+0xac/0x100
> >     <4>([<0000000000114fe6>] vtime_stop_cpu+0x52/0x100)
> >     <4> [<000000000010a324>] cpu_idle+0xfc/0x198
> >     <4> [<0000000000584a64>] start_secondary+0xb4/0xc0
>
> Does something like the below fix it? Normal sched_domain bits also do
> sync_sched() for domain destruction as can be seen from
> detach_destroy_domains()..
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 91642c1..3b02339 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -7918,6 +7923,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
>  
>  static void free_rootdomain(struct root_domain *rd)
>  {
> +	synchronize_sched();
> +
>  	cpupri_cleanup(&rd->cpupri);

We haven't seen the bug above with your patch applied again. On the other
hand the race window is so small, that we were also unable to reproduce
the original bug.
Instead we are later running into a different use-after-free bug. But that
one is unrelated to this one.

Anyway, your patch should fix this bug. Would be good to have it in .32

Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] sched: sched_rt_periodic_timer vs cpu hotplug
  2009-11-16  9:05   ` Heiko Carstens
@ 2009-11-16  9:31     ` Peter Zijlstra
  2009-11-16 11:18       ` [tip:sched/urgent] sched: Sched_rt_periodic_timer " tip-bot for Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2009-11-16  9:31 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ingo Molnar, Gregory Haskins, Siddha, Suresh B, linux-kernel,
	Martin Schwidefsky

Subject: sched: sched_rt_periodic_timer vs cpu hotplug
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Nov 16 10:28:09 CET 2009

Heiko reported a case where a timer interrupt managed to reference a
root_domain structure that was already freed by a concurrent
hot-un-plug operation.

Solve this like the regular sched_domain stuff is also synchronized,
by adding a synchronize_sched() stmt to the free path, this ensures
that a root_domain stays present for any atomic section that could
have observed it.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> 
---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -7921,6 +7921,8 @@ sd_parent_degenerate(struct sched_domain
 
 static void free_rootdomain(struct root_domain *rd)
 {
+	synchronize_sched();
+
 	cpupri_cleanup(&rd->cpupri);
 
 	free_cpumask_var(rd->rto_mask);



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:sched/urgent] sched: Sched_rt_periodic_timer vs cpu hotplug
  2009-11-16  9:31     ` [PATCH] sched: " Peter Zijlstra
@ 2009-11-16 11:18       ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 6+ messages in thread
From: tip-bot for Peter Zijlstra @ 2009-11-16 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, schwidefsky,
	heiko.carstens, suresh.b.siddha, tglx, ghaskins, mingo

Commit-ID:  047106adcc85e3023da210143a6ab8a55df9e0fc
Gitweb:     http://git.kernel.org/tip/047106adcc85e3023da210143a6ab8a55df9e0fc
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 16 Nov 2009 10:28:09 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 16 Nov 2009 10:46:27 +0100

sched: Sched_rt_periodic_timer vs cpu hotplug

Heiko reported a case where a timer interrupt managed to
reference a root_domain structure that was already freed by a
concurrent hot-un-plug operation.

Solve this like the regular sched_domain stuff is also
synchronized, by adding a synchronize_sched() stmt to the free
path, this ensures that a root_domain stays present for any
atomic section that could have observed it.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Siddha Suresh B <suresh.b.siddha@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <1258363873.26714.83.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index cea2bea..3c91f11 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7912,6 +7912,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 
 static void free_rootdomain(struct root_domain *rd)
 {
+	synchronize_sched();
+
 	cpupri_cleanup(&rd->cpupri);
 
 	free_cpumask_var(rd->rto_mask);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-11-16 11:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-11 10:18 [BUG] sched_rt_periodic_timer vs cpu hotplug Heiko Carstens
2009-11-11 10:30 ` Peter Zijlstra
2009-11-11 12:27   ` Heiko Carstens
2009-11-16  9:05   ` Heiko Carstens
2009-11-16  9:31     ` [PATCH] sched: " Peter Zijlstra
2009-11-16 11:18       ` [tip:sched/urgent] sched: Sched_rt_periodic_timer " tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox