From: Darren Hart <dvhltc@us.ibm.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>,
linux-rt-users <linux-rt-users@vger.kernel.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Steven Rostedt <rostedt@goodmis.org>,
gowrishankar <gowrishankar.m@linux.vnet.ibm.com>
Subject: Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
Date: Fri, 09 Jul 2010 13:05:20 -0700 [thread overview]
Message-ID: <4C378100.6000104@us.ibm.com> (raw)
In-Reply-To: <alpine.LFD.2.00.1007071356480.2604@localhost.localdomain>
On 07/07/2010 04:57 AM, Thomas Gleixner wrote:
> Cc'ing Darren.
>
> On Wed, 7 Jul 2010, Mike Galbraith wrote:
>
>> Greetings,
>>
>> Stress testing, looking to trigger RCU stalls, I've managed to find a
>> way to repeatably create fireworks. (got RCU stall, see attached)
>> 4. run it.
>>
>> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
>> this does not make it to consoles (poking sysrq-foo doesn't either).
>> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
>> explosion, which does make it to consoles.
>>
>> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
>> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
>> fallout. The start of bad juju is always pi_blocked_on != waiter.
>>
>> [ 141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
>> [ 141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [ 141.609268] PGD 20e174067 PUD 20e253067 PMD 0
>> [ 141.609268] Oops: 0000 [#1] PREEMPT SMP
>> [ 141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
>> [ 141.609268] CPU 0
>> [ 141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G W 2.6.33.6-rt23 #12 MS-7502/MS-7502
>> [ 141.609268] RIP: 0010:[<ffffffff8106856d>] [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [ 141.609268] RSP: 0018:ffff88020e3cdd78 EFLAGS: 00010097
>> [ 141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
>> [ 141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
>> [ 141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
>> [ 141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
>> [ 141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
>> [ 141.609268] FS: 00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
>> [ 141.609268] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
>> [ 141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [ 141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
>> [ 141.609268] Stack:
>> [ 141.609268] 0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
>> [ 141.609268]<0> 0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
>> [ 141.609268]<0> ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
>> [ 141.609268] Call Trace:
>> [ 141.609268] [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
>> [ 141.609268] [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
>> [ 141.609268] [<ffffffff81067d7f>] do_futex+0x83c/0x935
>> [ 141.609268] [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
>> [ 141.609268] [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
>> [ 141.609268] [<ffffffff81067fba>] sys_futex+0x142/0x154
>> [ 141.609268] [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
>> [ 141.609268] [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
>> [ 141.609268] [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
>> [ 141.609268] [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
>> [ 141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00<4c> 39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34
>> [ 141.609268] RIP [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [ 141.609268] RSP<ffff88020e3cdd78>
>> [ 141.609268] CR2: 0000000000000058
>> [ 141.609268] ---[ end trace 58805b944e6f93ce ]---
>> [ 141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2
>>
>> (5. eyeball locks.. -> zzzzt -> report -> eyeball..)
>>
>> -Mike
>>
OK, here is the final analysis. After a trace-cmd fix and some rtmutex
"paradigm" insight from Rostedt, the root of the problem came out:
Jul 9 14:26:39 elm9m94 kernel: ------------[ cut here ]------------
Jul 9 14:26:39 elm9m94 kernel: WARNING: at kernel/rtmutex.c:590 wakeup_next_waiter+0x244/0x370()
Jul 9 14:26:39 elm9m94 kernel: Hardware name: IBM eServer BladeCenter HS21 -[7995AC1]-
Jul 9 14:26:39 elm9m94 kernel: Modules linked in: scsi_wait_scan
Jul 9 14:26:39 elm9m94 kernel: Pid: 3870, comm: pthread_cond_ma Not tainted 2.6.33.5-rt23dvh01 #13
Jul 9 14:26:39 elm9m94 kernel: Call Trace:
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff8104dc3b>] warn_slowpath_common+0x7b/0xc0
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff8104dc94>] warn_slowpath_null+0x14/0x20
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff81083f74>] wakeup_next_waiter+0x244/0x370
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff81921875>] rt_mutex_slowunlock+0x35/0x80
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff819218e9>] rt_mutex_unlock+0x29/0x40
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff8108258c>] do_futex+0xa6c/0xae0
# addr2line -e vmlinux ffffffff8108258c
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:833
In futex_unlock_pi() unlocking the pi futex backing the pthread_mutex:
rt_mutex_unlock(&pi_state->pi_mutex);
and a trace of the pending owner with the changing waiter (pi_blocked_on) from
sched_show_task():
Jul 9 14:26:39 elm9m94 kernel: pthread_cond_ M ffffffff81922537 0 4176 3745 0x00000080
Jul 9 14:26:39 elm9m94 kernel: ffffffff821fc9a8 0000000000000000 ffff88042ded9c08 ffffffff819216c0
Jul 9 14:26:39 elm9m94 kernel: 0000000000000000 ffff88042dbbabc0 ffffffff821fc9c0 ffff88042decee40
Jul 9 14:26:39 elm9m94 kernel: ffff88042decee40 ffff88042decee40 ffff880400000000 ffff88042ded9b70
Jul 9 14:26:39 elm9m94 kernel: Call Trace:
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff819216c0>] ? rt_spin_lock_slowlock+0x190/0x310
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff8192216c>] ? rt_spin_lock+0x7c/0x90
Jul 9 14:26:39 elm9m94 kernel: [<ffffffff810810c8>] ? futex_wait_requeue_pi+0x178/0x360
# addr2line -e vmlinux ffffffff810810c8
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:153
That's inside match_futex(), back up a line:
# addr2line -e vmlinux ffffffff810810c0
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:2258
This is futex_wait_requeue_pi():
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
futex_wait_queue_me(hb, &q, to);
spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
spin_unlock(&hb->lock);
Snippets from the trace:
pthread_cond_ma-4176 [003] 395.357310: bprint: task_blocks_on_rt_mutex : 4176 blocks on waiter:
pthread_cond_ma-4176 [003] 395.357310: bprint: task_blocks_on_rt_mutex : waiter: 0xffff88042ded9b68
pthread_cond_ma-4176 [003] 395.357311: bprint: task_blocks_on_rt_mutex : waiter->lock: 0xffffffff821fc9a8
pthread_cond_ma-4176 [003] 395.357311: bprint: task_blocks_on_rt_mutex : was: 0xffff88042ded9cc8
^ should always be NULL
pthread_cond_ma-4176 [003] 395.357311: bprint: task_blocks_on_rt_mutex : oldwaiter->lock: 0xffff8804104d0450
pthread_cond_ma-3870 [000] 395.357732: bprint: wakeup_next_waiter : BUG: waiter changed
pthread_cond_ma-3870 [000] 395.357733: bprint: wakeup_next_waiter : pendowner pid: 4176
pthread_cond_ma-3870 [000] 395.357734: bprint: wakeup_next_waiter : pendowner comm:
pthread_cond_ma-3870 [000] 395.357734: bprint: wakeup_next_waiter : waiter->lock: 0xffff8804104d0450
pthread_cond_ma-3870 [000] 395.357734: bprint: wakeup_next_waiter : pi_blocked_on->lock: 0xffffffff821fc9a8
The one it changed to is the one we see in the task_blocks_on_rt_mutex
above. It has woken from the futex_wait_queue_me() after having been
requeued to the pi futex (we know this because it wouldn't have a
pi_blocked_on if it was still on the wait queue futex. Also the
futex_unlock_pi() call by 3870 is trying to unlock the previous lock
4167 was locked on.
The core of the problem is that the proxy_lock blocks a task on a lock
the task knows nothing about. So when it wakes up inside of
futex_wait_requeue_pi, it immediately tries to block on hb->lock to
check why it woke up. This has the potential to block the task on two
locks (thus overwriting the pi_blocked_on). Any attempt preventing this
involves a lock, and ultimiately the hb->lock. The only solution I see
is to make the hb->locks raw locks (thanks to Steven Rostedt for
original idea and batting this around with me in IRC).
Given the contention the hb->locks can be under, I'm concerned about the
latency impacts this can have, still, I don't see another way to allow
for the proxy locking that the rtmutex semantics require for requeue_pi
to work. I'll work up a patch and do some testing to see if I can get a
feel for the impact.
Ugh.
I'm still hoping to find some clever way to avoid this, going to map out
this state machine again and see if there is a way around it...
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
next prev parent reply other threads:[~2010-07-09 20:05 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-07 4:46 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() Mike Galbraith
2010-07-07 8:03 ` Mike Galbraith
2010-07-07 11:57 ` Thomas Gleixner
2010-07-07 12:50 ` Mike Galbraith
2010-07-07 11:57 ` Thomas Gleixner
2010-07-07 14:03 ` Darren Hart
2010-07-07 14:17 ` Mike Galbraith
2010-07-08 12:05 ` Mike Galbraith
2010-07-08 14:12 ` Darren Hart
2010-07-09 2:11 ` Darren Hart
2010-07-09 4:32 ` Mike Galbraith
[not found] ` <4C36CD83.6070809@us.ibm.com>
2010-07-09 8:13 ` Mike Galbraith
2010-07-09 13:58 ` Mike Galbraith
2010-07-09 14:51 ` Mike Galbraith
2010-07-09 16:35 ` Darren Hart
2010-07-09 19:34 ` Mike Galbraith
2010-07-09 20:05 ` Darren Hart [this message]
2010-07-13 8:03 ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
2010-07-13 9:25 ` Thomas Gleixner
2010-07-13 10:28 ` Thomas Gleixner
2010-07-13 11:52 ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
2010-07-13 15:57 ` Mike Galbraith
2010-07-13 18:59 ` Darren Hart
2010-07-18 8:32 ` Mike Galbraith
2010-07-13 9:58 ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Thomas Gleixner
2010-07-07 14:11 ` 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() gowrishankar
2010-07-07 14:31 ` Mike Galbraith
2010-07-07 15:05 ` Darren Hart
2010-07-07 17:45 ` Mike Galbraith
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C378100.6000104@us.ibm.com \
--to=dvhltc@us.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=efault@gmx.de \
--cc=gowrishankar.m@linux.vnet.ibm.com \
--cc=linux-rt-users@vger.kernel.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).