From: Gilles Carry <Gilles.Carry@bull.net>
To: Chirag Jog <chirag@linux.vnet.ibm.com>
Cc: Gregory Haskins <ghaskins@novell.com>,
linux-rt-users <linux-rt-users@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
dvhltc@us.ibm.com, Dinakar Guniguntala <dino@in.ibm.com>
Subject: Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
Date: Tue, 30 Sep 2008 08:47:50 +0200 [thread overview]
Message-ID: <48E1CB96.80109@bull.net> (raw)
In-Reply-To: <20080930044320.GA4685@linux.vnet.ibm.com>
Chirag Jog wrote:
> Hi Gregory,
> * Gregory Haskins <ghaskins@novell.com> [2008-09-29 18:00:01]:
>
>
>>Gregory Haskins wrote:
>>
>>>Gregory Haskins wrote:
>>>
>>>
>>>>Hi Chirag
>>>>
>>>>Chirag Jog wrote:
>>>>
>>>>
>>>>
>>>>>Hi Gregory,
>>>>>We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
>>>>>It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
>>>>>
>>>>>
>>>>>
>>>>
>>>>Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
>>>>tests.
>>>>
>>>
>>>Ok, I figured out this part. I needed a newer version of the .rpm from
>>>a different repo. However, both async_handler and sbrk_mutex seem to
>>>segfault for me. Hmm
>>>
>>
>>Thanks to help from Darren I got around this issue. Unfortunately both
>>tests pass so I cannot reproduce this issue, nor do I see the problem
>>via code inspection. Ill keep digging but I am currently at a loss. I
>>may need to send you some diagnostic patches to find this, if that is ok
>>with you Chirag?
>
> This particular bug is not producible on the x86 boxes, i have access
> to. Only on ppc64.
> Please send the diagnostic patches across.
> I'll try them out! :)
>
Hi,
I have access to Power6 and x86_64 boxes and so far I could only
reproduce the bug on PPC64.
The bug arised from 2.6.26.3-rt6 since sched-only-push-if-pushable.patch
and sched-only-push-once-per-queue.patch.
Whereas sbrk_mutex definetly shows up the problem, it also can occur
randomly, sometimes during the boot period.
At the beginning, I had system hangs or this (very similar to Chirag's and
not necessarly in sbrk_mutex):
cpu 0x3: Vector: 700 (Program Check) at [c0000000ee30b600]
pc: c0000000001b9bac: .__list_add+0x70/0xa0
lr: c0000000001b9ba8: .__list_add+0x6c/0xa0
sp: c0000000ee30b880
msr: 8000000000021032
current = 0xc0000000ee2b1830
paca = 0xc0000000005c3980
pid = 51, comm = sirq-sched/3
kernel BUG at lib/list_debug.c:33!
enter ? for help
[c0000000ee30b900] c0000000001b8ec0 .plist_del+0x6c/0xcc
[c0000000ee30b9a0] c00000000004d500 .dequeue_pushable_task+0x24/0x3c
[c0000000ee30ba20] c00000000004ec18 .push_rt_task+0x1f0/0x2c0
[c0000000ee30bae0] c00000000004ed0c .push_rt_tasks+0x24/0x44
[c0000000ee30bb70] c00000000004ed58 .post_schedule_rt+0x2c/0x50
[c0000000ee30bc00] c0000000000527c4 .finish_task_switch+0x100/0x1a8
[c0000000ee30bcb0] c0000000002cd1e0 .__schedule+0x688/0x744
[c0000000ee30bd90] c0000000002cd4ec .schedule+0xf4/0x128
[c0000000ee30be20] c000000000061634 .ksoftirqd+0x124/0x37c
[c0000000ee30bf00] c000000000076cf0 .kthread+0x84/0xd4
[c0000000ee30bf90] c000000000029368 .kernel_thread+0x4c/0x68
3:mon>
So I suspected a memory corruption but adding padding fields around
the pointers and extra checks did not reveal any data trashing.
Playing with xmon, I finally found out that when hanging, the system
was stuck in an infinite loop in plist_check_list.
Si I modified lib/plist.c:
I supposed that no list holds more than 100 000 000 elements in
the system. ;-)
static void plist_check_list(struct list_head *top)
{
struct list_head *prev = top, *next = top->next;
+ unsigned long long i = 1;
plist_check_prev_next(top, prev, next);
while (next != top) {
+ BUG_ON(i++ > 100000000);
prev = next;
next = prev->next;
plist_check_prev_next(top, prev, next);
and got this:
cpu 0x6: Vector: 700 (Program Check) at [c0000000eeda7530]
pc: c0000000001ba498: .plist_check_list+0x68/0xb4
lr: c0000000001ba4b4: .plist_check_list+0x84/0xb4
sp: c0000000eeda77b0
msr: 8000000000021032
current = 0xc0000000ee80dfa0
paca = 0xc0000000005d3f80
pid = 2602, comm = sbrk_mutex
kernel BUG at lib/plist.c:50!
enter ? for help
[c0000000eeda7850] c0000000001ba530 .plist_check_head+0x4c/0x64
[c0000000eeda78e0] c0000000001ba57c .plist_del+0x34/0xdc
[c0000000eeda7980] c00000000004d734 .dequeue_pushable_task+0x24/0x3c
[c0000000eeda7a00] c00000000004d7c4 .pick_next_task_rt+0x38/0x58
[c0000000eeda7a90] c0000000002cefb0 .__schedule+0x510/0x75c
[c0000000eeda7b70] c0000000002cf44c .schedule+0xf4/0x128
[c0000000eeda7c00] c0000000002cfe4c .do_nanosleep+0x7c/0xe4
[c0000000eeda7c90] c00000000007be68 .hrtimer_nanosleep+0x84/0x10c
[c0000000eeda7d90] c00000000007bf6c .sys_nanosleep+0x7c/0xa0
[c0000000eeda7e30] c0000000000086ac syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000080fdb85880
SP (4000843e660) is in userspace
which corresponds to the BUG_ON stuff.
It seems that the pushable_tasks list is corrupted: it never loops
back to the first element (top). Is there a shortcut anywhere?
Since the patches don't feature any arch-specific change, I'm looking
for arch-specific code triggered by the modifications brought by the
patches. Still searching...
Also for me, using CONFIG_GROUP_SCHED stuffs hides the problem.
I'm going to harden plist_check_list and see what it does.
Gilles.
next prev parent reply other threads:[~2008-09-30 6:51 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-25 12:32 [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Chirag Jog
2008-09-29 18:13 ` Gregory Haskins
2008-09-29 21:18 ` Gregory Haskins
2008-09-29 21:34 ` Gregory Haskins
2008-09-29 22:00 ` Gregory Haskins
2008-09-30 4:43 ` Chirag Jog
2008-09-30 6:47 ` Gilles Carry [this message]
2008-10-01 14:22 ` [PATCH] sched: add a stacktrace on enqueue_pushable error Gregory Haskins
2008-10-02 9:42 ` Gilles Carry
2008-10-02 11:18 ` [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Gilles Carry
2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
2008-10-03 12:43 ` [PATCH 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-03 12:43 ` [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
2008-10-03 13:46 ` Gilles Carry
2008-10-03 15:45 ` Chirag Jog
2008-10-03 17:27 ` Gregory Haskins
2008-10-03 17:26 ` [RT PATCH v2 0/2] Series short description Gregory Haskins
2008-10-03 17:26 ` [RT PATCH v2 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-03 17:26 ` [RT PATCH v2 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
2008-10-03 12:54 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
2008-10-06 15:14 ` [RT PATCH v3 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-06 15:14 ` [RT PATCH v3 2/2] RT: fix push_rt_task() to handle dequeue_pushable properly Gregory Haskins
2008-10-07 6:04 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gilles Carry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48E1CB96.80109@bull.net \
--to=gilles.carry@bull.net \
--cc=chirag@linux.vnet.ibm.com \
--cc=dino@in.ibm.com \
--cc=dvhltc@us.ibm.com \
--cc=ghaskins@novell.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).