linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gilles Carry <Gilles.Carry@bull.net>
To: Chirag Jog <chirag@linux.vnet.ibm.com>
Cc: Gregory Haskins <ghaskins@novell.com>,
	linux-rt-users <linux-rt-users@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	dvhltc@us.ibm.com, Dinakar Guniguntala <dino@in.ibm.com>
Subject: Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
Date: Tue, 30 Sep 2008 08:47:50 +0200	[thread overview]
Message-ID: <48E1CB96.80109@bull.net> (raw)
In-Reply-To: <20080930044320.GA4685@linux.vnet.ibm.com>

Chirag Jog wrote:
> Hi Gregory,
> * Gregory Haskins <ghaskins@novell.com> [2008-09-29 18:00:01]:
> 
> 
>>Gregory Haskins wrote:
>>
>>>Gregory Haskins wrote:
>>>  
>>>
>>>>Hi Chirag
>>>>
>>>>Chirag Jog wrote:
>>>>  
>>>>    
>>>>
>>>>>Hi Gregory,
>>>>>We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
>>>>>It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
>>>>>  
>>>>>    
>>>>>      
>>>>
>>>>Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
>>>>tests.
>>>>    
>>>
>>>Ok, I figured out this part.  I needed a newer version of the .rpm from
>>>a different repo.  However, both async_handler and sbrk_mutex seem to
>>>segfault for me.  Hmm
>>>  
>>
>>Thanks to help from Darren I got around this issue.  Unfortunately both
>>tests pass so I cannot reproduce this issue, nor do I see the problem
>>via code inspection.  Ill keep digging but I am currently at a loss.  I
>>may need to send you some diagnostic patches to find this, if that is ok
>>with you Chirag?
> 
> This particular bug is not producible on the x86 boxes, i have access
> to. Only on ppc64.
> Please send the diagnostic patches across. 
> I'll try them out! :)
> 

Hi,

I have access to Power6 and x86_64 boxes and so far I could only
reproduce the bug on PPC64.

The bug arised from 2.6.26.3-rt6 since sched-only-push-if-pushable.patch
and sched-only-push-once-per-queue.patch.

Whereas sbrk_mutex definetly shows up the problem, it also can occur
randomly, sometimes during the boot period.

At the beginning, I had system hangs or this (very similar to Chirag's and
not necessarly in sbrk_mutex):

cpu 0x3: Vector: 700 (Program Check) at [c0000000ee30b600]
     pc: c0000000001b9bac: .__list_add+0x70/0xa0
     lr: c0000000001b9ba8: .__list_add+0x6c/0xa0
     sp: c0000000ee30b880
    msr: 8000000000021032
   current = 0xc0000000ee2b1830
   paca    = 0xc0000000005c3980
     pid   = 51, comm = sirq-sched/3
kernel BUG at lib/list_debug.c:33!
enter ? for help
[c0000000ee30b900] c0000000001b8ec0 .plist_del+0x6c/0xcc
[c0000000ee30b9a0] c00000000004d500 .dequeue_pushable_task+0x24/0x3c
[c0000000ee30ba20] c00000000004ec18 .push_rt_task+0x1f0/0x2c0
[c0000000ee30bae0] c00000000004ed0c .push_rt_tasks+0x24/0x44
[c0000000ee30bb70] c00000000004ed58 .post_schedule_rt+0x2c/0x50
[c0000000ee30bc00] c0000000000527c4 .finish_task_switch+0x100/0x1a8
[c0000000ee30bcb0] c0000000002cd1e0 .__schedule+0x688/0x744
[c0000000ee30bd90] c0000000002cd4ec .schedule+0xf4/0x128
[c0000000ee30be20] c000000000061634 .ksoftirqd+0x124/0x37c
[c0000000ee30bf00] c000000000076cf0 .kthread+0x84/0xd4
[c0000000ee30bf90] c000000000029368 .kernel_thread+0x4c/0x68
3:mon>

So I suspected a memory corruption but adding padding fields around
the pointers and extra checks did not reveal any data trashing.


Playing with xmon, I finally found out that when hanging, the system
was stuck in an infinite loop in plist_check_list.
Si I modified lib/plist.c:
I supposed that no list holds more than 100 000 000 elements in
the system. ;-)

  static void plist_check_list(struct list_head *top)
  {
         struct list_head *prev = top, *next = top->next;
+       unsigned long long i = 1;

         plist_check_prev_next(top, prev, next);
         while (next != top) {
+               BUG_ON(i++ >    100000000);
                 prev = next;
                 next = prev->next;
                 plist_check_prev_next(top, prev, next);


and got this:

cpu 0x6: Vector: 700 (Program Check) at [c0000000eeda7530]
     pc: c0000000001ba498: .plist_check_list+0x68/0xb4
     lr: c0000000001ba4b4: .plist_check_list+0x84/0xb4
     sp: c0000000eeda77b0
    msr: 8000000000021032
   current = 0xc0000000ee80dfa0
   paca    = 0xc0000000005d3f80
     pid   = 2602, comm = sbrk_mutex
kernel BUG at lib/plist.c:50!
enter ? for help
[c0000000eeda7850] c0000000001ba530 .plist_check_head+0x4c/0x64
[c0000000eeda78e0] c0000000001ba57c .plist_del+0x34/0xdc
[c0000000eeda7980] c00000000004d734 .dequeue_pushable_task+0x24/0x3c
[c0000000eeda7a00] c00000000004d7c4 .pick_next_task_rt+0x38/0x58
[c0000000eeda7a90] c0000000002cefb0 .__schedule+0x510/0x75c
[c0000000eeda7b70] c0000000002cf44c .schedule+0xf4/0x128
[c0000000eeda7c00] c0000000002cfe4c .do_nanosleep+0x7c/0xe4
[c0000000eeda7c90] c00000000007be68 .hrtimer_nanosleep+0x84/0x10c
[c0000000eeda7d90] c00000000007bf6c .sys_nanosleep+0x7c/0xa0
[c0000000eeda7e30] c0000000000086ac syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000080fdb85880
SP (4000843e660) is in userspace

which corresponds to the BUG_ON stuff.
It seems that the pushable_tasks list is corrupted: it never loops
back to the first element (top). Is there a shortcut anywhere?



Since the patches don't feature any arch-specific change, I'm looking
for arch-specific code triggered by the modifications brought by the
patches. Still searching...

Also for me, using CONFIG_GROUP_SCHED stuffs hides the problem.

I'm going to harden plist_check_list and see what it does.

Gilles.

  reply	other threads:[~2008-09-30  6:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-25 12:32 [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Chirag Jog
2008-09-29 18:13 ` Gregory Haskins
2008-09-29 21:18 ` Gregory Haskins
2008-09-29 21:34   ` Gregory Haskins
2008-09-29 22:00     ` Gregory Haskins
2008-09-30  4:43       ` Chirag Jog
2008-09-30  6:47         ` Gilles Carry [this message]
2008-10-01 14:22         ` [PATCH] sched: add a stacktrace on enqueue_pushable error Gregory Haskins
2008-10-02  9:42           ` Gilles Carry
2008-10-02 11:18   ` [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Gilles Carry
2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
2008-10-03 12:43   ` [PATCH 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-03 12:43   ` [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
2008-10-03 13:46     ` Gilles Carry
2008-10-03 15:45       ` Chirag Jog
2008-10-03 17:27         ` Gregory Haskins
2008-10-03 17:26       ` [RT PATCH v2 0/2] Series short description Gregory Haskins
2008-10-03 17:26         ` [RT PATCH v2 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-03 17:26         ` [RT PATCH v2 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
2008-10-03 12:54   ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
2008-10-06 15:14   ` [RT PATCH v3 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-06 15:14   ` [RT PATCH v3 2/2] RT: fix push_rt_task() to handle dequeue_pushable properly Gregory Haskins
2008-10-07  6:04   ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gilles Carry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48E1CB96.80109@bull.net \
    --to=gilles.carry@bull.net \
    --cc=chirag@linux.vnet.ibm.com \
    --cc=dino@in.ibm.com \
    --cc=dvhltc@us.ibm.com \
    --cc=ghaskins@novell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).