[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
@ 2008-09-25 12:32 Chirag Jog
  2008-09-29 18:13 ` Gregory Haskins
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Chirag Jog @ 2008-09-25 12:32 UTC (permalink / raw)
  To: ghaskins; +Cc: linux-rt-users, LKML, Steven Rostedt

Hi Gregory,
We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.

login: cpu 0x2: Vector: 700 (Program Check) at [c0000000e8e875d0]
    pc: c00000000005110c: .pick_next_pushable_task+0x54/0x9c
    lr: c000000000059f50: .push_rt_task+0x44/0x2b4
    sp: c0000000e8e87850
   msr: 8000000000021032
  current = 0xc0000000ea5bb2e0
  paca    = 0xc000000000608700
    pid   = 2811, comm = async_handler
kernel BUG at kernel/sched_rt.c:1041! <---------------------
enter ? for help
[link register   ] c000000000059f50 .push_rt_task+0x44/0x2b4
[c0000000e8e87850] c0000000e8e878f0 (unreliable)
[c0000000e8e87900] c00000000005a1dc .push_rt_tasks+0x1c/0x38
[c0000000e8e87980] c00000000005a21c .post_schedule_rt+0x24/0x44
[c0000000e8e87a10] c000000000057cbc .finish_task_switch+0xd0/0x180
[c0000000e8e87ab0] c0000000003b6e88 .__schedule+0x6e0/0x798
[c0000000e8e87b90] c0000000003b7148 .schedule+0xec/0x11c
[c0000000e8e87c10] c0000000003b7a40 .do_nanosleep+0x6c/0xcc
[c0000000e8e87c90] c000000000080738 .hrtimer_nanosleep+0x7c/0x100
[c0000000e8e87d90] c000000000080830 .sys_nanosleep+0x74/0x94
[c0000000e8e87e30] c0000000000086ac syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0000008026449844
SP (400014185f0) is in userspace


This is generated by the BUG_ON lines in the pick_next_pushable function
introduced by the sched-only-push-once-per-queue.patch .

The -rt kernel prior to this patch didnot give such BUGes.

All this was tried with
CONFIG_GROUP_SCHED=N
CONFIG_RT_GROUP_SCHED=N


Setting the options
CONFIG_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=Y,
seems to solve the problem.



-- 
-Thanks,Chirag

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
  2008-09-25 12:32 [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Chirag Jog
@ 2008-09-29 18:13 ` Gregory Haskins
  2008-09-29 21:18 ` Gregory Haskins
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-09-29 18:13 UTC (permalink / raw)
  To: ghaskins, linux-rt-users, LKML, Steven Rostedt

[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]

Hi Chirag

Chirag Jog wrote:
> Hi Gregory,
> We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
> It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
>   
FYI I am looking at this now.  I suspect a dequeue_pushable_task()
probably found its way inside a conditional for GROUP_SCHED and
inadventently gets compiled away if you disable the feature. 
Investigating now..

-Greg

> login: cpu 0x2: Vector: 700 (Program Check) at [c0000000e8e875d0]
>     pc: c00000000005110c: .pick_next_pushable_task+0x54/0x9c
>     lr: c000000000059f50: .push_rt_task+0x44/0x2b4
>     sp: c0000000e8e87850
>    msr: 8000000000021032
>   current = 0xc0000000ea5bb2e0
>   paca    = 0xc000000000608700
>     pid   = 2811, comm = async_handler
> kernel BUG at kernel/sched_rt.c:1041! <---------------------
> enter ? for help
> [link register   ] c000000000059f50 .push_rt_task+0x44/0x2b4
> [c0000000e8e87850] c0000000e8e878f0 (unreliable)
> [c0000000e8e87900] c00000000005a1dc .push_rt_tasks+0x1c/0x38
> [c0000000e8e87980] c00000000005a21c .post_schedule_rt+0x24/0x44
> [c0000000e8e87a10] c000000000057cbc .finish_task_switch+0xd0/0x180
> [c0000000e8e87ab0] c0000000003b6e88 .__schedule+0x6e0/0x798
> [c0000000e8e87b90] c0000000003b7148 .schedule+0xec/0x11c
> [c0000000e8e87c10] c0000000003b7a40 .do_nanosleep+0x6c/0xcc
> [c0000000e8e87c90] c000000000080738 .hrtimer_nanosleep+0x7c/0x100
> [c0000000e8e87d90] c000000000080830 .sys_nanosleep+0x74/0x94
> [c0000000e8e87e30] c0000000000086ac syscall_exit+0x0/0x40
> --- Exception: c00 (System Call) at 0000008026449844
> SP (400014185f0) is in userspace
>
>
> This is generated by the BUG_ON lines in the pick_next_pushable function
> introduced by the sched-only-push-once-per-queue.patch .
>
> The -rt kernel prior to this patch didnot give such BUGes.
>
> All this was tried with
> CONFIG_GROUP_SCHED=N
> CONFIG_RT_GROUP_SCHED=N
>
>
> Setting the options
> CONFIG_GROUP_SCHED=y
> CONFIG_RT_GROUP_SCHED=Y,
> seems to solve the problem.
>
>
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
  2008-09-25 12:32 [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Chirag Jog
  2008-09-29 18:13 ` Gregory Haskins
@ 2008-09-29 21:18 ` Gregory Haskins
  2008-09-29 21:34   ` Gregory Haskins
  2008-10-02 11:18   ` [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Gilles Carry
  2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
  2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
  3 siblings, 2 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-09-29 21:18 UTC (permalink / raw)
  To: ghaskins, linux-rt-users, LKML, Steven Rostedt

[-- Attachment #1: Type: text/plain, Size: 2193 bytes --]

Hi Chirag

Chirag Jog wrote:
> Hi Gregory,
> We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
> It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
>   

Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
tests.  I installed the LTP rpm and I dont seem to have tests by that
name.  I have a 26.5-rt9 kernel all configured and ready to go but no
way to reproduce this issue.  Please advise.

Regards,
-Greg


> login: cpu 0x2: Vector: 700 (Program Check) at [c0000000e8e875d0]
>     pc: c00000000005110c: .pick_next_pushable_task+0x54/0x9c
>     lr: c000000000059f50: .push_rt_task+0x44/0x2b4
>     sp: c0000000e8e87850
>    msr: 8000000000021032
>   current = 0xc0000000ea5bb2e0
>   paca    = 0xc000000000608700
>     pid   = 2811, comm = async_handler
> kernel BUG at kernel/sched_rt.c:1041! <---------------------
> enter ? for help
> [link register   ] c000000000059f50 .push_rt_task+0x44/0x2b4
> [c0000000e8e87850] c0000000e8e878f0 (unreliable)
> [c0000000e8e87900] c00000000005a1dc .push_rt_tasks+0x1c/0x38
> [c0000000e8e87980] c00000000005a21c .post_schedule_rt+0x24/0x44
> [c0000000e8e87a10] c000000000057cbc .finish_task_switch+0xd0/0x180
> [c0000000e8e87ab0] c0000000003b6e88 .__schedule+0x6e0/0x798
> [c0000000e8e87b90] c0000000003b7148 .schedule+0xec/0x11c
> [c0000000e8e87c10] c0000000003b7a40 .do_nanosleep+0x6c/0xcc
> [c0000000e8e87c90] c000000000080738 .hrtimer_nanosleep+0x7c/0x100
> [c0000000e8e87d90] c000000000080830 .sys_nanosleep+0x74/0x94
> [c0000000e8e87e30] c0000000000086ac syscall_exit+0x0/0x40
> --- Exception: c00 (System Call) at 0000008026449844
> SP (400014185f0) is in userspace
>
>
> This is generated by the BUG_ON lines in the pick_next_pushable function
> introduced by the sched-only-push-once-per-queue.patch .
>
> The -rt kernel prior to this patch didnot give such BUGes.
>
> All this was tried with
> CONFIG_GROUP_SCHED=N
> CONFIG_RT_GROUP_SCHED=N
>
>
> Setting the options
> CONFIG_GROUP_SCHED=y
> CONFIG_RT_GROUP_SCHED=Y,
> seems to solve the problem.
>
>
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
  2008-09-29 21:18 ` Gregory Haskins
@ 2008-09-29 21:34   ` Gregory Haskins
  2008-09-29 22:00     ` Gregory Haskins
  2008-10-02 11:18   ` [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Gilles Carry
  1 sibling, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-09-29 21:34 UTC (permalink / raw)
  To: linux-rt-users, LKML, Steven Rostedt, chirag

[-- Attachment #1: Type: text/plain, Size: 2946 bytes --]

Gregory Haskins wrote:
> Hi Chirag
>
> Chirag Jog wrote:
>   
>> Hi Gregory,
>> We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
>> It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
>>   
>>     
>
> Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
> tests.
Ok, I figured out this part.  I needed a newer version of the .rpm from
a different repo.  However, both async_handler and sbrk_mutex seem to
segfault for me.  Hmm

---

ghaskins@test:~> uname -a
Linux test 2.6.26.5-rt9-rt #1 SMP PREEMPT RT Mon Sep 29 14:26:45 EDT
2008 x86_64 x86_64 x86_64 GNU/Linux

test:/home/ghaskins #
/usr/lib64/ltp/testcases/realtime/func/async_handler/async_handler

-----------------------------------
Asynchronous Event Handling Latency
-----------------------------------

jvmsim disabled
Running 1000000 iterations
Segmentation fault

---

Any ideas?

-Greg

>   I installed the LTP rpm and I dont seem to have tests by that
> name.  I have a 26.5-rt9 kernel all configured and ready to go but no
> way to reproduce this issue.  Please advise.
>
> Regards,
> -Greg
>
>
>   
>> login: cpu 0x2: Vector: 700 (Program Check) at [c0000000e8e875d0]
>>     pc: c00000000005110c: .pick_next_pushable_task+0x54/0x9c
>>     lr: c000000000059f50: .push_rt_task+0x44/0x2b4
>>     sp: c0000000e8e87850
>>    msr: 8000000000021032
>>   current = 0xc0000000ea5bb2e0
>>   paca    = 0xc000000000608700
>>     pid   = 2811, comm = async_handler
>> kernel BUG at kernel/sched_rt.c:1041! <---------------------
>> enter ? for help
>> [link register   ] c000000000059f50 .push_rt_task+0x44/0x2b4
>> [c0000000e8e87850] c0000000e8e878f0 (unreliable)
>> [c0000000e8e87900] c00000000005a1dc .push_rt_tasks+0x1c/0x38
>> [c0000000e8e87980] c00000000005a21c .post_schedule_rt+0x24/0x44
>> [c0000000e8e87a10] c000000000057cbc .finish_task_switch+0xd0/0x180
>> [c0000000e8e87ab0] c0000000003b6e88 .__schedule+0x6e0/0x798
>> [c0000000e8e87b90] c0000000003b7148 .schedule+0xec/0x11c
>> [c0000000e8e87c10] c0000000003b7a40 .do_nanosleep+0x6c/0xcc
>> [c0000000e8e87c90] c000000000080738 .hrtimer_nanosleep+0x7c/0x100
>> [c0000000e8e87d90] c000000000080830 .sys_nanosleep+0x74/0x94
>> [c0000000e8e87e30] c0000000000086ac syscall_exit+0x0/0x40
>> --- Exception: c00 (System Call) at 0000008026449844
>> SP (400014185f0) is in userspace
>>
>>
>> This is generated by the BUG_ON lines in the pick_next_pushable function
>> introduced by the sched-only-push-once-per-queue.patch .
>>
>> The -rt kernel prior to this patch didnot give such BUGes.
>>
>> All this was tried with
>> CONFIG_GROUP_SCHED=N
>> CONFIG_RT_GROUP_SCHED=N
>>
>>
>> Setting the options
>> CONFIG_GROUP_SCHED=y
>> CONFIG_RT_GROUP_SCHED=Y,
>> seems to solve the problem.
>>
>>
>>
>>   
>>     
>
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
  2008-09-29 21:34   ` Gregory Haskins
@ 2008-09-29 22:00     ` Gregory Haskins
  2008-09-30  4:43       ` Chirag Jog
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-09-29 22:00 UTC (permalink / raw)
  To: linux-rt-users, LKML, Steven Rostedt, chirag; +Cc: dvhltc

[-- Attachment #1: Type: text/plain, Size: 3482 bytes --]

Gregory Haskins wrote:
> Gregory Haskins wrote:
>   
>> Hi Chirag
>>
>> Chirag Jog wrote:
>>   
>>     
>>> Hi Gregory,
>>> We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
>>> It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
>>>   
>>>     
>>>       
>> Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
>> tests.
>>     
> Ok, I figured out this part.  I needed a newer version of the .rpm from
> a different repo.  However, both async_handler and sbrk_mutex seem to
> segfault for me.  Hmm
>   

Thanks to help from Darren I got around this issue.  Unfortunately both
tests pass so I cannot reproduce this issue, nor do I see the problem
via code inspection.  Ill keep digging but I am currently at a loss.  I
may need to send you some diagnostic patches to find this, if that is ok
with you Chirag?

-Greg

> ---
>
> ghaskins@test:~> uname -a
> Linux test 2.6.26.5-rt9-rt #1 SMP PREEMPT RT Mon Sep 29 14:26:45 EDT
> 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> test:/home/ghaskins #
> /usr/lib64/ltp/testcases/realtime/func/async_handler/async_handler
>
> -----------------------------------
> Asynchronous Event Handling Latency
> -----------------------------------
>
> jvmsim disabled
> Running 1000000 iterations
> Segmentation fault
>
> ---
>
> Any ideas?
>
> -Greg
>
>   
>>   I installed the LTP rpm and I dont seem to have tests by that
>> name.  I have a 26.5-rt9 kernel all configured and ready to go but no
>> way to reproduce this issue.  Please advise.
>>
>> Regards,
>> -Greg
>>
>>
>>   
>>     
>>> login: cpu 0x2: Vector: 700 (Program Check) at [c0000000e8e875d0]
>>>     pc: c00000000005110c: .pick_next_pushable_task+0x54/0x9c
>>>     lr: c000000000059f50: .push_rt_task+0x44/0x2b4
>>>     sp: c0000000e8e87850
>>>    msr: 8000000000021032
>>>   current = 0xc0000000ea5bb2e0
>>>   paca    = 0xc000000000608700
>>>     pid   = 2811, comm = async_handler
>>> kernel BUG at kernel/sched_rt.c:1041! <---------------------
>>> enter ? for help
>>> [link register   ] c000000000059f50 .push_rt_task+0x44/0x2b4
>>> [c0000000e8e87850] c0000000e8e878f0 (unreliable)
>>> [c0000000e8e87900] c00000000005a1dc .push_rt_tasks+0x1c/0x38
>>> [c0000000e8e87980] c00000000005a21c .post_schedule_rt+0x24/0x44
>>> [c0000000e8e87a10] c000000000057cbc .finish_task_switch+0xd0/0x180
>>> [c0000000e8e87ab0] c0000000003b6e88 .__schedule+0x6e0/0x798
>>> [c0000000e8e87b90] c0000000003b7148 .schedule+0xec/0x11c
>>> [c0000000e8e87c10] c0000000003b7a40 .do_nanosleep+0x6c/0xcc
>>> [c0000000e8e87c90] c000000000080738 .hrtimer_nanosleep+0x7c/0x100
>>> [c0000000e8e87d90] c000000000080830 .sys_nanosleep+0x74/0x94
>>> [c0000000e8e87e30] c0000000000086ac syscall_exit+0x0/0x40
>>> --- Exception: c00 (System Call) at 0000008026449844
>>> SP (400014185f0) is in userspace
>>>
>>>
>>> This is generated by the BUG_ON lines in the pick_next_pushable function
>>> introduced by the sched-only-push-once-per-queue.patch .
>>>
>>> The -rt kernel prior to this patch didnot give such BUGes.
>>>
>>> All this was tried with
>>> CONFIG_GROUP_SCHED=N
>>> CONFIG_RT_GROUP_SCHED=N
>>>
>>>
>>> Setting the options
>>> CONFIG_GROUP_SCHED=y
>>> CONFIG_RT_GROUP_SCHED=Y,
>>> seems to solve the problem.
>>>
>>>
>>>
>>>   
>>>     
>>>       
>>   
>>     
>
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
  2008-09-29 22:00     ` Gregory Haskins
@ 2008-09-30  4:43       ` Chirag Jog
  2008-09-30  6:47         ` Gilles Carry
  2008-10-01 14:22         ` [PATCH] sched: add a stacktrace on enqueue_pushable error Gregory Haskins
  0 siblings, 2 replies; 24+ messages in thread
From: Chirag Jog @ 2008-09-30  4:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-rt-users, LKML, Steven Rostedt, dvhltc, Dinakar Guniguntala

Hi Gregory,
* Gregory Haskins <ghaskins@novell.com> [2008-09-29 18:00:01]:

> Gregory Haskins wrote:
> > Gregory Haskins wrote:
> >   
> >> Hi Chirag
> >>
> >> Chirag Jog wrote:
> >>   
> >>     
> >>> Hi Gregory,
> >>> We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
> >>> It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
> >>>   
> >>>     
> >>>       
> >> Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
> >> tests.
> >>     
> > Ok, I figured out this part.  I needed a newer version of the .rpm from
> > a different repo.  However, both async_handler and sbrk_mutex seem to
> > segfault for me.  Hmm
> >   
> 
> Thanks to help from Darren I got around this issue.  Unfortunately both
> tests pass so I cannot reproduce this issue, nor do I see the problem
> via code inspection.  Ill keep digging but I am currently at a loss.  I
> may need to send you some diagnostic patches to find this, if that is ok
> with you Chirag?
This particular bug is not producible on the x86 boxes, i have access
to. Only on ppc64.
Please send the diagnostic patches across. 
I'll try them out! :)

-- 
-Thanks,Chirag

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
  2008-09-30  4:43       ` Chirag Jog
@ 2008-09-30  6:47         ` Gilles Carry
  2008-10-01 14:22         ` [PATCH] sched: add a stacktrace on enqueue_pushable error Gregory Haskins
  1 sibling, 0 replies; 24+ messages in thread
From: Gilles Carry @ 2008-09-30  6:47 UTC (permalink / raw)
  To: Chirag Jog
  Cc: Gregory Haskins, linux-rt-users, LKML, Steven Rostedt, dvhltc,
	Dinakar Guniguntala

Chirag Jog wrote:
> Hi Gregory,
> * Gregory Haskins <ghaskins@novell.com> [2008-09-29 18:00:01]:
> 
> 
>>Gregory Haskins wrote:
>>
>>>Gregory Haskins wrote:
>>>  
>>>
>>>>Hi Chirag
>>>>
>>>>Chirag Jog wrote:
>>>>  
>>>>    
>>>>
>>>>>Hi Gregory,
>>>>>We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
>>>>>It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.
>>>>>  
>>>>>    
>>>>>      
>>>>
>>>>Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
>>>>tests.
>>>>    
>>>
>>>Ok, I figured out this part.  I needed a newer version of the .rpm from
>>>a different repo.  However, both async_handler and sbrk_mutex seem to
>>>segfault for me.  Hmm
>>>  
>>
>>Thanks to help from Darren I got around this issue.  Unfortunately both
>>tests pass so I cannot reproduce this issue, nor do I see the problem
>>via code inspection.  Ill keep digging but I am currently at a loss.  I
>>may need to send you some diagnostic patches to find this, if that is ok
>>with you Chirag?
> 
> This particular bug is not producible on the x86 boxes, i have access
> to. Only on ppc64.
> Please send the diagnostic patches across. 
> I'll try them out! :)
> 

Hi,

I have access to Power6 and x86_64 boxes and so far I could only
reproduce the bug on PPC64.

The bug arised from 2.6.26.3-rt6 since sched-only-push-if-pushable.patch
and sched-only-push-once-per-queue.patch.

Whereas sbrk_mutex definetly shows up the problem, it also can occur
randomly, sometimes during the boot period.

At the beginning, I had system hangs or this (very similar to Chirag's and
not necessarly in sbrk_mutex):

cpu 0x3: Vector: 700 (Program Check) at [c0000000ee30b600]
     pc: c0000000001b9bac: .__list_add+0x70/0xa0
     lr: c0000000001b9ba8: .__list_add+0x6c/0xa0
     sp: c0000000ee30b880
    msr: 8000000000021032
   current = 0xc0000000ee2b1830
   paca    = 0xc0000000005c3980
     pid   = 51, comm = sirq-sched/3
kernel BUG at lib/list_debug.c:33!
enter ? for help
[c0000000ee30b900] c0000000001b8ec0 .plist_del+0x6c/0xcc
[c0000000ee30b9a0] c00000000004d500 .dequeue_pushable_task+0x24/0x3c
[c0000000ee30ba20] c00000000004ec18 .push_rt_task+0x1f0/0x2c0
[c0000000ee30bae0] c00000000004ed0c .push_rt_tasks+0x24/0x44
[c0000000ee30bb70] c00000000004ed58 .post_schedule_rt+0x2c/0x50
[c0000000ee30bc00] c0000000000527c4 .finish_task_switch+0x100/0x1a8
[c0000000ee30bcb0] c0000000002cd1e0 .__schedule+0x688/0x744
[c0000000ee30bd90] c0000000002cd4ec .schedule+0xf4/0x128
[c0000000ee30be20] c000000000061634 .ksoftirqd+0x124/0x37c
[c0000000ee30bf00] c000000000076cf0 .kthread+0x84/0xd4
[c0000000ee30bf90] c000000000029368 .kernel_thread+0x4c/0x68
3:mon>

So I suspected a memory corruption but adding padding fields around
the pointers and extra checks did not reveal any data trashing.


Playing with xmon, I finally found out that when hanging, the system
was stuck in an infinite loop in plist_check_list.
Si I modified lib/plist.c:
I supposed that no list holds more than 100 000 000 elements in
the system. ;-)

  static void plist_check_list(struct list_head *top)
  {
         struct list_head *prev = top, *next = top->next;
+       unsigned long long i = 1;

         plist_check_prev_next(top, prev, next);
         while (next != top) {
+               BUG_ON(i++ >    100000000);
                 prev = next;
                 next = prev->next;
                 plist_check_prev_next(top, prev, next);


and got this:

cpu 0x6: Vector: 700 (Program Check) at [c0000000eeda7530]
     pc: c0000000001ba498: .plist_check_list+0x68/0xb4
     lr: c0000000001ba4b4: .plist_check_list+0x84/0xb4
     sp: c0000000eeda77b0
    msr: 8000000000021032
   current = 0xc0000000ee80dfa0
   paca    = 0xc0000000005d3f80
     pid   = 2602, comm = sbrk_mutex
kernel BUG at lib/plist.c:50!
enter ? for help
[c0000000eeda7850] c0000000001ba530 .plist_check_head+0x4c/0x64
[c0000000eeda78e0] c0000000001ba57c .plist_del+0x34/0xdc
[c0000000eeda7980] c00000000004d734 .dequeue_pushable_task+0x24/0x3c
[c0000000eeda7a00] c00000000004d7c4 .pick_next_task_rt+0x38/0x58
[c0000000eeda7a90] c0000000002cefb0 .__schedule+0x510/0x75c
[c0000000eeda7b70] c0000000002cf44c .schedule+0xf4/0x128
[c0000000eeda7c00] c0000000002cfe4c .do_nanosleep+0x7c/0xe4
[c0000000eeda7c90] c00000000007be68 .hrtimer_nanosleep+0x84/0x10c
[c0000000eeda7d90] c00000000007bf6c .sys_nanosleep+0x7c/0xa0
[c0000000eeda7e30] c0000000000086ac syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000080fdb85880
SP (4000843e660) is in userspace

which corresponds to the BUG_ON stuff.
It seems that the pushable_tasks list is corrupted: it never loops
back to the first element (top). Is there a shortcut anywhere?



Since the patches don't feature any arch-specific change, I'm looking
for arch-specific code triggered by the modifications brought by the
patches. Still searching...

Also for me, using CONFIG_GROUP_SCHED stuffs hides the problem.

I'm going to harden plist_check_list and see what it does.

Gilles.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] sched: add a stacktrace on enqueue_pushable error
  2008-09-30  4:43       ` Chirag Jog
  2008-09-30  6:47         ` Gilles Carry
@ 2008-10-01 14:22         ` Gregory Haskins
  2008-10-02  9:42           ` Gilles Carry
  1 sibling, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-10-01 14:22 UTC (permalink / raw)
  To: Chirag Jog; +Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino

Hi Chirag,
  Please apply this patch to your 26.5-rt9 tree for ppc64, enable
  CONFIG_PROVE_LOCKING (which enables CONFIG_STACKTRACE) and give it a whirl.
  If you get an oops, please post the console output.

Thanks!

-Greg

------------------
sched: add a stacktrace on enqueue_pushable error

NOT FOR INCLUSION!

This is to help debug an issue discovered by Chirag Jog in the thread

http://lkml.org/lkml/2008/9/25/189

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h |    6 ++++++
 kernel/sched_rt.c     |   20 +++++++++++++++++++-
 2 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 67da014..53e8a6a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
 #include <linux/task_io_accounting.h>
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
+#include <linux/stacktrace.h>
 
 #include <asm/processor.h>
 
@@ -1170,6 +1171,11 @@ struct task_struct {
 
 	struct list_head tasks;
 	struct plist_node pushable_tasks;
+	struct {
+		unsigned long data[15];
+		struct stack_trace trace;
+		pid_t pid;
+	} pushable_stack;
 
 	/*
 	 * ptrace_list/ptrace_children forms the list of my children
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..0e6a88c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -55,7 +55,25 @@ static void update_rt_migration(struct rq *rq)
 
 static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
 {
-	plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+	if (plist_node_empty(&p->pushable_tasks)) {
+		struct stack_trace *trace = &p->pushable_stack.trace;
+
+		trace->nr_entries = 0;
+		trace->max_entries = sizeof(p->pushable_stack.data)/sizeof(p->pushable_stack.data[0]);
+		trace->entries = &p->pushable_stack.data[0];
+		trace->skip = 0;
+
+		save_stack_trace(trace);
+
+		p->pushable_stack.pid = current->pid;
+	} else {
+		printk(KERN_CRIT "redundant enqueue by %d detected\n",
+		       p->pushable_stack.pid);
+		print_stack_trace(&p->pushable_stack.trace, 5);
+
+		BUG_ON(!plist_node_empty(&p->pushable_tasks));
+	}
+
 	plist_node_init(&p->pushable_tasks, p->prio);
 	plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);
 }


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched: add a stacktrace on enqueue_pushable error
  2008-10-01 14:22         ` [PATCH] sched: add a stacktrace on enqueue_pushable error Gregory Haskins
@ 2008-10-02  9:42           ` Gilles Carry
  0 siblings, 0 replies; 24+ messages in thread
From: Gilles Carry @ 2008-10-02  9:42 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chirag Jog, linux-rt-users, linux-kernel, rostedt, dvhltc, dino,
	Jean-Pierre Dion, Sébastien Dugué, Tim Chavez

Hi,

I could reproduce the bug on an intel architecture.
I found where the problem is.
I'll post the patch in a few hours.

Gilles.

Gregory Haskins wrote:
> Hi Chirag,
>   Please apply this patch to your 26.5-rt9 tree for ppc64, enable
>   CONFIG_PROVE_LOCKING (which enables CONFIG_STACKTRACE) and give it a whirl.
>   If you get an oops, please post the console output.
> 
> Thanks!
> 
> -Greg
> 
> ------------------
> sched: add a stacktrace on enqueue_pushable error
> 
> NOT FOR INCLUSION!
> 
> This is to help debug an issue discovered by Chirag Jog in the thread
> 
> http://lkml.org/lkml/2008/9/25/189
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  include/linux/sched.h |    6 ++++++
>  kernel/sched_rt.c     |   20 +++++++++++++++++++-
>  2 files changed, 25 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 67da014..53e8a6a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -87,6 +87,7 @@ struct sched_param {
>  #include <linux/task_io_accounting.h>
>  #include <linux/kobject.h>
>  #include <linux/latencytop.h>
> +#include <linux/stacktrace.h>
>  
>  #include <asm/processor.h>
>  
> @@ -1170,6 +1171,11 @@ struct task_struct {
>  
>  	struct list_head tasks;
>  	struct plist_node pushable_tasks;
> +	struct {
> +		unsigned long data[15];
> +		struct stack_trace trace;
> +		pid_t pid;
> +	} pushable_stack;
>  
>  	/*
>  	 * ptrace_list/ptrace_children forms the list of my children
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 57a0c0d..0e6a88c 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -55,7 +55,25 @@ static void update_rt_migration(struct rq *rq)
>  
>  static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
>  {
> -	plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
> +	if (plist_node_empty(&p->pushable_tasks)) {
> +		struct stack_trace *trace = &p->pushable_stack.trace;
> +
> +		trace->nr_entries = 0;
> +		trace->max_entries = sizeof(p->pushable_stack.data)/sizeof(p->pushable_stack.data[0]);
> +		trace->entries = &p->pushable_stack.data[0];
> +		trace->skip = 0;
> +
> +		save_stack_trace(trace);
> +
> +		p->pushable_stack.pid = current->pid;
> +	} else {
> +		printk(KERN_CRIT "redundant enqueue by %d detected\n",
> +		       p->pushable_stack.pid);
> +		print_stack_trace(&p->pushable_stack.trace, 5);
> +
> +		BUG_ON(!plist_node_empty(&p->pushable_tasks));
> +	}
> +
>  	plist_node_init(&p->pushable_tasks, p->prio);
>  	plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gilles.Carry
Linux Project team
mailto: gilles.carry@bull.net
Phone: +33 (0)4 76 29 74 27
Addr.: BULL S.A.  1 rue de Provence, B.P. 208 38432 Echirolles Cedex
http://www.bull.com
http://www.bullopensource.org/
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang
  2008-09-29 21:18 ` Gregory Haskins
  2008-09-29 21:34   ` Gregory Haskins
@ 2008-10-02 11:18   ` Gilles Carry
  1 sibling, 0 replies; 24+ messages in thread
From: Gilles Carry @ 2008-10-02 11:18 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-rt-users, LKML, Steven Rostedt

Hi,

I could reproduce the bug on intel x86_64 with LTP's sbrk_mutex:

kernel BUG at kernel/sched_rt.c:1044!
invalid opcode: 0000 [1] PREEMPT SMP
CPU 5
Modules linked in: mptsas scsi_transport_sas
Pid: 27577, comm: sbrk_mutex Not tainted 2.6.26.5-rt9-00002-g3b27927 #23
RIP: 0010:[<ffffffff80227f95>]  [<ffffffff80227f95>] pick_next_pushable_task+0x6
1/0x77
RSP: 0018:ffff81007713fd28  EFLAGS: 00010046
RAX: 0000000000000005 RBX: ffff810083a4e280 RCX: ffff81013dcee458
RDX: ffff8100771f8000 RSI: ffff81013dcee2c0 RDI: ffff810083a4e280
RBP: ffff81007713fd28 R08: ffff81007713e000 R09: 0000000000000000
R10: 000000004bbbc9e0 R11: ffff81007dc3bde8 R12: ffff81023ff7c910
R13: ffff8101bf4ad0c0 R14: 0000000000000001 R15: ffff810083a4e280
FS:  000000004d3bf940(0063) GS:ffff81013f4458c0(0000) knlGS:00000000f7f216c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000389d495770 CR3: 000000007c11a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sbrk_mutex (pid: 27577, threadinfo ffff81007713e000, task ffff8100771d61
c0)
Stack:  ffff81007713fd68 ffffffff8022b1ce ffff81007713fdc8 ffff810083a4e280
  ffff81023ff7c910 ffff8101bf4ad0c0 0000000000000001 0000000000000000
  ffff81007713fd88 ffffffff8022b3c3 ffff81007713fda8 ffff810083a4e280
Call Trace:
  [<ffffffff8022b1ce>] push_rt_task+0x26/0x207
  [<ffffffff8022b3c3>] push_rt_tasks+0x14/0x1c
  [<ffffffff8022b3e4>] post_schedule_rt+0x19/0x25
  [<ffffffff8022d7e9>] finish_task_switch+0x73/0x121
  [<ffffffff805bbe3d>] thread_return+0x4f/0xdc
  [<ffffffff805bc066>] schedule+0xd4/0xf0
  [<ffffffff805bc686>] do_nanosleep+0x5c/0x9c
  [<ffffffff80248350>] ? hrtimer_nanosleep+0x54/0xbd
  [<ffffffff80247c9d>] ? hrtimer_wakeup+0x0/0x21
  [<ffffffff805bc66b>] ? do_nanosleep+0x41/0x9c
  [<ffffffff8022e9f4>] ? schedule_tail+0x43/0x97
  [<ffffffff80248405>] ? sys_nanosleep+0x4c/0x62
  [<ffffffff8020b32a>] ? system_call_after_swapgs+0x8a/0x8f


Code: 42 18 74 04 0f 0b eb fe 48 39 b7 48 0e 00 00 75 04 0f 0b eb fe 83 b9 50 ff
  ff ff 01 7f 04 0f 0b eb fe 83 b9 e0 fe ff ff 00 75 04 <0f> 0b eb fe 83 b9 8c fe
  ff ff 63 7e 04 0f 0b eb fe c9 48 89 f0
RIP  [<ffffffff80227f95>] pick_next_pushable_task+0x61/0x77
  RSP <ffff81007713fd28>




The difference with powerpc64 is that you need to be patient:
it takes tens of minutes to BUG/hang on intel whereas on power it's
almost immediate.


I just posted the patch on this list (Fix pushable_task list corruption)

Greg, please can you review this patch and comment?
Thanks.

Gilles.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9
  2008-09-25 12:32 [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Chirag Jog
  2008-09-29 18:13 ` Gregory Haskins
  2008-09-29 21:18 ` Gregory Haskins
@ 2008-10-03 12:42 ` Gregory Haskins
  2008-10-03 12:43   ` [PATCH 1/2] RT: Remove comment that is no longer true Gregory Haskins
                     ` (2 more replies)
  2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
  3 siblings, 3 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 12:42 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

Hi Chirag,

 Please try the following patches applied to 26.5-rt9 and let me know what you
 find.

Hi Steve,
  If these look good to everyone, please consider them for inclusion in -rt10. 

Regards,
-Greg

---

Gregory Haskins (2):
      RT: remove "paranoid" limit in push_rt_task
      RT: Remove comment that is no longer true


 kernel/sched_rt.c |   15 +++------------
 1 files changed, 3 insertions(+), 12 deletions(-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] RT: Remove comment that is no longer true
  2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
@ 2008-10-03 12:43   ` Gregory Haskins
  2008-10-03 12:43   ` [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
  2008-10-03 12:54   ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
  2 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 12:43 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

We fixed the condition noted in the comment with the "pushable_tasks"
logic, but forgot to remove this comment.  Lets clean it up.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..59ead84 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1125,16 +1125,6 @@ out:
 	return 1;
 }
 
-/*
- * TODO: Currently we just use the second highest prio task on
- *       the queue, and stop when it can't migrate (or there's
- *       no more RT tasks).  There may be a case where a lower
- *       priority RT task has a different affinity than the
- *       higher RT task. In this case the lower RT task could
- *       possibly be able to migrate where as the higher priority
- *       RT task could not.  We currently ignore this issue.
- *       Enhancements are welcome!
- */
 static void push_rt_tasks(struct rq *rq)
 {
 	/* push_rt_task will return true if it moved an RT */


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task
  2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
  2008-10-03 12:43   ` [PATCH 1/2] RT: Remove comment that is no longer true Gregory Haskins
@ 2008-10-03 12:43   ` Gregory Haskins
  2008-10-03 13:46     ` Gilles Carry
  2008-10-03 12:54   ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
  2 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 12:43 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

A panic was discovered by Chirag Jog and investigated by Gilles Carry
to be originating in the fact that a task being pushed away
may get migrated away during a double_lock_balance.  The result was
that the pushable_tasks list may become corrupted.

The root cause is that the "paranoid" retry limit could cause us to
bail out of a retry, but still try to remove the item from the (now
potentially incorrect) list.  There are numerous ways to correct the
condition, but the paranoid feature is no longer relevant with the new
pushable logic (since pushable naturally limits the loop anyway), so
lets just remove it.

Reported By: Chirag Jog <chirag@linux.vnet.ibm.com>
Found-by: Gilles Carry <gilles.carry@bull.net>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 59ead84..5a754fe 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1056,7 +1056,6 @@ static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
-	int paranoid = RT_MAX_TRIES;
 
 	if (!rq->rt.overloaded)
 		return 0;
@@ -1094,12 +1093,14 @@ static int push_rt_task(struct rq *rq)
 		 * If it has, then try again.
 		 */
 		task = pick_next_pushable_task(rq);
-		if (unlikely(task != next_task) && task && paranoid--) {
+		if (unlikely(task != next_task) && task) {
 			put_task_struct(next_task);
 			next_task = task;
 			goto retry;
 		}
 
+		BUG_ON(task_cpu(next_task) != rq->cpu);
+
 		/*
 		 * Once we have failed to push this task, we will not
 		 * try again, since the other cpus will pull from us


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task
  2008-10-03 12:43   ` [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
@ 2008-10-03 13:46     ` Gilles Carry
  2008-10-03 15:45       ` Chirag Jog
  2008-10-03 17:26       ` [RT PATCH v2 0/2] Series short description Gregory Haskins
  0 siblings, 2 replies; 24+ messages in thread
From: Gilles Carry @ 2008-10-03 13:46 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chirag Jog, linux-rt-users, linux-kernel, rostedt, dvhltc, dino

Sorry Greg,

Neither PPC64 nor Intel64 make it with this patch.
At boot time, it stops at the BUG_ON you added:
0xc00000000004eca4 is in push_rt_task (kernel/sched_rt.c:1102)

I let you do more investigations.
Have a good week-end in you garage ;)

Gilles.


PPC64:
cpu 0x2: Vector: 700 (Program Check) at [c0000000ee2877b0]
     pc: c00000000004eca4: .push_rt_task+0x1f4/0x2d0
     lr: c00000000004ec24: .push_rt_task+0x174/0x2d0
     sp: c0000000ee287a30
    msr: 8000000000021032
   current = 0xc0000000ee276fe0
   paca    = 0xc0000000005c3780
     pid   = 36, comm = sirq-block/2
kernel BUG at kernel/sched_rt.c:1102!
enter ? for help
[c0000000ee287a30] c00000000004ec78 .push_rt_task+0x1c8/0x2d0 (unreliable)
[c0000000ee287ae0] c00000000004eda4 .push_rt_tasks+0x24/0x44
[c0000000ee287b70] c00000000004edf0 .post_schedule_rt+0x2c/0x50
[c0000000ee287c00] c000000000052864 .finish_task_switch+0x100/0x1a8
[c0000000ee287cb0] c0000000002cdbd0 .__schedule+0x6a0/0x75c
[c0000000ee287d90] c0000000002cdedc .schedule+0xf4/0x128
[c0000000ee287e20] c000000000061700 .ksoftirqd+0x124/0x37c
[c0000000ee287f00] c000000000076dc0 .kthread+0x84/0xd4
[c0000000ee287f90] c000000000029368 .kernel_thread+0x4c/0x68
2:mon>

Intel64:
kernel BUG at kernel/sched_rt.c:1102!
invalid opcode: 0000 [1] PREEMPT SMP
CPU 4
Modules linked in: mptsas scsi_transport_sas
Pid: 61, comm: sirq-block/4 Not tainted 2.6.26.5-rt9-00002-g3b27927-dirty #26
RIP: 0010:[<ffffffff8022b307>]  [<ffffffff8022b307>] push_rt_task+0x15f/0x20b
RSP: 0018:ffff81007f4d5d70  EFLAGS: 00010097
RAX: 0000000000000000 RBX: ffff81007edf09d0 RCX: 000000000822b765
RDX: 000000000822b765 RSI: 0000000000000000 RDI: ffff81000103f280
RBP: ffff81007f4d5da0 R08: ffff81007f4d4000 R09: ffff81007edcbe20
R10: 00000000ffffffff R11: ffffffff8021fa2c R12: 0000000000000000
R13: ffff810001034280 R14: ffff81007edf09e0 R15: ffff81000103f280
FS:  00007f2f26e776f0(0000) GS:ffff81007fc0ccc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006b9fb0 CR3: 00000001bf4c9000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sirq-block/4 (pid: 61, threadinfo ffff81007f4d4000, task
ffff81007f4d0e10)
Stack:  000000007f4d5e00 ffff81000103f280 ffff81007edf09d0 ffff8101bf457540
  0000000000000001 0000000000000002 ffff81007f4d5dc0 ffffffff8022b3c7
  ffff81007f4d5de0 ffff81000103f280 ffff81007f4d5de0 ffffffff8022b3e8
Call Trace:
  [<ffffffff8022b3c7>] push_rt_tasks+0x14/0x1c
  [<ffffffff8022b3e8>] post_schedule_rt+0x19/0x25
  [<ffffffff8022d7ee>] finish_task_switch+0x73/0x121
  [<ffffffff805bbe3d>] thread_return+0x4f/0xdc
  [<ffffffff805bc066>] schedule+0xd4/0xf0
  [<ffffffff80237eeb>] ksoftirqd+0xb3/0x260
  [<ffffffff80237e38>] ? ksoftirqd+0x0/0x260
  [<ffffffff80245209>] ? kthread+0x47/0x76
  [<ffffffff8022e9f9>] ? schedule_tail+0x43/0x97
  [<ffffffff8020c3d8>] ? child_rip+0xa/0x12
  [<ffffffff802451c2>] ? kthread+0x0/0x76
  [<ffffffff8020c3ce>] ? child_rip+0x0/0x12


Code: 48 c7 c6 c0 1d 23 80 e8 83 b3 03 00 e9 ee fe ff ff 4c 89 e7 e8 b1 31 39
00 eb ba 48 8b 43 08 8b 40 18 41 3b 87 90 0e 00 00 74 04 <0f> 0b eb fe 48 89
de 4c 89 ff e8 5b fe ff ff f0 41 ff 0e 0f 94
RIP  [<ffffffff8022b307>] push_rt_task+0x15f/0x20b
  RSP <ffff81007f4d5d70>


Gregory Haskins wrote:
> A panic was discovered by Chirag Jog and investigated by Gilles Carry
> to be originating in the fact that a task being pushed away
> may get migrated away during a double_lock_balance.  The result was
> that the pushable_tasks list may become corrupted.
> 
> The root cause is that the "paranoid" retry limit could cause us to
> bail out of a retry, but still try to remove the item from the (now
> potentially incorrect) list.  There are numerous ways to correct the
> condition, but the paranoid feature is no longer relevant with the new
> pushable logic (since pushable naturally limits the loop anyway), so
> lets just remove it.
> 
> Reported By: Chirag Jog <chirag@linux.vnet.ibm.com>
> Found-by: Gilles Carry <gilles.carry@bull.net>
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  kernel/sched_rt.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 59ead84..5a754fe 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -1056,7 +1056,6 @@ static int push_rt_task(struct rq *rq)
>  {
>  	struct task_struct *next_task;
>  	struct rq *lowest_rq;
> -	int paranoid = RT_MAX_TRIES;
>  
>  	if (!rq->rt.overloaded)
>  		return 0;
> @@ -1094,12 +1093,14 @@ static int push_rt_task(struct rq *rq)
>  		 * If it has, then try again.
>  		 */
>  		task = pick_next_pushable_task(rq);
> -		if (unlikely(task != next_task) && task && paranoid--) {
> +		if (unlikely(task != next_task) && task) {
>  			put_task_struct(next_task);
>  			next_task = task;
>  			goto retry;
>  		}
>  
> +		BUG_ON(task_cpu(next_task) != rq->cpu);
> +
>  		/*
>  		 * Once we have failed to push this task, we will not
>  		 * try again, since the other cpus will pull from us
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task
  2008-10-03 13:46     ` Gilles Carry
@ 2008-10-03 15:45       ` Chirag Jog
  2008-10-03 17:27         ` Gregory Haskins
  2008-10-03 17:26       ` [RT PATCH v2 0/2] Series short description Gregory Haskins
  1 sibling, 1 reply; 24+ messages in thread
From: Chirag Jog @ 2008-10-03 15:45 UTC (permalink / raw)
  To: Gilles Carry
  Cc: Gregory Haskins, linux-rt-users, linux-kernel, rostedt, dvhltc,
	dino

* Gilles Carry <Gilles.Carry@bull.net> [2008-10-03 15:46:59]:

> Sorry Greg,
>
> Neither PPC64 nor Intel64 make it with this patch.
> At boot time, it stops at the BUG_ON you added:
> 0xc00000000004eca4 is in push_rt_task (kernel/sched_rt.c:1102)
I am also confirming this issue gilles reported.

Although, i have a question:
When we enable group scheduling (CONFIG_RT_GROUP_SCHED),
all the problems disappear.
After skimming through the rt scheduler code, I don't feel group 
scheduling alters the behavior of push/pull strategies in any way.

So I am wonder whether enabling group scheduling
actually solves the problem or just makes it tough 
to recreate?

> I let you do more investigations.
> Have a good week-end in you garage ;)
Have a great weekend :)
>
-- 
-Thanks,Chirag

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task
  2008-10-03 15:45       ` Chirag Jog
@ 2008-10-03 17:27         ` Gregory Haskins
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 17:27 UTC (permalink / raw)
  To: Gilles Carry, Gregory Haskins, linux-rt-users, linux-kernel,
	rostedt, dvhltc, dino

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

Chirag Jog wrote:
> * Gilles Carry <Gilles.Carry@bull.net> [2008-10-03 15:46:59]:
>
>   
>> Sorry Greg,
>>
>> Neither PPC64 nor Intel64 make it with this patch.
>> At boot time, it stops at the BUG_ON you added:
>> 0xc00000000004eca4 is in push_rt_task (kernel/sched_rt.c:1102)
>>     
> I am also confirming this issue gilles reported.
>
> Although, i have a question:
> When we enable group scheduling (CONFIG_RT_GROUP_SCHED),
> all the problems disappear.
> After skimming through the rt scheduler code, I don't feel group 
> scheduling alters the behavior of push/pull strategies in any way.
>
> So I am wonder whether enabling group scheduling
> actually solves the problem or just makes it tough 
> to recreate?
>   
Hi Chirag,


The issue that Gilles pointed me at is a race condition.  I do not
suspect GROUP_SCHED itself has anything to do with the problem other
than it changes the timing.  HTH!

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RT PATCH v2 0/2] Series short description
  2008-10-03 13:46     ` Gilles Carry
  2008-10-03 15:45       ` Chirag Jog
@ 2008-10-03 17:26       ` Gregory Haskins
  2008-10-03 17:26         ` [RT PATCH v2 1/2] RT: Remove comment that is no longer true Gregory Haskins
  2008-10-03 17:26         ` [RT PATCH v2 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
  1 sibling, 2 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 17:26 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

Gilles Carry wrote:
> Sorry Greg,
>
> Neither PPC64 nor Intel64 make it with this patch.
> At boot time, it stops at the BUG_ON you added:
> 0xc00000000004eca4 is in push_rt_task (kernel/sched_rt.c:1102)
>


Indeed.  Your report has revealed the problem to me.

The issue is that there are three conditions embedded in that if(!lower_rq)
code, but two are buried in the !retry case.  This was the mistake I was making.

We basically need to 

a) dequeue if the task hasnt moved
b) retry if the task *has* moved AND there are more tasks left
c) stop of the task *has* moved AND there are no more tasks

I was missing logic to handle (c).  "v2" should fix this so it is handled.
Please give it a try.  Thanks again, Gilles!

(Again, only build-tested)

Regards,
-Greg


---

Gregory Haskins (2):
      RT: remove "paranoid" limit in push_rt_task
      RT: Remove comment that is no longer true


 kernel/sched_rt.c |   44 ++++++++++++++++++++++----------------------
 1 files changed, 22 insertions(+), 22 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RT PATCH v2 1/2] RT: Remove comment that is no longer true
  2008-10-03 17:26       ` [RT PATCH v2 0/2] Series short description Gregory Haskins
@ 2008-10-03 17:26         ` Gregory Haskins
  2008-10-03 17:26         ` [RT PATCH v2 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
  1 sibling, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 17:26 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

We fixed the condition noted in the comment with the "pushable_tasks"
logic, but forgot to remove this comment.  Lets clean it up.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..59ead84 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1125,16 +1125,6 @@ out:
 	return 1;
 }
 
-/*
- * TODO: Currently we just use the second highest prio task on
- *       the queue, and stop when it can't migrate (or there's
- *       no more RT tasks).  There may be a case where a lower
- *       priority RT task has a different affinity than the
- *       higher RT task. In this case the lower RT task could
- *       possibly be able to migrate where as the higher priority
- *       RT task could not.  We currently ignore this issue.
- *       Enhancements are welcome!
- */
 static void push_rt_tasks(struct rq *rq)
 {
 	/* push_rt_task will return true if it moved an RT */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RT PATCH v2 2/2] RT: remove "paranoid" limit in push_rt_task
  2008-10-03 17:26       ` [RT PATCH v2 0/2] Series short description Gregory Haskins
  2008-10-03 17:26         ` [RT PATCH v2 1/2] RT: Remove comment that is no longer true Gregory Haskins
@ 2008-10-03 17:26         ` Gregory Haskins
  1 sibling, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 17:26 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

A panic was discovered by Chirag Jog and investigated by Gilles Carry
to be originating in the fact that a task being pushed away
may get migrated away during a double_lock_balance.  The result was
that the pushable_tasks list may become corrupted.

The root cause is that the "paranoid" retry limit could cause us to
bail out of a retry, but still try to remove the item from the (now
potentially incorrect) list.  There are numerous ways to correct the
condition, but the paranoid feature is no longer relevant with the new
pushable logic (since pushable naturally limits the loop anyway), so
lets just remove it.

Reported By: Chirag Jog <chirag@linux.vnet.ibm.com>
Found-by: Gilles Carry <gilles.carry@bull.net>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   34 ++++++++++++++++++++++------------
 1 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 59ead84..201bd97 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1056,7 +1056,6 @@ static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
-	int paranoid = RT_MAX_TRIES;
 
 	if (!rq->rt.overloaded)
 		return 0;
@@ -1090,23 +1089,34 @@ static int push_rt_task(struct rq *rq)
 		struct task_struct *task;
 		/*
 		 * find lock_lowest_rq releases rq->lock
-		 * so it is possible that next_task has changed.
-		 * If it has, then try again.
+		 * so it is possible that next_task has migrated.
+		 *
+		 * We need to make sure that the task is still on the same
+		 * run-queue and is also still the next task eligible for
+		 * pushing.
 		 */
 		task = pick_next_pushable_task(rq);
-		if (unlikely(task != next_task) && task && paranoid--) {
-			put_task_struct(next_task);
-			next_task = task;
-			goto retry;
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * If we get here, the task hasnt moved it all, but
+			 * it has failed to push.  We will not try again,
+			 * since the other cpus will pull from us when they
+			 * are ready.
+			 */
+			dequeue_pushable_task(rq, next_task);
+			goto out;
 		}
+		
+		if (!task)
+			/* No more tasks, just exit */
+			goto out;
 
 		/*
-		 * Once we have failed to push this task, we will not
-		 * try again, since the other cpus will pull from us
-		 * when they are ready
+		 * Something has shifted, try again.
 		 */
-		dequeue_pushable_task(rq, next_task);
-		goto out;
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
 	}
 
 	deactivate_task(rq, next_task, 0);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9
  2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
  2008-10-03 12:43   ` [PATCH 1/2] RT: Remove comment that is no longer true Gregory Haskins
  2008-10-03 12:43   ` [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
@ 2008-10-03 12:54   ` Gregory Haskins
  2 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-03 12:54 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

[-- Attachment #1: Type: text/plain, Size: 421 bytes --]

Gregory Haskins wrote:
> Hi Chirag,
>
>  Please try the following patches applied to 26.5-rt9 and let me know what you
>  find.
>
> Hi Steve,
>   If these look good to everyone, please consider them for inclusion in -rt10. 
>
>   

I meant to add: this are build-tested only (now back to vacation time
for me, which means slave labor in the garage!).

I will catch up with you guys on Monday.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang"
  2008-09-25 12:32 [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Chirag Jog
                   ` (2 preceding siblings ...)
  2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
@ 2008-10-06 15:14 ` Gregory Haskins
  2008-10-06 15:14   ` [RT PATCH v3 1/2] RT: Remove comment that is no longer true Gregory Haskins
                     ` (2 more replies)
  3 siblings, 3 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-06 15:14 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

Hi Steve,
  Chirag reported (via IRC) that v2 fixed his issue.  V3 is identical to v2
  except I cleaned up the patch description and fixed a comment typo.  This
  applies to 26.5-rt9.  Please consider it an urgent fix for -rt10.

Regards,
-Greg

---

Gregory Haskins (2):
      RT: fix push_rt_task() to handle dequeue_pushable properly
      RT: Remove comment that is no longer true


 kernel/sched_rt.c |   44 ++++++++++++++++++++++----------------------
 1 files changed, 22 insertions(+), 22 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RT PATCH v3 1/2] RT: Remove comment that is no longer true
  2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
@ 2008-10-06 15:14   ` Gregory Haskins
  2008-10-06 15:14   ` [RT PATCH v3 2/2] RT: fix push_rt_task() to handle dequeue_pushable properly Gregory Haskins
  2008-10-07  6:04   ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gilles Carry
  2 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-06 15:14 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

We fixed the condition noted in the comment with the "pushable_tasks"
logic, but forgot to remove this comment.  Lets clean it up.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..59ead84 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1125,16 +1125,6 @@ out:
 	return 1;
 }
 
-/*
- * TODO: Currently we just use the second highest prio task on
- *       the queue, and stop when it can't migrate (or there's
- *       no more RT tasks).  There may be a case where a lower
- *       priority RT task has a different affinity than the
- *       higher RT task. In this case the lower RT task could
- *       possibly be able to migrate where as the higher priority
- *       RT task could not.  We currently ignore this issue.
- *       Enhancements are welcome!
- */
 static void push_rt_tasks(struct rq *rq)
 {
 	/* push_rt_task will return true if it moved an RT */


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RT PATCH v3 2/2] RT: fix push_rt_task() to handle dequeue_pushable properly
  2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
  2008-10-06 15:14   ` [RT PATCH v3 1/2] RT: Remove comment that is no longer true Gregory Haskins
@ 2008-10-06 15:14   ` Gregory Haskins
  2008-10-07  6:04   ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gilles Carry
  2 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-10-06 15:14 UTC (permalink / raw)
  To: Chirag Jog
  Cc: linux-rt-users, linux-kernel, rostedt, dvhltc, dino, Gilles.Carry

A panic was discovered by Chirag Jog where a BUG_ON sanity check
in the new "pushable_task" logic would trigger a panic under
certain circumstances:

http://lkml.org/lkml/2008/9/25/189

Gilles Carry discovered that the root cause was attributed to the
pushable_tasks list getting corrupted in the push_rt_task logic.
This was the result of a dropped rq lock in double_lock_balance
allowing a task in the process of being pushed to potentially migrate
away, and thus corrupt the pushable_tasks() list.

I traced back the problem as introduced by the pushable_tasks patch
that went in recently.   There is a "retry" path in push_rt_task()
that actually had a compound conditional to decide whether to
retry or exit.  I missed the meaning behind the rationale for the
virtual "if(!task) goto out;" portion of the compound statement and
thus did not handle it properly.  The new pushable_tasks logic
actually creates three distinct conditions:

1) an untouched and unpushable task should be dequeued
2) a migrated task where more pushable tasks remain should be retried
3) a migrated task where no more pushable tasks exist should exit

The original logic mushed (1) and (3) together, resulting in the
system dequeuing a migrated task (against an unlocked foreign run-queue
nonetheless).

To fix this, we get rid of the notion of "paranoid" and we support the
three unique conditions properly.  The paranoid feature is no longer
relevant with the new pushable logic (since pushable naturally limits
the loop) anyway, so lets just remove it.

Reported-By: Chirag Jog <chirag@linux.vnet.ibm.com>
Found-by: Gilles Carry <gilles.carry@bull.net>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   34 ++++++++++++++++++++++------------
 1 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 59ead84..05a1d4a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1056,7 +1056,6 @@ static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
-	int paranoid = RT_MAX_TRIES;
 
 	if (!rq->rt.overloaded)
 		return 0;
@@ -1090,23 +1089,34 @@ static int push_rt_task(struct rq *rq)
 		struct task_struct *task;
 		/*
 		 * find lock_lowest_rq releases rq->lock
-		 * so it is possible that next_task has changed.
-		 * If it has, then try again.
+		 * so it is possible that next_task has migrated.
+		 *
+		 * We need to make sure that the task is still on the same
+		 * run-queue and is also still the next task eligible for
+		 * pushing.
 		 */
 		task = pick_next_pushable_task(rq);
-		if (unlikely(task != next_task) && task && paranoid--) {
-			put_task_struct(next_task);
-			next_task = task;
-			goto retry;
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * If we get here, the task hasnt moved at all, but
+			 * it has failed to push.  We will not try again,
+			 * since the other cpus will pull from us when they
+			 * are ready.
+			 */
+			dequeue_pushable_task(rq, next_task);
+			goto out;
 		}
+		
+		if (!task)
+			/* No more tasks, just exit */
+			goto out;
 
 		/*
-		 * Once we have failed to push this task, we will not
-		 * try again, since the other cpus will pull from us
-		 * when they are ready
+		 * Something has shifted, try again.
 		 */
-		dequeue_pushable_task(rq, next_task);
-		goto out;
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
 	}
 
 	deactivate_task(rq, next_task, 0);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang"
  2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
  2008-10-06 15:14   ` [RT PATCH v3 1/2] RT: Remove comment that is no longer true Gregory Haskins
  2008-10-06 15:14   ` [RT PATCH v3 2/2] RT: fix push_rt_task() to handle dequeue_pushable properly Gregory Haskins
@ 2008-10-07  6:04   ` Gilles Carry
  2 siblings, 0 replies; 24+ messages in thread
From: Gilles Carry @ 2008-10-07  6:04 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chirag Jog, linux-rt-users, linux-kernel, rostedt, dvhltc, dino


Gregory Haskins wrote:
> Hi Steve,
>   Chirag reported (via IRC) that v2 fixed his issue.  V3 is identical to v2
>   except I cleaned up the patch description and fixed a comment typo.  This
>   applies to 26.5-rt9.  Please consider it an urgent fix for -rt10.
> 

Hi,

After 24 hours of testing on PPC64, I confirm that this patch fixes the issue.

Gilles.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-10-07  6:08 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-25 12:32 [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Chirag Jog
2008-09-29 18:13 ` Gregory Haskins
2008-09-29 21:18 ` Gregory Haskins
2008-09-29 21:34   ` Gregory Haskins
2008-09-29 22:00     ` Gregory Haskins
2008-09-30  4:43       ` Chirag Jog
2008-09-30  6:47         ` Gilles Carry
2008-10-01 14:22         ` [PATCH] sched: add a stacktrace on enqueue_pushable error Gregory Haskins
2008-10-02  9:42           ` Gilles Carry
2008-10-02 11:18   ` [BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang Gilles Carry
2008-10-03 12:42 ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
2008-10-03 12:43   ` [PATCH 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-03 12:43   ` [PATCH 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
2008-10-03 13:46     ` Gilles Carry
2008-10-03 15:45       ` Chirag Jog
2008-10-03 17:27         ` Gregory Haskins
2008-10-03 17:26       ` [RT PATCH v2 0/2] Series short description Gregory Haskins
2008-10-03 17:26         ` [RT PATCH v2 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-03 17:26         ` [RT PATCH v2 2/2] RT: remove "paranoid" limit in push_rt_task Gregory Haskins
2008-10-03 12:54   ` [RT PATCH 0/2] fix for BUG_ON crash in 26.5-rt9 Gregory Haskins
2008-10-06 15:14 ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gregory Haskins
2008-10-06 15:14   ` [RT PATCH v3 1/2] RT: Remove comment that is no longer true Gregory Haskins
2008-10-06 15:14   ` [RT PATCH v3 2/2] RT: fix push_rt_task() to handle dequeue_pushable properly Gregory Haskins
2008-10-07  6:04   ` [RT PATCH v3 0/2] Fix for "[BUG][PPC64] BUG in 2.6.26.5-rt9 causing Hang" Gilles Carry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).