[Xenomai] Kernel freezes in __ipipe_sync

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai] Kernel freezes in __ipipe_sync_stage
@ 2014-06-20  9:11 Marco Tessore
  2014-06-20 11:52 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 11+ messages in thread
From: Marco Tessore @ 2014-06-20  9:11 UTC (permalink / raw)
  To: xenomai

Good morning,
I am a fairly new programmer to kernel code developement, and recently I 
deal with the development of applications and device drivers using the 
Linux / Ipipe / Xenomai platform;
I have a problem with a kernel installed on devices that we have in 
production, set up by another programmer.

Please allow me to describe the problem:
the problem is essentially that the kernel rarely, but often enough to 
be a problem,
seems to freeze at boot, and from what I have seen with a debugger 
hardware - specifically the Lauterbach T32 -
it seems that the stalemate is due to the ipipe code.

The kernel is version 2.6.31 for ARM architecture - specifically a 
Freescale iMX257, ARM926EJ-S - with Xenomai 2.5.6 and a not very recent 
ipipe patch, of which I did not know the version, i presume the one 
included in the xenomai archive.
The stalling seems to occur in the function __ ipipe_sync_stage, in 
kernel/ipipe/core.c, and can occur at various times during the system boot.

As an example, I describe the stack that I could observe during one of 
these stall conditions:

__ipipe_mach_get_tsc
xntimer_tick_aperiodic
xintr_clock_handler
__ipipe_sync_stage
ipipe_suspend_domain
__ipipe_walk_pipeline
__ipipe_restore_pipeline
xnarc_next_htick_shot
clockevents_program_event
tick_dev_program_event
tick_program_event
hrtimer_interrupt
mxc_timer_interrupt
handle_IRQ_event
handle_level_irq
asm_do_IRQ
__ipipe_sync_stage  <-- loop
ipipe_suspend_domain
__ipipe_walk_pipeline
__ipipe_restore_pipeline_head
xnpod_enable_timesource
xnpod_init
__native_skin_init
do_one_initcall
kernel_init

The problem seems to occur within the first call to the function 
__ipipe_sync_stage (as indicated by the arrow),
in particular it seems that we never match the exit condition of the 
innermost "while" loop:
((submask = p-> irqpend_lomask [level])! = 0).

It seems that after the reset of p->irqpend_lomask[level], during the 
execution of interrupt service routine, timer interrupt I think,
it seems that the flag, or some other flags in the variable returns set, 
and this seems to cause the lock.

Given that, although I had an idea of the general mechanisms that drive 
ipipe, I am not able to grasp the implementation details,
in particular I cannot state when and where that flags are set, I 
presume when hw interrupt occours.

I was wondering if you could give me an idea of what could cause 
stalling or if you had any suggestions on how to get out, making
advancing the kernel in a clean state.

Thank you in advance for any suggestions you could give me,
kind regards
Marco Tessore

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-20  9:11 [Xenomai] Kernel freezes in __ipipe_sync_stage Marco Tessore
@ 2014-06-20 11:52 ` Gilles Chanteperdrix
  2014-06-20 12:18   ` Marco Tessore
  2014-06-24 16:41   ` Marco Tessore
  0 siblings, 2 replies; 11+ messages in thread
From: Gilles Chanteperdrix @ 2014-06-20 11:52 UTC (permalink / raw)
  To: Marco Tessore, xenomai

On 06/20/2014 11:11 AM, Marco Tessore wrote:
> The kernel is version 2.6.31 for ARM architecture - specifically a 

Do you have the same problem with a recent I-pipe patches, like one for
3.8 or 3.10 kernel?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-20 11:52 ` Gilles Chanteperdrix
@ 2014-06-20 12:18   ` Marco Tessore
  2014-06-20 12:25     ` Gilles Chanteperdrix
  2014-06-24 16:41   ` Marco Tessore
  1 sibling, 1 reply; 11+ messages in thread
From: Marco Tessore @ 2014-06-20 12:18 UTC (permalink / raw)
  To: Gilles Chanteperdrix, xenomai

Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>> The kernel is version 2.6.31 for ARM architecture - specifically a
> Do you have the same problem with a recent I-pipe patches, like one for
> 3.8 or 3.10 kernel?
>

One note:
Philippe had already written an email about, full of tips on how to 
identify the cause, and that is probably not attributable to ipipe, 
unfortunately the email went into spam and I had not seen.

For the moment I have enough material to work with,
if the tests, that I will do on the 3.10, were to emerge again something 
I will inform you.

Thank you very much
Marco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-20 12:18   ` Marco Tessore
@ 2014-06-20 12:25     ` Gilles Chanteperdrix
  0 siblings, 0 replies; 11+ messages in thread
From: Gilles Chanteperdrix @ 2014-06-20 12:25 UTC (permalink / raw)
  To: Marco Tessore, xenomai

On 06/20/2014 02:18 PM, Marco Tessore wrote:
> Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
>> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>>> The kernel is version 2.6.31 for ARM architecture - specifically a
>> Do you have the same problem with a recent I-pipe patches, like one for
>> 3.8 or 3.10 kernel?
>>
> 
> One note:
> Philippe had already written an email about, full of tips on how to 
> identify the cause, and that is probably not attributable to ipipe, 
> unfortunately the email went into spam and I had not seen.
> 
> For the moment I have enough material to work with,
> if the tests, that I will do on the 3.10, were to emerge again something 
> I will inform you.

Ok, there is a very old bug in the imx tsc emulation, which used a 
physical address as virtual address. See for instance:

http://www.armadeus.com/wiki/index.php?title=Xenomai#Xenomai_kernel_space_support

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-20 11:52 ` Gilles Chanteperdrix
  2014-06-20 12:18   ` Marco Tessore
@ 2014-06-24 16:41   ` Marco Tessore
  2014-06-24 17:10     ` Philippe Gerum
  1 sibling, 1 reply; 11+ messages in thread
From: Marco Tessore @ 2014-06-24 16:41 UTC (permalink / raw)
  To: Gilles Chanteperdrix, xenomai

Hi,

Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>> The kernel is version 2.6.31 for ARM architecture - specifically a
> Do you have the same problem with a recent I-pipe patches, like one for
> 3.8 or 3.10 kernel?
>

I managed to do some tests on 3.10 kernel but on onother board with 
imx28 CPU, actually it happens that that kernel freezes too,
but I haven't debugged it with the jtag debugger.

I have, instead, some information on the original problem, that is the 
one that worried me more:

In summary:
I have a board based on imx25, with kernel 2.6.31, Xenomai 2.5.6 and 
ipipe patch 1.16-02.

Rarely, but often enough to be a problem, the kernel freezes at boot.
Thanks to a JTAG debugger I'm able to observe the kernel in the 
following situation:
I'm in an infinite loop with the following stack trace:
__ipipe_set_irqpending
xnintr_host_tick (__ipipe_propagate_irq)
xnintr_clock_handler
__ipipe_sync_stage    <- (1)
ipipe_suspend_domain
__ipipe_walk_pipeline
__ipipe_restore_pipeline_head
xnarch_next_tick_shot
clockevents_program_event
tick_dev_program_event
hrtimer_interrupt
mxc_interrupt
handle_IRQ_event
handle_level_irq
asm_do_IRQ
__ipipe_sync_stage <- (2)
ipipe_suspend_domain
__ipipe_walk_pipeline
__ipipe_restore_pipeline_head
xnpod_enable_timesource
xnpod_init
__native_skin_init
...
...

Specifically, it happens that the first call to __ipipe_sync_stage, the 
one marked with the number (2), is working on a stage that I can not 
determine,
let's say for convenience stage S1, I think is the Linux secondary 
domain but I'm not sure,
so the function invokes the interrupt handler of the system timer.
Continuing in the stack trace, I have a nested call to 
__ipipe_sync_stage, indicated with (1),
but this call works on another stage, for convenience domain S2,
in turn this function invokes a handler for the timer irq, which at a 
certain point invokes the __ipipe_propagate_irq which raises the flags 
for the stage S1,
thus making the first call to __ipipe_sync_stage (2) fails to get out of 
their while loops.

I should add that I do not see hardware interrupt for the timer in 
function __ipipe_grab_IRQ.
I have no idea how the cycle is triggered,but when the kernel is locked,
the kernel is located in the software exclusively infinite loop 
described above.

In the hope that you could help me understand what is going on,
I would have liked groped a patch like this:
- Store, for each level of nesting of __ipipe_sync_stage, the irq number 
currently running and on behalf of which stage.
- Patch the function __ipipe_set_irqpending in such a way as not to set 
the flags for the pair (irq, stage) if the pair is already present at 
some level in the current stack trace, that is,
- if the function __ipipe_sync_stage is executing the handler for a 
stage, and then he had reset the flags in irqpend_himask and 
irqpend_lomask, it does not expect the handler goes to raise again the 
same flag for the same stage.

What do you think about this?

Thank you very much for any kind of advice you could give me

Sincerely
Marco Tessore

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-24 16:41   ` Marco Tessore
@ 2014-06-24 17:10     ` Philippe Gerum
  2014-06-25  7:50       ` Marco Tessore
  0 siblings, 1 reply; 11+ messages in thread
From: Philippe Gerum @ 2014-06-24 17:10 UTC (permalink / raw)
  To: Marco Tessore, Gilles Chanteperdrix, xenomai

On 06/24/2014 06:41 PM, Marco Tessore wrote:
> Hi,
>
> Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
>> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>>> The kernel is version 2.6.31 for ARM architecture - specifically a
>> Do you have the same problem with a recent I-pipe patches, like one for
>> 3.8 or 3.10 kernel?
>>
>
> I managed to do some tests on 3.10 kernel but on onother board with
> imx28 CPU, actually it happens that that kernel freezes too,
> but I haven't debugged it with the jtag debugger.
>
> I have, instead, some information on the original problem, that is the
> one that worried me more:
>
> In summary:
> I have a board based on imx25, with kernel 2.6.31, Xenomai 2.5.6 and
> ipipe patch 1.16-02.
>
> Rarely, but often enough to be a problem, the kernel freezes at boot.
> Thanks to a JTAG debugger I'm able to observe the kernel in the
> following situation:
> I'm in an infinite loop with the following stack trace:
> __ipipe_set_irqpending
> xnintr_host_tick (__ipipe_propagate_irq)
> xnintr_clock_handler
> __ipipe_sync_stage    <- (1)
> ipipe_suspend_domain
> __ipipe_walk_pipeline
> __ipipe_restore_pipeline_head
> xnarch_next_tick_shot
> clockevents_program_event
> tick_dev_program_event
> hrtimer_interrupt
> mxc_interrupt
> handle_IRQ_event
> handle_level_irq
> asm_do_IRQ
> __ipipe_sync_stage <- (2)
> ipipe_suspend_domain
> __ipipe_walk_pipeline
> __ipipe_restore_pipeline_head
> xnpod_enable_timesource
> xnpod_init
> __native_skin_init
> ...
> ...
>
> Specifically, it happens that the first call to __ipipe_sync_stage, the
> one marked with the number (2), is working on a stage that I can not
> determine,
> let's say for convenience stage S1, I think is the Linux secondary
> domain but I'm not sure,
> so the function invokes the interrupt handler of the system timer.
> Continuing in the stack trace, I have a nested call to
> __ipipe_sync_stage, indicated with (1),
> but this call works on another stage, for convenience domain S2,
> in turn this function invokes a handler for the timer irq, which at a
> certain point invokes the __ipipe_propagate_irq which raises the flags
> for the stage S1,
> thus making the first call to __ipipe_sync_stage (2) fails to get out of
> their while loops.
>
> I should add that I do not see hardware interrupt for the timer in
> function __ipipe_grab_IRQ.
> I have no idea how the cycle is triggered,but when the kernel is locked,
> the kernel is located in the software exclusively infinite loop
> described above.
>
>
> In the hope that you could help me understand what is going on,
> I would have liked groped a patch like this:
> - Store, for each level of nesting of __ipipe_sync_stage, the irq number
> currently running and on behalf of which stage.
> - Patch the function __ipipe_set_irqpending in such a way as not to set
> the flags for the pair (irq, stage) if the pair is already present at
> some level in the current stack trace, that is,
> - if the function __ipipe_sync_stage is executing the handler for a
> stage, and then he had reset the flags in irqpend_himask and
> irqpend_lomask, it does not expect the handler goes to raise again the
> same flag for the same stage.
>
> What do you think about this?
>
> Thank you very much for any kind of advice you could give me
>

You mentioned random lockups during boot. Does you board ever lock up 
when passing xeno_hal.disable=1 on the kernel command line?

-- 
Philippe.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-24 17:10     ` Philippe Gerum
@ 2014-06-25  7:50       ` Marco Tessore
  2014-06-25  8:39         ` Philippe Gerum
  0 siblings, 1 reply; 11+ messages in thread
From: Marco Tessore @ 2014-06-25  7:50 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix, xenomai

Il 24/06/2014 19:10, Philippe Gerum ha scritto:
> On 06/24/2014 06:41 PM, Marco Tessore wrote:
>> Hi,
>>
>> Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
>>> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>>>> The kernel is version 2.6.31 for ARM architecture - specifically a
>>> Do you have the same problem with a recent I-pipe patches, like one for
>>> 3.8 or 3.10 kernel?
>>>
>>
>> I managed to do some tests on 3.10 kernel but on onother board with
>> imx28 CPU, actually it happens that that kernel freezes too,
>> but I haven't debugged it with the jtag debugger.
>>
>> I have, instead, some information on the original problem, that is the
>> one that worried me more:
>>
>> In summary:
>> I have a board based on imx25, with kernel 2.6.31, Xenomai 2.5.6 and
>> ipipe patch 1.16-02.
>>
>> Rarely, but often enough to be a problem, the kernel freezes at boot.
>> Thanks to a JTAG debugger I'm able to observe the kernel in the
>> following situation:
>> I'm in an infinite loop with the following stack trace:
>> __ipipe_set_irqpending
>> xnintr_host_tick (__ipipe_propagate_irq)
>> xnintr_clock_handler
>> __ipipe_sync_stage    <- (1)
>> ipipe_suspend_domain
>> __ipipe_walk_pipeline
>> __ipipe_restore_pipeline_head
>> xnarch_next_tick_shot
>> clockevents_program_event
>> tick_dev_program_event
>> hrtimer_interrupt
>> mxc_interrupt
>> handle_IRQ_event
>> handle_level_irq
>> asm_do_IRQ
>> __ipipe_sync_stage <- (2)
>> ipipe_suspend_domain
>> __ipipe_walk_pipeline
>> __ipipe_restore_pipeline_head
>> xnpod_enable_timesource
>> xnpod_init
>> __native_skin_init
>> ...
>> ...
>>
>> Specifically, it happens that the first call to __ipipe_sync_stage, the
>> one marked with the number (2), is working on a stage that I can not
>> determine,
>> let's say for convenience stage S1, I think is the Linux secondary
>> domain but I'm not sure,
>> so the function invokes the interrupt handler of the system timer.
>> Continuing in the stack trace, I have a nested call to
>> __ipipe_sync_stage, indicated with (1),
>> but this call works on another stage, for convenience domain S2,
>> in turn this function invokes a handler for the timer irq, which at a
>> certain point invokes the __ipipe_propagate_irq which raises the flags
>> for the stage S1,
>> thus making the first call to __ipipe_sync_stage (2) fails to get out of
>> their while loops.
>>
>> I should add that I do not see hardware interrupt for the timer in
>> function __ipipe_grab_IRQ.
>> I have no idea how the cycle is triggered,but when the kernel is locked,
>> the kernel is located in the software exclusively infinite loop
>> described above.
>>
>>
>> In the hope that you could help me understand what is going on,
>> I would have liked groped a patch like this:
>> - Store, for each level of nesting of __ipipe_sync_stage, the irq number
>> currently running and on behalf of which stage.
>> - Patch the function __ipipe_set_irqpending in such a way as not to set
>> the flags for the pair (irq, stage) if the pair is already present at
>> some level in the current stack trace, that is,
>> - if the function __ipipe_sync_stage is executing the handler for a
>> stage, and then he had reset the flags in irqpend_himask and
>> irqpend_lomask, it does not expect the handler goes to raise again the
>> same flag for the same stage.
>>
>> What do you think about this?
>>
>> Thank you very much for any kind of advice you could give me
>>
>
> You mentioned random lockups during boot. Does you board ever lock up 
> when passing xeno_hal.disable=1 on the kernel command line?
>
Yes, I mentioned random lockups, but always the kernel enters in the 
infinite loop described above.
Following your suggestion I tried to pass parameter xeno_hal.disable=1 
but kernel sayed
"Unknown boot option `xeno_hal.disable=1': ignoring"

What is supposed to do this option anyway? If it would disable HAL, does 
not this inhibits xenomai realtime services?

What about the patch,described above, that I would apply? say, don't 
permit that the interrupt handlers called in __ipipe_sync_stage raise a 
couple (stage, irq) already handled in the current stack?

Thank you
Marco Tessore



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-25  7:50       ` Marco Tessore
@ 2014-06-25  8:39         ` Philippe Gerum
  2014-07-02 16:36           ` Marco Tessore
  2014-07-09 16:09           ` Marco Tessore
  0 siblings, 2 replies; 11+ messages in thread
From: Philippe Gerum @ 2014-06-25  8:39 UTC (permalink / raw)
  To: Marco Tessore, Gilles Chanteperdrix, xenomai

On 06/25/2014 09:50 AM, Marco Tessore wrote:
> Il 24/06/2014 19:10, Philippe Gerum ha scritto:
>> On 06/24/2014 06:41 PM, Marco Tessore wrote:
>>> Hi,
>>>
>>> Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
>>>> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>>>>> The kernel is version 2.6.31 for ARM architecture - specifically a
>>>> Do you have the same problem with a recent I-pipe patches, like one for
>>>> 3.8 or 3.10 kernel?
>>>>
>>>
>>> I managed to do some tests on 3.10 kernel but on onother board with
>>> imx28 CPU, actually it happens that that kernel freezes too,
>>> but I haven't debugged it with the jtag debugger.
>>>
>>> I have, instead, some information on the original problem, that is the
>>> one that worried me more:
>>>
>>> In summary:
>>> I have a board based on imx25, with kernel 2.6.31, Xenomai 2.5.6 and
>>> ipipe patch 1.16-02.
>>>
>>> Rarely, but often enough to be a problem, the kernel freezes at boot.
>>> Thanks to a JTAG debugger I'm able to observe the kernel in the
>>> following situation:
>>> I'm in an infinite loop with the following stack trace:
>>> __ipipe_set_irqpending
>>> xnintr_host_tick (__ipipe_propagate_irq)
>>> xnintr_clock_handler
>>> __ipipe_sync_stage    <- (1)
>>> ipipe_suspend_domain
>>> __ipipe_walk_pipeline
>>> __ipipe_restore_pipeline_head
>>> xnarch_next_tick_shot
>>> clockevents_program_event
>>> tick_dev_program_event
>>> hrtimer_interrupt
>>> mxc_interrupt
>>> handle_IRQ_event
>>> handle_level_irq
>>> asm_do_IRQ
>>> __ipipe_sync_stage <- (2)
>>> ipipe_suspend_domain
>>> __ipipe_walk_pipeline
>>> __ipipe_restore_pipeline_head
>>> xnpod_enable_timesource
>>> xnpod_init
>>> __native_skin_init
>>> ...
>>> ...
>>>
>>> Specifically, it happens that the first call to __ipipe_sync_stage, the
>>> one marked with the number (2), is working on a stage that I can not
>>> determine,
>>> let's say for convenience stage S1, I think is the Linux secondary
>>> domain but I'm not sure,
>>> so the function invokes the interrupt handler of the system timer.
>>> Continuing in the stack trace, I have a nested call to
>>> __ipipe_sync_stage, indicated with (1),
>>> but this call works on another stage, for convenience domain S2,
>>> in turn this function invokes a handler for the timer irq, which at a
>>> certain point invokes the __ipipe_propagate_irq which raises the flags
>>> for the stage S1,
>>> thus making the first call to __ipipe_sync_stage (2) fails to get out of
>>> their while loops.
>>>
>>> I should add that I do not see hardware interrupt for the timer in
>>> function __ipipe_grab_IRQ.
>>> I have no idea how the cycle is triggered,but when the kernel is locked,
>>> the kernel is located in the software exclusively infinite loop
>>> described above.
>>>
>>>
>>> In the hope that you could help me understand what is going on,
>>> I would have liked groped a patch like this:
>>> - Store, for each level of nesting of __ipipe_sync_stage, the irq number
>>> currently running and on behalf of which stage.
>>> - Patch the function __ipipe_set_irqpending in such a way as not to set
>>> the flags for the pair (irq, stage) if the pair is already present at
>>> some level in the current stack trace, that is,
>>> - if the function __ipipe_sync_stage is executing the handler for a
>>> stage, and then he had reset the flags in irqpend_himask and
>>> irqpend_lomask, it does not expect the handler goes to raise again the
>>> same flag for the same stage.
>>>
>>> What do you think about this?
>>>
>>> Thank you very much for any kind of advice you could give me
>>>
>>
>> You mentioned random lockups during boot. Does you board ever lock up
>> when passing xeno_hal.disable=1 on the kernel command line?
>>
> Yes, I mentioned random lockups, but always the kernel enters in the
> infinite loop described above.
> Following your suggestion I tried to pass parameter xeno_hal.disable=1
> but kernel sayed
> "Unknown boot option `xeno_hal.disable=1': ignoring"
>

This is because you are running an outdated Xenomai 2.5.x release. A 
work around is to build all the Xenomai skins as modules in the kernel 
(native, posix, vxworks etc), refraining from modloading them during the 
boot process.

> What is supposed to do this option anyway? If it would disable HAL, does
> not this inhibits xenomai realtime services?
>

This is exactly what we want. When the real-time services commence, 
control of the hardware timer is handed over to Xenomai, which enables 
pipelining of the clock source events to the co-kernel. We need to know 
in this path is involved.

> What about the patch,described above, that I would apply? say, don't
> permit that the interrupt handlers called in __ipipe_sync_stage raise a
> couple (stage, irq) already handled in the current stack?
>

This won't work, this breaks an aspect of the pipeline core logic. This 
would be papering over the issue, not fixing it, opening a can of worms 
down the road. We are not chasing a bug in the core logic at this point, 
we are more likely chasing a bug in the SoC-specific code which binds 
the hw timer to the pipeline.

First step is to determine if the system experiences an IRQ storm of 
some sort from the timer chip, and why so. By focusing on the IRQ replay 
loop which basically resyncs the current interrupt state with the past 
events logged, you may be looking at rays from an ancient sun.

> Thank you
> Marco Tessore
>
>


-- 
Philippe.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-25  8:39         ` Philippe Gerum
@ 2014-07-02 16:36           ` Marco Tessore
  2014-07-02 17:41             ` Gilles Chanteperdrix
  2014-07-09 16:09           ` Marco Tessore
  1 sibling, 1 reply; 11+ messages in thread
From: Marco Tessore @ 2014-07-02 16:36 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix, xenomai

Good morning,

Il 25/06/2014 10:39, Philippe Gerum ha scritto:
> On 06/25/2014 09:50 AM, Marco Tessore wrote:
>> Il 24/06/2014 19:10, Philippe Gerum ha scritto:
>>> On 06/24/2014 06:41 PM, Marco Tessore wrote:
>>>> Hi,
>>>>
>>>> Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
>>>>> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>>>>>> The kernel is version 2.6.31 for ARM architecture - specifically a
>>>>> Do you have the same problem with a recent I-pipe patches, like 
>>>>> one for
>>>>> 3.8 or 3.10 kernel?
>>>>>
>>>>
>>>> I managed to do some tests on 3.10 kernel but on onother board with
>>>> imx28 CPU, actually it happens that that kernel freezes too,
>>>> but I haven't debugged it with the jtag debugger.
>>>>
>>>> I have, instead, some information on the original problem, that is the
>>>> one that worried me more:
>>>>
>>>> In summary:
>>>> I have a board based on imx25, with kernel 2.6.31, Xenomai 2.5.6 and
>>>> ipipe patch 1.16-02.
>>>>
>>>> Rarely, but often enough to be a problem, the kernel freezes at boot.
>>>> Thanks to a JTAG debugger I'm able to observe the kernel in the
>>>> following situation:
>>>> I'm in an infinite loop with the following stack trace:
>>>> __ipipe_set_irqpending
>>>> xnintr_host_tick (__ipipe_propagate_irq)
>>>> xnintr_clock_handler
>>>> __ipipe_sync_stage    <- (1)
>>>> ipipe_suspend_domain
>>>> __ipipe_walk_pipeline
>>>> __ipipe_restore_pipeline_head
>>>> xnarch_next_tick_shot
>>>> clockevents_program_event
>>>> tick_dev_program_event
>>>> hrtimer_interrupt
>>>> mxc_interrupt
>>>> handle_IRQ_event
>>>> handle_level_irq
>>>> asm_do_IRQ
>>>> __ipipe_sync_stage <- (2)
>>>> ipipe_suspend_domain
>>>> __ipipe_walk_pipeline
>>>> __ipipe_restore_pipeline_head
>>>> xnpod_enable_timesource
>>>> xnpod_init
>>>> __native_skin_init
>>>> ...
>>>> ...
>>>>
>>>> Specifically, it happens that the first call to __ipipe_sync_stage, 
>>>> the
>>>> one marked with the number (2), is working on a stage that I can not
>>>> determine,
>>>> let's say for convenience stage S1, I think is the Linux secondary
>>>> domain but I'm not sure,
>>>> so the function invokes the interrupt handler of the system timer.
>>>> Continuing in the stack trace, I have a nested call to
>>>> __ipipe_sync_stage, indicated with (1),
>>>> but this call works on another stage, for convenience domain S2,
>>>> in turn this function invokes a handler for the timer irq, which at a
>>>> certain point invokes the __ipipe_propagate_irq which raises the flags
>>>> for the stage S1,
>>>> thus making the first call to __ipipe_sync_stage (2) fails to get 
>>>> out of
>>>> their while loops.
>>>>
>>>> I should add that I do not see hardware interrupt for the timer in
>>>> function __ipipe_grab_IRQ.
>>>> I have no idea how the cycle is triggered,but when the kernel is 
>>>> locked,
>>>> the kernel is located in the software exclusively infinite loop
>>>> described above.
>>>>
>>>>
>>>> In the hope that you could help me understand what is going on,
>>>> I would have liked groped a patch like this:
>>>> - Store, for each level of nesting of __ipipe_sync_stage, the irq 
>>>> number
>>>> currently running and on behalf of which stage.
>>>> - Patch the function __ipipe_set_irqpending in such a way as not to 
>>>> set
>>>> the flags for the pair (irq, stage) if the pair is already present at
>>>> some level in the current stack trace, that is,
>>>> - if the function __ipipe_sync_stage is executing the handler for a
>>>> stage, and then he had reset the flags in irqpend_himask and
>>>> irqpend_lomask, it does not expect the handler goes to raise again the
>>>> same flag for the same stage.
>>>>
>>>> What do you think about this?
>>>>
>>>> Thank you very much for any kind of advice you could give me
>>>>
>>>
>>> You mentioned random lockups during boot. Does you board ever lock up
>>> when passing xeno_hal.disable=1 on the kernel command line?
>>>
>> Yes, I mentioned random lockups, but always the kernel enters in the
>> infinite loop described above.
>> Following your suggestion I tried to pass parameter xeno_hal.disable=1
>> but kernel sayed
>> "Unknown boot option `xeno_hal.disable=1': ignoring"
>>
>
> This is because you are running an outdated Xenomai 2.5.x release. A 
> work around is to build all the Xenomai skins as modules in the kernel 
> (native, posix, vxworks etc), refraining from modloading them during 
> the boot process.
>
>> What is supposed to do this option anyway? If it would disable HAL, does
>> not this inhibits xenomai realtime services?
>>
>
> This is exactly what we want. When the real-time services commence, 
> control of the hardware timer is handed over to Xenomai, which enables 
> pipelining of the clock source events to the co-kernel. We need to 
> know in this path is involved.
>
>> What about the patch,described above, that I would apply? say, don't
>> permit that the interrupt handlers called in __ipipe_sync_stage raise a
>> couple (stage, irq) already handled in the current stack?
>>
>
> This won't work, this breaks an aspect of the pipeline core logic. 
> This would be papering over the issue, not fixing it, opening a can of 
> worms down the road. We are not chasing a bug in the core logic at 
> this point, we are more likely chasing a bug in the SoC-specific code 
> which binds the hw timer to the pipeline.
>
> First step is to determine if the system experiences an IRQ storm of 
> some sort from the timer chip, and why so. By focusing on the IRQ 
> replay loop which basically resyncs the current interrupt state with 
> the past events logged, you may be looking at rays from an ancient sun.
>
>> Thank you
>> Marco Tessore
>>
>>
>
>
still trying to investigate the problem, I re-applied the patch ipipe on 
a clean kernel and compared with the problematic one,
obviously by matching the same versions of kernel, ipipe patch, xenomai.

I noticed a difference between the defective one and the one just obtained:
the defective kernel has the following block of code at the end of the file
/arch/arm/mach-mx25/devices.c

il blocco:
#ifdef CONFIG_IPIPE
static int post_cpu_init(void)
{
ipipe_mach_allow_hwtimer_uaccess(MX25_AIPS1_BASE_ADDR_VIRT,MX25_AIPS2_BASE_ADDR_VIRT);
     return 0;
}

postcore_initcall(post_cpu_init);
#endif /* CONFIG_IPIPE */


the question that I kindly ask is: what should do the function 
ipipe_mach_allow_hwtimer_uaccess?


In order to reconnect to the previous email:
- I analyzed interrupts: the timer ones seem to me fairly regular
- Occasionally we have bursts of the NAND memory interrupt, I think it 
is normal
- I am still experiencing occasional blocks of the kernel, but does not 
occur interrupt flood, neither from the timer and nand memory,
    for at least one second before the deadlock - seen with an oscilloscope.


Now I'm trying a kernel where I commented the code block above;
I hope they do not occur more blocks, but I'd like to know what is the 
function ipipe_mach_allow_hwtimer_uaccess,
since the block was entered by the person who produced the kernel I'm 
debugging,
the block is not present in the ipipe patch attached to the distribution 
of Xenomai,
I do not know why it was inserted.

Thank you very much
kind regards
Marco Tessore


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-07-02 16:36           ` Marco Tessore
@ 2014-07-02 17:41             ` Gilles Chanteperdrix
  0 siblings, 0 replies; 11+ messages in thread
From: Gilles Chanteperdrix @ 2014-07-02 17:41 UTC (permalink / raw)
  To: Marco Tessore, Philippe Gerum, xenomai

On 07/02/2014 06:36 PM, Marco Tessore wrote:
> ipipe_mach_allow_hwtimer_uaccess(MX25_AIPS1_BASE_ADDR_VIRT,MX25_AIPS2_BASE_ADDR_VIRT);
>      return 0;
> }
> 
> postcore_initcall(post_cpu_init);
> #endif /* CONFIG_IPIPE */
> 
> 
> the question that I kindly ask is: what should do the function 
> ipipe_mach_allow_hwtimer_uaccess?

As the name suggests, this function allows access to the hardware timer
registers from user-space. Xenomai user-space libraries require this,
unless you compile xenomai with --disable-arm-tsc, but this will
increase (greatly) the latency of services such as rt_timer_tsc or
clock_gettime(CLOCK_MONOTONIC) by requiring a system call.

And it probably has nothing to do with your problem.

-- 
                                                                Gilles.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
  2014-06-25  8:39         ` Philippe Gerum
  2014-07-02 16:36           ` Marco Tessore
@ 2014-07-09 16:09           ` Marco Tessore
  1 sibling, 0 replies; 11+ messages in thread
From: Marco Tessore @ 2014-07-09 16:09 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix, xenomai

Good morning,
I'm still trying to investigate the deadlock that is keeping me busy for 
quite some time.

I have the following situation occurs:
     the domain root is in its call to __ipipe_sync_stage invoked 
indirectly by
         xnpod_enable_timesource {xnlock_put_irq_restore(lock, x = 0), 
lock is ignored, and this will generate calls
         to __ipipe_restore_pipeline_head, __ipipe_walk_pipeline and 
ipipe_suspend_domain
         }
     here we are: in __ipipe_sync_stage for the Linux domain.
In it I have execution of the timer    interrupt service routine,
     which in my case is a Freescale i.MX25's timer:
         mxc_timer_interrupt in arch/arm/plat_mxc/time.c.

As a note: this file (time.c) have been corrected since it previously 
doesn'n take into account that timer chip for i.MX25 is the same of the 
one for the mx3 and mx5.

Following the chain, from __ipipe_sync_stage, we have a call to 
xnarch_next_ht_shot, xntimer_start_aperiodic;
is finally invoked the __ipipe_set_irq_pending for xenomai domain.

Subsequently, the procedure __xnarch_next_htick_shot  invokes the the 
ipipe_restore_pipeline_head.

than we have this call:

void __ipipe_restore_pipeline_head(unsigned long x)
{
     struct ipipe_percpu_domain_data *p = ipipe_head_cpudom_ptr();

     local_irq_disable_hw();

     if (x) {
#ifdef CONFIG_DEBUG_KERNEL
         static int warned;
         if (!warned && test_and_set_bit(IPIPE_STALL_FLAG, &p->status)) {
             /*
              * Already stalled albeit ipipe_restore_pipeline_head()
              * should have detected it? Send a warning once.
              */
             warned = 1;
             printk(KERN_WARNING
                    "I-pipe: ipipe_restore_pipeline_head() optimization 
failed.\n");
             dump_stack();
         }
#else /* !CONFIG_DEBUG_KERNEL */
         set_bit(IPIPE_STALL_FLAG, &p->status);
#endif /* CONFIG_DEBUG_KERNEL */
     }
     else {
         __clear_bit(IPIPE_STALL_FLAG, &p->status);
         if (unlikely(p->irqpend_himask != 0)) {
             struct ipipe_domain *head_domain = __ipipe_pipeline_head();
             if (likely(head_domain == __ipipe_current_domain))
                 __ipipe_sync_pipeline(IPIPE_IRQMASK_ANY);
             else
__ipipe_walk_pipeline(&head_domain->p_link); <-- THIS CALL
         }
         local_irq_enable_hw();
     }
}

(as we saw before, irqpend_himask for xenomai domain was set for the 
timer interrupt)

Here the call to the __ipipe_walk_pipeline and from this the 
__ipipe_sync_stage for the xenomai domain.

We have the call to xnintr_clock_handler
xntimer_tick_aperiodic, xntimer_next_local_shot, xnintr_host_tick, 
xnarch_relay_tick
theese calls __ipipe_set_irq_pending for the timer interrupt on linux 
domain.

Since we are already - deeper in the call stack - in the 
__ipipe_sync_stage for the linux domain, we have that at this level
__ipipe_sync_stage clears the flags in the interrupt log for the timer,
it handles the timer interrupt and the chain described above, set in 
turns the flags in the interrupt log for xenomai domain,
which handler sets again the interrupt log for the linux domain;
In the next iteration this repeats infinite times, causing stall of the 
kernel.

Can you help me to understand some more? In particular how it can be 
possible that linux domain triggers xenomain domain that in turns 
triggers linux domain?

As I said in previous mails, this is not a frequent bug, it happens 
randomly when I boot the machine,
but it's still limiting the scope for which the device has been developed.
I can capture the state with an hardware debugger when deadlock happens,
but I cannot find what is happened before.
Surely I know that I havent anomalies in timer interrupt,
driving a pin in the function __ipipe_grab_irq, I can see that timer 
interrupt is quite regular.

As I said in previous mails, this is not a frequent bug, it happens 
randomly when I boot the machine,
but it's still limiting the scope for which the device has been developed.
I can capture the state with an hardware debugger when deadlock happens,
but I cannot find what is happened before.
Surely I know that I haven't anomalies in timer interrupt:
driving a pin in the function __ipipe_grab_irq, I can see that timer 
interrupt is quite regular.

Thank you in advance for any help.
Kind regards
Marco Tessore

In reference to your past email

Il 25/06/2014 10:39, Philippe Gerum ha scritto:
> On 06/25/2014 09:50 AM, Marco Tessore wrote:
>> Il 24/06/2014 19:10, Philippe Gerum ha scritto:
>>> On 06/24/2014 06:41 PM, Marco Tessore wrote:
>>>> Hi,
>>>>
>>>> Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
>>>>> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>>>>>> The kernel is version 2.6.31 for ARM architecture - specifically a
>>>>> Do you have the same problem with a recent I-pipe patches, like 
>>>>> one for
>>>>> 3.8 or 3.10 kernel?
>>>>>
>>>>
>>>> I managed to do some tests on 3.10 kernel but on onother board with
>>>> imx28 CPU, actually it happens that that kernel freezes too,
>>>> but I haven't debugged it with the jtag debugger.
>>>>

> This is because you are running an outdated Xenomai 2.5.x release. A 
> work around is to build all the Xenomai skins as modules in the kernel 
> (native, posix, vxworks etc), refraining from modloading them during 
> the boot process.

I tried this and the event has not occurred,
instead, after hundreds of reboots it happened that the kernel freezed 
in idle_task, and the init process stalled, I don't know where, can be 
related or not to the problem described above.
>
> First step is to determine if the system experiences an IRQ storm of 
> some sort from the timer chip, and why so. By focusing on the IRQ 
> replay loop which basically resyncs the current interrupt state with 
> the past events logged, you may be looking at rays from an ancient sun.
>

It can be excluded, I haven't saw any interrupt storm, the timer 
interrupt is quite regular.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-07-09 16:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-20  9:11 [Xenomai] Kernel freezes in __ipipe_sync_stage Marco Tessore
2014-06-20 11:52 ` Gilles Chanteperdrix
2014-06-20 12:18   ` Marco Tessore
2014-06-20 12:25     ` Gilles Chanteperdrix
2014-06-24 16:41   ` Marco Tessore
2014-06-24 17:10     ` Philippe Gerum
2014-06-25  7:50       ` Marco Tessore
2014-06-25  8:39         ` Philippe Gerum
2014-07-02 16:36           ` Marco Tessore
2014-07-02 17:41             ` Gilles Chanteperdrix
2014-07-09 16:09           ` Marco Tessore

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.