From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <53B434F4.6060203@axelsw.it>
Date: Wed, 02 Jul 2014 18:36:04 +0200
From: Marco Tessore <marco.tessore@axelsw.it>
MIME-Version: 1.0
References: <53A3FAB0.4050100@axelsw.it> <53A4207F.9040801@xenomai.org>
 <53A9AA38.3090005@axelsw.it> <53A9B0FB.1070809@xenomai.org>
 <53AA7F39.90706@axelsw.it> <53AA8AC3.7020100@xenomai.org>
In-Reply-To: <53AA8AC3.7020100@xenomai.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Xenomai] Kernel freezes in __ipipe_sync_stage
List-Id: Discussions about the Xenomai project <xenomai.xenomai.org>
List-Unsubscribe: <http://www.xenomai.org/mailman/options/xenomai>,
 <mailto:xenomai-request@xenomai.org?subject=unsubscribe>
List-Archive: <http://www.xenomai.org/pipermail/xenomai/>
List-Post: <mailto:xenomai@xenomai.org>
List-Help: <mailto:xenomai-request@xenomai.org?subject=help>
List-Subscribe: <http://www.xenomai.org/mailman/listinfo/xenomai>,
 <mailto:xenomai-request@xenomai.org?subject=subscribe>
To: Philippe Gerum <rpm@xenomai.org>, Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>, xenomai@xenomai.org

Good morning,

Il 25/06/2014 10:39, Philippe Gerum ha scritto:
> On 06/25/2014 09:50 AM, Marco Tessore wrote:
>> Il 24/06/2014 19:10, Philippe Gerum ha scritto:
>>> On 06/24/2014 06:41 PM, Marco Tessore wrote:
>>>> Hi,
>>>>
>>>> Il 20/06/2014 13:52, Gilles Chanteperdrix ha scritto:
>>>>> On 06/20/2014 11:11 AM, Marco Tessore wrote:
>>>>>> The kernel is version 2.6.31 for ARM architecture - specifically a
>>>>> Do you have the same problem with a recent I-pipe patches, like 
>>>>> one for
>>>>> 3.8 or 3.10 kernel?
>>>>>
>>>>
>>>> I managed to do some tests on 3.10 kernel but on onother board with
>>>> imx28 CPU, actually it happens that that kernel freezes too,
>>>> but I haven't debugged it with the jtag debugger.
>>>>
>>>> I have, instead, some information on the original problem, that is the
>>>> one that worried me more:
>>>>
>>>> In summary:
>>>> I have a board based on imx25, with kernel 2.6.31, Xenomai 2.5.6 and
>>>> ipipe patch 1.16-02.
>>>>
>>>> Rarely, but often enough to be a problem, the kernel freezes at boot.
>>>> Thanks to a JTAG debugger I'm able to observe the kernel in the
>>>> following situation:
>>>> I'm in an infinite loop with the following stack trace:
>>>> __ipipe_set_irqpending
>>>> xnintr_host_tick (__ipipe_propagate_irq)
>>>> xnintr_clock_handler
>>>> __ipipe_sync_stage    <- (1)
>>>> ipipe_suspend_domain
>>>> __ipipe_walk_pipeline
>>>> __ipipe_restore_pipeline_head
>>>> xnarch_next_tick_shot
>>>> clockevents_program_event
>>>> tick_dev_program_event
>>>> hrtimer_interrupt
>>>> mxc_interrupt
>>>> handle_IRQ_event
>>>> handle_level_irq
>>>> asm_do_IRQ
>>>> __ipipe_sync_stage <- (2)
>>>> ipipe_suspend_domain
>>>> __ipipe_walk_pipeline
>>>> __ipipe_restore_pipeline_head
>>>> xnpod_enable_timesource
>>>> xnpod_init
>>>> __native_skin_init
>>>> ...
>>>> ...
>>>>
>>>> Specifically, it happens that the first call to __ipipe_sync_stage, 
>>>> the
>>>> one marked with the number (2), is working on a stage that I can not
>>>> determine,
>>>> let's say for convenience stage S1, I think is the Linux secondary
>>>> domain but I'm not sure,
>>>> so the function invokes the interrupt handler of the system timer.
>>>> Continuing in the stack trace, I have a nested call to
>>>> __ipipe_sync_stage, indicated with (1),
>>>> but this call works on another stage, for convenience domain S2,
>>>> in turn this function invokes a handler for the timer irq, which at a
>>>> certain point invokes the __ipipe_propagate_irq which raises the flags
>>>> for the stage S1,
>>>> thus making the first call to __ipipe_sync_stage (2) fails to get 
>>>> out of
>>>> their while loops.
>>>>
>>>> I should add that I do not see hardware interrupt for the timer in
>>>> function __ipipe_grab_IRQ.
>>>> I have no idea how the cycle is triggered,but when the kernel is 
>>>> locked,
>>>> the kernel is located in the software exclusively infinite loop
>>>> described above.
>>>>
>>>>
>>>> In the hope that you could help me understand what is going on,
>>>> I would have liked groped a patch like this:
>>>> - Store, for each level of nesting of __ipipe_sync_stage, the irq 
>>>> number
>>>> currently running and on behalf of which stage.
>>>> - Patch the function __ipipe_set_irqpending in such a way as not to 
>>>> set
>>>> the flags for the pair (irq, stage) if the pair is already present at
>>>> some level in the current stack trace, that is,
>>>> - if the function __ipipe_sync_stage is executing the handler for a
>>>> stage, and then he had reset the flags in irqpend_himask and
>>>> irqpend_lomask, it does not expect the handler goes to raise again the
>>>> same flag for the same stage.
>>>>
>>>> What do you think about this?
>>>>
>>>> Thank you very much for any kind of advice you could give me
>>>>
>>>
>>> You mentioned random lockups during boot. Does you board ever lock up
>>> when passing xeno_hal.disable=1 on the kernel command line?
>>>
>> Yes, I mentioned random lockups, but always the kernel enters in the
>> infinite loop described above.
>> Following your suggestion I tried to pass parameter xeno_hal.disable=1
>> but kernel sayed
>> "Unknown boot option `xeno_hal.disable=1': ignoring"
>>
>
> This is because you are running an outdated Xenomai 2.5.x release. A 
> work around is to build all the Xenomai skins as modules in the kernel 
> (native, posix, vxworks etc), refraining from modloading them during 
> the boot process.
>
>> What is supposed to do this option anyway? If it would disable HAL, does
>> not this inhibits xenomai realtime services?
>>
>
> This is exactly what we want. When the real-time services commence, 
> control of the hardware timer is handed over to Xenomai, which enables 
> pipelining of the clock source events to the co-kernel. We need to 
> know in this path is involved.
>
>> What about the patch,described above, that I would apply? say, don't
>> permit that the interrupt handlers called in __ipipe_sync_stage raise a
>> couple (stage, irq) already handled in the current stack?
>>
>
> This won't work, this breaks an aspect of the pipeline core logic. 
> This would be papering over the issue, not fixing it, opening a can of 
> worms down the road. We are not chasing a bug in the core logic at 
> this point, we are more likely chasing a bug in the SoC-specific code 
> which binds the hw timer to the pipeline.
>
> First step is to determine if the system experiences an IRQ storm of 
> some sort from the timer chip, and why so. By focusing on the IRQ 
> replay loop which basically resyncs the current interrupt state with 
> the past events logged, you may be looking at rays from an ancient sun.
>
>> Thank you
>> Marco Tessore
>>
>>
>
>
still trying to investigate the problem, I re-applied the patch ipipe on 
a clean kernel and compared with the problematic one,
obviously by matching the same versions of kernel, ipipe patch, xenomai.

I noticed a difference between the defective one and the one just obtained:
the defective kernel has the following block of code at the end of the file
/arch/arm/mach-mx25/devices.c

il blocco:
#ifdef CONFIG_IPIPE
static int post_cpu_init(void)
{
ipipe_mach_allow_hwtimer_uaccess(MX25_AIPS1_BASE_ADDR_VIRT,MX25_AIPS2_BASE_ADDR_VIRT);
     return 0;
}

postcore_initcall(post_cpu_init);
#endif /* CONFIG_IPIPE */


the question that I kindly ask is: what should do the function 
ipipe_mach_allow_hwtimer_uaccess?


In order to reconnect to the previous email:
- I analyzed interrupts: the timer ones seem to me fairly regular
- Occasionally we have bursts of the NAND memory interrupt, I think it 
is normal
- I am still experiencing occasional blocks of the kernel, but does not 
occur interrupt flood, neither from the timer and nand memory,
    for at least one second before the deadlock - seen with an oscilloscope.


Now I'm trying a kernel where I commented the code block above;
I hope they do not occur more blocks, but I'd like to know what is the 
function ipipe_mach_allow_hwtimer_uaccess,
since the block was entered by the person who produced the kernel I'm 
debugging,
the block is not present in the ipipe patch attached to the distribution 
of Xenomai,
I do not know why it was inserted.

Thank you very much
kind regards
Marco Tessore