[Xenomai-help] Kernel panic: not syncing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-help] Kernel panic: not syncing
@ 2008-07-07 15:45 Petr Cervenka
  2008-07-07 15:59 ` Philippe Gerum
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Cervenka @ 2008-07-07 15:45 UTC (permalink / raw)
  To: xenomai

Hello,
I'm not sure if I'm not off topic.
We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few days) we get an kernel panic and I don't know If it's our fault or a problem of kernel/xenomai/adeos/configuration/hw/...
If you have any questions, i'll try to answer them. Any help is welcome.
Petr Cervenka

I will try to reproduce the log (visible part of it). I can mail you snapshots for details, if needed:
---------------------------------------------------------------
<IRQ> profile_tick+0x5e/0xa0
tick_sched_timer+0x85/0x170
hrtimer_interrupt+0x12f/0x1e0
smp_apic_timer_interrupt+0x37/0x60
__ipipe_sync_stage+0x350/0x355
smp_apic_timer_interrupt+0x0/0x60
__xirq_end+0x0/0x85
smp_apic_timer_interrupt+0x0/0x60
__ipipe_handle_irq+0x91/0x250
default_idle+0x0/0x40
common_interrupt+0x61/0x7d
<EOI> default_idle+0x29/0x40
cpu_idle+0x8b/0x120
start_kernel+0x2ba/0x350
_sinittext+0x120/0x130

Code: 48 8b 11 48 89 d0 48 c1 e8 16 48 85 c0 75 1b 48 8b 51 08 48

RIP profile_pc+0x46/0x80
RSP <ffffffff80664da0>
CR2: 0000000040090fb8
---[ end trace ccd2184e479f15c8 ]---
Kernel panic - not syncing: Aiee, killing interrupt handler!

Another one is similar, with following differencies:
----------------------------------------------------------------------
<IRQ> scheduler_tick+0xf8/0x140
... /* same part as before */
smp_apic_timer_interrupt+0x0/0x60
ipipe_suspend_domain+0xb2/0xf0
__ipipe_walk_pipeline+0xee/0x150
__ipipe_handle_irq+0x81/0x250
common_interrupt+0x61/0x7d
<EOI>

Code: 0f 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56

RIP run_posix_cpu_timers+0x810/0x280
...



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-07 15:45 [Xenomai-help] Kernel panic: not syncing Petr Cervenka
@ 2008-07-07 15:59 ` Philippe Gerum
  2008-07-08  8:31   ` Petr Cervenka
  2008-07-08  8:38   ` Jan Kiszka
  0 siblings, 2 replies; 24+ messages in thread
From: Philippe Gerum @ 2008-07-07 15:59 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: xenomai

Petr Cervenka wrote:
> Hello,
> I'm not sure if I'm not off topic.
> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few days) we get an kernel panic and I don't know If it's our fault or a problem of kernel/xenomai/adeos/configuration/hw/...
> If you have any questions, i'll try to answer them. Any help is welcome.

It is an I-pipe issue, probably. We have to somewhat forge the register frame
passed to the Linux tick handler, since we may delay that call. Some register
values the profiling code attempts to dereference to find the preempted code may
be wrong in our case.

Could you 1) send back a disassembly of the profile_tick routine in your kernel
image, then apply the following patch to check whether it improves the situation
as well? TIA,

--- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~	2008-02-11 10:48:24.000000000 +0100
+++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c	2008-07-07 17:55:36.000000000 +0200
@@ -933,12 +933,7 @@
 		tick_regs->eip = regs.eip;
 		tick_regs->ebp = regs.ebp;
 #else /* !CONFIG_X86_32 */
-		tick_regs->ss = regs->ss;
-		tick_regs->rsp = regs->rsp;
-		tick_regs->eflags = regs->eflags;
-		tick_regs->cs = regs->cs;
-		tick_regs->rip = regs->rip;
-		tick_regs->rbp = regs->rbp;
+		*tick_regs = *regs;
 #endif /* !CONFIG_X86_32 */
 	}

-- 
Philippe.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-07 15:59 ` Philippe Gerum
@ 2008-07-08  8:31   ` Petr Cervenka
  2008-07-08  8:38   ` Jan Kiszka
  1 sibling, 0 replies; 24+ messages in thread
From: Petr Cervenka @ 2008-07-08  8:31 UTC (permalink / raw)
  To: rpm; +Cc: xenomai

>> Hello,
>> I'm not sure if I'm not off topic.
>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few days)
we get an kernel panic and I don't know If it's our fault or a problem of
kernel/xenomai/adeos/configuration/hw/...
>> If you have any questions, i'll try to answer them. Any help is
welcome.
>
>It is an I-pipe issue, probably. We have to somewhat forge the register
frame
>passed to the Linux tick handler, since we may delay that call. Some
register
>values the profiling code attempts to dereference to find the preempted
code may
>be wrong in our case.
>
>Could you 1) send back a disassembly of the profile_tick routine in your
kernel
>image, then apply the following patch to check whether it improves the
situation
>as well? TIA,
>

ad 1) I would like to, but i don't know how to do it. If you have a simple guide or a link, I would be grateful.
ad 2) I will apply the patch. But it will take days (or a week) before I could tell, if it helped or not.

Petr



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-07 15:59 ` Philippe Gerum
  2008-07-08  8:31   ` Petr Cervenka
@ 2008-07-08  8:38   ` Jan Kiszka
  2008-07-08  9:21     ` Gilles Chanteperdrix
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kiszka @ 2008-07-08  8:38 UTC (permalink / raw)
  To: rpm; +Cc: Petr Cervenka, xenomai

Philippe Gerum wrote:
> Petr Cervenka wrote:
>> Hello,
>> I'm not sure if I'm not off topic.
>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few days) we get an kernel panic and I don't know If it's our fault or a problem of kernel/xenomai/adeos/configuration/hw/...
>> If you have any questions, i'll try to answer them. Any help is welcome.
> 
> It is an I-pipe issue, probably. We have to somewhat forge the register frame
> passed to the Linux tick handler, since we may delay that call. Some register
> values the profiling code attempts to dereference to find the preempted code may
> be wrong in our case.
> 
> Could you 1) send back a disassembly of the profile_tick routine in your kernel
> image, then apply the following patch to check whether it improves the situation
> as well? TIA,
> 
> --- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~	2008-02-11 10:48:24.000000000 +0100
> +++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c	2008-07-07 17:55:36.000000000 +0200
> @@ -933,12 +933,7 @@
>  		tick_regs->eip = regs.eip;
>  		tick_regs->ebp = regs.ebp;
>  #else /* !CONFIG_X86_32 */
> -		tick_regs->ss = regs->ss;
> -		tick_regs->rsp = regs->rsp;
> -		tick_regs->eflags = regs->eflags;
> -		tick_regs->cs = regs->cs;
> -		tick_regs->rip = regs->rip;
> -		tick_regs->rbp = regs->rbp;
> +		*tick_regs = *regs;
>  #endif /* !CONFIG_X86_32 */

I'm fairly sure that this won't make a difference. According to Petr's
first dump we crash in profile_pc, and there the kernel pokes around on
the stack of the interrupted context (Petr, you are running SMP,
right?). The question is if this stack may have vanished or may have
been swapped out after capturing the registers. Or the test
"!user_mode(regs) && in_lock_functions(pc)" returns an invalid result
(Petr, do you run Xenomai kernel tasks?).

I do not yet see the scenario behind it, but a workaround for a
vanishing stack could be to cache sp[0] and sp[1] (as accessed in
profile_pc) and let a faked regs->rsp point to that cache. Nevertheless,
understanding the actual reason should remain a goal at the same time
(to avoid papering over an even more serious issue).

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-08  8:38   ` Jan Kiszka
@ 2008-07-08  9:21     ` Gilles Chanteperdrix
  2008-07-08  9:33       ` Jan Kiszka
  0 siblings, 1 reply; 24+ messages in thread
From: Gilles Chanteperdrix @ 2008-07-08  9:21 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Petr Cervenka, xenomai

Jan Kiszka wrote:
> Philippe Gerum wrote:
>> Petr Cervenka wrote:
>>> Hello,
>>> I'm not sure if I'm not off topic.
>>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few days) we get an kernel panic and I don't know If it's our fault or a problem of kernel/xenomai/adeos/configuration/hw/...
>>> If you have any questions, i'll try to answer them. Any help is welcome.
>> It is an I-pipe issue, probably. We have to somewhat forge the register frame
>> passed to the Linux tick handler, since we may delay that call. Some register
>> values the profiling code attempts to dereference to find the preempted code may
>> be wrong in our case.
>>
>> Could you 1) send back a disassembly of the profile_tick routine in your kernel
>> image, then apply the following patch to check whether it improves the situation
>> as well? TIA,
>>
>> --- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~	2008-02-11 10:48:24.000000000 +0100
>> +++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c	2008-07-07 17:55:36.000000000 +0200
>> @@ -933,12 +933,7 @@
>>  		tick_regs->eip = regs.eip;
>>  		tick_regs->ebp = regs.ebp;
>>  #else /* !CONFIG_X86_32 */
>> -		tick_regs->ss = regs->ss;
>> -		tick_regs->rsp = regs->rsp;
>> -		tick_regs->eflags = regs->eflags;
>> -		tick_regs->cs = regs->cs;
>> -		tick_regs->rip = regs->rip;
>> -		tick_regs->rbp = regs->rbp;
>> +		*tick_regs = *regs;
>>  #endif /* !CONFIG_X86_32 */
> 
> I'm fairly sure that this won't make a difference. According to Petr's
> first dump we crash in profile_pc, and there the kernel pokes around on
> the stack of the interrupted context (Petr, you are running SMP,
> right?). The question is if this stack may have vanished or may have
> been swapped out after capturing the registers.

When Xenomai has forwarded the tick to linux, Linux tick handler is 
executed upon resume to user-space, so, if the stack had to vanish, it 
would have to vanish upon execution of another interrupt handler before 
the tick handler. However, I believe that only do_exit can kill a task, 
and I am not sure if it can be called from an interrupt handler. As for 
the stack being swapped out, it is kmalloced memory, so, it is impossible.


-- 
                                                  Gilles.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-08  9:21     ` Gilles Chanteperdrix
@ 2008-07-08  9:33       ` Jan Kiszka
  2008-07-09 15:19         ` Petr Cervenka
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kiszka @ 2008-07-08  9:33 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Petr Cervenka, xenomai

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> Petr Cervenka wrote:
>>>> Hello,
>>>> I'm not sure if I'm not off topic.
>>>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few
>>>> days) we get an kernel panic and I don't know If it's our fault or a
>>>> problem of kernel/xenomai/adeos/configuration/hw/...
>>>> If you have any questions, i'll try to answer them. Any help is
>>>> welcome.
>>> It is an I-pipe issue, probably. We have to somewhat forge the
>>> register frame
>>> passed to the Linux tick handler, since we may delay that call. Some
>>> register
>>> values the profiling code attempts to dereference to find the
>>> preempted code may
>>> be wrong in our case.
>>>
>>> Could you 1) send back a disassembly of the profile_tick routine in
>>> your kernel
>>> image, then apply the following patch to check whether it improves
>>> the situation
>>> as well? TIA,
>>>
>>> --- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~    2008-02-11
>>> 10:48:24.000000000 +0100
>>> +++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c    2008-07-07
>>> 17:55:36.000000000 +0200
>>> @@ -933,12 +933,7 @@
>>>          tick_regs->eip = regs.eip;
>>>          tick_regs->ebp = regs.ebp;
>>>  #else /* !CONFIG_X86_32 */
>>> -        tick_regs->ss = regs->ss;
>>> -        tick_regs->rsp = regs->rsp;
>>> -        tick_regs->eflags = regs->eflags;
>>> -        tick_regs->cs = regs->cs;
>>> -        tick_regs->rip = regs->rip;
>>> -        tick_regs->rbp = regs->rbp;
>>> +        *tick_regs = *regs;
>>>  #endif /* !CONFIG_X86_32 */
>>
>> I'm fairly sure that this won't make a difference. According to Petr's
>> first dump we crash in profile_pc, and there the kernel pokes around on
>> the stack of the interrupted context (Petr, you are running SMP,
>> right?). The question is if this stack may have vanished or may have
>> been swapped out after capturing the registers.
> 
> When Xenomai has forwarded the tick to linux, Linux tick handler is
> executed upon resume to user-space, so, if the stack had to vanish, it
> would have to vanish upon execution of another interrupt handler before
> the tick handler. However, I believe that only do_exit can kill a task,
> and I am not sure if it can be called from an interrupt handler. As for
> the stack being swapped out, it is kmalloced memory, so, it is impossible.
> 

Yes, vanishing stack is unlikely, more probable is an invalid state
right from the beginning. I guess we need a full oops to say more.

Petr, any chance to attach a serial cable to your box and catch those
oopses completely via a second box? The register state would be telling,
but also, as Philippe already requested, a disassembly of the involved
function - in case it remain profile_pc: objdump -dS
linux-.../arch/x86/kernel/time_64.o.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-08  9:33       ` Jan Kiszka
@ 2008-07-09 15:19         ` Petr Cervenka
  2008-07-09 16:05           ` Philippe Gerum
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Cervenka @ 2008-07-09 15:19 UTC (permalink / raw)
  To: jan.kiszka; +Cc: xenomai

Jan Kiszka wrote:
>Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Philippe Gerum wrote:
>>>> Petr Cervenka wrote:
>>>>> Hello,
>>>>> I'm not sure if I'm not off topic.
>>>>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few
>>>>> days) we get an kernel panic and I don't know If it's our fault or a
>>>>> problem of kernel/xenomai/adeos/configuration/hw/...
>>>>> If you have any questions, i'll try to answer them. Any help is
>>>>> welcome.
>>>> It is an I-pipe issue, probably. We have to somewhat forge the
>>>> register frame
>>>> passed to the Linux tick handler, since we may delay that call. Some
>>>> register
>>>> values the profiling code attempts to dereference to find the
>>>> preempted code may
>>>> be wrong in our case.
>>>>
>>>> Could you 1) send back a disassembly of the profile_tick routine in
>>>> your kernel
>>>> image, then apply the following patch to check whether it improves
>>>> the situation
>>>> as well? TIA,
>>>>
>>>> --- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~    2008-02-11
>>>> 10:48:24.000000000 +0100
>>>> +++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c    2008-07-07
>>>> 17:55:36.000000000 +0200
>>>> @@ -933,12 +933,7 @@
>>>>          tick_regs->eip = regs.eip;
>>>>          tick_regs->ebp = regs.ebp;
>>>>  #else /* !CONFIG_X86_32 */
>>>> -        tick_regs->ss = regs->ss;
>>>> -        tick_regs->rsp = regs->rsp;
>>>> -        tick_regs->eflags = regs->eflags;
>>>> -        tick_regs->cs = regs->cs;
>>>> -        tick_regs->rip = regs->rip;
>>>> -        tick_regs->rbp = regs->rbp;
>>>> +        *tick_regs = *regs;
>>>>  #endif /* !CONFIG_X86_32 */
>>>
>>> I'm fairly sure that this won't make a difference. According to Petr's
>>> first dump we crash in profile_pc, and there the kernel pokes around
on
>>> the stack of the interrupted context (Petr, you are running SMP,
>>> right?). The question is if this stack may have vanished or may have
>>> been swapped out after capturing the registers.
>> 
>> When Xenomai has forwarded the tick to linux, Linux tick handler is
>> executed upon resume to user-space, so, if the stack had to vanish, it
>> would have to vanish upon execution of another interrupt handler before
>> the tick handler. However, I believe that only do_exit can kill a task,
>> and I am not sure if it can be called from an interrupt handler. As for
>> the stack being swapped out, it is kmalloced memory, so, it is
impossible.
>> 
>
>Yes, vanishing stack is unlikely, more probable is an invalid state
>right from the beginning. I guess we need a full oops to say more.
>
>Petr, any chance to attach a serial cable to your box and catch those
>oopses completely via a second box? The register state would be telling,
>but also, as Philippe already requested, a disassembly of the involved
>function - in case it remain profile_pc: objdump -dS
>linux-.../arch/x86/kernel/time_64.o.
>

To your questions:
we have only user space tasks, but we use rtdm driver (with ioctl, no tasks).
The processor is Athlon X2, 64-bit distribution (Kubuntu ?7.10?), x86_64 SMP PREEMPT,
Kernel 2.6.24, Xenomai 2.4.1, adeos-ipipe-2.6.24-x86-2.0-03
I could send you my kernel config file if you want.
I will try to learn the method of oopses catching via serial cable attached second box. But I don't know if it will be possible to setup such experiment for time long enough to reproduce the error.

The following disassembly is from different machine than the one which has the kernel panics. I use it for developing and testing but it should have the same HW and kernel configuration. I can't totally sure, that the disassembly is correct, but i hope so. I junst can't recognise it. Could you explain me what does "profile_pc+0x46/0x80"? I assume the first number is the current RIP address (relative to routine start), so 0x256 in this case. But what does mean the second number?

0000000000000210 <profile_pc>:
 210:	48 83 ec 18          	sub    $0x18,%rsp
 214:	48 89 5c 24 08       	mov    %rbx,0x8(%rsp)
 219:	48 89 6c 24 10       	mov    %rbp,0x10(%rsp)
 21e:	48 89 fb             	mov    %rdi,%rbx
 221:	f6 87 88 00 00 00 03 	testb  $0x3,0x88(%rdi)
 228:	48 8b af 80 00 00 00 	mov    0x80(%rdi),%rbp
 22f:	74 12                	je     243 <profile_pc+0x33>
 231:	48 89 e8             	mov    %rbp,%rax
 234:	48 8b 5c 24 08       	mov    0x8(%rsp),%rbx
 239:	48 8b 6c 24 10       	mov    0x10(%rsp),%rbp
 23e:	48 83 c4 18          	add    $0x18,%rsp
 242:	c3                   	retq   
 243:	48 89 ef             	mov    %rbp,%rdi
 246:	e8 00 00 00 00       	callq  24b <profile_pc+0x3b>
 24b:	85 c0                	test   %eax,%eax
 24d:	74 e2                	je     231 <profile_pc+0x21>
 24f:	48 8b 8b 98 00 00 00 	mov    0x98(%rbx),%rcx
 256:	48 8b 11             	mov    (%rcx),%rdx
 259:	48 89 d0             	mov    %rdx,%rax
 25c:	48 c1 e8 16          	shr    $0x16,%rax
 260:	48 85 c0             	test   %rax,%rax
 263:	75 1b                	jne    280 <profile_pc+0x70>
 265:	48 8b 51 08          	mov    0x8(%rcx),%rdx
 269:	48 89 d0             	mov    %rdx,%rax
 26c:	48 c1 e8 16          	shr    $0x16,%rax
 270:	48 85 c0             	test   %rax,%rax
 273:	48 0f 45 ea          	cmovne %rdx,%rbp
 277:	eb b8                	jmp    231 <profile_pc+0x21>
 279:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 280:	48 89 d5             	mov    %rdx,%rbp
 283:	eb ac                	jmp    231 <profile_pc+0x21>
 285:	66 66 2e 0f 1f 84 00 	nopw   %cs:0x0(%rax,%rax,1)
 28c:	00 00 00 00  



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-09 15:19         ` Petr Cervenka
@ 2008-07-09 16:05           ` Philippe Gerum
  2008-07-10 13:45             ` Petr Cervenka
  2008-07-11 13:18             ` Petr Cervenka
  0 siblings, 2 replies; 24+ messages in thread
From: Philippe Gerum @ 2008-07-09 16:05 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: jan.kiszka, xenomai

Petr Cervenka wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Philippe Gerum wrote:
>>>>> Petr Cervenka wrote:
>>>>>> Hello,
>>>>>> I'm not sure if I'm not off topic.
>>>>>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few
>>>>>> days) we get an kernel panic and I don't know If it's our fault or a
>>>>>> problem of kernel/xenomai/adeos/configuration/hw/...
>>>>>> If you have any questions, i'll try to answer them. Any help is
>>>>>> welcome.
>>>>> It is an I-pipe issue, probably. We have to somewhat forge the
>>>>> register frame
>>>>> passed to the Linux tick handler, since we may delay that call. Some
>>>>> register
>>>>> values the profiling code attempts to dereference to find the
>>>>> preempted code may
>>>>> be wrong in our case.
>>>>>
>>>>> Could you 1) send back a disassembly of the profile_tick routine in
>>>>> your kernel
>>>>> image, then apply the following patch to check whether it improves
>>>>> the situation
>>>>> as well? TIA,
>>>>>
>>>>> --- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~    2008-02-11
>>>>> 10:48:24.000000000 +0100
>>>>> +++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c    2008-07-07
>>>>> 17:55:36.000000000 +0200
>>>>> @@ -933,12 +933,7 @@
>>>>>          tick_regs->eip = regs.eip;
>>>>>          tick_regs->ebp = regs.ebp;
>>>>>  #else /* !CONFIG_X86_32 */
>>>>> -        tick_regs->ss = regs->ss;
>>>>> -        tick_regs->rsp = regs->rsp;
>>>>> -        tick_regs->eflags = regs->eflags;
>>>>> -        tick_regs->cs = regs->cs;
>>>>> -        tick_regs->rip = regs->rip;
>>>>> -        tick_regs->rbp = regs->rbp;
>>>>> +        *tick_regs = *regs;
>>>>>  #endif /* !CONFIG_X86_32 */
>>>> I'm fairly sure that this won't make a difference. According to Petr's
>>>> first dump we crash in profile_pc, and there the kernel pokes around
> on
>>>> the stack of the interrupted context (Petr, you are running SMP,
>>>> right?). The question is if this stack may have vanished or may have
>>>> been swapped out after capturing the registers.
>>> When Xenomai has forwarded the tick to linux, Linux tick handler is
>>> executed upon resume to user-space, so, if the stack had to vanish, it
>>> would have to vanish upon execution of another interrupt handler before
>>> the tick handler. However, I believe that only do_exit can kill a task,
>>> and I am not sure if it can be called from an interrupt handler. As for
>>> the stack being swapped out, it is kmalloced memory, so, it is
> impossible.
>> Yes, vanishing stack is unlikely, more probable is an invalid state
>> right from the beginning. I guess we need a full oops to say more.
>>
>> Petr, any chance to attach a serial cable to your box and catch those
>> oopses completely via a second box? The register state would be telling,
>> but also, as Philippe already requested, a disassembly of the involved
>> function - in case it remain profile_pc: objdump -dS
>> linux-.../arch/x86/kernel/time_64.o.
>>
> 
> To your questions:
> we have only user space tasks, but we use rtdm driver (with ioctl, no tasks).
> The processor is Athlon X2, 64-bit distribution (Kubuntu ?7.10?), x86_64 SMP PREEMPT,
> Kernel 2.6.24, Xenomai 2.4.1, adeos-ipipe-2.6.24-x86-2.0-03
> I could send you my kernel config file if you want.
> I will try to learn the method of oopses catching via serial cable attached second box. But I don't know if it will be possible to setup such experiment for time long enough to reproduce the error.
> 
> The following disassembly is from different machine than the one which has the kernel panics. I use it for developing and testing but it should have the same HW and kernel configuration. I can't totally sure, that the disassembly is correct, but i hope so. I junst can't recognise it. Could you explain me what does "profile_pc+0x46/0x80"? I assume the first number is the current RIP address (relative to routine start), so 0x256 in this case. But what does mean the second number?

Size of the routine.

> 
> 0000000000000210 <profile_pc>:
>  210:	48 83 ec 18          	sub    $0x18,%rsp
>  214:	48 89 5c 24 08       	mov    %rbx,0x8(%rsp)
>  219:	48 89 6c 24 10       	mov    %rbp,0x10(%rsp)
>  21e:	48 89 fb             	mov    %rdi,%rbx
>  221:	f6 87 88 00 00 00 03 	testb  $0x3,0x88(%rdi)
>  228:	48 8b af 80 00 00 00 	mov    0x80(%rdi),%rbp
>  22f:	74 12                	je     243 <profile_pc+0x33>
>  231:	48 89 e8             	mov    %rbp,%rax
>  234:	48 8b 5c 24 08       	mov    0x8(%rsp),%rbx
>  239:	48 8b 6c 24 10       	mov    0x10(%rsp),%rbp
>  23e:	48 83 c4 18          	add    $0x18,%rsp
>  242:	c3                   	retq   
>  243:	48 89 ef             	mov    %rbp,%rdi
>  246:	e8 00 00 00 00       	callq  24b <profile_pc+0x3b>
>  24b:	85 c0                	test   %eax,%eax
>  24d:	74 e2                	je     231 <profile_pc+0x21>
>  24f:	48 8b 8b 98 00 00 00 	mov    0x98(%rbx),%rcx
>  256:	48 8b 11             	mov    (%rcx),%rdx

We are dereferencing invalid stack memory, using the stack pointer value of the
preempted context.

What could help is to have the registers dump which should appear in the oops
message, and specifically the %rcx value.

>  259:	48 89 d0             	mov    %rdx,%rax
>  25c:	48 c1 e8 16          	shr    $0x16,%rax
>  260:	48 85 c0             	test   %rax,%rax
>  263:	75 1b                	jne    280 <profile_pc+0x70>
>  265:	48 8b 51 08          	mov    0x8(%rcx),%rdx
>  269:	48 89 d0             	mov    %rdx,%rax
>  26c:	48 c1 e8 16          	shr    $0x16,%rax
>  270:	48 85 c0             	test   %rax,%rax
>  273:	48 0f 45 ea          	cmovne %rdx,%rbp
>  277:	eb b8                	jmp    231 <profile_pc+0x21>
>  279:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
>  280:	48 89 d5             	mov    %rdx,%rbp
>  283:	eb ac                	jmp    231 <profile_pc+0x21>
>  285:	66 66 2e 0f 1f 84 00 	nopw   %cs:0x0(%rax,%rax,1)
>  28c:	00 00 00 00  
> 
> 


-- 
Philippe.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-09 16:05           ` Philippe Gerum
@ 2008-07-10 13:45             ` Petr Cervenka
  2008-07-11 13:18             ` Petr Cervenka
  1 sibling, 0 replies; 24+ messages in thread
From: Petr Cervenka @ 2008-07-10 13:45 UTC (permalink / raw)
  To: rpm; +Cc: xenomai


Philippe Gerum wrote:
>Petr Cervenka wrote:
>> Jan Kiszka wrote:
>>> Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Philippe Gerum wrote:
>>>>>> Petr Cervenka wrote:
>>>>>>> Hello,
>>>>>>> I'm not sure if I'm not off topic.
>>>>>>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few
>>>>>>> days) we get an kernel panic and I don't know If it's our fault or
a
>>>>>>> problem of kernel/xenomai/adeos/configuration/hw/...
>>>>>>> If you have any questions, i'll try to answer them. Any help is
>>>>>>> welcome.
>>>>>> It is an I-pipe issue, probably. We have to somewhat forge the
>>>>>> register frame
>>>>>> passed to the Linux tick handler, since we may delay that call.
Some
>>>>>> register
>>>>>> values the profiling code attempts to dereference to find the
>>>>>> preempted code may
>>>>>> be wrong in our case.
>>>>>>
>>>>>> Could you 1) send back a disassembly of the profile_tick routine in
>>>>>> your kernel
>>>>>> image, then apply the following patch to check whether it improves
>>>>>> the situation
>>>>>> as well? TIA,
>>>>>>
>>>>>> --- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~    2008-02-11
>>>>>> 10:48:24.000000000 +0100
>>>>>> +++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c    2008-07-07
>>>>>> 17:55:36.000000000 +0200
>>>>>> @@ -933,12 +933,7 @@
>>>>>>          tick_regs->eip = regs.eip;
>>>>>>          tick_regs->ebp = regs.ebp;
>>>>>>  #else /* !CONFIG_X86_32 */
>>>>>> -        tick_regs->ss = regs->ss;
>>>>>> -        tick_regs->rsp = regs->rsp;
>>>>>> -        tick_regs->eflags = regs->eflags;
>>>>>> -        tick_regs->cs = regs->cs;
>>>>>> -        tick_regs->rip = regs->rip;
>>>>>> -        tick_regs->rbp = regs->rbp;
>>>>>> +        *tick_regs = *regs;
>>>>>>  #endif /* !CONFIG_X86_32 */
>>>>> I'm fairly sure that this won't make a difference. According to
Petr's
>>>>> first dump we crash in profile_pc, and there the kernel pokes around
>> on
>>>>> the stack of the interrupted context (Petr, you are running SMP,
>>>>> right?). The question is if this stack may have vanished or may have
>>>>> been swapped out after capturing the registers.
>>>> When Xenomai has forwarded the tick to linux, Linux tick handler is
>>>> executed upon resume to user-space, so, if the stack had to vanish,
it
>>>> would have to vanish upon execution of another interrupt handler
before
>>>> the tick handler. However, I believe that only do_exit can kill a
task,
>>>> and I am not sure if it can be called from an interrupt handler. As
for
>>>> the stack being swapped out, it is kmalloced memory, so, it is
>> impossible.
>>> Yes, vanishing stack is unlikely, more probable is an invalid state
>>> right from the beginning. I guess we need a full oops to say more.
>>>
>>> Petr, any chance to attach a serial cable to your box and catch those
>>> oopses completely via a second box? The register state would be
telling,
>>> but also, as Philippe already requested, a disassembly of the involved
>>> function - in case it remain profile_pc: objdump -dS
>>> linux-.../arch/x86/kernel/time_64.o.
>>>
>> 
>> To your questions:
>> we have only user space tasks, but we use rtdm driver (with ioctl, no
tasks).
>> The processor is Athlon X2, 64-bit distribution (Kubuntu ?7.10?),
x86_64 SMP PREEMPT,
>> Kernel 2.6.24, Xenomai 2.4.1, adeos-ipipe-2.6.24-x86-2.0-03
>> I could send you my kernel config file if you want.
>> I will try to learn the method of oopses catching via serial cable
attached second box. But I don't know if it will be possible to setup such
experiment for time long enough to reproduce the error.
>> 
>> The following disassembly is from different machine than the one which
has the kernel panics. I use it for developing and testing but it should
have the same HW and kernel configuration. I can't totally sure, that the
disassembly is correct, but i hope so. I junst can't recognise it. Could
you explain me what does "profile_pc+0x46/0x80"? I assume the first number
is the current RIP address (relative to routine start), so 0x256 in this
case. But what does mean the second number?
>
>Size of the routine.
>
>> 
>> 0000000000000210 <profile_pc>:
>>  210:	48 83 ec 18          	sub    $0x18,%rsp
>>  214:	48 89 5c 24 08       	mov    %rbx,0x8(%rsp)
>>  219:	48 89 6c 24 10       	mov    %rbp,0x10(%rsp)
>>  21e:	48 89 fb             	mov    %rdi,%rbx
>>  221:	f6 87 88 00 00 00 03 	testb  $0x3,0x88(%rdi)
>>  228:	48 8b af 80 00 00 00 	mov    0x80(%rdi),%rbp
>>  22f:	74 12                	je     243 <profile_pc+0x33>
>>  231:	48 89 e8             	mov    %rbp,%rax
>>  234:	48 8b 5c 24 08       	mov    0x8(%rsp),%rbx
>>  239:	48 8b 6c 24 10       	mov    0x10(%rsp),%rbp
>>  23e:	48 83 c4 18          	add    $0x18,%rsp
>>  242:	c3                   	retq   
>>  243:	48 89 ef             	mov    %rbp,%rdi
>>  246:	e8 00 00 00 00       	callq  24b <profile_pc+0x3b>
>>  24b:	85 c0                	test   %eax,%eax
>>  24d:	74 e2                	je     231 <profile_pc+0x21>
>>  24f:	48 8b 8b 98 00 00 00 	mov    0x98(%rbx),%rcx
>>  256:	48 8b 11             	mov    (%rcx),%rdx
>
>We are dereferencing invalid stack memory, using the stack pointer value
of the
>preempted context.
>
>What could help is to have the registers dump which should appear in the
oops
>message, and specifically the %rcx value.

I tried to setup the serial line console but with only partial success.
I used this guide: http://www.av8n.com/computer/htm/kernel-lockup.htm (not the watchdog part)
Now I am able through gtkterm to login to the 1. (logged) machine from the 2. (logging) machine. But when I try to use: echo "Hi there." > /dev/console, i see the text only on the actual session (tty0) and not on the connected 2. machine (maybe the gtkterm could be not the right program for it).
When I use directly: echo "Hi there." > /dev/ttyS0, it freezes and I have to press Ctrl+C to continue (with: Interrupted signal call error).
In the dmesg output of the first machine, there is line: console [ttyS0] enabled. the console ttyS0, the getty on ttyS0 and the gtkterm use the same 115200 speed.
What have I done wrong?
Petr



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-09 16:05           ` Philippe Gerum
  2008-07-10 13:45             ` Petr Cervenka
@ 2008-07-11 13:18             ` Petr Cervenka
  2008-07-15 14:42               ` Petr Cervenka
  1 sibling, 1 reply; 24+ messages in thread
From: Petr Cervenka @ 2008-07-11 13:18 UTC (permalink / raw)
  To: rpm; +Cc: xenomai, jan.kiszka

I was able to capture the kernel panic through the serial line. This time, it took less time than I expected.
Petr

[ 2009.702873] Unable to handle kernel paging request at 0000000040090ff8 RIP: 
[ 2009.708187]  [<ffffffff802101a6>] profile_pc+0x46/0x80
[ 2009.716892] PGD 116fd067 PUD 116b5067 PMD 3d93d067 PTE 0
[ 2009.722998] Oops: 0000 [1] PREEMPT SMP 
[ 2009.727400] CPU 0 
[ 2009.729800] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat usb_storage libusual rt_e1000 rt_r8169 rtpacket rtnet ppdev pci171x_rtdm(P) container ac video output sbs sbshc dock battery parport_pc lp parport psmouse serio_raw pcspkr k8temp i2c_nforce2 i2c_core button af_packet ipv6 evdev ext3 jbd mbcache sg sd_mod ide_cd cdrom sata_nv floppy forcedeth ata_generic libata scsi_mod amd74xx ehci_hcd ohci_hcd ide_core usbcore fan fuse
[ 2009.774477] Pid: 0, comm: swapper Tainted: P        2.6.24-adeos #1
[ 2009.781682] RIP: 0010:[<ffffffff802101a6>]  [<ffffffff802101a6>] profile_pc+0x46/0x80
[ 2009.790682] RSP: 0018:ffffffff80664da0  EFLAGS: 00010202
[ 2009.796869] RAX: 0000000000000001 RBX: ffff8100010087a0 RCX: 0000000040090ff8
[ 2009.804953] RDX: ffff8100809b4000 RSI: 0000000000000906 RDI: ffffffff8048c6c8
[ 2009.813179] RBP: ffffffff8048c6c8 R08: 0000000000000004 R09: 0000000000000010
[ 2009.821383] R10: 0000000000000005 R11: ffffffff80258ee0 R12: ffff81000100a5c0
[ 2009.829667] R13: 000001d1bcf596b9 R14: 0000000000000000 R15: 0000000000000001
[ 2009.837867] FS:  0000000040091950(0000) GS:ffffffff805d6000(0000) knlGS:0000000000000000
[ 2009.847177] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 2009.853762] CR2: 0000000040090ff8 CR3: 0000000014cfd000 CR4: 00000000000006e0
[ 2009.862084] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2009.870373] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2009.878572] Process swapper (pid: 0, threadinfo ffffffff80602000, task ffffffff805a03a0)
[ 2009.887786] Stack:  000001d1bcf596b9 ffff8100010087a0 0000000000000001 ffffffff8023fd3e
[ 2009.897090]  000001d1bcf596b9 ffff81000100a6c0 ffff8100010087a0 ffffffff8025e8a5
[ 2009.905760]  ffff81000100a618 ffff81000100a6c0 ffff81000100a618 000001d1bcf58f16
[ 2009.914068] Call Trace:
[ 2009.917071]  <IRQ>  [<ffffffff8023fd3e>] profile_tick+0x5e/0xa0
[ 2009.923847]  [<ffffffff8025e8a5>] tick_sched_timer+0x85/0x170
[ 2009.930374]  [<ffffffff8025900f>] hrtimer_interrupt+0x12f/0x1e0
[ 2009.937150]  [<ffffffff80220857>] smp_apic_timer_interrupt+0x37/0x60
[ 2009.944454]  [<ffffffff802777e0>] __ipipe_sync_stage+0x350/0x355
[ 2009.951351]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
[ 2009.958555]  [<ffffffff802777e5>] __xirq_end+0x0/0x85
[ 2009.964349]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
[ 2009.971558]  [<ffffffff80226b01>] __ipipe_handle_irq+0x91/0x250
[ 2009.978457]  [<ffffffff8020af50>] default_idle+0x0/0x40
[ 2009.984437]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
[ 2009.990940]  <EOI>  [<ffffffff8020af79>] default_idle+0x29/0x40
[ 2009.997762]  [<ffffffff8020b01b>] cpu_idle+0x8b/0x120
[ 2010.003541]  [<ffffffff8060cbba>] start_kernel+0x2ba/0x350
[ 2010.009741]  [<ffffffff8060c120>] _sinittext+0x120/0x130
[ 2010.015821] 
[ 2010.017549] 
[ 2010.017556] Code: 48 8b 11 48 89 d0 48 c1 e8 16 48 85 c0 75 1b 48 8b 51 08 48 
[ 2010.028006] RIP  [<ffffffff802101a6>] profile_pc+0x46/0x80
[ 2010.034335]  RSP <ffffffff80664da0>
[ 2010.038334] CR2: 0000000040090ff8
[ 2010.042166] ---[ end trace 95174c527ade95f0 ]---
[ 2010.047468] Kernel panic - not syncing: Aiee, killing interrupt handler! 

>> 
>> 0000000000000210 <profile_pc>:
>>  210:	48 83 ec 18          	sub    $0x18,%rsp
>>  214:	48 89 5c 24 08       	mov    %rbx,0x8(%rsp)
>>  219:	48 89 6c 24 10       	mov    %rbp,0x10(%rsp)
>>  21e:	48 89 fb             	mov    %rdi,%rbx
>>  221:	f6 87 88 00 00 00 03 	testb  $0x3,0x88(%rdi)
>>  228:	48 8b af 80 00 00 00 	mov    0x80(%rdi),%rbp
>>  22f:	74 12                	je     243 <profile_pc+0x33>
>>  231:	48 89 e8             	mov    %rbp,%rax
>>  234:	48 8b 5c 24 08       	mov    0x8(%rsp),%rbx
>>  239:	48 8b 6c 24 10       	mov    0x10(%rsp),%rbp
>>  23e:	48 83 c4 18          	add    $0x18,%rsp
>>  242:	c3                   	retq   
>>  243:	48 89 ef             	mov    %rbp,%rdi
>>  246:	e8 00 00 00 00       	callq  24b <profile_pc+0x3b>
>>  24b:	85 c0                	test   %eax,%eax
>>  24d:	74 e2                	je     231 <profile_pc+0x21>
>>  24f:	48 8b 8b 98 00 00 00 	mov    0x98(%rbx),%rcx
>>  256:	48 8b 11             	mov    (%rcx),%rdx
>>  259:	48 89 d0             	mov    %rdx,%rax
>>  25c:	48 c1 e8 16          	shr    $0x16,%rax
>>  260:	48 85 c0             	test   %rax,%rax
>>  263:	75 1b                	jne    280 <profile_pc+0x70>
>>  265:	48 8b 51 08          	mov    0x8(%rcx),%rdx
>>  269:	48 89 d0             	mov    %rdx,%rax
>>  26c:	48 c1 e8 16          	shr    $0x16,%rax
>>  270:	48 85 c0             	test   %rax,%rax
>>  273:	48 0f 45 ea          	cmovne %rdx,%rbp
>>  277:	eb b8                	jmp    231 <profile_pc+0x21>
>>  279:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
>>  280:	48 89 d5             	mov    %rdx,%rbp
>>  283:	eb ac                	jmp    231 <profile_pc+0x21>
>>  285:	66 66 2e 0f 1f 84 00 	nopw   %cs:0x0(%rax,%rax,1)
>>  28c:	00 00 00 00  
>> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-11 13:18             ` Petr Cervenka
@ 2008-07-15 14:42               ` Petr Cervenka
  2008-07-15 15:03                 ` Jan Kiszka
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Cervenka @ 2008-07-15 14:42 UTC (permalink / raw)
  To: xenomai; +Cc: jan.kiszka

I captured also the second type of kernel panic. This one seems to happen during "advanced" configuration of out system. This means lot of work in a low priority (5) xenomai task (WORK_TASK_2056) for a short time.
Another question is, what does mean "(P)" after the name of our rtdm module (pci171x_rtdm(P))?

[ 7815.694296] ------------[ cut here ]------------
[ 7815.699111] kernel BUG at kernel/posix-cpu-timers.c:1295!
[ 7815.704715] invalid opcode: 0000 [1] PREEMPT SMP 
[ 7815.709672] CPU 0 
[ 7815.711777] Modules linked in: rt_e1000 rt_r8169 rtpacket rtnet ppdev pci171x_rtdm(P) container ac video output sbs sbshc dock battery parport_pc lp parport psmouse serio_raw pcspkr k8temp i2c_nforce2 button i2c_core af_packet ipv6 evdev ext3 jbd mbcache sg sd_mod ide_cd cdrom sata_nv floppy ata_generic libata ohci_hcd forcedeth ehci_hcd scsi_mod amd74xx ide_core usbcore fan fuse
[ 7815.747844] Pid: 6481, comm: WORK_TASK_2056 Tainted: P        2.6.24-adeos #1
[ 7815.755321] RIP: 0010:[<ffffffff80256e20>]  [<ffffffff80256e20>] run_posix_cpu_timers+0x810/0x820
[ 7815.764629] RSP: 0000:ffffffff80664d70  EFLAGS: 00010246
[ 7815.770122] RAX: ffff81000100a7c0 RBX: ffff81003e082780 RCX: ffffffff805a03a0
[ 7815.777573] RDX: 0000000000000000 RSI: ffff81003e082780 RDI: ffff81003e082780
[ 7815.785080] RBP: ffff8100010087a0 R08: 0000000000000004 R09: 0000000000000010
[ 7815.792566] R10: 0000000000000005 R11: ffffffff80258ee0 R12: ffff81000100a5c0
[ 7815.800001] R13: 00000719439890f1 R14: 0000000000000000 R15: ffffffff80664d90
[ 7815.807436] FS:  0000000040112950(0063) GS:ffffffff805d6000(0000) knlGS:0000000000000000
[ 7815.815909] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7815.821915] CR2: 00002b83d55aec80 CR3: 000000003dff8000 CR4: 00000000000006e0
[ 7815.829357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7815.836786] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 7815.844238] Process WORK_TASK_2056 (pid: 6481, threadinfo ffff810013f78000, task ffff81003e082780)
[ 7815.853584] Stack:  ffff810001013180 00000718f7bcb1dd ffffffff80664db0 ffffffff80238af8
[ 7815.862131]  ffffffff80664d90 ffffffff80664d90 00000719439890f1 ffff81000100a6c0
[ 7815.869974]  ffff8100010087a0 ffff81000100a5c0 00000719439890f1 0000000000000000
[ 7815.877667] Call Trace:
[ 7815.880441]  <IRQ>  [<ffffffff80238af8>] scheduler_tick+0xf8/0x140
[ 7815.886908]  [<ffffffff8025e89b>] tick_sched_timer+0x7b/0x170
[ 7815.892929]  [<ffffffff8025900f>] hrtimer_interrupt+0x12f/0x1e0
[ 7815.899137]  [<ffffffff80220857>] smp_apic_timer_interrupt+0x37/0x60
[ 7815.905752]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
[ 7815.911779]  [<ffffffff802777e0>] __ipipe_sync_stage+0x350/0x355
[ 7815.918085]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
[ 7815.924655]  [<ffffffff802777e5>] __xirq_end+0x0/0x85
[ 7815.929964]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
[ 7815.936587]  [<ffffffff80226b01>] __ipipe_handle_irq+0x91/0x250
[ 7815.942774]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
[ 7815.948673]  <EOI> 
[ 7815.950909] 
[ 7815.950909] Code: 0f 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 
[ 7815.960491] RIP  [<ffffffff80256e20>] run_posix_cpu_timers+0x810/0x820
[ 7815.967284]  RSP <ffffffff80664d70>
[ 7815.970982] ---[ end trace d192885d9858c4b2 ]---
[ 7815.975820] Kernel panic - not syncing: Aiee, killing interrupt handler! 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-15 14:42               ` Petr Cervenka
@ 2008-07-15 15:03                 ` Jan Kiszka
  2008-07-16  8:39                   ` Petr Cervenka
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kiszka @ 2008-07-15 15:03 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: xenomai

Petr Cervenka wrote:
> I captured also the second type of kernel panic. This one seems to happen during "advanced" configuration of out system. This means lot of work in a low priority (5) xenomai task (WORK_TASK_2056) for a short time.
> Another question is, what does mean "(P)" after the name of our rtdm module (pci171x_rtdm(P))?

That it either does not comply to the GPL or that the author forgot to
announce its compliance via MODULE_LICENSE().

> 
> [ 7815.694296] ------------[ cut here ]------------
> [ 7815.699111] kernel BUG at kernel/posix-cpu-timers.c:1295!
> [ 7815.704715] invalid opcode: 0000 [1] PREEMPT SMP 
> [ 7815.709672] CPU 0 
> [ 7815.711777] Modules linked in: rt_e1000 rt_r8169 rtpacket rtnet ppdev pci171x_rtdm(P) container ac video output sbs sbshc dock battery parport_pc lp parport psmouse serio_raw pcspkr k8temp i2c_nforce2 button i2c_core af_packet ipv6 evdev ext3 jbd mbcache sg sd_mod ide_cd cdrom sata_nv floppy ata_generic libata ohci_hcd forcedeth ehci_hcd scsi_mod amd74xx ide_core usbcore fan fuse
> [ 7815.747844] Pid: 6481, comm: WORK_TASK_2056 Tainted: P        2.6.24-adeos #1
> [ 7815.755321] RIP: 0010:[<ffffffff80256e20>]  [<ffffffff80256e20>] run_posix_cpu_timers+0x810/0x820
> [ 7815.764629] RSP: 0000:ffffffff80664d70  EFLAGS: 00010246
> [ 7815.770122] RAX: ffff81000100a7c0 RBX: ffff81003e082780 RCX: ffffffff805a03a0
> [ 7815.777573] RDX: 0000000000000000 RSI: ffff81003e082780 RDI: ffff81003e082780
> [ 7815.785080] RBP: ffff8100010087a0 R08: 0000000000000004 R09: 0000000000000010
> [ 7815.792566] R10: 0000000000000005 R11: ffffffff80258ee0 R12: ffff81000100a5c0
> [ 7815.800001] R13: 00000719439890f1 R14: 0000000000000000 R15: ffffffff80664d90
> [ 7815.807436] FS:  0000000040112950(0063) GS:ffffffff805d6000(0000) knlGS:0000000000000000
> [ 7815.815909] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7815.821915] CR2: 00002b83d55aec80 CR3: 000000003dff8000 CR4: 00000000000006e0
> [ 7815.829357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 7815.836786] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 7815.844238] Process WORK_TASK_2056 (pid: 6481, threadinfo ffff810013f78000, task ffff81003e082780)
> [ 7815.853584] Stack:  ffff810001013180 00000718f7bcb1dd ffffffff80664db0 ffffffff80238af8
> [ 7815.862131]  ffffffff80664d90 ffffffff80664d90 00000719439890f1 ffff81000100a6c0
> [ 7815.869974]  ffff8100010087a0 ffff81000100a5c0 00000719439890f1 0000000000000000
> [ 7815.877667] Call Trace:
> [ 7815.880441]  <IRQ>  [<ffffffff80238af8>] scheduler_tick+0xf8/0x140
> [ 7815.886908]  [<ffffffff8025e89b>] tick_sched_timer+0x7b/0x170
> [ 7815.892929]  [<ffffffff8025900f>] hrtimer_interrupt+0x12f/0x1e0
> [ 7815.899137]  [<ffffffff80220857>] smp_apic_timer_interrupt+0x37/0x60
> [ 7815.905752]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
> [ 7815.911779]  [<ffffffff802777e0>] __ipipe_sync_stage+0x350/0x355
> [ 7815.918085]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
> [ 7815.924655]  [<ffffffff802777e5>] __xirq_end+0x0/0x85
> [ 7815.929964]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
> [ 7815.936587]  [<ffffffff80226b01>] __ipipe_handle_irq+0x91/0x250
> [ 7815.942774]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
> [ 7815.948673]  <EOI> 
> [ 7815.950909] 
> [ 7815.950909] Code: 0f 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 
> [ 7815.960491] RIP  [<ffffffff80256e20>] run_posix_cpu_timers+0x810/0x820
> [ 7815.967284]  RSP <ffffffff80664d70>
> [ 7815.970982] ---[ end trace d192885d9858c4b2 ]---
> [ 7815.975820] Kernel panic - not syncing: Aiee, killing interrupt handler! 

That's now a totally different spot, and it makes me wonder if can
reproduce all this troubles with vanilla Xenomai and without your driver
being loaded...

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-15 15:03                 ` Jan Kiszka
@ 2008-07-16  8:39                   ` Petr Cervenka
  2008-07-17 10:21                     ` Jan Kiszka
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Cervenka @ 2008-07-16  8:39 UTC (permalink / raw)
  To: jan.kiszka; +Cc: xenomai


Jan Kizska wrote:
>Petr Cervenka wrote:
>> I captured also the second type of kernel panic. This one seems to
happen during "advanced" configuration of out system. This means lot of
work in a low priority (5) xenomai task (WORK_TASK_2056) for a short time.
>> Another question is, what does mean "(P)" after the name of our rtdm
module (pci171x_rtdm(P))?
>
>That it either does not comply to the GPL or that the author forgot to
>announce its compliance via MODULE_LICENSE().
>
>> 
>> [ 7815.694296] ------------[ cut here ]------------
>> [ 7815.699111] kernel BUG at kernel/posix-cpu-timers.c:1295!
>> [ 7815.704715] invalid opcode: 0000 [1] PREEMPT SMP 
>> [ 7815.709672] CPU 0 
>> [ 7815.711777] Modules linked in: rt_e1000 rt_r8169 rtpacket rtnet
ppdev pci171x_rtdm(P) container ac video output sbs sbshc dock battery
parport_pc lp parport psmouse serio_raw pcspkr k8temp i2c_nforce2 button
i2c_core af_packet ipv6 evdev ext3 jbd mbcache sg sd_mod ide_cd cdrom
sata_nv floppy ata_generic libata ohci_hcd forcedeth ehci_hcd scsi_mod
amd74xx ide_core usbcore fan fuse
>> [ 7815.747844] Pid: 6481, comm: WORK_TASK_2056 Tainted: P       
2.6.24-adeos #1
>> [ 7815.755321] RIP: 0010:[<ffffffff80256e20>]  [<ffffffff80256e20>]
run_posix_cpu_timers+0x810/0x820
>> [ 7815.764629] RSP: 0000:ffffffff80664d70  EFLAGS: 00010246
>> [ 7815.770122] RAX: ffff81000100a7c0 RBX: ffff81003e082780 RCX:
ffffffff805a03a0
>> [ 7815.777573] RDX: 0000000000000000 RSI: ffff81003e082780 RDI:
ffff81003e082780
>> [ 7815.785080] RBP: ffff8100010087a0 R08: 0000000000000004 R09:
0000000000000010
>> [ 7815.792566] R10: 0000000000000005 R11: ffffffff80258ee0 R12:
ffff81000100a5c0
>> [ 7815.800001] R13: 00000719439890f1 R14: 0000000000000000 R15:
ffffffff80664d90
>> [ 7815.807436] FS:  0000000040112950(0063) GS:ffffffff805d6000(0000)
knlGS:0000000000000000
>> [ 7815.815909] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 7815.821915] CR2: 00002b83d55aec80 CR3: 000000003dff8000 CR4:
00000000000006e0
>> [ 7815.829357] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
>> [ 7815.836786] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
>> [ 7815.844238] Process WORK_TASK_2056 (pid: 6481, threadinfo
ffff810013f78000, task ffff81003e082780)
>> [ 7815.853584] Stack:  ffff810001013180 00000718f7bcb1dd
ffffffff80664db0 ffffffff80238af8
>> [ 7815.862131]  ffffffff80664d90 ffffffff80664d90 00000719439890f1
ffff81000100a6c0
>> [ 7815.869974]  ffff8100010087a0 ffff81000100a5c0 00000719439890f1
0000000000000000
>> [ 7815.877667] Call Trace:
>> [ 7815.880441]  <IRQ>  [<ffffffff80238af8>] scheduler_tick+0xf8/0x140
>> [ 7815.886908]  [<ffffffff8025e89b>] tick_sched_timer+0x7b/0x170
>> [ 7815.892929]  [<ffffffff8025900f>] hrtimer_interrupt+0x12f/0x1e0
>> [ 7815.899137]  [<ffffffff80220857>] smp_apic_timer_interrupt+0x37/0x60
>> [ 7815.905752]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
>> [ 7815.911779]  [<ffffffff802777e0>] __ipipe_sync_stage+0x350/0x355
>> [ 7815.918085]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
>> [ 7815.924655]  [<ffffffff802777e5>] __xirq_end+0x0/0x85
>> [ 7815.929964]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
>> [ 7815.936587]  [<ffffffff80226b01>] __ipipe_handle_irq+0x91/0x250
>> [ 7815.942774]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
>> [ 7815.948673]  <EOI> 
>> [ 7815.950909] 
>> [ 7815.950909] Code: 0f 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41
57 41 56 
>> [ 7815.960491] RIP  [<ffffffff80256e20>]
run_posix_cpu_timers+0x810/0x820
>> [ 7815.967284]  RSP <ffffffff80664d70>
>> [ 7815.970982] ---[ end trace d192885d9858c4b2 ]---
>> [ 7815.975820] Kernel panic - not syncing: Aiee, killing interrupt
handler! 
>
>That's now a totally different spot, and it makes me wonder if can
>reproduce all this troubles with vanilla Xenomai and without your driver
>being loaded...
>

We measure data from our unit connected through rtnet or with a PCI card. It's independent if we use one way or another, these kernel panics appear in both setups. So rtnet and our module are not involved.
But it does depend on the measuring frequency and the amount of measured data in every cycle.

Petr



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-16  8:39                   ` Petr Cervenka
@ 2008-07-17 10:21                     ` Jan Kiszka
  2008-07-21 10:58                       ` Petr Cervenka
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kiszka @ 2008-07-17 10:21 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: xenomai

Petr Cervenka wrote:
> Jan Kizska wrote:
>> Petr Cervenka wrote:
>>> I captured also the second type of kernel panic. This one seems to
> happen during "advanced" configuration of out system. This means lot of
> work in a low priority (5) xenomai task (WORK_TASK_2056) for a short time.
>>> Another question is, what does mean "(P)" after the name of our rtdm
> module (pci171x_rtdm(P))?
>> That it either does not comply to the GPL or that the author forgot to
>> announce its compliance via MODULE_LICENSE().
>>
>>> [ 7815.694296] ------------[ cut here ]------------
>>> [ 7815.699111] kernel BUG at kernel/posix-cpu-timers.c:1295!
>>> [ 7815.704715] invalid opcode: 0000 [1] PREEMPT SMP 
>>> [ 7815.709672] CPU 0 
>>> [ 7815.711777] Modules linked in: rt_e1000 rt_r8169 rtpacket rtnet
> ppdev pci171x_rtdm(P) container ac video output sbs sbshc dock battery
> parport_pc lp parport psmouse serio_raw pcspkr k8temp i2c_nforce2 button
> i2c_core af_packet ipv6 evdev ext3 jbd mbcache sg sd_mod ide_cd cdrom
> sata_nv floppy ata_generic libata ohci_hcd forcedeth ehci_hcd scsi_mod
> amd74xx ide_core usbcore fan fuse
>>> [ 7815.747844] Pid: 6481, comm: WORK_TASK_2056 Tainted: P       
> 2.6.24-adeos #1
>>> [ 7815.755321] RIP: 0010:[<ffffffff80256e20>]  [<ffffffff80256e20>]
> run_posix_cpu_timers+0x810/0x820
>>> [ 7815.764629] RSP: 0000:ffffffff80664d70  EFLAGS: 00010246
>>> [ 7815.770122] RAX: ffff81000100a7c0 RBX: ffff81003e082780 RCX:
> ffffffff805a03a0
>>> [ 7815.777573] RDX: 0000000000000000 RSI: ffff81003e082780 RDI:
> ffff81003e082780
>>> [ 7815.785080] RBP: ffff8100010087a0 R08: 0000000000000004 R09:
> 0000000000000010
>>> [ 7815.792566] R10: 0000000000000005 R11: ffffffff80258ee0 R12:
> ffff81000100a5c0
>>> [ 7815.800001] R13: 00000719439890f1 R14: 0000000000000000 R15:
> ffffffff80664d90
>>> [ 7815.807436] FS:  0000000040112950(0063) GS:ffffffff805d6000(0000)
> knlGS:0000000000000000
>>> [ 7815.815909] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 7815.821915] CR2: 00002b83d55aec80 CR3: 000000003dff8000 CR4:
> 00000000000006e0
>>> [ 7815.829357] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
>>> [ 7815.836786] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
>>> [ 7815.844238] Process WORK_TASK_2056 (pid: 6481, threadinfo
> ffff810013f78000, task ffff81003e082780)
>>> [ 7815.853584] Stack:  ffff810001013180 00000718f7bcb1dd
> ffffffff80664db0 ffffffff80238af8
>>> [ 7815.862131]  ffffffff80664d90 ffffffff80664d90 00000719439890f1
> ffff81000100a6c0
>>> [ 7815.869974]  ffff8100010087a0 ffff81000100a5c0 00000719439890f1
> 0000000000000000
>>> [ 7815.877667] Call Trace:
>>> [ 7815.880441]  <IRQ>  [<ffffffff80238af8>] scheduler_tick+0xf8/0x140
>>> [ 7815.886908]  [<ffffffff8025e89b>] tick_sched_timer+0x7b/0x170
>>> [ 7815.892929]  [<ffffffff8025900f>] hrtimer_interrupt+0x12f/0x1e0
>>> [ 7815.899137]  [<ffffffff80220857>] smp_apic_timer_interrupt+0x37/0x60
>>> [ 7815.905752]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
>>> [ 7815.911779]  [<ffffffff802777e0>] __ipipe_sync_stage+0x350/0x355
>>> [ 7815.918085]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
>>> [ 7815.924655]  [<ffffffff802777e5>] __xirq_end+0x0/0x85
>>> [ 7815.929964]  [<ffffffff80220820>] smp_apic_timer_interrupt+0x0/0x60
>>> [ 7815.936587]  [<ffffffff80226b01>] __ipipe_handle_irq+0x91/0x250
>>> [ 7815.942774]  [<ffffffff8020c9f1>] common_interrupt+0x61/0x7d
>>> [ 7815.948673]  <EOI> 
>>> [ 7815.950909] 
>>> [ 7815.950909] Code: 0f 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 41
> 57 41 56 
>>> [ 7815.960491] RIP  [<ffffffff80256e20>]
> run_posix_cpu_timers+0x810/0x820
>>> [ 7815.967284]  RSP <ffffffff80664d70>
>>> [ 7815.970982] ---[ end trace d192885d9858c4b2 ]---
>>> [ 7815.975820] Kernel panic - not syncing: Aiee, killing interrupt
> handler! 
>> That's now a totally different spot, and it makes me wonder if can
>> reproduce all this troubles with vanilla Xenomai and without your driver
>> being loaded...
>>
> 
> We measure data from our unit connected through rtnet or with a PCI card. It's independent if we use one way or another, these kernel panics appear in both setups. So rtnet and our module are not involved.
> But it does depend on the measuring frequency and the amount of measured data in every cycle.

We likely see some race that causes weird memory corruptions. Its
probability often increases when the code execution frequency raises.

However, reducing the test case is very important now to reduce the
search domain for this issue. E.g. try to fake peripheral access as far
as possible, unloading the unused driver and only leaving the test
program behind that is executable on arbitrary Xenomai installation
(maybe finally on one of my boxes...).

TiA,
Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-17 10:21                     ` Jan Kiszka
@ 2008-07-21 10:58                       ` Petr Cervenka
  2008-07-21 11:26                         ` Jan Kiszka
  0 siblings, 1 reply; 24+ messages in thread
From: Petr Cervenka @ 2008-07-21 10:58 UTC (permalink / raw)
  To: jan.kiszka; +Cc: xenomai

Jan Kiszka wrote:
>We likely see some race that causes weird memory corruptions. Its
>probability often increases when the code execution frequency raises.
>
>However, reducing the test case is very important now to reduce the
>search domain for this issue. E.g. try to fake peripheral access as far
>as possible, unloading the unused driver and only leaving the test
>program behind that is executable on arbitrary Xenomai installation
>(maybe finally on one of my boxes...).
>
I'm not sure if I will be able to reduce the software. It's dependent on hardware and it's controlled from another windows computer with GUI and control application. And to check if the error is still there usually takes couple of days.
I ran a test during last weekend (and nothing wrong happened). But the /proc/xenomai/stat output is strange. Probably some type cast error, because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate value perhaps should be 0x000000008A939FDE = 2324930526.

CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
  0  0      0          18446744071739514846 0     00500088   69.8  ROOT/0
  1  0      0          18446744071675175740 0     00500080   23.2  ROOT/1
  0  5299   0          351459     0     00300182    0.0  LOGGER_TASK_1804289383
  0  5100   8          283613     0     00300186    0.0
  0  5317   0          40591      0     00300182    0.0
  0  5034   2          2330696    0     00300184    0.0  MAIN_TASK_2056
  0  5318   5          18446744071736105613 3     00300180   29.5  REG_TASK_2056
  0  5319   28         36         0     00300182    0.0  WORK_TASK_2056
  0  5321   38926      39159      0     00300380    0.0  CERECV_2056
  0  5323   1159385    2438330    0     00300181    0.0  CESEND_2056
  1  5710   0          18446744071675175740 0     00300184   76.8  HARDWARE_KERNEL
  0  0      0          18446744071964064315 0     00000000    0.7  IRQ520: [timer]
  1  0      0          232145209  0     00000000    0.0  IRQ520: [timer] 

My theory is, that a occasional "longer" work or system call usage in the real-time task corrupts the rest of the system (under some special circumstances).

Petr



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Kernel panic: not syncing
  2008-07-21 10:58                       ` Petr Cervenka
@ 2008-07-21 11:26                         ` Jan Kiszka
  2008-07-31 16:14                           ` [Xenomai-help] Segmentation error by heavy dynamic RT_QUEUE usage Petr Cervenka
  2008-08-13 11:01                           ` [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing) Jan Kiszka
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Kiszka @ 2008-07-21 11:26 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: xenomai

Petr Cervenka wrote:
> Jan Kiszka wrote:
>> We likely see some race that causes weird memory corruptions. Its
>> probability often increases when the code execution frequency raises.
>>
>> However, reducing the test case is very important now to reduce the
>> search domain for this issue. E.g. try to fake peripheral access as far
>> as possible, unloading the unused driver and only leaving the test
>> program behind that is executable on arbitrary Xenomai installation
>> (maybe finally on one of my boxes...).
>>
> I'm not sure if I will be able to reduce the software. It's dependent on hardware and it's controlled from another windows computer with GUI and control application. And to check if the error is still there usually takes couple of days.
> I ran a test during last weekend (and nothing wrong happened). But the /proc/xenomai/stat output is strange. Probably some type cast error, because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate value perhaps should be 0x000000008A939FDE = 2324930526.
> 
> CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
>   0  0      0          18446744071739514846 0     00500088   69.8  ROOT/0
>   1  0      0          18446744071675175740 0     00500080   23.2  ROOT/1
>   0  5299   0          351459     0     00300182    0.0  LOGGER_TASK_1804289383
>   0  5100   8          283613     0     00300186    0.0
>   0  5317   0          40591      0     00300182    0.0
>   0  5034   2          2330696    0     00300184    0.0  MAIN_TASK_2056
>   0  5318   5          18446744071736105613 3     00300180   29.5  REG_TASK_2056
>   0  5319   28         36         0     00300182    0.0  WORK_TASK_2056
>   0  5321   38926      39159      0     00300380    0.0  CERECV_2056
>   0  5323   1159385    2438330    0     00300181    0.0  CESEND_2056
>   1  5710   0          18446744071675175740 0     00300184   76.8  HARDWARE_KERNEL
>   0  0      0          18446744071964064315 0     00000000    0.7  IRQ520: [timer]
>   1  0      0          232145209  0     00000000    0.0  IRQ520: [timer] 

OK, at least this bug is a bit easier to fix. Please try this patch
(which also takes the chance and extends the range of our stat counters
a bit):

Index: xenomai/include/nucleus/stat.h
===================================================================
--- xenomai/include/nucleus/stat.h	(Revision 4060)
+++ xenomai/include/nucleus/stat.h	(Arbeitskopie)
@@ -84,20 +84,20 @@ do { \
 
 
 typedef struct xnstat_counter {
-	int counter;
+	unsigned long counter;
 } xnstat_counter_t;
 
-static inline int xnstat_counter_inc(xnstat_counter_t *c)
+static inline unsigned long xnstat_counter_inc(xnstat_counter_t *c)
 {
 	return c->counter++;
 }
 
-static inline int xnstat_counter_get(xnstat_counter_t *c)
+static inline unsigned long xnstat_counter_get(xnstat_counter_t *c)
 {
 	return c->counter;
 }
 
-static inline void xnstat_counter_set(xnstat_counter_t *c, int value)
+static inline void xnstat_counter_set(xnstat_counter_t *c, unsigned long value)
 {
 	c->counter = value;
 }

> 
> My theory is, that a occasional "longer" work or system call usage in the real-time task corrupts the rest of the system (under some special circumstances).

Yes, some nasty memory corruption is probably the reason. And that is
always hard to track down, specifically if it happens very
unpredictably. Nevertheless, if the issue continues to bug you, you will
not get around reducing the test case and trying to increase its
occurrence probability.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Xenomai-help] Segmentation error by heavy dynamic RT_QUEUE usage
  2008-07-21 11:26                         ` Jan Kiszka
@ 2008-07-31 16:14                           ` Petr Cervenka
  2008-08-12 14:37                             ` Philippe Gerum
  2008-08-13 11:01                           ` [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing) Jan Kiszka
  1 sibling, 1 reply; 24+ messages in thread
From: Petr Cervenka @ 2008-07-31 16:14 UTC (permalink / raw)
  To: xenomai

[-- Attachment #1: Type: text/plain, Size: 1457 bytes --]

Hello,
I wanted to make an small example to find the kernel panic (and I failed with it). But during my tests I found another possible error.
I made a small application (as netbeans c++ project) with two tasks:
1) server task with its RT_QUEUE waiting for a request.
2) client task which creates RT_QUEUES for response and sends requests to the server task
>From time to time I get an segmentation error.
It's always in the server task, when the server binds the clients queue, allocates a message buffer in it.
It seems when the server starts to work with this buffer, the client could already close the queue.
But this shouldn't be possible, because normally any attempt to close a queue binded by someone else ends with -EBUSY error.
The error needs some time to produce and 2 CPUs (cores). One for server and one for client.
My configuration(s):
Athlon XP 2600GHz  X86_64
kernel 2.6.24 (and 2.6.25.11)
adeos 2.6.24 2.0-03 (and 2.0-07)
xenomai 2.4.1 and 2.4.4
I'm sending also examples of the execution script and proper input.txt file
both of them should be much longer (input.txt could be several MB)!!!!
In the attachement there is also disassemble of my executable
And finally, one of the segmentation error messages:
[ 2553.818731] QT_SERVER[5919]: segfault at 2aaaaac96800 rip 4022b5 rsp 4000fe00 error 6
But there are more types, but allways when working with the allocated send buffer.
I know, I'm annoying, but I can't help myself.... ;-)
Petr


[-- Attachment #2: queuetest.tar.bz2 --]
[-- Type: application/octet-stream, Size: 10090 bytes --]

[-- Attachment #3: runme --]
[-- Type: application/octet-stream, Size: 164 bytes --]

#!/bin/sh

../dist/Debug/GNU-Linux-x86/queuetest < input2.txt
../dist/Debug/GNU-Linux-x86/queuetest < input2.txt
../dist/Debug/GNU-Linux-x86/queuetest < input2.txt

[-- Attachment #4: input.txt --]
[-- Type: text/plain, Size: 156 bytes --]

send 490000
recv
echo 490000
sleep 100
send 490000
recv
echo 490000
sleep 100
send 490000
recv
echo 490000
sleep 100
send 490000
recv
echo 490000
sleep 100

[-- Attachment #5: queuetest.asm.tar.bz2 --]
[-- Type: application/octet-stream, Size: 52235 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-help] Segmentation error by heavy dynamic RT_QUEUE usage
  2008-07-31 16:14                           ` [Xenomai-help] Segmentation error by heavy dynamic RT_QUEUE usage Petr Cervenka
@ 2008-08-12 14:37                             ` Philippe Gerum
  0 siblings, 0 replies; 24+ messages in thread
From: Philippe Gerum @ 2008-08-12 14:37 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: xenomai

Petr Cervenka wrote:
> Hello,
> I wanted to make an small example to find the kernel panic (and I failed with it). But during my tests I found another possible error.
> I made a small application (as netbeans c++ project) with two tasks:
> 1) server task with its RT_QUEUE waiting for a request.
> 2) client task which creates RT_QUEUES for response and sends requests to the server task
>>From time to time I get an segmentation error.
> It's always in the server task, when the server binds the clients queue, allocates a message buffer in it.
> It seems when the server starts to work with this buffer, the client could already close the queue.
> But this shouldn't be possible, because normally any attempt to close a queue binded by someone else ends with -EBUSY error.
> The error needs some time to produce and 2 CPUs (cores). One for server and one for client.
> My configuration(s):
> Athlon XP 2600GHz  X86_64
> kernel 2.6.24 (and 2.6.25.11)
> adeos 2.6.24 2.0-03 (and 2.0-07)
> xenomai 2.4.1 and 2.4.4
> I'm sending also examples of the execution script and proper input.txt file
> both of them should be much longer (input.txt could be several MB)!!!!
> In the attachement there is also disassemble of my executable
> And finally, one of the segmentation error messages:
> [ 2553.818731] QT_SERVER[5919]: segfault at 2aaaaac96800 rip 4022b5 rsp 4000fe00 error 6
> But there are more types, but allways when working with the allocated send buffer.
> I know, I'm annoying, but I can't help myself.... ;-)

Yeah, but I can't help running useful test code people cared to write either, so
that's ok.

There was a silly bug in the userland wrapper, unmapping the memory pool from
the application process, albeit the syscall just denied deletion (-EBUSY). This
issue also affects RT_HEAP objects the very same way.

Fixed in both trees. Thanks for narrowing the issue.

Note: creating / binding to a _shared_ queue switches the caller to secondary
mode, because in both cases, we need to use regular kernel services to mmap()
the memory pool to the application process.

--- src/skins/native/queue.c	(revision 4086)
+++ src/skins/native/queue.c	(working copy)
@@ -114,21 +114,18 @@
 {
 	int err;

-	err = __real_munmap(q->mapbase, q->mapsize);
-
-	if (err)
-		return -EINVAL;
-
 	err = XENOMAI_SKINCALL1(__native_muxid, __native_queue_delete, q);
-
 	if (err)
 		return err;

+	if (__real_munmap(q->mapbase, q->mapsize))
+		err = -errno;
+
 	q->opaque = XN_NO_HANDLE;
 	q->mapbase = NULL;
 	q->mapsize = 0;

-	return 0;
+	return err;
 }

 void *rt_queue_alloc(RT_QUEUE *q, size_t size)

PS: careful with the subject line, heavy / light RT_QUEUE usage is irrelevant
wrt this bug, it is purely a matter of sequence (queue_create -> queue_bind ->
queue_delete) that triggers the rt_queue_delete() wrapper issue.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing)
  2008-07-21 11:26                         ` Jan Kiszka
  2008-07-31 16:14                           ` [Xenomai-help] Segmentation error by heavy dynamic RT_QUEUE usage Petr Cervenka
@ 2008-08-13 11:01                           ` Jan Kiszka
  2008-08-13 15:29                             ` [Xenomai-core] [PATCH] Fix stat overruns on 64-bit Philippe Gerum
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kiszka @ 2008-08-13 11:01 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 2497 bytes --]

Jan Kiszka wrote:
> Petr Cervenka wrote:
>> I ran a test during last weekend (and nothing wrong happened). But the /proc/xenomai/stat output is strange. Probably some type cast error, because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate value perhaps should be 0x000000008A939FDE = 2324930526.
>>
>> CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
>>   0  0      0          18446744071739514846 0     00500088   69.8  ROOT/0
>>   1  0      0          18446744071675175740 0     00500080   23.2  ROOT/1
>>   0  5299   0          351459     0     00300182    0.0  LOGGER_TASK_1804289383
>>   0  5100   8          283613     0     00300186    0.0
>>   0  5317   0          40591      0     00300182    0.0
>>   0  5034   2          2330696    0     00300184    0.0  MAIN_TASK_2056
>>   0  5318   5          18446744071736105613 3     00300180   29.5  REG_TASK_2056
>>   0  5319   28         36         0     00300182    0.0  WORK_TASK_2056
>>   0  5321   38926      39159      0     00300380    0.0  CERECV_2056
>>   0  5323   1159385    2438330    0     00300181    0.0  CESEND_2056
>>   1  5710   0          18446744071675175740 0     00300184   76.8  HARDWARE_KERNEL
>>   0  0      0          18446744071964064315 0     00000000    0.7  IRQ520: [timer]
>>   1  0      0          232145209  0     00000000    0.0  IRQ520: [timer] 
> 
> OK, at least this bug is a bit easier to fix. Please try this patch
> (which also takes the chance and extends the range of our stat counters
> a bit):
> 
> Index: xenomai/include/nucleus/stat.h
> ===================================================================
> --- xenomai/include/nucleus/stat.h	(Revision 4060)
> +++ xenomai/include/nucleus/stat.h	(Arbeitskopie)
> @@ -84,20 +84,20 @@ do { \
>  
>  
>  typedef struct xnstat_counter {
> -	int counter;
> +	unsigned long counter;
>  } xnstat_counter_t;
>  
> -static inline int xnstat_counter_inc(xnstat_counter_t *c)
> +static inline unsigned long xnstat_counter_inc(xnstat_counter_t *c)
>  {
>  	return c->counter++;
>  }
>  
> -static inline int xnstat_counter_get(xnstat_counter_t *c)
> +static inline unsigned long xnstat_counter_get(xnstat_counter_t *c)
>  {
>  	return c->counter;
>  }
>  
> -static inline void xnstat_counter_set(xnstat_counter_t *c, int value)
> +static inline void xnstat_counter_set(xnstat_counter_t *c, unsigned long value)
>  {
>  	c->counter = value;
>  }

OK to apply those bits?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-core] [PATCH] Fix stat overruns on 64-bit
  2008-08-13 11:01                           ` [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing) Jan Kiszka
@ 2008-08-13 15:29                             ` Philippe Gerum
  0 siblings, 0 replies; 24+ messages in thread
From: Philippe Gerum @ 2008-08-13 15:29 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Petr Cervenka wrote:
>>> I ran a test during last weekend (and nothing wrong happened). But the /proc/xenomai/stat output is strange. Probably some type cast error, because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate value perhaps should be 0x000000008A939FDE = 2324930526.
>>>
>>> CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
>>>   0  0      0          18446744071739514846 0     00500088   69.8  ROOT/0
>>>   1  0      0          18446744071675175740 0     00500080   23.2  ROOT/1
>>>   0  5299   0          351459     0     00300182    0.0  LOGGER_TASK_1804289383
>>>   0  5100   8          283613     0     00300186    0.0
>>>   0  5317   0          40591      0     00300182    0.0
>>>   0  5034   2          2330696    0     00300184    0.0  MAIN_TASK_2056
>>>   0  5318   5          18446744071736105613 3     00300180   29.5  REG_TASK_2056
>>>   0  5319   28         36         0     00300182    0.0  WORK_TASK_2056
>>>   0  5321   38926      39159      0     00300380    0.0  CERECV_2056
>>>   0  5323   1159385    2438330    0     00300181    0.0  CESEND_2056
>>>   1  5710   0          18446744071675175740 0     00300184   76.8  HARDWARE_KERNEL
>>>   0  0      0          18446744071964064315 0     00000000    0.7  IRQ520: [timer]
>>>   1  0      0          232145209  0     00000000    0.0  IRQ520: [timer] 
>> OK, at least this bug is a bit easier to fix. Please try this patch
>> (which also takes the chance and extends the range of our stat counters
>> a bit):
>>
>> Index: xenomai/include/nucleus/stat.h
>> ===================================================================
>> --- xenomai/include/nucleus/stat.h	(Revision 4060)
>> +++ xenomai/include/nucleus/stat.h	(Arbeitskopie)
>> @@ -84,20 +84,20 @@ do { \
>>  
>>  
>>  typedef struct xnstat_counter {
>> -	int counter;
>> +	unsigned long counter;
>>  } xnstat_counter_t;
>>  
>> -static inline int xnstat_counter_inc(xnstat_counter_t *c)
>> +static inline unsigned long xnstat_counter_inc(xnstat_counter_t *c)
>>  {
>>  	return c->counter++;
>>  }
>>  
>> -static inline int xnstat_counter_get(xnstat_counter_t *c)
>> +static inline unsigned long xnstat_counter_get(xnstat_counter_t *c)
>>  {
>>  	return c->counter;
>>  }
>>  
>> -static inline void xnstat_counter_set(xnstat_counter_t *c, int value)
>> +static inline void xnstat_counter_set(xnstat_counter_t *c, unsigned long value)
>>  {
>>  	c->counter = value;
>>  }
> 
> OK to apply those bits?
>

Sure. Please apply to both branches.

> Jan
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Xenomai-core mailing list
> Xenomai-core@domain.hid
> https://mail.gna.org/listinfo/xenomai-core


-- 
Philippe.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing)
@ 2008-08-13 15:48 Fillod Stephane
  2008-08-13 17:02 ` Gilles Chanteperdrix
  2008-08-13 17:53 ` Philippe Gerum
  0 siblings, 2 replies; 24+ messages in thread
From: Fillod Stephane @ 2008-08-13 15:48 UTC (permalink / raw)
  To: Jan Kiszka, xenomai-core

Jan Kiszka wrote:
>/proc/xenomai/stat output is strange. Probably some type cast error, 
> because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate 
> value perhaps should be 0x000000008A939FDE = 2324930526.
[...]

Reminds me that other pending patch for /proc/xenomai/faults:
https://mail.gna.org/public/xenomai-core/2007-12/msg00064.html

-- 
Stephane


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing)
  2008-08-13 15:48 [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing) Fillod Stephane
@ 2008-08-13 17:02 ` Gilles Chanteperdrix
  2008-08-13 20:50   ` Philippe Gerum
  2008-08-13 17:53 ` Philippe Gerum
  1 sibling, 1 reply; 24+ messages in thread
From: Gilles Chanteperdrix @ 2008-08-13 17:02 UTC (permalink / raw)
  To: Fillod Stephane; +Cc: Jan Kiszka, xenomai-core

Fillod Stephane wrote:
> Jan Kiszka wrote:
>> /proc/xenomai/stat output is strange. Probably some type cast error, 
>> because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate 
>> value perhaps should be 0x000000008A939FDE = 2324930526.
> [...]
> 
> Reminds me that other pending patch for /proc/xenomai/faults:
> https://mail.gna.org/public/xenomai-core/2007-12/msg00064.html

december 2007? Oh dear! You should remind us more often when we
forg^H^H^H^H take so much time to include your patches.

-- 
                                                 Gilles.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing)
  2008-08-13 17:02 ` Gilles Chanteperdrix
@ 2008-08-13 20:50   ` Philippe Gerum
  0 siblings, 0 replies; 24+ messages in thread
From: Philippe Gerum @ 2008-08-13 20:50 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

Gilles Chanteperdrix wrote:
> Fillod Stephane wrote:
>> Jan Kiszka wrote:
>>> /proc/xenomai/stat output is strange. Probably some type cast error, 
>>> because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate 
>>> value perhaps should be 0x000000008A939FDE = 2324930526.
>> [...]
>>
>> Reminds me that other pending patch for /proc/xenomai/faults:
>> https://mail.gna.org/public/xenomai-core/2007-12/msg00064.html
> 
> december 2007? Oh dear! You should remind us more often when we
> forg^H^H^H^H take so much time to include your patches.
> 

Well, technically, this patch was not forgotten, but was, mmff... "swapped out".
Fact is that my swapper-in sometimes gets swapped out as well. Working on it.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing)
  2008-08-13 15:48 [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing) Fillod Stephane
  2008-08-13 17:02 ` Gilles Chanteperdrix
@ 2008-08-13 17:53 ` Philippe Gerum
  1 sibling, 0 replies; 24+ messages in thread
From: Philippe Gerum @ 2008-08-13 17:53 UTC (permalink / raw)
  To: Fillod Stephane; +Cc: Jan Kiszka, xenomai-core

Fillod Stephane wrote:
> Jan Kiszka wrote:
>> /proc/xenomai/stat output is strange. Probably some type cast error, 
>> because 18446744071739514846 = 0xFFFFFFFF8A939FDE and the appropriate 
>> value perhaps should be 0x000000008A939FDE = 2324930526.
> [...]
> 
> Reminds me that other pending patch for /proc/xenomai/faults:
> https://mail.gna.org/public/xenomai-core/2007-12/msg00064.html
> 

Finally applied, thanks.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-08-13 20:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-07 15:45 [Xenomai-help] Kernel panic: not syncing Petr Cervenka
2008-07-07 15:59 ` Philippe Gerum
2008-07-08  8:31   ` Petr Cervenka
2008-07-08  8:38   ` Jan Kiszka
2008-07-08  9:21     ` Gilles Chanteperdrix
2008-07-08  9:33       ` Jan Kiszka
2008-07-09 15:19         ` Petr Cervenka
2008-07-09 16:05           ` Philippe Gerum
2008-07-10 13:45             ` Petr Cervenka
2008-07-11 13:18             ` Petr Cervenka
2008-07-15 14:42               ` Petr Cervenka
2008-07-15 15:03                 ` Jan Kiszka
2008-07-16  8:39                   ` Petr Cervenka
2008-07-17 10:21                     ` Jan Kiszka
2008-07-21 10:58                       ` Petr Cervenka
2008-07-21 11:26                         ` Jan Kiszka
2008-07-31 16:14                           ` [Xenomai-help] Segmentation error by heavy dynamic RT_QUEUE usage Petr Cervenka
2008-08-12 14:37                             ` Philippe Gerum
2008-08-13 11:01                           ` [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing) Jan Kiszka
2008-08-13 15:29                             ` [Xenomai-core] [PATCH] Fix stat overruns on 64-bit Philippe Gerum
  -- strict thread matches above, loose matches on Subject: below --
2008-08-13 15:48 [Xenomai-core] [PATCH] Fix stat overruns on 64-bit (was: [Xenomai-help] Kernel panic: not syncing) Fillod Stephane
2008-08-13 17:02 ` Gilles Chanteperdrix
2008-08-13 20:50   ` Philippe Gerum
2008-08-13 17:53 ` Philippe Gerum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.