[Xenomai-core] Frozen timer IRQ

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-core] Frozen timer IRQ
@ 2006-04-04 21:29 Jan Kiszka
  2006-04-05  7:13 ` Philippe Gerum
  2006-04-05 12:10 ` Gilles Chanteperdrix
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Kiszka @ 2006-04-04 21:29 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

Hi,

my colleagues and I need some hint where to continue our search for the
cause of a weird cleanup issue:

An application of our robotics framework sometimes terminates (though
successfully) in a way that the system timer IRQ no longer arrives
afterwards or no re-program takes place anymore. All other Linux IRQs
are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case
yet as besides the framework some expensive gyroscope and the 16550A
driver are involved.

Fortunately, we found a clean way of stabilising the application by
fixing our broken code :) and improving the serial driver (RTIOC_PURGE),
so that the original problem is solved now (unreliable startup and
cleanup). Anyway, the stopped timer is not yet explainable, and that's
why we plan to dig deeper.

Jan

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-04 21:29 [Xenomai-core] Frozen timer IRQ Jan Kiszka
@ 2006-04-05  7:13 ` Philippe Gerum
  2006-04-05 12:10 ` Gilles Chanteperdrix
  1 sibling, 0 replies; 21+ messages in thread
From: Philippe Gerum @ 2006-04-05  7:13 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Hi,
> 
> my colleagues and I need some hint where to continue our search for the
> cause of a weird cleanup issue:
> 
> An application of our robotics framework sometimes terminates (though
> successfully) in a way that the system timer IRQ no longer arrives
> afterwards or no re-program takes place anymore.

Assuming that the APIC is disabled in the kernel configuration, so that 
there could be an issue with the nucleus host timer, I would try to look 
at the state of this timer (XNTIMER_DEQUEUED?) right after the cleanup. 
I would also try to store a copy of the last timer object seen by 
xntimer_next_local_shot(), so that the timer id (htimer or not 
basically) and the programmed tick date could be looked at after the 
cleanup phase. Normally, if no other application timer is active, the 
host timer should be the only one to tick periodically until 
xnpod_shutdown is called, and thus should keep on being reprogrammed by 
xntimer_next_local_shot().

If xnpod_shutdown is called, then this is another story, and 
rthal_timer_release() should be inspected instead.

  All other Linux IRQs
> are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case
> yet as besides the framework some expensive gyroscope and the 16550A
> driver are involved.
> 
> Fortunately, we found a clean way of stabilising the application by
> fixing our broken code :) and improving the serial driver (RTIOC_PURGE),
> so that the original problem is solved now (unreliable startup and
> cleanup). Anyway, the stopped timer is not yet explainable, and that's
> why we plan to dig deeper.
> 
> Jan
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Xenomai-core mailing list
> Xenomai-core@domain.hid
> https://mail.gna.org/listinfo/xenomai-core

-- 

Philippe.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-04 21:29 [Xenomai-core] Frozen timer IRQ Jan Kiszka
  2006-04-05  7:13 ` Philippe Gerum
@ 2006-04-05 12:10 ` Gilles Chanteperdrix
  2006-04-05 12:29   ` Philippe Gerum
  2006-04-05 12:38   ` Philippe Gerum
  1 sibling, 2 replies; 21+ messages in thread
From: Gilles Chanteperdrix @ 2006-04-05 12:10 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
 > Hi,
 > 
 > my colleagues and I need some hint where to continue our search for the
 > cause of a weird cleanup issue:
 > 
 > An application of our robotics framework sometimes terminates (though
 > successfully) in a way that the system timer IRQ no longer arrives
 > afterwards or no re-program takes place anymore. All other Linux IRQs
 > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case
 > yet as besides the framework some expensive gyroscope and the 16550A
 > driver are involved.

I observed a similar issue when xnpod_stop_timer was called when
shutting down the posix skin. I assumed that the problem was that
xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
in particular xnarch_stop_timer) ended up being called twice.

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 12:10 ` Gilles Chanteperdrix
@ 2006-04-05 12:29   ` Philippe Gerum
  2006-04-05 12:38   ` Philippe Gerum
  1 sibling, 0 replies; 21+ messages in thread
From: Philippe Gerum @ 2006-04-05 12:29 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>  > Hi,
>  > 
>  > my colleagues and I need some hint where to continue our search for the
>  > cause of a weird cleanup issue:
>  > 
>  > An application of our robotics framework sometimes terminates (though
>  > successfully) in a way that the system timer IRQ no longer arrives
>  > afterwards or no re-program takes place anymore. All other Linux IRQs
>  > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case
>  > yet as besides the framework some expensive gyroscope and the 16550A
>  > driver are involved.
> 
> I observed a similar issue when xnpod_stop_timer was called when
> shutting down the posix skin. I assumed that the problem was that
> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
> in particular xnarch_stop_timer) ended up being called twice.
> 

The XNTIMED bit from the pod's status is checked to trap multiple 
invocations, so this should not -normally- cause any issue.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 12:10 ` Gilles Chanteperdrix
  2006-04-05 12:29   ` Philippe Gerum
@ 2006-04-05 12:38   ` Philippe Gerum
  2006-04-05 13:05     ` Philippe Gerum
  1 sibling, 1 reply; 21+ messages in thread
From: Philippe Gerum @ 2006-04-05 12:38 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>  > Hi,
>  > 
>  > my colleagues and I need some hint where to continue our search for the
>  > cause of a weird cleanup issue:
>  > 
>  > An application of our robotics framework sometimes terminates (though
>  > successfully) in a way that the system timer IRQ no longer arrives
>  > afterwards or no re-program takes place anymore. All other Linux IRQs
>  > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case
>  > yet as besides the framework some expensive gyroscope and the 16550A
>  > driver are involved.
> 
> I observed a similar issue when xnpod_stop_timer was called when
> shutting down the posix skin. I assumed that the problem was that
> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
> in particular xnarch_stop_timer) ended up being called twice.
> 

Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ 
protected by the XNTIMED flag, but only the last part of the 
housekeeping chores performed upon stopping the systimer are. IOW, this 
is a latent bug, and xnpod_stop_timer should be fixed.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 12:38   ` Philippe Gerum
@ 2006-04-05 13:05     ` Philippe Gerum
  2006-04-05 19:30       ` Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: Philippe Gerum @ 2006-04-05 13:05 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
> Gilles Chanteperdrix wrote:
> 
>> Jan Kiszka wrote:
>>  > Hi,
>>  >  > my colleagues and I need some hint where to continue our search 
>> for the
>>  > cause of a weird cleanup issue:
>>  >  > An application of our robotics framework sometimes terminates 
>> (though
>>  > successfully) in a way that the system timer IRQ no longer arrives
>>  > afterwards or no re-program takes place anymore. All other Linux IRQs
>>  > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test 
>> case
>>  > yet as besides the framework some expensive gyroscope and the 16550A
>>  > driver are involved.
>>
>> I observed a similar issue when xnpod_stop_timer was called when
>> shutting down the posix skin. I assumed that the problem was that
>> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
>> in particular xnarch_stop_timer) ended up being called twice.
>>
> 
> Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ 
> protected by the XNTIMED flag, but only the last part of the 
> housekeeping chores performed upon stopping the systimer are. IOW, this 
> is a latent bug, and xnpod_stop_timer should be fixed.
> 

Commit 884 should do that.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 13:05     ` Philippe Gerum
@ 2006-04-05 19:30       ` Jan Kiszka
  2006-04-05 21:56         ` Jan Kiszka
  2006-04-06 17:10         ` [Xenomai-core] Frozen timer IRQ Philippe Gerum
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Kiszka @ 2006-04-05 19:30 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 2148 bytes --]

Philippe Gerum wrote:
> Philippe Gerum wrote:
>> Gilles Chanteperdrix wrote:
>>
>>> Jan Kiszka wrote:
>>>  > Hi,
>>>  >  > my colleagues and I need some hint where to continue our search
>>> for the
>>>  > cause of a weird cleanup issue:
>>>  >  > An application of our robotics framework sometimes terminates
>>> (though
>>>  > successfully) in a way that the system timer IRQ no longer arrives
>>>  > afterwards or no re-program takes place anymore. All other Linux IRQs
>>>  > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test
>>> case
>>>  > yet as besides the framework some expensive gyroscope and the 16550A
>>>  > driver are involved.
>>>
>>> I observed a similar issue when xnpod_stop_timer was called when
>>> shutting down the posix skin. I assumed that the problem was that
>>> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
>>> in particular xnarch_stop_timer) ended up being called twice.
>>>
>>
>> Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_
>> protected by the XNTIMED flag, but only the last part of the
>> housekeeping chores performed upon stopping the systimer are. IOW,
>> this is a latent bug, and xnpod_stop_timer should be fixed.
>>
> 
> Commit 884 should do that.
> 

Sorry for replying late: nope, this has no influence on our issue.

Well, someone put that damn piece of hardware on my desk, saying: "It
doesn't work." What he did not say is that there are multiple issues
contained :-/. I found and fixed (patch will follow) a severe bug in the
16550A driver, but the strange timer issue stays (though it's still
tricky to reproduce).

The point is - and that's likely why your patch doesn't help - that we
do not stop the system timer, i.e. unload all skins. We just terminate
an application. I did some research but failed to find a test case (only
our software "manages" to trigger this). Actually, it seems the hardware
timer is no longer working, because also other RT-tasks no longer time
out. Moreover, I checked nkpod->htimer.status, but it remains 0 all the
time. I need more time...

Jan

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 19:30       ` Jan Kiszka
@ 2006-04-05 21:56         ` Jan Kiszka
  2006-04-05 21:58           ` Jan Kiszka
  2006-04-06 15:04           ` Philippe Gerum
  2006-04-06 17:10         ` [Xenomai-core] Frozen timer IRQ Philippe Gerum
  1 sibling, 2 replies; 21+ messages in thread
From: Jan Kiszka @ 2006-04-05 21:56 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 3038 bytes --]

Jan Kiszka wrote:
> Philippe Gerum wrote:
>> Philippe Gerum wrote:
>>> Gilles Chanteperdrix wrote:
>>>
>>>> Jan Kiszka wrote:
>>>>  > Hi,
>>>>  >  > my colleagues and I need some hint where to continue our search
>>>> for the
>>>>  > cause of a weird cleanup issue:
>>>>  >  > An application of our robotics framework sometimes terminates
>>>> (though
>>>>  > successfully) in a way that the system timer IRQ no longer arrives
>>>>  > afterwards or no re-program takes place anymore. All other Linux IRQs
>>>>  > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test
>>>> case
>>>>  > yet as besides the framework some expensive gyroscope and the 16550A
>>>>  > driver are involved.
>>>>
>>>> I observed a similar issue when xnpod_stop_timer was called when
>>>> shutting down the posix skin. I assumed that the problem was that
>>>> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
>>>> in particular xnarch_stop_timer) ended up being called twice.
>>>>
>>> Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_
>>> protected by the XNTIMED flag, but only the last part of the
>>> housekeeping chores performed upon stopping the systimer are. IOW,
>>> this is a latent bug, and xnpod_stop_timer should be fixed.
>>>
>> Commit 884 should do that.
>>
> 
> Sorry for replying late: nope, this has no influence on our issue.
> 
> Well, someone put that damn piece of hardware on my desk, saying: "It
> doesn't work." What he did not say is that there are multiple issues
> contained :-/. I found and fixed (patch will follow) a severe bug in the
> 16550A driver, but the strange timer issue stays (though it's still
> tricky to reproduce).
> 
> The point is - and that's likely why your patch doesn't help - that we
> do not stop the system timer, i.e. unload all skins. We just terminate
> an application. I did some research but failed to find a test case (only
> our software "manages" to trigger this). Actually, it seems the hardware
> timer is no longer working, because also other RT-tasks no longer time
> out. Moreover, I checked nkpod->htimer.status, but it remains 0 all the
> time. I need more time...
> 

Attached is an ipipe-freeze of the frozen system. It's taken at the time
the main thread of the terminating application has successfully
rt_task_join'ed the last remaining RT-thread. I took 2000 trace points
before and after that point and additionally instrumented
rthal_timer_program_shot() (special trace 0x01, the argument is the
delay). The interesting stuff happens around 600 us after the freeze: it
seems the scheduled Linux timer arrives then but doesn't get much
attention beyond from ipipe.

Any idea what to look for next? I have a "perfect" test system now,
though I still see no light at the end of the tunnel how to export it to
other boxes.

Enough for today.

Jan


PS: This trace was taken over 2.6.15 to exclude any issues with the new
2.6.16. Both kernels show the same effect.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 21:56         ` Jan Kiszka
@ 2006-04-05 21:58           ` Jan Kiszka
  2006-04-06 15:04           ` Philippe Gerum
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Kiszka @ 2006-04-05 21:58 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core


[-- Attachment #1.1: Type: text/plain, Size: 2410 bytes --]

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> Philippe Gerum wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>
>>>>> Jan Kiszka wrote:
>>>>>  > Hi,
>>>>>  >  > my colleagues and I need some hint where to continue our search
>>>>> for the
>>>>>  > cause of a weird cleanup issue:
>>>>>  >  > An application of our robotics framework sometimes terminates
>>>>> (though
>>>>>  > successfully) in a way that the system timer IRQ no longer arrives
>>>>>  > afterwards or no re-program takes place anymore. All other Linux IRQs
>>>>>  > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test
>>>>> case
>>>>>  > yet as besides the framework some expensive gyroscope and the 16550A
>>>>>  > driver are involved.
>>>>>
>>>>> I observed a similar issue when xnpod_stop_timer was called when
>>>>> shutting down the posix skin. I assumed that the problem was that
>>>>> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
>>>>> in particular xnarch_stop_timer) ended up being called twice.
>>>>>
>>>> Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_
>>>> protected by the XNTIMED flag, but only the last part of the
>>>> housekeeping chores performed upon stopping the systimer are. IOW,
>>>> this is a latent bug, and xnpod_stop_timer should be fixed.
>>>>
>>> Commit 884 should do that.
>>>
>> Sorry for replying late: nope, this has no influence on our issue.
>>
>> Well, someone put that damn piece of hardware on my desk, saying: "It
>> doesn't work." What he did not say is that there are multiple issues
>> contained :-/. I found and fixed (patch will follow) a severe bug in the
>> 16550A driver, but the strange timer issue stays (though it's still
>> tricky to reproduce).
>>
>> The point is - and that's likely why your patch doesn't help - that we
>> do not stop the system timer, i.e. unload all skins. We just terminate
>> an application. I did some research but failed to find a test case (only
>> our software "manages" to trigger this). Actually, it seems the hardware
>> timer is no longer working, because also other RT-tasks no longer time
>> out. Moreover, I checked nkpod->htimer.status, but it remains 0 all the
>> time. I need more time...
>>
> 
> Attached is an ipipe-freeze of the frozen system. It's taken at the time

F***, the usual "see [not-attached] attachment".


[-- Attachment #1.2: frozen-timer2.bz2 --]
[-- Type: application/octet-stream, Size: 18756 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 21:56         ` Jan Kiszka
  2006-04-05 21:58           ` Jan Kiszka
@ 2006-04-06 15:04           ` Philippe Gerum
  2006-04-06 15:29             ` Jan Kiszka
  1 sibling, 1 reply; 21+ messages in thread
From: Philippe Gerum @ 2006-04-06 15:04 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> 
> Attached is an ipipe-freeze of the frozen system. It's taken at the time
> the main thread of the terminating application has successfully
> rt_task_join'ed the last remaining RT-thread. I took 2000 trace points
> before and after that point and additionally instrumented
> rthal_timer_program_shot() (special trace 0x01, the argument is the
> delay). The interesting stuff happens around 600 us after the freeze: it
> seems the scheduled Linux timer arrives then but doesn't get much
> attention beyond from ipipe.
> 
> Any idea what to look for next? I have a "perfect" test system now,
> though I still see no light at the end of the tunnel how to export it to
> other boxes.
> 
> Enough for today.
> 
> Jan
> 
> 
> PS: This trace was taken over 2.6.15 to exclude any issues with the new
> 2.6.16. Both kernels show the same effect.
> 

Does this patch make any difference?

--- ipipe-root.c~	2006-01-31 09:55:44.000000000 +0100
+++ ipipe-root.c	2006-04-06 17:01:49.000000000 +0200
@@ -328,9 +328,8 @@
  		/* Only sync virtual IRQs here, so that we don't recurse
  		   indefinitely in case of an external interrupt flood. */

-		if ((ipipe_root_domain->cpudata[cpuid].
-		     irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0)
-			__ipipe_sync_stage(IPIPE_IRQMASK_VIRT);
+		if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0)
+			__ipipe_sync_stage(IPIPE_IRQMASK_ANY);
  	}
  #ifdef CONFIG_IPIPE_TRACE_IRQSOFF
  	ipipe_trace_end(0x8000000D);
-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-06 15:04           ` Philippe Gerum
@ 2006-04-06 15:29             ` Jan Kiszka
  2006-04-06 15:39               ` Philippe Gerum
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2006-04-06 15:29 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1721 bytes --]

Philippe Gerum wrote:
> Jan Kiszka wrote:
>>
>> Attached is an ipipe-freeze of the frozen system. It's taken at the time
>> the main thread of the terminating application has successfully
>> rt_task_join'ed the last remaining RT-thread. I took 2000 trace points
>> before and after that point and additionally instrumented
>> rthal_timer_program_shot() (special trace 0x01, the argument is the
>> delay). The interesting stuff happens around 600 us after the freeze: it
>> seems the scheduled Linux timer arrives then but doesn't get much
>> attention beyond from ipipe.
>>
>> Any idea what to look for next? I have a "perfect" test system now,
>> though I still see no light at the end of the tunnel how to export it to
>> other boxes.
>>
>> Enough for today.
>>
>> Jan
>>
>>
>> PS: This trace was taken over 2.6.15 to exclude any issues with the new
>> 2.6.16. Both kernels show the same effect.
>>
> 
> Does this patch make any difference?
> 
> --- ipipe-root.c~    2006-01-31 09:55:44.000000000 +0100
> +++ ipipe-root.c    2006-04-06 17:01:49.000000000 +0200
> @@ -328,9 +328,8 @@
>          /* Only sync virtual IRQs here, so that we don't recurse
>             indefinitely in case of an external interrupt flood. */
> 
> -        if ((ipipe_root_domain->cpudata[cpuid].
> -             irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0)
> -            __ipipe_sync_stage(IPIPE_IRQMASK_VIRT);
> +        if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0)
> +            __ipipe_sync_stage(IPIPE_IRQMASK_ANY);
>      }
>  #ifdef CONFIG_IPIPE_TRACE_IRQSOFF
>      ipipe_trace_end(0x8000000D);

Nope.

Where should I put my finger on to find out what's happening?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-06 15:29             ` Jan Kiszka
@ 2006-04-06 15:39               ` Philippe Gerum
  2006-04-06 15:46                 ` Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: Philippe Gerum @ 2006-04-06 15:39 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Philippe Gerum wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Attached is an ipipe-freeze of the frozen system. It's taken at the time
>>>the main thread of the terminating application has successfully
>>>rt_task_join'ed the last remaining RT-thread. I took 2000 trace points
>>>before and after that point and additionally instrumented
>>>rthal_timer_program_shot() (special trace 0x01, the argument is the
>>>delay). The interesting stuff happens around 600 us after the freeze: it
>>>seems the scheduled Linux timer arrives then but doesn't get much
>>>attention beyond from ipipe.
>>>
>>>Any idea what to look for next? I have a "perfect" test system now,
>>>though I still see no light at the end of the tunnel how to export it to
>>>other boxes.
>>>
>>>Enough for today.
>>>
>>>Jan
>>>
>>>
>>>PS: This trace was taken over 2.6.15 to exclude any issues with the new
>>>2.6.16. Both kernels show the same effect.
>>>
>>
>>Does this patch make any difference?
>>
>>--- ipipe-root.c~    2006-01-31 09:55:44.000000000 +0100
>>+++ ipipe-root.c    2006-04-06 17:01:49.000000000 +0200
>>@@ -328,9 +328,8 @@
>>         /* Only sync virtual IRQs here, so that we don't recurse
>>            indefinitely in case of an external interrupt flood. */
>>
>>-        if ((ipipe_root_domain->cpudata[cpuid].
>>-             irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0)
>>-            __ipipe_sync_stage(IPIPE_IRQMASK_VIRT);
>>+        if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0)
>>+            __ipipe_sync_stage(IPIPE_IRQMASK_ANY);
>>     }
>> #ifdef CONFIG_IPIPE_TRACE_IRQSOFF
>>     ipipe_trace_end(0x8000000D);
> 
> 
> Nope.

That's good news, actually. I would have been quite embarrased if it did it.

> 
> Where should I put my finger on to find out what's happening?
> 

It seems that the pipeline log is not synced by __ipipe_unstall_iret_root.
We need to know why. Question: is the root stage stalled or unstalled by this
routine during the latest call before the box freezes?

PS: it would be nice to display the status of the current stage
(stalled/unstalled) and the one of the hw interrupt bit, for each trace.

> Jan
> 


-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-06 15:39               ` Philippe Gerum
@ 2006-04-06 15:46                 ` Jan Kiszka
  2006-04-06 17:15                   ` Philippe Gerum
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2006-04-06 15:46 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 2691 bytes --]

Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Attached is an ipipe-freeze of the frozen system. It's taken at the
>>>> time
>>>> the main thread of the terminating application has successfully
>>>> rt_task_join'ed the last remaining RT-thread. I took 2000 trace points
>>>> before and after that point and additionally instrumented
>>>> rthal_timer_program_shot() (special trace 0x01, the argument is the
>>>> delay). The interesting stuff happens around 600 us after the
>>>> freeze: it
>>>> seems the scheduled Linux timer arrives then but doesn't get much
>>>> attention beyond from ipipe.
>>>>
>>>> Any idea what to look for next? I have a "perfect" test system now,
>>>> though I still see no light at the end of the tunnel how to export
>>>> it to
>>>> other boxes.
>>>>
>>>> Enough for today.
>>>>
>>>> Jan
>>>>
>>>>
>>>> PS: This trace was taken over 2.6.15 to exclude any issues with the new
>>>> 2.6.16. Both kernels show the same effect.
>>>>
>>>
>>> Does this patch make any difference?
>>>
>>> --- ipipe-root.c~    2006-01-31 09:55:44.000000000 +0100
>>> +++ ipipe-root.c    2006-04-06 17:01:49.000000000 +0200
>>> @@ -328,9 +328,8 @@
>>>         /* Only sync virtual IRQs here, so that we don't recurse
>>>            indefinitely in case of an external interrupt flood. */
>>>
>>> -        if ((ipipe_root_domain->cpudata[cpuid].
>>> -             irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0)
>>> -            __ipipe_sync_stage(IPIPE_IRQMASK_VIRT);
>>> +        if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0)
>>> +            __ipipe_sync_stage(IPIPE_IRQMASK_ANY);
>>>     }
>>> #ifdef CONFIG_IPIPE_TRACE_IRQSOFF
>>>     ipipe_trace_end(0x8000000D);
>>
>>
>> Nope.
> 
> That's good news, actually. I would have been quite embarrased if it did
> it.
> 
>>
>> Where should I put my finger on to find out what's happening?
>>
> 
> It seems that the pipeline log is not synced by __ipipe_unstall_iret_root.
> We need to know why. Question: is the root stage stalled or unstalled by
> this
> routine during the latest call before the box freezes?

I'm currently switching my brain between to many tasks: Could you simply
tell me what variable to check so that I can hack some
ipipe_trace_special into the kernel?

> 
> PS: it would be nice to display the status of the current stage
> (stalled/unstalled) and the one of the hw interrupt bit, for each trace.

Patches are welcome :) - wait, you are the Adeos maintainer! ;)

Actually, the hw-irq state is already expressed by "|" at the head of
each line ("|" means "hw-IRQs off").

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-06 15:46                 ` Jan Kiszka
@ 2006-04-06 17:15                   ` Philippe Gerum
  2006-04-07 11:57                     ` Jan Kiszka
  2006-04-07 13:02                     ` Jan Kiszka
  0 siblings, 2 replies; 21+ messages in thread
From: Philippe Gerum @ 2006-04-06 17:15 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Philippe Gerum wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Philippe Gerum wrote:
>>>
>>>
>>>>Jan Kiszka wrote:
>>>>
>>>>
>>>>>Attached is an ipipe-freeze of the frozen system. It's taken at the
>>>>>time
>>>>>the main thread of the terminating application has successfully
>>>>>rt_task_join'ed the last remaining RT-thread. I took 2000 trace points
>>>>>before and after that point and additionally instrumented
>>>>>rthal_timer_program_shot() (special trace 0x01, the argument is the
>>>>>delay). The interesting stuff happens around 600 us after the
>>>>>freeze: it
>>>>>seems the scheduled Linux timer arrives then but doesn't get much
>>>>>attention beyond from ipipe.
>>>>>
>>>>>Any idea what to look for next? I have a "perfect" test system now,
>>>>>though I still see no light at the end of the tunnel how to export
>>>>>it to
>>>>>other boxes.
>>>>>
>>>>>Enough for today.
>>>>>
>>>>>Jan
>>>>>
>>>>>
>>>>>PS: This trace was taken over 2.6.15 to exclude any issues with the new
>>>>>2.6.16. Both kernels show the same effect.
>>>>>
>>>>
>>>>Does this patch make any difference?
>>>>
>>>>--- ipipe-root.c~    2006-01-31 09:55:44.000000000 +0100
>>>>+++ ipipe-root.c    2006-04-06 17:01:49.000000000 +0200
>>>>@@ -328,9 +328,8 @@
>>>>        /* Only sync virtual IRQs here, so that we don't recurse
>>>>           indefinitely in case of an external interrupt flood. */
>>>>
>>>>-        if ((ipipe_root_domain->cpudata[cpuid].
>>>>-             irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0)
>>>>-            __ipipe_sync_stage(IPIPE_IRQMASK_VIRT);
>>>>+        if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0)
>>>>+            __ipipe_sync_stage(IPIPE_IRQMASK_ANY);
>>>>    }
>>>>#ifdef CONFIG_IPIPE_TRACE_IRQSOFF
>>>>    ipipe_trace_end(0x8000000D);
>>>
>>>
>>>Nope.
>>
>>That's good news, actually. I would have been quite embarrased if it did
>>it.
>>
>>
>>>Where should I put my finger on to find out what's happening?
>>>
>>
>>It seems that the pipeline log is not synced by __ipipe_unstall_iret_root.
>>We need to know why. Question: is the root stage stalled or unstalled by
>>this
>>routine during the latest call before the box freezes?
> 
> 
> I'm currently switching my brain between to many tasks: Could you simply
> tell me what variable to check so that I can hack some
> ipipe_trace_special into the kernel?

The value of the IPIPE_STALL_FLAG for the root domain upon exit from 
__ipipe_unstall_iret_root.

> 
> 
>>PS: it would be nice to display the status of the current stage
>>(stalled/unstalled) and the one of the hw interrupt bit, for each trace.
> 
> 
> Patches are welcome :) - wait, you are the Adeos maintainer! ;)
> 
> Actually, the hw-irq state is already expressed by "|" at the head of
> each line ("|" means "hw-IRQs off").
> 

Ok, I'm rather short in time too, so let's drop this for now and keep it 
on the todo list so that we get back to this when time allows.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-06 17:15                   ` Philippe Gerum
@ 2006-04-07 11:57                     ` Jan Kiszka
  2006-04-07 13:02                     ` Jan Kiszka
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Kiszka @ 2006-04-07 11:57 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 2749 bytes --]

Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Philippe Gerum wrote:
>>>>
>>>>
>>>>> Jan Kiszka wrote:
>>>>>
>>>>>
>>>>>> Attached is an ipipe-freeze of the frozen system. It's taken at the
>>>>>> time
>>>>>> the main thread of the terminating application has successfully
>>>>>> rt_task_join'ed the last remaining RT-thread. I took 2000 trace
>>>>>> points
>>>>>> before and after that point and additionally instrumented
>>>>>> rthal_timer_program_shot() (special trace 0x01, the argument is the
>>>>>> delay). The interesting stuff happens around 600 us after the
>>>>>> freeze: it
>>>>>> seems the scheduled Linux timer arrives then but doesn't get much
>>>>>> attention beyond from ipipe.
>>>>>>
>>>>>> Any idea what to look for next? I have a "perfect" test system now,
>>>>>> though I still see no light at the end of the tunnel how to export
>>>>>> it to
>>>>>> other boxes.
>>>>>>
>>>>>> Enough for today.
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>>
>>>>>> PS: This trace was taken over 2.6.15 to exclude any issues with
>>>>>> the new
>>>>>> 2.6.16. Both kernels show the same effect.
>>>>>>
>>>>>
>>>>> Does this patch make any difference?
>>>>>
>>>>> --- ipipe-root.c~    2006-01-31 09:55:44.000000000 +0100
>>>>> +++ ipipe-root.c    2006-04-06 17:01:49.000000000 +0200
>>>>> @@ -328,9 +328,8 @@
>>>>>        /* Only sync virtual IRQs here, so that we don't recurse
>>>>>           indefinitely in case of an external interrupt flood. */
>>>>>
>>>>> -        if ((ipipe_root_domain->cpudata[cpuid].
>>>>> -             irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0)
>>>>> -            __ipipe_sync_stage(IPIPE_IRQMASK_VIRT);
>>>>> +        if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0)
>>>>> +            __ipipe_sync_stage(IPIPE_IRQMASK_ANY);
>>>>>    }
>>>>> #ifdef CONFIG_IPIPE_TRACE_IRQSOFF
>>>>>    ipipe_trace_end(0x8000000D);
>>>>
>>>>
>>>> Nope.
>>>
>>> That's good news, actually. I would have been quite embarrased if it did
>>> it.
>>>
>>>
>>>> Where should I put my finger on to find out what's happening?
>>>>
>>>
>>> It seems that the pipeline log is not synced by
>>> __ipipe_unstall_iret_root.
>>> We need to know why. Question: is the root stage stalled or unstalled by
>>> this
>>> routine during the latest call before the box freezes?
>>
>>
>> I'm currently switching my brain between to many tasks: Could you simply
>> tell me what variable to check so that I can hack some
>> ipipe_trace_special into the kernel?
> 
> The value of the IPIPE_STALL_FLAG for the root domain upon exit from
> __ipipe_unstall_iret_root.

ipipe_root_domain->cpudata[cpuid].status is 0 on return.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-06 17:15                   ` Philippe Gerum
  2006-04-07 11:57                     ` Jan Kiszka
@ 2006-04-07 13:02                     ` Jan Kiszka
  2006-04-07 16:28                       ` Philippe Gerum
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2006-04-07 13:02 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core


[-- Attachment #1.1: Type: text/plain, Size: 2594 bytes --]

Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> ...
>>> It seems that the pipeline log is not synced by
>>> __ipipe_unstall_iret_root.
>>> We need to know why. Question: is the root stage stalled or unstalled by
>>> this
>>> routine during the latest call before the box freezes?
>>
>>
>> I'm currently switching my brain between to many tasks: Could you simply
>> tell me what variable to check so that I can hack some
>> ipipe_trace_special into the kernel?
> 
> The value of the IPIPE_STALL_FLAG for the root domain upon exit from
> __ipipe_unstall_iret_root.
> 

The problem seems to be the stalled Xenomai domain:

>   fn                 1917    3.503  cond_resched+0x9 (console_conditional_schedule+0x16)
>  |fn                 1921    2.706  __ipipe_handle_irq+0xe (common_interrupt+0x18)
>  |fn                 1923    1.548  __ipipe_ack_common_irq+0x9 (__ipipe_handle_irq+0xc0)
>  |fn                 1925    4.390  mask_and_ack_8259A+0xb (__ipipe_ack_common_irq+0x47)
>  |(0x20) 0x00000000  1929    0.796  __ipipe_handle_irq+0x144 (common_interrupt+0x18)
>  |(0x30) 0x00000064  1930    0.766  __ipipe_handle_irq+0x15c (common_interrupt+0x18)
>  |(0x31) 0x00000064  1931    0.812  __ipipe_handle_irq+0x169 (common_interrupt+0x18)
>  |(0x32) 0x000000c8  1932    0.766  __ipipe_handle_irq+0x17e (common_interrupt+0x18)
>  |(0x32) 0x00000001  1932    0.781  __ipipe_handle_irq+0x188 (common_interrupt+0x18)
>  |(0x21) 0x00000000  1933    1.383  __ipipe_handle_irq+0x208 (common_interrupt+0x18)
>  |fn                 1934    1.413  __ipipe_stall_root+0x8 (resume_kernel+0x5)
>   fn                 1936    1.052  __ipipe_unstall_iret_root+0x8 (restore_raw+0x0)
>  |(0x11) 0x00000000  1937    0.932  __ipipe_unstall_iret_root+0x31 (restore_raw+0x0)
>  |(0x03) 0x00000000  1938    1.774  __ipipe_unstall_iret_root+0x64 (restore_raw+0x0)
>   fn                 1940    0.736  console_conditional_schedule+0x8 (fbcon_redraw+0xdf)

This was taken during the failing Linux timer tick with the attached
instrumentation hack.

BTW, that trace hacking reminds me that we should really think about
making a kernel debugger run. I recently noticed that latest kgdb
applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe
it is just about making kgdb's irq-locks ipipe-aware and bypassing the
ipipe for int3 and the serial IRQ (so that ipipe can be debugged as
well) and catching the relevant exceptions. Hmm, the debugger seems to
get initialised in the "early" stage. Is this before or after ipipe setup?

Jan

[-- Attachment #1.2: ipipe-root-instr.patch --]
[-- Type: text/plain, Size: 2443 bytes --]

--- arch/i386/kernel/ipipe-root.c.orig	2006-04-05 23:13:45.000000000 +0200
+++ arch/i386/kernel/ipipe-root.c	2006-04-07 14:35:30.000000000 +0200
@@ -315,11 +315,13 @@ asmlinkage void __ipipe_unstall_iret_roo
 	   emulation. */
 
 	if (!(regs.eflags & X86_EFLAGS_IF)) {
+ipipe_trace_special(0x10, 0);
 		__set_bit(IPIPE_STALL_FLAG,
 			  &ipipe_root_domain->cpudata[cpuid].status);
 		ipipe_mark_domain_stall(ipipe_root_domain, cpuid);
 		regs.eflags |= X86_EFLAGS_IF;
 	} else {
+ipipe_trace_special(0x11, 0);
 		__clear_bit(IPIPE_STALL_FLAG,
 			    &ipipe_root_domain->cpudata[cpuid].status);
 
@@ -335,6 +337,7 @@ asmlinkage void __ipipe_unstall_iret_roo
 #ifdef CONFIG_IPIPE_TRACE_IRQSOFF
 	ipipe_trace_end(0x8000000D);
 #endif /* CONFIG_IPIPE_TRACE_IRQSOFF */
+ipipe_trace_special(0x03, ipipe_root_domain->cpudata[cpuid].status);
 }
 
 asmlinkage int __ipipe_syscall_root(struct pt_regs regs)
@@ -457,20 +460,26 @@ fastcall int __ipipe_divert_exception(st
 static inline void __ipipe_walk_pipeline(struct list_head *pos, int cpuid)
 {
 	struct ipipe_domain *this_domain = ipipe_percpu_domain[cpuid];
+ipipe_trace_special(0x30, ipipe_root_domain->priority);
+ipipe_trace_special(0x31, this_domain->priority);
 
 	while (pos != &__ipipe_pipeline) {
 		struct ipipe_domain *next_domain =
 		    list_entry(pos, struct ipipe_domain, p_link);
+ipipe_trace_special(0x32, next_domain->priority);
+ipipe_trace_special(0x32, next_domain->cpudata[cpuid].status);
 
 		if (test_bit
 		    (IPIPE_STALL_FLAG, &next_domain->cpudata[cpuid].status))
 			break;	/* Stalled stage -- do not go further. */
 
+ipipe_trace_special(0x34, 0);
 		if (next_domain->cpudata[cpuid].irq_pending_hi != 0) {
 
 			if (next_domain == this_domain)
 				__ipipe_sync_stage(IPIPE_IRQMASK_ANY);
 			else {
+ipipe_trace_special(0x35, 0);
 				__ipipe_switch_to(this_domain, next_domain,
 						  cpuid);
 
@@ -483,6 +492,7 @@ static inline void __ipipe_walk_pipeline
 					__ipipe_sync_stage(IPIPE_IRQMASK_ANY);
 			}
 
+ipipe_trace_special(0x36, 0);
 			break;
 		} else if (next_domain == this_domain)
 			break;
@@ -587,7 +597,9 @@ int __ipipe_handle_irq(struct pt_regs re
 	   marked as 'sticky'. This search does not go beyond the
 	   current domain in the pipeline. */
 
+ipipe_trace_special(0x20, 0);
 	__ipipe_walk_pipeline(head, cpuid);
+ipipe_trace_special(0x21, 0);
 
 	ipipe_load_cpuid();
 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-07 13:02                     ` Jan Kiszka
@ 2006-04-07 16:28                       ` Philippe Gerum
  2006-04-07 16:39                         ` Philippe Gerum
  0 siblings, 1 reply; 21+ messages in thread
From: Philippe Gerum @ 2006-04-07 16:28 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> 
> BTW, that trace hacking reminds me that we should really think about
> making a kernel debugger run. I recently noticed that latest kgdb
> applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe
> it is just about making kgdb's irq-locks ipipe-aware and bypassing the
> ipipe for int3 and the serial IRQ (so that ipipe can be debugged as
> well) and catching the relevant exceptions. Hmm, the debugger seems to
> get initialised in the "early" stage. Is this before or after ipipe setup?
> 

It depends. If "kgdbwait" is set in the bootargs to halt the kernel 
waiting for the remote GDB to connect to the target, kgdb starts before 
the ipipe. Otherwise, it's a late init, and kgdb starts after the ipipe 
is fully initialized.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-07 16:28                       ` Philippe Gerum
@ 2006-04-07 16:39                         ` Philippe Gerum
  2006-04-07 18:00                           ` [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: Philippe Gerum @ 2006-04-07 16:39 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
> Jan Kiszka wrote:
> 
>>
>> BTW, that trace hacking reminds me that we should really think about
>> making a kernel debugger run. I recently noticed that latest kgdb
>> applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe
>> it is just about making kgdb's irq-locks ipipe-aware and bypassing the
>> ipipe for int3 and the serial IRQ (so that ipipe can be debugged as
>> well) and catching the relevant exceptions. Hmm, the debugger seems to
>> get initialised in the "early" stage. Is this before or after ipipe 
>> setup?
>>
> 
> It depends. If "kgdbwait" is set in the bootargs to halt the kernel 
> waiting for the remote GDB to connect to the target, kgdb starts before 
> the ipipe. Otherwise, it's a late init, and kgdb starts after the ipipe 
> is fully initialized.
> 

Basically, kgdb could start before the i-pipe as soon as a breakpoint is 
hit before the latter is enabled in init/main.c.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ - now traced with kgdb :)
  2006-04-07 16:39                         ` Philippe Gerum
@ 2006-04-07 18:00                           ` Jan Kiszka
  2006-04-09  9:40                             ` Philippe Gerum
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2006-04-07 18:00 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core


[-- Attachment #1.1: Type: text/plain, Size: 2000 bytes --]

Philippe Gerum wrote:
> Philippe Gerum wrote:
>> Jan Kiszka wrote:
>>
>>>
>>> BTW, that trace hacking reminds me that we should really think about
>>> making a kernel debugger run. I recently noticed that latest kgdb
>>> applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe
>>> it is just about making kgdb's irq-locks ipipe-aware and bypassing the
>>> ipipe for int3 and the serial IRQ (so that ipipe can be debugged as
>>> well) and catching the relevant exceptions. Hmm, the debugger seems to
>>> get initialised in the "early" stage. Is this before or after ipipe
>>> setup?
>>>
>>
>> It depends. If "kgdbwait" is set in the bootargs to halt the kernel
>> waiting for the remote GDB to connect to the target, kgdb starts
>> before the ipipe. Otherwise, it's a late init, and kgdb starts after
>> the ipipe is fully initialized.
>>
> 
> Basically, kgdb could start before the i-pipe as soon as a breakpoint is
> hit before the latter is enabled in init/main.c.
> 

Yep, I dug deeper meanwhile and also came across this.

I already have a trivial hack running here. The most tricky part for me
was to learn quilt, but now I start to love it :). Here is a snapshot
series for 2.6.15.5:

<kgdb series from CVS>
prepare-ipipe-x86.patch
adeos-ipipe-2.6.15-i386-1.2-01.patch
kgdb-ipipe-x86.patch

I'm currently wondering if it makes sense to register a kgdb domain and
"officially" capture all involved IRQs and events. So far the serial
line IRQ is hard-coded (should be retrieved from some internal kgdb
structure later). Anyway, it seems to work quite well, I'm currently
stepping through a network IRQ at ipipe-level.


While playing with this tool a bit, displaying the the ipipe structures,
and thinking about the original problem again, I wondered what could
cause a temporary (as I think to found out now) stalled xeno domain
without locking up the system? Some irq-lock leaks at driver level (i.e.
inside our own code)?

Jan

[-- Attachment #1.2: kgdb-ipipe-x86.patch --]
[-- Type: text/plain, Size: 3997 bytes --]

Index: linux-2.6.15.5/arch/i386/kernel/entry.S
===================================================================
--- linux-2.6.15.5.orig/arch/i386/kernel/entry.S	2006-04-07 16:53:39.000000000 +0200
+++ linux-2.6.15.5/arch/i386/kernel/entry.S	2006-04-07 16:53:40.000000000 +0200
@@ -194,7 +194,7 @@
 .previous
 
 
-ENTRY(ret_from_fork)
+KPROBE_ENTRY(ret_from_fork)
 	STI_COND_HW
 	pushl %eax
 	call schedule_tail
@@ -582,7 +582,7 @@
 	PUSH_XCODE(do_simd_coprocessor_error)
 	jmp error_code
 
-ENTRY(device_not_available)
+KPROBE_ENTRY(device_not_available)
 	pushl $-1			# mark this as an int
 	SAVE_ALL
 	DIVERT_EXCEPTION(device_not_available)
@@ -767,7 +767,7 @@
 	jmp error_code
 #endif
 
-ENTRY(spurious_interrupt_bug)
+KPROBE_ENTRY(spurious_interrupt_bug)
 	pushl $0
 	PUSH_XCODE(do_spurious_interrupt_bug)
 	jmp error_code
Index: linux-2.6.15.5/kernel/kgdb.c
===================================================================
--- linux-2.6.15.5.orig/kernel/kgdb.c	2006-04-07 16:30:51.000000000 +0200
+++ linux-2.6.15.5/kernel/kgdb.c	2006-04-07 16:57:35.000000000 +0200
@@ -740,7 +740,7 @@
 	unsigned long flags;
 	int processor;
 
-	local_irq_save(flags);
+	local_irq_save_hw(flags);
 	processor = smp_processor_id();
 	kgdb_info[processor].debuggerinfo = regs;
 	kgdb_info[processor].task = current;
@@ -770,7 +770,7 @@
 	/* Signal the master processor that we are done */
 	atomic_set(&procindebug[processor], 0);
 	spin_unlock(&slavecpulocks[processor]);
-	local_irq_restore(flags);
+	local_irq_restore_hw(flags);
 }
 #endif
 
@@ -1033,7 +1033,7 @@
 	 * Interrupts will be restored by the 'trap return' code, except when
 	 * single stepping.
 	 */
-	local_irq_save(flags);
+	local_irq_save_hw(flags);
 
 	/* Hold debugger_active */
 	procid = smp_processor_id();
@@ -1056,7 +1056,7 @@
 	if (atomic_read(&cpu_doing_single_step) != -1 &&
 	    atomic_read(&cpu_doing_single_step) != procid) {
 		atomic_set(&debugger_active, 0);
-		local_irq_restore(flags);
+		local_irq_restore_hw(flags);
 		goto acquirelock;
 	}
 
@@ -1556,7 +1556,7 @@
 kgdb_restore:
 	/* Free debugger_active */
 	atomic_set(&debugger_active, 0);
-	local_irq_restore(flags);
+	local_irq_restore_hw(flags);
 
 	return error;
 }
@@ -1925,9 +1925,9 @@
 	if (!kgdb_connected || atomic_read(&debugger_active) != 0)
 		return 0;
 	if ((code == SYS_RESTART) || (code == SYS_HALT) || (code == SYS_POWER_OFF)){
-		local_irq_save(flags);
+		local_irq_save_hw(flags);
 		put_packet("X00");
-		local_irq_restore(flags);
+		local_irq_restore_hw(flags);
 	}
 	return NOTIFY_DONE;
 }		
@@ -1942,9 +1942,9 @@
 	if (!kgdb_connected || atomic_read(&debugger_active) != 0)
 		return;
 
-	local_irq_save(flags);
+	local_irq_save_hw(flags);
 	kgdb_msg_write(s, count);
-	local_irq_restore(flags);
+	local_irq_restore_hw(flags);
 }
 
 static struct console kgdbcons = {
Index: linux-2.6.15.5/arch/i386/kernel/ipipe-root.c
===================================================================
--- linux-2.6.15.5.orig/arch/i386/kernel/ipipe-root.c	2006-04-07 16:53:39.000000000 +0200
+++ linux-2.6.15.5/arch/i386/kernel/ipipe-root.c	2006-04-07 17:48:00.000000000 +0200
@@ -111,6 +111,15 @@
 
 #endif	/* CONFIG_X86_LOCAL_APIC */
 
+#ifdef CONFIG_KGDB
+static struct ipipe_domain kgdb_domain;
+
+static void kgdb_domain_entry(void)
+{
+	
+}
+#endif /* CONFIG_KGDB */
+
 /* __ipipe_enable_pipeline() -- We are running on the boot CPU, hw
    interrupts are off, and secondary CPUs are still lost in space. */
 
@@ -248,6 +257,10 @@
 	ipipe_root_domain->irqs[IPIPE_SERVICE_IPI2].control &= ~IPIPE_SYSTEM_MASK;
 	ipipe_root_domain->irqs[IPIPE_SERVICE_IPI3].control &= ~IPIPE_SYSTEM_MASK;
 #endif	/* CONFIG_X86_LOCAL_APIC */
+
+#ifdef CONFIG_KGDB
+	ipipe_control_irq(4, 0, IPIPE_HANDLE_MASK|IPIPE_STICKY_MASK|IPIPE_SYSTEM_MASK);
+#endif /* CONFIG_KGDB */
 }
 
 static inline void __fixup_if(struct pt_regs *regs)

[-- Attachment #1.3: prepare-ipipe-x86.patch --]
[-- Type: text/plain, Size: 838 bytes --]

Index: linux-2.6.15.5/arch/i386/kernel/entry.S
===================================================================
--- linux-2.6.15.5.orig/arch/i386/kernel/entry.S	2006-04-07 16:42:54.000000000 +0200
+++ linux-2.6.15.5/arch/i386/kernel/entry.S	2006-04-07 16:47:23.000000000 +0200
@@ -123,7 +123,7 @@
 .previous
 
 
-KPROBE_ENTRY(ret_from_fork)
+ENTRY(ret_from_fork)
 	pushl %eax
 	call schedule_tail
 	GET_THREAD_INFO(%ebp)
@@ -470,7 +470,7 @@
 	pushl $do_simd_coprocessor_error
 	jmp error_code
 
-KPROBE_ENTRY(device_not_available)
+ENTRY(device_not_available)
 	pushl $-1			# mark this as an int
 	SAVE_ALL
 	movl %cr0, %eax
@@ -652,7 +652,7 @@
 	jmp error_code
 #endif
 
-KPROBE_ENTRY(spurious_interrupt_bug)
+ENTRY(spurious_interrupt_bug)
 	pushl $0
 	pushl $do_spurious_interrupt_bug
 	jmp error_code

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ - now traced with kgdb :)
  2006-04-07 18:00                           ` [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) Jan Kiszka
@ 2006-04-09  9:40                             ` Philippe Gerum
  0 siblings, 0 replies; 21+ messages in thread
From: Philippe Gerum @ 2006-04-09  9:40 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> 
> 
> Yep, I dug deeper meanwhile and also came across this.
> 
> I already have a trivial hack running here. The most tricky part for me
> was to learn quilt, but now I start to love it :). Here is a snapshot
> series for 2.6.15.5:
> 
> <kgdb series from CVS>
> prepare-ipipe-x86.patch
> adeos-ipipe-2.6.15-i386-1.2-01.patch
> kgdb-ipipe-x86.patch
>

In order to ease patch maintenance, we should move the relevant portions 
of this infrastructure to the I-pipe patch directly (i.e. I-pipe 
specific kgdb-ipipe-* code).

> I'm currently wondering if it makes sense to register a kgdb domain and
> "officially" capture all involved IRQs and events. So far the serial
> line IRQ is hard-coded (should be retrieved from some internal kgdb
> structure later). Anyway, it seems to work quite well, I'm currently
> stepping through a network IRQ at ipipe-level.
> 

Having a separate domain would allow to break into any runaway code from 
lower priority domains even with disabled interrupts, except the ipipe 
itself. This said, pushing a domain on top of Xenomai would break the 
assumption that hw interrupts are indeed disabled when operating due to 
the 'last domain optimization' feature, and introduce additional 
jittery. The other option would be to install a KGDB 'redirector' in 
__ipipe_handle_irq so that serial or network interrupts to KGDB would 
never be blocked by the stall bit; I would actually prefer this one.

> 
> While playing with this tool a bit, displaying the the ipipe structures,
> and thinking about the original problem again, I wondered what could
> cause a temporary (as I think to found out now) stalled xeno domain
> without locking up the system? Some irq-lock leaks at driver level (i.e.
> inside our own code)?
> 

At first sight, it might be related to the way __ipipe_unstall_iret_root 
operates. Basically, the idea is to make sure that the stall flag of the 
root domain upon return from the pipelining process always reflects the 
state of the hw interrupt flag at the time the processed event was taken 
by the CPU. It seems that your testcase shows that under some 
cicumstances, the root stage might be spuriously left in a stalled state 
by __ipipe_unstall_iret_root.

-- 

Philippe.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Xenomai-core] Frozen timer IRQ
  2006-04-05 19:30       ` Jan Kiszka
  2006-04-05 21:56         ` Jan Kiszka
@ 2006-04-06 17:10         ` Philippe Gerum
  1 sibling, 0 replies; 21+ messages in thread
From: Philippe Gerum @ 2006-04-06 17:10 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Philippe Gerum wrote:
> 
>>Philippe Gerum wrote:
>>
>>>Gilles Chanteperdrix wrote:
>>>
>>>
>>>>Jan Kiszka wrote:
>>>> > Hi,
>>>> >  > my colleagues and I need some hint where to continue our search
>>>>for the
>>>> > cause of a weird cleanup issue:
>>>> >  > An application of our robotics framework sometimes terminates
>>>>(though
>>>> > successfully) in a way that the system timer IRQ no longer arrives
>>>> > afterwards or no re-program takes place anymore. All other Linux IRQs
>>>> > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test
>>>>case
>>>> > yet as besides the framework some expensive gyroscope and the 16550A
>>>> > driver are involved.
>>>>
>>>>I observed a similar issue when xnpod_stop_timer was called when
>>>>shutting down the posix skin. I assumed that the problem was that
>>>>xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and
>>>>in particular xnarch_stop_timer) ended up being called twice.
>>>>
>>>
>>>Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_
>>>protected by the XNTIMED flag, but only the last part of the
>>>housekeeping chores performed upon stopping the systimer are. IOW,
>>>this is a latent bug, and xnpod_stop_timer should be fixed.
>>>
>>
>>Commit 884 should do that.
>>
> 
> 
> Sorry for replying late: nope, this has no influence on our issue.
> 

This fix was not intended to address this issue, but rather to cleanup 
the timer management code so that multiple releases as described by 
Gilles don't cause havoc anymore, hopefully. So that's ok.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2006-04-09  9:40 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-04 21:29 [Xenomai-core] Frozen timer IRQ Jan Kiszka
2006-04-05  7:13 ` Philippe Gerum
2006-04-05 12:10 ` Gilles Chanteperdrix
2006-04-05 12:29   ` Philippe Gerum
2006-04-05 12:38   ` Philippe Gerum
2006-04-05 13:05     ` Philippe Gerum
2006-04-05 19:30       ` Jan Kiszka
2006-04-05 21:56         ` Jan Kiszka
2006-04-05 21:58           ` Jan Kiszka
2006-04-06 15:04           ` Philippe Gerum
2006-04-06 15:29             ` Jan Kiszka
2006-04-06 15:39               ` Philippe Gerum
2006-04-06 15:46                 ` Jan Kiszka
2006-04-06 17:15                   ` Philippe Gerum
2006-04-07 11:57                     ` Jan Kiszka
2006-04-07 13:02                     ` Jan Kiszka
2006-04-07 16:28                       ` Philippe Gerum
2006-04-07 16:39                         ` Philippe Gerum
2006-04-07 18:00                           ` [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) Jan Kiszka
2006-04-09  9:40                             ` Philippe Gerum
2006-04-06 17:10         ` [Xenomai-core] Frozen timer IRQ Philippe Gerum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.