* [Xenomai-core] Frozen timer IRQ @ 2006-04-04 21:29 Jan Kiszka 2006-04-05 7:13 ` Philippe Gerum 2006-04-05 12:10 ` Gilles Chanteperdrix 0 siblings, 2 replies; 21+ messages in thread From: Jan Kiszka @ 2006-04-04 21:29 UTC (permalink / raw) To: xenomai-core [-- Attachment #1: Type: text/plain, Size: 812 bytes --] Hi, my colleagues and I need some hint where to continue our search for the cause of a weird cleanup issue: An application of our robotics framework sometimes terminates (though successfully) in a way that the system timer IRQ no longer arrives afterwards or no re-program takes place anymore. All other Linux IRQs are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case yet as besides the framework some expensive gyroscope and the 16550A driver are involved. Fortunately, we found a clean way of stabilising the application by fixing our broken code :) and improving the serial driver (RTIOC_PURGE), so that the original problem is solved now (unreliable startup and cleanup). Anyway, the stopped timer is not yet explainable, and that's why we plan to dig deeper. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-04 21:29 [Xenomai-core] Frozen timer IRQ Jan Kiszka @ 2006-04-05 7:13 ` Philippe Gerum 2006-04-05 12:10 ` Gilles Chanteperdrix 1 sibling, 0 replies; 21+ messages in thread From: Philippe Gerum @ 2006-04-05 7:13 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Hi, > > my colleagues and I need some hint where to continue our search for the > cause of a weird cleanup issue: > > An application of our robotics framework sometimes terminates (though > successfully) in a way that the system timer IRQ no longer arrives > afterwards or no re-program takes place anymore. Assuming that the APIC is disabled in the kernel configuration, so that there could be an issue with the nucleus host timer, I would try to look at the state of this timer (XNTIMER_DEQUEUED?) right after the cleanup. I would also try to store a copy of the last timer object seen by xntimer_next_local_shot(), so that the timer id (htimer or not basically) and the programmed tick date could be looked at after the cleanup phase. Normally, if no other application timer is active, the host timer should be the only one to tick periodically until xnpod_shutdown is called, and thus should keep on being reprogrammed by xntimer_next_local_shot(). If xnpod_shutdown is called, then this is another story, and rthal_timer_release() should be inspected instead. All other Linux IRQs > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case > yet as besides the framework some expensive gyroscope and the 16550A > driver are involved. > > Fortunately, we found a clean way of stabilising the application by > fixing our broken code :) and improving the serial driver (RTIOC_PURGE), > so that the original problem is solved now (unreliable startup and > cleanup). Anyway, the stopped timer is not yet explainable, and that's > why we plan to dig deeper. > > Jan > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Xenomai-core mailing list > Xenomai-core@domain.hid > https://mail.gna.org/listinfo/xenomai-core -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-04 21:29 [Xenomai-core] Frozen timer IRQ Jan Kiszka 2006-04-05 7:13 ` Philippe Gerum @ 2006-04-05 12:10 ` Gilles Chanteperdrix 2006-04-05 12:29 ` Philippe Gerum 2006-04-05 12:38 ` Philippe Gerum 1 sibling, 2 replies; 21+ messages in thread From: Gilles Chanteperdrix @ 2006-04-05 12:10 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Hi, > > my colleagues and I need some hint where to continue our search for the > cause of a weird cleanup issue: > > An application of our robotics framework sometimes terminates (though > successfully) in a way that the system timer IRQ no longer arrives > afterwards or no re-program takes place anymore. All other Linux IRQs > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case > yet as besides the framework some expensive gyroscope and the 16550A > driver are involved. I observed a similar issue when xnpod_stop_timer was called when shutting down the posix skin. I assumed that the problem was that xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and in particular xnarch_stop_timer) ended up being called twice. -- Gilles Chanteperdrix. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 12:10 ` Gilles Chanteperdrix @ 2006-04-05 12:29 ` Philippe Gerum 2006-04-05 12:38 ` Philippe Gerum 1 sibling, 0 replies; 21+ messages in thread From: Philippe Gerum @ 2006-04-05 12:29 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: > > Hi, > > > > my colleagues and I need some hint where to continue our search for the > > cause of a weird cleanup issue: > > > > An application of our robotics framework sometimes terminates (though > > successfully) in a way that the system timer IRQ no longer arrives > > afterwards or no re-program takes place anymore. All other Linux IRQs > > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case > > yet as besides the framework some expensive gyroscope and the 16550A > > driver are involved. > > I observed a similar issue when xnpod_stop_timer was called when > shutting down the posix skin. I assumed that the problem was that > xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and > in particular xnarch_stop_timer) ended up being called twice. > The XNTIMED bit from the pod's status is checked to trap multiple invocations, so this should not -normally- cause any issue. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 12:10 ` Gilles Chanteperdrix 2006-04-05 12:29 ` Philippe Gerum @ 2006-04-05 12:38 ` Philippe Gerum 2006-04-05 13:05 ` Philippe Gerum 1 sibling, 1 reply; 21+ messages in thread From: Philippe Gerum @ 2006-04-05 12:38 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: > > Hi, > > > > my colleagues and I need some hint where to continue our search for the > > cause of a weird cleanup issue: > > > > An application of our robotics framework sometimes terminates (though > > successfully) in a way that the system timer IRQ no longer arrives > > afterwards or no re-program takes place anymore. All other Linux IRQs > > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test case > > yet as besides the framework some expensive gyroscope and the 16550A > > driver are involved. > > I observed a similar issue when xnpod_stop_timer was called when > shutting down the posix skin. I assumed that the problem was that > xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and > in particular xnarch_stop_timer) ended up being called twice. > Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ protected by the XNTIMED flag, but only the last part of the housekeeping chores performed upon stopping the systimer are. IOW, this is a latent bug, and xnpod_stop_timer should be fixed. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 12:38 ` Philippe Gerum @ 2006-04-05 13:05 ` Philippe Gerum 2006-04-05 19:30 ` Jan Kiszka 0 siblings, 1 reply; 21+ messages in thread From: Philippe Gerum @ 2006-04-05 13:05 UTC (permalink / raw) To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core Philippe Gerum wrote: > Gilles Chanteperdrix wrote: > >> Jan Kiszka wrote: >> > Hi, >> > > my colleagues and I need some hint where to continue our search >> for the >> > cause of a weird cleanup issue: >> > > An application of our robotics framework sometimes terminates >> (though >> > successfully) in a way that the system timer IRQ no longer arrives >> > afterwards or no re-program takes place anymore. All other Linux IRQs >> > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test >> case >> > yet as besides the framework some expensive gyroscope and the 16550A >> > driver are involved. >> >> I observed a similar issue when xnpod_stop_timer was called when >> shutting down the posix skin. I assumed that the problem was that >> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and >> in particular xnarch_stop_timer) ended up being called twice. >> > > Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ > protected by the XNTIMED flag, but only the last part of the > housekeeping chores performed upon stopping the systimer are. IOW, this > is a latent bug, and xnpod_stop_timer should be fixed. > Commit 884 should do that. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 13:05 ` Philippe Gerum @ 2006-04-05 19:30 ` Jan Kiszka 2006-04-05 21:56 ` Jan Kiszka 2006-04-06 17:10 ` [Xenomai-core] Frozen timer IRQ Philippe Gerum 0 siblings, 2 replies; 21+ messages in thread From: Jan Kiszka @ 2006-04-05 19:30 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 2148 bytes --] Philippe Gerum wrote: > Philippe Gerum wrote: >> Gilles Chanteperdrix wrote: >> >>> Jan Kiszka wrote: >>> > Hi, >>> > > my colleagues and I need some hint where to continue our search >>> for the >>> > cause of a weird cleanup issue: >>> > > An application of our robotics framework sometimes terminates >>> (though >>> > successfully) in a way that the system timer IRQ no longer arrives >>> > afterwards or no re-program takes place anymore. All other Linux IRQs >>> > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test >>> case >>> > yet as besides the framework some expensive gyroscope and the 16550A >>> > driver are involved. >>> >>> I observed a similar issue when xnpod_stop_timer was called when >>> shutting down the posix skin. I assumed that the problem was that >>> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and >>> in particular xnarch_stop_timer) ended up being called twice. >>> >> >> Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ >> protected by the XNTIMED flag, but only the last part of the >> housekeeping chores performed upon stopping the systimer are. IOW, >> this is a latent bug, and xnpod_stop_timer should be fixed. >> > > Commit 884 should do that. > Sorry for replying late: nope, this has no influence on our issue. Well, someone put that damn piece of hardware on my desk, saying: "It doesn't work." What he did not say is that there are multiple issues contained :-/. I found and fixed (patch will follow) a severe bug in the 16550A driver, but the strange timer issue stays (though it's still tricky to reproduce). The point is - and that's likely why your patch doesn't help - that we do not stop the system timer, i.e. unload all skins. We just terminate an application. I did some research but failed to find a test case (only our software "manages" to trigger this). Actually, it seems the hardware timer is no longer working, because also other RT-tasks no longer time out. Moreover, I checked nkpod->htimer.status, but it remains 0 all the time. I need more time... Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 19:30 ` Jan Kiszka @ 2006-04-05 21:56 ` Jan Kiszka 2006-04-05 21:58 ` Jan Kiszka 2006-04-06 15:04 ` Philippe Gerum 2006-04-06 17:10 ` [Xenomai-core] Frozen timer IRQ Philippe Gerum 1 sibling, 2 replies; 21+ messages in thread From: Jan Kiszka @ 2006-04-05 21:56 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 3038 bytes --] Jan Kiszka wrote: > Philippe Gerum wrote: >> Philippe Gerum wrote: >>> Gilles Chanteperdrix wrote: >>> >>>> Jan Kiszka wrote: >>>> > Hi, >>>> > > my colleagues and I need some hint where to continue our search >>>> for the >>>> > cause of a weird cleanup issue: >>>> > > An application of our robotics framework sometimes terminates >>>> (though >>>> > successfully) in a way that the system timer IRQ no longer arrives >>>> > afterwards or no re-program takes place anymore. All other Linux IRQs >>>> > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test >>>> case >>>> > yet as besides the framework some expensive gyroscope and the 16550A >>>> > driver are involved. >>>> >>>> I observed a similar issue when xnpod_stop_timer was called when >>>> shutting down the posix skin. I assumed that the problem was that >>>> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and >>>> in particular xnarch_stop_timer) ended up being called twice. >>>> >>> Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ >>> protected by the XNTIMED flag, but only the last part of the >>> housekeeping chores performed upon stopping the systimer are. IOW, >>> this is a latent bug, and xnpod_stop_timer should be fixed. >>> >> Commit 884 should do that. >> > > Sorry for replying late: nope, this has no influence on our issue. > > Well, someone put that damn piece of hardware on my desk, saying: "It > doesn't work." What he did not say is that there are multiple issues > contained :-/. I found and fixed (patch will follow) a severe bug in the > 16550A driver, but the strange timer issue stays (though it's still > tricky to reproduce). > > The point is - and that's likely why your patch doesn't help - that we > do not stop the system timer, i.e. unload all skins. We just terminate > an application. I did some research but failed to find a test case (only > our software "manages" to trigger this). Actually, it seems the hardware > timer is no longer working, because also other RT-tasks no longer time > out. Moreover, I checked nkpod->htimer.status, but it remains 0 all the > time. I need more time... > Attached is an ipipe-freeze of the frozen system. It's taken at the time the main thread of the terminating application has successfully rt_task_join'ed the last remaining RT-thread. I took 2000 trace points before and after that point and additionally instrumented rthal_timer_program_shot() (special trace 0x01, the argument is the delay). The interesting stuff happens around 600 us after the freeze: it seems the scheduled Linux timer arrives then but doesn't get much attention beyond from ipipe. Any idea what to look for next? I have a "perfect" test system now, though I still see no light at the end of the tunnel how to export it to other boxes. Enough for today. Jan PS: This trace was taken over 2.6.15 to exclude any issues with the new 2.6.16. Both kernels show the same effect. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 21:56 ` Jan Kiszka @ 2006-04-05 21:58 ` Jan Kiszka 2006-04-06 15:04 ` Philippe Gerum 1 sibling, 0 replies; 21+ messages in thread From: Jan Kiszka @ 2006-04-05 21:58 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1.1: Type: text/plain, Size: 2410 bytes --] Jan Kiszka wrote: > Jan Kiszka wrote: >> Philippe Gerum wrote: >>> Philippe Gerum wrote: >>>> Gilles Chanteperdrix wrote: >>>> >>>>> Jan Kiszka wrote: >>>>> > Hi, >>>>> > > my colleagues and I need some hint where to continue our search >>>>> for the >>>>> > cause of a weird cleanup issue: >>>>> > > An application of our robotics framework sometimes terminates >>>>> (though >>>>> > successfully) in a way that the system timer IRQ no longer arrives >>>>> > afterwards or no re-program takes place anymore. All other Linux IRQs >>>>> > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test >>>>> case >>>>> > yet as besides the framework some expensive gyroscope and the 16550A >>>>> > driver are involved. >>>>> >>>>> I observed a similar issue when xnpod_stop_timer was called when >>>>> shutting down the posix skin. I assumed that the problem was that >>>>> xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and >>>>> in particular xnarch_stop_timer) ended up being called twice. >>>>> >>>> Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ >>>> protected by the XNTIMED flag, but only the last part of the >>>> housekeeping chores performed upon stopping the systimer are. IOW, >>>> this is a latent bug, and xnpod_stop_timer should be fixed. >>>> >>> Commit 884 should do that. >>> >> Sorry for replying late: nope, this has no influence on our issue. >> >> Well, someone put that damn piece of hardware on my desk, saying: "It >> doesn't work." What he did not say is that there are multiple issues >> contained :-/. I found and fixed (patch will follow) a severe bug in the >> 16550A driver, but the strange timer issue stays (though it's still >> tricky to reproduce). >> >> The point is - and that's likely why your patch doesn't help - that we >> do not stop the system timer, i.e. unload all skins. We just terminate >> an application. I did some research but failed to find a test case (only >> our software "manages" to trigger this). Actually, it seems the hardware >> timer is no longer working, because also other RT-tasks no longer time >> out. Moreover, I checked nkpod->htimer.status, but it remains 0 all the >> time. I need more time... >> > > Attached is an ipipe-freeze of the frozen system. It's taken at the time F***, the usual "see [not-attached] attachment". [-- Attachment #1.2: frozen-timer2.bz2 --] [-- Type: application/octet-stream, Size: 18756 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 21:56 ` Jan Kiszka 2006-04-05 21:58 ` Jan Kiszka @ 2006-04-06 15:04 ` Philippe Gerum 2006-04-06 15:29 ` Jan Kiszka 1 sibling, 1 reply; 21+ messages in thread From: Philippe Gerum @ 2006-04-06 15:04 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > > Attached is an ipipe-freeze of the frozen system. It's taken at the time > the main thread of the terminating application has successfully > rt_task_join'ed the last remaining RT-thread. I took 2000 trace points > before and after that point and additionally instrumented > rthal_timer_program_shot() (special trace 0x01, the argument is the > delay). The interesting stuff happens around 600 us after the freeze: it > seems the scheduled Linux timer arrives then but doesn't get much > attention beyond from ipipe. > > Any idea what to look for next? I have a "perfect" test system now, > though I still see no light at the end of the tunnel how to export it to > other boxes. > > Enough for today. > > Jan > > > PS: This trace was taken over 2.6.15 to exclude any issues with the new > 2.6.16. Both kernels show the same effect. > Does this patch make any difference? --- ipipe-root.c~ 2006-01-31 09:55:44.000000000 +0100 +++ ipipe-root.c 2006-04-06 17:01:49.000000000 +0200 @@ -328,9 +328,8 @@ /* Only sync virtual IRQs here, so that we don't recurse indefinitely in case of an external interrupt flood. */ - if ((ipipe_root_domain->cpudata[cpuid]. - irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0) - __ipipe_sync_stage(IPIPE_IRQMASK_VIRT); + if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0) + __ipipe_sync_stage(IPIPE_IRQMASK_ANY); } #ifdef CONFIG_IPIPE_TRACE_IRQSOFF ipipe_trace_end(0x8000000D); -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-06 15:04 ` Philippe Gerum @ 2006-04-06 15:29 ` Jan Kiszka 2006-04-06 15:39 ` Philippe Gerum 0 siblings, 1 reply; 21+ messages in thread From: Jan Kiszka @ 2006-04-06 15:29 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 1721 bytes --] Philippe Gerum wrote: > Jan Kiszka wrote: >> >> Attached is an ipipe-freeze of the frozen system. It's taken at the time >> the main thread of the terminating application has successfully >> rt_task_join'ed the last remaining RT-thread. I took 2000 trace points >> before and after that point and additionally instrumented >> rthal_timer_program_shot() (special trace 0x01, the argument is the >> delay). The interesting stuff happens around 600 us after the freeze: it >> seems the scheduled Linux timer arrives then but doesn't get much >> attention beyond from ipipe. >> >> Any idea what to look for next? I have a "perfect" test system now, >> though I still see no light at the end of the tunnel how to export it to >> other boxes. >> >> Enough for today. >> >> Jan >> >> >> PS: This trace was taken over 2.6.15 to exclude any issues with the new >> 2.6.16. Both kernels show the same effect. >> > > Does this patch make any difference? > > --- ipipe-root.c~ 2006-01-31 09:55:44.000000000 +0100 > +++ ipipe-root.c 2006-04-06 17:01:49.000000000 +0200 > @@ -328,9 +328,8 @@ > /* Only sync virtual IRQs here, so that we don't recurse > indefinitely in case of an external interrupt flood. */ > > - if ((ipipe_root_domain->cpudata[cpuid]. > - irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0) > - __ipipe_sync_stage(IPIPE_IRQMASK_VIRT); > + if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0) > + __ipipe_sync_stage(IPIPE_IRQMASK_ANY); > } > #ifdef CONFIG_IPIPE_TRACE_IRQSOFF > ipipe_trace_end(0x8000000D); Nope. Where should I put my finger on to find out what's happening? Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-06 15:29 ` Jan Kiszka @ 2006-04-06 15:39 ` Philippe Gerum 2006-04-06 15:46 ` Jan Kiszka 0 siblings, 1 reply; 21+ messages in thread From: Philippe Gerum @ 2006-04-06 15:39 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Philippe Gerum wrote: > >>Jan Kiszka wrote: >> >>>Attached is an ipipe-freeze of the frozen system. It's taken at the time >>>the main thread of the terminating application has successfully >>>rt_task_join'ed the last remaining RT-thread. I took 2000 trace points >>>before and after that point and additionally instrumented >>>rthal_timer_program_shot() (special trace 0x01, the argument is the >>>delay). The interesting stuff happens around 600 us after the freeze: it >>>seems the scheduled Linux timer arrives then but doesn't get much >>>attention beyond from ipipe. >>> >>>Any idea what to look for next? I have a "perfect" test system now, >>>though I still see no light at the end of the tunnel how to export it to >>>other boxes. >>> >>>Enough for today. >>> >>>Jan >>> >>> >>>PS: This trace was taken over 2.6.15 to exclude any issues with the new >>>2.6.16. Both kernels show the same effect. >>> >> >>Does this patch make any difference? >> >>--- ipipe-root.c~ 2006-01-31 09:55:44.000000000 +0100 >>+++ ipipe-root.c 2006-04-06 17:01:49.000000000 +0200 >>@@ -328,9 +328,8 @@ >> /* Only sync virtual IRQs here, so that we don't recurse >> indefinitely in case of an external interrupt flood. */ >> >>- if ((ipipe_root_domain->cpudata[cpuid]. >>- irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0) >>- __ipipe_sync_stage(IPIPE_IRQMASK_VIRT); >>+ if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0) >>+ __ipipe_sync_stage(IPIPE_IRQMASK_ANY); >> } >> #ifdef CONFIG_IPIPE_TRACE_IRQSOFF >> ipipe_trace_end(0x8000000D); > > > Nope. That's good news, actually. I would have been quite embarrased if it did it. > > Where should I put my finger on to find out what's happening? > It seems that the pipeline log is not synced by __ipipe_unstall_iret_root. We need to know why. Question: is the root stage stalled or unstalled by this routine during the latest call before the box freezes? PS: it would be nice to display the status of the current stage (stalled/unstalled) and the one of the hw interrupt bit, for each trace. > Jan > -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-06 15:39 ` Philippe Gerum @ 2006-04-06 15:46 ` Jan Kiszka 2006-04-06 17:15 ` Philippe Gerum 0 siblings, 1 reply; 21+ messages in thread From: Jan Kiszka @ 2006-04-06 15:46 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 2691 bytes --] Philippe Gerum wrote: > Jan Kiszka wrote: >> Philippe Gerum wrote: >> >>> Jan Kiszka wrote: >>> >>>> Attached is an ipipe-freeze of the frozen system. It's taken at the >>>> time >>>> the main thread of the terminating application has successfully >>>> rt_task_join'ed the last remaining RT-thread. I took 2000 trace points >>>> before and after that point and additionally instrumented >>>> rthal_timer_program_shot() (special trace 0x01, the argument is the >>>> delay). The interesting stuff happens around 600 us after the >>>> freeze: it >>>> seems the scheduled Linux timer arrives then but doesn't get much >>>> attention beyond from ipipe. >>>> >>>> Any idea what to look for next? I have a "perfect" test system now, >>>> though I still see no light at the end of the tunnel how to export >>>> it to >>>> other boxes. >>>> >>>> Enough for today. >>>> >>>> Jan >>>> >>>> >>>> PS: This trace was taken over 2.6.15 to exclude any issues with the new >>>> 2.6.16. Both kernels show the same effect. >>>> >>> >>> Does this patch make any difference? >>> >>> --- ipipe-root.c~ 2006-01-31 09:55:44.000000000 +0100 >>> +++ ipipe-root.c 2006-04-06 17:01:49.000000000 +0200 >>> @@ -328,9 +328,8 @@ >>> /* Only sync virtual IRQs here, so that we don't recurse >>> indefinitely in case of an external interrupt flood. */ >>> >>> - if ((ipipe_root_domain->cpudata[cpuid]. >>> - irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0) >>> - __ipipe_sync_stage(IPIPE_IRQMASK_VIRT); >>> + if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0) >>> + __ipipe_sync_stage(IPIPE_IRQMASK_ANY); >>> } >>> #ifdef CONFIG_IPIPE_TRACE_IRQSOFF >>> ipipe_trace_end(0x8000000D); >> >> >> Nope. > > That's good news, actually. I would have been quite embarrased if it did > it. > >> >> Where should I put my finger on to find out what's happening? >> > > It seems that the pipeline log is not synced by __ipipe_unstall_iret_root. > We need to know why. Question: is the root stage stalled or unstalled by > this > routine during the latest call before the box freezes? I'm currently switching my brain between to many tasks: Could you simply tell me what variable to check so that I can hack some ipipe_trace_special into the kernel? > > PS: it would be nice to display the status of the current stage > (stalled/unstalled) and the one of the hw interrupt bit, for each trace. Patches are welcome :) - wait, you are the Adeos maintainer! ;) Actually, the hw-irq state is already expressed by "|" at the head of each line ("|" means "hw-IRQs off"). Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-06 15:46 ` Jan Kiszka @ 2006-04-06 17:15 ` Philippe Gerum 2006-04-07 11:57 ` Jan Kiszka 2006-04-07 13:02 ` Jan Kiszka 0 siblings, 2 replies; 21+ messages in thread From: Philippe Gerum @ 2006-04-06 17:15 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Philippe Gerum wrote: > >>Jan Kiszka wrote: >> >>>Philippe Gerum wrote: >>> >>> >>>>Jan Kiszka wrote: >>>> >>>> >>>>>Attached is an ipipe-freeze of the frozen system. It's taken at the >>>>>time >>>>>the main thread of the terminating application has successfully >>>>>rt_task_join'ed the last remaining RT-thread. I took 2000 trace points >>>>>before and after that point and additionally instrumented >>>>>rthal_timer_program_shot() (special trace 0x01, the argument is the >>>>>delay). The interesting stuff happens around 600 us after the >>>>>freeze: it >>>>>seems the scheduled Linux timer arrives then but doesn't get much >>>>>attention beyond from ipipe. >>>>> >>>>>Any idea what to look for next? I have a "perfect" test system now, >>>>>though I still see no light at the end of the tunnel how to export >>>>>it to >>>>>other boxes. >>>>> >>>>>Enough for today. >>>>> >>>>>Jan >>>>> >>>>> >>>>>PS: This trace was taken over 2.6.15 to exclude any issues with the new >>>>>2.6.16. Both kernels show the same effect. >>>>> >>>> >>>>Does this patch make any difference? >>>> >>>>--- ipipe-root.c~ 2006-01-31 09:55:44.000000000 +0100 >>>>+++ ipipe-root.c 2006-04-06 17:01:49.000000000 +0200 >>>>@@ -328,9 +328,8 @@ >>>> /* Only sync virtual IRQs here, so that we don't recurse >>>> indefinitely in case of an external interrupt flood. */ >>>> >>>>- if ((ipipe_root_domain->cpudata[cpuid]. >>>>- irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0) >>>>- __ipipe_sync_stage(IPIPE_IRQMASK_VIRT); >>>>+ if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0) >>>>+ __ipipe_sync_stage(IPIPE_IRQMASK_ANY); >>>> } >>>>#ifdef CONFIG_IPIPE_TRACE_IRQSOFF >>>> ipipe_trace_end(0x8000000D); >>> >>> >>>Nope. >> >>That's good news, actually. I would have been quite embarrased if it did >>it. >> >> >>>Where should I put my finger on to find out what's happening? >>> >> >>It seems that the pipeline log is not synced by __ipipe_unstall_iret_root. >>We need to know why. Question: is the root stage stalled or unstalled by >>this >>routine during the latest call before the box freezes? > > > I'm currently switching my brain between to many tasks: Could you simply > tell me what variable to check so that I can hack some > ipipe_trace_special into the kernel? The value of the IPIPE_STALL_FLAG for the root domain upon exit from __ipipe_unstall_iret_root. > > >>PS: it would be nice to display the status of the current stage >>(stalled/unstalled) and the one of the hw interrupt bit, for each trace. > > > Patches are welcome :) - wait, you are the Adeos maintainer! ;) > > Actually, the hw-irq state is already expressed by "|" at the head of > each line ("|" means "hw-IRQs off"). > Ok, I'm rather short in time too, so let's drop this for now and keep it on the todo list so that we get back to this when time allows. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-06 17:15 ` Philippe Gerum @ 2006-04-07 11:57 ` Jan Kiszka 2006-04-07 13:02 ` Jan Kiszka 1 sibling, 0 replies; 21+ messages in thread From: Jan Kiszka @ 2006-04-07 11:57 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 2749 bytes --] Philippe Gerum wrote: > Jan Kiszka wrote: >> Philippe Gerum wrote: >> >>> Jan Kiszka wrote: >>> >>>> Philippe Gerum wrote: >>>> >>>> >>>>> Jan Kiszka wrote: >>>>> >>>>> >>>>>> Attached is an ipipe-freeze of the frozen system. It's taken at the >>>>>> time >>>>>> the main thread of the terminating application has successfully >>>>>> rt_task_join'ed the last remaining RT-thread. I took 2000 trace >>>>>> points >>>>>> before and after that point and additionally instrumented >>>>>> rthal_timer_program_shot() (special trace 0x01, the argument is the >>>>>> delay). The interesting stuff happens around 600 us after the >>>>>> freeze: it >>>>>> seems the scheduled Linux timer arrives then but doesn't get much >>>>>> attention beyond from ipipe. >>>>>> >>>>>> Any idea what to look for next? I have a "perfect" test system now, >>>>>> though I still see no light at the end of the tunnel how to export >>>>>> it to >>>>>> other boxes. >>>>>> >>>>>> Enough for today. >>>>>> >>>>>> Jan >>>>>> >>>>>> >>>>>> PS: This trace was taken over 2.6.15 to exclude any issues with >>>>>> the new >>>>>> 2.6.16. Both kernels show the same effect. >>>>>> >>>>> >>>>> Does this patch make any difference? >>>>> >>>>> --- ipipe-root.c~ 2006-01-31 09:55:44.000000000 +0100 >>>>> +++ ipipe-root.c 2006-04-06 17:01:49.000000000 +0200 >>>>> @@ -328,9 +328,8 @@ >>>>> /* Only sync virtual IRQs here, so that we don't recurse >>>>> indefinitely in case of an external interrupt flood. */ >>>>> >>>>> - if ((ipipe_root_domain->cpudata[cpuid]. >>>>> - irq_pending_hi & IPIPE_IRQMASK_VIRT) != 0) >>>>> - __ipipe_sync_stage(IPIPE_IRQMASK_VIRT); >>>>> + if (ipipe_root_domain->cpudata[cpuid].irq_pending_hi != 0) >>>>> + __ipipe_sync_stage(IPIPE_IRQMASK_ANY); >>>>> } >>>>> #ifdef CONFIG_IPIPE_TRACE_IRQSOFF >>>>> ipipe_trace_end(0x8000000D); >>>> >>>> >>>> Nope. >>> >>> That's good news, actually. I would have been quite embarrased if it did >>> it. >>> >>> >>>> Where should I put my finger on to find out what's happening? >>>> >>> >>> It seems that the pipeline log is not synced by >>> __ipipe_unstall_iret_root. >>> We need to know why. Question: is the root stage stalled or unstalled by >>> this >>> routine during the latest call before the box freezes? >> >> >> I'm currently switching my brain between to many tasks: Could you simply >> tell me what variable to check so that I can hack some >> ipipe_trace_special into the kernel? > > The value of the IPIPE_STALL_FLAG for the root domain upon exit from > __ipipe_unstall_iret_root. ipipe_root_domain->cpudata[cpuid].status is 0 on return. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-06 17:15 ` Philippe Gerum 2006-04-07 11:57 ` Jan Kiszka @ 2006-04-07 13:02 ` Jan Kiszka 2006-04-07 16:28 ` Philippe Gerum 1 sibling, 1 reply; 21+ messages in thread From: Jan Kiszka @ 2006-04-07 13:02 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1.1: Type: text/plain, Size: 2594 bytes --] Philippe Gerum wrote: > Jan Kiszka wrote: >> Philippe Gerum wrote: >>> ... >>> It seems that the pipeline log is not synced by >>> __ipipe_unstall_iret_root. >>> We need to know why. Question: is the root stage stalled or unstalled by >>> this >>> routine during the latest call before the box freezes? >> >> >> I'm currently switching my brain between to many tasks: Could you simply >> tell me what variable to check so that I can hack some >> ipipe_trace_special into the kernel? > > The value of the IPIPE_STALL_FLAG for the root domain upon exit from > __ipipe_unstall_iret_root. > The problem seems to be the stalled Xenomai domain: > fn 1917 3.503 cond_resched+0x9 (console_conditional_schedule+0x16) > |fn 1921 2.706 __ipipe_handle_irq+0xe (common_interrupt+0x18) > |fn 1923 1.548 __ipipe_ack_common_irq+0x9 (__ipipe_handle_irq+0xc0) > |fn 1925 4.390 mask_and_ack_8259A+0xb (__ipipe_ack_common_irq+0x47) > |(0x20) 0x00000000 1929 0.796 __ipipe_handle_irq+0x144 (common_interrupt+0x18) > |(0x30) 0x00000064 1930 0.766 __ipipe_handle_irq+0x15c (common_interrupt+0x18) > |(0x31) 0x00000064 1931 0.812 __ipipe_handle_irq+0x169 (common_interrupt+0x18) > |(0x32) 0x000000c8 1932 0.766 __ipipe_handle_irq+0x17e (common_interrupt+0x18) > |(0x32) 0x00000001 1932 0.781 __ipipe_handle_irq+0x188 (common_interrupt+0x18) > |(0x21) 0x00000000 1933 1.383 __ipipe_handle_irq+0x208 (common_interrupt+0x18) > |fn 1934 1.413 __ipipe_stall_root+0x8 (resume_kernel+0x5) > fn 1936 1.052 __ipipe_unstall_iret_root+0x8 (restore_raw+0x0) > |(0x11) 0x00000000 1937 0.932 __ipipe_unstall_iret_root+0x31 (restore_raw+0x0) > |(0x03) 0x00000000 1938 1.774 __ipipe_unstall_iret_root+0x64 (restore_raw+0x0) > fn 1940 0.736 console_conditional_schedule+0x8 (fbcon_redraw+0xdf) This was taken during the failing Linux timer tick with the attached instrumentation hack. BTW, that trace hacking reminds me that we should really think about making a kernel debugger run. I recently noticed that latest kgdb applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe it is just about making kgdb's irq-locks ipipe-aware and bypassing the ipipe for int3 and the serial IRQ (so that ipipe can be debugged as well) and catching the relevant exceptions. Hmm, the debugger seems to get initialised in the "early" stage. Is this before or after ipipe setup? Jan [-- Attachment #1.2: ipipe-root-instr.patch --] [-- Type: text/plain, Size: 2443 bytes --] --- arch/i386/kernel/ipipe-root.c.orig 2006-04-05 23:13:45.000000000 +0200 +++ arch/i386/kernel/ipipe-root.c 2006-04-07 14:35:30.000000000 +0200 @@ -315,11 +315,13 @@ asmlinkage void __ipipe_unstall_iret_roo emulation. */ if (!(regs.eflags & X86_EFLAGS_IF)) { +ipipe_trace_special(0x10, 0); __set_bit(IPIPE_STALL_FLAG, &ipipe_root_domain->cpudata[cpuid].status); ipipe_mark_domain_stall(ipipe_root_domain, cpuid); regs.eflags |= X86_EFLAGS_IF; } else { +ipipe_trace_special(0x11, 0); __clear_bit(IPIPE_STALL_FLAG, &ipipe_root_domain->cpudata[cpuid].status); @@ -335,6 +337,7 @@ asmlinkage void __ipipe_unstall_iret_roo #ifdef CONFIG_IPIPE_TRACE_IRQSOFF ipipe_trace_end(0x8000000D); #endif /* CONFIG_IPIPE_TRACE_IRQSOFF */ +ipipe_trace_special(0x03, ipipe_root_domain->cpudata[cpuid].status); } asmlinkage int __ipipe_syscall_root(struct pt_regs regs) @@ -457,20 +460,26 @@ fastcall int __ipipe_divert_exception(st static inline void __ipipe_walk_pipeline(struct list_head *pos, int cpuid) { struct ipipe_domain *this_domain = ipipe_percpu_domain[cpuid]; +ipipe_trace_special(0x30, ipipe_root_domain->priority); +ipipe_trace_special(0x31, this_domain->priority); while (pos != &__ipipe_pipeline) { struct ipipe_domain *next_domain = list_entry(pos, struct ipipe_domain, p_link); +ipipe_trace_special(0x32, next_domain->priority); +ipipe_trace_special(0x32, next_domain->cpudata[cpuid].status); if (test_bit (IPIPE_STALL_FLAG, &next_domain->cpudata[cpuid].status)) break; /* Stalled stage -- do not go further. */ +ipipe_trace_special(0x34, 0); if (next_domain->cpudata[cpuid].irq_pending_hi != 0) { if (next_domain == this_domain) __ipipe_sync_stage(IPIPE_IRQMASK_ANY); else { +ipipe_trace_special(0x35, 0); __ipipe_switch_to(this_domain, next_domain, cpuid); @@ -483,6 +492,7 @@ static inline void __ipipe_walk_pipeline __ipipe_sync_stage(IPIPE_IRQMASK_ANY); } +ipipe_trace_special(0x36, 0); break; } else if (next_domain == this_domain) break; @@ -587,7 +597,9 @@ int __ipipe_handle_irq(struct pt_regs re marked as 'sticky'. This search does not go beyond the current domain in the pipeline. */ +ipipe_trace_special(0x20, 0); __ipipe_walk_pipeline(head, cpuid); +ipipe_trace_special(0x21, 0); ipipe_load_cpuid(); [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-07 13:02 ` Jan Kiszka @ 2006-04-07 16:28 ` Philippe Gerum 2006-04-07 16:39 ` Philippe Gerum 0 siblings, 1 reply; 21+ messages in thread From: Philippe Gerum @ 2006-04-07 16:28 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > > BTW, that trace hacking reminds me that we should really think about > making a kernel debugger run. I recently noticed that latest kgdb > applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe > it is just about making kgdb's irq-locks ipipe-aware and bypassing the > ipipe for int3 and the serial IRQ (so that ipipe can be debugged as > well) and catching the relevant exceptions. Hmm, the debugger seems to > get initialised in the "early" stage. Is this before or after ipipe setup? > It depends. If "kgdbwait" is set in the bootargs to halt the kernel waiting for the remote GDB to connect to the target, kgdb starts before the ipipe. Otherwise, it's a late init, and kgdb starts after the ipipe is fully initialized. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-07 16:28 ` Philippe Gerum @ 2006-04-07 16:39 ` Philippe Gerum 2006-04-07 18:00 ` [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) Jan Kiszka 0 siblings, 1 reply; 21+ messages in thread From: Philippe Gerum @ 2006-04-07 16:39 UTC (permalink / raw) To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core Philippe Gerum wrote: > Jan Kiszka wrote: > >> >> BTW, that trace hacking reminds me that we should really think about >> making a kernel debugger run. I recently noticed that latest kgdb >> applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe >> it is just about making kgdb's irq-locks ipipe-aware and bypassing the >> ipipe for int3 and the serial IRQ (so that ipipe can be debugged as >> well) and catching the relevant exceptions. Hmm, the debugger seems to >> get initialised in the "early" stage. Is this before or after ipipe >> setup? >> > > It depends. If "kgdbwait" is set in the bootargs to halt the kernel > waiting for the remote GDB to connect to the target, kgdb starts before > the ipipe. Otherwise, it's a late init, and kgdb starts after the ipipe > is fully initialized. > Basically, kgdb could start before the i-pipe as soon as a breakpoint is hit before the latter is enabled in init/main.c. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) 2006-04-07 16:39 ` Philippe Gerum @ 2006-04-07 18:00 ` Jan Kiszka 2006-04-09 9:40 ` Philippe Gerum 0 siblings, 1 reply; 21+ messages in thread From: Jan Kiszka @ 2006-04-07 18:00 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1.1: Type: text/plain, Size: 2000 bytes --] Philippe Gerum wrote: > Philippe Gerum wrote: >> Jan Kiszka wrote: >> >>> >>> BTW, that trace hacking reminds me that we should really think about >>> making a kernel debugger run. I recently noticed that latest kgdb >>> applied with a single failing hunk on top of ipipe (2.6.15, x86). Maybe >>> it is just about making kgdb's irq-locks ipipe-aware and bypassing the >>> ipipe for int3 and the serial IRQ (so that ipipe can be debugged as >>> well) and catching the relevant exceptions. Hmm, the debugger seems to >>> get initialised in the "early" stage. Is this before or after ipipe >>> setup? >>> >> >> It depends. If "kgdbwait" is set in the bootargs to halt the kernel >> waiting for the remote GDB to connect to the target, kgdb starts >> before the ipipe. Otherwise, it's a late init, and kgdb starts after >> the ipipe is fully initialized. >> > > Basically, kgdb could start before the i-pipe as soon as a breakpoint is > hit before the latter is enabled in init/main.c. > Yep, I dug deeper meanwhile and also came across this. I already have a trivial hack running here. The most tricky part for me was to learn quilt, but now I start to love it :). Here is a snapshot series for 2.6.15.5: <kgdb series from CVS> prepare-ipipe-x86.patch adeos-ipipe-2.6.15-i386-1.2-01.patch kgdb-ipipe-x86.patch I'm currently wondering if it makes sense to register a kgdb domain and "officially" capture all involved IRQs and events. So far the serial line IRQ is hard-coded (should be retrieved from some internal kgdb structure later). Anyway, it seems to work quite well, I'm currently stepping through a network IRQ at ipipe-level. While playing with this tool a bit, displaying the the ipipe structures, and thinking about the original problem again, I wondered what could cause a temporary (as I think to found out now) stalled xeno domain without locking up the system? Some irq-lock leaks at driver level (i.e. inside our own code)? Jan [-- Attachment #1.2: kgdb-ipipe-x86.patch --] [-- Type: text/plain, Size: 3997 bytes --] Index: linux-2.6.15.5/arch/i386/kernel/entry.S =================================================================== --- linux-2.6.15.5.orig/arch/i386/kernel/entry.S 2006-04-07 16:53:39.000000000 +0200 +++ linux-2.6.15.5/arch/i386/kernel/entry.S 2006-04-07 16:53:40.000000000 +0200 @@ -194,7 +194,7 @@ .previous -ENTRY(ret_from_fork) +KPROBE_ENTRY(ret_from_fork) STI_COND_HW pushl %eax call schedule_tail @@ -582,7 +582,7 @@ PUSH_XCODE(do_simd_coprocessor_error) jmp error_code -ENTRY(device_not_available) +KPROBE_ENTRY(device_not_available) pushl $-1 # mark this as an int SAVE_ALL DIVERT_EXCEPTION(device_not_available) @@ -767,7 +767,7 @@ jmp error_code #endif -ENTRY(spurious_interrupt_bug) +KPROBE_ENTRY(spurious_interrupt_bug) pushl $0 PUSH_XCODE(do_spurious_interrupt_bug) jmp error_code Index: linux-2.6.15.5/kernel/kgdb.c =================================================================== --- linux-2.6.15.5.orig/kernel/kgdb.c 2006-04-07 16:30:51.000000000 +0200 +++ linux-2.6.15.5/kernel/kgdb.c 2006-04-07 16:57:35.000000000 +0200 @@ -740,7 +740,7 @@ unsigned long flags; int processor; - local_irq_save(flags); + local_irq_save_hw(flags); processor = smp_processor_id(); kgdb_info[processor].debuggerinfo = regs; kgdb_info[processor].task = current; @@ -770,7 +770,7 @@ /* Signal the master processor that we are done */ atomic_set(&procindebug[processor], 0); spin_unlock(&slavecpulocks[processor]); - local_irq_restore(flags); + local_irq_restore_hw(flags); } #endif @@ -1033,7 +1033,7 @@ * Interrupts will be restored by the 'trap return' code, except when * single stepping. */ - local_irq_save(flags); + local_irq_save_hw(flags); /* Hold debugger_active */ procid = smp_processor_id(); @@ -1056,7 +1056,7 @@ if (atomic_read(&cpu_doing_single_step) != -1 && atomic_read(&cpu_doing_single_step) != procid) { atomic_set(&debugger_active, 0); - local_irq_restore(flags); + local_irq_restore_hw(flags); goto acquirelock; } @@ -1556,7 +1556,7 @@ kgdb_restore: /* Free debugger_active */ atomic_set(&debugger_active, 0); - local_irq_restore(flags); + local_irq_restore_hw(flags); return error; } @@ -1925,9 +1925,9 @@ if (!kgdb_connected || atomic_read(&debugger_active) != 0) return 0; if ((code == SYS_RESTART) || (code == SYS_HALT) || (code == SYS_POWER_OFF)){ - local_irq_save(flags); + local_irq_save_hw(flags); put_packet("X00"); - local_irq_restore(flags); + local_irq_restore_hw(flags); } return NOTIFY_DONE; } @@ -1942,9 +1942,9 @@ if (!kgdb_connected || atomic_read(&debugger_active) != 0) return; - local_irq_save(flags); + local_irq_save_hw(flags); kgdb_msg_write(s, count); - local_irq_restore(flags); + local_irq_restore_hw(flags); } static struct console kgdbcons = { Index: linux-2.6.15.5/arch/i386/kernel/ipipe-root.c =================================================================== --- linux-2.6.15.5.orig/arch/i386/kernel/ipipe-root.c 2006-04-07 16:53:39.000000000 +0200 +++ linux-2.6.15.5/arch/i386/kernel/ipipe-root.c 2006-04-07 17:48:00.000000000 +0200 @@ -111,6 +111,15 @@ #endif /* CONFIG_X86_LOCAL_APIC */ +#ifdef CONFIG_KGDB +static struct ipipe_domain kgdb_domain; + +static void kgdb_domain_entry(void) +{ + +} +#endif /* CONFIG_KGDB */ + /* __ipipe_enable_pipeline() -- We are running on the boot CPU, hw interrupts are off, and secondary CPUs are still lost in space. */ @@ -248,6 +257,10 @@ ipipe_root_domain->irqs[IPIPE_SERVICE_IPI2].control &= ~IPIPE_SYSTEM_MASK; ipipe_root_domain->irqs[IPIPE_SERVICE_IPI3].control &= ~IPIPE_SYSTEM_MASK; #endif /* CONFIG_X86_LOCAL_APIC */ + +#ifdef CONFIG_KGDB + ipipe_control_irq(4, 0, IPIPE_HANDLE_MASK|IPIPE_STICKY_MASK|IPIPE_SYSTEM_MASK); +#endif /* CONFIG_KGDB */ } static inline void __fixup_if(struct pt_regs *regs) [-- Attachment #1.3: prepare-ipipe-x86.patch --] [-- Type: text/plain, Size: 838 bytes --] Index: linux-2.6.15.5/arch/i386/kernel/entry.S =================================================================== --- linux-2.6.15.5.orig/arch/i386/kernel/entry.S 2006-04-07 16:42:54.000000000 +0200 +++ linux-2.6.15.5/arch/i386/kernel/entry.S 2006-04-07 16:47:23.000000000 +0200 @@ -123,7 +123,7 @@ .previous -KPROBE_ENTRY(ret_from_fork) +ENTRY(ret_from_fork) pushl %eax call schedule_tail GET_THREAD_INFO(%ebp) @@ -470,7 +470,7 @@ pushl $do_simd_coprocessor_error jmp error_code -KPROBE_ENTRY(device_not_available) +ENTRY(device_not_available) pushl $-1 # mark this as an int SAVE_ALL movl %cr0, %eax @@ -652,7 +652,7 @@ jmp error_code #endif -KPROBE_ENTRY(spurious_interrupt_bug) +ENTRY(spurious_interrupt_bug) pushl $0 pushl $do_spurious_interrupt_bug jmp error_code [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) 2006-04-07 18:00 ` [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) Jan Kiszka @ 2006-04-09 9:40 ` Philippe Gerum 0 siblings, 0 replies; 21+ messages in thread From: Philippe Gerum @ 2006-04-09 9:40 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > > > Yep, I dug deeper meanwhile and also came across this. > > I already have a trivial hack running here. The most tricky part for me > was to learn quilt, but now I start to love it :). Here is a snapshot > series for 2.6.15.5: > > <kgdb series from CVS> > prepare-ipipe-x86.patch > adeos-ipipe-2.6.15-i386-1.2-01.patch > kgdb-ipipe-x86.patch > In order to ease patch maintenance, we should move the relevant portions of this infrastructure to the I-pipe patch directly (i.e. I-pipe specific kgdb-ipipe-* code). > I'm currently wondering if it makes sense to register a kgdb domain and > "officially" capture all involved IRQs and events. So far the serial > line IRQ is hard-coded (should be retrieved from some internal kgdb > structure later). Anyway, it seems to work quite well, I'm currently > stepping through a network IRQ at ipipe-level. > Having a separate domain would allow to break into any runaway code from lower priority domains even with disabled interrupts, except the ipipe itself. This said, pushing a domain on top of Xenomai would break the assumption that hw interrupts are indeed disabled when operating due to the 'last domain optimization' feature, and introduce additional jittery. The other option would be to install a KGDB 'redirector' in __ipipe_handle_irq so that serial or network interrupts to KGDB would never be blocked by the stall bit; I would actually prefer this one. > > While playing with this tool a bit, displaying the the ipipe structures, > and thinking about the original problem again, I wondered what could > cause a temporary (as I think to found out now) stalled xeno domain > without locking up the system? Some irq-lock leaks at driver level (i.e. > inside our own code)? > At first sight, it might be related to the way __ipipe_unstall_iret_root operates. Basically, the idea is to make sure that the stall flag of the root domain upon return from the pipelining process always reflects the state of the hw interrupt flag at the time the processed event was taken by the CPU. It seems that your testcase shows that under some cicumstances, the root stage might be spuriously left in a stalled state by __ipipe_unstall_iret_root. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Xenomai-core] Frozen timer IRQ 2006-04-05 19:30 ` Jan Kiszka 2006-04-05 21:56 ` Jan Kiszka @ 2006-04-06 17:10 ` Philippe Gerum 1 sibling, 0 replies; 21+ messages in thread From: Philippe Gerum @ 2006-04-06 17:10 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Philippe Gerum wrote: > >>Philippe Gerum wrote: >> >>>Gilles Chanteperdrix wrote: >>> >>> >>>>Jan Kiszka wrote: >>>> > Hi, >>>> > > my colleagues and I need some hint where to continue our search >>>>for the >>>> > cause of a weird cleanup issue: >>>> > > An application of our robotics framework sometimes terminates >>>>(though >>>> > successfully) in a way that the system timer IRQ no longer arrives >>>> > afterwards or no re-program takes place anymore. All other Linux IRQs >>>> > are fine (Ethernet, keyboard, etc.). I cannot provide an easy test >>>>case >>>> > yet as besides the framework some expensive gyroscope and the 16550A >>>> > driver are involved. >>>> >>>>I observed a similar issue when xnpod_stop_timer was called when >>>>shutting down the posix skin. I assumed that the problem was that >>>>xnpod_shutdown already called xnpod_stop_timer, so xnpod_stop_timer (and >>>>in particular xnarch_stop_timer) ended up being called twice. >>>> >>> >>>Err, sorry. Forget about my previous reply: xnarch_stop_timer is _not_ >>>protected by the XNTIMED flag, but only the last part of the >>>housekeeping chores performed upon stopping the systimer are. IOW, >>>this is a latent bug, and xnpod_stop_timer should be fixed. >>> >> >>Commit 884 should do that. >> > > > Sorry for replying late: nope, this has no influence on our issue. > This fix was not intended to address this issue, but rather to cleanup the timer management code so that multiple releases as described by Gilles don't cause havoc anymore, hopefully. So that's ok. -- Philippe. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2006-04-09 9:40 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-04 21:29 [Xenomai-core] Frozen timer IRQ Jan Kiszka 2006-04-05 7:13 ` Philippe Gerum 2006-04-05 12:10 ` Gilles Chanteperdrix 2006-04-05 12:29 ` Philippe Gerum 2006-04-05 12:38 ` Philippe Gerum 2006-04-05 13:05 ` Philippe Gerum 2006-04-05 19:30 ` Jan Kiszka 2006-04-05 21:56 ` Jan Kiszka 2006-04-05 21:58 ` Jan Kiszka 2006-04-06 15:04 ` Philippe Gerum 2006-04-06 15:29 ` Jan Kiszka 2006-04-06 15:39 ` Philippe Gerum 2006-04-06 15:46 ` Jan Kiszka 2006-04-06 17:15 ` Philippe Gerum 2006-04-07 11:57 ` Jan Kiszka 2006-04-07 13:02 ` Jan Kiszka 2006-04-07 16:28 ` Philippe Gerum 2006-04-07 16:39 ` Philippe Gerum 2006-04-07 18:00 ` [Xenomai-core] Frozen timer IRQ - now traced with kgdb :) Jan Kiszka 2006-04-09 9:40 ` Philippe Gerum 2006-04-06 17:10 ` [Xenomai-core] Frozen timer IRQ Philippe Gerum
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.