From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <443A5212.7020001@domain.hid>
Date: Mon, 10 Apr 2006 14:39:46 +0200
From: Philippe Gerum <rpm@xenomai.org>
MIME-Version: 1.0
Subject: Re: [Xenomai-core] [BUG?] stalled xeno domain
References: <4437D2EE.4040902@domain.hid> <4437E704.4010607@domain.hid>
	<443A437B.6010403@domain.hid>
In-Reply-To: <443A437B.6010403@domain.hid>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: "Xenomai life and development \(bug reports, patches,
	discussions\)" <xenomai.xenomai.org>
List-Unsubscribe: <https://mail.gna.org/listinfo/xenomai-core>,
	<mailto:xenomai-core-request@domain.hid>
List-Archive: </public/xenomai-core>
List-Post: <mailto:xenomai@xenomai.org>
List-Help: <mailto:xenomai-core-request@domain.hid>
List-Subscribe: <https://mail.gna.org/listinfo/xenomai-core>,
	<mailto:xenomai-core-request@domain.hid>
To: Jan Kiszka <jan.kiszka@domain.hid>
Cc: xenomai-core <xenomai@xenomai.org>

Jan Kiszka wrote:
> Jan Kiszka wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Hi Philippe,
>>>
>>>debugging is nice, tracing is still nicer. As you suggested, I extended
>>>the tracer with per-domain stall flags (needs some output clean-up,
>>>preliminary patch attached nevertheless).
>>>
>>>And here is the result (full trace attached):
>>>
>>>
>>>>:|    (0x51) 0x000000c8 -1113+   1.112  __ipipe_sync_stage+0x142 (ipipe_suspend_domain+0x56)
>>>>:|    fn                -1112+   1.789  __ipipe_sync_stage+0xe (ipipe_suspend_domain+0x56)
>>>>:|   *(0x50) 0x00000064 -1110+   1.293  __ipipe_sync_stage+0x97 (ipipe_suspend_domain+0x56)
>>>>:    *fn                -1109+   1.398  do_IRQ+0x8 (__ipipe_sync_stage+0xcf)
>>>>:    *fn                -1107+   2.105  __do_IRQ+0xc (do_IRQ+0x21)
>>>>:    *fn                -1105+   1.631  handle_IRQ_event+0xd (__do_IRQ+0x9e)
>>>>:    *fn                -1104+   2.413  timer_interrupt+0x9 (handle_IRQ_event+0x40)
>>>>:    *fn                -1101+   3.022  mark_offset_tsc+0xe (timer_interrupt+0x31)
>>>>:    *fn                -1098+   2.721  do_timer+0x9 (timer_interrupt+0x37)
>>>>:|   *fn                -1096+   2.112  __ipipe_handle_irq+0xe (common_interrupt+0x18)
>>>>:|   *fn                -1093+   1.210  __ipipe_ack_common_irq+0xc (__ipipe_handle_irq+0xc0)
>>>>:|   *(0x50) 0x00000064 -1092+   1.218  __ipipe_ack_common_irq+0x31 (__ipipe_handle_irq+0xc0)
>>>>:|   *fn                -1091+   1.834  mask_and_ack_8259A+0xb (__ipipe_ack_common_irq+0x5d)
>>>>:|   *(0x50) 0x00000064 -1089+   1.345  __ipipe_ack_common_irq+0x9d (__ipipe_handle_irq+0xc0)
>>>>:|   *fn                -1088    0.924  ipipe_suspend_domain+0xb (__ipipe_handle_irq+0x174)
>>>>:|   *(0x51) 0x000000c8 -1087+   1.172  ipipe_suspend_domain+0x26 (__ipipe_handle_irq+0x174)
>>>>:|   *fn                -1086+   2.030  __ipipe_sync_stage+0xe (ipipe_suspend_domain+0x56)
>>>>:|  **(0x50) 0x000000c8 -1084+   1.330  __ipipe_sync_stage+0x97 (ipipe_suspend_domain+0x56)
>>>>:|  **fn                -1082    0.879  xnintr_clock_handler+0x8 (__ipipe_sync_stage+0x12b)
>>>>:|  **fn                -1082+   1.218  xnintr_irq_handler+0xb (xnintr_clock_handler+0x18)
>>>>:|  **fn                -1080+   1.233  xnpod_announce_tick+0x9 (xnintr_irq_handler+0x2a)
>>>>:|  **(0x50) 0x000000c8 -1079+   1.736  xnpod_announce_tick+0x20 (xnintr_irq_handler+0x2a)
>>>>:|  **fn                -1077+   2.984  xntimer_do_tick_aperiodic+0xe (xnpod_announce_tick+0x36)
>>>>:|  **fn                -1074+   2.751  xnthread_timeout_handler+0x8 (xntimer_do_tick_aperiodic+0x18d)
>>>>:|  **fn                -1072+   1.864  xnpod_resume_thread+0xe (xnthread_timeout_handler+0x1d)
>>>>:|  **(0x50) 0x000000c8 -1070+   4.699  xnpod_resume_thread+0x2b (xnthread_timeout_handler+0x1d)
>>>>:|  **(0x50) 0x000000c8 -1065+   1.586  xnpod_resume_thread+0x2bf (xnthread_timeout_handler+0x1d)
>>>>:|  **(0x01) 0x00001237 -1063+   2.661  xntimer_do_tick_aperiodic+0x309 (xnpod_announce_tick+0x36)
>>>>:|  **(0x50) 0x000000c8 -1061+   1.263  xnpod_announce_tick+0x4f (xnintr_irq_handler+0x2a)
>>>>:|  **fn                -1060+   1.255  rthal_irq_end+0x8 (xnintr_irq_handler+0x46)
>>>>:|  **fn                -1058+   2.007  enable_8259A_irq+0x9 (rthal_irq_end+0x2e)
>>>>:|  **fn                -1056+   1.924  xnpod_schedule+0xe (xnintr_irq_handler+0x6e)
>>>>:|  **(0x50) 0x000000c8 -1054!  15.368  xnpod_schedule+0x53 (xnintr_irq_handler+0x6e)
>>>>:|  **fn                -1039!  13.300  __switch_to+0xc (xnpod_schedule+0x689)
>>>>:|  **(0x50) 0x000000c8 -1026+   1.766  xnpod_schedule+0x951 (xnpod_suspend_thread+0x27c)
>>>>:|  **(0x50) 0x000000c8 -1024!  18.195  xnpod_suspend_thread+0x2bb (rt_task_sleep+0x54)
>>>>:   **fn                -1006+   1.624  __ipipe_syscall_root+0x9 (system_call+0x20)
>>>>:   **fn                -1004+   1.406  __ipipe_dispatch_event+0xe (__ipipe_syscall_root+0x58)
>>>>:   **fn                -1003+   4.323  hisyscall_event+0xe (__ipipe_dispatch_event+0x5e)
>>>>:   **fn                 -998+   1.398  __rt_task_sleep+0xa (hisyscall_event+0x23c)
>>>>:   **fn                 -997+   1.789  __copy_from_user_ll+0xa (__rt_task_sleep+0x1a)
>>>>:   **fn                 -995!  15.345  rt_task_sleep+0xa (__rt_task_sleep+0x25)
>>>>:   **fn                 -980    0.879  __ipipe_syscall_root+0x9 (system_call+0x20)
>>>>:   **fn                 -979+   1.308  __ipipe_dispatch_event+0xe (__ipipe_syscall_root+0x58)
>>>>:   **fn                 -978+   2.451  hisyscall_event+0xe (__ipipe_dispatch_event+0x5e)
>>>>:   **fn                 -975+   1.631  sys_rtdm_ioctl+0x8 (hisyscall_event+0x23c)
>>>>:   **fn                 -974+   1.255  _rtdm_ioctl+0xc (sys_rtdm_ioctl+0x1b)
>>>>:   **fn                 -972+   4.180  rtdm_context_get+0xa (_rtdm_ioctl+0x20)
>>>>:   **fn                 -968+   1.203  __ipipe_syscall_root+0x9 (system_call+0x20)
>>>>:   **fn                 -967+   1.345  __ipipe_dispatch_event+0xe (__ipipe_syscall_root+0x58)
>>>
>>>The '*' mark a stalled domain, root domain on the right. It seems to me
>>>that xnpod_suspend_thread() leaves the xeno domain stalled on wake up.
>>>This gets recovered somehow later.
>>>
>>>But here we see a different effect which finally causes the tick to get
>>>frozen:
>>>
>>>
>>>>:     fn                 -504+   2.075  sched_clock+0xd (schedule+0x112)
>>>>:     fn                 -502    0.992  __ipipe_stall_root+0x8 (schedule+0x18e)
>>>>:    *(0x50) 0x00000064  -501+   4.022  __ipipe_stall_root+0x32 (schedule+0x18e)
>>>>:    *fn                 -497+   1.977  __ipipe_dispatch_event+0xe (schedule+0x412)
>>>>:    *fn                 -495+   4.210  schedule_event+0xd (__ipipe_dispatch_event+0x5e)
>>>>:   **(0x50) 0x000000c8  -491+   1.428  schedule_event+0x261 (__ipipe_dispatch_event+0x5e)
>>>>:|  **fn                 -490+   2.067  xnpod_schedule_runnable+0xe (schedule_event+0x28b)
>>>>:|  **fn                 -488    0.954  ipipe_unstall_pipeline_from+0xc (schedule_event+0x2c1)
>>>>:|   *(0x51) 0x000000c8  -487+   4.646  ipipe_unstall_pipeline_from+0x25 (schedule_event+0x2c1)
>>>>:|   *fn                 -482+   7.248  __switch_to+0xc (xnpod_schedule+0x689)
>>>>:|  **(0x50) 0x000000c8  -475+   1.421  xnpod_schedule+0x77f (xnpod_suspend_thread+0x27c)
>>>>:|  **(0x50) 0x000000c8  -473+   1.481  xnpod_schedule+0x951 (xnpod_suspend_thread+0x27c)
>>>>:|  **(0x50) 0x000000c8  -472+   1.654  xnpod_suspend_thread+0x2bb (xnshadow_relax+0x136)
>>>>:|  **(0x50) 0x000000c8  -470+   2.917  xnshadow_relax+0x154 (hisyscall_event+0x2ec)
>>>>:|  **fn                 -467+   1.375  ipipe_reenter_root+0xb (xnshadow_relax+0x204)
>>>>:|  **fn                 -466    0.954  __ipipe_unstall_root+0x8 (ipipe_reenter_root+0x26)
>>>>:|  * (0x51) 0x00000064  -465+   3.789  __ipipe_unstall_root+0x25 (ipipe_reenter_root+0x26)
>>>>:   * fn                 -461+   1.578  send_sig+0x8 (xnshadow_relax+0x228)
>>>>:   * fn                 -460+   1.496  send_sig_info+0xb (send_sig+0x1d)
>>>>:   * fn                 -458    0.909  __ipipe_test_and_stall_root+0x9 (send_sig_info+0x38)
>>>>:   **(0x50) 0x00000064  -457+   1.699  __ipipe_test_and_stall_root+0x27 (send_sig_info+0x38)
>>>>:   **fn                 -455+   1.203  specific_send_sig_info+0xb (send_sig_info+0x69)
>>>>:   **fn                 -454+   1.706  __ipipe_test_root+0x8 (specific_send_sig_info+0x16)
>>>>:   **fn                 -453+   3.360  sig_ignored+0x9 (specific_send_sig_info+0x29)
>>>>:   **fn                 -449+   1.714  send_signal+0xb (specific_send_sig_info+0x55)
>>>>:   **fn                 -447+   3.142  __sigqueue_alloc+0x9 (send_signal+0x3f)
>>>>:   **fn                 -444+   1.210  kmem_cache_alloc+0xa (__sigqueue_alloc+0x42)
>>>>:   **fn                 -443    0.909  __ipipe_test_and_stall_root+0x9 (kmem_cache_alloc+0x12)
>>>>:   **(0x50) 0x00000064  -442+   2.180  __ipipe_test_and_stall_root+0x27 (kmem_cache_alloc+0x12)
>>>>:   **fn                 -440+   1.218  __ipipe_restore_root+0x8 (kmem_cache_alloc+0x52)
>>>>:   **fn                 -439+   1.315  __ipipe_stall_root+0x8 (__ipipe_restore_root+0x11)
>>>
>>>No more recovery from the xeno stall, no more timer ticks.
>>>
>>>The special traces were generated via
>>>
>>>#define ipipe_mark_domain_stall(ipd,cpuid)     \
>>>	ipipe_trace_special(0x50, ipd->priority)
>>>#define ipipe_mark_domain_unstall(ipd,cpuid)   \
>>>	ipipe_trace_special(0x51, ipd->priority)
>>>
>>>Any ideas? I do not find anything fishy yet that we may have introduced
>>>to cause this problem.
>>>
>>
>>Hmm, what about this theory: the RT-thread which got woken up in the
>>first trace snippet suffered from a leaking IRQ lock. Its broken flags
>>got restored on wakeup, and also when the thread went for termination
>>(second trace). Therefore, we see this leaking domain stall. Possible
>>explanation?
>>
>>This would move I-pipe and Xenomai out of the focus, leaving only the
>>16550A driver and our middleware messaging service (also a RTDM driver,
>>currently my top favourite). My trace unfortunately does not last back
>>far enough to answer this, I will have a look at this on Monday.
> 
> 
> I should stop blaming others: it was a leaking lock in xeno_16550A. :D
> 
> 
>>If it turned out to be the reason, we should consider putting some
>>XENO_ASSERT guards in the return-to-user paths of RTDM services.
>>
> 
> 
> This was done yesterday and immediately pulled out the offending code a
> few minutes ago.
> 
> Philippe, do you see any remaining issues, e.g. that the leak survived
> the task termination? Does this have any meaning for correct driver and
> skin code?
> 

The only way I could see this leakage survive a switch transition would 
require it to happen over the root context, not over a primary context. 
Was it the case?

-- 

Philippe.