From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <443A56E2.1010204@domain.hid> Date: Mon, 10 Apr 2006 15:00:18 +0200 From: Jan Kiszka MIME-Version: 1.0 Subject: Re: [Xenomai-core] [BUG?] stalled xeno domain References: <4437D2EE.4040902@domain.hid> <4437E704.4010607@domain.hid> <443A437B.6010403@domain.hid> <443A5212.7020001@domain.hid> In-Reply-To: <443A5212.7020001@domain.hid> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigBED9754C78CF6639A99064BB" Sender: jan.kiszka@domain.hid List-Id: "Xenomai life and development \(bug reports, patches, discussions\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: xenomai-core This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigBED9754C78CF6639A99064BB Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Philippe Gerum wrote: > Jan Kiszka wrote: >> Jan Kiszka wrote: >> >>> Jan Kiszka wrote: >>> >>>> Hi Philippe, >>>> >>>> debugging is nice, tracing is still nicer. As you suggested, I exten= ded >>>> the tracer with per-domain stall flags (needs some output clean-up, >>>> preliminary patch attached nevertheless). >>>> >>>> And here is the result (full trace attached): >>>> >>>> >>>>> :| (0x51) 0x000000c8 -1113+ 1.112 __ipipe_sync_stage+0x142 >>>>> (ipipe_suspend_domain+0x56) >>>>> :| fn -1112+ 1.789 __ipipe_sync_stage+0xe >>>>> (ipipe_suspend_domain+0x56) >>>>> :| *(0x50) 0x00000064 -1110+ 1.293 __ipipe_sync_stage+0x97 >>>>> (ipipe_suspend_domain+0x56) >>>>> : *fn -1109+ 1.398 do_IRQ+0x8 >>>>> (__ipipe_sync_stage+0xcf) >>>>> : *fn -1107+ 2.105 __do_IRQ+0xc (do_IRQ+0x21) >>>>> : *fn -1105+ 1.631 handle_IRQ_event+0xd >>>>> (__do_IRQ+0x9e) >>>>> : *fn -1104+ 2.413 timer_interrupt+0x9 >>>>> (handle_IRQ_event+0x40) >>>>> : *fn -1101+ 3.022 mark_offset_tsc+0xe >>>>> (timer_interrupt+0x31) >>>>> : *fn -1098+ 2.721 do_timer+0x9 >>>>> (timer_interrupt+0x37) >>>>> :| *fn -1096+ 2.112 __ipipe_handle_irq+0xe >>>>> (common_interrupt+0x18) >>>>> :| *fn -1093+ 1.210 __ipipe_ack_common_irq+0xc >>>>> (__ipipe_handle_irq+0xc0) >>>>> :| *(0x50) 0x00000064 -1092+ 1.218 __ipipe_ack_common_irq+0x31= >>>>> (__ipipe_handle_irq+0xc0) >>>>> :| *fn -1091+ 1.834 mask_and_ack_8259A+0xb >>>>> (__ipipe_ack_common_irq+0x5d) >>>>> :| *(0x50) 0x00000064 -1089+ 1.345 __ipipe_ack_common_irq+0x9d= >>>>> (__ipipe_handle_irq+0xc0) >>>>> :| *fn -1088 0.924 ipipe_suspend_domain+0xb >>>>> (__ipipe_handle_irq+0x174) >>>>> :| *(0x51) 0x000000c8 -1087+ 1.172 ipipe_suspend_domain+0x26 >>>>> (__ipipe_handle_irq+0x174) >>>>> :| *fn -1086+ 2.030 __ipipe_sync_stage+0xe >>>>> (ipipe_suspend_domain+0x56) >>>>> :| **(0x50) 0x000000c8 -1084+ 1.330 __ipipe_sync_stage+0x97 >>>>> (ipipe_suspend_domain+0x56) >>>>> :| **fn -1082 0.879 xnintr_clock_handler+0x8 >>>>> (__ipipe_sync_stage+0x12b) >>>>> :| **fn -1082+ 1.218 xnintr_irq_handler+0xb >>>>> (xnintr_clock_handler+0x18) >>>>> :| **fn -1080+ 1.233 xnpod_announce_tick+0x9 >>>>> (xnintr_irq_handler+0x2a) >>>>> :| **(0x50) 0x000000c8 -1079+ 1.736 xnpod_announce_tick+0x20 >>>>> (xnintr_irq_handler+0x2a) >>>>> :| **fn -1077+ 2.984=20 >>>>> xntimer_do_tick_aperiodic+0xe (xnpod_announce_tick+0x36) >>>>> :| **fn -1074+ 2.751=20 >>>>> xnthread_timeout_handler+0x8 (xntimer_do_tick_aperiodic+0x18d) >>>>> :| **fn -1072+ 1.864 xnpod_resume_thread+0xe >>>>> (xnthread_timeout_handler+0x1d) >>>>> :| **(0x50) 0x000000c8 -1070+ 4.699 xnpod_resume_thread+0x2b >>>>> (xnthread_timeout_handler+0x1d) >>>>> :| **(0x50) 0x000000c8 -1065+ 1.586 xnpod_resume_thread+0x2bf >>>>> (xnthread_timeout_handler+0x1d) >>>>> :| **(0x01) 0x00001237 -1063+ 2.661=20 >>>>> xntimer_do_tick_aperiodic+0x309 (xnpod_announce_tick+0x36) >>>>> :| **(0x50) 0x000000c8 -1061+ 1.263 xnpod_announce_tick+0x4f >>>>> (xnintr_irq_handler+0x2a) >>>>> :| **fn -1060+ 1.255 rthal_irq_end+0x8 >>>>> (xnintr_irq_handler+0x46) >>>>> :| **fn -1058+ 2.007 enable_8259A_irq+0x9 >>>>> (rthal_irq_end+0x2e) >>>>> :| **fn -1056+ 1.924 xnpod_schedule+0xe >>>>> (xnintr_irq_handler+0x6e) >>>>> :| **(0x50) 0x000000c8 -1054! 15.368 xnpod_schedule+0x53 >>>>> (xnintr_irq_handler+0x6e) >>>>> :| **fn -1039! 13.300 __switch_to+0xc >>>>> (xnpod_schedule+0x689) >>>>> :| **(0x50) 0x000000c8 -1026+ 1.766 xnpod_schedule+0x951 >>>>> (xnpod_suspend_thread+0x27c) >>>>> :| **(0x50) 0x000000c8 -1024! 18.195 xnpod_suspend_thread+0x2bb >>>>> (rt_task_sleep+0x54) >>>>> : **fn -1006+ 1.624 __ipipe_syscall_root+0x9 >>>>> (system_call+0x20) >>>>> : **fn -1004+ 1.406 __ipipe_dispatch_event+0xe >>>>> (__ipipe_syscall_root+0x58) >>>>> : **fn -1003+ 4.323 hisyscall_event+0xe >>>>> (__ipipe_dispatch_event+0x5e) >>>>> : **fn -998+ 1.398 __rt_task_sleep+0xa >>>>> (hisyscall_event+0x23c) >>>>> : **fn -997+ 1.789 __copy_from_user_ll+0xa >>>>> (__rt_task_sleep+0x1a) >>>>> : **fn -995! 15.345 rt_task_sleep+0xa >>>>> (__rt_task_sleep+0x25) >>>>> : **fn -980 0.879 __ipipe_syscall_root+0x9 >>>>> (system_call+0x20) >>>>> : **fn -979+ 1.308 __ipipe_dispatch_event+0xe >>>>> (__ipipe_syscall_root+0x58) >>>>> : **fn -978+ 2.451 hisyscall_event+0xe >>>>> (__ipipe_dispatch_event+0x5e) >>>>> : **fn -975+ 1.631 sys_rtdm_ioctl+0x8 >>>>> (hisyscall_event+0x23c) >>>>> : **fn -974+ 1.255 _rtdm_ioctl+0xc >>>>> (sys_rtdm_ioctl+0x1b) >>>>> : **fn -972+ 4.180 rtdm_context_get+0xa >>>>> (_rtdm_ioctl+0x20) >>>>> : **fn -968+ 1.203 __ipipe_syscall_root+0x9 >>>>> (system_call+0x20) >>>>> : **fn -967+ 1.345 __ipipe_dispatch_event+0xe >>>>> (__ipipe_syscall_root+0x58) >>>> >>>> The '*' mark a stalled domain, root domain on the right. It seems to= me >>>> that xnpod_suspend_thread() leaves the xeno domain stalled on wake u= p. >>>> This gets recovered somehow later. >>>> >>>> But here we see a different effect which finally causes the tick to = get >>>> frozen: >>>> >>>> >>>>> : fn -504+ 2.075 sched_clock+0xd >>>>> (schedule+0x112) >>>>> : fn -502 0.992 __ipipe_stall_root+0x8 >>>>> (schedule+0x18e) >>>>> : *(0x50) 0x00000064 -501+ 4.022 __ipipe_stall_root+0x32 >>>>> (schedule+0x18e) >>>>> : *fn -497+ 1.977 __ipipe_dispatch_event+0xe >>>>> (schedule+0x412) >>>>> : *fn -495+ 4.210 schedule_event+0xd >>>>> (__ipipe_dispatch_event+0x5e) >>>>> : **(0x50) 0x000000c8 -491+ 1.428 schedule_event+0x261 >>>>> (__ipipe_dispatch_event+0x5e) >>>>> :| **fn -490+ 2.067 xnpod_schedule_runnable+0xe= >>>>> (schedule_event+0x28b) >>>>> :| **fn -488 0.954=20 >>>>> ipipe_unstall_pipeline_from+0xc (schedule_event+0x2c1) >>>>> :| *(0x51) 0x000000c8 -487+ 4.646=20 >>>>> ipipe_unstall_pipeline_from+0x25 (schedule_event+0x2c1) >>>>> :| *fn -482+ 7.248 __switch_to+0xc >>>>> (xnpod_schedule+0x689) >>>>> :| **(0x50) 0x000000c8 -475+ 1.421 xnpod_schedule+0x77f >>>>> (xnpod_suspend_thread+0x27c) >>>>> :| **(0x50) 0x000000c8 -473+ 1.481 xnpod_schedule+0x951 >>>>> (xnpod_suspend_thread+0x27c) >>>>> :| **(0x50) 0x000000c8 -472+ 1.654 xnpod_suspend_thread+0x2bb >>>>> (xnshadow_relax+0x136) >>>>> :| **(0x50) 0x000000c8 -470+ 2.917 xnshadow_relax+0x154 >>>>> (hisyscall_event+0x2ec) >>>>> :| **fn -467+ 1.375 ipipe_reenter_root+0xb >>>>> (xnshadow_relax+0x204) >>>>> :| **fn -466 0.954 __ipipe_unstall_root+0x8 >>>>> (ipipe_reenter_root+0x26) >>>>> :| * (0x51) 0x00000064 -465+ 3.789 __ipipe_unstall_root+0x25 >>>>> (ipipe_reenter_root+0x26) >>>>> : * fn -461+ 1.578 send_sig+0x8 >>>>> (xnshadow_relax+0x228) >>>>> : * fn -460+ 1.496 send_sig_info+0xb >>>>> (send_sig+0x1d) >>>>> : * fn -458 0.909=20 >>>>> __ipipe_test_and_stall_root+0x9 (send_sig_info+0x38) >>>>> : **(0x50) 0x00000064 -457+ 1.699=20 >>>>> __ipipe_test_and_stall_root+0x27 (send_sig_info+0x38) >>>>> : **fn -455+ 1.203 specific_send_sig_info+0xb >>>>> (send_sig_info+0x69) >>>>> : **fn -454+ 1.706 __ipipe_test_root+0x8 >>>>> (specific_send_sig_info+0x16) >>>>> : **fn -453+ 3.360 sig_ignored+0x9 >>>>> (specific_send_sig_info+0x29) >>>>> : **fn -449+ 1.714 send_signal+0xb >>>>> (specific_send_sig_info+0x55) >>>>> : **fn -447+ 3.142 __sigqueue_alloc+0x9 >>>>> (send_signal+0x3f) >>>>> : **fn -444+ 1.210 kmem_cache_alloc+0xa >>>>> (__sigqueue_alloc+0x42) >>>>> : **fn -443 0.909=20 >>>>> __ipipe_test_and_stall_root+0x9 (kmem_cache_alloc+0x12) >>>>> : **(0x50) 0x00000064 -442+ 2.180=20 >>>>> __ipipe_test_and_stall_root+0x27 (kmem_cache_alloc+0x12) >>>>> : **fn -440+ 1.218 __ipipe_restore_root+0x8 >>>>> (kmem_cache_alloc+0x52) >>>>> : **fn -439+ 1.315 __ipipe_stall_root+0x8 >>>>> (__ipipe_restore_root+0x11) >>>> >>>> No more recovery from the xeno stall, no more timer ticks. >>>> >>>> The special traces were generated via >>>> >>>> #define ipipe_mark_domain_stall(ipd,cpuid) \ >>>> ipipe_trace_special(0x50, ipd->priority) >>>> #define ipipe_mark_domain_unstall(ipd,cpuid) \ >>>> ipipe_trace_special(0x51, ipd->priority) >>>> >>>> Any ideas? I do not find anything fishy yet that we may have introdu= ced >>>> to cause this problem. >>>> >>> >>> Hmm, what about this theory: the RT-thread which got woken up in the >>> first trace snippet suffered from a leaking IRQ lock. Its broken flag= s >>> got restored on wakeup, and also when the thread went for termination= >>> (second trace). Therefore, we see this leaking domain stall. Possible= >>> explanation? >>> >>> This would move I-pipe and Xenomai out of the focus, leaving only the= >>> 16550A driver and our middleware messaging service (also a RTDM drive= r, >>> currently my top favourite). My trace unfortunately does not last bac= k >>> far enough to answer this, I will have a look at this on Monday. >> >> >> I should stop blaming others: it was a leaking lock in xeno_16550A. :D= >> >> >>> If it turned out to be the reason, we should consider putting some >>> XENO_ASSERT guards in the return-to-user paths of RTDM services. >>> >> >> >> This was done yesterday and immediately pulled out the offending code = a >> few minutes ago. >> >> Philippe, do you see any remaining issues, e.g. that the leak survived= >> the task termination? Does this have any meaning for correct driver an= d >> skin code? >> >=20 > The only way I could see this leakage survive a switch transition would= > require it to happen over the root context, not over a primary context.= > Was it the case? >=20 The task had to leave from primary mode. If I forced it to secondary before terminating, the problem did not show up. Jan --------------enigBED9754C78CF6639A99064BB Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFEOlbiniDOoMHTA+kRAm2FAJ9k9QCKPL1PDEhBWxTQbATsKPX4/wCfXlN/ 3c/MDQQv0h3/f80bBJMn5II= =bsCo -----END PGP SIGNATURE----- --------------enigBED9754C78CF6639A99064BB--