From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <528A4737.8040607@siemens.com> Date: Mon, 18 Nov 2013 17:58:31 +0100 From: Jan Kiszka MIME-Version: 1.0 References: <528A156C.1070501@xenomai.org> <528A21A9.9050502@siemens.com> <528A2486.4030606@xenomai.org> <528A2567.2000105@siemens.com> <528A27AF.3060608@xenomai.org> <528A2BB3.9040902@siemens.com> <528A2F96.7080302@xenomai.org> <528A3918.4000902@siemens.com> <528A3CEB.80909@xenomai.org> <528A446C.5060409@siemens.com> In-Reply-To: <528A446C.5060409@siemens.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] [Xenomai-git] Jan Kiszka : switchtest: Account for invalid last_switch.from field List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gilles Chanteperdrix Cc: xenomai@xenomai.org On 2013-11-18 17:46, Jan Kiszka wrote: >>>>>>> Ah, maybe this is the real bug: >>>>>>> >>>>>>> diff --git a/kernel/drivers/testing/switchtest.c >>>>>>> b/kernel/drivers/testing/switchtest.c >>>>>>> index 7d17c5f..b5080a6 100644 >>>>>>> --- a/kernel/drivers/testing/switchtest.c >>>>>>> +++ b/kernel/drivers/testing/switchtest.c >>>>>>> @@ -404,7 +404,8 @@ static void rtswitch_ktask(void *cookie) >>>>>>> >>>>>>> to = task->base.index; >>>>>>> >>>>>>> - rtswitch_pend_rt(ctx, task->base.index); >>>>>>> + if (rtswitch_pend_rt(ctx, task->base.index) != 0) >>>>>>> + return; >>>>>>> >>>>>>> for(;;) { >>>>>>> if (task->base.flags & RTTST_SWTEST_USE_FPU) >>>>>>> >>>>>>> Still need to validate, will let you know. >>>>>> >>>>>> I do not think that it is the right fix either: we should not have an >>>>>> error in a ktask, because they are destroyed before anything else is >>>>>> destroyed. >>>>> >>>>> Not an error, a destroyed rtdm event, thus return code < 0. This didn't >>>>> matter so far as we kill the kernel task where it was blocked. Now it >>>>> has to properly leave its main function. >>>> >>>> The rtdm event is destroyed after the ktask. So, normally, this should >>>> not happen. >>> >>> True. OK, but rtdm_task_destroy calls xnthread_cancel, and that should >>> kick us out of rtdm_event_wait with -EINTR - to my understanding. >>> >>> In any case, the patch didn't help. Trying to get some traces now for a >>> better picture. >> >> Have you tried to check rtdm_task_join return value and print a message >> if it returns an error? I believe for the task to be still running after >> it was expected to terminate, rtdm_task_join would have to fail, at least. > > rtdm_task_join returns void. Also, rtdm_task_destroy already joins in > order to maintain the original synchronous behavior. > > I'm breaking a trace now when idx is -1 or rtswitch_pend_rt returns > non-zero, slowly trying to dig deeper. Bailing out from rtswitch_pend_rt remains correct and necessary (with the new RTDM semantics): rtdm_task_destroy will now kick us out of the rtdm_even_wait, and that at a time where no switch happened yet. However, stress-testing switchtest with early interrupts also triggers this oops once in a while: [ 279.547820] [Xenomai] closing RTDM file descriptor 0 [ 279.548966] BUG: unable to handle kernel paging request at ffffc90002986f50 [ 279.550668] IP: [] xntimer_destroy+0x110/0x290 [ 279.551563] PGD 3f830067 PUD 3f831067 PMD 3c299067 PTE 0 [ 279.552298] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [ 279.552947] Modules linked in: xt_tcpudp xt_limit xt_pkttype ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_CT ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables dm_crypt 9pnet_virtio tpm_tis tpm psmouse microcode 9pnet tpm_bios serio_raw pcspkr e1000 i2c_piix4 floppy intel_agp intel_gtt virtio_pci virtio_blk virtio virtio_ring ahci libahci [ 279.558564] CPU: 0 PID: 8217 Comm: rtuo_ufpp_ufps2 Not tainted 3.10.19+ #71 [ 279.558564] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 279.558564] task: ffff88003c128900 ti: ffff88003c138000 task.ti: ffff88003c138000 [ 279.558564] RIP: 0010:[] [] xntimer_destroy+0x110/0x290 [ 279.558564] RSP: 0000:ffff88003c13b948 EFLAGS: 00010206 [ 279.558564] RAX: ffff88003e2a9288 RBX: ffff88003a5024a8 RCX: ffffffff818471e0 [ 279.558564] RDX: ffffc90002986f48 RSI: dead000000200200 RDI: dead000000100100 [ 279.558564] RBP: ffff88003c13b988 R08: 0000000000000218 R09: ffffffff81669c40 [ 279.558564] R10: ffff88003c13b9a0 R11: ffffffff811a0fbc R12: 0000000000000003 [ 279.558564] R13: 0000000000000000 R14: 0000000000016a90 R15: ffff88003c13b9a0 [ 279.558564] FS: 00007fdd847d2950(0000) GS:ffff88003d400000(0000) knlGS:0000000000000000 [ 279.558564] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 279.558564] CR2: ffffc90002986f50 CR3: 0000000001915000 CR4: 00000000000006f0 [ 279.558564] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 279.558564] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 279.558564] I-pipe domain Linux [ 279.558564] Stack: [ 279.558564] ffff88003bae9260 ffffffff811a0fbc ffffc90002989540 0000000000000000 [ 279.558564] 0000000000000000 0000000000016a90 ffff88003a5024a8 ffff88003c13b9a0 [ 279.558564] ffff88003c13b9d8 ffffffff8114119f ffff88003c13b9c8 000002b6fdeeead4 [ 279.558564] Call Trace: [ 279.558564] [] ? __vunmap+0x9c/0x110 [ 279.558564] [] rtdm_timer_destroy+0xcf/0x220 [ 279.558564] [] rtswitch_close+0xe2/0x120 [ 279.558564] [] __rt_dev_close+0x39e/0x900 [ 279.558564] [] cleanup_process_files+0x1f4/0x2b0 [ 279.558564] [] rtdm_process_detach+0x1a/0x30 [ 279.558564] [] detach_ppd+0x1b/0x30 [ 279.558564] [] handle_taskexit_event+0x408/0x940 [ 279.558564] [] ipipe_kevent_hook+0x6d8/0x12a0 [ 279.558564] [] ? rb_commit+0xd0/0x150 [ 279.558564] [] ? ring_buffer_unlock_commit+0x25/0xa0 [ 279.558564] [] __ipipe_notify_kevent+0x9c/0x130 [ 279.558564] [] do_exit+0x80d/0xb00 [ 279.558564] [] ? ring_buffer_unlock_commit+0x25/0xa0 [ 279.558564] [] ? trace_buffer_unlock_commit+0x43/0x60 [ 279.558564] [] do_group_exit+0x52/0xc0 [ 279.558564] [] get_signal_to_deliver+0x242/0x5f0 [ 279.558564] [] do_signal+0x54/0x640 [ 279.558564] [] ? rcu_irq_exit+0xaf/0x100 [ 279.558564] [] ? __ipipe_do_sync_stage+0x18c/0x280 [ 279.558564] [] do_notify_resume+0x65/0x90 [ 279.558564] [] ? trace_hardirqs_on_thunk+0x3a/0x3c [ 279.558564] [] int_signal+0x12/0x17 [ 279.558564] Code: 80 00 00 00 48 c7 43 48 00 00 00 00 48 89 43 30 48 8b 83 88 00 00 00 48 bf 00 01 10 00 00 00 ad de 48 be 00 02 20 00 00 00 ad de <48> 89 42 08 48 89 10 48 8b 05 6a 67 82 00 48 89 bb 80 00 00 00 [ 279.558564] RIP [] xntimer_destroy+0x110/0x290 [ 279.558564] RSP [ 279.558564] CR2: ffffc90002986f50 This didn't happen when just catching idx==-1, likely because we are still racing somewhere else, and that other change just papered over this race. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux