From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Fri, 08 Feb 2008 16:27:40 +0100 From: "Petr Cervenka" MIME-Version: 1.0 Message-ID: <200802081627.15934@domain.hid> References: 00710081503.15198@domain.hid> <200802061509.13010@domain.hid> <200802081341.13030@domain.hid> <47AC56A1.8090706@domain.hid> In-Reply-To: <47AC56A1.8090706@domain.hid> Content-Type: text/plain; charset=windows-1250 Content-Transfer-Encoding: QUOTED-PRINTABLE Subject: Re: [Xenomai-help] FPU not available List-Id: Help regarding installation and common use of Xenomai List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: rpm@xenomai.org Cc: jan.kiszka@domain.hid, xenomai@xenomai.org ______________________________________________________________ > Od: rpm@xenomai.org > Komu: Petr Cervenka > CC: jan.kiszka@domain.hid, xenomai@xenomai.org > Datum: 08.02.2008 14:25 > P=F8edm=ECt: Re: [Xenomai-help] FPU not available > >Petr Cervenka wrote: >>> Philippe Gerum wrote: >>>> Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> Gilles Chanteperdrix wrote: >>>>>>> On Wed, Feb 6, 2008 at 3:09 PM, Petr Cervenka wrote: >>>>>>>> Hello. >>>>>>>> Recently, we switched to newer distribution of linux (Kubuntu 7.10). During this switch we changed many things (Xenomai 2.4.1, linux kernel 2.6.24, x86_64 architecture, ...). >>>>>>>> No we have problem, that in one of our tasks we are sometimes= not able to use floating point operations (under very specific circums= tances) . In such case, that task crashes immediately, but rest of the = application runs "normaly". Output from dmesg is attached to this messa= ge. Task was created with T_FPU flag. >>>>>>>> Is there anything we can check or change? >>>>>>>> Petr Cervenka >>>>>>> I do not know if this is related to the issue you are facing, b= ut the >>>>>>> first FPU fault of a thread running in primary mode may be hand= led by >>>>>>> Xenomai without switching to secondary mode. So, maybe the faul= t >>>>>>> epilogue implicitely expects Xenomai to have switched the fault= to >>>>>>> secondary mode and use some secondary mode services such as >>>>>>> ipipe_restore_root, whereas the thread never leaved primary mod= e. >>>>>>> >>>>>> Good point! That is probably this path (and not the one I starre= d on): >>>>>> >>>>>> __ipipe_handle_exception() >>>>>> ... >>>>>> if (unlikely(ipipe_trap_notify(vector, regs))) { >>>>>> local_irq_restore(flags); >>>>>> return 1; >>>>>> } >>>>>> >>>>>> That needs some more thoughts... >>>>> Looking at the whole __ipipe_handle_exception, the problem is rel= ated to >>>>> the early, context-independent __ipipe_stall_root(). Can we post= pone >>>>> this safely after having called any potential high-stage hooks fo= r this >>>>> exception, and then only if the callee migrated the thread to the= root >>>>> domain? Or is there a need to have the root domain stalled across= the >>>>> post-fault migration? >>>>> >>>> Someone from the root domain may want to get notified of the excep= tions >>>> occurring in that domain too, in which case we may not postpone th= e >>>> virtual mask fixup after the notifier invocation, otherwise we wou= ld >>>> call the handler with a broken interrupt state. >>>> >>>>> In the latter case, we would have to fiddle with the stall bits d= irectly >>>>> instead of calling local_irq_restore - not just to work around th= e >>>>> BUG_ON, but also to avoid sync'ing root over potentially stalled >>>>> non-root domains... >>>>> >>>> This used to be done by ipipe_restore_pipeline_nosync() in older >>>> patches, but this one has disappeared after the flat log refactori= ng. We >>>> indeed need to resurrect something alike in order to reset the sta= ll bit >>>> without calling the syncer, when taking the fast exit path after >>>> ipipe_trap_notify(). >>> Hmm, so it could be fairly simple in fact: >>> >>> --- a/arch/x86/kernel/ipipe.c >>> +++ b/arch/x86/kernel/ipipe.c >>> @@ -755,7 +755,9 @@ int __ipipe_handle_exception(struct pt_r >>> #endif /* CONFIG_KGDB */ >>> >>> if (unlikely(ipipe_trap_notify(vector, regs))) { >>> - local_irq_restore(flags); >>> + if (!flags) >>> + __clear_bit(IPIPE_STALL_FLAG, >>> + &ipipe_root_cpudom_var(status)); >>> return 1; >>> } >>> >>> Petr, ready to try? >>> >> I tried this patch and the problem (or the race condition) disappear= ed. ;-) >> Is there any (easy) method to recognise if the problem was solved? >>=20 > >This one won't break the whole thing... > >diff --git a/arch/x86/kernel/ipipe.c b/arch/x86/kernel/ipipe.c >index ce24db7..af9d4c4 100644 >--- a/arch/x86/kernel/ipipe.c >+++ b/arch/x86/kernel/ipipe.c >@@ -758,6 +758,7 @@ int __ipipe_handle_exception(struct pt_regs *regs,= long error_code, int vector) > #endif /* CONFIG_KGDB */ > > if (unlikely(ipipe_trap_notify(vector, regs))) { >+ WARN_ON(!ipipe_root_domain_p); > if (!flags) > __clear_bit(IPIPE_STALL_FLAG, > &ipipe_root_cpudom_var(status)); > I applied the "WARN_ON" patch and got the kernel bug again. Dmesg output is attached. I hope it will help you this time a little bi= t. It seems that the warning started to be printed and then the error happ= ened (if I understand it well). >> To your previous questions: >> We use Athlon64 X2 (2 cores, 64-bit), kubuntu 7.10 amd64. >> We have 2 real-time userspace applications: some kind of server for = rtnet communication with special measuring hardware, and clients (1-4 i= nstances) for some computing, configuration, ethernet comunication, et= c. Comunication between server and clients is via named rt_queues.. Any= "failing example" is perhaps impossible. >> Any attempt with IPIPE_DEBUG and tracer removes the race condition. >>=20 >> Thank you VERY MUCH for you help and support (all of you). >> Petr >>=20 >>> Jan >>> >>> --=20 >>> Siemens AG, Corporate Technology, CT SE 2 >>> Corporate Competence Center Embedded Linux >>> >>=20 >>=20 >> _______________________________________________ >> Xenomai-help mailing list >> Xenomai-help@domain.hid >> https://mail.gna.org/listinfo/xenomai-help >>=20 > > >--=20 >Philippe. > [ 52.570242] WARNING: at arch/x86/kernel/ipipe.c:758 __ipipe_handle_e= xception() [ 52.570249] Pid: 4758, comm: REG_TASK_2056 Not tainted 2.6.24-adeos = #4 [ 52.570251]=20 [ 52.570251] Call Trace: [ 52.570283] ------------[ cut here ]------------ [ 52.570318] kernel BUG at kernel/ipipe/core.c:321! [ 52.570351] invalid opcode: 0000 [1] PREEMPT SMP=20 [ 52.570449] CPU 0=20 [ 52.570499] Modules linked in: rt_r8169 rtpacket rtnet rfcomm l2cap = bluetooth ppdev container ac sbs sbshc dock battery lp irtty_sir sir_de= v irda psmouse parport_pc parport crc_ccitt serio_raw k8temp pcspkr shp= chp pci_hotplug button i2c_nforce2 i2c_core af_packet ipv6 evdev ext3 j= bd mbcache sg sd_mod sata_nv ata_generic forcedeth libata amd74xx scsi_= mod ide_core ehci_hcd ohci_hcd usbcore fan fuse [ 52.571610] Pid: 4758, comm: REG_TASK_2056 Not tainted 2.6.24-adeos = #4 [ 52.571645] RIP: 0010:[] [] __i= pipe_restore_root+0x47/0x50 [ 52.571714] RSP: 0000:ffff81003e16bd88 EFLAGS: 00010002 [ 52.571747] RAX: ffffffff8067caa0 RBX: 00000009e4457f0f RCX: 0000000= 000000003 [ 52.571782] RDX: ffff810080993000 RSI: ffffffff8022743f RDI: 0000000= 000000001 [ 52.571816] RBP: ffff81003e16bd88 R08: ffff810001008420 R09: 0000000= 000000004 [ 52.571851] R10: ffff81003e16bda8 R11: 0000000000000000 R12: 0000000= 000000001 [ 52.571886] R13: ffff81003e16a000 R14: ffff81003e16bffd R15: 0000000= 000000000 [ 52.571921] FS: 0000000040091950(0063) GS:ffffffff805f5000(0000) kn= lGS:0000000000000000 [ 52.571963] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 52.571997] CR2: 00002b6d57fb623d CR3: 000000003a0c5000 CR4: 0000000= 0000006e0 [ 52.572031] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000= 000000000 [ 52.572066] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000= 000000400 [ 52.572101] Process REG_TASK_2056 (pid: 4758, threadinfo ffff81003e1= 6a000, task ffff810035432790) [ 52.572144] Stack: ffff81003e16bdb8 ffffffff8023af74 ffff81003e16be= 98 ffffffff806777a8 [ 52.572288] ffff810080993000 ffff81003e16a000 ffff81003e16bdc8 ffff= ffff80273809 [ 52.572411] ffff81003e16bde8 ffffffff8027383e ffffffff8022743f ffff= 81003e16bf00 [ 52.572509] Call Trace: [ 52.572569] [] cpu_clock+0x84/0xa0 [ 52.572604] [] get_timestamp+0x9/0x10 [ 52.572638] [] touch_softlockup_watchdog+0x2e/0x4= 0 [ 52.572675] [] __ipipe_handle_exception+0x25f/0x2= 70 [ 52.572711] [] touch_nmi_watchdog+0x1a/0x80 [ 52.572746] [] __ipipe_handle_exception+0x25f/0x2= 70 [ 52.572783] [] print_trace_address+0x11/0x20 [ 52.572818] [] __ipipe_handle_exception+0x25f/0x2= 70 [ 52.572853] [] dump_trace+0x10b/0x2c0 [ 52.572890] [] exception_event+0x48/0x60 [ 52.572925] [] show_trace+0x43/0x60 [ 52.572960] [] dump_stack+0x6a/0x80 [ 52.572995] [] __ipipe_handle_exception+0x25f/0x2= 70 [ 52.573033] [] error_sti+0x1e/0x52 [ 52.573071]=20 [ 52.573100]=20 [ 52.573100] Code: 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 56 41 55= 41 54 53=20 [ 52.573595] RIP [] __ipipe_restore_root+0x47/0x50 [ 52.573650] RSP [ 52.573690] ---[ end trace c09fed11ada7a064 ]--- [ 52.573723] note: REG_TASK_2056[4758] exited with preempt_count 1=20