From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 10 Jul 2008 15:45:01 +0200 From: "Petr Cervenka" MIME-Version: 1.0 Message-ID: <200807101545.25401@domain.hid> References: 00807071745.31720@domain.hid> <48723D5D.6020008@domain.hid> <48732793.7090605@domain.hid> <487331AE.5070009@domain.hid> <48733483.2050204@domain.hid> <200807091719.17625@domain.hid> <4874E1D8.6020307@domain.hid> In-Reply-To: <4874E1D8.6020307@domain.hid> Content-Type: text/plain; charset="windows-1250" Content-Transfer-Encoding: 8bit Subject: Re: [Xenomai-help] Kernel panic: not syncing List-Id: Help regarding installation and common use of Xenomai List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: rpm@xenomai.org Cc: xenomai@xenomai.org Philippe Gerum wrote: >Petr Cervenka wrote: >> Jan Kiszka wrote: >>> Gilles Chanteperdrix wrote: >>>> Jan Kiszka wrote: >>>>> Philippe Gerum wrote: >>>>>> Petr Cervenka wrote: >>>>>>> Hello, >>>>>>> I'm not sure if I'm not off topic. >>>>>>> We use Linux 2.6.24 and Xenomai 2.4.1. Occasionally (once in few >>>>>>> days) we get an kernel panic and I don't know If it's our fault or a >>>>>>> problem of kernel/xenomai/adeos/configuration/hw/... >>>>>>> If you have any questions, i'll try to answer them. Any help is >>>>>>> welcome. >>>>>> It is an I-pipe issue, probably. We have to somewhat forge the >>>>>> register frame >>>>>> passed to the Linux tick handler, since we may delay that call. Some >>>>>> register >>>>>> values the profiling code attempts to dereference to find the >>>>>> preempted code may >>>>>> be wrong in our case. >>>>>> >>>>>> Could you 1) send back a disassembly of the profile_tick routine in >>>>>> your kernel >>>>>> image, then apply the following patch to check whether it improves >>>>>> the situation >>>>>> as well? TIA, >>>>>> >>>>>> --- 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c~ 2008-02-11 >>>>>> 10:48:24.000000000 +0100 >>>>>> +++ 2.6.24-x86-2.0-03/arch/x86/kernel/ipipe.c 2008-07-07 >>>>>> 17:55:36.000000000 +0200 >>>>>> @@ -933,12 +933,7 @@ >>>>>> tick_regs->eip = regs.eip; >>>>>> tick_regs->ebp = regs.ebp; >>>>>> #else /* !CONFIG_X86_32 */ >>>>>> - tick_regs->ss = regs->ss; >>>>>> - tick_regs->rsp = regs->rsp; >>>>>> - tick_regs->eflags = regs->eflags; >>>>>> - tick_regs->cs = regs->cs; >>>>>> - tick_regs->rip = regs->rip; >>>>>> - tick_regs->rbp = regs->rbp; >>>>>> + *tick_regs = *regs; >>>>>> #endif /* !CONFIG_X86_32 */ >>>>> I'm fairly sure that this won't make a difference. According to Petr's >>>>> first dump we crash in profile_pc, and there the kernel pokes around >> on >>>>> the stack of the interrupted context (Petr, you are running SMP, >>>>> right?). The question is if this stack may have vanished or may have >>>>> been swapped out after capturing the registers. >>>> When Xenomai has forwarded the tick to linux, Linux tick handler is >>>> executed upon resume to user-space, so, if the stack had to vanish, it >>>> would have to vanish upon execution of another interrupt handler before >>>> the tick handler. However, I believe that only do_exit can kill a task, >>>> and I am not sure if it can be called from an interrupt handler. As for >>>> the stack being swapped out, it is kmalloced memory, so, it is >> impossible. >>> Yes, vanishing stack is unlikely, more probable is an invalid state >>> right from the beginning. I guess we need a full oops to say more. >>> >>> Petr, any chance to attach a serial cable to your box and catch those >>> oopses completely via a second box? The register state would be telling, >>> but also, as Philippe already requested, a disassembly of the involved >>> function - in case it remain profile_pc: objdump -dS >>> linux-.../arch/x86/kernel/time_64.o. >>> >> >> To your questions: >> we have only user space tasks, but we use rtdm driver (with ioctl, no tasks). >> The processor is Athlon X2, 64-bit distribution (Kubuntu ?7.10?), x86_64 SMP PREEMPT, >> Kernel 2.6.24, Xenomai 2.4.1, adeos-ipipe-2.6.24-x86-2.0-03 >> I could send you my kernel config file if you want. >> I will try to learn the method of oopses catching via serial cable attached second box. But I don't know if it will be possible to setup such experiment for time long enough to reproduce the error. >> >> The following disassembly is from different machine than the one which has the kernel panics. I use it for developing and testing but it should have the same HW and kernel configuration. I can't totally sure, that the disassembly is correct, but i hope so. I junst can't recognise it. Could you explain me what does "profile_pc+0x46/0x80"? I assume the first number is the current RIP address (relative to routine start), so 0x256 in this case. But what does mean the second number? > >Size of the routine. > >> >> 0000000000000210 : >> 210: 48 83 ec 18 sub $0x18,%rsp >> 214: 48 89 5c 24 08 mov %rbx,0x8(%rsp) >> 219: 48 89 6c 24 10 mov %rbp,0x10(%rsp) >> 21e: 48 89 fb mov %rdi,%rbx >> 221: f6 87 88 00 00 00 03 testb $0x3,0x88(%rdi) >> 228: 48 8b af 80 00 00 00 mov 0x80(%rdi),%rbp >> 22f: 74 12 je 243 >> 231: 48 89 e8 mov %rbp,%rax >> 234: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx >> 239: 48 8b 6c 24 10 mov 0x10(%rsp),%rbp >> 23e: 48 83 c4 18 add $0x18,%rsp >> 242: c3 retq >> 243: 48 89 ef mov %rbp,%rdi >> 246: e8 00 00 00 00 callq 24b >> 24b: 85 c0 test %eax,%eax >> 24d: 74 e2 je 231 >> 24f: 48 8b 8b 98 00 00 00 mov 0x98(%rbx),%rcx >> 256: 48 8b 11 mov (%rcx),%rdx > >We are dereferencing invalid stack memory, using the stack pointer value of the >preempted context. > >What could help is to have the registers dump which should appear in the oops >message, and specifically the %rcx value. I tried to setup the serial line console but with only partial success. I used this guide: http://www.av8n.com/computer/htm/kernel-lockup.htm (not the watchdog part) Now I am able through gtkterm to login to the 1. (logged) machine from the 2. (logging) machine. But when I try to use: echo "Hi there." > /dev/console, i see the text only on the actual session (tty0) and not on the connected 2. machine (maybe the gtkterm could be not the right program for it). When I use directly: echo "Hi there." > /dev/ttyS0, it freezes and I have to press Ctrl+C to continue (with: Interrupted signal call error). In the dmesg output of the first machine, there is line: console [ttyS0] enabled. the console ttyS0, the getty on ttyS0 and the gtkterm use the same 115200 speed. What have I done wrong? Petr