From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <522078F4.6080709@siemens.com> Date: Fri, 30 Aug 2013 12:50:28 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <5215C4FE.707@inmess.de> <521871FD.30705@web.de> <521F3D6D.8050700@siemens.com> <5220498A.8050000@inmess.de> In-Reply-To: <5220498A.8050000@inmess.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] two dd processes: soft lockup List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Benedikt Boeck Cc: "xenomai@xenomai.org" On 2013-08-30 09:28, Benedikt Boeck wrote: > On 08/29/2013 02:24 PM, Jan Kiszka wrote: >> On 2013-08-29 14:20, Benedikt Boeck wrote: >>> On 08/28/2013 07:24 PM, Gilles Chanteperdrix wrote: >>>> On 08/28/2013 08:00 AM, Jan Kiszka wrote: >>>>> On 2013-08-27 23:28, Gilles Chanteperdrix wrote: >>>>>> On 08/27/2013 03:56 PM, Benedikt Boeck wrote: >>>>>>> On 08/27/2013 01:14 PM, Gilles Chanteperdrix wrote: >>>>>>>> On 08/27/2013 12:47 PM, Benedikt Boeck wrote: >>>>>>>>> On 08/24/2013 10:42 AM, Jan Kiszka wrote: >>>>>>>>>> On 2013-08-22 09:59, Benedikt Boeck wrote: >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I got a problem. After starting two dd processes >>>>>>>>>>> (if=/dev/zero of=/dev/null), I get first a few messages >>>>>>>>>>> about soft lockup: kernel:[...] BUG: soft lockup - >>>>>>>>>>> CPU#. stuck for ..s! [dd:...] and little later a Kernel >>>>>>>>>>> panic - not syncing: Fatal exception in interrupt >>>>>>>>>>> >>>>>>>>>>> But if I had running xeno-test before and then starting >>>>>>>>>>> two dd processes I don't get a soft lockup. Strange? >>>>>>>>>>> >>>>>>>>>>> Also there is no soft lockup with a kernel identical >>>>>>>>>>> till a deactivated CONFIG_XENOMAI. A deactivated >>>>>>>>>>> CONFIG_SMP also prevent the error. But I like to have >>>>>>>>>>> both cores. Trying all four variants of >>>>>>>>>>> CONFIG_SCHED_SMT=y/n and CONFIG_SCHED_MC=y/n didn't >>>>>>>>>>> effect the error. >>>>>>>>>>> >>>>>>>>>>> Tested with different kernel versions (3.4.6, 3.5.7, >>>>>>>>>>> 3.8) always with matching ipipe patch. I think with >>>>>>>>>>> 3.8 I didn't get a kernel panic but still soft >>>>>>>>>>> lockups. Using Xenomai 2.6.2.1 and haven't tried other >>>>>>>>>>> versions yet. The processor is a Celeron Dual-Core >>>>>>>>>>> T3100. >>>>>>>>>>> >>>>>>>>>>> Does somebody have a idea? Maybe I just made a simple >>>>>>>>>>> (but effective) mistake. >>>>>>>>>> I've tried your configuration on ipipe-3.8 [1] but wasn't >>>>>>>>>> able to reproduced it in a 2-cpu VM with 2 dd instances. >>>>>>>>>> Could you provide the 3.8 config as well that generates >>>>>>>>>> the warnings for you? And please provide the information >>>>>>>>>> Gilles asked for, >>>>>>>>>> >>>>>>>>>> Thanks, Jan >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> http://git.xenomai.org/?p=ipipe.git;a=shortlog;h=refs/heads/ipipe-3.8 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> Thanks for your replys >>>>>>>>> Tested a 3.8 kernel again. I'm using >>>>>>>>> http://download.gna.org/adeos/patches/v3.x/x86/ipipe-core-3.8-x86-1.patch >>>>>>>>> >>>>>>>>> >>>>>>>>> as patch. Now I'm know for sure the dd processes generate >>>>>>>>> soft lockups but no kernel panic occurs (running two >>>>>>>>> hours). Attached todays 3.8 config. >>>>>>>>> >>>>>>>>> In my VM (VirtualBox) I can't produce the error either. But >>>>>>>>> the host has a different CPU (Core 2 Quad Q6600, reduced >>>>>>>>> numbers of cores for guest). Unfortunately I haven't got >>>>>>>>> another System for testing here. >>>>>>>>> >>>>>>>>> Tested yesterday the 3.5.7 config with disabled >>>>>>>>> CONFIG_NO_HZ: got the same behavior. Attached content of >>>>>>>>> /proc/xenomai/timer and /proc/timer_list before starting dd >>>>>>>>> or xeno-test (enabled NO_HZ). If needed i can also provide >>>>>>>>> the content after calling xeno-test. >>>>>>>> Yes please, the contents of the kernel boot logs would help >>>>>>>> too, as well as /proc/interrupts before and after xeno-test. >>>>>>>> >>>>>>>> You have the HPET timer in broadcast mode, but the LAPIC >>>>>>>> timers are started, now it would be interesting to know >>>>>>>> whether they ticked, cat /proc/interrupts will tell us that. >>>>>>>> >>>>>>> For the purpose to get related output, I got also the timer >>>>>>> and timer_list new. Here are the boot log, interrupts >>>>>>> (after/before), timer (after/before), timer_list >>>>>>> (after/before) >>>>>> So, the HPET timer used a PIT replacement for irq 0 only ticks 40 >>>>>> times. I had a similar issue when working on timers on I-pipe >>>>>> core for 3.2.21. If I remember correctly, it was due to the fact >>>>>> that IRQ 0 starts as a PIC interrupt, but at some points >>>>>> transitions from PIC to IOAPIC, is masked at I-pipe level when it >>>>>> is disabled, and never unmasked when enabled at IOAPIC level. >>>>>> Maybe it is a similar issue? You can try compiling a kernel >>>>>> without Xenomai support and see if irq 0 counter increments, and >>>>>> boot with the "hpet=disable" argument, to disable the HPET, in >>>>>> case I fixed the issue for the PIT, but not the HPET. >>>>>> >>>>> AFAIK, the broadcast timer is only used for kicking the APIC timers >>>>> out of deep sleep states where they tend to stop. That should be >>>>> rare, specifically when ACPI_PROCESSOR was disabled. And on modern >>>>> systems (with continuously running APIC timers), IRQ0 doesn't fire >>>>> at all after bootup. >>>>> >>>>> That said, it remains worthwhile trying what you suggested. >>>> Indeed, mode 1 is shutdown, so the timer is not supposed to tick. >>>> >>> Unfortunately the patch didn't solved my problem. Do you need something >>> tested with this kernel? Or other ideas, I can test to locate the >>> problem. >> When the system reports the lockup, can you still interact with it in >> some way? I'm wondering if we could extract a backtrace from running (or >> stuck) tasks or even an ftrace dump. >> >>> Indeed the irq0 timer didn't increase with disabled xenomai and >>> hpet=disabled/enabled. If I disable hpet the irq0 has 41/1 (cpu0/1) >>> instead of 40/0 ticks. >> Those ticks happen during boot-up, but the lockup is apparently later, >> only if you start the dd processes. This is likely unrelated. >> >> Jan >> > Till the kernel panic occurs, i got a few minutes. Except with kernel > 3.8, where no kernel panic occurs. So that could work. But in both cases > the System is not very responsive. It seems that every keyboard input is > buffered and executed between the lockups. So we are loosing IRQs - but maybe not all of them. > I didn't tried ssh, but I think it will not act really better. > Have no experience with backtrace or ftrace yet. For ftrace dump, which > kernel options do I need to enable besides CONFIG_FTRACE? Nothing in particular. You can record the trace with the trace-cmd tool, maybe starting it via ssh and stop it once the lockup becomes visible. Set "-e all" in the first run. > For backtrace > I will need dd with debug symbols right? Unfortunately Debian7 seems not > to have a coreutils-dbg package. So I will rebuild coreutils if needed. I was thinking about sysrq-l or sysrq-w (e.g. via serial console to capture the dump). Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux