From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751964AbbAKU0O (ORCPT ); Sun, 11 Jan 2015 15:26:14 -0500 Received: from e38.co.us.ibm.com ([32.97.110.159]:37077 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751356AbbAKU0N (ORCPT ); Sun, 11 Jan 2015 15:26:13 -0500 Date: Sun, 11 Jan 2015 12:26:04 -0800 From: "Paul E. McKenney" To: "Stoidner, Christoph" Cc: "linux-kernel@vger.kernel.org" Subject: Re: Question concerning RCU Message-ID: <20150111202604.GC8063@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <94921a144a97457385ae95b838c3c6fa@EX132MBOX1A.de2.local> <20150106194317.GG5280@linux.vnet.ibm.com> <81b94fc89c774b71a967fc93823e9c63@EX132MBOX1A.de2.local> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <81b94fc89c774b71a967fc93823e9c63@EX132MBOX1A.de2.local> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15011120-0029-0000-0000-0000071F4799 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jan 11, 2015 at 11:59:45AM +0000, Stoidner, Christoph wrote: > > Hi Paul, > > many thanks for your fast answer! > > Now I have changed my application in that way, that it does not require > Xenomai/I-Pipe anymore. That means my kernel is build now from > mainline source, with preempt_rt only and no Xenomai or I-Pipe. > However the problem is exact the same. After some runtime (minutes > or hours) the kernel freezes and JTAG debugging shows that it ends-up > in an endless loop in rcu_print_task_stall (as described before). > > > First I have seen this. Were you doing lots of CPU-hotplug operations? > > My system has only one core. So I think there should not be any > CPU-hotplugging. OK, so no point in providing you that set of patches, then. > > If you have more CPUs than the value of CONFIG_RCU_FANOUT (which > > defaults to 16), and if your workload offlined a full block of CPUs (full > > blocks being CPUs 0-15, 16-31, 32-47, and so on for the default value > > of CONFIG_RCU_FANOUT), then there is a theoretical issue that -might- > > cause the problem that you are seeing. > > Also this could not only happen on a single core system. Am I right? Yep, no way this can happen without a lot of CPUs and a lot of CPU hotplugging. > I have no idea how to find the problem. Do you have any more hints or ideas? You got stack traces with the stall warnings, correct? If so, please look at them and at Documentation/RCU/stallwarn.txt and see if the kernel is looping somewhere inappropriate. I am not familiar with the low-level ARM kernel code, but the stack below leads me to suspect that your kernel is interrupting itself to death or is improperly handling interrupts. Thanx, Paul > Here is a backtrace when the problem has occurred on the system without Xenomai/I-Pipe: > > #0 rcu_print_task_stall (rnp=0xc0498dc8 ) at kernel/rcutree_plugin.h:528 > #1 0xc005cabc in print_other_cpu_stall (rsp=0xc0498dc8 ) at kernel/rcutree.c:885 > #2 check_cpu_stall (rdp=0x80000093, rsp=0xc0498dc8 ) at kernel/rcutree.c:977 > #3 __rcu_pending (rdp=0x80000093, rsp=0xc0498dc8 ) at kernel/rcutree.c:2750 > #4 rcu_pending (cpu=) at kernel/rcutree.c:2800 > #5 rcu_check_callbacks (cpu=, user=) at kernel/rcutree.c:2179 > #6 0xc0027648 in update_process_times (user_tick=0) at kernel/timer.c:1427 > #7 0xc004e840 in tick_sched_timer (timer=0xc0498860 ) at kernel/time/tick-sched.c:1095 > #8 0xc003a0dc in __run_hrtimer (timer=0xc0498860 , now=) at kernel/hrtimer.c:1363 > #9 0xc003ab4c in hrtimer_interrupt (dev=) at kernel/hrtimer.c:1582 > #10 0xc02bf7bc in mxs_timer_interrupt (irq=, dev_id=) at drivers/clocksource/mxs_timer.c:132 > #11 0xc0055154 in handle_irq_event_percpu (desc=0xc7804c00, action=0xc04b0520 ) at kernel/irq/handle.c:144 > #12 0xc0055320 in handle_irq_event (desc=0xc7804c00) at kernel/irq/handle.c:197 > #13 0xc00578b8 in handle_level_irq (irq=, desc=0xc7804c00) at kernel/irq/chip.c:406 > #14 0xc0054aec in generic_handle_irq_desc (desc=, irq=16) at include/linux/irqdesc.h:115 > #15 generic_handle_irq (irq=16) at kernel/irq/irqdesc.c:314 > #16 0xc000f58c in handle_IRQ (irq=16, regs=) at arch/arm/kernel/irq.c:80 > #17 0xc000e360 in __irq_svc () at arch/arm/kernel/entry-armv.S:202 > #18 0xc000e360 in __irq_svc () at arch/arm/kernel/entry-armv.S:202 > #19 0xc000e360 in __irq_svc () at arch/arm/kernel/entry-armv.S:202 > #20 0xc000e360 in __irq_svc () at arch/arm/kernel/entry-armv.S:202 > ... > > Thanks and regards, > Christoph >