From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [BUG] Soft-lockup during cpu-hotplug in VFS callpaths Date: Wed, 24 Aug 2011 16:02:51 -0700 Message-ID: <20110824160251.3b8ba1f6.akpm@linux-foundation.org> References: <4E550057.9070609@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-pm@lists.linux-foundation.org, Nick Piggin To: "Srivatsa S. Bhat" Return-path: In-Reply-To: <4E550057.9070609@linux.vnet.ibm.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Wed, 24 Aug 2011 19:14:55 +0530 "Srivatsa S. Bhat" wrote: > Hi, > > While running stressful cpu hotplug tests along with kernel compilation > running in background, soft-lockups are detected on multiple CPUs. > Sometimes this also leads to hard lockups and kernel panic. > All the soft-lockups seem to occur at vfsmount_lock_local_cpu() or other VFS > callpaths. > > > [37108.410813] BUG: soft lockup - CPU#5 stuck for 22s! [cc1:29669] > > [37108.694781] Call Trace: > [37108.697306] [] ? vfsmount_lock_local_lock_cpu+0x70/0x70 > [37108.704258] [] path_init+0x315/0x400 > [37108.709558] [] ? __raw_spin_lock_init+0x38/0x70 > [37108.715812] [] path_openat+0x8c/0x3f0 > [37108.721203] [] ? sched_clock+0x9/0x10 > [37108.726597] [] ? sched_clock_cpu+0xcd/0x110 > [37108.732508] [] ? trace_hardirqs_off+0xd/0x10 > [37108.738498] [] ? local_clock+0x6f/0x80 > [37108.743970] [] do_filp_open+0x49/0xa0 > [37108.749362] [] ? alloc_fd+0xc3/0x210 > [37108.754665] [] ? _raw_spin_unlock+0x2b/0x40 > [37108.760575] [] ? alloc_fd+0xc3/0x210 > [37108.765875] [] do_sys_open+0x107/0x1e0 > [37108.771352] [] ? audit_syscall_entry+0x1bf/0x1f0 > [37108.777695] [] sys_open+0x20/0x30 > [37108.782741] [] system_call_fastpath+0x16/0x1b > > Kernel version: 3.0.1, 3.0.3 > Hardware: Dual socket quad-core hyper-threaded Intel x86 machine > Scenario: > (a) Stressful cpu hotplug tests + kernel compilation > > (b) IRQ balancing had been disabled and all the IRQs were made to be > routed to CPU 0 (except the ones that couldn't be routed). > > (c) Lockdep was enabled during kernel configuration. > > Steps (b) and (c) were done to dig deeper into the issue. However the same > issue was observed by just doing step (a). > > Definitely there seems to be a race condition occurring here, because this > issue is hit after sometime, after starting the tests. And the time it > takes to hit the issue increases as we increase the number of debug print > statements. In some cases (especially when the number of debug print > statements were quite high), the stress on the machine had to be increased > in order to hit the issue within measurable time. In my tests, a maximum > of about 2 to 2.5 hours was sufficient, to hit this bug. > > Please find the console log attached with this mail. > > Any ideas on how to go about fixing this bug? It's probably a bug in the core brlock implementation. I don't know who would work on fixing that.