From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751568Ab0JABYb (ORCPT ); Thu, 30 Sep 2010 21:24:31 -0400 Received: from bld-mail12.adl6.internode.on.net ([150.101.137.97]:40244 "EHLO mail.internode.on.net" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751046Ab0JABYa (ORCPT ); Thu, 30 Sep 2010 21:24:30 -0400 Date: Fri, 1 Oct 2010 11:24:25 +1000 From: Dave Chinner To: "J.H." Cc: Kevin Hilman , users@kernel.org, linux-kernel Subject: Re: [kernel.org users] cannot ssh to master.kernel.org Message-ID: <20101001012425.GC5665@dastard> References: <871v8bgpgb.fsf@deeprootsystems.com> <4CA51B61.8040404@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4CA51B61.8040404@kernel.org> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 30, 2010 at 04:21:05PM -0700, J.H. wrote: > Hey everyone, > > So this morning we found 'master' basically spinning off the hook, with > a consistently high load [READ: around 160ish]. Needless to say, mail > was getting differed due to the load and I'm sure ssh was having similar > issues. > > Several things were noticed on the system: > > - Several cron jobs had spun up multiple copies of themselves, despite > locking being present in the jobs. I'm not sure if they were attempting > to check the existing locks or if they were actually running through but > they were present. > > - The loads were relatively high, but not seemingly because of disk i/o. > There were a great number of processes in the run state. > > - Attempting to explicitly kill may of the processes left the processes > in a zombie state, but were still consuming CPU resources. Attempting > to kill them again did not result in the death of the processes, or the > relinquishing of the cpu resources. Attempting to strace the process > yielded nothing of interest / use. lsof on the zombie process did > return it's currently open file handles, including tcp connections. > > - Disks all seemed to be readable and writeable. > > - a sysrq+l dump of the system in this messed up state can be found at > http://pastebin.osuosl.org/35126 (this was originally requested by > Johannes Weiner) I'd say the machine had a severe low memory condition, followed by severe lock contention in the XFS inode cache shrinker caused by direct reclaim running in every process trying to allocate memory and all trying to shrink the same cache. That is, every CPU has traces like: Sep 30 18:31:00 hera klogd: [] spin_lock+0xe/0x10 [xfs] Sep 30 18:31:00 hera klogd: [] xfs_perag_get+0x2b/0x9d [xfs] Sep 30 18:31:00 hera klogd: [] xfs_inode_ag_iterator+0x50/0xd8 [xfs] Sep 30 18:31:00 hera klogd: [] ? xfs_reclaim_inode+0x0/0x206 [xfs] Sep 30 18:31:00 hera klogd: [] xfs_reclaim_inode_shrink+0x66/0x12a [xfs] Sep 30 18:31:00 hera klogd: [] shrink_slab+0xdb/0x15f Sep 30 18:31:00 hera klogd: [] do_try_to_free_pages+0x21a/0x342 Sep 30 18:31:00 hera klogd: [] try_to_free_pages+0x6f/0x71 Sep 30 18:31:00 hera klogd: [] ? isolate_pages_global+0x0/0x22a Sep 30 18:31:00 hera klogd: [] __alloc_pages_nodemask+0x3eb/0x636 Sep 30 18:31:00 hera klogd: [] alloc_pages_current+0x95/0x9e Sep 30 18:31:00 hera klogd: [] alloc_slab_page+0x1b/0x28 Sep 30 18:31:00 hera klogd: [] new_slab+0x54/0x1d4 Sep 30 18:31:00 hera klogd: [] __slab_alloc+0x1f3/0x3a7 > - kswapd0 has been using a lot of cpu time. I wouldn't be concerned > about this if it was say, 10% of a cpu, or maybe even 20% of a cpu for a > short time. It has however been running in some cases at 100% cpu on a > single core for hours on end. This seems to happen, in particular, > under slightly higher loads particularly relating to when there are a > number of rsyncs going on simultaneously. Yes, this is definitely related to the hang. The rsyncs are generating xfs inode cache pressure, and the kswapd is running the shrinker to clean it up. master.kernel.org is running 2.6.34.7-56.fc11.x86_64, which has this commit: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.34.y.git;a=commit;h=9bf729c0af67897ea8498ce17c29b0683f7f2028 (xfs: add a shrinker to background inode reclaim) This was necessary to avoid a far worse OOM panic, but until about a month ago nobody had any idea there were major lock contention problems with this shrinker. The fundamental problem here is not the XFS shrinker - it is the fact that direct reclaim allows shrinkers that act on global caches to be called concurrently on every CPU at once. Hence situations like this can occur where parallel access to a global lock/list is the limiting factor. However, it's only been the in the last month that I've really managed to get the lock contention problem sorted out (because I found a reproducable test case) and the shrinkers actually scaling. The series they are in are: http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/metadata-scale which is queued up in the XFS tree for merging in the .37 merge window. The specific commits involved are: 6c97772 xfs: serialise inode reclaim within an AG e1a48db xfs: batch inode reclaim lookup c727163 xfs: implement batched inode lookups for AG walking 7227905 xfs: split out inode walk inode grabbing fa78a91 xfs: split inode AG walking into separate code for reclaim ec9cb17 xfs: lockless per-ag lookups This is a major rework of the entire inode cache walking code, it converts AG lookups to be lockless (RCU), it batches inode lookups to reduce lock traffic and hold times, it reworks the "can we reclaim inode" logic to avoid all locking until we are mostly certain it can be reclaimed, etc. I'm not about to ask anyone to push these changes into .36 because they need wider testing first, so it's going to be a bit of time before stable kernel backports are even possible. > Replicating the rsyncs on a nearly identical box, running an older > 2.6.30 kernel, did not see this much cpu usage of kswapd0. Johannes > Weiner was helping me look into it yesterday, but I don't think anything > was explicitly conclusive. Yes, these issues started appearing around 2.6.34 when the XFS inode cache reclaim was pushed into a background thread for scalability reasons. I'd suggest that upgrade the kernel to the latest 2.6.35.y kernel would be a good idea.... Cheers, Dave. -- Dave Chinner david@fromorbit.com