From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p88Hi1R6242785 for ; Thu, 8 Sep 2011 12:44:01 -0500 Received: from peace.netnation.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id DC4A354B657 for ; Thu, 8 Sep 2011 10:44:00 -0700 (PDT) Received: from peace.netnation.com (peace.netnation.com [204.174.223.2]) by cuda.sgi.com with ESMTP id SDEnUCaCSvupt6Ul for ; Thu, 08 Sep 2011 10:44:00 -0700 (PDT) Date: Thu, 8 Sep 2011 10:43:24 -0700 From: Simon Kirby Subject: Re: [Drbd-dev] [3.1-rc4] XFS+DRBD hangs Message-ID: <20110908174324.GA8043@hostway.ca> References: <20110907221505.GC21603@hostway.ca> <20110908151305.GJ14243@barkeeper1-xen.linbit> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20110908151305.GJ14243@barkeeper1-xen.linbit> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: drbd-dev@lists.linbit.com, linux-kernel@vger.kernel.org, xfs@oss.sgi.com On Thu, Sep 08, 2011 at 05:13:05PM +0200, Lars Ellenberg wrote: > Sorry for double posting on drbd-dev, I managed to strip the other lists from Cc. > > > We upgraded from 2.6.36 which seemed to have a page leak (file pages left > > on the LRU) and so would eventually perform very poorly. 2.6.37 and > > 2.6.38 seemed to have some unix socket issue that caused heartbeat to > > wedge. Shall we enable lock debugging or something here? > > That could help us understand that stack trace. > > It looks like cpu 1 blocks in > > > [ 1532.427149] [] ? try_to_wake_up+0xc2/0x270 > > [ 1532.427149] <> [] default_wake_function+0xd/0x10 > > Which does not make sense to me at all. Well, good news, I think.. I believe this may be related to "PCI: Set PCI-E Max Payload Size on fabric", added by b03e7495a862b02829. 3.1-rc5 is running now with a patch to basically disable those changes, and has been stable for 12 hours. It usually hung in a few minutes before. The XFS peoples say it was very likely not 58d84c4ee0389ddeb86238d5 which is the only other thing that changed between these versions that seems to be at all in the hang path. Also, when the thing hangs, it stops pinging immediately, and with the PCI-E max payload thing active, the device that raises a bus error is actually the PCI-E to PCI-X bridge chip used to support the BCM5708 NICs, so that all seems related. Simon- _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs