From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Layton Subject: Re: NFS client hang on attempt to do async blocking posix lock enqueue Date: Fri, 8 Feb 2008 15:54:14 -0500 Message-ID: <20080208155414.269f44d9@tleilax.poochiereds.net> References: <20071129191532.GB17907@fieldses.org> <20080118230734.GE9754@fieldses.org> <34969391-0221-4AB3-99CE-ACC1817AD355@Sun.COM> <20080207232618.GF25374@fieldses.org> <20080208071502.3b952888@tleilax.poochiereds.net> <20080208143306.GA2177@fieldses.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: "J. Bruce Fields" , Oleg Drokin , Marc Eshel , linux-fsdevel@vger.kernel.org, Manoj Naik To: "david m. richter" Return-path: Received: from mx1.redhat.com ([66.187.233.31]:34352 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1763307AbYBHU5Y (ORCPT ); Fri, 8 Feb 2008 15:57:24 -0500 In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, 8 Feb 2008 13:49:01 -0500 (EST) "david m. richter" wrote: > On Fri, 8 Feb 2008, J. Bruce Fields wrote: > > > On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote: > > > On Thu, 7 Feb 2008 18:26:18 -0500 > > > "J. Bruce Fields" wrote: > > > > > > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote: > > > > > Hello! > > > > > > > > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote: > > > > > > > > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote: > > > > >>> The problem seems to be with the fact that the client and server are > > > > >>> on > > > > >>> the same machine. This test work fine with or without an underlaying > > > > >>> fs > > > > >>> that supports locking when the client and the server are on a > > > > >>> different > > > > >>> machines. Like you said the server is trying to send the grant > > > > >>> message to > > > > >>> the client but for some reason it fails when the client is on the > > > > >>> same > > > > >>> machine. > > > > >> That *shouldn't* make a difference, so we need to take another look at > > > > >> this--Oleg, this problem is still unfixed, right? > > > > > > > > > > Yes, I just pulled your latest nfs tree and I still can reproduce the > > > > > problem. > > > > > > > > OK, we have finally reproduced this problem here, and David's working on > > > > debugging. It does indeed seem to only be reproduceable with client and > > > > server on the same machine. Thanks for the report.... > > > > > > > > --b. > > > > > > It might be worth testing this both with and without the patchset I > > > posted to linux-nfs recently to take care of the lockd hang. If > > > lockd is stuck trying to rpc_ping itself then it probably would hang > > > like this, wouldn't it? > > > > Of course! Yes, that fits. > > > > --b. > > right on, jeff, good catch and thanks for directing my attention > to your patches. > Excellent! Glad that took care of it... > i applied them on top of 2.6.23.1 and tested them on a cluster > exporting GFS2 over NFS, using oleg's reproducer code. your patches fix > that lockd hang. > > in a bit more detail, oleg's reproducer basically gets a > whole-file read lock, tests the lock, upgrades to a whole-file exclusive > lock, tests the lock, then unlocks. the problem was that when getting > that exclusive lock things would hang. this only happened when the client > and server were on the same machine, and i could reproduce it with NFS > exporting GFS2 but not NFS exporting EXT3. > > Interesting. It's not clear me why the underlying filesystem would make any difference there. Though now that I look, it looks like fl_grant really only gets called from dlm code, and that queues up the block for an immediate grant callback attempt. So perhaps that's the reason. -- Jeff Layton