From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeff Layton <jlayton@redhat.com>
Subject: Re: NFS client hang on attempt to do async blocking posix lock
 enqueue
Date: Fri, 8 Feb 2008 15:54:14 -0500
Message-ID: <20080208155414.269f44d9@tleilax.poochiereds.net>
References: <20071129191532.GB17907@fieldses.org>
	<OF8E546EB8.CB97D7AD-ON882573A2.007C3C35-882573A2.007CB41A@us.ibm.com>
	<20080118230734.GE9754@fieldses.org>
	<34969391-0221-4AB3-99CE-ACC1817AD355@Sun.COM>
	<20080207232618.GF25374@fieldses.org>
	<20080208071502.3b952888@tleilax.poochiereds.net>
	<20080208143306.GA2177@fieldses.org>
	<Pine.BSO.4.64.0802081302130.6952@citi.umich.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	Oleg Drokin <Oleg.Drokin@Sun.COM>,
	Marc Eshel <eshel@almaden.ibm.com>,
	linux-fsdevel@vger.kernel.org, Manoj Naik <manoj@almaden.ibm.com>
To: "david m. richter" <richterd@citi.umich.edu>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([66.187.233.31]:34352 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1763307AbYBHU5Y (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 8 Feb 2008 15:57:24 -0500
In-Reply-To: <Pine.BSO.4.64.0802081302130.6952@citi.umich.edu>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Fri, 8 Feb 2008 13:49:01 -0500 (EST)
"david m. richter" <richterd@citi.umich.edu> wrote:

> On Fri, 8 Feb 2008, J. Bruce Fields wrote:
> 
> > On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> > > On Thu, 7 Feb 2008 18:26:18 -0500
> > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > 
> > > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > > > Hello!
> > > > >
> > > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > > > >
> > > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > > > >>> The problem seems to be with the fact that the client and server are 
> > > > >>> on
> > > > >>> the same machine. This test work fine with or without an underlaying 
> > > > >>> fs
> > > > >>> that supports locking when the client and the server are on a  
> > > > >>> different
> > > > >>> machines. Like you said the server is trying to send the grant  
> > > > >>> message to
> > > > >>> the client but for some reason it fails when the client is on the  
> > > > >>> same
> > > > >>> machine.
> > > > >> That *shouldn't* make a difference, so we need to take another look at
> > > > >> this--Oleg, this problem is still unfixed, right?
> > > > >
> > > > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > > > problem.
> > > > 
> > > > OK, we have finally reproduced this problem here, and David's working on
> > > > debugging.  It does indeed seem to only be reproduceable with client and
> > > > server on the same machine.  Thanks for the report....
> > > > 
> > > > --b.
> > > 
> > > It might be worth testing this both with and without the patchset I
> > > posted to linux-nfs recently to take care of the lockd hang. If
> > > lockd is stuck trying to rpc_ping itself then it probably would hang
> > > like this, wouldn't it?
> > 
> > Of course!  Yes, that fits.
> > 
> > --b.
> 
> 	right on, jeff, good catch and thanks for directing my attention 
> to your patches.
> 

Excellent! Glad that took care of it...

> 	i applied them on top of 2.6.23.1 and tested them on a cluster 
> exporting GFS2 over NFS, using oleg's reproducer code.  your patches fix 
> that lockd hang.
> 
> 	in a bit more detail, oleg's reproducer basically gets a 
> whole-file read lock, tests the lock, upgrades to a whole-file exclusive 
> lock, tests the lock, then unlocks.  the problem was that when getting 
> that exclusive lock things would hang.  this only happened when the client 
> and server were on the same machine, and i could reproduce it with NFS 
> exporting GFS2 but not NFS exporting EXT3.
> 
> 

Interesting. It's not clear me why the underlying filesystem would make
any difference there. Though now that I look, it looks like fl_grant
really only gets called from dlm code, and that queues up the block for
an immediate grant callback attempt. So perhaps that's the reason.

-- 
Jeff Layton <jlayton@redhat.com>