NFS client hang on attempt to do async blocking posix lock enqueue

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NFS client hang on attempt to do async blocking posix lock enqueue
@ 2007-11-29 19:15 J. Bruce Fields
  2007-11-29 22:41 ` Marc Eshel
  0 siblings, 1 reply; 12+ messages in thread
From: J. Bruce Fields @ 2007-11-29 19:15 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Manoj Naik, linux-fsdevel, Marc Eshel

On Thu, Nov 29, 2007 at 02:04:40PM -0500, Oleg Drokin wrote:
> Hello!
>
>     There is a problem with blocking async posix lock enqueue in
>     2.6.22 and 2.6.23 kernels.  Lock call to underlying FS is done
>     just fine, but when fl_grant is called to inform lockd of
>     succesful granting, nothing happens, and no reply to client is
>     sent.  The end result is client reports that the server is not
>     responding.  I enabled dprintks in the code and I see that
>     immediately after fl_grant, there is nlmsvc_grant_blocked message
>     (after callback: label) printed. Then server not responding
>     messages start, and after every message about "coulndn't create
>     RPC handle for localhost" I see nlmsvc_grant_blocked "lockd:
>     GRANTing blocked lock" message again with no activity from
>     underlying FS.
>
>     I am attaching a reproducer that I have, it is quite simple
>     actually.  Take note, that path to file to lock is hardcoded, so
>     adjust for your environment please.  Lcoking should be performed
>     on a file that resides on nfs client mountpoint.
>
>     I reproduced the problem with 2.6.22 and 2.6.23 with Lustre (I am
>     working on adapting lustre to async posix locks API) and GFS2.
>     Setup is totally local, i.e. I have single node on which there is
>     gfs (both server and client) (or lustre - just client, but that
>     does not make any difference), nfs server and nfs client that
>     mounts exported gfs or lustre.

Thanks, I'll take a look.  Replying now just to add Marc to the cc:.

--b.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2007-11-29 19:15 NFS client hang on attempt to do async blocking posix lock enqueue J. Bruce Fields
@ 2007-11-29 22:41 ` Marc Eshel
  2008-01-18 23:07   ` J. Bruce Fields
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Eshel @ 2007-11-29 22:41 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-fsdevel, Manoj Naik, Oleg Drokin

The problem seems to be with the fact that the client and server are on 
the same machine. This test work fine with or without an underlaying fs 
that supports locking when the client and the server are on a different 
machines. Like you said the server is trying to send the grant message to 
the client but for some reason it fails when the client is on the same 
machine. 
Marc.

"J. Bruce Fields" <bfields@fieldses.org> wrote on 11/29/2007 11:15:32 AM:

> On Thu, Nov 29, 2007 at 02:04:40PM -0500, Oleg Drokin wrote:
> > Hello!
> >
> >     There is a problem with blocking async posix lock enqueue in
> >     2.6.22 and 2.6.23 kernels.  Lock call to underlying FS is done
> >     just fine, but when fl_grant is called to inform lockd of
> >     succesful granting, nothing happens, and no reply to client is
> >     sent.  The end result is client reports that the server is not
> >     responding.  I enabled dprintks in the code and I see that
> >     immediately after fl_grant, there is nlmsvc_grant_blocked message
> >     (after callback: label) printed. Then server not responding
> >     messages start, and after every message about "coulndn't create
> >     RPC handle for localhost" I see nlmsvc_grant_blocked "lockd:
> >     GRANTing blocked lock" message again with no activity from
> >     underlying FS.
> >
> >     I am attaching a reproducer that I have, it is quite simple
> >     actually.  Take note, that path to file to lock is hardcoded, so
> >     adjust for your environment please.  Lcoking should be performed
> >     on a file that resides on nfs client mountpoint.
> >
> >     I reproduced the problem with 2.6.22 and 2.6.23 with Lustre (I am
> >     working on adapting lustre to async posix locks API) and GFS2.
> >     Setup is totally local, i.e. I have single node on which there is
> >     gfs (both server and client) (or lustre - just client, but that
> >     does not make any difference), nfs server and nfs client that
> >     mounts exported gfs or lustre.
> 
> Thanks, I'll take a look.  Replying now just to add Marc to the cc:.
> 
> --b.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2007-11-29 22:41 ` Marc Eshel
@ 2008-01-18 23:07   ` J. Bruce Fields
  2008-01-20 14:58     ` Oleg Drokin
  0 siblings, 1 reply; 12+ messages in thread
From: J. Bruce Fields @ 2008-01-18 23:07 UTC (permalink / raw)
  To: Marc Eshel; +Cc: linux-fsdevel, Manoj Naik, Oleg Drokin

On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> The problem seems to be with the fact that the client and server are on 
> the same machine. This test work fine with or without an underlaying fs 
> that supports locking when the client and the server are on a different 
> machines. Like you said the server is trying to send the grant message to 
> the client but for some reason it fails when the client is on the same 
> machine. 

That *shouldn't* make a difference, so we need to take another look at
this--Oleg, this problem is still unfixed, right?

--b.

> Marc.
> 
> "J. Bruce Fields" <bfields@fieldses.org> wrote on 11/29/2007 11:15:32 AM:
> 
> > On Thu, Nov 29, 2007 at 02:04:40PM -0500, Oleg Drokin wrote:
> > > Hello!
> > >
> > >     There is a problem with blocking async posix lock enqueue in
> > >     2.6.22 and 2.6.23 kernels.  Lock call to underlying FS is done
> > >     just fine, but when fl_grant is called to inform lockd of
> > >     succesful granting, nothing happens, and no reply to client is
> > >     sent.  The end result is client reports that the server is not
> > >     responding.  I enabled dprintks in the code and I see that
> > >     immediately after fl_grant, there is nlmsvc_grant_blocked message
> > >     (after callback: label) printed. Then server not responding
> > >     messages start, and after every message about "coulndn't create
> > >     RPC handle for localhost" I see nlmsvc_grant_blocked "lockd:
> > >     GRANTing blocked lock" message again with no activity from
> > >     underlying FS.
> > >
> > >     I am attaching a reproducer that I have, it is quite simple
> > >     actually.  Take note, that path to file to lock is hardcoded, so
> > >     adjust for your environment please.  Lcoking should be performed
> > >     on a file that resides on nfs client mountpoint.
> > >
> > >     I reproduced the problem with 2.6.22 and 2.6.23 with Lustre (I am
> > >     working on adapting lustre to async posix locks API) and GFS2.
> > >     Setup is totally local, i.e. I have single node on which there is
> > >     gfs (both server and client) (or lustre - just client, but that
> > >     does not make any difference), nfs server and nfs client that
> > >     mounts exported gfs or lustre.
> > 
> > Thanks, I'll take a look.  Replying now just to add Marc to the cc:.
> > 
> > --b.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-01-18 23:07   ` J. Bruce Fields
@ 2008-01-20 14:58     ` Oleg Drokin
  2008-02-07 23:26       ` J. Bruce Fields
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Drokin @ 2008-01-20 14:58 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Marc Eshel, linux-fsdevel, Manoj Naik

Hello!

On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:

> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
>> The problem seems to be with the fact that the client and server  
>> are on
>> the same machine. This test work fine with or without an  
>> underlaying fs
>> that supports locking when the client and the server are on a  
>> different
>> machines. Like you said the server is trying to send the grant  
>> message to
>> the client but for some reason it fails when the client is on the  
>> same
>> machine.
> That *shouldn't* make a difference, so we need to take another look at
> this--Oleg, this problem is still unfixed, right?

Yes, I just pulled your latest nfs tree and I still can reproduce the  
problem.

Bye,
     Oleg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-01-20 14:58     ` Oleg Drokin
@ 2008-02-07 23:26       ` J. Bruce Fields
  2008-02-08 12:15         ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: J. Bruce Fields @ 2008-02-07 23:26 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Marc Eshel, linux-fsdevel, Manoj Naik, richterd

On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> Hello!
>
> On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
>
>> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
>>> The problem seems to be with the fact that the client and server are 
>>> on
>>> the same machine. This test work fine with or without an underlaying 
>>> fs
>>> that supports locking when the client and the server are on a  
>>> different
>>> machines. Like you said the server is trying to send the grant  
>>> message to
>>> the client but for some reason it fails when the client is on the  
>>> same
>>> machine.
>> That *shouldn't* make a difference, so we need to take another look at
>> this--Oleg, this problem is still unfixed, right?
>
> Yes, I just pulled your latest nfs tree and I still can reproduce the  
> problem.

OK, we have finally reproduced this problem here, and David's working on
debugging.  It does indeed seem to only be reproduceable with client and
server on the same machine.  Thanks for the report....

--b.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-02-07 23:26       ` J. Bruce Fields
@ 2008-02-08 12:15         ` Jeff Layton
  2008-02-08 14:33           ` J. Bruce Fields
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2008-02-08 12:15 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Oleg Drokin, Marc Eshel, linux-fsdevel, Manoj Naik, richterd

On Thu, 7 Feb 2008 18:26:18 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > Hello!
> >
> > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> >
> >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> >>> The problem seems to be with the fact that the client and server are 
> >>> on
> >>> the same machine. This test work fine with or without an underlaying 
> >>> fs
> >>> that supports locking when the client and the server are on a  
> >>> different
> >>> machines. Like you said the server is trying to send the grant  
> >>> message to
> >>> the client but for some reason it fails when the client is on the  
> >>> same
> >>> machine.
> >> That *shouldn't* make a difference, so we need to take another look at
> >> this--Oleg, this problem is still unfixed, right?
> >
> > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > problem.
> 
> OK, we have finally reproduced this problem here, and David's working on
> debugging.  It does indeed seem to only be reproduceable with client and
> server on the same machine.  Thanks for the report....
> 
> --b.

It might be worth testing this both with and without the patchset I
posted to linux-nfs recently to take care of the lockd hang. If
lockd is stuck trying to rpc_ping itself then it probably would hang
like this, wouldn't it?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-02-08 12:15         ` Jeff Layton
@ 2008-02-08 14:33           ` J. Bruce Fields
  2008-02-08 18:49             ` david m. richter
  0 siblings, 1 reply; 12+ messages in thread
From: J. Bruce Fields @ 2008-02-08 14:33 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Oleg Drokin, Marc Eshel, linux-fsdevel, Manoj Naik, richterd

On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> On Thu, 7 Feb 2008 18:26:18 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > Hello!
> > >
> > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > >
> > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > >>> The problem seems to be with the fact that the client and server are 
> > >>> on
> > >>> the same machine. This test work fine with or without an underlaying 
> > >>> fs
> > >>> that supports locking when the client and the server are on a  
> > >>> different
> > >>> machines. Like you said the server is trying to send the grant  
> > >>> message to
> > >>> the client but for some reason it fails when the client is on the  
> > >>> same
> > >>> machine.
> > >> That *shouldn't* make a difference, so we need to take another look at
> > >> this--Oleg, this problem is still unfixed, right?
> > >
> > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > problem.
> > 
> > OK, we have finally reproduced this problem here, and David's working on
> > debugging.  It does indeed seem to only be reproduceable with client and
> > server on the same machine.  Thanks for the report....
> > 
> > --b.
> 
> It might be worth testing this both with and without the patchset I
> posted to linux-nfs recently to take care of the lockd hang. If
> lockd is stuck trying to rpc_ping itself then it probably would hang
> like this, wouldn't it?

Of course!  Yes, that fits.

--b.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-02-08 14:33           ` J. Bruce Fields
@ 2008-02-08 18:49             ` david m. richter
  2008-02-08 20:54               ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: david m. richter @ 2008-02-08 18:49 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Oleg Drokin, Marc Eshel, linux-fsdevel, Manoj Naik

On Fri, 8 Feb 2008, J. Bruce Fields wrote:

> On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> > On Thu, 7 Feb 2008 18:26:18 -0500
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > > Hello!
> > > >
> > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > > >
> > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > > >>> The problem seems to be with the fact that the client and server are 
> > > >>> on
> > > >>> the same machine. This test work fine with or without an underlaying 
> > > >>> fs
> > > >>> that supports locking when the client and the server are on a  
> > > >>> different
> > > >>> machines. Like you said the server is trying to send the grant  
> > > >>> message to
> > > >>> the client but for some reason it fails when the client is on the  
> > > >>> same
> > > >>> machine.
> > > >> That *shouldn't* make a difference, so we need to take another look at
> > > >> this--Oleg, this problem is still unfixed, right?
> > > >
> > > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > > problem.
> > > 
> > > OK, we have finally reproduced this problem here, and David's working on
> > > debugging.  It does indeed seem to only be reproduceable with client and
> > > server on the same machine.  Thanks for the report....
> > > 
> > > --b.
> > 
> > It might be worth testing this both with and without the patchset I
> > posted to linux-nfs recently to take care of the lockd hang. If
> > lockd is stuck trying to rpc_ping itself then it probably would hang
> > like this, wouldn't it?
> 
> Of course!  Yes, that fits.
> 
> --b.

	right on, jeff, good catch and thanks for directing my attention 
to your patches.

	i applied them on top of 2.6.23.1 and tested them on a cluster 
exporting GFS2 over NFS, using oleg's reproducer code.  your patches fix 
that lockd hang.

	in a bit more detail, oleg's reproducer basically gets a 
whole-file read lock, tests the lock, upgrades to a whole-file exclusive 
lock, tests the lock, then unlocks.  the problem was that when getting 
that exclusive lock things would hang.  this only happened when the client 
and server were on the same machine, and i could reproduce it with NFS 
exporting GFS2 but not NFS exporting EXT3.


	thanks,

	d
	.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-02-08 18:49             ` david m. richter
@ 2008-02-08 20:54               ` Jeff Layton
  2008-02-08 21:12                 ` J. Bruce Fields
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2008-02-08 20:54 UTC (permalink / raw)
  To: david m. richter
  Cc: J. Bruce Fields, Oleg Drokin, Marc Eshel, linux-fsdevel,
	Manoj Naik

On Fri, 8 Feb 2008 13:49:01 -0500 (EST)
"david m. richter" <richterd@citi.umich.edu> wrote:

> On Fri, 8 Feb 2008, J. Bruce Fields wrote:
> 
> > On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> > > On Thu, 7 Feb 2008 18:26:18 -0500
> > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > 
> > > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > > > Hello!
> > > > >
> > > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > > > >
> > > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > > > >>> The problem seems to be with the fact that the client and server are 
> > > > >>> on
> > > > >>> the same machine. This test work fine with or without an underlaying 
> > > > >>> fs
> > > > >>> that supports locking when the client and the server are on a  
> > > > >>> different
> > > > >>> machines. Like you said the server is trying to send the grant  
> > > > >>> message to
> > > > >>> the client but for some reason it fails when the client is on the  
> > > > >>> same
> > > > >>> machine.
> > > > >> That *shouldn't* make a difference, so we need to take another look at
> > > > >> this--Oleg, this problem is still unfixed, right?
> > > > >
> > > > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > > > problem.
> > > > 
> > > > OK, we have finally reproduced this problem here, and David's working on
> > > > debugging.  It does indeed seem to only be reproduceable with client and
> > > > server on the same machine.  Thanks for the report....
> > > > 
> > > > --b.
> > > 
> > > It might be worth testing this both with and without the patchset I
> > > posted to linux-nfs recently to take care of the lockd hang. If
> > > lockd is stuck trying to rpc_ping itself then it probably would hang
> > > like this, wouldn't it?
> > 
> > Of course!  Yes, that fits.
> > 
> > --b.
> 
> 	right on, jeff, good catch and thanks for directing my attention 
> to your patches.
> 

Excellent! Glad that took care of it...

> 	i applied them on top of 2.6.23.1 and tested them on a cluster 
> exporting GFS2 over NFS, using oleg's reproducer code.  your patches fix 
> that lockd hang.
> 
> 	in a bit more detail, oleg's reproducer basically gets a 
> whole-file read lock, tests the lock, upgrades to a whole-file exclusive 
> lock, tests the lock, then unlocks.  the problem was that when getting 
> that exclusive lock things would hang.  this only happened when the client 
> and server were on the same machine, and i could reproduce it with NFS 
> exporting GFS2 but not NFS exporting EXT3.
> 
> 

Interesting. It's not clear me why the underlying filesystem would make
any difference there. Though now that I look, it looks like fl_grant
really only gets called from dlm code, and that queues up the block for
an immediate grant callback attempt. So perhaps that's the reason.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-02-08 20:54               ` Jeff Layton
@ 2008-02-08 21:12                 ` J. Bruce Fields
  2008-02-08 21:27                   ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: J. Bruce Fields @ 2008-02-08 21:12 UTC (permalink / raw)
  To: Jeff Layton
  Cc: david m. richter, Oleg Drokin, Marc Eshel, linux-fsdevel,
	Manoj Naik

On Fri, Feb 08, 2008 at 03:54:14PM -0500, Jeff Layton wrote:
> Interesting. It's not clear me why the underlying filesystem would make
> any difference there. Though now that I look, it looks like fl_grant
> really only gets called from dlm code, and that queues up the block for
> an immediate grant callback attempt. So perhaps that's the reason.

The asynchronous locking interface does something slightly cheesy for
blocking locks--instead of waiting for the filesystem to respond, it
just sends back a deny immediately (even if the lock might actually be
available), then responds later with a granted message when it discovers
it's available.

That works, but we should make it just wait to send the reply to the
original lock request until we've got a real answer, as we do for
nonblocking lock requests.  And in fact someone submitted a patch to do
that--I just haven't gotten the time to review it.  Urp.

So anyway the effect is that on ext3 this particular lock wouldn't have
required a grant reply, whereas on gfs2 it does.

Of course, what this means is that we'd hit the same problem on ext3 too
if the lock request did in fact legitimately block.  So grant callbacks
probably have never worked on ext3 over the loopback interface either.
Oops!

I bet nobody's ever noticed because we manage to recover by retrying the
lock after it's available (whereas in the gfs2 case the retry hits the
same problem).  So in practice for ext3 this probably just means
blocking lock requests take a lot longer over loopback then they would
otherwise.  And probably the only people that care about nlm performance
don't usually do local mounts like that.

--b.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS client hang on attempt to do async blocking posix lock enqueue
  2008-02-08 21:12                 ` J. Bruce Fields
@ 2008-02-08 21:27                   ` Jeff Layton
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Layton @ 2008-02-08 21:27 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: david m. richter, Oleg Drokin, Marc Eshel, linux-fsdevel,
	Manoj Naik

On Fri, 8 Feb 2008 16:12:28 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Fri, Feb 08, 2008 at 03:54:14PM -0500, Jeff Layton wrote:
> > Interesting. It's not clear me why the underlying filesystem would make
> > any difference there. Though now that I look, it looks like fl_grant
> > really only gets called from dlm code, and that queues up the block for
> > an immediate grant callback attempt. So perhaps that's the reason.
> 
> The asynchronous locking interface does something slightly cheesy for
> blocking locks--instead of waiting for the filesystem to respond, it
> just sends back a deny immediately (even if the lock might actually be
> available), then responds later with a granted message when it discovers
> it's available.
> 
> That works, but we should make it just wait to send the reply to the
> original lock request until we've got a real answer, as we do for
> nonblocking lock requests.  And in fact someone submitted a patch to do
> that--I just haven't gotten the time to review it.  Urp.
> 
> So anyway the effect is that on ext3 this particular lock wouldn't have
> required a grant reply, whereas on gfs2 it does.
> 
> Of course, what this means is that we'd hit the same problem on ext3 too
> if the lock request did in fact legitimately block.  So grant callbacks
> probably have never worked on ext3 over the loopback interface either.
> Oops!
> 

As best I can tell, the whole problem with rpc_pings was introduced
when we moved everything to the rpcbind stuff. Before that we generally
never did an rpc_ping when binding the client. This probably did work
until that was introduced.

> I bet nobody's ever noticed because we manage to recover by retrying the
> lock after it's available (whereas in the gfs2 case the retry hits the
> same problem).  So in practice for ext3 this probably just means
> blocking lock requests take a lot longer over loopback then they would
> otherwise.  And probably the only people that care about nlm performance
> don't usually do local mounts like that.
> 

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* NFS client hang on attempt to do async blocking posix lock enqueue
@ 2007-11-29 19:04 Oleg Drokin
  0 siblings, 0 replies; 12+ messages in thread
From: Oleg Drokin @ 2007-11-29 19:04 UTC (permalink / raw)
  To: J. Bruce Fields, Manoj Naik; +Cc: linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

Hello!

     There is a problem with blocking async posix lock enqueue in  
2.6.22 and 2.6.23 kernels.
     Lock call to underlying FS is done just fine, but when fl_grant  
is called to inform lockd
     of succesful granting, nothing happens, and no reply to client is  
sent. The end result
     is client reports that the server is not responding.
     I enabled dprintks in the code and I see that immediately after  
fl_grant, there is nlmsvc_grant_blocked
     message (after callback: label) printed. Then server not  
responding messages start, and
     after every message about "coulndn't create RPC handle for  
localhost" I see
     nlmsvc_grant_blocked "lockd: GRANTing blocked lock" message again  
with no activity
     from underlying FS.

     I am attaching a reproducer that I have, it is quite simple  
actually. Take note, that
     path to file to lock is hardcoded, so adjust for your environment  
please.
     Lcoking should be performed on a file that resides on nfs client  
mountpoint.

     I reproduced the problem with 2.6.22 and 2.6.23 with Lustre (I am  
working on adapting lustre
     to async posix locks API) and GFS2.
     Setup is totally local, i.e. I have single node on which there is  
gfs (both server and client)
     (or lustre - just client, but that does not make any difference),  
nfs server and nfs client
     that mounts exported gfs or lustre.

Bye,
     Oleg

[-- Attachment #2: flock.c --]
[-- Type: application/octet-stream, Size: 5959 bytes --]

/* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*-
 * vim:expandtab:shiftwidth=8:tabstop=8:
 *
 * Lustre Light user test program
 *
 *  Copyright (c) 2002, 2003 Cluster File Systems, Inc.
 *
 *   This file is part of Lustre, http://www.lustre.org.
 *
 *   Lustre is free software; you can redistribute it and/or
 *   modify it under the terms of version 2 of the GNU General Public
 *   License as published by the Free Software Foundation.
 *
 *   Lustre is distributed in the hope that it will be useful,
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *   GNU General Public License for more details.
 *
 *   You should have received a copy of the GNU General Public License
 *   along with Lustre; if not, write to the Free Software
 *   Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

#define _BSD_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <getopt.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/queue.h>
#include <signal.h>
#include <errno.h>
#include <dirent.h>
#include <sys/uio.h>
#include <sys/time.h>
#include <stdarg.h>

static char lustre_path[] = "/mnt/lustre2";

#define ENTRY(str)                                                      \
        do {                                                            \
                char buf[100];                                          \
                int len;                                                \
                sprintf(buf, "===== START %s: %s ", __FUNCTION__, (str)); \
                len = strlen(buf);                                      \
                if (len < 79) {                                         \
                        memset(buf+len, '=', 100-len);                  \
                        buf[79] = '\n';                                 \
                        buf[80] = 0;                                    \
                }                                                       \
                printf("%s", buf);                                      \
        } while (0)

#define LEAVE()                                                         \
        do {                                                            \
                char buf[100];                                          \
                int len;                                                \
                sprintf(buf, "===== END TEST %s: successfully ",        \
                        __FUNCTION__);                                  \
                len = strlen(buf);                                      \
                if (len < 79) {                                         \
                        memset(buf+len, '=', 100-len);                  \
                        buf[79] = '\n';                                 \
                        buf[80] = 0;                                    \
                }                                                       \
                printf("%s", buf);                                      \
        } while (0)

#define EXIT return

#define MAX_PATH_LENGTH 4096


int t_fcntl(int fd, int cmd, ...)
{
	va_list ap;
	long arg;
	struct flock *lock = NULL;
	int rc = -1;

	va_start(ap, cmd);
	switch (cmd) {
	case F_GETFL:
		va_end(ap);
		rc = fcntl(fd, cmd);
		if (rc == -1) {
			printf("fcntl GETFL failed: %s\n",
				 strerror(errno));
			EXIT(1);
		}
		break;
	case F_SETFL:
		arg = va_arg(ap, long);
		va_end(ap);
		rc = fcntl(fd, cmd, arg);
		if (rc == -1) {
			printf("fcntl SETFL %ld failed: %s\n",
				 arg, strerror(errno));
			EXIT(1);
		}
		break;
	case F_GETLK:
	case F_SETLK:
	case F_SETLKW:
		lock = va_arg(ap, struct flock *);
		va_end(ap);
		rc = fcntl(fd, cmd, lock);
		if (rc == -1) {
			printf("fcntl cmd %d failed: %s\n",
				 cmd, strerror(errno));
			EXIT(1);
		}
		break;
	case F_DUPFD:
		arg = va_arg(ap, long);
		va_end(ap);
		rc = fcntl(fd, cmd, arg);
		if (rc == -1) {
			printf("fcntl F_DUPFD %d failed: %s\n",
				 (int)arg, strerror(errno));
			EXIT(1);
		}
		break;
	default:
		va_end(ap);
		printf("fcntl cmd %d not supported\n", cmd);
		EXIT(1);
	}
        if (lock)
                printf("fcntl %d = %d, ltype = %d\n", cmd, rc, lock->l_type);
	return rc;
}

int t_unlink(const char *path)
{
        int rc;

        rc = unlink(path);
        if (rc) {
                printf("unlink(%s) error: %s\n", path, strerror(errno));
                EXIT(-1);
        }
        return rc;
}

void t21()
{
        char file[MAX_PATH_LENGTH] = "";
        int fd, ret;
	struct flock lock = {
		.l_type = F_RDLCK,
		.l_whence = SEEK_SET,
	};

        ENTRY("basic fcntl support");
        snprintf(file, MAX_PATH_LENGTH, "%s/test_t21_file", lustre_path);

        fd = open(file, O_RDWR|O_CREAT, (mode_t)0666);
        if (fd < 0) {
                printf("error open file: %m\n", file);
                exit(-1);
        }

        t_fcntl(fd, F_SETFL, O_APPEND);
        if (!(ret = t_fcntl(fd, F_GETFL)) & O_APPEND) {
                printf("error get flag: ret %x\n", ret);
                exit(-1);
        }

	t_fcntl(fd, F_SETLK, &lock);
	t_fcntl(fd, F_GETLK, &lock);
	lock.l_type = F_WRLCK;
	t_fcntl(fd, F_SETLKW, &lock);
	t_fcntl(fd, F_GETLK, &lock);
	lock.l_type = F_UNLCK;
	t_fcntl(fd, F_SETLK, &lock);

        close(fd);
        t_unlink(file);
        LEAVE();
}


int main(int argc, char * const argv[])
{
        /* Set D_VFSTRACE to see messages from ll_file_flock.
           The test passes either with -o flock or -o noflock 
           mount -o flock -t lustre uml1:/mds1/client /mnt/lustre */
        t21();

	printf("completed successfully\n");
	return 0;
}

[-- Attachment #3: Type: text/plain, Size: 1 bytes --]



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-02-08 21:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-29 19:15 NFS client hang on attempt to do async blocking posix lock enqueue J. Bruce Fields
2007-11-29 22:41 ` Marc Eshel
2008-01-18 23:07   ` J. Bruce Fields
2008-01-20 14:58     ` Oleg Drokin
2008-02-07 23:26       ` J. Bruce Fields
2008-02-08 12:15         ` Jeff Layton
2008-02-08 14:33           ` J. Bruce Fields
2008-02-08 18:49             ` david m. richter
2008-02-08 20:54               ` Jeff Layton
2008-02-08 21:12                 ` J. Bruce Fields
2008-02-08 21:27                   ` Jeff Layton
  -- strict thread matches above, loose matches on Subject: below --
2007-11-29 19:04 Oleg Drokin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).