* 3.1.4: NFSv3 RPC scheduling issue? @ 2011-12-05 16:50 Frank van Maarseveen 2011-12-05 23:39 ` Trond Myklebust 0 siblings, 1 reply; 10+ messages in thread From: Frank van Maarseveen @ 2011-12-05 16:50 UTC (permalink / raw) To: Linux NFS mailing list After upgrading 50+ NFSv3 (over UDP) client machines from 3.0.x to 3.1.4 I occasionally noticed a machine with lots of processes hanging in __rpc_execute() for a specific mount point with no progress at all. Stack: [<c17fe7e0>] schedule+0x30/0x50 [<c177e259>] rpc_wait_bit_killable+0x19/0x30 [<c17feeb5>] __wait_on_bit+0x45/0x70 [<c177e240>] ? rpc_release_task+0x110/0x110 [<c17fef3d>] out_of_line_wait_on_bit+0x5d/0x70 [<c177e240>] ? rpc_release_task+0x110/0x110 [<c108aed0>] ? autoremove_wake_function+0x40/0x40 [<c177e89b>] __rpc_execute+0xdb/0x1a0 ... Every reference to the specific mount point on the client machine hangs and the server does not receive any related network traffic. The server works fine for other identical client machines with the same export mounted. Other mounts on the (now) broken client still work. Killing the hanging client processes repairs the situation. This has happened a couple of times on client machines with heavy (NFS) load. The mount-point has originally been mounted by the automounter. -- Frank ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-05 16:50 3.1.4: NFSv3 RPC scheduling issue? Frank van Maarseveen @ 2011-12-05 23:39 ` Trond Myklebust 2011-12-06 8:11 ` Frank van Maarseveen 2011-12-06 9:04 ` Frank van Maarseveen 0 siblings, 2 replies; 10+ messages in thread From: Trond Myklebust @ 2011-12-05 23:39 UTC (permalink / raw) To: Frank van Maarseveen; +Cc: Linux NFS mailing list On Mon, 2011-12-05 at 17:50 +0100, Frank van Maarseveen wrote: > After upgrading 50+ NFSv3 (over UDP) client machines from 3.0.x to > 3.1.4 I occasionally noticed a machine with lots of processes hanging > in __rpc_execute() for a specific mount point with no progress at all. > Stack: > > [<c17fe7e0>] schedule+0x30/0x50 > [<c177e259>] rpc_wait_bit_killable+0x19/0x30 > [<c17feeb5>] __wait_on_bit+0x45/0x70 > [<c177e240>] ? rpc_release_task+0x110/0x110 > [<c17fef3d>] out_of_line_wait_on_bit+0x5d/0x70 > [<c177e240>] ? rpc_release_task+0x110/0x110 > [<c108aed0>] ? autoremove_wake_function+0x40/0x40 > [<c177e89b>] __rpc_execute+0xdb/0x1a0 > ... > > Every reference to the specific mount point on the client machine hangs > and the server does not receive any related network traffic. The server > works fine for other identical client machines with the same export mounted. > Other mounts on the (now) broken client still work. Killing the hanging > client processes repairs the situation. > > This has happened a couple of times on client machines with heavy (NFS) > load. The mount-point has originally been mounted by the automounter. An command of 'echo 0 > /proc/sys/sunrpc/rpc_debug', should display a list of pending rpc_tasks as well as information on where they are sleeping. Can you please try this on one of the hanging clients and post the resulting dump? Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-05 23:39 ` Trond Myklebust @ 2011-12-06 8:11 ` Frank van Maarseveen 2011-12-06 19:57 ` Trond Myklebust 2011-12-06 9:04 ` Frank van Maarseveen 1 sibling, 1 reply; 10+ messages in thread From: Frank van Maarseveen @ 2011-12-06 8:11 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS mailing list On Mon, Dec 05, 2011 at 06:39:36PM -0500, Trond Myklebust wrote: > On Mon, 2011-12-05 at 17:50 +0100, Frank van Maarseveen wrote: > > After upgrading 50+ NFSv3 (over UDP) client machines from 3.0.x to > > 3.1.4 I occasionally noticed a machine with lots of processes hanging > > in __rpc_execute() for a specific mount point with no progress at all. > > Stack: > > > > [<c17fe7e0>] schedule+0x30/0x50 > > [<c177e259>] rpc_wait_bit_killable+0x19/0x30 > > [<c17feeb5>] __wait_on_bit+0x45/0x70 > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > [<c17fef3d>] out_of_line_wait_on_bit+0x5d/0x70 > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > [<c108aed0>] ? autoremove_wake_function+0x40/0x40 > > [<c177e89b>] __rpc_execute+0xdb/0x1a0 > > ... > > > > Every reference to the specific mount point on the client machine hangs > > and the server does not receive any related network traffic. The server > > works fine for other identical client machines with the same export mounted. > > Other mounts on the (now) broken client still work. Killing the hanging > > client processes repairs the situation. > > > > This has happened a couple of times on client machines with heavy (NFS) > > load. The mount-point has originally been mounted by the automounter. > > An command of 'echo 0 > /proc/sys/sunrpc/rpc_debug', should display a 36477 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:none 36479 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 36484 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 36485 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 36486 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36487 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 36488 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36489 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 36490 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36491 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36492 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36493 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36494 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36495 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36496 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending 36497 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36498 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 36499 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36500 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36501 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36502 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36503 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 36504 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36505 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36506 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36507 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36508 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36509 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36510 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36511 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36512 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36513 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36514 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36515 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36516 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36517 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36518 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36519 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36523 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36560 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36561 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36562 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36563 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36564 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36565 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36566 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36576 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending 36577 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36578 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36579 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36580 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36581 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36582 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36583 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending 36592 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending 36618 0001 -11 ffff88008dc9db60 (null) 0 ffffffff8193ba60 nfsv3 WRITE a:call_reserveresult q:xprt_sending 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending -- Frank ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-06 8:11 ` Frank van Maarseveen @ 2011-12-06 19:57 ` Trond Myklebust 2011-12-07 13:43 ` Frank van Maarseveen 0 siblings, 1 reply; 10+ messages in thread From: Trond Myklebust @ 2011-12-06 19:57 UTC (permalink / raw) To: Frank van Maarseveen; +Cc: Linux NFS mailing list On Tue, 2011-12-06 at 09:11 +0100, Frank van Maarseveen wrote: > On Mon, Dec 05, 2011 at 06:39:36PM -0500, Trond Myklebust wrote: > > On Mon, 2011-12-05 at 17:50 +0100, Frank van Maarseveen wrote: > > > After upgrading 50+ NFSv3 (over UDP) client machines from 3.0.x to > > > 3.1.4 I occasionally noticed a machine with lots of processes hanging > > > in __rpc_execute() for a specific mount point with no progress at all. > > > Stack: > > > > > > [<c17fe7e0>] schedule+0x30/0x50 > > > [<c177e259>] rpc_wait_bit_killable+0x19/0x30 > > > [<c17feeb5>] __wait_on_bit+0x45/0x70 > > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > > [<c17fef3d>] out_of_line_wait_on_bit+0x5d/0x70 > > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > > [<c108aed0>] ? autoremove_wake_function+0x40/0x40 > > > [<c177e89b>] __rpc_execute+0xdb/0x1a0 > > > ... > > > > > > Every reference to the specific mount point on the client machine hangs > > > and the server does not receive any related network traffic. The server > > > works fine for other identical client machines with the same export mounted. > > > Other mounts on the (now) broken client still work. Killing the hanging > > > client processes repairs the situation. > > > > > > This has happened a couple of times on client machines with heavy (NFS) > > > load. The mount-point has originally been mounted by the automounter. > > > > An command of 'echo 0 > /proc/sys/sunrpc/rpc_debug', should display a > > 36477 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:none > 36479 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > 36484 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > 36485 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > 36486 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36487 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > 36488 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36489 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > 36490 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36491 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36492 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36493 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36494 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36495 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36496 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending > 36497 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36498 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > 36499 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36500 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36501 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36502 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36503 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > 36504 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36505 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36506 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36507 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36508 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36509 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36510 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36511 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36512 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36513 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36514 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36515 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36516 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36517 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36518 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36519 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36523 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36560 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36561 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36562 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36563 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36564 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36565 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36566 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36576 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending > 36577 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36578 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36579 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36580 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36581 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36582 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36583 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > 36592 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending > 36618 0001 -11 ffff88008dc9db60 (null) 0 ffffffff8193ba60 nfsv3 WRITE a:call_reserveresult q:xprt_sending > 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending Hmm... Is this the full dump from that client? I have a theory about what is going on with the second dump that you showed, but I really do not understand this one... If the above trace is complete, then it would indicate that the value of xprt->snd_task has been corrupted somehow. Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-06 19:57 ` Trond Myklebust @ 2011-12-07 13:43 ` Frank van Maarseveen 2011-12-10 3:10 ` Trond Myklebust 0 siblings, 1 reply; 10+ messages in thread From: Frank van Maarseveen @ 2011-12-07 13:43 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS mailing list On Tue, Dec 06, 2011 at 02:57:43PM -0500, Trond Myklebust wrote: > On Tue, 2011-12-06 at 09:11 +0100, Frank van Maarseveen wrote: > > On Mon, Dec 05, 2011 at 06:39:36PM -0500, Trond Myklebust wrote: > > > On Mon, 2011-12-05 at 17:50 +0100, Frank van Maarseveen wrote: > > > > After upgrading 50+ NFSv3 (over UDP) client machines from 3.0.x to > > > > 3.1.4 I occasionally noticed a machine with lots of processes hanging > > > > in __rpc_execute() for a specific mount point with no progress at all. > > > > Stack: > > > > > > > > [<c17fe7e0>] schedule+0x30/0x50 > > > > [<c177e259>] rpc_wait_bit_killable+0x19/0x30 > > > > [<c17feeb5>] __wait_on_bit+0x45/0x70 > > > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > > > [<c17fef3d>] out_of_line_wait_on_bit+0x5d/0x70 > > > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > > > [<c108aed0>] ? autoremove_wake_function+0x40/0x40 > > > > [<c177e89b>] __rpc_execute+0xdb/0x1a0 > > > > ... > > > > > > > > Every reference to the specific mount point on the client machine hangs > > > > and the server does not receive any related network traffic. The server > > > > works fine for other identical client machines with the same export mounted. > > > > Other mounts on the (now) broken client still work. Killing the hanging > > > > client processes repairs the situation. > > > > > > > > This has happened a couple of times on client machines with heavy (NFS) > > > > load. The mount-point has originally been mounted by the automounter. > > > > > > An command of 'echo 0 > /proc/sys/sunrpc/rpc_debug', should display a > > > > 36477 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:none > > 36479 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > > 36484 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > > 36485 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > > 36486 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36487 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > > 36488 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36489 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > > 36490 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36491 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36492 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36493 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36494 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36495 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36496 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending > > 36497 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36498 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > > 36499 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36500 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36501 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36502 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36503 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 LOOKUP a:call_reserveresult q:xprt_sending > > 36504 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36505 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36506 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36507 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36508 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36509 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36510 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36511 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36512 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36513 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36514 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36515 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36516 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36517 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36518 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36519 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36523 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36560 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36561 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36562 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36563 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36564 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36565 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36566 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36576 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending > > 36577 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36578 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36579 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36580 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36581 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36582 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36583 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > 36592 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending > > 36618 0001 -11 ffff88008dc9db60 (null) 0 ffffffff8193ba60 nfsv3 WRITE a:call_reserveresult q:xprt_sending > > 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > > Hmm... Is this the full dump from that client? I have a theory about > what is going on with the second dump that you showed, but I really do > not understand this one... If the above trace is complete, then it would > indicate that the value of xprt->snd_task has been corrupted somehow. I did three more 'echo 0 > /proc/sys/sunrpc/rpc_debug' before repairing the mount and outputs differ slightly (file '1' contains above dump, 2, 3 and 4 are the others): diff 1 2: | 58d57 | < 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending diff 1 3: | 47d46 | < 36566 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending | 58c57 | < 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending | --- | > 33046 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending diff 1 4: | 16d15 | < 36496 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending | 47d45 | < 36566 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending | 58c56,57 | < 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending | --- | > 33046 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending | > 34798 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending -- Frank ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-07 13:43 ` Frank van Maarseveen @ 2011-12-10 3:10 ` Trond Myklebust 2011-12-11 12:40 ` Frank van Maarseveen 2011-12-11 14:09 ` Frank van Maarseveen 0 siblings, 2 replies; 10+ messages in thread From: Trond Myklebust @ 2011-12-10 3:10 UTC (permalink / raw) To: Frank van Maarseveen; +Cc: Linux NFS mailing list On Wed, 2011-12-07 at 14:43 +0100, Frank van Maarseveen wrote: > I did three more 'echo 0 > /proc/sys/sunrpc/rpc_debug' before repairing > the mount and outputs differ slightly (file '1' contains above dump, 2, > 3 and 4 are the others): > > diff 1 2: > | 58d57 > | < 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > diff 1 3: > | 47d46 > | < 36566 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > | 58c57 > | < 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > | --- > | > 33046 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > diff 1 4: > | 16d15 > | < 36496 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 GETATTR a:call_reserveresult q:xprt_sending > | 47d45 > | < 36566 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > | 58c56,57 > | < 21609 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > | --- > | > 33046 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending > | > 34798 0080 -11 ffff88008dc9db60 (null) 0 ffffffff81a68860 nfsv3 ACCESS a:call_reserveresult q:xprt_sending I'm still mystified as to what is going on here... Would it be possible to upgrade some of your clients to 3.1.5 (which contains a fix for a sunrpc socket buffer problem) and then to add the following patch? Cheers Trond 8<------------------------------------------------------------------------------- >From 72a29422b5d4e00c5aae6466b415cb5b075bb032 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Fri, 9 Dec 2011 13:40:57 -0500 Subject: [PATCH 1/2] debugging - Ensure that we clean up in case the call to queue xprt->task_cleanup fails (although it is hard to see how it could). - Ensure that we don't hold onto a UDP congestion slot if we're queued up waiting for the socket lock. - Add debugging printks to check for the above two conditions. - Add further debugging printks to track slot allocation --- net/sunrpc/xprt.c | 56 +++++++++++++++++++++++++++++++++++++++------------- 1 files changed, 42 insertions(+), 14 deletions(-) diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index c64c0ef..284eb42 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -44,6 +44,8 @@ #include <linux/workqueue.h> #include <linux/net.h> #include <linux/ktime.h> +#include <linux/ratelimit.h> +#include <linux/printk.h> #include <linux/sunrpc/clnt.h> #include <linux/sunrpc/metrics.h> @@ -66,6 +68,7 @@ static void xprt_init(struct rpc_xprt *xprt, struct net *net); static void xprt_request_init(struct rpc_task *, struct rpc_xprt *); static void xprt_connect_status(struct rpc_task *task); static int __xprt_get_cong(struct rpc_xprt *, struct rpc_task *); +static void __xprt_put_cong(struct rpc_xprt *, struct rpc_rqst *); static DEFINE_SPINLOCK(xprt_list_lock); static LIST_HEAD(xprt_list); @@ -227,15 +230,28 @@ out_sleep: } EXPORT_SYMBOL_GPL(xprt_reserve_xprt); +static void __xprt_clear_locked(struct rpc_xprt *xprt) +{ + smp_mb__before_clear_bit(); + clear_bit(XPRT_LOCKED, &xprt->state); + smp_mb__after_clear_bit(); +} + +static void xprt_run_task_cleanup(struct rpc_xprt *xprt) +{ + if (queue_work(rpciod_workqueue, &xprt->task_cleanup) == 0) { + __xprt_clear_locked(xprt); + pr_err("%s: RPC task cleanup failed to start!\n", __func__); + } +} + static void xprt_clear_locked(struct rpc_xprt *xprt) { xprt->snd_task = NULL; - if (!test_bit(XPRT_CLOSE_WAIT, &xprt->state) || xprt->shutdown) { - smp_mb__before_clear_bit(); - clear_bit(XPRT_LOCKED, &xprt->state); - smp_mb__after_clear_bit(); - } else - queue_work(rpciod_workqueue, &xprt->task_cleanup); + if (test_bit(XPRT_CLOSE_WAIT, &xprt->state) && !xprt->shutdown) + xprt_run_task_cleanup(xprt); + else + __xprt_clear_locked(xprt); } /* @@ -271,12 +287,18 @@ out_sleep: dprintk("RPC: %5u failed to lock transport %p\n", task->tk_pid, xprt); task->tk_timeout = 0; task->tk_status = -EAGAIN; - if (req == NULL) - priority = RPC_PRIORITY_LOW; - else if (!req->rq_ntrans) + priority = RPC_PRIORITY_LOW; + if (req != NULL) { priority = RPC_PRIORITY_NORMAL; - else - priority = RPC_PRIORITY_HIGH; + if (req->rq_ntrans) + priority = RPC_PRIORITY_HIGH; + if (req->rq_cong) { + __xprt_put_cong(xprt, req); + pr_err("%s: don't wait for a transport write lock " + "while holding a congestion slot.\n", + __func__); + } + } rpc_sleep_on_priority(&xprt->sending, task, NULL, priority); return 0; } @@ -636,7 +658,7 @@ void xprt_force_disconnect(struct rpc_xprt *xprt) set_bit(XPRT_CLOSE_WAIT, &xprt->state); /* Try to schedule an autoclose RPC call */ if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0) - queue_work(rpciod_workqueue, &xprt->task_cleanup); + xprt_run_task_cleanup(xprt); xprt_wake_pending_tasks(xprt, -EAGAIN); spin_unlock_bh(&xprt->transport_lock); } @@ -663,7 +685,7 @@ void xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie) set_bit(XPRT_CLOSE_WAIT, &xprt->state); /* Try to schedule an autoclose RPC call */ if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0) - queue_work(rpciod_workqueue, &xprt->task_cleanup); + xprt_run_task_cleanup(xprt); xprt_wake_pending_tasks(xprt, -EAGAIN); out: spin_unlock_bh(&xprt->transport_lock); @@ -681,7 +703,7 @@ xprt_init_autodisconnect(unsigned long data) goto out_abort; spin_unlock(&xprt->transport_lock); set_bit(XPRT_CONNECTION_CLOSE, &xprt->state); - queue_work(rpciod_workqueue, &xprt->task_cleanup); + xprt_run_task_cleanup(xprt); return; out_abort: spin_unlock(&xprt->transport_lock); @@ -942,6 +964,8 @@ static struct rpc_rqst *xprt_dynamic_alloc_slot(struct rpc_xprt *xprt, gfp_t gfp { struct rpc_rqst *req = ERR_PTR(-EAGAIN); + if (atomic_read(&xprt->num_reqs) > xprt->max_reqs) + pr_err_ratelimited("%s: xprt->num_reqs is out of range!\n", __func__); if (!atomic_add_unless(&xprt->num_reqs, 1, xprt->max_reqs)) goto out; req = kzalloc(sizeof(struct rpc_rqst), gfp_flags); @@ -955,6 +979,8 @@ out: static bool xprt_dynamic_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *req) { + if (atomic_read(&xprt->num_reqs) < xprt->min_reqs) + pr_err_ratelimited("%s: xprt->num_reqs is out of range!\n", __func__); if (atomic_add_unless(&xprt->num_reqs, -1, xprt->min_reqs)) { kfree(req); return true; @@ -1042,6 +1068,8 @@ struct rpc_xprt *xprt_alloc(struct net *net, size_t size, xprt->max_reqs = num_prealloc; xprt->min_reqs = num_prealloc; atomic_set(&xprt->num_reqs, num_prealloc); + if (num_prealloc < 1) + pr_err("%s: too few preallocated slots!\n", __func__); return xprt; -- 1.7.7.3 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-10 3:10 ` Trond Myklebust @ 2011-12-11 12:40 ` Frank van Maarseveen 2011-12-11 18:10 ` Frank van Maarseveen 2011-12-11 14:09 ` Frank van Maarseveen 1 sibling, 1 reply; 10+ messages in thread From: Frank van Maarseveen @ 2011-12-11 12:40 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS mailing list On Fri, Dec 09, 2011 at 10:10:01PM -0500, Trond Myklebust wrote: > [...] > I'm still mystified as to what is going on here... > > Would it be possible to upgrade some of your clients to 3.1.5 (which > contains a fix for a sunrpc socket buffer problem) and then to add the > following patch? Did so, the mount locked up and still is, ready for some more experimentation. I don't see any difference however. Did a echo 0 >/proc/sys/sunrpc/rpc_debug afterwards (see below). A recipe which seems to trigger the issue (at least occasionally) is cd /mount-point ssh server echo 3 \>/proc/sys/vm/drop_caches echo 3 >/proc/sys/vm/drop_caches for i in `seq 100` do du >/dev/null 2>&1 & done I'll try it on a pristine kernel to rule out some kernel patches (unlikely to be the cause or trigger but just to be sure). -pid- flgs status -client- --rqstp- -timeout ---ops-- 29785 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:none 29902 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29903 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29907 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29920 0080 -11 f519b6b0 f2606980 0 c191c4ac nfsv3 GETATTR a:call_status q:xprt_sending 29921 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29922 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29923 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29924 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29925 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29926 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29927 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29928 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29929 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29930 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29931 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29932 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29933 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29934 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29935 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29936 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29937 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29938 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29939 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29940 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29941 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29942 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29943 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29944 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29945 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29946 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29947 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29948 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29949 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29950 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29951 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29952 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29953 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29954 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29955 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29956 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29957 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29958 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29959 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29960 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29961 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29962 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29963 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29964 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29965 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29966 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29967 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29968 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29969 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29970 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29971 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29972 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29973 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29974 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29975 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29976 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29977 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29978 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29979 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29980 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29981 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29982 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29983 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29984 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29985 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29986 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29987 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29988 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29989 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29990 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29991 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29992 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29993 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29994 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29995 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29996 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29997 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29998 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 29999 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30000 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30001 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30002 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30003 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30004 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30005 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30006 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30007 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30008 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30009 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30010 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30011 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30012 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30013 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30014 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30015 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 30025 0080 -11 f519b6b0 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending -- Frank ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-11 12:40 ` Frank van Maarseveen @ 2011-12-11 18:10 ` Frank van Maarseveen 0 siblings, 0 replies; 10+ messages in thread From: Frank van Maarseveen @ 2011-12-11 18:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS mailing list On Sun, Dec 11, 2011 at 01:40:08PM +0100, Frank van Maarseveen wrote: > On Fri, Dec 09, 2011 at 10:10:01PM -0500, Trond Myklebust wrote: > > [...] > > I'm still mystified as to what is going on here... > > > > Would it be possible to upgrade some of your clients to 3.1.5 (which > > contains a fix for a sunrpc socket buffer problem) and then to add the > > following patch? > > Did so, the mount locked up and still is, ready for some more > experimentation. I don't see any difference however. Did a > echo 0 >/proc/sys/sunrpc/rpc_debug afterwards (see below). > > A recipe which seems to trigger the issue (at least occasionally) is > > cd /mount-point > ssh server echo 3 \>/proc/sys/vm/drop_caches > echo 3 >/proc/sys/vm/drop_caches > for i in `seq 100` > do > du >/dev/null 2>&1 & > done > > I'll try it on a pristine kernel to rule out some kernel patches (unlikely to > be the cause or trigger but just to be sure). Tried, same result: my own NFS client patches seem not to make any difference, as I expected. The ICMP port unreachable (see my other mail) go away when I stop ypbind and they are triggered by "ypwhich" commands too so I consider them no longer relevant. Not much output this time after "echo 0 >/proc/sys/sunrpc/rpc_debug". I tried twice: -pid- flgs status -client- --rqstp- -timeout ---ops-- 16020 0080 -11 f4778230 f325d0a0 0 c191b4ac nfsv3 GETATTR a:call_status q:xprt_sending 16038 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:none 16041 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 16045 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 16048 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 READDIRPLUS a:call_reserveresult q:xprt_sending 16060 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 16062 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 16069 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending -pid- flgs status -client- --rqstp- -timeout ---ops-- 16020 0080 -11 f4778230 f325d0a0 0 c191b4ac nfsv3 GETATTR a:call_status q:xprt_sending 16038 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:none 16041 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 16045 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 16048 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 READDIRPLUS a:call_reserveresult q:xprt_sending 16060 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 16062 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 16069 0080 -11 f4778230 (null) 0 c191b4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending The NFS client mounts from a machine holding many virtual NFS servers using an separate IP address for every export. When access on the client hangs then the same export is still mountable on this NFS client using a different server IP address (one NIC at both sides btw.). The dead virtual server IP address seems only dead for NFS RPC and only from the client in question: there is no traffic going out. Ping, rpcinfo et al just work. Mount on the client in trouble using the dead IP address but specifying a different virtual server export produces some traffic and then gets stuck too, I guess at the point when kernel needs to do NFS RPC. So, kernel NFS RPC from client drops dead for a specific server IP address. -- Frank ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-10 3:10 ` Trond Myklebust 2011-12-11 12:40 ` Frank van Maarseveen @ 2011-12-11 14:09 ` Frank van Maarseveen 1 sibling, 0 replies; 10+ messages in thread From: Frank van Maarseveen @ 2011-12-11 14:09 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS mailing list On Fri, Dec 09, 2011 at 10:10:01PM -0500, Trond Myklebust wrote: > [...] > I'm still mystified as to what is going on here... There is some network traffic I don't understand but it might be relevant. Client machine: 172.17.1.49 server hardware: 172.17.2.241 The server has an additional IP addres for every exported mount point. The one which is mounted and blocked on the client doesn't show up in traffic below. The server itself serves NIS too but the client is bound to a different physical machine for this. The following was captured (added some tshark -V output to clarify a bit): Frame Time Data 1 0.000000 172.17.1.49 -> 172.17.255.255 Portmap V2 CALLIT Call -------------------------------------------------------------------- User Datagram Protocol, Src Port: 38242 (38242), Dst Port: sunrpc (111) Remote Procedure Call, Type:Call XID:0x42d68775 Program: Portmap (100000) Program Version: 2 Procedure: CALLIT (5) Credentials Flavor: AUTH_UNIX (1) Length: 40 Verifier Flavor: AUTH_NULL (0) Length: 0 Portmap DOMAIN_NONACK call [Program Version: 2] [V2 Procedure: CALLIT (5)] Program: YPSERV (100004) Version: 2 Procedure: DOMAIN_NONACK (2) Argument length: 8 Domain: TSKU length: 4 contents: TSKU 2 0.000544 172.17.2.241 -> 172.17.1.49 UDP Source port: rda Destination port: 38242 ----------------------------------------------------------------------------------------- User Datagram Protocol, Src Port: rda (630), Dst Port: 38242 (38242) Data (36 bytes) // no clue from tshark 0000 42 d6 87 75 00 00 00 01 00 00 00 00 00 00 00 00 B..u............ 0010 00 00 00 00 00 00 00 00 00 00 03 5b 00 00 00 04 ...........[.... 0020 00 00 00 01 .... 3 0.000740 172.17.1.49 -> 172.17.2.241 ICMP Destination unreachable (Port unreachable) ------------------------------------------------------------------------------------------- Internet Control Message Protocol Type: 3 (Destination unreachable) Code: 3 (Port unreachable) User Datagram Protocol, Src Port: rda (630), Dst Port: 38242 (38242) 4 5.016300 Dell_e7:1f:f0 -> Dell_38:31:b8 ARP Who has 172.17.1.49? Tell 172.17.2.241 5 5.016443 Dell_38:31:b8 -> Dell_e7:1f:f0 ARP 172.17.1.49 is at 00:14:22:38:31:b8 6 20.001647 172.17.1.49 -> 172.17.255.255 Portmap V2 CALLIT Call 7 20.002132 172.17.2.241 -> 172.17.1.49 UDP Source port: rda Destination port: 49408 8 20.002375 172.17.1.49 -> 172.17.2.241 ICMP Destination unreachable (Port unreachable) 9 25.013384 Dell_38:31:b8 -> Dell_e7:1f:f0 ARP Who has 172.17.2.241? Tell 172.17.1.49 10 25.013397 Dell_e7:1f:f0 -> Dell_38:31:b8 ARP 172.17.2.241 is at 00:18:8b:e7:1f:f0 11 40.003525 172.17.1.49 -> 172.17.255.255 Portmap V2 CALLIT Call 12 40.004003 172.17.2.241 -> 172.17.1.49 UDP Source port: rda Destination port: 51755 13 40.004249 172.17.1.49 -> 172.17.2.241 ICMP Destination unreachable (Port unreachable) The UDP port seems random and doesn't show up in rpcinfo -p for client or server. What bothers me is the ICMP Destination unreachable. It applies to a port (code=3) but it might have some adverse effect on other traffic from the same hardware to the client. And I don't understand why the ICMP error occurs. But maybe this is all irrelevant. -- Frank ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.1.4: NFSv3 RPC scheduling issue? 2011-12-05 23:39 ` Trond Myklebust 2011-12-06 8:11 ` Frank van Maarseveen @ 2011-12-06 9:04 ` Frank van Maarseveen 1 sibling, 0 replies; 10+ messages in thread From: Frank van Maarseveen @ 2011-12-06 9:04 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux NFS mailing list On Mon, Dec 05, 2011 at 06:39:36PM -0500, Trond Myklebust wrote: > On Mon, 2011-12-05 at 17:50 +0100, Frank van Maarseveen wrote: > > After upgrading 50+ NFSv3 (over UDP) client machines from 3.0.x to > > 3.1.4 I occasionally noticed a machine with lots of processes hanging > > in __rpc_execute() for a specific mount point with no progress at all. > > Stack: > > > > [<c17fe7e0>] schedule+0x30/0x50 > > [<c177e259>] rpc_wait_bit_killable+0x19/0x30 > > [<c17feeb5>] __wait_on_bit+0x45/0x70 > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > [<c17fef3d>] out_of_line_wait_on_bit+0x5d/0x70 > > [<c177e240>] ? rpc_release_task+0x110/0x110 > > [<c108aed0>] ? autoremove_wake_function+0x40/0x40 > > [<c177e89b>] __rpc_execute+0xdb/0x1a0 > > ... > > > > Every reference to the specific mount point on the client machine hangs > > and the server does not receive any related network traffic. The server > > works fine for other identical client machines with the same export mounted. > > Other mounts on the (now) broken client still work. Killing the hanging > > client processes repairs the situation. > > > > This has happened a couple of times on client machines with heavy (NFS) > > load. The mount-point has originally been mounted by the automounter. > > An command of 'echo 0 > /proc/sys/sunrpc/rpc_debug', should display a > list of pending rpc_tasks as well as information on where they are > sleeping. > Can you please try this on one of the hanging clients and post the > resulting dump? Here's another one: -pid- flgs status -client- --rqstp- -timeout ---ops-- 28050 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:none 28074 0080 -11 c2d3c460 c8e82000 0 c191c4ac nfsv3 LOOKUP a:call_status q:xprt_sending 28078 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 28080 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 28085 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28086 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28087 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28089 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28090 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28091 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28092 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28093 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28094 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28095 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28096 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28097 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28098 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28099 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28100 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28106 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28107 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28108 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28109 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28111 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28112 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28113 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28114 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28115 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28116 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28117 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28118 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28119 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28120 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28121 0001 -11 c2d3c460 (null) 0 c182e34c nfsv3 READ a:call_reserveresult q:xprt_sending 28131 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28144 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28145 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 28169 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28170 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 28207 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28210 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28228 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28237 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28297 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 28306 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28311 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 LOOKUP a:call_reserveresult q:xprt_sending 28385 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 GETATTR a:call_reserveresult q:xprt_sending 28401 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 28915 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 29279 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 29393 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 29469 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 ACCESS a:call_reserveresult q:xprt_sending 37587 0080 -11 c2d3c460 (null) 0 c191c4ac nfsv3 FSSTAT a:call_reserveresult q:xprt_sending -- Frank ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-12-11 18:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-05 16:50 3.1.4: NFSv3 RPC scheduling issue? Frank van Maarseveen 2011-12-05 23:39 ` Trond Myklebust 2011-12-06 8:11 ` Frank van Maarseveen 2011-12-06 19:57 ` Trond Myklebust 2011-12-07 13:43 ` Frank van Maarseveen 2011-12-10 3:10 ` Trond Myklebust 2011-12-11 12:40 ` Frank van Maarseveen 2011-12-11 18:10 ` Frank van Maarseveen 2011-12-11 14:09 ` Frank van Maarseveen 2011-12-06 9:04 ` Frank van Maarseveen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).