* Re: nfs client hang [not found] <99613C19B13C5D40914FB8930657FA9303365708DE@uk-ex-mbx1.terastack.bluearc.com> @ 2010-07-27 7:25 ` Andy Chittenden 2010-07-27 10:53 ` Andy Chittenden 0 siblings, 1 reply; 18+ messages in thread From: Andy Chittenden @ 2010-07-27 7:25 UTC (permalink / raw) To: Linux Kernel Mailing List (linux-kernel@vger.kernel.org); +Cc: Trond Myklebust On 2010-07-23 13:36, Andy Chittenden wrote: >> IE the client starts a connection and then closes it again without sending data. > Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: > > ... lots of the following nfsv3 WRITE requests: > [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) > [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 > [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) > [ 7680.520079] RPC: 33550 disabling timer > [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7680.520089] RPC: __rpc_wake_up_task done > [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 > [ 7680.520098] RPC: 33550 xprt_connect_status: retrying > [ 7680.520103] RPC: 33550 call_connect_status (status -11) > [ 7680.520108] RPC: 33550 call_transmit (status 0) > [ 7680.520112] RPC: 33550 xprt_prepare_transmit > [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) > [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7680.520136] RPC: 33550 xprt_transmit(32920) > [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 > [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 I changed that debug to output sk_shutdown too. That has a value of 2 (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: err = -EPIPE; if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) goto out_err; which correlates with the trace "xs_tcp_send_request(32920) = -32". So, this looks like a problem in the sockets/tcp layer. The rpc layer issues a shutdown and then reconnects using the same socket. So either sk_shutdown needs zeroing once the shutdown completes or should be zeroed on subsequent connect. The latter sounds safer. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang 2010-07-27 7:25 ` nfs client hang Andy Chittenden @ 2010-07-27 10:53 ` Andy Chittenden 2010-07-27 12:21 ` Eric Dumazet 0 siblings, 1 reply; 18+ messages in thread From: Andy Chittenden @ 2010-07-27 10:53 UTC (permalink / raw) To: Andy Chittenden, Linux Kernel Mailing List (linux-kernel@vger.kernel.org) Cc: Trond Myklebust > >> IE the client starts a connection and then closes it again without sending data. > > Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: > > > > ... lots of the following nfsv3 WRITE requests: > > [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) > > [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 > > [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) > > [ 7680.520079] RPC: 33550 disabling timer > > [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > > [ 7680.520089] RPC: __rpc_wake_up_task done > > [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 > > [ 7680.520098] RPC: 33550 xprt_connect_status: retrying > > [ 7680.520103] RPC: 33550 call_connect_status (status -11) > > [ 7680.520108] RPC: 33550 call_transmit (status 0) > > [ 7680.520112] RPC: 33550 xprt_prepare_transmit > > [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) > > [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 > > [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > > [ 7680.520136] RPC: 33550 xprt_transmit(32920) > > [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 > > [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... > > [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 > I changed that debug to output sk_shutdown too. That has a value of 2 > (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: > err = -EPIPE; > if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) > goto out_err; > which correlates with the trace "xs_tcp_send_request(32920) = -32". So, > this looks like a problem in the sockets/tcp layer. The rpc layer issues > a shutdown and then reconnects using the same socket. So either > sk_shutdown needs zeroing once the shutdown completes or should be > zeroed on subsequent connect. The latter sounds safer. This patch for 2.6.34.1 fixes the issue: --- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100 +++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100 @@ -2522,6 +2522,13 @@ struct tcp_sock *tp = tcp_sk(sk); __u8 rcv_wscale; + /* clear down any previous shutdown attempts so that + * reconnects on a socket that's been shutdown leave the + * socket in a usable state (otherwise tcp_sendmsg() returns + * -EPIPE). + */ + sk->sk_shutdown = 0; + /* We'll fix this up when we get a response from the other end. * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. */ As I mentioned in my first message, we first saw this issue in 2.6.32 as supplied by debian (linux-image-2.6.32-5-amd64 Version: 2.6.32-17). It looks like the same patch would fix the problem there too. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang 2010-07-27 10:53 ` Andy Chittenden @ 2010-07-27 12:21 ` Eric Dumazet 2010-07-27 12:51 ` Andy Chittenden 2010-07-27 17:28 ` Chuck Lever 0 siblings, 2 replies; 18+ messages in thread From: Eric Dumazet @ 2010-07-27 12:21 UTC (permalink / raw) To: Andy Chittenden Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev Le mardi 27 juillet 2010 à 11:53 +0100, Andy Chittenden a écrit : > > >> IE the client starts a connection and then closes it again without sending data. > > > Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: > > > > > > ... lots of the following nfsv3 WRITE requests: > > > [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog > > > [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) > > > [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 > > > [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) > > > [ 7680.520079] RPC: 33550 disabling timer > > > [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > > > [ 7680.520089] RPC: __rpc_wake_up_task done > > > [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 > > > [ 7680.520098] RPC: 33550 xprt_connect_status: retrying > > > [ 7680.520103] RPC: 33550 call_connect_status (status -11) > > > [ 7680.520108] RPC: 33550 call_transmit (status 0) > > > [ 7680.520112] RPC: 33550 xprt_prepare_transmit > > > [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) > > > [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 > > > [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > > > [ 7680.520136] RPC: 33550 xprt_transmit(32920) > > > [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 > > > [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... > > > [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 > > > I changed that debug to output sk_shutdown too. That has a value of 2 > > (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: > > > err = -EPIPE; > > if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) > > goto out_err; > > > which correlates with the trace "xs_tcp_send_request(32920) = -32". So, > > this looks like a problem in the sockets/tcp layer. The rpc layer issues > > a shutdown and then reconnects using the same socket. So either > > sk_shutdown needs zeroing once the shutdown completes or should be > > zeroed on subsequent connect. The latter sounds safer. > > This patch for 2.6.34.1 fixes the issue: > > --- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100 > +++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100 > @@ -2522,6 +2522,13 @@ > struct tcp_sock *tp = tcp_sk(sk); > __u8 rcv_wscale; > > + /* clear down any previous shutdown attempts so that > + * reconnects on a socket that's been shutdown leave the > + * socket in a usable state (otherwise tcp_sendmsg() returns > + * -EPIPE). > + */ > + sk->sk_shutdown = 0; > + > /* We'll fix this up when we get a response from the other end. > * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. > */ > > As I mentioned in my first message, we first saw this issue in 2.6.32 as supplied by debian (linux-image-2.6.32-5-amd64 Version: 2.6.32-17). It looks like the same patch would fix the problem there too. > CC netdev This reminds me a similar problem we had in the past, fixed with commit 1fdf475a (tcp: tcp_disconnect() should clear window_clamp) But tcp_disconnect() already clears sk->sk_shutdown If NFS calls tcp_disconnect(), then shutdown(), there is a problem. Maybe xs_tcp_shutdown() should make some sanity tests ? Following sequence is legal, and your patch might break it. fd = socket(...); shutdown(fd, SHUT_WR); ... connect(fd, ...); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: nfs client hang 2010-07-27 12:21 ` Eric Dumazet @ 2010-07-27 12:51 ` Andy Chittenden 2010-07-27 17:28 ` Chuck Lever 1 sibling, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-27 12:51 UTC (permalink / raw) To: Eric Dumazet Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev On 2010-07-27 13:21, Eric Dumazet wrote: > Le mardi 27 juillet 2010 à 11:53 +0100, Andy Chittenden a écrit : >>>>> IE the client starts a connection and then closes it again without sending data. >>>> Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: >>>> >>>> ... lots of the following nfsv3 WRITE requests: >>>> [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) >>>> [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 >>>> [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) >>>> [ 7680.520079] RPC: 33550 disabling timer >>>> [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" >>>> [ 7680.520089] RPC: __rpc_wake_up_task done >>>> [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 >>>> [ 7680.520098] RPC: 33550 xprt_connect_status: retrying >>>> [ 7680.520103] RPC: 33550 call_connect_status (status -11) >>>> [ 7680.520108] RPC: 33550 call_transmit (status 0) >>>> [ 7680.520112] RPC: 33550 xprt_prepare_transmit >>>> [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) >>>> [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 >>>> [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data >>>> [ 7680.520136] RPC: 33550 xprt_transmit(32920) >>>> [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 >>>> [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... >>>> [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 >>> I changed that debug to output sk_shutdown too. That has a value of 2 >>> (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: >>> err = -EPIPE; >>> if (sk->sk_err || (sk->sk_shutdown& SEND_SHUTDOWN)) >>> goto out_err; >>> which correlates with the trace "xs_tcp_send_request(32920) = -32". So, >>> this looks like a problem in the sockets/tcp layer. The rpc layer issues >>> a shutdown and then reconnects using the same socket. So either >>> sk_shutdown needs zeroing once the shutdown completes or should be >>> zeroed on subsequent connect. The latter sounds safer. >> This patch for 2.6.34.1 fixes the issue: >> >> --- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100 >> +++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100 >> @@ -2522,6 +2522,13 @@ >> struct tcp_sock *tp = tcp_sk(sk); >> __u8 rcv_wscale; >> >> + /* clear down any previous shutdown attempts so that >> + * reconnects on a socket that's been shutdown leave the >> + * socket in a usable state (otherwise tcp_sendmsg() returns >> + * -EPIPE). >> + */ >> + sk->sk_shutdown = 0; >> + >> /* We'll fix this up when we get a response from the other end. >> * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. >> */ >> >> As I mentioned in my first message, we first saw this issue in 2.6.32 as supplied by debian (linux-image-2.6.32-5-amd64 Version: 2.6.32-17). It looks like the same patch would fix the problem there too. >> > CC netdev > > This reminds me a similar problem we had in the past, fixed with commit > 1fdf475a (tcp: tcp_disconnect() should clear window_clamp) > > But tcp_disconnect() already clears sk->sk_shutdown > > If NFS calls tcp_disconnect(), then shutdown(), there is a problem. > > Maybe xs_tcp_shutdown() should make some sanity tests ? > > Following sequence is legal, and your patch might break it. > > fd = socket(...); > shutdown(fd, SHUT_WR); > ... > connect(fd, ...); > > > Thanks for the response. From my reading of the RPC code, because nothing clears the sk_shutdown flag, the RPC code goes into a loop when recovering from packets being lost: shutdown connect send fails so repeat My patch stops the NFS client hang that I was seeing but I'm not an expert on either the socket layer, RPC code or NFS code so I'm happy for someone else to come up with the alternative, correct fix. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: nfs client hang 2010-07-27 12:21 ` Eric Dumazet @ 2010-07-27 17:28 ` Chuck Lever 2010-07-27 17:28 ` Chuck Lever 1 sibling, 0 replies; 18+ messages in thread From: Chuck Lever @ 2010-07-27 17:28 UTC (permalink / raw) To: Eric Dumazet Cc: Andy Chittenden, Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List Add CC: linux-nfs@vger.kernel.org On 07/27/10 08:21 AM, Eric Dumazet wrote: > Le mardi 27 juillet 2010 à 11:53 +0100, Andy Chittenden a écrit : >>>>> IE the client starts a connection and then closes it again without sending data. >>>> Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: >>>> >>>> ... lots of the following nfsv3 WRITE requests: >>>> [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) >>>> [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 >>>> [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) >>>> [ 7680.520079] RPC: 33550 disabling timer >>>> [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" >>>> [ 7680.520089] RPC: __rpc_wake_up_task done >>>> [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 >>>> [ 7680.520098] RPC: 33550 xprt_connect_status: retrying >>>> [ 7680.520103] RPC: 33550 call_connect_status (status -11) >>>> [ 7680.520108] RPC: 33550 call_transmit (status 0) >>>> [ 7680.520112] RPC: 33550 xprt_prepare_transmit >>>> [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) >>>> [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 >>>> [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data >>>> [ 7680.520136] RPC: 33550 xprt_transmit(32920) >>>> [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 >>>> [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... >>>> [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 >> >>> I changed that debug to output sk_shutdown too. That has a value of 2 >>> (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: >> >>> err = -EPIPE; >>> if (sk->sk_err || (sk->sk_shutdown& SEND_SHUTDOWN)) >>> goto out_err; >> >>> which correlates with the trace "xs_tcp_send_request(32920) = -32". So, >>> this looks like a problem in the sockets/tcp layer. The rpc layer issues >>> a shutdown and then reconnects using the same socket. So either >>> sk_shutdown needs zeroing once the shutdown completes or should be >>> zeroed on subsequent connect. The latter sounds safer. >> This patch for 2.6.34.1 fixes the issue: >> >> --- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100 >> +++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100 >> @@ -2522,6 +2522,13 @@ >> struct tcp_sock *tp = tcp_sk(sk); >> __u8 rcv_wscale; >> >> + /* clear down any previous shutdown attempts so that >> + * reconnects on a socket that's been shutdown leave the >> + * socket in a usable state (otherwise tcp_sendmsg() returns >> + * -EPIPE). >> + */ >> + sk->sk_shutdown = 0; >> + >> /* We'll fix this up when we get a response from the other end. >> * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. >> */ >> >> As I mentioned in my first message, we first saw this issue in 2.6.32 as supplied by debian (linux-image-2.6.32-5-amd64 Version: 2.6.32-17). It looks like the same patch would fix the problem there too. >> > > CC netdev > > This reminds me a similar problem we had in the past, fixed with commit > 1fdf475a (tcp: tcp_disconnect() should clear window_clamp) > > But tcp_disconnect() already clears sk->sk_shutdown > > If NFS calls tcp_disconnect(), then shutdown(), there is a problem. If tcp_disconnect() was called at some point, I would expect to see a message from xs_error_report() in the debugging output. Perhaps tcp_disconnect() is not being invoked at all? > Maybe xs_tcp_shutdown() should make some sanity tests ? > > Following sequence is legal, and your patch might break it. > > fd = socket(...); > shutdown(fd, SHUT_WR); > ... > connect(fd, ...); I looked closely at some of Andy's debugging output from the linux-kernel mailing list archive. I basically agree that the network layer is returning -EPIPE from tcp_sendmsg(), which the RPC client logic does not expect. But it's not clear to me how it gets into this state. > [ 7728.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) > [ 7728.520093] RPC: ffff88013e62d800 connect status 115 connected 0 sock state 2 "sock state 2" => sk->sk_state == TCP_SYN_SENT > [ 7728.520884] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.520889] RPC: state 1 conn 0 dead 0 zapped 1 "state 1" => sk->sk_state == TCP_ESTABLISHED RPC client wakes up this RPC task now that the connection is established. > [ 7728.520896] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.520900] RPC: 33550 disabling timer > [ 7728.520906] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.520912] RPC: __rpc_wake_up_task done > [ 7728.520932] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.520937] RPC: 33550 xprt_connect_status: retrying > [ 7728.520942] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt. > [ 7728.520947] RPC: 33550 call_transmit (status 0) > [ 7728.520951] RPC: 33550 xprt_prepare_transmit > [ 7728.520957] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.520962] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.520969] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.520976] RPC: 33550 xprt_transmit(32920) RPC client encodes the request and attempts to send it. > [ 7728.520984] RPC: xs_tcp_send_request(32920) = -32 Network layer says -EPIPE, for some reason. RPC client calls kernel_sock_shutdown(SHUT_WR). > [ 7728.520997] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521007] RPC: state 4 conn 1 dead 0 zapped 1 "state 4" => sk->sk_state == TCP_FIN_WAIT1 The RPC client sets up a linger timeout. > [ 7728.521013] RPC: 33550 call_status (status -32) > [ 7728.521018] RPC: 33550 call_bind (status 0) > [ 7728.521023] RPC: 33550 call_connect xprt ffff88013e62d800 is not connected > [ 7728.521028] RPC: 33550 xprt_connect xprt ffff88013e62d800 is not connected > [ 7728.521035] RPC: 33550 sleep_on(queue "xprt_pending" time 4296824426) > [ 7728.521040] RPC: 33550 added to queue ffff88013e62db20 "xprt_pending" > [ 7728.521045] RPC: 33550 setting alarm for 60000 ms RPC client puts this RPC task to sleep. > [ 7728.521439] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521444] RPC: state 5 conn 0 dead 0 zapped 1 "state 5" => sk->sk_state == TCP_FIN_WAIT2 > [ 7728.521602] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521608] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521612] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE The RPC client marks the socket closed. and the linger timeout is cancelled. At this point, sk_shutdown should be set to zero, correct? I don't see an xs_error_report() call here, which would confirm that the socket took a trip through tcp_disconnect(). > [ 7728.521617] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.521621] RPC: 33550 disabling timer > [ 7728.521626] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.521631] RPC: __rpc_wake_up_task done RPC client wakes up the RPC task. Meanwhile... > [ 7728.521636] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521641] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521645] RPC: disconnected transport ffff88013e62d800 > [ 7728.521649] RPC: xs_tcp_data_ready... ... network layer calls closed socket's data_ready method... while the awoken RPC task gets underway. > [ 7728.521662] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.521666] RPC: 33550 xprt_connect_status: retrying > [ 7728.521671] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt, even though there is no established connection. RPC client barrels on to send the request again. > [ 7728.521675] RPC: 33550 call_transmit (status 0) > [ 7728.521679] RPC: 33550 xprt_prepare_transmit > [ 7728.521683] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.521688] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.521694] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.521699] RPC: 33550 xprt_transmit(32920) > [ 7728.521704] RPC: xs_tcp_send_request(32920) = -32 RPC client attempts to send again, gets -EPIPE, and calls kernel_sock_shutdown(SHUT_WR). If there is no connection established, the RPC client expects -ENOTCONN, in which case it will attempt to reconnect here. > [ 7728.521709] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521714] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521718] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE Following this, the RPC client attempts to retransmit the request repeatedly, but the socket remains in the TCP_CLOSE state. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: nfs client hang @ 2010-07-27 17:28 ` Chuck Lever 0 siblings, 0 replies; 18+ messages in thread From: Chuck Lever @ 2010-07-27 17:28 UTC (permalink / raw) To: Eric Dumazet Cc: Andy Chittenden, Linux Kernel Mailing List (linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Trond Myklebust, netdev, Linux NFS Mailing List Add CC: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 07/27/10 08:21 AM, Eric Dumazet wrote: > Le mardi 27 juillet 2010 à 11:53 +0100, Andy Chittenden a écrit : >>>>> IE the client starts a connection and then closes it again without sending data. >>>> Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: >>>> >>>> ... lots of the following nfsv3 WRITE requests: >>>> [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) >>>> [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 >>>> [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) >>>> [ 7680.520079] RPC: 33550 disabling timer >>>> [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" >>>> [ 7680.520089] RPC: __rpc_wake_up_task done >>>> [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 >>>> [ 7680.520098] RPC: 33550 xprt_connect_status: retrying >>>> [ 7680.520103] RPC: 33550 call_connect_status (status -11) >>>> [ 7680.520108] RPC: 33550 call_transmit (status 0) >>>> [ 7680.520112] RPC: 33550 xprt_prepare_transmit >>>> [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) >>>> [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 >>>> [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data >>>> [ 7680.520136] RPC: 33550 xprt_transmit(32920) >>>> [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 >>>> [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... >>>> [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 >> >>> I changed that debug to output sk_shutdown too. That has a value of 2 >>> (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: >> >>> err = -EPIPE; >>> if (sk->sk_err || (sk->sk_shutdown& SEND_SHUTDOWN)) >>> goto out_err; >> >>> which correlates with the trace "xs_tcp_send_request(32920) = -32". So, >>> this looks like a problem in the sockets/tcp layer. The rpc layer issues >>> a shutdown and then reconnects using the same socket. So either >>> sk_shutdown needs zeroing once the shutdown completes or should be >>> zeroed on subsequent connect. The latter sounds safer. >> This patch for 2.6.34.1 fixes the issue: >> >> --- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100 >> +++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100 >> @@ -2522,6 +2522,13 @@ >> struct tcp_sock *tp = tcp_sk(sk); >> __u8 rcv_wscale; >> >> + /* clear down any previous shutdown attempts so that >> + * reconnects on a socket that's been shutdown leave the >> + * socket in a usable state (otherwise tcp_sendmsg() returns >> + * -EPIPE). >> + */ >> + sk->sk_shutdown = 0; >> + >> /* We'll fix this up when we get a response from the other end. >> * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. >> */ >> >> As I mentioned in my first message, we first saw this issue in 2.6.32 as supplied by debian (linux-image-2.6.32-5-amd64 Version: 2.6.32-17). It looks like the same patch would fix the problem there too. >> > > CC netdev > > This reminds me a similar problem we had in the past, fixed with commit > 1fdf475a (tcp: tcp_disconnect() should clear window_clamp) > > But tcp_disconnect() already clears sk->sk_shutdown > > If NFS calls tcp_disconnect(), then shutdown(), there is a problem. If tcp_disconnect() was called at some point, I would expect to see a message from xs_error_report() in the debugging output. Perhaps tcp_disconnect() is not being invoked at all? > Maybe xs_tcp_shutdown() should make some sanity tests ? > > Following sequence is legal, and your patch might break it. > > fd = socket(...); > shutdown(fd, SHUT_WR); > ... > connect(fd, ...); I looked closely at some of Andy's debugging output from the linux-kernel mailing list archive. I basically agree that the network layer is returning -EPIPE from tcp_sendmsg(), which the RPC client logic does not expect. But it's not clear to me how it gets into this state. > [ 7728.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) > [ 7728.520093] RPC: ffff88013e62d800 connect status 115 connected 0 sock state 2 "sock state 2" => sk->sk_state == TCP_SYN_SENT > [ 7728.520884] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.520889] RPC: state 1 conn 0 dead 0 zapped 1 "state 1" => sk->sk_state == TCP_ESTABLISHED RPC client wakes up this RPC task now that the connection is established. > [ 7728.520896] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.520900] RPC: 33550 disabling timer > [ 7728.520906] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.520912] RPC: __rpc_wake_up_task done > [ 7728.520932] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.520937] RPC: 33550 xprt_connect_status: retrying > [ 7728.520942] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt. > [ 7728.520947] RPC: 33550 call_transmit (status 0) > [ 7728.520951] RPC: 33550 xprt_prepare_transmit > [ 7728.520957] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.520962] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.520969] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.520976] RPC: 33550 xprt_transmit(32920) RPC client encodes the request and attempts to send it. > [ 7728.520984] RPC: xs_tcp_send_request(32920) = -32 Network layer says -EPIPE, for some reason. RPC client calls kernel_sock_shutdown(SHUT_WR). > [ 7728.520997] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521007] RPC: state 4 conn 1 dead 0 zapped 1 "state 4" => sk->sk_state == TCP_FIN_WAIT1 The RPC client sets up a linger timeout. > [ 7728.521013] RPC: 33550 call_status (status -32) > [ 7728.521018] RPC: 33550 call_bind (status 0) > [ 7728.521023] RPC: 33550 call_connect xprt ffff88013e62d800 is not connected > [ 7728.521028] RPC: 33550 xprt_connect xprt ffff88013e62d800 is not connected > [ 7728.521035] RPC: 33550 sleep_on(queue "xprt_pending" time 4296824426) > [ 7728.521040] RPC: 33550 added to queue ffff88013e62db20 "xprt_pending" > [ 7728.521045] RPC: 33550 setting alarm for 60000 ms RPC client puts this RPC task to sleep. > [ 7728.521439] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521444] RPC: state 5 conn 0 dead 0 zapped 1 "state 5" => sk->sk_state == TCP_FIN_WAIT2 > [ 7728.521602] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521608] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521612] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE The RPC client marks the socket closed. and the linger timeout is cancelled. At this point, sk_shutdown should be set to zero, correct? I don't see an xs_error_report() call here, which would confirm that the socket took a trip through tcp_disconnect(). > [ 7728.521617] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.521621] RPC: 33550 disabling timer > [ 7728.521626] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.521631] RPC: __rpc_wake_up_task done RPC client wakes up the RPC task. Meanwhile... > [ 7728.521636] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521641] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521645] RPC: disconnected transport ffff88013e62d800 > [ 7728.521649] RPC: xs_tcp_data_ready... ... network layer calls closed socket's data_ready method... while the awoken RPC task gets underway. > [ 7728.521662] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.521666] RPC: 33550 xprt_connect_status: retrying > [ 7728.521671] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt, even though there is no established connection. RPC client barrels on to send the request again. > [ 7728.521675] RPC: 33550 call_transmit (status 0) > [ 7728.521679] RPC: 33550 xprt_prepare_transmit > [ 7728.521683] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.521688] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.521694] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.521699] RPC: 33550 xprt_transmit(32920) > [ 7728.521704] RPC: xs_tcp_send_request(32920) = -32 RPC client attempts to send again, gets -EPIPE, and calls kernel_sock_shutdown(SHUT_WR). If there is no connection established, the RPC client expects -ENOTCONN, in which case it will attempt to reconnect here. > [ 7728.521709] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521714] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521718] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE Following this, the RPC client attempts to retransmit the request repeatedly, but the socket remains in the TCP_CLOSE state. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang 2010-07-27 17:28 ` Chuck Lever (?) @ 2010-07-28 7:08 ` Andy Chittenden -1 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-28 7:08 UTC (permalink / raw) To: Chuck Lever, Eric Dumazet Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List PiBJIGRvbid0IHNlZSBhbiB4c19lcnJvcl9yZXBvcnQoKSBjYWxsIGhlcmUsIHdoaWNoIHdvdWxk IGNvbmZpcm0gdGhhdCB0aGUgc29ja2V0IHRvb2sgYSB0cmlwIHRocm91Z2ggdGNwX2Rpc2Nvbm5l Y3QoKS4NCg0KRnJvbSBteSByZWFkaW5nIG9mIHRjcF9kaXNjb25uZWN0KCksIGl0IGNhbGxzIHNr LT5za19lcnJvcl9yZXBvcnQoc2spIHVuY29uZGl0aW9uYWxseSBzbyBhcyB0aGVyZSdzIG5vIHhz X2Vycm9yX3JlcG9ydCgpLCB0aGF0IHN1cmVseSBtZWFucyB0aGUgZXhhY3Qgb3Bwb3NpdGU6IHRj cF9kaXNjb25uZWN0KCkgd2Fzbid0IGNhbGxlZC4gSWYgaXQncyBub3QgY2FsbGVkLCBza19zaHV0 ZG93biBpcyBub3QgY2xlYXJlZC4gQW5kIG15IHJldmlzZWQgdHJhY2luZyBjb25maXJtZWQgdGhh dCBpdCB3YXMgc2V0IHRvIA0KU0VORF9TSFVURE9XTi4NCg0KLS0gDQpBbmR5LCBCbHVlQXJjIEVu Z2luZWVyaW5nDQoNCg0KLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCkZyb206IENodWNrIExl dmVyIFttYWlsdG86Y2h1Y2subGV2ZXJAb3JhY2xlLmNvbV0gDQpTZW50OiAyNyBKdWx5IDIwMTAg MTg6MjkNClRvOiBFcmljIER1bWF6ZXQNCkNjOiBBbmR5IENoaXR0ZW5kZW47IExpbnV4IEtlcm5l bCBNYWlsaW5nIExpc3QgKGxpbnV4LWtlcm5lbEB2Z2VyLmtlcm5lbC5vcmcpOyBUcm9uZCBNeWts ZWJ1c3Q7IG5ldGRldjsgTGludXggTkZTIE1haWxpbmcgTGlzdA0KU3ViamVjdDogUmU6IG5mcyBj bGllbnQgaGFuZw0KDQpBZGQgQ0M6IGxpbnV4LW5mc0B2Z2VyLmtlcm5lbC5vcmcNCg0KT24gMDcv MjcvMTAgMDg6MjEgQU0sIEVyaWMgRHVtYXpldCB3cm90ZToNCj4gTGUgbWFyZGkgMjcganVpbGxl dCAyMDEwIMOgIDExOjUzICswMTAwLCBBbmR5IENoaXR0ZW5kZW4gYSDDqWNyaXQgOg0KPj4+Pj4g SUUgdGhlIGNsaWVudCBzdGFydHMgYSBjb25uZWN0aW9uIGFuZCB0aGVuIGNsb3NlcyBpdCBhZ2Fp biB3aXRob3V0IHNlbmRpbmcgZGF0YS4NCj4+Pj4gT25jZSB0aGlzIGhhcHBlbnMsIGhlcmUncyBz b21lIHJwY2RlYnVnIGluZm8gZm9yIHRoZSBycGMgbW9kdWxlIHVzaW5nIDIuNi4zNC4xIGtlcm5l bDoNCj4+Pj4NCj4+Pj4gLi4uIGxvdHMgb2YgdGhlIGZvbGxvd2luZyBuZnN2MyBXUklURSByZXF1 ZXN0czoNCj4+Pj4gWyA3NjcwLjAyNjc0MV0gNTc3OTMgMDAwMSAgICAtMTEgZmZmZjg4MDEyZTMy YjAwMCAgIChudWxsKSAgICAgICAgMCBmZmZmZmZmZmEwM2JlYjEwIG5mc3YzIFdSSVRFIGE6Y2Fs bF9yZXNlcnZlcmVzdWx0IHE6eHBydF9iYWNrbG9nDQo+Pj4+IFsgNzY3MC4wMjY3NTldIDU3Nzk0 IDAwMDEgICAgLTExIGZmZmY4ODAxMmUzMmIwMDAgICAobnVsbCkgICAgICAgIDAgZmZmZmZmZmZh MDNiZWIxMCBuZnN2MyBXUklURSBhOmNhbGxfcmVzZXJ2ZXJlc3VsdCBxOnhwcnRfYmFja2xvZw0K Pj4+PiBbIDc2NzAuMDI2Nzc4XSA1Nzc5NSAwMDAxICAgIC0xMSBmZmZmODgwMTJlMzJiMDAwICAg KG51bGwpICAgICAgICAwIGZmZmZmZmZmYTAzYmViMTAgbmZzdjMgV1JJVEUgYTpjYWxsX3Jlc2Vy dmVyZXN1bHQgcTp4cHJ0X2JhY2tsb2cNCj4+Pj4gWyA3NjcwLjAyNjc5N10gNTc3OTYgMDAwMSAg ICAtMTEgZmZmZjg4MDEyZTMyYjAwMCAgIChudWxsKSAgICAgICAgMCBmZmZmZmZmZmEwM2JlYjEw IG5mc3YzIFdSSVRFIGE6Y2FsbF9yZXNlcnZlcmVzdWx0IHE6eHBydF9iYWNrbG9nDQo+Pj4+IFsg NzY3MC4wMjY4MTVdIDU3Nzk3IDAwMDEgICAgLTExIGZmZmY4ODAxMmUzMmIwMDAgICAobnVsbCkg ICAgICAgIDAgZmZmZmZmZmZhMDNiZWIxMCBuZnN2MyBXUklURSBhOmNhbGxfcmVzZXJ2ZXJlc3Vs dCBxOnhwcnRfYmFja2xvZw0KPj4+PiBbIDc2NzAuMDI2ODM0XSA1Nzc5OCAwMDAxICAgIC0xMSBm ZmZmODgwMTJlMzJiMDAwICAgKG51bGwpICAgICAgICAwIGZmZmZmZmZmYTAzYmViMTAgbmZzdjMg V1JJVEUgYTpjYWxsX3Jlc2VydmVyZXN1bHQgcTp4cHJ0X2JhY2tsb2cNCj4+Pj4gWyA3NjcwLjAy Njg1M10gNTc3OTkgMDAwMSAgICAtMTEgZmZmZjg4MDEyZTMyYjAwMCAgIChudWxsKSAgICAgICAg MCBmZmZmZmZmZmEwM2JlYjEwIG5mc3YzIFdSSVRFIGE6Y2FsbF9yZXNlcnZlcmVzdWx0IHE6eHBy dF9iYWNrbG9nDQo+Pj4+IFsgNzY3MC4wMjY4NzFdIDU3ODAwIDAwMDEgICAgLTExIGZmZmY4ODAx MmUzMmIwMDAgICAobnVsbCkgICAgICAgIDAgZmZmZmZmZmZhMDNiZWIxMCBuZnN2MyBXUklURSBh OmNhbGxfcmVzZXJ2ZXJlc3VsdCBxOnhwcnRfYmFja2xvZw0KPj4+PiBbIDc2NzAuMDI2ODkwXSA1 NzgwMSAwMDAxICAgIC0xMSBmZmZmODgwMTJlMzJiMDAwICAgKG51bGwpICAgICAgICAwIGZmZmZm ZmZmYTAzYmViMTAgbmZzdjMgV1JJVEUgYTpjYWxsX3Jlc2VydmVyZXN1bHQgcTp4cHJ0X2JhY2ts b2cNCj4+Pj4gWyA3NjcwLjAyNjkwOV0gNTc4MDIgMDAwMSAgICAtMTEgZmZmZjg4MDEyZTMyYjAw MCAgIChudWxsKSAgICAgICAgMCBmZmZmZmZmZmEwM2JlYjEwIG5mc3YzIFdSSVRFIGE6Y2FsbF9y ZXNlcnZlcmVzdWx0IHE6eHBydF9iYWNrbG9nDQo+Pj4+IFsgNzY4MC41MjAwNDJdIFJQQzogICAg ICAgd29ya2VyIGNvbm5lY3RpbmcgeHBydCBmZmZmODgwMTNlNjJkODAwIHZpYSB0Y3AgdG8gMTAu MS42LjEwMiAocG9ydCAyMDQ5KQ0KPj4+PiBbIDc2ODAuNTIwMDY2XSBSUEM6ICAgICAgIGZmZmY4 ODAxM2U2MmQ4MDAgY29ubmVjdCBzdGF0dXMgOTkgY29ubmVjdGVkIDAgc29jayBzdGF0ZSA3DQo+ Pj4+IFsgNzY4MC41MjAwNzRdIFJQQzogMzM1NTAgX19ycGNfd2FrZV91cF90YXNrIChub3cgNDI5 NjgxMjQyNikNCj4+Pj4gWyA3NjgwLjUyMDA3OV0gUlBDOiAzMzU1MCBkaXNhYmxpbmcgdGltZXIN Cj4+Pj4gWyA3NjgwLjUyMDA4NF0gUlBDOiAzMzU1MCByZW1vdmVkIGZyb20gcXVldWUgZmZmZjg4 MDEzZTYyZGIyMCAieHBydF9wZW5kaW5nIg0KPj4+PiBbIDc2ODAuNTIwMDg5XSBSUEM6ICAgICAg IF9fcnBjX3dha2VfdXBfdGFzayBkb25lDQo+Pj4+IFsgNzY4MC41MjAwOTRdIFJQQzogMzM1NTAg X19ycGNfZXhlY3V0ZSBmbGFncz0weDENCj4+Pj4gWyA3NjgwLjUyMDA5OF0gUlBDOiAzMzU1MCB4 cHJ0X2Nvbm5lY3Rfc3RhdHVzOiByZXRyeWluZw0KPj4+PiBbIDc2ODAuNTIwMTAzXSBSUEM6IDMz NTUwIGNhbGxfY29ubmVjdF9zdGF0dXMgKHN0YXR1cyAtMTEpDQo+Pj4+IFsgNzY4MC41MjAxMDhd IFJQQzogMzM1NTAgY2FsbF90cmFuc21pdCAoc3RhdHVzIDApDQo+Pj4+IFsgNzY4MC41MjAxMTJd IFJQQzogMzM1NTAgeHBydF9wcmVwYXJlX3RyYW5zbWl0DQo+Pj4+IFsgNzY4MC41MjAxMThdIFJQ QzogMzM1NTAgcnBjX3hkcl9lbmNvZGUgKHN0YXR1cyAwKQ0KPj4+PiBbIDc2ODAuNTIwMTIzXSBS UEM6IDMzNTUwIG1hcnNoYWxpbmcgVU5JWCBjcmVkIGZmZmY4ODAxMmUwMDIzMDANCj4+Pj4gWyA3 NjgwLjUyMDEzMF0gUlBDOiAzMzU1MCB1c2luZyBBVVRIX1VOSVggY3JlZCBmZmZmODgwMTJlMDAy MzAwIHRvIHdyYXAgcnBjIGRhdGENCj4+Pj4gWyA3NjgwLjUyMDEzNl0gUlBDOiAzMzU1MCB4cHJ0 X3RyYW5zbWl0KDMyOTIwKQ0KPj4+PiBbIDc2ODAuNTIwMTQ1XSBSUEM6ICAgICAgIHhzX3RjcF9z ZW5kX3JlcXVlc3QoMzI5MjApID0gLTMyDQo+Pj4+IFsgNzY4MC41MjAxNTFdIFJQQzogICAgICAg eHNfdGNwX3N0YXRlX2NoYW5nZSBjbGllbnQgZmZmZjg4MDEzZTYyZDgwMC4uLg0KPj4+PiBbIDc2 ODAuNTIwMTU2XSBSUEM6ICAgICAgIHN0YXRlIDcgY29ubiAwIGRlYWQgMCB6YXBwZWQgMQ0KPj4N Cj4+PiBJIGNoYW5nZWQgdGhhdCBkZWJ1ZyB0byBvdXRwdXQgc2tfc2h1dGRvd24gdG9vLiBUaGF0 IGhhcyBhIHZhbHVlIG9mIDINCj4+PiAoSUUgU0VORF9TSFVURE9XTikuIExvb2tpbmcgYXQgdGNw X3NlbmRtc2coKSwgSSBzZWUgdGhpczoNCj4+DQo+Pj4gICAgICAgICAgIGVyciA9IC1FUElQRTsN Cj4+PiAgICAgICAgICAgaWYgKHNrLT5za19lcnIgfHwgKHNrLT5za19zaHV0ZG93biYgIFNFTkRf U0hVVERPV04pKQ0KPj4+ICAgICAgICAgICAgICAgICAgIGdvdG8gb3V0X2VycjsNCj4+DQo+Pj4g d2hpY2ggY29ycmVsYXRlcyB3aXRoIHRoZSB0cmFjZSAieHNfdGNwX3NlbmRfcmVxdWVzdCgzMjky MCkgPSAtMzIiLiBTbywNCj4+PiB0aGlzIGxvb2tzIGxpa2UgYSBwcm9ibGVtIGluIHRoZSBzb2Nr ZXRzL3RjcCBsYXllci4gVGhlIHJwYyBsYXllciBpc3N1ZXMNCj4+PiBhIHNodXRkb3duIGFuZCB0 aGVuIHJlY29ubmVjdHMgdXNpbmcgdGhlIHNhbWUgc29ja2V0LiBTbyBlaXRoZXINCj4+PiBza19z aHV0ZG93biBuZWVkcyB6ZXJvaW5nIG9uY2UgdGhlIHNodXRkb3duIGNvbXBsZXRlcyBvciBzaG91 bGQgYmUNCj4+PiB6ZXJvZWQgb24gc3Vic2VxdWVudCBjb25uZWN0LiBUaGUgbGF0dGVyIHNvdW5k cyBzYWZlci4NCg0KPj4gVGhpcyBwYXRjaCBmb3IgMi42LjM0LjEgZml4ZXMgdGhlIGlzc3VlOg0K Pj4NCj4+IC0tLSAvaG9tZS9jb21wYW55L3NvZnR3YXJlL3NyYy9saW51eC0yLjYuMzQuMS9uZXQv aXB2NC90Y3Bfb3V0cHV0LmMgICAgIDIwMTAtMDctMjcgMDg6NDY6NDYuOTE3MDAwMDAwICswMTAw DQo+PiArKysgbmV0L2lwdjQvdGNwX291dHB1dC5jICAgICAgIDIwMTAtMDctMjcgMDk6MTk6MTYu MDAwMDAwMDAwICswMTAwDQo+PiBAQCAtMjUyMiw2ICsyNTIyLDEzIEBADQo+PiAgICAgICAgICBz dHJ1Y3QgdGNwX3NvY2sgKnRwID0gdGNwX3NrKHNrKTsNCj4+ICAgICAgICAgIF9fdTggcmN2X3dz Y2FsZTsNCj4+DQo+PiArICAgICAgIC8qIGNsZWFyIGRvd24gYW55IHByZXZpb3VzIHNodXRkb3du IGF0dGVtcHRzIHNvIHRoYXQNCj4+ICsgICAgICAgICogcmVjb25uZWN0cyBvbiBhIHNvY2tldCB0 aGF0J3MgYmVlbiBzaHV0ZG93biBsZWF2ZSB0aGUNCj4+ICsgICAgICAgICogc29ja2V0IGluIGEg dXNhYmxlIHN0YXRlIChvdGhlcndpc2UgdGNwX3NlbmRtc2coKSByZXR1cm5zDQo+PiArICAgICAg ICAqIC1FUElQRSkuDQo+PiArICAgICAgICAqLw0KPj4gKyAgICAgICBzay0+c2tfc2h1dGRvd24g PSAwOw0KPj4gKw0KPj4gICAgICAgICAgLyogV2UnbGwgZml4IHRoaXMgdXAgd2hlbiB3ZSBnZXQg YSByZXNwb25zZSBmcm9tIHRoZSBvdGhlciBlbmQuDQo+PiAgICAgICAgICAgKiBTZWUgdGNwX2lu cHV0LmM6dGNwX3Jjdl9zdGF0ZV9wcm9jZXNzIGNhc2UgVENQX1NZTl9TRU5ULg0KPj4gICAgICAg ICAgICovDQo+Pg0KPj4gQXMgSSBtZW50aW9uZWQgaW4gbXkgZmlyc3QgbWVzc2FnZSwgd2UgZmly c3Qgc2F3IHRoaXMgaXNzdWUgaW4gMi42LjMyIGFzIHN1cHBsaWVkIGJ5IGRlYmlhbiAobGludXgt aW1hZ2UtMi42LjMyLTUtYW1kNjQgVmVyc2lvbjogMi42LjMyLTE3KS4gSXQgbG9va3MgbGlrZSB0 aGUgc2FtZSBwYXRjaCB3b3VsZCBmaXggdGhlIHByb2JsZW0gdGhlcmUgdG9vLg0KPj4NCj4NCj4g Q0MgbmV0ZGV2DQo+DQo+IFRoaXMgcmVtaW5kcyBtZSBhIHNpbWlsYXIgcHJvYmxlbSB3ZSBoYWQg aW4gdGhlIHBhc3QsIGZpeGVkIHdpdGggY29tbWl0DQo+IDFmZGY0NzVhICh0Y3A6IHRjcF9kaXNj b25uZWN0KCkgc2hvdWxkIGNsZWFyIHdpbmRvd19jbGFtcCkNCj4NCj4gQnV0IHRjcF9kaXNjb25u ZWN0KCkgYWxyZWFkeSBjbGVhcnMgc2stPnNrX3NodXRkb3duDQo+DQo+IElmIE5GUyBjYWxscyB0 Y3BfZGlzY29ubmVjdCgpLCB0aGVuIHNodXRkb3duKCksIHRoZXJlIGlzIGEgcHJvYmxlbS4NCg0K SWYgdGNwX2Rpc2Nvbm5lY3QoKSB3YXMgY2FsbGVkIGF0IHNvbWUgcG9pbnQsIEkgd291bGQgZXhw ZWN0IHRvIHNlZSBhIA0KbWVzc2FnZSBmcm9tIHhzX2Vycm9yX3JlcG9ydCgpIGluIHRoZSBkZWJ1 Z2dpbmcgb3V0cHV0LiAgUGVyaGFwcyANCnRjcF9kaXNjb25uZWN0KCkgaXMgbm90IGJlaW5nIGlu dm9rZWQgYXQgYWxsPw0KDQo+IE1heWJlIHhzX3RjcF9zaHV0ZG93bigpIHNob3VsZCBtYWtlIHNv bWUgc2FuaXR5IHRlc3RzID8NCj4NCj4gRm9sbG93aW5nIHNlcXVlbmNlIGlzIGxlZ2FsLCBhbmQg eW91ciBwYXRjaCBtaWdodCBicmVhayBpdC4NCj4NCj4gZmQgPSBzb2NrZXQoLi4uKTsNCj4gc2h1 dGRvd24oZmQsIFNIVVRfV1IpOw0KPiAuLi4NCj4gY29ubmVjdChmZCwgLi4uKTsNCg0KDQpJIGxv b2tlZCBjbG9zZWx5IGF0IHNvbWUgb2YgQW5keSdzIGRlYnVnZ2luZyBvdXRwdXQgZnJvbSB0aGUg DQpsaW51eC1rZXJuZWwgbWFpbGluZyBsaXN0IGFyY2hpdmUuICBJIGJhc2ljYWxseSBhZ3JlZSB0 aGF0IHRoZSBuZXR3b3JrIA0KbGF5ZXIgaXMgcmV0dXJuaW5nIC1FUElQRSBmcm9tIHRjcF9zZW5k bXNnKCksIHdoaWNoIHRoZSBSUEMgY2xpZW50IGxvZ2ljIA0KZG9lcyBub3QgZXhwZWN0LiBCdXQg aXQncyBub3QgY2xlYXIgdG8gbWUgaG93IGl0IGdldHMgaW50byB0aGlzIHN0YXRlLg0KDQo+IFsg NzcyOC41MjAwNDJdIFJQQzogICAgICAgd29ya2VyIGNvbm5lY3RpbmcgeHBydCBmZmZmODgwMTNl NjJkODAwIHZpYSB0Y3AgdG8gMTAuMS42LjEwMiAocG9ydCAyMDQ5KQ0KID4gWyA3NzI4LjUyMDA5 M10gUlBDOiAgICAgICBmZmZmODgwMTNlNjJkODAwIGNvbm5lY3Qgc3RhdHVzIDExNSANCmNvbm5l Y3RlZCAwIHNvY2sgc3RhdGUgMg0KDQoic29jayBzdGF0ZSAyIiA9PiBzay0+c2tfc3RhdGUgPT0g VENQX1NZTl9TRU5UDQoNCiA+IFsgNzcyOC41MjA4ODRdIFJQQzogICAgICAgeHNfdGNwX3N0YXRl X2NoYW5nZSBjbGllbnQgZmZmZjg4MDEzZTYyZDgwMC4uLg0KID4gWyA3NzI4LjUyMDg4OV0gUlBD OiAgICAgICBzdGF0ZSAxIGNvbm4gMCBkZWFkIDAgemFwcGVkIDENCg0KInN0YXRlIDEiID0+IHNr LT5za19zdGF0ZSA9PSBUQ1BfRVNUQUJMSVNIRUQNCg0KUlBDIGNsaWVudCB3YWtlcyB1cCB0aGlz IFJQQyB0YXNrIG5vdyB0aGF0IHRoZSBjb25uZWN0aW9uIGlzIGVzdGFibGlzaGVkLg0KDQo+IFsg NzcyOC41MjA4OTZdIFJQQzogMzM1NTAgX19ycGNfd2FrZV91cF90YXNrIChub3cgNDI5NjgyNDQy NikNCj4gWyA3NzI4LjUyMDkwMF0gUlBDOiAzMzU1MCBkaXNhYmxpbmcgdGltZXINCj4gWyA3NzI4 LjUyMDkwNl0gUlBDOiAzMzU1MCByZW1vdmVkIGZyb20gcXVldWUgZmZmZjg4MDEzZTYyZGIyMCAi eHBydF9wZW5kaW5nIg0KPiBbIDc3MjguNTIwOTEyXSBSUEM6ICAgICAgIF9fcnBjX3dha2VfdXBf dGFzayBkb25lDQo+IFsgNzcyOC41MjA5MzJdIFJQQzogMzM1NTAgX19ycGNfZXhlY3V0ZSBmbGFn cz0weDENCj4gWyA3NzI4LjUyMDkzN10gUlBDOiAzMzU1MCB4cHJ0X2Nvbm5lY3Rfc3RhdHVzOiBy ZXRyeWluZw0KPiBbIDc3MjguNTIwOTQyXSBSUEM6IDMzNTUwIGNhbGxfY29ubmVjdF9zdGF0dXMg KHN0YXR1cyAtMTEpDQoNClRoZSBhd29rZW4gUlBDIHRhc2sncyBzdGF0dXMgaXMgLUVBR0FJTiwg d2hpY2ggcHJldmVudHMgYSByZWNvbm5lY3Rpb24gDQphdHRlbXB0Lg0KDQo+IFsgNzcyOC41MjA5 NDddIFJQQzogMzM1NTAgY2FsbF90cmFuc21pdCAoc3RhdHVzIDApDQo+IFsgNzcyOC41MjA5NTFd IFJQQzogMzM1NTAgeHBydF9wcmVwYXJlX3RyYW5zbWl0DQo+IFsgNzcyOC41MjA5NTddIFJQQzog MzM1NTAgcnBjX3hkcl9lbmNvZGUgKHN0YXR1cyAwKQ0KPiBbIDc3MjguNTIwOTYyXSBSUEM6IDMz NTUwIG1hcnNoYWxpbmcgVU5JWCBjcmVkIGZmZmY4ODAxMmUwMDIzMDANCj4gWyA3NzI4LjUyMDk2 OV0gUlBDOiAzMzU1MCB1c2luZyBBVVRIX1VOSVggY3JlZCBmZmZmODgwMTJlMDAyMzAwIHRvIHdy YXAgcnBjIGRhdGENCj4gWyA3NzI4LjUyMDk3Nl0gUlBDOiAzMzU1MCB4cHJ0X3RyYW5zbWl0KDMy OTIwKQ0KDQpSUEMgY2xpZW50IGVuY29kZXMgdGhlIHJlcXVlc3QgYW5kIGF0dGVtcHRzIHRvIHNl bmQgaXQuDQoNCj4gWyA3NzI4LjUyMDk4NF0gUlBDOiAgICAgICB4c190Y3Bfc2VuZF9yZXF1ZXN0 KDMyOTIwKSA9IC0zMg0KDQpOZXR3b3JrIGxheWVyIHNheXMgLUVQSVBFLCBmb3Igc29tZSByZWFz b24uICBSUEMgY2xpZW50IGNhbGxzIA0Ka2VybmVsX3NvY2tfc2h1dGRvd24oU0hVVF9XUikuDQoN Cj4gWyA3NzI4LjUyMDk5N10gUlBDOiAgICAgICB4c190Y3Bfc3RhdGVfY2hhbmdlIGNsaWVudCBm ZmZmODgwMTNlNjJkODAwLi4uDQo+IFsgNzcyOC41MjEwMDddIFJQQzogICAgICAgc3RhdGUgNCBj b25uIDEgZGVhZCAwIHphcHBlZCAxDQoNCiJzdGF0ZSA0IiA9PiBzay0+c2tfc3RhdGUgPT0gVENQ X0ZJTl9XQUlUMQ0KDQpUaGUgUlBDIGNsaWVudCBzZXRzIHVwIGEgbGluZ2VyIHRpbWVvdXQuDQoN Cj4gWyA3NzI4LjUyMTAxM10gUlBDOiAzMzU1MCBjYWxsX3N0YXR1cyAoc3RhdHVzIC0zMikNCj4g WyA3NzI4LjUyMTAxOF0gUlBDOiAzMzU1MCBjYWxsX2JpbmQgKHN0YXR1cyAwKQ0KPiBbIDc3Mjgu NTIxMDIzXSBSUEM6IDMzNTUwIGNhbGxfY29ubmVjdCB4cHJ0IGZmZmY4ODAxM2U2MmQ4MDAgaXMg bm90IGNvbm5lY3RlZA0KPiBbIDc3MjguNTIxMDI4XSBSUEM6IDMzNTUwIHhwcnRfY29ubmVjdCB4 cHJ0IGZmZmY4ODAxM2U2MmQ4MDAgaXMgbm90IGNvbm5lY3RlZA0KPiBbIDc3MjguNTIxMDM1XSBS UEM6IDMzNTUwIHNsZWVwX29uKHF1ZXVlICJ4cHJ0X3BlbmRpbmciIHRpbWUgNDI5NjgyNDQyNikN Cj4gWyA3NzI4LjUyMTA0MF0gUlBDOiAzMzU1MCBhZGRlZCB0byBxdWV1ZSBmZmZmODgwMTNlNjJk YjIwICJ4cHJ0X3BlbmRpbmciDQo+IFsgNzcyOC41MjEwNDVdIFJQQzogMzM1NTAgc2V0dGluZyBh bGFybSBmb3IgNjAwMDAgbXMNCg0KUlBDIGNsaWVudCBwdXRzIHRoaXMgUlBDIHRhc2sgdG8gc2xl ZXAuDQoNCj4gWyA3NzI4LjUyMTQzOV0gUlBDOiAgICAgICB4c190Y3Bfc3RhdGVfY2hhbmdlIGNs aWVudCBmZmZmODgwMTNlNjJkODAwLi4uDQo+IFsgNzcyOC41MjE0NDRdIFJQQzogICAgICAgc3Rh dGUgNSBjb25uIDAgZGVhZCAwIHphcHBlZCAxDQoNCiJzdGF0ZSA1IiA9PiBzay0+c2tfc3RhdGUg PT0gVENQX0ZJTl9XQUlUMg0KDQo+IFsgNzcyOC41MjE2MDJdIFJQQzogICAgICAgeHNfdGNwX3N0 YXRlX2NoYW5nZSBjbGllbnQgZmZmZjg4MDEzZTYyZDgwMC4uLg0KPiBbIDc3MjguNTIxNjA4XSBS UEM6ICAgICAgIHN0YXRlIDcgY29ubiAwIGRlYWQgMCB6YXBwZWQgMQ0KPiBbIDc3MjguNTIxNjEy XSBSUEM6ICAgICAgIGRpc2Nvbm5lY3RlZCB0cmFuc3BvcnQgZmZmZjg4MDEzZTYyZDgwMA0KDQoi c3RhdGUgNyIgPT4gc2stPnNrX3N0YXRlID09IFRDUF9DTE9TRQ0KDQpUaGUgUlBDIGNsaWVudCBt YXJrcyB0aGUgc29ja2V0IGNsb3NlZC4gYW5kIHRoZSBsaW5nZXIgdGltZW91dCBpcyANCmNhbmNl bGxlZC4gIEF0IHRoaXMgcG9pbnQsIHNrX3NodXRkb3duIHNob3VsZCBiZSBzZXQgdG8gemVybywg Y29ycmVjdD8gDQpJIGRvbid0IHNlZSBhbiB4c19lcnJvcl9yZXBvcnQoKSBjYWxsIGhlcmUsIHdo aWNoIHdvdWxkIGNvbmZpcm0gdGhhdCB0aGUgDQpzb2NrZXQgdG9vayBhIHRyaXAgdGhyb3VnaCB0 Y3BfZGlzY29ubmVjdCgpLg0KDQo+IFsgNzcyOC41MjE2MTddIFJQQzogMzM1NTAgX19ycGNfd2Fr ZV91cF90YXNrIChub3cgNDI5NjgyNDQyNikNCj4gWyA3NzI4LjUyMTYyMV0gUlBDOiAzMzU1MCBk aXNhYmxpbmcgdGltZXINCj4gWyA3NzI4LjUyMTYyNl0gUlBDOiAzMzU1MCByZW1vdmVkIGZyb20g cXVldWUgZmZmZjg4MDEzZTYyZGIyMCAieHBydF9wZW5kaW5nIg0KPiBbIDc3MjguNTIxNjMxXSBS UEM6ICAgICAgIF9fcnBjX3dha2VfdXBfdGFzayBkb25lDQoNClJQQyBjbGllbnQgd2FrZXMgdXAg dGhlIFJQQyB0YXNrLiAgTWVhbndoaWxlLi4uDQoNCj4gWyA3NzI4LjUyMTYzNl0gUlBDOiAgICAg ICB4c190Y3Bfc3RhdGVfY2hhbmdlIGNsaWVudCBmZmZmODgwMTNlNjJkODAwLi4uDQo+IFsgNzcy OC41MjE2NDFdIFJQQzogICAgICAgc3RhdGUgNyBjb25uIDAgZGVhZCAwIHphcHBlZCAxDQo+IFsg NzcyOC41MjE2NDVdIFJQQzogICAgICAgZGlzY29ubmVjdGVkIHRyYW5zcG9ydCBmZmZmODgwMTNl NjJkODAwDQo+IFsgNzcyOC41MjE2NDldIFJQQzogICAgICAgeHNfdGNwX2RhdGFfcmVhZHkuLi4N Cg0KLi4uIG5ldHdvcmsgbGF5ZXIgY2FsbHMgY2xvc2VkIHNvY2tldCdzIGRhdGFfcmVhZHkgbWV0 aG9kLi4uIHdoaWxlIHRoZSANCmF3b2tlbiBSUEMgdGFzayBnZXRzIHVuZGVyd2F5Lg0KDQo+IFsg NzcyOC41MjE2NjJdIFJQQzogMzM1NTAgX19ycGNfZXhlY3V0ZSBmbGFncz0weDENCj4gWyA3NzI4 LjUyMTY2Nl0gUlBDOiAzMzU1MCB4cHJ0X2Nvbm5lY3Rfc3RhdHVzOiByZXRyeWluZw0KPiBbIDc3 MjguNTIxNjcxXSBSUEM6IDMzNTUwIGNhbGxfY29ubmVjdF9zdGF0dXMgKHN0YXR1cyAtMTEpDQoN ClRoZSBhd29rZW4gUlBDIHRhc2sncyBzdGF0dXMgaXMgLUVBR0FJTiwgd2hpY2ggcHJldmVudHMg YSByZWNvbm5lY3Rpb24gDQphdHRlbXB0LCBldmVuIHRob3VnaCB0aGVyZSBpcyBubyBlc3RhYmxp c2hlZCBjb25uZWN0aW9uLg0KDQpSUEMgY2xpZW50IGJhcnJlbHMgb24gdG8gc2VuZCB0aGUgcmVx dWVzdCBhZ2Fpbi4NCg0KPiBbIDc3MjguNTIxNjc1XSBSUEM6IDMzNTUwIGNhbGxfdHJhbnNtaXQg KHN0YXR1cyAwKQ0KPiBbIDc3MjguNTIxNjc5XSBSUEM6IDMzNTUwIHhwcnRfcHJlcGFyZV90cmFu c21pdA0KPiBbIDc3MjguNTIxNjgzXSBSUEM6IDMzNTUwIHJwY194ZHJfZW5jb2RlIChzdGF0dXMg MCkNCj4gWyA3NzI4LjUyMTY4OF0gUlBDOiAzMzU1MCBtYXJzaGFsaW5nIFVOSVggY3JlZCBmZmZm ODgwMTJlMDAyMzAwDQo+IFsgNzcyOC41MjE2OTRdIFJQQzogMzM1NTAgdXNpbmcgQVVUSF9VTklY IGNyZWQgZmZmZjg4MDEyZTAwMjMwMCB0byB3cmFwIHJwYyBkYXRhDQo+IFsgNzcyOC41MjE2OTld IFJQQzogMzM1NTAgeHBydF90cmFuc21pdCgzMjkyMCkNCj4gWyA3NzI4LjUyMTcwNF0gUlBDOiAg ICAgICB4c190Y3Bfc2VuZF9yZXF1ZXN0KDMyOTIwKSA9IC0zMg0KDQpSUEMgY2xpZW50IGF0dGVt cHRzIHRvIHNlbmQgYWdhaW4sIGdldHMgLUVQSVBFLCBhbmQgY2FsbHMgDQprZXJuZWxfc29ja19z aHV0ZG93bihTSFVUX1dSKS4gIElmIHRoZXJlIGlzIG5vIGNvbm5lY3Rpb24gZXN0YWJsaXNoZWQs IA0KdGhlIFJQQyBjbGllbnQgZXhwZWN0cyAtRU5PVENPTk4sIGluIHdoaWNoIGNhc2UgaXQgd2ls bCBhdHRlbXB0IHRvIA0KcmVjb25uZWN0IGhlcmUuDQoNCj4gWyA3NzI4LjUyMTcwOV0gUlBDOiAg ICAgICB4c190Y3Bfc3RhdGVfY2hhbmdlIGNsaWVudCBmZmZmODgwMTNlNjJkODAwLi4uDQo+IFsg NzcyOC41MjE3MTRdIFJQQzogICAgICAgc3RhdGUgNyBjb25uIDAgZGVhZCAwIHphcHBlZCAxDQo+ IFsgNzcyOC41MjE3MThdIFJQQzogICAgICAgZGlzY29ubmVjdGVkIHRyYW5zcG9ydCBmZmZmODgw MTNlNjJkODAwDQoNCiJzdGF0ZSA3IiA9PiBzay0+c2tfc3RhdGUgPT0gVENQX0NMT1NFDQoNCkZv bGxvd2luZyB0aGlzLCB0aGUgUlBDIGNsaWVudCBhdHRlbXB0cyB0byByZXRyYW5zbWl0IHRoZSBy ZXF1ZXN0IA0KcmVwZWF0ZWRseSwgYnV0IHRoZSBzb2NrZXQgcmVtYWlucyBpbiB0aGUgVENQX0NM T1NFIHN0YXRlLg0KTu+/ve+/ve+/ve+/ve+/vXLvv73vv71577+977+977+9Yu+/vVjvv73vv73H p3bvv71e77+9Kd66ey5u77+9K++/ve+/ve+/ve+/vXvvv73vv73vv70i77+977+9Xm7vv71y77+9 77+977+9eu+/vRrvv73vv71o77+977+977+977+9Ju+/ve+/vR7vv71H77+977+977+9aO+/vQMo 77+96ZqO77+93aJqIu+/ve+/vRrvv70bbe+/ve+/ve+/ve+/ve+/vXrvv73elu+/ve+/ve+/vWbv v73vv73vv71o77+977+977+9fu+/vW3vv70= ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang @ 2010-07-28 7:08 ` Andy Chittenden 0 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-28 7:08 UTC (permalink / raw) To: Chuck Lever, Eric Dumazet Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List > I don't see an xs_error_report() call here, which would confirm that the socket took a trip through tcp_disconnect(). From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) unconditionally so as there's no xs_error_report(), that surely means the exact opposite: tcp_disconnect() wasn't called. If it's not called, sk_shutdown is not cleared. And my revised tracing confirmed that it was set to SEND_SHUTDOWN. -- Andy, BlueArc Engineering -----Original Message----- From: Chuck Lever [mailto:chuck.lever@oracle.com] Sent: 27 July 2010 18:29 To: Eric Dumazet Cc: Andy Chittenden; Linux Kernel Mailing List (linux-kernel@vger.kernel.org); Trond Myklebust; netdev; Linux NFS Mailing List Subject: Re: nfs client hang Add CC: linux-nfs@vger.kernel.org On 07/27/10 08:21 AM, Eric Dumazet wrote: > Le mardi 27 juillet 2010 à 11:53 +0100, Andy Chittenden a écrit : >>>>> IE the client starts a connection and then closes it again without sending data. >>>> Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: >>>> >>>> ... lots of the following nfsv3 WRITE requests: >>>> [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) >>>> [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 >>>> [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) >>>> [ 7680.520079] RPC: 33550 disabling timer >>>> [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" >>>> [ 7680.520089] RPC: __rpc_wake_up_task done >>>> [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 >>>> [ 7680.520098] RPC: 33550 xprt_connect_status: retrying >>>> [ 7680.520103] RPC: 33550 call_connect_status (status -11) >>>> [ 7680.520108] RPC: 33550 call_transmit (status 0) >>>> [ 7680.520112] RPC: 33550 xprt_prepare_transmit >>>> [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) >>>> [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 >>>> [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data >>>> [ 7680.520136] RPC: 33550 xprt_transmit(32920) >>>> [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 >>>> [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... >>>> [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 >> >>> I changed that debug to output sk_shutdown too. That has a value of 2 >>> (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: >> >>> err = -EPIPE; >>> if (sk->sk_err || (sk->sk_shutdown& SEND_SHUTDOWN)) >>> goto out_err; >> >>> which correlates with the trace "xs_tcp_send_request(32920) = -32". So, >>> this looks like a problem in the sockets/tcp layer. The rpc layer issues >>> a shutdown and then reconnects using the same socket. So either >>> sk_shutdown needs zeroing once the shutdown completes or should be >>> zeroed on subsequent connect. The latter sounds safer. >> This patch for 2.6.34.1 fixes the issue: >> >> --- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100 >> +++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100 >> @@ -2522,6 +2522,13 @@ >> struct tcp_sock *tp = tcp_sk(sk); >> __u8 rcv_wscale; >> >> + /* clear down any previous shutdown attempts so that >> + * reconnects on a socket that's been shutdown leave the >> + * socket in a usable state (otherwise tcp_sendmsg() returns >> + * -EPIPE). >> + */ >> + sk->sk_shutdown = 0; >> + >> /* We'll fix this up when we get a response from the other end. >> * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. >> */ >> >> As I mentioned in my first message, we first saw this issue in 2.6.32 as supplied by debian (linux-image-2.6.32-5-amd64 Version: 2.6.32-17). It looks like the same patch would fix the problem there too. >> > > CC netdev > > This reminds me a similar problem we had in the past, fixed with commit > 1fdf475a (tcp: tcp_disconnect() should clear window_clamp) > > But tcp_disconnect() already clears sk->sk_shutdown > > If NFS calls tcp_disconnect(), then shutdown(), there is a problem. If tcp_disconnect() was called at some point, I would expect to see a message from xs_error_report() in the debugging output. Perhaps tcp_disconnect() is not being invoked at all? > Maybe xs_tcp_shutdown() should make some sanity tests ? > > Following sequence is legal, and your patch might break it. > > fd = socket(...); > shutdown(fd, SHUT_WR); > ... > connect(fd, ...); I looked closely at some of Andy's debugging output from the linux-kernel mailing list archive. I basically agree that the network layer is returning -EPIPE from tcp_sendmsg(), which the RPC client logic does not expect. But it's not clear to me how it gets into this state. > [ 7728.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) > [ 7728.520093] RPC: ffff88013e62d800 connect status 115 connected 0 sock state 2 "sock state 2" => sk->sk_state == TCP_SYN_SENT > [ 7728.520884] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.520889] RPC: state 1 conn 0 dead 0 zapped 1 "state 1" => sk->sk_state == TCP_ESTABLISHED RPC client wakes up this RPC task now that the connection is established. > [ 7728.520896] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.520900] RPC: 33550 disabling timer > [ 7728.520906] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.520912] RPC: __rpc_wake_up_task done > [ 7728.520932] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.520937] RPC: 33550 xprt_connect_status: retrying > [ 7728.520942] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt. > [ 7728.520947] RPC: 33550 call_transmit (status 0) > [ 7728.520951] RPC: 33550 xprt_prepare_transmit > [ 7728.520957] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.520962] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.520969] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.520976] RPC: 33550 xprt_transmit(32920) RPC client encodes the request and attempts to send it. > [ 7728.520984] RPC: xs_tcp_send_request(32920) = -32 Network layer says -EPIPE, for some reason. RPC client calls kernel_sock_shutdown(SHUT_WR). > [ 7728.520997] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521007] RPC: state 4 conn 1 dead 0 zapped 1 "state 4" => sk->sk_state == TCP_FIN_WAIT1 The RPC client sets up a linger timeout. > [ 7728.521013] RPC: 33550 call_status (status -32) > [ 7728.521018] RPC: 33550 call_bind (status 0) > [ 7728.521023] RPC: 33550 call_connect xprt ffff88013e62d800 is not connected > [ 7728.521028] RPC: 33550 xprt_connect xprt ffff88013e62d800 is not connected > [ 7728.521035] RPC: 33550 sleep_on(queue "xprt_pending" time 4296824426) > [ 7728.521040] RPC: 33550 added to queue ffff88013e62db20 "xprt_pending" > [ 7728.521045] RPC: 33550 setting alarm for 60000 ms RPC client puts this RPC task to sleep. > [ 7728.521439] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521444] RPC: state 5 conn 0 dead 0 zapped 1 "state 5" => sk->sk_state == TCP_FIN_WAIT2 > [ 7728.521602] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521608] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521612] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE The RPC client marks the socket closed. and the linger timeout is cancelled. At this point, sk_shutdown should be set to zero, correct? I don't see an xs_error_report() call here, which would confirm that the socket took a trip through tcp_disconnect(). > [ 7728.521617] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.521621] RPC: 33550 disabling timer > [ 7728.521626] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.521631] RPC: __rpc_wake_up_task done RPC client wakes up the RPC task. Meanwhile... > [ 7728.521636] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521641] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521645] RPC: disconnected transport ffff88013e62d800 > [ 7728.521649] RPC: xs_tcp_data_ready... ... network layer calls closed socket's data_ready method... while the awoken RPC task gets underway. > [ 7728.521662] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.521666] RPC: 33550 xprt_connect_status: retrying > [ 7728.521671] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt, even though there is no established connection. RPC client barrels on to send the request again. > [ 7728.521675] RPC: 33550 call_transmit (status 0) > [ 7728.521679] RPC: 33550 xprt_prepare_transmit > [ 7728.521683] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.521688] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.521694] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.521699] RPC: 33550 xprt_transmit(32920) > [ 7728.521704] RPC: xs_tcp_send_request(32920) = -32 RPC client attempts to send again, gets -EPIPE, and calls kernel_sock_shutdown(SHUT_WR). If there is no connection established, the RPC client expects -ENOTCONN, in which case it will attempt to reconnect here. > [ 7728.521709] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521714] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521718] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE Following this, the RPC client attempts to retransmit the request repeatedly, but the socket remains in the TCP_CLOSE state. ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang @ 2010-07-28 7:08 ` Andy Chittenden 0 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-28 7:08 UTC (permalink / raw) To: Chuck Lever, Eric Dumazet Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 11398 bytes --] > I don't see an xs_error_report() call here, which would confirm that the socket took a trip through tcp_disconnect(). >From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) unconditionally so as there's no xs_error_report(), that surely means the exact opposite: tcp_disconnect() wasn't called. If it's not called, sk_shutdown is not cleared. And my revised tracing confirmed that it was set to SEND_SHUTDOWN. -- Andy, BlueArc Engineering -----Original Message----- From: Chuck Lever [mailto:chuck.lever@oracle.com] Sent: 27 July 2010 18:29 To: Eric Dumazet Cc: Andy Chittenden; Linux Kernel Mailing List (linux-kernel@vger.kernel.org); Trond Myklebust; netdev; Linux NFS Mailing List Subject: Re: nfs client hang Add CC: linux-nfs@vger.kernel.org On 07/27/10 08:21 AM, Eric Dumazet wrote: > Le mardi 27 juillet 2010 à 11:53 +0100, Andy Chittenden a écrit : >>>>> IE the client starts a connection and then closes it again without sending data. >>>> Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel: >>>> >>>> ... lots of the following nfsv3 WRITE requests: >>>> [ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog >>>> [ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) >>>> [ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7 >>>> [ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426) >>>> [ 7680.520079] RPC: 33550 disabling timer >>>> [ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" >>>> [ 7680.520089] RPC: __rpc_wake_up_task done >>>> [ 7680.520094] RPC: 33550 __rpc_execute flags=0x1 >>>> [ 7680.520098] RPC: 33550 xprt_connect_status: retrying >>>> [ 7680.520103] RPC: 33550 call_connect_status (status -11) >>>> [ 7680.520108] RPC: 33550 call_transmit (status 0) >>>> [ 7680.520112] RPC: 33550 xprt_prepare_transmit >>>> [ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0) >>>> [ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300 >>>> [ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data >>>> [ 7680.520136] RPC: 33550 xprt_transmit(32920) >>>> [ 7680.520145] RPC: xs_tcp_send_request(32920) = -32 >>>> [ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800... >>>> [ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1 >> >>> I changed that debug to output sk_shutdown too. That has a value of 2 >>> (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this: >> >>> err = -EPIPE; >>> if (sk->sk_err || (sk->sk_shutdown& SEND_SHUTDOWN)) >>> goto out_err; >> >>> which correlates with the trace "xs_tcp_send_request(32920) = -32". So, >>> this looks like a problem in the sockets/tcp layer. The rpc layer issues >>> a shutdown and then reconnects using the same socket. So either >>> sk_shutdown needs zeroing once the shutdown completes or should be >>> zeroed on subsequent connect. The latter sounds safer. >> This patch for 2.6.34.1 fixes the issue: >> >> --- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100 >> +++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100 >> @@ -2522,6 +2522,13 @@ >> struct tcp_sock *tp = tcp_sk(sk); >> __u8 rcv_wscale; >> >> + /* clear down any previous shutdown attempts so that >> + * reconnects on a socket that's been shutdown leave the >> + * socket in a usable state (otherwise tcp_sendmsg() returns >> + * -EPIPE). >> + */ >> + sk->sk_shutdown = 0; >> + >> /* We'll fix this up when we get a response from the other end. >> * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. >> */ >> >> As I mentioned in my first message, we first saw this issue in 2.6.32 as supplied by debian (linux-image-2.6.32-5-amd64 Version: 2.6.32-17). It looks like the same patch would fix the problem there too. >> > > CC netdev > > This reminds me a similar problem we had in the past, fixed with commit > 1fdf475a (tcp: tcp_disconnect() should clear window_clamp) > > But tcp_disconnect() already clears sk->sk_shutdown > > If NFS calls tcp_disconnect(), then shutdown(), there is a problem. If tcp_disconnect() was called at some point, I would expect to see a message from xs_error_report() in the debugging output. Perhaps tcp_disconnect() is not being invoked at all? > Maybe xs_tcp_shutdown() should make some sanity tests ? > > Following sequence is legal, and your patch might break it. > > fd = socket(...); > shutdown(fd, SHUT_WR); > ... > connect(fd, ...); I looked closely at some of Andy's debugging output from the linux-kernel mailing list archive. I basically agree that the network layer is returning -EPIPE from tcp_sendmsg(), which the RPC client logic does not expect. But it's not clear to me how it gets into this state. > [ 7728.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049) > [ 7728.520093] RPC: ffff88013e62d800 connect status 115 connected 0 sock state 2 "sock state 2" => sk->sk_state == TCP_SYN_SENT > [ 7728.520884] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.520889] RPC: state 1 conn 0 dead 0 zapped 1 "state 1" => sk->sk_state == TCP_ESTABLISHED RPC client wakes up this RPC task now that the connection is established. > [ 7728.520896] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.520900] RPC: 33550 disabling timer > [ 7728.520906] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.520912] RPC: __rpc_wake_up_task done > [ 7728.520932] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.520937] RPC: 33550 xprt_connect_status: retrying > [ 7728.520942] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt. > [ 7728.520947] RPC: 33550 call_transmit (status 0) > [ 7728.520951] RPC: 33550 xprt_prepare_transmit > [ 7728.520957] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.520962] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.520969] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.520976] RPC: 33550 xprt_transmit(32920) RPC client encodes the request and attempts to send it. > [ 7728.520984] RPC: xs_tcp_send_request(32920) = -32 Network layer says -EPIPE, for some reason. RPC client calls kernel_sock_shutdown(SHUT_WR). > [ 7728.520997] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521007] RPC: state 4 conn 1 dead 0 zapped 1 "state 4" => sk->sk_state == TCP_FIN_WAIT1 The RPC client sets up a linger timeout. > [ 7728.521013] RPC: 33550 call_status (status -32) > [ 7728.521018] RPC: 33550 call_bind (status 0) > [ 7728.521023] RPC: 33550 call_connect xprt ffff88013e62d800 is not connected > [ 7728.521028] RPC: 33550 xprt_connect xprt ffff88013e62d800 is not connected > [ 7728.521035] RPC: 33550 sleep_on(queue "xprt_pending" time 4296824426) > [ 7728.521040] RPC: 33550 added to queue ffff88013e62db20 "xprt_pending" > [ 7728.521045] RPC: 33550 setting alarm for 60000 ms RPC client puts this RPC task to sleep. > [ 7728.521439] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521444] RPC: state 5 conn 0 dead 0 zapped 1 "state 5" => sk->sk_state == TCP_FIN_WAIT2 > [ 7728.521602] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521608] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521612] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE The RPC client marks the socket closed. and the linger timeout is cancelled. At this point, sk_shutdown should be set to zero, correct? I don't see an xs_error_report() call here, which would confirm that the socket took a trip through tcp_disconnect(). > [ 7728.521617] RPC: 33550 __rpc_wake_up_task (now 4296824426) > [ 7728.521621] RPC: 33550 disabling timer > [ 7728.521626] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending" > [ 7728.521631] RPC: __rpc_wake_up_task done RPC client wakes up the RPC task. Meanwhile... > [ 7728.521636] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521641] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521645] RPC: disconnected transport ffff88013e62d800 > [ 7728.521649] RPC: xs_tcp_data_ready... ... network layer calls closed socket's data_ready method... while the awoken RPC task gets underway. > [ 7728.521662] RPC: 33550 __rpc_execute flags=0x1 > [ 7728.521666] RPC: 33550 xprt_connect_status: retrying > [ 7728.521671] RPC: 33550 call_connect_status (status -11) The awoken RPC task's status is -EAGAIN, which prevents a reconnection attempt, even though there is no established connection. RPC client barrels on to send the request again. > [ 7728.521675] RPC: 33550 call_transmit (status 0) > [ 7728.521679] RPC: 33550 xprt_prepare_transmit > [ 7728.521683] RPC: 33550 rpc_xdr_encode (status 0) > [ 7728.521688] RPC: 33550 marshaling UNIX cred ffff88012e002300 > [ 7728.521694] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data > [ 7728.521699] RPC: 33550 xprt_transmit(32920) > [ 7728.521704] RPC: xs_tcp_send_request(32920) = -32 RPC client attempts to send again, gets -EPIPE, and calls kernel_sock_shutdown(SHUT_WR). If there is no connection established, the RPC client expects -ENOTCONN, in which case it will attempt to reconnect here. > [ 7728.521709] RPC: xs_tcp_state_change client ffff88013e62d800... > [ 7728.521714] RPC: state 7 conn 0 dead 0 zapped 1 > [ 7728.521718] RPC: disconnected transport ffff88013e62d800 "state 7" => sk->sk_state == TCP_CLOSE Following this, the RPC client attempts to retransmit the request repeatedly, but the socket remains in the TCP_CLOSE state. ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang 2010-07-27 17:28 ` Chuck Lever (?) @ 2010-07-28 7:24 ` Andy Chittenden -1 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-28 7:24 UTC (permalink / raw) To: Andy Chittenden, Chuck Lever, Eric Dumazet Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List cmVzZW5kaW5nIGFzIGl0IHNlZW1zIHRvIGhhdmUgYmVlbiBjb3JydXB0ZWQgb24gTEtNTCENCg0K PiBUaGUgUlBDIGNsaWVudCBtYXJrcyB0aGUgc29ja2V0IGNsb3NlZC4gYW5kIHRoZSBsaW5nZXIg dGltZW91dCBpcyANCj4gY2FuY2VsbGVkLiAgQXQgdGhpcyBwb2ludCwgc2tfc2h1dGRvd24gc2hv dWxkIGJlIHNldCB0byB6ZXJvLCBjb3JyZWN0PyANCj4gSSBkb24ndCBzZWUgYW4geHNfZXJyb3Jf cmVwb3J0KCkgY2FsbCBoZXJlLCB3aGljaCB3b3VsZCBjb25maXJtIHRoYXQgdGhlIA0KPiBzb2Nr ZXQgdG9vayBhIHRyaXAgdGhyb3VnaCB0Y3BfZGlzY29ubmVjdCgpLg0KDQpGcm9tIG15IHJlYWRp bmcgb2YgdGNwX2Rpc2Nvbm5lY3QoKSwgaXQgY2FsbHMgc2stPnNrX2Vycm9yX3JlcG9ydChzaykg dW5jb25kaXRpb25hbGx5IHNvIGFzIHRoZXJlJ3Mgbm8geHNfZXJyb3JfcmVwb3J0KCksIHRoYXQg c3VyZWx5IG1lYW5zIHRoZSBleGFjdCBvcHBvc2l0ZTogdGNwX2Rpc2Nvbm5lY3QoKSB3YXNuJ3Qg Y2FsbGVkLiBJZiBpdCdzIG5vdCBjYWxsZWQsIHNrX3NodXRkb3duIGlzIG5vdCBjbGVhcmVkLiBB bmQgbXkgcmV2aXNlZCB0cmFjaW5nIGNvbmZpcm1lZCB0aGF0IGl0IHdhcyBzZXQgdG8gDQpTRU5E X1NIVVRET1dOLg0KDQotLSANCkFuZHksIEJsdWVBcmMgRW5naW5lZXJpbmcNCg== ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang @ 2010-07-28 7:24 ` Andy Chittenden 0 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-28 7:24 UTC (permalink / raw) To: Andy Chittenden, Chuck Lever, Eric Dumazet Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List resending as it seems to have been corrupted on LKML! > The RPC client marks the socket closed. and the linger timeout is > cancelled. At this point, sk_shutdown should be set to zero, correct? > I don't see an xs_error_report() call here, which would confirm that the > socket took a trip through tcp_disconnect(). From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) unconditionally so as there's no xs_error_report(), that surely means the exact opposite: tcp_disconnect() wasn't called. If it's not called, sk_shutdown is not cleared. And my revised tracing confirmed that it was set to SEND_SHUTDOWN. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: nfs client hang @ 2010-07-28 7:24 ` Andy Chittenden 0 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-28 7:24 UTC (permalink / raw) To: Andy Chittenden, Chuck Lever, Eric Dumazet Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 822 bytes --] resending as it seems to have been corrupted on LKML! > The RPC client marks the socket closed. and the linger timeout is > cancelled. At this point, sk_shutdown should be set to zero, correct? > I don't see an xs_error_report() call here, which would confirm that the > socket took a trip through tcp_disconnect(). >From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) unconditionally so as there's no xs_error_report(), that surely means the exact opposite: tcp_disconnect() wasn't called. If it's not called, sk_shutdown is not cleared. And my revised tracing confirmed that it was set to SEND_SHUTDOWN. -- Andy, BlueArc Engineering ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: nfs client hang @ 2010-07-28 17:37 ` Chuck Lever 0 siblings, 0 replies; 18+ messages in thread From: Chuck Lever @ 2010-07-28 17:37 UTC (permalink / raw) To: Andy Chittenden Cc: Eric Dumazet, Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List On 07/28/10 03:24 AM, Andy Chittenden wrote: > resending as it seems to have been corrupted on LKML! > >> The RPC client marks the socket closed. and the linger timeout is >> cancelled. At this point, sk_shutdown should be set to zero, correct? >> I don't see an xs_error_report() call here, which would confirm that the >> socket took a trip through tcp_disconnect(). > > From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) > unconditionally so as there's no xs_error_report(), that surely means > the exact opposite: tcp_disconnect() wasn't called. If it's not > called, sk_shutdown is not cleared. And my revised tracing confirmed > that it was set to SEND_SHUTDOWN. Sorry, that's what I meant above. An xs_error_report() debugging message at that point in the log would confirm that the socket took a trip through tcp_disconnect(). But I don't see such a message. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: nfs client hang @ 2010-07-28 17:37 ` Chuck Lever 0 siblings, 0 replies; 18+ messages in thread From: Chuck Lever @ 2010-07-28 17:37 UTC (permalink / raw) To: Andy Chittenden Cc: Eric Dumazet, Linux Kernel Mailing List (linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Trond Myklebust, netdev, Linux NFS Mailing List On 07/28/10 03:24 AM, Andy Chittenden wrote: > resending as it seems to have been corrupted on LKML! > >> The RPC client marks the socket closed. and the linger timeout is >> cancelled. At this point, sk_shutdown should be set to zero, correct? >> I don't see an xs_error_report() call here, which would confirm that the >> socket took a trip through tcp_disconnect(). > > From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) > unconditionally so as there's no xs_error_report(), that surely means > the exact opposite: tcp_disconnect() wasn't called. If it's not > called, sk_shutdown is not cleared. And my revised tracing confirmed > that it was set to SEND_SHUTDOWN. Sorry, that's what I meant above. An xs_error_report() debugging message at that point in the log would confirm that the socket took a trip through tcp_disconnect(). But I don't see such a message. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: nfs client hang @ 2010-07-29 10:10 ` Andy Chittenden 0 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-29 10:10 UTC (permalink / raw) To: Chuck Lever Cc: Eric Dumazet, Linux Kernel Mailing List (linux-kernel@vger.kernel.org), Trond Myklebust, netdev, Linux NFS Mailing List On 2010-07-28 18:37, Chuck Lever wrote: > On 07/28/10 03:24 AM, Andy Chittenden wrote: >> resending as it seems to have been corrupted on LKML! >> >>> The RPC client marks the socket closed. and the linger timeout is >>> cancelled. At this point, sk_shutdown should be set to zero, correct? >>> I don't see an xs_error_report() call here, which would confirm that the >>> socket took a trip through tcp_disconnect(). >> From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) >> unconditionally so as there's no xs_error_report(), that surely means >> the exact opposite: tcp_disconnect() wasn't called. If it's not >> called, sk_shutdown is not cleared. And my revised tracing confirmed >> that it was set to SEND_SHUTDOWN. > Sorry, that's what I meant above. > > An xs_error_report() debugging message at that point in the log would > confirm that the socket took a trip through tcp_disconnect(). But I > don't see such a message. I don't see how tcp_disconnect() gets called if the application does a shutdown when the state is TCP_ESTABLISHED (or a myriad of other states). It just seems to send a FIN. Should tcp_disconnect() be called? If so, how? Alternatively, I wonder whether my patch that set sk_shutdown to 0 in tcp_connect_init() is the correct fix after all. -- Andy, BlueArc Engineering ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: nfs client hang @ 2010-07-29 10:10 ` Andy Chittenden 0 siblings, 0 replies; 18+ messages in thread From: Andy Chittenden @ 2010-07-29 10:10 UTC (permalink / raw) To: Chuck Lever Cc: Eric Dumazet, Linux Kernel Mailing List (linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Trond Myklebust, netdev, Linux NFS Mailing List On 2010-07-28 18:37, Chuck Lever wrote: > On 07/28/10 03:24 AM, Andy Chittenden wrote: >> resending as it seems to have been corrupted on LKML! >> >>> The RPC client marks the socket closed. and the linger timeout is >>> cancelled. At this point, sk_shutdown should be set to zero, correct? >>> I don't see an xs_error_report() call here, which would confirm that the >>> socket took a trip through tcp_disconnect(). >> From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk) >> unconditionally so as there's no xs_error_report(), that surely means >> the exact opposite: tcp_disconnect() wasn't called. If it's not >> called, sk_shutdown is not cleared. And my revised tracing confirmed >> that it was set to SEND_SHUTDOWN. > Sorry, that's what I meant above. > > An xs_error_report() debugging message at that point in the log would > confirm that the socket took a trip through tcp_disconnect(). But I > don't see such a message. I don't see how tcp_disconnect() gets called if the application does a shutdown when the state is TCP_ESTABLISHED (or a myriad of other states). It just seems to send a FIN. Should tcp_disconnect() be called? If so, how? Alternatively, I wonder whether my patch that set sk_shutdown to 0 in tcp_connect_init() is the correct fix after all. -- Andy, BlueArc Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* General Linux Kernel / User Space License 2010-07-28 17:37 ` Chuck Lever (?) (?) @ 2011-07-07 17:01 ` Mitchell Erblich 2011-07-07 20:35 ` Chris Friesen -1 siblings, 1 reply; 18+ messages in thread From: Mitchell Erblich @ 2011-07-07 17:01 UTC (permalink / raw) To: Linux Kernel Mailing List (linux-kernel@vger.kernel.org) Please include me in the response. I am interested in any legal restriction of copying different parts of the kernel into the user space and then building an app around those changes. The app would have a src directory with the copied parts. Thanks, Mitchell Erblich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: General Linux Kernel / User Space License 2011-07-07 17:01 ` General Linux Kernel / User Space License Mitchell Erblich @ 2011-07-07 20:35 ` Chris Friesen 0 siblings, 0 replies; 18+ messages in thread From: Chris Friesen @ 2011-07-07 20:35 UTC (permalink / raw) To: Mitchell Erblich; +Cc: Linux Kernel Mailing List (linux-kernel@vger.kernel.org) On 07/07/2011 11:01 AM, Mitchell Erblich wrote: > > Please include me in the response. > > I am interested in any legal restriction of copying different parts of the > kernel into the user space and then building an app around > those changes. The kernel is mostly GPLv2. Treat the code appropriately. Chris Chris Friesen Software Developer GENBAND chris.friesen@genband.com www.genband.com ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-07-07 20:36 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <99613C19B13C5D40914FB8930657FA9303365708DE@uk-ex-mbx1.terastack.bluearc.com>
2010-07-27 7:25 ` nfs client hang Andy Chittenden
2010-07-27 10:53 ` Andy Chittenden
2010-07-27 12:21 ` Eric Dumazet
2010-07-27 12:51 ` Andy Chittenden
2010-07-27 17:28 ` Chuck Lever
2010-07-27 17:28 ` Chuck Lever
2010-07-28 7:08 ` Andy Chittenden
2010-07-28 7:08 ` Andy Chittenden
2010-07-28 7:08 ` Andy Chittenden
2010-07-28 7:24 ` Andy Chittenden
2010-07-28 7:24 ` Andy Chittenden
2010-07-28 7:24 ` Andy Chittenden
2010-07-28 17:37 ` Chuck Lever
2010-07-28 17:37 ` Chuck Lever
2010-07-29 10:10 ` Andy Chittenden
2010-07-29 10:10 ` Andy Chittenden
2011-07-07 17:01 ` General Linux Kernel / User Space License Mitchell Erblich
2011-07-07 20:35 ` Chris Friesen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.