* NFSv3 TCP socket stuck when all slots used and server goes away @ 2013-03-06 9:51 Simon Kirby 2013-03-06 14:06 ` Myklebust, Trond 0 siblings, 1 reply; 4+ messages in thread From: Simon Kirby @ 2013-03-06 9:51 UTC (permalink / raw) To: linux-nfs We had an issue with an Pacemaker/CRM HA-NFSv3 setup where one particular export hit an XFS locking issue on one node and got completely stuck. Upon failing over, service recovered for all clients that hadn't hit the mount since the issue occurred, but almost all of the usual clients (which also statfs commonly as a monitoring check) sat forever (>20 minutes) without reconnecting. It seems that the clients filled the RPC slots with requests over the TCP socket to the NFS VIP and the server ack'd everything at the TCP layer, but was not able to reply to anything due to the FS locking issue. When we failed over the VIP to the other node, service was restored, but the clients stuck this way continued to sit with nothing to tickle the TCP layer. netstat shows a socket with no send-queue, in ESTABLISHED state, and with no timer enabled: tcp 0 0 c:724 s:2049 ESTABLISHED - off (0.00/0/0) The mountpoint options used are: rw,hard,intr,tcp,vers=3 The export options are: rw,async,hide,no_root_squash,no_subtree_check,mp Is this expected behaviour? I suspect if TCP keepalived were enabled, the socket would eventually get torn down as soon as the client tries to send something to the (effectively rebooted / swapped) NFS server and gets an RST. However, as-is, there seems to be nothing here that would eventually cause anything to happen. Am I missing something? Simon- ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: NFSv3 TCP socket stuck when all slots used and server goes away 2013-03-06 9:51 NFSv3 TCP socket stuck when all slots used and server goes away Simon Kirby @ 2013-03-06 14:06 ` Myklebust, Trond 2013-03-06 21:20 ` Simon Kirby 0 siblings, 1 reply; 4+ messages in thread From: Myklebust, Trond @ 2013-03-06 14:06 UTC (permalink / raw) To: Simon Kirby, linux-nfs@vger.kernel.org > -----Original Message----- > From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs- > owner@vger.kernel.org] On Behalf Of Simon Kirby > Sent: Wednesday, March 06, 2013 4:52 AM > To: linux-nfs@vger.kernel.org > Subject: NFSv3 TCP socket stuck when all slots used and server goes away > > We had an issue with an Pacemaker/CRM HA-NFSv3 setup where one > particular export hit an XFS locking issue on one node and got completely > stuck. > Upon failing over, service recovered for all clients that hadn't hit the mount > since the issue occurred, but almost all of the usual clients (which also statfs > commonly as a monitoring check) sat forever (>20 > minutes) without reconnecting. > > It seems that the clients filled the RPC slots with requests over the TCP > socket to the NFS VIP and the server ack'd everything at the TCP layer, but > was not able to reply to anything due to the FS locking issue. When we failed > over the VIP to the other node, service was restored, but the clients stuck > this way continued to sit with nothing to tickle the TCP layer. netstat shows a > socket with no send-queue, in ESTABLISHED state, and with no timer > enabled: > > tcp 0 0 c:724 s:2049 ESTABLISHED - off (0.00/0/0) > > The mountpoint options used are: rw,hard,intr,tcp,vers=3 > > The export options are: > rw,async,hide,no_root_squash,no_subtree_check,mp > > Is this expected behaviour? I suspect if TCP keepalived were enabled, the > socket would eventually get torn down as soon as the client tries to send > something to the (effectively rebooted / swapped) NFS server and gets an > RST. However, as-is, there seems to be nothing here that would eventually > cause anything to happen. Am I missing something? Which client? Did the server close the connection? Cheers Trond ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: NFSv3 TCP socket stuck when all slots used and server goes away 2013-03-06 14:06 ` Myklebust, Trond @ 2013-03-06 21:20 ` Simon Kirby 2013-03-06 21:31 ` Myklebust, Trond 0 siblings, 1 reply; 4+ messages in thread From: Simon Kirby @ 2013-03-06 21:20 UTC (permalink / raw) To: Myklebust, Trond; +Cc: linux-nfs@vger.kernel.org On Wed, Mar 06, 2013 at 02:06:01PM +0000, Myklebust, Trond wrote: > > -----Original Message----- > > From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs- > > owner@vger.kernel.org] On Behalf Of Simon Kirby > > Sent: Wednesday, March 06, 2013 4:52 AM > > To: linux-nfs@vger.kernel.org > > Subject: NFSv3 TCP socket stuck when all slots used and server goes away > > > > We had an issue with an Pacemaker/CRM HA-NFSv3 setup where one > > particular export hit an XFS locking issue on one node and got completely > > stuck. > > Upon failing over, service recovered for all clients that hadn't hit the mount > > since the issue occurred, but almost all of the usual clients (which also statfs > > commonly as a monitoring check) sat forever (>20 > > minutes) without reconnecting. > > > > It seems that the clients filled the RPC slots with requests over the TCP > > socket to the NFS VIP and the server ack'd everything at the TCP layer, but > > was not able to reply to anything due to the FS locking issue. When we failed > > over the VIP to the other node, service was restored, but the clients stuck > > this way continued to sit with nothing to tickle the TCP layer. netstat shows a > > socket with no send-queue, in ESTABLISHED state, and with no timer > > enabled: > > > > tcp 0 0 c:724 s:2049 ESTABLISHED - off (0.00/0/0) > > > > The mountpoint options used are: rw,hard,intr,tcp,vers=3 > > > > The export options are: > > rw,async,hide,no_root_squash,no_subtree_check,mp > > > > Is this expected behaviour? I suspect if TCP keepalived were enabled, the > > socket would eventually get torn down as soon as the client tries to send > > something to the (effectively rebooted / swapped) NFS server and gets an > > RST. However, as-is, there seems to be nothing here that would eventually > > cause anything to happen. Am I missing something? > > Which client? Did the server close the connection? Oh. 3.2.16 knfsd server, 3.2.36 - 3.2.39 clients (about 20 of them). The server did not close the connection but got stonith'd by the other node (equivalent to a hard reboot of a single node). The socket doesn't get a FIN or anything, because the server just goes away. When it comes back, there is nothing on the server to know that the socket ever existed. With no send-queue and nothing un-acked on the client's view, and no keepalive timer or anything else, the client never seems to send anything, so it doesn't ever poke the server and get back an RST to tear down the socket on the client side, allowing it to reconnect. I have dmesg saved from an "rpcdebug -m rpc -c" after this occurred, but I didn't paste it originally because I am wondering if the client _is_ supposed to re-issue requests the RPC TCP socket if no response is received after this long. With no timeo specified, /proc/mounts shows the default timeo is 600 seconds, retrans 2. Is it supposed to send something over the socket again every 600 seconds if all slots were previously used to issue NFS requests but nothing has been answered? http://0x.ca/sim/ref/3.2.39/rpcdebug.txt Cheers, Simon- ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: NFSv3 TCP socket stuck when all slots used and server goes away 2013-03-06 21:20 ` Simon Kirby @ 2013-03-06 21:31 ` Myklebust, Trond 0 siblings, 0 replies; 4+ messages in thread From: Myklebust, Trond @ 2013-03-06 21:31 UTC (permalink / raw) To: Simon Kirby; +Cc: linux-nfs@vger.kernel.org > -----Original Message----- > From: Simon Kirby [mailto:sim@hostway.ca] > Sent: Wednesday, March 06, 2013 4:21 PM > To: Myklebust, Trond > Cc: linux-nfs@vger.kernel.org > Subject: Re: NFSv3 TCP socket stuck when all slots used and server goes > away > > On Wed, Mar 06, 2013 at 02:06:01PM +0000, Myklebust, Trond wrote: > > > > -----Original Message----- > > > From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs- > > > owner@vger.kernel.org] On Behalf Of Simon Kirby > > > Sent: Wednesday, March 06, 2013 4:52 AM > > > To: linux-nfs@vger.kernel.org > > > Subject: NFSv3 TCP socket stuck when all slots used and server goes > > > away > > > > > > We had an issue with an Pacemaker/CRM HA-NFSv3 setup where one > > > particular export hit an XFS locking issue on one node and got > > > completely stuck. > > > Upon failing over, service recovered for all clients that hadn't hit > > > the mount since the issue occurred, but almost all of the usual > > > clients (which also statfs commonly as a monitoring check) sat > > > forever (>20 > > > minutes) without reconnecting. > > > > > > It seems that the clients filled the RPC slots with requests over > > > the TCP socket to the NFS VIP and the server ack'd everything at the > > > TCP layer, but was not able to reply to anything due to the FS > > > locking issue. When we failed over the VIP to the other node, > > > service was restored, but the clients stuck this way continued to > > > sit with nothing to tickle the TCP layer. netstat shows a socket > > > with no send-queue, in ESTABLISHED state, and with no timer > > > enabled: > > > > > > tcp 0 0 c:724 s:2049 ESTABLISHED - off (0.00/0/0) > > > > > > The mountpoint options used are: rw,hard,intr,tcp,vers=3 > > > > > > The export options are: > > > rw,async,hide,no_root_squash,no_subtree_check,mp > > > > > > Is this expected behaviour? I suspect if TCP keepalived were > > > enabled, the socket would eventually get torn down as soon as the > > > client tries to send something to the (effectively rebooted / > > > swapped) NFS server and gets an RST. However, as-is, there seems to > > > be nothing here that would eventually cause anything to happen. Am I > missing something? > > > > Which client? Did the server close the connection? > > Oh. 3.2.16 knfsd server, 3.2.36 - 3.2.39 clients (about 20 of them). > > The server did not close the connection but got stonith'd by the other node > (equivalent to a hard reboot of a single node). The socket doesn't get a FIN > or anything, because the server just goes away. When it comes back, there is > nothing on the server to know that the socket ever existed. With no send- > queue and nothing un-acked on the client's view, and no keepalive timer or > anything else, the client never seems to send anything, so it doesn't ever > poke the server and get back an RST to tear down the socket on the client > side, allowing it to reconnect. > > I have dmesg saved from an "rpcdebug -m rpc -c" after this occurred, but I > didn't paste it originally because I am wondering if the client _is_ supposed to > re-issue requests the RPC TCP socket if no response is received after this > long. With no timeo specified, /proc/mounts shows the default timeo is 600 > seconds, retrans 2. Is it supposed to send something over the socket again > every 600 seconds if all slots were previously used to issue NFS requests but > nothing has been answered? > > http://0x.ca/sim/ref/3.2.39/rpcdebug.txt > > Cheers, The client should normally retransmit after the timeout, at which point it will discover that the other end is disconnected. It might take a few minutes though; your timeouts appear to have hit the maximum of 3 minutes between retries. Is there no traffic seen on the wire at all? Cheers Trond ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-03-06 21:31 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-06 9:51 NFSv3 TCP socket stuck when all slots used and server goes away Simon Kirby 2013-03-06 14:06 ` Myklebust, Trond 2013-03-06 21:20 ` Simon Kirby 2013-03-06 21:31 ` Myklebust, Trond
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.