linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RDMA connection lost and not re-opened
@ 2018-05-03 20:40 scar
  0 siblings, 0 replies; 3+ messages in thread
From: scar @ 2018-05-03 20:40 UTC (permalink / raw)
  To: linux-nfs

We are using NFSoRDMA on our cluster, which is using CentOS 6.9 with 
kernel 2.6.32-696.1.1.el6.x86_64.  2/10 of the clients had to be 
rebooted recently.  It appears due to NFS connection closed but not 
reopened.  For example, we will commonly see these messages:

May  2 14:46:08 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  2 15:42:39 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  2 15:42:44 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  2 16:04:02 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  2 16:04:02 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  2 18:46:00 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  2 19:16:09 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  2 19:28:49 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  2 21:14:42 n006 kernel: rpcrdma: connection to 10.10.11.10:20049 
closed (-103)
May  3 11:51:13 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  3 11:56:13 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  3 13:14:34 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16


I asked about these messages previously and they are just normal 
operations.  You can see the connection is usually reopened immediately 
if the resource is still required, but the message at 21:14:42 was not 
accompanied with a re-opening message, and this is about the time the 
client hung and became unresponsive.  I noticed similar messages on the 
other server that had to be rebooted:

May  2 15:46:52 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  2 16:08:39 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  2 19:14:23 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  2 21:14:38 n001 kernel: rpcrdma: connection to 10.10.11.10:20049 
closed (-103)
May  3 11:54:58 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  3 11:59:59 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)
May  3 12:50:57 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 on 
mlx4_0, memreg 5 slots 32 ird 16
May  3 12:55:58 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 
closed (-103)


You can see on each machine that the connection to 10.10.11.249:2050 was 
re-opened when i tried to login today on May 3 but the connection to 
10.10.11.10:20049 was not re-opened.  Meanwhile our other clients still 
have the connection to 10.10.11.10:20049 and the server at 10.10.11.10 
is working fine.

Any idea why this happened and how it could possibly be resolved without 
having to reboot the server and losing work?

Thanks


^ permalink raw reply	[flat|nested] 3+ messages in thread
* Re: RDMA connection lost and not re-opened
@ 2018-05-03 23:02 scar
  2018-05-04 16:58 ` Chuck Lever
  0 siblings, 1 reply; 3+ messages in thread
From: scar @ 2018-05-03 23:02 UTC (permalink / raw)
  To: linux-nfs

I did also notice these errors on the NFS server 10.10.11.10:

May  2 21:27:59 pac kernel: svcrdma: failed to send reply chunks, rc=-5
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May  2 21:27:59 pac kernel: nfsd: peername failed (err 107)!



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-05-04 16:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-03 20:40 RDMA connection lost and not re-opened scar
  -- strict thread matches above, loose matches on Subject: below --
2018-05-03 23:02 scar
2018-05-04 16:58 ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).