* pNFS DS infinite reconnect if process get interrupted
@ 2025-06-01 18:00 Mkrtchyan, Tigran
0 siblings, 0 replies; only message in thread
From: Mkrtchyan, Tigran @ 2025-06-01 18:00 UTC (permalink / raw)
To: linux-nfs
[-- Attachment #1: Type: text/plain, Size: 7529 bytes --]
Dear NFS fellows,
From time to time, we observe pNFS hangs in an infinite reconnect loop to DS.
We have pNFS deployment that uses the flexfiles layout and ~1000 tightly
coupled data servers over nfsv4.1. The user's applications run on HPC cluster
with RHEL9, 5.14.0-503.38.1.el9_5.x86_64. When applications take longer
that the specified time slot and get killed by the batch system (slurm). Some
of those killed applications hang and batch systems mark such nodes as 'dead'
and take them out of the scheduling, thus we get hundreds of CPUS disabled.
Despite the fact that it's hard to reproduce (it takes 1-2 days to trigger the issue),
here is my analysis of what happens based on network capture, ftraces, and kernel
source (RHEL and upstream 6.15).
When applications get killed (SIGTERM) while nfs client performs a connection
to DS, client ends in an infinite loop of connect-disconnect (DS runs
on port 24009):
tcpdump:
1 0.000000000 13a.b.c → 13x.y.z TCP 76 759 → 24009 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM=1 TSval=715791705 TSecr=0 WS=128
2 0.000023143 13x.y.z → 13a.b.c TCP 76 24009 → 759 [SYN, ACK] Seq=0 Ack=1 Win=31856 Len=0 MSS=1460 SACK_PERM=1 TSval=1948502509 TSecr=715791705 WS=128
3 0.000071151 13a.b.c → 13x.y.z TCP 68 759 → 24009 [ACK] Seq=1 Ack=1 Win=32128 Len=0 TSval=715791705 TSecr=1948502509
4 0.000224823 13a.b.c → 13x.y.z TCP 68 759 → 24009 [FIN, ACK] Seq=1 Ack=1 Win=32128 Len=0 TSval=715791706 TSecr=1948502509
5 0.000348922 13x.y.z → 13a.b.c TCP 68 24009 → 759 [FIN, ACK] Seq=1 Ack=2 Win=31872 Len=0 TSval=1948502509 TSecr=715791706
6 0.000394673 13a.b.c → 13x.y.z TCP 68 759 → 24009 [ACK] Seq=2 Ack=2 Win=32128 Len=0 TSval=715791706 TSecr=1948502509
ftrace:
27943 [008] ....... 2419533.462117: xprt_create: peer=[13x.y.z]:24009 state=BOUND
<...>-1227943 [008] ....... 2419533.462164: rpc_clnt_new: client=00000349 peer=[13x.y.z]:24009 program=nfs server=13x.y.z xprtsec=none flags=DISCRTRY|INFINITE_SLOTS|NO_RETRANS_TIMEOUT
<...>-1227943 [008] ....... 2419533.462165: rpc_task_begin: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=ACTIVE status=0 action=0x0
<...>-1227943 [008] ....... 2419533.462165: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=rpc_prepare_task [sunrpc]
<...>-1227943 [008] ....... 2419533.462165: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_start [sunrpc]
<...>-1227943 [008] ....... 2419533.462165: rpc_request: task:00000001@00000349 nfsv4 NULL (sync)
<...>-1227943 [008] ....... 2419533.462166: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_reserve [sunrpc]
<...>-1227943 [008] ....... 2419533.462166: xprt_reserve: task:00000001@00000349 xid=0xd88c82e6
<...>-1227943 [008] ....... 2419533.462166: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_reserveresult [sunrpc]
<...>-1227943 [008] ....... 2419533.462167: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_refresh [sunrpc]
<...>-1227943 [008] ....... 2419533.462167: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_refreshresult [sunrpc]
<...>-1227943 [008] ....... 2419533.462167: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_allocate [sunrpc]
<...>-1227943 [008] ....... 2419533.462167: rpc_buf_alloc: task:00000001@00000349 callsize=56 recvsize=32 status=0
<...>-1227943 [008] ....... 2419533.462167: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_encode [sunrpc]
<...>-1227943 [008] ....... 2419533.462168: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect [sunrpc]
<...>-1227943 [008] ....1.. 2419533.462168: xprt_reserve_xprt: task:00000001@00000349 snd_task:00000001
<...>-1227943 [008] ....... 2419533.462169: xprt_connect: peer=[13x.y.z]:24009 state=LOCKED|BOUND
<...>-1227943 [008] ....1.. 2419533.462169: rpc_task_sleep: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 timeout=0 queue=xprt_pending
<...>-1227943 [008] ....... 2419533.462170: rpc_task_sync_sleep: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=QUEUED|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect_status [sunrpc]
<...>-1227943 [008] ....... 2419533.462172: rpc_task_signalled: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=QUEUED|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect_status [sunrpc]
<...>-1227943 [008] ....1.. 2419533.462172: rpc_task_wakeup: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=QUEUED|ACTIVE|NEED_XMIT|NEED_RECV|SIGNALLED status=0 timeout=60000 queue=xprt_pending
<...>-1227943 [008] ....... 2419533.462173: rpc_task_sync_wake: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV|SIGNALLED status=0 action=call_connect_status [sunrpc]
<...>-1227943 [008] ....... 2419533.462173: rpc_task_run_action: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV|SIGNALLED status=-512 action=rpc_exit_task [sunrpc]
<...>-1227943 [008] ....... 2419533.462173: rpc_task_end: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV|SIGNALLED status=-512 action=call_connect_status [sunrpc]
<...>-1227943 [008] ....... 2419533.462173: rpc_stats_latency: task:00000001@00000349 xid=0xd88c82e6 nfsv4 NULL backlog=0 rtt=0 execute=8
<...>-1227943 [008] ....... 2419533.462173: rpc_task_call_done: task:00000001@00000349 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV|SIGNALLED status=-512 action=rpc_default_callback [sunrpc]
<...>-1227943 [008] ....1.. 2419533.462174: xprt_release_xprt: task:00000001@00000349 snd_task:ffffffff
<...>-1227943 [008] ....... 2419533.462175: rpc_clnt_shutdown: client=00000349
<...>-1227943 [008] ....... 2419533.462179: rpc_clnt_release: client=00000349
My assumption is that flexfilelayoutdev#nfs4_ff_layout_prepare_ds gets an error on nfs4_pnfs_ds_connect with status ERESTARTSYS,
which is set by rpc_signal_task, but the error is propagated as fatal (for example, by checking with nfs_error_is_fatal). Thus,
it was treated as a transient error and retried.
Does it sound like a reasonable analysis? How can we prove that? How can we fix that?
Best regards,
Tigran.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2826 bytes --]
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2025-06-01 18:00 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-01 18:00 pNFS DS infinite reconnect if process get interrupted Mkrtchyan, Tigran
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox