linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@kernel.org>
To: Dai Ngo <dai.ngo@oracle.com>, anna@kernel.org
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH 1/1] SUNRPC: call_connect_status needs to destroy transport on ETIMEDOUT before retry
Date: Tue, 05 Aug 2025 10:41:35 -0700	[thread overview]
Message-ID: <5fde3098a1ae24966def603f127986d6a417e5c4.camel@kernel.org> (raw)
In-Reply-To: <14935b2e-8124-4f46-a63e-331328d49d12@oracle.com>

On Tue, 2025-08-05 at 10:07 -0700, Dai Ngo wrote:
> 
> On 8/5/25 9:41 AM, Trond Myklebust wrote:
> > On Tue, 2025-08-05 at 08:46 -0700, Dai Ngo wrote:
> > > On 8/4/25 4:55 PM, Trond Myklebust wrote:
> > > > On Mon, 2025-08-04 at 13:13 -0700, Dai Ngo wrote:
> > > > > On 8/4/25 12:21 PM, Trond Myklebust wrote:
> > > > > > On Mon, 2025-08-04 at 12:08 -0700, Dai Ngo wrote:
> > > > > > > Currently, when an RPC connection times out during the
> > > > > > > connect
> > > > > > > phase,
> > > > > > > the task is retried by placing it back on the pending
> > > > > > > queue
> > > > > > > and
> > > > > > > waiting
> > > > > > > again. In some cases, the timeout occurs because TCP is
> > > > > > > unable to
> > > > > > > send
> > > > > > > the SYN packet. This situation most often arises on bare
> > > > > > > metal
> > > > > > > systems
> > > > > > > at boot time, when the NFS mount is attempted while the
> > > > > > > network
> > > > > > > link
> > > > > > > appears to be up but is not yet stable.
> > > > > > > 
> > > > > > > This patch addresses the issue by updating
> > > > > > > call_connect_status to
> > > > > > > destroy
> > > > > > > the transport on ETIMEDOUT error before retrying the
> > > > > > > connection.
> > > > > > > This
> > > > > > > ensures that subsequent connection attempts use a fresh
> > > > > > > transport,
> > > > > > > reducing the likelihood of repeated failures due to
> > > > > > > lingering
> > > > > > > network
> > > > > > > issues.
> > > > > > > 
> > > > > > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > > > ---
> > > > > > >     net/sunrpc/clnt.c | 2 +-
> > > > > > >     1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> > > > > > > index 21426c3049d3..701b742750c5 100644
> > > > > > > --- a/net/sunrpc/clnt.c
> > > > > > > +++ b/net/sunrpc/clnt.c
> > > > > > > @@ -2215,6 +2215,7 @@ call_connect_status(struct rpc_task
> > > > > > > *task)
> > > > > > >     	case -EHOSTUNREACH:
> > > > > > >     	case -EPIPE:
> > > > > > >     	case -EPROTO:
> > > > > > > +	case -ETIMEDOUT:
> > > > > > >     		xprt_conditional_disconnect(task-
> > > > > > > >tk_rqstp-
> > > > > > > > rq_xprt,
> > > > > > >     					    task-
> > > > > > > >tk_rqstp-
> > > > > > > > rq_connect_cookie);
> > > > > > >     		if (RPC_IS_SOFTCONN(task))
> > > > > > > @@ -2225,7 +2226,6 @@ call_connect_status(struct rpc_task
> > > > > > > *task)
> > > > > > >     	case -EADDRINUSE:
> > > > > > >     	case -ENOTCONN:
> > > > > > >     	case -EAGAIN:
> > > > > > > -	case -ETIMEDOUT:
> > > > > > >     		if (!(task->tk_flags &
> > > > > > > RPC_TASK_NO_ROUND_ROBIN)
> > > > > > > &&
> > > > > > >     		    (task->tk_flags & RPC_TASK_MOVEABLE)
> > > > > > > &&
> > > > > > >     		    test_bit(XPRT_REMOVE, &xprt->state))
> > > > > > > {
> > > > > > Why is this needed? The ETIMEDOUT is supposed to be a task
> > > > > > level
> > > > > > error,
> > > > > > not a connection level thing.
> > > > > If TCP was not able to sent the SYN out on due to the
> > > > > transient
> > > > > error
> > > > > with the link status and the task just turns around a wait
> > > > > again,
> > > > > since
> > > > > TCP does not retry the SYN, eventually systemd times out and
> > > > > stops
> > > > > the
> > > > > mount.
> > > > > 
> > > > > 
> > > > > Below is the snippet from the system log with rpcdebug
> > > > > enabled:
> > > > > 
> > > > > 
> > > > > Jun 20 10:23:01 qa-i360-odi03 kernel: i40e 0000:98:00.0 eth1:
> > > > > NIC
> > > > > Link is Up, 10 Gbps Full Duplex, Flow Control: None
> > > > > Jun 20 10:23:09 qa-i360-odi03 NetworkManager[1511]: <info>
> > > > > [1750414989.6033] manager: startup complete
> > > > > 
> > > > > ... <NFS mount starts>
> > > > > Jun 20 10:23:09 qa-i360-odi03 systemd[1]: Mounting /odi...
> > > > > ...
> > > > > Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 added to
> > > > > queue
> > > > > 0000000093f481cd "xprt_pending"
> > > > > Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 setting
> > > > > alarm
> > > > > for
> > > > > 60000 ms
> > > > > 
> > > > > ... <link status down & up>
> > > > > Jun 20 10:23:10 qa-i360-odi03 kernel: tg3 0000:04:00.0 em1:
> > > > > Link
> > > > > is
> > > > > up at 1000 Mbps, full duplex
> > > > > Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
> > > > > [1750414990.6359] device (em1): state change: disconnected ->
> > > > > prepare
> > > > > (reason 'none', sys-iface-state: 'managed')
> > > > > Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
> > > > > [1750414990.6361] device (em1): state change: prepare ->
> > > > > config
> > > > > (reason 'none', sys-iface-state: 'managed')
> > > > > Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
> > > > > [1750414990.6364] device (em1): state change: config -> ip-
> > > > > config
> > > > > (reason 'none', sys-iface-state: 'managed')
> > > > > Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
> > > > > [1750414990.6383] device (em1): Activation: successful,
> > > > > device
> > > > > activated.
> > > > > 
> > > > > ... <connect timed out>
> > > > > Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 timeout
> > > > > Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
> > > > > __rpc_wake_up_task
> > > > > (now 4294742016)
> > > > > Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 disabling
> > > > > timer
> > > > > Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 removed from
> > > > > queue
> > > > > 0000000093f481cd "xprt_pending"
> > > > > Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
> > > > > call_connect_status
> > > > > (status -110)
> > > > > 
> > > > > ... <wait again>
> > > > > Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
> > > > > sleep_on(queue
> > > > > "xprt_pending" time 4294742016)
> > > > > Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 added to
> > > > > queue
> > > > > 0000000093f481cd "xprt_pending"
> > > > > 
> > > > > ... <systemd timed out>
> > > > > Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mounting
> > > > > timed
> > > > > out. Terminating.
> > > > > 
> > > > > Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 got signal
> > > > > Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1
> > > > > __rpc_wake_up_task
> > > > > (now 4294770229)
> > > > > Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 disabling
> > > > > timer
> > > > > Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 removed from
> > > > > queue
> > > > > 0000000093f481cd "xprt_pending"
> > > > > 
> > > > > Jun 20 10:24:39 qa-i360-odi03 kernel: <-- nfs4_try_mount() =
> > > > > -512
> > > > > [error]
> > > > > Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mount
> > > > > process
> > > > > exited, code=killed status=15
> > > > > Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Failed
> > > > > with
> > > > > result 'timeout'.
> > > > > 
> > > > > This patch forces TCP to send the SYN on ETIMEDOUT error.
> > > > > 
> > > > Something is very wrong here...
> > > > 
> > > > If this patch is correct, and the call to
> > > > xprt_conditional_disconnect()
> > > > does indeed cause the socket to close, then something must have
> > > > bumped
> > > > xprt->connect_cookie.
> > > I'm a little confused here, xprt_conditional_disconnect only
> > > closes
> > > the
> > > socket if the connect cookie is still the same:
> > > 
> > >           if (cookie != xprt->connect_cookie)
> > >                   goto out;
> > > 
> > > So in this case the xprt->connect_cookie has not been bumped.
> > You're right... Sorry...
> > 
> > So does this mean that the socket is still in the SYN-SENT state?
> > It is
> > normally supposed to remain in that state for 60 seconds, and
> > resend
> > SYN using an exponential back off.
> 
> Apparently this was not done by TCP in this scenario:
> 
> ... mount starts
> Jun 20 10:23:09 qa-i360-odi03 systemd[1]: Mounting /odi...
> Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 added to queue
> 0000000093f481cd "xprt_pending"
> 
> Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info> 
> [1750414990.6383] device (em1): Activation: successful, device
> activated.
> 
> ... RPC task timed out on connect after 62 secs.  Note that at this
> time the network link status was already up for more than 60 secs
> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 timeout
> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 __rpc_wake_up_task
> (now 4294742016)
> 
> ... RPC task goes back and wait again.
> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 sleep_on(queue
> "xprt_pending" time 4294742016)
> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 added to queue
> 0000000093f481cd "xprt_pending"
> 
> ... nothing happened and systemd timed out after 90 secs and kill the
> mount
> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mounting timed
> out. Terminating.
> Jun 20 10:24:39 qa-i360-odi03 kernel: <-- nfs4_try_mount() = -512
> [error]
> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mount process
> exited, code=killed status=15
> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Failed with
> result 'timeout'.
> 
> I don't know why TCP does not retry the SYN after the networkk link
> is
> up again but I though instead of trying to figure out and fix the
> problem
> at TCP layer, we can make the RPC connect process a bit more robust
> by using
> a new socket for retry.
> 

We should never be using individual RPC message timeouts as a measure
for whether or not the TCP connection is down.

For the initial connection attempt, we use the TCP_SYNCNT socket option
to decide when to give up. Once the connection is established, we set
TCP_USER_TIMEOUT as well as socket keepalive to ensure that the
connection is still alive.

I'm willing to consider further socket-specific timers, but not if the
reason is to work around bugs (assuming that is indeed the case here).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com

  reply	other threads:[~2025-08-05 17:41 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-04 19:08 [PATCH 1/1] SUNRPC: call_connect_status needs to destroy transport on ETIMEDOUT before retry Dai Ngo
2025-08-04 19:21 ` Trond Myklebust
2025-08-04 20:13   ` Dai Ngo
2025-08-04 23:55     ` Trond Myklebust
2025-08-05 15:46       ` Dai Ngo
2025-08-05 16:41         ` Trond Myklebust
2025-08-05 17:07           ` Dai Ngo
2025-08-05 17:41             ` Trond Myklebust [this message]
2025-08-05 18:25               ` Dai Ngo
  -- strict thread matches above, loose matches on Subject: below --
2025-08-04 18:46 Dai Ngo
2025-08-04 18:49 ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5fde3098a1ae24966def603f127986d6a417e5c4.camel@kernel.org \
    --to=trondmy@kernel.org \
    --cc=anna@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).