linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dai Ngo <dai.ngo@oracle.com>
To: Trond Myklebust <trondmy@kernel.org>, anna@kernel.org
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH 1/1] SUNRPC: call_connect_status needs to destroy transport on ETIMEDOUT before retry
Date: Tue, 5 Aug 2025 11:25:49 -0700	[thread overview]
Message-ID: <9795ce6a-7b8c-4a1e-a536-0f804de502b3@oracle.com> (raw)
In-Reply-To: <5fde3098a1ae24966def603f127986d6a417e5c4.camel@kernel.org>


On 8/5/25 10:41 AM, Trond Myklebust wrote:
> On Tue, 2025-08-05 at 10:07 -0700, Dai Ngo wrote:
>> On 8/5/25 9:41 AM, Trond Myklebust wrote:
>>> On Tue, 2025-08-05 at 08:46 -0700, Dai Ngo wrote:
>>>> On 8/4/25 4:55 PM, Trond Myklebust wrote:
>>>>> On Mon, 2025-08-04 at 13:13 -0700, Dai Ngo wrote:
>>>>>> On 8/4/25 12:21 PM, Trond Myklebust wrote:
>>>>>>> On Mon, 2025-08-04 at 12:08 -0700, Dai Ngo wrote:
>>>>>>>> Currently, when an RPC connection times out during the
>>>>>>>> connect
>>>>>>>> phase,
>>>>>>>> the task is retried by placing it back on the pending
>>>>>>>> queue
>>>>>>>> and
>>>>>>>> waiting
>>>>>>>> again. In some cases, the timeout occurs because TCP is
>>>>>>>> unable to
>>>>>>>> send
>>>>>>>> the SYN packet. This situation most often arises on bare
>>>>>>>> metal
>>>>>>>> systems
>>>>>>>> at boot time, when the NFS mount is attempted while the
>>>>>>>> network
>>>>>>>> link
>>>>>>>> appears to be up but is not yet stable.
>>>>>>>>
>>>>>>>> This patch addresses the issue by updating
>>>>>>>> call_connect_status to
>>>>>>>> destroy
>>>>>>>> the transport on ETIMEDOUT error before retrying the
>>>>>>>> connection.
>>>>>>>> This
>>>>>>>> ensures that subsequent connection attempts use a fresh
>>>>>>>> transport,
>>>>>>>> reducing the likelihood of repeated failures due to
>>>>>>>> lingering
>>>>>>>> network
>>>>>>>> issues.
>>>>>>>>
>>>>>>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>>>>>>>> ---
>>>>>>>>      net/sunrpc/clnt.c | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>>>>>>>> index 21426c3049d3..701b742750c5 100644
>>>>>>>> --- a/net/sunrpc/clnt.c
>>>>>>>> +++ b/net/sunrpc/clnt.c
>>>>>>>> @@ -2215,6 +2215,7 @@ call_connect_status(struct rpc_task
>>>>>>>> *task)
>>>>>>>>      	case -EHOSTUNREACH:
>>>>>>>>      	case -EPIPE:
>>>>>>>>      	case -EPROTO:
>>>>>>>> +	case -ETIMEDOUT:
>>>>>>>>      		xprt_conditional_disconnect(task-
>>>>>>>>> tk_rqstp-
>>>>>>>>> rq_xprt,
>>>>>>>>      					    task-
>>>>>>>>> tk_rqstp-
>>>>>>>>> rq_connect_cookie);
>>>>>>>>      		if (RPC_IS_SOFTCONN(task))
>>>>>>>> @@ -2225,7 +2226,6 @@ call_connect_status(struct rpc_task
>>>>>>>> *task)
>>>>>>>>      	case -EADDRINUSE:
>>>>>>>>      	case -ENOTCONN:
>>>>>>>>      	case -EAGAIN:
>>>>>>>> -	case -ETIMEDOUT:
>>>>>>>>      		if (!(task->tk_flags &
>>>>>>>> RPC_TASK_NO_ROUND_ROBIN)
>>>>>>>> &&
>>>>>>>>      		    (task->tk_flags & RPC_TASK_MOVEABLE)
>>>>>>>> &&
>>>>>>>>      		    test_bit(XPRT_REMOVE, &xprt->state))
>>>>>>>> {
>>>>>>> Why is this needed? The ETIMEDOUT is supposed to be a task
>>>>>>> level
>>>>>>> error,
>>>>>>> not a connection level thing.
>>>>>> If TCP was not able to sent the SYN out on due to the
>>>>>> transient
>>>>>> error
>>>>>> with the link status and the task just turns around a wait
>>>>>> again,
>>>>>> since
>>>>>> TCP does not retry the SYN, eventually systemd times out and
>>>>>> stops
>>>>>> the
>>>>>> mount.
>>>>>>
>>>>>>
>>>>>> Below is the snippet from the system log with rpcdebug
>>>>>> enabled:
>>>>>>
>>>>>>
>>>>>> Jun 20 10:23:01 qa-i360-odi03 kernel: i40e 0000:98:00.0 eth1:
>>>>>> NIC
>>>>>> Link is Up, 10 Gbps Full Duplex, Flow Control: None
>>>>>> Jun 20 10:23:09 qa-i360-odi03 NetworkManager[1511]: <info>
>>>>>> [1750414989.6033] manager: startup complete
>>>>>>
>>>>>> ... <NFS mount starts>
>>>>>> Jun 20 10:23:09 qa-i360-odi03 systemd[1]: Mounting /odi...
>>>>>> ...
>>>>>> Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 added to
>>>>>> queue
>>>>>> 0000000093f481cd "xprt_pending"
>>>>>> Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 setting
>>>>>> alarm
>>>>>> for
>>>>>> 60000 ms
>>>>>>
>>>>>> ... <link status down & up>
>>>>>> Jun 20 10:23:10 qa-i360-odi03 kernel: tg3 0000:04:00.0 em1:
>>>>>> Link
>>>>>> is
>>>>>> up at 1000 Mbps, full duplex
>>>>>> Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
>>>>>> [1750414990.6359] device (em1): state change: disconnected ->
>>>>>> prepare
>>>>>> (reason 'none', sys-iface-state: 'managed')
>>>>>> Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
>>>>>> [1750414990.6361] device (em1): state change: prepare ->
>>>>>> config
>>>>>> (reason 'none', sys-iface-state: 'managed')
>>>>>> Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
>>>>>> [1750414990.6364] device (em1): state change: config -> ip-
>>>>>> config
>>>>>> (reason 'none', sys-iface-state: 'managed')
>>>>>> Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
>>>>>> [1750414990.6383] device (em1): Activation: successful,
>>>>>> device
>>>>>> activated.
>>>>>>
>>>>>> ... <connect timed out>
>>>>>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 timeout
>>>>>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
>>>>>> __rpc_wake_up_task
>>>>>> (now 4294742016)
>>>>>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 disabling
>>>>>> timer
>>>>>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 removed from
>>>>>> queue
>>>>>> 0000000093f481cd "xprt_pending"
>>>>>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
>>>>>> call_connect_status
>>>>>> (status -110)
>>>>>>
>>>>>> ... <wait again>
>>>>>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
>>>>>> sleep_on(queue
>>>>>> "xprt_pending" time 4294742016)
>>>>>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 added to
>>>>>> queue
>>>>>> 0000000093f481cd "xprt_pending"
>>>>>>
>>>>>> ... <systemd timed out>
>>>>>> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mounting
>>>>>> timed
>>>>>> out. Terminating.
>>>>>>
>>>>>> Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 got signal
>>>>>> Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1
>>>>>> __rpc_wake_up_task
>>>>>> (now 4294770229)
>>>>>> Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 disabling
>>>>>> timer
>>>>>> Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 removed from
>>>>>> queue
>>>>>> 0000000093f481cd "xprt_pending"
>>>>>>
>>>>>> Jun 20 10:24:39 qa-i360-odi03 kernel: <-- nfs4_try_mount() =
>>>>>> -512
>>>>>> [error]
>>>>>> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mount
>>>>>> process
>>>>>> exited, code=killed status=15
>>>>>> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Failed
>>>>>> with
>>>>>> result 'timeout'.
>>>>>>
>>>>>> This patch forces TCP to send the SYN on ETIMEDOUT error.
>>>>>>
>>>>> Something is very wrong here...
>>>>>
>>>>> If this patch is correct, and the call to
>>>>> xprt_conditional_disconnect()
>>>>> does indeed cause the socket to close, then something must have
>>>>> bumped
>>>>> xprt->connect_cookie.
>>>> I'm a little confused here, xprt_conditional_disconnect only
>>>> closes
>>>> the
>>>> socket if the connect cookie is still the same:
>>>>
>>>>            if (cookie != xprt->connect_cookie)
>>>>                    goto out;
>>>>
>>>> So in this case the xprt->connect_cookie has not been bumped.
>>> You're right... Sorry...
>>>
>>> So does this mean that the socket is still in the SYN-SENT state?
>>> It is
>>> normally supposed to remain in that state for 60 seconds, and
>>> resend
>>> SYN using an exponential back off.
>> Apparently this was not done by TCP in this scenario:
>>
>> ... mount starts
>> Jun 20 10:23:09 qa-i360-odi03 systemd[1]: Mounting /odi...
>> Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 added to queue
>> 0000000093f481cd "xprt_pending"
>>
>> Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
>> [1750414990.6383] device (em1): Activation: successful, device
>> activated.
>>
>> ... RPC task timed out on connect after 62 secs.  Note that at this
>> time the network link status was already up for more than 60 secs
>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 timeout
>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 __rpc_wake_up_task
>> (now 4294742016)
>>
>> ... RPC task goes back and wait again.
>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 sleep_on(queue
>> "xprt_pending" time 4294742016)
>> Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 added to queue
>> 0000000093f481cd "xprt_pending"
>>
>> ... nothing happened and systemd timed out after 90 secs and kill the
>> mount
>> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mounting timed
>> out. Terminating.
>> Jun 20 10:24:39 qa-i360-odi03 kernel: <-- nfs4_try_mount() = -512
>> [error]
>> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mount process
>> exited, code=killed status=15
>> Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Failed with
>> result 'timeout'.
>>
>> I don't know why TCP does not retry the SYN after the networkk link
>> is
>> up again but I though instead of trying to figure out and fix the
>> problem
>> at TCP layer, we can make the RPC connect process a bit more robust
>> by using
>> a new socket for retry.
>>
> We should never be using individual RPC message timeouts as a measure
> for whether or not the TCP connection is down.
>
> For the initial connection attempt, we use the TCP_SYNCNT socket option
> to decide when to give up.

The problem we're having is the initial connection attempt at boot time,
NFS mount started while the network link has been stable yet. NetworkManager
could also be blamed here, it claims 'startup complete' which allows
systemd to start the NFS mounts.

And for some reasons TCP did not send the SYN under this flaky link condition,
even after the network link is up again; we don't see any SYN in the network
trace. The TCP_SYNCNT option is not in play here.

This seems like a bug in the TCP/network layer and it seems to me that
you prefer to push the responsibility to resolve this problem down to
TCP/network stack.

This problem is reproducible, do you want to collect any additional data
before making the final decision?

-Dai

>   Once the connection is established, we set
> TCP_USER_TIMEOUT as well as socket keepalive to ensure that the
> connection is still alive.
>
> I'm willing to consider further socket-specific timers, but not if the
> reason is to work around bugs (assuming that is indeed the case here).
>

  reply	other threads:[~2025-08-05 18:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-04 19:08 [PATCH 1/1] SUNRPC: call_connect_status needs to destroy transport on ETIMEDOUT before retry Dai Ngo
2025-08-04 19:21 ` Trond Myklebust
2025-08-04 20:13   ` Dai Ngo
2025-08-04 23:55     ` Trond Myklebust
2025-08-05 15:46       ` Dai Ngo
2025-08-05 16:41         ` Trond Myklebust
2025-08-05 17:07           ` Dai Ngo
2025-08-05 17:41             ` Trond Myklebust
2025-08-05 18:25               ` Dai Ngo [this message]
  -- strict thread matches above, loose matches on Subject: below --
2025-08-04 18:46 Dai Ngo
2025-08-04 18:49 ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9795ce6a-7b8c-4a1e-a536-0f804de502b3@oracle.com \
    --to=dai.ngo@oracle.com \
    --cc=anna@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).