linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dai Ngo <dai.ngo@oracle.com>
To: Chuck Lever <chuck.lever@oracle.com>, Christoph Hellwig <hch@lst.de>
Cc: jlayton@kernel.org, neilb@ownmail.net, okorniev@redhat.com,
	tom@talpey.com, linux-nfs@vger.kernel.org
Subject: Re: [PATCH 2/3] NFSD: Do not fence the client on NFS4ERR_RETRY_UNCACHED_REP error
Date: Mon, 3 Nov 2025 12:03:07 -0800	[thread overview]
Message-ID: <0ccaa75a-9e7d-478d-b33c-1c777eb29dd6@oracle.com> (raw)
In-Reply-To: <f4ddebf0-7039-47c9-8e20-9622c8b33ddd@oracle.com>


On 11/3/25 11:14 AM, Dai Ngo wrote:
>
> On 11/3/25 10:57 AM, Chuck Lever wrote:
>> On 11/3/25 1:50 PM, Dai Ngo wrote:
>>> On 11/3/25 6:16 AM, Chuck Lever wrote:
>>>> On 11/3/25 6:45 AM, Christoph Hellwig wrote:
>>>>> On Sat, Nov 01, 2025 at 11:51:34AM -0700, Dai Ngo wrote:
>>>>>> NFS4ERR_RETRY_UNCACHED_REP error means client has seen and replied
>>>>>> to the layout recall, no fencing is needed.
>>>>> RFC 5661 specifies that error as:
>>>>>
>>>>>     The requester has attempted a retry of a Compound that it 
>>>>> previously
>>>>>     requested not be placed in the reply cache.
>>>>>
>>>>> which to me doesn't imply a positive action here.
>>>> Agreed, this status code seems like a loss of synchronization of 
>>>> session
>>>> state between the client and server, or an implementation bug. Ie, it
>>>> seems to mean that at the very least, session re-negotiation is 
>>>> needed,
>>>> at first blush. Should the server mark a callback channel FAULT, for
>>>> instance?
>>>>
>>>>
>>>>> But I'm not an
>>>>> expert at reply cache semantics, so I'll leave others to correct me.
>>>>> But please add a more detailed comment and commit log as this is
>>>>> completely unintuitive.
>>>> The session state and the state of the layout are at two different
>>>> and separate layers. Connect the dots to show that not fencing is
>>>> the correct action and will result in recovery of full backchannel
>>>> operation while maintaining the integrity of the file's content.
>>>>
>>>> So IMHO reviewers need this patch description to provide:
>>>>
>>>> - How this came up during your testing (and maybe a small reproducer)
>>>>
>>>> - An explanation of why leaving the client unfenced is appropriate
>>>>
>>>> - A discussion of what will happen when the server subsequently sends
>>>>     another operation on this session slot
>>> Here is the sequence of events that leads to 
>>> NFS4ERR_RETRY_UNCACHED_REP:
>>>
>>> 1. Server sends CB_LAYOUTRECALL with stateID seqid 2
>>> 2. Client replies NFS4ERR_NOMATCHING_LAYOUT
>>> 3. Server does not receive the reply due to hard hang - no server 
>>> thread
>>>     available to service the reply (I will post a fix for this problem)
>>> 4. Server RPC times out waiting for the reply, nfsd4_cb_sequence_done
>>>     is called with cb_seq_status == 1, nfsd4_mark_cb_fault is called
>>>     and the request is re-queued.
>>> 5. Client receives the same CB_LAYOUTRECALL with stateID seqid 2
>>>     again and this time client replies with NFS4ERR_RETRY_UNCACHED_REP.
>>>
>>> This process repeats forever from step 4.
>>>
>>> In this scenario, the server does not have a chance to service the 
>>> reply
>>> therefor nfsd4_cb_layout_done was not called so no fencing happens.
>>> However,
>>> if somehow a server thread becomes available and 
>>> nfsd4_cb_layout_done is
>>> called with NFS4ERR_RETRY_UNCACHED_REP error then the client is fenced.
>>> This stops the client from accessing the SCSI target for all layouts 
>>> which
>>> I think it's a bit harsh and unnecessary.
>>>
>>> This problem can be easily reproduced by running the git test.
>> The problem is step 3, above. NFS4ERR_RETRY_UNCACHED_REP is not a
>> fix for that,
>
> Agreed, as I said, I will post a separate fix for the hang.
>
>>   and I disagree that fencing is harsh, because
>> NFS4ERR_RETRY_UNCACHED_REP is supposed to be quite rare, and of course
>> there are other ways this error can happen.
>
> Yes, this error should be rare. But is fencing the client is a correct
> solution for it? IMHO, NFS4ERR_RETRY_UNCACHED_REP means the client has
> received and replied to the server, it just somehow the server did not
> see the reply due to many reasons. I think in this case we should just
> mark the back channel down and let the client recover it, instead of
> fencing the client.
>
>>
>> I don't understand the assessment that "the server does not have a
>> chance to service the reply". The server /sends/ replies. For the
>> backchannel, there should be an nfsd thread waiting for the reply...
>> unless I've misunderstood something.
>
> If all the server threads are waiting in __break_Lease then there is
> no available server thread to service the replies, or any incoming
> requests from the client. That's the hard hang problem that I mentioned
> above.

I can either (1) drop this patch and keep the existing behavior or
(2) mark the back channel down and let the recovery takes place.

What do you think?

-Dai


  reply	other threads:[~2025-11-03 20:03 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-01 18:51 [PATCH 0/3] NFSD: Fix problem with nfsd4_scsi_fence_client Dai Ngo
2025-11-01 18:51 ` [PATCH 1/3] NFSD: Fix problem with nfsd4_scsi_fence_client using the wrong reservation type Dai Ngo
2025-11-03 11:42   ` Christoph Hellwig
2025-11-01 18:51 ` [PATCH 2/3] NFSD: Do not fence the client on NFS4ERR_RETRY_UNCACHED_REP error Dai Ngo
2025-11-03 11:45   ` Christoph Hellwig
2025-11-03 14:16     ` Chuck Lever
2025-11-03 18:50       ` Dai Ngo
2025-11-03 18:57         ` Chuck Lever
2025-11-03 19:14           ` Dai Ngo
2025-11-03 20:03             ` Dai Ngo [this message]
2025-11-03 20:15             ` Chuck Lever
2025-11-03 20:36               ` Dai Ngo
2025-11-03 19:22         ` Jeff Layton
2025-11-03 19:36           ` Dai Ngo
2025-11-03 19:40             ` Jeff Layton
2025-11-01 18:51 ` [PATCH 3/3] NFSD: Add trace point for SCSI fencing operation Dai Ngo
2025-11-02 15:40   ` Chuck Lever
2025-11-03 20:44     ` Dai Ngo
2025-11-03 21:00       ` Chuck Lever
2025-11-04  0:32     ` Dai Ngo
2025-11-04 14:05       ` Chuck Lever
  -- strict thread matches above, loose matches on Subject: below --
2025-11-01 18:25 Dai Ngo
2025-11-01 18:25 ` [PATCH 2/3] NFSD: Do not fence the client on NFS4ERR_RETRY_UNCACHED_REP error Dai Ngo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0ccaa75a-9e7d-478d-b33c-1c777eb29dd6@oracle.com \
    --to=dai.ngo@oracle.com \
    --cc=chuck.lever@oracle.com \
    --cc=hch@lst.de \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).