public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Thomas Haynes <loghyr@gmail.com>
Cc: Olga Kornievskaia <aglo@umich.edu>,
	Trond Myklebust <trondmy@kernel.org>,
	 Anna Schumaker	 <anna@kernel.org>,
	linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 2/2] nfs: update inode ctime after removexattr operation
Date: Tue, 31 Mar 2026 18:36:18 -0400	[thread overview]
Message-ID: <2c9dc8067c77ae8cefecd5224eebee62cecb6781.camel@kernel.org> (raw)
In-Reply-To: <acxGNz07W03ijzM2@mana>

On Tue, 2026-03-31 at 15:16 -0700, Thomas Haynes wrote:
> On Tue, Mar 31, 2026 at 05:19:33PM -0800, Jeff Layton wrote:
> > On Tue, 2026-03-31 at 11:47 -0700, Thomas Haynes wrote:
> > > On Tue, Mar 31, 2026 at 07:42:56AM -0800, Jeff Layton wrote:
> > > > On Fri, 2026-03-27 at 11:22 -0700, Thomas Haynes wrote:
> > > > > On Fri, Mar 27, 2026 at 12:59:54PM -0800, Jeff Layton wrote:
> > > > > > On Fri, 2026-03-27 at 12:20 -0400, Olga Kornievskaia wrote:
> > > > > > > On Fri, Mar 27, 2026 at 11:50 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > > > > 
> > > > > > > > On Fri, 2026-03-27 at 11:11 -0400, Olga Kornievskaia wrote:
> > > > > > > > > On Tue, Mar 24, 2026 at 1:32 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > > > > > > 
> > > > > > > > > > xfstest generic/728 fails with delegated timestamps. The client does a
> > > > > > > > > > removexattr and then a stat to test the ctime, which doesn't change. The
> > > > > > > > > > stat() doesn't trigger a GETATTR because of the delegated timestamps, so
> > > > > > > > > > it relies on the cached ctime, which is wrong.
> > > > > > > > > > 
> > > > > > > > > > The setxattr compound has a trailing GETATTR, which ensures that its
> > > > > > > > > > ctime gets updated. Follow the same strategy with removexattr.
> > > > > > > > > 
> > > > > > > > > This approach relies on the fact that the server the serves delegated
> > > > > > > > > attributes would update change_attr on operations which might now
> > > > > > > > > necessarily happen (ie, linux server does not update change_attribute
> > > > > > > > > on writes or clone). I propose an alternative fix for the failing
> > > > > > > > > generic/728.
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> > > > > > > > > index 7b3ca68fb4bb..ede1835a45b3 100644
> > > > > > > > > --- a/fs/nfs/nfs42proc.c
> > > > > > > > > +++ b/fs/nfs/nfs42proc.c
> > > > > > > > > @@ -1389,7 +1389,13 @@ static int _nfs42_proc_removexattr(struct inode
> > > > > > > > > *inode, const char *name)
> > > > > > > > >             &res.seq_res, 1);
> > > > > > > > >         trace_nfs4_removexattr(inode, name, ret);
> > > > > > > > >         if (!ret)
> > > > > > > > > -               nfs4_update_changeattr(inode, &res.cinfo, timestamp, 0);
> > > > > > > > > +               if (nfs_have_delegated_attributes(inode)) {
> > > > > > > > > +                       nfs_update_delegated_mtime(inode);
> > > > > > > > > +                       spin_lock(&inode->i_lock);
> > > > > > > > > +                       nfs_set_cache_invalid(inode, NFS_INO_INVALID_BLOCKS);
> > > > > > > > > +                       spin_unlock(&inode->i_lock);
> > > > > > > > > +               } else
> > > > > > > > > +                       nfs4_update_changeattr(inode, &res.cinfo, timestamp, 0);
> > > > > > > > > 
> > > > > > > > >         return ret;
> > > > > > > > >  }
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > What's the advantage of doing it this way?
> > > > > > > > 
> > > > > > > > You just sent a REMOVEXATTR operation to the server that will change
> > > > > > > > the mtime there. The server has the most up-to-date version of the
> > > > > > > > mtime and ctime at that point.
> > > > > > > 
> > > > > > > In presence of delegated attributes, Is the server required to update
> > > > > > > its mtime/ctime on an operation? As I mentioned, the linux server does
> > > > > > > not update its ctime/mtime for WRITE, CLONE, COPY.
> > > > > > > 
> > > > > > > Is possible that
> > > > > > > some implementations might be different and also do not update the
> > > > > > > ctime/mtime on REMOVEXATTR?
> > > > > > > 
> > > > > > > Therefore I was suggesting that the patch
> > > > > > > relies on the fact that it would receive an updated value. Of course
> > > > > > > perhaps all implementations are done the same as the linux server and
> > > > > > > my point is moot. I didn't see anything in the spec that clarifies
> > > > > > > what the server supposed to do (and client rely on).
> > > > > > > 
> > > > > > 
> > > > > > (cc'ing Tom)
> > > > > > 
> > > > > > That is a very good point.
> > > > > > 
> > > > > > My interpretation was that delegated timestamps generally covered
> > > > > > writes, but SETATTR style operations that do anything beyond only
> > > > > > changing the mtime can't be cached.
> > > > > > 
> > > > > > We probably need some delstid spec clarification: for what operations
> > > > > > is the server required to disable timestamp updates when a write
> > > > > > delegation is outstanding?
> > > > > > 
> > > > > > In the case of nfsd, we disable timestamp updates for WRITE/COPY/CLONE
> > > > > > but not SETATTR/SETXATTR/REMOVEXATTR.
> > > > > > 
> > > > > > How does the Hammerspace anvil behave? Does it disable c/mtime updates
> > > > > > for writes when there is an outstanding timestamp delegation like we're
> > > > > > doing in nfsd? If so, does it do the same for
> > > > > > SETATTR/SETXATTR/REMOVEXATTR operations as well?
> > > > > 
> > > > > Jeff,
> > > > > 
> > > > > I think the right way to look at this is closer to how size is
> > > > > handled under delegation in RFC8881, rather than as a per-op rule.
> > > > > 
> > > > > In our implementation, because we are acting as an MDS and data I/O
> > > > > goes to DSes, we already treat size as effectively delegated when
> > > > > a write layout is outstanding. The MDS does not maintain authoritative
> > > > > size locally in that case. We may refresh size/timestamps internally
> > > > > (e.g., on GETATTR by querying DSes), but we don’t treat that as
> > > > > overriding the delegated authority.
> > > > > 
> > > > > For timestamps, our behavior is effectively the same model. When
> > > > > the client holds the relevant delegation, the server does not
> > > > > consider itself authoritative for ctime/mtime. If current values
> > > > > are needed, we can obtain them from the client (e.g., via CB_GETATTR),
> > > > > and the client must present the delegation stateid to demonstrate
> > > > > that authority. So the authority follows the delegation, not the
> > > > > specific operation.
> > > > > 
> > > > > That said, I don’t think we’ve fully resolved the semantics for all
> > > > > metadata-style ops either. WRITE and SETATTR are clear in our model,
> > > > > but for things like CLONE/COPY/SETXATTR/REMOVEXATTR, we’ve likely
> > > > > been relying on assumptions rather than a fully consistent rule.
> > > > > I.e., CLONE and COPY we just pass through to the DS and we don't
> > > > > implement SETXATTR/REMOVEXATTR.
> > > > > 
> > > > > So the spec question, as I see it, is not whether REMOVEXATTR (or
> > > > > any particular op) should update ctime/mtime, but whether delegated
> > > > > timestamps are meant to follow the same attribute-authority model
> > > > > as delegated size in RFC8881. If so, then we expect that the server
> > > > > should query the client via CB_GETATTR to return updated ctime/mtime
> > > > > after such operations while the delegation is outstanding.
> > > > > 
> > > > 
> > > > The dilemma we have is: because we _do_ allow local processes to stat()
> > > > files that have an outstanding write delegation, we can never allow the
> > > > ctime in particular to roll backward (modulo clock jumps).
> > > 
> > > I agree we do not want visible ctime rollback, but I do not see how
> > > that can be guaranteed from delegated timestamps alone when the
> > > authoritative timestamp may be generated on a different node with
> > > a different clock and the object may change during the CB_GETATTR
> > > window. That seems to require either monotonic clamping of exposed
> > > ctime, or treating change_attr rather than ctime as the real
> > > serialization signal.
> > > 

We basically disable c/mtime updates on the delegation when it's handed
out and record the current time (A). We then clamp the delegated
timestamp update to between A and "now" when the SETATTR is done.
That's enough to ensure that nothing rolls backward but we can still
respect the client's update assuming the clocks are close enough.

> > > > 
> > > > If we're dealing with changes that have been cached in the client and
> > > > are being lazily flushed out, then we can't update the timestamp when
> > > > that operation occurs. The time of the RPC to flush the changes will
> > > > almost certainly be later than the cached timestamps on the client that
> > > > will eventually be set, so when the client comes back we'd end up
> > > > violating the rollback rule.
> > > > 
> > > > Our only option is to freeze timestamp updates on anything that might
> > > > represent such an operation. So far, we only do that on WRITE and COPY
> > > > operations -- in general, operations that require an open file, since
> > > > FMODE_NOCMTIME is attached to the file.
> > > > 
> > > > Some SETATTRs that only update the mtime and atime can be cached on the
> > > > client by virtue of the fact that it's authoritative for timestamps.
> > > > There are some exceptions though:
> > > > 
> > > > - atime-only updates can't be cached since the ctime won't change with
> > > > a timestamp update if the mtime didn't change
> > > > 
> > > > - if you set the mtime to a time that is later than the time you got
> > > > the delegation from the server, but earlier than the current time, you
> > > > can't cache that. The ctime would be later than the mtime in that case,
> > > > and we don't have a mechanism to handle that in a delegated timestamp
> > > > SETATTR.
> > > > 
> > > > I don't see how you could reasonably buffer a SETXATTR or REMOVEXATTR
> > > > operation to be sent later. These need to be done synchronously since
> > > > they could always fail for some reason and we don't have a mechanism at
> > > > the syscall layer to handle a deferred error. They also only update the
> > > > ctime and not the mtime, and we have no mechanism to do that with
> > > > delegated timestamps.
> > > > 
> > > > Based on that, I think the client and server both need to ignore the
> > > > timestamp delegation on a SETXATTR or REMOVEXATTR. The server should
> > > > update the ctime and the client needs to send a trailing GETATTR on the
> > > > REMOVEXATTR compound in order to get it and the change attr.
> > > 
> > > One concern I have with a per-op formulation is extensibility. If
> > > delegated timestamp behavior is defined by enumerating specific
> > > operations, then every new operation added to the protocol creates
> > > a fresh ambiguity until the spec is updated again. It seems better
> > > to define the behavior in terms of operation properties - e.g., whether
> > > the operation is synchronously visible, can be deferred/cached at
> > > the client, and whether it affects only ctime versus mtime/atime -
> > > so future operations can be classified without reopening the base
> > > rule.
> > > 
> > > I.e., I can't tell if you want me to update the spec with
> > > guidance per-op or you are just documenting what you did.
> > > 
> > > 
> > 
> > I think we probably need some guidance in the spec, and I think that
> > guidance comes down to: operations that don't have a way to report a
> > delayed error condition can't be buffered on the client and must
> > continue to be done synchronously even if a delegation is held.
> > 
> > By way of example: if I do a write() on the client I can buffer that
> > because userland can eventually do an fsync() to see if it succeeded.
> > This is not true for syscalls like setxattr() or removexattr(), or most
> > syscalls that result in a SETATTR operation (chmod(), chown(), etc).
> > 
> > They must be done synchronously because:
> > 
> > 1/ there's no way to update only the ctime in a delegated timestamp
> > update
> > 
> > ...and...
> > 
> > 2/ these syscalls can fail, so we can't return from them until we know
> > the outcome.
> > 
> > How best to phrase this guidance, I'm not sure...
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> 
> One concern I have with the "ops that can report delayed error"
> formulation is that, in practice, any NFSv4 operation may encounter
> a temporary condition and return NFS4ERR_DELAY at execution time.
> Even a simple operation like GETFH may need to allocate reply
> resources and fail transiently. As such, the ability to convey delay
> is not a distinguishing property of a particular operation.
> 
> I think the relevant distinction is instead whether the client may
> complete the syscall locally and defer authoritative execution
> and/or result until a later point without violating its semantics.
> Some operations fit that model, but others do not.
> 
> Even WRITE is not inherently in the "may be deferred" category;
> that depends on the data path. For example, in a pNFS configuration
> where WRITE is routed synchronously to an MDS, completion semantics
> differ from a buffered writeback model. This suggests that the
> classification should not be tied to specific opcodes, but to the
> completion semantics of the operation in the context in which it
> is executed.
> 

Yeah, I phrased that poorly. I didn't mean something that would return
NFS4ERR_DELAY, but rather operations like unstable WRITEs and
asynchronous COPY operations (and anything else that falls into that
model) that can fail after the server has returned from the initial
operation.

> Given that, it may be more useful for the spec to describe delegated
> timestamp behavior in terms of these semantics:
> 
>    o Operations for which the client may lawfully acknowledge
>    completion before authoritative server execution has completed
>    may use delegated timestamp authority, subject to the constraints
>    of that deferred-completion model.
> 
>    o Operations that require authoritative execution status before
>    completion of the user-visible request must be executed synchronously
>    and must not rely on deferred timestamp resolution.
> 
> This avoids per-op enumeration and allows future operations to be
> classified based on their semantics rather than requiring explicit
> specification updates.

Ack. Sounds good to me.
-- 
Jeff Layton <jlayton@kernel.org>

      reply	other threads:[~2026-03-31 22:36 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-24 17:32 [PATCH v2 0/2] nfs: delegated attribute fixes Jeff Layton
2026-03-24 17:32 ` [PATCH v2 1/2] nfs: fix utimensat() for atime with delegated timestamps Jeff Layton
2026-03-25 12:30   ` Jeff Layton
2026-04-07 13:28     ` Olga Kornievskaia
2026-04-07 21:24       ` Olga Kornievskaia
2026-03-24 17:32 ` [PATCH v2 2/2] nfs: update inode ctime after removexattr operation Jeff Layton
2026-03-27 15:11   ` Olga Kornievskaia
2026-03-27 15:50     ` Jeff Layton
2026-03-27 16:20       ` Olga Kornievskaia
2026-03-27 16:59         ` Jeff Layton
2026-03-27 18:22           ` Thomas Haynes
2026-03-27 19:36             ` Olga Kornievskaia
     [not found]               ` <acbfV0C-fAy4nZ-i@mana>
2026-03-31 18:17                 ` Olga Kornievskaia
2026-03-31 18:50                   ` Thomas Haynes
2026-03-31 21:10                     ` Olga Kornievskaia
2026-03-31 22:57                   ` Rick Macklem
2026-03-31 11:42             ` Jeff Layton
2026-03-31 18:47               ` Thomas Haynes
2026-03-31 21:19                 ` Jeff Layton
2026-03-31 22:16                   ` Thomas Haynes
2026-03-31 22:36                     ` Jeff Layton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2c9dc8067c77ae8cefecd5224eebee62cecb6781.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=aglo@umich.edu \
    --cc=anna@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=loghyr@gmail.com \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox