Linux XFS filesystem development
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@kernel.org>
To: Jingbo Xu <jefflexu@linux.alibaba.com>,
	anna@kernel.org,  linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com,
	 linux-xfs@vger.kernel.org
Subject: Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
Date: Tue, 23 Jun 2026 00:51:16 -0400	[thread overview]
Message-ID: <2098148dfb9e156dd88242afecdea5d372b7a169.camel@kernel.org> (raw)
In-Reply-To: <71d4ac02-d760-49ab-a01c-e7d1a6662a18@linux.alibaba.com>

On Tue, 2026-06-23 at 12:04 +0800, Jingbo Xu wrote:

> +cc [linux-xfs@vger.kernel.org](mailto:linux-xfs@vger.kernel.org)
> 
> On 6/22/26 9:56 PM, Trond Myklebust wrote:
> 
> > On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
> > 
> > > NFSv4 COMMIT compound does not include GETATTR, and
> > > nfs4_commit_done_cb
> > > does not refresh inode attributes. Meanwhile, every WRITE marks
> > > NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
> > > 
> > > After COMMIT, i_blocks remains stale until the next stat() triggers a
> > > full revalidation. In writeback-heavy workloads where COMMITs happen
> > > without intervening stat() calls, the cached block count can stay
> > > indefinitely wrong.
> > > 
> > > Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
> > > the
> > > next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
> > > SPACE_USED, fetching the correct value from the server.
> > > 
> > > This matches NFSv3 behavior where nfs3_commit_done already calls
> > > nfs_refresh_inode with the wcc_data post-op attributes.
> > > 
> > > Reproduce with xfstests generic/694 on NFSv4.0 loopback:
> > > 
> > >   Server:
> > >     mount /dev/vdc /data/test
> > >     mount /dev/vdd /data/scratch
> > >     exportfs -o insecure,rw,sync,no_root_squash,fsid=1
> > > 127.0.0.1:/data/test
> > >     exportfs -o insecure,rw,sync,no_root_squash,fsid=2
> > > 127.0.0.1:/data/scratch
> > > 
> > >   Client:
> > >     mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
> > >     mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
> > > 
> > >   local.config:
> > >     export TEST_FS_MOUNT_OPTS="-o vers=4.0"
> > >     export MOUNT_OPTIONS="-o vers=4.0"
> > >     export FSTYP=nfs
> > >     export TEST_DEV=localhost:/data/test
> > >     export SCRATCH_DEV=localhost:/data/scratch
> > >     export TEST_DIR=/mnt/test
> > >     export SCRATCH_MNT=/mnt/scratch
> > > 
> > > This fixes xfstests generic/694.
> > > 
> > > Assisted-by: Qoder:Qwen3.7-Max
> > > Signed-off-by: Jingbo Xu <[jefflexu@linux.alibaba.com](mailto:jefflexu@linux.alibaba.com)>
> > > ---
> > >  fs/nfs/write.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index d7c399763ad9..88c5c9f7434c 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
> > > nfs_commit_data *data)
> > >  		/* Latency breaker */
> > >  		cond_resched();
> > >  	}
> > > +	if (status >= 0)
> > > +		nfs_set_cache_invalid(data->inode,
> > > NFS_INO_INVALID_BLOCKS);
> > >  
> > >  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> > >  	nfs_commit_end(cinfo.mds);
> > 
> > 
> > That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
> > data contents of the file: it is just ensuring that the existing data
> > is persisted onto disk.
> > 
> 
> 
> Yes the underlying backend filesystem is indeed xfs.
> 
> I retest it and can confirm that this failure is much likely
> reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
> consecutive test runs.
> 
> Beside XFS itself also passes the test.
> 
> 
> To be honest I'm not familiar with NFS, following is what AI concludes:
> 
> ```
> Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
> 
> 
>    Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
> a piggy-backed GETATTR that returns the server's current SPACE_USED. The
> client updates inode->i_blocks from the last completed WRITE's post-op
> attributes.
> 
>    Server-side delayed allocation — the export's local filesystem
> (ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
> extent tree nodes) are not yet allocated during most WRITE handling;
> they materialize only when the local fs performs its own writeback.
> 
>    Last WRITE completes before server metadata writeback (~80%
> probability in user's environment) — the final GETATTR returns
> SPACE_USED = 8388608 (data only, no metadata block). Client caches
> i_blocks = 8388608.
> 
>    syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
> __nfs_commit_inode, which sends a COMMIT RPC to the server.
>    Server executes vfs_fsync_range — this forces the local fs writeback,
> which now allocates the metadata block. Server's true SPACE_USED becomes
> 8388616 (+8 sectors = 4 KiB).
> 
>    COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
> COMMIT4resok contains only a write verifier. The XDR decoder
> (nfs4_xdr_dec_commit) never calls decode_getfattr.
> 
>    nfs_commit_release_pages performs no attribute invalidation — it
> neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
> stale cached value persists.
> 
>    Subsequent stat returns stale i_blocks — cache_validity is clean, so
> nfs_getattr serves the cached value 8388608 without revalidation.
> 
>    After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
> test fails.
> ```
>


I agree with your AI that this may indeed be a consequence of XFS's delayed allocation feature. However that hardly changes the fact that it would be a violation of the POSIX rules for how write(), fsync() and stat() are supposed to work in this situation.

My point is that I fail to see why we should degrade the performance of the generic NFS client in order to address a bug in one of the back end filesystems (in this case XFS) being exported.

So strong NACK to this patch from me.

```
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com

```


  reply	other threads:[~2026-06-23  4:51 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260622060038.13731-1-jefflexu@linux.alibaba.com>
     [not found] ` <2d09ab9a74d1eb21c99454dba8be597612d20efa.camel@kernel.org>
2026-06-23  4:04   ` [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4 Jingbo Xu
2026-06-23  4:51     ` Trond Myklebust [this message]
2026-06-23  4:54       ` Jingbo Xu
2026-06-23 13:40         ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2098148dfb9e156dd88242afecdea5d372b7a169.camel@kernel.org \
    --to=trondmy@kernel.org \
    --cc=anna@kernel.org \
    --cc=jefflexu@linux.alibaba.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox