From: Trond Myklebust <trondmy@kernel.org>
To: Jingbo Xu <jefflexu@linux.alibaba.com>,
anna@kernel.org, linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
Date: Tue, 23 Jun 2026 00:51:16 -0400 [thread overview]
Message-ID: <2098148dfb9e156dd88242afecdea5d372b7a169.camel@kernel.org> (raw)
In-Reply-To: <71d4ac02-d760-49ab-a01c-e7d1a6662a18@linux.alibaba.com>
On Tue, 2026-06-23 at 12:04 +0800, Jingbo Xu wrote:
> +cc [linux-xfs@vger.kernel.org](mailto:linux-xfs@vger.kernel.org)
>
> On 6/22/26 9:56 PM, Trond Myklebust wrote:
>
> > On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
> >
> > > NFSv4 COMMIT compound does not include GETATTR, and
> > > nfs4_commit_done_cb
> > > does not refresh inode attributes. Meanwhile, every WRITE marks
> > > NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
> > >
> > > After COMMIT, i_blocks remains stale until the next stat() triggers a
> > > full revalidation. In writeback-heavy workloads where COMMITs happen
> > > without intervening stat() calls, the cached block count can stay
> > > indefinitely wrong.
> > >
> > > Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
> > > the
> > > next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
> > > SPACE_USED, fetching the correct value from the server.
> > >
> > > This matches NFSv3 behavior where nfs3_commit_done already calls
> > > nfs_refresh_inode with the wcc_data post-op attributes.
> > >
> > > Reproduce with xfstests generic/694 on NFSv4.0 loopback:
> > >
> > > Server:
> > > mount /dev/vdc /data/test
> > > mount /dev/vdd /data/scratch
> > > exportfs -o insecure,rw,sync,no_root_squash,fsid=1
> > > 127.0.0.1:/data/test
> > > exportfs -o insecure,rw,sync,no_root_squash,fsid=2
> > > 127.0.0.1:/data/scratch
> > >
> > > Client:
> > > mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
> > > mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
> > >
> > > local.config:
> > > export TEST_FS_MOUNT_OPTS="-o vers=4.0"
> > > export MOUNT_OPTIONS="-o vers=4.0"
> > > export FSTYP=nfs
> > > export TEST_DEV=localhost:/data/test
> > > export SCRATCH_DEV=localhost:/data/scratch
> > > export TEST_DIR=/mnt/test
> > > export SCRATCH_MNT=/mnt/scratch
> > >
> > > This fixes xfstests generic/694.
> > >
> > > Assisted-by: Qoder:Qwen3.7-Max
> > > Signed-off-by: Jingbo Xu <[jefflexu@linux.alibaba.com](mailto:jefflexu@linux.alibaba.com)>
> > > ---
> > > fs/nfs/write.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index d7c399763ad9..88c5c9f7434c 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
> > > nfs_commit_data *data)
> > > /* Latency breaker */
> > > cond_resched();
> > > }
> > > + if (status >= 0)
> > > + nfs_set_cache_invalid(data->inode,
> > > NFS_INO_INVALID_BLOCKS);
> > >
> > > nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> > > nfs_commit_end(cinfo.mds);
> >
> >
> > That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
> > data contents of the file: it is just ensuring that the existing data
> > is persisted onto disk.
> >
>
>
> Yes the underlying backend filesystem is indeed xfs.
>
> I retest it and can confirm that this failure is much likely
> reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
> consecutive test runs.
>
> Beside XFS itself also passes the test.
>
>
> To be honest I'm not familiar with NFS, following is what AI concludes:
>
> ```
> Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
>
>
> Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
> a piggy-backed GETATTR that returns the server's current SPACE_USED. The
> client updates inode->i_blocks from the last completed WRITE's post-op
> attributes.
>
> Server-side delayed allocation — the export's local filesystem
> (ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
> extent tree nodes) are not yet allocated during most WRITE handling;
> they materialize only when the local fs performs its own writeback.
>
> Last WRITE completes before server metadata writeback (~80%
> probability in user's environment) — the final GETATTR returns
> SPACE_USED = 8388608 (data only, no metadata block). Client caches
> i_blocks = 8388608.
>
> syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
> __nfs_commit_inode, which sends a COMMIT RPC to the server.
> Server executes vfs_fsync_range — this forces the local fs writeback,
> which now allocates the metadata block. Server's true SPACE_USED becomes
> 8388616 (+8 sectors = 4 KiB).
>
> COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
> COMMIT4resok contains only a write verifier. The XDR decoder
> (nfs4_xdr_dec_commit) never calls decode_getfattr.
>
> nfs_commit_release_pages performs no attribute invalidation — it
> neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
> stale cached value persists.
>
> Subsequent stat returns stale i_blocks — cache_validity is clean, so
> nfs_getattr serves the cached value 8388608 without revalidation.
>
> After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
> test fails.
> ```
>
I agree with your AI that this may indeed be a consequence of XFS's delayed allocation feature. However that hardly changes the fact that it would be a violation of the POSIX rules for how write(), fsync() and stat() are supposed to work in this situation.
My point is that I fail to see why we should degrade the performance of the generic NFS client in order to address a bug in one of the back end filesystems (in this case XFS) being exported.
So strong NACK to this patch from me.
```
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com
```
next prev parent reply other threads:[~2026-06-23 4:51 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260622060038.13731-1-jefflexu@linux.alibaba.com>
[not found] ` <2d09ab9a74d1eb21c99454dba8be597612d20efa.camel@kernel.org>
2026-06-23 4:04 ` [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4 Jingbo Xu
2026-06-23 4:51 ` Trond Myklebust [this message]
2026-06-23 4:54 ` Jingbo Xu
2026-06-23 13:40 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2098148dfb9e156dd88242afecdea5d372b7a169.camel@kernel.org \
--to=trondmy@kernel.org \
--cc=anna@kernel.org \
--cc=jefflexu@linux.alibaba.com \
--cc=joseph.qi@linux.alibaba.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox