* [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
@ 2026-06-22 6:00 Jingbo Xu
2026-06-22 13:56 ` Trond Myklebust
0 siblings, 1 reply; 6+ messages in thread
From: Jingbo Xu @ 2026-06-22 6:00 UTC (permalink / raw)
To: trondmy, anna, linux-nfs; +Cc: linux-kernel, joseph.qi
NFSv4 COMMIT compound does not include GETATTR, and nfs4_commit_done_cb
does not refresh inode attributes. Meanwhile, every WRITE marks
NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
After COMMIT, i_blocks remains stale until the next stat() triggers a
full revalidation. In writeback-heavy workloads where COMMITs happen
without intervening stat() calls, the cached block count can stay
indefinitely wrong.
Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that the
next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
SPACE_USED, fetching the correct value from the server.
This matches NFSv3 behavior where nfs3_commit_done already calls
nfs_refresh_inode with the wcc_data post-op attributes.
Reproduce with xfstests generic/694 on NFSv4.0 loopback:
Server:
mount /dev/vdc /data/test
mount /dev/vdd /data/scratch
exportfs -o insecure,rw,sync,no_root_squash,fsid=1 127.0.0.1:/data/test
exportfs -o insecure,rw,sync,no_root_squash,fsid=2 127.0.0.1:/data/scratch
Client:
mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
local.config:
export TEST_FS_MOUNT_OPTS="-o vers=4.0"
export MOUNT_OPTIONS="-o vers=4.0"
export FSTYP=nfs
export TEST_DEV=localhost:/data/test
export SCRATCH_DEV=localhost:/data/scratch
export TEST_DIR=/mnt/test
export SCRATCH_MNT=/mnt/scratch
This fixes xfstests generic/694.
Assisted-by: Qoder:Qwen3.7-Max
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
fs/nfs/write.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index d7c399763ad9..88c5c9f7434c 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
/* Latency breaker */
cond_resched();
}
+ if (status >= 0)
+ nfs_set_cache_invalid(data->inode, NFS_INO_INVALID_BLOCKS);
nfs_init_cinfo(&cinfo, data->inode, data->dreq);
nfs_commit_end(cinfo.mds);
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
2026-06-22 6:00 [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4 Jingbo Xu
@ 2026-06-22 13:56 ` Trond Myklebust
2026-06-23 4:04 ` Jingbo Xu
0 siblings, 1 reply; 6+ messages in thread
From: Trond Myklebust @ 2026-06-22 13:56 UTC (permalink / raw)
To: Jingbo Xu, anna, linux-nfs; +Cc: linux-kernel, joseph.qi
On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
> NFSv4 COMMIT compound does not include GETATTR, and
> nfs4_commit_done_cb
> does not refresh inode attributes. Meanwhile, every WRITE marks
> NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
>
> After COMMIT, i_blocks remains stale until the next stat() triggers a
> full revalidation. In writeback-heavy workloads where COMMITs happen
> without intervening stat() calls, the cached block count can stay
> indefinitely wrong.
>
> Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
> the
> next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
> SPACE_USED, fetching the correct value from the server.
>
> This matches NFSv3 behavior where nfs3_commit_done already calls
> nfs_refresh_inode with the wcc_data post-op attributes.
>
> Reproduce with xfstests generic/694 on NFSv4.0 loopback:
>
> Server:
> mount /dev/vdc /data/test
> mount /dev/vdd /data/scratch
> exportfs -o insecure,rw,sync,no_root_squash,fsid=1
> 127.0.0.1:/data/test
> exportfs -o insecure,rw,sync,no_root_squash,fsid=2
> 127.0.0.1:/data/scratch
>
> Client:
> mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
> mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
>
> local.config:
> export TEST_FS_MOUNT_OPTS="-o vers=4.0"
> export MOUNT_OPTIONS="-o vers=4.0"
> export FSTYP=nfs
> export TEST_DEV=localhost:/data/test
> export SCRATCH_DEV=localhost:/data/scratch
> export TEST_DIR=/mnt/test
> export SCRATCH_MNT=/mnt/scratch
>
> This fixes xfstests generic/694.
>
> Assisted-by: Qoder:Qwen3.7-Max
> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> ---
> fs/nfs/write.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index d7c399763ad9..88c5c9f7434c 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
> nfs_commit_data *data)
> /* Latency breaker */
> cond_resched();
> }
> + if (status >= 0)
> + nfs_set_cache_invalid(data->inode,
> NFS_INO_INVALID_BLOCKS);
>
> nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> nfs_commit_end(cinfo.mds);
That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
data contents of the file: it is just ensuring that the existing data
is persisted onto disk.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
2026-06-22 13:56 ` Trond Myklebust
@ 2026-06-23 4:04 ` Jingbo Xu
2026-06-23 4:51 ` Trond Myklebust
0 siblings, 1 reply; 6+ messages in thread
From: Jingbo Xu @ 2026-06-23 4:04 UTC (permalink / raw)
To: Trond Myklebust, anna, linux-nfs; +Cc: linux-kernel, joseph.qi, linux-xfs
+cc linux-xfs@vger.kernel.org
On 6/22/26 9:56 PM, Trond Myklebust wrote:
> On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
>> NFSv4 COMMIT compound does not include GETATTR, and
>> nfs4_commit_done_cb
>> does not refresh inode attributes. Meanwhile, every WRITE marks
>> NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
>>
>> After COMMIT, i_blocks remains stale until the next stat() triggers a
>> full revalidation. In writeback-heavy workloads where COMMITs happen
>> without intervening stat() calls, the cached block count can stay
>> indefinitely wrong.
>>
>> Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
>> the
>> next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
>> SPACE_USED, fetching the correct value from the server.
>>
>> This matches NFSv3 behavior where nfs3_commit_done already calls
>> nfs_refresh_inode with the wcc_data post-op attributes.
>>
>> Reproduce with xfstests generic/694 on NFSv4.0 loopback:
>>
>> Server:
>> mount /dev/vdc /data/test
>> mount /dev/vdd /data/scratch
>> exportfs -o insecure,rw,sync,no_root_squash,fsid=1
>> 127.0.0.1:/data/test
>> exportfs -o insecure,rw,sync,no_root_squash,fsid=2
>> 127.0.0.1:/data/scratch
>>
>> Client:
>> mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
>> mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
>>
>> local.config:
>> export TEST_FS_MOUNT_OPTS="-o vers=4.0"
>> export MOUNT_OPTIONS="-o vers=4.0"
>> export FSTYP=nfs
>> export TEST_DEV=localhost:/data/test
>> export SCRATCH_DEV=localhost:/data/scratch
>> export TEST_DIR=/mnt/test
>> export SCRATCH_MNT=/mnt/scratch
>>
>> This fixes xfstests generic/694.
>>
>> Assisted-by: Qoder:Qwen3.7-Max
>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>> ---
>> fs/nfs/write.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index d7c399763ad9..88c5c9f7434c 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
>> nfs_commit_data *data)
>> /* Latency breaker */
>> cond_resched();
>> }
>> + if (status >= 0)
>> + nfs_set_cache_invalid(data->inode,
>> NFS_INO_INVALID_BLOCKS);
>>
>> nfs_init_cinfo(&cinfo, data->inode, data->dreq);
>> nfs_commit_end(cinfo.mds);
>
> That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
> data contents of the file: it is just ensuring that the existing data
> is persisted onto disk.
>
Yes the underlying backend filesystem is indeed xfs.
I retest it and can confirm that this failure is much likely
reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
consecutive test runs.
Beside XFS itself also passes the test.
To be honest I'm not familiar with NFS, following is what AI concludes:
```
Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
a piggy-backed GETATTR that returns the server's current SPACE_USED. The
client updates inode->i_blocks from the last completed WRITE's post-op
attributes.
Server-side delayed allocation — the export's local filesystem
(ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
extent tree nodes) are not yet allocated during most WRITE handling;
they materialize only when the local fs performs its own writeback.
Last WRITE completes before server metadata writeback (~80%
probability in user's environment) — the final GETATTR returns
SPACE_USED = 8388608 (data only, no metadata block). Client caches
i_blocks = 8388608.
syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
__nfs_commit_inode, which sends a COMMIT RPC to the server.
Server executes vfs_fsync_range — this forces the local fs writeback,
which now allocates the metadata block. Server's true SPACE_USED becomes
8388616 (+8 sectors = 4 KiB).
COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
COMMIT4resok contains only a write verifier. The XDR decoder
(nfs4_xdr_dec_commit) never calls decode_getfattr.
nfs_commit_release_pages performs no attribute invalidation — it
neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
stale cached value persists.
Subsequent stat returns stale i_blocks — cache_validity is clean, so
nfs_getattr serves the cached value 8388608 without revalidation.
After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
test fails.
```
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
2026-06-23 4:04 ` Jingbo Xu
@ 2026-06-23 4:51 ` Trond Myklebust
2026-06-23 4:54 ` Jingbo Xu
0 siblings, 1 reply; 6+ messages in thread
From: Trond Myklebust @ 2026-06-23 4:51 UTC (permalink / raw)
To: Jingbo Xu, anna, linux-nfs; +Cc: linux-kernel, joseph.qi, linux-xfs
On Tue, 2026-06-23 at 12:04 +0800, Jingbo Xu wrote:
> +cc [linux-xfs@vger.kernel.org](mailto:linux-xfs@vger.kernel.org)
>
> On 6/22/26 9:56 PM, Trond Myklebust wrote:
>
> > On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
> >
> > > NFSv4 COMMIT compound does not include GETATTR, and
> > > nfs4_commit_done_cb
> > > does not refresh inode attributes. Meanwhile, every WRITE marks
> > > NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
> > >
> > > After COMMIT, i_blocks remains stale until the next stat() triggers a
> > > full revalidation. In writeback-heavy workloads where COMMITs happen
> > > without intervening stat() calls, the cached block count can stay
> > > indefinitely wrong.
> > >
> > > Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
> > > the
> > > next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
> > > SPACE_USED, fetching the correct value from the server.
> > >
> > > This matches NFSv3 behavior where nfs3_commit_done already calls
> > > nfs_refresh_inode with the wcc_data post-op attributes.
> > >
> > > Reproduce with xfstests generic/694 on NFSv4.0 loopback:
> > >
> > > Server:
> > > mount /dev/vdc /data/test
> > > mount /dev/vdd /data/scratch
> > > exportfs -o insecure,rw,sync,no_root_squash,fsid=1
> > > 127.0.0.1:/data/test
> > > exportfs -o insecure,rw,sync,no_root_squash,fsid=2
> > > 127.0.0.1:/data/scratch
> > >
> > > Client:
> > > mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
> > > mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
> > >
> > > local.config:
> > > export TEST_FS_MOUNT_OPTS="-o vers=4.0"
> > > export MOUNT_OPTIONS="-o vers=4.0"
> > > export FSTYP=nfs
> > > export TEST_DEV=localhost:/data/test
> > > export SCRATCH_DEV=localhost:/data/scratch
> > > export TEST_DIR=/mnt/test
> > > export SCRATCH_MNT=/mnt/scratch
> > >
> > > This fixes xfstests generic/694.
> > >
> > > Assisted-by: Qoder:Qwen3.7-Max
> > > Signed-off-by: Jingbo Xu <[jefflexu@linux.alibaba.com](mailto:jefflexu@linux.alibaba.com)>
> > > ---
> > > fs/nfs/write.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index d7c399763ad9..88c5c9f7434c 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
> > > nfs_commit_data *data)
> > > /* Latency breaker */
> > > cond_resched();
> > > }
> > > + if (status >= 0)
> > > + nfs_set_cache_invalid(data->inode,
> > > NFS_INO_INVALID_BLOCKS);
> > >
> > > nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> > > nfs_commit_end(cinfo.mds);
> >
> >
> > That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
> > data contents of the file: it is just ensuring that the existing data
> > is persisted onto disk.
> >
>
>
> Yes the underlying backend filesystem is indeed xfs.
>
> I retest it and can confirm that this failure is much likely
> reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
> consecutive test runs.
>
> Beside XFS itself also passes the test.
>
>
> To be honest I'm not familiar with NFS, following is what AI concludes:
>
> ```
> Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
>
>
> Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
> a piggy-backed GETATTR that returns the server's current SPACE_USED. The
> client updates inode->i_blocks from the last completed WRITE's post-op
> attributes.
>
> Server-side delayed allocation — the export's local filesystem
> (ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
> extent tree nodes) are not yet allocated during most WRITE handling;
> they materialize only when the local fs performs its own writeback.
>
> Last WRITE completes before server metadata writeback (~80%
> probability in user's environment) — the final GETATTR returns
> SPACE_USED = 8388608 (data only, no metadata block). Client caches
> i_blocks = 8388608.
>
> syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
> __nfs_commit_inode, which sends a COMMIT RPC to the server.
> Server executes vfs_fsync_range — this forces the local fs writeback,
> which now allocates the metadata block. Server's true SPACE_USED becomes
> 8388616 (+8 sectors = 4 KiB).
>
> COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
> COMMIT4resok contains only a write verifier. The XDR decoder
> (nfs4_xdr_dec_commit) never calls decode_getfattr.
>
> nfs_commit_release_pages performs no attribute invalidation — it
> neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
> stale cached value persists.
>
> Subsequent stat returns stale i_blocks — cache_validity is clean, so
> nfs_getattr serves the cached value 8388608 without revalidation.
>
> After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
> test fails.
> ```
>
I agree with your AI that this may indeed be a consequence of XFS's delayed allocation feature. However that hardly changes the fact that it would be a violation of the POSIX rules for how write(), fsync() and stat() are supposed to work in this situation.
My point is that I fail to see why we should degrade the performance of the generic NFS client in order to address a bug in one of the back end filesystems (in this case XFS) being exported.
So strong NACK to this patch from me.
```
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com
```
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
2026-06-23 4:51 ` Trond Myklebust
@ 2026-06-23 4:54 ` Jingbo Xu
2026-06-23 13:40 ` Brian Foster
0 siblings, 1 reply; 6+ messages in thread
From: Jingbo Xu @ 2026-06-23 4:54 UTC (permalink / raw)
To: Trond Myklebust, anna, linux-nfs; +Cc: linux-kernel, joseph.qi, linux-xfs
On 6/23/26 12:51 PM, Trond Myklebust wrote:
> On Tue, 2026-06-23 at 12:04 +0800, Jingbo Xu wrote:
>
>> +cc [linux-xfs@vger.kernel.org](mailto:linux-xfs@vger.kernel.org)
>>
>> On 6/22/26 9:56 PM, Trond Myklebust wrote:
>>
>>> On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
>>>
>>>> NFSv4 COMMIT compound does not include GETATTR, and
>>>> nfs4_commit_done_cb
>>>> does not refresh inode attributes. Meanwhile, every WRITE marks
>>>> NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
>>>>
>>>> After COMMIT, i_blocks remains stale until the next stat() triggers a
>>>> full revalidation. In writeback-heavy workloads where COMMITs happen
>>>> without intervening stat() calls, the cached block count can stay
>>>> indefinitely wrong.
>>>>
>>>> Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
>>>> the
>>>> next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
>>>> SPACE_USED, fetching the correct value from the server.
>>>>
>>>> This matches NFSv3 behavior where nfs3_commit_done already calls
>>>> nfs_refresh_inode with the wcc_data post-op attributes.
>>>>
>>>> Reproduce with xfstests generic/694 on NFSv4.0 loopback:
>>>>
>>>> Server:
>>>> mount /dev/vdc /data/test
>>>> mount /dev/vdd /data/scratch
>>>> exportfs -o insecure,rw,sync,no_root_squash,fsid=1
>>>> 127.0.0.1:/data/test
>>>> exportfs -o insecure,rw,sync,no_root_squash,fsid=2
>>>> 127.0.0.1:/data/scratch
>>>>
>>>> Client:
>>>> mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
>>>> mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
>>>>
>>>> local.config:
>>>> export TEST_FS_MOUNT_OPTS="-o vers=4.0"
>>>> export MOUNT_OPTIONS="-o vers=4.0"
>>>> export FSTYP=nfs
>>>> export TEST_DEV=localhost:/data/test
>>>> export SCRATCH_DEV=localhost:/data/scratch
>>>> export TEST_DIR=/mnt/test
>>>> export SCRATCH_MNT=/mnt/scratch
>>>>
>>>> This fixes xfstests generic/694.
>>>>
>>>> Assisted-by: Qoder:Qwen3.7-Max
>>>> Signed-off-by: Jingbo Xu <[jefflexu@linux.alibaba.com](mailto:jefflexu@linux.alibaba.com)>
>>>> ---
>>>> fs/nfs/write.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>>>> index d7c399763ad9..88c5c9f7434c 100644
>>>> --- a/fs/nfs/write.c
>>>> +++ b/fs/nfs/write.c
>>>> @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
>>>> nfs_commit_data *data)
>>>> /* Latency breaker */
>>>> cond_resched();
>>>> }
>>>> + if (status >= 0)
>>>> + nfs_set_cache_invalid(data->inode,
>>>> NFS_INO_INVALID_BLOCKS);
>>>>
>>>> nfs_init_cinfo(&cinfo, data->inode, data->dreq);
>>>> nfs_commit_end(cinfo.mds);
>>>
>>>
>>> That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
>>> data contents of the file: it is just ensuring that the existing data
>>> is persisted onto disk.
>>>
>>
>>
>> Yes the underlying backend filesystem is indeed xfs.
>>
>> I retest it and can confirm that this failure is much likely
>> reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
>> consecutive test runs.
>>
>> Beside XFS itself also passes the test.
>>
>>
>> To be honest I'm not familiar with NFS, following is what AI concludes:
>>
>> ```
>> Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
>>
>>
>> Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
>> a piggy-backed GETATTR that returns the server's current SPACE_USED. The
>> client updates inode->i_blocks from the last completed WRITE's post-op
>> attributes.
>>
>> Server-side delayed allocation — the export's local filesystem
>> (ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
>> extent tree nodes) are not yet allocated during most WRITE handling;
>> they materialize only when the local fs performs its own writeback.
>>
>> Last WRITE completes before server metadata writeback (~80%
>> probability in user's environment) — the final GETATTR returns
>> SPACE_USED = 8388608 (data only, no metadata block). Client caches
>> i_blocks = 8388608.
>>
>> syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
>> __nfs_commit_inode, which sends a COMMIT RPC to the server.
>> Server executes vfs_fsync_range — this forces the local fs writeback,
>> which now allocates the metadata block. Server's true SPACE_USED becomes
>> 8388616 (+8 sectors = 4 KiB).
>>
>> COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
>> COMMIT4resok contains only a write verifier. The XDR decoder
>> (nfs4_xdr_dec_commit) never calls decode_getfattr.
>>
>> nfs_commit_release_pages performs no attribute invalidation — it
>> neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
>> stale cached value persists.
>>
>> Subsequent stat returns stale i_blocks — cache_validity is clean, so
>> nfs_getattr serves the cached value 8388608 without revalidation.
>>
>> After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
>> test fails.
>> ```
>>
>
>
> I agree with your AI that this may indeed be a consequence of XFS's delayed allocation feature. However that hardly changes the fact that it would be a violation of the POSIX rules for how write(), fsync() and stat() are supposed to work in this situation.
>
> My point is that I fail to see why we should degrade the performance of the generic NFS client in order to address a bug in one of the back end filesystems (in this case XFS) being exported.
>
Understand. Just have no idea which is the proper way to fix this.
--
Thanks,
Jingbo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4
2026-06-23 4:54 ` Jingbo Xu
@ 2026-06-23 13:40 ` Brian Foster
0 siblings, 0 replies; 6+ messages in thread
From: Brian Foster @ 2026-06-23 13:40 UTC (permalink / raw)
To: Jingbo Xu
Cc: Trond Myklebust, anna, linux-nfs, linux-kernel, joseph.qi,
linux-xfs
On Tue, Jun 23, 2026 at 12:54:54PM +0800, Jingbo Xu wrote:
>
>
> On 6/23/26 12:51 PM, Trond Myklebust wrote:
> > On Tue, 2026-06-23 at 12:04 +0800, Jingbo Xu wrote:
> >
> >> +cc [linux-xfs@vger.kernel.org](mailto:linux-xfs@vger.kernel.org)
> >>
> >> On 6/22/26 9:56 PM, Trond Myklebust wrote:
> >>
> >>> On Mon, 2026-06-22 at 14:00 +0800, Jingbo Xu wrote:
> >>>
> >>>> NFSv4 COMMIT compound does not include GETATTR, and
> >>>> nfs4_commit_done_cb
> >>>> does not refresh inode attributes. Meanwhile, every WRITE marks
> >>>> NFS_INO_INVALID_BLOCKS via nfs_post_op_update_inode_force_wcc_locked.
> >>>>
> >>>> After COMMIT, i_blocks remains stale until the next stat() triggers a
> >>>> full revalidation. In writeback-heavy workloads where COMMITs happen
> >>>> without intervening stat() calls, the cached block count can stay
> >>>> indefinitely wrong.
> >>>>
> >>>> Mark NFS_INO_INVALID_BLOCKS on successful COMMIT completion so that
> >>>> the
> >>>> next nfs_getattr requesting STATX_BLOCKS will issue a GETATTR with
> >>>> SPACE_USED, fetching the correct value from the server.
> >>>>
> >>>> This matches NFSv3 behavior where nfs3_commit_done already calls
> >>>> nfs_refresh_inode with the wcc_data post-op attributes.
> >>>>
> >>>> Reproduce with xfstests generic/694 on NFSv4.0 loopback:
> >>>>
> >>>> Server:
> >>>> mount /dev/vdc /data/test
> >>>> mount /dev/vdd /data/scratch
> >>>> exportfs -o insecure,rw,sync,no_root_squash,fsid=1
> >>>> 127.0.0.1:/data/test
> >>>> exportfs -o insecure,rw,sync,no_root_squash,fsid=2
> >>>> 127.0.0.1:/data/scratch
> >>>>
> >>>> Client:
> >>>> mount -t nfs -o vers=4.0 localhost:/data/test /mnt/test
> >>>> mount -t nfs -o vers=4.0 localhost:/data/scratch /mnt/scratch
> >>>>
> >>>> local.config:
> >>>> export TEST_FS_MOUNT_OPTS="-o vers=4.0"
> >>>> export MOUNT_OPTIONS="-o vers=4.0"
> >>>> export FSTYP=nfs
> >>>> export TEST_DEV=localhost:/data/test
> >>>> export SCRATCH_DEV=localhost:/data/scratch
> >>>> export TEST_DIR=/mnt/test
> >>>> export SCRATCH_MNT=/mnt/scratch
> >>>>
> >>>> This fixes xfstests generic/694.
> >>>>
> >>>> Assisted-by: Qoder:Qwen3.7-Max
> >>>> Signed-off-by: Jingbo Xu <[jefflexu@linux.alibaba.com](mailto:jefflexu@linux.alibaba.com)>
> >>>> ---
> >>>> fs/nfs/write.c | 2 ++
> >>>> 1 file changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> >>>> index d7c399763ad9..88c5c9f7434c 100644
> >>>> --- a/fs/nfs/write.c
> >>>> +++ b/fs/nfs/write.c
> >>>> @@ -1851,6 +1851,8 @@ static void nfs_commit_release_pages(struct
> >>>> nfs_commit_data *data)
> >>>> /* Latency breaker */
> >>>> cond_resched();
> >>>> }
> >>>> + if (status >= 0)
> >>>> + nfs_set_cache_invalid(data->inode,
> >>>> NFS_INO_INVALID_BLOCKS);
> >>>>
> >>>> nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> >>>> nfs_commit_end(cinfo.mds);
> >>>
> >>>
> >>> That sounds like an XFS bug, not an NFS bug. COMMIT isn't changing the
> >>> data contents of the file: it is just ensuring that the existing data
> >>> is persisted onto disk.
> >>>
> >>
> >>
> >> Yes the underlying backend filesystem is indeed xfs.
> >>
> >> I retest it and can confirm that this failure is much likely
> >> reproducible on NFS upon XFS, while NFS upon EXT4 succeeds for 50
> >> consecutive test runs.
> >>
> >> Beside XFS itself also passes the test.
> >>
> >>
> >> To be honest I'm not familiar with NFS, following is what AI concludes:
> >>
> >> ```
> >> Root Cause Timing Sequence: NFSv4.0 Stale i_blocks After syncfs
> >>
> >>
> >> Client issues multiple UNSTABLE WRITEs — each WRITE compound includes
> >> a piggy-backed GETATTR that returns the server's current SPACE_USED. The
> >> client updates inode->i_blocks from the last completed WRITE's post-op
> >> attributes.
> >>
> >> Server-side delayed allocation — the export's local filesystem
> >> (ext4/xfs) uses delayed allocation. Metadata blocks (indirect blocks /
> >> extent tree nodes) are not yet allocated during most WRITE handling;
> >> they materialize only when the local fs performs its own writeback.
> >>
> >> Last WRITE completes before server metadata writeback (~80%
> >> probability in user's environment) — the final GETATTR returns
> >> SPACE_USED = 8388608 (data only, no metadata block). Client caches
> >> i_blocks = 8388608.
> >>
> >> syncfs triggers COMMIT — nfs_write_inode(WB_SYNC_ALL) calls
> >> __nfs_commit_inode, which sends a COMMIT RPC to the server.
> >> Server executes vfs_fsync_range — this forces the local fs writeback,
> >> which now allocates the metadata block. Server's true SPACE_USED becomes
> >> 8388616 (+8 sectors = 4 KiB).
> >>
> >> COMMIT response carries no file attributes — per RFC 7530 §16.3.3,
> >> COMMIT4resok contains only a write verifier. The XDR decoder
> >> (nfs4_xdr_dec_commit) never calls decode_getfattr.
> >>
> >> nfs_commit_release_pages performs no attribute invalidation — it
> >> neither sends a follow-up GETATTR nor marks NFS_INO_INVALID_BLOCKS. The
> >> stale cached value persists.
> >>
> >> Subsequent stat returns stale i_blocks — cache_validity is clean, so
> >> nfs_getattr serves the cached value 8388608 without revalidation.
> >>
> >> After cycle_mount, fresh GETATTR returns 8388616 — mismatch detected,
> >> test fails.
> >> ```
> >>
> >
> >
> > I agree with your AI that this may indeed be a consequence of XFS's delayed allocation feature. However that hardly changes the fact that it would be a violation of the POSIX rules for how write(), fsync() and stat() are supposed to work in this situation.
> >
> > My point is that I fail to see why we should degrade the performance of the generic NFS client in order to address a bug in one of the back end filesystems (in this case XFS) being exported.
> >
>
Is the implication here that st_blocks changing across an fsync is a
POSIX violation? If so, can you clarify where that is said? My
understanding is that the rules/definitions here are rather vague and
st_blocks is a simple count of blocks allocated to the inode (i.e., so
this is ambiguous wrt delalloc, metadata blocks, etc.), but I could be
missing something.
> Understand. Just have no idea which is the proper way to fix this.
>
If I follow the above sequence correctly, the behavior being reported is
basically that NFS submits a write and caches a post-op block count on
the way out, a commit/fsync completes and doesn't update the block count
on the client, but XFS actually changes the block count during the fsync
due to a metadata (presumably bmap btree) block allocation on writeback
and delalloc conversion.
generic/694 basically does a sync -> sample inode block count -> mount
cycle -> resample and compare test, and then fails if the samples don't
match across the mount cycle. This presumably fails on NFS due to the
block count caching above. I.e., the post-fsync sample returns the
pre-fsync cached value and thus implies the value has changed purely
across the mount cycle, right?
A couple things to note here.. g/694 can either use falloc to create the
file or fall back to buffered write if falloc fails. Is that a factor
going through NFS in this case (i.e. fallocate is not supported)? Also
it looks like the sync was added to the test to explicitly address this
sort of problem in other cases.
It feels a little odd to me to call this an fs bug, tbh. The intent of
the test seems more to catch an incorrect block count calculation as
opposed to enforce any sort of caching or accounting rules. The test may
or may not even perform writes, after all. It seems more like the issue
is just that the fsync that was added for consistency at the fs level is
not sufficient for NFS due to this attribute caching.
Is this actually a problem for NFS in practice or will it eventually get
the latest value from the server? If the latter, maybe the test simply
needs to be fixed up or skipped for NFS..?
Brian
>
> --
> Thanks,
> Jingbo
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-23 13:40 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-22 6:00 [PATCH] NFS: invalidate i_blocks after COMMIT to fix stale block count on NFSv4 Jingbo Xu
2026-06-22 13:56 ` Trond Myklebust
2026-06-23 4:04 ` Jingbo Xu
2026-06-23 4:51 ` Trond Myklebust
2026-06-23 4:54 ` Jingbo Xu
2026-06-23 13:40 ` Brian Foster
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.