From: Trond Myklebust <trondmy@kernel.org>
To: Jeff Layton <jlayton@kernel.org>, Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org
Subject: Re: parts of pages on NFS being replaced by swaths of NULs
Date: Thu, 31 Jul 2025 17:56:08 -0400 [thread overview]
Message-ID: <752db17aff35a92b79e4c7bd3003ed890fe91403.camel@kernel.org> (raw)
In-Reply-To: <1c42a7fd9677ad1aa9a3a53eda738b3a6da3728e.camel@kernel.org>
On Wed, 2025-07-30 at 10:52 -0400, Jeff Layton wrote:
> We've been seeing a rather nasty bit of data corruption with NFS in
> our
> environment. The clients in this env run a patched v6.9 kernel
> (mostly
> due to GPU driver requirements). Most of the patches are NFS
> containerization fixes.
>
> The workload is python scripts writing JSONL files sequentially using
> bog-standard buffered write() calls. We're fairly certain that
> userland
> is not seeking so there should be no gaps in the data written.
>
> The problem is that we see ranges of written files being replaced by
> NULs. The length of the file seemingly doesn't change from what it
> should be, but a chunk of it will be zeroed-out. Looking at the
> offsets
> of the zeroed out ranges, the front part of one page is fine, but the
> data from some random offset in the page to the end of the page is
> zeroes.
>
> We have a reproducer but we have to run it in a heavily parallel
> configuration to make it happen, so it's evidently a tight race of
> some
> sort.
>
> We've turned up some tracepoints and reproduced this twice. What we
> see
> in both cases is that the client just doesn't write some section of
> the
> file.
>
> In the first trace, there was is a gap of 2201 bytes between these
> two
> writes on the wire:
>
> kworker/u1038:1-2597138 [106] ..... 46138.516795:
> nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> offset=53248 count=1895 stable=UNSTABLE
> oil-localfs-252-2605046 [163] ..... 46138.551459:
> nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> offset=57344 count=443956 stable=FILE_SYNC
>
> The zeroed-out range is from 55143-57344. At the same time that the
> file is growing from 53248 to 55143 (due to sequential write()
> activity), the client is kicking off writeback for the range up to
> 55143. It's issuing 2 writes, one for 0-53248 and one for 53248-55143
> (note that I've filtered out all but one of the DS filehandles for
> brevity):
>
> oil-localfs-252-2605046 [162] ..... 46138.516414: nfs_size_grow:
> fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> version=1753485366409158129 cursize=49152 newsize=50130
> oil-localfs-252-2605046 [162] ..... 46138.516593: nfs_size_grow:
> fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> version=1753485366409158129 cursize=50130 newsize=53248
> kworker/u1038:1-2597138 [106] ..... 46138.516740:
> nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> offset=0 count=53248 stable=UNSTABLE
> oil-localfs-252-2605046 [162] ..... 46138.516753: nfs_size_grow:
> fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> version=1753485366409158129 cursize=53248 newsize=55143
> kworker/u1038:1-2597138 [106] ..... 46138.516795:
> nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> offset=53248 count=1895 stable=UNSTABLE
> kworker/u1037:2-2871862 [097] ..... 46138.517659: nfs4_pnfs_write:
> error=0 (OK) fileid=00:aa:10056165185 fhandle=0x6bd94d55 offset=0
> count=53248 res=53248 stateid=1:0x79a9c471 layoutstateid=1:0xcbd8aaad
> kworker/u1037:2-2871862 [097] ..... 46138.517662:
> nfs_writeback_done: error=53248 fileid=00:aa:10056165185
> fhandle=0x6bd94d55 offset=0 count=53248 res=53248 stable=UNSTABLE
> verifier=5199cdae2816c899
> kworker/u1037:5-2593935 [226] ..... 46138.517669: nfs4_pnfs_write:
> error=0 (OK) fileid=00:aa:10056165185 fhandle=0x6bd94d55 offset=53248
> count=1895 res=1895 stateid=1:0x79a9c471 layoutstateid=1:0xcbd8aaad
> kworker/u1037:5-2593935 [226] ..... 46138.517672:
> nfs_writeback_done: error=1895 fileid=00:aa:10056165185
> fhandle=0x6bd94d55 offset=53248 count=1895 res=1895 stable=UNSTABLE
> verifier=5199cdae2816c899
> oil-localfs-252-2605046 [162] ..... 46138.518360: nfs_size_grow:
> fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> version=1753485366409158129 cursize=55143 newsize=57344
> oil-localfs-252-2605046 [162] ..... 46138.518556: nfs_size_grow:
> fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> version=1753485366409158129 cursize=57344 newsize=60156
>
> ...and just after writeback completes, we see the file size grow from
> 55143 to the end of the page (57344).
>
> The second trace has similar symptoms. There is a lot more (smaller)
> write activity (due to memory pressure?). There is a gap of 3791
> bytes
> between these on-the-wire writes, however:
>
> kworker/u1036:0-2339252 [217] ..... 479572.054622:
> nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> offset=221184 count=4401 stable=UNSTABLE
> kworker/u1030:1-2297876 [042] ..... 479572.074194:
> nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> offset=229376 count=261898 stable=UNSTABLE
>
> Same situation -- the at page at offset 53248 has 305 bytes on it,
> and
> the remaining is zeroed. This trace shows similar racing write() and
> writeback activity as in Friday's trace. At around the same time as
> the
> client was growing the file over the affected range, writeback was
> kicking off for everything up to the affected range (this has some
> other wb related calls filtered for brevity):
>
> oil-localfs-86-727850 [215] ..... 479572.053987: nfs_size_grow:
> fileid=00:96:10067193438 fhandle=0x14c40498
> version=1753823598774309300 cursize=217088
> newsize=220572
>
> kworker/u1036:8-2339326 [088] ..... 479572.054008:
> nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> offset=217088 count=3484
> stable=UNSTABLE
>
> oil-localfs-86-727850 [215] ..... 479572.054405: nfs_size_grow:
> fileid=00:96:10067193438 fhandle=0x14c40498
> version=1753823598774309300 cursize=220572
> newsize=221184
>
> kworker/u1036:1-2297875 [217] ..... 479572.054418:
> nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> offset=220572 count=612
> stable=UNSTABLE
>
> oil-localfs-86-727850 [215] ..... 479572.054581: nfs_size_grow:
> fileid=00:96:10067193438 fhandle=0x14c40498
> version=1753823598774309300 cursize=221184
> newsize=225280
>
> oil-localfs-86-727850 [215] ..... 479572.054584: nfs_size_grow:
> fileid=00:96:10067193438 fhandle=0x14c40498
> version=1753823598774309300 cursize=225280
> newsize=225585
>
> kworker/u1036:0-2339252 [217] ..... 479572.054622:
> nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> offset=221184 count=4401
> stable=UNSTABLE
>
> oil-localfs-86-727850 [215] ..... 479572.054997: nfs_size_grow:
> fileid=00:96:10067193438 fhandle=0x14c40498
> version=1753823598774309300 cursize=225585
> newsize=229376
>
> oil-localfs-86-727850 [215] ..... 479572.055190: nfs_size_grow:
> fileid=00:96:10067193438 fhandle=0x14c40498
> version=1753823598774309300 cursize=229376
> newsize=230598
>
>
> Could this be a race between extending an existing dirty page, and
> writeback kicking off for the pre-extension range on the page? Maybe
> the client is clearing the dirty bit, thinking that the write covers
> the dirty range, but it has an outdated idea about what that range is
> or doesn't properly check?
>
> Traces for both events, filtered on the relevant fileid are attached.
> I've rolled patches for some new tracepoints that I'm going to
> attempt
> to turn up next, but I thought that this was a good point to solicit
> ideas.
>
> Happy to entertain other thoughts or patches!
So... The fact that we are seeing a nfs_size_grow() for the hole at
offset 55143 means that either an existing request was updated, or a
new one was created in order to cover that hole, and it must have been
marked as dirty.
I'm not seeing anything in the NFS code that can lose that request
without triggering either the nfs_write_error tracepoint, the
nfs_commit_error tracepoint, the nfs_invalidate_folio tracepoint or
else completing the write.
The only other way I can see this data being lost is if something is
corrupting folio->private, or if the page cache is somehow managing to
throw away a dirty folio.
Of the two, there was for a while a netfs bug which would corrupt
folio->private, but I assume you're not using cachefs?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com
next prev parent reply other threads:[~2025-07-31 21:56 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-30 14:52 parts of pages on NFS being replaced by swaths of NULs Jeff Layton
2025-07-31 21:56 ` Trond Myklebust [this message]
2025-07-31 23:04 ` Jeff Layton
2025-08-06 14:20 ` Jeff Layton
2025-08-12 11:58 ` Jeff Layton
2025-08-12 16:58 ` Trond Myklebust
2025-08-12 17:20 ` Jeff Layton
2025-08-16 13:01 ` Jeff Layton
2025-08-16 14:51 ` Trond Myklebust
2025-08-16 15:31 ` Jeff Layton
2025-08-16 17:16 ` Jeff Layton
2025-08-16 21:25 ` Mike Snitzer
2025-08-19 18:38 ` Mike Snitzer
2025-08-19 20:13 ` Jeff Layton
2025-08-19 21:38 ` seeing CRC mismatch with NFS DIRECT WRITE when localio_O_DIRECT_align_misaligned_IO=Y Mike Snitzer
2025-08-19 21:50 ` Jeff Layton
2025-08-19 17:10 ` parts of pages on NFS being replaced by swaths of NULs Jeff Layton
2025-08-19 18:28 ` Trond Myklebust
2025-08-19 19:48 ` Jeff Layton
2025-08-19 20:47 ` Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=752db17aff35a92b79e4c7bd3003ed890fe91403.camel@kernel.org \
--to=trondmy@kernel.org \
--cc=anna@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).