From: Jeff Layton <jlayton@kernel.org>
To: Trond Myklebust <trondmy@kernel.org>, Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org
Subject: Re: parts of pages on NFS being replaced by swaths of NULs
Date: Wed, 06 Aug 2025 10:20:54 -0400 [thread overview]
Message-ID: <f6952e2503d1ec74a87051df7de216bca998cb50.camel@kernel.org> (raw)
In-Reply-To: <752db17aff35a92b79e4c7bd3003ed890fe91403.camel@kernel.org>
On Thu, 2025-07-31 at 17:56 -0400, Trond Myklebust wrote:
> On Wed, 2025-07-30 at 10:52 -0400, Jeff Layton wrote:
> > We've been seeing a rather nasty bit of data corruption with NFS in
> > our
> > environment. The clients in this env run a patched v6.9 kernel
> > (mostly
> > due to GPU driver requirements). Most of the patches are NFS
> > containerization fixes.
> >
> > The workload is python scripts writing JSONL files sequentially using
> > bog-standard buffered write() calls. We're fairly certain that
> > userland
> > is not seeking so there should be no gaps in the data written.
> >
> > The problem is that we see ranges of written files being replaced by
> > NULs. The length of the file seemingly doesn't change from what it
> > should be, but a chunk of it will be zeroed-out. Looking at the
> > offsets
> > of the zeroed out ranges, the front part of one page is fine, but the
> > data from some random offset in the page to the end of the page is
> > zeroes.
> >
> > We have a reproducer but we have to run it in a heavily parallel
> > configuration to make it happen, so it's evidently a tight race of
> > some
> > sort.
> >
> > We've turned up some tracepoints and reproduced this twice. What we
> > see
> > in both cases is that the client just doesn't write some section of
> > the
> > file.
> >
> > In the first trace, there was is a gap of 2201 bytes between these
> > two
> > writes on the wire:
> >
> > kworker/u1038:1-2597138 [106] ..... 46138.516795:
> > nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> > offset=53248 count=1895 stable=UNSTABLE
> > foo-localfs-252-2605046 [163] ..... 46138.551459:
> > nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> > offset=57344 count=443956 stable=FILE_SYNC
> >
> > The zeroed-out range is from 55143-57344. At the same time that the
> > file is growing from 53248 to 55143 (due to sequential write()
> > activity), the client is kicking off writeback for the range up to
> > 55143. It's issuing 2 writes, one for 0-53248 and one for 53248-55143
> > (note that I've filtered out all but one of the DS filehandles for
> > brevity):
> >
> > foo-localfs-252-2605046 [162] ..... 46138.516414: nfs_size_grow:
> > fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> > version=1753485366409158129 cursize=49152 newsize=50130
> > foo-localfs-252-2605046 [162] ..... 46138.516593: nfs_size_grow:
> > fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> > version=1753485366409158129 cursize=50130 newsize=53248
> > kworker/u1038:1-2597138 [106] ..... 46138.516740:
> > nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> > offset=0 count=53248 stable=UNSTABLE
> > foo-localfs-252-2605046 [162] ..... 46138.516753: nfs_size_grow:
> > fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> > version=1753485366409158129 cursize=53248 newsize=55143
> > kworker/u1038:1-2597138 [106] ..... 46138.516795:
> > nfs_initiate_write: fileid=00:aa:10056165185 fhandle=0x6bd94d55
> > offset=53248 count=1895 stable=UNSTABLE
> > kworker/u1037:2-2871862 [097] ..... 46138.517659: nfs4_pnfs_write:
> > error=0 (OK) fileid=00:aa:10056165185 fhandle=0x6bd94d55 offset=0
> > count=53248 res=53248 stateid=1:0x79a9c471 layoutstateid=1:0xcbd8aaad
> > kworker/u1037:2-2871862 [097] ..... 46138.517662:
> > nfs_writeback_done: error=53248 fileid=00:aa:10056165185
> > fhandle=0x6bd94d55 offset=0 count=53248 res=53248 stable=UNSTABLE
> > verifier=5199cdae2816c899
> > kworker/u1037:5-2593935 [226] ..... 46138.517669: nfs4_pnfs_write:
> > error=0 (OK) fileid=00:aa:10056165185 fhandle=0x6bd94d55 offset=53248
> > count=1895 res=1895 stateid=1:0x79a9c471 layoutstateid=1:0xcbd8aaad
> > kworker/u1037:5-2593935 [226] ..... 46138.517672:
> > nfs_writeback_done: error=1895 fileid=00:aa:10056165185
> > fhandle=0x6bd94d55 offset=53248 count=1895 res=1895 stable=UNSTABLE
> > verifier=5199cdae2816c899
> > foo-localfs-252-2605046 [162] ..... 46138.518360: nfs_size_grow:
> > fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> > version=1753485366409158129 cursize=55143 newsize=57344
> > foo-localfs-252-2605046 [162] ..... 46138.518556: nfs_size_grow:
> > fileid=00:aa:10056165185 fhandle=0x8bfc64c9
> > version=1753485366409158129 cursize=57344 newsize=60156
> >
> > ...and just after writeback completes, we see the file size grow from
> > 55143 to the end of the page (57344).
> >
> > The second trace has similar symptoms. There is a lot more (smaller)
> > write activity (due to memory pressure?). There is a gap of 3791
> > bytes
> > between these on-the-wire writes, however:
> >
> > kworker/u1036:0-2339252 [217] ..... 479572.054622:
> > nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> > offset=221184 count=4401 stable=UNSTABLE
> > kworker/u1030:1-2297876 [042] ..... 479572.074194:
> > nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> > offset=229376 count=261898 stable=UNSTABLE
> >
> > Same situation -- the at page at offset 53248 has 305 bytes on it,
> > and
> > the remaining is zeroed. This trace shows similar racing write() and
> > writeback activity as in Friday's trace. At around the same time as
> > the
> > client was growing the file over the affected range, writeback was
> > kicking off for everything up to the affected range (this has some
> > other wb related calls filtered for brevity):
> >
> > foo-localfs-86-727850 [215] ..... 479572.053987: nfs_size_grow:
> > fileid=00:96:10067193438 fhandle=0x14c40498
> > version=1753823598774309300 cursize=217088
> > newsize=220572
> >
> > kworker/u1036:8-2339326 [088] ..... 479572.054008:
> > nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> > offset=217088 count=3484
> > stable=UNSTABLE
> >
> > foo-localfs-86-727850 [215] ..... 479572.054405: nfs_size_grow:
> > fileid=00:96:10067193438 fhandle=0x14c40498
> > version=1753823598774309300 cursize=220572
> > newsize=221184
> >
> > kworker/u1036:1-2297875 [217] ..... 479572.054418:
> > nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> > offset=220572 count=612
> > stable=UNSTABLE
> >
> > foo-localfs-86-727850 [215] ..... 479572.054581: nfs_size_grow:
> > fileid=00:96:10067193438 fhandle=0x14c40498
> > version=1753823598774309300 cursize=221184
> > newsize=225280
> >
> > foo-localfs-86-727850 [215] ..... 479572.054584: nfs_size_grow:
> > fileid=00:96:10067193438 fhandle=0x14c40498
> > version=1753823598774309300 cursize=225280
> > newsize=225585
> >
> > kworker/u1036:0-2339252 [217] ..... 479572.054622:
> > nfs_initiate_write: fileid=00:96:10067193438 fhandle=0xc9992232
> > offset=221184 count=4401
> > stable=UNSTABLE
> >
> > foo-localfs-86-727850 [215] ..... 479572.054997: nfs_size_grow:
> > fileid=00:96:10067193438 fhandle=0x14c40498
> > version=1753823598774309300 cursize=225585
> > newsize=229376
> >
> > foo-localfs-86-727850 [215] ..... 479572.055190: nfs_size_grow:
> > fileid=00:96:10067193438 fhandle=0x14c40498
> > version=1753823598774309300 cursize=229376
> > newsize=230598
> >
> >
> > Could this be a race between extending an existing dirty page, and
> > writeback kicking off for the pre-extension range on the page? Maybe
> > the client is clearing the dirty bit, thinking that the write covers
> > the dirty range, but it has an outdated idea about what that range is
> > or doesn't properly check?
> >
> > Traces for both events, filtered on the relevant fileid are attached.
> > I've rolled patches for some new tracepoints that I'm going to
> > attempt
> > to turn up next, but I thought that this was a good point to solicit
> > ideas.
> >
> > Happy to entertain other thoughts or patches!
>
> So... The fact that we are seeing a nfs_size_grow() for the hole at
> offset 55143 means that either an existing request was updated, or a
> new one was created in order to cover that hole, and it must have been
> marked as dirty.
>
> I'm not seeing anything in the NFS code that can lose that request
> without triggering either the nfs_write_error tracepoint, the
> nfs_commit_error tracepoint, the nfs_invalidate_folio tracepoint or
> else completing the write.
>
> The only other way I can see this data being lost is if something is
> corrupting folio->private, or if the page cache is somehow managing to
> throw away a dirty folio.
> Of the two, there was for a while a netfs bug which would corrupt
> folio->private, but I assume you're not using cachefs?
We reproduced this again, this time with some extra tracepoints that I
added. I'll post patches for those soon. I may need to add more.
Here's the writeup of it I did this morning. Some names changed to
protect the paranoid. Let me know if anyone has ideas:
In this case, the corruption happened fairly early. There is a 2262
byte hole between 10026 and 12288:
kworker/u1025:9-3435794 [000] ..... 42066.127742: nfs_initiate_write: fileid=00:82:10087279963 fhandle=0x483d45e0 offset=0 count=10026 stable=UNSTABLE
foo-localfs-161-1326370 [135] ..... 42066.169637: nfs_initiate_write: fileid=00:82:10087279963 fhandle=0x483d45e0 offset=12288 count=489012 stable=FILE_SYNC
The reproducer is a python script doing 5013 byte write() calls
exclusively. There is another layer between that and NFS though. It's
implemented as a FUSE fs that passes reads and writes through to NFS.
This layer alters the I/O pattern in an interesting way:
The first write from userland is 5013 bytes. (Side q: Does foofs use
io_uring to do these writes?):
foo-localfs-161-1326370 [135] ..... 42066.127165: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=0 count=5013 ki_flags=
...but then it writes everything up to the end of the second page in
the next write() call:
foo-localfs-161-1326370 [135] ..... 42066.127486: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=5013 count=3179 ki_flags=
...then it writes the rest of that 5013 byte write:
foo-localfs-161-1326370 [135] ..... 42066.127717: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=8192 count=1834 ki_flags=
...so now we have 2 complete lines. The next write from userland is to
the end of the page:
foo-localfs-161-1326370 [135] ..... 42066.127954: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=10026 count=2262 ki_flags=
...and the pattern continues...
foo-localfs-161-1326370 [135] ..... 42066.129411: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=12288 count=2751 ki_flags=
foo-localfs-161-1326370 [135] ..... 42066.129749: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=15039 count=1345 ki_flags=
foo-localfs-161-1326370 [135] ..... 42066.130020: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=16384 count=3668 ki_flags=
This weird I/O pattern may help explain why it reproduces easier with
this layer in the mix. This gives the NFS client ample opportunity to
have to extend existing pages. I plan to change my attempted reproducer
to mimic this pattern.
The new tracepoints give us a slightly clearer picture of the race (I
filtered out all but one of the DS filehandles for simplicity).
Here's the end of the write from 8192-10026. At this point the file
size is 10026 (everything is normal at this point, AFAICT):
foo-localfs-161-1326370 [135] ..... 42066.127720: nfs_size_grow: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 cursize=8192 newsize=10026
Writeback for everything in the cache so far is kicked off:
kworker/u1025:9-3435794 [000] ..... 42066.127742: nfs_initiate_write: fileid=00:82:10087279963 fhandle=0x483d45e0 offset=0 count=10026 stable=UNSTABLE
A write comes in from foofs for 2262 bytes, and it calls into
write_begin. I suspect at this point nfs_write_begin is blocked on the
page lock:
foo-localfs-161-1326370 [135] ..... 42066.127954: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=10026 count=2262 ki_flags=
foo-localfs-161-1326370 [135] ..... 42066.127954: nfs_write_begin: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=10026 count=2262
Writeback to the DS's proceeds:
kworker/u1035:0-3302696 [209] ..... 42066.128388: nfs4_pnfs_write: error=0 (OK) fileid=00:82:10087279963 fhandle=0x483d45e0 offset=0 count=10026 res=10026 stateid=1:0x428af712 layoutstateid=1:0x47c699b3
kworker/u1035:0-3302696 [209] ..... 42066.128396: nfs_writeback_done: error=10026 fileid=00:82:10087279963 fhandle=0x483d45e0 offset=0 count=10026 res=10026 stable=UNSTABLE verifier=a376459679d60091
Eventually, foofs gets the page lock, and write_end is called. It
updates the folio via nfs_try_to_update_request():
foo-localfs-161-1326370 [135] ..... 42066.128429: nfs_write_end: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=10026 count=2262
foo-localfs-161-1326370 [135] ..... 42066.128429: nfs_update_folio: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=1834 count=2262
foo-localfs-161-1326370 [135] ..... 42066.128429: nfs_try_to_update_request: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=1834 count=2262
...COMMITs go out for the data just written (side q: there was only a
single WRITE call on the wire before this. Why didn't it just do a SYNC
write instead?):
kworker/u1028:3-3432139 [031] ..... 42066.128431: nfs_initiate_commit: fileid=00:82:10087279963 fhandle=0x483d45e0 offset=0 count=0
kworker/u1035:0-3302696 [209] ..... 42066.129158: nfs_commit_done: error=0 fileid=00:82:10087279963 fhandle=0x483d45e0 offset=0 stable=FILE_SYNC verifier=a376459679d60091
...the file size in the inode is grown:
foo-localfs-161-1326370 [135] ..... 42066.129179: nfs_size_grow: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 cursize=10026 newsize=12288
...and then the next userland write happens:
foo-localfs-161-1326370 [135] ..... 42066.129411: nfs_file_write: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 offset=12288 count=2751 ki_flags=
...and eventually the next on-the-wire write occurs due to an fsync, which skips the partial page:
foo-localfs-161-1326370 [135] ..... 42066.169262: nfs_size_grow: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 cursize=499712 newsize=501300
foo-localfs-161-1326370 [135] ..... 42066.169517: nfs_fsync_enter: fileid=00:82:10087279963 fhandle=0xa6491ad0 version=1754454982197296994 cache_validity=0x4700 (INVALID_CHANGE|INVALID_CTIME|INVALID_MTIME|INVALID_BLOCKS)
foo-localfs-161-1326370 [135] ..... 42066.169637: nfs_initiate_write: fileid=00:82:10087279963 fhandle=0x483d45e0 offset=12288 count=489012 stable=FILE_SYNC
It's still not 100% clear why part of the page didn't get written back.
I have a couple of theories at this point:
1/ the dirty bit is somehow either not being set properly in the first
place for the 2262 byte write, or is getting cleared inappropriately
during the WRITE that includes the first part of the page.
2/ The file size doesn't grow until quite late in the process. Perhaps
we could be clamping the writeback range to the old file size and
tossing out the rest of the page?
I think #1 is most probable.
My thinking at this point is that since we know the userland writes are
contiguous in this case, I can make nfs_writepages look for holes in
the data to be written back, and dump info about the page / nfs_page
with the hole.
Happy to entertain other ideas too.
Cheers,
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2025-08-06 14:20 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-30 14:52 parts of pages on NFS being replaced by swaths of NULs Jeff Layton
2025-07-31 21:56 ` Trond Myklebust
2025-07-31 23:04 ` Jeff Layton
2025-08-06 14:20 ` Jeff Layton [this message]
2025-08-12 11:58 ` Jeff Layton
2025-08-12 16:58 ` Trond Myklebust
2025-08-12 17:20 ` Jeff Layton
2025-08-16 13:01 ` Jeff Layton
2025-08-16 14:51 ` Trond Myklebust
2025-08-16 15:31 ` Jeff Layton
2025-08-16 17:16 ` Jeff Layton
2025-08-16 21:25 ` Mike Snitzer
2025-08-19 18:38 ` Mike Snitzer
2025-08-19 20:13 ` Jeff Layton
2025-08-19 21:38 ` seeing CRC mismatch with NFS DIRECT WRITE when localio_O_DIRECT_align_misaligned_IO=Y Mike Snitzer
2025-08-19 21:50 ` Jeff Layton
2025-08-19 17:10 ` parts of pages on NFS being replaced by swaths of NULs Jeff Layton
2025-08-19 18:28 ` Trond Myklebust
2025-08-19 19:48 ` Jeff Layton
2025-08-19 20:47 ` Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f6952e2503d1ec74a87051df7de216bca998cb50.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=anna@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).