linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Brian Foster <bfoster@redhat.com>,
	John Garry <john.g.garry@oracle.com>,
	Zorro Lang <zlang@redhat.com>,
	fstests@vger.kernel.org, Ritesh Harjani <ritesh.list@gmail.com>,
	tytso@mit.edu, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx
Date: Tue, 21 Oct 2025 10:44:06 -0700	[thread overview]
Message-ID: <20251021174406.GR6178@frogsfrogsfrogs> (raw)
In-Reply-To: <aPd1aKVFImgXpULn@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On Tue, Oct 21, 2025 at 05:28:32PM +0530, Ojaswin Mujoo wrote:
> On Tue, Oct 21, 2025 at 07:30:32AM -0400, Brian Foster wrote:
> > On Tue, Oct 21, 2025 at 03:58:23PM +0530, Ojaswin Mujoo wrote:
> > > On Mon, Oct 20, 2025 at 11:33:40AM +0100, John Garry wrote:
> > > > On 06/10/2025 14:20, Ojaswin Mujoo wrote:
> > > > > Hi Zorro, thanks for checking this. So correct me if im wrong but I
> > > > > understand that you have run this test on an atomic writes enabled
> > > > > kernel where the stack also supports atomic writes.
> > > > > 
> > > > > Looking at the bad data log:
> > > > > 
> > > > > 	+READ BAD DATA: offset = 0x1c000, size = 0x1803, fname = /mnt/xfstests/test/junk
> > > > > 	+OFFSET      GOOD    BAD     RANGE
> > > > > 	+0x1c000     0x0000  0xcdcd  0x0
> > > > > 	+operation# (mod 256) for the bad data may be 205
> > > > > 
> > > > > We see that 0x0000 was expected but we got 0xcdcd. Now the operation
> > > > > that caused this is indicated to be 205, but looking at that operation:
> > > > > 
> > > > > +205(205 mod 256): ZERO     0x6dbe6 thru 0x6e6aa	(0xac5 bytes)
> > > > > 
> > > > > This doesn't even overlap the range that is bad. (0x1c000 to 0x1c00f).
> > > > > Infact, it does seem like an unlikely coincidence that the actual data
> > > > > in the bad range is 0xcdcd which is something xfs_io -c "pwrite" writes
> > > > > to default (fsx writes random data in even offsets and operation num in
> > > > > odd).
> > > > > 
> > > > > I am able to replicate this but only on XFS but not on ext4 (atleast not
> > > > > in 20 runs).  I'm trying to better understand if this is a test issue or
> > > > > not. Will keep you update.
> > > > 
> > > > 
> > > > Hi Ojaswin,
> > > > 
> > > > Sorry for the very slow response.
> > > > 
> > > > Are you still checking this issue?
> > > > 
> > > > To replicate, should I just take latest xfs kernel and run this series on
> > > > top of latest xfstests? Is it 100% reproducible?
> > > > 
> > > > Thanks,
> > > > John
> > > 
> > > Hi John,
> > > 
> > > Yes Im looking into it but I'm now starting to run into some reflink/cow
> > > based concepts that are taking time to understand. Let me share what I
> > > have till now:
> > > 
> > > So the test.sh that I'm using can be found here [1] which just uses an
> > > fsx replay file (which replays all operations) present in the same repo
> > > [2]. If you see the replay file, there are a bunch of random operations
> > > followed by the last 2 commented out operations:
> > > 
> > > # copy_range 0xd000 0x1000 0x1d800 0x44000   <--- # operations <start> <len> <dest of copy> <filesize (can be ignored)>
> > > # mapread 0x1e000 0x1000 0x1e400 *
> > > 
> > > The copy_range here is the one which causes (or exposes) the corruption
> > > at 0x1e800 (the end of copy range destination gets corrupted).
> > > 
> > > To have more control, I commented these 2 operations and am doing it by
> > > hand in the test.sh file, with xfs_io. I'm also using a non atomic write
> > > device so we only have S/W fallback.
> > > 
> > > Now some observations:
> > > 
> > > 1. The copy_range operations is actually copying from a hole to a hole,
> > > so we should be reading all 0s. But What I see is the following happening:
> > > 
> > >   vfs_copy_file_range
> > >    do_splice_direct
> > >     do_splice_direct_actor
> > >      do_splice_read
> > >        # Adds the folio at src offset to the pipe. I confirmed this is all 0x0.
> > >      splice_direct_to_actor
> > >       direct_splice_actor
> > >        do_splice_from
> > >         iter_file_splice_write
> > >          xfs_file_write_iter
> > >           xfs_file_buffered_write
> > >            iomap_file_buferred_write
> > >             iomap_iter
> > >              xfs_buferred_write_iomap_begin
> > >                # Here we correctly see that there is noting at the
> > >                # destination in data fork, but somehow we find a mapped
> > >                # extent in cow fork which is returned to iomap.
> > >              iomap_write_iter
> > >               __iomap_write_begin
> > >                 # Here we notice folio is not uptodate and call
> > >                 # iomap_read_folio_range() to read from the cow_fork
> > >                 # mapping we found earlier. This results in folio having
> > >                 # incorrect data at 0x1e800 offset.
> > > 
> > >  So it seems like the fsx operations might be corrupting the cow fork state
> > >  somehow leading to stale data exposure. 
> > > 
> > > 2. If we disable atomic writes we dont hit the issue.
> > > 
> > > 3. If I do a -c pread of the destination range before doing the
> > > copy_range operation then I don't see the corruption any more.

Yeah, I stopped seeing failures after adding -X (verify data after every
operation) to FSX_AVOID.

> > > I'm now trying to figure out why the mapping returned is not IOMAP_HOLE
> > > as it should be. I don't know the COW path in xfs so there are some gaps
> > > in my understanding. Let me know if you need any other information since
> > > I'm reliably able to replicate on 6.17.0-rc4.
> > > 
> > 
> > I haven't followed your issue closely, but just on this hole vs. COW
> > thing, XFS has a bit of a quirk where speculative COW fork preallocation
> > can expand out over holes in the data fork. If iomap lookup for buffered
> > write sees COW fork blocks present, it reports those blocks as the
> > primary mapping even if the data fork happens to be a hole (since
> > there's no point in allocating blocks to the data fork when we can just
> > remap).

That sounds like a bug -- if a sub-fsblock write to an uncached file
range has to read data in from disk, then xfs needs to pass the data
fork mapping to iomap even if it's a read.

Can you capture the ftrace output of the iomap_iter_*map tracepoints?

> > Again I've no idea if this relates to your issue or what you're
> > referring to as a hole (i.e. data fork only?), but just pointing it out.
> > The latest iomap/xfs patches I posted a few days ago kind of dance
> > around this a bit, but I was somewhat hoping that maybe the cleanups
> > there would trigger some thoughts on better iomap reporting in that
> > regard.
> 
> Hi Brian, Thanks for the details and yes by "hole" i did mean hole in
> data fork only. The part that I'm now confused about is does this sort
> of preallocation extent hold any valid data? IIUC it should not, so I

No.  Mappings in the cow fork are not fully written and should never be
used for reads.

> would expect it to trigger iomap_block_needs_zeroing() to write zeroes
> to the folio. Instead, what I see in the issue is that we are trying to
> do disk read.

Hrm.  Part of the problem here might be that iomap_read_folio_range
ignores iomap_iter::srcmap if it's type IOMAP_HOLE (see
iomap_iter_srcmap), even if the filesystem actually *set* the srcmap to
a hole.

FWIW I see a somewhat different failure -- not data corruption, but
pwrite returning failure:

--- /run/fstests/bin/tests/generic/521.out      2025-07-15 14:45:15.100315255 -0700
+++ /var/tmp/fstests/generic/521.out.bad        2025-10-21 10:33:39.032263811 -0700
@@ -1,2 +1,668 @@
 QA output created by 521
+dowrite: write: Input/output error
+LOG DUMP (661 total operations):
+1(  1 mod 256): TRUNCATE UP    from 0x0 to 0x1d000
+2(  2 mod 256): DEDUPE 0x19000 thru 0x1bfff    (0x3000 bytes) to 0x13000 thru 0x15fff
+3(  3 mod 256): SKIPPED (no operation)
+4(  4 mod 256): PUNCH    0x5167 thru 0x12d1c   (0xdbb6 bytes)
+5(  5 mod 256): WRITE    0x79000 thru 0x86fff  (0xe000 bytes) HOLE
+6(  6 mod 256): PUNCH    0x32344 thru 0x36faf  (0x4c6c bytes)
+7(  7 mod 256): READ     0x0 thru 0xfff        (0x1000 bytes)
+8(  8 mod 256): WRITE    0xe000 thru 0x11fff   (0x4000 bytes)
+9(  9 mod 256): PUNCH    0x71324 thru 0x86fff  (0x15cdc bytes)
+10( 10 mod 256): MAPREAD  0x5b000 thru 0x6d218 (0x12219 bytes)
+11( 11 mod 256): COLLAPSE 0x70000 thru 0x79fff (0xa000 bytes)
+12( 12 mod 256): WRITE    0x41000 thru 0x50fff (0x10000 bytes)
+13( 13 mod 256): INSERT 0x39000 thru 0x4dfff   (0x15000 bytes)
+14( 14 mod 256): WRITE    0x34000 thru 0x37fff (0x4000 bytes)
+15( 15 mod 256): MAPREAD  0x55000 thru 0x6ee44 (0x19e45 bytes)
+16( 16 mod 256): READ     0x46000 thru 0x55fff (0x10000 bytes)
+17( 17 mod 256): PUNCH    0x1ccea thru 0x23b2e (0x6e45 bytes)
+18( 18 mod 256): COPY 0x2a000 thru 0x35fff     (0xc000 bytes) to 0x52000 thru 0x5dfff
+19( 19 mod 256): SKIPPED (no operation)
+20( 20 mod 256): WRITE    0x10000 thru 0x1ffff (0x10000 bytes)
<snip>
+645(133 mod 256): READ     0x5000 thru 0x16fff (0x12000 bytes)
+646(134 mod 256): PUNCH    0x3a51d thru 0x41978        (0x745c bytes)
+647(135 mod 256): FALLOC   0x47f4c thru 0x54867        (0xc91b bytes) INTERIOR
+648(136 mod 256): WRITE    0xa000 thru 0x1dfff (0x14000 bytes)
+649(137 mod 256): CLONE 0x83000 thru 0x89fff   (0x7000 bytes) to 0x4b000 thru 0x51fff
+650(138 mod 256): TRUNCATE DOWN        from 0x8bac4 to 0x7e000
+651(139 mod 256): MAPWRITE 0x13000 thru 0x170e6        (0x40e7 bytes)
+652(140 mod 256): XCHG 0x6a000 thru 0x7cfff    (0x13000 bytes) to 0x8000 thru 0x1afff
+653(141 mod 256): XCHG 0x35000 thru 0x3cfff    (0x8000 bytes) to 0x1b000 thru 0x22fff
+654(142 mod 256): CLONE 0x47000 thru 0x60fff   (0x1a000 bytes) to 0x65000 thru 0x7efff
+655(143 mod 256): DEDUPE 0x79000 thru 0x7dfff  (0x5000 bytes) to 0x6e000 thru 0x72fff
+656(144 mod 256): XCHG 0x4d000 thru 0x5ffff    (0x13000 bytes) to 0x8000 thru 0x1afff
+657(145 mod 256): PUNCH    0x7194f thru 0x7efff        (0xd6b1 bytes)
+658(146 mod 256): PUNCH    0x7af7e thru 0x7efff        (0x4082 bytes)
+659(147 mod 256): MAPREAD  0x77000 thru 0x7e55d        (0x755e bytes)
+660(148 mod 256): READ     0x58000 thru 0x64fff        (0xd000 bytes)
+661(149 mod 256): WRITE    0x88000 thru 0x8bfff        (0x4000 bytes) HOLE
+Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
+Correct content saved for comparison
+(maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")

Curiously there are no EIO errors logged in dmesg.

--D

> Regards,
> ojaswin
> > 
> > Brian
> 
> > 
> > > [1]
> > > https://github.com/OjaswinM/fsx-aw-issue/tree/master
> > > 
> > > [2] https://github.com/OjaswinM/fsx-aw-issue/blob/master/repro.fsxops
> > > 
> > > regards,
> > > ojaswin
> > > 
> > 
> 

  reply	other threads:[~2025-10-21 17:44 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-19  6:47 [PATCH v7 00/11] Add more tests for multi fs block atomic writes Ojaswin Mujoo
2025-09-19  6:47 ` [PATCH v7 01/12] common/rc: Add _min() and _max() helpers Ojaswin Mujoo
2025-09-19  6:47 ` [PATCH v7 02/12] common/rc: Add fio atomic write helpers Ojaswin Mujoo
2025-09-19 16:27   ` Darrick J. Wong
2025-09-19  6:47 ` [PATCH v7 03/12] common/rc: Add a helper to run fsx on a given file Ojaswin Mujoo
2025-09-19  6:47 ` [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx Ojaswin Mujoo
2025-09-28  8:55   ` Zorro Lang
2025-09-28 13:19   ` Zorro Lang
2025-10-02 17:56     ` Ojaswin Mujoo
2025-10-03 17:19       ` Zorro Lang
2025-10-05 12:57         ` Ojaswin Mujoo
2025-10-05 15:39           ` Zorro Lang
2025-10-06 13:20             ` Ojaswin Mujoo
2025-10-07  9:58               ` Ojaswin Mujoo
2025-10-17 16:01                 ` Zorro Lang
2025-10-17 16:27                   ` Darrick J. Wong
2025-10-17 18:47                     ` Zorro Lang
2025-10-17 22:52                       ` Darrick J. Wong
2025-10-20 10:33               ` John Garry
2025-10-21 10:28                 ` Ojaswin Mujoo
2025-10-21 11:30                   ` Brian Foster
2025-10-21 11:58                     ` Ojaswin Mujoo
2025-10-21 17:44                       ` Darrick J. Wong [this message]
2025-10-22  7:40                         ` Ojaswin Mujoo
2025-10-23 15:44                           ` John Garry
2025-10-23 17:55                             ` Darrick J. Wong
2025-10-29 18:11   ` [PATCH] xfs: fix write failures in software-provided atomic writes Darrick J. Wong
2025-10-29 18:13     ` Darrick J. Wong
2025-10-30 13:52     ` John Garry
2025-10-30 15:01       ` Darrick J. Wong
2025-10-30 16:35         ` John Garry
2025-10-30 19:38           ` John Garry
2025-10-31  4:30             ` Darrick J. Wong
2025-10-31 10:17               ` John Garry
2025-10-31 17:13                 ` Darrick J. Wong
2025-11-03 12:16                   ` John Garry
2025-11-03 18:01                     ` Darrick J. Wong
2025-10-31  8:08             ` Ojaswin Mujoo
2025-10-31 10:04               ` John Garry
2025-09-19  6:47 ` [PATCH v7 05/12] generic: Add atomic write test using fio crc check verifier Ojaswin Mujoo
2025-10-28  9:42   ` Ojaswin Mujoo
2025-11-01  9:00     ` Zorro Lang
2025-09-19  6:47 ` [PATCH v7 06/12] generic: Add atomic write test using fio verify on file mixed mappings Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 07/12] generic: Add atomic write multi-fsblock O_[D]SYNC tests Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 08/12] generic: Stress fsx with atomic writes enabled Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 09/12] generic: Add sudden shutdown tests for multi block atomic writes Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 10/12] ext4: Test atomic write and ioend codepaths with bigalloc Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 11/12] ext4: Test atomic writes allocation and write " Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 12/12] ext4: Atomic write test for extent split across leaf nodes Ojaswin Mujoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251021174406.GR6178@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=bfoster@redhat.com \
    --cc=fstests@vger.kernel.org \
    --cc=john.g.garry@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=zlang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).