From: "Darrick J. Wong" <djwong@kernel.org>
To: John Garry <john.g.garry@oracle.com>
Cc: Carlos Maiolino <cem@kernel.org>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
Zorro Lang <zlang@redhat.com>,
fstests@vger.kernel.org, Ritesh Harjani <ritesh.list@gmail.com>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: fix write failures in software-provided atomic writes
Date: Fri, 31 Oct 2025 10:13:30 -0700 [thread overview]
Message-ID: <20251031171330.GC6178@frogsfrogsfrogs> (raw)
In-Reply-To: <d787aed1-19ad-4fb9-ba64-33d754d46e5f@oracle.com>
On Fri, Oct 31, 2025 at 10:17:56AM +0000, John Garry wrote:
> On 31/10/2025 04:30, Darrick J. Wong wrote:
> > > @@ -1215,6 +1216,7 @@ xfs_atomic_write_cow_iomap_begin(
> > > return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
> > >
> > > I think that the problem may be that we were converting an inappropriate
> > > number of blocks from unwritten to real allocations (but never writing to
> > > the excess blocks). Does it look ok?
> > That looks like a good correction to me; I'll run that on my test fleet
> > overnight and we'll see what happens. Thanks for putting this together!
>
> Cool, but I am not confident that it is a completely correct. Here's the
> updated code:
>
> int error;
> u64 seq;
> + xfs_filblks_t count_fsb_orig = count_fsb;
>
> ASSERT(flags & IOMAP_WRITE);
> ASSERT(flags & IOMAP_DIRECT);
> @@ -1202,7 +1203,7 @@ xfs_atomic_write_cow_iomap_begin(
> found:
> if (cmap.br_state != XFS_EXT_NORM) {
> error = xfs_reflink_convert_cow_locked(ip, offset_fsb,
> - count_fsb);
> + count_fsb_orig);
> if (error)
> goto out_unlock;
> cmap.br_state = XFS_EXT_NORM;
Er... this is exactly the same snippet as yesterday. <confused>
(That snippet seems to have survived overnight fsx)
> cmap may be longer than count_fsb_orig (which was the failing scenario). In
> that case, after calling xfs_reflink_convert_cow_locked(), we would have
> partially converted cmap, so it is proper to set cmap.br_state =
> XFS_EXT_NORM?
Hrm. Let me walk through this function again.
At the start, {offset,count}_fsb express the offset/length parameters,
but in fsblock units and expanded to align with an fsblock.
If the cow fork has a mapping (cmap) that starts before the offset, we
trim the mapping to the desired range and goto found. cmap cannot end
before offset_fsb because that's how xfs_iext_lookup_extent works.
If the cow fork doesn't have a mapping or the one it does have doesn't
start until after offset_fsb, then we trim count_fsb so that
(offset_fsb, count_fsb) is the range that isn't mapped to space.
Then we cycle the ILOCK to get a transaction and do the lookup again
due to "extent layout could have changed since the unlock". Same rules
as the first lookup. I wonder if that second xfs_trim_extent should be
using orig_count_fsb, because at this point it's trimming the cmap to
the shorter range. It's probably not a big deal because iomap will
just call ->iomap_begin on the rest of the range, but it's more work.
If the second lookup doesn't produce a mapping, then we call
xfs_bmapi_write to fill the hole and drop down to @found.
Now, we're at the found: label, having arrived here in one of three
ways:
1) The first cmap lookup found at least one block that it can use.
(offset_fsb, count_fsb) is the original range that iomap asked for.
This mapping could be written or unwritten, and it's been trimmed
so that it won't extend outside the requested write range.
2) The second cmap lookup found at least one block that it can use.
(offset_fsb, count_fsb) is a truncated range because the cow fork
has a mapping that starts at (offset_fsb + count_fsb). This mapping
could also be written or unwritten, and it also has been trimmed so
that it won't extend outside the hole range.
Why do we trim cmap to the hole range? The original write range
will suite iomap just fine.
3) We xfs_bmapi_write'd a hole, and now we have an unwritten mapping.
(offset_fsb, count_fsb) is the same truncated range from 2).
cmap is potentially much larger than (offset_fsb, count_fsb).
Why do we not trim this mapping to the original write range?
If at this point the mapping is unwritten, we convert it to written
mapping with xfs_reflink_convert_cow_locked. offset_fsb retains its
original value, but what is count_fsb? It's either the original value
(1) or the smaller one from filling the hole (2) or (3)? Why don't we
pass the cmap startoff/blockcount to this function?
As an aside: Changing count_fsb makes it harder for me to understand
what's going on in this function.
Now, having converted either the range we arrived at via (1-3) (or with
your patch, the original range) to written state, we set br_state and
pass that back to iomap. I think in case (3) it's possible that xfs
incorrectly reports to iomap an overly large mapping that might not
actually reflect the cow fork contents, because we only converted so
much of the mapping state.
> We should trim cmap to count_fsb_orig also, right?
iomap doesn't care if the mapping it receives spans more space than just
the range it asked for, but XFS shouldn't be exposing mappings that
don't reflect the actual state of the cow fork.
I think there are several things wrong with this function:
A) xfs_bmapi_write can return a much larger unwritten mapping than what
the caller asked for. We convert part of that range to written, but
return the entire written mapping to iomap even though that's
inaccurate.
B) The arguments to xfs_reflink_convert_cow_locked are wrong -- an
unwritten mapping could be *smaller* than the write range (or even
the hole range). In this case, we convert too much file range to
written state because we then return a smaller mapping to iomap.
C) It doesn't handle delalloc mappings. This I covered in the patch
that I already sent to the list.
D) Reassigning count_fsb to handle the hole means that if the second
cmap lookup attempt succeeds (due to racing with someone else) we
trim the mapping more than is strictly necessary. The changing
meaning of count_fsb makes this harder to notice.
E) The tracepoint is kinda wrong because @length is mutated. That makes
it harder to chase the data flows through this function because you
can't just grep on the pos/bytecount strings.
F) We don't actually check that the br_state = XFS_EXT_NORM assignment
is accurate, i.e. that the cow fork contains a written mapping for
the range that we're interested in.
G) Somewhat inadequate documentation of why we need to xfs_trim_extent
so aggressively in this function.
H) Not sure why xfs_iomap_end_fsb is used here, the vfs already clamped
the write range to s_maxbytes.
> I don't think that it makes much of a difference, but it seems the proper
> thing to do. Maybe the subsequent traces length values would be inconsistent
> with other path to @found label if we don't trim.
With A-H fixed, all the atomic writes issues I was seeing went away.
I'll run this through the atomic writes QA tests and send a fix series
to the list.
--D
> Thank,
> John
>
>
>
next prev parent reply other threads:[~2025-10-31 17:13 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-19 6:47 [PATCH v7 00/11] Add more tests for multi fs block atomic writes Ojaswin Mujoo
2025-09-19 6:47 ` [PATCH v7 01/12] common/rc: Add _min() and _max() helpers Ojaswin Mujoo
2025-09-19 6:47 ` [PATCH v7 02/12] common/rc: Add fio atomic write helpers Ojaswin Mujoo
2025-09-19 16:27 ` Darrick J. Wong
2025-09-19 6:47 ` [PATCH v7 03/12] common/rc: Add a helper to run fsx on a given file Ojaswin Mujoo
2025-09-19 6:47 ` [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx Ojaswin Mujoo
2025-09-28 8:55 ` Zorro Lang
2025-09-28 13:19 ` Zorro Lang
2025-10-02 17:56 ` Ojaswin Mujoo
2025-10-03 17:19 ` Zorro Lang
2025-10-05 12:57 ` Ojaswin Mujoo
2025-10-05 15:39 ` Zorro Lang
2025-10-06 13:20 ` Ojaswin Mujoo
2025-10-07 9:58 ` Ojaswin Mujoo
2025-10-17 16:01 ` Zorro Lang
2025-10-17 16:27 ` Darrick J. Wong
2025-10-17 18:47 ` Zorro Lang
2025-10-17 22:52 ` Darrick J. Wong
2025-10-20 10:33 ` John Garry
2025-10-21 10:28 ` Ojaswin Mujoo
2025-10-21 11:30 ` Brian Foster
2025-10-21 11:58 ` Ojaswin Mujoo
2025-10-21 17:44 ` Darrick J. Wong
2025-10-22 7:40 ` Ojaswin Mujoo
2025-10-23 15:44 ` John Garry
2025-10-23 17:55 ` Darrick J. Wong
2025-10-29 18:11 ` [PATCH] xfs: fix write failures in software-provided atomic writes Darrick J. Wong
2025-10-29 18:13 ` Darrick J. Wong
2025-10-30 13:52 ` John Garry
2025-10-30 15:01 ` Darrick J. Wong
2025-10-30 16:35 ` John Garry
2025-10-30 19:38 ` John Garry
2025-10-31 4:30 ` Darrick J. Wong
2025-10-31 10:17 ` John Garry
2025-10-31 17:13 ` Darrick J. Wong [this message]
2025-11-03 12:16 ` John Garry
2025-11-03 18:01 ` Darrick J. Wong
2025-10-31 8:08 ` Ojaswin Mujoo
2025-10-31 10:04 ` John Garry
2025-09-19 6:47 ` [PATCH v7 05/12] generic: Add atomic write test using fio crc check verifier Ojaswin Mujoo
2025-10-28 9:42 ` Ojaswin Mujoo
2025-11-01 9:00 ` Zorro Lang
2025-09-19 6:47 ` [PATCH v7 06/12] generic: Add atomic write test using fio verify on file mixed mappings Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 07/12] generic: Add atomic write multi-fsblock O_[D]SYNC tests Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 08/12] generic: Stress fsx with atomic writes enabled Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 09/12] generic: Add sudden shutdown tests for multi block atomic writes Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 10/12] ext4: Test atomic write and ioend codepaths with bigalloc Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 11/12] ext4: Test atomic writes allocation and write " Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 12/12] ext4: Atomic write test for extent split across leaf nodes Ojaswin Mujoo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251031171330.GC6178@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=cem@kernel.org \
--cc=fstests@vger.kernel.org \
--cc=john.g.garry@oracle.com \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=zlang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox