From: "Darrick J. Wong" <djwong@kernel.org>
To: John Garry <john.g.garry@oracle.com>
Cc: Carlos Maiolino <cem@kernel.org>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
Zorro Lang <zlang@redhat.com>,
fstests@vger.kernel.org, Ritesh Harjani <ritesh.list@gmail.com>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: fix write failures in software-provided atomic writes
Date: Mon, 3 Nov 2025 10:01:12 -0800 [thread overview]
Message-ID: <20251103180112.GD196370@frogsfrogsfrogs> (raw)
In-Reply-To: <64128e92-44a7-4830-86e7-2c98383a9f28@oracle.com>
On Mon, Nov 03, 2025 at 12:16:14PM +0000, John Garry wrote:
> On 31/10/2025 17:13, Darrick J. Wong wrote:
> > On Fri, Oct 31, 2025 at 10:17:56AM +0000, John Garry wrote:
> > > On 31/10/2025 04:30, Darrick J. Wong wrote:
> > > > > @@ -1215,6 +1216,7 @@ xfs_atomic_write_cow_iomap_begin(
> > > > > return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
> > > > >
> > > > > I think that the problem may be that we were converting an inappropriate
> > > > > number of blocks from unwritten to real allocations (but never writing to
> > > > > the excess blocks). Does it look ok?
> > > > That looks like a good correction to me; I'll run that on my test fleet
> > > > overnight and we'll see what happens. Thanks for putting this together!
> > >
> > > Cool, but I am not confident that it is a completely correct. Here's the
> > > updated code:
> > >
> > > int error;
> > > u64 seq;
> > > + xfs_filblks_t count_fsb_orig = count_fsb;
> > >
> > > ASSERT(flags & IOMAP_WRITE);
> > > ASSERT(flags & IOMAP_DIRECT);
> > > @@ -1202,7 +1203,7 @@ xfs_atomic_write_cow_iomap_begin(
> > > found:
> > > if (cmap.br_state != XFS_EXT_NORM) {
> > > error = xfs_reflink_convert_cow_locked(ip, offset_fsb,
> > > - count_fsb);
> > > + count_fsb_orig);
> > > if (error)
> > > goto out_unlock;
> > > cmap.br_state = XFS_EXT_NORM;
> >
> > Er... this is exactly the same snippet as yesterday. <confused>
>
> Yes, I was just showing it for context.
>
> >
> > (That snippet seems to have survived overnight fsx)
> >
> > > cmap may be longer than count_fsb_orig (which was the failing scenario). In
> > > that case, after calling xfs_reflink_convert_cow_locked(), we would have
> > > partially converted cmap, so it is proper to set cmap.br_state =
> > > XFS_EXT_NORM?
> >
> > Hrm. Let me walk through this function again.
> >
> > At the start, {offset,count}_fsb express the offset/length parameters,
> > but in fsblock units and expanded to align with an fsblock.
>
> I do also note that the usage of xfs_iomap_end_fsb() is a bit dubious, as
> this will truncate the write if it goes over s_maxbytes. However, we never
> want to truncate an atomic write.
>
> >
> > If the cow fork has a mapping (cmap) that starts before the offset, we
> > trim the mapping to the desired range and goto found. cmap cannot end
> > before offset_fsb because that's how xfs_iext_lookup_extent works.
> >
> > If the cow fork doesn't have a mapping or the one it does have doesn't
> > start until after offset_fsb, then we trim count_fsb so that
> > (offset_fsb, count_fsb) is the range that isn't mapped to space.
> >
> > Then we cycle the ILOCK to get a transaction and do the lookup again
> > due to "extent layout could have changed since the unlock". Same rules
> > as the first lookup. I wonder if that second xfs_trim_extent should be
> > using orig_count_fsb, because at this point it's trimming the cmap to
> > the shorter range.
>
> Yes, I think so.
>
> > It's probably not a big deal because iomap will
> > just call ->iomap_begin on the rest of the range, but it's more work.
> >
> > If the second lookup doesn't produce a mapping, then we call
> > xfs_bmapi_write to fill the hole and drop down to @found.
> >
> > Now, we're at the found: label, having arrived here in one of three
> > ways:
> >
> > 1) The first cmap lookup found at least one block that it can use.
> > (offset_fsb, count_fsb) is the original range that iomap asked for.
> > This mapping could be written or unwritten, and it's been trimmed
> > so that it won't extend outside the requested write range.
> >
> > 2) The second cmap lookup found at least one block that it can use.
> > (offset_fsb, count_fsb) is a truncated range because the cow fork
> > has a mapping that starts at (offset_fsb + count_fsb). This mapping
> > could also be written or unwritten, and it also has been trimmed so
> > that it won't extend outside the hole range.
> >
> > Why do we trim cmap to the hole range? The original write range
> > will suite iomap just fine.
>
> I think that the xfs_iext_lookup_extent() and xfs_trim_extent() pattern may
> have been just copied without considering this.
>
> >
> > 3) We xfs_bmapi_write'd a hole, and now we have an unwritten mapping.
> > (offset_fsb, count_fsb) is the same truncated range from 2).
> > cmap is potentially much larger than (offset_fsb, count_fsb).
> >
> > Why do we not trim this mapping to the original write range?
> >
>
> I don't know, but this is what I was suggesting should happen.
>
> > If at this point the mapping is unwritten, we convert it to written
> > mapping with xfs_reflink_convert_cow_locked. offset_fsb retains its
> > original value, but what is count_fsb? It's either the original value
> > (1) or the smaller one from filling the hole (2) or (3)? Why don't we
> > pass the cmap startoff/blockcount to this function?
>
> I think that we should
>
> >
> > As an aside: Changing count_fsb makes it harder for me to understand
> > what's going on in this function.
> >
> > Now, having converted either the range we arrived at via (1-3) (or with
> > your patch, the original range) to written state, we set br_state and
> > pass that back to iomap. I think in case (3) it's possible that xfs
> > incorrectly reports to iomap an overly large mapping that might not
> > actually reflect the cow fork contents, because we only converted so
> > much of the mapping state.
> >
> > > We should trim cmap to count_fsb_orig also, right?
> >
> > iomap doesn't care if the mapping it receives spans more space than just
> > the range it asked for, but XFS shouldn't be exposing mappings that
> > don't reflect the actual state of the cow fork.
> >
> > I think there are several things wrong with this function:
> >
> > A) xfs_bmapi_write can return a much larger unwritten mapping than what
> > the caller asked for. We convert part of that range to written, but
> > return the entire written mapping to iomap even though that's
> > inaccurate.
> >
> > B) The arguments to xfs_reflink_convert_cow_locked are wrong -- an
> > unwritten mapping could be *smaller* than the write range (or even
> > the hole range). In this case, we convert too much file range to
> > written state because we then return a smaller mapping to iomap.
> >
> > C) It doesn't handle delalloc mappings. This I covered in the patch
> > that I already sent to the list.
> >
> > D) Reassigning count_fsb to handle the hole means that if the second
> > cmap lookup attempt succeeds (due to racing with someone else) we
> > trim the mapping more than is strictly necessary. The changing
> > meaning of count_fsb makes this harder to notice.
> >
> > E) The tracepoint is kinda wrong because @length is mutated. That makes
> > it harder to chase the data flows through this function because you
> > can't just grep on the pos/bytecount strings.
>
> Yes, I was noticing this also.
>
> >
> > F) We don't actually check that the br_state = XFS_EXT_NORM assignment
> > is accurate, i.e. that the cow fork contains a written mapping for
> > the range that we're interested in.
> >
> > G) Somewhat inadequate documentation of why we need to xfs_trim_extent
> > so aggressively in this function.
> >
> > H) Not sure why xfs_iomap_end_fsb is used here, the vfs already clamped
> > the write range to s_maxbytes.
>
> Please note that I mentioned about this above.
>
> >
> > > I don't think that it makes much of a difference, but it seems the proper
> > > thing to do. Maybe the subsequent traces length values would be inconsistent
> > > with other path to @found label if we don't trim.
> >
> > With A-H fixed, all the atomic writes issues I was seeing went away.
> > I'll run this through the atomic writes QA tests and send a fix series
> > to the list.
>
> ok, thanks!
<nod> Fixes have now been sent to the list.
--D
next prev parent reply other threads:[~2025-11-03 18:01 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-19 6:47 [PATCH v7 00/11] Add more tests for multi fs block atomic writes Ojaswin Mujoo
2025-09-19 6:47 ` [PATCH v7 01/12] common/rc: Add _min() and _max() helpers Ojaswin Mujoo
2025-09-19 6:47 ` [PATCH v7 02/12] common/rc: Add fio atomic write helpers Ojaswin Mujoo
2025-09-19 16:27 ` Darrick J. Wong
2025-09-19 6:47 ` [PATCH v7 03/12] common/rc: Add a helper to run fsx on a given file Ojaswin Mujoo
2025-09-19 6:47 ` [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx Ojaswin Mujoo
2025-09-28 8:55 ` Zorro Lang
2025-09-28 13:19 ` Zorro Lang
2025-10-02 17:56 ` Ojaswin Mujoo
2025-10-03 17:19 ` Zorro Lang
2025-10-05 12:57 ` Ojaswin Mujoo
2025-10-05 15:39 ` Zorro Lang
2025-10-06 13:20 ` Ojaswin Mujoo
2025-10-07 9:58 ` Ojaswin Mujoo
2025-10-17 16:01 ` Zorro Lang
2025-10-17 16:27 ` Darrick J. Wong
2025-10-17 18:47 ` Zorro Lang
2025-10-17 22:52 ` Darrick J. Wong
2025-10-20 10:33 ` John Garry
2025-10-21 10:28 ` Ojaswin Mujoo
2025-10-21 11:30 ` Brian Foster
2025-10-21 11:58 ` Ojaswin Mujoo
2025-10-21 17:44 ` Darrick J. Wong
2025-10-22 7:40 ` Ojaswin Mujoo
2025-10-23 15:44 ` John Garry
2025-10-23 17:55 ` Darrick J. Wong
2025-10-29 18:11 ` [PATCH] xfs: fix write failures in software-provided atomic writes Darrick J. Wong
2025-10-29 18:13 ` Darrick J. Wong
2025-10-30 13:52 ` John Garry
2025-10-30 15:01 ` Darrick J. Wong
2025-10-30 16:35 ` John Garry
2025-10-30 19:38 ` John Garry
2025-10-31 4:30 ` Darrick J. Wong
2025-10-31 10:17 ` John Garry
2025-10-31 17:13 ` Darrick J. Wong
2025-11-03 12:16 ` John Garry
2025-11-03 18:01 ` Darrick J. Wong [this message]
2025-10-31 8:08 ` Ojaswin Mujoo
2025-10-31 10:04 ` John Garry
2025-09-19 6:47 ` [PATCH v7 05/12] generic: Add atomic write test using fio crc check verifier Ojaswin Mujoo
2025-10-28 9:42 ` Ojaswin Mujoo
2025-11-01 9:00 ` Zorro Lang
2025-09-19 6:47 ` [PATCH v7 06/12] generic: Add atomic write test using fio verify on file mixed mappings Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 07/12] generic: Add atomic write multi-fsblock O_[D]SYNC tests Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 08/12] generic: Stress fsx with atomic writes enabled Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 09/12] generic: Add sudden shutdown tests for multi block atomic writes Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 10/12] ext4: Test atomic write and ioend codepaths with bigalloc Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 11/12] ext4: Test atomic writes allocation and write " Ojaswin Mujoo
2025-09-19 6:48 ` [PATCH v7 12/12] ext4: Atomic write test for extent split across leaf nodes Ojaswin Mujoo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251103180112.GD196370@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=cem@kernel.org \
--cc=fstests@vger.kernel.org \
--cc=john.g.garry@oracle.com \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=zlang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.