Re: [PATCH] xfs: fix write failures in software-provided atomic writes

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: John Garry <john.g.garry@oracle.com>
Cc: Carlos Maiolino <cem@kernel.org>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Zorro Lang <zlang@redhat.com>,
	fstests@vger.kernel.org, Ritesh Harjani <ritesh.list@gmail.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: fix write failures in software-provided atomic writes
Date: Mon, 3 Nov 2025 10:01:12 -0800	[thread overview]
Message-ID: <20251103180112.GD196370@frogsfrogsfrogs> (raw)
In-Reply-To: <64128e92-44a7-4830-86e7-2c98383a9f28@oracle.com>

On Mon, Nov 03, 2025 at 12:16:14PM +0000, John Garry wrote:
> On 31/10/2025 17:13, Darrick J. Wong wrote:
> > On Fri, Oct 31, 2025 at 10:17:56AM +0000, John Garry wrote:
> > > On 31/10/2025 04:30, Darrick J. Wong wrote:
> > > > > @@ -1215,6 +1216,7 @@ xfs_atomic_write_cow_iomap_begin(
> > > > >    	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
> > > > > 
> > > > > I think that the problem may be that we were converting an inappropriate
> > > > > number of blocks from unwritten to real allocations (but never writing to
> > > > > the excess blocks). Does it look ok?
> > > > That looks like a good correction to me; I'll run that on my test fleet
> > > > overnight and we'll see what happens.  Thanks for putting this together!
> > > 
> > > Cool, but I am not confident that it is a completely correct. Here's the
> > > updated code:
> > > 
> > >   	int			error;
> > >   	u64			seq;
> > > +	xfs_filblks_t		count_fsb_orig = count_fsb;
> > > 
> > >   	ASSERT(flags & IOMAP_WRITE);
> > >   	ASSERT(flags & IOMAP_DIRECT);
> > > @@ -1202,7 +1203,7 @@ xfs_atomic_write_cow_iomap_begin(
> > >   found:
> > >   	if (cmap.br_state != XFS_EXT_NORM) {
> > >   		error = xfs_reflink_convert_cow_locked(ip, offset_fsb,
> > > -				count_fsb);
> > > +				count_fsb_orig);
> > >   		if (error)
> > >   			goto out_unlock;
> > >   		cmap.br_state = XFS_EXT_NORM;
> > 
> > Er... this is exactly the same snippet as yesterday. <confused>
> 
> Yes, I was just showing it for context.
> 
> > 
> > (That snippet seems to have survived overnight fsx)
> > 
> > > cmap may be longer than count_fsb_orig (which was the failing scenario). In
> > > that case, after calling xfs_reflink_convert_cow_locked(), we would have
> > > partially converted cmap, so it is proper to set cmap.br_state =
> > > XFS_EXT_NORM?
> > 
> > Hrm.  Let me walk through this function again.
> > 
> > At the start, {offset,count}_fsb express the offset/length parameters,
> > but in fsblock units and expanded to align with an fsblock.
> 
> I do also note that the usage of xfs_iomap_end_fsb() is a bit dubious, as
> this will truncate the write if it goes over s_maxbytes. However, we never
> want to truncate an atomic write.
> 
> > 
> > If the cow fork has a mapping (cmap) that starts before the offset, we
> > trim the mapping to the desired range and goto found.  cmap cannot end
> > before offset_fsb because that's how xfs_iext_lookup_extent works.
> > 
> > If the cow fork doesn't have a mapping or the one it does have doesn't
> > start until after offset_fsb, then we trim count_fsb so that
> > (offset_fsb, count_fsb) is the range that isn't mapped to space.
> > 
> > Then we cycle the ILOCK to get a transaction and do the lookup again
> > due to "extent layout could have changed since the unlock".  Same rules
> > as the first lookup.  I wonder if that second xfs_trim_extent should be
> > using orig_count_fsb, because at this point it's trimming the cmap to
> > the shorter range.
> 
> Yes, I think so.
> 
> > It's probably not a big deal because iomap will
> > just call ->iomap_begin on the rest of the range, but it's more work.
> > 
> > If the second lookup doesn't produce a mapping, then we call
> > xfs_bmapi_write to fill the hole and drop down to @found.
> > 
> > Now, we're at the found: label, having arrived here in one of three
> > ways:
> > 
> > 1) The first cmap lookup found at least one block that it can use.
> >     (offset_fsb, count_fsb) is the original range that iomap asked for.
> >     This mapping could be written or unwritten, and it's been trimmed
> >     so that it won't extend outside the requested write range.
> > 
> > 2) The second cmap lookup found at least one block that it can use.
> >     (offset_fsb, count_fsb) is a truncated range because the cow fork
> >     has a mapping that starts at (offset_fsb + count_fsb).  This mapping
> >     could also be written or unwritten, and it also has been trimmed so
> >     that it won't extend outside the hole range.
> > 
> >     Why do we trim cmap to the hole range?  The original write range
> >     will suite iomap just fine.
> 
> I think that the xfs_iext_lookup_extent() and xfs_trim_extent() pattern may
> have been just copied without considering this.
> 
> > 
> > 3) We xfs_bmapi_write'd a hole, and now we have an unwritten mapping.
> >     (offset_fsb, count_fsb) is the same truncated range from 2).
> >     cmap is potentially much larger than (offset_fsb, count_fsb).
> > 
> >     Why do we not trim this mapping to the original write range?
> > 
> 
> I don't know, but this is what I was suggesting should happen.
> 
> > If at this point the mapping is unwritten, we convert it to written
> > mapping with xfs_reflink_convert_cow_locked.  offset_fsb retains its
> > original value, but what is count_fsb?  It's either the original value
> > (1) or the smaller one from filling the hole (2) or (3)?  Why don't we
> > pass the cmap startoff/blockcount to this function?
> 
> I think that we should
> 
> > 
> > As an aside: Changing count_fsb makes it harder for me to understand
> > what's going on in this function.
> > 
> > Now, having converted either the range we arrived at via (1-3) (or with
> > your patch, the original range) to written state, we set br_state and
> > pass that back to iomap.  I think in case (3) it's possible that xfs
> > incorrectly reports to iomap an overly large mapping that might not
> > actually reflect the cow fork contents, because we only converted so
> > much of the mapping state.
> > 
> > > We should trim cmap to count_fsb_orig also, right?
> > 
> > iomap doesn't care if the mapping it receives spans more space than just
> > the range it asked for, but XFS shouldn't be exposing mappings that
> > don't reflect the actual state of the cow fork.
> > 
> > I think there are several things wrong with this function:
> > 
> > A) xfs_bmapi_write can return a much larger unwritten mapping than what
> >     the caller asked for.  We convert part of that range to written, but
> >     return the entire written mapping to iomap even though that's
> >     inaccurate.
> > 
> > B) The arguments to xfs_reflink_convert_cow_locked are wrong -- an
> >     unwritten mapping could be *smaller* than the write range (or even
> >     the hole range).  In this case, we convert too much file range to
> >     written state because we then return a smaller mapping to iomap.
> > 
> > C) It doesn't handle delalloc mappings.  This I covered in the patch
> >     that I already sent to the list.
> > 
> > D) Reassigning count_fsb to handle the hole means that if the second
> >     cmap lookup attempt succeeds (due to racing with someone else) we
> >     trim the mapping more than is strictly necessary.  The changing
> >     meaning of count_fsb makes this harder to notice.
> > 
> > E) The tracepoint is kinda wrong because @length is mutated.  That makes
> >     it harder to chase the data flows through this function because you
> >     can't just grep on the pos/bytecount strings.
> 
> Yes, I was noticing this also.
> 
> > 
> > F) We don't actually check that the br_state = XFS_EXT_NORM assignment
> >     is accurate, i.e. that the cow fork contains a written mapping for
> >     the range that we're interested in.
> > 
> > G) Somewhat inadequate documentation of why we need to xfs_trim_extent
> >     so aggressively in this function.
> > 
> > H) Not sure why xfs_iomap_end_fsb is used here, the vfs already clamped
> >     the write range to s_maxbytes.
> 
> Please note that I mentioned about this above.
> 
> > 
> > > I don't think that it makes much of a difference, but it seems the proper
> > > thing to do. Maybe the subsequent traces length values would be inconsistent
> > > with other path to @found label if we don't trim.
> > 
> > With A-H fixed, all the atomic writes issues I was seeing went away.
> > I'll run this through the atomic writes QA tests and send a fix series
> > to the list.
> 
> ok, thanks!

<nod> Fixes have now been sent to the list.

--D

next prev parent reply	other threads:[~2025-11-03 18:01 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-19  6:47 [PATCH v7 00/11] Add more tests for multi fs block atomic writes Ojaswin Mujoo
2025-09-19  6:47 ` [PATCH v7 01/12] common/rc: Add _min() and _max() helpers Ojaswin Mujoo
2025-09-19  6:47 ` [PATCH v7 02/12] common/rc: Add fio atomic write helpers Ojaswin Mujoo
2025-09-19 16:27   ` Darrick J. Wong
2025-09-19  6:47 ` [PATCH v7 03/12] common/rc: Add a helper to run fsx on a given file Ojaswin Mujoo
2025-09-19  6:47 ` [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx Ojaswin Mujoo
2025-09-28  8:55   ` Zorro Lang
2025-09-28 13:19   ` Zorro Lang
2025-10-02 17:56     ` Ojaswin Mujoo
2025-10-03 17:19       ` Zorro Lang
2025-10-05 12:57         ` Ojaswin Mujoo
2025-10-05 15:39           ` Zorro Lang
2025-10-06 13:20             ` Ojaswin Mujoo
2025-10-07  9:58               ` Ojaswin Mujoo
2025-10-17 16:01                 ` Zorro Lang
2025-10-17 16:27                   ` Darrick J. Wong
2025-10-17 18:47                     ` Zorro Lang
2025-10-17 22:52                       ` Darrick J. Wong
2025-10-20 10:33               ` John Garry
2025-10-21 10:28                 ` Ojaswin Mujoo
2025-10-21 11:30                   ` Brian Foster
2025-10-21 11:58                     ` Ojaswin Mujoo
2025-10-21 17:44                       ` Darrick J. Wong
2025-10-22  7:40                         ` Ojaswin Mujoo
2025-10-23 15:44                           ` John Garry
2025-10-23 17:55                             ` Darrick J. Wong
2025-10-29 18:11   ` [PATCH] xfs: fix write failures in software-provided atomic writes Darrick J. Wong
2025-10-29 18:13     ` Darrick J. Wong
2025-10-30 13:52     ` John Garry
2025-10-30 15:01       ` Darrick J. Wong
2025-10-30 16:35         ` John Garry
2025-10-30 19:38           ` John Garry
2025-10-31  4:30             ` Darrick J. Wong
2025-10-31 10:17               ` John Garry
2025-10-31 17:13                 ` Darrick J. Wong
2025-11-03 12:16                   ` John Garry
2025-11-03 18:01                     ` Darrick J. Wong [this message]
2025-10-31  8:08             ` Ojaswin Mujoo
2025-10-31 10:04               ` John Garry
2025-09-19  6:47 ` [PATCH v7 05/12] generic: Add atomic write test using fio crc check verifier Ojaswin Mujoo
2025-10-28  9:42   ` Ojaswin Mujoo
2025-11-01  9:00     ` Zorro Lang
2025-09-19  6:47 ` [PATCH v7 06/12] generic: Add atomic write test using fio verify on file mixed mappings Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 07/12] generic: Add atomic write multi-fsblock O_[D]SYNC tests Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 08/12] generic: Stress fsx with atomic writes enabled Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 09/12] generic: Add sudden shutdown tests for multi block atomic writes Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 10/12] ext4: Test atomic write and ioend codepaths with bigalloc Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 11/12] ext4: Test atomic writes allocation and write " Ojaswin Mujoo
2025-09-19  6:48 ` [PATCH v7 12/12] ext4: Atomic write test for extent split across leaf nodes Ojaswin Mujoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251103180112.GD196370@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=cem@kernel.org \
    --cc=fstests@vger.kernel.org \
    --cc=john.g.garry@oracle.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=zlang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).