From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
Date: Wed, 17 Sep 2025 07:32:11 +1000 [thread overview]
Message-ID: <aMnXW_sEk_wTPnvB@dread.disaster.area> (raw)
In-Reply-To: <vpsyvzbupclvb76axyzytms5rh5yzubcyj5l5h2iwpk3d7xf6a@dw6pemmdfcka>
On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > doesn't have dirty metadata) will have no associated commit
> > > sequence number set.
> > >
> > > Hence if ili_datasync_commit_seq is non-zero, then by definition the
> > > inode must be pinned and has been dirtied for datasync purposes.
> > > That means we can simply query ili_datasync_commit_seq in
> > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > >
> > > I suspect that the above fsync code can then become:
> > >
> > > spin_lock(&iip->ili_lock);
> > > if (datasync)
> > > seq = iip->ili_datasync_commit_seq;
> > > else
> > > seq = iip->ili_commit_seq;
> > > spin_unlock(&iip->ili_lock);
> > >
> > > if (!seq)
> > > return 0;
> > > return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> > >
> > > For the same reason. i.e. a non-zero sequence number implies the
> > > inode log item is dirty in the CIL and pinned.
> > >
> > > At this point, we really don't care about races with transaction
> > > commits. f(data)sync should only wait for modifications that have
> > > been fully completed. If they haven't set the commit sequence in the
> > > log item, they haven't fully completed. If the commit sequence is
> > > already set, the the CIL push will co-ordinate appropriately with
> > > commits to ensure correct data integrity behaviour occurs.
> > >
> > > Hence I think that if we tie the sequence number clearing to the
> > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > the pin checks and use the commit sequence numbers directly to
> > > determine what the correct behaviour should be...
> >
> > Here's a patch that implements this. It appears to pass fstests
> > without any regressions on my test VMs. Can you test it and check
> > that it retains the expected performance improvement for
> > O_DSYNC+DIO on fallocate()d space?
>
> Heh, I just wanted to send my version of the patch after all the tests
> passed :). Anyway, I've given your patch a spin with the test I have and
> its performance looks good. So feel free to add:
>
> Tested-by: Jan Kara <jack@suse.cz>
Thanks!
> BTW I don't have customer setup with DB2 available where the huge
> difference is visible (I'll send them backport of the patch to their SUSE
> kernel once we settle on it) but I have written a tool that replays the
> same set of pwrites from same set of threads I've captured from syscall
> trace. It reproduces only about 20% difference between good & bad kernels
> on my test machine but it was good enough for the bisection and analysis
> and the customer confirmed that the revert of what I've bisected to
> actually fixes their issue (rwsem reader lockstealing logic).
It was also recently bisected on RHEL 8.x to the introduction of
rwsem spin-on-owner changes from back in 2019. Might be the same
commit you are talking about, but either way it's an indication of
rwsem lock contention rather than a problem with the rwsems
themselves.
> So I'm
> reasonably confident I'm really reproducing their issue.
Ok, that's good to know. I was thinking that maybe a fio recipe
might show it up, too, but I'm not sure about that nor do I have the
time to investigate it...
> BTW, your patch seems to be about 2% slower on average than what I have
> written and somewhat more variable. It may be just a bad luck but
> I suspect it may be related to the fact that I ended up using READ_ONCE for
> reads of ili_commit_seq and ili_datasync_commit_seq while you use ili_lock.
Possibly....
> So I just wanted to suggest that as a possible optimization (my patch
> attached for reference). But regardless of whether you do the change or not
> I think the patch is good to go.
I was on the fence about using READ_ONCE/WRITE_ONCE.
However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
prevent torn reads of 64 bit variables on 32 bit platforms. A torn
read of a commit sequence number will result in a transient data
integrity guarantee failure, and so I decided to err on the side of
caution....
.....
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 829675700fcd..2a90e156b072 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
> flags |= XFS_ILOG_CORE;
> }
>
> - /*
> - * Record the specific change for fdatasync optimisation. This allows
> - * fdatasync to skip log forces for inodes that are only timestamp
> - * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> - * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> - * (ili_fields) correctly tracks that the version has changed.
> - */
> spin_lock(&iip->ili_lock);
> - iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> - if (flags & XFS_ILOG_IVERSION)
> - flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> -
> /*
> * Inode verifiers do not check that the CoW extent size hint is an
> * integer multiple of the rt extent size on a directory with both
> @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
> xfs_trans_brelse(tp, bp);
> }
>
> + /*
> + * Set the transaction dirty state we've created back in inode item
> + * before mangling flags for storing on disk. We use the value later in
> + * xfs_inode_item_committing() to determine whether the transaction is
> + * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> + * xfs_trans_ijoin() before adding inode to the next transaction.
> + */
> + iip->ili_dirty_flags = flags;
> +
> + /*
> + * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> + * actual on-disk dirty tracking (ili_fields) correctly tracks that the
> + * version has changed.
> + */
> + if (flags & XFS_ILOG_IVERSION)
> + flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> +
OK, I think I might have missed this. I'll check/fix it, and send an
updated version for inclusion.
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2025-09-16 21:32 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-08 15:12 [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync Jan Kara
2025-09-09 0:18 ` Dave Chinner
2025-09-10 9:05 ` Jan Kara
2025-09-11 0:59 ` Dave Chinner
2025-09-16 6:15 ` Dave Chinner
2025-09-16 13:32 ` Jan Kara
2025-09-16 21:32 ` Dave Chinner [this message]
2025-09-17 10:07 ` Jan Kara
2025-09-17 10:27 ` Lukas Herbolt
[not found] ` <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>
2025-09-17 15:05 ` Jan Kara
2025-09-17 15:16 ` Lukas Herbolt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aMnXW_sEk_wTPnvB@dread.disaster.area \
--to=david@fromorbit.com \
--cc=jack@suse.cz \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox