Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync

Linux XFS filesystem development
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
Date: Wed, 17 Sep 2025 07:32:11 +1000	[thread overview]
Message-ID: <aMnXW_sEk_wTPnvB@dread.disaster.area> (raw)
In-Reply-To: <vpsyvzbupclvb76axyzytms5rh5yzubcyj5l5h2iwpk3d7xf6a@dw6pemmdfcka>

On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > doesn't have dirty metadata) will have no associated commit
> > > sequence number set.
> > > 
> > > Hence if ili_datasync_commit_seq is non-zero, then by definition the
> > > inode must be pinned and has been dirtied for datasync purposes.
> > > That means we can simply query ili_datasync_commit_seq in
> > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > > 
> > > I suspect that the above fsync code can then become:
> > > 
> > > 	spin_lock(&iip->ili_lock);
> > > 	if (datasync)
> > > 		seq = iip->ili_datasync_commit_seq;
> > > 	else
> > > 		seq = iip->ili_commit_seq;
> > > 	spin_unlock(&iip->ili_lock);
> > > 
> > > 	if (!seq)
> > > 		return 0;
> > > 	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> > > 
> > > For the same reason. i.e. a non-zero sequence number implies the
> > > inode log item is dirty in the CIL and pinned.
> > > 
> > > At this point, we really don't care about races with transaction
> > > commits. f(data)sync should only wait for modifications that have
> > > been fully completed. If they haven't set the commit sequence in the
> > > log item, they haven't fully completed. If the commit sequence is
> > > already set, the the CIL push will co-ordinate appropriately with
> > > commits to ensure correct data integrity behaviour occurs.
> > > 
> > > Hence I think that if we tie the sequence number clearing to the
> > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > the pin checks and use the commit sequence numbers directly to
> > > determine what the correct behaviour should be...
> > 
> > Here's a patch that implements this. It appears to pass fstests
> > without any regressions on my test VMs. Can you test it and check
> > that it retains the expected performance improvement for
> > O_DSYNC+DIO on fallocate()d space?
> 
> Heh, I just wanted to send my version of the patch after all the tests
> passed :). Anyway, I've given your patch a spin with the test I have and
> its performance looks good. So feel free to add:
> 
> Tested-by: Jan Kara <jack@suse.cz>

Thanks!

> BTW I don't have customer setup with DB2 available where the huge
> difference is visible (I'll send them backport of the patch to their SUSE
> kernel once we settle on it) but I have written a tool that replays the
> same set of pwrites from same set of threads I've captured from syscall
> trace. It reproduces only about 20% difference between good & bad kernels
> on my test machine but it was good enough for the bisection and analysis
> and the customer confirmed that the revert of what I've bisected to
> actually fixes their issue (rwsem reader lockstealing logic).

It was also recently bisected on RHEL 8.x to the introduction of
rwsem spin-on-owner changes from back in 2019. Might be the same
commit you are talking about, but either way it's an indication of
rwsem lock contention rather than a problem with the rwsems
themselves.

> So I'm
> reasonably confident I'm really reproducing their issue.

Ok, that's good to know. I was thinking that maybe a fio recipe
might show it up, too, but I'm not sure about that nor do I have the
time to investigate it...

> BTW, your patch seems to be about 2% slower on average than what I have
> written and somewhat more variable. It may be just a bad luck but
> I suspect it may be related to the fact that I ended up using READ_ONCE for
> reads of ili_commit_seq and ili_datasync_commit_seq while you use ili_lock.

Possibly....

> So I just wanted to suggest that as a possible optimization (my patch
> attached for reference). But regardless of whether you do the change or not
> I think the patch is good to go.

I was on the fence about using READ_ONCE/WRITE_ONCE.

However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
prevent torn reads of 64 bit variables on 32 bit platforms. A torn
read of a commit sequence number will result in a transient data
integrity guarantee failure, and so I decided to err on the side of
caution....

.....

> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 829675700fcd..2a90e156b072 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
>  		flags |= XFS_ILOG_CORE;
>  	}
>  
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This allows
> -	 * fdatasync to skip log forces for inodes that are only timestamp
> -	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> -	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> -	 * (ili_fields) correctly tracks that the version has changed.
> -	 */
>  	spin_lock(&iip->ili_lock);
> -	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> -	if (flags & XFS_ILOG_IVERSION)
> -		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> -
>  	/*
>  	 * Inode verifiers do not check that the CoW extent size hint is an
>  	 * integer multiple of the rt extent size on a directory with both
> @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
>  		xfs_trans_brelse(tp, bp);
>  	}
>  
> +	/*
> +	 * Set the transaction dirty state we've created back in inode item
> +	 * before mangling flags for storing on disk. We use the value later in
> +	 * xfs_inode_item_committing() to determine whether the transaction is
> +	 * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> +	 * xfs_trans_ijoin() before adding inode to the next transaction.
> +	 */
> +	iip->ili_dirty_flags = flags;
> +
> +	/*
> +	 * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> +	 * actual on-disk dirty tracking (ili_fields) correctly tracks that the
> +	 * version has changed.
> +	 */
> +	if (flags & XFS_ILOG_IVERSION)
> +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> +

OK, I think I might have missed this. I'll check/fix it, and send an
updated version for inclusion.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2025-09-16 21:32 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-08 15:12 [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync Jan Kara
2025-09-09  0:18 ` Dave Chinner
2025-09-10  9:05   ` Jan Kara
2025-09-11  0:59     ` Dave Chinner
2025-09-16  6:15       ` Dave Chinner
2025-09-16 13:32         ` Jan Kara
2025-09-16 21:32           ` Dave Chinner [this message]
2025-09-17 10:07             ` Jan Kara
2025-09-17 10:27               ` Lukas Herbolt
     [not found]               ` <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>
2025-09-17 15:05                 ` Jan Kara
2025-09-17 15:16                   ` Lukas Herbolt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMnXW_sEk_wTPnvB@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox