All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
Date: Wed, 17 Sep 2025 07:32:11 +1000	[thread overview]
Message-ID: <aMnXW_sEk_wTPnvB@dread.disaster.area> (raw)
In-Reply-To: <vpsyvzbupclvb76axyzytms5rh5yzubcyj5l5h2iwpk3d7xf6a@dw6pemmdfcka>

On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > doesn't have dirty metadata) will have no associated commit
> > > sequence number set.
> > > 
> > > Hence if ili_datasync_commit_seq is non-zero, then by definition the
> > > inode must be pinned and has been dirtied for datasync purposes.
> > > That means we can simply query ili_datasync_commit_seq in
> > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > > 
> > > I suspect that the above fsync code can then become:
> > > 
> > > 	spin_lock(&iip->ili_lock);
> > > 	if (datasync)
> > > 		seq = iip->ili_datasync_commit_seq;
> > > 	else
> > > 		seq = iip->ili_commit_seq;
> > > 	spin_unlock(&iip->ili_lock);
> > > 
> > > 	if (!seq)
> > > 		return 0;
> > > 	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> > > 
> > > For the same reason. i.e. a non-zero sequence number implies the
> > > inode log item is dirty in the CIL and pinned.
> > > 
> > > At this point, we really don't care about races with transaction
> > > commits. f(data)sync should only wait for modifications that have
> > > been fully completed. If they haven't set the commit sequence in the
> > > log item, they haven't fully completed. If the commit sequence is
> > > already set, the the CIL push will co-ordinate appropriately with
> > > commits to ensure correct data integrity behaviour occurs.
> > > 
> > > Hence I think that if we tie the sequence number clearing to the
> > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > the pin checks and use the commit sequence numbers directly to
> > > determine what the correct behaviour should be...
> > 
> > Here's a patch that implements this. It appears to pass fstests
> > without any regressions on my test VMs. Can you test it and check
> > that it retains the expected performance improvement for
> > O_DSYNC+DIO on fallocate()d space?
> 
> Heh, I just wanted to send my version of the patch after all the tests
> passed :). Anyway, I've given your patch a spin with the test I have and
> its performance looks good. So feel free to add:
> 
> Tested-by: Jan Kara <jack@suse.cz>

Thanks!

> BTW I don't have customer setup with DB2 available where the huge
> difference is visible (I'll send them backport of the patch to their SUSE
> kernel once we settle on it) but I have written a tool that replays the
> same set of pwrites from same set of threads I've captured from syscall
> trace. It reproduces only about 20% difference between good & bad kernels
> on my test machine but it was good enough for the bisection and analysis
> and the customer confirmed that the revert of what I've bisected to
> actually fixes their issue (rwsem reader lockstealing logic).

It was also recently bisected on RHEL 8.x to the introduction of
rwsem spin-on-owner changes from back in 2019. Might be the same
commit you are talking about, but either way it's an indication of
rwsem lock contention rather than a problem with the rwsems
themselves.

> So I'm
> reasonably confident I'm really reproducing their issue.

Ok, that's good to know. I was thinking that maybe a fio recipe
might show it up, too, but I'm not sure about that nor do I have the
time to investigate it...

> BTW, your patch seems to be about 2% slower on average than what I have
> written and somewhat more variable. It may be just a bad luck but
> I suspect it may be related to the fact that I ended up using READ_ONCE for
> reads of ili_commit_seq and ili_datasync_commit_seq while you use ili_lock.

Possibly....

> So I just wanted to suggest that as a possible optimization (my patch
> attached for reference). But regardless of whether you do the change or not
> I think the patch is good to go.

I was on the fence about using READ_ONCE/WRITE_ONCE.

However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
prevent torn reads of 64 bit variables on 32 bit platforms. A torn
read of a commit sequence number will result in a transient data
integrity guarantee failure, and so I decided to err on the side of
caution....

.....

> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 829675700fcd..2a90e156b072 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
>  		flags |= XFS_ILOG_CORE;
>  	}
>  
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This allows
> -	 * fdatasync to skip log forces for inodes that are only timestamp
> -	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> -	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> -	 * (ili_fields) correctly tracks that the version has changed.
> -	 */
>  	spin_lock(&iip->ili_lock);
> -	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> -	if (flags & XFS_ILOG_IVERSION)
> -		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> -
>  	/*
>  	 * Inode verifiers do not check that the CoW extent size hint is an
>  	 * integer multiple of the rt extent size on a directory with both
> @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
>  		xfs_trans_brelse(tp, bp);
>  	}
>  
> +	/*
> +	 * Set the transaction dirty state we've created back in inode item
> +	 * before mangling flags for storing on disk. We use the value later in
> +	 * xfs_inode_item_committing() to determine whether the transaction is
> +	 * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> +	 * xfs_trans_ijoin() before adding inode to the next transaction.
> +	 */
> +	iip->ili_dirty_flags = flags;
> +
> +	/*
> +	 * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> +	 * actual on-disk dirty tracking (ili_fields) correctly tracks that the
> +	 * version has changed.
> +	 */
> +	if (flags & XFS_ILOG_IVERSION)
> +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> +

OK, I think I might have missed this. I'll check/fix it, and send an
updated version for inclusion.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2025-09-16 21:32 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-08 15:12 [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync Jan Kara
2025-09-09  0:18 ` Dave Chinner
2025-09-10  9:05   ` Jan Kara
2025-09-11  0:59     ` Dave Chinner
2025-09-16  6:15       ` Dave Chinner
2025-09-16 13:32         ` Jan Kara
2025-09-16 21:32           ` Dave Chinner [this message]
2025-09-17 10:07             ` Jan Kara
2025-09-17 10:27               ` Lukas Herbolt
     [not found]               ` <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>
2025-09-17 15:05                 ` Jan Kara
2025-09-17 15:16                   ` Lukas Herbolt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMnXW_sEk_wTPnvB@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.