[PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync

Linux XFS filesystem development
 help / color / mirror / Atom feed

* [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
@ 2025-09-08 15:12 Jan Kara
  2025-09-09  0:18 ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2025-09-08 15:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Jan Kara

Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
writes to file preallocated with fallocate (thus DIO happens to
unwritten extents and we need ILOCK in exclusive mode for timestamp
modifications and extent conversions). But holding ILOCK over the log
force doesn't seem strictly necessary for correctness. We are just using
it for a mechanism to make sure parallel fsyncs all wait for log force
to complete but that can be also achieved without holding ILOCK.

With this patch DB2 database restore operation speeds up by a factor of
about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing
store.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c       | 33 ++++++++++++++++++++++-----------
 fs/xfs/xfs_inode_item.c |  1 +
 fs/xfs/xfs_inode_item.h |  1 +
 3 files changed, 24 insertions(+), 11 deletions(-)

I've chosen adding ili_fsync_flushing_fields to xfs_inode_log_item since that
seemed in line with how the code is structured. Arguably that is unnecessarily
wasteful since in practice we use just one bit of information from
ili_fsync_fields and one bit from ili_fsync_flushing_fields. If people prefer
more space efficient solution, I can certainly do that.

This is marked as RFC because I'm not quite sure I didn't miss some subtlety
in XFS logging mechanisms and this will crash & burn badly in some corner
case (but fstests seem to be passing fine for me).

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b04c59d87378..2bb793c8c179 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -80,9 +80,13 @@ xfs_fsync_seq(
 	struct xfs_inode	*ip,
 	bool			datasync)
 {
+	unsigned int sync_fields;
+
 	if (!xfs_ipincount(ip))
 		return 0;
-	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+	sync_fields = ip->i_itemp->ili_fsync_fields |
+		      ip->i_itemp->ili_fsync_flushing_fields;
+	if (datasync && !(sync_fields & ~XFS_ILOG_TIMESTAMP))
 		return 0;
 	return ip->i_itemp->ili_commit_seq;
 }
@@ -92,13 +96,14 @@ xfs_fsync_seq(
  * log up to the latest LSN that touched the inode.
  *
  * If we have concurrent fsync/fdatasync() calls, we need them to all block on
- * the log force before we clear the ili_fsync_fields field. This ensures that
- * we don't get a racing sync operation that does not wait for the metadata to
- * hit the journal before returning.  If we race with clearing ili_fsync_fields,
- * then all that will happen is the log force will do nothing as the lsn will
- * already be on disk.  We can't race with setting ili_fsync_fields because that
- * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock
- * shared until after the ili_fsync_fields is cleared.
+ * the log force until it is finished. Thus we clear ili_fsync_fields so that
+ * new modifications since starting log force can accumulate there and just
+ * save old ili_fsync_fields value to ili_fsync_flushing_fields so that
+ * concurrent fsyncs can use that to determine whether they need to wait for
+ * running log force or not. This ensures that we don't get a racing sync
+ * operation that does not wait for the metadata to hit the journal before
+ * returning.  If we race with clearing ili_fsync_fields, then all that will
+ * happen is the log force will do nothing as the lsn will already be on disk.
  */
 static  int
 xfs_fsync_flush_log(
@@ -112,14 +117,20 @@ xfs_fsync_flush_log(
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	seq = xfs_fsync_seq(ip, datasync);
 	if (seq) {
+		spin_lock(&ip->i_itemp->ili_lock);
+		ip->i_itemp->ili_fsync_flushing_fields =
+						ip->i_itemp->ili_fsync_fields;
+		ip->i_itemp->ili_fsync_fields = 0;
+		spin_unlock(&ip->i_itemp->ili_lock);
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
 					  log_flushed);
-
 		spin_lock(&ip->i_itemp->ili_lock);
-		ip->i_itemp->ili_fsync_fields = 0;
+		ip->i_itemp->ili_fsync_flushing_fields = 0;
 		spin_unlock(&ip->i_itemp->ili_lock);
+	} else {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	}
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 829675700fcd..39d15eb9311d 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -1056,6 +1056,7 @@ xfs_iflush_abort_clean(
 	iip->ili_last_fields = 0;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
+	iip->ili_fsync_flushing_fields = 0;
 	iip->ili_flush_lsn = 0;
 	iip->ili_item.li_buf = NULL;
 	list_del_init(&iip->ili_item.li_bio_list);
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index ba92ce11a011..018ba4cd3fab 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -33,6 +33,7 @@ struct xfs_inode_log_item {
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
+	unsigned int		ili_fsync_flushing_fields;  /* fields currently being committed by fsync */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
 };
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-08 15:12 [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync Jan Kara
@ 2025-09-09  0:18 ` Dave Chinner
  2025-09-10  9:05   ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2025-09-09  0:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs

On Mon, Sep 08, 2025 at 05:12:49PM +0200, Jan Kara wrote:
> Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
> significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
> writes to file preallocated with fallocate (thus DIO happens to
> unwritten extents and we need ILOCK in exclusive mode for timestamp
> modifications and extent conversions). But holding ILOCK over the log
> force doesn't seem strictly necessary for correctness.

That was introduced a long while back in 2015 when the
ili_fsync_fields flags were introduced to optimise O_DSYNC to avoid
timestamp updates from causing log forces. That was commit
fc0561cefc04 ("xfs: optimise away log forces on timestamp updates for
fdatasync").

> We are just using
> it for a mechanism to make sure parallel fsyncs all wait for log force
> to complete but that can be also achieved without holding ILOCK.

Not exactly. It was used to ensure that we correctly serialised
the setting of newly dirty fsync flags against the log force that
allows us to clearing the existing fsync flags. It requires the
ILOCK_EXCL to set new flags, hence holding the ILOCK_SHARED was
sufficient to acheive this.

At the time, the ili_lock did not exist (introduced in 2020), so the
only way to serialise inode log item flags updates was to use the
ILOCK. But now with the ili_lock, we can update fields safely
without holding the ILOCK...

> With this patch DB2 database restore operation speeds up by a factor of
> about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing
> store.

*nod*

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/xfs/xfs_file.c       | 33 ++++++++++++++++++++++-----------
>  fs/xfs/xfs_inode_item.c |  1 +
>  fs/xfs/xfs_inode_item.h |  1 +
>  3 files changed, 24 insertions(+), 11 deletions(-)
> 
> I've chosen adding ili_fsync_flushing_fields to xfs_inode_log_item since that
> seemed in line with how the code is structured. Arguably that is unnecessarily
> wasteful since in practice we use just one bit of information from
> ili_fsync_fields and one bit from ili_fsync_flushing_fields. If people prefer
> more space efficient solution, I can certainly do that.

So you are trying to mirror the ili_fields -> ili_last_fields
behaviour w.r.t. inode flushing?

> This is marked as RFC because I'm not quite sure I didn't miss some subtlety
> in XFS logging mechanisms and this will crash & burn badly in some corner
> case (but fstests seem to be passing fine for me).

Yeah, that's the issue, isn't it?

> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index b04c59d87378..2bb793c8c179 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -80,9 +80,13 @@ xfs_fsync_seq(
>  	struct xfs_inode	*ip,
>  	bool			datasync)
>  {
> +	unsigned int sync_fields;
> +
>  	if (!xfs_ipincount(ip))
>  		return 0;
> -	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +	sync_fields = ip->i_itemp->ili_fsync_fields |
> +		      ip->i_itemp->ili_fsync_flushing_fields;

Racy read - the new ili_fsync_flushing_fields field is protected by
ili_lock, not ILOCK....

> +	if (datasync && !(sync_fields & ~XFS_ILOG_TIMESTAMP))
>  		return 0;
>  	return ip->i_itemp->ili_commit_seq;
>  }

These xfs_fsync_seq() checks are the real reason we need to hold the
ILOCK for fsync correctness.

The ili_fsync_fields, pin count and the ili_commit_seq are updated
as part of the transaction commit processing. The only lock they
share across this context is the ILOCK_EXCL, and the update order is
this:

__xfs_trans_commit
  xfs_trans_run_precommits
    ->iop_precommit
      xfs_inode_item_precommit		>>>> updates ili_fsync_fields
  xlog_cil_commit
    xlog_cil_insert_items
      xlog_cil_insert_format_items
        xfs_cil_prepare_item
          ->iop_pin
	    xfs_inode_item_pin()	>>>> pin count increment
    ->iop_committing(seq)
      xfs_inode_item_committing(seq)    >>>> writes ili_commit_seq
        xfs_inode_item_release()
	  xfs_iunlock(ILOCK_EXCL)	>>>> inode now unlocked


Hence in xfs_file_fsync(), we have to hold the ILOCK_SHARED to
determine if the log force needs to be done.

The ILOCK was extended to cover the log force in commit fc0561cefc04
because we needed to clear the ili_fsync_fields after the log force,
but it had to be done in a way that avoided racing with setting new
fsync bits in the transaction commit. Using the ILOCK was the only
way to do that, so the lock was then held across the log force...

Ok, so it look slike it is safe to drop the ILOCK across the log
force as long as we have some other way to ensure we don't drop
newly dirtied fsync fields on the ground.


> @@ -92,13 +96,14 @@ xfs_fsync_seq(
>   * log up to the latest LSN that touched the inode.
>   *
>   * If we have concurrent fsync/fdatasync() calls, we need them to all block on
> - * the log force before we clear the ili_fsync_fields field. This ensures that
> - * we don't get a racing sync operation that does not wait for the metadata to
> - * hit the journal before returning.  If we race with clearing ili_fsync_fields,
> - * then all that will happen is the log force will do nothing as the lsn will
> - * already be on disk.  We can't race with setting ili_fsync_fields because that
> - * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock
> - * shared until after the ili_fsync_fields is cleared.
> + * the log force until it is finished. Thus we clear ili_fsync_fields so that
> + * new modifications since starting log force can accumulate there and just
> + * save old ili_fsync_fields value to ili_fsync_flushing_fields so that
> + * concurrent fsyncs can use that to determine whether they need to wait for
> + * running log force or not. This ensures that we don't get a racing sync
> + * operation that does not wait for the metadata to hit the journal before
> + * returning.  If we race with clearing ili_fsync_fields, then all that will
> + * happen is the log force will do nothing as the lsn will already be on disk.
>   */
>  static  int
>  xfs_fsync_flush_log(
> @@ -112,14 +117,20 @@ xfs_fsync_flush_log(
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	seq = xfs_fsync_seq(ip, datasync);
>  	if (seq) {
> +		spin_lock(&ip->i_itemp->ili_lock);
> +		ip->i_itemp->ili_fsync_flushing_fields =
> +						ip->i_itemp->ili_fsync_fields;
> +		ip->i_itemp->ili_fsync_fields = 0;
> +		spin_unlock(&ip->i_itemp->ili_lock);

This is racy. if we get three fdatasync()s at the same time, they
can do:

t0			t1			t2

fsync = ILOG_CORE
flushing = ILOG_CORE
fsync = 0
<log force>
			xfs_fsync_seq
			  fields = ILOG_CORE
			fsync = 0
			flushing = 0
			fsync = 0
			<log force>
						xfs_fsync_seq
						  fields = 0
						<skips datasync>
......
<log force completes>

In this case t2 should have waited on the log force like t0 and t1,
but the lack of dirty fsync fields has allowed it to skip them.

We avoid these problems with ili_fields/ili_last_fields by always
ORing the ili_last_fields back into the ili_fields whenever we
update it. This means we don't lose bits that were set by previous
operations that are still in flight.

> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
>  					  log_flushed);
> -
>  		spin_lock(&ip->i_itemp->ili_lock);
> -		ip->i_itemp->ili_fsync_fields = 0;
> +		ip->i_itemp->ili_fsync_flushing_fields = 0;
>  		spin_unlock(&ip->i_itemp->ili_lock);
> +	} else {
> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  	}
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  	return error;
>  }

Hence it seems better to reintegrate xfs_fsync_seq() because it
cleans up the locking and logic. I'd also call it
"ili_last_fsync_fields" to match the naming that the inode flushing
code uses (i.e. ili_last_fields):

	struct xfs_inode_log_item *iip = ip->i_itemp;
	unsigned int		sync_fields;
	xfs_csn_t		seq = 0;

	xfs_ilock(ip, XFS_ILOCK_SHARED);
	if (!xfs_ipincount(ip)) {
		xfs_iunlock(ip, XFS_ILOCK_SHARED);
		return 0;
	}

	spin_lock(&iip->ili_lock);
	sync_fields = iip->ili_fsync_fields | iip->ili_last_fsync_fields;

	/*
	 * Don't force the log for O_DSYNC operations on inodes that
	 * only have dirty timestamps. Timestamps are not necessary
	 * for data integrity so we can skip them in this case.
	 */
	if (!datasync || (sync_fields & ~XFS_ILOG_TIMESTAMP))
		seq = iip->ili_commit_seq;
		iip->ili_last_fsync_fields |= iip->ili_fsync_fields;
		iip->ili_fsync_fields = 0;
	}
	spin_unlock(&iip->ili_lock);
	xfs_iunlock(ip, XFS_ILOCK_SHARED);

	if (!seq)
		return 0;

	error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
				  log_flushed);
	spin_lock(&iip->ili_lock);
	iip->ili_last_fsync_fields = 0;
	spin_unlock(&iip->ili_lock);

	return error;
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 829675700fcd..39d15eb9311d 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -1056,6 +1056,7 @@ xfs_iflush_abort_clean(
>  	iip->ili_last_fields = 0;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
> +	iip->ili_fsync_flushing_fields = 0;
>  	iip->ili_flush_lsn = 0;
>  	iip->ili_item.li_buf = NULL;
>  	list_del_init(&iip->ili_item.li_bio_list);

I also suspect that ili_last_fsync_fields needs to be cleared in
xfs_iflush() at the same time ili_fsync_fields is cleared. i.e.
flushing for writeback absolutely clears the fsync status of the
inode because, by definition, the inode is not pinned in the journal
and so is not dirty in memory. Hence a fsync at that point should
result in a no-op....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-09  0:18 ` Dave Chinner
@ 2025-09-10  9:05   ` Jan Kara
  2025-09-11  0:59     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2025-09-10  9:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 9014 bytes --]

On Tue 09-09-25 10:18:59, Dave Chinner wrote:
> On Mon, Sep 08, 2025 at 05:12:49PM +0200, Jan Kara wrote:
> > Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
> > significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
> > writes to file preallocated with fallocate (thus DIO happens to
> > unwritten extents and we need ILOCK in exclusive mode for timestamp
> > modifications and extent conversions). But holding ILOCK over the log
> > force doesn't seem strictly necessary for correctness.
> 
> That was introduced a long while back in 2015 when the
> ili_fsync_fields flags were introduced to optimise O_DSYNC to avoid
> timestamp updates from causing log forces. That was commit
> fc0561cefc04 ("xfs: optimise away log forces on timestamp updates for
> fdatasync").
> 
> > We are just using
> > it for a mechanism to make sure parallel fsyncs all wait for log force
> > to complete but that can be also achieved without holding ILOCK.
> 
> Not exactly. It was used to ensure that we correctly serialised
> the setting of newly dirty fsync flags against the log force that
> allows us to clearing the existing fsync flags. It requires the
> ILOCK_EXCL to set new flags, hence holding the ILOCK_SHARED was
> sufficient to acheive this.
> 
> At the time, the ili_lock did not exist (introduced in 2020), so the
> only way to serialise inode log item flags updates was to use the
> ILOCK. But now with the ili_lock, we can update fields safely
> without holding the ILOCK...
> 
> > With this patch DB2 database restore operation speeds up by a factor of
> > about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing
> > store.
> 
> *nod*
> 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/xfs/xfs_file.c       | 33 ++++++++++++++++++++++-----------
> >  fs/xfs/xfs_inode_item.c |  1 +
> >  fs/xfs/xfs_inode_item.h |  1 +
> >  3 files changed, 24 insertions(+), 11 deletions(-)
> > 
> > I've chosen adding ili_fsync_flushing_fields to xfs_inode_log_item since that
> > seemed in line with how the code is structured. Arguably that is unnecessarily
> > wasteful since in practice we use just one bit of information from
> > ili_fsync_fields and one bit from ili_fsync_flushing_fields. If people prefer
> > more space efficient solution, I can certainly do that.
> 
> So you are trying to mirror the ili_fields -> ili_last_fields
> behaviour w.r.t. inode flushing?

Kind of yes.

> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index b04c59d87378..2bb793c8c179 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -80,9 +80,13 @@ xfs_fsync_seq(
> >  	struct xfs_inode	*ip,
> >  	bool			datasync)
> >  {
> > +	unsigned int sync_fields;
> > +
> >  	if (!xfs_ipincount(ip))
> >  		return 0;
> > -	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > +	sync_fields = ip->i_itemp->ili_fsync_fields |
> > +		      ip->i_itemp->ili_fsync_flushing_fields;
> 
> Racy read - the new ili_fsync_flushing_fields field is protected by
> ili_lock, not ILOCK....

Yup, luckily easy to fix.

> > +	if (datasync && !(sync_fields & ~XFS_ILOG_TIMESTAMP))
> >  		return 0;
> >  	return ip->i_itemp->ili_commit_seq;
> >  }
> 
> These xfs_fsync_seq() checks are the real reason we need to hold the
> ILOCK for fsync correctness.
> 
> The ili_fsync_fields, pin count and the ili_commit_seq are updated
> as part of the transaction commit processing. The only lock they
> share across this context is the ILOCK_EXCL, and the update order is
> this:
> 
> __xfs_trans_commit
>   xfs_trans_run_precommits
>     ->iop_precommit
>       xfs_inode_item_precommit		>>>> updates ili_fsync_fields
>   xlog_cil_commit
>     xlog_cil_insert_items
>       xlog_cil_insert_format_items
>         xfs_cil_prepare_item
>           ->iop_pin
> 	    xfs_inode_item_pin()	>>>> pin count increment
>     ->iop_committing(seq)
>       xfs_inode_item_committing(seq)    >>>> writes ili_commit_seq
>         xfs_inode_item_release()
> 	  xfs_iunlock(ILOCK_EXCL)	>>>> inode now unlocked
> 
> 
> Hence in xfs_file_fsync(), we have to hold the ILOCK_SHARED to
> determine if the log force needs to be done.
> 
> The ILOCK was extended to cover the log force in commit fc0561cefc04
> because we needed to clear the ili_fsync_fields after the log force,
> but it had to be done in a way that avoided racing with setting new
> fsync bits in the transaction commit. Using the ILOCK was the only
> way to do that, so the lock was then held across the log force...
> 
> Ok, so it look slike it is safe to drop the ILOCK across the log
> force as long as we have some other way to ensure we don't drop
> newly dirtied fsync fields on the ground.

Thanks for the details!

> > @@ -112,14 +117,20 @@ xfs_fsync_flush_log(
> >  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> >  	seq = xfs_fsync_seq(ip, datasync);
> >  	if (seq) {
> > +		spin_lock(&ip->i_itemp->ili_lock);
> > +		ip->i_itemp->ili_fsync_flushing_fields =
> > +						ip->i_itemp->ili_fsync_fields;
> > +		ip->i_itemp->ili_fsync_fields = 0;
> > +		spin_unlock(&ip->i_itemp->ili_lock);
> 
> This is racy. if we get three fdatasync()s at the same time, they
> can do:
> 
> t0			t1			t2
> 
> fsync = ILOG_CORE
> flushing = ILOG_CORE
> fsync = 0
> <log force>
> 			xfs_fsync_seq
> 			  fields = ILOG_CORE
> 			fsync = 0
> 			flushing = 0
> 			fsync = 0
> 			<log force>
> 						xfs_fsync_seq
> 						  fields = 0
> 						<skips datasync>
> ......
> <log force completes>
> 
> In this case t2 should have waited on the log force like t0 and t1,
> but the lack of dirty fsync fields has allowed it to skip them.
> 
> We avoid these problems with ili_fields/ili_last_fields by always
> ORing the ili_last_fields back into the ili_fields whenever we
> update it. This means we don't lose bits that were set by previous
> operations that are still in flight.

Good spotting!

> > +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> >  					  log_flushed);
> > -
> >  		spin_lock(&ip->i_itemp->ili_lock);
> > -		ip->i_itemp->ili_fsync_fields = 0;
> > +		ip->i_itemp->ili_fsync_flushing_fields = 0;
> >  		spin_unlock(&ip->i_itemp->ili_lock);
> > +	} else {
> > +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  	}
> > -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  	return error;
> >  }
> 
> Hence it seems better to reintegrate xfs_fsync_seq() because it
> cleans up the locking and logic. I'd also call it
> "ili_last_fsync_fields" to match the naming that the inode flushing
> code uses (i.e. ili_last_fields):
> 
> 	struct xfs_inode_log_item *iip = ip->i_itemp;
> 	unsigned int		sync_fields;
> 	xfs_csn_t		seq = 0;
> 
> 	xfs_ilock(ip, XFS_ILOCK_SHARED);
> 	if (!xfs_ipincount(ip)) {
> 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> 		return 0;
> 	}
> 
> 	spin_lock(&iip->ili_lock);
> 	sync_fields = iip->ili_fsync_fields | iip->ili_last_fsync_fields;
> 
> 	/*
> 	 * Don't force the log for O_DSYNC operations on inodes that
> 	 * only have dirty timestamps. Timestamps are not necessary
> 	 * for data integrity so we can skip them in this case.
> 	 */
> 	if (!datasync || (sync_fields & ~XFS_ILOG_TIMESTAMP))
> 		seq = iip->ili_commit_seq;
> 		iip->ili_last_fsync_fields |= iip->ili_fsync_fields;
> 		iip->ili_fsync_fields = 0;
> 	}
> 	spin_unlock(&iip->ili_lock);
> 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> 
> 	if (!seq)
> 		return 0;
> 
> 	error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> 				  log_flushed);
> 	spin_lock(&iip->ili_lock);
> 	iip->ili_last_fsync_fields = 0;
> 	spin_unlock(&iip->ili_lock);
> 
> 	return error;

After some thinking... But isn't this still racy in a different way? Like:

t0			t1			t2

fsync = ILOG_CORE
last_fsync = ILOG_CORE
fsync = 0
<log force lsn1>
			modifies file - this is now possible because we
			  don't hold ILOCK anymore
			calls fsync
			so:
			fsync = ILOG_CORE
			last_fsync = ILOG_CORE
			fsync = 0
			<log force lsn2>
......
<log force lsn1 completes>
last_fsync = 0
						fsync
						fsync == 0 && last_fsync == 0 =>
						  <skips fsync>

Hence t2 didn't wait until log force of lsn2 completed which looks like a
bug.

The problem is last_fsync was cleared too early. We'd need to clear it only
once the latest log force for the inode has finished (which presumably
doable by remembering the latest forced seq in xfs_inode_log_item).

But I'm wondering whether we aren't overcomplicating this. Cannot we just
instead of ili_fsync_fields maintain the latest seq when the inode has been
modified in a way relevant for fdatasync? Let's call that
ili_datasync_commit_seq. Then in case of datasync we'll call
xfs_log_force_seq() for ili_datasync_commit_seq and in case of fsync for
ili_commit_seq. No need for complex clearing of ili_fsync_fields... The
only thing I'm not sure about is how to best handle the ili_fsync_fields
check in xfs_bmbt_to_iomap() - I haven't found a function to check whether a
particular seq has been already committed or not.

Tentative patch attached.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-xfs-Don-t-hold-XFS_ILOCK_SHARED-over-log-force-durin.patch --]
[-- Type: text/x-patch, Size: 5958 bytes --]

From d699b7901ccb3e85e197c6dcc1b3ffd2a33707ef Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 31 Jul 2025 11:24:01 +0200
Subject: [PATCH v2] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync

Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
writes to file preallocated with fallocate (thus DIO happens to
unwritten extents and we need ILOCK in exclusive mode for timestamp
modifications and extent conversions). But holding ILOCK over the log
force doesn't seem strictly necessary for correctness. We are just using
it for a mechanism to make sure parallel fsyncs all wait for log force
to complete but that can be also achieved without holding ILOCK.

With this patch DB2 database restore operation speeds up by a factor of
about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing
store.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c       | 48 ++++++++++++++---------------------------
 fs/xfs/xfs_inode.c      |  2 --
 fs/xfs/xfs_inode_item.c | 12 ++++++++---
 fs/xfs/xfs_inode_item.h |  4 +++-
 4 files changed, 28 insertions(+), 38 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b04c59d87378..c5b8ebddc179 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -75,30 +75,10 @@ xfs_dir_fsync(
 	return xfs_log_force_inode(ip);
 }
 
-static xfs_csn_t
-xfs_fsync_seq(
-	struct xfs_inode	*ip,
-	bool			datasync)
-{
-	if (!xfs_ipincount(ip))
-		return 0;
-	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-		return 0;
-	return ip->i_itemp->ili_commit_seq;
-}
-
 /*
  * All metadata updates are logged, which means that we just have to flush the
- * log up to the latest LSN that touched the inode.
- *
- * If we have concurrent fsync/fdatasync() calls, we need them to all block on
- * the log force before we clear the ili_fsync_fields field. This ensures that
- * we don't get a racing sync operation that does not wait for the metadata to
- * hit the journal before returning.  If we race with clearing ili_fsync_fields,
- * then all that will happen is the log force will do nothing as the lsn will
- * already be on disk.  We can't race with setting ili_fsync_fields because that
- * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock
- * shared until after the ili_fsync_fields is cleared.
+ * log up to the latest LSN that modified the inode metadata relevant for the
+ * fsync/fdatasync().
  */
 static  int
 xfs_fsync_flush_log(
@@ -106,21 +86,25 @@ xfs_fsync_flush_log(
 	bool			datasync,
 	int			*log_flushed)
 {
-	int			error = 0;
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 	xfs_csn_t		seq;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	seq = xfs_fsync_seq(ip, datasync);
-	if (seq) {
-		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
-					  log_flushed);
-
-		spin_lock(&ip->i_itemp->ili_lock);
-		ip->i_itemp->ili_fsync_fields = 0;
-		spin_unlock(&ip->i_itemp->ili_lock);
+	if (!xfs_ipincount(ip)) {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		return 0;
 	}
+
+	if (datasync)
+		seq = iip->ili_datasync_commit_seq;
+	else
+		seq = iip->ili_commit_seq;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	return error;
+
+	if (!seq)
+		return 0;
+
+	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9c39251961a3..209b8aba6238 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1656,7 +1656,6 @@ xfs_ifree_mark_inode_stale(
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	spin_unlock(&iip->ili_lock);
 	ASSERT(iip->ili_last_fields);
 
@@ -2502,7 +2501,6 @@ xfs_iflush(
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	set_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
 	spin_unlock(&iip->ili_lock);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 829675700fcd..27f28bea0a8f 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -153,7 +153,10 @@ xfs_inode_item_precommit(
 	 * (ili_fields) correctly tracks that the version has changed.
 	 */
 	spin_lock(&iip->ili_lock);
-	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
+	if (flags & ~(XFS_ILOG_IVERSION | XFS_ILOG_TIMESTAMP))
+		iip->ili_datasync_tx = true;
+	else
+		iip->ili_datasync_tx = false;
 	if (flags & XFS_ILOG_IVERSION)
 		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
 
@@ -863,7 +866,11 @@ xfs_inode_item_committing(
 	struct xfs_log_item	*lip,
 	xfs_csn_t		seq)
 {
-	INODE_ITEM(lip)->ili_commit_seq = seq;
+	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+
+	iip->ili_commit_seq = seq;
+	if (iip->ili_datasync_tx)
+		iip->ili_datasync_commit_seq = seq;
 	return xfs_inode_item_release(lip);
 }
 
@@ -1055,7 +1062,6 @@ xfs_iflush_abort_clean(
 {
 	iip->ili_last_fields = 0;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	iip->ili_flush_lsn = 0;
 	iip->ili_item.li_buf = NULL;
 	list_del_init(&iip->ili_item.li_bio_list);
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index ba92ce11a011..7a3364f3a6b4 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -32,9 +32,11 @@ struct xfs_inode_log_item {
 	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
-	unsigned int		ili_fsync_fields;  /* logged since last fsync */
+	bool			ili_datasync_tx;   /* does fdatasync need to flush current tx? */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
+	/* last transaction commit with changes relevant for fdatasync */
+	xfs_csn_t		ili_datasync_commit_seq;
 };
 
 static inline int xfs_inode_clean(struct xfs_inode *ip)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-10  9:05   ` Jan Kara
@ 2025-09-11  0:59     ` Dave Chinner
  2025-09-16  6:15       ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2025-09-11  0:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs

On Wed, Sep 10, 2025 at 11:05:18AM +0200, Jan Kara wrote:
> On Tue 09-09-25 10:18:59, Dave Chinner wrote:
> > On Mon, Sep 08, 2025 at 05:12:49PM +0200, Jan Kara wrote:
> > > Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
> > > significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
> > > writes to file preallocated with fallocate (thus DIO happens to
> > > unwritten extents and we need ILOCK in exclusive mode for timestamp
> > > modifications and extent conversions). But holding ILOCK over the log
> > > force doesn't seem strictly necessary for correctness.
> > 
> > That was introduced a long while back in 2015 when the
> > ili_fsync_fields flags were introduced to optimise O_DSYNC to avoid
> > timestamp updates from causing log forces. That was commit
> > fc0561cefc04 ("xfs: optimise away log forces on timestamp updates for
> > fdatasync").
.....
> > > +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > >  		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> > >  					  log_flushed);
> > > -
> > >  		spin_lock(&ip->i_itemp->ili_lock);
> > > -		ip->i_itemp->ili_fsync_fields = 0;
> > > +		ip->i_itemp->ili_fsync_flushing_fields = 0;
> > >  		spin_unlock(&ip->i_itemp->ili_lock);
> > > +	} else {
> > > +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > >  	}
> > > -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > >  	return error;
> > >  }
> > 
> > Hence it seems better to reintegrate xfs_fsync_seq() because it
> > cleans up the locking and logic. I'd also call it
> > "ili_last_fsync_fields" to match the naming that the inode flushing
> > code uses (i.e. ili_last_fields):
> > 
> > 	struct xfs_inode_log_item *iip = ip->i_itemp;
> > 	unsigned int		sync_fields;
> > 	xfs_csn_t		seq = 0;
> > 
> > 	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > 	if (!xfs_ipincount(ip)) {
> > 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > 		return 0;
> > 	}
> > 
> > 	spin_lock(&iip->ili_lock);
> > 	sync_fields = iip->ili_fsync_fields | iip->ili_last_fsync_fields;
> > 
> > 	/*
> > 	 * Don't force the log for O_DSYNC operations on inodes that
> > 	 * only have dirty timestamps. Timestamps are not necessary
> > 	 * for data integrity so we can skip them in this case.
> > 	 */
> > 	if (!datasync || (sync_fields & ~XFS_ILOG_TIMESTAMP))
> > 		seq = iip->ili_commit_seq;
> > 		iip->ili_last_fsync_fields |= iip->ili_fsync_fields;
> > 		iip->ili_fsync_fields = 0;
> > 	}
> > 	spin_unlock(&iip->ili_lock);
> > 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > 
> > 	if (!seq)
> > 		return 0;
> > 
> > 	error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> > 				  log_flushed);
> > 	spin_lock(&iip->ili_lock);
> > 	iip->ili_last_fsync_fields = 0;
> > 	spin_unlock(&iip->ili_lock);
> > 
> > 	return error;
> 
> After some thinking... But isn't this still racy in a different way? Like:
> 
> t0			t1			t2
> 
> fsync = ILOG_CORE
> last_fsync = ILOG_CORE
> fsync = 0
> <log force lsn1>
> 			modifies file - this is now possible because we
> 			  don't hold ILOCK anymore
> 			calls fsync
> 			so:
> 			fsync = ILOG_CORE
> 			last_fsync = ILOG_CORE
> 			fsync = 0
> 			<log force lsn2>
> ......
> <log force lsn1 completes>
> last_fsync = 0
> 						fsync
> 						fsync == 0 && last_fsync == 0 =>
> 						  <skips fsync>
> 
> Hence t2 didn't wait until log force of lsn2 completed which looks like a
> bug.

Yes, you are right. I didn't think about that case...

> The problem is last_fsync was cleared too early. We'd need to clear it only
> once the latest log force for the inode has finished (which presumably
> doable by remembering the latest forced seq in xfs_inode_log_item).
> 
> But I'm wondering whether we aren't overcomplicating this. Cannot we just
> instead of ili_fsync_fields maintain the latest seq when the inode has been
> modified in a way relevant for fdatasync?

Trying to work out the implications of not pushing the latest
sequence when a datasync operation is asked for is making my head
hurt. I -think- it will work, but....

> Let's call that
> ili_datasync_commit_seq. Then in case of datasync we'll call
> xfs_log_force_seq() for ili_datasync_commit_seq and in case of fsync for
> ili_commit_seq. No need for complex clearing of ili_fsync_fields... The
> only thing I'm not sure about is how to best handle the ili_fsync_fields
> check in xfs_bmbt_to_iomap() - I haven't found a function to check whether a
> particular seq has been already committed or not.

.... this is problematic. It would need to look something like
xlog_item_in_current_chkpt(). That checks the log item is dirty in
the current checkpoint, but doesn't check against all the
checkpoints that are currently in flight.

To do that, we would have to walk the cil->xc_committing list to
check against all the seqeunce numbers outstanding checkpoints that
are in flight. However, we -really- don't want to be iterating that
list on every IO; that would make the cil->xc_push_lock a very hot
lock. That's going to cause performance regressions, not improve
things.

Something different needs to be done here...

> Tentative patch attached.

....

>  static  int
>  xfs_fsync_flush_log(
> @@ -106,21 +86,25 @@ xfs_fsync_flush_log(
>  	bool			datasync,
>  	int			*log_flushed)
>  {
> -	int			error = 0;
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	xfs_csn_t		seq;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	seq = xfs_fsync_seq(ip, datasync);
> -	if (seq) {
> -		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> -					  log_flushed);
> -
> -		spin_lock(&ip->i_itemp->ili_lock);
> -		ip->i_itemp->ili_fsync_fields = 0;
> -		spin_unlock(&ip->i_itemp->ili_lock);
> +	if (!xfs_ipincount(ip)) {
> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +		return 0;
>  	}
> +
> +	if (datasync)
> +		seq = iip->ili_datasync_commit_seq;
> +	else
> +		seq = iip->ili_commit_seq;
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);

> -	return error;
> +
> +	if (!seq)
> +		return 0;
> +
> +	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);

OK, so why do we still need the ILOCK and pin count check here?
If we don't check xfs_ipincount() and simply pass the relevant
sequence number to xfs_log_force_seq(), then it will check if the
sequence has already been committed and skip it if it has already
been done.

Yes, this has the same cil->xc_push_lock contention issue as
xfs_bmbt_to_iomap() would have, but we can mostly avoid this with a
further modification to the sequence number behaviour.

i.e. if we clear the commit sequences on last unpin (i.e. in
xfs_inode_item_unpin) then an item that is not in the CIL (and so
doesn't have dirty metadata) will have no associated commit
sequence number set.

Hence if ili_datasync_commit_seq is non-zero, then by definition the
inode must be pinned and has been dirtied for datasync purposes.
That means we can simply query ili_datasync_commit_seq in
xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.

I suspect that the above fsync code can then become:

	spin_lock(&iip->ili_lock);
	if (datasync)
		seq = iip->ili_datasync_commit_seq;
	else
		seq = iip->ili_commit_seq;
	spin_unlock(&iip->ili_lock);

	if (!seq)
		return 0;
	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);

For the same reason. i.e. a non-zero sequence number implies the
inode log item is dirty in the CIL and pinned.

At this point, we really don't care about races with transaction
commits. f(data)sync should only wait for modifications that have
been fully completed. If they haven't set the commit sequence in the
log item, they haven't fully completed. If the commit sequence is
already set, the the CIL push will co-ordinate appropriately with
commits to ensure correct data integrity behaviour occurs.

Hence I think that if we tie the sequence number clearing to the
inode being removed from the CIL (i.e. last unpin) we can drop all
the pin checks and use the commit sequence numbers directly to
determine what the correct behaviour should be...

> @@ -863,7 +866,11 @@ xfs_inode_item_committing(
>  	struct xfs_log_item	*lip,
>  	xfs_csn_t		seq)
>  {
> -	INODE_ITEM(lip)->ili_commit_seq = seq;
> +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +
> +	iip->ili_commit_seq = seq;
> +	if (iip->ili_datasync_tx)
> +		iip->ili_datasync_commit_seq = seq;
>  	return xfs_inode_item_release(lip);

I think I'd prefer that this be based on iip->ili_dirty_flags.

i.e. we currently clear that field at the end of
inode_item_precommit, but if we leave it intact until the committing
callback we can look directly at those flags and then zero them
unconditionally.

i.e.

	struct xfs_inode_log_item *iip = INODE_ITEM(lip);

	spin_lock(&iip->ili_lock);
	iip->ili_commit_seq = seq;
	if (iip->ili_dirty_flags & ~(XFS_ILOG_IVERSION|XFS_ILOG_TIMESTAMP))
		iip->ili_datasync_commit_seq = seq;
	spin_unlock(&iip->ili_lock);

	/*
	 * Clear all the transaction specific dirty state and
	 * release the log item reference now we are done commiting
	 * the transaction.
	 */
	iip->ili_dirty_flags = 0;
	return xfs_inode_item_release(lip);

This way we don't need an extra field in the inode log item just to
hold a temporary state between ->precommit and ->committing
callbacks.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-11  0:59     ` Dave Chinner
@ 2025-09-16  6:15       ` Dave Chinner
  2025-09-16 13:32         ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2025-09-16  6:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs

On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> i.e. if we clear the commit sequences on last unpin (i.e. in
> xfs_inode_item_unpin) then an item that is not in the CIL (and so
> doesn't have dirty metadata) will have no associated commit
> sequence number set.
> 
> Hence if ili_datasync_commit_seq is non-zero, then by definition the
> inode must be pinned and has been dirtied for datasync purposes.
> That means we can simply query ili_datasync_commit_seq in
> xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> 
> I suspect that the above fsync code can then become:
> 
> 	spin_lock(&iip->ili_lock);
> 	if (datasync)
> 		seq = iip->ili_datasync_commit_seq;
> 	else
> 		seq = iip->ili_commit_seq;
> 	spin_unlock(&iip->ili_lock);
> 
> 	if (!seq)
> 		return 0;
> 	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> 
> For the same reason. i.e. a non-zero sequence number implies the
> inode log item is dirty in the CIL and pinned.
> 
> At this point, we really don't care about races with transaction
> commits. f(data)sync should only wait for modifications that have
> been fully completed. If they haven't set the commit sequence in the
> log item, they haven't fully completed. If the commit sequence is
> already set, the the CIL push will co-ordinate appropriately with
> commits to ensure correct data integrity behaviour occurs.
> 
> Hence I think that if we tie the sequence number clearing to the
> inode being removed from the CIL (i.e. last unpin) we can drop all
> the pin checks and use the commit sequence numbers directly to
> determine what the correct behaviour should be...

Here's a patch that implements this. It appears to pass fstests
without any regressions on my test VMs. Can you test it and check
that it retains the expected performance improvement for
O_DSYNC+DIO on fallocate()d space?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: rework datasync tracking and execution

From: Dave Chinner <dchinner@redhat.com>

Jan Kara reported that the shared ILOCK held across the journal
flush during fdatasync operations slows down O_DSYNC DIO on
unwritten extents significantly. The underlying issue is that
unwritten extent conversion needs the ILOCK exclusive, whilst the
datasync operation after the extent conversion holds it shared.

Hence we cannot be flushing the journal for one IO completion whilst
at the same time doing unwritten extent conversion on another IO
completion on the same inode. THis means that IO completions
lock-step, and IO performance is dependent on the journal flush
latency.

Jan demostrated that reducing the ifdatasync lock hold time can
improve O_DSYNC DIO to unwritten extents performance by 2.5x.
Discussion on that patch found issues with the method, and we
came to the conclusion that seperately tracking datasync flush
seqeunces was the best approach to solving the problem.

The fsync code uses the ILOCK to serialise against concurrent
modifications in the transaction commit phase. In a transaction
commit, there are several disjoint updates to inode log item state
that need to be considered atomically by the fsync code. These
operations are allo done under ILOCK_EXCL context:

1. ili_fsync_flags is updated in ->iop_precommit
2. i_pincount is updated in ->iop_pin before it is added to the CIL
3. ili_commit_seq is updated in ->iop_committing, after it has been
   added to the CIL

In fsync, we need to:

1. check that the inode is dirty in the journal (ipincount)
2. check that ili_fsync_flags is set
3. grab the ili_commit_seq if a journal flush is needed
4. clear the ili_fsync_flags to ensure that new modifications that
require fsync are tracked in ->iop_precommit correctly

The serialisation of ipincount/ili_commit_seq is needed
to ensure that we don't try to unnecessarily flush the journal.

The serialisation of ili_fsync_flags being set in
->iop_precommit and cleared in fsync post journal flush is
required for correctness.

Hence holding the ILOCK_SHARED in xfs_file_fsync() performs all this
serialisation for us.  Ideally, we want to remove the need to hold
the ILOCK_SHARED in xfs_file_fsync() for best performance.

We start with the observation that fsync/fdatasync() only need to
wait for operations that have been completed. Hence operations that
are still being committed have not completed and datasync operations
do not need to wait for them.

This means we can use a single point in time in the commit process
to signal "this modification is complete". This is what
->iop_committing is supposed to provide - it is the point at which
the object is unlocked after the modification has been recorded in
the CIL. Hence we could use ili_commit_seq to determine if we should
flush the journal.

In theory, we can already do this. However, in practice this will
expose an internal global CIL lock to the IO path. The ipincount()
checks optimise away the need to take this lock - if the inode is
not pinned, then it is not in the CIL and we don't need to check if
a journal flush at ili_commit_seq needs to be performed.

The reason this is needed is that the ili_commit_seq is never
cleared. Once it is set, it remains set even once the journal has
been committed and the object has been unpinned. Hence we have to
look that journal internal commit sequence state to determine if
ili_commit_seq needs to be acted on or not.

We can solve this by clearing ili_commit_seq when the inode is
unpinned. If we clear it atomically with the last unpin going away,
then we are guaranteed that new modifications will order correctly
as they add a new pin counts and we won't clear a sequence number
for an active modification in the CIL.

Further, we can then allow the per-transaction flag state to
propagate into ->iop_committing (instead of clearing it in
->iop_precommit) and that will allow us to determine if the
modification needs a full fsync or just a datasync, and so we can
record a separate datasync sequence number (Jan's idea!) and then
use that in the fdatasync path instead of the full fsync sequence
number.

With this infrastructure in place, we no longer need the
ILOCK_SHARED in the fsync path. All serialisation is done against
the commit sequence numbers - if the sequence number is set, then we
have to flush the journal. If it is not set, then we have nothing to
do. This greatly simplifies the fsync implementation....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c       | 75 ++++++++++++++++++++++---------------------------
 fs/xfs/xfs_inode.c      | 25 +++++++++++------
 fs/xfs/xfs_inode_item.c | 58 ++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_inode_item.h | 10 ++++++-
 fs/xfs/xfs_iomap.c      | 15 ++++++++--
 5 files changed, 119 insertions(+), 64 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f96fbf5c54c9..2702fef2c90c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -75,52 +75,47 @@ xfs_dir_fsync(
 	return xfs_log_force_inode(ip);
 }
 
-static xfs_csn_t
-xfs_fsync_seq(
-	struct xfs_inode	*ip,
-	bool			datasync)
-{
-	if (!xfs_ipincount(ip))
-		return 0;
-	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-		return 0;
-	return ip->i_itemp->ili_commit_seq;
-}
-
 /*
- * All metadata updates are logged, which means that we just have to flush the
- * log up to the latest LSN that touched the inode.
+ * All metadata updates are logged, which means that we just have to push the
+ * journal to the required sequence number than holds the updates. We track
+ * datasync commits separately to full sync commits, and hence only need to
+ * select the correct sequence number for the log force here.
  *
- * If we have concurrent fsync/fdatasync() calls, we need them to all block on
- * the log force before we clear the ili_fsync_fields field. This ensures that
- * we don't get a racing sync operation that does not wait for the metadata to
- * hit the journal before returning.  If we race with clearing ili_fsync_fields,
- * then all that will happen is the log force will do nothing as the lsn will
- * already be on disk.  We can't race with setting ili_fsync_fields because that
- * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock
- * shared until after the ili_fsync_fields is cleared.
+ * We don't have to serialise against concurrent modifications, as we do not
+ * have to wait for modifications that have not yet completed. We define a
+ * transaction commit as completing when the commit sequence number is updated,
+ * hence if the sequence number has not updated, the sync operation has been
+ * run before the commit completed and we don't have to wait for it.
+ *
+ * If we have concurrent fsync/fdatasync() calls, the sequence numbers remain
+ * set on the log item until - at least - the journal flush completes. In
+ * reality, they are only cleared when the inode is fully unpinned (i.e.
+ * persistent in the journal and not dirty in the CIL), and so we rely on
+ * xfs_log_force_seq() either skipping sequences that have been persisted or
+ * waiting on sequences that are still in flight to correctly order concurrent
+ * sync operations.
  */
-static  int
+static int
 xfs_fsync_flush_log(
 	struct xfs_inode	*ip,
 	bool			datasync,
 	int			*log_flushed)
 {
-	int			error = 0;
-	xfs_csn_t		seq;
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	xfs_csn_t		seq = 0;
 
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	seq = xfs_fsync_seq(ip, datasync);
-	if (seq) {
-		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
+	spin_lock(&iip->ili_lock);
+	if (datasync)
+		seq = iip->ili_datasync_seq;
+	else
+		seq = iip->ili_commit_seq;
+	spin_unlock(&iip->ili_lock);
+
+	if (!seq)
+		return 0;
+
+	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
 					  log_flushed);
-
-		spin_lock(&ip->i_itemp->ili_lock);
-		ip->i_itemp->ili_fsync_fields = 0;
-		spin_unlock(&ip->i_itemp->ili_lock);
-	}
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	return error;
 }
 
 STATIC int
@@ -158,12 +153,10 @@ xfs_file_fsync(
 		error = blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
 
 	/*
-	 * Any inode that has dirty modifications in the log is pinned.  The
-	 * racy check here for a pinned inode will not catch modifications
-	 * that happen concurrently to the fsync call, but fsync semantics
-	 * only require to sync previously completed I/O.
+	 * If the inode has a inode log item attached, it may need the journal
+	 * flushed to persist any changes the log item might be tracking.
 	 */
-	if (xfs_ipincount(ip)) {
+	if (ip->i_itemp) {
 		err2 = xfs_fsync_flush_log(ip, datasync, &log_flushed);
 		if (err2 && !error)
 			error = err2;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 0ddb9ce0f5e3..b5619ed5667b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1667,7 +1667,6 @@ xfs_ifree_mark_inode_stale(
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	spin_unlock(&iip->ili_lock);
 	ASSERT(iip->ili_last_fields);
 
@@ -1832,12 +1831,20 @@ static void
 xfs_iunpin(
 	struct xfs_inode	*ip)
 {
-	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	xfs_csn_t		seq = 0;
 
 	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
+	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED);
+
+	spin_lock(&iip->ili_lock);
+	seq = iip->ili_commit_seq;
+	spin_unlock(&iip->ili_lock);
+	if (!seq)
+		return;
 
 	/* Give the log a push to start the unpinning I/O */
-	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
+	xfs_log_force_seq(ip->i_mount, seq, 0, NULL);
 
 }
 
@@ -2506,7 +2513,6 @@ xfs_iflush(
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	set_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
 	spin_unlock(&iip->ili_lock);
 
@@ -2665,12 +2671,15 @@ int
 xfs_log_force_inode(
 	struct xfs_inode	*ip)
 {
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 	xfs_csn_t		seq = 0;
 
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	if (xfs_ipincount(ip))
-		seq = ip->i_itemp->ili_commit_seq;
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (!iip)
+		return 0;
+
+	spin_lock(&iip->ili_lock);
+	seq = iip->ili_commit_seq;
+	spin_unlock(&iip->ili_lock);
 
 	if (!seq)
 		return 0;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index afb6cadf7793..83b94b437696 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -153,7 +153,6 @@ xfs_inode_item_precommit(
 	 * (ili_fields) correctly tracks that the version has changed.
 	 */
 	spin_lock(&iip->ili_lock);
-	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
 	if (flags & XFS_ILOG_IVERSION)
 		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
 
@@ -214,12 +213,6 @@ xfs_inode_item_precommit(
 	spin_unlock(&iip->ili_lock);
 
 	xfs_inode_item_precommit_check(ip);
-
-	/*
-	 * We are done with the log item transaction dirty state, so clear it so
-	 * that it doesn't pollute future transactions.
-	 */
-	iip->ili_dirty_flags = 0;
 	return 0;
 }
 
@@ -729,13 +722,24 @@ xfs_inode_item_unpin(
 	struct xfs_log_item	*lip,
 	int			remove)
 {
-	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
+	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+	struct xfs_inode	*ip = iip->ili_inode;
 
 	trace_xfs_inode_unpin(ip, _RET_IP_);
 	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
-	if (atomic_dec_and_test(&ip->i_pincount))
+
+	/*
+	 * If this is the last unpin, then the inode no longer needs a journal
+	 * flush to persist it. Hence we can clear the commit sequence numbers
+	 * as a fsync/fdatasync operation on the inode at this point is a no-op.
+	 */
+	if (atomic_dec_and_lock(&ip->i_pincount, &iip->ili_lock)) {
+		iip->ili_commit_seq = 0;
+		iip->ili_datasync_seq = 0;
+		spin_unlock(&iip->ili_lock);
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
+	}
 }
 
 STATIC uint
@@ -863,12 +867,45 @@ xfs_inode_item_committed(
 	return lsn;
 }
 
+/*
+ * The modification is now complete, so before we unlock the inode we need to
+ * update the commit sequence numbers for data integrity journal flushes. We
+ * always record the commit sequence number (ili_commit_seq) so that anything
+ * that needs a full journal sync will capture all of this modification.
+ *
+ * We then
+ * check if the changes will impact a datasync (O_DSYNC) journal flush. If the
+ * changes will require a datasync flush, then we also record the sequence in
+ * ili_datasync_seq.
+ *
+ * These commit sequence numbers will get cleared atomically with the inode being
+ * unpinned (i.e. pin count goes to zero), and so it will only be set when the
+ * inode is dirty in the journal. This removes the need for checking if the
+ * inode is pinned to determine if a journal flush is necessary, and hence
+ * removes the need for holding the ILOCK_SHARED in xfs_file_fsync() to
+ * serialise pin counts against commit sequence number updates.
+ *
+ */
 STATIC void
 xfs_inode_item_committing(
 	struct xfs_log_item	*lip,
 	xfs_csn_t		seq)
 {
-	INODE_ITEM(lip)->ili_commit_seq = seq;
+	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+
+	spin_lock(&iip->ili_lock);
+	iip->ili_commit_seq = seq;
+	if (iip->ili_dirty_flags & ~(XFS_ILOG_IVERSION | XFS_ILOG_TIMESTAMP))
+		iip->ili_datasync_seq = seq;
+	spin_unlock(&iip->ili_lock);
+
+	/*
+	 * Clear the per-transaction dirty flags now that we have finished
+	 * recording the transaction's inode modifications in the CIL and are
+	 * about to release and (maybe) unlock the inode.
+	 */
+	iip->ili_dirty_flags = 0;
+
 	return xfs_inode_item_release(lip);
 }
 
@@ -1060,7 +1097,6 @@ xfs_iflush_abort_clean(
 {
 	iip->ili_last_fields = 0;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	iip->ili_flush_lsn = 0;
 	iip->ili_item.li_buf = NULL;
 	list_del_init(&iip->ili_item.li_bio_list);
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index ba92ce11a011..2ddcca41714f 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -32,9 +32,17 @@ struct xfs_inode_log_item {
 	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
-	unsigned int		ili_fsync_fields;  /* logged since last fsync */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
+
+	/*
+	 * We record the sequence number for every inode modification, as
+	 * well as those that only require fdatasync operations for data
+	 * integrity. This allows optimisation of the O_DSYNC/fdatasync path
+	 * without needing to track what modifications the journal is currently
+	 * carrying for the inode. These are protected by the above ili_lock.
+	 */
 	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
+	xfs_csn_t		ili_datasync_seq;  /* for datasync optimisation */
 };
 
 static inline int xfs_inode_clean(struct xfs_inode *ip)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2a74f2957341..f8c925220005 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -149,9 +149,18 @@ xfs_bmbt_to_iomap(
 		iomap->bdev = target->bt_bdev;
 	iomap->flags = iomap_flags;
 
-	if (xfs_ipincount(ip) &&
-	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-		iomap->flags |= IOMAP_F_DIRTY;
+	/*
+	 * If the inode is dirty for datasync purposes, let iomap know so it
+	 * doesn't elide the IO completion journal flushes on O_DSYNC IO.
+	 */
+	if (ip->i_itemp) {
+		struct xfs_inode_log_item *iip = ip->i_itemp;
+
+		spin_lock(&iip->ili_lock);
+		if (iip->ili_datasync_seq)
+			iomap->flags |= IOMAP_F_DIRTY;
+		spin_unlock(&iip->ili_lock);
+	}
 
 	iomap->validity_cookie = sequence_cookie;
 	return 0;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-16  6:15       ` Dave Chinner
@ 2025-09-16 13:32         ` Jan Kara
  2025-09-16 21:32           ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2025-09-16 13:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 20016 bytes --]

On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > i.e. if we clear the commit sequences on last unpin (i.e. in
> > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > doesn't have dirty metadata) will have no associated commit
> > sequence number set.
> > 
> > Hence if ili_datasync_commit_seq is non-zero, then by definition the
> > inode must be pinned and has been dirtied for datasync purposes.
> > That means we can simply query ili_datasync_commit_seq in
> > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > 
> > I suspect that the above fsync code can then become:
> > 
> > 	spin_lock(&iip->ili_lock);
> > 	if (datasync)
> > 		seq = iip->ili_datasync_commit_seq;
> > 	else
> > 		seq = iip->ili_commit_seq;
> > 	spin_unlock(&iip->ili_lock);
> > 
> > 	if (!seq)
> > 		return 0;
> > 	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> > 
> > For the same reason. i.e. a non-zero sequence number implies the
> > inode log item is dirty in the CIL and pinned.
> > 
> > At this point, we really don't care about races with transaction
> > commits. f(data)sync should only wait for modifications that have
> > been fully completed. If they haven't set the commit sequence in the
> > log item, they haven't fully completed. If the commit sequence is
> > already set, the the CIL push will co-ordinate appropriately with
> > commits to ensure correct data integrity behaviour occurs.
> > 
> > Hence I think that if we tie the sequence number clearing to the
> > inode being removed from the CIL (i.e. last unpin) we can drop all
> > the pin checks and use the commit sequence numbers directly to
> > determine what the correct behaviour should be...
> 
> Here's a patch that implements this. It appears to pass fstests
> without any regressions on my test VMs. Can you test it and check
> that it retains the expected performance improvement for
> O_DSYNC+DIO on fallocate()d space?

Heh, I just wanted to send my version of the patch after all the tests
passed :). Anyway, I've given your patch a spin with the test I have and
its performance looks good. So feel free to add:

Tested-by: Jan Kara <jack@suse.cz>

BTW I don't have customer setup with DB2 available where the huge
difference is visible (I'll send them backport of the patch to their SUSE
kernel once we settle on it) but I have written a tool that replays the
same set of pwrites from same set of threads I've captured from syscall
trace. It reproduces only about 20% difference between good & bad kernels
on my test machine but it was good enough for the bisection and analysis
and the customer confirmed that the revert of what I've bisected to
actually fixes their issue (rwsem reader lockstealing logic). So I'm
reasonably confident I'm really reproducing their issue.

BTW, your patch seems to be about 2% slower on average than what I have
written and somewhat more variable. It may be just a bad luck but
I suspect it may be related to the fact that I ended up using READ_ONCE for
reads of ili_commit_seq and ili_datasync_commit_seq while you use ili_lock.
So I just wanted to suggest that as a possible optimization (my patch
attached for reference). But regardless of whether you do the change or not
I think the patch is good to go.

								Honza

> xfs: rework datasync tracking and execution
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Jan Kara reported that the shared ILOCK held across the journal
> flush during fdatasync operations slows down O_DSYNC DIO on
> unwritten extents significantly. The underlying issue is that
> unwritten extent conversion needs the ILOCK exclusive, whilst the
> datasync operation after the extent conversion holds it shared.
> 
> Hence we cannot be flushing the journal for one IO completion whilst
> at the same time doing unwritten extent conversion on another IO
> completion on the same inode. THis means that IO completions
> lock-step, and IO performance is dependent on the journal flush
> latency.
> 
> Jan demostrated that reducing the ifdatasync lock hold time can
> improve O_DSYNC DIO to unwritten extents performance by 2.5x.
> Discussion on that patch found issues with the method, and we
> came to the conclusion that seperately tracking datasync flush
> seqeunces was the best approach to solving the problem.
> 
> The fsync code uses the ILOCK to serialise against concurrent
> modifications in the transaction commit phase. In a transaction
> commit, there are several disjoint updates to inode log item state
> that need to be considered atomically by the fsync code. These
> operations are allo done under ILOCK_EXCL context:
> 
> 1. ili_fsync_flags is updated in ->iop_precommit
> 2. i_pincount is updated in ->iop_pin before it is added to the CIL
> 3. ili_commit_seq is updated in ->iop_committing, after it has been
>    added to the CIL
> 
> In fsync, we need to:
> 
> 1. check that the inode is dirty in the journal (ipincount)
> 2. check that ili_fsync_flags is set
> 3. grab the ili_commit_seq if a journal flush is needed
> 4. clear the ili_fsync_flags to ensure that new modifications that
> require fsync are tracked in ->iop_precommit correctly
> 
> The serialisation of ipincount/ili_commit_seq is needed
> to ensure that we don't try to unnecessarily flush the journal.
> 
> The serialisation of ili_fsync_flags being set in
> ->iop_precommit and cleared in fsync post journal flush is
> required for correctness.
> 
> Hence holding the ILOCK_SHARED in xfs_file_fsync() performs all this
> serialisation for us.  Ideally, we want to remove the need to hold
> the ILOCK_SHARED in xfs_file_fsync() for best performance.
> 
> We start with the observation that fsync/fdatasync() only need to
> wait for operations that have been completed. Hence operations that
> are still being committed have not completed and datasync operations
> do not need to wait for them.
> 
> This means we can use a single point in time in the commit process
> to signal "this modification is complete". This is what
> ->iop_committing is supposed to provide - it is the point at which
> the object is unlocked after the modification has been recorded in
> the CIL. Hence we could use ili_commit_seq to determine if we should
> flush the journal.
> 
> In theory, we can already do this. However, in practice this will
> expose an internal global CIL lock to the IO path. The ipincount()
> checks optimise away the need to take this lock - if the inode is
> not pinned, then it is not in the CIL and we don't need to check if
> a journal flush at ili_commit_seq needs to be performed.
> 
> The reason this is needed is that the ili_commit_seq is never
> cleared. Once it is set, it remains set even once the journal has
> been committed and the object has been unpinned. Hence we have to
> look that journal internal commit sequence state to determine if
> ili_commit_seq needs to be acted on or not.
> 
> We can solve this by clearing ili_commit_seq when the inode is
> unpinned. If we clear it atomically with the last unpin going away,
> then we are guaranteed that new modifications will order correctly
> as they add a new pin counts and we won't clear a sequence number
> for an active modification in the CIL.
> 
> Further, we can then allow the per-transaction flag state to
> propagate into ->iop_committing (instead of clearing it in
> ->iop_precommit) and that will allow us to determine if the
> modification needs a full fsync or just a datasync, and so we can
> record a separate datasync sequence number (Jan's idea!) and then
> use that in the fdatasync path instead of the full fsync sequence
> number.
> 
> With this infrastructure in place, we no longer need the
> ILOCK_SHARED in the fsync path. All serialisation is done against
> the commit sequence numbers - if the sequence number is set, then we
> have to flush the journal. If it is not set, then we have nothing to
> do. This greatly simplifies the fsync implementation....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c       | 75 ++++++++++++++++++++++---------------------------
>  fs/xfs/xfs_inode.c      | 25 +++++++++++------
>  fs/xfs/xfs_inode_item.c | 58 ++++++++++++++++++++++++++++++--------
>  fs/xfs/xfs_inode_item.h | 10 ++++++-
>  fs/xfs/xfs_iomap.c      | 15 ++++++++--
>  5 files changed, 119 insertions(+), 64 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index f96fbf5c54c9..2702fef2c90c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -75,52 +75,47 @@ xfs_dir_fsync(
>  	return xfs_log_force_inode(ip);
>  }
>  
> -static xfs_csn_t
> -xfs_fsync_seq(
> -	struct xfs_inode	*ip,
> -	bool			datasync)
> -{
> -	if (!xfs_ipincount(ip))
> -		return 0;
> -	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -		return 0;
> -	return ip->i_itemp->ili_commit_seq;
> -}
> -
>  /*
> - * All metadata updates are logged, which means that we just have to flush the
> - * log up to the latest LSN that touched the inode.
> + * All metadata updates are logged, which means that we just have to push the
> + * journal to the required sequence number than holds the updates. We track
> + * datasync commits separately to full sync commits, and hence only need to
> + * select the correct sequence number for the log force here.
>   *
> - * If we have concurrent fsync/fdatasync() calls, we need them to all block on
> - * the log force before we clear the ili_fsync_fields field. This ensures that
> - * we don't get a racing sync operation that does not wait for the metadata to
> - * hit the journal before returning.  If we race with clearing ili_fsync_fields,
> - * then all that will happen is the log force will do nothing as the lsn will
> - * already be on disk.  We can't race with setting ili_fsync_fields because that
> - * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock
> - * shared until after the ili_fsync_fields is cleared.
> + * We don't have to serialise against concurrent modifications, as we do not
> + * have to wait for modifications that have not yet completed. We define a
> + * transaction commit as completing when the commit sequence number is updated,
> + * hence if the sequence number has not updated, the sync operation has been
> + * run before the commit completed and we don't have to wait for it.
> + *
> + * If we have concurrent fsync/fdatasync() calls, the sequence numbers remain
> + * set on the log item until - at least - the journal flush completes. In
> + * reality, they are only cleared when the inode is fully unpinned (i.e.
> + * persistent in the journal and not dirty in the CIL), and so we rely on
> + * xfs_log_force_seq() either skipping sequences that have been persisted or
> + * waiting on sequences that are still in flight to correctly order concurrent
> + * sync operations.
>   */
> -static  int
> +static int
>  xfs_fsync_flush_log(
>  	struct xfs_inode	*ip,
>  	bool			datasync,
>  	int			*log_flushed)
>  {
> -	int			error = 0;
> -	xfs_csn_t		seq;
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	xfs_csn_t		seq = 0;
>  
> -	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	seq = xfs_fsync_seq(ip, datasync);
> -	if (seq) {
> -		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> +	spin_lock(&iip->ili_lock);
> +	if (datasync)
> +		seq = iip->ili_datasync_seq;
> +	else
> +		seq = iip->ili_commit_seq;
> +	spin_unlock(&iip->ili_lock);
> +
> +	if (!seq)
> +		return 0;
> +
> +	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
>  					  log_flushed);
> -
> -		spin_lock(&ip->i_itemp->ili_lock);
> -		ip->i_itemp->ili_fsync_fields = 0;
> -		spin_unlock(&ip->i_itemp->ili_lock);
> -	}
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> -	return error;
>  }
>  
>  STATIC int
> @@ -158,12 +153,10 @@ xfs_file_fsync(
>  		error = blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
>  
>  	/*
> -	 * Any inode that has dirty modifications in the log is pinned.  The
> -	 * racy check here for a pinned inode will not catch modifications
> -	 * that happen concurrently to the fsync call, but fsync semantics
> -	 * only require to sync previously completed I/O.
> +	 * If the inode has a inode log item attached, it may need the journal
> +	 * flushed to persist any changes the log item might be tracking.
>  	 */
> -	if (xfs_ipincount(ip)) {
> +	if (ip->i_itemp) {
>  		err2 = xfs_fsync_flush_log(ip, datasync, &log_flushed);
>  		if (err2 && !error)
>  			error = err2;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 0ddb9ce0f5e3..b5619ed5667b 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1667,7 +1667,6 @@ xfs_ifree_mark_inode_stale(
>  	spin_lock(&iip->ili_lock);
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
> -	iip->ili_fsync_fields = 0;
>  	spin_unlock(&iip->ili_lock);
>  	ASSERT(iip->ili_last_fields);
>  
> @@ -1832,12 +1831,20 @@ static void
>  xfs_iunpin(
>  	struct xfs_inode	*ip)
>  {
> -	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	xfs_csn_t		seq = 0;
>  
>  	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
> +	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED);
> +
> +	spin_lock(&iip->ili_lock);
> +	seq = iip->ili_commit_seq;
> +	spin_unlock(&iip->ili_lock);
> +	if (!seq)
> +		return;
>  
>  	/* Give the log a push to start the unpinning I/O */
> -	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
> +	xfs_log_force_seq(ip->i_mount, seq, 0, NULL);
>  
>  }
>  
> @@ -2506,7 +2513,6 @@ xfs_iflush(
>  	spin_lock(&iip->ili_lock);
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
> -	iip->ili_fsync_fields = 0;
>  	set_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
>  	spin_unlock(&iip->ili_lock);
>  
> @@ -2665,12 +2671,15 @@ int
>  xfs_log_force_inode(
>  	struct xfs_inode	*ip)
>  {
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	xfs_csn_t		seq = 0;
>  
> -	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	if (xfs_ipincount(ip))
> -		seq = ip->i_itemp->ili_commit_seq;
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +	if (!iip)
> +		return 0;
> +
> +	spin_lock(&iip->ili_lock);
> +	seq = iip->ili_commit_seq;
> +	spin_unlock(&iip->ili_lock);
>  
>  	if (!seq)
>  		return 0;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index afb6cadf7793..83b94b437696 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -153,7 +153,6 @@ xfs_inode_item_precommit(
>  	 * (ili_fields) correctly tracks that the version has changed.
>  	 */
>  	spin_lock(&iip->ili_lock);
> -	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
>  	if (flags & XFS_ILOG_IVERSION)
>  		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
>  
> @@ -214,12 +213,6 @@ xfs_inode_item_precommit(
>  	spin_unlock(&iip->ili_lock);
>  
>  	xfs_inode_item_precommit_check(ip);
> -
> -	/*
> -	 * We are done with the log item transaction dirty state, so clear it so
> -	 * that it doesn't pollute future transactions.
> -	 */
> -	iip->ili_dirty_flags = 0;
>  	return 0;
>  }
>  
> @@ -729,13 +722,24 @@ xfs_inode_item_unpin(
>  	struct xfs_log_item	*lip,
>  	int			remove)
>  {
> -	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
> +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +	struct xfs_inode	*ip = iip->ili_inode;
>  
>  	trace_xfs_inode_unpin(ip, _RET_IP_);
>  	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
>  	ASSERT(atomic_read(&ip->i_pincount) > 0);
> -	if (atomic_dec_and_test(&ip->i_pincount))
> +
> +	/*
> +	 * If this is the last unpin, then the inode no longer needs a journal
> +	 * flush to persist it. Hence we can clear the commit sequence numbers
> +	 * as a fsync/fdatasync operation on the inode at this point is a no-op.
> +	 */
> +	if (atomic_dec_and_lock(&ip->i_pincount, &iip->ili_lock)) {
> +		iip->ili_commit_seq = 0;
> +		iip->ili_datasync_seq = 0;
> +		spin_unlock(&iip->ili_lock);
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> +	}
>  }
>  
>  STATIC uint
> @@ -863,12 +867,45 @@ xfs_inode_item_committed(
>  	return lsn;
>  }
>  
> +/*
> + * The modification is now complete, so before we unlock the inode we need to
> + * update the commit sequence numbers for data integrity journal flushes. We
> + * always record the commit sequence number (ili_commit_seq) so that anything
> + * that needs a full journal sync will capture all of this modification.
> + *
> + * We then
> + * check if the changes will impact a datasync (O_DSYNC) journal flush. If the
> + * changes will require a datasync flush, then we also record the sequence in
> + * ili_datasync_seq.
> + *
> + * These commit sequence numbers will get cleared atomically with the inode being
> + * unpinned (i.e. pin count goes to zero), and so it will only be set when the
> + * inode is dirty in the journal. This removes the need for checking if the
> + * inode is pinned to determine if a journal flush is necessary, and hence
> + * removes the need for holding the ILOCK_SHARED in xfs_file_fsync() to
> + * serialise pin counts against commit sequence number updates.
> + *
> + */
>  STATIC void
>  xfs_inode_item_committing(
>  	struct xfs_log_item	*lip,
>  	xfs_csn_t		seq)
>  {
> -	INODE_ITEM(lip)->ili_commit_seq = seq;
> +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_commit_seq = seq;
> +	if (iip->ili_dirty_flags & ~(XFS_ILOG_IVERSION | XFS_ILOG_TIMESTAMP))
> +		iip->ili_datasync_seq = seq;
> +	spin_unlock(&iip->ili_lock);
> +
> +	/*
> +	 * Clear the per-transaction dirty flags now that we have finished
> +	 * recording the transaction's inode modifications in the CIL and are
> +	 * about to release and (maybe) unlock the inode.
> +	 */
> +	iip->ili_dirty_flags = 0;
> +
>  	return xfs_inode_item_release(lip);
>  }
>  
> @@ -1060,7 +1097,6 @@ xfs_iflush_abort_clean(
>  {
>  	iip->ili_last_fields = 0;
>  	iip->ili_fields = 0;
> -	iip->ili_fsync_fields = 0;
>  	iip->ili_flush_lsn = 0;
>  	iip->ili_item.li_buf = NULL;
>  	list_del_init(&iip->ili_item.li_bio_list);
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index ba92ce11a011..2ddcca41714f 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -32,9 +32,17 @@ struct xfs_inode_log_item {
>  	spinlock_t		ili_lock;	   /* flush state lock */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
> -	unsigned int		ili_fsync_fields;  /* logged since last fsync */
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> +
> +	/*
> +	 * We record the sequence number for every inode modification, as
> +	 * well as those that only require fdatasync operations for data
> +	 * integrity. This allows optimisation of the O_DSYNC/fdatasync path
> +	 * without needing to track what modifications the journal is currently
> +	 * carrying for the inode. These are protected by the above ili_lock.
> +	 */
>  	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
> +	xfs_csn_t		ili_datasync_seq;  /* for datasync optimisation */
>  };
>  
>  static inline int xfs_inode_clean(struct xfs_inode *ip)
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 2a74f2957341..f8c925220005 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -149,9 +149,18 @@ xfs_bmbt_to_iomap(
>  		iomap->bdev = target->bt_bdev;
>  	iomap->flags = iomap_flags;
>  
> -	if (xfs_ipincount(ip) &&
> -	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -		iomap->flags |= IOMAP_F_DIRTY;
> +	/*
> +	 * If the inode is dirty for datasync purposes, let iomap know so it
> +	 * doesn't elide the IO completion journal flushes on O_DSYNC IO.
> +	 */
> +	if (ip->i_itemp) {
> +		struct xfs_inode_log_item *iip = ip->i_itemp;
> +
> +		spin_lock(&iip->ili_lock);
> +		if (iip->ili_datasync_seq)
> +			iomap->flags |= IOMAP_F_DIRTY;
> +		spin_unlock(&iip->ili_lock);
> +	}
>  
>  	iomap->validity_cookie = sequence_cookie;
>  	return 0;
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-xfs-Don-t-hold-XFS_ILOCK_SHARED-over-log-force-durin.patch --]
[-- Type: text/x-patch, Size: 9723 bytes --]

From eeb10cf33fe01491a037fb184ed5872a227fe39e Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 31 Jul 2025 11:24:01 +0200
Subject: [PATCH] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync

Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
writes to file preallocated with fallocate (thus DIO happens to
unwritten extents and we need ILOCK in exclusive mode for timestamp
modifications and extent conversions). But holding ILOCK over the log
force doesn't seem strictly necessary for correctness. We are just using
it for a mechanism to make sure parallel fsyncs all wait for log force
to complete but that can be also achieved without holding ILOCK. We
introduce a new mechanism which tracks for each inode sequence number of
the last transaction that is relevant for fdatasync(2) and use that
instead of maintaining ili_fsync_fields.

With this patch DB2 database restore operation speeds up by a factor of
about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing
store.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c       | 51 +++++++++++--------------------
 fs/xfs/xfs_inode.c      |  2 --
 fs/xfs/xfs_inode_item.c | 68 +++++++++++++++++++++++++++++------------
 fs/xfs/xfs_inode_item.h |  3 +-
 fs/xfs/xfs_iomap.c      |  2 +-
 5 files changed, 68 insertions(+), 58 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f96fbf5c54c9..c7beee51a1bf 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -75,30 +75,10 @@ xfs_dir_fsync(
 	return xfs_log_force_inode(ip);
 }
 
-static xfs_csn_t
-xfs_fsync_seq(
-	struct xfs_inode	*ip,
-	bool			datasync)
-{
-	if (!xfs_ipincount(ip))
-		return 0;
-	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-		return 0;
-	return ip->i_itemp->ili_commit_seq;
-}
-
 /*
  * All metadata updates are logged, which means that we just have to flush the
- * log up to the latest LSN that touched the inode.
- *
- * If we have concurrent fsync/fdatasync() calls, we need them to all block on
- * the log force before we clear the ili_fsync_fields field. This ensures that
- * we don't get a racing sync operation that does not wait for the metadata to
- * hit the journal before returning.  If we race with clearing ili_fsync_fields,
- * then all that will happen is the log force will do nothing as the lsn will
- * already be on disk.  We can't race with setting ili_fsync_fields because that
- * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock
- * shared until after the ili_fsync_fields is cleared.
+ * log up to the latest LSN that modified the inode metadata relevant for the
+ * fsync/fdatasync().
  */
 static  int
 xfs_fsync_flush_log(
@@ -106,21 +86,24 @@ xfs_fsync_flush_log(
 	bool			datasync,
 	int			*log_flushed)
 {
-	int			error = 0;
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 	xfs_csn_t		seq;
 
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	seq = xfs_fsync_seq(ip, datasync);
-	if (seq) {
-		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
-					  log_flushed);
+	/*
+	 * We only care about IO user knows has completed before calling fsync.
+	 * Thus we don't care about races with currently running transactions
+	 * possibly updating the sequence numbers. They can only make us force
+	 * a later seq than strictly needed.
+	 */
+	if (datasync)
+		seq = READ_ONCE(iip->ili_datasync_commit_seq);
+	else
+		seq = READ_ONCE(iip->ili_commit_seq);
 
-		spin_lock(&ip->i_itemp->ili_lock);
-		ip->i_itemp->ili_fsync_fields = 0;
-		spin_unlock(&ip->i_itemp->ili_lock);
-	}
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	return error;
+	if (!seq)
+		return 0;
+
+	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9c39251961a3..209b8aba6238 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1656,7 +1656,6 @@ xfs_ifree_mark_inode_stale(
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	spin_unlock(&iip->ili_lock);
 	ASSERT(iip->ili_last_fields);
 
@@ -2502,7 +2501,6 @@ xfs_iflush(
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	set_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags);
 	spin_unlock(&iip->ili_lock);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 829675700fcd..2a90e156b072 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -145,18 +145,7 @@ xfs_inode_item_precommit(
 		flags |= XFS_ILOG_CORE;
 	}
 
-	/*
-	 * Record the specific change for fdatasync optimisation. This allows
-	 * fdatasync to skip log forces for inodes that are only timestamp
-	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
-	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
-	 * (ili_fields) correctly tracks that the version has changed.
-	 */
 	spin_lock(&iip->ili_lock);
-	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
-	if (flags & XFS_ILOG_IVERSION)
-		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
-
 	/*
 	 * Inode verifiers do not check that the CoW extent size hint is an
 	 * integer multiple of the rt extent size on a directory with both
@@ -204,6 +193,23 @@ xfs_inode_item_precommit(
 		xfs_trans_brelse(tp, bp);
 	}
 
+	/*
+	 * Set the transaction dirty state we've created back in inode item
+	 * before mangling flags for storing on disk. We use the value later in
+	 * xfs_inode_item_committing() to determine whether the transaction is
+	 * relevant for fdatasync or not. ili_dirty_flags gets cleared in
+	 * xfs_trans_ijoin() before adding inode to the next transaction.
+	 */
+	iip->ili_dirty_flags = flags;
+
+	/*
+	 * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
+	 * actual on-disk dirty tracking (ili_fields) correctly tracks that the
+	 * version has changed.
+	 */
+	if (flags & XFS_ILOG_IVERSION)
+		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
+
 	/*
 	 * Always OR in the bits from the ili_last_fields field.  This is to
 	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
@@ -215,11 +221,6 @@ xfs_inode_item_precommit(
 
 	xfs_inode_item_precommit_check(ip);
 
-	/*
-	 * We are done with the log item transaction dirty state, so clear it so
-	 * that it doesn't pollute future transactions.
-	 */
-	iip->ili_dirty_flags = 0;
 	return 0;
 }
 
@@ -729,13 +730,26 @@ xfs_inode_item_unpin(
 	struct xfs_log_item	*lip,
 	int			remove)
 {
-	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
+	struct xfs_inode_log_item	*iip = INODE_ITEM(lip);
+	struct xfs_inode		*ip = iip->ili_inode;
 
 	trace_xfs_inode_unpin(ip, _RET_IP_);
 	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
-	if (atomic_dec_and_test(&ip->i_pincount))
+	/*
+	 * We can race with new commit pinning inode again and then
+	 * xfs_inode_item_committing() setting fsync sequence numbers. Hence we
+	 * have to grab ili_lock if dropping the last inode pin to be sure we
+	 * either see i_pincount increment from the new commit or
+	 * xfs_inode_item_committing() will wait with its setting of sequence
+	 * numbers.
+	 */
+	if (atomic_dec_and_lock(&ip->i_pincount, &iip->ili_lock)) {
+		WRITE_ONCE(iip->ili_commit_seq, 0);
+		WRITE_ONCE(iip->ili_datasync_commit_seq, 0);
+		spin_unlock(&iip->ili_lock);
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
+	}
 }
 
 STATIC uint
@@ -863,7 +877,22 @@ xfs_inode_item_committing(
 	struct xfs_log_item	*lip,
 	xfs_csn_t		seq)
 {
-	INODE_ITEM(lip)->ili_commit_seq = seq;
+	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+
+	/*
+	 * ili_lock protects us from races with xfs_inode_item_unpin(). See
+	 * comment in that function for details.
+	 */
+	spin_lock(&iip->ili_lock);
+	WRITE_ONCE(iip->ili_commit_seq, seq);
+	/*
+	 * Record the specific sequence for fdatasync optimisation. This allows
+	 * fdatasync to skip log forces for inodes that are only timestamp
+	 * dirty.
+	 */
+	if (iip->ili_dirty_flags & ~(XFS_ILOG_IVERSION | XFS_ILOG_TIMESTAMP))
+		WRITE_ONCE(iip->ili_datasync_commit_seq, seq);
+	spin_unlock(&iip->ili_lock);
 	return xfs_inode_item_release(lip);
 }
 
@@ -1055,7 +1084,6 @@ xfs_iflush_abort_clean(
 {
 	iip->ili_last_fields = 0;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	iip->ili_flush_lsn = 0;
 	iip->ili_item.li_buf = NULL;
 	list_del_init(&iip->ili_item.li_bio_list);
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index ba92ce11a011..bdcd4b6ebba4 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -32,9 +32,10 @@ struct xfs_inode_log_item {
 	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
-	unsigned int		ili_fsync_fields;  /* logged since last fsync */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
+	/* last transaction commit with changes relevant for fdatasync */
+	xfs_csn_t		ili_datasync_commit_seq;
 };
 
 static inline int xfs_inode_clean(struct xfs_inode *ip)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2a74f2957341..23539cc29a37 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -150,7 +150,7 @@ xfs_bmbt_to_iomap(
 	iomap->flags = iomap_flags;
 
 	if (xfs_ipincount(ip) &&
-	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+	    READ_ONCE(ip->i_itemp->ili_datasync_commit_seq))
 		iomap->flags |= IOMAP_F_DIRTY;
 
 	iomap->validity_cookie = sequence_cookie;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-16 13:32         ` Jan Kara
@ 2025-09-16 21:32           ` Dave Chinner
  2025-09-17 10:07             ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2025-09-16 21:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs

On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > doesn't have dirty metadata) will have no associated commit
> > > sequence number set.
> > > 
> > > Hence if ili_datasync_commit_seq is non-zero, then by definition the
> > > inode must be pinned and has been dirtied for datasync purposes.
> > > That means we can simply query ili_datasync_commit_seq in
> > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > > 
> > > I suspect that the above fsync code can then become:
> > > 
> > > 	spin_lock(&iip->ili_lock);
> > > 	if (datasync)
> > > 		seq = iip->ili_datasync_commit_seq;
> > > 	else
> > > 		seq = iip->ili_commit_seq;
> > > 	spin_unlock(&iip->ili_lock);
> > > 
> > > 	if (!seq)
> > > 		return 0;
> > > 	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> > > 
> > > For the same reason. i.e. a non-zero sequence number implies the
> > > inode log item is dirty in the CIL and pinned.
> > > 
> > > At this point, we really don't care about races with transaction
> > > commits. f(data)sync should only wait for modifications that have
> > > been fully completed. If they haven't set the commit sequence in the
> > > log item, they haven't fully completed. If the commit sequence is
> > > already set, the the CIL push will co-ordinate appropriately with
> > > commits to ensure correct data integrity behaviour occurs.
> > > 
> > > Hence I think that if we tie the sequence number clearing to the
> > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > the pin checks and use the commit sequence numbers directly to
> > > determine what the correct behaviour should be...
> > 
> > Here's a patch that implements this. It appears to pass fstests
> > without any regressions on my test VMs. Can you test it and check
> > that it retains the expected performance improvement for
> > O_DSYNC+DIO on fallocate()d space?
> 
> Heh, I just wanted to send my version of the patch after all the tests
> passed :). Anyway, I've given your patch a spin with the test I have and
> its performance looks good. So feel free to add:
> 
> Tested-by: Jan Kara <jack@suse.cz>

Thanks!

> BTW I don't have customer setup with DB2 available where the huge
> difference is visible (I'll send them backport of the patch to their SUSE
> kernel once we settle on it) but I have written a tool that replays the
> same set of pwrites from same set of threads I've captured from syscall
> trace. It reproduces only about 20% difference between good & bad kernels
> on my test machine but it was good enough for the bisection and analysis
> and the customer confirmed that the revert of what I've bisected to
> actually fixes their issue (rwsem reader lockstealing logic).

It was also recently bisected on RHEL 8.x to the introduction of
rwsem spin-on-owner changes from back in 2019. Might be the same
commit you are talking about, but either way it's an indication of
rwsem lock contention rather than a problem with the rwsems
themselves.

> So I'm
> reasonably confident I'm really reproducing their issue.

Ok, that's good to know. I was thinking that maybe a fio recipe
might show it up, too, but I'm not sure about that nor do I have the
time to investigate it...

> BTW, your patch seems to be about 2% slower on average than what I have
> written and somewhat more variable. It may be just a bad luck but
> I suspect it may be related to the fact that I ended up using READ_ONCE for
> reads of ili_commit_seq and ili_datasync_commit_seq while you use ili_lock.

Possibly....

> So I just wanted to suggest that as a possible optimization (my patch
> attached for reference). But regardless of whether you do the change or not
> I think the patch is good to go.

I was on the fence about using READ_ONCE/WRITE_ONCE.

However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
prevent torn reads of 64 bit variables on 32 bit platforms. A torn
read of a commit sequence number will result in a transient data
integrity guarantee failure, and so I decided to err on the side of
caution....

.....

> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 829675700fcd..2a90e156b072 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
>  		flags |= XFS_ILOG_CORE;
>  	}
>  
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This allows
> -	 * fdatasync to skip log forces for inodes that are only timestamp
> -	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> -	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> -	 * (ili_fields) correctly tracks that the version has changed.
> -	 */
>  	spin_lock(&iip->ili_lock);
> -	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> -	if (flags & XFS_ILOG_IVERSION)
> -		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> -
>  	/*
>  	 * Inode verifiers do not check that the CoW extent size hint is an
>  	 * integer multiple of the rt extent size on a directory with both
> @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
>  		xfs_trans_brelse(tp, bp);
>  	}
>  
> +	/*
> +	 * Set the transaction dirty state we've created back in inode item
> +	 * before mangling flags for storing on disk. We use the value later in
> +	 * xfs_inode_item_committing() to determine whether the transaction is
> +	 * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> +	 * xfs_trans_ijoin() before adding inode to the next transaction.
> +	 */
> +	iip->ili_dirty_flags = flags;
> +
> +	/*
> +	 * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> +	 * actual on-disk dirty tracking (ili_fields) correctly tracks that the
> +	 * version has changed.
> +	 */
> +	if (flags & XFS_ILOG_IVERSION)
> +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> +

OK, I think I might have missed this. I'll check/fix it, and send an
updated version for inclusion.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-16 21:32           ` Dave Chinner
@ 2025-09-17 10:07             ` Jan Kara
  2025-09-17 10:27               ` Lukas Herbolt
       [not found]               ` <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Jan Kara @ 2025-09-17 10:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs

On Wed 17-09-25 07:32:11, Dave Chinner wrote:
> On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> > On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > > doesn't have dirty metadata) will have no associated commit
> > > > sequence number set.
> > > > 
> > > > Hence if ili_datasync_commit_seq is non-zero, then by definition the
> > > > inode must be pinned and has been dirtied for datasync purposes.
> > > > That means we can simply query ili_datasync_commit_seq in
> > > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > > > 
> > > > I suspect that the above fsync code can then become:
> > > > 
> > > > 	spin_lock(&iip->ili_lock);
> > > > 	if (datasync)
> > > > 		seq = iip->ili_datasync_commit_seq;
> > > > 	else
> > > > 		seq = iip->ili_commit_seq;
> > > > 	spin_unlock(&iip->ili_lock);
> > > > 
> > > > 	if (!seq)
> > > > 		return 0;
> > > > 	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> > > > 
> > > > For the same reason. i.e. a non-zero sequence number implies the
> > > > inode log item is dirty in the CIL and pinned.
> > > > 
> > > > At this point, we really don't care about races with transaction
> > > > commits. f(data)sync should only wait for modifications that have
> > > > been fully completed. If they haven't set the commit sequence in the
> > > > log item, they haven't fully completed. If the commit sequence is
> > > > already set, the the CIL push will co-ordinate appropriately with
> > > > commits to ensure correct data integrity behaviour occurs.
> > > > 
> > > > Hence I think that if we tie the sequence number clearing to the
> > > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > > the pin checks and use the commit sequence numbers directly to
> > > > determine what the correct behaviour should be...
> > > 
> > > Here's a patch that implements this. It appears to pass fstests
> > > without any regressions on my test VMs. Can you test it and check
> > > that it retains the expected performance improvement for
> > > O_DSYNC+DIO on fallocate()d space?
> > 
> > Heh, I just wanted to send my version of the patch after all the tests
> > passed :). Anyway, I've given your patch a spin with the test I have and
> > its performance looks good. So feel free to add:
> > 
> > Tested-by: Jan Kara <jack@suse.cz>
> 
> Thanks!
> 
> > BTW I don't have customer setup with DB2 available where the huge
> > difference is visible (I'll send them backport of the patch to their SUSE
> > kernel once we settle on it) but I have written a tool that replays the
> > same set of pwrites from same set of threads I've captured from syscall
> > trace. It reproduces only about 20% difference between good & bad kernels
> > on my test machine but it was good enough for the bisection and analysis
> > and the customer confirmed that the revert of what I've bisected to
> > actually fixes their issue (rwsem reader lockstealing logic).
> 
> It was also recently bisected on RHEL 8.x to the introduction of
> rwsem spin-on-owner changes from back in 2019. Might be the same
> commit you are talking about, but either way it's an indication of
> rwsem lock contention rather than a problem with the rwsems
> themselves.

Right. I've also come to a conclusion that the real problem is the too
heavy use of ILOCK and not the rwsem behavior itself. Hence this patch :).

> > So I'm
> > reasonably confident I'm really reproducing their issue.
> 
> Ok, that's good to know. I was thinking that maybe a fio recipe
> might show it up, too, but I'm not sure about that nor do I have the
> time to investigate it...

I was actually trying hard to come up with a fio recipe to reproduce this
but I've failed. As you are noting in your changelog, this workload is
bound by log throughput and one of the obvious differences between fast and
slow kernels is that fast kernels do less larger log forces while slow
kernels do many tiny log forces (with obvious consequences for throughput,
in particular because we tend to relog the same blocks over and over again
- the slow kernels end up logging about 3x as much data in total).  Now
with fio the jobs were always managing to cram enough changes in one log
force for the difference to not be visible. Somehow the distribution of
writes among threads (and possibly their location determining how the btree
gets fragmented and which blocks get logged) DB2 creates is pretty peculiar
so that it makes such a big difference.

> > So I just wanted to suggest that as a possible optimization (my patch
> > attached for reference). But regardless of whether you do the change or not
> > I think the patch is good to go.
> 
> I was on the fence about using READ_ONCE/WRITE_ONCE.
> 
> However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
> prevent torn reads of 64 bit variables on 32 bit platforms. A torn
> read of a commit sequence number will result in a transient data
> integrity guarantee failure, and so I decided to err on the side of
> caution....

Hum, right. I didn't think of 32-bits.

> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 829675700fcd..2a90e156b072 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
> >  		flags |= XFS_ILOG_CORE;
> >  	}
> >  
> > -	/*
> > -	 * Record the specific change for fdatasync optimisation. This allows
> > -	 * fdatasync to skip log forces for inodes that are only timestamp
> > -	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> > -	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > -	 * (ili_fields) correctly tracks that the version has changed.
> > -	 */
> >  	spin_lock(&iip->ili_lock);
> > -	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > -	if (flags & XFS_ILOG_IVERSION)
> > -		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > -
> >  	/*
> >  	 * Inode verifiers do not check that the CoW extent size hint is an
> >  	 * integer multiple of the rt extent size on a directory with both
> > @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
> >  		xfs_trans_brelse(tp, bp);
> >  	}
> >  
> > +	/*
> > +	 * Set the transaction dirty state we've created back in inode item
> > +	 * before mangling flags for storing on disk. We use the value later in
> > +	 * xfs_inode_item_committing() to determine whether the transaction is
> > +	 * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> > +	 * xfs_trans_ijoin() before adding inode to the next transaction.
> > +	 */
> > +	iip->ili_dirty_flags = flags;
> > +
> > +	/*
> > +	 * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> > +	 * actual on-disk dirty tracking (ili_fields) correctly tracks that the
> > +	 * version has changed.
> > +	 */
> > +	if (flags & XFS_ILOG_IVERSION)
> > +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > +
> 
> OK, I think I might have missed this. I'll check/fix it, and send an
> updated version for inclusion.

Yeah, your version may miss we've set XFS_ILOG_CORE in flags in
xfs_inode_item_precommit(). Frankly, I wasn't sure whether fdatasync() not
flushing the log in these cases it fine or not so I've just preserved the
existing behavior in my patch.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-17 10:07             ` Jan Kara
@ 2025-09-17 10:27               ` Lukas Herbolt
       [not found]               ` <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Lukas Herbolt @ 2025-09-17 10:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-xfs

>> I was actually trying hard to come up with a fio recipe to reproduce this
>> but I've failed.
I have a similar issue with DB2 observed during index reorg/rebuild.
The db opens
the files with O_DSYNC|O_DIRECT and writes with libaio 256 x 4k writes (mostly
sequential)  I believe that with the 256 and 4K depends on DB2
internal page size
and extent size. Anyway with bellow I am getting around ~18MB/s
without the patch
and around ~75MB/s with your patch. Need to test Dave's as well.

The fio file looks like this
---
[global]
name=fio-seq-write
filename_format=/mnt/test/$jobname/$jobnum/fio.$filenum.$jobnum
rw=write
bs=4k
direct=1
numjobs=10
time_based
runtime=20
nrfiles=1
size=4G
ioengine=libaio
iodepth=16
iodepth_batch_submit=16
iodepth_batch_complete_min=16
iodepth_batch_complete_max=16
group_reporting=1
gtod_reduce=1
sync=dsync
fadvise_hint=1

[posix]
fallocate=posix
stonewall
---


On Wed, Sep 17, 2025 at 12:07 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-09-25 07:32:11, Dave Chinner wrote:
> > On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> > > On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > > > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > > > doesn't have dirty metadata) will have no associated commit
> > > > > sequence number set.
> > > > >
> > > > > Hence if ili_datasync_commit_seq is non-zero, then by definition the
> > > > > inode must be pinned and has been dirtied for datasync purposes.
> > > > > That means we can simply query ili_datasync_commit_seq in
> > > > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > > > >
> > > > > I suspect that the above fsync code can then become:
> > > > >
> > > > >         spin_lock(&iip->ili_lock);
> > > > >         if (datasync)
> > > > >                 seq = iip->ili_datasync_commit_seq;
> > > > >         else
> > > > >                 seq = iip->ili_commit_seq;
> > > > >         spin_unlock(&iip->ili_lock);
> > > > >
> > > > >         if (!seq)
> > > > >                 return 0;
> > > > >         return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed);
> > > > >
> > > > > For the same reason. i.e. a non-zero sequence number implies the
> > > > > inode log item is dirty in the CIL and pinned.
> > > > >
> > > > > At this point, we really don't care about races with transaction
> > > > > commits. f(data)sync should only wait for modifications that have
> > > > > been fully completed. If they haven't set the commit sequence in the
> > > > > log item, they haven't fully completed. If the commit sequence is
> > > > > already set, the the CIL push will co-ordinate appropriately with
> > > > > commits to ensure correct data integrity behaviour occurs.
> > > > >
> > > > > Hence I think that if we tie the sequence number clearing to the
> > > > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > > > the pin checks and use the commit sequence numbers directly to
> > > > > determine what the correct behaviour should be...
> > > >
> > > > Here's a patch that implements this. It appears to pass fstests
> > > > without any regressions on my test VMs. Can you test it and check
> > > > that it retains the expected performance improvement for
> > > > O_DSYNC+DIO on fallocate()d space?
> > >
> > > Heh, I just wanted to send my version of the patch after all the tests
> > > passed :). Anyway, I've given your patch a spin with the test I have and
> > > its performance looks good. So feel free to add:
> > >
> > > Tested-by: Jan Kara <jack@suse.cz>
> >
> > Thanks!
> >
> > > BTW I don't have customer setup with DB2 available where the huge
> > > difference is visible (I'll send them backport of the patch to their SUSE
> > > kernel once we settle on it) but I have written a tool that replays the
> > > same set of pwrites from same set of threads I've captured from syscall
> > > trace. It reproduces only about 20% difference between good & bad kernels
> > > on my test machine but it was good enough for the bisection and analysis
> > > and the customer confirmed that the revert of what I've bisected to
> > > actually fixes their issue (rwsem reader lockstealing logic).
> >
> > It was also recently bisected on RHEL 8.x to the introduction of
> > rwsem spin-on-owner changes from back in 2019. Might be the same
> > commit you are talking about, but either way it's an indication of
> > rwsem lock contention rather than a problem with the rwsems
> > themselves.
>
> Right. I've also come to a conclusion that the real problem is the too
> heavy use of ILOCK and not the rwsem behavior itself. Hence this patch :).
>
> > > So I'm
> > > reasonably confident I'm really reproducing their issue.
> >
> > Ok, that's good to know. I was thinking that maybe a fio recipe
> > might show it up, too, but I'm not sure about that nor do I have the
> > time to investigate it...
>
> I was actually trying hard to come up with a fio recipe to reproduce this
> but I've failed. As you are noting in your changelog, this workload is
> bound by log throughput and one of the obvious differences between fast and
> slow kernels is that fast kernels do less larger log forces while slow
> kernels do many tiny log forces (with obvious consequences for throughput,
> in particular because we tend to relog the same blocks over and over again
> - the slow kernels end up logging about 3x as much data in total).  Now
> with fio the jobs were always managing to cram enough changes in one log
> force for the difference to not be visible. Somehow the distribution of
> writes among threads (and possibly their location determining how the btree
> gets fragmented and which blocks get logged) DB2 creates is pretty peculiar
> so that it makes such a big difference.
>
> > > So I just wanted to suggest that as a possible optimization (my patch
> > > attached for reference). But regardless of whether you do the change or not
> > > I think the patch is good to go.
> >
> > I was on the fence about using READ_ONCE/WRITE_ONCE.
> >
> > However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
> > prevent torn reads of 64 bit variables on 32 bit platforms. A torn
> > read of a commit sequence number will result in a transient data
> > integrity guarantee failure, and so I decided to err on the side of
> > caution....
>
> Hum, right. I didn't think of 32-bits.
>
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index 829675700fcd..2a90e156b072 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
> > >             flags |= XFS_ILOG_CORE;
> > >     }
> > >
> > > -   /*
> > > -    * Record the specific change for fdatasync optimisation. This allows
> > > -    * fdatasync to skip log forces for inodes that are only timestamp
> > > -    * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> > > -    * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > > -    * (ili_fields) correctly tracks that the version has changed.
> > > -    */
> > >     spin_lock(&iip->ili_lock);
> > > -   iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > > -   if (flags & XFS_ILOG_IVERSION)
> > > -           flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > -
> > >     /*
> > >      * Inode verifiers do not check that the CoW extent size hint is an
> > >      * integer multiple of the rt extent size on a directory with both
> > > @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
> > >             xfs_trans_brelse(tp, bp);
> > >     }
> > >
> > > +   /*
> > > +    * Set the transaction dirty state we've created back in inode item
> > > +    * before mangling flags for storing on disk. We use the value later in
> > > +    * xfs_inode_item_committing() to determine whether the transaction is
> > > +    * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> > > +    * xfs_trans_ijoin() before adding inode to the next transaction.
> > > +    */
> > > +   iip->ili_dirty_flags = flags;
> > > +
> > > +   /*
> > > +    * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> > > +    * actual on-disk dirty tracking (ili_fields) correctly tracks that the
> > > +    * version has changed.
> > > +    */
> > > +   if (flags & XFS_ILOG_IVERSION)
> > > +           flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > +
> >
> > OK, I think I might have missed this. I'll check/fix it, and send an
> > updated version for inclusion.
>
> Yeah, your version may miss we've set XFS_ILOG_CORE in flags in
> xfs_inode_item_precommit(). Frankly, I wasn't sure whether fdatasync() not
> flushing the log in these cases it fine or not so I've just preserved the
> existing behavior in my patch.
>
>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>


-- 
Lukas Herbolt
SSME
Red Hat


^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>]

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
       [not found]               ` <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>
@ 2025-09-17 15:05                 ` Jan Kara
  2025-09-17 15:16                   ` Lukas Herbolt
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2025-09-17 15:05 UTC (permalink / raw)
  To: Lukas Herbolt; +Cc: Jan Kara, Dave Chinner, linux-xfs

On Wed 17-09-25 12:22:46, Lukas Herbolt wrote:
> >> I was actually trying hard to come up with a fio recipe to reproduce
> this
> >> but I've failed.
> I have a similar issue with DB2 observed during index reorg/rebuild. The db
> opens
> the files with O_DSYNC|O_DIRECT and writes with libaio 256 x 4k writes
> (mostly
> sequential)  I believe that with the 256 and 4K depends on DB2
> internal page size
> and extent size. Anyway with bellow I am getting around ~18MB/s without the
> patch
> and around ~75MB/s with your patch. Need to test Dave's as well.
> 
> The fio file looks like this
> ---
> [global]
> name=fio-seq-write
> filename_format=/mnt/test/$jobname/$jobnum/fio.$filenum.$jobnum
> rw=write
> bs=4k
> direct=1
> numjobs=10
> time_based
> runtime=20
> nrfiles=1
> size=4G
> ioengine=libaio
> iodepth=16
> iodepth_batch_submit=16
> iodepth_batch_complete_min=16
> iodepth_batch_complete_max=16
> group_reporting=1
> gtod_reduce=1
> sync=dsync
> fadvise_hint=1
> 
> [posix]
> fallocate=posix
> stonewall
> ---

Thanks for sharing this! The results of this fio job vary significantly on
my test machine (even if I increase the runtime to several minutes) but
indeed average throughput seems to be about 30% better with the patch
applied so I think it's good enough smoke test.

								Honza

> 
> On Wed, Sep 17, 2025 at 12:07 PM Jan Kara <jack@suse.cz> wrote:
> 
> > On Wed 17-09-25 07:32:11, Dave Chinner wrote:
> > > On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> > > > On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > > > > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > > > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > > > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > > > > doesn't have dirty metadata) will have no associated commit
> > > > > > sequence number set.
> > > > > >
> > > > > > Hence if ili_datasync_commit_seq is non-zero, then by definition
> > the
> > > > > > inode must be pinned and has been dirtied for datasync purposes.
> > > > > > That means we can simply query ili_datasync_commit_seq in
> > > > > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > > > > >
> > > > > > I suspect that the above fsync code can then become:
> > > > > >
> > > > > >         spin_lock(&iip->ili_lock);
> > > > > >         if (datasync)
> > > > > >                 seq = iip->ili_datasync_commit_seq;
> > > > > >         else
> > > > > >                 seq = iip->ili_commit_seq;
> > > > > >         spin_unlock(&iip->ili_lock);
> > > > > >
> > > > > >         if (!seq)
> > > > > >                 return 0;
> > > > > >         return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> > log_flushed);
> > > > > >
> > > > > > For the same reason. i.e. a non-zero sequence number implies the
> > > > > > inode log item is dirty in the CIL and pinned.
> > > > > >
> > > > > > At this point, we really don't care about races with transaction
> > > > > > commits. f(data)sync should only wait for modifications that have
> > > > > > been fully completed. If they haven't set the commit sequence in
> > the
> > > > > > log item, they haven't fully completed. If the commit sequence is
> > > > > > already set, the the CIL push will co-ordinate appropriately with
> > > > > > commits to ensure correct data integrity behaviour occurs.
> > > > > >
> > > > > > Hence I think that if we tie the sequence number clearing to the
> > > > > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > > > > the pin checks and use the commit sequence numbers directly to
> > > > > > determine what the correct behaviour should be...
> > > > >
> > > > > Here's a patch that implements this. It appears to pass fstests
> > > > > without any regressions on my test VMs. Can you test it and check
> > > > > that it retains the expected performance improvement for
> > > > > O_DSYNC+DIO on fallocate()d space?
> > > >
> > > > Heh, I just wanted to send my version of the patch after all the tests
> > > > passed :). Anyway, I've given your patch a spin with the test I have
> > and
> > > > its performance looks good. So feel free to add:
> > > >
> > > > Tested-by: Jan Kara <jack@suse.cz>
> > >
> > > Thanks!
> > >
> > > > BTW I don't have customer setup with DB2 available where the huge
> > > > difference is visible (I'll send them backport of the patch to their
> > SUSE
> > > > kernel once we settle on it) but I have written a tool that replays the
> > > > same set of pwrites from same set of threads I've captured from syscall
> > > > trace. It reproduces only about 20% difference between good & bad
> > kernels
> > > > on my test machine but it was good enough for the bisection and
> > analysis
> > > > and the customer confirmed that the revert of what I've bisected to
> > > > actually fixes their issue (rwsem reader lockstealing logic).
> > >
> > > It was also recently bisected on RHEL 8.x to the introduction of
> > > rwsem spin-on-owner changes from back in 2019. Might be the same
> > > commit you are talking about, but either way it's an indication of
> > > rwsem lock contention rather than a problem with the rwsems
> > > themselves.
> >
> > Right. I've also come to a conclusion that the real problem is the too
> > heavy use of ILOCK and not the rwsem behavior itself. Hence this patch :).
> >
> > > > So I'm
> > > > reasonably confident I'm really reproducing their issue.
> > >
> > > Ok, that's good to know. I was thinking that maybe a fio recipe
> > > might show it up, too, but I'm not sure about that nor do I have the
> > > time to investigate it...
> >
> > I was actually trying hard to come up with a fio recipe to reproduce this
> > but I've failed. As you are noting in your changelog, this workload is
> > bound by log throughput and one of the obvious differences between fast and
> > slow kernels is that fast kernels do less larger log forces while slow
> > kernels do many tiny log forces (with obvious consequences for throughput,
> > in particular because we tend to relog the same blocks over and over again
> > - the slow kernels end up logging about 3x as much data in total).  Now
> > with fio the jobs were always managing to cram enough changes in one log
> > force for the difference to not be visible. Somehow the distribution of
> > writes among threads (and possibly their location determining how the btree
> > gets fragmented and which blocks get logged) DB2 creates is pretty peculiar
> > so that it makes such a big difference.
> >
> > > > So I just wanted to suggest that as a possible optimization (my patch
> > > > attached for reference). But regardless of whether you do the change
> > or not
> > > > I think the patch is good to go.
> > >
> > > I was on the fence about using READ_ONCE/WRITE_ONCE.
> > >
> > > However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
> > > prevent torn reads of 64 bit variables on 32 bit platforms. A torn
> > > read of a commit sequence number will result in a transient data
> > > integrity guarantee failure, and so I decided to err on the side of
> > > caution....
> >
> > Hum, right. I didn't think of 32-bits.
> >
> > > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > > index 829675700fcd..2a90e156b072 100644
> > > > --- a/fs/xfs/xfs_inode_item.c
> > > > +++ b/fs/xfs/xfs_inode_item.c
> > > > @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
> > > >             flags |= XFS_ILOG_CORE;
> > > >     }
> > > >
> > > > -   /*
> > > > -    * Record the specific change for fdatasync optimisation. This
> > allows
> > > > -    * fdatasync to skip log forces for inodes that are only timestamp
> > > > -    * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert
> > it
> > > > -    * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > > > -    * (ili_fields) correctly tracks that the version has changed.
> > > > -    */
> > > >     spin_lock(&iip->ili_lock);
> > > > -   iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > > > -   if (flags & XFS_ILOG_IVERSION)
> > > > -           flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > > -
> > > >     /*
> > > >      * Inode verifiers do not check that the CoW extent size hint is an
> > > >      * integer multiple of the rt extent size on a directory with both
> > > > @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
> > > >             xfs_trans_brelse(tp, bp);
> > > >     }
> > > >
> > > > +   /*
> > > > +    * Set the transaction dirty state we've created back in inode item
> > > > +    * before mangling flags for storing on disk. We use the value
> > later in
> > > > +    * xfs_inode_item_committing() to determine whether the
> > transaction is
> > > > +    * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> > > > +    * xfs_trans_ijoin() before adding inode to the next transaction.
> > > > +    */
> > > > +   iip->ili_dirty_flags = flags;
> > > > +
> > > > +   /*
> > > > +    * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> > > > +    * actual on-disk dirty tracking (ili_fields) correctly tracks
> > that the
> > > > +    * version has changed.
> > > > +    */
> > > > +   if (flags & XFS_ILOG_IVERSION)
> > > > +           flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > > +
> > >
> > > OK, I think I might have missed this. I'll check/fix it, and send an
> > > updated version for inclusion.
> >
> > Yeah, your version may miss we've set XFS_ILOG_CORE in flags in
> > xfs_inode_item_precommit(). Frankly, I wasn't sure whether fdatasync() not
> > flushing the log in these cases it fine or not so I've just preserved the
> > existing behavior in my patch.
> >
> >                                                                 Honza
> >
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> >
> >
> 
> -- 
> Lukas Herbolt
> SSME
> Red Hat
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync
  2025-09-17 15:05                 ` Jan Kara
@ 2025-09-17 15:16                   ` Lukas Herbolt
  0 siblings, 0 replies; 11+ messages in thread
From: Lukas Herbolt @ 2025-09-17 15:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-xfs

I forgot to mention, I am not sure if fio always preallocates
(deallocates) the file extents if the
writes (ie. unwritten-> written) to that file already happened. Also
the mount in my test is done
with nodiratime, noatime.

So I was running before each fio, to be sure
---
find  /mnt/test/posix -type f -exec fallocate -v -z -l 1G {} \;;
xfs_bmap -vp  /mnt/test/posix/*/*
---
and checking that the extents are unwritten.
With that, the results were quite stable, I would say.

On Wed, Sep 17, 2025 at 5:05 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-09-25 12:22:46, Lukas Herbolt wrote:
> > >> I was actually trying hard to come up with a fio recipe to reproduce
> > this
> > >> but I've failed.
> > I have a similar issue with DB2 observed during index reorg/rebuild. The db
> > opens
> > the files with O_DSYNC|O_DIRECT and writes with libaio 256 x 4k writes
> > (mostly
> > sequential)  I believe that with the 256 and 4K depends on DB2
> > internal page size
> > and extent size. Anyway with bellow I am getting around ~18MB/s without the
> > patch
> > and around ~75MB/s with your patch. Need to test Dave's as well.
> >
> > The fio file looks like this
> > ---
> > [global]
> > name=fio-seq-write
> > filename_format=/mnt/test/$jobname/$jobnum/fio.$filenum.$jobnum
> > rw=write
> > bs=4k
> > direct=1
> > numjobs=10
> > time_based
> > runtime=20
> > nrfiles=1
> > size=4G
> > ioengine=libaio
> > iodepth=16
> > iodepth_batch_submit=16
> > iodepth_batch_complete_min=16
> > iodepth_batch_complete_max=16
> > group_reporting=1
> > gtod_reduce=1
> > sync=dsync
> > fadvise_hint=1
> >
> > [posix]
> > fallocate=posix
> > stonewall
> > ---
>
> Thanks for sharing this! The results of this fio job vary significantly on
> my test machine (even if I increase the runtime to several minutes) but
> indeed average throughput seems to be about 30% better with the patch
> applied so I think it's good enough smoke test.
>
>                                                                 Honza
>
> >
> > On Wed, Sep 17, 2025 at 12:07 PM Jan Kara <jack@suse.cz> wrote:
> >
> > > On Wed 17-09-25 07:32:11, Dave Chinner wrote:
> > > > On Tue, Sep 16, 2025 at 03:32:42PM +0200, Jan Kara wrote:
> > > > > On Tue 16-09-25 16:15:32, Dave Chinner wrote:
> > > > > > On Thu, Sep 11, 2025 at 10:59:15AM +1000, Dave Chinner wrote:
> > > > > > > i.e. if we clear the commit sequences on last unpin (i.e. in
> > > > > > > xfs_inode_item_unpin) then an item that is not in the CIL (and so
> > > > > > > doesn't have dirty metadata) will have no associated commit
> > > > > > > sequence number set.
> > > > > > >
> > > > > > > Hence if ili_datasync_commit_seq is non-zero, then by definition
> > > the
> > > > > > > inode must be pinned and has been dirtied for datasync purposes.
> > > > > > > That means we can simply query ili_datasync_commit_seq in
> > > > > > > xfs_bmbt_to_iomap() to set IOMAP_F_DIRTY.
> > > > > > >
> > > > > > > I suspect that the above fsync code can then become:
> > > > > > >
> > > > > > >         spin_lock(&iip->ili_lock);
> > > > > > >         if (datasync)
> > > > > > >                 seq = iip->ili_datasync_commit_seq;
> > > > > > >         else
> > > > > > >                 seq = iip->ili_commit_seq;
> > > > > > >         spin_unlock(&iip->ili_lock);
> > > > > > >
> > > > > > >         if (!seq)
> > > > > > >                 return 0;
> > > > > > >         return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> > > log_flushed);
> > > > > > >
> > > > > > > For the same reason. i.e. a non-zero sequence number implies the
> > > > > > > inode log item is dirty in the CIL and pinned.
> > > > > > >
> > > > > > > At this point, we really don't care about races with transaction
> > > > > > > commits. f(data)sync should only wait for modifications that have
> > > > > > > been fully completed. If they haven't set the commit sequence in
> > > the
> > > > > > > log item, they haven't fully completed. If the commit sequence is
> > > > > > > already set, the the CIL push will co-ordinate appropriately with
> > > > > > > commits to ensure correct data integrity behaviour occurs.
> > > > > > >
> > > > > > > Hence I think that if we tie the sequence number clearing to the
> > > > > > > inode being removed from the CIL (i.e. last unpin) we can drop all
> > > > > > > the pin checks and use the commit sequence numbers directly to
> > > > > > > determine what the correct behaviour should be...
> > > > > >
> > > > > > Here's a patch that implements this. It appears to pass fstests
> > > > > > without any regressions on my test VMs. Can you test it and check
> > > > > > that it retains the expected performance improvement for
> > > > > > O_DSYNC+DIO on fallocate()d space?
> > > > >
> > > > > Heh, I just wanted to send my version of the patch after all the tests
> > > > > passed :). Anyway, I've given your patch a spin with the test I have
> > > and
> > > > > its performance looks good. So feel free to add:
> > > > >
> > > > > Tested-by: Jan Kara <jack@suse.cz>
> > > >
> > > > Thanks!
> > > >
> > > > > BTW I don't have customer setup with DB2 available where the huge
> > > > > difference is visible (I'll send them backport of the patch to their
> > > SUSE
> > > > > kernel once we settle on it) but I have written a tool that replays the
> > > > > same set of pwrites from same set of threads I've captured from syscall
> > > > > trace. It reproduces only about 20% difference between good & bad
> > > kernels
> > > > > on my test machine but it was good enough for the bisection and
> > > analysis
> > > > > and the customer confirmed that the revert of what I've bisected to
> > > > > actually fixes their issue (rwsem reader lockstealing logic).
> > > >
> > > > It was also recently bisected on RHEL 8.x to the introduction of
> > > > rwsem spin-on-owner changes from back in 2019. Might be the same
> > > > commit you are talking about, but either way it's an indication of
> > > > rwsem lock contention rather than a problem with the rwsems
> > > > themselves.
> > >
> > > Right. I've also come to a conclusion that the real problem is the too
> > > heavy use of ILOCK and not the rwsem behavior itself. Hence this patch :).
> > >
> > > > > So I'm
> > > > > reasonably confident I'm really reproducing their issue.
> > > >
> > > > Ok, that's good to know. I was thinking that maybe a fio recipe
> > > > might show it up, too, but I'm not sure about that nor do I have the
> > > > time to investigate it...
> > >
> > > I was actually trying hard to come up with a fio recipe to reproduce this
> > > but I've failed. As you are noting in your changelog, this workload is
> > > bound by log throughput and one of the obvious differences between fast and
> > > slow kernels is that fast kernels do less larger log forces while slow
> > > kernels do many tiny log forces (with obvious consequences for throughput,
> > > in particular because we tend to relog the same blocks over and over again
> > > - the slow kernels end up logging about 3x as much data in total).  Now
> > > with fio the jobs were always managing to cram enough changes in one log
> > > force for the difference to not be visible. Somehow the distribution of
> > > writes among threads (and possibly their location determining how the btree
> > > gets fragmented and which blocks get logged) DB2 creates is pretty peculiar
> > > so that it makes such a big difference.
> > >
> > > > > So I just wanted to suggest that as a possible optimization (my patch
> > > > > attached for reference). But regardless of whether you do the change
> > > or not
> > > > > I think the patch is good to go.
> > > >
> > > > I was on the fence about using READ_ONCE/WRITE_ONCE.
> > > >
> > > > However, xfs_csn_t is 64 bit and READ_ONCE/WRITE_ONCE doesn't
> > > > prevent torn reads of 64 bit variables on 32 bit platforms. A torn
> > > > read of a commit sequence number will result in a transient data
> > > > integrity guarantee failure, and so I decided to err on the side of
> > > > caution....
> > >
> > > Hum, right. I didn't think of 32-bits.
> > >
> > > > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > > > index 829675700fcd..2a90e156b072 100644
> > > > > --- a/fs/xfs/xfs_inode_item.c
> > > > > +++ b/fs/xfs/xfs_inode_item.c
> > > > > @@ -145,18 +145,7 @@ xfs_inode_item_precommit(
> > > > >             flags |= XFS_ILOG_CORE;
> > > > >     }
> > > > >
> > > > > -   /*
> > > > > -    * Record the specific change for fdatasync optimisation. This
> > > allows
> > > > > -    * fdatasync to skip log forces for inodes that are only timestamp
> > > > > -    * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert
> > > it
> > > > > -    * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > > > > -    * (ili_fields) correctly tracks that the version has changed.
> > > > > -    */
> > > > >     spin_lock(&iip->ili_lock);
> > > > > -   iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > > > > -   if (flags & XFS_ILOG_IVERSION)
> > > > > -           flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > > > -
> > > > >     /*
> > > > >      * Inode verifiers do not check that the CoW extent size hint is an
> > > > >      * integer multiple of the rt extent size on a directory with both
> > > > > @@ -204,6 +193,23 @@ xfs_inode_item_precommit(
> > > > >             xfs_trans_brelse(tp, bp);
> > > > >     }
> > > > >
> > > > > +   /*
> > > > > +    * Set the transaction dirty state we've created back in inode item
> > > > > +    * before mangling flags for storing on disk. We use the value
> > > later in
> > > > > +    * xfs_inode_item_committing() to determine whether the
> > > transaction is
> > > > > +    * relevant for fdatasync or not. ili_dirty_flags gets cleared in
> > > > > +    * xfs_trans_ijoin() before adding inode to the next transaction.
> > > > > +    */
> > > > > +   iip->ili_dirty_flags = flags;
> > > > > +
> > > > > +   /*
> > > > > +    * Now convert XFS_ILOG_IVERSION flag to XFS_ILOG_CORE so that the
> > > > > +    * actual on-disk dirty tracking (ili_fields) correctly tracks
> > > that the
> > > > > +    * version has changed.
> > > > > +    */
> > > > > +   if (flags & XFS_ILOG_IVERSION)
> > > > > +           flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > > > +
> > > >
> > > > OK, I think I might have missed this. I'll check/fix it, and send an
> > > > updated version for inclusion.
> > >
> > > Yeah, your version may miss we've set XFS_ILOG_CORE in flags in
> > > xfs_inode_item_precommit(). Frankly, I wasn't sure whether fdatasync() not
> > > flushing the log in these cases it fine or not so I've just preserved the
> > > existing behavior in my patch.
> > >
> > >                                                                 Honza
> > >
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> > >
> > >
> >
> > --
> > Lukas Herbolt
> > SSME
> > Red Hat
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>


-- 
Lukas Herbolt
SSME
Red Hat


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-09-17 15:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 15:12 [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync Jan Kara
2025-09-09  0:18 ` Dave Chinner
2025-09-10  9:05   ` Jan Kara
2025-09-11  0:59     ` Dave Chinner
2025-09-16  6:15       ` Dave Chinner
2025-09-16 13:32         ` Jan Kara
2025-09-16 21:32           ` Dave Chinner
2025-09-17 10:07             ` Jan Kara
2025-09-17 10:27               ` Lukas Herbolt
     [not found]               ` <CAM4Jq_5kSfwPRiVsGD67n3ftoNPsdXOwMx0jxxQ4f8T9kcqgcw@mail.gmail.com>
2025-09-17 15:05                 ` Jan Kara
2025-09-17 15:16                   ` Lukas Herbolt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox