[PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
@ 2021-05-19  1:19 Dave Chinner
  2021-05-19  7:59 ` Carlos Maiolino
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dave Chinner @ 2021-05-19  1:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because this happens at high thread counts on high IOPS devices
doing mixed read/write AIO-DIO to a single file at about a million
iops:

   64.09%     0.21%  [kernel]            [k] io_submit_one
   - 63.87% io_submit_one
      - 44.33% aio_write
         - 42.70% xfs_file_write_iter
            - 41.32% xfs_file_dio_write_aligned
               - 25.51% xfs_file_write_checks
                  - 21.60% _raw_spin_lock
                     - 21.59% do_raw_spin_lock
                        - 19.70% __pv_queued_spin_lock_slowpath

This also happens of the IO completion IO path:

   22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
   - 22.49% xfs_dio_write_end_io
      - 21.79% _raw_spin_lock
         - 20.97% do_raw_spin_lock
            - 20.10% __pv_queued_spin_lock_slowpath                                                                                                            ▒

IOWs, fio is burning ~14 whole CPUs on this spin lock.

So, do an unlocked check against inode size first, then if we are
at/beyond EOF, take the spinlock and recheck. This makes the
spinlock disappear from the overwrite fastpath.

I'd like to report that fixing this makes things go faster. It
doesn't - it just exposes the the XFS_ILOCK as the next severe
contention point doing extent mapping lookups, and that now burns
all the 14 CPUs this spinlock was burning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..c068dcd414f4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -384,21 +384,30 @@ xfs_file_write_checks(
 		}
 		goto restart;
 	}
+
 	/*
 	 * If the offset is beyond the size of the file, we need to zero any
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.  If zeroing is needed and we are currently holding the
-	 * iolock shared, we need to update it to exclusive which implies
-	 * having to redo all checks before.
+	 * write.  If zeroing is needed and we are currently holding the iolock
+	 * shared, we need to update it to exclusive which implies having to
+	 * redo all checks before.
+	 *
+	 * We need to serialise against EOF updates that occur in IO completions
+	 * here. We want to make sure that nobody is changing the size while we
+	 * do this check until we have placed an IO barrier (i.e.  hold the
+	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
+	 * spinlock effectively forms a memory barrier once we have the
+	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
+	 * hence be able to correctly determine if we need to run zeroing.
 	 *
-	 * We need to serialise against EOF updates that occur in IO
-	 * completions here. We want to make sure that nobody is changing the
-	 * size while we do this check until we have placed an IO barrier (i.e.
-	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
-	 * The spinlock effectively forms a memory barrier once we have the
-	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
-	 * and hence be able to correctly determine if we need to run zeroing.
+	 * We can do an unlocked check here safely as IO completion can only
+	 * extend EOF. Truncate is locked out at this point, so the EOF can
+	 * not move backwards, only forwards. Hence we only need to take the
+	 * slow path and spin locks when we are at or beyond the current EOF.
 	 */
+	if (iocb->ki_pos <= i_size_read(inode))
+		goto out;
+
 	spin_lock(&ip->i_flags_lock);
 	isize = i_size_read(inode);
 	if (iocb->ki_pos > isize) {
@@ -426,7 +435,7 @@ xfs_file_write_checks(
 			drained_dio = true;
 			goto restart;
 		}
-	
+
 		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
 		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
 				NULL, &xfs_buffered_write_iomap_ops);
@@ -435,6 +444,7 @@ xfs_file_write_checks(
 	} else
 		spin_unlock(&ip->i_flags_lock);
 
+out:
 	return file_modified(file);
 }
 
@@ -500,7 +510,17 @@ xfs_dio_write_end_io(
 	 * other IO completions here to update the EOF. Failing to serialise
 	 * here can result in EOF moving backwards and Bad Things Happen when
 	 * that occurs.
+	 *
+	 * As IO completion only ever extends EOF, we can do an unlocked check
+	 * here to avoid taking the spinlock. If we land within the current EOF,
+	 * then we do not need to do an extending update at all, and we don't
+	 * need to take the lock to check this. If we race with an update moving
+	 * EOF, then we'll either still be beyond EOF and need to take the lock,
+	 * or we'll be within EOF and we don't need to take it at all.
 	 */
+	if (offset + size <= i_size_read(inode))
+		goto out;
+
 	spin_lock(&ip->i_flags_lock);
 	if (offset + size > i_size_read(inode)) {
 		i_size_write(inode, offset + size);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-05-19  1:19 [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath Dave Chinner
@ 2021-05-19  7:59 ` Carlos Maiolino
  2021-05-19 12:20   ` Dave Chinner
  2021-05-20 23:33 ` Darrick J. Wong
  2021-05-31 17:58 ` riteshh
  2 siblings, 1 reply; 10+ messages in thread
From: Carlos Maiolino @ 2021-05-19  7:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 11:19:20AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because this happens at high thread counts on high IOPS devices
> doing mixed read/write AIO-DIO to a single file at about a million
> iops:
> 
>    64.09%     0.21%  [kernel]            [k] io_submit_one
>    - 63.87% io_submit_one
>       - 44.33% aio_write
>          - 42.70% xfs_file_write_iter
>             - 41.32% xfs_file_dio_write_aligned
>                - 25.51% xfs_file_write_checks
>                   - 21.60% _raw_spin_lock
>                      - 21.59% do_raw_spin_lock
>                         - 19.70% __pv_queued_spin_lock_slowpath
> 
> This also happens of the IO completion IO path:
> 
>    22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
>    - 22.49% xfs_dio_write_end_io
>       - 21.79% _raw_spin_lock
>          - 20.97% do_raw_spin_lock
>             - 20.10% __pv_queued_spin_lock_slowpath                                                                                                            ▒
> 
> IOWs, fio is burning ~14 whole CPUs on this spin lock.
> 
> So, do an unlocked check against inode size first, then if we are
> at/beyond EOF, take the spinlock and recheck. This makes the
> spinlock disappear from the overwrite fastpath.
> 
> I'd like to report that fixing this makes things go faster.

maybe you meant this does not make things go faster?

> It
> doesn't - it just exposes the the XFS_ILOCK as the next severe
> contention point doing extent mapping lookups, and that now burns
> all the 14 CPUs this spinlock was burning.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

The patch looks good, and the comments about why it's safe to not take the
spinlock (specially why the EOF can't be moved back) is much welcomed.

Feel free to add:
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

> ---
>  fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..c068dcd414f4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -384,21 +384,30 @@ xfs_file_write_checks(
>  		}
>  		goto restart;
>  	}
> +
>  	/*
>  	 * If the offset is beyond the size of the file, we need to zero any
>  	 * blocks that fall between the existing EOF and the start of this
> -	 * write.  If zeroing is needed and we are currently holding the
> -	 * iolock shared, we need to update it to exclusive which implies
> -	 * having to redo all checks before.
> +	 * write.  If zeroing is needed and we are currently holding the iolock
> +	 * shared, we need to update it to exclusive which implies having to
> +	 * redo all checks before.
> +	 *
> +	 * We need to serialise against EOF updates that occur in IO completions
> +	 * here. We want to make sure that nobody is changing the size while we
> +	 * do this check until we have placed an IO barrier (i.e.  hold the
> +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> +	 * spinlock effectively forms a memory barrier once we have the
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> +	 * hence be able to correctly determine if we need to run zeroing.
>  	 *
> -	 * We need to serialise against EOF updates that occur in IO
> -	 * completions here. We want to make sure that nobody is changing the
> -	 * size while we do this check until we have placed an IO barrier (i.e.
> -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> -	 * The spinlock effectively forms a memory barrier once we have the
> -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> -	 * and hence be able to correctly determine if we need to run zeroing.
> +	 * We can do an unlocked check here safely as IO completion can only
> +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> +	 * not move backwards, only forwards. Hence we only need to take the
> +	 * slow path and spin locks when we are at or beyond the current EOF.
>  	 */
> +	if (iocb->ki_pos <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	isize = i_size_read(inode);
>  	if (iocb->ki_pos > isize) {
> @@ -426,7 +435,7 @@ xfs_file_write_checks(
>  			drained_dio = true;
>  			goto restart;
>  		}
> -	
> +
>  		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
>  		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
>  				NULL, &xfs_buffered_write_iomap_ops);
> @@ -435,6 +444,7 @@ xfs_file_write_checks(
>  	} else
>  		spin_unlock(&ip->i_flags_lock);
>  
> +out:
>  	return file_modified(file);
>  }
>  
> @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
>  	 * other IO completions here to update the EOF. Failing to serialise
>  	 * here can result in EOF moving backwards and Bad Things Happen when
>  	 * that occurs.
> +	 *
> +	 * As IO completion only ever extends EOF, we can do an unlocked check
> +	 * here to avoid taking the spinlock. If we land within the current EOF,
> +	 * then we do not need to do an extending update at all, and we don't
> +	 * need to take the lock to check this. If we race with an update moving
> +	 * EOF, then we'll either still be beyond EOF and need to take the lock,
> +	 * or we'll be within EOF and we don't need to take it at all.
>  	 */
> +	if (offset + size <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	if (offset + size > i_size_read(inode)) {
>  		i_size_write(inode, offset + size);
> -- 
> 2.31.1
> 

-- 
Carlos


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-05-19  7:59 ` Carlos Maiolino
@ 2021-05-19 12:20   ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2021-05-19 12:20 UTC (permalink / raw)
  To: linux-xfs

On Wed, May 19, 2021 at 09:59:29AM +0200, Carlos Maiolino wrote:
> On Wed, May 19, 2021 at 11:19:20AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because this happens at high thread counts on high IOPS devices
> > doing mixed read/write AIO-DIO to a single file at about a million
> > iops:
> > 
> >    64.09%     0.21%  [kernel]            [k] io_submit_one
> >    - 63.87% io_submit_one
> >       - 44.33% aio_write
> >          - 42.70% xfs_file_write_iter
> >             - 41.32% xfs_file_dio_write_aligned
> >                - 25.51% xfs_file_write_checks
> >                   - 21.60% _raw_spin_lock
> >                      - 21.59% do_raw_spin_lock
> >                         - 19.70% __pv_queued_spin_lock_slowpath
> > 
> > This also happens of the IO completion IO path:
> > 
> >    22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
> >    - 22.49% xfs_dio_write_end_io
> >       - 21.79% _raw_spin_lock
> >          - 20.97% do_raw_spin_lock
> >             - 20.10% __pv_queued_spin_lock_slowpath                                                                                                            ▒
> > 
> > IOWs, fio is burning ~14 whole CPUs on this spin lock.
> > 
> > So, do an unlocked check against inode size first, then if we are
> > at/beyond EOF, take the spinlock and recheck. This makes the
> > spinlock disappear from the overwrite fastpath.
> > 
> > I'd like to report that fixing this makes things go faster.
> 
> maybe you meant this does not make things go faster?

Yes, that is what this statement means. That is, I'd -like- to
report that things went faster, but reality doesn't care about what
I'd -like- to have happen, as the next sentence explained... :(

> > It
> > doesn't - it just exposes the the XFS_ILOCK as the next severe
> > contention point doing extent mapping lookups, and that now burns
> > all the 14 CPUs this spinlock was burning.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> The patch looks good, and the comments about why it's safe to not take the
> spinlock (specially why the EOF can't be moved back) is much welcomed.
> 
> Feel free to add:
> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-05-19  1:19 [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath Dave Chinner
  2021-05-19  7:59 ` Carlos Maiolino
@ 2021-05-20 23:33 ` Darrick J. Wong
  2021-05-25  7:18   ` Dave Chinner
  2021-05-31 17:58 ` riteshh
  2 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2021-05-20 23:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 11:19:20AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because this happens at high thread counts on high IOPS devices
> doing mixed read/write AIO-DIO to a single file at about a million
> iops:
> 
>    64.09%     0.21%  [kernel]            [k] io_submit_one
>    - 63.87% io_submit_one
>       - 44.33% aio_write
>          - 42.70% xfs_file_write_iter
>             - 41.32% xfs_file_dio_write_aligned
>                - 25.51% xfs_file_write_checks
>                   - 21.60% _raw_spin_lock
>                      - 21.59% do_raw_spin_lock
>                         - 19.70% __pv_queued_spin_lock_slowpath
> 
> This also happens of the IO completion IO path:
> 
>    22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
>    - 22.49% xfs_dio_write_end_io
>       - 21.79% _raw_spin_lock
>          - 20.97% do_raw_spin_lock
>             - 20.10% __pv_queued_spin_lock_slowpath                                                                                                            ▒

Super long line there.

> 
> IOWs, fio is burning ~14 whole CPUs on this spin lock.
> 
> So, do an unlocked check against inode size first, then if we are
> at/beyond EOF, take the spinlock and recheck. This makes the
> spinlock disappear from the overwrite fastpath.
> 
> I'd like to report that fixing this makes things go faster. It
> doesn't - it just exposes the the XFS_ILOCK as the next severe
> contention point doing extent mapping lookups, and that now burns
> all the 14 CPUs this spinlock was burning.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..c068dcd414f4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -384,21 +384,30 @@ xfs_file_write_checks(
>  		}
>  		goto restart;
>  	}
> +
>  	/*
>  	 * If the offset is beyond the size of the file, we need to zero any
>  	 * blocks that fall between the existing EOF and the start of this
> -	 * write.  If zeroing is needed and we are currently holding the
> -	 * iolock shared, we need to update it to exclusive which implies
> -	 * having to redo all checks before.
> +	 * write.  If zeroing is needed and we are currently holding the iolock
> +	 * shared, we need to update it to exclusive which implies having to
> +	 * redo all checks before.
> +	 *
> +	 * We need to serialise against EOF updates that occur in IO completions
> +	 * here. We want to make sure that nobody is changing the size while we
> +	 * do this check until we have placed an IO barrier (i.e.  hold the
> +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> +	 * spinlock effectively forms a memory barrier once we have the
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> +	 * hence be able to correctly determine if we need to run zeroing.
>  	 *
> -	 * We need to serialise against EOF updates that occur in IO
> -	 * completions here. We want to make sure that nobody is changing the
> -	 * size while we do this check until we have placed an IO barrier (i.e.
> -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> -	 * The spinlock effectively forms a memory barrier once we have the
> -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> -	 * and hence be able to correctly determine if we need to run zeroing.
> +	 * We can do an unlocked check here safely as IO completion can only
> +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> +	 * not move backwards, only forwards. Hence we only need to take the
> +	 * slow path and spin locks when we are at or beyond the current EOF.
>  	 */
> +	if (iocb->ki_pos <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	isize = i_size_read(inode);
>  	if (iocb->ki_pos > isize) {
> @@ -426,7 +435,7 @@ xfs_file_write_checks(
>  			drained_dio = true;
>  			goto restart;
>  		}
> -	
> +
>  		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
>  		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
>  				NULL, &xfs_buffered_write_iomap_ops);
> @@ -435,6 +444,7 @@ xfs_file_write_checks(
>  	} else
>  		spin_unlock(&ip->i_flags_lock);
>  
> +out:
>  	return file_modified(file);
>  }
>  
> @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
>  	 * other IO completions here to update the EOF. Failing to serialise
>  	 * here can result in EOF moving backwards and Bad Things Happen when
>  	 * that occurs.
> +	 *
> +	 * As IO completion only ever extends EOF, we can do an unlocked check
> +	 * here to avoid taking the spinlock. If we land within the current EOF,
> +	 * then we do not need to do an extending update at all, and we don't
> +	 * need to take the lock to check this. If we race with an update moving
> +	 * EOF, then we'll either still be beyond EOF and need to take the lock,
> +	 * or we'll be within EOF and we don't need to take it at all.

Is truncate locked out at this point too?  I /think/ it is since we
still hold the iolock (shared or excl) which blocks truncate?

--D

>  	 */
> +	if (offset + size <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	if (offset + size > i_size_read(inode)) {
>  		i_size_write(inode, offset + size);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-05-20 23:33 ` Darrick J. Wong
@ 2021-05-25  7:18   ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2021-05-25  7:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 20, 2021 at 04:33:32PM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 11:19:20AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because this happens at high thread counts on high IOPS devices
> > doing mixed read/write AIO-DIO to a single file at about a million
> > iops:
> > 
> >    64.09%     0.21%  [kernel]            [k] io_submit_one
> >    - 63.87% io_submit_one
> >       - 44.33% aio_write
> >          - 42.70% xfs_file_write_iter
> >             - 41.32% xfs_file_dio_write_aligned
> >                - 25.51% xfs_file_write_checks
> >                   - 21.60% _raw_spin_lock
> >                      - 21.59% do_raw_spin_lock
> >                         - 19.70% __pv_queued_spin_lock_slowpath
> > 
> > This also happens of the IO completion IO path:
> > 
> >    22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
> >    - 22.49% xfs_dio_write_end_io
> >       - 21.79% _raw_spin_lock
> >          - 20.97% do_raw_spin_lock
> >             - 20.10% __pv_queued_spin_lock_slowpath                                                                                                            ▒
> 
> Super long line there.

Ah, forgot to trim it.

> > @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
> >  	 * other IO completions here to update the EOF. Failing to serialise
> >  	 * here can result in EOF moving backwards and Bad Things Happen when
> >  	 * that occurs.
> > +	 *
> > +	 * As IO completion only ever extends EOF, we can do an unlocked check
> > +	 * here to avoid taking the spinlock. If we land within the current EOF,
> > +	 * then we do not need to do an extending update at all, and we don't
> > +	 * need to take the lock to check this. If we race with an update moving
> > +	 * EOF, then we'll either still be beyond EOF and need to take the lock,
> > +	 * or we'll be within EOF and we don't need to take it at all.
> 
> Is truncate locked out at this point too?  I /think/ it is since we
> still hold the iolock (shared or excl) which blocks truncate?

truncate and fallocate are locked out because the inode dio count is
still elevated at this point. i.e. they'll block in inode_dio_wait()
until we return to iomap_dio_complete() and it (eventually) calls
inode_dio_end()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-05-19  1:19 [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath Dave Chinner
  2021-05-19  7:59 ` Carlos Maiolino
  2021-05-20 23:33 ` Darrick J. Wong
@ 2021-05-31 17:58 ` riteshh
  2021-06-01 23:15   ` Dave Chinner
  2 siblings, 1 reply; 10+ messages in thread
From: riteshh @ 2021-05-31 17:58 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara; +Cc: linux-xfs

On 21/05/19 11:19AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Because this happens at high thread counts on high IOPS devices
> doing mixed read/write AIO-DIO to a single file at about a million
> iops:
>
>    64.09%     0.21%  [kernel]            [k] io_submit_one
>    - 63.87% io_submit_one
>       - 44.33% aio_write
>          - 42.70% xfs_file_write_iter
>             - 41.32% xfs_file_dio_write_aligned
>                - 25.51% xfs_file_write_checks
>                   - 21.60% _raw_spin_lock
>                      - 21.59% do_raw_spin_lock
>                         - 19.70% __pv_queued_spin_lock_slowpath
>
> This also happens of the IO completion IO path:
>
>    22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
>    - 22.49% xfs_dio_write_end_io
>       - 21.79% _raw_spin_lock
>          - 20.97% do_raw_spin_lock
>             - 20.10% __pv_queued_spin_lock_slowpath                                                                                                            ▒
>
> IOWs, fio is burning ~14 whole CPUs on this spin lock.
>
> So, do an unlocked check against inode size first, then if we are
> at/beyond EOF, take the spinlock and recheck. This makes the
> spinlock disappear from the overwrite fastpath.
>
> I'd like to report that fixing this makes things go faster. It
> doesn't - it just exposes the the XFS_ILOCK as the next severe
> contention point doing extent mapping lookups, and that now burns
> all the 14 CPUs this spinlock was burning.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 31 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..c068dcd414f4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -384,21 +384,30 @@ xfs_file_write_checks(
>  		}
>  		goto restart;
>  	}
> +
>  	/*
>  	 * If the offset is beyond the size of the file, we need to zero any
>  	 * blocks that fall between the existing EOF and the start of this
> -	 * write.  If zeroing is needed and we are currently holding the
> -	 * iolock shared, we need to update it to exclusive which implies
> -	 * having to redo all checks before.
> +	 * write.  If zeroing is needed and we are currently holding the iolock
> +	 * shared, we need to update it to exclusive which implies having to
> +	 * redo all checks before.
> +	 *
> +	 * We need to serialise against EOF updates that occur in IO completions
> +	 * here. We want to make sure that nobody is changing the size while we
> +	 * do this check until we have placed an IO barrier (i.e.  hold the
> +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> +	 * spinlock effectively forms a memory barrier once we have the
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> +	 * hence be able to correctly determine if we need to run zeroing.
>  	 *
> -	 * We need to serialise against EOF updates that occur in IO
> -	 * completions here. We want to make sure that nobody is changing the
> -	 * size while we do this check until we have placed an IO barrier (i.e.
> -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> -	 * The spinlock effectively forms a memory barrier once we have the
> -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> -	 * and hence be able to correctly determine if we need to run zeroing.
> +	 * We can do an unlocked check here safely as IO completion can only
> +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> +	 * not move backwards, only forwards. Hence we only need to take the
> +	 * slow path and spin locks when we are at or beyond the current EOF.
>  	 */
> +	if (iocb->ki_pos <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	isize = i_size_read(inode);
>  	if (iocb->ki_pos > isize) {

Hello Dave/Jan,

Sorry about some silly queries here. But locking sometimes can get confusing and
needs a background context/history.

So,
I was going through the XFS DIO path and I couldn't completely get this below
difference between xfs_file_dio_write_unaligned() v/s
xfs_file_dio_write_aligned() checks for taking xfs iolock (inode rwsem)
with different exclusivity(exclusive v/s shared).

I in xfs_**_unaligned() function, we also check if (ki_pos + count >= isize()).
If yes, then we go for an exclusive iolock.
While in xfs_**_aligned() function, we always take shared iolock.

Can you please help me understand why is that? In case of an extending aligned
write, won't we need an exclusive iolock for XFS?
Or IIUC, this exclusive lock is mostly needed to prevent two sub-bock zeroing
from running in parallel (which if this happens could cause corruption)
and this can only happen with unaligned writes.

Whereas, I guess ext4, still does exclusive lock even with aligned extending
writes, possibly because of updation of inode->i_size and orphan inode
handling requires it to take exclusive inode rwsem.

While for XFS inode->i_size updation happens with a different spinlock which is
ip->i_flags_lock.

Is my understanding complete and correct?
Or did I miss anything here?

Thanks
ritesh


> @@ -426,7 +435,7 @@ xfs_file_write_checks(
>  			drained_dio = true;
>  			goto restart;
>  		}
> -
> +
>  		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
>  		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
>  				NULL, &xfs_buffered_write_iomap_ops);
> @@ -435,6 +444,7 @@ xfs_file_write_checks(
>  	} else
>  		spin_unlock(&ip->i_flags_lock);
>
> +out:
>  	return file_modified(file);
>  }
>
> @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
>  	 * other IO completions here to update the EOF. Failing to serialise
>  	 * here can result in EOF moving backwards and Bad Things Happen when
>  	 * that occurs.
> +	 *
> +	 * As IO completion only ever extends EOF, we can do an unlocked check
> +	 * here to avoid taking the spinlock. If we land within the current EOF,
> +	 * then we do not need to do an extending update at all, and we don't
> +	 * need to take the lock to check this. If we race with an update moving
> +	 * EOF, then we'll either still be beyond EOF and need to take the lock,
> +	 * or we'll be within EOF and we don't need to take it at all.
>  	 */
> +	if (offset + size <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	if (offset + size > i_size_read(inode)) {
>  		i_size_write(inode, offset + size);
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-05-31 17:58 ` riteshh
@ 2021-06-01 23:15   ` Dave Chinner
  2021-06-03 14:54     ` riteshh
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-06-01 23:15 UTC (permalink / raw)
  To: riteshh; +Cc: Jan Kara, linux-xfs

On Mon, May 31, 2021 at 11:28:25PM +0530, riteshh wrote:
> On 21/05/19 11:19AM, Dave Chinner wrote:
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -384,21 +384,30 @@ xfs_file_write_checks(
> >  		}
> >  		goto restart;
> >  	}
> > +
> >  	/*
> >  	 * If the offset is beyond the size of the file, we need to zero any
> >  	 * blocks that fall between the existing EOF and the start of this
> > -	 * write.  If zeroing is needed and we are currently holding the
> > -	 * iolock shared, we need to update it to exclusive which implies
> > -	 * having to redo all checks before.
> > +	 * write.  If zeroing is needed and we are currently holding the iolock
> > +	 * shared, we need to update it to exclusive which implies having to
> > +	 * redo all checks before.
> > +	 *
> > +	 * We need to serialise against EOF updates that occur in IO completions
> > +	 * here. We want to make sure that nobody is changing the size while we
> > +	 * do this check until we have placed an IO barrier (i.e.  hold the
> > +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> > +	 * spinlock effectively forms a memory barrier once we have the
> > +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> > +	 * hence be able to correctly determine if we need to run zeroing.
> >  	 *
> > -	 * We need to serialise against EOF updates that occur in IO
> > -	 * completions here. We want to make sure that nobody is changing the
> > -	 * size while we do this check until we have placed an IO barrier (i.e.
> > -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> > -	 * The spinlock effectively forms a memory barrier once we have the
> > -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> > -	 * and hence be able to correctly determine if we need to run zeroing.
> > +	 * We can do an unlocked check here safely as IO completion can only
> > +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> > +	 * not move backwards, only forwards. Hence we only need to take the
> > +	 * slow path and spin locks when we are at or beyond the current EOF.
> >  	 */
> > +	if (iocb->ki_pos <= i_size_read(inode))
> > +		goto out;
> > +
> >  	spin_lock(&ip->i_flags_lock);
> >  	isize = i_size_read(inode);
> >  	if (iocb->ki_pos > isize) {
> 
> Hello Dave/Jan,
> 
> Sorry about some silly queries here. But locking sometimes can get confusing and
> needs a background context/history.
> 
> So,
> I was going through the XFS DIO path and I couldn't completely get this below
> difference between xfs_file_dio_write_unaligned() v/s
> xfs_file_dio_write_aligned() checks for taking xfs iolock (inode rwsem)
> with different exclusivity(exclusive v/s shared).
> 
> I in xfs_**_unaligned() function, we also check if (ki_pos + count >= isize()).
> If yes, then we go for an exclusive iolock.
> While in xfs_**_aligned() function, we always take shared iolock.
> 
> Can you please help me understand why is that? In case of an extending aligned
> write, won't we need an exclusive iolock for XFS?

No. Extending the file is a slowpath operation which requires
exclusive locking. We always take the shared lock first if we can
because that's the normal fast path operation and so we optimise for
that case.

In the aligned DIO case, we check for sub-block EOF zeroing in
xfs_file_write_checks(). If needed, we upgrade the lock to exclusive
while the EOF zeroing is done. Once we return back to the aligned IO
code, we'll demote that exclusive lock back to shared for the block
aligned IO that we are issuing.

> Or IIUC, this exclusive lock is mostly needed to prevent two sub-bock zeroing
> from running in parallel (which if this happens could cause corruption)
> and this can only happen with unaligned writes.

The exclusive lock is needed for serialising zeroing operations,
whether it be zeroing for EOF extension or sub-block zeroing for
unaligned writes.

The reason for the EOF checks in the unaligned case is right there
in the comment above the EOF checks:

        /*
         * Extending writes need exclusivity because of the sub-block zeroing
         * that the DIO code always does for partial tail blocks beyond EOF, so
         * don't even bother trying the fast path in this case.
         */

IOWs, there is no possible "fast path" shared locking case for
unaligned extending DIOs, so we just take the exlusive lock right
from the start.

> Whereas, I guess ext4, still does exclusive lock even with aligned extending
> writes, possibly because of updation of inode->i_size and orphan inode
> handling requires it to take exclusive inode rwsem.
> 
> While for XFS inode->i_size updation happens with a different spinlock which is
> ip->i_flags_lock.

XFS is serialising DIO completion against DIO submission here rather
than anything else to do with inode size updates which are,
generally speaking, serialised at a higher level by various
combinations of i_rwsem, MMAPLOCK, ILOCK and inode_dio_wait().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-06-01 23:15   ` Dave Chinner
@ 2021-06-03 14:54     ` riteshh
  0 siblings, 0 replies; 10+ messages in thread
From: riteshh @ 2021-06-03 14:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs

On 21/06/02 09:15AM, Dave Chinner wrote:
> On Mon, May 31, 2021 at 11:28:25PM +0530, riteshh wrote:
> > On 21/05/19 11:19AM, Dave Chinner wrote:
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -384,21 +384,30 @@ xfs_file_write_checks(
> > >  		}
> > >  		goto restart;
> > >  	}
> > > +
> > >  	/*
> > >  	 * If the offset is beyond the size of the file, we need to zero any
> > >  	 * blocks that fall between the existing EOF and the start of this
> > > -	 * write.  If zeroing is needed and we are currently holding the
> > > -	 * iolock shared, we need to update it to exclusive which implies
> > > -	 * having to redo all checks before.
> > > +	 * write.  If zeroing is needed and we are currently holding the iolock
> > > +	 * shared, we need to update it to exclusive which implies having to
> > > +	 * redo all checks before.
> > > +	 *
> > > +	 * We need to serialise against EOF updates that occur in IO completions
> > > +	 * here. We want to make sure that nobody is changing the size while we
> > > +	 * do this check until we have placed an IO barrier (i.e.  hold the
> > > +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> > > +	 * spinlock effectively forms a memory barrier once we have the
> > > +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> > > +	 * hence be able to correctly determine if we need to run zeroing.
> > >  	 *
> > > -	 * We need to serialise against EOF updates that occur in IO
> > > -	 * completions here. We want to make sure that nobody is changing the
> > > -	 * size while we do this check until we have placed an IO barrier (i.e.
> > > -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> > > -	 * The spinlock effectively forms a memory barrier once we have the
> > > -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> > > -	 * and hence be able to correctly determine if we need to run zeroing.
> > > +	 * We can do an unlocked check here safely as IO completion can only
> > > +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> > > +	 * not move backwards, only forwards. Hence we only need to take the
> > > +	 * slow path and spin locks when we are at or beyond the current EOF.
> > >  	 */
> > > +	if (iocb->ki_pos <= i_size_read(inode))
> > > +		goto out;
> > > +
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	isize = i_size_read(inode);
> > >  	if (iocb->ki_pos > isize) {
> >
> > Hello Dave/Jan,
> >
> > Sorry about some silly queries here. But locking sometimes can get confusing and
> > needs a background context/history.
> >
> > So,
> > I was going through the XFS DIO path and I couldn't completely get this below
> > difference between xfs_file_dio_write_unaligned() v/s
> > xfs_file_dio_write_aligned() checks for taking xfs iolock (inode rwsem)
> > with different exclusivity(exclusive v/s shared).
> >
> > I in xfs_**_unaligned() function, we also check if (ki_pos + count >= isize()).
> > If yes, then we go for an exclusive iolock.
> > While in xfs_**_aligned() function, we always take shared iolock.
> >
> > Can you please help me understand why is that? In case of an extending aligned
> > write, won't we need an exclusive iolock for XFS?
>
> No. Extending the file is a slowpath operation which requires
> exclusive locking. We always take the shared lock first if we can
> because that's the normal fast path operation and so we optimise for
> that case.
>
> In the aligned DIO case, we check for sub-block EOF zeroing in
> xfs_file_write_checks(). If needed, we upgrade the lock to exclusive
> while the EOF zeroing is done. Once we return back to the aligned IO
> code, we'll demote that exclusive lock back to shared for the block
> aligned IO that we are issuing.
>
> > Or IIUC, this exclusive lock is mostly needed to prevent two sub-bock zeroing
> > from running in parallel (which if this happens could cause corruption)
> > and this can only happen with unaligned writes.
>
> The exclusive lock is needed for serialising zeroing operations,
> whether it be zeroing for EOF extension or sub-block zeroing for
> unaligned writes.
>
> The reason for the EOF checks in the unaligned case is right there
> in the comment above the EOF checks:
>
>         /*
>          * Extending writes need exclusivity because of the sub-block zeroing
>          * that the DIO code always does for partial tail blocks beyond EOF, so
>          * don't even bother trying the fast path in this case.
>          */
>
> IOWs, there is no possible "fast path" shared locking case for
> unaligned extending DIOs, so we just take the exlusive lock right
> from the start.
>
> > Whereas, I guess ext4, still does exclusive lock even with aligned extending
> > writes, possibly because of updation of inode->i_size and orphan inode
> > handling requires it to take exclusive inode rwsem.
> >
> > While for XFS inode->i_size updation happens with a different spinlock which is
> > ip->i_flags_lock.
>
> XFS is serialising DIO completion against DIO submission here rather
> than anything else to do with inode size updates which are,
> generally speaking, serialised at a higher level by various
> combinations of i_rwsem, MMAPLOCK, ILOCK and inode_dio_wait().

Thanks a lot Dave for detailed explaination about this.
This makes things quite clear now from XFS side.

-ritesh

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
@ 2021-06-02 21:58 Dave Chinner
  2021-06-02 23:00 ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-06-02 21:58 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because this happens at high thread counts on high IOPS devices
doing mixed read/write AIO-DIO to a single file at about a million
iops:

   64.09%     0.21%  [kernel]            [k] io_submit_one
   - 63.87% io_submit_one
      - 44.33% aio_write
         - 42.70% xfs_file_write_iter
            - 41.32% xfs_file_dio_write_aligned
               - 25.51% xfs_file_write_checks
                  - 21.60% _raw_spin_lock
                     - 21.59% do_raw_spin_lock
                        - 19.70% __pv_queued_spin_lock_slowpath

This also happens of the IO completion IO path:

   22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
   - 22.49% xfs_dio_write_end_io
      - 21.79% _raw_spin_lock
         - 20.97% do_raw_spin_lock
            - 20.10% __pv_queued_spin_lock_slowpath

IOWs, fio is burning ~14 whole CPUs on this spin lock.

So, do an unlocked check against inode size first, then if we are
at/beyond EOF, take the spinlock and recheck. This makes the
spinlock disappear from the overwrite fastpath.

I'd like to report that fixing this makes things go faster. It
doesn't - it just exposes the the XFS_ILOCK as the next severe
contention point doing extent mapping lookups, and that now burns
all the 14 CPUs this spinlock was burning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..c068dcd414f4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -384,21 +384,30 @@ xfs_file_write_checks(
 		}
 		goto restart;
 	}
+
 	/*
 	 * If the offset is beyond the size of the file, we need to zero any
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.  If zeroing is needed and we are currently holding the
-	 * iolock shared, we need to update it to exclusive which implies
-	 * having to redo all checks before.
+	 * write.  If zeroing is needed and we are currently holding the iolock
+	 * shared, we need to update it to exclusive which implies having to
+	 * redo all checks before.
+	 *
+	 * We need to serialise against EOF updates that occur in IO completions
+	 * here. We want to make sure that nobody is changing the size while we
+	 * do this check until we have placed an IO barrier (i.e.  hold the
+	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
+	 * spinlock effectively forms a memory barrier once we have the
+	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
+	 * hence be able to correctly determine if we need to run zeroing.
 	 *
-	 * We need to serialise against EOF updates that occur in IO
-	 * completions here. We want to make sure that nobody is changing the
-	 * size while we do this check until we have placed an IO barrier (i.e.
-	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
-	 * The spinlock effectively forms a memory barrier once we have the
-	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
-	 * and hence be able to correctly determine if we need to run zeroing.
+	 * We can do an unlocked check here safely as IO completion can only
+	 * extend EOF. Truncate is locked out at this point, so the EOF can
+	 * not move backwards, only forwards. Hence we only need to take the
+	 * slow path and spin locks when we are at or beyond the current EOF.
 	 */
+	if (iocb->ki_pos <= i_size_read(inode))
+		goto out;
+
 	spin_lock(&ip->i_flags_lock);
 	isize = i_size_read(inode);
 	if (iocb->ki_pos > isize) {
@@ -426,7 +435,7 @@ xfs_file_write_checks(
 			drained_dio = true;
 			goto restart;
 		}
-	
+
 		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
 		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
 				NULL, &xfs_buffered_write_iomap_ops);
@@ -435,6 +444,7 @@ xfs_file_write_checks(
 	} else
 		spin_unlock(&ip->i_flags_lock);
 
+out:
 	return file_modified(file);
 }
 
@@ -500,7 +510,17 @@ xfs_dio_write_end_io(
 	 * other IO completions here to update the EOF. Failing to serialise
 	 * here can result in EOF moving backwards and Bad Things Happen when
 	 * that occurs.
+	 *
+	 * As IO completion only ever extends EOF, we can do an unlocked check
+	 * here to avoid taking the spinlock. If we land within the current EOF,
+	 * then we do not need to do an extending update at all, and we don't
+	 * need to take the lock to check this. If we race with an update moving
+	 * EOF, then we'll either still be beyond EOF and need to take the lock,
+	 * or we'll be within EOF and we don't need to take it at all.
 	 */
+	if (offset + size <= i_size_read(inode))
+		goto out;
+
 	spin_lock(&ip->i_flags_lock);
 	if (offset + size > i_size_read(inode)) {
 		i_size_write(inode, offset + size);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
  2021-06-02 21:58 Dave Chinner
@ 2021-06-02 23:00 ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2021-06-02 23:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 07:58:02AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because this happens at high thread counts on high IOPS devices
> doing mixed read/write AIO-DIO to a single file at about a million
> iops:
> 
>    64.09%     0.21%  [kernel]            [k] io_submit_one
>    - 63.87% io_submit_one
>       - 44.33% aio_write
>          - 42.70% xfs_file_write_iter
>             - 41.32% xfs_file_dio_write_aligned
>                - 25.51% xfs_file_write_checks
>                   - 21.60% _raw_spin_lock
>                      - 21.59% do_raw_spin_lock
>                         - 19.70% __pv_queued_spin_lock_slowpath
> 
> This also happens of the IO completion IO path:
> 
>    22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
>    - 22.49% xfs_dio_write_end_io
>       - 21.79% _raw_spin_lock
>          - 20.97% do_raw_spin_lock
>             - 20.10% __pv_queued_spin_lock_slowpath
> 
> IOWs, fio is burning ~14 whole CPUs on this spin lock.
> 
> So, do an unlocked check against inode size first, then if we are
> at/beyond EOF, take the spinlock and recheck. This makes the
> spinlock disappear from the overwrite fastpath.
> 
> I'd like to report that fixing this makes things go faster. It
> doesn't - it just exposes the the XFS_ILOCK as the next severe
> contention point doing extent mapping lookups, and that now burns
> all the 14 CPUs this spinlock was burning.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..c068dcd414f4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -384,21 +384,30 @@ xfs_file_write_checks(
>  		}
>  		goto restart;
>  	}
> +
>  	/*
>  	 * If the offset is beyond the size of the file, we need to zero any
>  	 * blocks that fall between the existing EOF and the start of this
> -	 * write.  If zeroing is needed and we are currently holding the
> -	 * iolock shared, we need to update it to exclusive which implies
> -	 * having to redo all checks before.
> +	 * write.  If zeroing is needed and we are currently holding the iolock
> +	 * shared, we need to update it to exclusive which implies having to
> +	 * redo all checks before.
> +	 *
> +	 * We need to serialise against EOF updates that occur in IO completions
> +	 * here. We want to make sure that nobody is changing the size while we
> +	 * do this check until we have placed an IO barrier (i.e.  hold the
> +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> +	 * spinlock effectively forms a memory barrier once we have the
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> +	 * hence be able to correctly determine if we need to run zeroing.
>  	 *
> -	 * We need to serialise against EOF updates that occur in IO
> -	 * completions here. We want to make sure that nobody is changing the
> -	 * size while we do this check until we have placed an IO barrier (i.e.
> -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> -	 * The spinlock effectively forms a memory barrier once we have the
> -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> -	 * and hence be able to correctly determine if we need to run zeroing.
> +	 * We can do an unlocked check here safely as IO completion can only
> +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> +	 * not move backwards, only forwards. Hence we only need to take the
> +	 * slow path and spin locks when we are at or beyond the current EOF.
>  	 */
> +	if (iocb->ki_pos <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	isize = i_size_read(inode);
>  	if (iocb->ki_pos > isize) {
> @@ -426,7 +435,7 @@ xfs_file_write_checks(
>  			drained_dio = true;
>  			goto restart;
>  		}
> -	
> +
>  		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
>  		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
>  				NULL, &xfs_buffered_write_iomap_ops);
> @@ -435,6 +444,7 @@ xfs_file_write_checks(
>  	} else
>  		spin_unlock(&ip->i_flags_lock);
>  
> +out:
>  	return file_modified(file);
>  }
>  
> @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
>  	 * other IO completions here to update the EOF. Failing to serialise
>  	 * here can result in EOF moving backwards and Bad Things Happen when
>  	 * that occurs.
> +	 *
> +	 * As IO completion only ever extends EOF, we can do an unlocked check
> +	 * here to avoid taking the spinlock. If we land within the current EOF,
> +	 * then we do not need to do an extending update at all, and we don't
> +	 * need to take the lock to check this. If we race with an update moving
> +	 * EOF, then we'll either still be beyond EOF and need to take the lock,
> +	 * or we'll be within EOF and we don't need to take it at all.
>  	 */
> +	if (offset + size <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	if (offset + size > i_size_read(inode)) {
>  		i_size_write(inode, offset + size);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-06-03 14:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-19  1:19 [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath Dave Chinner
2021-05-19  7:59 ` Carlos Maiolino
2021-05-19 12:20   ` Dave Chinner
2021-05-20 23:33 ` Darrick J. Wong
2021-05-25  7:18   ` Dave Chinner
2021-05-31 17:58 ` riteshh
2021-06-01 23:15   ` Dave Chinner
2021-06-03 14:54     ` riteshh
  -- strict thread matches above, loose matches on Subject: below --
2021-06-02 21:58 Dave Chinner
2021-06-02 23:00 ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox