From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath
Date: Wed, 2 Jun 2021 16:00:51 -0700 [thread overview]
Message-ID: <20210602230051.GO26380@locust> (raw)
In-Reply-To: <20210602215802.24753-1-david@fromorbit.com>
On Thu, Jun 03, 2021 at 07:58:02AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Because this happens at high thread counts on high IOPS devices
> doing mixed read/write AIO-DIO to a single file at about a million
> iops:
>
> 64.09% 0.21% [kernel] [k] io_submit_one
> - 63.87% io_submit_one
> - 44.33% aio_write
> - 42.70% xfs_file_write_iter
> - 41.32% xfs_file_dio_write_aligned
> - 25.51% xfs_file_write_checks
> - 21.60% _raw_spin_lock
> - 21.59% do_raw_spin_lock
> - 19.70% __pv_queued_spin_lock_slowpath
>
> This also happens of the IO completion IO path:
>
> 22.89% 0.69% [kernel] [k] xfs_dio_write_end_io
> - 22.49% xfs_dio_write_end_io
> - 21.79% _raw_spin_lock
> - 20.97% do_raw_spin_lock
> - 20.10% __pv_queued_spin_lock_slowpath
>
> IOWs, fio is burning ~14 whole CPUs on this spin lock.
>
> So, do an unlocked check against inode size first, then if we are
> at/beyond EOF, take the spinlock and recheck. This makes the
> spinlock disappear from the overwrite fastpath.
>
> I'd like to report that fixing this makes things go faster. It
> doesn't - it just exposes the the XFS_ILOCK as the next severe
> contention point doing extent mapping lookups, and that now burns
> all the 14 CPUs this spinlock was burning.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
> 1 file changed, 31 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..c068dcd414f4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -384,21 +384,30 @@ xfs_file_write_checks(
> }
> goto restart;
> }
> +
> /*
> * If the offset is beyond the size of the file, we need to zero any
> * blocks that fall between the existing EOF and the start of this
> - * write. If zeroing is needed and we are currently holding the
> - * iolock shared, we need to update it to exclusive which implies
> - * having to redo all checks before.
> + * write. If zeroing is needed and we are currently holding the iolock
> + * shared, we need to update it to exclusive which implies having to
> + * redo all checks before.
> + *
> + * We need to serialise against EOF updates that occur in IO completions
> + * here. We want to make sure that nobody is changing the size while we
> + * do this check until we have placed an IO barrier (i.e. hold the
> + * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
> + * spinlock effectively forms a memory barrier once we have the
> + * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> + * hence be able to correctly determine if we need to run zeroing.
> *
> - * We need to serialise against EOF updates that occur in IO
> - * completions here. We want to make sure that nobody is changing the
> - * size while we do this check until we have placed an IO barrier (i.e.
> - * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> - * The spinlock effectively forms a memory barrier once we have the
> - * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> - * and hence be able to correctly determine if we need to run zeroing.
> + * We can do an unlocked check here safely as IO completion can only
> + * extend EOF. Truncate is locked out at this point, so the EOF can
> + * not move backwards, only forwards. Hence we only need to take the
> + * slow path and spin locks when we are at or beyond the current EOF.
> */
> + if (iocb->ki_pos <= i_size_read(inode))
> + goto out;
> +
> spin_lock(&ip->i_flags_lock);
> isize = i_size_read(inode);
> if (iocb->ki_pos > isize) {
> @@ -426,7 +435,7 @@ xfs_file_write_checks(
> drained_dio = true;
> goto restart;
> }
> -
> +
> trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
> error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
> NULL, &xfs_buffered_write_iomap_ops);
> @@ -435,6 +444,7 @@ xfs_file_write_checks(
> } else
> spin_unlock(&ip->i_flags_lock);
>
> +out:
> return file_modified(file);
> }
>
> @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
> * other IO completions here to update the EOF. Failing to serialise
> * here can result in EOF moving backwards and Bad Things Happen when
> * that occurs.
> + *
> + * As IO completion only ever extends EOF, we can do an unlocked check
> + * here to avoid taking the spinlock. If we land within the current EOF,
> + * then we do not need to do an extending update at all, and we don't
> + * need to take the lock to check this. If we race with an update moving
> + * EOF, then we'll either still be beyond EOF and need to take the lock,
> + * or we'll be within EOF and we don't need to take it at all.
> */
> + if (offset + size <= i_size_read(inode))
> + goto out;
> +
> spin_lock(&ip->i_flags_lock);
> if (offset + size > i_size_read(inode)) {
> i_size_write(inode, offset + size);
> --
> 2.31.1
>
next prev parent reply other threads:[~2021-06-02 23:00 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-02 21:58 [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath Dave Chinner
2021-06-02 23:00 ` Darrick J. Wong [this message]
-- strict thread matches above, loose matches on Subject: below --
2021-05-19 1:19 Dave Chinner
2021-05-19 7:59 ` Carlos Maiolino
2021-05-19 12:20 ` Dave Chinner
2021-05-20 23:33 ` Darrick J. Wong
2021-05-25 7:18 ` Dave Chinner
2021-05-31 17:58 ` riteshh
2021-06-01 23:15 ` Dave Chinner
2021-06-03 14:54 ` riteshh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210602230051.GO26380@locust \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox