Re: xfs_buf_lock vs aio

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Avi Kivity <avi@scylladb.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: xfs_buf_lock vs aio
Date: Fri, 9 Feb 2018 09:11:53 +1100	[thread overview]
Message-ID: <20180208221153.GF20266@dastard> (raw)
In-Reply-To: <ed83833b-5024-40f2-b335-2dcab7fddf0e@scylladb.com>

On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
> On 02/08/2018 01:33 AM, Dave Chinner wrote:
> >On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> >>As usual, I'm having my lovely io_submit()s sleeping. This time some
> >>detailed traces. 4.14.15.

[....]

> >>Forcing the log, so sleeping with ILOCK taken.
> >Because it's trying to reallocate an extent that is pinned in the
> >log and is marked stale. i.e. we are reallocating a recently freed
> >metadata extent that hasn't been committed to disk yet. IOWs, it's
> >the metadata form of the "log force to clear a busy extent so we can
> >re-use it" condition....
> >
> >There's nothing you can do to reliably avoid this - it's a sign that
> >you're running low on free space in an AG because it's recycling
> >recently freed space faster than the CIL is being committed to disk.
> >
> >You could speed up background journal syncs to try to reduce the
> >async checkpoint latency that allows busy extents to build up
> >(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> >journal overhead and IO latency, etc.
> 
> Perhaps xfs should auto-tune this variable.

That's not a fix. That's a nasty hack that attempts to hide the
underlying problem of selecting AGs and/or free space that requires
a log force to be used instead of finding other, un-encumbered
freespace present in the filesystem.

i.e. We know what the underlying problem is here, and there isn't a
quick fix for it. You're just going to have to live with that until
we work out a reliable, robust way of avoiding having busy extents
block allocations.

> >>reactor-29  3336 [029]  3580.420137: xfs:xfs_ilock: dev 9:0 ino
> >>0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
> >>             7fffa0858572 xfs_ilock ([kernel.kallsyms])
> >>             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
> >>             7fff8127314c file_update_time ([kernel.kallsyms])
> >>             7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
> >>             7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
> >>             7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
> >>             7fff812ab24f aio_write ([kernel.kallsyms])
> >>             7fff812ab90e do_io_submit ([kernel.kallsyms])
> >>             7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
> >>             7fff81005917 do_syscall_64 ([kernel.kallsyms])
> >>             7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
> >>io_submit() stalls. Later, task 8146 relinquishes the lock, which
> >>(after passing through another worker thread), is finally acquired
> >>by io_submit() which the continues.
> >Yup, so it's the timestamp updates at IO submission that are
> >blocking waiting for the *ILOCK* lock.
> >
> >[.....]
> >
> >>The other is to check that ILOCK can be taken exclusively in
> >>xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I
> >>might venture a patch.
> >You're getting your locks mixed up - xfs_file_dio_aio_write()
> >doesn't ever take the ILOCK directly.  It takes the _IOLOCK_
> >directly, and that's what IOCB_NOWAIT acts on.
> 
> It also takes ILOCK by calling file_update_time(), as seen in the
> trace above. I've seen that it does a _nowait acquisition of the
> IOLOCK, but that's not enough.
> 
> If we could pass the iocb to file_update_time() then
> xfs_vn_update_time, then it could perform a _nowait acquisition of
> ILOCK. Otherwise, we can just check if ILOCK is acquirable in
> xfs_file_aio_write_checks(). That's racy, but better than nothing.

Sure, but that's not the issue with such suggestions. I'm not about
to propose a patch to hack an iocb through generic infrastructure
that doesn't need an iocb. Doing stuff like that will only get me
shouted at because it's a massive layering violation and I should
(and do!) know better....

> >require the _ILOCK_ to be taken exclusively, and there's no way
> >of knowing if that is going to be needed at the level of
> >xfs_file_dio_aio_write().
> >
> >There are hooks in xfs_file_aio_write_checks() to avoid timestamp
> >updates on write (FMODE_NOCMTIME) but there is no way to set this
> >from userspace. It's used exclusively by the XFS open-by-handle
> >code so that utilities like online defrag can write data into
> >files without changing their user-visible timestamps.
> >
> 
> xfs_file_aio_write_checks() knows its going to call
> file_update_time(), and it can guess that file_update_time() will
> call xfs_vn_update_time(). So just before that call, check that
> ILOCK is free and EAGAIN if not.  Maybe that patch won't win any
> beauty contests but it will reduce some of my pain.

Again, grose, unmaintable layering violations. Expeditious solution
to cater for your immediate need - you're welcome to do this with
your own kernels, but it's not a solution we can really carry
upstream.

And that brings me to the real issue here: If your requirement
really is "never block io_submit(), ever", then the correct change
to make is to the aio code so that it will /always queue IO/ and
never attempt to submit it directly. If you get that done, then we
can stop playing this tiresome whack-a-mole game in XFS.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2018-02-08 22:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-07 17:20 xfs_buf_lock vs aio Avi Kivity
2018-02-07 23:33 ` Dave Chinner
2018-02-08  8:24   ` Avi Kivity
2018-02-08 22:11     ` Dave Chinner [this message]
2018-02-09 12:11       ` Avi Kivity
2018-02-09 23:10         ` Dave Chinner
2018-02-12  9:33           ` Avi Kivity
2018-02-13  5:18             ` Dave Chinner
2018-02-13 23:14               ` Darrick J. Wong
2018-02-14  2:16                 ` Dave Chinner
2018-02-14 12:01                   ` Avi Kivity
2018-02-14 12:07               ` Avi Kivity
2018-02-14 12:18                 ` Avi Kivity
2018-02-14 23:56                 ` Dave Chinner
2018-02-15  9:36                   ` Avi Kivity
2018-02-15 21:30                     ` Dave Chinner
2018-02-16  8:07                       ` Avi Kivity
2018-02-19  2:40                         ` Dave Chinner
2018-02-19  4:48                           ` Dave Chinner
2018-02-25 17:47                           ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180208221153.GF20266@dastard \
    --to=david@fromorbit.com \
    --cc=avi@scylladb.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.