linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Avi Kivity <avi@scylladb.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: xfs_buf_lock vs aio
Date: Fri, 9 Feb 2018 09:11:53 +1100	[thread overview]
Message-ID: <20180208221153.GF20266@dastard> (raw)
In-Reply-To: <ed83833b-5024-40f2-b335-2dcab7fddf0e@scylladb.com>

On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
> On 02/08/2018 01:33 AM, Dave Chinner wrote:
> >On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> >>As usual, I'm having my lovely io_submit()s sleeping. This time some
> >>detailed traces. 4.14.15.

[....]

> >>Forcing the log, so sleeping with ILOCK taken.
> >Because it's trying to reallocate an extent that is pinned in the
> >log and is marked stale. i.e. we are reallocating a recently freed
> >metadata extent that hasn't been committed to disk yet. IOWs, it's
> >the metadata form of the "log force to clear a busy extent so we can
> >re-use it" condition....
> >
> >There's nothing you can do to reliably avoid this - it's a sign that
> >you're running low on free space in an AG because it's recycling
> >recently freed space faster than the CIL is being committed to disk.
> >
> >You could speed up background journal syncs to try to reduce the
> >async checkpoint latency that allows busy extents to build up
> >(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> >journal overhead and IO latency, etc.
> 
> Perhaps xfs should auto-tune this variable.

That's not a fix. That's a nasty hack that attempts to hide the
underlying problem of selecting AGs and/or free space that requires
a log force to be used instead of finding other, un-encumbered
freespace present in the filesystem.

i.e. We know what the underlying problem is here, and there isn't a
quick fix for it. You're just going to have to live with that until
we work out a reliable, robust way of avoiding having busy extents
block allocations.

> >>reactor-29  3336 [029]  3580.420137: xfs:xfs_ilock: dev 9:0 ino
> >>0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
> >>             7fffa0858572 xfs_ilock ([kernel.kallsyms])
> >>             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
> >>             7fff8127314c file_update_time ([kernel.kallsyms])
> >>             7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
> >>             7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
> >>             7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
> >>             7fff812ab24f aio_write ([kernel.kallsyms])
> >>             7fff812ab90e do_io_submit ([kernel.kallsyms])
> >>             7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
> >>             7fff81005917 do_syscall_64 ([kernel.kallsyms])
> >>             7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
> >>io_submit() stalls. Later, task 8146 relinquishes the lock, which
> >>(after passing through another worker thread), is finally acquired
> >>by io_submit() which the continues.
> >Yup, so it's the timestamp updates at IO submission that are
> >blocking waiting for the *ILOCK* lock.
> >
> >[.....]
> >
> >>The other is to check that ILOCK can be taken exclusively in
> >>xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I
> >>might venture a patch.
> >You're getting your locks mixed up - xfs_file_dio_aio_write()
> >doesn't ever take the ILOCK directly.  It takes the _IOLOCK_
> >directly, and that's what IOCB_NOWAIT acts on.
> 
> It also takes ILOCK by calling file_update_time(), as seen in the
> trace above. I've seen that it does a _nowait acquisition of the
> IOLOCK, but that's not enough.
> 
> If we could pass the iocb to file_update_time() then
> xfs_vn_update_time, then it could perform a _nowait acquisition of
> ILOCK. Otherwise, we can just check if ILOCK is acquirable in
> xfs_file_aio_write_checks(). That's racy, but better than nothing.

Sure, but that's not the issue with such suggestions. I'm not about
to propose a patch to hack an iocb through generic infrastructure
that doesn't need an iocb. Doing stuff like that will only get me
shouted at because it's a massive layering violation and I should
(and do!) know better....

> >require the _ILOCK_ to be taken exclusively, and there's no way
> >of knowing if that is going to be needed at the level of
> >xfs_file_dio_aio_write().
> >
> >There are hooks in xfs_file_aio_write_checks() to avoid timestamp
> >updates on write (FMODE_NOCMTIME) but there is no way to set this
> >from userspace. It's used exclusively by the XFS open-by-handle
> >code so that utilities like online defrag can write data into
> >files without changing their user-visible timestamps.
> >
> 
> xfs_file_aio_write_checks() knows its going to call
> file_update_time(), and it can guess that file_update_time() will
> call xfs_vn_update_time(). So just before that call, check that
> ILOCK is free and EAGAIN if not.  Maybe that patch won't win any
> beauty contests but it will reduce some of my pain.

Again, grose, unmaintable layering violations. Expeditious solution
to cater for your immediate need - you're welcome to do this with
your own kernels, but it's not a solution we can really carry
upstream.

And that brings me to the real issue here: If your requirement
really is "never block io_submit(), ever", then the correct change
to make is to the aio code so that it will /always queue IO/ and
never attempt to submit it directly. If you get that done, then we
can stop playing this tiresome whack-a-mole game in XFS.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-02-08 22:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-07 17:20 xfs_buf_lock vs aio Avi Kivity
2018-02-07 23:33 ` Dave Chinner
2018-02-08  8:24   ` Avi Kivity
2018-02-08 22:11     ` Dave Chinner [this message]
2018-02-09 12:11       ` Avi Kivity
2018-02-09 23:10         ` Dave Chinner
2018-02-12  9:33           ` Avi Kivity
2018-02-13  5:18             ` Dave Chinner
2018-02-13 23:14               ` Darrick J. Wong
2018-02-14  2:16                 ` Dave Chinner
2018-02-14 12:01                   ` Avi Kivity
2018-02-14 12:07               ` Avi Kivity
2018-02-14 12:18                 ` Avi Kivity
2018-02-14 23:56                 ` Dave Chinner
2018-02-15  9:36                   ` Avi Kivity
2018-02-15 21:30                     ` Dave Chinner
2018-02-16  8:07                       ` Avi Kivity
2018-02-19  2:40                         ` Dave Chinner
2018-02-19  4:48                           ` Dave Chinner
2018-02-25 17:47                           ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180208221153.GF20266@dastard \
    --to=david@fromorbit.com \
    --cc=avi@scylladb.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).