public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Avi Kivity <avi@scylladb.com>
Cc: Glauber Costa <glauber@scylladb.com>, xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Tue, 1 Dec 2015 19:13:29 -0500	[thread overview]
Message-ID: <20151202001329.GA9633@bfoster.bfoster> (raw)
In-Reply-To: <565DF472.8080101@scylladb.com>

On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
> On 12/01/2015 08:51 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
> >>
> >>On 12/01/2015 06:29 PM, Brian Foster wrote:
> >>>On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
> >>>>On 12/01/2015 06:01 PM, Brian Foster wrote:
> >>>>>On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> >>>>>>On 12/01/2015 04:56 PM, Brian Foster wrote:
> >>>>>>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> >>>>>>>>On 12/01/2015 03:11 PM, Brian Foster wrote:
> >>>>>>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> >>>>>>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote:
> >>>>>>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>>>>>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote:
...
> >>The case of waiting for I/O is much more worrying, because I/O latency are
> >>much higher.  But it seems like most of the DIO path does not trigger
> >>locking around I/O (and we are careful to avoid the ones that do, like
> >>writing beyond eof).
> >>
> >>(sorry for repeating myself, I have the feeling we are talking past each
> >>other and want to be on the same page)
> >>
> >Yeah, my point is just that just because the thread blocked on I/O,
> >doesn't mean the cpu can't carry on with some useful work for another
> >task.
> 
> In our case, there is no other task.  We run one thread per logical core, so
> if that thread gets blocked, the cpu idles.
> 
> The whole point of io_submit() is to issue an I/O and let the caller
> continue processing immediately.  It is the equivalent of O_NONBLOCK for
> networking code.  If O_NONBLOCK did block from time to time, practically all
> modern network applications would see a huge performance drop.
> 

Ok, but my understanding is that O_NONBLOCK would return an error code
in the blocking case such that userspace can do something else or retry
from a blockable context. I think this is similar to what hch posted wrt
to the pwrite2() bits for nonblocking buffered I/O or what I was asking
about earlier on with regard to returning an error if some blocking
would otherwise occur.

> >
> >>>>>  We submit an I/O which is
> >>>>>asynchronous in nature and wait on a completion, which causes the cpu to
> >>>>>schedule and execute another task until the completion is set by I/O
> >>>>>completion (via an async callback). At that point, the issuing thread
> >>>>>continues where it left off. I suspect I'm missing something... can you
> >>>>>elaborate on what you'd do differently here (and how it helps)?
> >>>>Just apply the same technique everywhere: convert locks to trylock +
> >>>>schedule a continuation on failure.
> >>>>
> >>>I'm certainly not an expert on the kernel scheduling, locking and
> >>>serialization mechanisms, but my understanding is that most things
> >>>outside of spin locks are reschedule points. For example, the
> >>>wait_for_completion() calls XFS uses to wait on I/O boil down to
> >>>schedule_timeout() calls. Buffer locks are implemented as semaphores and
> >>>down() can end up in the same place.
> >>But, for the most part, XFS seems to be able to avoid sleeping.  The call to
> >>__blockdev_direct_IO only launches the I/O, so any locking is only around
> >>cpu operations and, unless there is contention, won't cause us to sleep in
> >>io_submit().
> >>
> >>Trying to follow the code, it looks like xfs_get_blocks_direct (and
> >>__blockdev_direct_IO's get_block parameter in general) is synchronous, so
> >>we're just lucky to have everything in cache.  If it isn't, we block right
> >>there.  I really hope I'm misreading this and some other magic is happening
> >>elsewhere instead of this.
> >>
> >Nope, it's synchronous from a code perspective. The
> >xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
> >inode bmap metadata if it hasn't been done already. Note that this
> >should only happen once as everything is stored in-core, so in most
> >cases this is skipped. It's also possible extents are read in via some
> >other path/operation on the inode before an async I/O happens to be
> >submitted (e.g., see some of the other xfs_bmapi_read() callers).
> 
> Is there (could we add) some ioctl to prime this cache?  We could call it
> from a worker thread where we don't mind blocking during open.
> 

I suppose that's possible, or the worker thread could perform some
existing operation known to prime the cache. I don't think it's worth
getting into without a concrete example, however. The extent read
example we're batting around might not ever be a problem (as you've
noted due to file size), if files are truncated and recycled, for
example.

> What is the eviction policy for this cache?   Is it simply the block
> device's page cache?
> 

IIUC the extent list stays around until the inode is reclaimed. There's
a separate buffer cache for metadata buffers. Both types of objects
would be reclaimed based on memory pressure.

> What about the write path, will we see the same problems there?  I would
> guess the problem is less severe there if the metadata is written with
> writeback policy.
> 

Metadata is modified in-core and handed off to the logging
infrastructure via a transaction. The log is flushed to disk some time
later and metadata writeback occurs asynchronously via the xfsaild
thread.

Brian

> >
> >Either way, the extents have to be read in at some point and I'd expect
> >that cpu to schedule onto some other task while that thread waits on I/O
> >to complete (read-ahead could also be a factor here, but I haven't
> >really dug into how that is triggered for buffers).
> 
> To provide an example, our application, which is a database, faces this
> problem exact at a higher level.  Data is stored in data files, and data
> items' locations are stored in index files. When we read a bit of data, we
> issue an index read, and pass it a continuation to be executed when the read
> completes.  This latter continuation parses the data and passes it to the
> code that prepares it for merging with data from other data files, and an
> eventual return to the user.
> 
> Having written code for over a year in this style, I've come to expect it to
> be used everywhere asynchronous I/O is used, but I realize it is fairly hard
> without good support from a framework that allows continuations to be
> composed in a natural way.
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2015-12-02  0:13 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-28  2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29   ` Avi Kivity
2015-11-30 16:14     ` Brian Foster
2015-12-01  9:08       ` Avi Kivity
2015-12-01 13:11         ` Brian Foster
2015-12-01 13:58           ` Avi Kivity
2015-12-01 14:01             ` Glauber Costa
2015-12-01 14:37               ` Avi Kivity
2015-12-01 20:45               ` Dave Chinner
2015-12-01 20:56                 ` Avi Kivity
2015-12-01 23:41                   ` Dave Chinner
2015-12-02  8:23                     ` Avi Kivity
2015-12-01 14:56             ` Brian Foster
2015-12-01 15:22               ` Avi Kivity
2015-12-01 16:01                 ` Brian Foster
2015-12-01 16:08                   ` Avi Kivity
2015-12-01 16:29                     ` Brian Foster
2015-12-01 17:09                       ` Avi Kivity
2015-12-01 18:03                         ` Carlos Maiolino
2015-12-01 19:07                           ` Avi Kivity
2015-12-01 21:19                             ` Dave Chinner
2015-12-01 21:38                               ` Avi Kivity
2015-12-01 23:06                                 ` Dave Chinner
2015-12-02  9:02                                   ` Avi Kivity
2015-12-02 12:57                                     ` Carlos Maiolino
2015-12-02 23:19                                     ` Dave Chinner
2015-12-03 12:52                                       ` Avi Kivity
2015-12-04  3:16                                         ` Dave Chinner
2015-12-08 13:52                                           ` Avi Kivity
2015-12-08 23:13                                             ` Dave Chinner
2015-12-01 18:51                         ` Brian Foster
2015-12-01 19:07                           ` Glauber Costa
2015-12-01 19:35                             ` Brian Foster
2015-12-01 19:45                               ` Avi Kivity
2015-12-01 19:26                           ` Avi Kivity
2015-12-01 19:41                             ` Christoph Hellwig
2015-12-01 19:50                               ` Avi Kivity
2015-12-02  0:13                             ` Brian Foster [this message]
2015-12-02  0:57                               ` Dave Chinner
2015-12-02  8:38                                 ` Avi Kivity
2015-12-02  8:34                               ` Avi Kivity
2015-12-08  6:03                                 ` Dave Chinner
2015-12-08 13:56                                   ` Avi Kivity
2015-12-08 23:32                                     ` Dave Chinner
2015-12-09  8:37                                       ` Avi Kivity
2015-12-01 21:04                 ` Dave Chinner
2015-12-01 21:10                   ` Glauber Costa
2015-12-01 21:39                     ` Dave Chinner
2015-12-01 21:24                   ` Avi Kivity
2015-12-01 21:31                     ` Glauber Costa
2015-11-30 15:49   ` Glauber Costa
2015-12-01 13:11     ` Brian Foster
2015-12-01 13:39       ` Glauber Costa
2015-12-01 14:02         ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51   ` Glauber Costa
2015-12-01 20:30     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151202001329.GA9633@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=avi@scylladb.com \
    --cc=glauber@scylladb.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox