From: Brian Foster <bfoster@redhat.com>
To: Avi Kivity <avi@scylladb.com>
Cc: Glauber Costa <glauber@scylladb.com>, xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Tue, 1 Dec 2015 13:51:13 -0500 [thread overview]
Message-ID: <20151201185113.GG26129@bfoster.bfoster> (raw)
In-Reply-To: <565DD449.5090101@scylladb.com>
On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
>
>
> On 12/01/2015 06:29 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
> >>
> >>On 12/01/2015 06:01 PM, Brian Foster wrote:
> >>>On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> >>>>On 12/01/2015 04:56 PM, Brian Foster wrote:
> >>>>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> >>>>>>On 12/01/2015 03:11 PM, Brian Foster wrote:
> >>>>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> >>>>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote:
> >>>>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>>>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote:
> >>>>>>>...
...
> >>>>>>Won't io_submit() also trigger metadata I/O? Or is that all deferred to
> >>>>>>async tasks? I don't mind them blocking each other as long as they let my
> >>>>>>io_submit alone.
> >>>>>>
> >>>>>Yeah, it can trigger metadata reads, force the log (the stale buffer
> >>>>>example) or push the AIL (wait on log space). Metadata changes made
> >>>>>directly via your I/O request are logged/committed via transactions,
> >>>>>which are generally processed asynchronously from that point on.
> >>>>>
> >>>>>>> io_submit() can probably block in a variety of
> >>>>>>>places afaict... it might have to read in the inode extent map, allocate
> >>>>>>>blocks, take inode/ag locks, reserve log space for transactions, etc.
> >>>>>>Any chance of changing all that to be asynchronous? Doesn't sound too hard,
> >>>>>>if somebody else has to do it.
> >>>>>>
> >>>>>I'm not following... if the fs needs to read in the inode extent map to
> >>>>>prepare for an allocation, what else can the thread do but wait? Are you
> >>>>>suggesting the request kick off whatever the blocking action happens to
> >>>>>be asynchronously and return with an error such that the request can be
> >>>>>retried later?
> >>>>Not quite, it should be invisible to the caller.
> >>>>
> >>>>That is, the code called by io_submit() (file_operations::write_iter, it
> >>>>seems to be called today) can kick off this operation and have it continue
> >>>>from where it left off.
> >>>Isn't that generally what happens today?
> >>You tell me. According to $subject, apparently not enough. Maybe we're
> >>triggering it more often, or we suffer more when it does trigger (the latter
> >>probably more likely).
> >>
> >The original mail describes looking at the sched:sched_switch tracepoint
> >which on a quick look, appears to fire whenever a cpu context switch
> >occurs. This likely triggers any time we wait on an I/O or a contended
> >lock (among other situations I'm sure), and it signifies that something
> >else is going to execute in our place until this thread can make
> >progress.
>
> For us, nothing else can execute in our place, we usually have exactly one
> thread per logical core. So we are heavily dependent on io_submit not
> sleeping.
>
Yes, this "coroutine model" makes more sense to me from the application
perspective. I'm just trying to understand what you're after from the
kernel perspective.
> The case of a contended lock is, to me, less worrying. It can be reduced by
> using more allocation groups, which is apparently the shared resource under
> contention.
>
Yep.
> The case of waiting for I/O is much more worrying, because I/O latency are
> much higher. But it seems like most of the DIO path does not trigger
> locking around I/O (and we are careful to avoid the ones that do, like
> writing beyond eof).
>
> (sorry for repeating myself, I have the feeling we are talking past each
> other and want to be on the same page)
>
Yeah, my point is just that just because the thread blocked on I/O,
doesn't mean the cpu can't carry on with some useful work for another
task.
> >
> >>> We submit an I/O which is
> >>>asynchronous in nature and wait on a completion, which causes the cpu to
> >>>schedule and execute another task until the completion is set by I/O
> >>>completion (via an async callback). At that point, the issuing thread
> >>>continues where it left off. I suspect I'm missing something... can you
> >>>elaborate on what you'd do differently here (and how it helps)?
> >>Just apply the same technique everywhere: convert locks to trylock +
> >>schedule a continuation on failure.
> >>
> >I'm certainly not an expert on the kernel scheduling, locking and
> >serialization mechanisms, but my understanding is that most things
> >outside of spin locks are reschedule points. For example, the
> >wait_for_completion() calls XFS uses to wait on I/O boil down to
> >schedule_timeout() calls. Buffer locks are implemented as semaphores and
> >down() can end up in the same place.
>
> But, for the most part, XFS seems to be able to avoid sleeping. The call to
> __blockdev_direct_IO only launches the I/O, so any locking is only around
> cpu operations and, unless there is contention, won't cause us to sleep in
> io_submit().
>
> Trying to follow the code, it looks like xfs_get_blocks_direct (and
> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
> we're just lucky to have everything in cache. If it isn't, we block right
> there. I really hope I'm misreading this and some other magic is happening
> elsewhere instead of this.
>
Nope, it's synchronous from a code perspective. The
xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
inode bmap metadata if it hasn't been done already. Note that this
should only happen once as everything is stored in-core, so in most
cases this is skipped. It's also possible extents are read in via some
other path/operation on the inode before an async I/O happens to be
submitted (e.g., see some of the other xfs_bmapi_read() callers).
Either way, the extents have to be read in at some point and I'd expect
that cpu to schedule onto some other task while that thread waits on I/O
to complete (read-ahead could also be a factor here, but I haven't
really dug into how that is triggered for buffers).
Brian
> >Brian
> >
> >>>>Seastar (the async user framework which we use to drive xfs) makes writing
> >>>>code like this easy, using continuations; but of course from ordinary
> >>>>threaded code it can be quite hard.
> >>>>
> >>>>btw, there was an attempt to make ext[34] async using this method, but I
> >>>>think it was ripped out. Yes, the mortal remains can still be seen with
> >>>>'git grep EIOCBQUEUED'.
> >>>>
> >>>>>>>It sounds to me that first and foremost you want to make sure you don't
> >>>>>>>have however many parallel operations you typically have running
> >>>>>>>contending on the same inodes or AGs. Hint: creating files under
> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under
> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode
> >>>>>>>number).
> >>>>>>Unfortunately our directory layout cannot be changed. And doesn't this
> >>>>>>require having agcount == O(number of active files)? That is easily in the
> >>>>>>thousands.
> >>>>>>
> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely
> >>>>>ballpark, but really it's something you'll probably just need to test to
> >>>>>see how far you need to go to avoid AG contention.
> >>>>>
> >>>>>I'm primarily throwing the subdir thing out there for testing purposes.
> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you
> >>>>>can determine whether/how much it really helps with modified AG counts.
> >>>>>I don't know enough about your application design to really comment on
> >>>>>that...
> >>>>We have O(cpus) shards that operate independently. Each shard writes 32MB
> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes
> >>>>without blocking); the files are then flushed and closed, and later removed.
> >>>>In parallel there are sequential writes and reads of large files using 128kB
> >>>>buffers), as well as random reads. Files are immutable (append-only), and
> >>>>if a file is being written, it is not concurrently read. In general files
> >>>>are not shared across shards. All I/O is async and O_DIRECT. open(),
> >>>>truncate(), fdatasync(), and friends are called from a helper thread.
> >>>>
> >>>>As far as I can tell it should a very friendly load for XFS and SSDs.
> >>>>
> >>>>>>> Reducing the frequency of block allocation/frees might also be
> >>>>>>>another help (e.g., preallocate and reuse files,
> >>>>>>Isn't that discouraged for SSDs?
> >>>>>>
> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed
> >>>>>and thus never discarded..? Are you running fstrim?
> >>>>mount -o discard. And yes, overwrites are supposedly more expensive than
> >>>>trim old data + allocate new data, but maybe if you compare it with the work
> >>>>XFS has to do, perhaps the tradeoff is bad.
> >>>>
> >>>Ok, my understanding is that '-o discard' is not recommended in favor of
> >>>periodic fstrim for performance reasons, but that may or may not still
> >>>be the case.
> >>I understand that most SSDs have queued trim these days, but maybe I'm
> >>optimistic.
> >>
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-12-01 18:51 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-28 2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29 ` Avi Kivity
2015-11-30 16:14 ` Brian Foster
2015-12-01 9:08 ` Avi Kivity
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:58 ` Avi Kivity
2015-12-01 14:01 ` Glauber Costa
2015-12-01 14:37 ` Avi Kivity
2015-12-01 20:45 ` Dave Chinner
2015-12-01 20:56 ` Avi Kivity
2015-12-01 23:41 ` Dave Chinner
2015-12-02 8:23 ` Avi Kivity
2015-12-01 14:56 ` Brian Foster
2015-12-01 15:22 ` Avi Kivity
2015-12-01 16:01 ` Brian Foster
2015-12-01 16:08 ` Avi Kivity
2015-12-01 16:29 ` Brian Foster
2015-12-01 17:09 ` Avi Kivity
2015-12-01 18:03 ` Carlos Maiolino
2015-12-01 19:07 ` Avi Kivity
2015-12-01 21:19 ` Dave Chinner
2015-12-01 21:38 ` Avi Kivity
2015-12-01 23:06 ` Dave Chinner
2015-12-02 9:02 ` Avi Kivity
2015-12-02 12:57 ` Carlos Maiolino
2015-12-02 23:19 ` Dave Chinner
2015-12-03 12:52 ` Avi Kivity
2015-12-04 3:16 ` Dave Chinner
2015-12-08 13:52 ` Avi Kivity
2015-12-08 23:13 ` Dave Chinner
2015-12-01 18:51 ` Brian Foster [this message]
2015-12-01 19:07 ` Glauber Costa
2015-12-01 19:35 ` Brian Foster
2015-12-01 19:45 ` Avi Kivity
2015-12-01 19:26 ` Avi Kivity
2015-12-01 19:41 ` Christoph Hellwig
2015-12-01 19:50 ` Avi Kivity
2015-12-02 0:13 ` Brian Foster
2015-12-02 0:57 ` Dave Chinner
2015-12-02 8:38 ` Avi Kivity
2015-12-02 8:34 ` Avi Kivity
2015-12-08 6:03 ` Dave Chinner
2015-12-08 13:56 ` Avi Kivity
2015-12-08 23:32 ` Dave Chinner
2015-12-09 8:37 ` Avi Kivity
2015-12-01 21:04 ` Dave Chinner
2015-12-01 21:10 ` Glauber Costa
2015-12-01 21:39 ` Dave Chinner
2015-12-01 21:24 ` Avi Kivity
2015-12-01 21:31 ` Glauber Costa
2015-11-30 15:49 ` Glauber Costa
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:39 ` Glauber Costa
2015-12-01 14:02 ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51 ` Glauber Costa
2015-12-01 20:30 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151201185113.GG26129@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=avi@scylladb.com \
--cc=glauber@scylladb.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox