From: Avi Kivity <avi@scylladb.com>
To: Glauber Costa <glauber@scylladb.com>, xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Tue, 1 Dec 2015 21:07:14 +0200 [thread overview]
Message-ID: <565DEFE2.2000308@scylladb.com> (raw)
In-Reply-To: <20151201180321.GA4762@redhat.com>
On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
> Hi Avi,
>
>>> else is going to execute in our place until this thread can make
>>> progress.
>> For us, nothing else can execute in our place, we usually have exactly one
>> thread per logical core. So we are heavily dependent on io_submit not
>> sleeping.
>>
>> The case of a contended lock is, to me, less worrying. It can be reduced by
>> using more allocation groups, which is apparently the shared resource under
>> contention.
>>
> I apologize if I misread your previous comments, but, IIRC you said you can't
> change the directory structure your application is using, and IIRC your
> application does not spread files across several directories.
I miswrote somewhat: the application writes data files and commitlog
files. The data file directory structure is fixed due to compatibility
concerns (it is not a single directory, but some workloads will see most
access on files in a single directory. The commitlog directory
structure is more relaxed, and we can split it to a directory per shard
(=cpu) or something else.
If worst comes to worst, we'll hack around this and distribute the data
files into more directories, and provide some hack for compatibility.
> XFS spread files across the allocation groups, based on the directory these
> files are created,
Idea: create the files in some subdirectory, and immediately move them
to their required location.
> trying to keep files as close as possible from their
> metadata.
This is pointless for an SSD. Perhaps XFS should randomize the ag on
nonrotational media instead.
> Directories are spreaded across the AGs in a 'round-robin' way, each
> new directory, will be created in the next allocation group, and, xfs will try
> to allocate the files in the same AG as its parent directory. (Take a look at
> the 'rotorstep' sysctl option for xfs).
>
> So, unless you have the files distributed across enough directories, increasing
> the number of allocation groups may not change the lock contention you're
> facing in this case.
>
> I really don't remember if it has been mentioned already, but if not, it might
> be worth to take this point in consideration.
Thanks. I think you should really consider randomizing the ag for SSDs,
and meanwhile, we can just use the creation-directory hack to get the
same effect, at the cost of an extra system call. So at least for this
problem, there is a solution.
> anyway, just my 0.02
>
>> The case of waiting for I/O is much more worrying, because I/O latency are
>> much higher. But it seems like most of the DIO path does not trigger
>> locking around I/O (and we are careful to avoid the ones that do, like
>> writing beyond eof).
>>
>> (sorry for repeating myself, I have the feeling we are talking past each
>> other and want to be on the same page)
>>
>>>>> We submit an I/O which is
>>>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>>>> schedule and execute another task until the completion is set by I/O
>>>>> completion (via an async callback). At that point, the issuing thread
>>>>> continues where it left off. I suspect I'm missing something... can you
>>>>> elaborate on what you'd do differently here (and how it helps)?
>>>> Just apply the same technique everywhere: convert locks to trylock +
>>>> schedule a continuation on failure.
>>>>
>>> I'm certainly not an expert on the kernel scheduling, locking and
>>> serialization mechanisms, but my understanding is that most things
>>> outside of spin locks are reschedule points. For example, the
>>> wait_for_completion() calls XFS uses to wait on I/O boil down to
>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and
>>> down() can end up in the same place.
>> But, for the most part, XFS seems to be able to avoid sleeping. The call to
>> __blockdev_direct_IO only launches the I/O, so any locking is only around
>> cpu operations and, unless there is contention, won't cause us to sleep in
>> io_submit().
>>
>> Trying to follow the code, it looks like xfs_get_blocks_direct (and
>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
>> we're just lucky to have everything in cache. If it isn't, we block right
>> there. I really hope I'm misreading this and some other magic is happening
>> elsewhere instead of this.
>>
>>> Brian
>>>
>>>>>> Seastar (the async user framework which we use to drive xfs) makes writing
>>>>>> code like this easy, using continuations; but of course from ordinary
>>>>>> threaded code it can be quite hard.
>>>>>>
>>>>>> btw, there was an attempt to make ext[34] async using this method, but I
>>>>>> think it was ripped out. Yes, the mortal remains can still be seen with
>>>>>> 'git grep EIOCBQUEUED'.
>>>>>>
>>>>>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>>>>>> have however many parallel operations you typically have running
>>>>>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>>>>>> number).
>>>>>>>> Unfortunately our directory layout cannot be changed. And doesn't this
>>>>>>>> require having agcount == O(number of active files)? That is easily in the
>>>>>>>> thousands.
>>>>>>>>
>>>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>>>>>> ballpark, but really it's something you'll probably just need to test to
>>>>>>> see how far you need to go to avoid AG contention.
>>>>>>>
>>>>>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>>>>>> can determine whether/how much it really helps with modified AG counts.
>>>>>>> I don't know enough about your application design to really comment on
>>>>>>> that...
>>>>>> We have O(cpus) shards that operate independently. Each shard writes 32MB
>>>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>>>>>> without blocking); the files are then flushed and closed, and later removed.
>>>>>> In parallel there are sequential writes and reads of large files using 128kB
>>>>>> buffers), as well as random reads. Files are immutable (append-only), and
>>>>>> if a file is being written, it is not concurrently read. In general files
>>>>>> are not shared across shards. All I/O is async and O_DIRECT. open(),
>>>>>> truncate(), fdatasync(), and friends are called from a helper thread.
>>>>>>
>>>>>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>>>>>
>>>>>>>>> Reducing the frequency of block allocation/frees might also be
>>>>>>>>> another help (e.g., preallocate and reuse files,
>>>>>>>> Isn't that discouraged for SSDs?
>>>>>>>>
>>>>>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>>>>>> and thus never discarded..? Are you running fstrim?
>>>>>> mount -o discard. And yes, overwrites are supposedly more expensive than
>>>>>> trim old data + allocate new data, but maybe if you compare it with the work
>>>>>> XFS has to do, perhaps the tradeoff is bad.
>>>>>>
>>>>> Ok, my understanding is that '-o discard' is not recommended in favor of
>>>>> periodic fstrim for performance reasons, but that may or may not still
>>>>> be the case.
>>>> I understand that most SSDs have queued trim these days, but maybe I'm
>>>> optimistic.
>>>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-12-01 19:07 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-28 2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29 ` Avi Kivity
2015-11-30 16:14 ` Brian Foster
2015-12-01 9:08 ` Avi Kivity
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:58 ` Avi Kivity
2015-12-01 14:01 ` Glauber Costa
2015-12-01 14:37 ` Avi Kivity
2015-12-01 20:45 ` Dave Chinner
2015-12-01 20:56 ` Avi Kivity
2015-12-01 23:41 ` Dave Chinner
2015-12-02 8:23 ` Avi Kivity
2015-12-01 14:56 ` Brian Foster
2015-12-01 15:22 ` Avi Kivity
2015-12-01 16:01 ` Brian Foster
2015-12-01 16:08 ` Avi Kivity
2015-12-01 16:29 ` Brian Foster
2015-12-01 17:09 ` Avi Kivity
2015-12-01 18:03 ` Carlos Maiolino
2015-12-01 19:07 ` Avi Kivity [this message]
2015-12-01 21:19 ` Dave Chinner
2015-12-01 21:38 ` Avi Kivity
2015-12-01 23:06 ` Dave Chinner
2015-12-02 9:02 ` Avi Kivity
2015-12-02 12:57 ` Carlos Maiolino
2015-12-02 23:19 ` Dave Chinner
2015-12-03 12:52 ` Avi Kivity
2015-12-04 3:16 ` Dave Chinner
2015-12-08 13:52 ` Avi Kivity
2015-12-08 23:13 ` Dave Chinner
2015-12-01 18:51 ` Brian Foster
2015-12-01 19:07 ` Glauber Costa
2015-12-01 19:35 ` Brian Foster
2015-12-01 19:45 ` Avi Kivity
2015-12-01 19:26 ` Avi Kivity
2015-12-01 19:41 ` Christoph Hellwig
2015-12-01 19:50 ` Avi Kivity
2015-12-02 0:13 ` Brian Foster
2015-12-02 0:57 ` Dave Chinner
2015-12-02 8:38 ` Avi Kivity
2015-12-02 8:34 ` Avi Kivity
2015-12-08 6:03 ` Dave Chinner
2015-12-08 13:56 ` Avi Kivity
2015-12-08 23:32 ` Dave Chinner
2015-12-09 8:37 ` Avi Kivity
2015-12-01 21:04 ` Dave Chinner
2015-12-01 21:10 ` Glauber Costa
2015-12-01 21:39 ` Dave Chinner
2015-12-01 21:24 ` Avi Kivity
2015-12-01 21:31 ` Glauber Costa
2015-11-30 15:49 ` Glauber Costa
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:39 ` Glauber Costa
2015-12-01 14:02 ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51 ` Glauber Costa
2015-12-01 20:30 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=565DEFE2.2000308@scylladb.com \
--to=avi@scylladb.com \
--cc=glauber@scylladb.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox