From: Avi Kivity <avi@scylladb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>,
Glauber Costa <glauber@scylladb.com>,
xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Wed, 2 Dec 2015 10:23:05 +0200 [thread overview]
Message-ID: <565EAA69.80003@scylladb.com> (raw)
In-Reply-To: <20151201234139.GE19199@dastard>
On 12/02/2015 01:41 AM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 10:56:01PM +0200, Avi Kivity wrote:
>> On 12/01/2015 10:45 PM, Dave Chinner wrote:
>>> On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote:
>>> The difference is an allocation can block waiting on IO, and the
>>> CPU can then go off and run another process, which then tries to do
>>> an allocation. So you might only have 4 CPUs, but a workload that
>>> can have a hundred active allocations at once (not uncommon in
>>> file server workloads).
>> But for us, probably not much more. We try to restrict active I/Os
>> to the effective disk queue depth (more than that and they just turn
>> sour waiting in the disk queue).
>>
>>
>>> On worklaods that are roughly 1 process per CPU, it's typical that
>>> agcount = 2 * N cpus gives pretty good results on large filesystems.
>> This is probably using sync calls. Using async calls you can have
>> many more I/Os in progress (but still limited by effective disk
>> queue depth).
> Ah, no. Even with async IO you don't want unbound allocation
> concurrency.
Unbound, certainly not.
But if my disk want 100 concurrent operations to deliver maximum
bandwidth, and XFS wants fewer concurrent allocations to satisfy some
internal constraint, then I can't satisfy both.
To be fair, the number 100 was measured for 4k reads. It's sure to be
much lower for 128k writes, and since we set an extent size hint of 1MB,
only 1/8th of those will be allocating. So I expect things to work in
practice, at least with the current generation of disks. Unfortunately
disk bandwidth is growing faster than latency is improving, which means
that the effective concurrency is increasing.
> The allocation algorithms rely on having contiguous
> free space extents that are much larger than the allocations being
> done to work effeectively and minimise file fragmentation. If you
> chop the filesystem up into lots of small AGs, then it accelerates
> the rate at which the free space gets chopped up into smaller
> extents and performance then suffers. It's the same problem as
> running a large filesystem near ENOSPC for an extended period of
> time, which again is something we most definitely don't recommend
> you do in production systems.
I understand. I guess it makes ag randomization even more important,
for our use case.
What happens when an ag fills up? Can a file overflow to another ag?
>
>>> If you've got 400GB filesystems or you are using spinning disks,
>>> then you probably don't want to go above 16 AGs, because then you
>>> have problems with maintaining continugous free space and you'll
>>> seek the spinning disks to death....
>> We're concentrating on SSDs for now.
> Sure, so "problems with maintaining continugous free space" is what
> you need to be concerned about.
Right. Luckily our allocation patterns are very friendly towards that.
We have append-only files that grow rapidly, then are immutable for a
time, then are deleted. (It is a log-structured database so a natual fit
for SSDs).
We can increase our extent size hint if it will help the SSD any.
>
>>>>>> 'mount -o ikeep,'
>>>>> Interesting. Our files are large so we could try this.
>>> Keep in mind that ikeep means that inode allocation permanently
>>> fragments free space, which can affect how large files are allocated
>>> once you truncate/rm the original files.
>> We can try to prime this by allocating a lot of inodes up front,
>> then removing them, so that this doesn't happen.
> Again - what problem have you measured that inode preallocation will
> solves in your application? Don't make changes just because you
> *think* it will fix what you *think* is a problem. Measure, analyse,
> solve, in that order.
We are now investigating what we can do to fix the problem, we aren't
committing to any solution yet. Certainly we plan to be certain of what
the problem is before we fix it.
Up until a few days ago we never saw any blocks with XFS, and were very
happy -- but that was with 90us, 450k IOPS disks. With the slower
disks, accessed through a certain hypervisor, we do see XFS block, and
it is very worrying.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-12-02 8:23 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-28 2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29 ` Avi Kivity
2015-11-30 16:14 ` Brian Foster
2015-12-01 9:08 ` Avi Kivity
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:58 ` Avi Kivity
2015-12-01 14:01 ` Glauber Costa
2015-12-01 14:37 ` Avi Kivity
2015-12-01 20:45 ` Dave Chinner
2015-12-01 20:56 ` Avi Kivity
2015-12-01 23:41 ` Dave Chinner
2015-12-02 8:23 ` Avi Kivity [this message]
2015-12-01 14:56 ` Brian Foster
2015-12-01 15:22 ` Avi Kivity
2015-12-01 16:01 ` Brian Foster
2015-12-01 16:08 ` Avi Kivity
2015-12-01 16:29 ` Brian Foster
2015-12-01 17:09 ` Avi Kivity
2015-12-01 18:03 ` Carlos Maiolino
2015-12-01 19:07 ` Avi Kivity
2015-12-01 21:19 ` Dave Chinner
2015-12-01 21:38 ` Avi Kivity
2015-12-01 23:06 ` Dave Chinner
2015-12-02 9:02 ` Avi Kivity
2015-12-02 12:57 ` Carlos Maiolino
2015-12-02 23:19 ` Dave Chinner
2015-12-03 12:52 ` Avi Kivity
2015-12-04 3:16 ` Dave Chinner
2015-12-08 13:52 ` Avi Kivity
2015-12-08 23:13 ` Dave Chinner
2015-12-01 18:51 ` Brian Foster
2015-12-01 19:07 ` Glauber Costa
2015-12-01 19:35 ` Brian Foster
2015-12-01 19:45 ` Avi Kivity
2015-12-01 19:26 ` Avi Kivity
2015-12-01 19:41 ` Christoph Hellwig
2015-12-01 19:50 ` Avi Kivity
2015-12-02 0:13 ` Brian Foster
2015-12-02 0:57 ` Dave Chinner
2015-12-02 8:38 ` Avi Kivity
2015-12-02 8:34 ` Avi Kivity
2015-12-08 6:03 ` Dave Chinner
2015-12-08 13:56 ` Avi Kivity
2015-12-08 23:32 ` Dave Chinner
2015-12-09 8:37 ` Avi Kivity
2015-12-01 21:04 ` Dave Chinner
2015-12-01 21:10 ` Glauber Costa
2015-12-01 21:39 ` Dave Chinner
2015-12-01 21:24 ` Avi Kivity
2015-12-01 21:31 ` Glauber Costa
2015-11-30 15:49 ` Glauber Costa
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:39 ` Glauber Costa
2015-12-01 14:02 ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51 ` Glauber Costa
2015-12-01 20:30 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=565EAA69.80003@scylladb.com \
--to=avi@scylladb.com \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=glauber@scylladb.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox