From: Avi Kivity <avi@scylladb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Glauber Costa <glauber@scylladb.com>, xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Tue, 1 Dec 2015 23:38:29 +0200 [thread overview]
Message-ID: <565E1355.4020900@scylladb.com> (raw)
In-Reply-To: <20151201211914.GZ19199@dastard>
On 12/01/2015 11:19 PM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote:
>> On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
>>> Hi Avi,
>>>
>>>>> else is going to execute in our place until this thread can make
>>>>> progress.
>>>> For us, nothing else can execute in our place, we usually have exactly one
>>>> thread per logical core. So we are heavily dependent on io_submit not
>>>> sleeping.
>>>>
>>>> The case of a contended lock is, to me, less worrying. It can be reduced by
>>>> using more allocation groups, which is apparently the shared resource under
>>>> contention.
>>>>
>>> I apologize if I misread your previous comments, but, IIRC you said you can't
>>> change the directory structure your application is using, and IIRC your
>>> application does not spread files across several directories.
>> I miswrote somewhat: the application writes data files and commitlog
>> files. The data file directory structure is fixed due to
>> compatibility concerns (it is not a single directory, but some
>> workloads will see most access on files in a single directory. The
>> commitlog directory structure is more relaxed, and we can split it
>> to a directory per shard (=cpu) or something else.
>>
>> If worst comes to worst, we'll hack around this and distribute the
>> data files into more directories, and provide some hack for
>> compatibility.
>>
>>> XFS spread files across the allocation groups, based on the directory these
>>> files are created,
>> Idea: create the files in some subdirectory, and immediately move
>> them to their required location.
> See xfs_fsr.
Can you elaborate? I don't see how it is applicable.
My hack involves creating the file in a random directory, and while it
is still zero sized, move it to its final directory. This is simply to
defeat the ag selection heuristic. No data is copied.
>>> trying to keep files as close as possible from their
>>> metadata.
>> This is pointless for an SSD. Perhaps XFS should randomize the ag on
>> nonrotational media instead.
> Actually, no, it is not pointless. SSDs do not require optimisation
> for minimal seek time, but data locality is still just as important
> as spinning disks, if not moreso. Why? Because the garbage
> collection routines in the SSDs are all about locality and we can't
> drive garbage collection effectively via discard operations if the
> filesystem is not keeping temporally related files close together in
> it's block address space.
In my case, files in the same directory are not temporally related. But
I understand where the heuristic comes from.
Maybe an ioctl to set a directory attribute "the files in this directory
are not temporally related"?
I imagine this will be useful for many server applications.
> e.g. If the files in a directory are all close together, and the
> directory is removed, we then leave a big empty contiguous region in
> the filesystem free space map, and when we send discards over that
> we end up with a single big trim and the drive handles that far more
Would this not be defeated if a directory that happens to share the
allocation group gets populated simultaneously?
> effectively than lots of little trims (i.e. one per file) that the
> drive cannot do anything useful with because they are all smaller
> than the internal SSD page/block sizes and so get ignored. This is
> one of the reasons fstrim is so much more efficient and effective
> than using the discard mount option.
In my use case, the files are fairly large, and there is constant
rewriting (not in-place: files are read, merged, and written back). So
I'm worried an fstrim can happen too late.
>
> And, well, XFS is designed to operate on storage devices made up of
> more than one drive, so the way AGs are selected is designed to
> given long term load balancing (both for space usage and
> instantenous performance). With the existing algorithms we've not
> had any issues with SSD lifetimes, long term performance
> degradation, etc, so there's no evidence that we actually need to
> change the fundamental allocation algorithms specially for SSDs.
>
Ok. Maybe the SSDs can deal with untrimmed overwrites efficiently,
provided the io sizes are large enough.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-12-01 21:38 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-28 2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29 ` Avi Kivity
2015-11-30 16:14 ` Brian Foster
2015-12-01 9:08 ` Avi Kivity
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:58 ` Avi Kivity
2015-12-01 14:01 ` Glauber Costa
2015-12-01 14:37 ` Avi Kivity
2015-12-01 20:45 ` Dave Chinner
2015-12-01 20:56 ` Avi Kivity
2015-12-01 23:41 ` Dave Chinner
2015-12-02 8:23 ` Avi Kivity
2015-12-01 14:56 ` Brian Foster
2015-12-01 15:22 ` Avi Kivity
2015-12-01 16:01 ` Brian Foster
2015-12-01 16:08 ` Avi Kivity
2015-12-01 16:29 ` Brian Foster
2015-12-01 17:09 ` Avi Kivity
2015-12-01 18:03 ` Carlos Maiolino
2015-12-01 19:07 ` Avi Kivity
2015-12-01 21:19 ` Dave Chinner
2015-12-01 21:38 ` Avi Kivity [this message]
2015-12-01 23:06 ` Dave Chinner
2015-12-02 9:02 ` Avi Kivity
2015-12-02 12:57 ` Carlos Maiolino
2015-12-02 23:19 ` Dave Chinner
2015-12-03 12:52 ` Avi Kivity
2015-12-04 3:16 ` Dave Chinner
2015-12-08 13:52 ` Avi Kivity
2015-12-08 23:13 ` Dave Chinner
2015-12-01 18:51 ` Brian Foster
2015-12-01 19:07 ` Glauber Costa
2015-12-01 19:35 ` Brian Foster
2015-12-01 19:45 ` Avi Kivity
2015-12-01 19:26 ` Avi Kivity
2015-12-01 19:41 ` Christoph Hellwig
2015-12-01 19:50 ` Avi Kivity
2015-12-02 0:13 ` Brian Foster
2015-12-02 0:57 ` Dave Chinner
2015-12-02 8:38 ` Avi Kivity
2015-12-02 8:34 ` Avi Kivity
2015-12-08 6:03 ` Dave Chinner
2015-12-08 13:56 ` Avi Kivity
2015-12-08 23:32 ` Dave Chinner
2015-12-09 8:37 ` Avi Kivity
2015-12-01 21:04 ` Dave Chinner
2015-12-01 21:10 ` Glauber Costa
2015-12-01 21:39 ` Dave Chinner
2015-12-01 21:24 ` Avi Kivity
2015-12-01 21:31 ` Glauber Costa
2015-11-30 15:49 ` Glauber Costa
2015-12-01 13:11 ` Brian Foster
2015-12-01 13:39 ` Glauber Costa
2015-12-01 14:02 ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51 ` Glauber Costa
2015-12-01 20:30 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=565E1355.4020900@scylladb.com \
--to=avi@scylladb.com \
--cc=david@fromorbit.com \
--cc=glauber@scylladb.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.