From: Ric Wheeler <rwheeler@redhat.com>
To: Sage Weil <sweil@redhat.com>
Cc: Orit Wasserman <owasserm@redhat.com>, ceph-devel@vger.kernel.org
Subject: Re: newstore direction
Date: Thu, 22 Oct 2015 22:06:03 -0400 [thread overview]
Message-ID: <5629960B.7030108@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1510220528140.16833@cobra.newdream.net>
On 10/22/2015 08:50 AM, Sage Weil wrote:
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to pretty
>> much all of our key customers about local file systems and storage - customers
>> all have migrated over to using normal file systems under Oracle/DB2.
>> Typically, they use XFS or ext4. I don't know of any non-standard file
>> systems and only have seen one account running on a raw block store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO path is
>> identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some time
>> talking to the local file system gurus about this in detail. I can help with
>> that conversation.
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten), then
> sure: there is very little change in the data path.
>
> But at that point, what is the point? This only works if you have one (or
> a few) huge files and the user space app already has all the complexity of
> a filesystem-like thing (with its own internal journal, allocators,
> garbage collection, etc.). Do they just do this to ease administrative
> tasks like backup?
I think that the key here is that if we fsync() like crazy - regardless of
writing to a file system or to some new, yet to be define block device primitive
store - we are limited to the IOP's of that particular block device.
Ignoring exotic hardware configs for people who can ignore all SSD devices, we
will have rotating, high capacity, slow spinning drives for *a long time* as the
eventual tier. Given that assumption, we need to do better then to be limited
to synchronous IOP's for a slow drive. When we have commodity pricing for
things like persistent DRAM, then I agree that writing directly to that medium
makes sense (but you can do that with DAX by effectively mapping that into the
process address space).
Specifically, moving from a file system with some inefficiencies will only boost
performance from say 20-30 IOP's to roughly 40-50 IOP's.
The way this has been handled traditionally for things like databases, etc is:
* batch up the transactions that need to be destaged
* issue an O_DIRECT async IO for all of the elements that need to be written
(bypassed the page cache, direct to the backing store)
* wait for completion
We should probably add to that sequence an fsync() of the directory (or a file
in the file system) to insure that any volatile write cache is invalidated, but
there is *no* reason to fsync() each file.
I think that we need to look at why the write pattern is so heavily synchronous
and single threaded if we are hoping to extract from any given storage tier its
maximum performance.
Doing this can raise your file creations per second (or allocations per second)
from a few dozen to a few hundred or more per second.
The complexity that writing a new block level allocation strategy that you save is:
* if you lay out a lot of small objects on the block store that can grow, we
will quickly end up doing very complicated techniques that file systems solved a
long time ago (pre-allocation, etc)
* multi-stream aware allocation if you have multiple processes writing to the
same store
* tracking things like allocated but unwritten (can happen if some process
"pokes" a hole in an object, common with things like virtual machine images)
One we end up handling all of that in new, untested code, I think that we end up
with a lot of pain and only minimal gain in terms of performance.
ric
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object. We fsync like crazy and the fact that
> there are two independent layers journaling and managing different types
> of consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file
> system to work around what it is used to: we swap extents to avoid
> write-ahead (see Christoph's patch), O_NOMTIME, unprivileged
> open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.
>
> 2) We preallocate huge files and write a user-space object system that
> lives within it (pretending the file is a block device). The file system
> rarely gets in the way (assuming the file is prewritten and we don't do
> anything stupid). But it doesn't give us anything a block device
> wouldn't, and it doesn't save us any complexity in our code.
>
> At the end of the day, 1 and 1b are always going to be slower than 2.
> And although 1b performs a bit better than 1, it has similar (user-space)
> complexity to 2. On the other hand, if you step back and view teh
> entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex
> than 2... and yet still slower. Given we ultimately have to support both
> (both as an upstream and as a distro), that's not very attractive.
>
> Also note that every time we have strayed off the reservation from the
> beaten path (1) to anything mildly exotic (1b) we have been bitten by
> obscure file systems bugs. And that's assume we get everything we need
> upstream... which is probably a year's endeavour.
>
> Don't get me wrong: I'm all for making changes to file systems to better
> support systems like Ceph. Things like O_NOCMTIME and O_ATOMIC make a
> huge amount of sense of a ton of different systems. But our situations is
> a bit different: we always own the entire device (and often the server),
> so there is no need to share with other users or apps (and when you do,
> you just use the existing FileStore backend). And as you know performance
> is a huge pain point. We are already handicapped by virtue of being
> distributed and strongly consistent; we can't afford to give away more to
> a storage layer that isn't providing us much (or the right) value.
>
> And I'm tired of half measures. I want the OSD to be as fast as we can
> make it given the architectural constraints (RADOS consistency and
> ordering semantics). This is truly low-hanging fruit: it's modular,
> self-contained, pluggable, and this will be my third time around this
> particular block.
>
> sage
next prev parent reply other threads:[~2015-10-23 2:06 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-19 19:49 newstore direction Sage Weil
2015-10-19 20:22 ` Robert LeBlanc
2015-10-19 20:30 ` Somnath Roy
2015-10-19 20:54 ` Sage Weil
2015-10-19 22:21 ` James (Fei) Liu-SSI
2015-10-20 2:24 ` Chen, Xiaoxi
2015-10-20 12:30 ` Sage Weil
2015-10-20 13:19 ` Mark Nelson
2015-10-20 17:04 ` kernel neophyte
2015-10-21 10:06 ` Allen Samuels
2015-10-21 13:35 ` Mark Nelson
2015-10-21 16:10 ` Chen, Xiaoxi
2015-10-22 1:09 ` Allen Samuels
2015-10-20 2:32 ` Varada Kari
2015-10-20 2:40 ` Chen, Xiaoxi
2015-10-20 12:34 ` Sage Weil
2015-10-20 20:18 ` Martin Millnert
2015-10-20 20:32 ` James (Fei) Liu-SSI
2015-10-20 20:39 ` James (Fei) Liu-SSI
2015-10-20 21:20 ` Sage Weil
2015-10-19 21:18 ` Wido den Hollander
2015-10-19 22:40 ` Varada Kari
2015-10-20 0:48 ` John Spray
2015-10-20 20:00 ` Sage Weil
2015-10-20 20:36 ` Gregory Farnum
2015-10-20 21:47 ` Sage Weil
2015-10-20 22:23 ` Ric Wheeler
2015-10-21 13:32 ` Sage Weil
2015-10-21 13:50 ` Ric Wheeler
2015-10-23 6:21 ` Howard Chu
2015-10-23 11:06 ` Ric Wheeler
2015-10-23 11:47 ` Ric Wheeler
2015-10-23 14:59 ` Howard Chu
2015-10-23 16:37 ` Ric Wheeler
2015-10-23 18:59 ` Gregory Farnum
2015-10-23 21:23 ` Howard Chu
2015-10-20 20:42 ` Matt Benjamin
2015-10-22 12:32 ` Milosz Tanski
2015-10-23 3:16 ` Howard Chu
2015-10-23 13:27 ` Milosz Tanski
2015-10-20 2:08 ` Haomai Wang
2015-10-20 12:25 ` Sage Weil
2015-10-20 7:06 ` Dałek, Piotr
2015-10-20 18:31 ` Ric Wheeler
2015-10-20 19:44 ` Sage Weil
2015-10-20 21:43 ` Ric Wheeler
2015-10-20 19:44 ` Yehuda Sadeh-Weinraub
2015-10-21 8:22 ` Orit Wasserman
2015-10-21 11:18 ` Ric Wheeler
2015-10-21 17:30 ` Sage Weil
2015-10-22 8:31 ` Christoph Hellwig
2015-10-22 12:50 ` Sage Weil
2015-10-22 17:42 ` James (Fei) Liu-SSI
2015-10-22 23:42 ` Samuel Just
2015-10-23 0:10 ` Samuel Just
2015-10-23 1:26 ` Allen Samuels
2015-10-23 2:06 ` Ric Wheeler [this message]
2015-10-21 10:06 ` Allen Samuels
2015-10-21 11:24 ` Ric Wheeler
2015-10-21 14:14 ` Mark Nelson
2015-10-21 15:51 ` Ric Wheeler
2015-10-21 19:37 ` Mark Nelson
2015-10-21 21:20 ` Martin Millnert
2015-10-22 2:12 ` Allen Samuels
2015-10-22 8:51 ` Orit Wasserman
2015-10-22 0:53 ` Allen Samuels
2015-10-22 1:16 ` Ric Wheeler
2015-10-22 1:22 ` Allen Samuels
2015-10-23 2:10 ` Ric Wheeler
2015-10-21 13:44 ` Mark Nelson
2015-10-22 1:39 ` Allen Samuels
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5629960B.7030108@redhat.com \
--to=rwheeler@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=owasserm@redhat.com \
--cc=sweil@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.