From: Matt Benjamin <mbenjamin@redhat.com>
To: Sage Weil <sweil@redhat.com>
Cc: John Spray <jspray@redhat.com>,
Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: newstore direction
Date: Tue, 20 Oct 2015 16:42:54 -0400 (EDT) [thread overview]
Message-ID: <888344827.35094163.1445373774060.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1510201251140.16833@cobra.newdream.net>
We mostly assumed that sort-of transactional file systems, perhaps hosted in user space was the most tractable trajectory. I have seen newstore and keyvalue store as essentially congruent approaches using database primitives (and I am interested in what you make of Russell Sears). I'm skeptical of any hope of keeping things "simple." Like Martin downthread, most systems I havce seen (filers, ZFS)) make use of a fast, durable commit log and then flex out...something else.
--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103
http://www.redhat.com/en/technologies/storage
tel. 734-707-0660
fax. 734-769-8938
cel. 734-216-5309
----- Original Message -----
> From: "Sage Weil" <sweil@redhat.com>
> To: "John Spray" <jspray@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, October 20, 2015 4:00:23 PM
> Subject: Re: newstore direction
>
> On Tue, 20 Oct 2015, John Spray wrote:
> > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@redhat.com> wrote:
> > > - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage. Maybe we do this anyway (put metadata
> > > on
> > > SSD!) so it won't matter. But what happens when we are storing gobs of
> > > rgw index data or cephfs metadata? Suddenly we are pulling storage out
> > > of
> > > a different pool and those aren't currently fungible.
> >
> > This is the concerning bit for me -- the other parts one "just" has to
> > get the code right, but this problem could linger and be something we
> > have to keep explaining to users indefinitely. It reminds me of cases
> > in other systems where users had to make an educated guess about inode
> > size up front, depending on whether you're expecting to efficiently
> > store a lot of xattrs.
> >
> > In practice it's rare for users to make these kinds of decisions well
> > up-front: it really needs to be adjustable later, ideally
> > automatically. That could be pretty straightforward if the KV part
> > was stored directly on block storage, instead of having XFS in the
> > mix. I'm not quite up with the state of the art in this area: are
> > there any reasonable alternatives for the KV part that would consume
> > some defined range of a block device from userspace, instead of
> > sitting on top of a filesystem?
>
> I agree: this is my primary concern with the raw block approach.
>
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
>
> I see two basic options:
>
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work. It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
>
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore. As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion. If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well. Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me. In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2015-10-20 20:42 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-19 19:49 newstore direction Sage Weil
2015-10-19 20:22 ` Robert LeBlanc
2015-10-19 20:30 ` Somnath Roy
2015-10-19 20:54 ` Sage Weil
2015-10-19 22:21 ` James (Fei) Liu-SSI
2015-10-20 2:24 ` Chen, Xiaoxi
2015-10-20 12:30 ` Sage Weil
2015-10-20 13:19 ` Mark Nelson
2015-10-20 17:04 ` kernel neophyte
2015-10-21 10:06 ` Allen Samuels
2015-10-21 13:35 ` Mark Nelson
2015-10-21 16:10 ` Chen, Xiaoxi
2015-10-22 1:09 ` Allen Samuels
2015-10-20 2:32 ` Varada Kari
2015-10-20 2:40 ` Chen, Xiaoxi
2015-10-20 12:34 ` Sage Weil
2015-10-20 20:18 ` Martin Millnert
2015-10-20 20:32 ` James (Fei) Liu-SSI
2015-10-20 20:39 ` James (Fei) Liu-SSI
2015-10-20 21:20 ` Sage Weil
2015-10-19 21:18 ` Wido den Hollander
2015-10-19 22:40 ` Varada Kari
2015-10-20 0:48 ` John Spray
2015-10-20 20:00 ` Sage Weil
2015-10-20 20:36 ` Gregory Farnum
2015-10-20 21:47 ` Sage Weil
2015-10-20 22:23 ` Ric Wheeler
2015-10-21 13:32 ` Sage Weil
2015-10-21 13:50 ` Ric Wheeler
2015-10-23 6:21 ` Howard Chu
2015-10-23 11:06 ` Ric Wheeler
2015-10-23 11:47 ` Ric Wheeler
2015-10-23 14:59 ` Howard Chu
2015-10-23 16:37 ` Ric Wheeler
2015-10-23 18:59 ` Gregory Farnum
2015-10-23 21:23 ` Howard Chu
2015-10-20 20:42 ` Matt Benjamin [this message]
2015-10-22 12:32 ` Milosz Tanski
2015-10-23 3:16 ` Howard Chu
2015-10-23 13:27 ` Milosz Tanski
2015-10-20 2:08 ` Haomai Wang
2015-10-20 12:25 ` Sage Weil
2015-10-20 7:06 ` Dałek, Piotr
2015-10-20 18:31 ` Ric Wheeler
2015-10-20 19:44 ` Sage Weil
2015-10-20 21:43 ` Ric Wheeler
2015-10-20 19:44 ` Yehuda Sadeh-Weinraub
2015-10-21 8:22 ` Orit Wasserman
2015-10-21 11:18 ` Ric Wheeler
2015-10-21 17:30 ` Sage Weil
2015-10-22 8:31 ` Christoph Hellwig
2015-10-22 12:50 ` Sage Weil
2015-10-22 17:42 ` James (Fei) Liu-SSI
2015-10-22 23:42 ` Samuel Just
2015-10-23 0:10 ` Samuel Just
2015-10-23 1:26 ` Allen Samuels
2015-10-23 2:06 ` Ric Wheeler
2015-10-21 10:06 ` Allen Samuels
2015-10-21 11:24 ` Ric Wheeler
2015-10-21 14:14 ` Mark Nelson
2015-10-21 15:51 ` Ric Wheeler
2015-10-21 19:37 ` Mark Nelson
2015-10-21 21:20 ` Martin Millnert
2015-10-22 2:12 ` Allen Samuels
2015-10-22 8:51 ` Orit Wasserman
2015-10-22 0:53 ` Allen Samuels
2015-10-22 1:16 ` Ric Wheeler
2015-10-22 1:22 ` Allen Samuels
2015-10-23 2:10 ` Ric Wheeler
2015-10-21 13:44 ` Mark Nelson
2015-10-22 1:39 ` Allen Samuels
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=888344827.35094163.1445373774060.JavaMail.zimbra@redhat.com \
--to=mbenjamin@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=jspray@redhat.com \
--cc=sweil@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.