Re: newstore direction - Wido den Hollander

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wido den Hollander <wido@42on.com>
To: Sage Weil <sweil@redhat.com>, ceph-devel@vger.kernel.org
Subject: Re: newstore direction
Date: Mon, 19 Oct 2015 23:18:52 +0200	[thread overview]
Message-ID: <56255E3C.1040607@42on.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1510191216200.4188@cobra.newdream.net>

On 10/19/2015 09:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, 
> write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few 
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb changes 
> land... the kv commit is currently 2-3).  So two people are managing 
> metadata, here: the fs managing the file metadata (with its own 
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but 
> at a minimum it is a couple btree lookups.  We'd love to use open by 
> handle (which would reduce this to 1 btree traversal), but running 
> the daemon as ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is 
> a overwrite with no allocation changes.  (We don't care about mtime.)  
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep 
> it pretty simple, and manage it in kv store along with all of our other 
> metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block 
> device, one to commit our transaction (vs 4+ before).  For overwrites, 
> we'd have one io to do our write-ahead log (kv journal), then do 
> the overwrite async (vs 4+ before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects 
> are not fragmented, then the metadata to store the block offsets is about 
> the same size as the metadata to store the filenames we have now. 
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS 
> partition) vs the block storage.  Maybe we do this anyway (put metadata on 
> SSD!) so it won't matter.  But what happens when we are storing gobs of 
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of 
> a different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this 
> can be reasonbly simple, especially for the flash case (where 
> fragmentation isn't such an issue as long as our blocks are reasonbly 
> sized).  For disk we may beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The 
> good news is it'll just need to validate what we have stored in the kv 
> store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block 
> layers might help us with elasticity of file vs block areas.
> 

I've been using bcache for a while now in production and that helped a lot.

Intel SSDs with GPT. First few partitions as Journals and then one big
partition for bcache.

/dev/bcache0    2.8T  264G  2.5T  10% /var/lib/ceph/osd/ceph-60
/dev/bcache1    2.8T  317G  2.5T  12% /var/lib/ceph/osd/ceph-61
/dev/bcache2    2.8T  303G  2.5T  11% /var/lib/ceph/osd/ceph-62
/dev/bcache3    2.8T  316G  2.5T  12% /var/lib/ceph/osd/ceph-63
/dev/bcache4    2.8T  167G  2.6T   6% /var/lib/ceph/osd/ceph-64
/dev/bcache5    2.8T  295G  2.5T  11% /var/lib/ceph/osd/ceph-65

The maintainers from bcache also presented bcachefs:
https://lkml.org/lkml/2015/8/21/22

"checksumming, compression: currently only zlib is supported for
compression, and for checksumming there's crc32c and a 64 bit checksum."

Wouldn't that be something that can be leveraged from? Consuming a raw
block device seems like re-inventing the wheel to me. I might be wrong
though.

I have no idea how stable bcachefs is, but it might be worth looking in to.

>  - Rocksdb can push colder data to a second directory, so we could have a 
> fast ssd primary area (for wal and most metadata) and a second hdd 
> directory for stuff it has to push off.  Then have a conservative amount 
> of file space on the hdd.  If our block fills up, use the existing file 
> mechanism to put data there too.  (But then we have to maintain both the 
> current kv + file approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on

next prev parent reply	other threads:[~2015-10-19 21:18 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-19 19:49 newstore direction Sage Weil
2015-10-19 20:22 ` Robert LeBlanc
2015-10-19 20:30 ` Somnath Roy
2015-10-19 20:54   ` Sage Weil
2015-10-19 22:21     ` James (Fei) Liu-SSI
2015-10-20  2:24       ` Chen, Xiaoxi
2015-10-20 12:30         ` Sage Weil
2015-10-20 13:19           ` Mark Nelson
2015-10-20 17:04             ` kernel neophyte
2015-10-21 10:06             ` Allen Samuels
2015-10-21 13:35               ` Mark Nelson
2015-10-21 16:10                 ` Chen, Xiaoxi
2015-10-22  1:09                   ` Allen Samuels
2015-10-20  2:32       ` Varada Kari
2015-10-20  2:40         ` Chen, Xiaoxi
2015-10-20 12:34       ` Sage Weil
2015-10-20 20:18         ` Martin Millnert
2015-10-20 20:32         ` James (Fei) Liu-SSI
2015-10-20 20:39           ` James (Fei) Liu-SSI
2015-10-20 21:20           ` Sage Weil
2015-10-19 21:18 ` Wido den Hollander [this message]
2015-10-19 22:40 ` Varada Kari
2015-10-20  0:48 ` John Spray
2015-10-20 20:00   ` Sage Weil
2015-10-20 20:36     ` Gregory Farnum
2015-10-20 21:47       ` Sage Weil
2015-10-20 22:23         ` Ric Wheeler
2015-10-21 13:32           ` Sage Weil
2015-10-21 13:50             ` Ric Wheeler
2015-10-23  6:21               ` Howard Chu
2015-10-23 11:06                 ` Ric Wheeler
2015-10-23 11:47                   ` Ric Wheeler
2015-10-23 14:59                     ` Howard Chu
2015-10-23 16:37                       ` Ric Wheeler
2015-10-23 18:59                       ` Gregory Farnum
2015-10-23 21:23                         ` Howard Chu
2015-10-20 20:42     ` Matt Benjamin
2015-10-22 12:32     ` Milosz Tanski
2015-10-23  3:16       ` Howard Chu
2015-10-23 13:27         ` Milosz Tanski
2015-10-20  2:08 ` Haomai Wang
2015-10-20 12:25   ` Sage Weil
2015-10-20  7:06 ` Dałek, Piotr
2015-10-20 18:31 ` Ric Wheeler
2015-10-20 19:44   ` Sage Weil
2015-10-20 21:43     ` Ric Wheeler
2015-10-20 19:44   ` Yehuda Sadeh-Weinraub
2015-10-21  8:22   ` Orit Wasserman
2015-10-21 11:18     ` Ric Wheeler
2015-10-21 17:30       ` Sage Weil
2015-10-22  8:31         ` Christoph Hellwig
2015-10-22 12:50       ` Sage Weil
2015-10-22 17:42         ` James (Fei) Liu-SSI
2015-10-22 23:42           ` Samuel Just
2015-10-23  0:10             ` Samuel Just
2015-10-23  1:26             ` Allen Samuels
2015-10-23  2:06         ` Ric Wheeler
2015-10-21 10:06   ` Allen Samuels
2015-10-21 11:24     ` Ric Wheeler
2015-10-21 14:14       ` Mark Nelson
2015-10-21 15:51         ` Ric Wheeler
2015-10-21 19:37           ` Mark Nelson
2015-10-21 21:20             ` Martin Millnert
2015-10-22  2:12               ` Allen Samuels
2015-10-22  8:51                 ` Orit Wasserman
2015-10-22  0:53       ` Allen Samuels
2015-10-22  1:16         ` Ric Wheeler
2015-10-22  1:22           ` Allen Samuels
2015-10-23  2:10             ` Ric Wheeler
2015-10-21 13:44     ` Mark Nelson
2015-10-22  1:39       ` Allen Samuels

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56255E3C.1040607@42on.com \
    --to=wido@42on.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sweil@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.