All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Nelson <mnelson@redhat.com>
To: Allen Samuels <Allen.Samuels@sandisk.com>,
	Sage Weil <sweil@redhat.com>,
	"Chen, Xiaoxi" <xiaoxi.chen@intel.com>
Cc: "James (Fei) Liu-SSI" <james.liu@ssi.samsung.com>,
	Somnath Roy <Somnath.Roy@sandisk.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: newstore direction
Date: Wed, 21 Oct 2015 08:35:44 -0500	[thread overview]
Message-ID: <562794B0.8050005@redhat.com> (raw)
In-Reply-To: <7334B4281E425749B85E08CF7EC6F8534383DD10@SACMBXIP03.sdcorp.global.sandisk.com>

Thanks Allen!  The devil is always in the details.  Know of anything 
else that looks promising?

Mark

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I doubt that NVMKV will be useful for two reasons:
>
> (1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs
> (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, October 20, 2015 6:20 AM
> To: Sage Weil <sweil@redhat.com>; Chen, Xiaoxi <xiaoxi.chen@intel.com>
> Cc: James (Fei) Liu-SSI <james.liu@ssi.samsung.com>; Somnath Roy <Somnath.Roy@sandisk.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/20/2015 07:30 AM, Sage Weil wrote:
>> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>>> +1, nowadays K-V DB care more about very small key-value pairs, say
>>> several bytes to a few KB, but in SSD case we only care about 4KB or
>>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>>> vendor are also trying to build this kind of interface, we had a
>>> NVM-L library but still under development.
>>
>> Do you have an NVMKV link?  I see a paper and a stale github repo..
>> not sure if I'm looking at the right thing.
>>
>> My concern with using a key/value interface for the object data is
>> that you end up with lots of key/value pairs (e.g., $inode_$offset =
>> $4kb_of_data) that is pretty inefficient to store and (depending on
>> the
>> implementation) tends to break alignment.  I don't think these
>> interfaces are targetted toward block-sized/aligned payloads.  Storing
>> just the metadata (block allocation map) w/ the kv api and storing the
>> data directly on a block/page interface makes more sense to me.
>>
>> sage
>
> I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for instance.  http://pmem.io might be a better bet, though I haven't looked closely at it.
>
> Mark
>
>>
>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>>> To: Sage Weil; Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> Hi Sage and Somnath,
>>>>     In my humble opinion, There is another more aggressive  solution
>>>> than raw block device base keyvalue store as backend for
>>>> objectstore. The new key value  SSD device with transaction support would be  ideal to solve the issues.
>>>> First of all, it is raw SSD device. Secondly , It provides key value
>>>> interface directly from SSD. Thirdly, it can provide transaction
>>>> support, consistency will be guaranteed by hardware device. It
>>>> pretty much satisfied all of objectstore needs without any extra
>>>> overhead since there is not any extra layer in between device and objectstore.
>>>>      Either way, I strongly support to have CEPH own data format
>>>> instead of relying on filesystem.
>>>>
>>>>     Regards,
>>>>     James
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Monday, October 19, 2015 1:55 PM
>>>> To: Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>>> Sage,
>>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>>> dbs (for storing allocators and all). The reason is the unknown
>>>>> write amps they causes.
>>>>
>>>> My hope is to keep behing the KeyValueDB interface (and/more change
>>>> it as
>>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>>> btree- based one for high-end flash).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
>>>>> Sent: Monday, October 19, 2015 12:49 PM
>>>>> To: ceph-devel@vger.kernel.org
>>>>> Subject: newstore direction
>>>>>
>>>>> The current design is based on two simple ideas:
>>>>>
>>>>>    1) a key/value interface is better way to manage all of our
>>>>> internal metadata (object metadata, attrs, layout, collection
>>>>> membership, write-ahead logging, overlay data, etc.)
>>>>>
>>>>>    2) a file system is well suited for storage object data (as files).
>>>>>
>>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.
>>>>> A few
>>>>> things:
>>>>>
>>>>>    - We currently write the data to the file, fsync, then commit the
>>>>> kv transaction.  That's at least 3 IOs: one for the data, one for
>>>>> the fs journal, one for the kv txn to commit (at least once my
>>>>> rocksdb changes land... the kv commit is currently 2-3).  So two
>>>>> people are managing metadata, here: the fs managing the file
>>>>> metadata (with its own
>>>>> journal) and the kv backend (with its journal).
>>>>>
>>>>>    - On read we have to open files by name, which means traversing
>>>>> the fs
>>>> namespace.  Newstore tries to keep it as flat and simple as
>>>> possible, but at a minimum it is a couple btree lookups.  We'd love
>>>> to use open by handle (which would reduce this to 1 btree
>>>> traversal), but running the daemon as ceph and not root makes that hard...
>>>>>
>>>>>    - ...and file systems insist on updating mtime on writes, even
>>>>> when it is a
>>>> overwrite with no allocation changes.  (We don't care about mtime.)
>>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>>> brainfreeze.
>>>>>
>>>>>    - XFS is (probably) never going going to give us data checksums,
>>>>> which we
>>>> want desperately.
>>>>>
>>>>> But what's the alternative?  My thought is to just bite the bullet
>>>>> and
>>>> consume a raw block device directly.  Write an allocator, hopefully
>>>> keep it pretty simple, and manage it in kv store along with all of our other metadata.
>>>>>
>>>>> Wins:
>>>>>
>>>>>    - 2 IOs for most: one to write the data to unused space in the
>>>>> block device,
>>>> one to commit our transaction (vs 4+ before).  For overwrites, we'd
>>>> have one io to do our write-ahead log (kv journal), then do the
>>>> overwrite async (vs 4+ before).
>>>>>
>>>>>    - No concern about mtime getting in the way
>>>>>
>>>>>    - Faster reads (no fs lookup)
>>>>>
>>>>>    - Similarly sized metadata for most objects.  If we assume most
>>>>> objects are
>>>> not fragmented, then the metadata to store the block offsets is
>>>> about the same size as the metadata to store the filenames we have now.
>>>>>
>>>>> Problems:
>>>>>
>>>>>    - We have to size the kv backend storage (probably still an XFS
>>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>>> metadata on
>>>>> SSD!) so it won't matter.  But what happens when we are storing
>>>>> gobs of
>>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage
>>>> out of a different pool and those aren't currently fungible.
>>>>>
>>>>>    - We have to write and maintain an allocator.  I'm still
>>>>> optimistic this can be
>>>> reasonbly simple, especially for the flash case (where fragmentation
>>>> isn't such an issue as long as our blocks are reasonbly sized).  For
>>>> disk we may beed to be moderately clever.
>>>>>
>>>>>    - We'll need a fsck to ensure our internal metadata is
>>>>> consistent.  The good
>>>> news is it'll just need to validate what we have stored in the kv store.
>>>>>
>>>>> Other thoughts:
>>>>>
>>>>>    - We might want to consider whether dm-thin or bcache or other
>>>>> block
>>>> layers might help us with elasticity of file vs block areas.
>>>>>
>>>>>    - Rocksdb can push colder data to a second directory, so we could
>>>>> have a fast ssd primary area (for wal and most metadata) and a
>>>>> second hdd directory for stuff it has to push off.  Then have a
>>>>> conservative amount of file space on the hdd.  If our block fills
>>>>> up, use the existing file mechanism to put data there too.  (But
>>>>> then we have to maintain both the current kv + file approach and
>>>>> not go all-in on kv +
>>>>> block.)
>>>>>
>>>>> Thoughts?
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> ________________________________
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>> message is
>>>> intended only for the use of the designated recipient(s) named
>>>> above. If the reader of this message is not the intended recipient,
>>>> you are hereby notified that you have received this message in error
>>>> and that any review, dissemination, distribution, or copying of this
>>>> message is strictly prohibited. If you have received this
>>>> communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies
>>>> of this message in your possession (whether hard copies or electronically stored copies).
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

  reply	other threads:[~2015-10-21 13:35 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-19 19:49 newstore direction Sage Weil
2015-10-19 20:22 ` Robert LeBlanc
2015-10-19 20:30 ` Somnath Roy
2015-10-19 20:54   ` Sage Weil
2015-10-19 22:21     ` James (Fei) Liu-SSI
2015-10-20  2:24       ` Chen, Xiaoxi
2015-10-20 12:30         ` Sage Weil
2015-10-20 13:19           ` Mark Nelson
2015-10-20 17:04             ` kernel neophyte
2015-10-21 10:06             ` Allen Samuels
2015-10-21 13:35               ` Mark Nelson [this message]
2015-10-21 16:10                 ` Chen, Xiaoxi
2015-10-22  1:09                   ` Allen Samuels
2015-10-20  2:32       ` Varada Kari
2015-10-20  2:40         ` Chen, Xiaoxi
2015-10-20 12:34       ` Sage Weil
2015-10-20 20:18         ` Martin Millnert
2015-10-20 20:32         ` James (Fei) Liu-SSI
2015-10-20 20:39           ` James (Fei) Liu-SSI
2015-10-20 21:20           ` Sage Weil
2015-10-19 21:18 ` Wido den Hollander
2015-10-19 22:40 ` Varada Kari
2015-10-20  0:48 ` John Spray
2015-10-20 20:00   ` Sage Weil
2015-10-20 20:36     ` Gregory Farnum
2015-10-20 21:47       ` Sage Weil
2015-10-20 22:23         ` Ric Wheeler
2015-10-21 13:32           ` Sage Weil
2015-10-21 13:50             ` Ric Wheeler
2015-10-23  6:21               ` Howard Chu
2015-10-23 11:06                 ` Ric Wheeler
2015-10-23 11:47                   ` Ric Wheeler
2015-10-23 14:59                     ` Howard Chu
2015-10-23 16:37                       ` Ric Wheeler
2015-10-23 18:59                       ` Gregory Farnum
2015-10-23 21:23                         ` Howard Chu
2015-10-20 20:42     ` Matt Benjamin
2015-10-22 12:32     ` Milosz Tanski
2015-10-23  3:16       ` Howard Chu
2015-10-23 13:27         ` Milosz Tanski
2015-10-20  2:08 ` Haomai Wang
2015-10-20 12:25   ` Sage Weil
2015-10-20  7:06 ` Dałek, Piotr
2015-10-20 18:31 ` Ric Wheeler
2015-10-20 19:44   ` Sage Weil
2015-10-20 21:43     ` Ric Wheeler
2015-10-20 19:44   ` Yehuda Sadeh-Weinraub
2015-10-21  8:22   ` Orit Wasserman
2015-10-21 11:18     ` Ric Wheeler
2015-10-21 17:30       ` Sage Weil
2015-10-22  8:31         ` Christoph Hellwig
2015-10-22 12:50       ` Sage Weil
2015-10-22 17:42         ` James (Fei) Liu-SSI
2015-10-22 23:42           ` Samuel Just
2015-10-23  0:10             ` Samuel Just
2015-10-23  1:26             ` Allen Samuels
2015-10-23  2:06         ` Ric Wheeler
2015-10-21 10:06   ` Allen Samuels
2015-10-21 11:24     ` Ric Wheeler
2015-10-21 14:14       ` Mark Nelson
2015-10-21 15:51         ` Ric Wheeler
2015-10-21 19:37           ` Mark Nelson
2015-10-21 21:20             ` Martin Millnert
2015-10-22  2:12               ` Allen Samuels
2015-10-22  8:51                 ` Orit Wasserman
2015-10-22  0:53       ` Allen Samuels
2015-10-22  1:16         ` Ric Wheeler
2015-10-22  1:22           ` Allen Samuels
2015-10-23  2:10             ` Ric Wheeler
2015-10-21 13:44     ` Mark Nelson
2015-10-22  1:39       ` Allen Samuels

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=562794B0.8050005@redhat.com \
    --to=mnelson@redhat.com \
    --cc=Allen.Samuels@sandisk.com \
    --cc=Somnath.Roy@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=james.liu@ssi.samsung.com \
    --cc=sweil@redhat.com \
    --cc=xiaoxi.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.