From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: newstore direction Date: Tue, 20 Oct 2015 08:19:37 -0500 Message-ID: <56263F69.9080700@redhat.com> References: <755F6B91B3BE364F9BCA11EA3F9E0C6F3174851F@SACMBXIP02.sdcorp.global.sandisk.com> <99767EA2E27DD44DB4E9F9B9ACA458C056195331@SSIEXCH-MB3.ssi.samsung.com> <6F3FA899187F0043BA1827A69DA2F7CC036341E8@shsmsx102.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:43897 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751757AbbJTNTk (ORCPT ); Tue, 20 Oct 2015 09:19:40 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil , "Chen, Xiaoxi" Cc: "James (Fei) Liu-SSI" , Somnath Roy , "ceph-devel@vger.kernel.org" On 10/20/2015 07:30 AM, Sage Weil wrote: > On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: >> +1, nowadays K-V DB care more about very small key-value pairs, say >> several bytes to a few KB, but in SSD case we only care about 4KB or >> 8KB. In this way, NVMKV is a good design and seems some of the SSD >> vendor are also trying to build this kind of interface, we had a NVM-L >> library but still under development. > > Do you have an NVMKV link? I see a paper and a stale github repo.. not > sure if I'm looking at the right thing. > > My concern with using a key/value interface for the object data is that > you end up with lots of key/value pairs (e.g., $inode_$offset = > $4kb_of_data) that is pretty inefficient to store and (depending on the > implementation) tends to break alignment. I don't think these interfaces > are targetted toward block-sized/aligned payloads. Storing just the > metadata (block allocation map) w/ the kv api and storing the data > directly on a block/page interface makes more sense to me. > > sage I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left. Nisha Talagala is now out at Parallel Systems for instance. http://pmem.io might be a better bet, though I haven't looked closely at it. Mark > > >>> -----Original Message----- >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >>> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI >>> Sent: Tuesday, October 20, 2015 6:21 AM >>> To: Sage Weil; Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> Hi Sage and Somnath, >>> In my humble opinion, There is another more aggressive solution than raw >>> block device base keyvalue store as backend for objectstore. The new key >>> value SSD device with transaction support would be ideal to solve the issues. >>> First of all, it is raw SSD device. Secondly , It provides key value interface >>> directly from SSD. Thirdly, it can provide transaction support, consistency will >>> be guaranteed by hardware device. It pretty much satisfied all of objectstore >>> needs without any extra overhead since there is not any extra layer in >>> between device and objectstore. >>> Either way, I strongly support to have CEPH own data format instead of >>> relying on filesystem. >>> >>> Regards, >>> James >>> >>> -----Original Message----- >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >>> owner@vger.kernel.org] On Behalf Of Sage Weil >>> Sent: Monday, October 19, 2015 1:55 PM >>> To: Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> On Mon, 19 Oct 2015, Somnath Roy wrote: >>>> Sage, >>>> I fully support that. If we want to saturate SSDs , we need to get >>>> rid of this filesystem overhead (which I am in process of measuring). >>>> Also, it will be good if we can eliminate the dependency on the k/v >>>> dbs (for storing allocators and all). The reason is the unknown write >>>> amps they causes. >>> >>> My hope is to keep behing the KeyValueDB interface (and/more change it as >>> appropriate) so that other backends can be easily swapped in (e.g. a btree- >>> based one for high-end flash). >>> >>> sage >>> >>> >>>> >>>> Thanks & Regards >>>> Somnath >>>> >>>> >>>> -----Original Message----- >>>> From: ceph-devel-owner@vger.kernel.org >>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil >>>> Sent: Monday, October 19, 2015 12:49 PM >>>> To: ceph-devel@vger.kernel.org >>>> Subject: newstore direction >>>> >>>> The current design is based on two simple ideas: >>>> >>>> 1) a key/value interface is better way to manage all of our internal >>>> metadata (object metadata, attrs, layout, collection membership, >>>> write-ahead logging, overlay data, etc.) >>>> >>>> 2) a file system is well suited for storage object data (as files). >>>> >>>> So far 1 is working out well, but I'm questioning the wisdom of #2. A >>>> few >>>> things: >>>> >>>> - We currently write the data to the file, fsync, then commit the kv >>>> transaction. That's at least 3 IOs: one for the data, one for the fs >>>> journal, one for the kv txn to commit (at least once my rocksdb >>>> changes land... the kv commit is currently 2-3). So two people are >>>> managing metadata, here: the fs managing the file metadata (with its >>>> own >>>> journal) and the kv backend (with its journal). >>>> >>>> - On read we have to open files by name, which means traversing the fs >>> namespace. Newstore tries to keep it as flat and simple as possible, but at a >>> minimum it is a couple btree lookups. We'd love to use open by handle >>> (which would reduce this to 1 btree traversal), but running the daemon as >>> ceph and not root makes that hard... >>>> >>>> - ...and file systems insist on updating mtime on writes, even when it is a >>> overwrite with no allocation changes. (We don't care about mtime.) >>> O_NOCMTIME patches exist but it is hard to get these past the kernel >>> brainfreeze. >>>> >>>> - XFS is (probably) never going going to give us data checksums, which we >>> want desperately. >>>> >>>> But what's the alternative? My thought is to just bite the bullet and >>> consume a raw block device directly. Write an allocator, hopefully keep it >>> pretty simple, and manage it in kv store along with all of our other metadata. >>>> >>>> Wins: >>>> >>>> - 2 IOs for most: one to write the data to unused space in the block device, >>> one to commit our transaction (vs 4+ before). For overwrites, we'd have one >>> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ >>> before). >>>> >>>> - No concern about mtime getting in the way >>>> >>>> - Faster reads (no fs lookup) >>>> >>>> - Similarly sized metadata for most objects. If we assume most objects are >>> not fragmented, then the metadata to store the block offsets is about the >>> same size as the metadata to store the filenames we have now. >>>> >>>> Problems: >>>> >>>> - We have to size the kv backend storage (probably still an XFS >>>> partition) vs the block storage. Maybe we do this anyway (put >>>> metadata on >>>> SSD!) so it won't matter. But what happens when we are storing gobs of >>> rgw index data or cephfs metadata? Suddenly we are pulling storage out of a >>> different pool and those aren't currently fungible. >>>> >>>> - We have to write and maintain an allocator. I'm still optimistic this can be >>> reasonbly simple, especially for the flash case (where fragmentation isn't >>> such an issue as long as our blocks are reasonbly sized). For disk we may >>> beed to be moderately clever. >>>> >>>> - We'll need a fsck to ensure our internal metadata is consistent. The good >>> news is it'll just need to validate what we have stored in the kv store. >>>> >>>> Other thoughts: >>>> >>>> - We might want to consider whether dm-thin or bcache or other block >>> layers might help us with elasticity of file vs block areas. >>>> >>>> - Rocksdb can push colder data to a second directory, so we could >>>> have a fast ssd primary area (for wal and most metadata) and a second >>>> hdd directory for stuff it has to push off. Then have a conservative >>>> amount of file space on the hdd. If our block fills up, use the >>>> existing file mechanism to put data there too. (But then we have to >>>> maintain both the current kv + file approach and not go all-in on kv + >>>> block.) >>>> >>>> Thoughts? >>>> sage >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to majordomo@vger.kernel.org More >>> majordomo >>>> info at http://vger.kernel.org/majordomo-info.html >>>> >>>> ________________________________ >>>> >>>> PLEASE NOTE: The information contained in this electronic mail message is >>> intended only for the use of the designated recipient(s) named above. If the >>> reader of this message is not the intended recipient, you are hereby notified >>> that you have received this message in error and that any review, >>> dissemination, distribution, or copying of this message is strictly prohibited. If >>> you have received this communication in error, please notify the sender by >>> telephone or e-mail (as shown above) immediately and destroy any and all >>> copies of this message in your possession (whether hard copies or >>> electronically stored copies). >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to majordomo@vger.kernel.org More >>> majordomo >>>> info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the >>> body of a message to majordomo@vger.kernel.org More majordomo info at >>> http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the >>> body of a message to majordomo@vger.kernel.org More majordomo info at >>> http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >