* Blueprint: Add LevelDB support to ceph cluster backend store @ 2013-07-31 3:10 Haomai Wang 2013-07-30 22:54 ` Alex Elsayed 2013-07-31 6:01 ` Sage Weil 0 siblings, 2 replies; 10+ messages in thread From: Haomai Wang @ 2013-07-31 3:10 UTC (permalink / raw) To: ceph-devel@vger.kernel.org Every node of ceph cluster has a backend filesystem such as btrfs, xfs and ext4 that provides storage for data objects, whose location are determined by CRUSH algorithm. There should exists an abstract interface sitting between osd and backend store, allowing different backend store implementation. Currently, we only have general POSIX interface. LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. We could implement a LevelDB backend to support base operations correspond to POSIX operations. LevelDB driver enables gateway to communicate with LevelDB to store objects on the node basis. LevelDB driver is attractive by the folks who own a special use case such as a write-heave system. If we can abstract a general interface, we can choose other DBM if you find it more suitable, such as Kyoto Cabinet, BDB. Futhermore, we can choose backen store for each OSD node. So we have different OSD type for special purpose. Expected Results: Objects can be stored reliably to LevelDB. The IO performance and recovery process can be comparable to original stores. And for special case, LevelDB driver should have much better performance than local filesystem backend driver. The snapshot and any features you think of are optional. Best regards, Wheats ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-07-31 3:10 Blueprint: Add LevelDB support to ceph cluster backend store Haomai Wang @ 2013-07-30 22:54 ` Alex Elsayed 2013-07-31 5:56 ` Gregory Farnum 2013-07-31 6:04 ` 袁冬 2013-07-31 6:01 ` Sage Weil 1 sibling, 2 replies; 10+ messages in thread From: Alex Elsayed @ 2013-07-30 22:54 UTC (permalink / raw) To: ceph-devel I posted this as a comment on the blueprint, but I figured I'd say it here: The thing I'd worry about here is that LevelDB's performance (along with that of various other K/V stores) falls off a cliff for large values. Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows drastic performance loss with 100KB values on both read and write: http://symas.com/mdb/microbench/#sec4 It's not just disk latency, either - an SSD showed the same behavior: http://symas.com/mdb/microbench/#sec7 I'd recommend REALLY careful benchmarking with a variety of loads (and value sizes). ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-07-30 22:54 ` Alex Elsayed @ 2013-07-31 5:56 ` Gregory Farnum 2013-07-31 6:04 ` 袁冬 1 sibling, 0 replies; 10+ messages in thread From: Gregory Farnum @ 2013-07-31 5:56 UTC (permalink / raw) To: Alex Elsayed, haomaiwang; +Cc: ceph-devel@vger.kernel.org On Tue, Jul 30, 2013 at 3:54 PM, Alex Elsayed <eternaleye@gmail.com> wrote: > I posted this as a comment on the blueprint, but I figured I'd say it here: > > The thing I'd worry about here is that LevelDB's performance (along with > that of various other K/V stores) falls off a cliff for large values. > > Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows > drastic performance loss with 100KB values on both read and write: > http://symas.com/mdb/microbench/#sec4 > > It's not just disk latency, either - an SSD showed the same behavior: > http://symas.com/mdb/microbench/#sec7 > > I'd recommend REALLY careful benchmarking with a variety of loads (and value > sizes). There are various users of leveldb who have tuned it more for workloads like this; Ryak has some stuff (not sure how much) and I believe HyperDex has some code changes that do a bunch but include better support for large writes. One thing to keep in mind is that we do already have leveldb in the OSD; it uses that for "omap" and keeping track of a lot of object metadata and lookaside stuff. I've asked before about using leveldb as a backing store and the big trouble with it is that it assumes it's feasible to copy the values it stores several times; with 4MB objects it really isn't. That doesn't mean it can't be appropriate for other kinds of workloads, though, and there are several interface layers for providing a backing store that could make this pluggable. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-07-30 22:54 ` Alex Elsayed 2013-07-31 5:56 ` Gregory Farnum @ 2013-07-31 6:04 ` 袁冬 2013-07-31 6:07 ` 袁冬 1 sibling, 1 reply; 10+ messages in thread From: 袁冬 @ 2013-07-31 6:04 UTC (permalink / raw) To: Alex Elsayed; +Cc: ceph-devel We have the same idea and already tested the LevelDB Performance VS Btrfs. The result is negative, especially for big block IO. 1KB Block 4KB Block 8KB Block 128KB Block 1MB Block LevelDB with Compress: 1.77MB/s 5.15MB/s 6.44MB/s 7.64MB/s 13.61MB/s LevelDB without Compress: 1.12MB/s 3.21MB/s 4.57MB/s 7.28MB/s 13.28MB/s Btrfs 13.84MB/s 12.96MB/s 18.29MB/s 95.26MB/s 109.23MB/s On 31 July 2013 06:54, Alex Elsayed <eternaleye@gmail.com> wrote: > I posted this as a comment on the blueprint, but I figured I'd say it here: > > The thing I'd worry about here is that LevelDB's performance (along with > that of various other K/V stores) falls off a cliff for large values. > > Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows > drastic performance loss with 100KB values on both read and write: > http://symas.com/mdb/microbench/#sec4 > > It's not just disk latency, either - an SSD showed the same behavior: > http://symas.com/mdb/microbench/#sec7 > > I'd recommend REALLY careful benchmarking with a variety of loads (and value > sizes). > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Dong Yuan Email:yuandong1222@gmail.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-07-31 6:04 ` 袁冬 @ 2013-07-31 6:07 ` 袁冬 0 siblings, 0 replies; 10+ messages in thread From: 袁冬 @ 2013-07-31 6:07 UTC (permalink / raw) To: Alex Elsayed; +Cc: ceph-devel A better format result: 1KB Block LevelDB with Compress: 1.77MB/s LevelDB without Compress: 1.12MB/s Btrfs: 13.84MB/s 4KB Block LevelDB with Compress: 5.15MB/s LevelDB without Compress: 3.21MB/s Btrfs: 12.96MB/s 8KB Block LevelDB with Compress: 6.44MB/s LevelDB without Compress: 4.57MB/s Btrfs: 18.29MB/s 128KB Block LevelDB with Compress: 7.64MB/s LevelDB without Compress: 7.28MB/s Btrfs: 95.26MB/s 1MB Block LevelDB with Compress: 13.61MB/s LevelDB without Compress: 13.28MB/s Btrfs: 109.23MB/s On 31 July 2013 14:04, 袁冬 <yuandong1222@gmail.com> wrote: > We have the same idea and already tested the LevelDB Performance VS > Btrfs. The result is negative, especially for big block IO. > > 1KB Block 4KB > Block 8KB Block 128KB Block 1MB Block > LevelDB with Compress: 1.77MB/s 5.15MB/s 6.44MB/s > 7.64MB/s 13.61MB/s > LevelDB without Compress: 1.12MB/s 3.21MB/s 4.57MB/s > 7.28MB/s 13.28MB/s > Btrfs 13.84MB/s > 12.96MB/s 18.29MB/s 95.26MB/s 109.23MB/s > > On 31 July 2013 06:54, Alex Elsayed <eternaleye@gmail.com> wrote: >> I posted this as a comment on the blueprint, but I figured I'd say it here: >> >> The thing I'd worry about here is that LevelDB's performance (along with >> that of various other K/V stores) falls off a cliff for large values. >> >> Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows >> drastic performance loss with 100KB values on both read and write: >> http://symas.com/mdb/microbench/#sec4 >> >> It's not just disk latency, either - an SSD showed the same behavior: >> http://symas.com/mdb/microbench/#sec7 >> >> I'd recommend REALLY careful benchmarking with a variety of loads (and value >> sizes). >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Dong Yuan > Email:yuandong1222@gmail.com -- Dong Yuan Email:yuandong1222@gmail.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-07-31 3:10 Blueprint: Add LevelDB support to ceph cluster backend store Haomai Wang 2013-07-30 22:54 ` Alex Elsayed @ 2013-07-31 6:01 ` Sage Weil 2013-07-31 6:38 ` Haomai Wang 1 sibling, 1 reply; 10+ messages in thread From: Sage Weil @ 2013-07-31 6:01 UTC (permalink / raw) To: Haomai Wang; +Cc: ceph-devel@vger.kernel.org Hi Haomai, On Wed, 31 Jul 2013, Haomai Wang wrote: > Every node of ceph cluster has a backend filesystem such as btrfs, > xfs and ext4 that provides storage for data objects, whose location > are determined by CRUSH algorithm. There should exists an abstract > interface sitting between osd and backend store, allowing different > backend store implementation. Currently, we only have general > POSIX interface. LevelDB is a fast key-value storage library written at > Google that provides an ordered mapping from string keys to string > values. We could implement a LevelDB backend to support base > operations correspond to POSIX operations. LevelDB driver enables > gateway to communicate with LevelDB to store objects on the node > basis. > > > LevelDB driver is attractive by the folks who own a special use case > such as a write-heave system. If we can abstract a general interface, > we can choose other DBM if you find it more suitable, such as Kyoto > Cabinet, BDB. Futhermore, we can choose backen store for each OSD > node. So we have different OSD type for special purpose. > > Expected Results: Objects can be stored reliably to LevelDB. The IO > performance and recovery process can be comparable to original > stores. And for special case, LevelDB driver should have much better > performance than local filesystem backend driver. The snapshot and > any features you think of are optional. I added a comment in the wiki, but I'll reply here. Much of what you're talking about is already in place: - There is an ObjectStore.h abstraction of the local storage. The only up to date implementation is FileStore, which uses a combination of a local file system and leveldb, but other backends have been used in the past, and new ones can we easily added in. - We currently use leveldb for the 'omap' component of rados objects. That is, each rados object has a bytestream portion (like a file), attr (like extended attributes), and an omap (keys/values). All of none of those interfaces can be used for any given object, although most users only use one interface at a time. The main limitation here if you want to use leveldb only is that we still have an inode in the file system to represent each object, even when it contains only key/value pairs. - The use of leveldb itself is also well abstracted by a KeyValueDB interface, so other key/value libraries could be swapped in in its place. The main other component is a middle layer that wraps the kv store to provide copy-on-write type semantics for each object's set of keys (to facilitate the snapshot functionality in rados/ceph). If you have a workload that you want to be purgely key/value based, it would be possible to write a much simpler ObjectStore implementation that ignores or trivially implements the byte and attr portions of the object in leveldb (or the KeyValueDB abstraction). It would have very different performance characteristics than what we're doing now, of course. You might also be interested in looking at the HyperLevelDB project, which is a fork of leveldb that focuses on multithreading and compaction performance. We've heard from other people who are interested in wiring different key/value backends into the OSD, so any work to make it easier to do that would be great! sage ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-07-31 6:01 ` Sage Weil @ 2013-07-31 6:38 ` Haomai Wang 2013-08-27 23:01 ` Sage Weil 0 siblings, 1 reply; 10+ messages in thread From: Haomai Wang @ 2013-07-31 6:38 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org 2013-7-31, 2:01, Sage Weil <sage@inktank.com> wrote: > Hi Haomai, > > On Wed, 31 Jul 2013, Haomai Wang wrote: >> Every node of ceph cluster has a backend filesystem such as btrfs, >> xfs and ext4 that provides storage for data objects, whose location >> are determined by CRUSH algorithm. There should exists an abstract >> interface sitting between osd and backend store, allowing different >> backend store implementation. Currently, we only have general >> POSIX interface. LevelDB is a fast key-value storage library written at >> Google that provides an ordered mapping from string keys to string >> values. We could implement a LevelDB backend to support base >> operations correspond to POSIX operations. LevelDB driver enables >> gateway to communicate with LevelDB to store objects on the node >> basis. >> >> >> LevelDB driver is attractive by the folks who own a special use case >> such as a write-heave system. If we can abstract a general interface, >> we can choose other DBM if you find it more suitable, such as Kyoto >> Cabinet, BDB. Futhermore, we can choose backen store for each OSD >> node. So we have different OSD type for special purpose. >> >> Expected Results: Objects can be stored reliably to LevelDB. The IO >> performance and recovery process can be comparable to original >> stores. And for special case, LevelDB driver should have much better >> performance than local filesystem backend driver. The snapshot and >> any features you think of are optional. > > I added a comment in the wiki, but I'll reply here. > > Much of what you're talking about is already in place: > > - There is an ObjectStore.h abstraction of the local storage. The only > up to date implementation is FileStore, which uses a combination > of a local file system and leveldb, but other backends have been used > in the past, and new ones can we easily added in. > > - We currently use leveldb for the 'omap' component of rados objects. > That is, each rados object has a bytestream portion (like a file), > attr (like extended attributes), and an omap (keys/values). All of > none of those interfaces can be used for any given object, although > most users only use one interface at a time. The main limitation here > if you want to use leveldb only is that we still have an inode in the > file system to represent each object, even when it contains only > key/value pairs. > > - The use of leveldb itself is also well abstracted by a KeyValueDB > interface, so other key/value libraries could be swapped in in its > place. The main other component is a middle layer that wraps the kv > store to provide copy-on-write type semantics for each object's set of > keys (to facilitate the snapshot functionality in rados/ceph). > > If you have a workload that you want to be purgely key/value based, it > would be possible to write a much simpler ObjectStore implementation that > ignores or trivially implements the byte and attr portions of the object > in leveldb (or the KeyValueDB abstraction). It would have very different > performance characteristics than what we're doing now, of course. You > might also be interested in looking at the HyperLevelDB project, which is > a fork of leveldb that focuses on multithreading and compaction > performance. I'm happy to hear it. I think there may exists one thing you may leave out. If we abstract a unified or more different interfaces, we can allow different pool to use in different situation. For example, there exists two LevelDB backend OSD nodes forming up a distributed k/v store, three Btrfs OSD nodes forming up a traditional use case. More imaging space will be given to users. > > We've heard from other people who are interested in wiring different > key/value backends into the OSD, so any work to make it easier to do that > would be great! > > sage Best regards, Wheats -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-07-31 6:38 ` Haomai Wang @ 2013-08-27 23:01 ` Sage Weil 2013-08-28 14:12 ` Haomai Wang 0 siblings, 1 reply; 10+ messages in thread From: Sage Weil @ 2013-08-27 23:01 UTC (permalink / raw) To: Haomai Wang; +Cc: ceph-devel@vger.kernel.org Hi Haomai, I just wanted to check in to see if things have progressed at all since we talked at CDS. If you have any questions or there is anything I can help with, let me know! I'd love to see this alternative backend make it into Emperor. Thanks! sage ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-08-27 23:01 ` Sage Weil @ 2013-08-28 14:12 ` Haomai Wang 2013-08-28 16:17 ` Sage Weil 0 siblings, 1 reply; 10+ messages in thread From: Haomai Wang @ 2013-08-28 14:12 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org On Aug 28, 2013, at 7:01 AM, Sage Weil <sage@inktank.com> wrote: > Hi Haomai, > > I just wanted to check in to see if things have progressed at all since we > talked at CDS. If you have any questions or there is anything I can > help with, let me know! I'd love to see this alternative backend make it > into Emperor. Yes, I'm ready to do it. May I ask about how to register bp to redmine? Is it true to do it directly? Can I follow a example bp? > > Thanks! > sage > Best regards, Wheats ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Blueprint: Add LevelDB support to ceph cluster backend store 2013-08-28 14:12 ` Haomai Wang @ 2013-08-28 16:17 ` Sage Weil 0 siblings, 0 replies; 10+ messages in thread From: Sage Weil @ 2013-08-28 16:17 UTC (permalink / raw) To: Haomai Wang; +Cc: ceph-devel@vger.kernel.org On Wed, 28 Aug 2013, Haomai Wang wrote: > > On Aug 28, 2013, at 7:01 AM, Sage Weil <sage@inktank.com> wrote: > > > Hi Haomai, > > > > I just wanted to check in to see if things have progressed at all since we > > talked at CDS. If you have any questions or there is anything I can > > help with, let me know! I'd love to see this alternative backend make it > > into Emperor. > Yes, I'm ready to do it. May I ask about how to register bp to redmine? Is it > true to do it directly? Can I follow a example bp? There is no magic connection between the blueprints and redmine (yet). Just create a redmine account (if you haven't already) and open a Feature ticket, and cut&paste or link back to the blueprint. (I've added you to the developer group which lets you do a nubmer of things that you couldn't before. sage ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2013-08-28 16:17 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-31 3:10 Blueprint: Add LevelDB support to ceph cluster backend store Haomai Wang 2013-07-30 22:54 ` Alex Elsayed 2013-07-31 5:56 ` Gregory Farnum 2013-07-31 6:04 ` 袁冬 2013-07-31 6:07 ` 袁冬 2013-07-31 6:01 ` Sage Weil 2013-07-31 6:38 ` Haomai Wang 2013-08-27 23:01 ` Sage Weil 2013-08-28 14:12 ` Haomai Wang 2013-08-28 16:17 ` Sage Weil
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.