* mon switch from leveldb to rocksdb @ 2016-05-02 18:49 Sage Weil 2016-05-02 19:00 ` Howard Chu 2016-05-02 21:25 ` Wido den Hollander 0 siblings, 2 replies; 18+ messages in thread From: Sage Weil @ 2016-05-02 18:49 UTC (permalink / raw) To: ceph-devel We're thinking about switching the default backend on the mon from leveldb to rocksdb. Rocksdb is better maintained, has a stronger feature set, is generally faster, and is linked statically, which means we won't be vulnerable to buggy distro packages. There is one blocker, though. Some distro leveldbs name the sst files with the .ldb suffix. (Some don't; very annoying.) There is a unit test in rocksdb that tries to verify that ldb is silently renamed to sst, and it passes, but the test is incomplete: the test failes to verify that ldb/sst files can actually be read, and it turns out only the 'check' path (not the normal open and read it path) handles ldb properly. Anyway, once that works, rocksdb will magically upgrade from leveldb to rocksdb. Note that once that happens you can't switch from rocksdb back to leveldb without recreating the mon. Alternatively, we could not worry about upgrading existing leveldb instances and just make newly created mons default to rocksdb. 1) Thoughts on moving to rocksdb in general? 2) Importance of leveldb->rocksdb conversion? 3) Anyone want to fix the ldb handling in rocksdb? Thanks! sage ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil @ 2016-05-02 19:00 ` Howard Chu 2016-05-03 13:34 ` Mark Nelson 2016-05-02 21:25 ` Wido den Hollander 1 sibling, 1 reply; 18+ messages in thread From: Howard Chu @ 2016-05-02 19:00 UTC (permalink / raw) To: Sage Weil, ceph-devel Sage Weil wrote: > 1) Thoughts on moving to rocksdb in general? Are you actually prepared to undertake all of the measurement and tuning required to make RocksDB actually work well? You're switching from an (abandoned/unsupported) engine with only a handful of config parameters to one with ~40-50 params, all of which have critical but unpredictable impact on resource consumption and performance. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-02 19:00 ` Howard Chu @ 2016-05-03 13:34 ` Mark Nelson 2016-05-03 16:41 ` Gregory Farnum 0 siblings, 1 reply; 18+ messages in thread From: Mark Nelson @ 2016-05-03 13:34 UTC (permalink / raw) To: Howard Chu, Sage Weil, ceph-devel On 05/02/2016 02:00 PM, Howard Chu wrote: > Sage Weil wrote: >> 1) Thoughts on moving to rocksdb in general? > > Are you actually prepared to undertake all of the measurement and tuning > required to make RocksDB actually work well? You're switching from an > (abandoned/unsupported) engine with only a handful of config parameters > to one with ~40-50 params, all of which have critical but unpredictable > impact on resource consumption and performance. > You are absolutely correct, and there are definitely pitfalls we need to watch out for with the number of tunables in rocksdb. At least on the performance side two of the big issues we've hit with leveldb compaction related. In some scenarios compaction happens slower than the number of writes coming in resulting in ever-growing db sizes. The other issue is that compaction is single threaded and this can cause stalls and general mayhem when things get really heavily loaded. My hope is that if we do go with rocksdb, even in a sub-optimally tuned state, we'll be better off than we were with leveldb. We did some very preliminary benchmarks a couple of years ago (admittedly a too-small dataset size) basically comparing the (at the time) stock ceph leveldb settings vs rocksdb. On this set size, leveldb looked much better for reads, but much worse for writes. I suspect with much larger data sets, the write issues will only compound with the compaction issues and will start having a much bigger impact. Indeed, if you look at the scatterplots for leveldb, you'll see a regular set of high latency writes. In rocksdb we saw much better looking write behavior, but overall reads were slower. We didn't do any real tuning to improve read performance in the leveled compaction tests, but I think we'll be starting out in a much better place to improve them than we are with leveldb. https://drive.google.com/file/d/0B2gTBZrkrnpZN3JFV3RZeVBPWlU/view?usp=sharing Mark ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 13:34 ` Mark Nelson @ 2016-05-03 16:41 ` Gregory Farnum 2016-05-03 17:01 ` Mark Nelson 0 siblings, 1 reply; 18+ messages in thread From: Gregory Farnum @ 2016-05-03 16:41 UTC (permalink / raw) To: Mark Nelson; +Cc: Howard Chu, Sage Weil, ceph-devel On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote: > On 05/02/2016 02:00 PM, Howard Chu wrote: >> >> Sage Weil wrote: >>> >>> 1) Thoughts on moving to rocksdb in general? >> >> >> Are you actually prepared to undertake all of the measurement and tuning >> required to make RocksDB actually work well? You're switching from an >> (abandoned/unsupported) engine with only a handful of config parameters >> to one with ~40-50 params, all of which have critical but unpredictable >> impact on resource consumption and performance. >> > > You are absolutely correct, and there are definitely pitfalls we need to > watch out for with the number of tunables in rocksdb. At least on the > performance side two of the big issues we've hit with leveldb compaction > related. In some scenarios compaction happens slower than the number of > writes coming in resulting in ever-growing db sizes. The other issue is > that compaction is single threaded and this can cause stalls and general > mayhem when things get really heavily loaded. My hope is that if we do go > with rocksdb, even in a sub-optimally tuned state, we'll be better off than > we were with leveldb. > > We did some very preliminary benchmarks a couple of years ago (admittedly a > too-small dataset size) basically comparing the (at the time) stock ceph > leveldb settings vs rocksdb. On this set size, leveldb looked much better > for reads, but much worse for writes. That's actually a bit troubling — many of our monitor problems have arisen from slow reads, rather than slow writes. I suspect we want to eliminate this before switching, if it's a concern. ...Although I think I did see a monitor caching layer go by, so maybe it's a moot point now? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 16:41 ` Gregory Farnum @ 2016-05-03 17:01 ` Mark Nelson 2016-05-03 17:17 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Mark Nelson @ 2016-05-03 17:01 UTC (permalink / raw) To: Gregory Farnum; +Cc: Howard Chu, Sage Weil, ceph-devel On 05/03/2016 11:41 AM, Gregory Farnum wrote: > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote: >> On 05/02/2016 02:00 PM, Howard Chu wrote: >>> >>> Sage Weil wrote: >>>> >>>> 1) Thoughts on moving to rocksdb in general? >>> >>> >>> Are you actually prepared to undertake all of the measurement and tuning >>> required to make RocksDB actually work well? You're switching from an >>> (abandoned/unsupported) engine with only a handful of config parameters >>> to one with ~40-50 params, all of which have critical but unpredictable >>> impact on resource consumption and performance. >>> >> >> You are absolutely correct, and there are definitely pitfalls we need to >> watch out for with the number of tunables in rocksdb. At least on the >> performance side two of the big issues we've hit with leveldb compaction >> related. In some scenarios compaction happens slower than the number of >> writes coming in resulting in ever-growing db sizes. The other issue is >> that compaction is single threaded and this can cause stalls and general >> mayhem when things get really heavily loaded. My hope is that if we do go >> with rocksdb, even in a sub-optimally tuned state, we'll be better off than >> we were with leveldb. >> >> We did some very preliminary benchmarks a couple of years ago (admittedly a >> too-small dataset size) basically comparing the (at the time) stock ceph >> leveldb settings vs rocksdb. On this set size, leveldb looked much better >> for reads, but much worse for writes. > > That's actually a bit troubling — many of our monitor problems have > arisen from slow reads, rather than slow writes. I suspect we want to > eliminate this before switching, if it's a concern. > > ...Although I think I did see a monitor caching layer go by, so maybe > it's a moot point now? Yeah, I suspect that's helping significantly. I think based at least one what I remember seeing I'm more concerned about high latency events than average read performance though. IE if there is a compaction storm, which store is going to handle it more gracefully with less spikey behavior? In those leveldb tests we only saw writes and write trims hit by those periodic 10-60 second high-latency spikes, but if I recall the mon has (or at least had?) a global lock where write stalls would basically make the whole monitor stall. I think Joao might have improved that after we did this testing but I don't remember the details at this point. > -Greg > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 17:01 ` Mark Nelson @ 2016-05-03 17:17 ` Sage Weil 2016-05-03 17:20 ` Gregory Farnum 0 siblings, 1 reply; 18+ messages in thread From: Sage Weil @ 2016-05-03 17:17 UTC (permalink / raw) To: Mark Nelson; +Cc: Gregory Farnum, Howard Chu, ceph-devel [-- Attachment #1: Type: TEXT/PLAIN, Size: 3355 bytes --] On Tue, 3 May 2016, Mark Nelson wrote: > On 05/03/2016 11:41 AM, Gregory Farnum wrote: > > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote: > > > On 05/02/2016 02:00 PM, Howard Chu wrote: > > > > > > > > Sage Weil wrote: > > > > > > > > > > 1) Thoughts on moving to rocksdb in general? > > > > > > > > > > > > Are you actually prepared to undertake all of the measurement and tuning > > > > required to make RocksDB actually work well? You're switching from an > > > > (abandoned/unsupported) engine with only a handful of config parameters > > > > to one with ~40-50 params, all of which have critical but unpredictable > > > > impact on resource consumption and performance. > > > > > > > > > > You are absolutely correct, and there are definitely pitfalls we need to > > > watch out for with the number of tunables in rocksdb. At least on the > > > performance side two of the big issues we've hit with leveldb compaction > > > related. In some scenarios compaction happens slower than the number of > > > writes coming in resulting in ever-growing db sizes. The other issue is > > > that compaction is single threaded and this can cause stalls and general > > > mayhem when things get really heavily loaded. My hope is that if we do go > > > with rocksdb, even in a sub-optimally tuned state, we'll be better off > > > than > > > we were with leveldb. > > > > > > We did some very preliminary benchmarks a couple of years ago (admittedly > > > a > > > too-small dataset size) basically comparing the (at the time) stock ceph > > > leveldb settings vs rocksdb. On this set size, leveldb looked much better > > > for reads, but much worse for writes. > > > > That's actually a bit troubling — many of our monitor problems have > > arisen from slow reads, rather than slow writes. I suspect we want to > > eliminate this before switching, if it's a concern. > > > > ...Although I think I did see a monitor caching layer go by, so maybe > > it's a moot point now? > > Yeah, I suspect that's helping significantly. I think based at least one what > I remember seeing I'm more concerned about high latency events than average > read performance though. IE if there is a compaction storm, which store is > going to handle it more gracefully with less spikey behavior? I'm most worried about the read storm that happens on each commit to fetch all the just-updated PG stat keys. The other data in the mon is just noise in comparison, I think, with the exception of the OSDMaps... which IIRC is what the cache you mention was for. The initial PR, https://github.com/ceph/ceph/pull/8888 just makes the backend choice persistent. Rocksdb is still experimental. There's an accompanying ceph-qa-suite pr so that we test both. Once we do some performance evaluation we can decide whether the switch is safe as-is, if more work (caching layer or tuning) is needed, or if it's a bad idea. > In those leveldb tests we only saw writes and write trims hit by those > periodic 10-60 second high-latency spikes, but if I recall the mon has (or at > least had?) a global lock where write stalls would basically make the whole > monitor stall. I think Joao might have improved that after we did this > testing but I don't remember the details at this point. I don't think any of this locking has changed... sage ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 17:17 ` Sage Weil @ 2016-05-03 17:20 ` Gregory Farnum 2016-05-03 17:23 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Gregory Farnum @ 2016-05-03 17:20 UTC (permalink / raw) To: Sage Weil; +Cc: Mark Nelson, Howard Chu, ceph-devel On Tue, May 3, 2016 at 10:17 AM, Sage Weil <sweil@redhat.com> wrote: > On Tue, 3 May 2016, Mark Nelson wrote: >> On 05/03/2016 11:41 AM, Gregory Farnum wrote: >> > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote: >> > > On 05/02/2016 02:00 PM, Howard Chu wrote: >> > > > >> > > > Sage Weil wrote: >> > > > > >> > > > > 1) Thoughts on moving to rocksdb in general? >> > > > >> > > > >> > > > Are you actually prepared to undertake all of the measurement and tuning >> > > > required to make RocksDB actually work well? You're switching from an >> > > > (abandoned/unsupported) engine with only a handful of config parameters >> > > > to one with ~40-50 params, all of which have critical but unpredictable >> > > > impact on resource consumption and performance. >> > > > >> > > >> > > You are absolutely correct, and there are definitely pitfalls we need to >> > > watch out for with the number of tunables in rocksdb. At least on the >> > > performance side two of the big issues we've hit with leveldb compaction >> > > related. In some scenarios compaction happens slower than the number of >> > > writes coming in resulting in ever-growing db sizes. The other issue is >> > > that compaction is single threaded and this can cause stalls and general >> > > mayhem when things get really heavily loaded. My hope is that if we do go >> > > with rocksdb, even in a sub-optimally tuned state, we'll be better off >> > > than >> > > we were with leveldb. >> > > >> > > We did some very preliminary benchmarks a couple of years ago (admittedly >> > > a >> > > too-small dataset size) basically comparing the (at the time) stock ceph >> > > leveldb settings vs rocksdb. On this set size, leveldb looked much better >> > > for reads, but much worse for writes. >> > >> > That's actually a bit troubling — many of our monitor problems have >> > arisen from slow reads, rather than slow writes. I suspect we want to >> > eliminate this before switching, if it's a concern. >> > >> > ...Although I think I did see a monitor caching layer go by, so maybe >> > it's a moot point now? >> >> Yeah, I suspect that's helping significantly. I think based at least one what >> I remember seeing I'm more concerned about high latency events than average >> read performance though. IE if there is a compaction storm, which store is >> going to handle it more gracefully with less spikey behavior? > > I'm most worried about the read storm that happens on each commit to fetch > all the just-updated PG stat keys. The other data in the mon is just > noise in comparison, I think, with the exception of the OSDMaps... which > IIRC is what the cache you mention was for. > > The initial PR, > > https://github.com/ceph/ceph/pull/8888 > > just makes the backend choice persistent. Rocksdb is still experimental. > There's an accompanying ceph-qa-suite pr so that we test both. Once we > do some performance evaluation we can decide whether the switch is safe > as-is, if more work (caching layer or tuning) is needed, or if it's a bad > idea. > >> In those leveldb tests we only saw writes and write trims hit by those >> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at >> least had?) a global lock where write stalls would basically make the whole >> monitor stall. I think Joao might have improved that after we did this >> testing but I don't remember the details at this point. > > I don't think any of this locking has changed... The paxos state machine is no longer blocked for reads while an unrelated write is happening. Nor are older-version reads on the writing subsystem. That fix is post-firefly, right? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 17:20 ` Gregory Farnum @ 2016-05-03 17:23 ` Sage Weil 2016-05-03 17:32 ` Mark Nelson 0 siblings, 1 reply; 18+ messages in thread From: Sage Weil @ 2016-05-03 17:23 UTC (permalink / raw) To: Gregory Farnum; +Cc: Mark Nelson, Howard Chu, ceph-devel On Tue, 3 May 2016, Gregory Farnum wrote: > >> In those leveldb tests we only saw writes and write trims hit by those > >> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at > >> least had?) a global lock where write stalls would basically make the whole > >> monitor stall. I think Joao might have improved that after we did this > >> testing but I don't remember the details at this point. > > > > I don't think any of this locking has changed... > > The paxos state machine is no longer blocked for reads while an > unrelated write is happening. Nor are older-version reads on the > writing subsystem. That fix is post-firefly, right? Oh yeah, it was new in hammer, I think. I forgot those tests were that old! sage ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 17:23 ` Sage Weil @ 2016-05-03 17:32 ` Mark Nelson 0 siblings, 0 replies; 18+ messages in thread From: Mark Nelson @ 2016-05-03 17:32 UTC (permalink / raw) To: Sage Weil, Gregory Farnum; +Cc: Howard Chu, ceph-devel On 05/03/2016 12:23 PM, Sage Weil wrote: > On Tue, 3 May 2016, Gregory Farnum wrote: >>>> In those leveldb tests we only saw writes and write trims hit by those >>>> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at >>>> least had?) a global lock where write stalls would basically make the whole >>>> monitor stall. I think Joao might have improved that after we did this >>>> testing but I don't remember the details at this point. >>> >>> I don't think any of this locking has changed... >> >> The paxos state machine is no longer blocked for reads while an >> unrelated write is happening. Nor are older-version reads on the >> writing subsystem. That fix is post-firefly, right? > > Oh yeah, it was new in hammer, I think. I forgot those tests were that > old! Time goes by fast when you are having fun? Indeed those tests are nearly 2 years old at this point. I guess my question is given the state machine improvements do we expect those leveldb write/wtrim latency spikes to still cause major mon stalls? On the other hand, is the osdmap read cache enough to help offset the lower read performance in (this configuration of) rocksdb? I'm still worried about those big leveldb write latency spikes, but maybe it's less of an issue now and the average read performance is a bigger issue. > > sage > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil 2016-05-02 19:00 ` Howard Chu @ 2016-05-02 21:25 ` Wido den Hollander 2016-05-02 21:42 ` Shinobu Kinjo 1 sibling, 1 reply; 18+ messages in thread From: Wido den Hollander @ 2016-05-02 21:25 UTC (permalink / raw) To: Sage Weil, ceph-devel > Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: > > > We're thinking about switching the default backend on the mon from leveldb > to rocksdb. Rocksdb is better maintained, has a stronger feature set, is > generally faster, and is linked statically, which means we won't be > vulnerable to buggy distro packages. > > There is one blocker, though. Some distro leveldbs name the sst files > with the .ldb suffix. (Some don't; very annoying.) There is a unit test > in rocksdb that tries to verify that ldb is silently renamed to sst, > and it passes, but the test is incomplete: the test failes to verify > that ldb/sst files can actually be read, and it turns out only the 'check' > path (not the normal open and read it path) handles ldb properly. > > Anyway, once that works, rocksdb will magically upgrade from leveldb to > rocksdb. Note that once that happens you can't switch from rocksdb back > to leveldb without recreating the mon. > > Alternatively, we could not worry about upgrading existing leveldb > instances and just make newly created mons default to rocksdb. > > 1) Thoughts on moving to rocksdb in general? > > 2) Importance of leveldb->rocksdb conversion? > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. Wido > 3) Anyone want to fix the ldb handling in rocksdb? > > Thanks! > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-02 21:25 ` Wido den Hollander @ 2016-05-02 21:42 ` Shinobu Kinjo 2016-05-02 21:47 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Shinobu Kinjo @ 2016-05-02 21:42 UTC (permalink / raw) To: Wido den Hollander; +Cc: Sage Weil, Ceph Development If possible, it would be much better to make it pluggable so that we select what we want. On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: >> >> >> We're thinking about switching the default backend on the mon from leveldb >> to rocksdb. Rocksdb is better maintained, has a stronger feature set, is >> generally faster, and is linked statically, which means we won't be >> vulnerable to buggy distro packages. >> >> There is one blocker, though. Some distro leveldbs name the sst files >> with the .ldb suffix. (Some don't; very annoying.) There is a unit test >> in rocksdb that tries to verify that ldb is silently renamed to sst, >> and it passes, but the test is incomplete: the test failes to verify >> that ldb/sst files can actually be read, and it turns out only the 'check' >> path (not the normal open and read it path) handles ldb properly. >> >> Anyway, once that works, rocksdb will magically upgrade from leveldb to >> rocksdb. Note that once that happens you can't switch from rocksdb back >> to leveldb without recreating the mon. >> >> Alternatively, we could not worry about upgrading existing leveldb >> instances and just make newly created mons default to rocksdb. >> >> 1) Thoughts on moving to rocksdb in general? >> >> 2) Importance of leveldb->rocksdb conversion? >> > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. > > Wido > >> 3) Anyone want to fix the ldb handling in rocksdb? >> >> Thanks! >> sage >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Email: shinobu@linux.com GitHub: shinobu-x Blog: Life with Distributed Computational System based on OpenSource ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-02 21:42 ` Shinobu Kinjo @ 2016-05-02 21:47 ` Sage Weil 2016-05-03 5:25 ` Zhou, Yuan 0 siblings, 1 reply; 18+ messages in thread From: Sage Weil @ 2016-05-02 21:47 UTC (permalink / raw) To: skinjo; +Cc: Wido den Hollander, Ceph Development On Tue, 3 May 2016, Shinobu Kinjo wrote: > If possible, it would be much better to make it pluggable so that we > select what we want. Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability. After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there. sage > > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: > > > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: > >> > >> > >> We're thinking about switching the default backend on the mon from leveldb > >> to rocksdb. Rocksdb is better maintained, has a stronger feature set, is > >> generally faster, and is linked statically, which means we won't be > >> vulnerable to buggy distro packages. > >> > >> There is one blocker, though. Some distro leveldbs name the sst files > >> with the .ldb suffix. (Some don't; very annoying.) There is a unit test > >> in rocksdb that tries to verify that ldb is silently renamed to sst, > >> and it passes, but the test is incomplete: the test failes to verify > >> that ldb/sst files can actually be read, and it turns out only the 'check' > >> path (not the normal open and read it path) handles ldb properly. > >> > >> Anyway, once that works, rocksdb will magically upgrade from leveldb to > >> rocksdb. Note that once that happens you can't switch from rocksdb back > >> to leveldb without recreating the mon. > >> > >> Alternatively, we could not worry about upgrading existing leveldb > >> instances and just make newly created mons default to rocksdb. > >> > >> 1) Thoughts on moving to rocksdb in general? > >> > >> 2) Importance of leveldb->rocksdb conversion? > >> > > > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? > > > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. > > > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. > > > > Wido > > > >> 3) Anyone want to fix the ldb handling in rocksdb? > >> > >> Thanks! > >> sage > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Email: > shinobu@linux.com > GitHub: > shinobu-x > Blog: > Life with Distributed Computational System based on OpenSource > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb 2016-05-02 21:47 ` Sage Weil @ 2016-05-03 5:25 ` Zhou, Yuan 2016-05-03 5:28 ` Somnath Roy 2016-05-03 12:24 ` Sage Weil 0 siblings, 2 replies; 18+ messages in thread From: Zhou, Yuan @ 2016-05-03 5:25 UTC (permalink / raw) To: Sage Weil, skinjo@redhat.com; +Cc: Wido den Hollander, Ceph Development Hi Sage, how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also? thanks, -yuan -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, May 3, 2016 5:47 AM To: skinjo@redhat.com Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org> Subject: Re: mon switch from leveldb to rocksdb On Tue, 3 May 2016, Shinobu Kinjo wrote: > If possible, it would be much better to make it pluggable so that we > select what we want. Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability. After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there. sage > > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: > > > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: > >> > >> > >> We're thinking about switching the default backend on the mon from > >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger > >> feature set, is generally faster, and is linked statically, which > >> means we won't be vulnerable to buggy distro packages. > >> > >> There is one blocker, though. Some distro leveldbs name the sst > >> files with the .ldb suffix. (Some don't; very annoying.) There is > >> a unit test in rocksdb that tries to verify that ldb is silently > >> renamed to sst, and it passes, but the test is incomplete: the test > >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check' > >> path (not the normal open and read it path) handles ldb properly. > >> > >> Anyway, once that works, rocksdb will magically upgrade from > >> leveldb to rocksdb. Note that once that happens you can't switch > >> from rocksdb back to leveldb without recreating the mon. > >> > >> Alternatively, we could not worry about upgrading existing leveldb > >> instances and just make newly created mons default to rocksdb. > >> > >> 1) Thoughts on moving to rocksdb in general? > >> > >> 2) Importance of leveldb->rocksdb conversion? > >> > > > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? > > > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. > > > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. > > > > Wido > > > >> 3) Anyone want to fix the ldb handling in rocksdb? > >> > >> Thanks! > >> sage > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe > >> ceph-devel" in the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe > > ceph-devel" in the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Email: > shinobu@linux.com > GitHub: > shinobu-x > Blog: > Life with Distributed Computational System based on OpenSource > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb 2016-05-03 5:25 ` Zhou, Yuan @ 2016-05-03 5:28 ` Somnath Roy 2016-05-03 6:00 ` Shinobu Kinjo 2016-05-03 12:24 ` Sage Weil 1 sibling, 1 reply; 18+ messages in thread From: Somnath Roy @ 2016-05-03 5:28 UTC (permalink / raw) To: Zhou, Yuan, Sage Weil, skinjo@redhat.com Cc: Wido den Hollander, Ceph Development I think filestore is already supporting rocksdb as OMAP.. Thanks & Regards Somnath -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan Sent: Monday, May 02, 2016 10:25 PM To: Sage Weil; skinjo@redhat.com Cc: Wido den Hollander; Ceph Development Subject: RE: mon switch from leveldb to rocksdb Hi Sage, how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also? thanks, -yuan -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, May 3, 2016 5:47 AM To: skinjo@redhat.com Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org> Subject: Re: mon switch from leveldb to rocksdb On Tue, 3 May 2016, Shinobu Kinjo wrote: > If possible, it would be much better to make it pluggable so that we > select what we want. Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability. After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there. sage > > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: > > > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: > >> > >> > >> We're thinking about switching the default backend on the mon from > >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger > >> feature set, is generally faster, and is linked statically, which > >> means we won't be vulnerable to buggy distro packages. > >> > >> There is one blocker, though. Some distro leveldbs name the sst > >> files with the .ldb suffix. (Some don't; very annoying.) There is > >> a unit test in rocksdb that tries to verify that ldb is silently > >> renamed to sst, and it passes, but the test is incomplete: the test > >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check' > >> path (not the normal open and read it path) handles ldb properly. > >> > >> Anyway, once that works, rocksdb will magically upgrade from > >> leveldb to rocksdb. Note that once that happens you can't switch > >> from rocksdb back to leveldb without recreating the mon. > >> > >> Alternatively, we could not worry about upgrading existing leveldb > >> instances and just make newly created mons default to rocksdb. > >> > >> 1) Thoughts on moving to rocksdb in general? > >> > >> 2) Importance of leveldb->rocksdb conversion? > >> > > > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? > > > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. > > > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. > > > > Wido > > > >> 3) Anyone want to fix the ldb handling in rocksdb? > >> > >> Thanks! > >> sage > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe > >> ceph-devel" in the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe > > ceph-devel" in the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Email: > shinobu@linux.com > GitHub: > shinobu-x > Blog: > Life with Distributed Computational System based on OpenSource > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 5:28 ` Somnath Roy @ 2016-05-03 6:00 ` Shinobu Kinjo 2016-05-03 6:29 ` Somnath Roy 0 siblings, 1 reply; 18+ messages in thread From: Shinobu Kinjo @ 2016-05-03 6:00 UTC (permalink / raw) To: Somnath Roy; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development > I think filestore is already supporting rocksdb as OMAP.. If the RocksDB library is there, yes... What is really challenge in here to me is, as Sage mentioned: > if someone wants to convert, they can add/remove/replace mons in their cluster to get there. Maybe this is a related issue: https://github.com/facebook/rocksdb/issues/677 What do you think? Cheers, Shinobu. ----- Original Message ----- From: "Somnath Roy" <Somnath.Roy@sandisk.com> To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org> Sent: Tuesday, May 3, 2016 2:28:56 PM Subject: RE: mon switch from leveldb to rocksdb I think filestore is already supporting rocksdb as OMAP.. Thanks & Regards Somnath -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan Sent: Monday, May 02, 2016 10:25 PM To: Sage Weil; skinjo@redhat.com Cc: Wido den Hollander; Ceph Development Subject: RE: mon switch from leveldb to rocksdb Hi Sage, how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also? thanks, -yuan -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, May 3, 2016 5:47 AM To: skinjo@redhat.com Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org> Subject: Re: mon switch from leveldb to rocksdb On Tue, 3 May 2016, Shinobu Kinjo wrote: > If possible, it would be much better to make it pluggable so that we > select what we want. Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability. After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there. sage > > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: > > > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: > >> > >> > >> We're thinking about switching the default backend on the mon from > >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger > >> feature set, is generally faster, and is linked statically, which > >> means we won't be vulnerable to buggy distro packages. > >> > >> There is one blocker, though. Some distro leveldbs name the sst > >> files with the .ldb suffix. (Some don't; very annoying.) There is > >> a unit test in rocksdb that tries to verify that ldb is silently > >> renamed to sst, and it passes, but the test is incomplete: the test > >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check' > >> path (not the normal open and read it path) handles ldb properly. > >> > >> Anyway, once that works, rocksdb will magically upgrade from > >> leveldb to rocksdb. Note that once that happens you can't switch > >> from rocksdb back to leveldb without recreating the mon. > >> > >> Alternatively, we could not worry about upgrading existing leveldb > >> instances and just make newly created mons default to rocksdb. > >> > >> 1) Thoughts on moving to rocksdb in general? > >> > >> 2) Importance of leveldb->rocksdb conversion? > >> > > > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? > > > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. > > > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. > > > > Wido > > > >> 3) Anyone want to fix the ldb handling in rocksdb? > >> > >> Thanks! > >> sage > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe > >> ceph-devel" in the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe > > ceph-devel" in the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Email: > shinobu@linux.com > GitHub: > shinobu-x > Blog: > Life with Distributed Computational System based on OpenSource > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb 2016-05-03 6:00 ` Shinobu Kinjo @ 2016-05-03 6:29 ` Somnath Roy 2016-05-03 8:10 ` Shinobu Kinjo 0 siblings, 1 reply; 18+ messages in thread From: Somnath Roy @ 2016-05-03 6:29 UTC (permalink / raw) To: Shinobu Kinjo; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development You need to recreate OSDs (mkfs) in order to move to rocksdb, it is not a seamless transition as per I know.. -----Original Message----- From: Shinobu Kinjo [mailto:skinjo@redhat.com] Sent: Monday, May 02, 2016 11:00 PM To: Somnath Roy Cc: Yuan Zhou; Sage Weil; Wido den Hollander; Ceph Development Subject: Re: mon switch from leveldb to rocksdb > I think filestore is already supporting rocksdb as OMAP.. If the RocksDB library is there, yes... What is really challenge in here to me is, as Sage mentioned: > if someone wants to convert, they can add/remove/replace mons in their cluster to get there. Maybe this is a related issue: https://github.com/facebook/rocksdb/issues/677 What do you think? Cheers, Shinobu. ----- Original Message ----- From: "Somnath Roy" <Somnath.Roy@sandisk.com> To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org> Sent: Tuesday, May 3, 2016 2:28:56 PM Subject: RE: mon switch from leveldb to rocksdb I think filestore is already supporting rocksdb as OMAP.. Thanks & Regards Somnath -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan Sent: Monday, May 02, 2016 10:25 PM To: Sage Weil; skinjo@redhat.com Cc: Wido den Hollander; Ceph Development Subject: RE: mon switch from leveldb to rocksdb Hi Sage, how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also? thanks, -yuan -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, May 3, 2016 5:47 AM To: skinjo@redhat.com Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org> Subject: Re: mon switch from leveldb to rocksdb On Tue, 3 May 2016, Shinobu Kinjo wrote: > If possible, it would be much better to make it pluggable so that we > select what we want. Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability. After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there. sage > > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: > > > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: > >> > >> > >> We're thinking about switching the default backend on the mon from > >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger > >> feature set, is generally faster, and is linked statically, which > >> means we won't be vulnerable to buggy distro packages. > >> > >> There is one blocker, though. Some distro leveldbs name the sst > >> files with the .ldb suffix. (Some don't; very annoying.) There is > >> a unit test in rocksdb that tries to verify that ldb is silently > >> renamed to sst, and it passes, but the test is incomplete: the test > >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check' > >> path (not the normal open and read it path) handles ldb properly. > >> > >> Anyway, once that works, rocksdb will magically upgrade from > >> leveldb to rocksdb. Note that once that happens you can't switch > >> from rocksdb back to leveldb without recreating the mon. > >> > >> Alternatively, we could not worry about upgrading existing leveldb > >> instances and just make newly created mons default to rocksdb. > >> > >> 1) Thoughts on moving to rocksdb in general? > >> > >> 2) Importance of leveldb->rocksdb conversion? > >> > > > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? > > > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. > > > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. > > > > Wido > > > >> 3) Anyone want to fix the ldb handling in rocksdb? > >> > >> Thanks! > >> sage > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe > >> ceph-devel" in the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe > > ceph-devel" in the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Email: > shinobu@linux.com > GitHub: > shinobu-x > Blog: > Life with Distributed Computational System based on OpenSource > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb 2016-05-03 6:29 ` Somnath Roy @ 2016-05-03 8:10 ` Shinobu Kinjo 0 siblings, 0 replies; 18+ messages in thread From: Shinobu Kinjo @ 2016-05-03 8:10 UTC (permalink / raw) To: Somnath Roy; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development On Tue, May 3, 2016 at 3:29 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote: > You need to recreate OSDs (mkfs) in order to move to rocksdb, it is not a seamless transition as per I know.. Yeah, you're right. >> >> We're thinking about switching the default backend on the mon from >> >> leveldb to rocksdb. But we're talking about mon... > > -----Original Message----- > From: Shinobu Kinjo [mailto:skinjo@redhat.com] > Sent: Monday, May 02, 2016 11:00 PM > To: Somnath Roy > Cc: Yuan Zhou; Sage Weil; Wido den Hollander; Ceph Development > Subject: Re: mon switch from leveldb to rocksdb > >> I think filestore is already supporting rocksdb as OMAP.. > > If the RocksDB library is there, yes... > > What is really challenge in here to me is, as Sage mentioned: > >> if someone wants to convert, they can add/remove/replace mons in their cluster to get there. > > Maybe this is a related issue: > > https://github.com/facebook/rocksdb/issues/677 > > What do you think? > > Cheers, > Shinobu. > > ----- Original Message ----- > From: "Somnath Roy" <Somnath.Roy@sandisk.com> > To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com > Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org> > Sent: Tuesday, May 3, 2016 2:28:56 PM > Subject: RE: mon switch from leveldb to rocksdb > > I think filestore is already supporting rocksdb as OMAP.. > > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan > Sent: Monday, May 02, 2016 10:25 PM > To: Sage Weil; skinjo@redhat.com > Cc: Wido den Hollander; Ceph Development > Subject: RE: mon switch from leveldb to rocksdb > > Hi Sage, > > how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also? > > thanks, -yuan > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil > Sent: Tuesday, May 3, 2016 5:47 AM > To: skinjo@redhat.com > Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org> > Subject: Re: mon switch from leveldb to rocksdb > > On Tue, 3 May 2016, Shinobu Kinjo wrote: >> If possible, it would be much better to make it pluggable so that we >> select what we want. > > Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability. > > After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there. > > sage > > > >> >> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: >> > >> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: >> >> >> >> >> >> We're thinking about switching the default backend on the mon from >> >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger >> >> feature set, is generally faster, and is linked statically, which >> >> means we won't be vulnerable to buggy distro packages. >> >> >> >> There is one blocker, though. Some distro leveldbs name the sst >> >> files with the .ldb suffix. (Some don't; very annoying.) There is >> >> a unit test in rocksdb that tries to verify that ldb is silently >> >> renamed to sst, and it passes, but the test is incomplete: the test >> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check' >> >> path (not the normal open and read it path) handles ldb properly. >> >> >> >> Anyway, once that works, rocksdb will magically upgrade from >> >> leveldb to rocksdb. Note that once that happens you can't switch >> >> from rocksdb back to leveldb without recreating the mon. >> >> >> >> Alternatively, we could not worry about upgrading existing leveldb >> >> instances and just make newly created mons default to rocksdb. >> >> >> >> 1) Thoughts on moving to rocksdb in general? >> >> >> >> 2) Importance of leveldb->rocksdb conversion? >> >> >> > >> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? >> > >> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. >> > >> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. >> > >> > Wido >> > >> >> 3) Anyone want to fix the ldb handling in rocksdb? >> >> >> >> Thanks! >> >> sage >> >> >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe >> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- >> > To unsubscribe from this list: send the line "unsubscribe >> > ceph-devel" in the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Email: >> shinobu@linux.com >> GitHub: >> shinobu-x >> Blog: >> Life with Distributed Computational System based on OpenSource >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majordomo@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- Email: shinobu@linux.com GitHub: shinobu-x Blog: Life with Distributed Computational System based on OpenSource ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb 2016-05-03 5:25 ` Zhou, Yuan 2016-05-03 5:28 ` Somnath Roy @ 2016-05-03 12:24 ` Sage Weil 1 sibling, 0 replies; 18+ messages in thread From: Sage Weil @ 2016-05-03 12:24 UTC (permalink / raw) To: Zhou, Yuan; +Cc: skinjo@redhat.com, Wido den Hollander, Ceph Development On Tue, 3 May 2016, Zhou, Yuan wrote: > Hi Sage, > > how about the filestore_omap_backend? It's set to leveldb by default > now. Would it be set to rocksdb also? I'd rather leave FileStore alone since it will eventually be deprecated. It's also more sensitive to performance variation and we'd need to be a lot more careful making any changes. sage > > thanks, -yuan > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil > Sent: Tuesday, May 3, 2016 5:47 AM > To: skinjo@redhat.com > Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org> > Subject: Re: mon switch from leveldb to rocksdb > > On Tue, 3 May 2016, Shinobu Kinjo wrote: > > If possible, it would be much better to make it pluggable so that we > > select what we want. > > Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability. > > After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there. > > sage > > > > > > > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote: > > > > > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>: > > >> > > >> > > >> We're thinking about switching the default backend on the mon from > > >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger > > >> feature set, is generally faster, and is linked statically, which > > >> means we won't be vulnerable to buggy distro packages. > > >> > > >> There is one blocker, though. Some distro leveldbs name the sst > > >> files with the .ldb suffix. (Some don't; very annoying.) There is > > >> a unit test in rocksdb that tries to verify that ldb is silently > > >> renamed to sst, and it passes, but the test is incomplete: the test > > >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check' > > >> path (not the normal open and read it path) handles ldb properly. > > >> > > >> Anyway, once that works, rocksdb will magically upgrade from > > >> leveldb to rocksdb. Note that once that happens you can't switch > > >> from rocksdb back to leveldb without recreating the mon. > > >> > > >> Alternatively, we could not worry about upgrading existing leveldb > > >> instances and just make newly created mons default to rocksdb. > > >> > > >> 1) Thoughts on moving to rocksdb in general? > > >> > > >> 2) Importance of leveldb->rocksdb conversion? > > >> > > > > > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors? > > > > > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs. > > > > > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary. > > > > > > Wido > > > > > >> 3) Anyone want to fix the ldb handling in rocksdb? > > >> > > >> Thanks! > > >> sage > > >> > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe > > >> ceph-devel" in the body of a message to majordomo@vger.kernel.org > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > > > ceph-devel" in the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Email: > > shinobu@linux.com > > GitHub: > > shinobu-x > > Blog: > > Life with Distributed Computational System based on OpenSource > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2016-05-03 17:32 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil 2016-05-02 19:00 ` Howard Chu 2016-05-03 13:34 ` Mark Nelson 2016-05-03 16:41 ` Gregory Farnum 2016-05-03 17:01 ` Mark Nelson 2016-05-03 17:17 ` Sage Weil 2016-05-03 17:20 ` Gregory Farnum 2016-05-03 17:23 ` Sage Weil 2016-05-03 17:32 ` Mark Nelson 2016-05-02 21:25 ` Wido den Hollander 2016-05-02 21:42 ` Shinobu Kinjo 2016-05-02 21:47 ` Sage Weil 2016-05-03 5:25 ` Zhou, Yuan 2016-05-03 5:28 ` Somnath Roy 2016-05-03 6:00 ` Shinobu Kinjo 2016-05-03 6:29 ` Somnath Roy 2016-05-03 8:10 ` Shinobu Kinjo 2016-05-03 12:24 ` Sage Weil
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.