* mon switch from leveldb to rocksdb
@ 2016-05-02 18:49 Sage Weil
2016-05-02 19:00 ` Howard Chu
2016-05-02 21:25 ` Wido den Hollander
0 siblings, 2 replies; 18+ messages in thread
From: Sage Weil @ 2016-05-02 18:49 UTC (permalink / raw)
To: ceph-devel
We're thinking about switching the default backend on the mon from leveldb
to rocksdb. Rocksdb is better maintained, has a stronger feature set, is
generally faster, and is linked statically, which means we won't be
vulnerable to buggy distro packages.
There is one blocker, though. Some distro leveldbs name the sst files
with the .ldb suffix. (Some don't; very annoying.) There is a unit test
in rocksdb that tries to verify that ldb is silently renamed to sst,
and it passes, but the test is incomplete: the test failes to verify
that ldb/sst files can actually be read, and it turns out only the 'check'
path (not the normal open and read it path) handles ldb properly.
Anyway, once that works, rocksdb will magically upgrade from leveldb to
rocksdb. Note that once that happens you can't switch from rocksdb back
to leveldb without recreating the mon.
Alternatively, we could not worry about upgrading existing leveldb
instances and just make newly created mons default to rocksdb.
1) Thoughts on moving to rocksdb in general?
2) Importance of leveldb->rocksdb conversion?
3) Anyone want to fix the ldb handling in rocksdb?
Thanks!
sage
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil
@ 2016-05-02 19:00 ` Howard Chu
2016-05-03 13:34 ` Mark Nelson
2016-05-02 21:25 ` Wido den Hollander
1 sibling, 1 reply; 18+ messages in thread
From: Howard Chu @ 2016-05-02 19:00 UTC (permalink / raw)
To: Sage Weil, ceph-devel
Sage Weil wrote:
> 1) Thoughts on moving to rocksdb in general?
Are you actually prepared to undertake all of the measurement and tuning
required to make RocksDB actually work well? You're switching from an
(abandoned/unsupported) engine with only a handful of config parameters to one
with ~40-50 params, all of which have critical but unpredictable impact on
resource consumption and performance.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil
2016-05-02 19:00 ` Howard Chu
@ 2016-05-02 21:25 ` Wido den Hollander
2016-05-02 21:42 ` Shinobu Kinjo
1 sibling, 1 reply; 18+ messages in thread
From: Wido den Hollander @ 2016-05-02 21:25 UTC (permalink / raw)
To: Sage Weil, ceph-devel
> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
>
>
> We're thinking about switching the default backend on the mon from leveldb
> to rocksdb. Rocksdb is better maintained, has a stronger feature set, is
> generally faster, and is linked statically, which means we won't be
> vulnerable to buggy distro packages.
>
> There is one blocker, though. Some distro leveldbs name the sst files
> with the .ldb suffix. (Some don't; very annoying.) There is a unit test
> in rocksdb that tries to verify that ldb is silently renamed to sst,
> and it passes, but the test is incomplete: the test failes to verify
> that ldb/sst files can actually be read, and it turns out only the 'check'
> path (not the normal open and read it path) handles ldb properly.
>
> Anyway, once that works, rocksdb will magically upgrade from leveldb to
> rocksdb. Note that once that happens you can't switch from rocksdb back
> to leveldb without recreating the mon.
>
> Alternatively, we could not worry about upgrading existing leveldb
> instances and just make newly created mons default to rocksdb.
>
> 1) Thoughts on moving to rocksdb in general?
>
> 2) Importance of leveldb->rocksdb conversion?
>
I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
Wido
> 3) Anyone want to fix the ldb handling in rocksdb?
>
> Thanks!
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-02 21:25 ` Wido den Hollander
@ 2016-05-02 21:42 ` Shinobu Kinjo
2016-05-02 21:47 ` Sage Weil
0 siblings, 1 reply; 18+ messages in thread
From: Shinobu Kinjo @ 2016-05-02 21:42 UTC (permalink / raw)
To: Wido den Hollander; +Cc: Sage Weil, Ceph Development
If possible, it would be much better to make it pluggable so that we
select what we want.
On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
>>
>>
>> We're thinking about switching the default backend on the mon from leveldb
>> to rocksdb. Rocksdb is better maintained, has a stronger feature set, is
>> generally faster, and is linked statically, which means we won't be
>> vulnerable to buggy distro packages.
>>
>> There is one blocker, though. Some distro leveldbs name the sst files
>> with the .ldb suffix. (Some don't; very annoying.) There is a unit test
>> in rocksdb that tries to verify that ldb is silently renamed to sst,
>> and it passes, but the test is incomplete: the test failes to verify
>> that ldb/sst files can actually be read, and it turns out only the 'check'
>> path (not the normal open and read it path) handles ldb properly.
>>
>> Anyway, once that works, rocksdb will magically upgrade from leveldb to
>> rocksdb. Note that once that happens you can't switch from rocksdb back
>> to leveldb without recreating the mon.
>>
>> Alternatively, we could not worry about upgrading existing leveldb
>> instances and just make newly created mons default to rocksdb.
>>
>> 1) Thoughts on moving to rocksdb in general?
>>
>> 2) Importance of leveldb->rocksdb conversion?
>>
>
> I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
>
> Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
>
> Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
>
> Wido
>
>> 3) Anyone want to fix the ldb handling in rocksdb?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Email:
shinobu@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-02 21:42 ` Shinobu Kinjo
@ 2016-05-02 21:47 ` Sage Weil
2016-05-03 5:25 ` Zhou, Yuan
0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-05-02 21:47 UTC (permalink / raw)
To: skinjo; +Cc: Wido den Hollander, Ceph Development
On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.
Yeah, that is the plan. The mon_keyvaluedb will select leveldb or
rocksdb. We'd just switch the default over at some point, once we're
satisfied with stability.
After thinking about this some more I agree with Wido that the conversion
isn't useful enough to bother with. We can just make new mons use
rocksdb, and if someone wants to convert, they can add/remove/replace mons
in their cluster to get there.
sage
>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from leveldb
> >> to rocksdb. Rocksdb is better maintained, has a stronger feature set, is
> >> generally faster, and is linked statically, which means we won't be
> >> vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though. Some distro leveldbs name the sst files
> >> with the .ldb suffix. (Some don't; very annoying.) There is a unit test
> >> in rocksdb that tries to verify that ldb is silently renamed to sst,
> >> and it passes, but the test is incomplete: the test failes to verify
> >> that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from leveldb to
> >> rocksdb. Note that once that happens you can't switch from rocksdb back
> >> to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb
2016-05-02 21:47 ` Sage Weil
@ 2016-05-03 5:25 ` Zhou, Yuan
2016-05-03 5:28 ` Somnath Roy
2016-05-03 12:24 ` Sage Weil
0 siblings, 2 replies; 18+ messages in thread
From: Zhou, Yuan @ 2016-05-03 5:25 UTC (permalink / raw)
To: Sage Weil, skinjo@redhat.com; +Cc: Wido den Hollander, Ceph Development
Hi Sage,
how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?
thanks, -yuan
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb
On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.
Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability.
After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
sage
>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from
> >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger
> >> feature set, is generally faster, and is linked statically, which
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though. Some distro leveldbs name the sst
> >> files with the .ldb suffix. (Some don't; very annoying.) There is
> >> a unit test in rocksdb that tries to verify that ldb is silently
> >> renamed to sst, and it passes, but the test is incomplete: the test
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from
> >> leveldb to rocksdb. Note that once that happens you can't switch
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb
2016-05-03 5:25 ` Zhou, Yuan
@ 2016-05-03 5:28 ` Somnath Roy
2016-05-03 6:00 ` Shinobu Kinjo
2016-05-03 12:24 ` Sage Weil
1 sibling, 1 reply; 18+ messages in thread
From: Somnath Roy @ 2016-05-03 5:28 UTC (permalink / raw)
To: Zhou, Yuan, Sage Weil, skinjo@redhat.com
Cc: Wido den Hollander, Ceph Development
I think filestore is already supporting rocksdb as OMAP..
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
Sent: Monday, May 02, 2016 10:25 PM
To: Sage Weil; skinjo@redhat.com
Cc: Wido den Hollander; Ceph Development
Subject: RE: mon switch from leveldb to rocksdb
Hi Sage,
how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?
thanks, -yuan
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb
On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.
Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability.
After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
sage
>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from
> >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger
> >> feature set, is generally faster, and is linked statically, which
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though. Some distro leveldbs name the sst
> >> files with the .ldb suffix. (Some don't; very annoying.) There is
> >> a unit test in rocksdb that tries to verify that ldb is silently
> >> renamed to sst, and it passes, but the test is incomplete: the test
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from
> >> leveldb to rocksdb. Note that once that happens you can't switch
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 5:28 ` Somnath Roy
@ 2016-05-03 6:00 ` Shinobu Kinjo
2016-05-03 6:29 ` Somnath Roy
0 siblings, 1 reply; 18+ messages in thread
From: Shinobu Kinjo @ 2016-05-03 6:00 UTC (permalink / raw)
To: Somnath Roy; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development
> I think filestore is already supporting rocksdb as OMAP..
If the RocksDB library is there, yes...
What is really challenge in here to me is, as Sage mentioned:
> if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
Maybe this is a related issue:
https://github.com/facebook/rocksdb/issues/677
What do you think?
Cheers,
Shinobu.
----- Original Message -----
From: "Somnath Roy" <Somnath.Roy@sandisk.com>
To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com
Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Tuesday, May 3, 2016 2:28:56 PM
Subject: RE: mon switch from leveldb to rocksdb
I think filestore is already supporting rocksdb as OMAP..
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
Sent: Monday, May 02, 2016 10:25 PM
To: Sage Weil; skinjo@redhat.com
Cc: Wido den Hollander; Ceph Development
Subject: RE: mon switch from leveldb to rocksdb
Hi Sage,
how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?
thanks, -yuan
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb
On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.
Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability.
After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
sage
>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from
> >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger
> >> feature set, is generally faster, and is linked statically, which
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though. Some distro leveldbs name the sst
> >> files with the .ldb suffix. (Some don't; very annoying.) There is
> >> a unit test in rocksdb that tries to verify that ldb is silently
> >> renamed to sst, and it passes, but the test is incomplete: the test
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from
> >> leveldb to rocksdb. Note that once that happens you can't switch
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb
2016-05-03 6:00 ` Shinobu Kinjo
@ 2016-05-03 6:29 ` Somnath Roy
2016-05-03 8:10 ` Shinobu Kinjo
0 siblings, 1 reply; 18+ messages in thread
From: Somnath Roy @ 2016-05-03 6:29 UTC (permalink / raw)
To: Shinobu Kinjo; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development
You need to recreate OSDs (mkfs) in order to move to rocksdb, it is not a seamless transition as per I know..
-----Original Message-----
From: Shinobu Kinjo [mailto:skinjo@redhat.com]
Sent: Monday, May 02, 2016 11:00 PM
To: Somnath Roy
Cc: Yuan Zhou; Sage Weil; Wido den Hollander; Ceph Development
Subject: Re: mon switch from leveldb to rocksdb
> I think filestore is already supporting rocksdb as OMAP..
If the RocksDB library is there, yes...
What is really challenge in here to me is, as Sage mentioned:
> if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
Maybe this is a related issue:
https://github.com/facebook/rocksdb/issues/677
What do you think?
Cheers,
Shinobu.
----- Original Message -----
From: "Somnath Roy" <Somnath.Roy@sandisk.com>
To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com
Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Tuesday, May 3, 2016 2:28:56 PM
Subject: RE: mon switch from leveldb to rocksdb
I think filestore is already supporting rocksdb as OMAP..
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
Sent: Monday, May 02, 2016 10:25 PM
To: Sage Weil; skinjo@redhat.com
Cc: Wido den Hollander; Ceph Development
Subject: RE: mon switch from leveldb to rocksdb
Hi Sage,
how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?
thanks, -yuan
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb
On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.
Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability.
After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
sage
>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from
> >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger
> >> feature set, is generally faster, and is linked statically, which
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though. Some distro leveldbs name the sst
> >> files with the .ldb suffix. (Some don't; very annoying.) There is
> >> a unit test in rocksdb that tries to verify that ldb is silently
> >> renamed to sst, and it passes, but the test is incomplete: the test
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from
> >> leveldb to rocksdb. Note that once that happens you can't switch
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 6:29 ` Somnath Roy
@ 2016-05-03 8:10 ` Shinobu Kinjo
0 siblings, 0 replies; 18+ messages in thread
From: Shinobu Kinjo @ 2016-05-03 8:10 UTC (permalink / raw)
To: Somnath Roy; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development
On Tue, May 3, 2016 at 3:29 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> You need to recreate OSDs (mkfs) in order to move to rocksdb, it is not a seamless transition as per I know..
Yeah, you're right.
>> >> We're thinking about switching the default backend on the mon from
>> >> leveldb to rocksdb.
But we're talking about mon...
>
> -----Original Message-----
> From: Shinobu Kinjo [mailto:skinjo@redhat.com]
> Sent: Monday, May 02, 2016 11:00 PM
> To: Somnath Roy
> Cc: Yuan Zhou; Sage Weil; Wido den Hollander; Ceph Development
> Subject: Re: mon switch from leveldb to rocksdb
>
>> I think filestore is already supporting rocksdb as OMAP..
>
> If the RocksDB library is there, yes...
>
> What is really challenge in here to me is, as Sage mentioned:
>
>> if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
>
> Maybe this is a related issue:
>
> https://github.com/facebook/rocksdb/issues/677
>
> What do you think?
>
> Cheers,
> Shinobu.
>
> ----- Original Message -----
> From: "Somnath Roy" <Somnath.Roy@sandisk.com>
> To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com
> Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, May 3, 2016 2:28:56 PM
> Subject: RE: mon switch from leveldb to rocksdb
>
> I think filestore is already supporting rocksdb as OMAP..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
> Sent: Monday, May 02, 2016 10:25 PM
> To: Sage Weil; skinjo@redhat.com
> Cc: Wido den Hollander; Ceph Development
> Subject: RE: mon switch from leveldb to rocksdb
>
> Hi Sage,
>
> how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?
>
> thanks, -yuan
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, May 3, 2016 5:47 AM
> To: skinjo@redhat.com
> Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
> Subject: Re: mon switch from leveldb to rocksdb
>
> On Tue, 3 May 2016, Shinobu Kinjo wrote:
>> If possible, it would be much better to make it pluggable so that we
>> select what we want.
>
> Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability.
>
> After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
>
> sage
>
>
>
>>
>> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
>> >
>> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
>> >>
>> >>
>> >> We're thinking about switching the default backend on the mon from
>> >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger
>> >> feature set, is generally faster, and is linked statically, which
>> >> means we won't be vulnerable to buggy distro packages.
>> >>
>> >> There is one blocker, though. Some distro leveldbs name the sst
>> >> files with the .ldb suffix. (Some don't; very annoying.) There is
>> >> a unit test in rocksdb that tries to verify that ldb is silently
>> >> renamed to sst, and it passes, but the test is incomplete: the test
>> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
>> >> path (not the normal open and read it path) handles ldb properly.
>> >>
>> >> Anyway, once that works, rocksdb will magically upgrade from
>> >> leveldb to rocksdb. Note that once that happens you can't switch
>> >> from rocksdb back to leveldb without recreating the mon.
>> >>
>> >> Alternatively, we could not worry about upgrading existing leveldb
>> >> instances and just make newly created mons default to rocksdb.
>> >>
>> >> 1) Thoughts on moving to rocksdb in general?
>> >>
>> >> 2) Importance of leveldb->rocksdb conversion?
>> >>
>> >
>> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
>> >
>> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
>> >
>> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
>> >
>> > Wido
>> >
>> >> 3) Anyone want to fix the ldb handling in rocksdb?
>> >>
>> >> Thanks!
>> >> sage
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe
>> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Email:
>> shinobu@linux.com
>> GitHub:
>> shinobu-x
>> Blog:
>> Life with Distributed Computational System based on OpenSource
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
Email:
shinobu@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: mon switch from leveldb to rocksdb
2016-05-03 5:25 ` Zhou, Yuan
2016-05-03 5:28 ` Somnath Roy
@ 2016-05-03 12:24 ` Sage Weil
1 sibling, 0 replies; 18+ messages in thread
From: Sage Weil @ 2016-05-03 12:24 UTC (permalink / raw)
To: Zhou, Yuan; +Cc: skinjo@redhat.com, Wido den Hollander, Ceph Development
On Tue, 3 May 2016, Zhou, Yuan wrote:
> Hi Sage,
>
> how about the filestore_omap_backend? It's set to leveldb by default
> now. Would it be set to rocksdb also?
I'd rather leave FileStore alone since it will eventually be deprecated.
It's also more sensitive to performance variation and we'd need to be a
lot more careful making any changes.
sage
>
> thanks, -yuan
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, May 3, 2016 5:47 AM
> To: skinjo@redhat.com
> Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
> Subject: Re: mon switch from leveldb to rocksdb
>
> On Tue, 3 May 2016, Shinobu Kinjo wrote:
> > If possible, it would be much better to make it pluggable so that we
> > select what we want.
>
> Yeah, that is the plan. The mon_keyvaluedb will select leveldb or rocksdb. We'd just switch the default over at some point, once we're satisfied with stability.
>
> After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with. We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
>
> sage
>
>
>
> >
> > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> > >
> > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> > >>
> > >>
> > >> We're thinking about switching the default backend on the mon from
> > >> leveldb to rocksdb. Rocksdb is better maintained, has a stronger
> > >> feature set, is generally faster, and is linked statically, which
> > >> means we won't be vulnerable to buggy distro packages.
> > >>
> > >> There is one blocker, though. Some distro leveldbs name the sst
> > >> files with the .ldb suffix. (Some don't; very annoying.) There is
> > >> a unit test in rocksdb that tries to verify that ldb is silently
> > >> renamed to sst, and it passes, but the test is incomplete: the test
> > >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> > >> path (not the normal open and read it path) handles ldb properly.
> > >>
> > >> Anyway, once that works, rocksdb will magically upgrade from
> > >> leveldb to rocksdb. Note that once that happens you can't switch
> > >> from rocksdb back to leveldb without recreating the mon.
> > >>
> > >> Alternatively, we could not worry about upgrading existing leveldb
> > >> instances and just make newly created mons default to rocksdb.
> > >>
> > >> 1) Thoughts on moving to rocksdb in general?
> > >>
> > >> 2) Importance of leveldb->rocksdb conversion?
> > >>
> > >
> > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> > >
> > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> > >
> > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> > >
> > > Wido
> > >
> > >> 3) Anyone want to fix the ldb handling in rocksdb?
> > >>
> > >> Thanks!
> > >> sage
> > >>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe
> > >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Email:
> > shinobu@linux.com
> > GitHub:
> > shinobu-x
> > Blog:
> > Life with Distributed Computational System based on OpenSource
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-02 19:00 ` Howard Chu
@ 2016-05-03 13:34 ` Mark Nelson
2016-05-03 16:41 ` Gregory Farnum
0 siblings, 1 reply; 18+ messages in thread
From: Mark Nelson @ 2016-05-03 13:34 UTC (permalink / raw)
To: Howard Chu, Sage Weil, ceph-devel
On 05/02/2016 02:00 PM, Howard Chu wrote:
> Sage Weil wrote:
>> 1) Thoughts on moving to rocksdb in general?
>
> Are you actually prepared to undertake all of the measurement and tuning
> required to make RocksDB actually work well? You're switching from an
> (abandoned/unsupported) engine with only a handful of config parameters
> to one with ~40-50 params, all of which have critical but unpredictable
> impact on resource consumption and performance.
>
You are absolutely correct, and there are definitely pitfalls we need to
watch out for with the number of tunables in rocksdb. At least on the
performance side two of the big issues we've hit with leveldb compaction
related. In some scenarios compaction happens slower than the number of
writes coming in resulting in ever-growing db sizes. The other issue is
that compaction is single threaded and this can cause stalls and general
mayhem when things get really heavily loaded. My hope is that if we do
go with rocksdb, even in a sub-optimally tuned state, we'll be better
off than we were with leveldb.
We did some very preliminary benchmarks a couple of years ago
(admittedly a too-small dataset size) basically comparing the (at the
time) stock ceph leveldb settings vs rocksdb. On this set size, leveldb
looked much better for reads, but much worse for writes. I suspect with
much larger data sets, the write issues will only compound with the
compaction issues and will start having a much bigger impact.
Indeed, if you look at the scatterplots for leveldb, you'll see a
regular set of high latency writes. In rocksdb we saw much better
looking write behavior, but overall reads were slower. We didn't do any
real tuning to improve read performance in the leveled compaction tests,
but I think we'll be starting out in a much better place to improve them
than we are with leveldb.
https://drive.google.com/file/d/0B2gTBZrkrnpZN3JFV3RZeVBPWlU/view?usp=sharing
Mark
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 13:34 ` Mark Nelson
@ 2016-05-03 16:41 ` Gregory Farnum
2016-05-03 17:01 ` Mark Nelson
0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2016-05-03 16:41 UTC (permalink / raw)
To: Mark Nelson; +Cc: Howard Chu, Sage Weil, ceph-devel
On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
> On 05/02/2016 02:00 PM, Howard Chu wrote:
>>
>> Sage Weil wrote:
>>>
>>> 1) Thoughts on moving to rocksdb in general?
>>
>>
>> Are you actually prepared to undertake all of the measurement and tuning
>> required to make RocksDB actually work well? You're switching from an
>> (abandoned/unsupported) engine with only a handful of config parameters
>> to one with ~40-50 params, all of which have critical but unpredictable
>> impact on resource consumption and performance.
>>
>
> You are absolutely correct, and there are definitely pitfalls we need to
> watch out for with the number of tunables in rocksdb. At least on the
> performance side two of the big issues we've hit with leveldb compaction
> related. In some scenarios compaction happens slower than the number of
> writes coming in resulting in ever-growing db sizes. The other issue is
> that compaction is single threaded and this can cause stalls and general
> mayhem when things get really heavily loaded. My hope is that if we do go
> with rocksdb, even in a sub-optimally tuned state, we'll be better off than
> we were with leveldb.
>
> We did some very preliminary benchmarks a couple of years ago (admittedly a
> too-small dataset size) basically comparing the (at the time) stock ceph
> leveldb settings vs rocksdb. On this set size, leveldb looked much better
> for reads, but much worse for writes.
That's actually a bit troubling — many of our monitor problems have
arisen from slow reads, rather than slow writes. I suspect we want to
eliminate this before switching, if it's a concern.
...Although I think I did see a monitor caching layer go by, so maybe
it's a moot point now?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 16:41 ` Gregory Farnum
@ 2016-05-03 17:01 ` Mark Nelson
2016-05-03 17:17 ` Sage Weil
0 siblings, 1 reply; 18+ messages in thread
From: Mark Nelson @ 2016-05-03 17:01 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Howard Chu, Sage Weil, ceph-devel
On 05/03/2016 11:41 AM, Gregory Farnum wrote:
> On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
>> On 05/02/2016 02:00 PM, Howard Chu wrote:
>>>
>>> Sage Weil wrote:
>>>>
>>>> 1) Thoughts on moving to rocksdb in general?
>>>
>>>
>>> Are you actually prepared to undertake all of the measurement and tuning
>>> required to make RocksDB actually work well? You're switching from an
>>> (abandoned/unsupported) engine with only a handful of config parameters
>>> to one with ~40-50 params, all of which have critical but unpredictable
>>> impact on resource consumption and performance.
>>>
>>
>> You are absolutely correct, and there are definitely pitfalls we need to
>> watch out for with the number of tunables in rocksdb. At least on the
>> performance side two of the big issues we've hit with leveldb compaction
>> related. In some scenarios compaction happens slower than the number of
>> writes coming in resulting in ever-growing db sizes. The other issue is
>> that compaction is single threaded and this can cause stalls and general
>> mayhem when things get really heavily loaded. My hope is that if we do go
>> with rocksdb, even in a sub-optimally tuned state, we'll be better off than
>> we were with leveldb.
>>
>> We did some very preliminary benchmarks a couple of years ago (admittedly a
>> too-small dataset size) basically comparing the (at the time) stock ceph
>> leveldb settings vs rocksdb. On this set size, leveldb looked much better
>> for reads, but much worse for writes.
>
> That's actually a bit troubling — many of our monitor problems have
> arisen from slow reads, rather than slow writes. I suspect we want to
> eliminate this before switching, if it's a concern.
>
> ...Although I think I did see a monitor caching layer go by, so maybe
> it's a moot point now?
Yeah, I suspect that's helping significantly. I think based at least
one what I remember seeing I'm more concerned about high latency events
than average read performance though. IE if there is a compaction
storm, which store is going to handle it more gracefully with less
spikey behavior?
In those leveldb tests we only saw writes and write trims hit by those
periodic 10-60 second high-latency spikes, but if I recall the mon has
(or at least had?) a global lock where write stalls would basically make
the whole monitor stall. I think Joao might have improved that after we
did this testing but I don't remember the details at this point.
> -Greg
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 17:01 ` Mark Nelson
@ 2016-05-03 17:17 ` Sage Weil
2016-05-03 17:20 ` Gregory Farnum
0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-05-03 17:17 UTC (permalink / raw)
To: Mark Nelson; +Cc: Gregory Farnum, Howard Chu, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 3355 bytes --]
On Tue, 3 May 2016, Mark Nelson wrote:
> On 05/03/2016 11:41 AM, Gregory Farnum wrote:
> > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
> > > On 05/02/2016 02:00 PM, Howard Chu wrote:
> > > >
> > > > Sage Weil wrote:
> > > > >
> > > > > 1) Thoughts on moving to rocksdb in general?
> > > >
> > > >
> > > > Are you actually prepared to undertake all of the measurement and tuning
> > > > required to make RocksDB actually work well? You're switching from an
> > > > (abandoned/unsupported) engine with only a handful of config parameters
> > > > to one with ~40-50 params, all of which have critical but unpredictable
> > > > impact on resource consumption and performance.
> > > >
> > >
> > > You are absolutely correct, and there are definitely pitfalls we need to
> > > watch out for with the number of tunables in rocksdb. At least on the
> > > performance side two of the big issues we've hit with leveldb compaction
> > > related. In some scenarios compaction happens slower than the number of
> > > writes coming in resulting in ever-growing db sizes. The other issue is
> > > that compaction is single threaded and this can cause stalls and general
> > > mayhem when things get really heavily loaded. My hope is that if we do go
> > > with rocksdb, even in a sub-optimally tuned state, we'll be better off
> > > than
> > > we were with leveldb.
> > >
> > > We did some very preliminary benchmarks a couple of years ago (admittedly
> > > a
> > > too-small dataset size) basically comparing the (at the time) stock ceph
> > > leveldb settings vs rocksdb. On this set size, leveldb looked much better
> > > for reads, but much worse for writes.
> >
> > That's actually a bit troubling — many of our monitor problems have
> > arisen from slow reads, rather than slow writes. I suspect we want to
> > eliminate this before switching, if it's a concern.
> >
> > ...Although I think I did see a monitor caching layer go by, so maybe
> > it's a moot point now?
>
> Yeah, I suspect that's helping significantly. I think based at least one what
> I remember seeing I'm more concerned about high latency events than average
> read performance though. IE if there is a compaction storm, which store is
> going to handle it more gracefully with less spikey behavior?
I'm most worried about the read storm that happens on each commit to fetch
all the just-updated PG stat keys. The other data in the mon is just
noise in comparison, I think, with the exception of the OSDMaps... which
IIRC is what the cache you mention was for.
The initial PR,
https://github.com/ceph/ceph/pull/8888
just makes the backend choice persistent. Rocksdb is still experimental.
There's an accompanying ceph-qa-suite pr so that we test both. Once we
do some performance evaluation we can decide whether the switch is safe
as-is, if more work (caching layer or tuning) is needed, or if it's a bad
idea.
> In those leveldb tests we only saw writes and write trims hit by those
> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
> least had?) a global lock where write stalls would basically make the whole
> monitor stall. I think Joao might have improved that after we did this
> testing but I don't remember the details at this point.
I don't think any of this locking has changed...
sage
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 17:17 ` Sage Weil
@ 2016-05-03 17:20 ` Gregory Farnum
2016-05-03 17:23 ` Sage Weil
0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2016-05-03 17:20 UTC (permalink / raw)
To: Sage Weil; +Cc: Mark Nelson, Howard Chu, ceph-devel
On Tue, May 3, 2016 at 10:17 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 3 May 2016, Mark Nelson wrote:
>> On 05/03/2016 11:41 AM, Gregory Farnum wrote:
>> > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
>> > > On 05/02/2016 02:00 PM, Howard Chu wrote:
>> > > >
>> > > > Sage Weil wrote:
>> > > > >
>> > > > > 1) Thoughts on moving to rocksdb in general?
>> > > >
>> > > >
>> > > > Are you actually prepared to undertake all of the measurement and tuning
>> > > > required to make RocksDB actually work well? You're switching from an
>> > > > (abandoned/unsupported) engine with only a handful of config parameters
>> > > > to one with ~40-50 params, all of which have critical but unpredictable
>> > > > impact on resource consumption and performance.
>> > > >
>> > >
>> > > You are absolutely correct, and there are definitely pitfalls we need to
>> > > watch out for with the number of tunables in rocksdb. At least on the
>> > > performance side two of the big issues we've hit with leveldb compaction
>> > > related. In some scenarios compaction happens slower than the number of
>> > > writes coming in resulting in ever-growing db sizes. The other issue is
>> > > that compaction is single threaded and this can cause stalls and general
>> > > mayhem when things get really heavily loaded. My hope is that if we do go
>> > > with rocksdb, even in a sub-optimally tuned state, we'll be better off
>> > > than
>> > > we were with leveldb.
>> > >
>> > > We did some very preliminary benchmarks a couple of years ago (admittedly
>> > > a
>> > > too-small dataset size) basically comparing the (at the time) stock ceph
>> > > leveldb settings vs rocksdb. On this set size, leveldb looked much better
>> > > for reads, but much worse for writes.
>> >
>> > That's actually a bit troubling — many of our monitor problems have
>> > arisen from slow reads, rather than slow writes. I suspect we want to
>> > eliminate this before switching, if it's a concern.
>> >
>> > ...Although I think I did see a monitor caching layer go by, so maybe
>> > it's a moot point now?
>>
>> Yeah, I suspect that's helping significantly. I think based at least one what
>> I remember seeing I'm more concerned about high latency events than average
>> read performance though. IE if there is a compaction storm, which store is
>> going to handle it more gracefully with less spikey behavior?
>
> I'm most worried about the read storm that happens on each commit to fetch
> all the just-updated PG stat keys. The other data in the mon is just
> noise in comparison, I think, with the exception of the OSDMaps... which
> IIRC is what the cache you mention was for.
>
> The initial PR,
>
> https://github.com/ceph/ceph/pull/8888
>
> just makes the backend choice persistent. Rocksdb is still experimental.
> There's an accompanying ceph-qa-suite pr so that we test both. Once we
> do some performance evaluation we can decide whether the switch is safe
> as-is, if more work (caching layer or tuning) is needed, or if it's a bad
> idea.
>
>> In those leveldb tests we only saw writes and write trims hit by those
>> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
>> least had?) a global lock where write stalls would basically make the whole
>> monitor stall. I think Joao might have improved that after we did this
>> testing but I don't remember the details at this point.
>
> I don't think any of this locking has changed...
The paxos state machine is no longer blocked for reads while an
unrelated write is happening. Nor are older-version reads on the
writing subsystem. That fix is post-firefly, right?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 17:20 ` Gregory Farnum
@ 2016-05-03 17:23 ` Sage Weil
2016-05-03 17:32 ` Mark Nelson
0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-05-03 17:23 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Mark Nelson, Howard Chu, ceph-devel
On Tue, 3 May 2016, Gregory Farnum wrote:
> >> In those leveldb tests we only saw writes and write trims hit by those
> >> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
> >> least had?) a global lock where write stalls would basically make the whole
> >> monitor stall. I think Joao might have improved that after we did this
> >> testing but I don't remember the details at this point.
> >
> > I don't think any of this locking has changed...
>
> The paxos state machine is no longer blocked for reads while an
> unrelated write is happening. Nor are older-version reads on the
> writing subsystem. That fix is post-firefly, right?
Oh yeah, it was new in hammer, I think. I forgot those tests were that
old!
sage
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: mon switch from leveldb to rocksdb
2016-05-03 17:23 ` Sage Weil
@ 2016-05-03 17:32 ` Mark Nelson
0 siblings, 0 replies; 18+ messages in thread
From: Mark Nelson @ 2016-05-03 17:32 UTC (permalink / raw)
To: Sage Weil, Gregory Farnum; +Cc: Howard Chu, ceph-devel
On 05/03/2016 12:23 PM, Sage Weil wrote:
> On Tue, 3 May 2016, Gregory Farnum wrote:
>>>> In those leveldb tests we only saw writes and write trims hit by those
>>>> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
>>>> least had?) a global lock where write stalls would basically make the whole
>>>> monitor stall. I think Joao might have improved that after we did this
>>>> testing but I don't remember the details at this point.
>>>
>>> I don't think any of this locking has changed...
>>
>> The paxos state machine is no longer blocked for reads while an
>> unrelated write is happening. Nor are older-version reads on the
>> writing subsystem. That fix is post-firefly, right?
>
> Oh yeah, it was new in hammer, I think. I forgot those tests were that
> old!
Time goes by fast when you are having fun? Indeed those tests are
nearly 2 years old at this point.
I guess my question is given the state machine improvements do we expect
those leveldb write/wtrim latency spikes to still cause major mon
stalls? On the other hand, is the osdmap read cache enough to help
offset the lower read performance in (this configuration of) rocksdb?
I'm still worried about those big leveldb write latency spikes, but
maybe it's less of an issue now and the average read performance is a
bigger issue.
>
> sage
>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2016-05-03 17:32 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil
2016-05-02 19:00 ` Howard Chu
2016-05-03 13:34 ` Mark Nelson
2016-05-03 16:41 ` Gregory Farnum
2016-05-03 17:01 ` Mark Nelson
2016-05-03 17:17 ` Sage Weil
2016-05-03 17:20 ` Gregory Farnum
2016-05-03 17:23 ` Sage Weil
2016-05-03 17:32 ` Mark Nelson
2016-05-02 21:25 ` Wido den Hollander
2016-05-02 21:42 ` Shinobu Kinjo
2016-05-02 21:47 ` Sage Weil
2016-05-03 5:25 ` Zhou, Yuan
2016-05-03 5:28 ` Somnath Roy
2016-05-03 6:00 ` Shinobu Kinjo
2016-05-03 6:29 ` Somnath Roy
2016-05-03 8:10 ` Shinobu Kinjo
2016-05-03 12:24 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.