mon switch from leveldb to rocksdb

All of lore.kernel.org
 help / color / mirror / Atom feed

* mon switch from leveldb to rocksdb
@ 2016-05-02 18:49 Sage Weil
  2016-05-02 19:00 ` Howard Chu
  2016-05-02 21:25 ` Wido den Hollander
  0 siblings, 2 replies; 18+ messages in thread
From: Sage Weil @ 2016-05-02 18:49 UTC (permalink / raw)
  To: ceph-devel

We're thinking about switching the default backend on the mon from leveldb 
to rocksdb.  Rocksdb is better maintained, has a stronger feature set, is 
generally faster, and is linked statically, which means we won't be 
vulnerable to buggy distro packages.

There is one blocker, though.  Some distro leveldbs name the sst files 
with the .ldb suffix.  (Some don't; very annoying.)  There is a unit test 
in rocksdb that tries to verify that ldb is silently renamed to sst, 
and it passes, but the test is incomplete: the test failes to verify 
that ldb/sst files can actually be read, and it turns out only the 'check' 
path (not the normal open and read it path) handles ldb properly.

Anyway, once that works, rocksdb will magically upgrade from leveldb to 
rocksdb.  Note that once that happens you can't switch from rocksdb back 
to leveldb without recreating the mon.

Alternatively, we could not worry about upgrading existing leveldb 
instances and just make newly created mons default to rocksdb.

1) Thoughts on moving to rocksdb in general?

2) Importance of leveldb->rocksdb conversion?

3) Anyone want to fix the ldb handling in rocksdb?

Thanks!
sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil
@ 2016-05-02 19:00 ` Howard Chu
  2016-05-03 13:34   ` Mark Nelson
  2016-05-02 21:25 ` Wido den Hollander
  1 sibling, 1 reply; 18+ messages in thread
From: Howard Chu @ 2016-05-02 19:00 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Sage Weil wrote:
> 1) Thoughts on moving to rocksdb in general?

Are you actually prepared to undertake all of the measurement and tuning 
required to make RocksDB actually work well? You're switching from an 
(abandoned/unsupported) engine with only a handful of config parameters to one 
with ~40-50 params, all of which have critical but unpredictable impact on 
resource consumption and performance.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil
  2016-05-02 19:00 ` Howard Chu
@ 2016-05-02 21:25 ` Wido den Hollander
  2016-05-02 21:42   ` Shinobu Kinjo
  1 sibling, 1 reply; 18+ messages in thread
From: Wido den Hollander @ 2016-05-02 21:25 UTC (permalink / raw)
  To: Sage Weil, ceph-devel


> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> 
> 
> We're thinking about switching the default backend on the mon from leveldb 
> to rocksdb.  Rocksdb is better maintained, has a stronger feature set, is 
> generally faster, and is linked statically, which means we won't be 
> vulnerable to buggy distro packages.
> 
> There is one blocker, though.  Some distro leveldbs name the sst files 
> with the .ldb suffix.  (Some don't; very annoying.)  There is a unit test 
> in rocksdb that tries to verify that ldb is silently renamed to sst, 
> and it passes, but the test is incomplete: the test failes to verify 
> that ldb/sst files can actually be read, and it turns out only the 'check' 
> path (not the normal open and read it path) handles ldb properly.
> 
> Anyway, once that works, rocksdb will magically upgrade from leveldb to 
> rocksdb.  Note that once that happens you can't switch from rocksdb back 
> to leveldb without recreating the mon.
> 
> Alternatively, we could not worry about upgrading existing leveldb 
> instances and just make newly created mons default to rocksdb.
> 
> 1) Thoughts on moving to rocksdb in general?
> 
> 2) Importance of leveldb->rocksdb conversion?
> 

I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?

Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.

Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.

Wido

> 3) Anyone want to fix the ldb handling in rocksdb?
> 
> Thanks!
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-02 21:25 ` Wido den Hollander
@ 2016-05-02 21:42   ` Shinobu Kinjo
  2016-05-02 21:47     ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Shinobu Kinjo @ 2016-05-02 21:42 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Sage Weil, Ceph Development

If possible, it would be much better to make it pluggable so that we
select what we want.

On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
>>
>>
>> We're thinking about switching the default backend on the mon from leveldb
>> to rocksdb.  Rocksdb is better maintained, has a stronger feature set, is
>> generally faster, and is linked statically, which means we won't be
>> vulnerable to buggy distro packages.
>>
>> There is one blocker, though.  Some distro leveldbs name the sst files
>> with the .ldb suffix.  (Some don't; very annoying.)  There is a unit test
>> in rocksdb that tries to verify that ldb is silently renamed to sst,
>> and it passes, but the test is incomplete: the test failes to verify
>> that ldb/sst files can actually be read, and it turns out only the 'check'
>> path (not the normal open and read it path) handles ldb properly.
>>
>> Anyway, once that works, rocksdb will magically upgrade from leveldb to
>> rocksdb.  Note that once that happens you can't switch from rocksdb back
>> to leveldb without recreating the mon.
>>
>> Alternatively, we could not worry about upgrading existing leveldb
>> instances and just make newly created mons default to rocksdb.
>>
>> 1) Thoughts on moving to rocksdb in general?
>>
>> 2) Importance of leveldb->rocksdb conversion?
>>
>
> I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
>
> Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
>
> Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
>
> Wido
>
>> 3) Anyone want to fix the ldb handling in rocksdb?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Email:
shinobu@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-02 21:42   ` Shinobu Kinjo
@ 2016-05-02 21:47     ` Sage Weil
  2016-05-03  5:25       ` Zhou, Yuan
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-05-02 21:47 UTC (permalink / raw)
  To: skinjo; +Cc: Wido den Hollander, Ceph Development

On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.

Yeah, that is the plan.  The mon_keyvaluedb will select leveldb or 
rocksdb.  We'd just switch the default over at some point, once we're 
satisfied with stability.

After thinking about this some more I agree with Wido that the conversion 
isn't useful enough to bother with.  We can just make new mons use 
rocksdb, and if someone wants to convert, they can add/remove/replace mons 
in their cluster to get there.

sage



> 
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from leveldb
> >> to rocksdb.  Rocksdb is better maintained, has a stronger feature set, is
> >> generally faster, and is linked statically, which means we won't be
> >> vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though.  Some distro leveldbs name the sst files
> >> with the .ldb suffix.  (Some don't; very annoying.)  There is a unit test
> >> in rocksdb that tries to verify that ldb is silently renamed to sst,
> >> and it passes, but the test is incomplete: the test failes to verify
> >> that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from leveldb to
> >> rocksdb.  Note that once that happens you can't switch from rocksdb back
> >> to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: mon switch from leveldb to rocksdb
  2016-05-02 21:47     ` Sage Weil
@ 2016-05-03  5:25       ` Zhou, Yuan
  2016-05-03  5:28         ` Somnath Roy
  2016-05-03 12:24         ` Sage Weil
  0 siblings, 2 replies; 18+ messages in thread
From: Zhou, Yuan @ 2016-05-03  5:25 UTC (permalink / raw)
  To: Sage Weil, skinjo@redhat.com; +Cc: Wido den Hollander, Ceph Development

Hi Sage, 

how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?

thanks, -yuan

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb

On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we 
> select what we want.

Yeah, that is the plan.  The mon_keyvaluedb will select leveldb or rocksdb.  We'd just switch the default over at some point, once we're satisfied with stability.

After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with.  We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.

sage



> 
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from 
> >> leveldb to rocksdb.  Rocksdb is better maintained, has a stronger 
> >> feature set, is generally faster, and is linked statically, which 
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though.  Some distro leveldbs name the sst 
> >> files with the .ldb suffix.  (Some don't; very annoying.)  There is 
> >> a unit test in rocksdb that tries to verify that ldb is silently 
> >> renamed to sst, and it passes, but the test is incomplete: the test 
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from 
> >> leveldb to rocksdb.  Note that once that happens you can't switch 
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb 
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe 
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe 
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: mon switch from leveldb to rocksdb
  2016-05-03  5:25       ` Zhou, Yuan
@ 2016-05-03  5:28         ` Somnath Roy
  2016-05-03  6:00           ` Shinobu Kinjo
  2016-05-03 12:24         ` Sage Weil
  1 sibling, 1 reply; 18+ messages in thread
From: Somnath Roy @ 2016-05-03  5:28 UTC (permalink / raw)
  To: Zhou, Yuan, Sage Weil, skinjo@redhat.com
  Cc: Wido den Hollander, Ceph Development

I think filestore is already supporting rocksdb as OMAP..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
Sent: Monday, May 02, 2016 10:25 PM
To: Sage Weil; skinjo@redhat.com
Cc: Wido den Hollander; Ceph Development
Subject: RE: mon switch from leveldb to rocksdb

Hi Sage,

how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?

thanks, -yuan

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb

On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.

Yeah, that is the plan.  The mon_keyvaluedb will select leveldb or rocksdb.  We'd just switch the default over at some point, once we're satisfied with stability.

After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with.  We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.

sage



>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from
> >> leveldb to rocksdb.  Rocksdb is better maintained, has a stronger
> >> feature set, is generally faster, and is linked statically, which
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though.  Some distro leveldbs name the sst
> >> files with the .ldb suffix.  (Some don't; very annoying.)  There is
> >> a unit test in rocksdb that tries to verify that ldb is silently
> >> renamed to sst, and it passes, but the test is incomplete: the test
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from
> >> leveldb to rocksdb.  Note that once that happens you can't switch
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03  5:28         ` Somnath Roy
@ 2016-05-03  6:00           ` Shinobu Kinjo
  2016-05-03  6:29             ` Somnath Roy
  0 siblings, 1 reply; 18+ messages in thread
From: Shinobu Kinjo @ 2016-05-03  6:00 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development

> I think filestore is already supporting rocksdb as OMAP..

If the RocksDB library is there, yes...

What is really challenge in here to me is, as Sage mentioned:

> if someone wants to convert, they can add/remove/replace mons in their cluster to get there.

Maybe this is a related issue:

https://github.com/facebook/rocksdb/issues/677

What do you think?

Cheers,
Shinobu.

----- Original Message -----
From: "Somnath Roy" <Somnath.Roy@sandisk.com>
To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com
Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Tuesday, May 3, 2016 2:28:56 PM
Subject: RE: mon switch from leveldb to rocksdb

I think filestore is already supporting rocksdb as OMAP..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
Sent: Monday, May 02, 2016 10:25 PM
To: Sage Weil; skinjo@redhat.com
Cc: Wido den Hollander; Ceph Development
Subject: RE: mon switch from leveldb to rocksdb

Hi Sage,

how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?

thanks, -yuan

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb

On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we
> select what we want.

Yeah, that is the plan.  The mon_keyvaluedb will select leveldb or rocksdb.  We'd just switch the default over at some point, once we're satisfied with stability.

After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with.  We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.

sage



>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from
> >> leveldb to rocksdb.  Rocksdb is better maintained, has a stronger
> >> feature set, is generally faster, and is linked statically, which
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though.  Some distro leveldbs name the sst
> >> files with the .ldb suffix.  (Some don't; very annoying.)  There is
> >> a unit test in rocksdb that tries to verify that ldb is silently
> >> renamed to sst, and it passes, but the test is incomplete: the test
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from
> >> leveldb to rocksdb.  Note that once that happens you can't switch
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: mon switch from leveldb to rocksdb
  2016-05-03  6:00           ` Shinobu Kinjo
@ 2016-05-03  6:29             ` Somnath Roy
  2016-05-03  8:10               ` Shinobu Kinjo
  0 siblings, 1 reply; 18+ messages in thread
From: Somnath Roy @ 2016-05-03  6:29 UTC (permalink / raw)
  To: Shinobu Kinjo; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development

You need to recreate OSDs (mkfs) in order to move to rocksdb, it is not a seamless transition as per I know..

-----Original Message-----
From: Shinobu Kinjo [mailto:skinjo@redhat.com] 
Sent: Monday, May 02, 2016 11:00 PM
To: Somnath Roy
Cc: Yuan Zhou; Sage Weil; Wido den Hollander; Ceph Development
Subject: Re: mon switch from leveldb to rocksdb

> I think filestore is already supporting rocksdb as OMAP..

If the RocksDB library is there, yes...

What is really challenge in here to me is, as Sage mentioned:

> if someone wants to convert, they can add/remove/replace mons in their cluster to get there.

Maybe this is a related issue:

https://github.com/facebook/rocksdb/issues/677

What do you think?

Cheers,
Shinobu.

----- Original Message -----
From: "Somnath Roy" <Somnath.Roy@sandisk.com>
To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com
Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Tuesday, May 3, 2016 2:28:56 PM
Subject: RE: mon switch from leveldb to rocksdb

I think filestore is already supporting rocksdb as OMAP..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
Sent: Monday, May 02, 2016 10:25 PM
To: Sage Weil; skinjo@redhat.com
Cc: Wido den Hollander; Ceph Development
Subject: RE: mon switch from leveldb to rocksdb

Hi Sage,

how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?

thanks, -yuan

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, May 3, 2016 5:47 AM
To: skinjo@redhat.com
Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: mon switch from leveldb to rocksdb

On Tue, 3 May 2016, Shinobu Kinjo wrote:
> If possible, it would be much better to make it pluggable so that we 
> select what we want.

Yeah, that is the plan.  The mon_keyvaluedb will select leveldb or rocksdb.  We'd just switch the default over at some point, once we're satisfied with stability.

After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with.  We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.

sage



>
> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> >>
> >>
> >> We're thinking about switching the default backend on the mon from 
> >> leveldb to rocksdb.  Rocksdb is better maintained, has a stronger 
> >> feature set, is generally faster, and is linked statically, which 
> >> means we won't be vulnerable to buggy distro packages.
> >>
> >> There is one blocker, though.  Some distro leveldbs name the sst 
> >> files with the .ldb suffix.  (Some don't; very annoying.)  There is 
> >> a unit test in rocksdb that tries to verify that ldb is silently 
> >> renamed to sst, and it passes, but the test is incomplete: the test 
> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> >> path (not the normal open and read it path) handles ldb properly.
> >>
> >> Anyway, once that works, rocksdb will magically upgrade from 
> >> leveldb to rocksdb.  Note that once that happens you can't switch 
> >> from rocksdb back to leveldb without recreating the mon.
> >>
> >> Alternatively, we could not worry about upgrading existing leveldb 
> >> instances and just make newly created mons default to rocksdb.
> >>
> >> 1) Thoughts on moving to rocksdb in general?
> >>
> >> 2) Importance of leveldb->rocksdb conversion?
> >>
> >
> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> >
> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> >
> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> >
> > Wido
> >
> >> 3) Anyone want to fix the ldb handling in rocksdb?
> >>
> >> Thanks!
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe 
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe 
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03  6:29             ` Somnath Roy
@ 2016-05-03  8:10               ` Shinobu Kinjo
  0 siblings, 0 replies; 18+ messages in thread
From: Shinobu Kinjo @ 2016-05-03  8:10 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Yuan Zhou, Sage Weil, Wido den Hollander, Ceph Development

On Tue, May 3, 2016 at 3:29 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> You need to recreate OSDs (mkfs) in order to move to rocksdb, it is not a seamless transition as per I know..

Yeah, you're right.

>> >> We're thinking about switching the default backend on the mon from
>> >> leveldb to rocksdb.

But we're talking about mon...

>
> -----Original Message-----
> From: Shinobu Kinjo [mailto:skinjo@redhat.com]
> Sent: Monday, May 02, 2016 11:00 PM
> To: Somnath Roy
> Cc: Yuan Zhou; Sage Weil; Wido den Hollander; Ceph Development
> Subject: Re: mon switch from leveldb to rocksdb
>
>> I think filestore is already supporting rocksdb as OMAP..
>
> If the RocksDB library is there, yes...
>
> What is really challenge in here to me is, as Sage mentioned:
>
>> if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
>
> Maybe this is a related issue:
>
> https://github.com/facebook/rocksdb/issues/677
>
> What do you think?
>
> Cheers,
> Shinobu.
>
> ----- Original Message -----
> From: "Somnath Roy" <Somnath.Roy@sandisk.com>
> To: "Yuan Zhou" <yuan.zhou@intel.com>, "Sage Weil" <sage@newdream.net>, skinjo@redhat.com
> Cc: "Wido den Hollander" <wido@42on.com>, "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, May 3, 2016 2:28:56 PM
> Subject: RE: mon switch from leveldb to rocksdb
>
> I think filestore is already supporting rocksdb as OMAP..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Zhou, Yuan
> Sent: Monday, May 02, 2016 10:25 PM
> To: Sage Weil; skinjo@redhat.com
> Cc: Wido den Hollander; Ceph Development
> Subject: RE: mon switch from leveldb to rocksdb
>
> Hi Sage,
>
> how about the filestore_omap_backend? It's set to leveldb by default now. Would it be set to rocksdb also?
>
> thanks, -yuan
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, May 3, 2016 5:47 AM
> To: skinjo@redhat.com
> Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
> Subject: Re: mon switch from leveldb to rocksdb
>
> On Tue, 3 May 2016, Shinobu Kinjo wrote:
>> If possible, it would be much better to make it pluggable so that we
>> select what we want.
>
> Yeah, that is the plan.  The mon_keyvaluedb will select leveldb or rocksdb.  We'd just switch the default over at some point, once we're satisfied with stability.
>
> After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with.  We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
>
> sage
>
>
>
>>
>> On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
>> >
>> >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
>> >>
>> >>
>> >> We're thinking about switching the default backend on the mon from
>> >> leveldb to rocksdb.  Rocksdb is better maintained, has a stronger
>> >> feature set, is generally faster, and is linked statically, which
>> >> means we won't be vulnerable to buggy distro packages.
>> >>
>> >> There is one blocker, though.  Some distro leveldbs name the sst
>> >> files with the .ldb suffix.  (Some don't; very annoying.)  There is
>> >> a unit test in rocksdb that tries to verify that ldb is silently
>> >> renamed to sst, and it passes, but the test is incomplete: the test
>> >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
>> >> path (not the normal open and read it path) handles ldb properly.
>> >>
>> >> Anyway, once that works, rocksdb will magically upgrade from
>> >> leveldb to rocksdb.  Note that once that happens you can't switch
>> >> from rocksdb back to leveldb without recreating the mon.
>> >>
>> >> Alternatively, we could not worry about upgrading existing leveldb
>> >> instances and just make newly created mons default to rocksdb.
>> >>
>> >> 1) Thoughts on moving to rocksdb in general?
>> >>
>> >> 2) Importance of leveldb->rocksdb conversion?
>> >>
>> >
>> > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
>> >
>> > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
>> >
>> > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
>> >
>> > Wido
>> >
>> >> 3) Anyone want to fix the ldb handling in rocksdb?
>> >>
>> >> Thanks!
>> >> sage
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe
>> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Email:
>> shinobu@linux.com
>> GitHub:
>> shinobu-x
>> Blog:
>> Life with Distributed Computational System based on OpenSource
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).



-- 
Email:
shinobu@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: mon switch from leveldb to rocksdb
  2016-05-03  5:25       ` Zhou, Yuan
  2016-05-03  5:28         ` Somnath Roy
@ 2016-05-03 12:24         ` Sage Weil
  1 sibling, 0 replies; 18+ messages in thread
From: Sage Weil @ 2016-05-03 12:24 UTC (permalink / raw)
  To: Zhou, Yuan; +Cc: skinjo@redhat.com, Wido den Hollander, Ceph Development

On Tue, 3 May 2016, Zhou, Yuan wrote:
> Hi Sage, 
> 
> how about the filestore_omap_backend? It's set to leveldb by default 
> now. Would it be set to rocksdb also?

I'd rather leave FileStore alone since it will eventually be deprecated.  
It's also more sensitive to performance variation and we'd need to be a 
lot more careful making any changes.

sage



> 
> thanks, -yuan
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, May 3, 2016 5:47 AM
> To: skinjo@redhat.com
> Cc: Wido den Hollander <wido@42on.com>; Ceph Development <ceph-devel@vger.kernel.org>
> Subject: Re: mon switch from leveldb to rocksdb
> 
> On Tue, 3 May 2016, Shinobu Kinjo wrote:
> > If possible, it would be much better to make it pluggable so that we 
> > select what we want.
> 
> Yeah, that is the plan.  The mon_keyvaluedb will select leveldb or rocksdb.  We'd just switch the default over at some point, once we're satisfied with stability.
> 
> After thinking about this some more I agree with Wido that the conversion isn't useful enough to bother with.  We can just make new mons use rocksdb, and if someone wants to convert, they can add/remove/replace mons in their cluster to get there.
> 
> sage
> 
> 
> 
> > 
> > On Tue, May 3, 2016 at 6:25 AM, Wido den Hollander <wido@42on.com> wrote:
> > >
> > >> Op 2 mei 2016 om 20:49 schreef Sage Weil <sweil@redhat.com>:
> > >>
> > >>
> > >> We're thinking about switching the default backend on the mon from 
> > >> leveldb to rocksdb.  Rocksdb is better maintained, has a stronger 
> > >> feature set, is generally faster, and is linked statically, which 
> > >> means we won't be vulnerable to buggy distro packages.
> > >>
> > >> There is one blocker, though.  Some distro leveldbs name the sst 
> > >> files with the .ldb suffix.  (Some don't; very annoying.)  There is 
> > >> a unit test in rocksdb that tries to verify that ldb is silently 
> > >> renamed to sst, and it passes, but the test is incomplete: the test 
> > >> failes to verify that ldb/sst files can actually be read, and it turns out only the 'check'
> > >> path (not the normal open and read it path) handles ldb properly.
> > >>
> > >> Anyway, once that works, rocksdb will magically upgrade from 
> > >> leveldb to rocksdb.  Note that once that happens you can't switch 
> > >> from rocksdb back to leveldb without recreating the mon.
> > >>
> > >> Alternatively, we could not worry about upgrading existing leveldb 
> > >> instances and just make newly created mons default to rocksdb.
> > >>
> > >> 1) Thoughts on moving to rocksdb in general?
> > >>
> > >> 2) Importance of leveldb->rocksdb conversion?
> > >>
> > >
> > > I would not touch this auto conversion at first. I know there is things to gain, but is it enough to gain that it might be worth while potentially corrupting monitors?
> > >
> > > Is it that LevelDB doesn't handle large cluster load for example? Imho the majority of Ceph clusters is still far below 500 OSDs.
> > >
> > > Personally I always try to stay away from touching the MONs datastore. Always feels a bit scary.
> > >
> > > Wido
> > >
> > >> 3) Anyone want to fix the ldb handling in rocksdb?
> > >>
> > >> Thanks!
> > >> sage
> > >>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe 
> > >> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe 
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> > 
> > --
> > Email:
> > shinobu@linux.com
> > GitHub:
> > shinobu-x
> > Blog:
> > Life with Distributed Computational System based on OpenSource
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-02 19:00 ` Howard Chu
@ 2016-05-03 13:34   ` Mark Nelson
  2016-05-03 16:41     ` Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Nelson @ 2016-05-03 13:34 UTC (permalink / raw)
  To: Howard Chu, Sage Weil, ceph-devel

On 05/02/2016 02:00 PM, Howard Chu wrote:
> Sage Weil wrote:
>> 1) Thoughts on moving to rocksdb in general?
>
> Are you actually prepared to undertake all of the measurement and tuning
> required to make RocksDB actually work well? You're switching from an
> (abandoned/unsupported) engine with only a handful of config parameters
> to one with ~40-50 params, all of which have critical but unpredictable
> impact on resource consumption and performance.
>

You are absolutely correct, and there are definitely pitfalls we need to 
watch out for with the number of tunables in rocksdb.  At least on the 
performance side two of the big issues we've hit with leveldb compaction 
related.  In some scenarios compaction happens slower than the number of 
writes coming in resulting in ever-growing db sizes.  The other issue is 
that compaction is single threaded and this can cause stalls and general 
mayhem when things get really heavily loaded.  My hope is that if we do 
go with rocksdb, even in a sub-optimally tuned state, we'll be better 
off than we were with leveldb.

We did some very preliminary benchmarks a couple of years ago 
(admittedly a too-small dataset size) basically comparing the (at the 
time) stock ceph leveldb settings vs rocksdb.  On this set size, leveldb 
looked much better for reads, but much worse for writes.  I suspect with 
much larger data sets, the write issues will only compound with the 
compaction issues and will start having a much bigger impact.

Indeed, if you look at the scatterplots for leveldb, you'll see a 
regular set of high latency writes. In rocksdb we saw much better 
looking write behavior, but overall reads were slower.  We didn't do any 
real tuning to improve read performance in the leveled compaction tests, 
but I think we'll be starting out in a much better place to improve them 
than we are with leveldb.

https://drive.google.com/file/d/0B2gTBZrkrnpZN3JFV3RZeVBPWlU/view?usp=sharing

Mark

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03 13:34   ` Mark Nelson
@ 2016-05-03 16:41     ` Gregory Farnum
  2016-05-03 17:01       ` Mark Nelson
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2016-05-03 16:41 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Howard Chu, Sage Weil, ceph-devel

On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
> On 05/02/2016 02:00 PM, Howard Chu wrote:
>>
>> Sage Weil wrote:
>>>
>>> 1) Thoughts on moving to rocksdb in general?
>>
>>
>> Are you actually prepared to undertake all of the measurement and tuning
>> required to make RocksDB actually work well? You're switching from an
>> (abandoned/unsupported) engine with only a handful of config parameters
>> to one with ~40-50 params, all of which have critical but unpredictable
>> impact on resource consumption and performance.
>>
>
> You are absolutely correct, and there are definitely pitfalls we need to
> watch out for with the number of tunables in rocksdb.  At least on the
> performance side two of the big issues we've hit with leveldb compaction
> related.  In some scenarios compaction happens slower than the number of
> writes coming in resulting in ever-growing db sizes.  The other issue is
> that compaction is single threaded and this can cause stalls and general
> mayhem when things get really heavily loaded.  My hope is that if we do go
> with rocksdb, even in a sub-optimally tuned state, we'll be better off than
> we were with leveldb.
>
> We did some very preliminary benchmarks a couple of years ago (admittedly a
> too-small dataset size) basically comparing the (at the time) stock ceph
> leveldb settings vs rocksdb.  On this set size, leveldb looked much better
> for reads, but much worse for writes.

That's actually a bit troubling — many of our monitor problems have
arisen from slow reads, rather than slow writes. I suspect we want to
eliminate this before switching, if it's a concern.

...Although I think I did see a monitor caching layer go by, so maybe
it's a moot point now?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03 16:41     ` Gregory Farnum
@ 2016-05-03 17:01       ` Mark Nelson
  2016-05-03 17:17         ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Nelson @ 2016-05-03 17:01 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Howard Chu, Sage Weil, ceph-devel



On 05/03/2016 11:41 AM, Gregory Farnum wrote:
> On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
>> On 05/02/2016 02:00 PM, Howard Chu wrote:
>>>
>>> Sage Weil wrote:
>>>>
>>>> 1) Thoughts on moving to rocksdb in general?
>>>
>>>
>>> Are you actually prepared to undertake all of the measurement and tuning
>>> required to make RocksDB actually work well? You're switching from an
>>> (abandoned/unsupported) engine with only a handful of config parameters
>>> to one with ~40-50 params, all of which have critical but unpredictable
>>> impact on resource consumption and performance.
>>>
>>
>> You are absolutely correct, and there are definitely pitfalls we need to
>> watch out for with the number of tunables in rocksdb.  At least on the
>> performance side two of the big issues we've hit with leveldb compaction
>> related.  In some scenarios compaction happens slower than the number of
>> writes coming in resulting in ever-growing db sizes.  The other issue is
>> that compaction is single threaded and this can cause stalls and general
>> mayhem when things get really heavily loaded.  My hope is that if we do go
>> with rocksdb, even in a sub-optimally tuned state, we'll be better off than
>> we were with leveldb.
>>
>> We did some very preliminary benchmarks a couple of years ago (admittedly a
>> too-small dataset size) basically comparing the (at the time) stock ceph
>> leveldb settings vs rocksdb.  On this set size, leveldb looked much better
>> for reads, but much worse for writes.
>
> That's actually a bit troubling — many of our monitor problems have
> arisen from slow reads, rather than slow writes. I suspect we want to
> eliminate this before switching, if it's a concern.
>
> ...Although I think I did see a monitor caching layer go by, so maybe
> it's a moot point now?

Yeah, I suspect that's helping significantly.  I think based at least 
one what I remember seeing I'm more concerned about high latency events 
than average read performance though.  IE if there is a compaction 
storm, which store is going to handle it more gracefully with less 
spikey behavior?

In those leveldb tests we only saw writes and write trims hit by those 
periodic 10-60 second high-latency spikes, but if I recall the mon has 
(or at least had?) a global lock where write stalls would basically make 
the whole monitor stall.  I think Joao might have improved that after we 
did this testing but I don't remember the details at this point.

> -Greg
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03 17:01       ` Mark Nelson
@ 2016-05-03 17:17         ` Sage Weil
  2016-05-03 17:20           ` Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-05-03 17:17 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Gregory Farnum, Howard Chu, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3355 bytes --]

On Tue, 3 May 2016, Mark Nelson wrote:
> On 05/03/2016 11:41 AM, Gregory Farnum wrote:
> > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
> > > On 05/02/2016 02:00 PM, Howard Chu wrote:
> > > > 
> > > > Sage Weil wrote:
> > > > > 
> > > > > 1) Thoughts on moving to rocksdb in general?
> > > > 
> > > > 
> > > > Are you actually prepared to undertake all of the measurement and tuning
> > > > required to make RocksDB actually work well? You're switching from an
> > > > (abandoned/unsupported) engine with only a handful of config parameters
> > > > to one with ~40-50 params, all of which have critical but unpredictable
> > > > impact on resource consumption and performance.
> > > > 
> > > 
> > > You are absolutely correct, and there are definitely pitfalls we need to
> > > watch out for with the number of tunables in rocksdb.  At least on the
> > > performance side two of the big issues we've hit with leveldb compaction
> > > related.  In some scenarios compaction happens slower than the number of
> > > writes coming in resulting in ever-growing db sizes.  The other issue is
> > > that compaction is single threaded and this can cause stalls and general
> > > mayhem when things get really heavily loaded.  My hope is that if we do go
> > > with rocksdb, even in a sub-optimally tuned state, we'll be better off
> > > than
> > > we were with leveldb.
> > > 
> > > We did some very preliminary benchmarks a couple of years ago (admittedly
> > > a
> > > too-small dataset size) basically comparing the (at the time) stock ceph
> > > leveldb settings vs rocksdb.  On this set size, leveldb looked much better
> > > for reads, but much worse for writes.
> > 
> > That's actually a bit troubling — many of our monitor problems have
> > arisen from slow reads, rather than slow writes. I suspect we want to
> > eliminate this before switching, if it's a concern.
> > 
> > ...Although I think I did see a monitor caching layer go by, so maybe
> > it's a moot point now?
> 
> Yeah, I suspect that's helping significantly.  I think based at least one what
> I remember seeing I'm more concerned about high latency events than average
> read performance though.  IE if there is a compaction storm, which store is
> going to handle it more gracefully with less spikey behavior?

I'm most worried about the read storm that happens on each commit to fetch 
all the just-updated PG stat keys.  The other data in the mon is just 
noise in comparison, I think, with the exception of the OSDMaps... which 
IIRC is what the cache you mention was for.

The initial PR,

	https://github.com/ceph/ceph/pull/8888

just makes the backend choice persistent.  Rocksdb is still experimental.  
There's an accompanying ceph-qa-suite pr so that we test both.  Once we 
do some performance evaluation we can decide whether the switch is safe 
as-is, if more work (caching layer or tuning) is needed, or if it's a bad 
idea.

> In those leveldb tests we only saw writes and write trims hit by those
> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
> least had?) a global lock where write stalls would basically make the whole
> monitor stall.  I think Joao might have improved that after we did this
> testing but I don't remember the details at this point.

I don't think any of this locking has changed...

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03 17:17         ` Sage Weil
@ 2016-05-03 17:20           ` Gregory Farnum
  2016-05-03 17:23             ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2016-05-03 17:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, Howard Chu, ceph-devel

On Tue, May 3, 2016 at 10:17 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 3 May 2016, Mark Nelson wrote:
>> On 05/03/2016 11:41 AM, Gregory Farnum wrote:
>> > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@redhat.com> wrote:
>> > > On 05/02/2016 02:00 PM, Howard Chu wrote:
>> > > >
>> > > > Sage Weil wrote:
>> > > > >
>> > > > > 1) Thoughts on moving to rocksdb in general?
>> > > >
>> > > >
>> > > > Are you actually prepared to undertake all of the measurement and tuning
>> > > > required to make RocksDB actually work well? You're switching from an
>> > > > (abandoned/unsupported) engine with only a handful of config parameters
>> > > > to one with ~40-50 params, all of which have critical but unpredictable
>> > > > impact on resource consumption and performance.
>> > > >
>> > >
>> > > You are absolutely correct, and there are definitely pitfalls we need to
>> > > watch out for with the number of tunables in rocksdb.  At least on the
>> > > performance side two of the big issues we've hit with leveldb compaction
>> > > related.  In some scenarios compaction happens slower than the number of
>> > > writes coming in resulting in ever-growing db sizes.  The other issue is
>> > > that compaction is single threaded and this can cause stalls and general
>> > > mayhem when things get really heavily loaded.  My hope is that if we do go
>> > > with rocksdb, even in a sub-optimally tuned state, we'll be better off
>> > > than
>> > > we were with leveldb.
>> > >
>> > > We did some very preliminary benchmarks a couple of years ago (admittedly
>> > > a
>> > > too-small dataset size) basically comparing the (at the time) stock ceph
>> > > leveldb settings vs rocksdb.  On this set size, leveldb looked much better
>> > > for reads, but much worse for writes.
>> >
>> > That's actually a bit troubling — many of our monitor problems have
>> > arisen from slow reads, rather than slow writes. I suspect we want to
>> > eliminate this before switching, if it's a concern.
>> >
>> > ...Although I think I did see a monitor caching layer go by, so maybe
>> > it's a moot point now?
>>
>> Yeah, I suspect that's helping significantly.  I think based at least one what
>> I remember seeing I'm more concerned about high latency events than average
>> read performance though.  IE if there is a compaction storm, which store is
>> going to handle it more gracefully with less spikey behavior?
>
> I'm most worried about the read storm that happens on each commit to fetch
> all the just-updated PG stat keys.  The other data in the mon is just
> noise in comparison, I think, with the exception of the OSDMaps... which
> IIRC is what the cache you mention was for.
>
> The initial PR,
>
>         https://github.com/ceph/ceph/pull/8888
>
> just makes the backend choice persistent.  Rocksdb is still experimental.
> There's an accompanying ceph-qa-suite pr so that we test both.  Once we
> do some performance evaluation we can decide whether the switch is safe
> as-is, if more work (caching layer or tuning) is needed, or if it's a bad
> idea.
>
>> In those leveldb tests we only saw writes and write trims hit by those
>> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
>> least had?) a global lock where write stalls would basically make the whole
>> monitor stall.  I think Joao might have improved that after we did this
>> testing but I don't remember the details at this point.
>
> I don't think any of this locking has changed...

The paxos state machine is no longer blocked for reads while an
unrelated write is happening. Nor are older-version reads on the
writing subsystem. That fix is post-firefly, right?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03 17:20           ` Gregory Farnum
@ 2016-05-03 17:23             ` Sage Weil
  2016-05-03 17:32               ` Mark Nelson
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-05-03 17:23 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Mark Nelson, Howard Chu, ceph-devel

On Tue, 3 May 2016, Gregory Farnum wrote:
> >> In those leveldb tests we only saw writes and write trims hit by those
> >> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
> >> least had?) a global lock where write stalls would basically make the whole
> >> monitor stall.  I think Joao might have improved that after we did this
> >> testing but I don't remember the details at this point.
> >
> > I don't think any of this locking has changed...
> 
> The paxos state machine is no longer blocked for reads while an
> unrelated write is happening. Nor are older-version reads on the
> writing subsystem. That fix is post-firefly, right?

Oh yeah, it was new in hammer, I think.  I forgot those tests were that 
old!

sage


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: mon switch from leveldb to rocksdb
  2016-05-03 17:23             ` Sage Weil
@ 2016-05-03 17:32               ` Mark Nelson
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Nelson @ 2016-05-03 17:32 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: Howard Chu, ceph-devel

On 05/03/2016 12:23 PM, Sage Weil wrote:
> On Tue, 3 May 2016, Gregory Farnum wrote:
>>>> In those leveldb tests we only saw writes and write trims hit by those
>>>> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at
>>>> least had?) a global lock where write stalls would basically make the whole
>>>> monitor stall.  I think Joao might have improved that after we did this
>>>> testing but I don't remember the details at this point.
>>>
>>> I don't think any of this locking has changed...
>>
>> The paxos state machine is no longer blocked for reads while an
>> unrelated write is happening. Nor are older-version reads on the
>> writing subsystem. That fix is post-firefly, right?
>
> Oh yeah, it was new in hammer, I think.  I forgot those tests were that
> old!

Time goes by fast when you are having fun?  Indeed those tests are 
nearly 2 years old at this point.

I guess my question is given the state machine improvements do we expect 
those leveldb write/wtrim latency spikes to still cause major mon 
stalls?  On the other hand, is the osdmap read cache enough to help 
offset the lower read performance in (this configuration of) rocksdb?

I'm still worried about those big leveldb write latency spikes, but 
maybe it's less of an issue now and the average read performance is a 
bigger issue.

>
> sage
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-05-03 17:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-02 18:49 mon switch from leveldb to rocksdb Sage Weil
2016-05-02 19:00 ` Howard Chu
2016-05-03 13:34   ` Mark Nelson
2016-05-03 16:41     ` Gregory Farnum
2016-05-03 17:01       ` Mark Nelson
2016-05-03 17:17         ` Sage Weil
2016-05-03 17:20           ` Gregory Farnum
2016-05-03 17:23             ` Sage Weil
2016-05-03 17:32               ` Mark Nelson
2016-05-02 21:25 ` Wido den Hollander
2016-05-02 21:42   ` Shinobu Kinjo
2016-05-02 21:47     ` Sage Weil
2016-05-03  5:25       ` Zhou, Yuan
2016-05-03  5:28         ` Somnath Roy
2016-05-03  6:00           ` Shinobu Kinjo
2016-05-03  6:29             ` Somnath Roy
2016-05-03  8:10               ` Shinobu Kinjo
2016-05-03 12:24         ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.