Re: [ceph-users] keyvaluestore backend metadata overhead

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [ceph-users] keyvaluestore backend metadata overhead
       [not found] <CAC8iE5iHTEfSQL978paWpu9hSfUbE65OVT_dKi2P=yvWSQ5JhA@mail.gmail.com>
@ 2015-01-29 22:51 ` Sage Weil
  2015-01-30  2:46   ` Haomai Wang
  2015-01-30 14:46   ` Chris Pacejo
  0 siblings, 2 replies; 15+ messages in thread
From: Sage Weil @ 2015-01-29 22:51 UTC (permalink / raw)
  To: Chris Pacejo; +Cc: ceph-devel, haomaiwang

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1699 bytes --]

Hi Chris,

[Moving this thread to ceph-devel, which is probably a bit more 
appropriate.]

On Thu, 29 Jan 2015, Chris Pacejo wrote:
> Hi, we've been experimenting with the keyvaluestore backend, and have found
> that, on every object write (e.g. with `rados put`), a single transaction is
> issued containing an additional 9 KeyValueDB writes, beyond those which
> constitute the object data.  Given the key names, these are clearly all
> metadata of some sort, but this poses a problem when the objects themselves
> are very small.  Given the default strip block size of 4 KiB, with objects
> of size 36 KiB or less, half or more of all key-value store writes are
> metadata writes.  With objects of size 4 KiB or less, the metadata overhead
> grows to 90%+.
> 
> Is there any way to reduce the number of metadata rows which must be written
> with each object?

There is a level (or two) of indirection in KeyValueStore's 
GenericObjectMap that is there to allow object cloning.  I wonder if we 
will want to facilitate a backend that doesn't implement clone and can 
only be used for pools that disallow clone and snap operations.

There is also some key consolidation in the OSD layer we talked about in 
the wednesday performance call that will cut this down some!

> (Alternatively, if there is a way to convince the OSD to issue multiple
> concurrent write transactions, that would also help.  But even with
> "keyvaluestore op threads" set as high as 64, and `rados bench` issuing 64
> concurrent writes, we never see more than a single active write transaction
> on the (multithread-capable) backend.  Is there some other option we're
> missing?)

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-29 22:51 ` [ceph-users] keyvaluestore backend metadata overhead Sage Weil
@ 2015-01-30  2:46   ` Haomai Wang
  2015-01-30 15:41     ` Chris Pacejo
  2015-01-30 14:46   ` Chris Pacejo
  1 sibling, 1 reply; 15+ messages in thread
From: Haomai Wang @ 2015-01-30  2:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chris Pacejo, ceph-devel@vger.kernel.org

Hi Chris,

For metadata overhead, we need to resolve it at upper level,
keyvaluestore won't add extra metadata in normal io except rarely
header save which only update when header changed.

As for active write, why do you think it there only one active write
in keyvaluestore threads? I just check runtime perf data again, it
looks fine that multi write can do concurrently submit transaction.

On Fri, Jan 30, 2015 at 6:51 AM, Sage Weil <sage@newdream.net> wrote:
> Hi Chris,
>
> [Moving this thread to ceph-devel, which is probably a bit more
> appropriate.]
>
> On Thu, 29 Jan 2015, Chris Pacejo wrote:
>> Hi, we've been experimenting with the keyvaluestore backend, and have found
>> that, on every object write (e.g. with `rados put`), a single transaction is
>> issued containing an additional 9 KeyValueDB writes, beyond those which
>> constitute the object data.  Given the key names, these are clearly all
>> metadata of some sort, but this poses a problem when the objects themselves
>> are very small.  Given the default strip block size of 4 KiB, with objects
>> of size 36 KiB or less, half or more of all key-value store writes are
>> metadata writes.  With objects of size 4 KiB or less, the metadata overhead
>> grows to 90%+.
>>
>> Is there any way to reduce the number of metadata rows which must be written
>> with each object?
>
> There is a level (or two) of indirection in KeyValueStore's
> GenericObjectMap that is there to allow object cloning.  I wonder if we
> will want to facilitate a backend that doesn't implement clone and can
> only be used for pools that disallow clone and snap operations.
>
> There is also some key consolidation in the OSD layer we talked about in
> the wednesday performance call that will cut this down some!
>
>> (Alternatively, if there is a way to convince the OSD to issue multiple
>> concurrent write transactions, that would also help.  But even with
>> "keyvaluestore op threads" set as high as 64, and `rados bench` issuing 64
>> concurrent writes, we never see more than a single active write transaction
>> on the (multithread-capable) backend.  Is there some other option we're
>> missing?)
>
> sage



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-29 22:51 ` [ceph-users] keyvaluestore backend metadata overhead Sage Weil
  2015-01-30  2:46   ` Haomai Wang
@ 2015-01-30 14:46   ` Chris Pacejo
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Pacejo @ 2015-01-30 14:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, haomaiwang

Hi Sage, thanks for the quick reply.

On Thu, Jan 29, 2015 at 5:51 PM, Sage Weil <sage@newdream.net> wrote:
> There is a level (or two) of indirection in KeyValueStore's
> GenericObjectMap that is there to allow object cloning.  I wonder if we
> will want to facilitate a backend that doesn't implement clone and can
> only be used for pools that disallow clone and snap operations.

That would be perfect for us.  We need neither cloning nor snapshots.


> There is also some key consolidation in the OSD layer we talked about in
> the wednesday performance call that will cut this down some!

Awesome.  Each fewer key-value pair will be a huge performance boost for us!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-30  2:46   ` Haomai Wang
@ 2015-01-30 15:41     ` Chris Pacejo
  2015-01-30 15:52       ` Haomai Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Pacejo @ 2015-01-30 15:41 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel@vger.kernel.org

Hi Haomai,

On Thu, Jan 29, 2015 at 9:46 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> For metadata overhead, we need to resolve it at upper level,
> keyvaluestore won't add extra metadata in normal io except rarely
> header save which only update when header changed.

Unfortunately, our write workload is dominated by object creates.

> As for active write, why do you think it there only one active write
> in keyvaluestore threads? I just check runtime perf data again, it
> looks fine that multi write can do concurrently submit transaction.

We've implemented a MySQL backend for KeyValueDB in the hopes of
getting better performance than LevelDB (what we're currently seeing
is on par).  Internally, it uses a LIFO connection pool, from which
connections are leased for the duration of a transaction commit or
snapshot walk (to permit concurrent transactions).  Watching the
connection activity in MySQL using "SHOW PROCESSLIST", during most
runs, it's clear that, for the duration of the write benchmark, all
but two of the connections remain idle.  (During cleanup, I do see
more connections used, and I have on occasion seen more used during
writes.)  So while it's possible the transactions are being built
concurrently, they aren't (or are with a very low probability) being
submitted (via submit_transaction_sync()) concurrently.

(It's entirely possible that a bug in our code, or misdocumented
behavior in the MySQL client, excludes concurrent threads from using
open MySQL connections, but I *have* seen concurrent transaction
commits, only rarely.)

You mention "runtime perf data", is there a simple way to query the
OSD's idea of how many concurrent transaction submits it is issuing?
In the meantime I'll instrument our backend to track this value
itself.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-30 15:41     ` Chris Pacejo
@ 2015-01-30 15:52       ` Haomai Wang
  2015-01-30 16:08         ` Chris Pacejo
  2015-02-03 15:13         ` Chris Pacejo
  0 siblings, 2 replies; 15+ messages in thread
From: Haomai Wang @ 2015-01-30 15:52 UTC (permalink / raw)
  To: Chris Pacejo; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On Fri, Jan 30, 2015 at 11:41 PM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
> Hi Haomai,
>
> On Thu, Jan 29, 2015 at 9:46 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> For metadata overhead, we need to resolve it at upper level,
>> keyvaluestore won't add extra metadata in normal io except rarely
>> header save which only update when header changed.
>
> Unfortunately, our write workload is dominated by object creates.
>
>
>> As for active write, why do you think it there only one active write
>> in keyvaluestore threads? I just check runtime perf data again, it
>> looks fine that multi write can do concurrently submit transaction.
>
> We've implemented a MySQL backend for KeyValueDB in the hopes of
> getting better performance than LevelDB (what we're currently seeing
> is on par).  Internally, it uses a LIFO connection pool, from which
> connections are leased for the duration of a transaction commit or
> snapshot walk (to permit concurrent transactions).  Watching the
> connection activity in MySQL using "SHOW PROCESSLIST", during most
> runs, it's clear that, for the duration of the write benchmark, all
> but two of the connections remain idle.  (During cleanup, I do see
> more connections used, and I have on occasion seen more used during
> writes.)  So while it's possible the transactions are being built
> concurrently, they aren't (or are with a very low probability) being
> submitted (via submit_transaction_sync()) concurrently.
>
> (It's entirely possible that a bug in our code, or misdocumented
> behavior in the MySQL client, excludes concurrent threads from using
> open MySQL connections, but I *have* seen concurrent transaction
> commits, only rarely.)
>
> You mention "runtime perf data", is there a simple way to query the
> OSD's idea of how many concurrent transaction submits it is issuing?
> In the meantime I'll instrument our backend to track this value
> itself.
>
> Thanks!

It's really a surprise that you impl a MySQL backend. Could I know the
purpose? Because it may not fit with keyvaluestore I think.

You can simply calculate the sum of submit_transaction_sync consuming
time, it would be the multiple of the op thread number.

-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-30 15:52       ` Haomai Wang
@ 2015-01-30 16:08         ` Chris Pacejo
  2015-01-30 16:18           ` Haomai Wang
  2015-02-03 15:13         ` Chris Pacejo
  1 sibling, 1 reply; 15+ messages in thread
From: Chris Pacejo @ 2015-01-30 16:08 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On Fri, Jan 30, 2015 at 10:52 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
> It's really a surprise that you impl a MySQL backend. Could I know the
> purpose? Because it may not fit with keyvaluestore I think.

We've found it to perform better (in isolation) than LevelDB.  We were
able to map KeyValueDB's interface to it fairly painlessly, and I
believe correctly.  (The only major catch was that we needed to buffer
operations within a transaction and execute them all at once on
submit, to prevent MySQL unnecessarily holding locks for the duration
of long-lived transactions.)

> You can simply calculate the sum of submit_transaction_sync consuming
> time, it would be the multiple of the op thread number.

I will try this, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-30 16:08         ` Chris Pacejo
@ 2015-01-30 16:18           ` Haomai Wang
  2015-02-01 14:50             ` Chen, Xiaoxi
  0 siblings, 1 reply; 15+ messages in thread
From: Haomai Wang @ 2015-01-30 16:18 UTC (permalink / raw)
  To: Chris Pacejo; +Cc: Sage Weil, ceph-devel@vger.kernel.org

Although I still have some confusing, it's glad to see more attempts.
More test results  are welcomed!

On Sat, Jan 31, 2015 at 12:08 AM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
> On Fri, Jan 30, 2015 at 10:52 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> It's really a surprise that you impl a MySQL backend. Could I know the
>> purpose? Because it may not fit with keyvaluestore I think.
>
> We've found it to perform better (in isolation) than LevelDB.  We were
> able to map KeyValueDB's interface to it fairly painlessly, and I
> believe correctly.  (The only major catch was that we needed to buffer
> operations within a transaction and execute them all at once on
> submit, to prevent MySQL unnecessarily holding locks for the duration
> of long-lived transactions.)
>
>
>> You can simply calculate the sum of submit_transaction_sync consuming
>> time, it would be the multiple of the op thread number.
>
> I will try this, thanks.



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-30 16:18           ` Haomai Wang
@ 2015-02-01 14:50             ` Chen, Xiaoxi
  2015-02-03 15:03               ` Chris Pacejo
  0 siblings, 1 reply; 15+ messages in thread
From: Chen, Xiaoxi @ 2015-02-01 14:50 UTC (permalink / raw)
  To: Haomai Wang, Chris Pacejo; +Cc: Sage Weil, ceph-devel@vger.kernel.org

We can always use a structure database in an unstructured way, I think it's workable in theory, but  why choose MySQL? 

As discussed some while ago,  any LSM structured database design will suffer in performance due to write amplification, is that the reason goes to MySQL only about prevent LSM? Or try some B-tree like structure?  If so ,maybe LMDB is a better choice?(although it's not yeet self-proven as production ready )

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Saturday, January 31, 2015 12:18 AM
To: Chris Pacejo
Cc: Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: [ceph-users] keyvaluestore backend metadata overhead

Although I still have some confusing, it's glad to see more attempts.
More test results  are welcomed!

On Sat, Jan 31, 2015 at 12:08 AM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
> On Fri, Jan 30, 2015 at 10:52 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> It's really a surprise that you impl a MySQL backend. Could I know 
>> the purpose? Because it may not fit with keyvaluestore I think.
>
> We've found it to perform better (in isolation) than LevelDB.  We were 
> able to map KeyValueDB's interface to it fairly painlessly, and I 
> believe correctly.  (The only major catch was that we needed to buffer 
> operations within a transaction and execute them all at once on 
> submit, to prevent MySQL unnecessarily holding locks for the duration 
> of long-lived transactions.)
>
>
>> You can simply calculate the sum of submit_transaction_sync consuming 
>> time, it would be the multiple of the op thread number.
>
> I will try this, thanks.



--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-02-01 14:50             ` Chen, Xiaoxi
@ 2015-02-03 15:03               ` Chris Pacejo
  2015-02-04  3:15                 ` Mark Nelson
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Pacejo @ 2015-02-03 15:03 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Haomai Wang, Sage Weil, ceph-devel@vger.kernel.org

Hi Xiaoxi,

On Sun, Feb 1, 2015 at 9:50 AM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> We can always use a structure database in an unstructured way, I think it's workable in theory, but  why choose MySQL?

In our internal performance tests, it performed better than LevelDB
and some others, and it's well-proven.  It's not our first choice, nor
are we done investigating other options.  But we'll check out LMDB,
thanks for the pointer.

Regardless, the issues we're seeing are equally applicable to any
key-value backend.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-01-30 15:52       ` Haomai Wang
  2015-01-30 16:08         ` Chris Pacejo
@ 2015-02-03 15:13         ` Chris Pacejo
  2015-02-03 20:25           ` Chris Pacejo
  1 sibling, 1 reply; 15+ messages in thread
From: Chris Pacejo @ 2015-02-03 15:13 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On Fri, Jan 30, 2015 at 10:52 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
>>> As for active write, why do you think it there only one active write
>>> in keyvaluestore threads? I just check runtime perf data again, it
>>> looks fine that multi write can do concurrently submit transaction.
>
> You can simply calculate the sum of submit_transaction_sync consuming
> time, it would be the multiple of the op thread number.

I've instrumented submit_transaction to tick up/down an atomic
counter.  While in certain situations (resource-constrained VM; OSD
startup), I do see up to "keyvaluestore op threads" number of
concurrent transaction submits reported by this counter, on real
hardware, during `rados bench` with 2700-byte objects, I never see
more than 3 concurrent submits; on average, I see 2.  I don't know the
OSD's internals well enough to speculate on the cause, but it's worth
noting that the OSD processes consume a lot of CPU (170%+) during
these benchmarks (compared to 14% for 1 MiB objects).

We'll keep experimenting, but we're definitely excited of the
possibility of reducing metadata overhead

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-02-03 15:13         ` Chris Pacejo
@ 2015-02-03 20:25           ` Chris Pacejo
  2015-02-04  2:31             ` Haomai Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Pacejo @ 2015-02-03 20:25 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

The below observations (including high CPU usage by the OSDs) hold
true when we (roughly) double the performance of the MySQL backend by
pointing it to SSDs instead of rotary media.  This causes us to
suspect that our current bottleneck is not the extra load placed on
the backend by the metadata; but rather something in the OSD which
causes it to be unable to saturate the backend.  Any thoughts?


On Tue, Feb 3, 2015 at 10:13 AM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
> I've instrumented submit_transaction to tick up/down an atomic
> counter.  While in certain situations (resource-constrained VM; OSD
> startup), I do see up to "keyvaluestore op threads" number of
> concurrent transaction submits reported by this counter, on real
> hardware, during `rados bench` with 2700-byte objects, I never see
> more than 3 concurrent submits; on average, I see 2.  I don't know the
> OSD's internals well enough to speculate on the cause, but it's worth
> noting that the OSD processes consume a lot of CPU (170%+) during
> these benchmarks (compared to 14% for 1 MiB objects).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-02-03 20:25           ` Chris Pacejo
@ 2015-02-04  2:31             ` Haomai Wang
  2015-02-09 18:29               ` Chris Pacejo
  0 siblings, 1 reply; 15+ messages in thread
From: Haomai Wang @ 2015-02-04  2:31 UTC (permalink / raw)
  To: Chris Pacejo; +Cc: ceph-devel@vger.kernel.org

On Wed, Feb 4, 2015 at 4:25 AM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
> The below observations (including high CPU usage by the OSDs) hold
> true when we (roughly) double the performance of the MySQL backend by
> pointing it to SSDs instead of rotary media.  This causes us to
> suspect that our current bottleneck is not the extra load placed on
> the backend by the metadata; but rather something in the OSD which
> causes it to be unable to saturate the backend.  Any thoughts?
>

Maybe more detail number can help us a bit.

>
> On Tue, Feb 3, 2015 at 10:13 AM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
>> I've instrumented submit_transaction to tick up/down an atomic
>> counter.  While in certain situations (resource-constrained VM; OSD
>> startup), I do see up to "keyvaluestore op threads" number of
>> concurrent transaction submits reported by this counter, on real
>> hardware, during `rados bench` with 2700-byte objects, I never see
>> more than 3 concurrent submits; on average, I see 2.  I don't know the
>> OSD's internals well enough to speculate on the cause, but it's worth
>> noting that the OSD processes consume a lot of CPU (170%+) during
>> these benchmarks (compared to 14% for 1 MiB objects).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-02-03 15:03               ` Chris Pacejo
@ 2015-02-04  3:15                 ` Mark Nelson
  0 siblings, 0 replies; 15+ messages in thread
From: Mark Nelson @ 2015-02-04  3:15 UTC (permalink / raw)
  To: Chris Pacejo, Chen, Xiaoxi
  Cc: Haomai Wang, Sage Weil, ceph-devel@vger.kernel.org

On 02/03/2015 09:03 AM, Chris Pacejo wrote:
> Hi Xiaoxi,
>
> On Sun, Feb 1, 2015 at 9:50 AM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
>> We can always use a structure database in an unstructured way, I think it's workable in theory, but  why choose MySQL?
>
> In our internal performance tests, it performed better than LevelDB
> and some others, and it's well-proven.  It's not our first choice, nor
> are we done investigating other options.  But we'll check out LMDB,
> thanks for the pointer.
>
> Regardless, the issues we're seeing are equally applicable to any
> key-value backend.

You may also wish to try the rocksdb backend with universal compaction 
rather than leveled compaction.

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-02-04  2:31             ` Haomai Wang
@ 2015-02-09 18:29               ` Chris Pacejo
  2015-02-10  6:33                 ` Haomai Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Pacejo @ 2015-02-09 18:29 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel@vger.kernel.org

On Tue, Feb 3, 2015 at 9:31 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> Maybe more detail number can help us a bit.

Here's what we're testing with and what we observe:

Hardware:
 2x6-core hyperthreaded Xeon E5-2620 v2 2.10GHz CPU
 8x8 GiB DDR3 RAM
 4x4 TB 7200 RPM 8 ms 183 MB/s SAS rotary disks

Software:
 CentOS 7
 CEPH 0.91
 4 OSDs
 osd pool default size = 1 (just for testing!)
 keyvaluestore op threads = 16
 keyvaluestore backend = mysql (our own backend)
 one MySQL process per OSD, each writing to a separate disk

Test setup:
 rados bench on a fresh install
 256 concurrent writes
 360 seconds
 2700 byte objects, and 1 MiB objects
 measure throughput with rados bench
 measure CPU usage by observing top
 measure max concurrent transaction submits by instrumenting the
KeyValueDB interface

With this setup, we observe that, with 2700 byte objects:

 7.4 MiB/s (~2900 ops/s) throughput,
 170%/170%/60%/60% OSD CPU usage,
 200%/200%/65%/65% MySQL CPU usage, and
 3/3/1/1 maximum concurrent transaction submits;

and with 1 MiB objects:

 50.7 MiB/s (~51 ops/s) throughput,
 14%/14%/4%/4% OSD CPU usage,
 50%/50%/15%/15% MySQL CPU usage, and
 3/3/1/1 maximum concurrent transaction submits.

We know that our transaction concurrency measurement is not buggy, as
it will consistently report up to `keyvaluestore op threads`
concurrent submits both on OSD startup on this same hardware, and
during benchmarking in a resource-constrained VM.  We are pretty sure
MySQL is not the bottleneck, since we've been able to throw much more
at it (concurrently); at least 10 kops/s per instance.  (Sequentially
it is not so good; hence our fixation on the low transaction
concurrency!)

Let me know if there are any other figures which would be helpful in
diagnosing why the OSDs are not issuing as many concurrent
transactions as we'd like, or why they are using so much CPU.  Thanks
for your help.

- Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] keyvaluestore backend metadata overhead
  2015-02-09 18:29               ` Chris Pacejo
@ 2015-02-10  6:33                 ` Haomai Wang
  0 siblings, 0 replies; 15+ messages in thread
From: Haomai Wang @ 2015-02-10  6:33 UTC (permalink / raw)
  To: Chris Pacejo; +Cc: ceph-devel@vger.kernel.org

On Tue, Feb 10, 2015 at 2:29 AM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
> On Tue, Feb 3, 2015 at 9:31 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> Maybe more detail number can help us a bit.
>
> Here's what we're testing with and what we observe:
>
> Hardware:
>  2x6-core hyperthreaded Xeon E5-2620 v2 2.10GHz CPU
>  8x8 GiB DDR3 RAM
>  4x4 TB 7200 RPM 8 ms 183 MB/s SAS rotary disks
>
> Software:
>  CentOS 7
>  CEPH 0.91
>  4 OSDs
>  osd pool default size = 1 (just for testing!)
>  keyvaluestore op threads = 16
>  keyvaluestore backend = mysql (our own backend)
>  one MySQL process per OSD, each writing to a separate disk
>
> Test setup:
>  rados bench on a fresh install
>  256 concurrent writes
>  360 seconds
>  2700 byte objects, and 1 MiB objects
>  measure throughput with rados bench
>  measure CPU usage by observing top
>  measure max concurrent transaction submits by instrumenting the
> KeyValueDB interface
>
> With this setup, we observe that, with 2700 byte objects:
>
>  7.4 MiB/s (~2900 ops/s) throughput,
>  170%/170%/60%/60% OSD CPU usage,
>  200%/200%/65%/65% MySQL CPU usage, and
>  3/3/1/1 maximum concurrent transaction submits;
>
> and with 1 MiB objects:
>
>  50.7 MiB/s (~51 ops/s) throughput,
>  14%/14%/4%/4% OSD CPU usage,
>  50%/50%/15%/15% MySQL CPU usage, and
>  3/3/1/1 maximum concurrent transaction submits.

It looks like that a little unbalance ops for four osds?

>
> We know that our transaction concurrency measurement is not buggy, as
> it will consistently report up to `keyvaluestore op threads`
> concurrent submits both on OSD startup on this same hardware, and
> during benchmarking in a resource-constrained VM.  We are pretty sure
> MySQL is not the bottleneck, since we've been able to throw much more
> at it (concurrently); at least 10 kops/s per instance.  (Sequentially
> it is not so good; hence our fixation on the low transaction
> concurrency!)
>
> Let me know if there are any other figures which would be helpful in
> diagnosing why the OSDs are not issuing as many concurrent
> transactions as we'd like, or why they are using so much CPU.  Thanks
> for your help.

I think you can look at perf dump result to see whether exists full
throttle queue, such as keyvaluestore queue.

Sorry, I still can't think of anything may prevent concurrent level
above objectstore backend, at most of cases, backend should be the
bottleneck

>
> - Chris



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-02-10  6:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAC8iE5iHTEfSQL978paWpu9hSfUbE65OVT_dKi2P=yvWSQ5JhA@mail.gmail.com>
2015-01-29 22:51 ` [ceph-users] keyvaluestore backend metadata overhead Sage Weil
2015-01-30  2:46   ` Haomai Wang
2015-01-30 15:41     ` Chris Pacejo
2015-01-30 15:52       ` Haomai Wang
2015-01-30 16:08         ` Chris Pacejo
2015-01-30 16:18           ` Haomai Wang
2015-02-01 14:50             ` Chen, Xiaoxi
2015-02-03 15:03               ` Chris Pacejo
2015-02-04  3:15                 ` Mark Nelson
2015-02-03 15:13         ` Chris Pacejo
2015-02-03 20:25           ` Chris Pacejo
2015-02-04  2:31             ` Haomai Wang
2015-02-09 18:29               ` Chris Pacejo
2015-02-10  6:33                 ` Haomai Wang
2015-01-30 14:46   ` Chris Pacejo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.