Reproducing allocator performance differences

All of lore.kernel.org
 help / color / mirror / Atom feed

* Reproducing allocator performance differences
@ 2015-10-01 15:32 Curley, Matthew
  2015-10-01 17:18 ` Mark Nelson
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Curley, Matthew @ 2015-10-01 15:32 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

We've been trying to reproduce the allocator performance impact on 4K random reads seen in the Hackathon (and more recent tests).  At this point though, we're not seeing any significant difference between tcmalloc and jemalloc so we're looking for thoughts on what we're doing wrong.  Or at least some suggestions to try out.

More detail here:
https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?usp=sharing

Thanks for any input!

--MC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Reproducing allocator performance differences
  2015-10-01 15:32 Reproducing allocator performance differences Curley, Matthew
@ 2015-10-01 17:18 ` Mark Nelson
  2015-10-01 18:09   ` Curley, Matthew
  2015-10-02  6:07 ` Dałek, Piotr
  2015-10-02  6:10 ` Alexandre DERUMIER
  2 siblings, 1 reply; 12+ messages in thread
From: Mark Nelson @ 2015-10-01 17:18 UTC (permalink / raw)
  To: Curley, Matthew, ceph-devel@vger.kernel.org

On 10/01/2015 10:32 AM, Curley, Matthew wrote:
> We've been trying to reproduce the allocator performance impact on 4K random reads seen in the Hackathon (and more recent tests).  At this point though, we're not seeing any significant difference between tcmalloc and jemalloc so we're looking for thoughts on what we're doing wrong.  Or at least some suggestions to try out.
>
> More detail here:
> https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?usp=sharing
>
> Thanks for any input!

Hi Mathew,

I can point out a couple of differences in our setups:

1) I have 4 NVMe cards with 4 OSDs per card in each node, ie 16 OSDs 
total per node.  I'm also running the fio processes on the same nodes as 
the OSDs, so there is far less CPU available per OSD in my setup.

2) You have more memory per node than I do (and far more memory per OSD)

3) I'm using fio with the librbd engine, not fio+libaio on kernel RBD. 
It would be interesting to know if if this is having an effect.

4) I'm using RBD cache (and allowing writeback before flush)

5) I'm not using nobarriers

I suspect that in my setup I am very much bound by things other than the 
NVMe cards.  I think we should look at this in terms of per-node 
throughput rather than per-OSD.  What I find very interesting is that 
you are seeing much higher per-node tcmalloc performance than I am but 
fairly similar per-node jemalloc performance.  For 4K random reads I saw 
about 14K random read IOPs per node for tcmalloc+32MB TC and around 40K 
IOPS per node with tcmalloc+128MB tc or jemalloc.  It appears to me that 
for both tcmalloc and jemalloc you saw around 50K IOPS per node in the 4 
OSD per card case.

A couple of thoughts:

1) Did you happen to record any CPU usage data during your tests? 
Perhaps with only 4 OSDs per node there is less CPU contention.

2) Did you test 4K random writes?  It would be interesting to see if 
those results show the same behavior.

3) I'm going to assume that since you saw differences in performance 
with different queue depths that this is O_DIRECT?  Did you sync/drop 
cache on the OSDs before the tests?  Was the data pre-filled on the RBD 
volumes?

4) Even given the above, you have a lot more memory available for buffer 
cache.  Did you happen to look at how many of the IOs were actually 
hitting the NVMe devices?

Mark

>
> --MC
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Reproducing allocator performance differences
  2015-10-01 17:18 ` Mark Nelson
@ 2015-10-01 18:09   ` Curley, Matthew
  2015-10-02  3:39     ` Chaitanya Huilgol
  2015-10-13 23:32     ` Curley, Matthew
  0 siblings, 2 replies; 12+ messages in thread
From: Curley, Matthew @ 2015-10-01 18:09 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel@vger.kernel.org

Thanks a bunch for the feedback Mark.  I'll push this back to the guy doing the test runs and get more data, including the writes.

Some responses:
* There's definitely a fair amount of CPU available even at higher queue depths, but I don't have current results.  I'll get a colmux grab for a representative sample.

* We did try fio with librbd (and multiple block devices/workers per client) previously on a different rig, what we saw was no real benefit over kernel + libaio.  We'll get concrete data on this rig though.

* Yes on fio with direct I/O, yes on the pre-fill, and yes on the drop cache (with a 3).  Not dropping cache has actually caused some frustrating results inconsistency, but that's a whole different topic.

* For these results--especially with more outstanding I/O--you definitely run completely out of page cache pretty quickly and see almost nothing at the NVMe device.  I was less worried for this particular test since we were after demonstrating a % shift in processing efficiency at the OSD rather than accurate representation of the backing storage, but correct me if that's a poor assumption here.  

* We'll try to track more closely to your memory per OSD ratio. When we shift the block devices size and reduce the kernel memory to force a % of I/O to not come from page cache you definitely see a lowering in overall performance (about 70K IOPS between lowest and highest results, for consistent queue depth and client count).  

--MC

On 10/01/2015 10:32 AM, Curley, Matthew wrote:
> We've been trying to reproduce the allocator performance impact on 4K random reads seen in the Hackathon (and more recent tests).  At this point though, we're not seeing any significant difference between tcmalloc and jemalloc so we're looking for thoughts on what we're doing wrong.  Or at least some suggestions to try out.
>
> More detail here:
> https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?usp=
> sharing
>
> Thanks for any input!

Hi Mathew,

I can point out a couple of differences in our setups:

1) I have 4 NVMe cards with 4 OSDs per card in each node, ie 16 OSDs total per node.  I'm also running the fio processes on the same nodes as the OSDs, so there is far less CPU available per OSD in my setup.

2) You have more memory per node than I do (and far more memory per OSD)

3) I'm using fio with the librbd engine, not fio+libaio on kernel RBD. 
It would be interesting to know if if this is having an effect.

4) I'm using RBD cache (and allowing writeback before flush)

5) I'm not using nobarriers

I suspect that in my setup I am very much bound by things other than the NVMe cards.  I think we should look at this in terms of per-node throughput rather than per-OSD.  What I find very interesting is that you are seeing much higher per-node tcmalloc performance than I am but fairly similar per-node jemalloc performance.  For 4K random reads I saw about 14K random read IOPs per node for tcmalloc+32MB TC and around 40K IOPS per node with tcmalloc+128MB tc or jemalloc.  It appears to me that for both tcmalloc and jemalloc you saw around 50K IOPS per node in the 4 OSD per card case.

A couple of thoughts:

1) Did you happen to record any CPU usage data during your tests? 
Perhaps with only 4 OSDs per node there is less CPU contention.

2) Did you test 4K random writes?  It would be interesting to see if those results show the same behavior.

3) I'm going to assume that since you saw differences in performance with different queue depths that this is O_DIRECT?  Did you sync/drop cache on the OSDs before the tests?  Was the data pre-filled on the RBD volumes?

4) Even given the above, you have a lot more memory available for buffer cache.  Did you happen to look at how many of the IOs were actually hitting the NVMe devices?

Mark

>
> --MC
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Reproducing allocator performance differences
  2015-10-01 18:09   ` Curley, Matthew
@ 2015-10-02  3:39     ` Chaitanya Huilgol
  2015-10-13 23:32     ` Curley, Matthew
  1 sibling, 0 replies; 12+ messages in thread
From: Chaitanya Huilgol @ 2015-10-02  3:39 UTC (permalink / raw)
  To: Curley, Matthew, Mark Nelson, ceph-devel@vger.kernel.org

We had been able to hit consistently bad tcmalloc performance with the following test:

- Have more than one pool configured and create RBD images on both pools
- Start I/O on RBD images of one pool
- After a short duration start I/O on images from other pool
More easily reproducible with larger number of shards and less PGs

This should induce the movement between tcmallo thread caches and central list which results in slower alloc/frees.

Regards,
Chaitanya

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Curley, Matthew
> Sent: Thursday, October 01, 2015 11:39 PM
> To: Mark Nelson; ceph-devel@vger.kernel.org
> Subject: RE: Reproducing allocator performance differences
>
> Thanks a bunch for the feedback Mark.  I'll push this back to the guy doing
> the test runs and get more data, including the writes.
>
> Some responses:
> * There's definitely a fair amount of CPU available even at higher queue
> depths, but I don't have current results.  I'll get a colmux grab for a
> representative sample.
>
> * We did try fio with librbd (and multiple block devices/workers per client)
> previously on a different rig, what we saw was no real benefit over kernel +
> libaio.  We'll get concrete data on this rig though.
>
> * Yes on fio with direct I/O, yes on the pre-fill, and yes on the drop cache
> (with a 3).  Not dropping cache has actually caused some frustrating results
> inconsistency, but that's a whole different topic.
>
> * For these results--especially with more outstanding I/O--you definitely run
> completely out of page cache pretty quickly and see almost nothing at the
> NVMe device.  I was less worried for this particular test since we were after
> demonstrating a % shift in processing efficiency at the OSD rather than
> accurate representation of the backing storage, but correct me if that's a
> poor assumption here.
>
> * We'll try to track more closely to your memory per OSD ratio. When we
> shift the block devices size and reduce the kernel memory to force a % of I/O
> to not come from page cache you definitely see a lowering in overall
> performance (about 70K IOPS between lowest and highest results, for
> consistent queue depth and client count).
>
> --MC
>
> On 10/01/2015 10:32 AM, Curley, Matthew wrote:
> > We've been trying to reproduce the allocator performance impact on 4K
> random reads seen in the Hackathon (and more recent tests).  At this point
> though, we're not seeing any significant difference between tcmalloc and
> jemalloc so we're looking for thoughts on what we're doing wrong.  Or at
> least some suggestions to try out.
> >
> > More detail here:
> >
> https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?
> usp=
> > sharing
> >
> > Thanks for any input!
>
> Hi Mathew,
>
> I can point out a couple of differences in our setups:
>
> 1) I have 4 NVMe cards with 4 OSDs per card in each node, ie 16 OSDs total
> per node.  I'm also running the fio processes on the same nodes as the OSDs,
> so there is far less CPU available per OSD in my setup.
>
> 2) You have more memory per node than I do (and far more memory per
> OSD)
>
> 3) I'm using fio with the librbd engine, not fio+libaio on kernel RBD.
> It would be interesting to know if if this is having an effect.
>
> 4) I'm using RBD cache (and allowing writeback before flush)
>
> 5) I'm not using nobarriers
>
> I suspect that in my setup I am very much bound by things other than the
> NVMe cards.  I think we should look at this in terms of per-node throughput
> rather than per-OSD.  What I find very interesting is that you are seeing much
> higher per-node tcmalloc performance than I am but fairly similar per-node
> jemalloc performance.  For 4K random reads I saw about 14K random read
> IOPs per node for tcmalloc+32MB TC and around 40K IOPS per node with
> tcmalloc+128MB tc or jemalloc.  It appears to me that for both tcmalloc and
> jemalloc you saw around 50K IOPS per node in the 4 OSD per card case.
>
> A couple of thoughts:
>
> 1) Did you happen to record any CPU usage data during your tests?
> Perhaps with only 4 OSDs per node there is less CPU contention.
>
> 2) Did you test 4K random writes?  It would be interesting to see if those
> results show the same behavior.
>
> 3) I'm going to assume that since you saw differences in performance with
> different queue depths that this is O_DIRECT?  Did you sync/drop cache on
> the OSDs before the tests?  Was the data pre-filled on the RBD volumes?
>
> 4) Even given the above, you have a lot more memory available for buffer
> cache.  Did you happen to look at how many of the IOs were actually hitting
> the NVMe devices?
>
> Mark
>
> >
> > --MC
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Reproducing allocator performance differences
  2015-10-01 18:09   ` Curley, Matthew
  2015-10-02  3:39     ` Chaitanya Huilgol
@ 2015-10-13 23:32     ` Curley, Matthew
  2015-10-14 19:04       ` Mark Nelson
  1 sibling, 1 reply; 12+ messages in thread
From: Curley, Matthew @ 2015-10-13 23:32 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel@vger.kernel.org

Got a pass of results out here:
https://drive.google.com/file/d/0B2kp18maR7axRmdVZGMyckRYc0k/view?usp=sharing

Bringing the config closer to the memory per OSD ratio Mark used and fiorbd + multiple block devices per client does appear to reproduce significant performance differences under load.  For random read anyway, the testing we did didn't show much difference on writes.  Still not seeing hackathon 700k+ IOPs, but I didn't really expect that :)

Would be interested in any further feedback, also on how the impact of any allocator changes from today's model is meaningful.  Clearly there appear to be cases where we're not seeing a difference with Hammer tcmalloc vs jemalloc, but is it always just a matter of time before degradation kicks in?  Or, do the changes not really impact certain configurations/data sets?

Thanks for the help!
-- MC

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Curley, Matthew
Sent: Thursday, October 01, 2015 1:09 PM
To: Mark Nelson; ceph-devel@vger.kernel.org
Subject: RE: Reproducing allocator performance differences

Thanks a bunch for the feedback Mark.  I'll push this back to the guy doing the test runs and get more data, including the writes.

Some responses:
* There's definitely a fair amount of CPU available even at higher queue depths, but I don't have current results.  I'll get a colmux grab for a representative sample.

* We did try fio with librbd (and multiple block devices/workers per client) previously on a different rig, what we saw was no real benefit over kernel + libaio.  We'll get concrete data on this rig though.

* Yes on fio with direct I/O, yes on the pre-fill, and yes on the drop cache (with a 3).  Not dropping cache has actually caused some frustrating results inconsistency, but that's a whole different topic.

* For these results--especially with more outstanding I/O--you definitely run completely out of page cache pretty quickly and see almost nothing at the NVMe device.  I was less worried for this particular test since we were after demonstrating a % shift in processing efficiency at the OSD rather than accurate representation of the backing storage, but correct me if that's a poor assumption here.  

* We'll try to track more closely to your memory per OSD ratio. When we shift the block devices size and reduce the kernel memory to force a % of I/O to not come from page cache you definitely see a lowering in overall performance (about 70K IOPS between lowest and highest results, for consistent queue depth and client count).  

--MC

On 10/01/2015 10:32 AM, Curley, Matthew wrote:
> We've been trying to reproduce the allocator performance impact on 4K random reads seen in the Hackathon (and more recent tests).  At this point though, we're not seeing any significant difference between tcmalloc and jemalloc so we're looking for thoughts on what we're doing wrong.  Or at least some suggestions to try out.
>
> More detail here:
> https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?usp=
> sharing
>
> Thanks for any input!

Hi Mathew,

I can point out a couple of differences in our setups:

1) I have 4 NVMe cards with 4 OSDs per card in each node, ie 16 OSDs total per node.  I'm also running the fio processes on the same nodes as the OSDs, so there is far less CPU available per OSD in my setup.

2) You have more memory per node than I do (and far more memory per OSD)

3) I'm using fio with the librbd engine, not fio+libaio on kernel RBD. 
It would be interesting to know if if this is having an effect.

4) I'm using RBD cache (and allowing writeback before flush)

5) I'm not using nobarriers

I suspect that in my setup I am very much bound by things other than the NVMe cards.  I think we should look at this in terms of per-node throughput rather than per-OSD.  What I find very interesting is that you are seeing much higher per-node tcmalloc performance than I am but fairly similar per-node jemalloc performance.  For 4K random reads I saw about 14K random read IOPs per node for tcmalloc+32MB TC and around 40K IOPS per node with tcmalloc+128MB tc or jemalloc.  It appears to me that for both tcmalloc and jemalloc you saw around 50K IOPS per node in the 4 OSD per card case.

A couple of thoughts:

1) Did you happen to record any CPU usage data during your tests? 
Perhaps with only 4 OSDs per node there is less CPU contention.

2) Did you test 4K random writes?  It would be interesting to see if those results show the same behavior.

3) I'm going to assume that since you saw differences in performance with different queue depths that this is O_DIRECT?  Did you sync/drop cache on the OSDs before the tests?  Was the data pre-filled on the RBD volumes?

4) Even given the above, you have a lot more memory available for buffer cache.  Did you happen to look at how many of the IOs were actually hitting the NVMe devices?

Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Reproducing allocator performance differences
  2015-10-13 23:32     ` Curley, Matthew
@ 2015-10-14 19:04       ` Mark Nelson
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Nelson @ 2015-10-14 19:04 UTC (permalink / raw)
  To: Curley, Matthew, ceph-devel@vger.kernel.org

Hi Matthew,

Glad to hear you were able to see a similar effect for reads at least! 
FWIW, I also have not been able to hit 700K IOPs, though my CPUs are 
slower than the ones they are using at Intel.  On my setup I'm hitting 
about 40K read IOPs per node and about 13-14K write IOPS per node with 
3X replication and journals on the same NVMe devices as the data.  In 
the couple of tests I've run I'm getting around 160K read IOPS with 4 
nodes and ~330K read IOPS with 8 nodes.

Mark

On 10/13/2015 06:32 PM, Curley, Matthew wrote:
> Got a pass of results out here:
> https://drive.google.com/file/d/0B2kp18maR7axRmdVZGMyckRYc0k/view?usp=sharing
>
> Bringing the config closer to the memory per OSD ratio Mark used and fiorbd + multiple block devices per client does appear to reproduce significant performance differences under load.  For random read anyway, the testing we did didn't show much difference on writes.  Still not seeing hackathon 700k+ IOPs, but I didn't really expect that :)
>
> Would be interested in any further feedback, also on how the impact of any allocator changes from today's model is meaningful.  Clearly there appear to be cases where we're not seeing a difference with Hammer tcmalloc vs jemalloc, but is it always just a matter of time before degradation kicks in?  Or, do the changes not really impact certain configurations/data sets?
>
> Thanks for the help!
> -- MC
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Curley, Matthew
> Sent: Thursday, October 01, 2015 1:09 PM
> To: Mark Nelson; ceph-devel@vger.kernel.org
> Subject: RE: Reproducing allocator performance differences
>
> Thanks a bunch for the feedback Mark.  I'll push this back to the guy doing the test runs and get more data, including the writes.
>
> Some responses:
> * There's definitely a fair amount of CPU available even at higher queue depths, but I don't have current results.  I'll get a colmux grab for a representative sample.
>
> * We did try fio with librbd (and multiple block devices/workers per client) previously on a different rig, what we saw was no real benefit over kernel + libaio.  We'll get concrete data on this rig though.
>
> * Yes on fio with direct I/O, yes on the pre-fill, and yes on the drop cache (with a 3).  Not dropping cache has actually caused some frustrating results inconsistency, but that's a whole different topic.
>
> * For these results--especially with more outstanding I/O--you definitely run completely out of page cache pretty quickly and see almost nothing at the NVMe device.  I was less worried for this particular test since we were after demonstrating a % shift in processing efficiency at the OSD rather than accurate representation of the backing storage, but correct me if that's a poor assumption here.
>
> * We'll try to track more closely to your memory per OSD ratio. When we shift the block devices size and reduce the kernel memory to force a % of I/O to not come from page cache you definitely see a lowering in overall performance (about 70K IOPS between lowest and highest results, for consistent queue depth and client count).
>
> --MC
>
> On 10/01/2015 10:32 AM, Curley, Matthew wrote:
>> We've been trying to reproduce the allocator performance impact on 4K random reads seen in the Hackathon (and more recent tests).  At this point though, we're not seeing any significant difference between tcmalloc and jemalloc so we're looking for thoughts on what we're doing wrong.  Or at least some suggestions to try out.
>>
>> More detail here:
>> https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?usp=
>> sharing
>>
>> Thanks for any input!
>
> Hi Mathew,
>
> I can point out a couple of differences in our setups:
>
> 1) I have 4 NVMe cards with 4 OSDs per card in each node, ie 16 OSDs total per node.  I'm also running the fio processes on the same nodes as the OSDs, so there is far less CPU available per OSD in my setup.
>
> 2) You have more memory per node than I do (and far more memory per OSD)
>
> 3) I'm using fio with the librbd engine, not fio+libaio on kernel RBD.
> It would be interesting to know if if this is having an effect.
>
> 4) I'm using RBD cache (and allowing writeback before flush)
>
> 5) I'm not using nobarriers
>
> I suspect that in my setup I am very much bound by things other than the NVMe cards.  I think we should look at this in terms of per-node throughput rather than per-OSD.  What I find very interesting is that you are seeing much higher per-node tcmalloc performance than I am but fairly similar per-node jemalloc performance.  For 4K random reads I saw about 14K random read IOPs per node for tcmalloc+32MB TC and around 40K IOPS per node with tcmalloc+128MB tc or jemalloc.  It appears to me that for both tcmalloc and jemalloc you saw around 50K IOPS per node in the 4 OSD per card case.
>
> A couple of thoughts:
>
> 1) Did you happen to record any CPU usage data during your tests?
> Perhaps with only 4 OSDs per node there is less CPU contention.
>
> 2) Did you test 4K random writes?  It would be interesting to see if those results show the same behavior.
>
> 3) I'm going to assume that since you saw differences in performance with different queue depths that this is O_DIRECT?  Did you sync/drop cache on the OSDs before the tests?  Was the data pre-filled on the RBD volumes?
>
> 4) Even given the above, you have a lot more memory available for buffer cache.  Did you happen to look at how many of the IOs were actually hitting the NVMe devices?
>
> Mark
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Reproducing allocator performance differences
  2015-10-01 15:32 Reproducing allocator performance differences Curley, Matthew
  2015-10-01 17:18 ` Mark Nelson
@ 2015-10-02  6:07 ` Dałek, Piotr
  2015-10-02  6:55   ` Alexandre DERUMIER
  2015-10-02  6:10 ` Alexandre DERUMIER
  2 siblings, 1 reply; 12+ messages in thread
From: Dałek, Piotr @ 2015-10-02  6:07 UTC (permalink / raw)
  To: Curley, Matthew, ceph-devel@vger.kernel.org

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Curley, Matthew
> Sent: Thursday, October 01, 2015 5:33 PM
> 
> We've been trying to reproduce the allocator performance impact on 4K
> random reads seen in the Hackathon (and more recent tests).  At this point
> though, we're not seeing any significant difference between tcmalloc and
> jemalloc so we're looking for thoughts on what we're doing wrong.  Or at
> least some suggestions to try out.
> 
> More detail here:
> https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?
> usp=sharing
> 
> Thanks for any input!

This problem is more obvious on writes than reads (I noticed that CPU usage on writes is way more erratic than on reads, where it's pretty much constant), also - more clients would be better (or worse, depending on how you look at it).
And finally, this gets more serious with time, once tcmalloc caches get filled, it starts to slow down.


With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Reproducing allocator performance differences
  2015-10-02  6:07 ` Dałek, Piotr
@ 2015-10-02  6:55   ` Alexandre DERUMIER
  2015-10-02  7:24     ` Dałek, Piotr
  0 siblings, 1 reply; 12+ messages in thread
From: Alexandre DERUMIER @ 2015-10-02  6:55 UTC (permalink / raw)
  To: Dałek, Piotr; +Cc: Curley, Matthew, ceph-devel

>>also - more clients would be better (or worse, depending on how you look at it).

It's quite possible, if I remember, I could trigger more easily with fio with a lot of numjobs (30-40)

----- Mail original -----
De: "Dałek, Piotr" <Piotr.Dalek@ts.fujitsu.com>
À: "Curley, Matthew" <matthew.curley@hpe.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 2 Octobre 2015 08:07:01
Objet: RE: Reproducing allocator performance differences

> -----Original Message----- 
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
> owner@vger.kernel.org] On Behalf Of Curley, Matthew 
> Sent: Thursday, October 01, 2015 5:33 PM 
> 
> We've been trying to reproduce the allocator performance impact on 4K 
> random reads seen in the Hackathon (and more recent tests). At this point 
> though, we're not seeing any significant difference between tcmalloc and 
> jemalloc so we're looking for thoughts on what we're doing wrong. Or at 
> least some suggestions to try out. 
> 
> More detail here: 
> https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view? 
> usp=sharing 
> 
> Thanks for any input! 

This problem is more obvious on writes than reads (I noticed that CPU usage on writes is way more erratic than on reads, where it's pretty much constant), also - more clients would be better (or worse, depending on how you look at it). 
And finally, this gets more serious with time, once tcmalloc caches get filled, it starts to slow down. 


With best regards / Pozdrawiam 
Piotr Dałek 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Reproducing allocator performance differences
  2015-10-02  6:55   ` Alexandre DERUMIER
@ 2015-10-02  7:24     ` Dałek, Piotr
  2015-10-02 11:25       ` Alexandre DERUMIER
  0 siblings, 1 reply; 12+ messages in thread
From: Dałek, Piotr @ 2015-10-02  7:24 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Curley, Matthew, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER
> Sent: Friday, October 02, 2015 8:55 AM
> 
> >>also - more clients would be better (or worse, depending on how you look
> at it).
> 
> It's quite possible, if I remember, I could trigger more easily with fio with a lot
> of numjobs (30-40)

I use rados bench -t 128 for that :)

With best regards / Pozdrawiam
Piotr Dałek



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Reproducing allocator performance differences
  2015-10-02  7:24     ` Dałek, Piotr
@ 2015-10-02 11:25       ` Alexandre DERUMIER
  2015-10-02 11:33         ` Dałek, Piotr
  0 siblings, 1 reply; 12+ messages in thread
From: Alexandre DERUMIER @ 2015-10-02 11:25 UTC (permalink / raw)
  To: Dałek, Piotr; +Cc: Curley, Matthew, ceph-devel

>>I use rados bench -t 128 for that :)

I'm not sure it's exactly the same

on my rados bench tests,
I need to launch multiple "rados bench" process in parallel to scale at high speed.

(like fio -numjobs which create multiple fio process )







----- Mail original -----
De: "Dałek, Piotr" <Piotr.Dalek@ts.fujitsu.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Curley, Matthew" <matthew.curley@hpe.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Vendredi 2 Octobre 2015 09:24:24
Objet: RE: Reproducing allocator performance differences

> -----Original Message----- 
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
> owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER 
> Sent: Friday, October 02, 2015 8:55 AM 
> 
> >>also - more clients would be better (or worse, depending on how you look 
> at it). 
> 
> It's quite possible, if I remember, I could trigger more easily with fio with a lot 
> of numjobs (30-40) 

I use rados bench -t 128 for that :) 

With best regards / Pozdrawiam 
Piotr Dałek 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Reproducing allocator performance differences
  2015-10-02 11:25       ` Alexandre DERUMIER
@ 2015-10-02 11:33         ` Dałek, Piotr
  0 siblings, 0 replies; 12+ messages in thread
From: Dałek, Piotr @ 2015-10-02 11:33 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Curley, Matthew, ceph-devel

> -----Original Message-----
> From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
> Sent: Friday, October 02, 2015 1:26 PM
> To: Dałek, Piotr
> Cc: Curley, Matthew; ceph-devel
> Subject: Re: Reproducing allocator performance differences
> 
> >>I use rados bench -t 128 for that :)
> 
> I'm not sure it's exactly the same
> 
> on my rados bench tests,
> I need to launch multiple "rados bench" process in parallel to scale at high
> speed.
> 
> (like fio -numjobs which create multiple fio process )

It's *quite* similar as long as you use recent rados tool (the one with https://github.com/ceph/ceph/pull/4690 integrated; also, --no-verify helps on read/seq benches). 

With best regards / Pozdrawiam
Piotr Dałek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Reproducing allocator performance differences
  2015-10-01 15:32 Reproducing allocator performance differences Curley, Matthew
  2015-10-01 17:18 ` Mark Nelson
  2015-10-02  6:07 ` Dałek, Piotr
@ 2015-10-02  6:10 ` Alexandre DERUMIER
  2 siblings, 0 replies; 12+ messages in thread
From: Alexandre DERUMIER @ 2015-10-02  6:10 UTC (permalink / raw)
  To: Curley, Matthew; +Cc: ceph-devel

I was able reproduce it with this config a lot of time,4k randread,

intel s3610 ssd, testing with a small rbd which can be keep in buffer memory.

I think around 150-200k iops by osd, I was able to trigger it easily.



auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
osd_pool_default_min_size = 1
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_journaler = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
osd_op_threads = 5
filestore_op_threads = 4
osd_op_num_threads_per_shard = 2
osd_op_num_shards = 10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
ms_nocrc = true
ms_dispatch_throttle_bytes = 0
cephx_sign_messages = false
cephx_require_signatures = false
throttler_perf_counter = false
ms_crc_header = false
ms_crc_data = false

[osd]
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false




----- Mail original -----
De: "Curley, Matthew" <matthew.curley@hpe.com>
À: "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Jeudi 1 Octobre 2015 17:32:49
Objet: Reproducing allocator performance differences

We've been trying to reproduce the allocator performance impact on 4K random reads seen in the Hackathon (and more recent tests). At this point though, we're not seeing any significant difference between tcmalloc and jemalloc so we're looking for thoughts on what we're doing wrong. Or at least some suggestions to try out. 

More detail here: 
https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?usp=sharing 

Thanks for any input! 

--MC 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-10-14 19:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-01 15:32 Reproducing allocator performance differences Curley, Matthew
2015-10-01 17:18 ` Mark Nelson
2015-10-01 18:09   ` Curley, Matthew
2015-10-02  3:39     ` Chaitanya Huilgol
2015-10-13 23:32     ` Curley, Matthew
2015-10-14 19:04       ` Mark Nelson
2015-10-02  6:07 ` Dałek, Piotr
2015-10-02  6:55   ` Alexandre DERUMIER
2015-10-02  7:24     ` Dałek, Piotr
2015-10-02 11:25       ` Alexandre DERUMIER
2015-10-02 11:33         ` Dałek, Piotr
2015-10-02  6:10 ` Alexandre DERUMIER

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.