Performance variation across RBD clients on different pools in all SSD setup

All of lore.kernel.org
 help / color / mirror / Atom feed

* Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
@ 2014-12-03 11:41 Chaitanya Huilgol
  2014-12-03 12:53 ` Dan van der Ster
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Chaitanya Huilgol @ 2014-12-03 11:41 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi All,

We are seeing large read performance variations across RBD clients on different pools. Below is the summary of our findings

- First client starting I/O after a cluster restart (ceph start/stop on all OSD nodes) gets the best performance
- Clients started later exhibit 40% to 70% degraded performance, This is seen even in cases where first client I/O is stopped before starting the second client I/O
-  Adding performance counters showed large increase in latency across the entire path and no specific point of increased latency - upto 3x increase in latency
- On further investigation we have root caused this to degradation in tcmalloc performance inducing large latency across the entire path
- Also the variation is more as we increase the number of op worker shards, with lower shards the variation is lesser but this results in more lock contention and is not a good option for SSD based clusters
- Variation is observed even when the RBD images are not written at all thus indicating that this is not a filesystem issue

Below is a snippet of perf top output for the two runs:

(1)    TCmalloc  - Client-1
   2.68%  ceph-osd                 [.] crush_hash32_3
  2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
  1.66%  [kernel]                 [k] _raw_spin_lock
  1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
  1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)

(2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)

14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
  7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
  6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
  1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
  1.57%  ceph-osd                 [.] crush_hash32_3

Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
Increasing the TCmalloc thread cache limit with 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our test setups. However this is a temporary resolution - this also bloats the OSD memory usage

We have also tested with glibc malloc and jemalloc based builds where this issue is not seen, both hold up well, below is the perf output from the tests

(3)    Glibc - malloc : Any client - no significant change

  3.00%  libc-2.19.so         [.] _int_malloc
  2.65%  libc-2.19.so         [.] malloc
  2.47%  libc-2.19.so         [.] _int_free
  2.33%  ceph-osd             [.] crush_hash32_3
  1.63%  [kernel]             [k] _raw_spin_lock
  1.38%  libstdc++.so.6.0.19  [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)

(4)    Jemalloc  - Any client - no significant changes

  2.47%  ceph-osd                 [.] crush_hash32_3
  2.25%  libjemalloc.so.1         [.] free
  2.07%  libc-2.19.so             [.] 0x0000000000081070
  1.95%  libjemalloc.so.1         [.] malloc
  1.65%  [kernel]                 [k] _raw_spin_lock
  1.60%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)

IMHO, we should probably look at the following in general for better performance with less variation

- Add jemalloc option for ceph builds
- Look at ways to evenly distribute PGs across the shards  - with larger number of shards some shards do not get exercised at all while some are overloaded
- Look at decreasing heap activity in the I/O path (index Manager, Hash Index, LFN index etc.)

We can discuss this further in todays performance meeting

Thanks,
Chaitanya



________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
  2014-12-03 11:41 Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue Chaitanya Huilgol
@ 2014-12-03 12:53 ` Dan van der Ster
  2014-12-03 15:21   ` Chaitanya Huilgol
  2014-12-03 14:21 ` Mark Nelson
  2014-12-03 15:59 ` Sage Weil
  2 siblings, 1 reply; 7+ messages in thread
From: Dan van der Ster @ 2014-12-03 12:53 UTC (permalink / raw)
  To: Chaitanya Huilgol; +Cc: ceph-devel@vger.kernel.org

On Wed, Dec 3, 2014 at 12:41 PM, Chaitanya Huilgol
<Chaitanya.Huilgol@sandisk.com> wrote:
> Hi All,
>
> We are seeing large read performance variations across RBD clients on different pools. Below is the summary of our findings
>
> - First client starting I/O after a cluster restart (ceph start/stop on all OSD nodes) gets the best performance
> - Clients started later exhibit 40% to 70% degraded performance, This is seen even in cases where first client I/O is stopped before starting the second client I/O
> -  Adding performance counters showed large increase in latency across the entire path and no specific point of increased latency - upto 3x increase in latency
> - On further investigation we have root caused this to degradation in tcmalloc performance inducing large latency across the entire path
> - Also the variation is more as we increase the number of op worker shards, with lower shards the variation is lesser but this results in more lock contention and is not a good option for SSD based clusters
> - Variation is observed even when the RBD images are not written at all thus indicating that this is not a filesystem issue
>
> Below is a snippet of perf top output for the two runs:
>
> (1)    TCmalloc  - Client-1
>    2.68%  ceph-osd                 [.] crush_hash32_3
>   2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.66%  [kernel]                 [k] _raw_spin_lock
>   1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>   1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)
>
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
>
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>   7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>   6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>   1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.57%  ceph-osd                 [.] crush_hash32_3
>
> Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
> Increasing the TCmalloc thread cache limit with 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our test setups. However this is a temporary resolution - this also bloats the OSD memory usage
>

I've noticed that tcmalloc is quite visible in perf top, but I never
looked closer because we don't even have debug symbols enabled in our
tcmalloc. Here's a production dumpling ceph-osd right now:

Samples: 35K of event 'cycles', Event count (approx.): 4040795974,
Thread: ceph-osd(13976)
 87.81%  libtcmalloc.so.4.1.0.#prelink#.P1wCcj  [.] 0x0000000000017e6f
  1.41%  libpthread-2.12.so                     [.] pthread_mutex_lock
  1.40%  libstdc++.so.6.0.13                    [.] 0x0000000000065b8c

What value did you use for TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and
how well did it alleviate the problem? I assume env
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=x ceph-osd ... is sufficient to
override this?

Cheers, Dan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
  2014-12-03 12:53 ` Dan van der Ster
@ 2014-12-03 15:21   ` Chaitanya Huilgol
  0 siblings, 0 replies; 7+ messages in thread
From: Chaitanya Huilgol @ 2014-12-03 15:21 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel@vger.kernel.org

Hi Dan,

I think the default value of ' TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' is 16M, we increased it to 128M (need to tune this further). The heap stats from the OSD show about 30M in the thread caches though.
With default setting , we have seen the performance dropping down by 70% on the second client and with the tuning we have not seen this drop - I guess it may be just postponing the problem.

Setting env in ceph-osd.conf will do it.

Regards,
Chaitanya

-----Original Message-----
From: Dan van der Ster [mailto:daniel.vanderster@cern.ch]
Sent: Wednesday, December 03, 2014 6:24 PM
To: Chaitanya Huilgol
Cc: ceph-devel@vger.kernel.org
Subject: Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue

On Wed, Dec 3, 2014 at 12:41 PM, Chaitanya Huilgol <Chaitanya.Huilgol@sandisk.com> wrote:
> Hi All,
>
> We are seeing large read performance variations across RBD clients on
> different pools. Below is the summary of our findings
>
> - First client starting I/O after a cluster restart (ceph start/stop
> on all OSD nodes) gets the best performance
> - Clients started later exhibit 40% to 70% degraded performance, This
> is seen even in cases where first client I/O is stopped before
> starting the second client I/O
> -  Adding performance counters showed large increase in latency across
> the entire path and no specific point of increased latency - upto 3x
> increase in latency
> - On further investigation we have root caused this to degradation in
> tcmalloc performance inducing large latency across the entire path
> - Also the variation is more as we increase the number of op worker
> shards, with lower shards the variation is lesser but this results in
> more lock contention and is not a good option for SSD based clusters
> - Variation is observed even when the RBD images are not written at
> all thus indicating that this is not a filesystem issue
>
> Below is a snippet of perf top output for the two runs:
>
> (1)    TCmalloc  - Client-1
>    2.68%  ceph-osd                 [.] crush_hash32_3
>   2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.66%  [kernel]                 [k] _raw_spin_lock
>   1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>   1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)
>
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
>
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>   7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>   6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>   1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.57%  ceph-osd                 [.] crush_hash32_3
>
> Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
> Increasing the TCmalloc thread cache limit with
> 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our
> test setups. However this is a temporary resolution - this also bloats
> the OSD memory usage
>

I've noticed that tcmalloc is quite visible in perf top, but I never looked closer because we don't even have debug symbols enabled in our tcmalloc. Here's a production dumpling ceph-osd right now:

Samples: 35K of event 'cycles', Event count (approx.): 4040795974,
Thread: ceph-osd(13976)
 87.81%  libtcmalloc.so.4.1.0.#prelink#.P1wCcj  [.] 0x0000000000017e6f
  1.41%  libpthread-2.12.so                     [.] pthread_mutex_lock
  1.40%  libstdc++.so.6.0.13                    [.] 0x0000000000065b8c

What value did you use for TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and how well did it alleviate the problem? I assume env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=x ceph-osd ... is sufficient to override this?

Cheers, Dan

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
  2014-12-03 11:41 Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue Chaitanya Huilgol
  2014-12-03 12:53 ` Dan van der Ster
@ 2014-12-03 14:21 ` Mark Nelson
  2014-12-03 15:31   ` Chaitanya Huilgol
  2014-12-03 15:59 ` Sage Weil
  2 siblings, 1 reply; 7+ messages in thread
From: Mark Nelson @ 2014-12-03 14:21 UTC (permalink / raw)
  To: Chaitanya Huilgol, ceph-devel@vger.kernel.org



On 12/03/2014 05:41 AM, Chaitanya Huilgol wrote:
> Hi All,
>
> We are seeing large read performance variations across RBD clients on different pools. Below is the summary of our findings
>
> - First client starting I/O after a cluster restart (ceph start/stop on all OSD nodes) gets the best performance
> - Clients started later exhibit 40% to 70% degraded performance, This is seen even in cases where first client I/O is stopped before starting the second client I/O
> -  Adding performance counters showed large increase in latency across the entire path and no specific point of increased latency - upto 3x increase in latency
> - On further investigation we have root caused this to degradation in tcmalloc performance inducing large latency across the entire path
> - Also the variation is more as we increase the number of op worker shards, with lower shards the variation is lesser but this results in more lock contention and is not a good option for SSD based clusters
> - Variation is observed even when the RBD images are not written at all thus indicating that this is not a filesystem issue
>
> Below is a snippet of perf top output for the two runs:
>
> (1)    TCmalloc  - Client-1
>     2.68%  ceph-osd                 [.] crush_hash32_3
>    2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>    1.66%  [kernel]                 [k] _raw_spin_lock
>    1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>    1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)
>
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
>
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>    7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>    6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>    1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>    1.57%  ceph-osd                 [.] crush_hash32_3
>
> Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
> Increasing the TCmalloc thread cache limit with 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our test setups. However this is a temporary resolution - this also bloats the OSD memory usage
>
> We have also tested with glibc malloc and jemalloc based builds where this issue is not seen, both hold up well, below is the perf output from the tests
>
> (3)    Glibc - malloc : Any client - no significant change
>
>    3.00%  libc-2.19.so         [.] _int_malloc
>    2.65%  libc-2.19.so         [.] malloc
>    2.47%  libc-2.19.so         [.] _int_free
>    2.33%  ceph-osd             [.] crush_hash32_3
>    1.63%  [kernel]             [k] _raw_spin_lock
>    1.38%  libstdc++.so.6.0.19  [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>
> (4)    Jemalloc  - Any client - no significant changes
>
>    2.47%  ceph-osd                 [.] crush_hash32_3
>    2.25%  libjemalloc.so.1         [.] free
>    2.07%  libc-2.19.so             [.] 0x0000000000081070
>    1.95%  libjemalloc.so.1         [.] malloc
>    1.65%  [kernel]                 [k] _raw_spin_lock
>    1.60%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>
> IMHO, we should probably look at the following in general for better performance with less variation
>
> - Add jemalloc option for ceph builds
> - Look at ways to evenly distribute PGs across the shards  - with larger number of shards some shards do not get exercised at all while some are overloaded
> - Look at decreasing heap activity in the I/O path (index Manager, Hash Index, LFN index etc.)
>
> We can discuss this further in todays performance meeting

This is a fantastic writup Chaitanya.  Please add to the performance 
meeting agenda.

fwiw there are some interesting benchmarks and discussion of different 
allocators here:

http://www.percona.com/blog/2012/07/05/impact-of-memory-allocators-on-mysql-performance/
http://www.reddit.com/r/programming/comments/18zija/github_got_30_better_performance_using_tcmalloc/

I would definitely be in favor of at least exploring options other than 
tcmalloc.

Mark

>
> Thanks,
> Chaitanya
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
  2014-12-03 14:21 ` Mark Nelson
@ 2014-12-03 15:31   ` Chaitanya Huilgol
  0 siblings, 0 replies; 7+ messages in thread
From: Chaitanya Huilgol @ 2014-12-03 15:31 UTC (permalink / raw)
  To: mnelson@redhat.com, ceph-devel@vger.kernel.org

Thanks Mark, I have added this item in the agenda for today's meeting

Regards,
Chaitanya

-----Original Message-----
From: Mark Nelson [mailto:mark.nelson@inktank.com] 
Sent: Wednesday, December 03, 2014 7:51 PM
To: Chaitanya Huilgol; ceph-devel@vger.kernel.org
Subject: Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue



On 12/03/2014 05:41 AM, Chaitanya Huilgol wrote:
> Hi All,
>
> We are seeing large read performance variations across RBD clients on 
> different pools. Below is the summary of our findings
>
> - First client starting I/O after a cluster restart (ceph start/stop 
> on all OSD nodes) gets the best performance
> - Clients started later exhibit 40% to 70% degraded performance, This 
> is seen even in cases where first client I/O is stopped before 
> starting the second client I/O
> -  Adding performance counters showed large increase in latency across 
> the entire path and no specific point of increased latency - upto 3x 
> increase in latency
> - On further investigation we have root caused this to degradation in 
> tcmalloc performance inducing large latency across the entire path
> - Also the variation is more as we increase the number of op worker 
> shards, with lower shards the variation is lesser but this results in 
> more lock contention and is not a good option for SSD based clusters
> - Variation is observed even when the RBD images are not written at 
> all thus indicating that this is not a filesystem issue
>
> Below is a snippet of perf top output for the two runs:
>
> (1)    TCmalloc  - Client-1
>     2.68%  ceph-osd                 [.] crush_hash32_3
>    2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>    1.66%  [kernel]                 [k] _raw_spin_lock
>    1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>    1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)
>
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
>
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>    7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>    6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>    1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>    1.57%  ceph-osd                 [.] crush_hash32_3
>
> Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
> Increasing the TCmalloc thread cache limit with 
> 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our 
> test setups. However this is a temporary resolution - this also bloats 
> the OSD memory usage
>
> We have also tested with glibc malloc and jemalloc based builds where 
> this issue is not seen, both hold up well, below is the perf output 
> from the tests
>
> (3)    Glibc - malloc : Any client - no significant change
>
>    3.00%  libc-2.19.so         [.] _int_malloc
>    2.65%  libc-2.19.so         [.] malloc
>    2.47%  libc-2.19.so         [.] _int_free
>    2.33%  ceph-osd             [.] crush_hash32_3
>    1.63%  [kernel]             [k] _raw_spin_lock
>    1.38%  libstdc++.so.6.0.19  [.] std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> 
> >::basic_string(std::string const&)
>
> (4)    Jemalloc  - Any client - no significant changes
>
>    2.47%  ceph-osd                 [.] crush_hash32_3
>    2.25%  libjemalloc.so.1         [.] free
>    2.07%  libc-2.19.so             [.] 0x0000000000081070
>    1.95%  libjemalloc.so.1         [.] malloc
>    1.65%  [kernel]                 [k] _raw_spin_lock
>    1.60%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>
> IMHO, we should probably look at the following in general for better 
> performance with less variation
>
> - Add jemalloc option for ceph builds
> - Look at ways to evenly distribute PGs across the shards  - with 
> larger number of shards some shards do not get exercised at all while 
> some are overloaded
> - Look at decreasing heap activity in the I/O path (index Manager, 
> Hash Index, LFN index etc.)
>
> We can discuss this further in todays performance meeting

This is a fantastic writup Chaitanya.  Please add to the performance meeting agenda.

fwiw there are some interesting benchmarks and discussion of different allocators here:

http://www.percona.com/blog/2012/07/05/impact-of-memory-allocators-on-mysql-performance/
http://www.reddit.com/r/programming/comments/18zija/github_got_30_better_performance_using_tcmalloc/

I would definitely be in favor of at least exploring options other than tcmalloc.

Mark

>
> Thanks,
> Chaitanya
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
  2014-12-03 11:41 Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue Chaitanya Huilgol
  2014-12-03 12:53 ` Dan van der Ster
  2014-12-03 14:21 ` Mark Nelson
@ 2014-12-03 15:59 ` Sage Weil
  2014-12-03 17:41   ` Vijayendra Shamanna
  2 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2014-12-03 15:59 UTC (permalink / raw)
  To: Chaitanya Huilgol; +Cc: ceph-devel@vger.kernel.org

On Wed, 3 Dec 2014, Chaitanya Huilgol wrote:
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
> 
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>   7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>   6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>   1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.57%  ceph-osd                 [.] crush_hash32_3

Yikes!

> IMHO, we should probably look at the following in general for better 
> performance with less variation
> 
> - Add jemalloc option for ceph builds

Definitely.

Several years ago we saw serious heap fragmetnation issues with glibc.  I 
suspect newer versions are less problematics.  It may also be that newer 
version of tcmalloc behave better (not sure if we're linking against the 
latest version?).  In any case, we should have build support for all 
options.  We'll need to be careful when making a change, though.  The best 
choice may also vary on a per-distro basis.

> - Look at ways to evenly distribute PGs across the shards - with larger 
> number of shards some shards do not get exercised at all while some are 
> overloaded

Ok

> - Look at decreasing heap activity in the I/O path (index Manager, Hash 
> Index, LFN index etc.)

Yes.  Unfortunately I think this is a long tail... lots of small changes 
needed before we'll see much impact.

sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
  2014-12-03 15:59 ` Sage Weil
@ 2014-12-03 17:41   ` Vijayendra Shamanna
  0 siblings, 0 replies; 7+ messages in thread
From: Vijayendra Shamanna @ 2014-12-03 17:41 UTC (permalink / raw)
  To: Sage Weil, Chaitanya Huilgol; +Cc: ceph-devel@vger.kernel.org

Sage,

We did test with latest version of tcmalloc as well. It exhibited the same behavior.

Viju

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, December 03, 2014 9:30 PM
To: Chaitanya Huilgol
Cc: ceph-devel@vger.kernel.org
Subject: Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue

On Wed, 3 Dec 2014, Chaitanya Huilgol wrote:
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
>
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>   7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>   6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>   1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>   1.57%  ceph-osd                 [.] crush_hash32_3

Yikes!

> IMHO, we should probably look at the following in general for better
> performance with less variation
>
> - Add jemalloc option for ceph builds

Definitely.

Several years ago we saw serious heap fragmetnation issues with glibc.  I suspect newer versions are less problematics.  It may also be that newer version of tcmalloc behave better (not sure if we're linking against the latest version?).  In any case, we should have build support for all options.  We'll need to be careful when making a change, though.  The best choice may also vary on a per-distro basis.

> - Look at ways to evenly distribute PGs across the shards - with
> larger number of shards some shards do not get exercised at all while
> some are overloaded

Ok

> - Look at decreasing heap activity in the I/O path (index Manager,
> Hash Index, LFN index etc.)

Yes.  Unfortunately I think this is a long tail... lots of small changes needed before we'll see much impact.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-12-03 17:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-03 11:41 Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue Chaitanya Huilgol
2014-12-03 12:53 ` Dan van der Ster
2014-12-03 15:21   ` Chaitanya Huilgol
2014-12-03 14:21 ` Mark Nelson
2014-12-03 15:31   ` Chaitanya Huilgol
2014-12-03 15:59 ` Sage Weil
2014-12-03 17:41   ` Vijayendra Shamanna

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.