Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mark Nelson <mark.nelson@inktank.com>
To: Chaitanya Huilgol <Chaitanya.Huilgol@sandisk.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue
Date: Wed, 03 Dec 2014 08:21:25 -0600	[thread overview]
Message-ID: <547F1C65.1060102@redhat.com> (raw)
In-Reply-To: <9E914F5BD7F48A4782456CEB550A42280A74C0F6@SACMBXIP01.sdcorp.global.sandisk.com>



On 12/03/2014 05:41 AM, Chaitanya Huilgol wrote:
> Hi All,
>
> We are seeing large read performance variations across RBD clients on different pools. Below is the summary of our findings
>
> - First client starting I/O after a cluster restart (ceph start/stop on all OSD nodes) gets the best performance
> - Clients started later exhibit 40% to 70% degraded performance, This is seen even in cases where first client I/O is stopped before starting the second client I/O
> -  Adding performance counters showed large increase in latency across the entire path and no specific point of increased latency - upto 3x increase in latency
> - On further investigation we have root caused this to degradation in tcmalloc performance inducing large latency across the entire path
> - Also the variation is more as we increase the number of op worker shards, with lower shards the variation is lesser but this results in more lock contention and is not a good option for SSD based clusters
> - Variation is observed even when the RBD images are not written at all thus indicating that this is not a filesystem issue
>
> Below is a snippet of perf top output for the two runs:
>
> (1)    TCmalloc  - Client-1
>     2.68%  ceph-osd                 [.] crush_hash32_3
>    2.65%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>    1.66%  [kernel]                 [k] _raw_spin_lock
>    1.56%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>    1.51%  libtcmalloc.so.4.1.2     [.] operator delete(void*)
>
> (2)    TCmalloc - Client -2 (note significant increase in TCmalloc internal free to central list code paths)
>
> 14.75%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::FetchFromSpans()
>    7.46%  libtcmalloc.so.4.1.2     [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
>    6.71%  libtcmalloc.so.4.1.2     [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>    1.68%  libtcmalloc.so.4.1.2     [.] operator new(unsigned long)
>    1.57%  ceph-osd                 [.] crush_hash32_3
>
> Tying it all together, It looks like the new client I/O on a different pool induces change in how the OSD shards are used, this would induce movement of memory to/from the thread local caches to the central free lists.
> Increasing the TCmalloc thread cache limit with 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES' alleviates the issues in our test setups. However this is a temporary resolution - this also bloats the OSD memory usage
>
> We have also tested with glibc malloc and jemalloc based builds where this issue is not seen, both hold up well, below is the perf output from the tests
>
> (3)    Glibc - malloc : Any client - no significant change
>
>    3.00%  libc-2.19.so         [.] _int_malloc
>    2.65%  libc-2.19.so         [.] malloc
>    2.47%  libc-2.19.so         [.] _int_free
>    2.33%  ceph-osd             [.] crush_hash32_3
>    1.63%  [kernel]             [k] _raw_spin_lock
>    1.38%  libstdc++.so.6.0.19  [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>
> (4)    Jemalloc  - Any client - no significant changes
>
>    2.47%  ceph-osd                 [.] crush_hash32_3
>    2.25%  libjemalloc.so.1         [.] free
>    2.07%  libc-2.19.so             [.] 0x0000000000081070
>    1.95%  libjemalloc.so.1         [.] malloc
>    1.65%  [kernel]                 [k] _raw_spin_lock
>    1.60%  libstdc++.so.6.0.19      [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
>
> IMHO, we should probably look at the following in general for better performance with less variation
>
> - Add jemalloc option for ceph builds
> - Look at ways to evenly distribute PGs across the shards  - with larger number of shards some shards do not get exercised at all while some are overloaded
> - Look at decreasing heap activity in the I/O path (index Manager, Hash Index, LFN index etc.)
>
> We can discuss this further in todays performance meeting

This is a fantastic writup Chaitanya.  Please add to the performance 
meeting agenda.

fwiw there are some interesting benchmarks and discussion of different 
allocators here:

http://www.percona.com/blog/2012/07/05/impact-of-memory-allocators-on-mysql-performance/
http://www.reddit.com/r/programming/comments/18zija/github_got_30_better_performance_using_tcmalloc/

I would definitely be in favor of at least exploring options other than 
tcmalloc.

Mark

>
> Thanks,
> Chaitanya
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2014-12-03 14:21 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-03 11:41 Performance variation across RBD clients on different pools in all SSD setup - tcmalloc issue Chaitanya Huilgol
2014-12-03 12:53 ` Dan van der Ster
2014-12-03 15:21   ` Chaitanya Huilgol
2014-12-03 14:21 ` Mark Nelson [this message]
2014-12-03 15:31   ` Chaitanya Huilgol
2014-12-03 15:59 ` Sage Weil
2014-12-03 17:41   ` Vijayendra Shamanna

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=547F1C65.1060102@redhat.com \
    --to=mark.nelson@inktank.com \
    --cc=Chaitanya.Huilgol@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=mnelson@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.