From mboxrd@z Thu Jan 1 00:00:00 1970 From: David McBride Subject: Bounding OSD memory requirements during peering/recovery Date: Sun, 08 Feb 2015 16:05:13 +0000 Message-ID: <54D78939.4000708@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from ppsw-50.csi.cam.ac.uk ([131.111.8.150]:34928 "EHLO ppsw-50.csi.cam.ac.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756390AbbBHQEy (ORCPT ); Sun, 8 Feb 2015 11:04:54 -0500 Received: from cpc17-cmbg14-2-0-cust484.5-4.cable.virginm.net ([86.6.155.229]:55009 helo=[192.168.8.2]) by ppsw-50.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.158]:587) with esmtpsa (PLAIN:dwm37) (TLSv1.2:DHE-RSA-AES128-SHA:128) id 1YKULs-0005yy-ru (Exim 4.82_3-c0e5623) for ceph-devel@vger.kernel.org (return-path ); Sun, 08 Feb 2015 16:04:52 +0000 Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Ceph-devel Hello, I'm trying to understand the memory requirements for a Ceph node, particularly when it is undergoing recovery. Comments, suggestions, pointers are all welcome. (This is my second attempt at sending this email; it appeared to get=20 eaten the first time =E2=80=94 probably because it had a 1MB .heap file= attached.) Background: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D I've got a fairly tortured prototype Ceph cluster. It was left unattended for several months, as I'd been needed to work elsewhere =E2= =80=94 but now I'm returning to it, with an eye to continue to building production services on it if I have sufficient confidence in its capabilities. In the intervening time, several root filesystems on cluster nodes went full (because of poorly configured logging, as well as MONs being co-located with OSDs for expediency) and several drives were also unceremoniously pulled out for reuse elsewhere. A subsequent recovery is proving problematic: if all OSDs are started concurrently, they are substantially exceeding the amount of RAM available on the hosts during peering, and are being killed off by the kernel OOM killer. (And then subsequently being restarted by Upstart, resulting in thrashing for a while, up until something unknown goes awry and the machine stops sending telemetry and no-longer responds to SSH. That's = a=20 separate problem.) Looking at tcmalloc-accounted heap statistics, I've seen individual OSD= s using 9GB+ of RAM; looking at RSS sizes of individual machines, I've seen process-images exceeding 16GB. On 12-disk machines with 32GB of RAM each, this is problematic. So, I've started looking at the data-structures and algorithms that govern OSD recovery. I've found the following references: http://ceph.com/docs/master/dev/placement-group/ http://ceph.com/docs/master/dev/peering/ http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ http://ceph.com/docs/master/dev/osd_internals/map_message_handling/ http://dachary.org/?p=3D2061 =E2=80=A6 and hope to develop an understanding of an upper bound on mem= ory utilization that an efficient implementation of the algorithms describe= d would require. I've also been trying to collect memory profiles for OSD processes as they're operating, to compare theory with reality. Memory profiling: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =46or example, having found an OSD using ~6GB of memory, I turned on he= ap profiling, and dumped its state using `ceph tell osd.N heap start_profiler; ceph tell osd.N heap dump`: > ------------------------------------------------ > MALLOC: 6167528240 ( 5881.8 MiB) Bytes in use by application > MALLOC: + 18309120 ( 17.5 MiB) Bytes in page heap freelist > MALLOC: + 39689152 ( 37.9 MiB) Bytes in central cache freelist > MALLOC: + 4750960 ( 4.5 MiB) Bytes in transfer cache freelist > MALLOC: + 25223840 ( 24.1 MiB) Bytes in thread cache freelists > MALLOC: + 27603096 ( 26.3 MiB) Bytes in malloc metadata > MALLOC: ------------ > MALLOC: =3D 6283104408 ( 5992.0 MiB) Actual memory used (physical += swap) > MALLOC: + 2080768 ( 2.0 MiB) Bytes released to OS (aka unmapp= ed) > MALLOC: ------------ > MALLOC: =3D 6285185176 ( 5994.0 MiB) Virtual address space used > MALLOC: > MALLOC: 374907 Spans in use > MALLOC: 335 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > ------------------------------------------------ However, the heap dumps so generated only appear to show memory allocations (made? touched?) since heap profiling was enabled: > google-pprof --text /usr/bin/ceph-osd osd.25.profile.0001.heap > Using local file /usr/bin/ceph-osd. > Using local file osd.25.profile.0001.heap. > Total: 0.0 MB > 0.0 46.7% 46.7% 0.0 59.0% SimpleMessenger::add_accept_pi= pe > [...] Note the "Total: 0.0MB", which differs wildly from the stats reported b= y=20 tcmalloc, and the RSS of the process reported by the kernel. So, for testing purposes, I selectively started up ~20% of the OSDs, each invoked with the setting CEPH_HEAP_PROFILER_INIT=3D1 =E2=80=A6 defined in their environmentment to cause the heap profiler t= o be started at OSD start-time. This has a significant CPU and memory overhead. Also set were the cluster flags: noout,nobackfill,norecover,noscrub,nodeep-scrub =E2=80=A6 to avoid commingling memory requirements due to peering with = other factors. I've produced a number of .heap files which show >=3D 1000MB of memory allocated in an RB tree as a result of PG::RecoveryState::RecoveryMachine::send_notify, PG::read_info and MOSDPGNotify::decode_payload (or descendants). An example heapfile from a fairly typical OSD can currently be fetched = from: http://people.ds.cam.ac.uk/dwm37/tmp/osd.0.profile.0124.heap This was produced by the binaries from the Ceph 'trusty' repository;=20 `ceph -v` returns: > ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0) Running pprof in interactive mode and running `top30 --cum` on this=20 heapfile reports: > Total: 2172.3 MB > 1705.9 78.5% 78.5% 1748.4 80.5% __gnu_cxx::new_allocator::cons= truct (inline) > 0.0 0.0% 78.5% 1600.7 73.7% std::_Rb_tree::_M_create_node = (inline) > 0.0 0.0% 78.5% 1367.9 63.0% start_thread > 0.0 0.0% 78.5% 1367.6 63.0% ioperm > 0.0 0.0% 78.5% 963.4 44.4% ThreadPool::worker > 0.0 0.0% 78.5% 963.3 44.3% ThreadPool::WorkThread::entry > 0.0 0.0% 78.5% 951.0 43.8% OSD::process_peering_events > 0.0 0.0% 78.5% 950.9 43.8% OSD::PeeringWQ::_process > 0.0 0.0% 78.5% 949.8 43.7% PG::RecoveryState::handle_even= t (inline) > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::detail::sen= d_function::operator (inline) > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::simple_stat= e::react_impl > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::state_machi= ne::process_event (inline) > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::state_machi= ne::send_event > 0.0 0.0% 78.5% 949.8 43.7% local_react (inline) > 0.0 0.0% 78.5% 949.8 43.7% local_react_impl (inline) > 0.0 0.0% 78.5% 949.8 43.7% operator (inline) > 0.0 0.0% 78.5% 949.8 43.7% react (inline) > 0.0 0.0% 78.5% 948.5 43.7% std::vector::push_back (inline= ) > 0.0 0.0% 78.5% 948.3 43.7% PG::RecoveryState::RecoveryMac= hine::send_notify > 0.0 0.0% 78.5% 947.1 43.6% std::vector::_M_insert_aux > 0.0 0.0% 78.5% 947.0 43.6% _Rb_tree (inline) > 0.0 0.0% 78.5% 947.0 43.6% map (inline) > 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_clone_node (= inline) > 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_copy > 0.0 0.0% 78.5% 809.8 37.3% construct (inline) > 0.0 0.0% 78.5% 808.4 37.2% std::pair::pair > 0.0 0.0% 78.5% 804.2 37.0% __libc_start_main > 0.0 0.0% 78.5% 804.2 37.0% _start > 0.0 0.0% 78.5% 804.2 37.0% main > 0.0 0.0% 78.5% 803.6 37.0% OSD::init This appears to show a large amount of memory =E2=80=94 nearly a gigaby= te =E2=80=94=20 allocated by boost::statechart, which is slightly surprising as the FAQ= =20 for boost::statechart quotes a ~1KB memory footprint per state-machine: =20 http://www.boost.org/doc/libs/1_35_0/libs/statechart/doc/faq.html#Embed= dedApplications Perhaps something unexpected is happening here? I'm almost hoping that= =20 perhaps statechart is perhaps being subtly misused or misconfigured in=20 some way that, if fixed, would result in a significant drop in memory=20 utilization=E2=80=A6! Quantifying problem-size: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Given that it appears to be the log-merging stage of PG recovery that seems to be expensive, I queried the statistics of those PGs which seemed to be taking a long time to peer, via `ceph pg query`. These showed that (at least a handful) of those PG's recovery_state past_intervals list contained on the order of ~200-300 entries. (I have no feel as to whether this is excessive.) Unused memory: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D One thing I note is that I still sometimes see OSDs with large fraction= s=20 of their memory allocation sitting on the tcmalloc freelist, e.g.: > osd.0 tcmalloc heap stats:-------------------------------------------= ----- > MALLOC: 2226810584 ( 2123.7 MiB) Bytes in use by application > MALLOC: + 1421361152 ( 1355.5 MiB) Bytes in page heap freelist > MALLOC: + 41864920 ( 39.9 MiB) Bytes in central cache freelist > MALLOC: + 5215680 ( 5.0 MiB) Bytes in transfer cache freelist > MALLOC: + 18508944 ( 17.7 MiB) Bytes in thread cache freelists > MALLOC: + 16216216 ( 15.5 MiB) Bytes in malloc metadata > MALLOC: ------------ > MALLOC: =3D 3729977496 ( 3557.2 MiB) Actual memory used (physical += swap) > MALLOC: + 32792576 ( 31.3 MiB) Bytes released to OS (aka unmapp= ed) > MALLOC: ------------ > MALLOC: =3D 3762770072 ( 3588.5 MiB) Virtual address space used > MALLOC: > MALLOC: 144565 Spans in use > MALLOC: 225 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > ------------------------------------------------ This is despite having: TCMALLOC_RELEASE_RATE=3D10 =E2=80=A6 set in the environment of each OSD process. This doesn't hel= p with contention for RAM between processes! (I have mentioned this before, though hadn't at that time yet tried=20 running OSDs with TCMALLOC_RELEASE_RATE. See also: http://www.spinics.net/lists/ceph-devel/msg18769.html =E2=80=A6 for history. Note for anyone intending to reproduce this experiment: Upstart=20 overrides should be written to a file named=20 /etc/init/ceph-{osd,mon}.override, not ceph-{osd,mon}.conf.override as = I=20 incorrectly specified previously.) Leak detection: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Not yet being familiar with the the data-structures or algorithms that govern PG recovery, it's not clear to me whether this is memory usage that is expected or not for a 120-OSD cluster with 2048 PGs =E2=80=94 o= r whether there might be some variety of leak (or inefficient memory-use pattern.) It doesn't help that I'm not a C++ hacker. :-) Reading around the subject, I came across `leaksanitiser`, a clang/LLVM= : facility: https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer =E2=80=A6 as well as ticket #9756, which suggests using Clang's other s= tatic analysis capabilities to help flag potentially problematic code: http://tracker.ceph.com/issues/9756 I might spend some time this weekend to see if I can help advance that ticket. (I note that http://ceph.com/gitbuilders.cgi now returns 404; perhaps that has been superceded by some RedHat-internal facility?) Cheers, David --=20 David McBride Unix Specialist, University Information Services -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html