* Bounding OSD memory requirements during peering/recovery @ 2015-02-08 16:05 David McBride 2015-02-08 20:05 ` David McBride 2015-02-09 15:31 ` Gregory Farnum 0 siblings, 2 replies; 14+ messages in thread From: David McBride @ 2015-02-08 16:05 UTC (permalink / raw) To: Ceph-devel Hello, I'm trying to understand the memory requirements for a Ceph node, particularly when it is undergoing recovery. Comments, suggestions, pointers are all welcome. (This is my second attempt at sending this email; it appeared to get eaten the first time — probably because it had a 1MB .heap file attached.) Background: ========== I've got a fairly tortured prototype Ceph cluster. It was left unattended for several months, as I'd been needed to work elsewhere — but now I'm returning to it, with an eye to continue to building production services on it if I have sufficient confidence in its capabilities. In the intervening time, several root filesystems on cluster nodes went full (because of poorly configured logging, as well as MONs being co-located with OSDs for expediency) and several drives were also unceremoniously pulled out for reuse elsewhere. A subsequent recovery is proving problematic: if all OSDs are started concurrently, they are substantially exceeding the amount of RAM available on the hosts during peering, and are being killed off by the kernel OOM killer. (And then subsequently being restarted by Upstart, resulting in thrashing for a while, up until something unknown goes awry and the machine stops sending telemetry and no-longer responds to SSH. That's a separate problem.) Looking at tcmalloc-accounted heap statistics, I've seen individual OSDs using 9GB+ of RAM; looking at RSS sizes of individual machines, I've seen process-images exceeding 16GB. On 12-disk machines with 32GB of RAM each, this is problematic. So, I've started looking at the data-structures and algorithms that govern OSD recovery. I've found the following references: http://ceph.com/docs/master/dev/placement-group/ http://ceph.com/docs/master/dev/peering/ http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ http://ceph.com/docs/master/dev/osd_internals/map_message_handling/ http://dachary.org/?p=2061 … and hope to develop an understanding of an upper bound on memory utilization that an efficient implementation of the algorithms described would require. I've also been trying to collect memory profiles for OSD processes as they're operating, to compare theory with reality. Memory profiling: ================ For example, having found an OSD using ~6GB of memory, I turned on heap profiling, and dumped its state using `ceph tell osd.N heap start_profiler; ceph tell osd.N heap dump`: > ------------------------------------------------ > MALLOC: 6167528240 ( 5881.8 MiB) Bytes in use by application > MALLOC: + 18309120 ( 17.5 MiB) Bytes in page heap freelist > MALLOC: + 39689152 ( 37.9 MiB) Bytes in central cache freelist > MALLOC: + 4750960 ( 4.5 MiB) Bytes in transfer cache freelist > MALLOC: + 25223840 ( 24.1 MiB) Bytes in thread cache freelists > MALLOC: + 27603096 ( 26.3 MiB) Bytes in malloc metadata > MALLOC: ------------ > MALLOC: = 6283104408 ( 5992.0 MiB) Actual memory used (physical + swap) > MALLOC: + 2080768 ( 2.0 MiB) Bytes released to OS (aka unmapped) > MALLOC: ------------ > MALLOC: = 6285185176 ( 5994.0 MiB) Virtual address space used > MALLOC: > MALLOC: 374907 Spans in use > MALLOC: 335 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > ------------------------------------------------ However, the heap dumps so generated only appear to show memory allocations (made? touched?) since heap profiling was enabled: > google-pprof --text /usr/bin/ceph-osd osd.25.profile.0001.heap > Using local file /usr/bin/ceph-osd. > Using local file osd.25.profile.0001.heap. > Total: 0.0 MB > 0.0 46.7% 46.7% 0.0 59.0% SimpleMessenger::add_accept_pipe > [...] Note the "Total: 0.0MB", which differs wildly from the stats reported by tcmalloc, and the RSS of the process reported by the kernel. So, for testing purposes, I selectively started up ~20% of the OSDs, each invoked with the setting CEPH_HEAP_PROFILER_INIT=1 … defined in their environmentment to cause the heap profiler to be started at OSD start-time. This has a significant CPU and memory overhead. Also set were the cluster flags: noout,nobackfill,norecover,noscrub,nodeep-scrub … to avoid commingling memory requirements due to peering with other factors. I've produced a number of .heap files which show >= 1000MB of memory allocated in an RB tree as a result of PG::RecoveryState::RecoveryMachine::send_notify, PG::read_info and MOSDPGNotify::decode_payload (or descendants). An example heapfile from a fairly typical OSD can currently be fetched from: http://people.ds.cam.ac.uk/dwm37/tmp/osd.0.profile.0124.heap This was produced by the binaries from the Ceph 'trusty' repository; `ceph -v` returns: > ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0) Running pprof in interactive mode and running `top30 --cum` on this heapfile reports: > Total: 2172.3 MB > 1705.9 78.5% 78.5% 1748.4 80.5% __gnu_cxx::new_allocator::construct (inline) > 0.0 0.0% 78.5% 1600.7 73.7% std::_Rb_tree::_M_create_node (inline) > 0.0 0.0% 78.5% 1367.9 63.0% start_thread > 0.0 0.0% 78.5% 1367.6 63.0% ioperm > 0.0 0.0% 78.5% 963.4 44.4% ThreadPool::worker > 0.0 0.0% 78.5% 963.3 44.3% ThreadPool::WorkThread::entry > 0.0 0.0% 78.5% 951.0 43.8% OSD::process_peering_events > 0.0 0.0% 78.5% 950.9 43.8% OSD::PeeringWQ::_process > 0.0 0.0% 78.5% 949.8 43.7% PG::RecoveryState::handle_event (inline) > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::detail::send_function::operator (inline) > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::simple_state::react_impl > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::state_machine::process_event (inline) > 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::state_machine::send_event > 0.0 0.0% 78.5% 949.8 43.7% local_react (inline) > 0.0 0.0% 78.5% 949.8 43.7% local_react_impl (inline) > 0.0 0.0% 78.5% 949.8 43.7% operator (inline) > 0.0 0.0% 78.5% 949.8 43.7% react (inline) > 0.0 0.0% 78.5% 948.5 43.7% std::vector::push_back (inline) > 0.0 0.0% 78.5% 948.3 43.7% PG::RecoveryState::RecoveryMachine::send_notify > 0.0 0.0% 78.5% 947.1 43.6% std::vector::_M_insert_aux > 0.0 0.0% 78.5% 947.0 43.6% _Rb_tree (inline) > 0.0 0.0% 78.5% 947.0 43.6% map (inline) > 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_clone_node (inline) > 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_copy > 0.0 0.0% 78.5% 809.8 37.3% construct (inline) > 0.0 0.0% 78.5% 808.4 37.2% std::pair::pair > 0.0 0.0% 78.5% 804.2 37.0% __libc_start_main > 0.0 0.0% 78.5% 804.2 37.0% _start > 0.0 0.0% 78.5% 804.2 37.0% main > 0.0 0.0% 78.5% 803.6 37.0% OSD::init This appears to show a large amount of memory — nearly a gigabyte — allocated by boost::statechart, which is slightly surprising as the FAQ for boost::statechart quotes a ~1KB memory footprint per state-machine: http://www.boost.org/doc/libs/1_35_0/libs/statechart/doc/faq.html#EmbeddedApplications Perhaps something unexpected is happening here? I'm almost hoping that perhaps statechart is perhaps being subtly misused or misconfigured in some way that, if fixed, would result in a significant drop in memory utilization…! Quantifying problem-size: ======================== Given that it appears to be the log-merging stage of PG recovery that seems to be expensive, I queried the statistics of those PGs which seemed to be taking a long time to peer, via `ceph pg <pgid> query`. These showed that (at least a handful) of those PG's recovery_state past_intervals list contained on the order of ~200-300 entries. (I have no feel as to whether this is excessive.) Unused memory: ============= One thing I note is that I still sometimes see OSDs with large fractions of their memory allocation sitting on the tcmalloc freelist, e.g.: > osd.0 tcmalloc heap stats:------------------------------------------------ > MALLOC: 2226810584 ( 2123.7 MiB) Bytes in use by application > MALLOC: + 1421361152 ( 1355.5 MiB) Bytes in page heap freelist > MALLOC: + 41864920 ( 39.9 MiB) Bytes in central cache freelist > MALLOC: + 5215680 ( 5.0 MiB) Bytes in transfer cache freelist > MALLOC: + 18508944 ( 17.7 MiB) Bytes in thread cache freelists > MALLOC: + 16216216 ( 15.5 MiB) Bytes in malloc metadata > MALLOC: ------------ > MALLOC: = 3729977496 ( 3557.2 MiB) Actual memory used (physical + swap) > MALLOC: + 32792576 ( 31.3 MiB) Bytes released to OS (aka unmapped) > MALLOC: ------------ > MALLOC: = 3762770072 ( 3588.5 MiB) Virtual address space used > MALLOC: > MALLOC: 144565 Spans in use > MALLOC: 225 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > ------------------------------------------------ This is despite having: TCMALLOC_RELEASE_RATE=10 … set in the environment of each OSD process. This doesn't help with contention for RAM between processes! (I have mentioned this before, though hadn't at that time yet tried running OSDs with TCMALLOC_RELEASE_RATE. See also: http://www.spinics.net/lists/ceph-devel/msg18769.html … for history. Note for anyone intending to reproduce this experiment: Upstart overrides should be written to a file named /etc/init/ceph-{osd,mon}.override, not ceph-{osd,mon}.conf.override as I incorrectly specified previously.) Leak detection: ============== Not yet being familiar with the the data-structures or algorithms that govern PG recovery, it's not clear to me whether this is memory usage that is expected or not for a 120-OSD cluster with 2048 PGs — or whether there might be some variety of leak (or inefficient memory-use pattern.) It doesn't help that I'm not a C++ hacker. :-) Reading around the subject, I came across `leaksanitiser`, a clang/LLVM: facility: https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer … as well as ticket #9756, which suggests using Clang's other static analysis capabilities to help flag potentially problematic code: http://tracker.ceph.com/issues/9756 I might spend some time this weekend to see if I can help advance that ticket. (I note that http://ceph.com/gitbuilders.cgi now returns 404; perhaps that has been superceded by some RedHat-internal facility?) Cheers, David -- David McBride <dwm37@cam.ac.uk> Unix Specialist, University Information Services -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride @ 2015-02-08 20:05 ` David McBride 2015-02-09 10:38 ` David McBride 2015-02-09 15:31 ` Gregory Farnum 1 sibling, 1 reply; 14+ messages in thread From: David McBride @ 2015-02-08 20:05 UTC (permalink / raw) To: Ceph-devel On 08/02/15 16:05, David McBride wrote: > Reading around the subject, I came across `leaksanitiser`, a clang/LLVM: > facility: > > https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer > > … as well as ticket #9756, which suggests using Clang's other static > analysis capabilities to help flag potentially problematic code: > > http://tracker.ceph.com/issues/9756 I've gone ahead and implemented this. I've submitted a pull-request via Github, visible here: https://github.com/ceph/autobuild-ceph/pull/22 I've not tried to replicate the gitbuilder environment directly, so these changes are untested, though should work — at least, once someone's added 'clang' to the list of packages to be autoprovisioned! Cheers, David -- David McBride <dwm37@cam.ac.uk> Unix Specialist, University Information Services -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-02-08 20:05 ` David McBride @ 2015-02-09 10:38 ` David McBride 0 siblings, 0 replies; 14+ messages in thread From: David McBride @ 2015-02-09 10:38 UTC (permalink / raw) To: Ceph-devel On 08/02/15 20:05, David McBride wrote: > https://github.com/ceph/autobuild-ceph/pull/22 > > I've not tried to replicate the gitbuilder environment directly, so > these changes are untested, though should work — at least, once > someone's added 'clang' to the list of packages to be autoprovisioned! I've now updated this pull request; now also implemented: * Updates to fabfile.py to cause clang (and clang-analyzer on RPM machines) to be installed prior to builds. * Added the '-analyze' hostname affix, which causes Ceph to be built with the 'scan-build' static-analysis wrapper. As a side-effect of compilation, a static-analysis of Ceph's code will also be run; the resulting report will be deposited in scan-build.tmp/. * Tweaked the environment of clang builds so that it shouldn't generate spurious errors when being run with versions of ccache < 3.2. Cheers, David -- David McBride <dwm37@cam.ac.uk> Unix Specialist, University Information Services -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride 2015-02-08 20:05 ` David McBride @ 2015-02-09 15:31 ` Gregory Farnum 2015-02-09 21:36 ` David McBride 1 sibling, 1 reply; 14+ messages in thread From: Gregory Farnum @ 2015-02-09 15:31 UTC (permalink / raw) To: David McBride; +Cc: Ceph-devel Right. So, memory usage of an OSD is usually linear in the number of PGs it hosts. However, that memory can also grow based on at least one other thing: the number of OSD Maps required to go through peering. It *looks* to me like this is what you're running in to, not growth on the number of state machines. In particular, those past_intervals you mentioned. ;) Anyway, I'm afraid I don't have any magic cure-all for you. This kind of long-term dirtied Ceph cluster is something I've only seen once or twice and I've never led a recovery on them. But the effort usually involves, as you've done, limiting the number of OSDs per host that are doing recovery at once (which probably means starting one OSD at a time until stability, rather than one per host!), disabling recovery (as you've already done), ...and occasionally hacking up the map history. :/ Good luck! -Greg On Sun, Feb 8, 2015 at 8:05 AM, David McBride <dwm37@cam.ac.uk> wrote: > Hello, > > I'm trying to understand the memory requirements for a Ceph node, > particularly when it is undergoing recovery. > > Comments, suggestions, pointers are all welcome. > > (This is my second attempt at sending this email; it appeared to get eaten > the first time — probably because it had a 1MB .heap file attached.) > > > Background: > ========== > > I've got a fairly tortured prototype Ceph cluster. It was left > unattended for several months, as I'd been needed to work elsewhere — > but now I'm returning to it, with an eye to continue to building > production services on it if I have sufficient confidence in its > capabilities. > > In the intervening time, several root filesystems on cluster nodes went > full (because of poorly configured logging, as well as MONs being > co-located with OSDs for expediency) and several drives were also > unceremoniously pulled out for reuse elsewhere. > > A subsequent recovery is proving problematic: if all OSDs are started > concurrently, they are substantially exceeding the amount of RAM > available on the hosts during peering, and are being killed off by the > kernel OOM killer. > > (And then subsequently being restarted by Upstart, resulting in > thrashing for a while, up until something unknown goes awry and the > machine stops sending telemetry and no-longer responds to SSH. That's a > separate problem.) > > Looking at tcmalloc-accounted heap statistics, I've seen individual OSDs > using 9GB+ of RAM; looking at RSS sizes of individual machines, I've > seen process-images exceeding 16GB. On 12-disk machines with 32GB of > RAM each, this is problematic. > > So, I've started looking at the data-structures and algorithms that > govern OSD recovery. I've found the following references: > > http://ceph.com/docs/master/dev/placement-group/ > http://ceph.com/docs/master/dev/peering/ > http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ > http://ceph.com/docs/master/dev/osd_internals/map_message_handling/ > http://dachary.org/?p=2061 > > … and hope to develop an understanding of an upper bound on memory > utilization that an efficient implementation of the algorithms described > would require. > > I've also been trying to collect memory profiles for OSD processes as > they're operating, to compare theory with reality. > > > Memory profiling: > ================ > > For example, having found an OSD using ~6GB of memory, I turned on heap > profiling, and dumped its state using `ceph tell osd.N heap > start_profiler; ceph tell osd.N heap dump`: > >> ------------------------------------------------ >> MALLOC: 6167528240 ( 5881.8 MiB) Bytes in use by application >> MALLOC: + 18309120 ( 17.5 MiB) Bytes in page heap freelist >> MALLOC: + 39689152 ( 37.9 MiB) Bytes in central cache freelist >> MALLOC: + 4750960 ( 4.5 MiB) Bytes in transfer cache freelist >> MALLOC: + 25223840 ( 24.1 MiB) Bytes in thread cache freelists >> MALLOC: + 27603096 ( 26.3 MiB) Bytes in malloc metadata >> MALLOC: ------------ >> MALLOC: = 6283104408 ( 5992.0 MiB) Actual memory used (physical + swap) >> MALLOC: + 2080768 ( 2.0 MiB) Bytes released to OS (aka unmapped) >> MALLOC: ------------ >> MALLOC: = 6285185176 ( 5994.0 MiB) Virtual address space used >> MALLOC: >> MALLOC: 374907 Spans in use >> MALLOC: 335 Thread heaps in use >> MALLOC: 8192 Tcmalloc page size >> ------------------------------------------------ > > > However, the heap dumps so generated only appear to show memory > allocations (made? touched?) since heap profiling was enabled: > >> google-pprof --text /usr/bin/ceph-osd osd.25.profile.0001.heap >> Using local file /usr/bin/ceph-osd. >> Using local file osd.25.profile.0001.heap. >> Total: 0.0 MB >> 0.0 46.7% 46.7% 0.0 59.0% SimpleMessenger::add_accept_pipe >> [...] > > > Note the "Total: 0.0MB", which differs wildly from the stats reported by > tcmalloc, and the RSS of the process reported by the kernel. > > So, for testing purposes, I selectively started up ~20% of the OSDs, > each invoked with the setting > > CEPH_HEAP_PROFILER_INIT=1 > > … defined in their environmentment to cause the heap profiler to be > started at OSD start-time. This has a significant CPU and memory > overhead. > > Also set were the cluster flags: > > noout,nobackfill,norecover,noscrub,nodeep-scrub > > … to avoid commingling memory requirements due to peering with other > factors. > > I've produced a number of .heap files which show >= 1000MB of memory > allocated in an RB tree as a result of > PG::RecoveryState::RecoveryMachine::send_notify, PG::read_info and > MOSDPGNotify::decode_payload (or descendants). > > An example heapfile from a fairly typical OSD can currently be fetched from: > > http://people.ds.cam.ac.uk/dwm37/tmp/osd.0.profile.0124.heap > > This was produced by the binaries from the Ceph 'trusty' repository; `ceph > -v` returns: > >> ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0) > > > Running pprof in interactive mode and running `top30 --cum` on this heapfile > reports: > >> Total: 2172.3 MB >> 1705.9 78.5% 78.5% 1748.4 80.5% __gnu_cxx::new_allocator::construct >> (inline) >> 0.0 0.0% 78.5% 1600.7 73.7% std::_Rb_tree::_M_create_node >> (inline) >> 0.0 0.0% 78.5% 1367.9 63.0% start_thread >> 0.0 0.0% 78.5% 1367.6 63.0% ioperm >> 0.0 0.0% 78.5% 963.4 44.4% ThreadPool::worker >> 0.0 0.0% 78.5% 963.3 44.3% ThreadPool::WorkThread::entry >> 0.0 0.0% 78.5% 951.0 43.8% OSD::process_peering_events >> 0.0 0.0% 78.5% 950.9 43.8% OSD::PeeringWQ::_process >> 0.0 0.0% 78.5% 949.8 43.7% PG::RecoveryState::handle_event >> (inline) >> 0.0 0.0% 78.5% 949.8 43.7% >> boost::statechart::detail::send_function::operator (inline) >> 0.0 0.0% 78.5% 949.8 43.7% >> boost::statechart::simple_state::react_impl >> 0.0 0.0% 78.5% 949.8 43.7% >> boost::statechart::state_machine::process_event (inline) >> 0.0 0.0% 78.5% 949.8 43.7% >> boost::statechart::state_machine::send_event >> 0.0 0.0% 78.5% 949.8 43.7% local_react (inline) >> 0.0 0.0% 78.5% 949.8 43.7% local_react_impl (inline) >> 0.0 0.0% 78.5% 949.8 43.7% operator (inline) >> 0.0 0.0% 78.5% 949.8 43.7% react (inline) >> 0.0 0.0% 78.5% 948.5 43.7% std::vector::push_back (inline) >> 0.0 0.0% 78.5% 948.3 43.7% >> PG::RecoveryState::RecoveryMachine::send_notify >> 0.0 0.0% 78.5% 947.1 43.6% std::vector::_M_insert_aux >> 0.0 0.0% 78.5% 947.0 43.6% _Rb_tree (inline) >> 0.0 0.0% 78.5% 947.0 43.6% map (inline) >> 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_clone_node >> (inline) >> 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_copy >> 0.0 0.0% 78.5% 809.8 37.3% construct (inline) >> 0.0 0.0% 78.5% 808.4 37.2% std::pair::pair >> 0.0 0.0% 78.5% 804.2 37.0% __libc_start_main >> 0.0 0.0% 78.5% 804.2 37.0% _start >> 0.0 0.0% 78.5% 804.2 37.0% main >> 0.0 0.0% 78.5% 803.6 37.0% OSD::init > > > This appears to show a large amount of memory — nearly a gigabyte — > allocated by boost::statechart, which is slightly surprising as the FAQ for > boost::statechart quotes a ~1KB memory footprint per state-machine: > > > http://www.boost.org/doc/libs/1_35_0/libs/statechart/doc/faq.html#EmbeddedApplications > > Perhaps something unexpected is happening here? I'm almost hoping that > perhaps statechart is perhaps being subtly misused or misconfigured in some > way that, if fixed, would result in a significant drop in memory > utilization…! > > > Quantifying problem-size: > ======================== > > Given that it appears to be the log-merging stage of PG recovery that > seems to be expensive, I queried the statistics of those PGs which > seemed to be taking a long time to peer, via `ceph pg <pgid> query`. > > These showed that (at least a handful) of those PG's recovery_state > past_intervals list contained on the order of ~200-300 entries. > > (I have no feel as to whether this is excessive.) > > > Unused memory: > ============= > > One thing I note is that I still sometimes see OSDs with large fractions of > their memory allocation sitting on the tcmalloc freelist, e.g.: > >> osd.0 tcmalloc heap stats:------------------------------------------------ >> MALLOC: 2226810584 ( 2123.7 MiB) Bytes in use by application >> MALLOC: + 1421361152 ( 1355.5 MiB) Bytes in page heap freelist >> MALLOC: + 41864920 ( 39.9 MiB) Bytes in central cache freelist >> MALLOC: + 5215680 ( 5.0 MiB) Bytes in transfer cache freelist >> MALLOC: + 18508944 ( 17.7 MiB) Bytes in thread cache freelists >> MALLOC: + 16216216 ( 15.5 MiB) Bytes in malloc metadata >> MALLOC: ------------ >> MALLOC: = 3729977496 ( 3557.2 MiB) Actual memory used (physical + swap) >> MALLOC: + 32792576 ( 31.3 MiB) Bytes released to OS (aka unmapped) >> MALLOC: ------------ >> MALLOC: = 3762770072 ( 3588.5 MiB) Virtual address space used >> MALLOC: >> MALLOC: 144565 Spans in use >> MALLOC: 225 Thread heaps in use >> MALLOC: 8192 Tcmalloc page size >> ------------------------------------------------ > > > This is despite having: > > TCMALLOC_RELEASE_RATE=10 > > … set in the environment of each OSD process. This doesn't help with > contention for RAM between processes! > > (I have mentioned this before, though hadn't at that time yet tried running > OSDs with TCMALLOC_RELEASE_RATE. See also: > > http://www.spinics.net/lists/ceph-devel/msg18769.html > > … for history. > > Note for anyone intending to reproduce this experiment: Upstart overrides > should be written to a file named /etc/init/ceph-{osd,mon}.override, not > ceph-{osd,mon}.conf.override as I incorrectly specified previously.) > > > Leak detection: > ============== > > Not yet being familiar with the the data-structures or algorithms that > govern PG recovery, it's not clear to me whether this is memory usage > that is expected or not for a 120-OSD cluster with 2048 PGs — or > whether there might be some variety of leak (or inefficient memory-use > pattern.) > > It doesn't help that I'm not a C++ hacker. :-) > > Reading around the subject, I came across `leaksanitiser`, a clang/LLVM: > facility: > > https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer > > … as well as ticket #9756, which suggests using Clang's other static > analysis capabilities to help flag potentially problematic code: > > http://tracker.ceph.com/issues/9756 > > I might spend some time this weekend to see if I can help advance that > ticket. > > (I note that http://ceph.com/gitbuilders.cgi now returns 404; perhaps > that has been superceded by some RedHat-internal facility?) > > Cheers, > David > -- > David McBride <dwm37@cam.ac.uk> > Unix Specialist, University Information Services > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-02-09 15:31 ` Gregory Farnum @ 2015-02-09 21:36 ` David McBride 2015-02-10 1:51 ` Sage Weil 0 siblings, 1 reply; 14+ messages in thread From: David McBride @ 2015-02-09 21:36 UTC (permalink / raw) To: Gregory Farnum; +Cc: Ceph-devel On 09/02/15 15:31, Gregory Farnum wrote: > So, memory usage of an OSD is usually linear in the number of PGs it > hosts. However, that memory can also grow based on at least one other > thing: the number of OSD Maps required to go through peering. It > *looks* to me like this is what you're running in to, not growth on > the number of state machines. In particular, those past_intervals you > mentioned. ;) Hi Greg, Right, that sounds entirely plausible, and is very helpful. In practice, that means I'll need to be careful to avoid this situation occurring in production — but given that's unlikely to occur except in the case of non-trivial neglect, I don't think I need be particularly concerned. (Happily, I'm in the situation that my existing cluster is purely for testing purposes; the data is expendable.) That said, for my own peace of mind, it would be valuable to have a procedure that can be used to recover from this state, even if it's unlikely to occur in practice. I'm currently running an experiment where I augment the RAM of each OSD node with 10GB swapfiles on each spinning OSD disk, so that there's a big-enough backing-store to complete log reconstruction. (You obviously wouldn't want to operate in this manner during normal production operation — the loss of a single drive would cause a hard machine-crash, and the performance will be fairly diabolical, particularly if you allow client workloads to carry on in the background.) I did try enabling zswap on the Utopic LTS kernel as supplied as an option in Ubuntu 14.04; however, the kernel was not stable in such a configuration and several machines crashed under memory pressure. I do have OSDs committing suicide periodically, probably because they're insufficiently responsive to heartbeats as they start to hit swap. This is before experimenting with the various OSD tuning dials for timeouts, so some improvement may be possible. In the meantime, I've configured the ceph-osd Upstart jobs to apply a post-exec command of `sleep 3600` to reduce the rate at which they're respawned. So far, the resulting configuration seems to be making progress, albeit moderately slowly. Cheers, David -- David McBride <dwm37@cam.ac.uk> Unix Specialist, University Information Services -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-02-09 21:36 ` David McBride @ 2015-02-10 1:51 ` Sage Weil 2015-03-09 15:42 ` Dan van der Ster 0 siblings, 1 reply; 14+ messages in thread From: Sage Weil @ 2015-02-10 1:51 UTC (permalink / raw) To: David McBride; +Cc: Gregory Farnum, Ceph-devel On Mon, 9 Feb 2015, David McBride wrote: > On 09/02/15 15:31, Gregory Farnum wrote: > > > So, memory usage of an OSD is usually linear in the number of PGs it > > hosts. However, that memory can also grow based on at least one other > > thing: the number of OSD Maps required to go through peering. It > > *looks* to me like this is what you're running in to, not growth on > > the number of state machines. In particular, those past_intervals you > > mentioned. ;) > > Hi Greg, > > Right, that sounds entirely plausible, and is very helpful. > > In practice, that means I'll need to be careful to avoid this situation > occurring in production ? but given that's unlikely to occur except in the > case of non-trivial neglect, I don't think I need be particularly concerned. > > (Happily, I'm in the situation that my existing cluster is purely for testing > purposes; the data is expendable.) > > That said, for my own peace of mind, it would be valuable to have a procedure > that can be used to recover from this state, even if it's unlikely to occur in > practice. The best luck I've had recovering from situations is something like: - stop all osds - osd set nodown - osd set nobackfill - osd set noup - set map cache size smaller to reduce memory footprint. osd map cache size = 50 osd map max advance = 25 osd map share max epochs = 25 osd pg epoch persisted max stale = 25 (basically, keep most of those values in sync, and smaller than the map cache) - start all osds, let them catch up on their maps. (if they can't fit in memory at this point then another creative solution will be needed) - unset noup so that everyone peers at once It may also help to try to match the in/out state with where the data actually resides (i.e. mark an osd back in if it was marked out but the cluster didn't rebalance). > I'm currently running an experiment where I augment the RAM of each OSD node > with 10GB swapfiles on each spinning OSD disk, so that there's a big-enough > backing-store to complete log reconstruction. Swap tends to not work very well.. make sure nodown is set if you have to go this route or else osds will get marked down when they miss heartbeats... sage ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-02-10 1:51 ` Sage Weil @ 2015-03-09 15:42 ` Dan van der Ster 2015-03-09 15:47 ` Gregory Farnum 0 siblings, 1 reply; 14+ messages in thread From: Dan van der Ster @ 2015-03-09 15:42 UTC (permalink / raw) To: Sage Weil; +Cc: David McBride, Gregory Farnum, Ceph-devel Hi Sage, On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote: > On Mon, 9 Feb 2015, David McBride wrote: >> On 09/02/15 15:31, Gregory Farnum wrote: >> >> > So, memory usage of an OSD is usually linear in the number of PGs it >> > hosts. However, that memory can also grow based on at least one other >> > thing: the number of OSD Maps required to go through peering. It >> > *looks* to me like this is what you're running in to, not growth on >> > the number of state machines. In particular, those past_intervals you >> > mentioned. ;) >> >> Hi Greg, >> >> Right, that sounds entirely plausible, and is very helpful. >> >> In practice, that means I'll need to be careful to avoid this situation >> occurring in production ? but given that's unlikely to occur except in the >> case of non-trivial neglect, I don't think I need be particularly concerned. >> >> (Happily, I'm in the situation that my existing cluster is purely for testing >> purposes; the data is expendable.) >> >> That said, for my own peace of mind, it would be valuable to have a procedure >> that can be used to recover from this state, even if it's unlikely to occur in >> practice. > > The best luck I've had recovering from situations is something like: > > - stop all osds > - osd set nodown > - osd set nobackfill > - osd set noup > - set map cache size smaller to reduce memory footprint. > > osd map cache size = 50 > osd map max advance = 25 > osd map share max epochs = 25 > osd pg epoch persisted max stale = 25 > These above settings have proven to be very useful when setting up some of our new OSD servers with not much memory per OSD: 64GB RAM for 48x4TB OSDs Prior to applying these settings (plus one more, below) we were seeing memory usage around 2-3GB / OSD when they are freshly created. After a restart the processes stayed under 3-400MB. It seems the initial bootstrapping -- getting all the most recent 500 osdmaps -- in bunches of 100 at a time causes the osd map cache to exceed its 50 entry limit -- and that memory is then never freed. We found that to fix this we had to also lower the "osd map message max" setting on the mons -- like that them OSD memory is staying under 500MB per process. Currently we're happily running a large [1] number of OSDs with the following configuration: [global] osd map message max = 10 [osd] osd map cache size = 20 osd map max advance = 10 osd map share max epochs = 10 osd pg epoch persisted max stale = 10 and the memory consumption is 400-500MB per process, even during backfilling. And so far we didn't see any drawbacks to this configuration. Should we expect any problems if we continue with this small osdmap cache, permanently? Best Regards, Dan [1] "large" in this case means the osdmap is 4.6MB in size ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-03-09 15:42 ` Dan van der Ster @ 2015-03-09 15:47 ` Gregory Farnum 2015-03-13 11:24 ` Dan van der Ster 0 siblings, 1 reply; 14+ messages in thread From: Gregory Farnum @ 2015-03-09 15:47 UTC (permalink / raw) To: Dan van der Ster; +Cc: Sage Weil, David McBride, Ceph-devel On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> wrote: > Hi Sage, > > On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote: >> On Mon, 9 Feb 2015, David McBride wrote: >>> On 09/02/15 15:31, Gregory Farnum wrote: >>> >>> > So, memory usage of an OSD is usually linear in the number of PGs it >>> > hosts. However, that memory can also grow based on at least one other >>> > thing: the number of OSD Maps required to go through peering. It >>> > *looks* to me like this is what you're running in to, not growth on >>> > the number of state machines. In particular, those past_intervals you >>> > mentioned. ;) >>> >>> Hi Greg, >>> >>> Right, that sounds entirely plausible, and is very helpful. >>> >>> In practice, that means I'll need to be careful to avoid this situation >>> occurring in production ? but given that's unlikely to occur except in the >>> case of non-trivial neglect, I don't think I need be particularly concerned. >>> >>> (Happily, I'm in the situation that my existing cluster is purely for testing >>> purposes; the data is expendable.) >>> >>> That said, for my own peace of mind, it would be valuable to have a procedure >>> that can be used to recover from this state, even if it's unlikely to occur in >>> practice. >> >> The best luck I've had recovering from situations is something like: >> >> - stop all osds >> - osd set nodown >> - osd set nobackfill >> - osd set noup >> - set map cache size smaller to reduce memory footprint. >> >> osd map cache size = 50 >> osd map max advance = 25 >> osd map share max epochs = 25 >> osd pg epoch persisted max stale = 25 It can cause extreme slowness if you get into a failure situation and your OSDs need to calculate past intervals across more maps than will fit in the cache. :( That said, this might be a good idea as long as you're conscious of needing to set it back if you get into trouble later on. -Greg ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-03-09 15:47 ` Gregory Farnum @ 2015-03-13 11:24 ` Dan van der Ster [not found] ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com> 0 siblings, 1 reply; 14+ messages in thread From: Dan van der Ster @ 2015-03-13 11:24 UTC (permalink / raw) To: Gregory Farnum; +Cc: Sage Weil, David McBride, Ceph-devel On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote: > On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> wrote: >> Hi Sage, >> >> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote: >>> On Mon, 9 Feb 2015, David McBride wrote: >>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>> >>>> > So, memory usage of an OSD is usually linear in the number of PGs it >>>> > hosts. However, that memory can also grow based on at least one other >>>> > thing: the number of OSD Maps required to go through peering. It >>>> > *looks* to me like this is what you're running in to, not growth on >>>> > the number of state machines. In particular, those past_intervals you >>>> > mentioned. ;) >>>> >>>> Hi Greg, >>>> >>>> Right, that sounds entirely plausible, and is very helpful. >>>> >>>> In practice, that means I'll need to be careful to avoid this situation >>>> occurring in production ? but given that's unlikely to occur except in the >>>> case of non-trivial neglect, I don't think I need be particularly concerned. >>>> >>>> (Happily, I'm in the situation that my existing cluster is purely for testing >>>> purposes; the data is expendable.) >>>> >>>> That said, for my own peace of mind, it would be valuable to have a procedure >>>> that can be used to recover from this state, even if it's unlikely to occur in >>>> practice. >>> >>> The best luck I've had recovering from situations is something like: >>> >>> - stop all osds >>> - osd set nodown >>> - osd set nobackfill >>> - osd set noup >>> - set map cache size smaller to reduce memory footprint. >>> >>> osd map cache size = 50 >>> osd map max advance = 25 >>> osd map share max epochs = 25 >>> osd pg epoch persisted max stale = 25 > > It can cause extreme slowness if you get into a failure situation and > your OSDs need to calculate past intervals across more maps than will > fit in the cache. :( .. extreme slowness or is it also possible to get into a situation where the PGs are stuck incomplete forever? The reason I ask is because we actually had a network issue this morning that left OSDs flapping and a lot of osdmap epoch churn. Now our network has stabilized but 10 PGs are incomplete, even though all the OSDs are up. One PG looks like this, for example: pg 75.45 is stuck inactive for 87351.077529, current state incomplete, last acting [6689,1919,2329] pg 75.45 is stuck unclean for 87351.096198, current state incomplete, last acting [6689,1919,2329] pg 75.45 is incomplete, acting [6689,1919,2329] 1919 3.62000 osd.1919 up 1.00000 1.00000 2329 3.62000 osd.2329 up 1.00000 1.00000 6689 3.62000 osd.6689 up 1.00000 1.00000 The pg query output here: http://pastebin.com/WyTAU69W Is that a result of these short map caches or could it be something else? (we're running 0.93-76-gc35f422) WWGD (what would Greg do?) to activate these PGs? Thanks! Dan ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>]
* Re: Bounding OSD memory requirements during peering/recovery [not found] ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com> @ 2015-03-13 12:52 ` Dan van der Ster 2015-03-13 15:36 ` Dan van der Ster 0 siblings, 1 reply; 14+ messages in thread From: Dan van der Ster @ 2015-03-13 12:52 UTC (permalink / raw) To: Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel Hi Sage, Losing a message would have been plausible given the network issue we had today. I tried: # ceph osd pg-temp 75.45 6689 set 75.45 pg_temp mapping to [6689] then waited a bit. It's still incomplete -- the only difference is now I see two more past_intervals in the pg. Full query here: http://pastebin.com/TU7vVLpj I didn't have debug_osd above zero when I did that. Should I try again with debug_osd 20? Thanks :) Dan On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote: > This looks a bit like a the osds may have lost a message, actually. You can > kick an individual pg to repeer with something like > > ceph osd pg-temp 75.45 6689 > > See if that makes it go? > > sage > > > > On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com> > wrote: >> >> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote: >>> >>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> >>> wrote: >>>> >>>> Hi Sage, >>>> >>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote: >>>>> >>>>> On Mon, 9 Feb 2015, David McBride wrote: >>>>>> >>>>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>>>> >>>>>>> So, memory >>>>>>> usage of an OSD is usually linear in the number of PGs it >>>>>>> hosts. However, that memory can also grow based on at least one >>>>>>> other >>>>>>> thing: the number of OSD Maps required to go through peering. It >>>>>>> *looks* to me like this is what you're running in to, not growth on >>>>>>> the number of state machines. In particular, those past_intervals >>>>>>> you >>>>>>> mentioned. ;) >>>>>> >>>>>> >>>>>> Hi Greg, >>>>>> >>>>>> Right, that sounds entirely plausible, and is very helpful. >>>>>> >>>>>> In practice, that means I'll need to be careful to avoid this >>>>>> situation >>>>>> occurring in production ? but given that's unlikely to occur except >>>>>> in the >>>>>> case of non-trivial neglect, I don't think I need be particularly >>>>>> concerned. >>>>>> >>>>>> (Happily, I'm in the situation that my existing cluster is purely for >>>>>> testing >>>>>> purposes; the data is expendable.) >>>>>> >>>>>> That said, for my own peace of mind, it would be valuable to have a >>>>>> procedure >>>>>> that can be used to recover from this >>>>>> state, even if it's unlikely to occur in >>>>>> practice. >>>>> >>>>> >>>>> The best luck I've had recovering from situations is something like: >>>>> >>>>> - stop all osds >>>>> - osd set nodown >>>>> - osd set nobackfill >>>>> - osd set noup >>>>> - set map cache size smaller to reduce memory footprint. >>>>> >>>>> osd map cache size = 50 >>>>> osd map max advance = 25 >>>>> osd map share max epochs = 25 >>>>> osd pg epoch persisted max stale = 25 >>> >>> >>> It can cause extreme slowness if you get into a failure situation and >>> your OSDs need to calculate past intervals across more maps than will >>> fit in the cache. :( >> >> >> .. extreme slowness or is it also possible to get into a situation >> where the PGs are stuck incomplete forever? >> >> The reason I ask is because we actually had a network issue this >> morning that left OSDs flapping and a lot of osdmap epoch churn. Now >> our network has >> stabilized but 10 PGs are incomplete, even though all >> the OSDs are up. One PG looks like this, for example: >> >> pg 75.45 is stuck inactive for 87351.077529, current state incomplete, >> last acting [6689,1919,2329] >> pg 75.45 is stuck unclean for 87351.096198, current state incomplete, >> last acting [6689,1919,2329] >> pg 75.45 is incomplete, acting [6689,1919,2329] >> >> 1919 3.62000 osd.1919 up >> 1.00000 1.00000 >> 2329 3.62000 osd.2329 up >> 1.00000 1.00000 >> 6689 3.62000 osd.6689 up >> 1.00000 1.00000 >> >> The pg query output here: http://pastebin.com/WyTAU69W >> >> Is that a result of these short map caches or could it be something >> else? (we're running 0.93-76-gc35f422) >> WWGD (what would Greg do?) to activate these PGs? >> >> Thanks! Dan >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-03-13 12:52 ` Dan van der Ster @ 2015-03-13 15:36 ` Dan van der Ster 2015-03-13 20:42 ` Samuel Just 0 siblings, 1 reply; 14+ messages in thread From: Dan van der Ster @ 2015-03-13 15:36 UTC (permalink / raw) To: Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com> wrote: > Hi Sage, > > Losing a message would have been plausible given the network issue we had today. > > I tried: > > # ceph osd pg-temp 75.45 6689 > set 75.45 pg_temp mapping to [6689] > > then waited a bit. It's still incomplete -- the only difference is now > I see two more past_intervals in the pg. Full query here: > http://pastebin.com/TU7vVLpj > > I didn't have debug_osd above zero when I did that. Should I try again > with debug_osd 20? I tried again with logging. The pg goes like this: incomplete -> inactive -> remapped -> remapped+peering -> remapped -> inactive -> peering -> incomplete The killer seems to be: 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050 pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994 ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689] r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0 remapped+peering] choose_acting no suitable info found (incomplete backfills?), reverting to up Full log is here: http://pastebin.com/hZUBD9NT Do you have an idea what went wrong here? BTW, our firefly "prod" cluster suffered from the same network problem today, but all of those cluster's PGs recovered nicely. Does the hammer RC have different peering logic that might apply here? Thanks! Dan > > Thanks :) > > Dan > > On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote: >> This looks a bit like a the osds may have lost a message, actually. You can >> kick an individual pg to repeer with something like >> >> ceph osd pg-temp 75.45 6689 >> >> See if that makes it go? >> >> sage >> >> >> >> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com> >> wrote: >>> >>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote: >>>> >>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> >>>> wrote: >>>>> >>>>> Hi Sage, >>>>> >>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote: >>>>>> >>>>>> On Mon, 9 Feb 2015, David McBride wrote: >>>>>>> >>>>>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>>>>> >>>>>>>> So, memory >>>>>>>> usage of an OSD is usually linear in the number of PGs it >>>>>>>> hosts. However, that memory can also grow based on at least one >>>>>>>> other >>>>>>>> thing: the number of OSD Maps required to go through peering. It >>>>>>>> *looks* to me like this is what you're running in to, not growth on >>>>>>>> the number of state machines. In particular, those past_intervals >>>>>>>> you >>>>>>>> mentioned. ;) >>>>>>> >>>>>>> >>>>>>> Hi Greg, >>>>>>> >>>>>>> Right, that sounds entirely plausible, and is very helpful. >>>>>>> >>>>>>> In practice, that means I'll need to be careful to avoid this >>>>>>> situation >>>>>>> occurring in production ? but given that's unlikely to occur except >>>>>>> in the >>>>>>> case of non-trivial neglect, I don't think I need be particularly >>>>>>> concerned. >>>>>>> >>>>>>> (Happily, I'm in the situation that my existing cluster is purely for >>>>>>> testing >>>>>>> purposes; the data is expendable.) >>>>>>> >>>>>>> That said, for my own peace of mind, it would be valuable to have a >>>>>>> procedure >>>>>>> that can be used to recover from this >>>>>>> state, even if it's unlikely to occur in >>>>>>> practice. >>>>>> >>>>>> >>>>>> The best luck I've had recovering from situations is something like: >>>>>> >>>>>> - stop all osds >>>>>> - osd set nodown >>>>>> - osd set nobackfill >>>>>> - osd set noup >>>>>> - set map cache size smaller to reduce memory footprint. >>>>>> >>>>>> osd map cache size = 50 >>>>>> osd map max advance = 25 >>>>>> osd map share max epochs = 25 >>>>>> osd pg epoch persisted max stale = 25 >>>> >>>> >>>> It can cause extreme slowness if you get into a failure situation and >>>> your OSDs need to calculate past intervals across more maps than will >>>> fit in the cache. :( >>> >>> >>> .. extreme slowness or is it also possible to get into a situation >>> where the PGs are stuck incomplete forever? >>> >>> The reason I ask is because we actually had a network issue this >>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now >>> our network has >>> stabilized but 10 PGs are incomplete, even though all >>> the OSDs are up. One PG looks like this, for example: >>> >>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete, >>> last acting [6689,1919,2329] >>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete, >>> last acting [6689,1919,2329] >>> pg 75.45 is incomplete, acting [6689,1919,2329] >>> >>> 1919 3.62000 osd.1919 up >>> 1.00000 1.00000 >>> 2329 3.62000 osd.2329 up >>> 1.00000 1.00000 >>> 6689 3.62000 osd.6689 up >>> 1.00000 1.00000 >>> >>> The pg query output here: http://pastebin.com/WyTAU69W >>> >>> Is that a result of these short map caches or could it be something >>> else? (we're running 0.93-76-gc35f422) >>> WWGD (what would Greg do?) to activate these PGs? >>> >>> Thanks! Dan >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-03-13 15:36 ` Dan van der Ster @ 2015-03-13 20:42 ` Samuel Just 2015-03-13 20:53 ` Samuel Just 0 siblings, 1 reply; 14+ messages in thread From: Samuel Just @ 2015-03-13 20:42 UTC (permalink / raw) To: Dan van der Ster, Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel I've opened a bug for this (http://tracker.ceph.com/issues/11110), I bet it's related to the new logic for allowing recovery below min_size. Exactly what sha1 was running on the osds during this time period? -Sam On 03/13/2015 08:36 AM, Dan van der Ster wrote: > On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com> wrote: >> Hi Sage, >> >> Losing a message would have been plausible given the network issue we had today. >> >> I tried: >> >> # ceph osd pg-temp 75.45 6689 >> set 75.45 pg_temp mapping to [6689] >> >> then waited a bit. It's still incomplete -- the only difference is now >> I see two more past_intervals in the pg. Full query here: >> http://pastebin.com/TU7vVLpj >> >> I didn't have debug_osd above zero when I did that. Should I try again >> with debug_osd 20? > I tried again with logging. The pg goes like this: > > incomplete -> inactive -> remapped -> remapped+peering -> remapped -> > inactive -> peering -> incomplete > > The killer seems to be: > > 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050 > pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994 > ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689] > r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0 > remapped+peering] choose_acting no suitable info found (incomplete > backfills?), reverting to up > > Full log is here: http://pastebin.com/hZUBD9NT > > Do you have an idea what went wrong here? BTW, our firefly "prod" > cluster suffered from the same network problem today, but all of those > cluster's PGs recovered nicely. > Does the hammer RC have different peering logic that might apply here? > > Thanks! Dan > > > >> Thanks :) >> >> Dan >> >> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote: >>> This looks a bit like a the osds may have lost a message, actually. You can >>> kick an individual pg to repeer with something like >>> >>> ceph osd pg-temp 75.45 6689 >>> >>> See if that makes it go? >>> >>> sage >>> >>> >>> >>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com> >>> wrote: >>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote: >>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> >>>>> wrote: >>>>>> Hi Sage, >>>>>> >>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote: >>>>>>> On Mon, 9 Feb 2015, David McBride wrote: >>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>>>>>> >>>>>>>>> So, memory >>>>>>>>> usage of an OSD is usually linear in the number of PGs it >>>>>>>>> hosts. However, that memory can also grow based on at least one >>>>>>>>> other >>>>>>>>> thing: the number of OSD Maps required to go through peering. It >>>>>>>>> *looks* to me like this is what you're running in to, not growth on >>>>>>>>> the number of state machines. In particular, those past_intervals >>>>>>>>> you >>>>>>>>> mentioned. ;) >>>>>>>> >>>>>>>> Hi Greg, >>>>>>>> >>>>>>>> Right, that sounds entirely plausible, and is very helpful. >>>>>>>> >>>>>>>> In practice, that means I'll need to be careful to avoid this >>>>>>>> situation >>>>>>>> occurring in production ? but given that's unlikely to occur except >>>>>>>> in the >>>>>>>> case of non-trivial neglect, I don't think I need be particularly >>>>>>>> concerned. >>>>>>>> >>>>>>>> (Happily, I'm in the situation that my existing cluster is purely for >>>>>>>> testing >>>>>>>> purposes; the data is expendable.) >>>>>>>> >>>>>>>> That said, for my own peace of mind, it would be valuable to have a >>>>>>>> procedure >>>>>>>> that can be used to recover from this >>>>>>>> state, even if it's unlikely to occur in >>>>>>>> practice. >>>>>>> >>>>>>> The best luck I've had recovering from situations is something like: >>>>>>> >>>>>>> - stop all osds >>>>>>> - osd set nodown >>>>>>> - osd set nobackfill >>>>>>> - osd set noup >>>>>>> - set map cache size smaller to reduce memory footprint. >>>>>>> >>>>>>> osd map cache size = 50 >>>>>>> osd map max advance = 25 >>>>>>> osd map share max epochs = 25 >>>>>>> osd pg epoch persisted max stale = 25 >>>>> >>>>> It can cause extreme slowness if you get into a failure situation and >>>>> your OSDs need to calculate past intervals across more maps than will >>>>> fit in the cache. :( >>>> >>>> .. extreme slowness or is it also possible to get into a situation >>>> where the PGs are stuck incomplete forever? >>>> >>>> The reason I ask is because we actually had a network issue this >>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now >>>> our network has >>>> stabilized but 10 PGs are incomplete, even though all >>>> the OSDs are up. One PG looks like this, for example: >>>> >>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete, >>>> last acting [6689,1919,2329] >>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete, >>>> last acting [6689,1919,2329] >>>> pg 75.45 is incomplete, acting [6689,1919,2329] >>>> >>>> 1919 3.62000 osd.1919 up >>>> 1.00000 1.00000 >>>> 2329 3.62000 osd.2329 up >>>> 1.00000 1.00000 >>>> 6689 3.62000 osd.6689 up >>>> 1.00000 1.00000 >>>> >>>> The pg query output here: http://pastebin.com/WyTAU69W >>>> >>>> Is that a result of these short map caches or could it be something >>>> else? (we're running 0.93-76-gc35f422) >>>> WWGD (what would Greg do?) to activate these PGs? >>>> >>>> Thanks! Dan >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-03-13 20:42 ` Samuel Just @ 2015-03-13 20:53 ` Samuel Just 2015-03-13 21:24 ` Dan van der Ster 0 siblings, 1 reply; 14+ messages in thread From: Samuel Just @ 2015-03-13 20:53 UTC (permalink / raw) To: Dan van der Ster, Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel Also, are you certain that all were running the same version? -Sam On 03/13/2015 01:42 PM, Samuel Just wrote: > I've opened a bug for this (http://tracker.ceph.com/issues/11110), I > bet it's related to the new logic for allowing recovery below > min_size. Exactly what sha1 was running on the osds during this time > period? > -Sam > > On 03/13/2015 08:36 AM, Dan van der Ster wrote: >> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster >> <dan@vanderster.com> wrote: >>> Hi Sage, >>> >>> Losing a message would have been plausible given the network issue >>> we had today. >>> >>> I tried: >>> >>> # ceph osd pg-temp 75.45 6689 >>> set 75.45 pg_temp mapping to [6689] >>> >>> then waited a bit. It's still incomplete -- the only difference is now >>> I see two more past_intervals in the pg. Full query here: >>> http://pastebin.com/TU7vVLpj >>> >>> I didn't have debug_osd above zero when I did that. Should I try again >>> with debug_osd 20? >> I tried again with logging. The pg goes like this: >> >> incomplete -> inactive -> remapped -> remapped+peering -> remapped -> >> inactive -> peering -> incomplete >> >> The killer seems to be: >> >> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050 >> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994 >> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689] >> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0 >> remapped+peering] choose_acting no suitable info found (incomplete >> backfills?), reverting to up >> >> Full log is here: http://pastebin.com/hZUBD9NT >> >> Do you have an idea what went wrong here? BTW, our firefly "prod" >> cluster suffered from the same network problem today, but all of those >> cluster's PGs recovered nicely. >> Does the hammer RC have different peering logic that might apply here? >> >> Thanks! Dan >> >> >> >>> Thanks :) >>> >>> Dan >>> >>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote: >>>> This looks a bit like a the osds may have lost a message, >>>> actually. You can >>>> kick an individual pg to repeer with something like >>>> >>>> ceph osd pg-temp 75.45 6689 >>>> >>>> See if that makes it go? >>>> >>>> sage >>>> >>>> >>>> >>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster >>>> <dan@vanderster.com> >>>> wrote: >>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> >>>>> wrote: >>>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster >>>>>> <dan@vanderster.com> >>>>>> wrote: >>>>>>> Hi Sage, >>>>>>> >>>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> >>>>>>> wrote: >>>>>>>> On Mon, 9 Feb 2015, David McBride wrote: >>>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>>>>>>> >>>>>>>>>> So, memory >>>>>>>>>> usage of an OSD is usually linear in the number of PGs it >>>>>>>>>> hosts. However, that memory can also grow based on at least >>>>>>>>>> one >>>>>>>>>> other >>>>>>>>>> thing: the number of OSD Maps required to go through >>>>>>>>>> peering. It >>>>>>>>>> *looks* to me like this is what you're running in to, not >>>>>>>>>> growth on >>>>>>>>>> the number of state machines. In particular, those >>>>>>>>>> past_intervals >>>>>>>>>> you >>>>>>>>>> mentioned. ;) >>>>>>>>> >>>>>>>>> Hi Greg, >>>>>>>>> >>>>>>>>> Right, that sounds entirely plausible, and is very helpful. >>>>>>>>> >>>>>>>>> In practice, that means I'll need to be careful to avoid this >>>>>>>>> situation >>>>>>>>> occurring in production ? but given that's unlikely to occur >>>>>>>>> except >>>>>>>>> in the >>>>>>>>> case of non-trivial neglect, I don't think I need be >>>>>>>>> particularly >>>>>>>>> concerned. >>>>>>>>> >>>>>>>>> (Happily, I'm in the situation that my existing cluster is >>>>>>>>> purely for >>>>>>>>> testing >>>>>>>>> purposes; the data is expendable.) >>>>>>>>> >>>>>>>>> That said, for my own peace of mind, it would be valuable to >>>>>>>>> have a >>>>>>>>> procedure >>>>>>>>> that can be used to recover from this >>>>>>>>> state, even if it's unlikely to occur in >>>>>>>>> practice. >>>>>>>> >>>>>>>> The best luck I've had recovering from situations is >>>>>>>> something like: >>>>>>>> >>>>>>>> - stop all osds >>>>>>>> - osd set nodown >>>>>>>> - osd set nobackfill >>>>>>>> - osd set noup >>>>>>>> - set map cache size smaller to reduce memory footprint. >>>>>>>> >>>>>>>> osd map cache size = 50 >>>>>>>> osd map max advance = 25 >>>>>>>> osd map share max epochs = 25 >>>>>>>> osd pg epoch persisted max stale = 25 >>>>>> >>>>>> It can cause extreme slowness if you get into a failure >>>>>> situation and >>>>>> your OSDs need to calculate past intervals across more maps >>>>>> than will >>>>>> fit in the cache. :( >>>>> >>>>> .. extreme slowness or is it also possible to get into a situation >>>>> where the PGs are stuck incomplete forever? >>>>> >>>>> The reason I ask is because we actually had a network issue this >>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now >>>>> our network has >>>>> stabilized but 10 PGs are incomplete, even though all >>>>> the OSDs are up. One PG looks like this, for example: >>>>> >>>>> pg 75.45 is stuck inactive for 87351.077529, current state >>>>> incomplete, >>>>> last acting [6689,1919,2329] >>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete, >>>>> last acting [6689,1919,2329] >>>>> pg 75.45 is incomplete, acting [6689,1919,2329] >>>>> >>>>> 1919 3.62000 osd.1919 up >>>>> 1.00000 1.00000 >>>>> 2329 3.62000 osd.2329 up >>>>> 1.00000 1.00000 >>>>> 6689 3.62000 osd.6689 up >>>>> 1.00000 1.00000 >>>>> >>>>> The pg query output here: http://pastebin.com/WyTAU69W >>>>> >>>>> Is that a result of these short map caches or could it be something >>>>> else? (we're running 0.93-76-gc35f422) >>>>> WWGD (what would Greg do?) to activate these PGs? >>>>> >>>>> Thanks! Dan >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe >>>>> ceph-devel" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery 2015-03-13 20:53 ` Samuel Just @ 2015-03-13 21:24 ` Dan van der Ster 0 siblings, 0 replies; 14+ messages in thread From: Dan van der Ster @ 2015-03-13 21:24 UTC (permalink / raw) To: Samuel Just; +Cc: Sage Weil, Gregory Farnum, David McBride, Ceph-devel Yup, all running 0.93-76-gc35f422 (from gitbuilder just after Sage merged the latest straw2 fix...). I just uploaded the ceph.log to help understand the issue. Let me know if I can help further :) Thanks! Dan On Fri, Mar 13, 2015 at 9:53 PM, Samuel Just <sjust@redhat.com> wrote: > Also, are you certain that all were running the same version? > -Sam > > > On 03/13/2015 01:42 PM, Samuel Just wrote: >> >> I've opened a bug for this (http://tracker.ceph.com/issues/11110), I bet >> it's related to the new logic for allowing recovery below min_size. Exactly >> what sha1 was running on the osds during this time period? >> -Sam >> >> On 03/13/2015 08:36 AM, Dan van der Ster wrote: >>> >>> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com> >>> wrote: >>>> >>>> Hi Sage, >>>> >>>> Losing a message would have been plausible given the network issue we >>>> had today. >>>> >>>> I tried: >>>> >>>> # ceph osd pg-temp 75.45 6689 >>>> set 75.45 pg_temp mapping to [6689] >>>> >>>> then waited a bit. It's still incomplete -- the only difference is now >>>> I see two more past_intervals in the pg. Full query here: >>>> http://pastebin.com/TU7vVLpj >>>> >>>> I didn't have debug_osd above zero when I did that. Should I try again >>>> with debug_osd 20? >>> >>> I tried again with logging. The pg goes like this: >>> >>> incomplete -> inactive -> remapped -> remapped+peering -> remapped -> >>> inactive -> peering -> incomplete >>> >>> The killer seems to be: >>> >>> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050 >>> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994 >>> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689] >>> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0 >>> remapped+peering] choose_acting no suitable info found (incomplete >>> backfills?), reverting to up >>> >>> Full log is here: http://pastebin.com/hZUBD9NT >>> >>> Do you have an idea what went wrong here? BTW, our firefly "prod" >>> cluster suffered from the same network problem today, but all of those >>> cluster's PGs recovered nicely. >>> Does the hammer RC have different peering logic that might apply here? >>> >>> Thanks! Dan >>> >>> >>> >>>> Thanks :) >>>> >>>> Dan >>>> >>>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote: >>>>> >>>>> This looks a bit like a the osds may have lost a message, actually. >>>>> You can >>>>> kick an individual pg to repeer with something like >>>>> >>>>> ceph osd pg-temp 75.45 6689 >>>>> >>>>> See if that makes it go? >>>>> >>>>> sage >>>>> >>>>> >>>>> >>>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com> >>>>> wrote: >>>>>> >>>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> >>>>>> wrote: >>>>>>> >>>>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster >>>>>>> <dan@vanderster.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Sage, >>>>>>>> >>>>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Mon, 9 Feb 2015, David McBride wrote: >>>>>>>>>> >>>>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>>>>>>>> >>>>>>>>>>> So, memory >>>>>>>>>>> usage of an OSD is usually linear in the number of PGs it >>>>>>>>>>> hosts. However, that memory can also grow based on at least one >>>>>>>>>>> other >>>>>>>>>>> thing: the number of OSD Maps required to go through peering. >>>>>>>>>>> It >>>>>>>>>>> *looks* to me like this is what you're running in to, not >>>>>>>>>>> growth on >>>>>>>>>>> the number of state machines. In particular, those >>>>>>>>>>> past_intervals >>>>>>>>>>> you >>>>>>>>>>> mentioned. ;) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Greg, >>>>>>>>>> >>>>>>>>>> Right, that sounds entirely plausible, and is very helpful. >>>>>>>>>> >>>>>>>>>> In practice, that means I'll need to be careful to avoid this >>>>>>>>>> situation >>>>>>>>>> occurring in production ? but given that's unlikely to occur >>>>>>>>>> except >>>>>>>>>> in the >>>>>>>>>> case of non-trivial neglect, I don't think I need be >>>>>>>>>> particularly >>>>>>>>>> concerned. >>>>>>>>>> >>>>>>>>>> (Happily, I'm in the situation that my existing cluster is >>>>>>>>>> purely for >>>>>>>>>> testing >>>>>>>>>> purposes; the data is expendable.) >>>>>>>>>> >>>>>>>>>> That said, for my own peace of mind, it would be valuable to >>>>>>>>>> have a >>>>>>>>>> procedure >>>>>>>>>> that can be used to recover from this >>>>>>>>>> state, even if it's unlikely to occur in >>>>>>>>>> practice. >>>>>>>>> >>>>>>>>> >>>>>>>>> The best luck I've had recovering from situations is something >>>>>>>>> like: >>>>>>>>> >>>>>>>>> - stop all osds >>>>>>>>> - osd set nodown >>>>>>>>> - osd set nobackfill >>>>>>>>> - osd set noup >>>>>>>>> - set map cache size smaller to reduce memory footprint. >>>>>>>>> >>>>>>>>> osd map cache size = 50 >>>>>>>>> osd map max advance = 25 >>>>>>>>> osd map share max epochs = 25 >>>>>>>>> osd pg epoch persisted max stale = 25 >>>>>>> >>>>>>> >>>>>>> It can cause extreme slowness if you get into a failure situation >>>>>>> and >>>>>>> your OSDs need to calculate past intervals across more maps than >>>>>>> will >>>>>>> fit in the cache. :( >>>>>> >>>>>> >>>>>> .. extreme slowness or is it also possible to get into a situation >>>>>> where the PGs are stuck incomplete forever? >>>>>> >>>>>> The reason I ask is because we actually had a network issue this >>>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now >>>>>> our network has >>>>>> stabilized but 10 PGs are incomplete, even though all >>>>>> the OSDs are up. One PG looks like this, for example: >>>>>> >>>>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete, >>>>>> last acting [6689,1919,2329] >>>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete, >>>>>> last acting [6689,1919,2329] >>>>>> pg 75.45 is incomplete, acting [6689,1919,2329] >>>>>> >>>>>> 1919 3.62000 osd.1919 up >>>>>> 1.00000 1.00000 >>>>>> 2329 3.62000 osd.2329 up >>>>>> 1.00000 1.00000 >>>>>> 6689 3.62000 osd.6689 up >>>>>> 1.00000 1.00000 >>>>>> >>>>>> The pg query output here: http://pastebin.com/WyTAU69W >>>>>> >>>>>> Is that a result of these short map caches or could it be something >>>>>> else? (we're running 0.93-76-gc35f422) >>>>>> WWGD (what would Greg do?) to activate these PGs? >>>>>> >>>>>> Thanks! Dan >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-03-13 21:24 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
2015-02-08 20:05 ` David McBride
2015-02-09 10:38 ` David McBride
2015-02-09 15:31 ` Gregory Farnum
2015-02-09 21:36 ` David McBride
2015-02-10 1:51 ` Sage Weil
2015-03-09 15:42 ` Dan van der Ster
2015-03-09 15:47 ` Gregory Farnum
2015-03-13 11:24 ` Dan van der Ster
[not found] ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
2015-03-13 12:52 ` Dan van der Ster
2015-03-13 15:36 ` Dan van der Ster
2015-03-13 20:42 ` Samuel Just
2015-03-13 20:53 ` Samuel Just
2015-03-13 21:24 ` Dan van der Ster
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.