* Bounding OSD memory requirements during peering/recovery
@ 2015-02-08 16:05 David McBride
2015-02-08 20:05 ` David McBride
2015-02-09 15:31 ` Gregory Farnum
0 siblings, 2 replies; 14+ messages in thread
From: David McBride @ 2015-02-08 16:05 UTC (permalink / raw)
To: Ceph-devel
Hello,
I'm trying to understand the memory requirements for a Ceph node,
particularly when it is undergoing recovery.
Comments, suggestions, pointers are all welcome.
(This is my second attempt at sending this email; it appeared to get
eaten the first time — probably because it had a 1MB .heap file attached.)
Background:
==========
I've got a fairly tortured prototype Ceph cluster. It was left
unattended for several months, as I'd been needed to work elsewhere —
but now I'm returning to it, with an eye to continue to building
production services on it if I have sufficient confidence in its
capabilities.
In the intervening time, several root filesystems on cluster nodes went
full (because of poorly configured logging, as well as MONs being
co-located with OSDs for expediency) and several drives were also
unceremoniously pulled out for reuse elsewhere.
A subsequent recovery is proving problematic: if all OSDs are started
concurrently, they are substantially exceeding the amount of RAM
available on the hosts during peering, and are being killed off by the
kernel OOM killer.
(And then subsequently being restarted by Upstart, resulting in
thrashing for a while, up until something unknown goes awry and the
machine stops sending telemetry and no-longer responds to SSH. That's a
separate problem.)
Looking at tcmalloc-accounted heap statistics, I've seen individual OSDs
using 9GB+ of RAM; looking at RSS sizes of individual machines, I've
seen process-images exceeding 16GB. On 12-disk machines with 32GB of
RAM each, this is problematic.
So, I've started looking at the data-structures and algorithms that
govern OSD recovery. I've found the following references:
http://ceph.com/docs/master/dev/placement-group/
http://ceph.com/docs/master/dev/peering/
http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/
http://ceph.com/docs/master/dev/osd_internals/map_message_handling/
http://dachary.org/?p=2061
… and hope to develop an understanding of an upper bound on memory
utilization that an efficient implementation of the algorithms described
would require.
I've also been trying to collect memory profiles for OSD processes as
they're operating, to compare theory with reality.
Memory profiling:
================
For example, having found an OSD using ~6GB of memory, I turned on heap
profiling, and dumped its state using `ceph tell osd.N heap
start_profiler; ceph tell osd.N heap dump`:
> ------------------------------------------------
> MALLOC: 6167528240 ( 5881.8 MiB) Bytes in use by application
> MALLOC: + 18309120 ( 17.5 MiB) Bytes in page heap freelist
> MALLOC: + 39689152 ( 37.9 MiB) Bytes in central cache freelist
> MALLOC: + 4750960 ( 4.5 MiB) Bytes in transfer cache freelist
> MALLOC: + 25223840 ( 24.1 MiB) Bytes in thread cache freelists
> MALLOC: + 27603096 ( 26.3 MiB) Bytes in malloc metadata
> MALLOC: ------------
> MALLOC: = 6283104408 ( 5992.0 MiB) Actual memory used (physical + swap)
> MALLOC: + 2080768 ( 2.0 MiB) Bytes released to OS (aka unmapped)
> MALLOC: ------------
> MALLOC: = 6285185176 ( 5994.0 MiB) Virtual address space used
> MALLOC:
> MALLOC: 374907 Spans in use
> MALLOC: 335 Thread heaps in use
> MALLOC: 8192 Tcmalloc page size
> ------------------------------------------------
However, the heap dumps so generated only appear to show memory
allocations (made? touched?) since heap profiling was enabled:
> google-pprof --text /usr/bin/ceph-osd osd.25.profile.0001.heap
> Using local file /usr/bin/ceph-osd.
> Using local file osd.25.profile.0001.heap.
> Total: 0.0 MB
> 0.0 46.7% 46.7% 0.0 59.0% SimpleMessenger::add_accept_pipe
> [...]
Note the "Total: 0.0MB", which differs wildly from the stats reported by
tcmalloc, and the RSS of the process reported by the kernel.
So, for testing purposes, I selectively started up ~20% of the OSDs,
each invoked with the setting
CEPH_HEAP_PROFILER_INIT=1
… defined in their environmentment to cause the heap profiler to be
started at OSD start-time. This has a significant CPU and memory
overhead.
Also set were the cluster flags:
noout,nobackfill,norecover,noscrub,nodeep-scrub
… to avoid commingling memory requirements due to peering with other
factors.
I've produced a number of .heap files which show >= 1000MB of memory
allocated in an RB tree as a result of
PG::RecoveryState::RecoveryMachine::send_notify, PG::read_info and
MOSDPGNotify::decode_payload (or descendants).
An example heapfile from a fairly typical OSD can currently be fetched from:
http://people.ds.cam.ac.uk/dwm37/tmp/osd.0.profile.0124.heap
This was produced by the binaries from the Ceph 'trusty' repository;
`ceph -v` returns:
> ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
Running pprof in interactive mode and running `top30 --cum` on this
heapfile reports:
> Total: 2172.3 MB
> 1705.9 78.5% 78.5% 1748.4 80.5% __gnu_cxx::new_allocator::construct (inline)
> 0.0 0.0% 78.5% 1600.7 73.7% std::_Rb_tree::_M_create_node (inline)
> 0.0 0.0% 78.5% 1367.9 63.0% start_thread
> 0.0 0.0% 78.5% 1367.6 63.0% ioperm
> 0.0 0.0% 78.5% 963.4 44.4% ThreadPool::worker
> 0.0 0.0% 78.5% 963.3 44.3% ThreadPool::WorkThread::entry
> 0.0 0.0% 78.5% 951.0 43.8% OSD::process_peering_events
> 0.0 0.0% 78.5% 950.9 43.8% OSD::PeeringWQ::_process
> 0.0 0.0% 78.5% 949.8 43.7% PG::RecoveryState::handle_event (inline)
> 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::detail::send_function::operator (inline)
> 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::simple_state::react_impl
> 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::state_machine::process_event (inline)
> 0.0 0.0% 78.5% 949.8 43.7% boost::statechart::state_machine::send_event
> 0.0 0.0% 78.5% 949.8 43.7% local_react (inline)
> 0.0 0.0% 78.5% 949.8 43.7% local_react_impl (inline)
> 0.0 0.0% 78.5% 949.8 43.7% operator (inline)
> 0.0 0.0% 78.5% 949.8 43.7% react (inline)
> 0.0 0.0% 78.5% 948.5 43.7% std::vector::push_back (inline)
> 0.0 0.0% 78.5% 948.3 43.7% PG::RecoveryState::RecoveryMachine::send_notify
> 0.0 0.0% 78.5% 947.1 43.6% std::vector::_M_insert_aux
> 0.0 0.0% 78.5% 947.0 43.6% _Rb_tree (inline)
> 0.0 0.0% 78.5% 947.0 43.6% map (inline)
> 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_clone_node (inline)
> 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_copy
> 0.0 0.0% 78.5% 809.8 37.3% construct (inline)
> 0.0 0.0% 78.5% 808.4 37.2% std::pair::pair
> 0.0 0.0% 78.5% 804.2 37.0% __libc_start_main
> 0.0 0.0% 78.5% 804.2 37.0% _start
> 0.0 0.0% 78.5% 804.2 37.0% main
> 0.0 0.0% 78.5% 803.6 37.0% OSD::init
This appears to show a large amount of memory — nearly a gigabyte —
allocated by boost::statechart, which is slightly surprising as the FAQ
for boost::statechart quotes a ~1KB memory footprint per state-machine:
http://www.boost.org/doc/libs/1_35_0/libs/statechart/doc/faq.html#EmbeddedApplications
Perhaps something unexpected is happening here? I'm almost hoping that
perhaps statechart is perhaps being subtly misused or misconfigured in
some way that, if fixed, would result in a significant drop in memory
utilization…!
Quantifying problem-size:
========================
Given that it appears to be the log-merging stage of PG recovery that
seems to be expensive, I queried the statistics of those PGs which
seemed to be taking a long time to peer, via `ceph pg <pgid> query`.
These showed that (at least a handful) of those PG's recovery_state
past_intervals list contained on the order of ~200-300 entries.
(I have no feel as to whether this is excessive.)
Unused memory:
=============
One thing I note is that I still sometimes see OSDs with large fractions
of their memory allocation sitting on the tcmalloc freelist, e.g.:
> osd.0 tcmalloc heap stats:------------------------------------------------
> MALLOC: 2226810584 ( 2123.7 MiB) Bytes in use by application
> MALLOC: + 1421361152 ( 1355.5 MiB) Bytes in page heap freelist
> MALLOC: + 41864920 ( 39.9 MiB) Bytes in central cache freelist
> MALLOC: + 5215680 ( 5.0 MiB) Bytes in transfer cache freelist
> MALLOC: + 18508944 ( 17.7 MiB) Bytes in thread cache freelists
> MALLOC: + 16216216 ( 15.5 MiB) Bytes in malloc metadata
> MALLOC: ------------
> MALLOC: = 3729977496 ( 3557.2 MiB) Actual memory used (physical + swap)
> MALLOC: + 32792576 ( 31.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC: ------------
> MALLOC: = 3762770072 ( 3588.5 MiB) Virtual address space used
> MALLOC:
> MALLOC: 144565 Spans in use
> MALLOC: 225 Thread heaps in use
> MALLOC: 8192 Tcmalloc page size
> ------------------------------------------------
This is despite having:
TCMALLOC_RELEASE_RATE=10
… set in the environment of each OSD process. This doesn't help with
contention for RAM between processes!
(I have mentioned this before, though hadn't at that time yet tried
running OSDs with TCMALLOC_RELEASE_RATE. See also:
http://www.spinics.net/lists/ceph-devel/msg18769.html
… for history.
Note for anyone intending to reproduce this experiment: Upstart
overrides should be written to a file named
/etc/init/ceph-{osd,mon}.override, not ceph-{osd,mon}.conf.override as I
incorrectly specified previously.)
Leak detection:
==============
Not yet being familiar with the the data-structures or algorithms that
govern PG recovery, it's not clear to me whether this is memory usage
that is expected or not for a 120-OSD cluster with 2048 PGs — or
whether there might be some variety of leak (or inefficient memory-use
pattern.)
It doesn't help that I'm not a C++ hacker. :-)
Reading around the subject, I came across `leaksanitiser`, a clang/LLVM:
facility:
https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer
… as well as ticket #9756, which suggests using Clang's other static
analysis capabilities to help flag potentially problematic code:
http://tracker.ceph.com/issues/9756
I might spend some time this weekend to see if I can help advance that
ticket.
(I note that http://ceph.com/gitbuilders.cgi now returns 404; perhaps
that has been superceded by some RedHat-internal facility?)
Cheers,
David
--
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
@ 2015-02-08 20:05 ` David McBride
2015-02-09 10:38 ` David McBride
2015-02-09 15:31 ` Gregory Farnum
1 sibling, 1 reply; 14+ messages in thread
From: David McBride @ 2015-02-08 20:05 UTC (permalink / raw)
To: Ceph-devel
On 08/02/15 16:05, David McBride wrote:
> Reading around the subject, I came across `leaksanitiser`, a clang/LLVM:
> facility:
>
> https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer
>
> … as well as ticket #9756, which suggests using Clang's other static
> analysis capabilities to help flag potentially problematic code:
>
> http://tracker.ceph.com/issues/9756
I've gone ahead and implemented this. I've submitted a pull-request via
Github, visible here:
https://github.com/ceph/autobuild-ceph/pull/22
I've not tried to replicate the gitbuilder environment directly, so
these changes are untested, though should work — at least, once
someone's added 'clang' to the list of packages to be autoprovisioned!
Cheers,
David
--
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-02-08 20:05 ` David McBride
@ 2015-02-09 10:38 ` David McBride
0 siblings, 0 replies; 14+ messages in thread
From: David McBride @ 2015-02-09 10:38 UTC (permalink / raw)
To: Ceph-devel
On 08/02/15 20:05, David McBride wrote:
> https://github.com/ceph/autobuild-ceph/pull/22
>
> I've not tried to replicate the gitbuilder environment directly, so
> these changes are untested, though should work — at least, once
> someone's added 'clang' to the list of packages to be autoprovisioned!
I've now updated this pull request; now also implemented:
* Updates to fabfile.py to cause clang (and clang-analyzer on RPM
machines) to be installed prior to builds.
* Added the '-analyze' hostname affix, which causes Ceph to be built
with the 'scan-build' static-analysis wrapper. As a side-effect of
compilation, a static-analysis of Ceph's code will also be run; the
resulting report will be deposited in scan-build.tmp/.
* Tweaked the environment of clang builds so that it shouldn't
generate spurious errors when being run with versions of
ccache < 3.2.
Cheers,
David
--
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
2015-02-08 20:05 ` David McBride
@ 2015-02-09 15:31 ` Gregory Farnum
2015-02-09 21:36 ` David McBride
1 sibling, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2015-02-09 15:31 UTC (permalink / raw)
To: David McBride; +Cc: Ceph-devel
Right.
So, memory usage of an OSD is usually linear in the number of PGs it
hosts. However, that memory can also grow based on at least one other
thing: the number of OSD Maps required to go through peering. It
*looks* to me like this is what you're running in to, not growth on
the number of state machines. In particular, those past_intervals you
mentioned. ;)
Anyway, I'm afraid I don't have any magic cure-all for you. This kind
of long-term dirtied Ceph cluster is something I've only seen once or
twice and I've never led a recovery on them. But the effort usually
involves, as you've done, limiting the number of OSDs per host that
are doing recovery at once (which probably means starting one OSD at a
time until stability, rather than one per host!), disabling recovery
(as you've already done), ...and occasionally hacking up the map
history. :/
Good luck!
-Greg
On Sun, Feb 8, 2015 at 8:05 AM, David McBride <dwm37@cam.ac.uk> wrote:
> Hello,
>
> I'm trying to understand the memory requirements for a Ceph node,
> particularly when it is undergoing recovery.
>
> Comments, suggestions, pointers are all welcome.
>
> (This is my second attempt at sending this email; it appeared to get eaten
> the first time — probably because it had a 1MB .heap file attached.)
>
>
> Background:
> ==========
>
> I've got a fairly tortured prototype Ceph cluster. It was left
> unattended for several months, as I'd been needed to work elsewhere —
> but now I'm returning to it, with an eye to continue to building
> production services on it if I have sufficient confidence in its
> capabilities.
>
> In the intervening time, several root filesystems on cluster nodes went
> full (because of poorly configured logging, as well as MONs being
> co-located with OSDs for expediency) and several drives were also
> unceremoniously pulled out for reuse elsewhere.
>
> A subsequent recovery is proving problematic: if all OSDs are started
> concurrently, they are substantially exceeding the amount of RAM
> available on the hosts during peering, and are being killed off by the
> kernel OOM killer.
>
> (And then subsequently being restarted by Upstart, resulting in
> thrashing for a while, up until something unknown goes awry and the
> machine stops sending telemetry and no-longer responds to SSH. That's a
> separate problem.)
>
> Looking at tcmalloc-accounted heap statistics, I've seen individual OSDs
> using 9GB+ of RAM; looking at RSS sizes of individual machines, I've
> seen process-images exceeding 16GB. On 12-disk machines with 32GB of
> RAM each, this is problematic.
>
> So, I've started looking at the data-structures and algorithms that
> govern OSD recovery. I've found the following references:
>
> http://ceph.com/docs/master/dev/placement-group/
> http://ceph.com/docs/master/dev/peering/
> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/
> http://ceph.com/docs/master/dev/osd_internals/map_message_handling/
> http://dachary.org/?p=2061
>
> … and hope to develop an understanding of an upper bound on memory
> utilization that an efficient implementation of the algorithms described
> would require.
>
> I've also been trying to collect memory profiles for OSD processes as
> they're operating, to compare theory with reality.
>
>
> Memory profiling:
> ================
>
> For example, having found an OSD using ~6GB of memory, I turned on heap
> profiling, and dumped its state using `ceph tell osd.N heap
> start_profiler; ceph tell osd.N heap dump`:
>
>> ------------------------------------------------
>> MALLOC: 6167528240 ( 5881.8 MiB) Bytes in use by application
>> MALLOC: + 18309120 ( 17.5 MiB) Bytes in page heap freelist
>> MALLOC: + 39689152 ( 37.9 MiB) Bytes in central cache freelist
>> MALLOC: + 4750960 ( 4.5 MiB) Bytes in transfer cache freelist
>> MALLOC: + 25223840 ( 24.1 MiB) Bytes in thread cache freelists
>> MALLOC: + 27603096 ( 26.3 MiB) Bytes in malloc metadata
>> MALLOC: ------------
>> MALLOC: = 6283104408 ( 5992.0 MiB) Actual memory used (physical + swap)
>> MALLOC: + 2080768 ( 2.0 MiB) Bytes released to OS (aka unmapped)
>> MALLOC: ------------
>> MALLOC: = 6285185176 ( 5994.0 MiB) Virtual address space used
>> MALLOC:
>> MALLOC: 374907 Spans in use
>> MALLOC: 335 Thread heaps in use
>> MALLOC: 8192 Tcmalloc page size
>> ------------------------------------------------
>
>
> However, the heap dumps so generated only appear to show memory
> allocations (made? touched?) since heap profiling was enabled:
>
>> google-pprof --text /usr/bin/ceph-osd osd.25.profile.0001.heap
>> Using local file /usr/bin/ceph-osd.
>> Using local file osd.25.profile.0001.heap.
>> Total: 0.0 MB
>> 0.0 46.7% 46.7% 0.0 59.0% SimpleMessenger::add_accept_pipe
>> [...]
>
>
> Note the "Total: 0.0MB", which differs wildly from the stats reported by
> tcmalloc, and the RSS of the process reported by the kernel.
>
> So, for testing purposes, I selectively started up ~20% of the OSDs,
> each invoked with the setting
>
> CEPH_HEAP_PROFILER_INIT=1
>
> … defined in their environmentment to cause the heap profiler to be
> started at OSD start-time. This has a significant CPU and memory
> overhead.
>
> Also set were the cluster flags:
>
> noout,nobackfill,norecover,noscrub,nodeep-scrub
>
> … to avoid commingling memory requirements due to peering with other
> factors.
>
> I've produced a number of .heap files which show >= 1000MB of memory
> allocated in an RB tree as a result of
> PG::RecoveryState::RecoveryMachine::send_notify, PG::read_info and
> MOSDPGNotify::decode_payload (or descendants).
>
> An example heapfile from a fairly typical OSD can currently be fetched from:
>
> http://people.ds.cam.ac.uk/dwm37/tmp/osd.0.profile.0124.heap
>
> This was produced by the binaries from the Ceph 'trusty' repository; `ceph
> -v` returns:
>
>> ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
>
>
> Running pprof in interactive mode and running `top30 --cum` on this heapfile
> reports:
>
>> Total: 2172.3 MB
>> 1705.9 78.5% 78.5% 1748.4 80.5% __gnu_cxx::new_allocator::construct
>> (inline)
>> 0.0 0.0% 78.5% 1600.7 73.7% std::_Rb_tree::_M_create_node
>> (inline)
>> 0.0 0.0% 78.5% 1367.9 63.0% start_thread
>> 0.0 0.0% 78.5% 1367.6 63.0% ioperm
>> 0.0 0.0% 78.5% 963.4 44.4% ThreadPool::worker
>> 0.0 0.0% 78.5% 963.3 44.3% ThreadPool::WorkThread::entry
>> 0.0 0.0% 78.5% 951.0 43.8% OSD::process_peering_events
>> 0.0 0.0% 78.5% 950.9 43.8% OSD::PeeringWQ::_process
>> 0.0 0.0% 78.5% 949.8 43.7% PG::RecoveryState::handle_event
>> (inline)
>> 0.0 0.0% 78.5% 949.8 43.7%
>> boost::statechart::detail::send_function::operator (inline)
>> 0.0 0.0% 78.5% 949.8 43.7%
>> boost::statechart::simple_state::react_impl
>> 0.0 0.0% 78.5% 949.8 43.7%
>> boost::statechart::state_machine::process_event (inline)
>> 0.0 0.0% 78.5% 949.8 43.7%
>> boost::statechart::state_machine::send_event
>> 0.0 0.0% 78.5% 949.8 43.7% local_react (inline)
>> 0.0 0.0% 78.5% 949.8 43.7% local_react_impl (inline)
>> 0.0 0.0% 78.5% 949.8 43.7% operator (inline)
>> 0.0 0.0% 78.5% 949.8 43.7% react (inline)
>> 0.0 0.0% 78.5% 948.5 43.7% std::vector::push_back (inline)
>> 0.0 0.0% 78.5% 948.3 43.7%
>> PG::RecoveryState::RecoveryMachine::send_notify
>> 0.0 0.0% 78.5% 947.1 43.6% std::vector::_M_insert_aux
>> 0.0 0.0% 78.5% 947.0 43.6% _Rb_tree (inline)
>> 0.0 0.0% 78.5% 947.0 43.6% map (inline)
>> 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_clone_node
>> (inline)
>> 0.0 0.0% 78.5% 947.0 43.6% std::_Rb_tree::_M_copy
>> 0.0 0.0% 78.5% 809.8 37.3% construct (inline)
>> 0.0 0.0% 78.5% 808.4 37.2% std::pair::pair
>> 0.0 0.0% 78.5% 804.2 37.0% __libc_start_main
>> 0.0 0.0% 78.5% 804.2 37.0% _start
>> 0.0 0.0% 78.5% 804.2 37.0% main
>> 0.0 0.0% 78.5% 803.6 37.0% OSD::init
>
>
> This appears to show a large amount of memory — nearly a gigabyte —
> allocated by boost::statechart, which is slightly surprising as the FAQ for
> boost::statechart quotes a ~1KB memory footprint per state-machine:
>
>
> http://www.boost.org/doc/libs/1_35_0/libs/statechart/doc/faq.html#EmbeddedApplications
>
> Perhaps something unexpected is happening here? I'm almost hoping that
> perhaps statechart is perhaps being subtly misused or misconfigured in some
> way that, if fixed, would result in a significant drop in memory
> utilization…!
>
>
> Quantifying problem-size:
> ========================
>
> Given that it appears to be the log-merging stage of PG recovery that
> seems to be expensive, I queried the statistics of those PGs which
> seemed to be taking a long time to peer, via `ceph pg <pgid> query`.
>
> These showed that (at least a handful) of those PG's recovery_state
> past_intervals list contained on the order of ~200-300 entries.
>
> (I have no feel as to whether this is excessive.)
>
>
> Unused memory:
> =============
>
> One thing I note is that I still sometimes see OSDs with large fractions of
> their memory allocation sitting on the tcmalloc freelist, e.g.:
>
>> osd.0 tcmalloc heap stats:------------------------------------------------
>> MALLOC: 2226810584 ( 2123.7 MiB) Bytes in use by application
>> MALLOC: + 1421361152 ( 1355.5 MiB) Bytes in page heap freelist
>> MALLOC: + 41864920 ( 39.9 MiB) Bytes in central cache freelist
>> MALLOC: + 5215680 ( 5.0 MiB) Bytes in transfer cache freelist
>> MALLOC: + 18508944 ( 17.7 MiB) Bytes in thread cache freelists
>> MALLOC: + 16216216 ( 15.5 MiB) Bytes in malloc metadata
>> MALLOC: ------------
>> MALLOC: = 3729977496 ( 3557.2 MiB) Actual memory used (physical + swap)
>> MALLOC: + 32792576 ( 31.3 MiB) Bytes released to OS (aka unmapped)
>> MALLOC: ------------
>> MALLOC: = 3762770072 ( 3588.5 MiB) Virtual address space used
>> MALLOC:
>> MALLOC: 144565 Spans in use
>> MALLOC: 225 Thread heaps in use
>> MALLOC: 8192 Tcmalloc page size
>> ------------------------------------------------
>
>
> This is despite having:
>
> TCMALLOC_RELEASE_RATE=10
>
> … set in the environment of each OSD process. This doesn't help with
> contention for RAM between processes!
>
> (I have mentioned this before, though hadn't at that time yet tried running
> OSDs with TCMALLOC_RELEASE_RATE. See also:
>
> http://www.spinics.net/lists/ceph-devel/msg18769.html
>
> … for history.
>
> Note for anyone intending to reproduce this experiment: Upstart overrides
> should be written to a file named /etc/init/ceph-{osd,mon}.override, not
> ceph-{osd,mon}.conf.override as I incorrectly specified previously.)
>
>
> Leak detection:
> ==============
>
> Not yet being familiar with the the data-structures or algorithms that
> govern PG recovery, it's not clear to me whether this is memory usage
> that is expected or not for a 120-OSD cluster with 2048 PGs — or
> whether there might be some variety of leak (or inefficient memory-use
> pattern.)
>
> It doesn't help that I'm not a C++ hacker. :-)
>
> Reading around the subject, I came across `leaksanitiser`, a clang/LLVM:
> facility:
>
> https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer
>
> … as well as ticket #9756, which suggests using Clang's other static
> analysis capabilities to help flag potentially problematic code:
>
> http://tracker.ceph.com/issues/9756
>
> I might spend some time this weekend to see if I can help advance that
> ticket.
>
> (I note that http://ceph.com/gitbuilders.cgi now returns 404; perhaps
> that has been superceded by some RedHat-internal facility?)
>
> Cheers,
> David
> --
> David McBride <dwm37@cam.ac.uk>
> Unix Specialist, University Information Services
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-02-09 15:31 ` Gregory Farnum
@ 2015-02-09 21:36 ` David McBride
2015-02-10 1:51 ` Sage Weil
0 siblings, 1 reply; 14+ messages in thread
From: David McBride @ 2015-02-09 21:36 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Ceph-devel
On 09/02/15 15:31, Gregory Farnum wrote:
> So, memory usage of an OSD is usually linear in the number of PGs it
> hosts. However, that memory can also grow based on at least one other
> thing: the number of OSD Maps required to go through peering. It
> *looks* to me like this is what you're running in to, not growth on
> the number of state machines. In particular, those past_intervals you
> mentioned. ;)
Hi Greg,
Right, that sounds entirely plausible, and is very helpful.
In practice, that means I'll need to be careful to avoid this situation
occurring in production — but given that's unlikely to occur except in
the case of non-trivial neglect, I don't think I need be particularly
concerned.
(Happily, I'm in the situation that my existing cluster is purely for
testing purposes; the data is expendable.)
That said, for my own peace of mind, it would be valuable to have a
procedure that can be used to recover from this state, even if it's
unlikely to occur in practice.
I'm currently running an experiment where I augment the RAM of each OSD
node with 10GB swapfiles on each spinning OSD disk, so that there's a
big-enough backing-store to complete log reconstruction.
(You obviously wouldn't want to operate in this manner during normal
production operation — the loss of a single drive would cause a hard
machine-crash, and the performance will be fairly diabolical,
particularly if you allow client workloads to carry on in the background.)
I did try enabling zswap on the Utopic LTS kernel as supplied as an
option in Ubuntu 14.04; however, the kernel was not stable in such a
configuration and several machines crashed under memory pressure.
I do have OSDs committing suicide periodically, probably because they're
insufficiently responsive to heartbeats as they start to hit swap. This
is before experimenting with the various OSD tuning dials for timeouts,
so some improvement may be possible.
In the meantime, I've configured the ceph-osd Upstart jobs to apply a
post-exec command of `sleep 3600` to reduce the rate at which they're
respawned.
So far, the resulting configuration seems to be making progress, albeit
moderately slowly.
Cheers,
David
--
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-02-09 21:36 ` David McBride
@ 2015-02-10 1:51 ` Sage Weil
2015-03-09 15:42 ` Dan van der Ster
0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2015-02-10 1:51 UTC (permalink / raw)
To: David McBride; +Cc: Gregory Farnum, Ceph-devel
On Mon, 9 Feb 2015, David McBride wrote:
> On 09/02/15 15:31, Gregory Farnum wrote:
>
> > So, memory usage of an OSD is usually linear in the number of PGs it
> > hosts. However, that memory can also grow based on at least one other
> > thing: the number of OSD Maps required to go through peering. It
> > *looks* to me like this is what you're running in to, not growth on
> > the number of state machines. In particular, those past_intervals you
> > mentioned. ;)
>
> Hi Greg,
>
> Right, that sounds entirely plausible, and is very helpful.
>
> In practice, that means I'll need to be careful to avoid this situation
> occurring in production ? but given that's unlikely to occur except in the
> case of non-trivial neglect, I don't think I need be particularly concerned.
>
> (Happily, I'm in the situation that my existing cluster is purely for testing
> purposes; the data is expendable.)
>
> That said, for my own peace of mind, it would be valuable to have a procedure
> that can be used to recover from this state, even if it's unlikely to occur in
> practice.
The best luck I've had recovering from situations is something like:
- stop all osds
- osd set nodown
- osd set nobackfill
- osd set noup
- set map cache size smaller to reduce memory footprint.
osd map cache size = 50
osd map max advance = 25
osd map share max epochs = 25
osd pg epoch persisted max stale = 25
(basically, keep most of those values in sync, and smaller than
the map cache)
- start all osds, let them catch up on their maps. (if they can't fit in
memory at this point then another creative solution will be needed)
- unset noup so that everyone peers at once
It may also help to try to match the in/out state with where the data
actually resides (i.e. mark an osd back in if it was marked out but the
cluster didn't rebalance).
> I'm currently running an experiment where I augment the RAM of each OSD node
> with 10GB swapfiles on each spinning OSD disk, so that there's a big-enough
> backing-store to complete log reconstruction.
Swap tends to not work very well.. make sure nodown is set if you have to
go this route or else osds will get marked down when they miss
heartbeats...
sage
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-02-10 1:51 ` Sage Weil
@ 2015-03-09 15:42 ` Dan van der Ster
2015-03-09 15:47 ` Gregory Farnum
0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-09 15:42 UTC (permalink / raw)
To: Sage Weil; +Cc: David McBride, Gregory Farnum, Ceph-devel
Hi Sage,
On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 9 Feb 2015, David McBride wrote:
>> On 09/02/15 15:31, Gregory Farnum wrote:
>>
>> > So, memory usage of an OSD is usually linear in the number of PGs it
>> > hosts. However, that memory can also grow based on at least one other
>> > thing: the number of OSD Maps required to go through peering. It
>> > *looks* to me like this is what you're running in to, not growth on
>> > the number of state machines. In particular, those past_intervals you
>> > mentioned. ;)
>>
>> Hi Greg,
>>
>> Right, that sounds entirely plausible, and is very helpful.
>>
>> In practice, that means I'll need to be careful to avoid this situation
>> occurring in production ? but given that's unlikely to occur except in the
>> case of non-trivial neglect, I don't think I need be particularly concerned.
>>
>> (Happily, I'm in the situation that my existing cluster is purely for testing
>> purposes; the data is expendable.)
>>
>> That said, for my own peace of mind, it would be valuable to have a procedure
>> that can be used to recover from this state, even if it's unlikely to occur in
>> practice.
>
> The best luck I've had recovering from situations is something like:
>
> - stop all osds
> - osd set nodown
> - osd set nobackfill
> - osd set noup
> - set map cache size smaller to reduce memory footprint.
>
> osd map cache size = 50
> osd map max advance = 25
> osd map share max epochs = 25
> osd pg epoch persisted max stale = 25
>
These above settings have proven to be very useful when setting up
some of our new OSD servers with not much memory per OSD: 64GB RAM for
48x4TB OSDs
Prior to applying these settings (plus one more, below) we were seeing
memory usage around 2-3GB / OSD when they are freshly created. After a
restart the processes stayed under 3-400MB.
It seems the initial bootstrapping -- getting all the most recent 500
osdmaps -- in bunches of 100 at a time causes the osd map cache to
exceed its 50 entry limit -- and that memory is then never freed. We
found that to fix this we had to also lower the "osd map message max"
setting on the mons -- like that them OSD memory is staying under
500MB per process.
Currently we're happily running a large [1] number of OSDs with the
following configuration:
[global]
osd map message max = 10
[osd]
osd map cache size = 20
osd map max advance = 10
osd map share max epochs = 10
osd pg epoch persisted max stale = 10
and the memory consumption is 400-500MB per process, even during
backfilling. And so far we didn't see any drawbacks to this
configuration. Should we expect any problems if we continue with this
small osdmap cache, permanently?
Best Regards,
Dan
[1] "large" in this case means the osdmap is 4.6MB in size
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-03-09 15:42 ` Dan van der Ster
@ 2015-03-09 15:47 ` Gregory Farnum
2015-03-13 11:24 ` Dan van der Ster
0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2015-03-09 15:47 UTC (permalink / raw)
To: Dan van der Ster; +Cc: Sage Weil, David McBride, Ceph-devel
On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi Sage,
>
> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>> On Mon, 9 Feb 2015, David McBride wrote:
>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>
>>> > So, memory usage of an OSD is usually linear in the number of PGs it
>>> > hosts. However, that memory can also grow based on at least one other
>>> > thing: the number of OSD Maps required to go through peering. It
>>> > *looks* to me like this is what you're running in to, not growth on
>>> > the number of state machines. In particular, those past_intervals you
>>> > mentioned. ;)
>>>
>>> Hi Greg,
>>>
>>> Right, that sounds entirely plausible, and is very helpful.
>>>
>>> In practice, that means I'll need to be careful to avoid this situation
>>> occurring in production ? but given that's unlikely to occur except in the
>>> case of non-trivial neglect, I don't think I need be particularly concerned.
>>>
>>> (Happily, I'm in the situation that my existing cluster is purely for testing
>>> purposes; the data is expendable.)
>>>
>>> That said, for my own peace of mind, it would be valuable to have a procedure
>>> that can be used to recover from this state, even if it's unlikely to occur in
>>> practice.
>>
>> The best luck I've had recovering from situations is something like:
>>
>> - stop all osds
>> - osd set nodown
>> - osd set nobackfill
>> - osd set noup
>> - set map cache size smaller to reduce memory footprint.
>>
>> osd map cache size = 50
>> osd map max advance = 25
>> osd map share max epochs = 25
>> osd pg epoch persisted max stale = 25
It can cause extreme slowness if you get into a failure situation and
your OSDs need to calculate past intervals across more maps than will
fit in the cache. :(
That said, this might be a good idea as long as you're conscious of
needing to set it back if you get into trouble later on.
-Greg
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-03-09 15:47 ` Gregory Farnum
@ 2015-03-13 11:24 ` Dan van der Ster
[not found] ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 11:24 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, David McBride, Ceph-devel
On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> wrote:
>> Hi Sage,
>>
>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>> On Mon, 9 Feb 2015, David McBride wrote:
>>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>>
>>>> > So, memory usage of an OSD is usually linear in the number of PGs it
>>>> > hosts. However, that memory can also grow based on at least one other
>>>> > thing: the number of OSD Maps required to go through peering. It
>>>> > *looks* to me like this is what you're running in to, not growth on
>>>> > the number of state machines. In particular, those past_intervals you
>>>> > mentioned. ;)
>>>>
>>>> Hi Greg,
>>>>
>>>> Right, that sounds entirely plausible, and is very helpful.
>>>>
>>>> In practice, that means I'll need to be careful to avoid this situation
>>>> occurring in production ? but given that's unlikely to occur except in the
>>>> case of non-trivial neglect, I don't think I need be particularly concerned.
>>>>
>>>> (Happily, I'm in the situation that my existing cluster is purely for testing
>>>> purposes; the data is expendable.)
>>>>
>>>> That said, for my own peace of mind, it would be valuable to have a procedure
>>>> that can be used to recover from this state, even if it's unlikely to occur in
>>>> practice.
>>>
>>> The best luck I've had recovering from situations is something like:
>>>
>>> - stop all osds
>>> - osd set nodown
>>> - osd set nobackfill
>>> - osd set noup
>>> - set map cache size smaller to reduce memory footprint.
>>>
>>> osd map cache size = 50
>>> osd map max advance = 25
>>> osd map share max epochs = 25
>>> osd pg epoch persisted max stale = 25
>
> It can cause extreme slowness if you get into a failure situation and
> your OSDs need to calculate past intervals across more maps than will
> fit in the cache. :(
.. extreme slowness or is it also possible to get into a situation
where the PGs are stuck incomplete forever?
The reason I ask is because we actually had a network issue this
morning that left OSDs flapping and a lot of osdmap epoch churn. Now
our network has stabilized but 10 PGs are incomplete, even though all
the OSDs are up. One PG looks like this, for example:
pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
last acting [6689,1919,2329]
pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
last acting [6689,1919,2329]
pg 75.45 is incomplete, acting [6689,1919,2329]
1919 3.62000 osd.1919 up
1.00000 1.00000
2329 3.62000 osd.2329 up
1.00000 1.00000
6689 3.62000 osd.6689 up
1.00000 1.00000
The pg query output here: http://pastebin.com/WyTAU69W
Is that a result of these short map caches or could it be something
else? (we're running 0.93-76-gc35f422)
WWGD (what would Greg do?) to activate these PGs?
Thanks! Dan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
[not found] ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
@ 2015-03-13 12:52 ` Dan van der Ster
2015-03-13 15:36 ` Dan van der Ster
0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 12:52 UTC (permalink / raw)
To: Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel
Hi Sage,
Losing a message would have been plausible given the network issue we had today.
I tried:
# ceph osd pg-temp 75.45 6689
set 75.45 pg_temp mapping to [6689]
then waited a bit. It's still incomplete -- the only difference is now
I see two more past_intervals in the pg. Full query here:
http://pastebin.com/TU7vVLpj
I didn't have debug_osd above zero when I did that. Should I try again
with debug_osd 20?
Thanks :)
Dan
On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
> This looks a bit like a the osds may have lost a message, actually. You can
> kick an individual pg to repeer with something like
>
> ceph osd pg-temp 75.45 6689
>
> See if that makes it go?
>
> sage
>
>
>
> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
> wrote:
>>
>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>
>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com>
>>> wrote:
>>>>
>>>> Hi Sage,
>>>>
>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>
>>>>> On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>
>>>>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>
>>>>>>> So, memory
>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>> hosts. However, that memory can also grow based on at least one
>>>>>>> other
>>>>>>> thing: the number of OSD Maps required to go through peering. It
>>>>>>> *looks* to me like this is what you're running in to, not growth on
>>>>>>> the number of state machines. In particular, those past_intervals
>>>>>>> you
>>>>>>> mentioned. ;)
>>>>>>
>>>>>>
>>>>>> Hi Greg,
>>>>>>
>>>>>> Right, that sounds entirely plausible, and is very helpful.
>>>>>>
>>>>>> In practice, that means I'll need to be careful to avoid this
>>>>>> situation
>>>>>> occurring in production ? but given that's unlikely to occur except
>>>>>> in the
>>>>>> case of non-trivial neglect, I don't think I need be particularly
>>>>>> concerned.
>>>>>>
>>>>>> (Happily, I'm in the situation that my existing cluster is purely for
>>>>>> testing
>>>>>> purposes; the data is expendable.)
>>>>>>
>>>>>> That said, for my own peace of mind, it would be valuable to have a
>>>>>> procedure
>>>>>> that can be used to recover from this
>>>>>> state, even if it's unlikely to occur in
>>>>>> practice.
>>>>>
>>>>>
>>>>> The best luck I've had recovering from situations is something like:
>>>>>
>>>>> - stop all osds
>>>>> - osd set nodown
>>>>> - osd set nobackfill
>>>>> - osd set noup
>>>>> - set map cache size smaller to reduce memory footprint.
>>>>>
>>>>> osd map cache size = 50
>>>>> osd map max advance = 25
>>>>> osd map share max epochs = 25
>>>>> osd pg epoch persisted max stale = 25
>>>
>>>
>>> It can cause extreme slowness if you get into a failure situation and
>>> your OSDs need to calculate past intervals across more maps than will
>>> fit in the cache. :(
>>
>>
>> .. extreme slowness or is it also possible to get into a situation
>> where the PGs are stuck incomplete forever?
>>
>> The reason I ask is because we actually had a network issue this
>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>> our network has
>> stabilized but 10 PGs are incomplete, even though all
>> the OSDs are up. One PG looks like this, for example:
>>
>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>> last acting [6689,1919,2329]
>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>> last acting [6689,1919,2329]
>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>
>> 1919 3.62000 osd.1919 up
>> 1.00000 1.00000
>> 2329 3.62000 osd.2329 up
>> 1.00000 1.00000
>> 6689 3.62000 osd.6689 up
>> 1.00000 1.00000
>>
>> The pg query output here: http://pastebin.com/WyTAU69W
>>
>> Is that a result of these short map caches or could it be something
>> else? (we're running 0.93-76-gc35f422)
>> WWGD (what would Greg do?) to activate these PGs?
>>
>> Thanks! Dan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-03-13 12:52 ` Dan van der Ster
@ 2015-03-13 15:36 ` Dan van der Ster
2015-03-13 20:42 ` Samuel Just
0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 15:36 UTC (permalink / raw)
To: Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel
On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi Sage,
>
> Losing a message would have been plausible given the network issue we had today.
>
> I tried:
>
> # ceph osd pg-temp 75.45 6689
> set 75.45 pg_temp mapping to [6689]
>
> then waited a bit. It's still incomplete -- the only difference is now
> I see two more past_intervals in the pg. Full query here:
> http://pastebin.com/TU7vVLpj
>
> I didn't have debug_osd above zero when I did that. Should I try again
> with debug_osd 20?
I tried again with logging. The pg goes like this:
incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
inactive -> peering -> incomplete
The killer seems to be:
2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
remapped+peering] choose_acting no suitable info found (incomplete
backfills?), reverting to up
Full log is here: http://pastebin.com/hZUBD9NT
Do you have an idea what went wrong here? BTW, our firefly "prod"
cluster suffered from the same network problem today, but all of those
cluster's PGs recovered nicely.
Does the hammer RC have different peering logic that might apply here?
Thanks! Dan
>
> Thanks :)
>
> Dan
>
> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>> This looks a bit like a the osds may have lost a message, actually. You can
>> kick an individual pg to repeer with something like
>>
>> ceph osd pg-temp 75.45 6689
>>
>> See if that makes it go?
>>
>> sage
>>
>>
>>
>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
>> wrote:
>>>
>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>
>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com>
>>>> wrote:
>>>>>
>>>>> Hi Sage,
>>>>>
>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>>
>>>>>> On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>
>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>
>>>>>>>> So, memory
>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>> hosts. However, that memory can also grow based on at least one
>>>>>>>> other
>>>>>>>> thing: the number of OSD Maps required to go through peering. It
>>>>>>>> *looks* to me like this is what you're running in to, not growth on
>>>>>>>> the number of state machines. In particular, those past_intervals
>>>>>>>> you
>>>>>>>> mentioned. ;)
>>>>>>>
>>>>>>>
>>>>>>> Hi Greg,
>>>>>>>
>>>>>>> Right, that sounds entirely plausible, and is very helpful.
>>>>>>>
>>>>>>> In practice, that means I'll need to be careful to avoid this
>>>>>>> situation
>>>>>>> occurring in production ? but given that's unlikely to occur except
>>>>>>> in the
>>>>>>> case of non-trivial neglect, I don't think I need be particularly
>>>>>>> concerned.
>>>>>>>
>>>>>>> (Happily, I'm in the situation that my existing cluster is purely for
>>>>>>> testing
>>>>>>> purposes; the data is expendable.)
>>>>>>>
>>>>>>> That said, for my own peace of mind, it would be valuable to have a
>>>>>>> procedure
>>>>>>> that can be used to recover from this
>>>>>>> state, even if it's unlikely to occur in
>>>>>>> practice.
>>>>>>
>>>>>>
>>>>>> The best luck I've had recovering from situations is something like:
>>>>>>
>>>>>> - stop all osds
>>>>>> - osd set nodown
>>>>>> - osd set nobackfill
>>>>>> - osd set noup
>>>>>> - set map cache size smaller to reduce memory footprint.
>>>>>>
>>>>>> osd map cache size = 50
>>>>>> osd map max advance = 25
>>>>>> osd map share max epochs = 25
>>>>>> osd pg epoch persisted max stale = 25
>>>>
>>>>
>>>> It can cause extreme slowness if you get into a failure situation and
>>>> your OSDs need to calculate past intervals across more maps than will
>>>> fit in the cache. :(
>>>
>>>
>>> .. extreme slowness or is it also possible to get into a situation
>>> where the PGs are stuck incomplete forever?
>>>
>>> The reason I ask is because we actually had a network issue this
>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>> our network has
>>> stabilized but 10 PGs are incomplete, even though all
>>> the OSDs are up. One PG looks like this, for example:
>>>
>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>>> last acting [6689,1919,2329]
>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>> last acting [6689,1919,2329]
>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>
>>> 1919 3.62000 osd.1919 up
>>> 1.00000 1.00000
>>> 2329 3.62000 osd.2329 up
>>> 1.00000 1.00000
>>> 6689 3.62000 osd.6689 up
>>> 1.00000 1.00000
>>>
>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>
>>> Is that a result of these short map caches or could it be something
>>> else? (we're running 0.93-76-gc35f422)
>>> WWGD (what would Greg do?) to activate these PGs?
>>>
>>> Thanks! Dan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-03-13 15:36 ` Dan van der Ster
@ 2015-03-13 20:42 ` Samuel Just
2015-03-13 20:53 ` Samuel Just
0 siblings, 1 reply; 14+ messages in thread
From: Samuel Just @ 2015-03-13 20:42 UTC (permalink / raw)
To: Dan van der Ster, Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel
I've opened a bug for this (http://tracker.ceph.com/issues/11110), I
bet it's related to the new logic for allowing recovery below min_size.
Exactly what sha1 was running on the osds during this time period?
-Sam
On 03/13/2015 08:36 AM, Dan van der Ster wrote:
> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com> wrote:
>> Hi Sage,
>>
>> Losing a message would have been plausible given the network issue we had today.
>>
>> I tried:
>>
>> # ceph osd pg-temp 75.45 6689
>> set 75.45 pg_temp mapping to [6689]
>>
>> then waited a bit. It's still incomplete -- the only difference is now
>> I see two more past_intervals in the pg. Full query here:
>> http://pastebin.com/TU7vVLpj
>>
>> I didn't have debug_osd above zero when I did that. Should I try again
>> with debug_osd 20?
> I tried again with logging. The pg goes like this:
>
> incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
> inactive -> peering -> incomplete
>
> The killer seems to be:
>
> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
> remapped+peering] choose_acting no suitable info found (incomplete
> backfills?), reverting to up
>
> Full log is here: http://pastebin.com/hZUBD9NT
>
> Do you have an idea what went wrong here? BTW, our firefly "prod"
> cluster suffered from the same network problem today, but all of those
> cluster's PGs recovered nicely.
> Does the hammer RC have different peering logic that might apply here?
>
> Thanks! Dan
>
>
>
>> Thanks :)
>>
>> Dan
>>
>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>>> This looks a bit like a the osds may have lost a message, actually. You can
>>> kick an individual pg to repeer with something like
>>>
>>> ceph osd pg-temp 75.45 6689
>>>
>>> See if that makes it go?
>>>
>>> sage
>>>
>>>
>>>
>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
>>> wrote:
>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com>
>>>>> wrote:
>>>>>> Hi Sage,
>>>>>>
>>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>>> On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>>
>>>>>>>>> So, memory
>>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>> hosts. However, that memory can also grow based on at least one
>>>>>>>>> other
>>>>>>>>> thing: the number of OSD Maps required to go through peering. It
>>>>>>>>> *looks* to me like this is what you're running in to, not growth on
>>>>>>>>> the number of state machines. In particular, those past_intervals
>>>>>>>>> you
>>>>>>>>> mentioned. ;)
>>>>>>>>
>>>>>>>> Hi Greg,
>>>>>>>>
>>>>>>>> Right, that sounds entirely plausible, and is very helpful.
>>>>>>>>
>>>>>>>> In practice, that means I'll need to be careful to avoid this
>>>>>>>> situation
>>>>>>>> occurring in production ? but given that's unlikely to occur except
>>>>>>>> in the
>>>>>>>> case of non-trivial neglect, I don't think I need be particularly
>>>>>>>> concerned.
>>>>>>>>
>>>>>>>> (Happily, I'm in the situation that my existing cluster is purely for
>>>>>>>> testing
>>>>>>>> purposes; the data is expendable.)
>>>>>>>>
>>>>>>>> That said, for my own peace of mind, it would be valuable to have a
>>>>>>>> procedure
>>>>>>>> that can be used to recover from this
>>>>>>>> state, even if it's unlikely to occur in
>>>>>>>> practice.
>>>>>>>
>>>>>>> The best luck I've had recovering from situations is something like:
>>>>>>>
>>>>>>> - stop all osds
>>>>>>> - osd set nodown
>>>>>>> - osd set nobackfill
>>>>>>> - osd set noup
>>>>>>> - set map cache size smaller to reduce memory footprint.
>>>>>>>
>>>>>>> osd map cache size = 50
>>>>>>> osd map max advance = 25
>>>>>>> osd map share max epochs = 25
>>>>>>> osd pg epoch persisted max stale = 25
>>>>>
>>>>> It can cause extreme slowness if you get into a failure situation and
>>>>> your OSDs need to calculate past intervals across more maps than will
>>>>> fit in the cache. :(
>>>>
>>>> .. extreme slowness or is it also possible to get into a situation
>>>> where the PGs are stuck incomplete forever?
>>>>
>>>> The reason I ask is because we actually had a network issue this
>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>>> our network has
>>>> stabilized but 10 PGs are incomplete, even though all
>>>> the OSDs are up. One PG looks like this, for example:
>>>>
>>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>>>> last acting [6689,1919,2329]
>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>>> last acting [6689,1919,2329]
>>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>>
>>>> 1919 3.62000 osd.1919 up
>>>> 1.00000 1.00000
>>>> 2329 3.62000 osd.2329 up
>>>> 1.00000 1.00000
>>>> 6689 3.62000 osd.6689 up
>>>> 1.00000 1.00000
>>>>
>>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>>
>>>> Is that a result of these short map caches or could it be something
>>>> else? (we're running 0.93-76-gc35f422)
>>>> WWGD (what would Greg do?) to activate these PGs?
>>>>
>>>> Thanks! Dan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-03-13 20:42 ` Samuel Just
@ 2015-03-13 20:53 ` Samuel Just
2015-03-13 21:24 ` Dan van der Ster
0 siblings, 1 reply; 14+ messages in thread
From: Samuel Just @ 2015-03-13 20:53 UTC (permalink / raw)
To: Dan van der Ster, Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel
Also, are you certain that all were running the same version?
-Sam
On 03/13/2015 01:42 PM, Samuel Just wrote:
> I've opened a bug for this (http://tracker.ceph.com/issues/11110), I
> bet it's related to the new logic for allowing recovery below
> min_size. Exactly what sha1 was running on the osds during this time
> period?
> -Sam
>
> On 03/13/2015 08:36 AM, Dan van der Ster wrote:
>> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster
>> <dan@vanderster.com> wrote:
>>> Hi Sage,
>>>
>>> Losing a message would have been plausible given the network issue
>>> we had today.
>>>
>>> I tried:
>>>
>>> # ceph osd pg-temp 75.45 6689
>>> set 75.45 pg_temp mapping to [6689]
>>>
>>> then waited a bit. It's still incomplete -- the only difference is now
>>> I see two more past_intervals in the pg. Full query here:
>>> http://pastebin.com/TU7vVLpj
>>>
>>> I didn't have debug_osd above zero when I did that. Should I try again
>>> with debug_osd 20?
>> I tried again with logging. The pg goes like this:
>>
>> incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
>> inactive -> peering -> incomplete
>>
>> The killer seems to be:
>>
>> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
>> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
>> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
>> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
>> remapped+peering] choose_acting no suitable info found (incomplete
>> backfills?), reverting to up
>>
>> Full log is here: http://pastebin.com/hZUBD9NT
>>
>> Do you have an idea what went wrong here? BTW, our firefly "prod"
>> cluster suffered from the same network problem today, but all of those
>> cluster's PGs recovered nicely.
>> Does the hammer RC have different peering logic that might apply here?
>>
>> Thanks! Dan
>>
>>
>>
>>> Thanks :)
>>>
>>> Dan
>>>
>>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>>>> This looks a bit like a the osds may have lost a message,
>>>> actually. You can
>>>> kick an individual pg to repeer with something like
>>>>
>>>> ceph osd pg-temp 75.45 6689
>>>>
>>>> See if that makes it go?
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster
>>>> <dan@vanderster.com>
>>>> wrote:
>>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com>
>>>>> wrote:
>>>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster
>>>>>> <dan@vanderster.com>
>>>>>> wrote:
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net>
>>>>>>> wrote:
>>>>>>>> On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>>>
>>>>>>>>>> So, memory
>>>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>>> hosts. However, that memory can also grow based on at least
>>>>>>>>>> one
>>>>>>>>>> other
>>>>>>>>>> thing: the number of OSD Maps required to go through
>>>>>>>>>> peering. It
>>>>>>>>>> *looks* to me like this is what you're running in to, not
>>>>>>>>>> growth on
>>>>>>>>>> the number of state machines. In particular, those
>>>>>>>>>> past_intervals
>>>>>>>>>> you
>>>>>>>>>> mentioned. ;)
>>>>>>>>>
>>>>>>>>> Hi Greg,
>>>>>>>>>
>>>>>>>>> Right, that sounds entirely plausible, and is very helpful.
>>>>>>>>>
>>>>>>>>> In practice, that means I'll need to be careful to avoid this
>>>>>>>>> situation
>>>>>>>>> occurring in production ? but given that's unlikely to occur
>>>>>>>>> except
>>>>>>>>> in the
>>>>>>>>> case of non-trivial neglect, I don't think I need be
>>>>>>>>> particularly
>>>>>>>>> concerned.
>>>>>>>>>
>>>>>>>>> (Happily, I'm in the situation that my existing cluster is
>>>>>>>>> purely for
>>>>>>>>> testing
>>>>>>>>> purposes; the data is expendable.)
>>>>>>>>>
>>>>>>>>> That said, for my own peace of mind, it would be valuable to
>>>>>>>>> have a
>>>>>>>>> procedure
>>>>>>>>> that can be used to recover from this
>>>>>>>>> state, even if it's unlikely to occur in
>>>>>>>>> practice.
>>>>>>>>
>>>>>>>> The best luck I've had recovering from situations is
>>>>>>>> something like:
>>>>>>>>
>>>>>>>> - stop all osds
>>>>>>>> - osd set nodown
>>>>>>>> - osd set nobackfill
>>>>>>>> - osd set noup
>>>>>>>> - set map cache size smaller to reduce memory footprint.
>>>>>>>>
>>>>>>>> osd map cache size = 50
>>>>>>>> osd map max advance = 25
>>>>>>>> osd map share max epochs = 25
>>>>>>>> osd pg epoch persisted max stale = 25
>>>>>>
>>>>>> It can cause extreme slowness if you get into a failure
>>>>>> situation and
>>>>>> your OSDs need to calculate past intervals across more maps
>>>>>> than will
>>>>>> fit in the cache. :(
>>>>>
>>>>> .. extreme slowness or is it also possible to get into a situation
>>>>> where the PGs are stuck incomplete forever?
>>>>>
>>>>> The reason I ask is because we actually had a network issue this
>>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>>>> our network has
>>>>> stabilized but 10 PGs are incomplete, even though all
>>>>> the OSDs are up. One PG looks like this, for example:
>>>>>
>>>>> pg 75.45 is stuck inactive for 87351.077529, current state
>>>>> incomplete,
>>>>> last acting [6689,1919,2329]
>>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>>>> last acting [6689,1919,2329]
>>>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>>>
>>>>> 1919 3.62000 osd.1919 up
>>>>> 1.00000 1.00000
>>>>> 2329 3.62000 osd.2329 up
>>>>> 1.00000 1.00000
>>>>> 6689 3.62000 osd.6689 up
>>>>> 1.00000 1.00000
>>>>>
>>>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>>>
>>>>> Is that a result of these short map caches or could it be something
>>>>> else? (we're running 0.93-76-gc35f422)
>>>>> WWGD (what would Greg do?) to activate these PGs?
>>>>>
>>>>> Thanks! Dan
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bounding OSD memory requirements during peering/recovery
2015-03-13 20:53 ` Samuel Just
@ 2015-03-13 21:24 ` Dan van der Ster
0 siblings, 0 replies; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 21:24 UTC (permalink / raw)
To: Samuel Just; +Cc: Sage Weil, Gregory Farnum, David McBride, Ceph-devel
Yup, all running 0.93-76-gc35f422 (from gitbuilder just after Sage merged the
latest straw2 fix...). I just uploaded the ceph.log to help understand
the issue. Let me know if I can help further :)
Thanks! Dan
On Fri, Mar 13, 2015 at 9:53 PM, Samuel Just <sjust@redhat.com> wrote:
> Also, are you certain that all were running the same version?
> -Sam
>
>
> On 03/13/2015 01:42 PM, Samuel Just wrote:
>>
>> I've opened a bug for this (http://tracker.ceph.com/issues/11110), I bet
>> it's related to the new logic for allowing recovery below min_size. Exactly
>> what sha1 was running on the osds during this time period?
>> -Sam
>>
>> On 03/13/2015 08:36 AM, Dan van der Ster wrote:
>>>
>>> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com>
>>> wrote:
>>>>
>>>> Hi Sage,
>>>>
>>>> Losing a message would have been plausible given the network issue we
>>>> had today.
>>>>
>>>> I tried:
>>>>
>>>> # ceph osd pg-temp 75.45 6689
>>>> set 75.45 pg_temp mapping to [6689]
>>>>
>>>> then waited a bit. It's still incomplete -- the only difference is now
>>>> I see two more past_intervals in the pg. Full query here:
>>>> http://pastebin.com/TU7vVLpj
>>>>
>>>> I didn't have debug_osd above zero when I did that. Should I try again
>>>> with debug_osd 20?
>>>
>>> I tried again with logging. The pg goes like this:
>>>
>>> incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
>>> inactive -> peering -> incomplete
>>>
>>> The killer seems to be:
>>>
>>> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
>>> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
>>> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
>>> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
>>> remapped+peering] choose_acting no suitable info found (incomplete
>>> backfills?), reverting to up
>>>
>>> Full log is here: http://pastebin.com/hZUBD9NT
>>>
>>> Do you have an idea what went wrong here? BTW, our firefly "prod"
>>> cluster suffered from the same network problem today, but all of those
>>> cluster's PGs recovered nicely.
>>> Does the hammer RC have different peering logic that might apply here?
>>>
>>> Thanks! Dan
>>>
>>>
>>>
>>>> Thanks :)
>>>>
>>>> Dan
>>>>
>>>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>>>>>
>>>>> This looks a bit like a the osds may have lost a message, actually.
>>>>> You can
>>>>> kick an individual pg to repeer with something like
>>>>>
>>>>> ceph osd pg-temp 75.45 6689
>>>>>
>>>>> See if that makes it go?
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
>>>>> wrote:
>>>>>>
>>>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster
>>>>>>> <dan@vanderster.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Sage,
>>>>>>>>
>>>>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>>>>
>>>>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>>>>
>>>>>>>>>>> So, memory
>>>>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>>>> hosts. However, that memory can also grow based on at least one
>>>>>>>>>>> other
>>>>>>>>>>> thing: the number of OSD Maps required to go through peering.
>>>>>>>>>>> It
>>>>>>>>>>> *looks* to me like this is what you're running in to, not
>>>>>>>>>>> growth on
>>>>>>>>>>> the number of state machines. In particular, those
>>>>>>>>>>> past_intervals
>>>>>>>>>>> you
>>>>>>>>>>> mentioned. ;)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Greg,
>>>>>>>>>>
>>>>>>>>>> Right, that sounds entirely plausible, and is very helpful.
>>>>>>>>>>
>>>>>>>>>> In practice, that means I'll need to be careful to avoid this
>>>>>>>>>> situation
>>>>>>>>>> occurring in production ? but given that's unlikely to occur
>>>>>>>>>> except
>>>>>>>>>> in the
>>>>>>>>>> case of non-trivial neglect, I don't think I need be
>>>>>>>>>> particularly
>>>>>>>>>> concerned.
>>>>>>>>>>
>>>>>>>>>> (Happily, I'm in the situation that my existing cluster is
>>>>>>>>>> purely for
>>>>>>>>>> testing
>>>>>>>>>> purposes; the data is expendable.)
>>>>>>>>>>
>>>>>>>>>> That said, for my own peace of mind, it would be valuable to
>>>>>>>>>> have a
>>>>>>>>>> procedure
>>>>>>>>>> that can be used to recover from this
>>>>>>>>>> state, even if it's unlikely to occur in
>>>>>>>>>> practice.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The best luck I've had recovering from situations is something
>>>>>>>>> like:
>>>>>>>>>
>>>>>>>>> - stop all osds
>>>>>>>>> - osd set nodown
>>>>>>>>> - osd set nobackfill
>>>>>>>>> - osd set noup
>>>>>>>>> - set map cache size smaller to reduce memory footprint.
>>>>>>>>>
>>>>>>>>> osd map cache size = 50
>>>>>>>>> osd map max advance = 25
>>>>>>>>> osd map share max epochs = 25
>>>>>>>>> osd pg epoch persisted max stale = 25
>>>>>>>
>>>>>>>
>>>>>>> It can cause extreme slowness if you get into a failure situation
>>>>>>> and
>>>>>>> your OSDs need to calculate past intervals across more maps than
>>>>>>> will
>>>>>>> fit in the cache. :(
>>>>>>
>>>>>>
>>>>>> .. extreme slowness or is it also possible to get into a situation
>>>>>> where the PGs are stuck incomplete forever?
>>>>>>
>>>>>> The reason I ask is because we actually had a network issue this
>>>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>>>>> our network has
>>>>>> stabilized but 10 PGs are incomplete, even though all
>>>>>> the OSDs are up. One PG looks like this, for example:
>>>>>>
>>>>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>>>>>> last acting [6689,1919,2329]
>>>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>>>>> last acting [6689,1919,2329]
>>>>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>>>>
>>>>>> 1919 3.62000 osd.1919 up
>>>>>> 1.00000 1.00000
>>>>>> 2329 3.62000 osd.2329 up
>>>>>> 1.00000 1.00000
>>>>>> 6689 3.62000 osd.6689 up
>>>>>> 1.00000 1.00000
>>>>>>
>>>>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>>>>
>>>>>> Is that a result of these short map caches or could it be something
>>>>>> else? (we're running 0.93-76-gc35f422)
>>>>>> WWGD (what would Greg do?) to activate these PGs?
>>>>>>
>>>>>> Thanks! Dan
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-03-13 21:24 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
2015-02-08 20:05 ` David McBride
2015-02-09 10:38 ` David McBride
2015-02-09 15:31 ` Gregory Farnum
2015-02-09 21:36 ` David McBride
2015-02-10 1:51 ` Sage Weil
2015-03-09 15:42 ` Dan van der Ster
2015-03-09 15:47 ` Gregory Farnum
2015-03-13 11:24 ` Dan van der Ster
[not found] ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
2015-03-13 12:52 ` Dan van der Ster
2015-03-13 15:36 ` Dan van der Ster
2015-03-13 20:42 ` Samuel Just
2015-03-13 20:53 ` Samuel Just
2015-03-13 21:24 ` Dan van der Ster
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.