Bounding OSD memory requirements during peering/recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

* Bounding OSD memory requirements during peering/recovery
@ 2015-02-08 16:05 David McBride
  2015-02-08 20:05 ` David McBride
  2015-02-09 15:31 ` Gregory Farnum
  0 siblings, 2 replies; 14+ messages in thread
From: David McBride @ 2015-02-08 16:05 UTC (permalink / raw)
  To: Ceph-devel

Hello,

I'm trying to understand the memory requirements for a Ceph node,
particularly when it is undergoing recovery.

Comments, suggestions, pointers are all welcome.

(This is my second attempt at sending this email; it appeared to get 
eaten the first time — probably because it had a 1MB .heap file attached.)

Background:
==========

I've got a fairly tortured prototype Ceph cluster.  It was left
unattended for several months, as I'd been needed to work elsewhere —
but now I'm returning to it, with an eye to continue to building
production services on it if I have sufficient confidence in its
capabilities.

In the intervening time, several root filesystems on cluster nodes went
full (because of poorly configured logging, as well as MONs being
co-located with OSDs for expediency) and several drives were also
unceremoniously pulled out for reuse elsewhere.

A subsequent recovery is proving problematic: if all OSDs are started
concurrently, they are substantially exceeding the amount of RAM
available on the hosts during peering, and are being killed off by the
kernel OOM killer.

(And then subsequently being restarted by Upstart, resulting in
thrashing for a while, up until something unknown goes awry and the
machine stops sending telemetry and no-longer responds to SSH.  That's a 
separate problem.)

Looking at tcmalloc-accounted heap statistics, I've seen individual OSDs
using 9GB+ of RAM; looking at RSS sizes of individual machines, I've
seen process-images exceeding 16GB.  On 12-disk machines with 32GB of
RAM each, this is problematic.

So, I've started looking at the data-structures and algorithms that
govern OSD recovery.  I've found the following references:

  http://ceph.com/docs/master/dev/placement-group/
  http://ceph.com/docs/master/dev/peering/
  http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/
  http://ceph.com/docs/master/dev/osd_internals/map_message_handling/
  http://dachary.org/?p=2061

… and hope to develop an understanding of an upper bound on memory
utilization that an efficient implementation of the algorithms described
would require.

I've also been trying to collect memory profiles for OSD processes as
they're operating, to compare theory with reality.

Memory profiling:
================

For example, having found an OSD using ~6GB of memory, I turned on heap
profiling, and dumped its state using `ceph tell osd.N heap
start_profiler; ceph tell osd.N heap dump`:

> ------------------------------------------------
> MALLOC:     6167528240 ( 5881.8 MiB) Bytes in use by application
> MALLOC: +     18309120 (   17.5 MiB) Bytes in page heap freelist
> MALLOC: +     39689152 (   37.9 MiB) Bytes in central cache freelist
> MALLOC: +      4750960 (    4.5 MiB) Bytes in transfer cache freelist
> MALLOC: +     25223840 (   24.1 MiB) Bytes in thread cache freelists
> MALLOC: +     27603096 (   26.3 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: =   6283104408 ( 5992.0 MiB) Actual memory used (physical + swap)
> MALLOC: +      2080768 (    2.0 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: =   6285185176 ( 5994.0 MiB) Virtual address space used
> MALLOC:
> MALLOC:         374907              Spans in use
> MALLOC:            335              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> ------------------------------------------------

However, the heap dumps so generated only appear to show memory
allocations (made? touched?) since heap profiling was enabled:

> google-pprof --text /usr/bin/ceph-osd osd.25.profile.0001.heap
> Using local file /usr/bin/ceph-osd.
> Using local file osd.25.profile.0001.heap.
> Total: 0.0 MB
>      0.0  46.7%  46.7%      0.0  59.0% SimpleMessenger::add_accept_pipe
> [...]

Note the "Total: 0.0MB", which differs wildly from the stats reported by 
tcmalloc, and the RSS of the process reported by the kernel.

So, for testing purposes, I selectively started up ~20% of the OSDs,
each invoked with the setting

   CEPH_HEAP_PROFILER_INIT=1

… defined in their environmentment to cause the heap profiler to be
started at OSD start-time.  This has a significant CPU and memory
overhead.

Also set were the cluster flags:

   noout,nobackfill,norecover,noscrub,nodeep-scrub

… to avoid commingling memory requirements due to peering with other
factors.

I've produced a number of .heap files which show >= 1000MB of memory
allocated in an RB tree as a result of
PG::RecoveryState::RecoveryMachine::send_notify, PG::read_info and
MOSDPGNotify::decode_payload (or descendants).

An example heapfile from a fairly typical OSD can currently be fetched from:

   http://people.ds.cam.ac.uk/dwm37/tmp/osd.0.profile.0124.heap

This was produced by the binaries from the Ceph 'trusty' repository; 
`ceph -v` returns:

> ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)

Running pprof in interactive mode and running `top30 --cum` on this 
heapfile reports:

> Total: 2172.3 MB
>   1705.9  78.5%  78.5%   1748.4  80.5% __gnu_cxx::new_allocator::construct (inline)
>      0.0   0.0%  78.5%   1600.7  73.7% std::_Rb_tree::_M_create_node (inline)
>      0.0   0.0%  78.5%   1367.9  63.0% start_thread
>      0.0   0.0%  78.5%   1367.6  63.0% ioperm
>      0.0   0.0%  78.5%    963.4  44.4% ThreadPool::worker
>      0.0   0.0%  78.5%    963.3  44.3% ThreadPool::WorkThread::entry
>      0.0   0.0%  78.5%    951.0  43.8% OSD::process_peering_events
>      0.0   0.0%  78.5%    950.9  43.8% OSD::PeeringWQ::_process
>      0.0   0.0%  78.5%    949.8  43.7% PG::RecoveryState::handle_event (inline)
>      0.0   0.0%  78.5%    949.8  43.7% boost::statechart::detail::send_function::operator (inline)
>      0.0   0.0%  78.5%    949.8  43.7% boost::statechart::simple_state::react_impl
>      0.0   0.0%  78.5%    949.8  43.7% boost::statechart::state_machine::process_event (inline)
>      0.0   0.0%  78.5%    949.8  43.7% boost::statechart::state_machine::send_event
>      0.0   0.0%  78.5%    949.8  43.7% local_react (inline)
>      0.0   0.0%  78.5%    949.8  43.7% local_react_impl (inline)
>      0.0   0.0%  78.5%    949.8  43.7% operator (inline)
>      0.0   0.0%  78.5%    949.8  43.7% react (inline)
>      0.0   0.0%  78.5%    948.5  43.7% std::vector::push_back (inline)
>      0.0   0.0%  78.5%    948.3  43.7% PG::RecoveryState::RecoveryMachine::send_notify
>      0.0   0.0%  78.5%    947.1  43.6% std::vector::_M_insert_aux
>      0.0   0.0%  78.5%    947.0  43.6% _Rb_tree (inline)
>      0.0   0.0%  78.5%    947.0  43.6% map (inline)
>      0.0   0.0%  78.5%    947.0  43.6% std::_Rb_tree::_M_clone_node (inline)
>      0.0   0.0%  78.5%    947.0  43.6% std::_Rb_tree::_M_copy
>      0.0   0.0%  78.5%    809.8  37.3% construct (inline)
>      0.0   0.0%  78.5%    808.4  37.2% std::pair::pair
>      0.0   0.0%  78.5%    804.2  37.0% __libc_start_main
>      0.0   0.0%  78.5%    804.2  37.0% _start
>      0.0   0.0%  78.5%    804.2  37.0% main
>      0.0   0.0%  78.5%    803.6  37.0% OSD::init

This appears to show a large amount of memory — nearly a gigabyte — 
allocated by boost::statechart, which is slightly surprising as the FAQ 
for boost::statechart quotes a ~1KB memory footprint per state-machine:

http://www.boost.org/doc/libs/1_35_0/libs/statechart/doc/faq.html#EmbeddedApplications

Perhaps something unexpected is happening here?  I'm almost hoping that 
perhaps statechart is perhaps being subtly misused or misconfigured in 
some way that, if fixed, would result in a significant drop in memory 
utilization…!

Quantifying problem-size:
========================

Given that it appears to be the log-merging stage of PG recovery that
seems to be expensive, I queried the statistics of those PGs which
seemed to be taking a long time to peer, via `ceph pg <pgid> query`.

These showed that (at least a handful) of those PG's recovery_state
past_intervals list contained on the order of ~200-300 entries.

(I have no feel as to whether this is excessive.)

Unused memory:
=============

One thing I note is that I still sometimes see OSDs with large fractions 
of their memory allocation sitting on the tcmalloc freelist, e.g.:

> osd.0 tcmalloc heap stats:------------------------------------------------
> MALLOC:     2226810584 ( 2123.7 MiB) Bytes in use by application
> MALLOC: +   1421361152 ( 1355.5 MiB) Bytes in page heap freelist
> MALLOC: +     41864920 (   39.9 MiB) Bytes in central cache freelist
> MALLOC: +      5215680 (    5.0 MiB) Bytes in transfer cache freelist
> MALLOC: +     18508944 (   17.7 MiB) Bytes in thread cache freelists
> MALLOC: +     16216216 (   15.5 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: =   3729977496 ( 3557.2 MiB) Actual memory used (physical + swap)
> MALLOC: +     32792576 (   31.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: =   3762770072 ( 3588.5 MiB) Virtual address space used
> MALLOC:
> MALLOC:         144565              Spans in use
> MALLOC:            225              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> ------------------------------------------------

This is despite having:

   TCMALLOC_RELEASE_RATE=10

… set in the environment of each OSD process.  This doesn't help with
contention for RAM between processes!

(I have mentioned this before, though hadn't at that time yet tried 
running OSDs with TCMALLOC_RELEASE_RATE. See also:

   http://www.spinics.net/lists/ceph-devel/msg18769.html

… for history.

Note for anyone intending to reproduce this experiment: Upstart 
overrides should be written to a file named 
/etc/init/ceph-{osd,mon}.override, not ceph-{osd,mon}.conf.override as I 
incorrectly specified previously.)

Leak detection:
==============

Not yet being familiar with the the data-structures or algorithms that
govern PG recovery, it's not clear to me whether this is memory usage
that is expected or not for a 120-OSD cluster with 2048 PGs — or
whether there might be some variety of leak (or inefficient memory-use
pattern.)

It doesn't help that I'm not a C++ hacker. :-)

Reading around the subject, I came across `leaksanitiser`, a clang/LLVM:
facility:

  https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer

… as well as ticket #9756, which suggests using Clang's other static
analysis capabilities to help flag potentially problematic code:

  http://tracker.ceph.com/issues/9756

I might spend some time this weekend to see if I can help advance that
ticket.

(I note that http://ceph.com/gitbuilders.cgi now returns 404; perhaps
that has been superceded by some RedHat-internal facility?)

Cheers,
David
-- 
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
@ 2015-02-08 20:05 ` David McBride
  2015-02-09 10:38   ` David McBride
  2015-02-09 15:31 ` Gregory Farnum
  1 sibling, 1 reply; 14+ messages in thread
From: David McBride @ 2015-02-08 20:05 UTC (permalink / raw)
  To: Ceph-devel

On 08/02/15 16:05, David McBride wrote:

> Reading around the subject, I came across `leaksanitiser`, a clang/LLVM:
> facility:
>
>   https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer
>
> … as well as ticket #9756, which suggests using Clang's other static
> analysis capabilities to help flag potentially problematic code:
>
>   http://tracker.ceph.com/issues/9756

I've gone ahead and implemented this.  I've submitted a pull-request via 
Github, visible here:

  https://github.com/ceph/autobuild-ceph/pull/22

I've not tried to replicate the gitbuilder environment directly, so 
these changes are untested, though should work — at least, once 
someone's added 'clang' to the list of packages to be autoprovisioned!

Cheers,
David
-- 
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-02-08 20:05 ` David McBride
@ 2015-02-09 10:38   ` David McBride
  0 siblings, 0 replies; 14+ messages in thread
From: David McBride @ 2015-02-09 10:38 UTC (permalink / raw)
  To: Ceph-devel

On 08/02/15 20:05, David McBride wrote:

>   https://github.com/ceph/autobuild-ceph/pull/22
>
> I've not tried to replicate the gitbuilder environment directly, so
> these changes are untested, though should work — at least, once
> someone's added 'clang' to the list of packages to be autoprovisioned!

I've now updated this pull request; now also implemented:

  * Updates to fabfile.py to cause clang (and clang-analyzer on RPM
    machines) to be installed prior to builds.

  * Added the '-analyze' hostname affix, which causes Ceph to be built
    with the 'scan-build' static-analysis wrapper.  As a side-effect of
    compilation, a static-analysis of Ceph's code will also be run; the
    resulting report will be deposited in scan-build.tmp/.

  * Tweaked the environment of clang builds so that it shouldn't
    generate spurious errors when being run with versions of
    ccache < 3.2.

Cheers,
David
-- 
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
  2015-02-08 20:05 ` David McBride
@ 2015-02-09 15:31 ` Gregory Farnum
  2015-02-09 21:36   ` David McBride
  1 sibling, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2015-02-09 15:31 UTC (permalink / raw)
  To: David McBride; +Cc: Ceph-devel

Right.

So, memory usage of an OSD is usually linear in the number of PGs it
hosts. However, that memory can also grow based on at least one other
thing: the number of OSD Maps required to go through peering. It
*looks* to me like this is what you're running in to, not growth on
the number of state machines. In particular, those past_intervals you
mentioned. ;)

Anyway, I'm afraid I don't have any magic cure-all for you. This kind
of long-term dirtied Ceph cluster is something I've only seen once or
twice and I've never led a recovery on them. But the effort usually
involves, as you've done, limiting the number of OSDs per host that
are doing recovery at once (which probably means starting one OSD at a
time until stability, rather than one per host!), disabling recovery
(as you've already done), ...and occasionally hacking up the map
history. :/

Good luck!
-Greg

On Sun, Feb 8, 2015 at 8:05 AM, David McBride <dwm37@cam.ac.uk> wrote:
> Hello,
>
> I'm trying to understand the memory requirements for a Ceph node,
> particularly when it is undergoing recovery.
>
> Comments, suggestions, pointers are all welcome.
>
> (This is my second attempt at sending this email; it appeared to get eaten
> the first time — probably because it had a 1MB .heap file attached.)
>
>
> Background:
> ==========
>
> I've got a fairly tortured prototype Ceph cluster.  It was left
> unattended for several months, as I'd been needed to work elsewhere —
> but now I'm returning to it, with an eye to continue to building
> production services on it if I have sufficient confidence in its
> capabilities.
>
> In the intervening time, several root filesystems on cluster nodes went
> full (because of poorly configured logging, as well as MONs being
> co-located with OSDs for expediency) and several drives were also
> unceremoniously pulled out for reuse elsewhere.
>
> A subsequent recovery is proving problematic: if all OSDs are started
> concurrently, they are substantially exceeding the amount of RAM
> available on the hosts during peering, and are being killed off by the
> kernel OOM killer.
>
> (And then subsequently being restarted by Upstart, resulting in
> thrashing for a while, up until something unknown goes awry and the
> machine stops sending telemetry and no-longer responds to SSH.  That's a
> separate problem.)
>
> Looking at tcmalloc-accounted heap statistics, I've seen individual OSDs
> using 9GB+ of RAM; looking at RSS sizes of individual machines, I've
> seen process-images exceeding 16GB.  On 12-disk machines with 32GB of
> RAM each, this is problematic.
>
> So, I've started looking at the data-structures and algorithms that
> govern OSD recovery.  I've found the following references:
>
>  http://ceph.com/docs/master/dev/placement-group/
>  http://ceph.com/docs/master/dev/peering/
>  http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/
>  http://ceph.com/docs/master/dev/osd_internals/map_message_handling/
>  http://dachary.org/?p=2061
>
> … and hope to develop an understanding of an upper bound on memory
> utilization that an efficient implementation of the algorithms described
> would require.
>
> I've also been trying to collect memory profiles for OSD processes as
> they're operating, to compare theory with reality.
>
>
> Memory profiling:
> ================
>
> For example, having found an OSD using ~6GB of memory, I turned on heap
> profiling, and dumped its state using `ceph tell osd.N heap
> start_profiler; ceph tell osd.N heap dump`:
>
>> ------------------------------------------------
>> MALLOC:     6167528240 ( 5881.8 MiB) Bytes in use by application
>> MALLOC: +     18309120 (   17.5 MiB) Bytes in page heap freelist
>> MALLOC: +     39689152 (   37.9 MiB) Bytes in central cache freelist
>> MALLOC: +      4750960 (    4.5 MiB) Bytes in transfer cache freelist
>> MALLOC: +     25223840 (   24.1 MiB) Bytes in thread cache freelists
>> MALLOC: +     27603096 (   26.3 MiB) Bytes in malloc metadata
>> MALLOC:   ------------
>> MALLOC: =   6283104408 ( 5992.0 MiB) Actual memory used (physical + swap)
>> MALLOC: +      2080768 (    2.0 MiB) Bytes released to OS (aka unmapped)
>> MALLOC:   ------------
>> MALLOC: =   6285185176 ( 5994.0 MiB) Virtual address space used
>> MALLOC:
>> MALLOC:         374907              Spans in use
>> MALLOC:            335              Thread heaps in use
>> MALLOC:           8192              Tcmalloc page size
>> ------------------------------------------------
>
>
> However, the heap dumps so generated only appear to show memory
> allocations (made? touched?) since heap profiling was enabled:
>
>> google-pprof --text /usr/bin/ceph-osd osd.25.profile.0001.heap
>> Using local file /usr/bin/ceph-osd.
>> Using local file osd.25.profile.0001.heap.
>> Total: 0.0 MB
>>      0.0  46.7%  46.7%      0.0  59.0% SimpleMessenger::add_accept_pipe
>> [...]
>
>
> Note the "Total: 0.0MB", which differs wildly from the stats reported by
> tcmalloc, and the RSS of the process reported by the kernel.
>
> So, for testing purposes, I selectively started up ~20% of the OSDs,
> each invoked with the setting
>
>   CEPH_HEAP_PROFILER_INIT=1
>
> … defined in their environmentment to cause the heap profiler to be
> started at OSD start-time.  This has a significant CPU and memory
> overhead.
>
> Also set were the cluster flags:
>
>   noout,nobackfill,norecover,noscrub,nodeep-scrub
>
> … to avoid commingling memory requirements due to peering with other
> factors.
>
> I've produced a number of .heap files which show >= 1000MB of memory
> allocated in an RB tree as a result of
> PG::RecoveryState::RecoveryMachine::send_notify, PG::read_info and
> MOSDPGNotify::decode_payload (or descendants).
>
> An example heapfile from a fairly typical OSD can currently be fetched from:
>
>   http://people.ds.cam.ac.uk/dwm37/tmp/osd.0.profile.0124.heap
>
> This was produced by the binaries from the Ceph 'trusty' repository; `ceph
> -v` returns:
>
>> ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
>
>
> Running pprof in interactive mode and running `top30 --cum` on this heapfile
> reports:
>
>> Total: 2172.3 MB
>>   1705.9  78.5%  78.5%   1748.4  80.5% __gnu_cxx::new_allocator::construct
>> (inline)
>>      0.0   0.0%  78.5%   1600.7  73.7% std::_Rb_tree::_M_create_node
>> (inline)
>>      0.0   0.0%  78.5%   1367.9  63.0% start_thread
>>      0.0   0.0%  78.5%   1367.6  63.0% ioperm
>>      0.0   0.0%  78.5%    963.4  44.4% ThreadPool::worker
>>      0.0   0.0%  78.5%    963.3  44.3% ThreadPool::WorkThread::entry
>>      0.0   0.0%  78.5%    951.0  43.8% OSD::process_peering_events
>>      0.0   0.0%  78.5%    950.9  43.8% OSD::PeeringWQ::_process
>>      0.0   0.0%  78.5%    949.8  43.7% PG::RecoveryState::handle_event
>> (inline)
>>      0.0   0.0%  78.5%    949.8  43.7%
>> boost::statechart::detail::send_function::operator (inline)
>>      0.0   0.0%  78.5%    949.8  43.7%
>> boost::statechart::simple_state::react_impl
>>      0.0   0.0%  78.5%    949.8  43.7%
>> boost::statechart::state_machine::process_event (inline)
>>      0.0   0.0%  78.5%    949.8  43.7%
>> boost::statechart::state_machine::send_event
>>      0.0   0.0%  78.5%    949.8  43.7% local_react (inline)
>>      0.0   0.0%  78.5%    949.8  43.7% local_react_impl (inline)
>>      0.0   0.0%  78.5%    949.8  43.7% operator (inline)
>>      0.0   0.0%  78.5%    949.8  43.7% react (inline)
>>      0.0   0.0%  78.5%    948.5  43.7% std::vector::push_back (inline)
>>      0.0   0.0%  78.5%    948.3  43.7%
>> PG::RecoveryState::RecoveryMachine::send_notify
>>      0.0   0.0%  78.5%    947.1  43.6% std::vector::_M_insert_aux
>>      0.0   0.0%  78.5%    947.0  43.6% _Rb_tree (inline)
>>      0.0   0.0%  78.5%    947.0  43.6% map (inline)
>>      0.0   0.0%  78.5%    947.0  43.6% std::_Rb_tree::_M_clone_node
>> (inline)
>>      0.0   0.0%  78.5%    947.0  43.6% std::_Rb_tree::_M_copy
>>      0.0   0.0%  78.5%    809.8  37.3% construct (inline)
>>      0.0   0.0%  78.5%    808.4  37.2% std::pair::pair
>>      0.0   0.0%  78.5%    804.2  37.0% __libc_start_main
>>      0.0   0.0%  78.5%    804.2  37.0% _start
>>      0.0   0.0%  78.5%    804.2  37.0% main
>>      0.0   0.0%  78.5%    803.6  37.0% OSD::init
>
>
> This appears to show a large amount of memory — nearly a gigabyte —
> allocated by boost::statechart, which is slightly surprising as the FAQ for
> boost::statechart quotes a ~1KB memory footprint per state-machine:
>
>
> http://www.boost.org/doc/libs/1_35_0/libs/statechart/doc/faq.html#EmbeddedApplications
>
> Perhaps something unexpected is happening here?  I'm almost hoping that
> perhaps statechart is perhaps being subtly misused or misconfigured in some
> way that, if fixed, would result in a significant drop in memory
> utilization…!
>
>
> Quantifying problem-size:
> ========================
>
> Given that it appears to be the log-merging stage of PG recovery that
> seems to be expensive, I queried the statistics of those PGs which
> seemed to be taking a long time to peer, via `ceph pg <pgid> query`.
>
> These showed that (at least a handful) of those PG's recovery_state
> past_intervals list contained on the order of ~200-300 entries.
>
> (I have no feel as to whether this is excessive.)
>
>
> Unused memory:
> =============
>
> One thing I note is that I still sometimes see OSDs with large fractions of
> their memory allocation sitting on the tcmalloc freelist, e.g.:
>
>> osd.0 tcmalloc heap stats:------------------------------------------------
>> MALLOC:     2226810584 ( 2123.7 MiB) Bytes in use by application
>> MALLOC: +   1421361152 ( 1355.5 MiB) Bytes in page heap freelist
>> MALLOC: +     41864920 (   39.9 MiB) Bytes in central cache freelist
>> MALLOC: +      5215680 (    5.0 MiB) Bytes in transfer cache freelist
>> MALLOC: +     18508944 (   17.7 MiB) Bytes in thread cache freelists
>> MALLOC: +     16216216 (   15.5 MiB) Bytes in malloc metadata
>> MALLOC:   ------------
>> MALLOC: =   3729977496 ( 3557.2 MiB) Actual memory used (physical + swap)
>> MALLOC: +     32792576 (   31.3 MiB) Bytes released to OS (aka unmapped)
>> MALLOC:   ------------
>> MALLOC: =   3762770072 ( 3588.5 MiB) Virtual address space used
>> MALLOC:
>> MALLOC:         144565              Spans in use
>> MALLOC:            225              Thread heaps in use
>> MALLOC:           8192              Tcmalloc page size
>> ------------------------------------------------
>
>
> This is despite having:
>
>   TCMALLOC_RELEASE_RATE=10
>
> … set in the environment of each OSD process.  This doesn't help with
> contention for RAM between processes!
>
> (I have mentioned this before, though hadn't at that time yet tried running
> OSDs with TCMALLOC_RELEASE_RATE. See also:
>
>   http://www.spinics.net/lists/ceph-devel/msg18769.html
>
> … for history.
>
> Note for anyone intending to reproduce this experiment: Upstart overrides
> should be written to a file named /etc/init/ceph-{osd,mon}.override, not
> ceph-{osd,mon}.conf.override as I incorrectly specified previously.)
>
>
> Leak detection:
> ==============
>
> Not yet being familiar with the the data-structures or algorithms that
> govern PG recovery, it's not clear to me whether this is memory usage
> that is expected or not for a 120-OSD cluster with 2048 PGs — or
> whether there might be some variety of leak (or inefficient memory-use
> pattern.)
>
> It doesn't help that I'm not a C++ hacker. :-)
>
> Reading around the subject, I came across `leaksanitiser`, a clang/LLVM:
> facility:
>
>  https://code.google.com/p/address-sanitizer/wiki/LeakSanitizer
>
> … as well as ticket #9756, which suggests using Clang's other static
> analysis capabilities to help flag potentially problematic code:
>
>  http://tracker.ceph.com/issues/9756
>
> I might spend some time this weekend to see if I can help advance that
> ticket.
>
> (I note that http://ceph.com/gitbuilders.cgi now returns 404; perhaps
> that has been superceded by some RedHat-internal facility?)
>
> Cheers,
> David
> --
> David McBride <dwm37@cam.ac.uk>
> Unix Specialist, University Information Services
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-02-09 15:31 ` Gregory Farnum
@ 2015-02-09 21:36   ` David McBride
  2015-02-10  1:51     ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: David McBride @ 2015-02-09 21:36 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph-devel

On 09/02/15 15:31, Gregory Farnum wrote:

> So, memory usage of an OSD is usually linear in the number of PGs it
> hosts. However, that memory can also grow based on at least one other
> thing: the number of OSD Maps required to go through peering. It
> *looks* to me like this is what you're running in to, not growth on
> the number of state machines. In particular, those past_intervals you
> mentioned. ;)

Hi Greg,

Right, that sounds entirely plausible, and is very helpful.

In practice, that means I'll need to be careful to avoid this situation 
occurring in production — but given that's unlikely to occur except in 
the case of non-trivial neglect, I don't think I need be particularly 
concerned.

(Happily, I'm in the situation that my existing cluster is purely for 
testing purposes; the data is expendable.)

That said, for my own peace of mind, it would be valuable to have a 
procedure that can be used to recover from this state, even if it's 
unlikely to occur in practice.

I'm currently running an experiment where I augment the RAM of each OSD 
node with 10GB swapfiles on each spinning OSD disk, so that there's a 
big-enough backing-store to complete log reconstruction.

(You obviously wouldn't want to operate in this manner during normal 
production operation — the loss of a single drive would cause a hard 
machine-crash, and the performance will be fairly diabolical, 
particularly if you allow client workloads to carry on in the background.)

I did try enabling zswap on the Utopic LTS kernel as supplied as an 
option in Ubuntu 14.04; however, the kernel was not stable in such a 
configuration and several machines crashed under memory pressure.

I do have OSDs committing suicide periodically, probably because they're 
insufficiently responsive to heartbeats as they start to hit swap.  This 
is before experimenting with the various OSD tuning dials for timeouts, 
so some improvement may be possible.

In the meantime, I've configured the ceph-osd Upstart jobs to apply a 
post-exec command of `sleep 3600` to reduce the rate at which they're 
respawned.

So far, the resulting configuration seems to be making progress, albeit 
moderately slowly.

Cheers,
David
-- 
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-02-09 21:36   ` David McBride
@ 2015-02-10  1:51     ` Sage Weil
  2015-03-09 15:42       ` Dan van der Ster
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2015-02-10  1:51 UTC (permalink / raw)
  To: David McBride; +Cc: Gregory Farnum, Ceph-devel

On Mon, 9 Feb 2015, David McBride wrote:
> On 09/02/15 15:31, Gregory Farnum wrote:
> 
> > So, memory usage of an OSD is usually linear in the number of PGs it
> > hosts. However, that memory can also grow based on at least one other
> > thing: the number of OSD Maps required to go through peering. It
> > *looks* to me like this is what you're running in to, not growth on
> > the number of state machines. In particular, those past_intervals you
> > mentioned. ;)
> 
> Hi Greg,
> 
> Right, that sounds entirely plausible, and is very helpful.
> 
> In practice, that means I'll need to be careful to avoid this situation
> occurring in production ? but given that's unlikely to occur except in the
> case of non-trivial neglect, I don't think I need be particularly concerned.
> 
> (Happily, I'm in the situation that my existing cluster is purely for testing
> purposes; the data is expendable.)
> 
> That said, for my own peace of mind, it would be valuable to have a procedure
> that can be used to recover from this state, even if it's unlikely to occur in
> practice.

The best luck I've had recovering from situations is something like:

- stop all osds
- osd set nodown
- osd set nobackfill
- osd set noup
- set map cache size smaller to reduce memory footprint.  

  osd map cache size = 50
  osd map max advance = 25
  osd map share max epochs = 25
  osd pg epoch persisted max stale = 25

(basically, keep most of those values in sync, and smaller than 
the map cache)

- start all osds, let them catch up on their maps.  (if they can't fit in 
memory at this point then another creative solution will be needed)
- unset noup so that everyone peers at once

It may also help to try to match the in/out state with where the data 
actually resides (i.e. mark an osd back in if it was marked out but the 
cluster didn't rebalance).

> I'm currently running an experiment where I augment the RAM of each OSD node
> with 10GB swapfiles on each spinning OSD disk, so that there's a big-enough
> backing-store to complete log reconstruction.

Swap tends to not work very well.. make sure nodown is set if you have to 
go this route or else osds will get marked down when they miss 
heartbeats...

sage


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-02-10  1:51     ` Sage Weil
@ 2015-03-09 15:42       ` Dan van der Ster
  2015-03-09 15:47         ` Gregory Farnum
  0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-09 15:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: David McBride, Gregory Farnum, Ceph-devel

Hi Sage,

On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 9 Feb 2015, David McBride wrote:
>> On 09/02/15 15:31, Gregory Farnum wrote:
>>
>> > So, memory usage of an OSD is usually linear in the number of PGs it
>> > hosts. However, that memory can also grow based on at least one other
>> > thing: the number of OSD Maps required to go through peering. It
>> > *looks* to me like this is what you're running in to, not growth on
>> > the number of state machines. In particular, those past_intervals you
>> > mentioned. ;)
>>
>> Hi Greg,
>>
>> Right, that sounds entirely plausible, and is very helpful.
>>
>> In practice, that means I'll need to be careful to avoid this situation
>> occurring in production ? but given that's unlikely to occur except in the
>> case of non-trivial neglect, I don't think I need be particularly concerned.
>>
>> (Happily, I'm in the situation that my existing cluster is purely for testing
>> purposes; the data is expendable.)
>>
>> That said, for my own peace of mind, it would be valuable to have a procedure
>> that can be used to recover from this state, even if it's unlikely to occur in
>> practice.
>
> The best luck I've had recovering from situations is something like:
>
> - stop all osds
> - osd set nodown
> - osd set nobackfill
> - osd set noup
> - set map cache size smaller to reduce memory footprint.
>
>   osd map cache size = 50
>   osd map max advance = 25
>   osd map share max epochs = 25
>   osd pg epoch persisted max stale = 25
>

These above settings have proven to be very useful when setting up
some of our new OSD servers with not much memory per OSD: 64GB RAM for
48x4TB OSDs
Prior to applying these settings (plus one more, below) we were seeing
memory usage around 2-3GB / OSD when they are freshly created. After a
restart the processes stayed under 3-400MB.

It seems the initial bootstrapping -- getting all the most recent 500
osdmaps -- in bunches of 100 at a time causes the osd map cache to
exceed its 50 entry limit -- and that memory is then never freed. We
found that to fix this we had to also lower the "osd map message max"
setting on the mons -- like that them OSD memory is staying under
500MB per process.

Currently we're happily running a large [1] number of OSDs with the
following configuration:

[global]
   osd map message max = 10

[osd]
   osd map cache size = 20
   osd map max advance = 10
   osd map share max epochs = 10
   osd pg epoch persisted max stale = 10

and the memory consumption is 400-500MB per process, even during
backfilling. And so far we didn't see any drawbacks to this
configuration. Should we expect any problems if we continue with this
small osdmap cache, permanently?

Best Regards,
Dan

[1] "large" in this case means the osdmap is 4.6MB in size

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-03-09 15:42       ` Dan van der Ster
@ 2015-03-09 15:47         ` Gregory Farnum
  2015-03-13 11:24           ` Dan van der Ster
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2015-03-09 15:47 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Sage Weil, David McBride, Ceph-devel

On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi Sage,
>
> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>> On Mon, 9 Feb 2015, David McBride wrote:
>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>
>>> > So, memory usage of an OSD is usually linear in the number of PGs it
>>> > hosts. However, that memory can also grow based on at least one other
>>> > thing: the number of OSD Maps required to go through peering. It
>>> > *looks* to me like this is what you're running in to, not growth on
>>> > the number of state machines. In particular, those past_intervals you
>>> > mentioned. ;)
>>>
>>> Hi Greg,
>>>
>>> Right, that sounds entirely plausible, and is very helpful.
>>>
>>> In practice, that means I'll need to be careful to avoid this situation
>>> occurring in production ? but given that's unlikely to occur except in the
>>> case of non-trivial neglect, I don't think I need be particularly concerned.
>>>
>>> (Happily, I'm in the situation that my existing cluster is purely for testing
>>> purposes; the data is expendable.)
>>>
>>> That said, for my own peace of mind, it would be valuable to have a procedure
>>> that can be used to recover from this state, even if it's unlikely to occur in
>>> practice.
>>
>> The best luck I've had recovering from situations is something like:
>>
>> - stop all osds
>> - osd set nodown
>> - osd set nobackfill
>> - osd set noup
>> - set map cache size smaller to reduce memory footprint.
>>
>>   osd map cache size = 50
>>   osd map max advance = 25
>>   osd map share max epochs = 25
>>   osd pg epoch persisted max stale = 25

It can cause extreme slowness if you get into a failure situation and
your OSDs need to calculate past intervals across more maps than will
fit in the cache. :(

That said, this might be a good idea as long as you're conscious of
needing to set it back if you get into trouble later on.
-Greg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-03-09 15:47         ` Gregory Farnum
@ 2015-03-13 11:24           ` Dan van der Ster
       [not found]             ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 11:24 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, David McBride, Ceph-devel

On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com> wrote:
>> Hi Sage,
>>
>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>> On Mon, 9 Feb 2015, David McBride wrote:
>>>> On 09/02/15 15:31, Gregory Farnum wrote:
>>>>
>>>> > So, memory usage of an OSD is usually linear in the number of PGs it
>>>> > hosts. However, that memory can also grow based on at least one other
>>>> > thing: the number of OSD Maps required to go through peering. It
>>>> > *looks* to me like this is what you're running in to, not growth on
>>>> > the number of state machines. In particular, those past_intervals you
>>>> > mentioned. ;)
>>>>
>>>> Hi Greg,
>>>>
>>>> Right, that sounds entirely plausible, and is very helpful.
>>>>
>>>> In practice, that means I'll need to be careful to avoid this situation
>>>> occurring in production ? but given that's unlikely to occur except in the
>>>> case of non-trivial neglect, I don't think I need be particularly concerned.
>>>>
>>>> (Happily, I'm in the situation that my existing cluster is purely for testing
>>>> purposes; the data is expendable.)
>>>>
>>>> That said, for my own peace of mind, it would be valuable to have a procedure
>>>> that can be used to recover from this state, even if it's unlikely to occur in
>>>> practice.
>>>
>>> The best luck I've had recovering from situations is something like:
>>>
>>> - stop all osds
>>> - osd set nodown
>>> - osd set nobackfill
>>> - osd set noup
>>> - set map cache size smaller to reduce memory footprint.
>>>
>>>   osd map cache size = 50
>>>   osd map max advance = 25
>>>   osd map share max epochs = 25
>>>   osd pg epoch persisted max stale = 25
>
> It can cause extreme slowness if you get into a failure situation and
> your OSDs need to calculate past intervals across more maps than will
> fit in the cache. :(

.. extreme slowness or is it also possible to get into a situation
where the PGs are stuck incomplete forever?

The reason I ask is because we actually had a network issue this
morning that left OSDs flapping and a lot of osdmap epoch churn. Now
our network has stabilized but 10 PGs are incomplete, even though all
the OSDs are up. One PG looks like this, for example:

pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
last acting [6689,1919,2329]
pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
last acting [6689,1919,2329]
pg 75.45 is incomplete, acting [6689,1919,2329]

1919     3.62000                 osd.1919                      up
1.00000          1.00000
2329     3.62000                 osd.2329                      up
1.00000          1.00000
6689     3.62000                 osd.6689                      up
1.00000          1.00000

The pg query output here: http://pastebin.com/WyTAU69W

Is that a result of these short map caches or could it be something
else?  (we're running 0.93-76-gc35f422)
WWGD (what would Greg do?) to activate these PGs?

Thanks! Dan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
       [not found]             ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
@ 2015-03-13 12:52               ` Dan van der Ster
  2015-03-13 15:36                 ` Dan van der Ster
  0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 12:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel

Hi Sage,

Losing a message would have been plausible given the network issue we had today.

I tried:

# ceph osd pg-temp 75.45 6689
set 75.45 pg_temp mapping to [6689]

then waited a bit. It's still incomplete -- the only difference is now
I see two more past_intervals in the pg. Full query here:
http://pastebin.com/TU7vVLpj

I didn't have debug_osd above zero when I did that. Should I try again
with debug_osd 20?

Thanks :)

Dan

On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
> This looks a bit like a the osds may have lost a message, actually.  You can
> kick an individual pg to repeer with something like
>
> ceph osd pg-temp 75.45 6689
>
> See if that makes it go?
>
> sage
>
>
>
> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
> wrote:
>>
>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>
>>>  On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com>
>>> wrote:
>>>>
>>>>  Hi Sage,
>>>>
>>>>  On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>
>>>>>  On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>
>>>>>>  On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>
>>>>>>>  So, memory
>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>  hosts. However, that memory can also grow based on at least one
>>>>>>> other
>>>>>>>  thing: the number of OSD Maps required to go through peering. It
>>>>>>>  *looks* to me like this is what you're running in to, not growth on
>>>>>>>  the number of state machines. In particular, those past_intervals
>>>>>>> you
>>>>>>>  mentioned. ;)
>>>>>>
>>>>>>
>>>>>>  Hi Greg,
>>>>>>
>>>>>>  Right, that sounds entirely plausible, and is very helpful.
>>>>>>
>>>>>>  In practice, that means I'll need to be careful to avoid this
>>>>>> situation
>>>>>>  occurring in production ? but given that's unlikely to occur except
>>>>>> in the
>>>>>>  case of non-trivial neglect, I don't think I need be particularly
>>>>>> concerned.
>>>>>>
>>>>>>  (Happily, I'm in the situation that my existing cluster is purely for
>>>>>> testing
>>>>>>  purposes; the data is expendable.)
>>>>>>
>>>>>>  That said, for my own peace of mind, it would be valuable to have a
>>>>>> procedure
>>>>>>  that can be used to recover from this
>>>>>> state, even if it's unlikely to occur in
>>>>>>  practice.
>>>>>
>>>>>
>>>>>  The best luck I've had recovering from situations is something like:
>>>>>
>>>>>  - stop all osds
>>>>>  - osd set nodown
>>>>>  - osd set nobackfill
>>>>>  - osd set noup
>>>>>  - set map cache size smaller to reduce memory footprint.
>>>>>
>>>>>    osd map cache size = 50
>>>>>    osd map max advance = 25
>>>>>    osd map share max epochs = 25
>>>>>    osd pg epoch persisted max stale = 25
>>>
>>>
>>>  It can cause extreme slowness if you get into a failure situation and
>>>  your OSDs need to calculate past intervals across more maps than will
>>>  fit in the cache. :(
>>
>>
>> .. extreme slowness or is it also possible to get into a situation
>> where the PGs are stuck incomplete forever?
>>
>> The reason I ask is because we actually had a network issue this
>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>> our network has
>> stabilized but 10 PGs are incomplete, even though all
>> the OSDs are up. One PG looks like this, for example:
>>
>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>> last acting [6689,1919,2329]
>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>> last acting [6689,1919,2329]
>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>
>> 1919     3.62000                 osd.1919                      up
>> 1.00000          1.00000
>> 2329     3.62000                 osd.2329                      up
>> 1.00000          1.00000
>> 6689     3.62000                 osd.6689                      up
>> 1.00000          1.00000
>>
>> The pg query output here: http://pastebin.com/WyTAU69W
>>
>> Is that a result of these short map caches or could it be something
>> else?  (we're running 0.93-76-gc35f422)
>> WWGD (what would Greg do?) to activate these PGs?
>>
>> Thanks! Dan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-03-13 12:52               ` Dan van der Ster
@ 2015-03-13 15:36                 ` Dan van der Ster
  2015-03-13 20:42                   ` Samuel Just
  0 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 15:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel

On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi Sage,
>
> Losing a message would have been plausible given the network issue we had today.
>
> I tried:
>
> # ceph osd pg-temp 75.45 6689
> set 75.45 pg_temp mapping to [6689]
>
> then waited a bit. It's still incomplete -- the only difference is now
> I see two more past_intervals in the pg. Full query here:
> http://pastebin.com/TU7vVLpj
>
> I didn't have debug_osd above zero when I did that. Should I try again
> with debug_osd 20?

I tried again with logging. The pg goes like this:

incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
inactive -> peering -> incomplete

The killer seems to be:

2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
remapped+peering] choose_acting no suitable info found (incomplete
backfills?), reverting to up

Full log is here: http://pastebin.com/hZUBD9NT

Do you have an idea what went wrong here? BTW, our firefly "prod"
cluster suffered from the same network problem today, but all of those
cluster's PGs recovered nicely.
Does the hammer RC have different peering logic that might apply here?

Thanks! Dan



>
> Thanks :)
>
> Dan
>
> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>> This looks a bit like a the osds may have lost a message, actually.  You can
>> kick an individual pg to repeer with something like
>>
>> ceph osd pg-temp 75.45 6689
>>
>> See if that makes it go?
>>
>> sage
>>
>>
>>
>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
>> wrote:
>>>
>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>
>>>>  On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com>
>>>> wrote:
>>>>>
>>>>>  Hi Sage,
>>>>>
>>>>>  On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>>
>>>>>>  On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>
>>>>>>>  On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>
>>>>>>>>  So, memory
>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>  hosts. However, that memory can also grow based on at least one
>>>>>>>> other
>>>>>>>>  thing: the number of OSD Maps required to go through peering. It
>>>>>>>>  *looks* to me like this is what you're running in to, not growth on
>>>>>>>>  the number of state machines. In particular, those past_intervals
>>>>>>>> you
>>>>>>>>  mentioned. ;)
>>>>>>>
>>>>>>>
>>>>>>>  Hi Greg,
>>>>>>>
>>>>>>>  Right, that sounds entirely plausible, and is very helpful.
>>>>>>>
>>>>>>>  In practice, that means I'll need to be careful to avoid this
>>>>>>> situation
>>>>>>>  occurring in production ? but given that's unlikely to occur except
>>>>>>> in the
>>>>>>>  case of non-trivial neglect, I don't think I need be particularly
>>>>>>> concerned.
>>>>>>>
>>>>>>>  (Happily, I'm in the situation that my existing cluster is purely for
>>>>>>> testing
>>>>>>>  purposes; the data is expendable.)
>>>>>>>
>>>>>>>  That said, for my own peace of mind, it would be valuable to have a
>>>>>>> procedure
>>>>>>>  that can be used to recover from this
>>>>>>> state, even if it's unlikely to occur in
>>>>>>>  practice.
>>>>>>
>>>>>>
>>>>>>  The best luck I've had recovering from situations is something like:
>>>>>>
>>>>>>  - stop all osds
>>>>>>  - osd set nodown
>>>>>>  - osd set nobackfill
>>>>>>  - osd set noup
>>>>>>  - set map cache size smaller to reduce memory footprint.
>>>>>>
>>>>>>    osd map cache size = 50
>>>>>>    osd map max advance = 25
>>>>>>    osd map share max epochs = 25
>>>>>>    osd pg epoch persisted max stale = 25
>>>>
>>>>
>>>>  It can cause extreme slowness if you get into a failure situation and
>>>>  your OSDs need to calculate past intervals across more maps than will
>>>>  fit in the cache. :(
>>>
>>>
>>> .. extreme slowness or is it also possible to get into a situation
>>> where the PGs are stuck incomplete forever?
>>>
>>> The reason I ask is because we actually had a network issue this
>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>> our network has
>>> stabilized but 10 PGs are incomplete, even though all
>>> the OSDs are up. One PG looks like this, for example:
>>>
>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>>> last acting [6689,1919,2329]
>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>> last acting [6689,1919,2329]
>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>
>>> 1919     3.62000                 osd.1919                      up
>>> 1.00000          1.00000
>>> 2329     3.62000                 osd.2329                      up
>>> 1.00000          1.00000
>>> 6689     3.62000                 osd.6689                      up
>>> 1.00000          1.00000
>>>
>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>
>>> Is that a result of these short map caches or could it be something
>>> else?  (we're running 0.93-76-gc35f422)
>>> WWGD (what would Greg do?) to activate these PGs?
>>>
>>> Thanks! Dan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-03-13 15:36                 ` Dan van der Ster
@ 2015-03-13 20:42                   ` Samuel Just
  2015-03-13 20:53                     ` Samuel Just
  0 siblings, 1 reply; 14+ messages in thread
From: Samuel Just @ 2015-03-13 20:42 UTC (permalink / raw)
  To: Dan van der Ster, Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel

I've opened a bug for this  (http://tracker.ceph.com/issues/11110), I 
bet it's related to the new logic for allowing recovery below min_size.  
Exactly what sha1 was running on the osds during this time period?
-Sam

On 03/13/2015 08:36 AM, Dan van der Ster wrote:
> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com> wrote:
>> Hi Sage,
>>
>> Losing a message would have been plausible given the network issue we had today.
>>
>> I tried:
>>
>> # ceph osd pg-temp 75.45 6689
>> set 75.45 pg_temp mapping to [6689]
>>
>> then waited a bit. It's still incomplete -- the only difference is now
>> I see two more past_intervals in the pg. Full query here:
>> http://pastebin.com/TU7vVLpj
>>
>> I didn't have debug_osd above zero when I did that. Should I try again
>> with debug_osd 20?
> I tried again with logging. The pg goes like this:
>
> incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
> inactive -> peering -> incomplete
>
> The killer seems to be:
>
> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
> remapped+peering] choose_acting no suitable info found (incomplete
> backfills?), reverting to up
>
> Full log is here: http://pastebin.com/hZUBD9NT
>
> Do you have an idea what went wrong here? BTW, our firefly "prod"
> cluster suffered from the same network problem today, but all of those
> cluster's PGs recovered nicely.
> Does the hammer RC have different peering logic that might apply here?
>
> Thanks! Dan
>
>
>
>> Thanks :)
>>
>> Dan
>>
>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>>> This looks a bit like a the osds may have lost a message, actually.  You can
>>> kick an individual pg to repeer with something like
>>>
>>> ceph osd pg-temp 75.45 6689
>>>
>>> See if that makes it go?
>>>
>>> sage
>>>
>>>
>>>
>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
>>> wrote:
>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>>   On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@vanderster.com>
>>>>> wrote:
>>>>>>   Hi Sage,
>>>>>>
>>>>>>   On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>>>   On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>>   On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>>
>>>>>>>>>   So, memory
>>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>>   hosts. However, that memory can also grow based on at least one
>>>>>>>>> other
>>>>>>>>>   thing: the number of OSD Maps required to go through peering. It
>>>>>>>>>   *looks* to me like this is what you're running in to, not growth on
>>>>>>>>>   the number of state machines. In particular, those past_intervals
>>>>>>>>> you
>>>>>>>>>   mentioned. ;)
>>>>>>>>
>>>>>>>>   Hi Greg,
>>>>>>>>
>>>>>>>>   Right, that sounds entirely plausible, and is very helpful.
>>>>>>>>
>>>>>>>>   In practice, that means I'll need to be careful to avoid this
>>>>>>>> situation
>>>>>>>>   occurring in production ? but given that's unlikely to occur except
>>>>>>>> in the
>>>>>>>>   case of non-trivial neglect, I don't think I need be particularly
>>>>>>>> concerned.
>>>>>>>>
>>>>>>>>   (Happily, I'm in the situation that my existing cluster is purely for
>>>>>>>> testing
>>>>>>>>   purposes; the data is expendable.)
>>>>>>>>
>>>>>>>>   That said, for my own peace of mind, it would be valuable to have a
>>>>>>>> procedure
>>>>>>>>   that can be used to recover from this
>>>>>>>> state, even if it's unlikely to occur in
>>>>>>>>   practice.
>>>>>>>
>>>>>>>   The best luck I've had recovering from situations is something like:
>>>>>>>
>>>>>>>   - stop all osds
>>>>>>>   - osd set nodown
>>>>>>>   - osd set nobackfill
>>>>>>>   - osd set noup
>>>>>>>   - set map cache size smaller to reduce memory footprint.
>>>>>>>
>>>>>>>     osd map cache size = 50
>>>>>>>     osd map max advance = 25
>>>>>>>     osd map share max epochs = 25
>>>>>>>     osd pg epoch persisted max stale = 25
>>>>>
>>>>>   It can cause extreme slowness if you get into a failure situation and
>>>>>   your OSDs need to calculate past intervals across more maps than will
>>>>>   fit in the cache. :(
>>>>
>>>> .. extreme slowness or is it also possible to get into a situation
>>>> where the PGs are stuck incomplete forever?
>>>>
>>>> The reason I ask is because we actually had a network issue this
>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>>> our network has
>>>> stabilized but 10 PGs are incomplete, even though all
>>>> the OSDs are up. One PG looks like this, for example:
>>>>
>>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>>>> last acting [6689,1919,2329]
>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>>> last acting [6689,1919,2329]
>>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>>
>>>> 1919     3.62000                 osd.1919                      up
>>>> 1.00000          1.00000
>>>> 2329     3.62000                 osd.2329                      up
>>>> 1.00000          1.00000
>>>> 6689     3.62000                 osd.6689                      up
>>>> 1.00000          1.00000
>>>>
>>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>>
>>>> Is that a result of these short map caches or could it be something
>>>> else?  (we're running 0.93-76-gc35f422)
>>>> WWGD (what would Greg do?) to activate these PGs?
>>>>
>>>> Thanks! Dan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-03-13 20:42                   ` Samuel Just
@ 2015-03-13 20:53                     ` Samuel Just
  2015-03-13 21:24                       ` Dan van der Ster
  0 siblings, 1 reply; 14+ messages in thread
From: Samuel Just @ 2015-03-13 20:53 UTC (permalink / raw)
  To: Dan van der Ster, Sage Weil; +Cc: Gregory Farnum, David McBride, Ceph-devel

Also, are you certain that all were running the same version?
-Sam

On 03/13/2015 01:42 PM, Samuel Just wrote:
> I've opened a bug for this  (http://tracker.ceph.com/issues/11110), I 
> bet it's related to the new logic for allowing recovery below 
> min_size.  Exactly what sha1 was running on the osds during this time 
> period?
> -Sam
>
> On 03/13/2015 08:36 AM, Dan van der Ster wrote:
>> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster 
>> <dan@vanderster.com> wrote:
>>> Hi Sage,
>>>
>>> Losing a message would have been plausible given the network issue 
>>> we had today.
>>>
>>> I tried:
>>>
>>> # ceph osd pg-temp 75.45 6689
>>> set 75.45 pg_temp mapping to [6689]
>>>
>>> then waited a bit. It's still incomplete -- the only difference is now
>>> I see two more past_intervals in the pg. Full query here:
>>> http://pastebin.com/TU7vVLpj
>>>
>>> I didn't have debug_osd above zero when I did that. Should I try again
>>> with debug_osd 20?
>> I tried again with logging. The pg goes like this:
>>
>> incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
>> inactive -> peering -> incomplete
>>
>> The killer seems to be:
>>
>> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
>> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
>> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
>> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
>> remapped+peering] choose_acting no suitable info found (incomplete
>> backfills?), reverting to up
>>
>> Full log is here: http://pastebin.com/hZUBD9NT
>>
>> Do you have an idea what went wrong here? BTW, our firefly "prod"
>> cluster suffered from the same network problem today, but all of those
>> cluster's PGs recovered nicely.
>> Does the hammer RC have different peering logic that might apply here?
>>
>> Thanks! Dan
>>
>>
>>
>>> Thanks :)
>>>
>>> Dan
>>>
>>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>>>> This looks a bit like a the osds may have lost a message, 
>>>> actually.  You can
>>>> kick an individual pg to repeer with something like
>>>>
>>>> ceph osd pg-temp 75.45 6689
>>>>
>>>> See if that makes it go?
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster 
>>>> <dan@vanderster.com>
>>>> wrote:
>>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> 
>>>>> wrote:
>>>>>>   On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster 
>>>>>> <dan@vanderster.com>
>>>>>> wrote:
>>>>>>>   Hi Sage,
>>>>>>>
>>>>>>>   On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> 
>>>>>>> wrote:
>>>>>>>>   On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>>>   On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>>>
>>>>>>>>>>   So, memory
>>>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>>>   hosts. However, that memory can also grow based on at least 
>>>>>>>>>> one
>>>>>>>>>> other
>>>>>>>>>>   thing: the number of OSD Maps required to go through 
>>>>>>>>>> peering. It
>>>>>>>>>>   *looks* to me like this is what you're running in to, not 
>>>>>>>>>> growth on
>>>>>>>>>>   the number of state machines. In particular, those 
>>>>>>>>>> past_intervals
>>>>>>>>>> you
>>>>>>>>>>   mentioned. ;)
>>>>>>>>>
>>>>>>>>>   Hi Greg,
>>>>>>>>>
>>>>>>>>>   Right, that sounds entirely plausible, and is very helpful.
>>>>>>>>>
>>>>>>>>>   In practice, that means I'll need to be careful to avoid this
>>>>>>>>> situation
>>>>>>>>>   occurring in production ? but given that's unlikely to occur 
>>>>>>>>> except
>>>>>>>>> in the
>>>>>>>>>   case of non-trivial neglect, I don't think I need be 
>>>>>>>>> particularly
>>>>>>>>> concerned.
>>>>>>>>>
>>>>>>>>>   (Happily, I'm in the situation that my existing cluster is 
>>>>>>>>> purely for
>>>>>>>>> testing
>>>>>>>>>   purposes; the data is expendable.)
>>>>>>>>>
>>>>>>>>>   That said, for my own peace of mind, it would be valuable to 
>>>>>>>>> have a
>>>>>>>>> procedure
>>>>>>>>>   that can be used to recover from this
>>>>>>>>> state, even if it's unlikely to occur in
>>>>>>>>>   practice.
>>>>>>>>
>>>>>>>>   The best luck I've had recovering from situations is 
>>>>>>>> something like:
>>>>>>>>
>>>>>>>>   - stop all osds
>>>>>>>>   - osd set nodown
>>>>>>>>   - osd set nobackfill
>>>>>>>>   - osd set noup
>>>>>>>>   - set map cache size smaller to reduce memory footprint.
>>>>>>>>
>>>>>>>>     osd map cache size = 50
>>>>>>>>     osd map max advance = 25
>>>>>>>>     osd map share max epochs = 25
>>>>>>>>     osd pg epoch persisted max stale = 25
>>>>>>
>>>>>>   It can cause extreme slowness if you get into a failure 
>>>>>> situation and
>>>>>>   your OSDs need to calculate past intervals across more maps 
>>>>>> than will
>>>>>>   fit in the cache. :(
>>>>>
>>>>> .. extreme slowness or is it also possible to get into a situation
>>>>> where the PGs are stuck incomplete forever?
>>>>>
>>>>> The reason I ask is because we actually had a network issue this
>>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>>>> our network has
>>>>> stabilized but 10 PGs are incomplete, even though all
>>>>> the OSDs are up. One PG looks like this, for example:
>>>>>
>>>>> pg 75.45 is stuck inactive for 87351.077529, current state 
>>>>> incomplete,
>>>>> last acting [6689,1919,2329]
>>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>>>> last acting [6689,1919,2329]
>>>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>>>
>>>>> 1919     3.62000 osd.1919                      up
>>>>> 1.00000          1.00000
>>>>> 2329     3.62000 osd.2329                      up
>>>>> 1.00000          1.00000
>>>>> 6689     3.62000 osd.6689                      up
>>>>> 1.00000          1.00000
>>>>>
>>>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>>>
>>>>> Is that a result of these short map caches or could it be something
>>>>> else?  (we're running 0.93-76-gc35f422)
>>>>> WWGD (what would Greg do?) to activate these PGs?
>>>>>
>>>>> Thanks! Dan
>>>>> -- 
>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>> ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Bounding OSD memory requirements during peering/recovery
  2015-03-13 20:53                     ` Samuel Just
@ 2015-03-13 21:24                       ` Dan van der Ster
  0 siblings, 0 replies; 14+ messages in thread
From: Dan van der Ster @ 2015-03-13 21:24 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, Gregory Farnum, David McBride, Ceph-devel

Yup, all running 0.93-76-gc35f422 (from gitbuilder just after Sage merged the
latest straw2 fix...). I just uploaded the ceph.log to help understand
the issue. Let me know if I can help further :)
Thanks! Dan

On Fri, Mar 13, 2015 at 9:53 PM, Samuel Just <sjust@redhat.com> wrote:
> Also, are you certain that all were running the same version?
> -Sam
>
>
> On 03/13/2015 01:42 PM, Samuel Just wrote:
>>
>> I've opened a bug for this  (http://tracker.ceph.com/issues/11110), I bet
>> it's related to the new logic for allowing recovery below min_size.  Exactly
>> what sha1 was running on the osds during this time period?
>> -Sam
>>
>> On 03/13/2015 08:36 AM, Dan van der Ster wrote:
>>>
>>> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@vanderster.com>
>>> wrote:
>>>>
>>>> Hi Sage,
>>>>
>>>> Losing a message would have been plausible given the network issue we
>>>> had today.
>>>>
>>>> I tried:
>>>>
>>>> # ceph osd pg-temp 75.45 6689
>>>> set 75.45 pg_temp mapping to [6689]
>>>>
>>>> then waited a bit. It's still incomplete -- the only difference is now
>>>> I see two more past_intervals in the pg. Full query here:
>>>> http://pastebin.com/TU7vVLpj
>>>>
>>>> I didn't have debug_osd above zero when I did that. Should I try again
>>>> with debug_osd 20?
>>>
>>> I tried again with logging. The pg goes like this:
>>>
>>> incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
>>> inactive -> peering -> incomplete
>>>
>>> The killer seems to be:
>>>
>>> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
>>> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
>>> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
>>> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
>>> remapped+peering] choose_acting no suitable info found (incomplete
>>> backfills?), reverting to up
>>>
>>> Full log is here: http://pastebin.com/hZUBD9NT
>>>
>>> Do you have an idea what went wrong here? BTW, our firefly "prod"
>>> cluster suffered from the same network problem today, but all of those
>>> cluster's PGs recovered nicely.
>>> Does the hammer RC have different peering logic that might apply here?
>>>
>>> Thanks! Dan
>>>
>>>
>>>
>>>> Thanks :)
>>>>
>>>> Dan
>>>>
>>>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>>>>>
>>>>> This looks a bit like a the osds may have lost a message, actually.
>>>>> You can
>>>>> kick an individual pg to repeer with something like
>>>>>
>>>>> ceph osd pg-temp 75.45 6689
>>>>>
>>>>> See if that makes it go?
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@vanderster.com>
>>>>> wrote:
>>>>>>
>>>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>   On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster
>>>>>>> <dan@vanderster.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   Hi Sage,
>>>>>>>>
>>>>>>>>   On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>   On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>>>>
>>>>>>>>>>   On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>>>>
>>>>>>>>>>>   So, memory
>>>>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>>>>   hosts. However, that memory can also grow based on at least one
>>>>>>>>>>> other
>>>>>>>>>>>   thing: the number of OSD Maps required to go through peering.
>>>>>>>>>>> It
>>>>>>>>>>>   *looks* to me like this is what you're running in to, not
>>>>>>>>>>> growth on
>>>>>>>>>>>   the number of state machines. In particular, those
>>>>>>>>>>> past_intervals
>>>>>>>>>>> you
>>>>>>>>>>>   mentioned. ;)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   Hi Greg,
>>>>>>>>>>
>>>>>>>>>>   Right, that sounds entirely plausible, and is very helpful.
>>>>>>>>>>
>>>>>>>>>>   In practice, that means I'll need to be careful to avoid this
>>>>>>>>>> situation
>>>>>>>>>>   occurring in production ? but given that's unlikely to occur
>>>>>>>>>> except
>>>>>>>>>> in the
>>>>>>>>>>   case of non-trivial neglect, I don't think I need be
>>>>>>>>>> particularly
>>>>>>>>>> concerned.
>>>>>>>>>>
>>>>>>>>>>   (Happily, I'm in the situation that my existing cluster is
>>>>>>>>>> purely for
>>>>>>>>>> testing
>>>>>>>>>>   purposes; the data is expendable.)
>>>>>>>>>>
>>>>>>>>>>   That said, for my own peace of mind, it would be valuable to
>>>>>>>>>> have a
>>>>>>>>>> procedure
>>>>>>>>>>   that can be used to recover from this
>>>>>>>>>> state, even if it's unlikely to occur in
>>>>>>>>>>   practice.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   The best luck I've had recovering from situations is something
>>>>>>>>> like:
>>>>>>>>>
>>>>>>>>>   - stop all osds
>>>>>>>>>   - osd set nodown
>>>>>>>>>   - osd set nobackfill
>>>>>>>>>   - osd set noup
>>>>>>>>>   - set map cache size smaller to reduce memory footprint.
>>>>>>>>>
>>>>>>>>>     osd map cache size = 50
>>>>>>>>>     osd map max advance = 25
>>>>>>>>>     osd map share max epochs = 25
>>>>>>>>>     osd pg epoch persisted max stale = 25
>>>>>>>
>>>>>>>
>>>>>>>   It can cause extreme slowness if you get into a failure situation
>>>>>>> and
>>>>>>>   your OSDs need to calculate past intervals across more maps than
>>>>>>> will
>>>>>>>   fit in the cache. :(
>>>>>>
>>>>>>
>>>>>> .. extreme slowness or is it also possible to get into a situation
>>>>>> where the PGs are stuck incomplete forever?
>>>>>>
>>>>>> The reason I ask is because we actually had a network issue this
>>>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>>>>> our network has
>>>>>> stabilized but 10 PGs are incomplete, even though all
>>>>>> the OSDs are up. One PG looks like this, for example:
>>>>>>
>>>>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>>>>>> last acting [6689,1919,2329]
>>>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>>>>> last acting [6689,1919,2329]
>>>>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>>>>
>>>>>> 1919     3.62000 osd.1919                      up
>>>>>> 1.00000          1.00000
>>>>>> 2329     3.62000 osd.2329                      up
>>>>>> 1.00000          1.00000
>>>>>> 6689     3.62000 osd.6689                      up
>>>>>> 1.00000          1.00000
>>>>>>
>>>>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>>>>
>>>>>> Is that a result of these short map caches or could it be something
>>>>>> else?  (we're running 0.93-76-gc35f422)
>>>>>> WWGD (what would Greg do?) to activate these PGs?
>>>>>>
>>>>>> Thanks! Dan
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-03-13 21:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
2015-02-08 20:05 ` David McBride
2015-02-09 10:38   ` David McBride
2015-02-09 15:31 ` Gregory Farnum
2015-02-09 21:36   ` David McBride
2015-02-10  1:51     ` Sage Weil
2015-03-09 15:42       ` Dan van der Ster
2015-03-09 15:47         ` Gregory Farnum
2015-03-13 11:24           ` Dan van der Ster
     [not found]             ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
2015-03-13 12:52               ` Dan van der Ster
2015-03-13 15:36                 ` Dan van der Ster
2015-03-13 20:42                   ` Samuel Just
2015-03-13 20:53                     ` Samuel Just
2015-03-13 21:24                       ` Dan van der Ster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.