Re: Bounding OSD memory requirements during peering/recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Samuel Just <sjust@redhat.com>
To: Dan van der Ster <dan@vanderster.com>, Sage Weil <sage@newdream.net>
Cc: Gregory Farnum <greg@gregs42.com>,
	David McBride <dwm37@cam.ac.uk>,
	Ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Bounding OSD memory requirements during peering/recovery
Date: Fri, 13 Mar 2015 13:53:10 -0700	[thread overview]
Message-ID: <55034E36.9070402@redhat.com> (raw)
In-Reply-To: <55034B9C.8040000@redhat.com>

Also, are you certain that all were running the same version?
-Sam

On 03/13/2015 01:42 PM, Samuel Just wrote:
> I've opened a bug for this  (http://tracker.ceph.com/issues/11110), I 
> bet it's related to the new logic for allowing recovery below 
> min_size.  Exactly what sha1 was running on the osds during this time 
> period?
> -Sam
>
> On 03/13/2015 08:36 AM, Dan van der Ster wrote:
>> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster 
>> <dan@vanderster.com> wrote:
>>> Hi Sage,
>>>
>>> Losing a message would have been plausible given the network issue 
>>> we had today.
>>>
>>> I tried:
>>>
>>> # ceph osd pg-temp 75.45 6689
>>> set 75.45 pg_temp mapping to [6689]
>>>
>>> then waited a bit. It's still incomplete -- the only difference is now
>>> I see two more past_intervals in the pg. Full query here:
>>> http://pastebin.com/TU7vVLpj
>>>
>>> I didn't have debug_osd above zero when I did that. Should I try again
>>> with debug_osd 20?
>> I tried again with logging. The pg goes like this:
>>
>> incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
>> inactive -> peering -> incomplete
>>
>> The killer seems to be:
>>
>> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
>> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
>> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
>> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
>> remapped+peering] choose_acting no suitable info found (incomplete
>> backfills?), reverting to up
>>
>> Full log is here: http://pastebin.com/hZUBD9NT
>>
>> Do you have an idea what went wrong here? BTW, our firefly "prod"
>> cluster suffered from the same network problem today, but all of those
>> cluster's PGs recovered nicely.
>> Does the hammer RC have different peering logic that might apply here?
>>
>> Thanks! Dan
>>
>>
>>
>>> Thanks :)
>>>
>>> Dan
>>>
>>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@newdream.net> wrote:
>>>> This looks a bit like a the osds may have lost a message, 
>>>> actually.  You can
>>>> kick an individual pg to repeer with something like
>>>>
>>>> ceph osd pg-temp 75.45 6689
>>>>
>>>> See if that makes it go?
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster 
>>>> <dan@vanderster.com>
>>>> wrote:
>>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@gregs42.com> 
>>>>> wrote:
>>>>>>   On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster 
>>>>>> <dan@vanderster.com>
>>>>>> wrote:
>>>>>>>   Hi Sage,
>>>>>>>
>>>>>>>   On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@newdream.net> 
>>>>>>> wrote:
>>>>>>>>   On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>>>   On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>>>
>>>>>>>>>>   So, memory
>>>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>>>   hosts. However, that memory can also grow based on at least 
>>>>>>>>>> one
>>>>>>>>>> other
>>>>>>>>>>   thing: the number of OSD Maps required to go through 
>>>>>>>>>> peering. It
>>>>>>>>>>   *looks* to me like this is what you're running in to, not 
>>>>>>>>>> growth on
>>>>>>>>>>   the number of state machines. In particular, those 
>>>>>>>>>> past_intervals
>>>>>>>>>> you
>>>>>>>>>>   mentioned. ;)
>>>>>>>>>
>>>>>>>>>   Hi Greg,
>>>>>>>>>
>>>>>>>>>   Right, that sounds entirely plausible, and is very helpful.
>>>>>>>>>
>>>>>>>>>   In practice, that means I'll need to be careful to avoid this
>>>>>>>>> situation
>>>>>>>>>   occurring in production ? but given that's unlikely to occur 
>>>>>>>>> except
>>>>>>>>> in the
>>>>>>>>>   case of non-trivial neglect, I don't think I need be 
>>>>>>>>> particularly
>>>>>>>>> concerned.
>>>>>>>>>
>>>>>>>>>   (Happily, I'm in the situation that my existing cluster is 
>>>>>>>>> purely for
>>>>>>>>> testing
>>>>>>>>>   purposes; the data is expendable.)
>>>>>>>>>
>>>>>>>>>   That said, for my own peace of mind, it would be valuable to 
>>>>>>>>> have a
>>>>>>>>> procedure
>>>>>>>>>   that can be used to recover from this
>>>>>>>>> state, even if it's unlikely to occur in
>>>>>>>>>   practice.
>>>>>>>>
>>>>>>>>   The best luck I've had recovering from situations is 
>>>>>>>> something like:
>>>>>>>>
>>>>>>>>   - stop all osds
>>>>>>>>   - osd set nodown
>>>>>>>>   - osd set nobackfill
>>>>>>>>   - osd set noup
>>>>>>>>   - set map cache size smaller to reduce memory footprint.
>>>>>>>>
>>>>>>>>     osd map cache size = 50
>>>>>>>>     osd map max advance = 25
>>>>>>>>     osd map share max epochs = 25
>>>>>>>>     osd pg epoch persisted max stale = 25
>>>>>>
>>>>>>   It can cause extreme slowness if you get into a failure 
>>>>>> situation and
>>>>>>   your OSDs need to calculate past intervals across more maps 
>>>>>> than will
>>>>>>   fit in the cache. :(
>>>>>
>>>>> .. extreme slowness or is it also possible to get into a situation
>>>>> where the PGs are stuck incomplete forever?
>>>>>
>>>>> The reason I ask is because we actually had a network issue this
>>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>>>> our network has
>>>>> stabilized but 10 PGs are incomplete, even though all
>>>>> the OSDs are up. One PG looks like this, for example:
>>>>>
>>>>> pg 75.45 is stuck inactive for 87351.077529, current state 
>>>>> incomplete,
>>>>> last acting [6689,1919,2329]
>>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>>>> last acting [6689,1919,2329]
>>>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>>>
>>>>> 1919     3.62000 osd.1919                      up
>>>>> 1.00000          1.00000
>>>>> 2329     3.62000 osd.2329                      up
>>>>> 1.00000          1.00000
>>>>> 6689     3.62000 osd.6689                      up
>>>>> 1.00000          1.00000
>>>>>
>>>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>>>
>>>>> Is that a result of these short map caches or could it be something
>>>>> else?  (we're running 0.93-76-gc35f422)
>>>>> WWGD (what would Greg do?) to activate these PGs?
>>>>>
>>>>> Thanks! Dan
>>>>> -- 
>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>> ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-03-13 20:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-08 16:05 Bounding OSD memory requirements during peering/recovery David McBride
2015-02-08 20:05 ` David McBride
2015-02-09 10:38   ` David McBride
2015-02-09 15:31 ` Gregory Farnum
2015-02-09 21:36   ` David McBride
2015-02-10  1:51     ` Sage Weil
2015-03-09 15:42       ` Dan van der Ster
2015-03-09 15:47         ` Gregory Farnum
2015-03-13 11:24           ` Dan van der Ster
     [not found]             ` <f943965c-b279-4e5f-ac47-1dc6443e594d@email.android.com>
2015-03-13 12:52               ` Dan van der Ster
2015-03-13 15:36                 ` Dan van der Ster
2015-03-13 20:42                   ` Samuel Just
2015-03-13 20:53                     ` Samuel Just [this message]
2015-03-13 21:24                       ` Dan van der Ster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55034E36.9070402@redhat.com \
    --to=sjust@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=dan@vanderster.com \
    --cc=dwm37@cam.ac.uk \
    --cc=greg@gregs42.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.