From mboxrd@z Thu Jan 1 00:00:00 1970 From: Samuel Just Subject: Re: Bounding OSD memory requirements during peering/recovery Date: Fri, 13 Mar 2015 13:53:10 -0700 Message-ID: <55034E36.9070402@redhat.com> References: <54D78939.4000708@cam.ac.uk> <54D92850.5080409@cam.ac.uk> <55034B9C.8040000@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:44695 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752488AbbCMUxL (ORCPT ); Fri, 13 Mar 2015 16:53:11 -0400 In-Reply-To: <55034B9C.8040000@redhat.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Dan van der Ster , Sage Weil Cc: Gregory Farnum , David McBride , Ceph-devel Also, are you certain that all were running the same version? -Sam On 03/13/2015 01:42 PM, Samuel Just wrote: > I've opened a bug for this (http://tracker.ceph.com/issues/11110), I > bet it's related to the new logic for allowing recovery below > min_size. Exactly what sha1 was running on the osds during this time > period? > -Sam > > On 03/13/2015 08:36 AM, Dan van der Ster wrote: >> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster >> wrote: >>> Hi Sage, >>> >>> Losing a message would have been plausible given the network issue >>> we had today. >>> >>> I tried: >>> >>> # ceph osd pg-temp 75.45 6689 >>> set 75.45 pg_temp mapping to [6689] >>> >>> then waited a bit. It's still incomplete -- the only difference is now >>> I see two more past_intervals in the pg. Full query here: >>> http://pastebin.com/TU7vVLpj >>> >>> I didn't have debug_osd above zero when I did that. Should I try again >>> with debug_osd 20? >> I tried again with logging. The pg goes like this: >> >> incomplete -> inactive -> remapped -> remapped+peering -> remapped -> >> inactive -> peering -> incomplete >> >> The killer seems to be: >> >> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050 >> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994 >> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689] >> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0 >> remapped+peering] choose_acting no suitable info found (incomplete >> backfills?), reverting to up >> >> Full log is here: http://pastebin.com/hZUBD9NT >> >> Do you have an idea what went wrong here? BTW, our firefly "prod" >> cluster suffered from the same network problem today, but all of those >> cluster's PGs recovered nicely. >> Does the hammer RC have different peering logic that might apply here? >> >> Thanks! Dan >> >> >> >>> Thanks :) >>> >>> Dan >>> >>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil wrote: >>>> This looks a bit like a the osds may have lost a message, >>>> actually. You can >>>> kick an individual pg to repeer with something like >>>> >>>> ceph osd pg-temp 75.45 6689 >>>> >>>> See if that makes it go? >>>> >>>> sage >>>> >>>> >>>> >>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster >>>> >>>> wrote: >>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum >>>>> wrote: >>>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster >>>>>> >>>>>> wrote: >>>>>>> Hi Sage, >>>>>>> >>>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil >>>>>>> wrote: >>>>>>>> On Mon, 9 Feb 2015, David McBride wrote: >>>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>>>>>>> >>>>>>>>>> So, memory >>>>>>>>>> usage of an OSD is usually linear in the number of PGs it >>>>>>>>>> hosts. However, that memory can also grow based on at least >>>>>>>>>> one >>>>>>>>>> other >>>>>>>>>> thing: the number of OSD Maps required to go through >>>>>>>>>> peering. It >>>>>>>>>> *looks* to me like this is what you're running in to, not >>>>>>>>>> growth on >>>>>>>>>> the number of state machines. In particular, those >>>>>>>>>> past_intervals >>>>>>>>>> you >>>>>>>>>> mentioned. ;) >>>>>>>>> >>>>>>>>> Hi Greg, >>>>>>>>> >>>>>>>>> Right, that sounds entirely plausible, and is very helpful. >>>>>>>>> >>>>>>>>> In practice, that means I'll need to be careful to avoid this >>>>>>>>> situation >>>>>>>>> occurring in production ? but given that's unlikely to occur >>>>>>>>> except >>>>>>>>> in the >>>>>>>>> case of non-trivial neglect, I don't think I need be >>>>>>>>> particularly >>>>>>>>> concerned. >>>>>>>>> >>>>>>>>> (Happily, I'm in the situation that my existing cluster is >>>>>>>>> purely for >>>>>>>>> testing >>>>>>>>> purposes; the data is expendable.) >>>>>>>>> >>>>>>>>> That said, for my own peace of mind, it would be valuable to >>>>>>>>> have a >>>>>>>>> procedure >>>>>>>>> that can be used to recover from this >>>>>>>>> state, even if it's unlikely to occur in >>>>>>>>> practice. >>>>>>>> >>>>>>>> The best luck I've had recovering from situations is >>>>>>>> something like: >>>>>>>> >>>>>>>> - stop all osds >>>>>>>> - osd set nodown >>>>>>>> - osd set nobackfill >>>>>>>> - osd set noup >>>>>>>> - set map cache size smaller to reduce memory footprint. >>>>>>>> >>>>>>>> osd map cache size = 50 >>>>>>>> osd map max advance = 25 >>>>>>>> osd map share max epochs = 25 >>>>>>>> osd pg epoch persisted max stale = 25 >>>>>> >>>>>> It can cause extreme slowness if you get into a failure >>>>>> situation and >>>>>> your OSDs need to calculate past intervals across more maps >>>>>> than will >>>>>> fit in the cache. :( >>>>> >>>>> .. extreme slowness or is it also possible to get into a situation >>>>> where the PGs are stuck incomplete forever? >>>>> >>>>> The reason I ask is because we actually had a network issue this >>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now >>>>> our network has >>>>> stabilized but 10 PGs are incomplete, even though all >>>>> the OSDs are up. One PG looks like this, for example: >>>>> >>>>> pg 75.45 is stuck inactive for 87351.077529, current state >>>>> incomplete, >>>>> last acting [6689,1919,2329] >>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete, >>>>> last acting [6689,1919,2329] >>>>> pg 75.45 is incomplete, acting [6689,1919,2329] >>>>> >>>>> 1919 3.62000 osd.1919 up >>>>> 1.00000 1.00000 >>>>> 2329 3.62000 osd.2329 up >>>>> 1.00000 1.00000 >>>>> 6689 3.62000 osd.6689 up >>>>> 1.00000 1.00000 >>>>> >>>>> The pg query output here: http://pastebin.com/WyTAU69W >>>>> >>>>> Is that a result of these short map caches or could it be something >>>>> else? (we're running 0.93-76-gc35f422) >>>>> WWGD (what would Greg do?) to activate these PGs? >>>>> >>>>> Thanks! Dan >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe >>>>> ceph-devel" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html