From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Priebe Subject: Re: still recovery issues with cuttlefish Date: Wed, 21 Aug 2013 21:37:18 +0200 Message-ID: <521516EE.9070405@profihost.ag> References: <51FA1AC1.8040207@profihost.ag> <51FBF765.9030700@profihost.ag> <51FBFE85.5040700@profihost.ag> <5203A597.4060701@cloudapt.com> <5203DFAE.9070100@profihost.ag> <52068FB6.1080209@profihost.ag> <673B805F-B036-4066-B8AD-770E6464B64C@profihost.ag> <520B2BF0.2030208@profihost.ag> <5214DCA1.3040003@cloudapt.com> <52150513.8040908@profihost.ag> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ph.de-nserver.de ([85.158.179.214]:33908 "EHLO mail-ph.de-nserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752069Ab3HUThZ (ORCPT ); Wed, 21 Aug 2013 15:37:25 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Samuel Just Cc: Mike Dawson , "josh.durgin@inktank.com" , "ceph-devel@vger.kernel.org" , Sage Weil Hi Sam, Am 21.08.2013 21:13, schrieb Samuel Just: > As long as the request is for an object which is up to date on the > primary, the request will be served without waiting for recovery. Sure but remember if you have VM random 4K workload a lot of objects go out of date pretty soon. > A request only waits on recovery if the particular object being read or > written must be recovered. Yes but on 4k load this can be a lot. > Your issue was that recovering the > particular object being requested was unreasonably slow due to > silliness in the recovery code which you disabled by disabling > osd_recover_clone_overlap. Yes and no. It's better now but far away from being good or perfect. My VMs do not crash anymore but i still have a bunch of slow requests (just around 10 messages) and still a VERY high I/O load on the disks during recovery. > In cases where the primary osd is significantly behind, we do make one > of the other osds primary during recovery in order to expedite > requests (pgs in this state are shown as remapped). oh never seen that but at least in my case even 60s are a very long timeframe and the OSD is very stressed during recovery. Is it possible for me to set this value? Stefan > -Sam > > On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe wrote: >> Am 21.08.2013 17:32, schrieb Samuel Just: >> >>> Have you tried setting osd_recovery_clone_overlap to false? That >>> seemed to help with Stefan's issue. >> >> >> This might sound a bug harsh but maybe due to my limited english skills ;-) >> >> I still think that Cephs recovery system is broken by design. If an OSD >> comes back (was offline) all write requests regarding PGs where this one is >> primary are targeted immediatly to this OSD. If this one is not up2date for >> an PG it tries to recover that one immediatly which costs 4MB / block. If >> you have a lot of small write all over your OSDs and PGs you're sucked as >> your OSD has to recover ALL it's PGs immediatly or at least lots of them >> WHICH can't work. This is totally crazy. >> >> I think the right way would be: >> 1.) if an OSD goes down the replicas got primaries >> >> or >> >> 2.) an OSD which does not have an up2date PG should redirect to the OSD >> holding the secondary or third replica. >> >> Both results in being able to have a really smooth and slow recovery without >> any stress even under heavy 4K workloads like rbd backed VMs. >> >> Thanks for reading! >> >> Greets Stefan >> >> >> >>> -Sam >>> >>> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson >>> wrote: >>>> >>>> Sam/Josh, >>>> >>>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this >>>> morning, >>>> hoping it would improve this situation, but there was no appreciable >>>> change. >>>> >>>> One node in our cluster fsck'ed after a reboot and got a bit behind. Our >>>> instances backed by RBD volumes were OK at that point, but once the node >>>> booted fully and the OSDs started, all Windows instances with rbd volumes >>>> experienced very choppy performance and were unable to ingest video >>>> surveillance traffic and commit it to disk. Once the cluster got back to >>>> HEALTH_OK, they resumed normal operation. >>>> >>>> I tried for a time with conservative recovery settings (osd max backfills >>>> = >>>> 1, osd recovery op priority = 1, and osd recovery max active = 1). No >>>> improvement for the guests. So I went to more aggressive settings to get >>>> things moving faster. That decreased the duration of the outage. >>>> >>>> During the entire period of recovery/backfill, the network looked >>>> fine...no >>>> where close to saturation. iowait on all drives look fine as well. >>>> >>>> Any ideas? >>>> >>>> Thanks, >>>> Mike Dawson >>>> >>>> >>>> >>>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: >>>>> >>>>> >>>>> the same problem still occours. Will need to check when i've time to >>>>> gather logs again. >>>>> >>>>> Am 14.08.2013 01:11, schrieb Samuel Just: >>>>>> >>>>>> >>>>>> I'm not sure, but your logs did show that you had >16 recovery ops in >>>>>> flight, so it's worth a try. If it doesn't help, you should collect >>>>>> the same set of logs I'll look again. Also, there are a few other >>>>>> patches between 61.7 and current cuttlefish which may help. >>>>>> -Sam >>>>>> >>>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just : >>>>>>> >>>>>>>> I just backported a couple of patches from next to fix a bug where we >>>>>>>> weren't respecting the osd_recovery_max_active config in some cases >>>>>>>> (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either try the >>>>>>>> current cuttlefish branch or wait for a 61.8 release. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks! Are you sure that this is the issue? I don't believe that but >>>>>>> i'll give it a try. I already tested a branch from sage where he fixed >>>>>>> a >>>>>>> race regarding max active some weeks ago. So active recovering was max >>>>>>> 1 but >>>>>>> the issue didn't went away. >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> I got swamped today. I should be able to look tomorrow. Sorry! >>>>>>>>> -Sam >>>>>>>>> >>>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Did you take a look? >>>>>>>>>> >>>>>>>>>> Stefan >>>>>>>>>> >>>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just : >>>>>>>>>> >>>>>>>>>>> Great! I'll take a look on Monday. >>>>>>>>>>> -Sam >>>>>>>>>>> >>>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Samual, >>>>>>>>>>>> >>>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just: >>>>>>>>>>>> >>>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's. >>>>>>>>>>>>> >>>>>>>>>>>>> Stefan: Can you reproduce the problem with >>>>>>>>>>>>> >>>>>>>>>>>>> debug osd = 20 >>>>>>>>>>>>> debug filestore = 20 >>>>>>>>>>>>> debug ms = 1 >>>>>>>>>>>>> debug optracker = 20 >>>>>>>>>>>>> >>>>>>>>>>>>> on a few osds (including the restarted osd), and upload those >>>>>>>>>>>>> osd >>>>>>>>>>>>> logs >>>>>>>>>>>>> along with the ceph.log from before killing the osd until after >>>>>>>>>>>>> the >>>>>>>>>>>>> cluster becomes clean again? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> done - you'll find the logs at cephdrop folder: >>>>>>>>>>>> slow_requests_recovering_cuttlefish >>>>>>>>>>>> >>>>>>>>>>>> osd.52 was the one recovering >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>>> Greets, >>>>>>>>>>>> Stefan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>> ceph-devel" in >>>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>>> in >>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >