All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
To: Samuel Just <sam.just@inktank.com>
Cc: Mike Dawson <mike.dawson@cloudapt.com>,
	"josh.durgin@inktank.com" <josh.durgin@inktank.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	Sage Weil <sage@inktank.com>
Subject: Re: still recovery issues with cuttlefish
Date: Thu, 22 Aug 2013 09:41:45 +0200	[thread overview]
Message-ID: <5215C0B9.5080307@profihost.ag> (raw)
In-Reply-To: <CA+4uBUbLhpMgPPcmhxdv=DhRoXktmB61mtiMNn6X7wgBWZxksw@mail.gmail.com>

Am 22.08.2013 05:34, schrieb Samuel Just:
> It's not really possible at this time to control that limit because
> changing the primary is actually fairly expensive and doing it
> unnecessarily would probably make the situation much worse

I'm sorry but remapping or backfilling is far less expensive on all of
my machines than recovering.

While backfilling i've around 8-10% I/O waits while under recovery i
have 40%-50%


 (it's
> mostly necessary for backfilling, which is expensive anyway).  It
> seems like forwarding IO on an object which needs to be recovered to a
> replica with the object would be the next step.  Certainly something
> to consider for the future.

Yes this would be the solution.

Stefan

> -Sam
> 
> On Wed, Aug 21, 2013 at 12:37 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> Hi Sam,
>> Am 21.08.2013 21:13, schrieb Samuel Just:
>>
>>> As long as the request is for an object which is up to date on the
>>> primary, the request will be served without waiting for recovery.
>>
>>
>> Sure but remember if you have VM random 4K workload a lot of objects go out
>> of date pretty soon.
>>
>>
>>> A request only waits on recovery if the particular object being read or
>>>
>>> written must be recovered.
>>
>>
>> Yes but on 4k load this can be a lot.
>>
>>
>>> Your issue was that recovering the
>>> particular object being requested was unreasonably slow due to
>>> silliness in the recovery code which you disabled by disabling
>>> osd_recover_clone_overlap.
>>
>>
>> Yes and no. It's better now but far away from being good or perfect. My VMs
>> do not crash anymore but i still have a bunch of slow requests (just around
>> 10 messages) and still a VERY high I/O load on the disks during recovery.
>>
>>
>>> In cases where the primary osd is significantly behind, we do make one
>>> of the other osds primary during recovery in order to expedite
>>> requests (pgs in this state are shown as remapped).
>>
>>
>> oh never seen that but at least in my case even 60s are a very long
>> timeframe and the OSD is very stressed during recovery. Is it possible for
>> me to set this value?
>>
>>
>> Stefan
>>
>>> -Sam
>>>
>>> On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe <s.priebe@profihost.ag>
>>> wrote:
>>>>
>>>> Am 21.08.2013 17:32, schrieb Samuel Just:
>>>>
>>>>> Have you tried setting osd_recovery_clone_overlap to false?  That
>>>>> seemed to help with Stefan's issue.
>>>>
>>>>
>>>>
>>>> This might sound a bug harsh but maybe due to my limited english skills
>>>> ;-)
>>>>
>>>> I still think that Cephs recovery system is broken by design. If an OSD
>>>> comes back (was offline) all write requests regarding PGs where this one
>>>> is
>>>> primary are targeted immediatly to this OSD. If this one is not up2date
>>>> for
>>>> an PG it tries to recover that one immediatly which costs 4MB / block. If
>>>> you have a lot of small write all over your OSDs and PGs you're sucked as
>>>> your OSD has to recover ALL it's PGs immediatly or at least lots of them
>>>> WHICH can't work. This is totally crazy.
>>>>
>>>> I think the right way would be:
>>>> 1.) if an OSD goes down the replicas got primaries
>>>>
>>>> or
>>>>
>>>> 2.) an OSD which does not have an up2date PG should redirect to the OSD
>>>> holding the secondary or third replica.
>>>>
>>>> Both results in being able to have a really smooth and slow recovery
>>>> without
>>>> any stress even under heavy 4K workloads like rbd backed VMs.
>>>>
>>>> Thanks for reading!
>>>>
>>>> Greets Stefan
>>>>
>>>>
>>>>
>>>>> -Sam
>>>>>
>>>>> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@cloudapt.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Sam/Josh,
>>>>>>
>>>>>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
>>>>>> morning,
>>>>>> hoping it would improve this situation, but there was no appreciable
>>>>>> change.
>>>>>>
>>>>>> One node in our cluster fsck'ed after a reboot and got a bit behind.
>>>>>> Our
>>>>>> instances backed by RBD volumes were OK at that point, but once the
>>>>>> node
>>>>>> booted fully and the OSDs started, all Windows instances with rbd
>>>>>> volumes
>>>>>> experienced very choppy performance and were unable to ingest video
>>>>>> surveillance traffic and commit it to disk. Once the cluster got back
>>>>>> to
>>>>>> HEALTH_OK, they resumed normal operation.
>>>>>>
>>>>>> I tried for a time with conservative recovery settings (osd max
>>>>>> backfills
>>>>>> =
>>>>>> 1, osd recovery op priority = 1, and osd recovery max active = 1). No
>>>>>> improvement for the guests. So I went to more aggressive settings to
>>>>>> get
>>>>>> things moving faster. That decreased the duration of the outage.
>>>>>>
>>>>>> During the entire period of recovery/backfill, the network looked
>>>>>> fine...no
>>>>>> where close to saturation. iowait on all drives look fine as well.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Thanks,
>>>>>> Mike Dawson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> the same problem still occours. Will need to check when i've time to
>>>>>>> gather logs again.
>>>>>>>
>>>>>>> Am 14.08.2013 01:11, schrieb Samuel Just:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure, but your logs did show that you had >16 recovery ops in
>>>>>>>> flight, so it's worth a try.  If it doesn't help, you should collect
>>>>>>>> the same set of logs I'll look again.  Also, there are a few other
>>>>>>>> patches between 61.7 and current cuttlefish which may help.
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
>>>>>>>> <s.priebe@profihost.ag> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@inktank.com>:
>>>>>>>>>
>>>>>>>>>> I just backported a couple of patches from next to fix a bug where
>>>>>>>>>> we
>>>>>>>>>> weren't respecting the osd_recovery_max_active config in some cases
>>>>>>>>>> (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either try the
>>>>>>>>>> current cuttlefish branch or wait for a 61.8 release.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks! Are you sure that this is the issue? I don't believe that
>>>>>>>>> but
>>>>>>>>> i'll give it a try. I already tested a branch from sage where he
>>>>>>>>> fixed
>>>>>>>>> a
>>>>>>>>> race regarding max active some weeks ago. So active recovering was
>>>>>>>>> max
>>>>>>>>> 1 but
>>>>>>>>> the issue didn't went away.
>>>>>>>>>
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just
>>>>>>>>>> <sam.just@inktank.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I got swamped today.  I should be able to look tomorrow.  Sorry!
>>>>>>>>>>> -Sam
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
>>>>>>>>>>> <s.priebe@profihost.ag> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Did you take a look?
>>>>>>>>>>>>
>>>>>>>>>>>> Stefan
>>>>>>>>>>>>
>>>>>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just
>>>>>>>>>>>> <sam.just@inktank.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Great!  I'll take a look on Monday.
>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
>>>>>>>>>>>>> <s.priebe@profihost.ag> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Samual,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Stefan: Can you reproduce the problem with
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> debug osd = 20
>>>>>>>>>>>>>>> debug filestore = 20
>>>>>>>>>>>>>>> debug ms = 1
>>>>>>>>>>>>>>> debug optracker = 20
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> on a few osds (including the restarted osd), and upload those
>>>>>>>>>>>>>>> osd
>>>>>>>>>>>>>>> logs
>>>>>>>>>>>>>>> along with the ceph.log from before killing the osd until
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cluster becomes clean again?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> done - you'll find the logs at cephdrop folder:
>>>>>>>>>>>>>> slow_requests_recovering_cuttlefish
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> osd.52 was the one recovering
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Greets,
>>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>>> ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>> More majordomo info at
>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel"
>>>>>>>>>> in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

      reply	other threads:[~2013-08-22  7:41 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-01  8:22 still recovery issues with cuttlefish Stefan Priebe - Profihost AG
2013-08-01 14:50 ` Andrey Korolyov
2013-08-01 18:38   ` Samuel Just
2013-08-02 17:56     ` Andrey Korolyov
2013-08-01 18:34 ` Samuel Just
2013-08-01 18:34   ` Stefan Priebe
2013-08-01 18:36     ` Samuel Just
2013-08-01 18:36       ` Samuel Just
2013-08-01 18:46       ` Stefan Priebe
2013-08-01 18:54   ` Mike Dawson
2013-08-01 19:07     ` Stefan Priebe
2013-08-01 21:23       ` Samuel Just
2013-08-02  7:44         ` Stefan Priebe
2013-08-02 17:35           ` Samuel Just
2013-08-02 18:16             ` Stefan Priebe
2013-08-02 18:21               ` Samuel Just
2013-08-02 18:46                 ` Stefan Priebe
2013-08-08 14:05                   ` Mike Dawson
2013-08-08 15:43                     ` Oliver Francke
2013-08-08 18:13                     ` Stefan Priebe
2013-08-09 21:44                       ` Samuel Just
2013-08-10 19:08                         ` Stefan Priebe
2013-08-11  3:50                           ` Samuel Just
2013-08-13  4:39                             ` Stefan Priebe - Profihost AG
2013-08-13  5:34                               ` Samuel Just
2013-08-13 20:43                                 ` Samuel Just
2013-08-13 21:03                                   ` Stefan Priebe - Profihost AG
2013-08-13 23:11                                     ` Samuel Just
2013-08-14  7:04                                       ` Stefan Priebe - Profihost AG
2013-08-21 15:28                                         ` Mike Dawson
2013-08-21 15:32                                           ` Samuel Just
2013-08-21 16:25                                             ` Yann ROBIN
2013-08-21 17:12                                               ` Samuel Just
2013-08-21 17:55                                                 ` Mike Dawson
2013-08-21 18:05                                                   ` Samuel Just
2013-08-21 18:21                                             ` Stefan Priebe
2013-08-21 19:13                                               ` Samuel Just
2013-08-21 19:37                                                 ` Stefan Priebe
2013-08-22  3:34                                                   ` Samuel Just
2013-08-22  7:41                                                     ` Stefan Priebe - Profihost AG [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5215C0B9.5080307@profihost.ag \
    --to=s.priebe@profihost.ag \
    --cc=ceph-devel@vger.kernel.org \
    --cc=josh.durgin@inktank.com \
    --cc=mike.dawson@cloudapt.com \
    --cc=sage@inktank.com \
    --cc=sam.just@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.