From: Mike Dawson <mike.dawson@cloudapt.com>
To: Samuel Just <sam.just@inktank.com>
Cc: Yann ROBIN <yann.robin@youscribe.com>,
Stefan Priebe - Profihost AG <s.priebe@profihost.ag>,
"josh.durgin@inktank.com" <josh.durgin@inktank.com>,
"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: still recovery issues with cuttlefish
Date: Wed, 21 Aug 2013 13:55:30 -0400 [thread overview]
Message-ID: <5214FF12.1070903@cloudapt.com> (raw)
In-Reply-To: <CA+4uBUaEdV18VR3aT1rkfKp0b9NtDabpNChWO0DCRaDBFhH8Vw@mail.gmail.com>
Sam,
Tried it. Injected with 'ceph tell osd.* injectargs --
--no_osd_recover_clone_overlap', then stopped one OSD for ~1 minute.
Upon restart, all my Windows VMs have issues until HEALTH_OK.
The recovery was taking an abnormally long time, so I reverted away from
--no_osd_recover_clone_overlap after about 10mins, to get back to HEALTH_OK.
Interestingly, a Raring guest running a different video surveillance
package proceeded without any issue whatsoever.
Here is an image of the traffic to some of these Windows guests:
http://www.gammacode.com/upload/rbd-hang-with-clone-overlap.jpg
Ceph is outside of HEALTH_OK between ~12:55 and 13:10. Most of these
instances rebooted due to an app error caused by the i/o hang shortly
after 13:10.
These Windows instances are booted as COW clones from a Glance image
using Cinder. They also have a second RBD volume for bulk storage. I'm
using qemu 1.5.2.
Thanks,
Mike
On 8/21/2013 1:12 PM, Samuel Just wrote:
> Ah, thanks for the correction.
> -Sam
>
> On Wed, Aug 21, 2013 at 9:25 AM, Yann ROBIN <yann.robin@youscribe.com> wrote:
>> It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401)
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Samuel Just
>> Sent: mercredi 21 août 2013 17:33
>> To: Mike Dawson
>> Cc: Stefan Priebe - Profihost AG; josh.durgin@inktank.com; ceph-devel@vger.kernel.org
>> Subject: Re: still recovery issues with cuttlefish
>>
>> Have you tried setting osd_recovery_clone_overlap to false? That seemed to help with Stefan's issue.
>> -Sam
>>
>> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@cloudapt.com> wrote:
>>> Sam/Josh,
>>>
>>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
>>> morning, hoping it would improve this situation, but there was no appreciable change.
>>>
>>> One node in our cluster fsck'ed after a reboot and got a bit behind.
>>> Our instances backed by RBD volumes were OK at that point, but once
>>> the node booted fully and the OSDs started, all Windows instances with
>>> rbd volumes experienced very choppy performance and were unable to
>>> ingest video surveillance traffic and commit it to disk. Once the
>>> cluster got back to HEALTH_OK, they resumed normal operation.
>>>
>>> I tried for a time with conservative recovery settings (osd max
>>> backfills = 1, osd recovery op priority = 1, and osd recovery max
>>> active = 1). No improvement for the guests. So I went to more
>>> aggressive settings to get things moving faster. That decreased the duration of the outage.
>>>
>>> During the entire period of recovery/backfill, the network looked
>>> fine...no where close to saturation. iowait on all drives look fine as well.
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>> Mike Dawson
>>>
>>>
>>>
>>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:
>>>>
>>>> the same problem still occours. Will need to check when i've time to
>>>> gather logs again.
>>>>
>>>> Am 14.08.2013 01:11, schrieb Samuel Just:
>>>>>
>>>>> I'm not sure, but your logs did show that you had >16 recovery ops
>>>>> in flight, so it's worth a try. If it doesn't help, you should
>>>>> collect the same set of logs I'll look again. Also, there are a few
>>>>> other patches between 61.7 and current cuttlefish which may help.
>>>>> -Sam
>>>>>
>>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
>>>>> <s.priebe@profihost.ag> wrote:
>>>>>>
>>>>>>
>>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@inktank.com>:
>>>>>>
>>>>>>> I just backported a couple of patches from next to fix a bug where
>>>>>>> we weren't respecting the osd_recovery_max_active config in some
>>>>>>> cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either
>>>>>>> try the current cuttlefish branch or wait for a 61.8 release.
>>>>>>
>>>>>>
>>>>>> Thanks! Are you sure that this is the issue? I don't believe that
>>>>>> but i'll give it a try. I already tested a branch from sage where
>>>>>> he fixed a race regarding max active some weeks ago. So active
>>>>>> recovering was max 1 but the issue didn't went away.
>>>>>>
>>>>>> Stefan
>>>>>>
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just
>>>>>>> <sam.just@inktank.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I got swamped today. I should be able to look tomorrow. Sorry!
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
>>>>>>>> <s.priebe@profihost.ag> wrote:
>>>>>>>>>
>>>>>>>>> Did you take a look?
>>>>>>>>>
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@inktank.com>:
>>>>>>>>>
>>>>>>>>>> Great! I'll take a look on Monday.
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
>>>>>>>>>> <s.priebe@profihost.ag> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Samual,
>>>>>>>>>>>
>>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just:
>>>>>>>>>>>
>>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's.
>>>>>>>>>>>>
>>>>>>>>>>>> Stefan: Can you reproduce the problem with
>>>>>>>>>>>>
>>>>>>>>>>>> debug osd = 20
>>>>>>>>>>>> debug filestore = 20
>>>>>>>>>>>> debug ms = 1
>>>>>>>>>>>> debug optracker = 20
>>>>>>>>>>>>
>>>>>>>>>>>> on a few osds (including the restarted osd), and upload those
>>>>>>>>>>>> osd logs along with the ceph.log from before killing the osd
>>>>>>>>>>>> until after the cluster becomes clean again?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> done - you'll find the logs at cephdrop folder:
>>>>>>>>>>> slow_requests_recovering_cuttlefish
>>>>>>>>>>>
>>>>>>>>>>> osd.52 was the one recovering
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> Greets,
>>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel" in the body of a message to
>>>>>>>>>> majordomo@vger.kernel.org More majordomo info at
>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in
>>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>> info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2013-08-21 17:55 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-01 8:22 still recovery issues with cuttlefish Stefan Priebe - Profihost AG
2013-08-01 14:50 ` Andrey Korolyov
2013-08-01 18:38 ` Samuel Just
2013-08-02 17:56 ` Andrey Korolyov
2013-08-01 18:34 ` Samuel Just
2013-08-01 18:34 ` Stefan Priebe
2013-08-01 18:36 ` Samuel Just
2013-08-01 18:36 ` Samuel Just
2013-08-01 18:46 ` Stefan Priebe
2013-08-01 18:54 ` Mike Dawson
2013-08-01 19:07 ` Stefan Priebe
2013-08-01 21:23 ` Samuel Just
2013-08-02 7:44 ` Stefan Priebe
2013-08-02 17:35 ` Samuel Just
2013-08-02 18:16 ` Stefan Priebe
2013-08-02 18:21 ` Samuel Just
2013-08-02 18:46 ` Stefan Priebe
2013-08-08 14:05 ` Mike Dawson
2013-08-08 15:43 ` Oliver Francke
2013-08-08 18:13 ` Stefan Priebe
2013-08-09 21:44 ` Samuel Just
2013-08-10 19:08 ` Stefan Priebe
2013-08-11 3:50 ` Samuel Just
2013-08-13 4:39 ` Stefan Priebe - Profihost AG
2013-08-13 5:34 ` Samuel Just
2013-08-13 20:43 ` Samuel Just
2013-08-13 21:03 ` Stefan Priebe - Profihost AG
2013-08-13 23:11 ` Samuel Just
2013-08-14 7:04 ` Stefan Priebe - Profihost AG
2013-08-21 15:28 ` Mike Dawson
2013-08-21 15:32 ` Samuel Just
2013-08-21 16:25 ` Yann ROBIN
2013-08-21 17:12 ` Samuel Just
2013-08-21 17:55 ` Mike Dawson [this message]
2013-08-21 18:05 ` Samuel Just
2013-08-21 18:21 ` Stefan Priebe
2013-08-21 19:13 ` Samuel Just
2013-08-21 19:37 ` Stefan Priebe
2013-08-22 3:34 ` Samuel Just
2013-08-22 7:41 ` Stefan Priebe - Profihost AG
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5214FF12.1070903@cloudapt.com \
--to=mike.dawson@cloudapt.com \
--cc=ceph-devel@vger.kernel.org \
--cc=josh.durgin@inktank.com \
--cc=s.priebe@profihost.ag \
--cc=sam.just@inktank.com \
--cc=yann.robin@youscribe.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.