All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josh Durgin <josh.durgin@dreamhost.com>
To: Oliver Francke <Oliver.Francke@filoo.de>
Cc: ceph-devel@vger.kernel.org
Subject: Re: q. about rbd-header
Date: Wed, 14 Mar 2012 14:59:50 -0700	[thread overview]
Message-ID: <4F6114D6.4000809@dreamhost.com> (raw)
In-Reply-To: <89CB135C-CB66-4240-B89C-EFDFCB8AACF4@filoo.de>

On 03/14/2012 01:49 PM, Oliver Francke wrote:
> Well,
>
> nobody able to sched some light in?
> Did some math and found out how to fill the size bytes.

Sorry I didn't respond faster.

> But, one question never got answered:
>      - why is - with busy VMs - frequently the first block affected,
>        with the result of damaged grub-loaders/partition-tables/filesystems?
>        Is this some NULL/zero pointer thingy in case of ceph-failure?

My guess is that this is not the first object affected, but it's where 
the loss of an object is most easily noticeable - if an object doesn't 
exist, it's treated as being full of zeros, which might go undetected 
for a long time if it's e.g. some temp or log file that's not reread and 
verified.

> If you demand some broken images… we have many of them to investigate,
> unfortunately.

We'd really like to find the root cause of the problem. One possibility 
is some bad interaction between osds running different versions. This 
caused one issue with recovery stxShadow saw yesterday, for example 
(http://tracker.newdream.net/issues/2132). Had you been doing rolling 
upgrades of osds before these problems appeared? If so, do you know 
which versions you had running concurrently?

Are your osds often restarting?

What we'd need to diagnose this are osd logs during recovery with:

debug osd = 20
debug ms = 1

Once you detect the problem, a log from each replica storing the pg the 
bad/missing object is in should be enough.

And just to make sure, you aren't writing to these rbd images from 
multiple places, right? This wouldn't cause the missing header objects, 
but is likely to cause corruption of the image data. This could happen, 
for example, by rolling an image back to a snapshot while a vm is 
running on it.

Josh

> Maybe this sounds a bit harsh, after the 5th night-shift trying to repair images
> and keep customers calm, I think this is forgivable.
>
> Oliver.
>
> Am 14.03.2012 um 16:05 schrieb Oliver Francke:
>
>> Hey,
>>
>> anybody out there who could explain the structure of a rbd-header? After
>> last crash we have about 10 images with a:
>>    2012-03-14 15:22:47.998790 7f45a61e3760 librbd: Error reading
>> header: 2 No such file or directory
>> error opening image vm-266-disk-1.rbd: 2 No such file or directory
>> ... error?
>> I understand the "rb.x.y"-prefix, the 2 ^ 16hex as block-size. But
>> the size/count encoding is not intuitive ;)
>>
>> Besides one file, where I "created" a header and putted it via "rados
>> put" back into the pool, and got some files
>> back, many of the other images with lost headers have different sizes.
>>
>> We got bad luck again, too many crashed VM's, too much data-loss...
>>
>> Comments welcome ;)
>>
>> Oliver.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2012-03-14 21:59 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-14 15:05 q. about rbd-header Oliver Francke
2012-03-14 20:49 ` Oliver Francke
2012-03-14 21:22   ` Sage Weil
2012-03-14 21:59   ` Josh Durgin [this message]
2012-03-15 10:21     ` Oliver Francke
2012-03-14 20:54 ` Josh Durgin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F6114D6.4000809@dreamhost.com \
    --to=josh.durgin@dreamhost.com \
    --cc=Oliver.Francke@filoo.de \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.