Is Ceph recovery able to handle massive crash

All of lore.kernel.org
 help / color / mirror / Atom feed

* Is Ceph recovery able to handle massive crash
@ 2013-01-05 12:19 Denis Fondras
  2013-01-05 15:24 ` Gregory Farnum
  2013-01-07 17:25 ` Denis Fondras
  0 siblings, 2 replies; 14+ messages in thread
From: Denis Fondras @ 2013-01-05 12:19 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hello all,

I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over 
btrfs) and every once in a while, an OSD process crashes (almost never 
the same osd crashes).
This time I had 2 osd crash in a row and so I only had one replicate. I 
could bring the 2 crashed osd up and it started to recover. 
Unfortunately, the "source" osd crashed while recovering and now I have 
a some lost PGs.

If I happen to bring the primary OSD up again, can I imagine the lost PG 
will be recovered too ?

Regards,
Denis

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-05 12:19 Is Ceph recovery able to handle massive crash Denis Fondras
@ 2013-01-05 15:24 ` Gregory Farnum
  2013-01-07 17:25 ` Denis Fondras
  1 sibling, 0 replies; 14+ messages in thread
From: Gregory Farnum @ 2013-01-05 15:24 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel@vger.kernel.org

On Saturday, January 5, 2013 at 4:19 AM, Denis Fondras wrote:
> Hello all,
> 
> I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over 
> btrfs) and every once in a while, an OSD process crashes (almost never 
> the same osd crashes).
> This time I had 2 osd crash in a row and so I only had one replicate. I 
> could bring the 2 crashed osd up and it started to recover. 
> Unfortunately, the "source" osd crashed while recovering and now I have 
> a some lost PGs.
> 
> If I happen to bring the primary OSD up again, can I imagine the lost PG 
> will be recovered too ?


Yes, it will recover just fine. Ceph is strictly consistent and so you won't lose any data unless you lose the disks.
-Greg


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-05 12:19 Is Ceph recovery able to handle massive crash Denis Fondras
  2013-01-05 15:24 ` Gregory Farnum
@ 2013-01-07 17:25 ` Denis Fondras
  2013-01-07 21:30   ` Gregory Farnum
  1 sibling, 1 reply; 14+ messages in thread
From: Denis Fondras @ 2013-01-07 17:25 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hello all,

> I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over
> btrfs) and every once in a while, an OSD process crashes (almost never
> the same osd crashes).
> This time I had 2 osd crash in a row and so I only had one replicate. I
> could bring the 2 crashed osd up and it started to recover.
> Unfortunately, the "source" osd crashed while recovering and now I have
> a some lost PGs.
>
> If I happen to bring the primary OSD up again, can I imagine the lost PG
> will be recovered too ?
>

Ok, so it seems I can't bring back to life my primary OSD :-(

---8<---------------
health HEALTH_WARN 72 pgs incomplete; 72 pgs stuck inactive; 72 pgs 
stuck unclean
monmap e1: 1 mons at {a=192.168.0.132:6789/0}, election epoch 1, quorum 0 a
osdmap e1130: 3 osds: 2 up, 2 in
  pgmap v1567492: 624 pgs: 552 active+clean, 72 incomplete; 1633 GB 
data, 4766 GB used, 3297 GB / 8383 GB avail
  mdsmap e127: 1/1/1 up {0=a=up:active}

2013-01-07 18:11:10.852673 mon.0 [INF] pgmap v1567492: 624 pgs: 552 
active+clean, 72 incomplete; 1633 GB data, 4766 GB used, 3297 GB / 8383 
GB avail
---8<---------------

When I "rbd list", I can see all my images.
When I do "rbd map", I can map only a few of them and when I mount the 
devices, none can mount (the mount process hangs and I cannot even ^C 
the process).

Is there something I can try ?

Thank you in advance,
Denis

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-07 17:25 ` Denis Fondras
@ 2013-01-07 21:30   ` Gregory Farnum
  2013-01-08  8:44     ` Denis Fondras
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2013-01-07 21:30 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel@vger.kernel.org

On Monday, January 7, 2013 at 9:25 AM, Denis Fondras wrote:
> Hello all,
> 
> > I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over
> > btrfs) and every once in a while, an OSD process crashes (almost never
> > the same osd crashes).
> > This time I had 2 osd crash in a row and so I only had one replicate. I
> > could bring the 2 crashed osd up and it started to recover.
> > Unfortunately, the "source" osd crashed while recovering and now I have
> > a some lost PGs.
> > 
> > If I happen to bring the primary OSD up again, can I imagine the lost PG
> > will be recovered too ?
> 
> 
> 
> Ok, so it seems I can't bring back to life my primary OSD :-(
> 
> ---8<---------------
> health HEALTH_WARN 72 pgs incomplete; 72 pgs stuck inactive; 72 pgs 
> stuck unclean
> monmap e1: 1 mons at {a=192.168.0.132:6789/0}, election epoch 1, quorum 0 a
> osdmap e1130: 3 osds: 2 up, 2 in
> pgmap v1567492: 624 pgs: 552 active+clean, 72 incomplete; 1633 GB 
> data, 4766 GB used, 3297 GB / 8383 GB avail
> mdsmap e127: 1/1/1 up {0=a=up:active}
> 
> 2013-01-07 18:11:10.852673 mon.0 [INF] pgmap v1567492: 624 pgs: 552 
> active+clean, 72 incomplete; 1633 GB data, 4766 GB used, 3297 GB / 8383 
> GB avail
> ---8<---------------
> 
> When I "rbd list", I can see all my images.
> When I do "rbd map", I can map only a few of them and when I mount the 
> devices, none can mount (the mount process hangs and I cannot even ^C 
> the process).
> 
> Is there something I can try ?

What's wrong with your primary OSD? In general they shouldn't really be crashing that frequently and if you've got a new bug we'd like to diagnose and fix it.

If that can't be done (or it's a hardware failure or something), you can mark the OSD lost, but that might lose data and then you will be sad.
-Greg


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-07 21:30   ` Gregory Farnum
@ 2013-01-08  8:44     ` Denis Fondras
  2013-01-08 12:57       ` Denis Fondras
  2013-01-08 17:09       ` Gregory Farnum
  0 siblings, 2 replies; 14+ messages in thread
From: Denis Fondras @ 2013-01-08  8:44 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hello,

I tried to upgrade to 0.56.1 this morning as it could help with 
recovery. No luck so far...

> What's wrong with your primary OSD?

I don't know what's really wrong. The disk seems fine.

> In general they shouldn't really be crashing that frequently and if you've got a new bug we'd like to diagnose and fix it.

I don't know if it is hardware related (it seems not as I tested each 
parts). Then it might be an issue with btrfs (linux 3.5) or Ceph or 
another software part.
However, I'm willing to resolve this issue. Just tell me what you need, 
what I can do.

> If that can't be done (or it's a hardware failure or something), you can mark the OSD lost, but that might lose data and then you will be sad.

Well, if I must have a loss I'd really like to try everything before :)

Denis

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08  8:44     ` Denis Fondras
@ 2013-01-08 12:57       ` Denis Fondras
  2013-01-08 13:10         ` Wido den Hollander
  2013-01-08 13:51         ` Moore, Shawn M
  2013-01-08 17:09       ` Gregory Farnum
  1 sibling, 2 replies; 14+ messages in thread
From: Denis Fondras @ 2013-01-08 12:57 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hello,

I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD 
drive and cat them together and get back a usable raw volume from which 
I could get back my data ?

Everything seems to be there but I don't know the order of the rbd 
objects. Are the last bytes of the file name the offset of the block ?

Regards,
Denis

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08 12:57       ` Denis Fondras
@ 2013-01-08 13:10         ` Wido den Hollander
  2013-01-08 13:36           ` Wido den Hollander
  2013-01-08 13:51         ` Moore, Shawn M
  1 sibling, 1 reply; 14+ messages in thread
From: Wido den Hollander @ 2013-01-08 13:10 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel@vger.kernel.org

On 01/08/2013 01:57 PM, Denis Fondras wrote:
> Hello,
>
> I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD
> drive and cat them together and get back a usable raw volume from which
> I could get back my data ?
>

Yes, that is doable. The problem only is that RBD is sparse. So you'd 
have to fill up the empty spaces with 4MB of zeroes.

But yes, it's doable if you gather all the objects and will the rest up 
with zeroes.

> Everything seems to be there but I don't know the order of the rbd
> objects. Are the last bytes of the file name the offset of the block ?
>

There was a quick perl command for this to generate all the suffixes, 
but I can't seem to find it right now.

Wido

> Regards,
> Denis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08 13:10         ` Wido den Hollander
@ 2013-01-08 13:36           ` Wido den Hollander
  0 siblings, 0 replies; 14+ messages in thread
From: Wido den Hollander @ 2013-01-08 13:36 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel@vger.kernel.org

On 01/08/2013 02:10 PM, Wido den Hollander wrote:
> On 01/08/2013 01:57 PM, Denis Fondras wrote:
>> Hello,
>>
>> I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD
>> drive and cat them together and get back a usable raw volume from which
>> I could get back my data ?
>>
>
> Yes, that is doable. The problem only is that RBD is sparse. So you'd
> have to fill up the empty spaces with 4MB of zeroes.
>
> But yes, it's doable if you gather all the objects and will the rest up
> with zeroes.
>
>> Everything seems to be there but I don't know the order of the rbd
>> objects. Are the last bytes of the file name the offset of the block ?
>>
>
> There was a quick perl command for this to generate all the suffixes,
> but I can't seem to find it right now.
>

You could do something like this to generate all the blocks you should 
need, the non-existing ones you should fill them with nothing, aka 4MB 
of nothing.

perl -e 'while ($s < (SIZE_IN_MB / 4)) { printf "BLOCK_PREFIX.%012x\n", 
$s; $s++}'

Size is the block-device in MB en BLOCK_PREFIX can be something like 
"rb.0.1016.238e1f29"

Wido

> Wido
>
>> Regards,
>> Denis
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Is Ceph recovery able to handle massive crash
  2013-01-08 12:57       ` Denis Fondras
  2013-01-08 13:10         ` Wido den Hollander
@ 2013-01-08 13:51         ` Moore, Shawn M
  2013-01-08 14:53           ` Denis Fondras
  1 sibling, 1 reply; 14+ messages in thread
From: Moore, Shawn M @ 2013-01-08 13:51 UTC (permalink / raw)
  To: Denis Fondras, ceph-devel@vger.kernel.org

If you know the prefix (which is seems you do) and the original size of the rbd you should be able to use my utility.

https://github.com/smmoore/ceph/blob/master/rbd_restore.sh

You will need all the rados files in the current working directory you execute the script from.  We have used it many times so far and works for us.  I have not had any outside feedback on it's usage.  But if you are truly missing any files, it will seek over them and your rbd might be corrupt.  Likewise if a file itself is damaged, it will write what is in that file to the rebuild.

HTH,
Shawn

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Denis Fondras
Sent: Tuesday, January 08, 2013 7:57 AM
To: ceph-devel@vger.kernel.org
Subject: Re: Is Ceph recovery able to handle massive crash

Hello,

I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD 
drive and cat them together and get back a usable raw volume from which 
I could get back my data ?

Everything seems to be there but I don't know the order of the rbd 
objects. Are the last bytes of the file name the offset of the block ?

Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08 13:51         ` Moore, Shawn M
@ 2013-01-08 14:53           ` Denis Fondras
  0 siblings, 0 replies; 14+ messages in thread
From: Denis Fondras @ 2013-01-08 14:53 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Le 08/01/2013 14:51, Moore, Shawn M a écrit :
> If you know the prefix (which is seems you do) and the original size of the rbd you should be able to use my utility.
>
> https://github.com/smmoore/ceph/blob/master/rbd_restore.sh
>
> You will need all the rados files in the current working directory you execute the script from.  We have used it many times so far and works for us.  I have not had any outside feedback on it's usage.  But if you are truly missing any files, it will seek over them and your rbd might be corrupt.  Likewise if a file itself is damaged, it will write what is in that file to the rebuild.
>

Thank you very much Shawn, that's your script that gave me the idea of 
rebuiding the RBD from files ;-)

I coded my own script which find the RBD files itself.

Denis

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08  8:44     ` Denis Fondras
  2013-01-08 12:57       ` Denis Fondras
@ 2013-01-08 17:09       ` Gregory Farnum
  2013-01-08 19:44         ` Denis Fondras
  1 sibling, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2013-01-08 17:09 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel@vger.kernel.org

On Tue, Jan 8, 2013 at 12:44 AM, Denis Fondras <ceph@ledeuns.net> wrote:
>> What's wrong with your primary OSD?
>
>
> I don't know what's really wrong. The disk seems fine.

What error message do you get when you try and turn it on? If the
daemon is crashing, what is the backtrace?
-Greg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08 17:09       ` Gregory Farnum
@ 2013-01-08 19:44         ` Denis Fondras
  2013-01-08 23:36           ` Gregory Farnum
  0 siblings, 1 reply; 14+ messages in thread
From: Denis Fondras @ 2013-01-08 19:44 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org

Hello,

> What error message do you get when you try and turn it on? If the
> daemon is crashing, what is the backtrace?

The daemon is crashing. Here is the full log if you want to take a look 
: http://vps.ledeuns.net/ceph-osd.0.log.gz

The RBD rebuild script helped to get the data back. I will now try to 
rebuild a Ceph cluster and do some more tests.

Denis

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08 19:44         ` Denis Fondras
@ 2013-01-08 23:36           ` Gregory Farnum
  2013-01-09  8:30             ` Denis Fondras
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2013-01-08 23:36 UTC (permalink / raw)
  To: Denis Fondras; +Cc: ceph-devel@vger.kernel.org

On Tue, Jan 8, 2013 at 11:44 AM, Denis Fondras <ceph@ledeuns.net> wrote:
> Hello,
>
>
>> What error message do you get when you try and turn it on? If the
>> daemon is crashing, what is the backtrace?
>
>
> The daemon is crashing. Here is the full log if you want to take a look :
> http://vps.ledeuns.net/ceph-osd.0.log.gz
>
> The RBD rebuild script helped to get the data back. I will now try to
> rebuild a Ceph cluster and do some more tests.
>
> Denis

It looks like it's taking approximately forever for writes to complete
to disk; it's shutting down because threads are going off to write and
not coming back. If you set "osd op thread timeout = 60" (or 120) it
might manage to churn through, but I'd look into why the writes are
taking so long — bad disk, fragmented btrfs filesystem, or something
else.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Is Ceph recovery able to handle massive crash
  2013-01-08 23:36           ` Gregory Farnum
@ 2013-01-09  8:30             ` Denis Fondras
  0 siblings, 0 replies; 14+ messages in thread
From: Denis Fondras @ 2013-01-09  8:30 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org

Hello,

Le 09/01/2013 00:36, Gregory Farnum a écrit :
>
> It looks like it's taking approximately forever for writes to complete
> to disk; it's shutting down because threads are going off to write and
> not coming back. If you set "osd op thread timeout = 60" (or 120) it
> might manage to churn through, but I'd look into why the writes are
> taking so long — bad disk, fragmented btrfs filesystem, or something
> else.


I believe it is a BTRFS issue as when I mkfs.btrfs the volume and rejoin 
it to the cluster, it works (OSD is staying up).

Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-01-09  8:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-05 12:19 Is Ceph recovery able to handle massive crash Denis Fondras
2013-01-05 15:24 ` Gregory Farnum
2013-01-07 17:25 ` Denis Fondras
2013-01-07 21:30   ` Gregory Farnum
2013-01-08  8:44     ` Denis Fondras
2013-01-08 12:57       ` Denis Fondras
2013-01-08 13:10         ` Wido den Hollander
2013-01-08 13:36           ` Wido den Hollander
2013-01-08 13:51         ` Moore, Shawn M
2013-01-08 14:53           ` Denis Fondras
2013-01-08 17:09       ` Gregory Farnum
2013-01-08 19:44         ` Denis Fondras
2013-01-08 23:36           ` Gregory Farnum
2013-01-09  8:30             ` Denis Fondras

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.