HEALTH_WARNING

All of lore.kernel.org
 help / color / mirror / Atom feed

* HEALTH_WARNING
       [not found] <835540127.13427.1301716690785.JavaMail.root@mail.linserv.se>
@ 2011-04-02  3:59 ` Martin Wilderoth
  2011-04-02  8:22   ` HEALTH_WARNING Wido den Hollander
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Wilderoth @ 2011-04-02  3:59 UTC (permalink / raw)
  To: ceph-devel

Hello,

One of my hosts run out of diskspace on the root file system (logfiles)
So I restared ceph. Discoverd the low diskspace during the restart. osd2 and osd3

ceph health gives a message like this

HEALTH_WARN osdmonitor: num_osds = 6, num_up_osds = 4, num_in_osds = 4 Some PGs are: degraded,peering

now osd.1 is dead all the other are running

How do I get the running one up and in ? and how do I know which ods it is ?

how do I recover the dead one ?

 Regards Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: HEALTH_WARNING
  2011-04-02  3:59 ` HEALTH_WARNING Martin Wilderoth
@ 2011-04-02  8:22   ` Wido den Hollander
  0 siblings, 0 replies; 8+ messages in thread
From: Wido den Hollander @ 2011-04-02  8:22 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

Hi,

On Sat, 2011-04-02 at 05:59 +0200, Martin Wilderoth wrote:
> Hello,
> 
> One of my hosts run out of diskspace on the root file system (logfiles)
> So I restared ceph. Discoverd the low diskspace during the restart. osd2 and osd3
> 

Do you have separate partitions for your OSD data? Or do you have one
big / partition? I'd recommend a separate partition for your OSD's. 

> ceph health gives a message like this
> 
> HEALTH_WARN osdmonitor: num_osds = 6, num_up_osds = 4, num_in_osds = 4 Some PGs are: degraded,peering
> 
> now osd.1 is dead all the other are running
> 
> How do I get the running one up and in ? and how do I know which ods it is ?
> 

$ ceph osd dump -o -

That should tell you which OSD is down/out.

> how do I recover the dead one ?
> 

Normally starting the OSD would be enough. Look closely though, you
might have hit a bug which caused the OSD to crash. If so, there should
be a file called "core" in / which has a core-dump and could tell why
the OSD crashed:

$ gdb /usr/bin/cosd /core

Make sure you have the debug symbols (-dbg packages) installed when
doing so.

If you monitor 'ceph -w' then, you should see the cluster recover and
all OSD's should be up & in.

Wido

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: HEALTH_WARNING
       [not found] <1463999357.13436.1301740919511.JavaMail.root@mail.linserv.se>
@ 2011-04-02 10:55 ` Martin Wilderoth
  2011-04-02 15:04   ` HEALTH_WARNING Henry Chang
  2011-04-03 23:38   ` HEALTH_WARNING Gregory Farnum
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Wilderoth @ 2011-04-02 10:55 UTC (permalink / raw)
  To: ceph-devel

Hello,

I have seperate partitions for my osd and the btrfs file system.
I also use SSD-disk for journaling.

But I got problem when the root system was filled up with logfiles on one host,
the file system reported out of diskspace.

But the osd's were not filled to 100%. Later I realised that the root system on one of the osd hosts (osd2 and osd3) had no space left, to much logging.

The only way I know to recover is to create a new filesystem in the cluster :-)
But it's bad fot the data :-)

When i get problems with one osd it seems as if they are crashing one by one.
And i dont know how to get them up again whitout deleting all the data.

Hi, 

On Sat, 2011-04-02 at 05:59 +0200, Martin Wilderoth wrote: 
> Hello, 
> 
> One of my hosts run out of diskspace on the root file system (logfiles) 
> So I restared ceph. Discoverd the low diskspace during the restart. osd2 and osd3 
> 

Do you have separate partitions for your OSD data? Or do you have one 
big / partition? I'd recommend a separate partition for your OSD's. 

> ceph health gives a message like this 
> 
> HEALTH_WARN osdmonitor: num_osds = 6, num_up_osds = 4, num_in_osds = 4 Some PGs are: degraded,peering 
> 
> now osd.1 is dead all the other are running 
> 
> How do I get the running one up and in ? and how do I know which ods it is ? 
> 

$ ceph osd dump -o - 

That should tell you which OSD is down/out. 

> how do I recover the dead one ? 
> 

Normally starting the OSD would be enough. Look closely though, you 
might have hit a bug which caused the OSD to crash. If so, there should 
be a file called "core" in / which has a core-dump and could tell why 
the OSD crashed: 

$ gdb /usr/bin/cosd /core 

Make sure you have the debug symbols (-dbg packages) installed when 
doing so. 

If you monitor 'ceph -w' then, you should see the cluster recover and 
all OSD's should be up & in. 

Wido 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: HEALTH_WARNING
  2011-04-02 10:55 ` HEALTH_WARNING Martin Wilderoth
@ 2011-04-02 15:04   ` Henry Chang
  2011-04-02 18:23     ` HEALTH_WARNING Martin Wilderoth
  2011-04-03 23:38   ` HEALTH_WARNING Gregory Farnum
  1 sibling, 1 reply; 8+ messages in thread
From: Henry Chang @ 2011-04-02 15:04 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

> The only way I know to recover is to create a new filesystem in the cluster :-)
> But it's bad fot the data :-)
>
> When i get problems with one osd it seems as if they are crashing one by one.
> And i dont know how to get them up again whitout deleting all the data.
>

Which version of ceph are you running? We had similar problem before.
I would like to know if recent fixes for osd recovery (>= v0.25.1)
have already resolved this problem.

Thanks,
Henry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: HEALTH_WARNING
  2011-04-02 15:04   ` HEALTH_WARNING Henry Chang
@ 2011-04-02 18:23     ` Martin Wilderoth
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Wilderoth @ 2011-04-02 18:23 UTC (permalink / raw)
  To: ceph-devel

I'm running 0.25.2 from the debian repository squeeze. so it's the latest i think

> The only way I know to recover is to create a new filesystem in the cluster :-) 
> But it's bad fot the data :-) 
> 
> When i get problems with one osd it seems as if they are crashing one by one. 
> And i dont know how to get them up again whitout deleting all the data. 
> 

Which version of ceph are you running? We had similar problem before. 
I would like to know if recent fixes for osd recovery (>= v0.25.1) 
have already resolved this problem. 

Thanks, 
Henry 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: HEALTH_WARNING
  2011-04-02 10:55 ` HEALTH_WARNING Martin Wilderoth
  2011-04-02 15:04   ` HEALTH_WARNING Henry Chang
@ 2011-04-03 23:38   ` Gregory Farnum
  1 sibling, 0 replies; 8+ messages in thread
From: Gregory Farnum @ 2011-04-03 23:38 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

On Sat, Apr 2, 2011 at 3:55 AM, Martin Wilderoth
<martin.wilderoth@linserv.se> wrote:
> Hello,
>
> I have seperate partitions for my osd and the btrfs file system.
> I also use SSD-disk for journaling.
>
> But I got problem when the root system was filled up with logfiles on one host,
> the file system reported out of diskspace.
>
> But the osd's were not filled to 100%. Later I realised that the root system on one of the osd hosts (osd2 and osd3) had no space left, to much logging.
>
> The only way I know to recover is to create a new filesystem in the cluster :-)
> But it's bad fot the data :-)
>
> When i get problems with one osd it seems as if they are crashing one by one.
> And i dont know how to get them up again whitout deleting all the data.
You should be able to simply clear up some space (don't remove any of
the actual OSD data though!) and then start up the OSD daemon, at
which point it ought to automatically rejoin the cluster.
Is this not working? If not, please start up the daemon with higher
levels of debug logging and put the logs somewhere accessible.
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: HEALTH_WARNING
       [not found] <290366553.13874.1302029956409.JavaMail.root@mail.linserv.se>
@ 2011-04-05 19:07 ` Martin Wilderoth
  2011-04-06 17:13   ` HEALTH_WARNING Josh Durgin
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Wilderoth @ 2011-04-05 19:07 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

I did clear some data and the restart but the osd didn't go online again. Instead The osd was running for some time and then they became dead one by one.

I was re-creating the filesystem and transfering data again with a similar result. This time the filesystem was not filled up.
It seems as the filesystem is hanginging and I can't get any respons from it.

I have done same process again, during the creation it complained on journaling
hdparm -W 0 /dev/sda2. This time I made sure it didn't complain on the hdparam of the SSD disks, while I was creating the filesystem

on my host where the filesystem is mounted i have seen some dmesg conection filed

[16143.534936] libceph: client4428 fsid 19be9ae7-cdf8-cb03-4178-568342d30fa5
[16143.535092] libceph: mon0 10.0.6.10:6789 session established
[16224.427969] libceph: mon0 10.0.6.10:6789 socket closed
[16224.427975] libceph: mon0 10.0.6.10:6789 session lost, hunting for new mon
[16224.429637] libceph: mon0 10.0.6.10:6789 connection failed
[16233.700478] libceph: mon1 10.0.6.11:6789 connection failed
[16243.716405] libceph: mon2 10.0.6.12:6789 connection failed
[16253.728529] libceph: mon2 10.0.6.12:6789 connection failed
[17008.794981] libceph: client4107 fsid 2c3fefe7-3362-f541-27b4-64176adb3f22
[17008.795127] libceph: mon0 10.0.6.10:6789 session established

Not sure I have everything configured corectly ?

Regards Martin

----- Ursprungligt meddelande ----- 
Från: "Gregory Farnum" <gregf@hq.newdream.net> 
Till: "Martin Wilderoth" <martin.wilderoth@linserv.se> 
Kopia: ceph-devel@vger.kernel.org 
Skickat: måndag, 4 apr 2011 1:38:48 
Ämne: Re: HEALTH_WARNING 

On Sat, Apr 2, 2011 at 3:55 AM, Martin Wilderoth 
<martin.wilderoth@linserv.se> wrote: 
> Hello, 
> 
> I have seperate partitions for my osd and the btrfs file system. 
> I also use SSD-disk for journaling. 
> 
> But I got problem when the root system was filled up with logfiles on one host, 
> the file system reported out of diskspace. 
> 
> But the osd's were not filled to 100%. Later I realised that the root system on one of the osd hosts (osd2 and osd3) had no space left, to much logging. 
> 
> The only way I know to recover is to create a new filesystem in the cluster :-) 
> But it's bad fot the data :-) 
> 
> When i get problems with one osd it seems as if they are crashing one by one. 
> And i dont know how to get them up again whitout deleting all the data. 
You should be able to simply clear up some space (don't remove any of 
the actual OSD data though!) and then start up the OSD daemon, at 
which point it ought to automatically rejoin the cluster. 
Is this not working? If not, please start up the daemon with higher 
levels of debug logging and put the logs somewhere accessible. 
-Greg 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: HEALTH_WARNING
  2011-04-05 19:07 ` HEALTH_WARNING Martin Wilderoth
@ 2011-04-06 17:13   ` Josh Durgin
  0 siblings, 0 replies; 8+ messages in thread
From: Josh Durgin @ 2011-04-06 17:13 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

On Tue, 5 Apr 2011 21:07:52 +0200 (CEST), Martin Wilderoth
<martin.wilderoth@linserv.se> wrote:
> I did clear some data and the restart but the osd didn't go online
> again. Instead The osd was running for some time and then they became
> dead one by one.
>
> I was re-creating the filesystem and transfering data again with a
> similar result. This time the filesystem was not filled up.
> It seems as the filesystem is hanginging and I can't get any respons from it.
>
> I have done same process again, during the creation it complained on
> journaling
> hdparm -W 0 /dev/sda2. This time I made sure it didn't complain on
> the hdparam of the SSD disks, while I was creating the filesystem
>
> on my host where the filesystem is mounted i have seen some dmesg
> conection filed
>
> [16143.534936] libceph: client4428 fsid 19be9ae7-cdf8-cb03-4178-568342d30fa5
> [16143.535092] libceph: mon0 10.0.6.10:6789 session established
> [16224.427969] libceph: mon0 10.0.6.10:6789 socket closed
> [16224.427975] libceph: mon0 10.0.6.10:6789 session lost, hunting for new mon
> [16224.429637] libceph: mon0 10.0.6.10:6789 connection failed
> [16233.700478] libceph: mon1 10.0.6.11:6789 connection failed
> [16243.716405] libceph: mon2 10.0.6.12:6789 connection failed
> [16253.728529] libceph: mon2 10.0.6.12:6789 connection failed
> [17008.794981] libceph: client4107 fsid 2c3fefe7-3362-f541-27b4-64176adb3f22
> [17008.795127] libceph: mon0 10.0.6.10:6789 session established
>
> Not sure I have everything configured corectly ?

You may have hit a bug in the OSDs - could you add this to your
ceph.conf in the [osd] section, restart the osd daemons, and post the 
logs somewhere accessible?

         debug ms = 1
         debug osd = 25
         debug monc = 20
         debug journal = 20
         debug filestore = 10

We can probably help you debug this faster on IRC (#ceph on irc.oftc.net).

Thanks,
Josh Durgin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-04-06 17:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <290366553.13874.1302029956409.JavaMail.root@mail.linserv.se>
2011-04-05 19:07 ` HEALTH_WARNING Martin Wilderoth
2011-04-06 17:13   ` HEALTH_WARNING Josh Durgin
     [not found] <1463999357.13436.1301740919511.JavaMail.root@mail.linserv.se>
2011-04-02 10:55 ` HEALTH_WARNING Martin Wilderoth
2011-04-02 15:04   ` HEALTH_WARNING Henry Chang
2011-04-02 18:23     ` HEALTH_WARNING Martin Wilderoth
2011-04-03 23:38   ` HEALTH_WARNING Gregory Farnum
     [not found] <835540127.13427.1301716690785.JavaMail.root@mail.linserv.se>
2011-04-02  3:59 ` HEALTH_WARNING Martin Wilderoth
2011-04-02  8:22   ` HEALTH_WARNING Wido den Hollander

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.