Re: osd stops

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: osd stops
       [not found] <64610990.14485.1302631358989.JavaMail.root@mail.linserv.se>
@ 2011-04-12 18:05 ` Martin Wilderoth
  2011-04-12 18:24   ` Gregory Farnum
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Wilderoth @ 2011-04-12 18:05 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Thanks for the answer, now I know the reson. Some of my osd had 90% of data, dmesg also shows error with the btrfs on the hosts. I will run the test with another file system ext3 :-) or is any other filesystem better. It's a backuppc filesystem with a lot of hardlinks and data I would like to test to run in ceph.

----- Ursprungligt meddelande ----- 
Från: "Gregory Farnum" <gregory.farnum@dreamhost.com> 
Till: "Martin Wilderoth" <martin.wilderoth@linserv.se> 
Kopia: ceph-devel@vger.kernel.org 
Skickat: tisdag, 12 apr 2011 19:24:27 
Ämne: Re: osd stops 

Ah. It looks like you're running btrfs and you have a very full disk. Unfortunately btrfs doesn't handle low-disk situations (above ~80% utilization -- yes, it's annoying) very well and so it's failing to perform pretty basic tasks and is propagating those failures up to the OSD. If you really need to run that close to full utilization you're going to need to use another underlying filesystem, or add more disks/nodes to spread the data across. 
Sorry. :( 

-Greg 
On Tuesday, April 12, 2011 at 9:26 AM, Martin Wilderoth wrote: 
I have been done some tests and it seems as I always get the same problem. 
> I have been transfering data and suddenly I get I/O error and superblock problem. 
> This occurs when the filesystem is filled to aprox 80% 
> 
> ceph health reports no error. I restart the system -a stop -a start 
> after that the system is degraded and the osd stopes. 
> 
> The log shows of the fist failing osd 
> 
> 2011-04-12 17:51:07.716513 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).fault first fault 
> 2011-04-12 17:51:07.716868 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).connect claims to be 0.0.0.0:6802/15976 not 10.0.6.12:6802/13633 - wrong node! 
> os/FileStore.cc: In function 'void FileStore::sync_entry()', in thread '0x7f023f9ce700' 
> os/FileStore.cc: 2674: FAILED assert(r == 0) 
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5) 
> 1: (FileStore::sync_entry()+0x1975) [0x59f165] 
> 2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d] 
> 3: (()+0x68ba) [0x7f024602b8ba] 
> 4: (clone()+0x6d) [0x7f0244cc002d] 
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5) 
> 1: (FileStore::sync_entry()+0x1975) [0x59f165] 
> 2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d] 
> 3: (()+0x68ba) [0x7f024602b8ba] 
> 4: (clone()+0x6d) [0x7f0244cc002d] 
> *** Caught signal (Aborted) ** 
> in thread 0x7f023f9ce700 
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5) 
> 1: /usr/bin/cosd() [0x61e42c] 
> 2: (()+0xef60) [0x7f0246033f60] 
> 3: (gsignal()+0x35) [0x7f0244c23165] 
> 4: (abort()+0x180) [0x7f0244c25f70] 
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f02454b6dc5] 
> 6: (()+0xcb166) [0x7f02454b5166] 
> 7: (()+0xcb193) [0x7f02454b5193] 
> 8: (()+0xcb28e) [0x7f02454b528e] 
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3] 
> 10: (FileStore::sync_entry()+0x1975) [0x59f165] 
> 11: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d] 
> 12: (()+0x68ba) [0x7f024602b8ba] 
> 13: (clone()+0x6d) [0x7f0244cc002d] 
> 
> the second failing osd 
> 
> 2011-04-12 18:03:36.036420 7f39c6ce7700 FileStore: sync_entry timed out after 600 seconds. 
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5) 
> 2011-04-12 18:03:36.036494 1: (SafeTimer::timer_thread()+0x36b) [0x601afb] 
> 2011-04-12 18:03:36.036509 2: (SafeTimerThread::entry()+0xd) [0x6042cd] 
> 2011-04-12 18:03:36.036528 3: (()+0x68ba) [0x7f39d034a8ba] 
> 2011-04-12 18:03:36.036541 4: (clone()+0x6d) [0x7f39cefdf02d] 
> 2011-04-12 18:03:36.036551 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)', in thread '0x7f39c6ce7700' 
> os/FileStore.cc: 2573: FAILED assert(0) 
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5) 
> 1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34] 
> 2: (SafeTimer::timer_thread()+0x36b) [0x601afb] 
> 3: (SafeTimerThread::entry()+0xd) [0x6042cd] 
> 4: (()+0x68ba) [0x7f39d034a8ba] 
> 5: (clone()+0x6d) [0x7f39cefdf02d] 
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5) 
> 1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34] 
> 2: (SafeTimer::timer_thread()+0x36b) [0x601afb] 
> 3: (SafeTimerThread::entry()+0xd) [0x6042cd] 
> 4: (()+0x68ba) [0x7f39d034a8ba] 
> 5: (clone()+0x6d) [0x7f39cefdf02d] 
> *** Caught signal (Aborted) ** 
> in thread 0x7f39c6ce7700 
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5) 
> 1: /usr/bin/cosd() [0x61e42c] 
> 2: (()+0xef60) [0x7f39d0352f60] 
> 3: (gsignal()+0x35) [0x7f39cef42165] 
> 4: (abort()+0x180) [0x7f39cef44f70] 
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f39cf7d5dc5] 
> 6: (()+0xcb166) [0x7f39cf7d4166] 
> 7: (()+0xcb193) [0x7f39cf7d4193] 
> 8: (()+0xcb28e) [0x7f39cf7d428e] 
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3] 
> 10: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34] 
> 11: (SafeTimer::timer_thread()+0x36b) [0x601afb] 
> 12: (SafeTimerThread::entry()+0xd) [0x6042cd] 
> 13: (()+0x68ba) [0x7f39d034a8ba] 
> 14: (clone()+0x6d) [0x7f39cefdf02d] 
> 
> regards Martin 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: osd stops
  2011-04-12 18:05 ` osd stops Martin Wilderoth
@ 2011-04-12 18:24   ` Gregory Farnum
  2011-04-13 12:12     ` Martin Wilderoth
  0 siblings, 1 reply; 9+ messages in thread
From: Gregory Farnum @ 2011-04-12 18:24 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

On Tuesday, April 12, 2011 at 11:05 AM, Martin Wilderoth wrote:
Thanks for the answer, now I know the reson. Some of my osd had 90% of data, dmesg also shows error with the btrfs on the hosts. I will run the test with another file system ext3 :-) or is any other filesystem better. It's a backuppc filesystem with a lot of hardlinks and data I would like to test to run in ceph.

ext3 or really any other FS will handle it better, although Ceph itself is also not super-resilient to such situations. Eventually we will have automatic rebalancing of data but it's not in there right now.

Could you maybe send along your config file and the local filesystem statistics on each of your OSDs? CRUSH is psuedo-random and so it's not going to have perfectly even utilization but if the variance is too high we'll want to look into it sooner rather than later.
-Greg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: osd stops
  2011-04-12 18:24   ` Gregory Farnum
@ 2011-04-13 12:12     ` Martin Wilderoth
  2011-04-13 19:38       ` Gregory Farnum
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Wilderoth @ 2011-04-13 12:12 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

This is my config,

;
; Sample ceph ceph.conf file.
;
; This file defines cluster membership, the various locations
; that Ceph stores data, and any other runtime options.

; If a 'host' is defined for a daemon, the start/stop script will
; verify that it matches the hostname (or else ignore it).  If it is
; not defined, it is assumed that the daemon is intended to start on
; the current host (e.g., in a setup with a startup.conf on each
; node).

; global
[global]
        ; enable secure authentication
        auth supported = cephx
        keyring = /etc/ceph/keyring.bin

        ; allow ourselves to open a lot of files
        max open files = 131072
        pid file = /var/run/ceph/$name.pid
        debug ms = 1

; monitors
;  You need at least one.  You need at least three if you want to
;  tolerate any node failures.  Always create an odd number.
[mon]
        mon data = /data/mon$id

        ; logging, for debugging monitor crashes, in order of
        ; their likelihood of being helpful :)
        ;debug ms = 1
        ;debug mon = 20
        ;debug paxos = 20
        ;debug auth = 20

[mon0]
        host = ceph1
        mon addr = 10.0.6.10:6789

[mon1]
        host = ceph2
        mon addr = 10.0.6.11:6789

[mon2]
        host = ceph3
        mon addr = 10.0.6.12:6789

; mds
;  You need at least one.  Define two to get a standby.
[mds]
        ; where the mds keeps it's secret encryption keys
        keyring = /etc/ceph/keyring.$name

        ; mds logging to debug issues.
        ;debug ms = 1
        ;debug mds = 20

[mds0]
        host = ceph1

[mds1]
        host = ceph2

[mds2]
        host = ceph3

; osd
;  You need at least one.  Two if you want data to be replicated.
;  Define as many as you like.
[osd]
        sudo = true
        ; This is where the btrfs volume will be mounted.
        osd data = /data/osd$id
        ; where the ods keeps it's secret encryption keys
        keyring = /etc/ceph/keyring.$name

        ; Ideally, make this a separate disk or partition.  A few
        ; hundred MB should be enough; more if you have fast or many
        ; disks.  You can use a file under the osd data dir if need be
        ; (e.g. /data/osd$id/journal), but it will be slower than a
        ; separate disk or partition.

        ; This is an example of a file-based journal.
        ;osd journal = /data/osd$id/journal
        ;osd journal size = 1000 ; journal size, in megabytes

        ; osd logging to debug osd issues, in order of likelihood of being
        ; helpful
;       debug ms = 1
;       debug osd = 25
;       debug monc = 20
;       debug journal = 20
;       debug filestore = 10
;       osd use stale snap = true

[osd0]
        host = ceph1

        ; if 'btrfs devs' is not specified, you're responsible for
        ; setting up the 'osd data' dir.  if it is not btrfs, things
        ; will behave up until you try to recover from a crash (which
        ; usually fine for basic testing).
        btrfs devs = /dev/sdc
        osd journal = /dev/sda1

[osd1]
        host = ceph1
        btrfs devs = /dev/sdd
        osd journal = /dev/sda2

[osd2]
        host = ceph2
        btrfs devs = /dev/sdc
        osd journal = /dev/sda1

[osd3]
        host = ceph2
        btrfs devs = /dev/sdd
        osd journal = /dev/sda2

[osd4]
        host = ceph3
        btrfs devs = /dev/sdc
        osd journal = /dev/sda1

[osd5]
        host = ceph3
        btrfs devs = /dev/sdd
        osd journal = /dev/sda2

The statistics of the disks, this is after the crash of osd2 and osd4.

/dev/sdc             143373312 124954676  18418636  88% /data/osd0
/dev/sdd             143373312 137639524   5733788  97% /data/osd1

/dev/sdc             143373312 120350584  23022728  84% /data/osd2
/dev/sdd             143373312 141986188   1387124 100% /data/osd3

/dev/sdc             143373312 112025716  31347596  79% /data/osd4
/dev/sdd             143373312 115163124  28210188  81% /data/osd5

I will send some statistic of the ext3 as well

----- Ursprungligt meddelande ----- 
Från: "Gregory Farnum" <gregory.farnum@dreamhost.com> 
Till: "Martin Wilderoth" <martin.wilderoth@linserv.se> 
Kopia: ceph-devel@vger.kernel.org 
Skickat: tisdag, 12 apr 2011 14:24:14 
Ämne: Re: osd stops 

On Tuesday, April 12, 2011 at 11:05 AM, Martin Wilderoth wrote: 
Thanks for the answer, now I know the reson. Some of my osd had 90% of data, dmesg also shows error with the btrfs on the hosts. I will run the test with another file system ext3 :-) or is any other filesystem better. It's a backuppc filesystem with a lot of hardlinks and data I would like to test to run in ceph. 

ext3 or really any other FS will handle it better, although Ceph itself is also not super-resilient to such situations. Eventually we will have automatic rebalancing of data but it's not in there right now. 

Could you maybe send along your config file and the local filesystem statistics on each of your OSDs? CRUSH is psuedo-random and so it's not going to have perfectly even utilization but if the variance is too high we'll want to look into it sooner rather than later. 
-Greg 



-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: osd stops
  2011-04-13 12:12     ` Martin Wilderoth
@ 2011-04-13 19:38       ` Gregory Farnum
  2011-04-13 19:43         ` Gregory Farnum
  0 siblings, 1 reply; 9+ messages in thread
From: Gregory Farnum @ 2011-04-13 19:38 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

On Wednesday, April 13, 2011 at 5:12 AM, Martin Wilderoth wrote:
The statistics of the disks, this is after the crash of osd2 and osd4.
> 
> /dev/sdc 143373312 124954676 18418636 88% /data/osd0
> /dev/sdd 143373312 137639524 5733788 97% /data/osd1
> 
> /dev/sdc 143373312 120350584 23022728 84% /data/osd2
> /dev/sdd 143373312 141986188 1387124 100% /data/osd3
> 
> /dev/sdc 143373312 112025716 31347596 79% /data/osd4
> /dev/sdd 143373312 115163124 28210188 81% /data/osd5
> 
> I will send some statistic of the ext3 as well

Am I reading this right, each of those disks is ~137MB? Those are some very small disks; I actually don't think you'll have much luck with the OSDs on something that small just because random balancing on disks that small won't work out very well -- there's too much variation when the total disk is only ~30 times larger than the default stripe size.
-Greg




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: osd stops
  2011-04-13 19:38       ` Gregory Farnum
@ 2011-04-13 19:43         ` Gregory Farnum
  0 siblings, 0 replies; 9+ messages in thread
From: Gregory Farnum @ 2011-04-13 19:43 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel


On Wednesday, April 13, 2011 at 12:38 PM, Gregory Farnum wrote: 
> On Wednesday, April 13, 2011 at 5:12 AM, Martin Wilderoth wrote:
> The statistics of the disks, this is after the crash of osd2 and osd4.
> > 
> > /dev/sdc 143373312 124954676 18418636 88% /data/osd0
> > /dev/sdd 143373312 137639524 5733788 97% /data/osd1
> > 
> > /dev/sdc 143373312 120350584 23022728 84% /data/osd2
> > /dev/sdd 143373312 141986188 1387124 100% /data/osd3
> > 
> > /dev/sdc 143373312 112025716 31347596 79% /data/osd4
> > /dev/sdd 143373312 115163124 28210188 81% /data/osd5
> > 
> > I will send some statistic of the ext3 as well
> 
> Am I reading this right, each of those disks is ~137MB? Those are some very small disks; I actually don't think you'll have much luck with the OSDs on something that small just because random balancing on disks that small won't work out very well -- there's too much variation when the total disk is only ~30 times larger than the default stripe size.
> -Greg
>  Never mind, just realized that default df outputs in 1KB blocks -- was thinking it was bytes for some reason. 

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <ab2410b5-fe4c-4600-a2c0-f36a708fb6e2@mail.linserv.se>]

* osd stops
       [not found] <ab2410b5-fe4c-4600-a2c0-f36a708fb6e2@mail.linserv.se>
@ 2013-04-14  5:07 ` Martin Wilderoth
  0 siblings, 0 replies; 9+ messages in thread
From: Martin Wilderoth @ 2013-04-14  5:07 UTC (permalink / raw)
  To: ceph-devel

Hello,

I have a ceph cluster running bobtail 0.56.4

I have been playing with the rbd images and kvm.
The hardware for the osd,mon is 3 x HP G5 and 2 x SATA 2TB disk.
and 2xSAS for the journals. (Old hardware)

I have one host running kvm-rbd against the setup.
The network is 1GB network.

The scenario is as follows. I run my vm servers for some day/s
then it hangs the cluster starts to recover and 2 osd+mon are down.
And i have to restart that server. 
Health goes to OK after recovering and I can continue.

I have noticed that i get high load related ceph-mon and osd
on one of the hosts. The host that get high load is not always the same.
I guess that’s the host that the kvm-host uses for mon.

What i can't figure out is why the osd's are stopping/crashing.
And all load is sent to only that server mon, if that's what is happening ????
Is it's possible to limit something. It feels like i'm overloading the system ????

I have not turned on debug. Please let me know if any logs would be of any interest.

 /Best Regards Martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <1608788961.14465.1302625479260.JavaMail.root@mail.linserv.se>]

* osd stops
       [not found] <1608788961.14465.1302625479260.JavaMail.root@mail.linserv.se>
@ 2011-04-12 16:26 ` Martin Wilderoth
  2011-04-12 16:57   ` Wido den Hollander
  2011-04-12 17:24   ` Gregory Farnum
  0 siblings, 2 replies; 9+ messages in thread
From: Martin Wilderoth @ 2011-04-12 16:26 UTC (permalink / raw)
  To: ceph-devel

I have been done some tests and it seems as I always get the same problem.
I have been transfering data and suddenly I get I/O error and superblock problem.
This occurs when the filesystem is filled to aprox 80%

ceph health reports no error. I restart the system -a stop -a start
after that the system is degraded and the osd stopes.

The log shows of the fist failing osd

2011-04-12 17:51:07.716513 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).fault first fault
2011-04-12 17:51:07.716868 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).connect claims to be 0.0.0.0:6802/15976 not 10.0.6.12:6802/13633 - wrong node!
os/FileStore.cc: In function 'void FileStore::sync_entry()', in thread '0x7f023f9ce700'
os/FileStore.cc: 2674: FAILED assert(r == 0)
 ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
 1: (FileStore::sync_entry()+0x1975) [0x59f165]
 2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
 3: (()+0x68ba) [0x7f024602b8ba]
 4: (clone()+0x6d) [0x7f0244cc002d]
 ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
 1: (FileStore::sync_entry()+0x1975) [0x59f165]
 2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
 3: (()+0x68ba) [0x7f024602b8ba]
 4: (clone()+0x6d) [0x7f0244cc002d]
*** Caught signal (Aborted) **
 in thread 0x7f023f9ce700
 ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
 1: /usr/bin/cosd() [0x61e42c]
 2: (()+0xef60) [0x7f0246033f60]
 3: (gsignal()+0x35) [0x7f0244c23165]
 4: (abort()+0x180) [0x7f0244c25f70]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f02454b6dc5]
 6: (()+0xcb166) [0x7f02454b5166]
 7: (()+0xcb193) [0x7f02454b5193]
 8: (()+0xcb28e) [0x7f02454b528e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
 10: (FileStore::sync_entry()+0x1975) [0x59f165]
 11: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
 12: (()+0x68ba) [0x7f024602b8ba]
 13: (clone()+0x6d) [0x7f0244cc002d]

the second failing osd

2011-04-12 18:03:36.036420 7f39c6ce7700 FileStore: sync_entry timed out after 600 seconds.
 ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
2011-04-12 18:03:36.036494 1: (SafeTimer::timer_thread()+0x36b) [0x601afb]
2011-04-12 18:03:36.036509 2: (SafeTimerThread::entry()+0xd) [0x6042cd]
2011-04-12 18:03:36.036528 3: (()+0x68ba) [0x7f39d034a8ba]
2011-04-12 18:03:36.036541 4: (clone()+0x6d) [0x7f39cefdf02d]
2011-04-12 18:03:36.036551 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)', in thread '0x7f39c6ce7700'
os/FileStore.cc: 2573: FAILED assert(0)
 ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
 1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
 2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
 3: (SafeTimerThread::entry()+0xd) [0x6042cd]
 4: (()+0x68ba) [0x7f39d034a8ba]
 5: (clone()+0x6d) [0x7f39cefdf02d]
 ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
 1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
 2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
 3: (SafeTimerThread::entry()+0xd) [0x6042cd]
 4: (()+0x68ba) [0x7f39d034a8ba]
 5: (clone()+0x6d) [0x7f39cefdf02d]
*** Caught signal (Aborted) **
 in thread 0x7f39c6ce7700
 ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
 1: /usr/bin/cosd() [0x61e42c]
 2: (()+0xef60) [0x7f39d0352f60]
 3: (gsignal()+0x35) [0x7f39cef42165]
 4: (abort()+0x180) [0x7f39cef44f70]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f39cf7d5dc5]
 6: (()+0xcb166) [0x7f39cf7d4166]
 7: (()+0xcb193) [0x7f39cf7d4193]
 8: (()+0xcb28e) [0x7f39cf7d428e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
 10: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
 11: (SafeTimer::timer_thread()+0x36b) [0x601afb]
 12: (SafeTimerThread::entry()+0xd) [0x6042cd]
 13: (()+0x68ba) [0x7f39d034a8ba]
 14: (clone()+0x6d) [0x7f39cefdf02d]

regards Martin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: osd stops
  2011-04-12 16:26 ` Martin Wilderoth
@ 2011-04-12 16:57   ` Wido den Hollander
  2011-04-12 17:24   ` Gregory Farnum
  1 sibling, 0 replies; 9+ messages in thread
From: Wido den Hollander @ 2011-04-12 16:57 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

Hi Martin,

On Tue, 2011-04-12 at 18:26 +0200, Martin Wilderoth wrote:
> I have been done some tests and it seems as I always get the same problem.
> I have been transfering data and suddenly I get I/O error and superblock problem.
> This occurs when the filesystem is filled to aprox 80%
> 
> ceph health reports no error. I restart the system -a stop -a start
> after that the system is degraded and the osd stopes.
> 
> The log shows of the fist failing osd
> 
> 2011-04-12 17:51:07.716513 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).fault first fault
> 2011-04-12 17:51:07.716868 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).connect claims to be 0.0.0.0:6802/15976 not 10.0.6.12:6802/13633 - wrong node!
> os/FileStore.cc: In function 'void FileStore::sync_entry()', in thread '0x7f023f9ce700'
> os/FileStore.cc: 2674: FAILED assert(r == 0)
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (FileStore::sync_entry()+0x1975) [0x59f165]
>  2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  3: (()+0x68ba) [0x7f024602b8ba]
>  4: (clone()+0x6d) [0x7f0244cc002d]
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (FileStore::sync_entry()+0x1975) [0x59f165]
>  2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  3: (()+0x68ba) [0x7f024602b8ba]
>  4: (clone()+0x6d) [0x7f0244cc002d]
> *** Caught signal (Aborted) **
>  in thread 0x7f023f9ce700
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: /usr/bin/cosd() [0x61e42c]
>  2: (()+0xef60) [0x7f0246033f60]
>  3: (gsignal()+0x35) [0x7f0244c23165]
>  4: (abort()+0x180) [0x7f0244c25f70]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f02454b6dc5]
>  6: (()+0xcb166) [0x7f02454b5166]
>  7: (()+0xcb193) [0x7f02454b5193]
>  8: (()+0xcb28e) [0x7f02454b528e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
>  10: (FileStore::sync_entry()+0x1975) [0x59f165]
>  11: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  12: (()+0x68ba) [0x7f024602b8ba]
>  13: (clone()+0x6d) [0x7f0244cc002d]
> 
> the second failing osd
> 
> 2011-04-12 18:03:36.036420 7f39c6ce7700 FileStore: sync_entry timed out after 600 seconds.
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
> 2011-04-12 18:03:36.036494 1: (SafeTimer::timer_thread()+0x36b) [0x601afb]
> 2011-04-12 18:03:36.036509 2: (SafeTimerThread::entry()+0xd) [0x6042cd]
> 2011-04-12 18:03:36.036528 3: (()+0x68ba) [0x7f39d034a8ba]
> 2011-04-12 18:03:36.036541 4: (clone()+0x6d) [0x7f39cefdf02d]
> 2011-04-12 18:03:36.036551 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)', in thread '0x7f39c6ce7700'
> os/FileStore.cc: 2573: FAILED assert(0)
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  3: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  4: (()+0x68ba) [0x7f39d034a8ba]
>  5: (clone()+0x6d) [0x7f39cefdf02d]
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  3: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  4: (()+0x68ba) [0x7f39d034a8ba]
>  5: (clone()+0x6d) [0x7f39cefdf02d]
> *** Caught signal (Aborted) **
>  in thread 0x7f39c6ce7700
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: /usr/bin/cosd() [0x61e42c]
>  2: (()+0xef60) [0x7f39d0352f60]
>  3: (gsignal()+0x35) [0x7f39cef42165]
>  4: (abort()+0x180) [0x7f39cef44f70]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f39cf7d5dc5]
>  6: (()+0xcb166) [0x7f39cf7d4166]
>  7: (()+0xcb193) [0x7f39cf7d4193]
>  8: (()+0xcb28e) [0x7f39cf7d428e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
>  10: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  11: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  12: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  13: (()+0x68ba) [0x7f39d034a8ba]
>  14: (clone()+0x6d) [0x7f39cefdf02d]

This seems to me that you have a disk I/O problem, where the OSD can't
commit it's data fast enough and exits.

Does "dmesg" show any disk errors?

Wido

> 
> regards Martin
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: osd stops
  2011-04-12 16:26 ` Martin Wilderoth
  2011-04-12 16:57   ` Wido den Hollander
@ 2011-04-12 17:24   ` Gregory Farnum
  1 sibling, 0 replies; 9+ messages in thread
From: Gregory Farnum @ 2011-04-12 17:24 UTC (permalink / raw)
  To: Martin Wilderoth; +Cc: ceph-devel

Ah. It looks like you're running btrfs and you have a very full disk. Unfortunately btrfs doesn't handle low-disk situations (above ~80% utilization -- yes, it's annoying) very well and so it's failing to perform pretty basic tasks and is propagating those failures up to the OSD. If you really need to run that close to full utilization you're going to need to use another underlying filesystem, or add more disks/nodes to spread the data across.
Sorry. :(

-Greg
On Tuesday, April 12, 2011 at 9:26 AM, Martin Wilderoth wrote:
I have been done some tests and it seems as I always get the same problem.
> I have been transfering data and suddenly I get I/O error and superblock problem.
> This occurs when the filesystem is filled to aprox 80%
> 
> ceph health reports no error. I restart the system -a stop -a start
> after that the system is degraded and the osd stopes.
> 
> The log shows of the fist failing osd
> 
> 2011-04-12 17:51:07.716513 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).fault first fault
> 2011-04-12 17:51:07.716868 7f02365b8700 -- 0.0.0.0:6802/20180 >> 10.0.6.12:6802/13633 pipe(0x2e1da00 sd=22 pgs=0 cs=0 l=0).connect claims to be 0.0.0.0:6802/15976 not 10.0.6.12:6802/13633 - wrong node!
> os/FileStore.cc: In function 'void FileStore::sync_entry()', in thread '0x7f023f9ce700'
> os/FileStore.cc: 2674: FAILED assert(r == 0)
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (FileStore::sync_entry()+0x1975) [0x59f165]
>  2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  3: (()+0x68ba) [0x7f024602b8ba]
>  4: (clone()+0x6d) [0x7f0244cc002d]
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (FileStore::sync_entry()+0x1975) [0x59f165]
>  2: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  3: (()+0x68ba) [0x7f024602b8ba]
>  4: (clone()+0x6d) [0x7f0244cc002d]
> *** Caught signal (Aborted) **
>  in thread 0x7f023f9ce700
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: /usr/bin/cosd() [0x61e42c]
>  2: (()+0xef60) [0x7f0246033f60]
>  3: (gsignal()+0x35) [0x7f0244c23165]
>  4: (abort()+0x180) [0x7f0244c25f70]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f02454b6dc5]
>  6: (()+0xcb166) [0x7f02454b5166]
>  7: (()+0xcb193) [0x7f02454b5193]
>  8: (()+0xcb28e) [0x7f02454b528e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
>  10: (FileStore::sync_entry()+0x1975) [0x59f165]
>  11: (FileStore::SyncThread::entry()+0xd) [0x5a8a7d]
>  12: (()+0x68ba) [0x7f024602b8ba]
>  13: (clone()+0x6d) [0x7f0244cc002d]
> 
> the second failing osd
> 
> 2011-04-12 18:03:36.036420 7f39c6ce7700 FileStore: sync_entry timed out after 600 seconds.
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
> 2011-04-12 18:03:36.036494 1: (SafeTimer::timer_thread()+0x36b) [0x601afb]
> 2011-04-12 18:03:36.036509 2: (SafeTimerThread::entry()+0xd) [0x6042cd]
> 2011-04-12 18:03:36.036528 3: (()+0x68ba) [0x7f39d034a8ba]
> 2011-04-12 18:03:36.036541 4: (clone()+0x6d) [0x7f39cefdf02d]
> 2011-04-12 18:03:36.036551 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)', in thread '0x7f39c6ce7700'
> os/FileStore.cc: 2573: FAILED assert(0)
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  3: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  4: (()+0x68ba) [0x7f39d034a8ba]
>  5: (clone()+0x6d) [0x7f39cefdf02d]
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  2: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  3: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  4: (()+0x68ba) [0x7f39d034a8ba]
>  5: (clone()+0x6d) [0x7f39cefdf02d]
> *** Caught signal (Aborted) **
>  in thread 0x7f39c6ce7700
>  ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
>  1: /usr/bin/cosd() [0x61e42c]
>  2: (()+0xef60) [0x7f39d0352f60]
>  3: (gsignal()+0x35) [0x7f39cef42165]
>  4: (abort()+0x180) [0x7f39cef44f70]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f39cf7d5dc5]
>  6: (()+0xcb166) [0x7f39cf7d4166]
>  7: (()+0xcb193) [0x7f39cf7d4193]
>  8: (()+0xcb28e) [0x7f39cf7d428e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x373) [0x6061e3]
>  10: (SyncEntryTimeout::finish(int)+0xf4) [0x5a0b34]
>  11: (SafeTimer::timer_thread()+0x36b) [0x601afb]
>  12: (SafeTimerThread::entry()+0xd) [0x6042cd]
>  13: (()+0x68ba) [0x7f39d034a8ba]
>  14: (clone()+0x6d) [0x7f39cefdf02d]
> 
> regards Martin
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-04-14  5:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <64610990.14485.1302631358989.JavaMail.root@mail.linserv.se>
2011-04-12 18:05 ` osd stops Martin Wilderoth
2011-04-12 18:24   ` Gregory Farnum
2011-04-13 12:12     ` Martin Wilderoth
2011-04-13 19:38       ` Gregory Farnum
2011-04-13 19:43         ` Gregory Farnum
     [not found] <ab2410b5-fe4c-4600-a2c0-f36a708fb6e2@mail.linserv.se>
2013-04-14  5:07 ` Martin Wilderoth
     [not found] <1608788961.14465.1302625479260.JavaMail.root@mail.linserv.se>
2011-04-12 16:26 ` Martin Wilderoth
2011-04-12 16:57   ` Wido den Hollander
2011-04-12 17:24   ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.