* Hit suicide timeout after adding new osd @ 2013-01-17 14:35 Jens Kristian Søgaard 2013-01-17 14:47 ` Wido den Hollander 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-17 14:35 UTC (permalink / raw) To: ceph-devel@vger.kernel.org Hi guys, I had a functioning Ceph system that reported HEALTH_OK. It was running with 3 osds on 3 servers. Then I added an extra osd on 1 of the servers using the commands from the documentation here: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ Shortly after I did that 2 of the existing osds crashed. I restarted them and after some hours they were up and running again, but soon one of them crashed again - and a third existing osd crashed as well. I restarted those two and waited some hours for them to come up. A short while later one of them crashed again. I have then restarted restarted that last one and watched the logs closely. It seems the same patterns repeats itself every time. It starts up doing its normal maintenance before going "up" (takes a long while). Then it seems to be running, but logs the following every 5 seconds: heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed out after 30 After some time it logs: =================================================== heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide timed out after 300 2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654 common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0x8462bb] 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8] 4: (CephContextServiceThread::entry()+0x55) [0x8e01c5] 5: /lib64/libpthread.so.0() [0x360de07d14] 6: (clone()+0x6d) [0x360d6f167d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) ** in thread 7f053f149700 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-osd() [0x82ea90] 2: /lib64/libpthread.so.0() [0x360de0efe0] 3: (gsignal()+0x35) [0x360d635925] 4: (abort()+0x148) [0x360d6370d8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. =================================================== How can I avoid this? - is it a bug, or have I done something wrong? I'm running Ceph 0.56.1 from the official RPMs on Fedora 17. The underlying disks and network connectivity has been tested and nothing seems to be wrong there. Thanks in advance for your assistance! -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 14:35 Hit suicide timeout after adding new osd Jens Kristian Søgaard @ 2013-01-17 14:47 ` Wido den Hollander 2013-01-17 14:50 ` Stefan Priebe 0 siblings, 1 reply; 37+ messages in thread From: Wido den Hollander @ 2013-01-17 14:47 UTC (permalink / raw) To: Jens Kristian Søgaard; +Cc: ceph-devel@vger.kernel.org Hi, On 01/17/2013 03:35 PM, Jens Kristian Søgaard wrote: > Hi guys, > > I had a functioning Ceph system that reported HEALTH_OK. It was running > with 3 osds on 3 servers. > > Then I added an extra osd on 1 of the servers using the commands from > the documentation here: > > http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ > > Shortly after I did that 2 of the existing osds crashed. > > I restarted them and after some hours they were up and running again, > but soon one of them crashed again - and a third existing osd crashed as > well. I restarted those two and waited some hours for them to come up. A > short while later one of them crashed again. > > I have then restarted restarted that last one and watched the logs > closely. It seems the same patterns repeats itself every time. It starts > up doing its normal maintenance before going "up" (takes a long while). > Then it seems to be running, but logs the following every 5 seconds: > > heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed > out after 30 > > After some time it logs: > > =================================================== > heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide > timed out after 300 > > 2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654 > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") > > ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0x8462bb] > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e] > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8] > 4: (CephContextServiceThread::entry()+0x55) [0x8e01c5] > 5: /lib64/libpthread.so.0() [0x360de07d14] > 6: (clone()+0x6d) [0x360d6f167d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > 2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) ** > in thread 7f053f149700 > > ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > 1: /usr/bin/ceph-osd() [0x82ea90] > 2: /lib64/libpthread.so.0() [0x360de0efe0] > 3: (gsignal()+0x35) [0x360d635925] > 4: (abort()+0x148) [0x360d6370d8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > =================================================== > > How can I avoid this? - is it a bug, or have I done something wrong? > I think you are seeing the same issue as I noticed about two weeks ago: http://www.spinics.net/lists/ceph-devel/msg11328.html See this issue: http://tracker.newdream.net/issues/3714 I can't find branch wip-3714 anymore, so it might be already merged into next. You might want to try building from 'next' yourself or fetch some new packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ Wido > I'm running Ceph 0.56.1 from the official RPMs on Fedora 17. > The underlying disks and network connectivity has been tested and > nothing seems to be wrong there. > > Thanks in advance for your assistance! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 14:47 ` Wido den Hollander @ 2013-01-17 14:50 ` Stefan Priebe 2013-01-17 15:33 ` Wido den Hollander 0 siblings, 1 reply; 37+ messages in thread From: Stefan Priebe @ 2013-01-17 14:50 UTC (permalink / raw) To: Wido den Hollander; +Cc: Jens Kristian Søgaard, ceph-devel@vger.kernel.org Hi, Am 17.01.2013 15:47, schrieb Wido den Hollander: > You might want to try building from 'next' yourself or fetch some new > packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ Should it be backported to bobtail branch as well? Stefan ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 14:50 ` Stefan Priebe @ 2013-01-17 15:33 ` Wido den Hollander 2013-01-17 15:37 ` Stefan Priebe 0 siblings, 1 reply; 37+ messages in thread From: Wido den Hollander @ 2013-01-17 15:33 UTC (permalink / raw) To: Stefan Priebe; +Cc: Jens Kristian Søgaard, ceph-devel@vger.kernel.org Hi, On 01/17/2013 03:50 PM, Stefan Priebe wrote: > Hi, > > Am 17.01.2013 15:47, schrieb Wido den Hollander: >> You might want to try building from 'next' yourself or fetch some new >> packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ > > Should it be backported to bobtail branch as well? > Don't think it's in there. I found the commit in the next branch: https://github.com/ceph/ceph/commit/7e94f6f1a7b7a865433edacd6a521f6ea1170eac Doesn't seem to be in 'bobtail'. Wido > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 15:33 ` Wido den Hollander @ 2013-01-17 15:37 ` Stefan Priebe 2013-01-17 17:17 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Stefan Priebe @ 2013-01-17 15:37 UTC (permalink / raw) To: Wido den Hollander Cc: Jens Kristian Søgaard, ceph-devel@vger.kernel.org, Sage Weil Hi Sage, Am 17.01.2013 16:33, schrieb Wido den Hollander: > Hi, > On 01/17/2013 03:50 PM, Stefan Priebe wrote: >> Hi, >> Am 17.01.2013 15:47, schrieb Wido den Hollander: >>> You might want to try building from 'next' yourself or fetch some new >>> packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ >> >> Should it be backported to bobtail branch as well? >> > > Don't think it's in there. I found the commit in the next branch: > https://github.com/ceph/ceph/commit/7e94f6f1a7b7a865433edacd6a521f6ea1170eac > > Doesn't seem to be in 'bobtail'. > > Wido Was this forgotten? I mean a lot of people have posted about git suicide timeout with 0.56 oder 0.56.1 so it should be also in the bobtail branch? Greets, Stefan ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 15:37 ` Stefan Priebe @ 2013-01-17 17:17 ` Sage Weil 2013-01-17 20:32 ` Jens Kristian Søgaard 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-17 17:17 UTC (permalink / raw) To: Stefan Priebe Cc: Wido den Hollander, Jens Kristian Søgaard, ceph-devel@vger.kernel.org The fix for this is in v0.56.1, so I think you are seeing something else. If this is something you can reproduce with 'debug osd = 10' or 20 and 'debug ms = 1', that would be extremely helpful! Thanks sage On Thu, 17 Jan 2013, Stefan Priebe wrote: > Hi Sage, > > Am 17.01.2013 16:33, schrieb Wido den Hollander: > > Hi, > > On 01/17/2013 03:50 PM, Stefan Priebe wrote: > > > Hi, > > > Am 17.01.2013 15:47, schrieb Wido den Hollander: > > > > You might want to try building from 'next' yourself or fetch some new > > > > packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/ > > > > > > Should it be backported to bobtail branch as well? > > > > > > > Don't think it's in there. I found the commit in the next branch: > > https://github.com/ceph/ceph/commit/7e94f6f1a7b7a865433edacd6a521f6ea1170eac > > > > Doesn't seem to be in 'bobtail'. > > > > Wido > > Was this forgotten? I mean a lot of people have posted about git suicide > timeout with 0.56 oder 0.56.1 so it should be also in the bobtail branch? > > Greets, > Stefan > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 17:17 ` Sage Weil @ 2013-01-17 20:32 ` Jens Kristian Søgaard 2013-01-17 22:03 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-17 20:32 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, > The fix for this is in v0.56.1, so I think you are seeing something else. > If this is something you can reproduce with 'debug osd = 10' or 20 and > 'debug ms = 1', that would be extremely helpful! I have set the debug level and restarted the osd. It initialized for about an hour and then crashed after 5 minutes with the same error message. You can download the full logs here: http://bit.ly/WbcSQp Thanks for your help! -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 20:32 ` Jens Kristian Søgaard @ 2013-01-17 22:03 ` Sage Weil 2013-01-18 11:24 ` Jens Kristian Søgaard 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-17 22:03 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Jens- On Thu, 17 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > The fix for this is in v0.56.1, so I think you are seeing something else. > > If this is something you can reproduce with 'debug osd = 10' or 20 and > > 'debug ms = 1', that would be extremely helpful! > > I have set the debug level and restarted the osd. It initialized for about an > hour and then crashed after 5 minutes with the same error message. > > You can download the full logs here: > > http://bit.ly/WbcSQp > > Thanks for your help! Unfortunately I can't see what the thread gets stuck doing after it stops doing work (at an apparently normal point). Is there any chance you can attach to it with gdb as soon as the log slows down and the initial timeout messages appear? Or check the core file and see what thread 7fd8c1ff3700 is up to? Also, 'debug tp = 20' would narrow things down slightly. Thanks! sage ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-17 22:03 ` Sage Weil @ 2013-01-18 11:24 ` Jens Kristian Søgaard 2013-01-18 21:28 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-18 11:24 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, > Unfortunately I can't see what the thread gets stuck doing after it stops > doing work (at an apparently normal point). Is there any chance you can > attach to it with gdb as soon as the log slows down and the initial > timeout messages appear? Or check the core file and see what thread > 7fd8c1ff3700 is up to? Here's the info from the core file: (gdb) thread 45 [Switching to thread 45 (Thread 0x7fd8c1ff3700 (LWP 995))] #0 0x000000360de0acb4 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 (gdb) bt #0 0x000000360de0acb4 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 #1 0x00000000006e34a9 in RWLock::get_read (this=0x1aa54c8) at common/RWLock.h:51 #2 0x00000000006a0857 in OSD::queue_want_up_thru (this=this@entry=0x1aa44f0, want=want@entry=460) at osd/OSD.cc:2585 Python Exception <type 'exceptions.IndexError'> list index out of range: #3 0x00000000006ce175 in OSD::process_peering_events (this=0x1aa44f0, pgs=std::list) at osd/OSD.cc:6193 Python Exception <type 'exceptions.IndexError'> list index out of range: #4 0x0000000000709617 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=std::list) at osd/OSD.h:718 #5 0x00000000008cbefc in ThreadPool::worker (this=0x1aa4938, wt=0x3c539a0) at common/WorkQueue.cc:113 #6 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #7 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #8 0x000000360d6f167d in clone () from /lib64/libc.so.6 Does this help? -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-18 11:24 ` Jens Kristian Søgaard @ 2013-01-18 21:28 ` Sage Weil 2013-01-18 21:36 ` Jens Kristian Søgaard 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-18 21:28 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Fri, 18 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > Unfortunately I can't see what the thread gets stuck doing after it stops > > doing work (at an apparently normal point). Is there any chance you can > > attach to it with gdb as soon as the log slows down and the initial timeout > > messages appear? Or check the core file and see what thread 7fd8c1ff3700 is > > up to? > > Here's the info from the core file: > > (gdb) thread 45 > [Switching to thread 45 (Thread 0x7fd8c1ff3700 (LWP 995))] > #0 0x000000360de0acb4 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 > > (gdb) bt > #0 0x000000360de0acb4 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 > #1 0x00000000006e34a9 in RWLock::get_read (this=0x1aa54c8) at > common/RWLock.h:51 > #2 0x00000000006a0857 in OSD::queue_want_up_thru (this=this@entry=0x1aa44f0, > want=want@entry=460) at osd/OSD.cc:2585 > Python Exception <type 'exceptions.IndexError'> list index out of range: > #3 0x00000000006ce175 in OSD::process_peering_events (this=0x1aa44f0, > pgs=std::list) at osd/OSD.cc:6193 > Python Exception <type 'exceptions.IndexError'> list index out of range: > #4 0x0000000000709617 in OSD::PeeringWQ::_process (this=<optimized out>, > pgs=std::list) at osd/OSD.h:718 > #5 0x00000000008cbefc in ThreadPool::worker (this=0x1aa4938, wt=0x3c539a0) at > common/WorkQueue.cc:113 > #6 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) > at common/WorkQueue.h:288 > #7 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 > #8 0x000000360d6f167d in clone () from /lib64/libc.so.6 Getting closer.. it looks like someone is blocked with map_lock held, or leaked the lock. Can you do a 'thread apply all bt' with that same core file/process? sage > > > Does this help? > > -- > Jens Kristian S?gaard, Mermaid Consulting ApS, > jens@mermaidconsulting.dk, > http://www.mermaidconsulting.com/ > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-18 21:28 ` Sage Weil @ 2013-01-18 21:36 ` Jens Kristian Søgaard 2013-01-18 21:44 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-18 21:36 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi, > Getting closer.. it looks like someone is blocked with map_lock held, or > leaked the lock. Can you do a 'thread apply all bt' with that same core > file/process? Sure, here's the result: (gdb) thread apply all bt Thread 59 (Thread 0x7fd8ce7fc700 (LWP 981)): #0 0x000000360d6e60b7 in unlink () from /lib64/libc.so.6 #1 0x00000000007fa741 in LFNIndex::lfn_unlink (this=this@entry=0x7fd8c8000c50, path=std::vector of length 3, capacity 4 = {...}, hoid=..., mangled_name="rb.0.11b7.4a933baa.00000000a3c0__head_B9A6E309__0") at os/LFNIndex.cc:758 #2 0x00000000007fb463 in LFNIndex::remove_object (this=0x7fd8c8000c50, from=std::vector of length 3, capacity 4 = {...}, hoid=...) at os/LFNIndex.cc:293 #3 0x0000000000802dd1 in HashIndex::_remove (this=0x7fd8c8000c50, path=std::vector of length 3, capacity 4 = {...}, hoid=..., mangled_name=...) at os/HashIndex.cc:273 #4 0x00000000007fbc0e in LFNIndex::unlink (this=0x7fd8c8000c50, hoid=...) at os/LFNIndex.cc:87 #5 0x00000000007b471f in FileStore::lfn_unlink (this=this@entry=0x1c25130, cid=..., o=..., spos=...) at os/FileStore.cc:361 #6 0x00000000007b4960 in FileStore::_remove (this=this@entry=0x1c25130, cid=..., oid=..., spos=...) at os/FileStore.cc:2841 #7 0x00000000007c967a in FileStore::_do_transaction (this=this@entry=0x1c25130, t=..., op_seq=op_seq@entry=13768542, trans_num=trans_num@entry=0) at os/FileStore.cc:2405 Python Exception <type 'exceptions.IndexError'> list index out of range: #8 0x00000000007cb47c in FileStore::do_transactions (this=0x1c25130, tls=std::list, op_seq=13768542) at os/FileStore.cc:2122 #9 0x00000000007a0b68 in FileStore::_do_op (this=0x1c25130, osr=<optimized out>) at os/FileStore.cc:1955 #10 0x00000000008cbefc in ThreadPool::worker (this=0x1c25b38, wt=0x1c65630) at common/WorkQueue.cc:113 #11 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #12 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #13 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 58 (Thread 0x7fd8d41f2700 (LWP 1018)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd884001320) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd884001320, buf=<optimized out>, buf@entry=0x7fd8d41f1c8f "\377\001", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd884001320) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 57 (Thread 0x7fd8d42f3700 (LWP 1016)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd87c000fe8) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd87c000e10) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 56 (Thread 0x7fd8e46c5700 (LWP 1003)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x948a6b8) at common/Cond.h:55 #2 Pipe::writer (this=0x948a4e0) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 55 (Thread 0x7fd8d44f5700 (LWP 1015)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd884000fe8) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd884000e10) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 54 (Thread 0x7fd8e691d700 (LWP 32747)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000084daeb in ceph::log::Log::entry (this=0x1a82b80) at log/Log.cc:323 #2 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #3 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 53 (Thread 0x7fd8e45c4700 (LWP 1006)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x948a4e0) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x948a4e0, buf=<optimized out>, buf@entry=0x7fd8e45c3c8f "\377*", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x948a4e0) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 52 (Thread 0x7fd8baffd700 (LWP 1000)): #0 0x000000360de0b902 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000006b5642 in WaitUntil (when=..., mutex=..., this=0x1aa4fe8) at common/Cond.h:71 #2 WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x1aa4fe8) at common/Cond.h:79 #3 OSD::heartbeat_entry (this=0x1aa44f0) at osd/OSD.cc:2086 #4 0x00000000006f9f0d in OSD::T_Heartbeat::entry (this=<optimized out>) at osd/OSD.h:548 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 51 (Thread 0x7fd8cc7f8700 (LWP 1019)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd8840014f8) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd884001320) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 50 (Thread 0x7fd8bbfff700 (LWP 998)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000007a0723 in Wait (mutex=..., this=0x1c25970) at common/Cond.h:55 #2 FileStore::op_queue_reserve_throttle (this=this@entry=0x1c25130, o=o@entry=0x7fd8b0066070) at os/FileStore.cc:1925 Python Exception <type 'exceptions.IndexError'> list index out of range: #3 0x00000000007bde7a in FileStore::queue_transactions (this=0x1c25130, posr=<optimized out>, tls=empty std::list, onreadable=0x7fd8b006a430, ondisk=0x7fd8b00329f0, onreadable_sync=<optimized out>, osd_op=std::tr1::shared_ptr (empty) 0x0) at os/FileStore.cc:2030 #4 0x00000000007ce702 in ObjectStore::queue_transaction (this=0x1c25130, osr=osr@entry=0x3c24260, t=t@entry=0x7fd8b0100340, onreadable=0x7fd8b006a430, ondisk=ondisk@entry=0x7fd8b00329f0, onreadable_sync=onreadable_sync@entry=0x0, op=std::tr1::shared_ptr (empty) 0x0) at os/ObjectStore.h:645 #5 0x00000000006a9c22 in OSD::RemoveWQ::_process (this=0x1aa5a60, item=0x7db97f0) at osd/OSD.cc:2316 #6 0x00000000008cbefc in ThreadPool::worker (this=0x1aa4c48, wt=0x3c4d860) at common/WorkQueue.cc:113 #7 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #8 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #9 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 49 (Thread 0x7fd8e43c2700 (LWP 1008)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd88c000a98) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd88c0008c0) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 48 (Thread 0x7fd8cc2f3700 (LWP 1026)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd858000d90) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd858000d90, buf=<optimized out>, buf@entry=0x7fd8cc2f2c8f "\377]", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd858000d90) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 47 (Thread 0x7fd8c17f2700 (LWP 996)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000007a0723 in Wait (mutex=..., this=0x1c25970) at common/Cond.h:55 #2 FileStore::op_queue_reserve_throttle (this=this@entry=0x1c25130, o=o@entry=0x7fd8bc47aae0) at os/FileStore.cc:1925 Python Exception <type 'exceptions.IndexError'> list index out of range: #3 0x00000000007bde7a in FileStore::queue_transactions (this=0x1c25130, posr=<optimized out>, tls=empty std::list, onreadable=0x7fd8bc469fd0, ondisk=0x7fd8bc9ff4c0, onreadable_sync=<optimized out>, osd_op=std::tr1::shared_ptr (empty) 0x0) at os/FileStore.cc:2030 #4 0x00000000007ce702 in ObjectStore::queue_transaction (this=0x1c25130, osr=0x26ea4c0, t=0x7fd8bc84abd0, onreadable=0x7fd8bc469fd0, ondisk=0x7fd8bc9ff4c0, onreadable_sync=0x0, op=std::tr1::shared_ptr (empty) 0x0) at os/ObjectStore.h:645 #5 0x00000000006a070c in OSD::dispatch_context_transaction (this=this@entry=0x1aa44f0, ctx=..., pg=pg@entry=0xb425ae0) at osd/OSD.cc:4895 Python Exception <type 'exceptions.IndexError'> list index out of range: #6 0x00000000006cdf7d in OSD::process_peering_events (this=0x1aa44f0, pgs=std::list) at osd/OSD.cc:6188 Python Exception <type 'exceptions.IndexError'> list index out of range: #7 0x0000000000709617 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=std::list) at osd/OSD.h:718 #8 0x00000000008cbefc in ThreadPool::worker (this=0x1aa4938, wt=0x3c4f310) at common/WorkQueue.cc:113 #9 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #10 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #11 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 46 (Thread 0x7fd8e42c1700 (LWP 1009)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd8840008c0) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd8840008c0, buf=<optimized out>, buf@entry=0x7fd8e42c0c8f "\377\002", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd8840008c0) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 45 (Thread 0x7fd8c1ff3700 (LWP 995)): #0 0x000000360de0acb4 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 #1 0x00000000006e34a9 in RWLock::get_read (this=0x1aa54c8) at common/RWLock.h:51 #2 0x00000000006a0857 in OSD::queue_want_up_thru (this=this@entry=0x1aa44f0, want=want@entry=460) at osd/OSD.cc:2585 Python Exception <type 'exceptions.IndexError'> list index out of range: #3 0x00000000006ce175 in OSD::process_peering_events (this=0x1aa44f0, pgs=std::list) at osd/OSD.cc:6193 Python Exception <type 'exceptions.IndexError'> list index out of range: #4 0x0000000000709617 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=std::list) at osd/OSD.h:718 #5 0x00000000008cbefc in ThreadPool::worker (this=0x1aa4938, wt=0x3c539a0) at common/WorkQueue.cc:113 #6 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #7 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #8 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 44 (Thread 0x7fd8cc1f2700 (LWP 1027)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd858001d10) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd858001d10, buf=<optimized out>, buf@entry=0x7fd8cc1f1c8f "\377]", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd858001d10) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 ---Type <return> to continue, or q <return> to quit--- #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 43 (Thread 0x7fd8e41c0700 (LWP 1010)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd87c0008c0) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd87c0008c0, buf=<optimized out>, buf@entry=0x7fd8e41bfc8f "\377_", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd87c0008c0) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 42 (Thread 0x7fd8c27f4700 (LWP 994)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000084091f in Wait (mutex=..., this=0x7fffed153d70) at common/Cond.h:55 #2 Finisher::finisher_thread_entry (this=0x7fffed153d10) at common/Finisher.cc:80 #3 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #4 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 41 (Thread 0x7fd8cc3f4700 (LWP 1028)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd858001470) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd858001470, buf=<optimized out>, buf@entry=0x7fd8cc3f3c8f "\377]", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd858001470) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 40 (Thread 0x7fd8c2ff5700 (LWP 993)): #0 0x000000360de0b902 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008c7ed8 in WaitUntil (when=..., mutex=..., this=0x7fffed153c58) at common/Cond.h:71 #2 SafeTimer::timer_thread (this=0x7fffed153c48) at common/Timer.cc:115 #3 0x00000000008c8e0d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 39 (Thread 0x7fd8e691e780 (LWP 32746)): #0 0x000000360de08e60 in pthread_join () from /lib64/libpthread.so.0 #1 0x00000000008c3322 in Thread::join (this=0x1a86380, prval=<optimized out>) at common/Thread.cc:121 #2 0x00000000008bc028 in SimpleMessenger::wait (this=0x1a86180) at msg/SimpleMessenger.cc:490 #3 0x0000000000611a31 in main (argc=<optimized out>, argv=<optimized out>) at ceph_osd.cc:446 Thread 38 (Thread 0x7fd8d45f6700 (LWP 1013)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd87c000a98) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd87c0008c0) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 37 (Thread 0x7fd8c37f6700 (LWP 992)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x0000000000947c31 in Accepter::entry (this=0x1a88a40) at msg/Accepter.cc:198 #2 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #3 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 36 (Thread 0x7fd8c0ff1700 (LWP 997)): #0 0x000000360de0b902 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008cbbe0 in WaitUntil (when=..., mutex=..., this=0x1aa4b38) at common/Cond.h:71 #2 WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x1aa4b38) at common/Cond.h:79 #3 ThreadPool::worker (this=0x1aa4ac0, wt=0x3c4fcb0) at common/WorkQueue.cc:131 #4 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 35 (Thread 0x7fd8c3ff7700 (LWP 991)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000096345b in Wait (mutex=..., this=0x1a88ad0) at common/Cond.h:55 #2 DispatchQueue::entry (this=0x1a88a68) at msg/DispatchQueue.cc:117 #3 0x00000000008c0a4d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:85 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 34 (Thread 0x7fd8d46f7700 (LWP 1012)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd884000e10) at msg/Pipe.cc:1991 ---Type <return> to continue, or q <return> to quit--- #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd884000e10, buf=<optimized out>, buf@entry=0x7fd8d46f6c8f "\377", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd884000e10) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 33 (Thread 0x7fd8c57fa700 (LWP 988)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000007a0723 in Wait (mutex=..., this=0x1c25970) at common/Cond.h:55 #2 FileStore::op_queue_reserve_throttle (this=this@entry=0x1c25130, o=o@entry=0x7fd8681ab170) at os/FileStore.cc:1925 Python Exception <type 'exceptions.IndexError'> list index out of range: #3 0x00000000007bde7a in FileStore::queue_transactions (this=0x1c25130, posr=<optimized out>, tls=empty std::list, onreadable=0x7fd8681b0010, ondisk=0x0, onreadable_sync=<optimized out>, osd_op=std::tr1::shared_ptr (empty) 0x0) at os/FileStore.cc:2030 #4 0x00000000007ce702 in ObjectStore::queue_transaction (this=0x1c25130, osr=osr@entry=0x0, t=t@entry=0x7fd8681a3770, onreadable=onreadable@entry=0x7fd8681b0010, ondisk=ondisk@entry=0x0, onreadable_sync=onreadable_sync@entry=0x7fd8681a2e00, op=std::tr1::shared_ptr (empty) 0x0) at os/ObjectStore.h:645 #5 0x00000000006dc1f1 in OSD::handle_osd_map (this=this@entry=0x1aa44f0, m=m@entry=0x7fd87802a8b0) at osd/OSD.cc:3951 #6 0x00000000006dd73b in OSD::_dispatch (this=this@entry=0x1aa44f0, m=m@entry=0x7fd87802a8b0) at osd/OSD.cc:3417 #7 0x00000000006dddff in OSD::ms_dispatch (this=0x1aa44f0, m=0x7fd87802a8b0) at osd/OSD.cc:3220 #8 0x00000000009635f3 in ms_deliver_dispatch (m=0x7fd87802a8b0, this=0x1a86c90) at msg/Messenger.h:549 #9 DispatchQueue::entry (this=0x1a86d78) at msg/DispatchQueue.cc:107 #10 0x00000000008c0a4d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:85 #11 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #12 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 32 (Thread 0x7fd8e44c3700 (LWP 1007)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd88c0008c0) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd88c0008c0, buf=<optimized out>, buf@entry=0x7fd8e44c2c8f "\377", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd88c0008c0) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 31 (Thread 0x7fd8c5ffb700 (LWP 987)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x0000000000947c31 in Accepter::entry (this=0x1a86240) at msg/Accepter.cc:198 #2 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #3 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 30 (Thread 0x7fd8cc4f5700 (LWP 1023)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd858001ee8) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd858001d10) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 29 (Thread 0x7fd8b9ffb700 (LWP 1002)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008c785a in Wait (mutex=..., this=0x1aa5f70) at common/Cond.h:55 #2 SafeTimer::timer_thread (this=0x1aa5f60) at common/Timer.cc:113 #3 0x00000000008c8e0d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 28 (Thread 0x7fd8c77fe700 (LWP 984)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000084091f in Wait (mutex=..., this=0x1c25638) at common/Cond.h:55 #2 Finisher::finisher_thread_entry (this=0x1c255d8) at common/Finisher.cc:80 #3 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #4 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 27 (Thread 0x7fd8cc5f6700 (LWP 1022)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd858001648) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd858001470) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 26 (Thread 0x7fd8cd7fa700 (LWP 983)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000084091f in Wait (mutex=..., this=0x1c25a68) at common/Cond.h:55 #2 Finisher::finisher_thread_entry (this=0x1c25a08) at common/Finisher.cc:80 #3 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #4 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 25 (Thread 0x7fd8ba7fc700 (LWP 1001)): ---Type <return> to continue, or q <return> to quit--- #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000084091f in Wait (mutex=..., this=0x1aa6210) at common/Cond.h:55 #2 Finisher::finisher_thread_entry (this=0x1aa61b0) at common/Finisher.cc:80 #3 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #4 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 24 (Thread 0x7fd8cc6f7700 (LWP 1021)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd858000f68) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd858000d90) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 23 (Thread 0x7fd8cdffb700 (LWP 982)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000007b569c in Wait (mutex=..., this=0x1c25cf0) at common/Cond.h:55 #2 FileStore::flusher_entry (this=0x1c25130) at os/FileStore.cc:3272 #3 0x00000000007cea0d in FileStore::FlusherThread::entry (this=<optimized out>) at os/FileStore.h:262 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 22 (Thread 0x7fd8bb7fe700 (LWP 999)): #0 0x000000360de0b902 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008cbbe0 in WaitUntil (when=..., mutex=..., this=0x1aa4e48) at common/Cond.h:71 #2 WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x1aa4e48) at common/Cond.h:79 #3 ThreadPool::worker (this=0x1aa4dd0, wt=0x3c45700) at common/WorkQueue.cc:131 #4 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 21 (Thread 0x7fd8d43f4700 (LWP 1014)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x000000000096fcbc in Pipe::tcp_read_wait (this=this@entry=0x7fd87c000e10) at msg/Pipe.cc:1991 #2 0x000000000096ff80 in Pipe::tcp_read (this=this@entry=0x7fd87c000e10, buf=<optimized out>, buf@entry=0x7fd8d43f3c8f "\377`", len=len@entry=1) at msg/Pipe.cc:1964 #3 0x0000000000980384 in Pipe::reader (this=0x7fd87c000e10) at msg/Pipe.cc:1222 #4 0x00000000009833bd in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 20 (Thread 0x7fd8ceffd700 (LWP 980)): #0 0x000000360de0ddcd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x000000360de09c56 in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x000000360de09b58 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000000000089d85b in Mutex::Lock (this=0x7fd8b0000d58, no_lockdep=<optimized out>) at common/Mutex.cc:90 #4 0x00000000007a0ae5 in FileStore::_do_op (this=0x1c25130, osr=0x7fd8b0000c90) at os/FileStore.cc:1951 #5 0x00000000008cbefc in ThreadPool::worker (this=0x1c25b38, wt=0x1aae360) at common/WorkQueue.cc:113 #6 0x00000000008cce70 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #7 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #8 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 19 (Thread 0x7fd8c47f8700 (LWP 990)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000096345b in Wait (mutex=..., this=0x1a87bb0) at common/Cond.h:55 #2 DispatchQueue::entry (this=0x1a87b48) at msg/DispatchQueue.cc:117 #3 0x00000000008c0a4d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:85 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 18 (Thread 0x7fd8cf7fe700 (LWP 979)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000084091f in Wait (mutex=..., this=0x1c251a8) at common/Cond.h:55 #2 Finisher::finisher_thread_entry (this=0x1c25148) at common/Finisher.cc:80 #3 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #4 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 17 (Thread 0x7fd8d47f8700 (LWP 1011)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000097a10e in Wait (mutex=..., this=0x7fd884000a98) at common/Cond.h:55 #2 Pipe::writer (this=0x7fd8840008c0) at msg/Pipe.cc:1470 #3 0x000000000098329d in Pipe::Writer::entry (this=<optimized out>) at msg/Pipe.h:59 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 16 (Thread 0x7fd8c6ffd700 (LWP 985)): #0 0x000000360de0b902 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #1 0x00000000008c7ed8 in WaitUntil (when=..., mutex=..., this=0x1c25818) at common/Cond.h:71 #2 SafeTimer::timer_thread (this=0x1c25808) at common/Timer.cc:115 #3 0x00000000008c8e0d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 15 (Thread 0x7fd8cffff700 (LWP 978)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000006e283a in Cond::Wait (this=0x1bbd920, mutex=...) at common/Cond.h:55 #2 0x0000000000797a15 in FileJournal::write_finish_thread_entry (this=0x1bbd5b0) at os/FileJournal.cc:1316 #3 0x00000000006dfa7d in FileJournal::WriteFinisher::entry (this=<optimized out>) at os/FileJournal.h:303 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 14 (Thread 0x7fd8c4ff9700 (LWP 989)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x0000000000947c31 in Accepter::entry (this=0x1a86d50) at msg/Accepter.cc:198 #2 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #3 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 13 (Thread 0x7fd8d57fa700 (LWP 977)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000079a96e in Wait (mutex=..., this=0x1bbd6e8) at common/Cond.h:55 #2 FileJournal::write_thread_entry (this=0x1bbd5b0) at os/FileJournal.cc:1108 #3 0x00000000006dfa9d in FileJournal::Writer::entry (this=<optimized out>) at os/FileJournal.h:293 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7fd8d67fc700 (LWP 969)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000829703 in BGThread (this=0x1aa3be0) at util/env_posix.cc:522 #2 leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper (arg=0x1aa3be0) at util/env_posix.cc:470 #3 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #4 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7fd8d77fe700 (LWP 965)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008bce60 in Wait (mutex=..., this=0x1a870f0) at common/Cond.h:55 #2 SimpleMessenger::reaper_entry (this=0x1a86c90) at msg/SimpleMessenger.cc:206 #3 0x00000000008c148d in SimpleMessenger::ReaperThread::entry (this=<optimized out>) at msg/SimpleMessenger.h:411 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7fd8d6ffd700 (LWP 968)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008c785a in Wait (mutex=..., this=0x1aa6098) at common/Cond.h:55 #2 SafeTimer::timer_thread (this=0x1aa6088) at common/Timer.cc:113 #3 0x00000000008c8e0d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7fd8c67fc700 (LWP 986)): #0 0x000000360de0ddcd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x000000360de09c56 in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x000000360de09b58 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000000000089d85b in Mutex::Lock (this=0x1aa4500, no_lockdep=<optimized out>) at common/Mutex.cc:90 #4 0x00000000006ddb37 in OSD::ms_dispatch (this=0x1aa44f0, m=0x7fd880003930) at osd/OSD.cc:3211 #5 0x00000000009635f3 in ms_deliver_dispatch (m=0x7fd880003930, this=0x1a86180) at msg/Messenger.h:549 #6 DispatchQueue::entry (this=0x1a86268) at msg/DispatchQueue.cc:107 #7 0x00000000008c0a4d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:85 #8 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #9 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fd8d7fff700 (LWP 967)): #0 0x000000360de0ddcd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x000000360de1043c in _L_cond_lock_848 () from /lib64/libpthread.so.0 #2 0x000000360de10318 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0 #3 0x000000360de0b9b6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #4 0x00000000008c7ed8 in WaitUntil (when=..., mutex=..., this=0x1aa4568) at common/Cond.h:71 #5 SafeTimer::timer_thread (this=0x1aa4558) at common/Timer.cc:115 #6 0x00000000008c8e0d in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38 #7 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #8 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fd8ccff9700 (LWP 964)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #1 0x00000000008bce60 in Wait (mutex=..., this=0x1a88de0) at common/Cond.h:55 #2 SimpleMessenger::reaper_entry (this=0x1a88980) at msg/SimpleMessenger.cc:206 #3 0x00000000008c148d in SimpleMessenger::ReaperThread::entry (this=<optimized out>) at msg/SimpleMessenger.h:411 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fd8d5ffb700 (LWP 976)): #0 0x000000360de0b902 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000007cd471 in Cond::WaitUntil (this=this@entry=0x1c25768, mutex=..., when=...) at common/Cond.h:71 #2 0x00000000007cd58e in Cond::WaitInterval (this=this@entry=0x1c25768, cct=<optimized out>, mutex=..., interval=...) at common/Cond.h:79 #3 0x00000000007c3463 in FileStore::sync_entry (this=0x1c25130) at os/FileStore.cc:3312 #4 0x00000000007cff6d in FileStore::SyncThread::entry (this=<optimized out>) at os/FileStore.h:115 #5 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #6 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fd8d4ff9700 (LWP 963)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008bce60 in Wait (mutex=..., this=0x1a87ec0) at common/Cond.h:55 #2 SimpleMessenger::reaper_entry (this=0x1a87a60) at msg/SimpleMessenger.cc:206 #3 0x00000000008c148d in SimpleMessenger::ReaperThread::entry (this=<optimized out>) at msg/SimpleMessenger.h:411 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7fd8e4eda700 (LWP 966)): #0 0x000000360d6ea9e3 in select () from /lib64/libc.so.6 #1 0x000000000082f6fa in SignalHandler::entry (this=0x1aa3410) at global/signal_handler.cc:210 #2 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #3 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fd8e56db700 (LWP 32750)): #0 0x000000360d6e8bcf in poll () from /lib64/libc.so.6 #1 0x00000000008cefcd in AdminSocket::entry (this=0x1a830d0) at common/admin_socket.cc:228 #2 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #3 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fd8c7fff700 (LWP 962)): #0 0x000000360de0b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000008bce60 in Wait (mutex=..., this=0x1a865e0) at common/Cond.h:55 #2 SimpleMessenger::reaper_entry (this=0x1a86180) at msg/SimpleMessenger.cc:206 #3 0x00000000008c148d in SimpleMessenger::ReaperThread::entry (this=<optimized out>) at msg/SimpleMessenger.h:411 #4 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #5 0x000000360d6f167d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fd8e5edc700 (LWP 32749)): #0 0x000000360de0eebb in raise () from /lib64/libpthread.so.0 #1 0x000000000082eb46 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 <signal handler called> #4 0x000000360d635925 in raise () from /lib64/libc.so.6 #5 0x000000360d6370d8 in abort () from /lib64/libc.so.6 #6 0x0000003611660dad in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #7 0x000000361165eea6 in ?? () from /lib64/libstdc++.so.6 #8 0x000000361165eed3 in std::terminate() () from /lib64/libstdc++.so.6 #9 0x000000361165f0fe in __cxa_throw () from /lib64/libstdc++.so.6 #10 0x00000000008d510d in ceph::__ceph_assert_fail (assertion=0x9edbf1 "0 == \"hit suicide timeout\"", file=<optimized out>, line=78, func=0x9edd40 "bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:77 #11 0x00000000008462bb in ceph::HeartbeatMap::_check (this=this@entry=0x1a831d0, h=<optimized out>, who=who@entry=0x9edc66 "is_healthy", now=now@entry=1358454183) at common/HeartbeatMap.cc:78 #12 0x0000000000846a9e in ceph::HeartbeatMap::is_healthy (this=this@entry=0x1a831d0) at common/HeartbeatMap.cc:118 #13 0x0000000000846cc8 in ceph::HeartbeatMap::check_touch_file (this=0x1a831d0) at common/HeartbeatMap.cc:129 #14 0x00000000008e01c5 in CephContextServiceThread::entry (this=0x1aa02d0) at common/ceph_context.cc:68 #15 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #16 0x000000360d6f167d in clone () from /lib64/libc.so.6 -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-18 21:36 ` Jens Kristian Søgaard @ 2013-01-18 21:44 ` Sage Weil 2013-01-19 9:25 ` Jens Kristian Søgaard 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-18 21:44 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Fri, 18 Jan 2013, Jens Kristian S?gaard wrote: > Hi, > > > Getting closer.. it looks like someone is blocked with map_lock held, or > > leaked the lock. Can you do a 'thread apply all bt' with that same core > > file/process? > > Sure, here's the result: Ah, I think it's the PG removal workqueue being unfriendly. Annoyingly I made a kludge around this for another cluster hitting it and then forgot to get it integrated. Can you give wip-pg-removal a go and see if that resolves the problem? It's building right now.. packages should show up shortly on gitbuilder.ceph.com once it goes green here http://ceph.com/gitbuilder.cgi Thanks! sage ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-18 21:44 ` Sage Weil @ 2013-01-19 9:25 ` Jens Kristian Søgaard 2013-01-19 16:44 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-19 9:25 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, Thanks for the speedy reply! > to get it integrated. Can you give wip-pg-removal a go and see if > that resolves the problem? It's building right now.. packages should > show up shortly on gitbuilder.ceph.com once it goes green here > http://ceph.com/gitbuilder.cgi It is only listed for OpenSUSE 12 where it is in yellow state. I'm running Fedora 17. I would really much like to test the fix soon, as multiple osds have now crashed with the same error. I now have unfound pgs and the cluster is no longer usable. -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 9:25 ` Jens Kristian Søgaard @ 2013-01-19 16:44 ` Sage Weil 2013-01-19 17:56 ` Jens Kristian Søgaard 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-19 16:44 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Sat, 19 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > Thanks for the speedy reply! > > > to get it integrated. Can you give wip-pg-removal a go and see if > > that resolves the problem? It's building right now.. packages should > > show up shortly on gitbuilder.ceph.com once it goes green here > > http://ceph.com/gitbuilder.cgi > > It is only listed for OpenSUSE 12 where it is in yellow state. Whoops, I forgot about the time zones... the fix is merged into master and a I dropped the wip branch. I just repushed the patches (on top of bobtail) for you. http://gitbuilder.ceph.com/ceph-rpm-fc17-x86_64-basic/ref/wip-pg-removal/ Thanks! sage > > I'm running Fedora 17. > > I would really much like to test the fix soon, as multiple osds have now > crashed with the same error. I now have unfound pgs and the cluster is > no longer usable. > > -- > Jens Kristian S?gaard, Mermaid Consulting ApS, > jens@mermaidconsulting.dk, > http://www.mermaidconsulting.com/ > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 16:44 ` Sage Weil @ 2013-01-19 17:56 ` Jens Kristian Søgaard 2013-01-19 18:19 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-19 17:56 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, > a I dropped the wip branch. I just repushed the patches (on top of > bobtail) for you. Thanks for the new build! I have finished testing it on one osd - sadly the osd crashed again, but now with a new stack trace. 0> 2013-01-19 18:48:44.636938 7fb5597f2700 -1 ./osd/OSDMap.h: In function 'const epoch_t& OSDMap::get_up_thru(int) const' thread 7fb5597f2700 time 2013-01-19 18:48:44.499494 ./osd/OSDMap.h: 367: FAILED assert(exists(osd)) ceph version 0.56.1-25-g25a6b1b (25a6b1b325db2a2b45963f83623c447ec577c5ef) 1: /usr/bin/ceph-osd() [0x60db42] 2: /usr/bin/ceph-osd() [0x6e3b35] 3: (pg_interval_t::check_new_interval(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, long, pg_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >*, std::ostream*)+0x250) [0x935590] 4: (PG::start_peering_interval(std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&)+0x353) [0x7563c3] 5: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x21e) [0x75887e] 6: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x86) [0x78abb6] 7: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x53) [0x78ac33] 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x76f58b] 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x19) [0x76f619] 10: (PG::RecoveryState::handle_event(boost::statechart::event_base const&, PG::RecoveryCtx*)+0x4d) [0x76f6cd] 11: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, PG::RecoveryCtx*)+0x196) [0x72bf46] 12: (OSD::advance_pg(unsigned int, PG*, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x48b) [0x6cf14b] 13: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&)+0x2a6) [0x6cf7f6] 14: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&)+0x17) [0x70a3f7] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x95c) [0x8ccccc] 16: (ThreadPool::WorkThread::entry()+0x10) [0x8cdc40] 17: /lib64/libpthread.so.0() [0x360de07d14] 18: (clone()+0x6d) [0x360d6f167d] -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 17:56 ` Jens Kristian Søgaard @ 2013-01-19 18:19 ` Sage Weil 2013-01-19 18:40 ` Jens Kristian Søgaard 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-19 18:19 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Sat, 19 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > a I dropped the wip branch. I just repushed the patches (on top of bobtail) > > for you. > > Thanks for the new build! > > I have finished testing it on one osd - sadly the osd crashed again, but now > with a new stack trace. > > 0> 2013-01-19 18:48:44.636938 7fb5597f2700 -1 ./osd/OSDMap.h: In function > 'const epoch_t& OSDMap::get_up_thru(int) const' thread 7fb5597f2700 time > 2013-01-19 18:48:44.499494 > ./osd/OSDMap.h: 367: FAILED assert(exists(osd)) Do you have a full log for this? sage > > ceph version 0.56.1-25-g25a6b1b (25a6b1b325db2a2b45963f83623c447ec577c5ef) > 1: /usr/bin/ceph-osd() [0x60db42] > 2: /usr/bin/ceph-osd() [0x6e3b35] > 3: (pg_interval_t::check_new_interval(std::vector<int, std::allocator<int> > > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, > std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, > unsigned int, unsigned int, std::tr1::shared_ptr<OSDMap const>, > std::tr1::shared_ptr<OSDMap const>, long, pg_t, std::map<unsigned int, > pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int > const, pg_interval_t> > >*, std::ostream*)+0x250) [0x935590] > 4: (PG::start_peering_interval(std::tr1::shared_ptr<OSDMap const>, > std::vector<int, std::allocator<int> > const&, std::vector<int, > std::allocator<int> > const&)+0x353) [0x7563c3] > 5: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x21e) [0x75887e] > 6: (boost::statechart::detail::reaction_result > boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, > boost::statechart::custom_reaction<PG::ActMap>, > boost::statechart::custom_reaction<PG::NullEvt>, > boost::statechart::custom_reaction<PG::FlushedEvt>, > boost::statechart::transition<boost::statechart::event_base, > PG::RecoveryState::Crashed, > boost::statechart::detail::no_context<boost::statechart::event_base>, > &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function> > >, boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > >(boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, > boost::statechart::event_base const&, void const*)+0x86) [0x78abb6] > 7: (boost::statechart::detail::reaction_result > boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, > boost::statechart::custom_reaction<PG::AdvMap>, > boost::statechart::custom_reaction<PG::ActMap>, > boost::statechart::custom_reaction<PG::NullEvt>, > boost::statechart::custom_reaction<PG::FlushedEvt>, > boost::statechart::transition<boost::statechart::event_base, > PG::RecoveryState::Crashed, > boost::statechart::detail::no_context<boost::statechart::event_base>, > &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function>, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > >(boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, > boost::statechart::event_base const&, void const*)+0x53) [0x78ac33] > 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base > const&)+0x5b) [0x76f58b] > 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base > const&)+0x19) [0x76f619] > 10: (PG::RecoveryState::handle_event(boost::statechart::event_base const&, > PG::RecoveryCtx*)+0x4d) [0x76f6cd] > 11: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, > std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, > std::vector<int, std::allocator<int> >&, PG::RecoveryCtx*)+0x196) [0x72bf46] > 12: (OSD::advance_pg(unsigned int, PG*, PG::RecoveryCtx*, > std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, > std::allocator<boost::intrusive_ptr<PG> > >*)+0x48b) [0x6cf14b] > 13: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > > const&)+0x2a6) [0x6cf7f6] > 14: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > > const&)+0x17) [0x70a3f7] > 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x95c) [0x8ccccc] > 16: (ThreadPool::WorkThread::entry()+0x10) [0x8cdc40] > 17: /lib64/libpthread.so.0() [0x360de07d14] > 18: (clone()+0x6d) [0x360d6f167d] > > > -- > Jens Kristian S?gaard, Mermaid Consulting ApS, > jens@mermaidconsulting.dk, > http://www.mermaidconsulting.com/ > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 18:19 ` Sage Weil @ 2013-01-19 18:40 ` Jens Kristian Søgaard 2013-01-19 20:08 ` Sage Weil 2013-01-21 0:14 ` Sage Weil 0 siblings, 2 replies; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-19 18:40 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, > Do you have a full log for this? I upped the log level and started the osd again. It ran for 23 seconds and then suddenly crashed out of the blue. The last log lines were: 2013-01-19 19:31:39.975475 7f50de7fc700 10 osd.2 pg_epoch: 416 pg[0.fc( v 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] state<Reset>: Reset advmap 2013-01-19 19:31:39.975483 7f50de7fc700 10 osd.2 pg_epoch: 416 pg[0.fc( v 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] _calc_past_interval_range: already have past intervals back to 156 2013-01-19 19:31:39.975495 7f50de7fc700 10 osd.2 pg_epoch: 416 pg[0.fc( v 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] handle_advance_map [1,3]/[1,3] 2013-01-19 19:31:39.975505 7f50de7fc700 10 osd.2 pg_epoch: 417 pg[0.fc( v 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] state<Reset>: Reset advmap 2013-01-19 19:31:39.975513 7f50de7fc700 10 The stack trace from the core file shows: Program terminated with signal 6, Aborted. #0 0x000000360de0eebb in raise () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install boost-thread-1.48.0-13.fc17.x86_64 glibc-2.15-57.fc17.x86_64 libaio-0.3.109-5.fc17.x86_64 libgcc-4.7.2-2.fc17.x86_64 libstdc++-4.7.2-2.fc17.x86_64 libuuid-2.21.2-2.fc17.x86_64 nspr-4.9.2-1.fc17.x86_64 nss-3.13.5-1.fc17.x86_64 nss-softokn-3.13.5-1.fc17.x86_64 nss-softokn-freebl-3.13.5-1.fc17.x86_64 nss-util-3.13.5-1.fc17.x86_64 sqlite-3.7.11-3.fc17.x86_64 (gdb) bt #0 0x000000360de0eebb in raise () from /lib64/libpthread.so.0 #1 0x000000000082f7a6 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 <signal handler called> #4 0x000000360d635925 in raise () from /lib64/libc.so.6 #5 0x000000360d6370d8 in abort () from /lib64/libc.so.6 #6 0x0000003611660dad in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #7 0x000000361165eea6 in ?? () from /lib64/libstdc++.so.6 #8 0x000000361165eed3 in std::terminate() () from /lib64/libstdc++.so.6 #9 0x000000361165f0fe in __cxa_throw () from /lib64/libstdc++.so.6 #10 0x00000000008d5edd in ceph::__ceph_assert_fail (assertion=0x99b1b8 "exists(osd)", file=<optimized out>, line=367, func=0x99fa20 "const epoch_t& OSDMap::get_up_thru(int) const") at common/assert.cc:77 #11 0x000000000060db42 in OSDMap::get_up_thru (osd=<optimized out>, this=<optimized out>) at osd/OSDMap.h:367 #12 0x00000000006e3b35 in OSDMap::get_up_thru (this=<optimized out>, osd=<optimized out>) at osd/OSDMap.h:369 #13 0x0000000000935590 in pg_interval_t::check_new_interval (old_acting=..., new_acting=..., old_up=..., new_up=..., same_interval_since=553, last_epoch_clean=425, osdmap=std::tr1::shared_ptr (count 83, weak 1) 0x2d59530, lastmap=std::tr1::shared_ptr (count 59, weak 1) 0x2e85650, pool_id=0, pgid=..., past_intervals=0xc62ef78, out=0x0) at osd/osd_types.cc:1537 #14 0x00000000007563c3 in PG::start_peering_interval (this=this@entry=0xc62e880, lastmap=std::tr1::shared_ptr (count 59, weak 1) 0x2e85650, newup=std::vector of length 2, capacity 2 = {...}, newacting=std::vector of length 3, capacity 3 = {...}) at osd/PG.cc:4624 #15 0x000000000075887e in PG::RecoveryState::Reset::react (this=this@entry=0x9581270, advmap=...) at osd/PG.cc:5241 #16 0x000000000078abb6 in react<PG::RecoveryState::Reset, boost::statechart::event_base, void const*> (evt=..., stt=..., eventType=<optimized out>) at /usr/include/boost/statechart/custom_reaction.hpp:42 #17 boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function> >, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=eventType@entry=0xcdc260) at /usr/include/boost/statechart/simple_state.hpp:816 #18 0x000000000078ac33 in local_react<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed> > > (eventType=0xcdc260, evt=..., this=0x9581270) at /usr/include/boost/statechart/simple_state.hpp:851 #19 boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::NullEvt>, boost::statechart::custom_reaction<PG::FlushedEvt>, boost::statechart::transition<boost::statechart::event_base, PG::RecoveryState::Crashed, boost::statechart::detail::no_context<boost::statechart::event_base>, &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > (stt=..., evt=..., eventType=0xcdc260) at /usr/include/boost/statechart/simple_state.hpp:820 #20 0x000000000076f58b in operator() (this=<synthetic pointer>) at /usr/include/boost/statechart/state_machine.hpp:87 #21 operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, const void*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial>::exception_event_handler> (action=..., this=<optimized out>) at /usr/include/boost/statechart/null_exception_translator.hpp:33 #22 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event (this=0xc62fb50, evt=...) at /usr/include/boost/statechart/state_machine.hpp:885 #23 0x000000000076f619 in boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event (this=this@entry=0xc62fb50, evt=...) at /usr/include/boost/statechart/state_machine.hpp:275 #24 0x000000000076f6cd in PG::RecoveryState::handle_event (this=0xc62fb50, evt=..., rctx=0x7f50ddffaa70) at osd/PG.h:1682 #25 0x000000000072bf46 in PG::handle_advance_map (this=0xc62e880, osdmap=std::tr1::shared_ptr (count 83, weak 1) 0x2d59530, lastmap=std::tr1::shared_ptr (count 59, weak 1) 0x2e85650, newup=std::vector of length 2, capacity 2 = {...}, newacting=std::vector of length 3, capacity 4 = {...}, rctx=0x7f50ddffaa70) at osd/PG.cc:5050 #26 0x00000000006cf14b in OSD::advance_pg (this=this@entry=0x2a27640, osd_epoch=760, pg=pg@entry=0xc62e880, rctx=rctx@entry=0x7f50ddffaa70, new_pgs=new_pgs@entry=0x7f50ddffaa40) at osd/OSD.cc:4042 Python Exception <type 'exceptions.IndexError'> list index out of range: #27 0x00000000006cf7f6 in OSD::process_peering_events (this=0x2a27640, pgs=std::list) at osd/OSD.cc:6170 Python Exception <type 'exceptions.IndexError'> list index out of range: #28 0x000000000070a3f7 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=std::list) at osd/OSD.h:718 #29 0x00000000008ccccc in ThreadPool::worker (this=0x2a27a88, wt=0x5cd2cd0) at common/WorkQueue.cc:113 #30 0x00000000008cdc40 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:288 #31 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 #32 0x000000360d6f167d in clone () from /lib64/libc.so.6 Do you want a full copy of the log file? It generated 128 MB of logs in those seconds. -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 18:40 ` Jens Kristian Søgaard @ 2013-01-19 20:08 ` Sage Weil 2013-01-19 20:29 ` Jens Kristian Søgaard 2013-01-21 0:14 ` Sage Weil 1 sibling, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-19 20:08 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Sat, 19 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > Do you have a full log for this? > > I upped the log level and started the osd again. > > It ran for 23 seconds and then suddenly crashed out of the blue. > > [...] > > Do you want a full copy of the log file? > > It generated 128 MB of logs in those seconds. Yes, please! That looks like the same spot as last time. sage ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 20:08 ` Sage Weil @ 2013-01-19 20:29 ` Jens Kristian Søgaard 2013-01-19 22:04 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-19 20:29 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, >> It generated 128 MB of logs in those seconds. > Yes, please! That looks like the same spot as last time. Here's a download link for the logs: http://bit.ly/11Hn7BN -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 20:29 ` Jens Kristian Søgaard @ 2013-01-19 22:04 ` Sage Weil 0 siblings, 0 replies; 37+ messages in thread From: Sage Weil @ 2013-01-19 22:04 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Sat, 19 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > >> It generated 128 MB of logs in those seconds. > > > Yes, please! That looks like the same spot as last time. > > Here's a download link for the logs: > > http://bit.ly/11Hn7BN Can you 'ceph osd getmap 555 -o 555' and 'ceph osd getmap 556 -o 556', and attach those to the bug below? http://tracker.newdream.net/issues/3879 Thanks! sage > > -- > Jens Kristian S?gaard, Mermaid Consulting ApS, > jens@mermaidconsulting.dk, > http://www.mermaidconsulting.com/ > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-19 18:40 ` Jens Kristian Søgaard 2013-01-19 20:08 ` Sage Weil @ 2013-01-21 0:14 ` Sage Weil 2013-01-21 6:59 ` Jens Kristian Søgaard 1 sibling, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-21 0:14 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org The secondary crash is fixed by 17160843d0c523359d8fa934418ff2c1f7bffb25. Also backported to bobtail. If you have the core from the second instance of the heartbeat timeout, from the wip-pg-removal branch, please post the 'thread apply all bt' output. Thanks! sage On Sat, 19 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > Do you have a full log for this? > > I upped the log level and started the osd again. > > It ran for 23 seconds and then suddenly crashed out of the blue. > > The last log lines were: > > 2013-01-19 19:31:39.975475 7f50de7fc700 10 osd.2 pg_epoch: 416 pg[0.fc( v > 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 > 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] > state<Reset>: Reset advmap > 2013-01-19 19:31:39.975483 7f50de7fc700 10 osd.2 pg_epoch: 416 pg[0.fc( v > 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 > 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] > _calc_past_interval_range: already have past intervals back to 156 > 2013-01-19 19:31:39.975495 7f50de7fc700 10 osd.2 pg_epoch: 416 pg[0.fc( v > 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 > 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] > handle_advance_map [1,3]/[1,3] > 2013-01-19 19:31:39.975505 7f50de7fc700 10 osd.2 pg_epoch: 417 pg[0.fc( v > 164'38593 (164'37592,164'38593] local-les=247 n=2622 ec=1 les/c 247/156 > 379/379/379) [1,3] r=-1 lpr=379 pi=152-378/14 lcod 0'0 inactive NOTIFY] > state<Reset>: Reset advmap > 2013-01-19 19:31:39.975513 7f50de7fc700 10 > > > The stack trace from the core file shows: > > Program terminated with signal 6, Aborted. > #0 0x000000360de0eebb in raise () from /lib64/libpthread.so.0 > Missing separate debuginfos, use: debuginfo-install > boost-thread-1.48.0-13.fc17.x86_64 glibc-2.15-57.fc17.x86_64 > libaio-0.3.109-5.fc17.x86_64 libgcc-4.7.2-2.fc17.x86_64 > libstdc++-4.7.2-2.fc17.x86_64 libuuid-2.21.2-2.fc17.x86_64 > nspr-4.9.2-1.fc17.x86_64 nss-3.13.5-1.fc17.x86_64 > nss-softokn-3.13.5-1.fc17.x86_64 nss-softokn-freebl-3.13.5-1.fc17.x86_64 > nss-util-3.13.5-1.fc17.x86_64 sqlite-3.7.11-3.fc17.x86_64 > (gdb) bt > #0 0x000000360de0eebb in raise () from /lib64/libpthread.so.0 > #1 0x000000000082f7a6 in reraise_fatal (signum=6) at > global/signal_handler.cc:58 > #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 > #3 <signal handler called> > #4 0x000000360d635925 in raise () from /lib64/libc.so.6 > #5 0x000000360d6370d8 in abort () from /lib64/libc.so.6 > #6 0x0000003611660dad in __gnu_cxx::__verbose_terminate_handler() () from > /lib64/libstdc++.so.6 > #7 0x000000361165eea6 in ?? () from /lib64/libstdc++.so.6 > #8 0x000000361165eed3 in std::terminate() () from /lib64/libstdc++.so.6 > #9 0x000000361165f0fe in __cxa_throw () from /lib64/libstdc++.so.6 > #10 0x00000000008d5edd in ceph::__ceph_assert_fail (assertion=0x99b1b8 > "exists(osd)", file=<optimized out>, line=367, func=0x99fa20 "const epoch_t& > OSDMap::get_up_thru(int) const") at common/assert.cc:77 > #11 0x000000000060db42 in OSDMap::get_up_thru (osd=<optimized out>, > this=<optimized out>) at osd/OSDMap.h:367 > #12 0x00000000006e3b35 in OSDMap::get_up_thru (this=<optimized out>, > osd=<optimized out>) at osd/OSDMap.h:369 > #13 0x0000000000935590 in pg_interval_t::check_new_interval (old_acting=..., > new_acting=..., old_up=..., new_up=..., same_interval_since=553, > last_epoch_clean=425, osdmap=std::tr1::shared_ptr (count 83, weak 1) > 0x2d59530, > lastmap=std::tr1::shared_ptr (count 59, weak 1) 0x2e85650, pool_id=0, > pgid=..., past_intervals=0xc62ef78, out=0x0) at osd/osd_types.cc:1537 > #14 0x00000000007563c3 in PG::start_peering_interval > (this=this@entry=0xc62e880, lastmap=std::tr1::shared_ptr (count 59, weak 1) > 0x2e85650, newup=std::vector of length 2, capacity 2 = {...}, > newacting=std::vector of length 3, capacity 3 = {...}) at osd/PG.cc:4624 > #15 0x000000000075887e in PG::RecoveryState::Reset::react > (this=this@entry=0x9581270, advmap=...) at osd/PG.cc:5241 > #16 0x000000000078abb6 in react<PG::RecoveryState::Reset, > boost::statechart::event_base, void const*> (evt=..., stt=..., > eventType=<optimized out>) at > /usr/include/boost/statechart/custom_reaction.hpp:42 > #17 boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, > boost::statechart::custom_reaction<PG::ActMap>, > boost::statechart::custom_reaction<PG::NullEvt>, > boost::statechart::custom_reaction<PG::FlushedEvt>, > boost::statechart::transition<boost::statechart::event_base, > PG::RecoveryState::Crashed, > boost::statechart::detail::no_context<boost::statechart::event_base>, > &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function> > >, boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > > (stt=..., evt=..., eventType=eventType@entry=0xcdc260) at > /usr/include/boost/statechart/simple_state.hpp:816 > #18 0x000000000078ac33 in > local_react<boost::mpl::list5<boost::statechart::custom_reaction<PG::AdvMap>, > boost::statechart::custom_reaction<PG::ActMap>, > boost::statechart::custom_reaction<PG::NullEvt>, > boost::statechart::custom_reaction<PG::FlushedEvt>, > boost::statechart::transition<boost::statechart::event_base, > PG::RecoveryState::Crashed> > > (eventType=0xcdc260, evt=..., this=0x9581270) > at /usr/include/boost/statechart/simple_state.hpp:851 > #19 boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, > boost::statechart::custom_reaction<PG::AdvMap>, > boost::statechart::custom_reaction<PG::ActMap>, > boost::statechart::custom_reaction<PG::NullEvt>, > boost::statechart::custom_reaction<PG::FlushedEvt>, > boost::statechart::transition<boost::statechart::event_base, > PG::RecoveryState::Crashed, > boost::statechart::detail::no_context<boost::statechart::event_base>, > &boost::statechart::detail::no_context<boost::statechart::event_base>::no_function>, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> > > (stt=..., evt=..., eventType=0xcdc260) at > /usr/include/boost/statechart/simple_state.hpp:820 > #20 0x000000000076f58b in operator() (this=<synthetic pointer>) at > /usr/include/boost/statechart/state_machine.hpp:87 > #21 > operator()<boost::statechart::detail::send_function<boost::statechart::detail::state_base<std::allocator<void>, > boost::statechart::detail::rtti_policy>, boost::statechart::event_base, const > void*>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial>::exception_event_handler> (action=..., > this=<optimized out>) at > /usr/include/boost/statechart/null_exception_translator.hpp:33 > #22 boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::send_event (this=0xc62fb50, > evt=...) at /usr/include/boost/statechart/state_machine.hpp:885 > #23 0x000000000076f619 in > boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::process_event > (this=this@entry=0xc62fb50, evt=...) > at /usr/include/boost/statechart/state_machine.hpp:275 > #24 0x000000000076f6cd in PG::RecoveryState::handle_event (this=0xc62fb50, > evt=..., rctx=0x7f50ddffaa70) at osd/PG.h:1682 > #25 0x000000000072bf46 in PG::handle_advance_map (this=0xc62e880, > osdmap=std::tr1::shared_ptr (count 83, weak 1) 0x2d59530, > lastmap=std::tr1::shared_ptr (count 59, weak 1) 0x2e85650, newup=std::vector > of length 2, capacity 2 = {...}, > newacting=std::vector of length 3, capacity 4 = {...}, > rctx=0x7f50ddffaa70) at osd/PG.cc:5050 > #26 0x00000000006cf14b in OSD::advance_pg (this=this@entry=0x2a27640, > osd_epoch=760, pg=pg@entry=0xc62e880, rctx=rctx@entry=0x7f50ddffaa70, > new_pgs=new_pgs@entry=0x7f50ddffaa40) at osd/OSD.cc:4042 > Python Exception <type 'exceptions.IndexError'> list index out of range: > #27 0x00000000006cf7f6 in OSD::process_peering_events (this=0x2a27640, > pgs=std::list) at osd/OSD.cc:6170 > Python Exception <type 'exceptions.IndexError'> list index out of range: > #28 0x000000000070a3f7 in OSD::PeeringWQ::_process (this=<optimized out>, > pgs=std::list) at osd/OSD.h:718 > #29 0x00000000008ccccc in ThreadPool::worker (this=0x2a27a88, wt=0x5cd2cd0) at > common/WorkQueue.cc:113 > #30 0x00000000008cdc40 in ThreadPool::WorkThread::entry (this=<optimized out>) > at common/WorkQueue.h:288 > #31 0x000000360de07d14 in start_thread () from /lib64/libpthread.so.0 > #32 0x000000360d6f167d in clone () from /lib64/libc.so.6 > > > Do you want a full copy of the log file? > > It generated 128 MB of logs in those seconds. > > -- > Jens Kristian S?gaard, Mermaid Consulting ApS, > jens@mermaidconsulting.dk, > http://www.mermaidconsulting.com/ > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-21 0:14 ` Sage Weil @ 2013-01-21 6:59 ` Jens Kristian Søgaard 2013-01-21 7:11 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-21 6:59 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, > If you have the core from the second instance of the heartbeat timeout, > from the wip-pg-removal branch, please post the 'thread apply all bt' > output. Here's the thread stack traces: http://bit.ly/XrGeca And here's the logs (debug=20) from same crash: http://bit.ly/WyltfU -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-21 6:59 ` Jens Kristian Søgaard @ 2013-01-21 7:11 ` Sage Weil 2013-01-23 12:14 ` Jens Kristian Søgaard 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-21 7:11 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Mon, 21 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > If you have the core from the second instance of the heartbeat timeout, from > > the wip-pg-removal branch, please post the 'thread apply all bt' output. > > Here's the thread stack traces: > > http://bit.ly/XrGeca > > And here's the logs (debug=20) from same crash: > > http://bit.ly/WyltfU Followed up on IRC, but for everyone else's benefit: I think the problem now is just that 'osd target transaction size' is too big (default is 300). Recommended 50.. let's see how that goes. Even smaller (20 or 25) would probably be fine. Let me know how it goes! Thanks- sage ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-21 7:11 ` Sage Weil @ 2013-01-23 12:14 ` Jens Kristian Søgaard 2013-01-23 12:26 ` Wido den Hollander 2013-01-23 20:59 ` Jens Kristian Søgaard 0 siblings, 2 replies; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-23 12:14 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, > I think the problem now is just that 'osd target transaction size' is > too big (default is 300). Recommended 50.. let's see how that goes. > Even smaller (20 or 25) would probably be fine. I set it to 50, and that seems to have solved all my problems. After a day or so my cluster got to a HEALTH_OK state again. It has been running for a few days now without any crashes! Thanks for all your help! -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-23 12:14 ` Jens Kristian Søgaard @ 2013-01-23 12:26 ` Wido den Hollander 2013-01-23 12:29 ` Jens Kristian Søgaard 2013-01-23 13:13 ` Sage Weil 2013-01-23 20:59 ` Jens Kristian Søgaard 1 sibling, 2 replies; 37+ messages in thread From: Wido den Hollander @ 2013-01-23 12:26 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Sage Weil, Stefan Priebe, ceph-devel@vger.kernel.org On 01/23/2013 01:14 PM, Jens Kristian Søgaard wrote: > Hi Sage, > >> I think the problem now is just that 'osd target transaction size' is >> too big (default is 300). Recommended 50.. let's see how that goes. >> Even smaller (20 or 25) would probably be fine. > Going through the code and reading that this solved it for Jens, could this issue be traced back to less powerful CPUs? I've seen this on Atom and Fusion platforms which both don't excel in their computing power. From what I read is that the OSD by default does 300 transactions and then commits them? If the CPU is to slow to handle all the work timeouts can occur because it can't do all the transactions inside the set window? By lowering the number of transactions it sends out a heartbeat more often thus keeping itself alive. Correct? Wido > I set it to 50, and that seems to have solved all my problems. > > After a day or so my cluster got to a HEALTH_OK state again. It has been > running for a few days now without any crashes! > > Thanks for all your help! > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-23 12:26 ` Wido den Hollander @ 2013-01-23 12:29 ` Jens Kristian Søgaard 2013-01-23 13:13 ` Sage Weil 1 sibling, 0 replies; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-23 12:29 UTC (permalink / raw) To: Wido den Hollander; +Cc: Sage Weil, Stefan Priebe, ceph-devel@vger.kernel.org Hi Wido, > Going through the code and reading that this solved it for Jens, could > this issue be traced back to less powerful CPUs? Depends on what you mean by less powerful. All my OSD servers are equipped with Xeon E5606 CPUs. That is a quad-core 2.13 Ghz CPU. They are not used for anything else than Ceph. -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-23 12:26 ` Wido den Hollander 2013-01-23 12:29 ` Jens Kristian Søgaard @ 2013-01-23 13:13 ` Sage Weil 1 sibling, 0 replies; 37+ messages in thread From: Sage Weil @ 2013-01-23 13:13 UTC (permalink / raw) To: Wido den Hollander Cc: Jens Kristian Søgaard, Stefan Priebe, ceph-devel@vger.kernel.org On Wed, 23 Jan 2013, Wido den Hollander wrote: > On 01/23/2013 01:14 PM, Jens Kristian S?gaard wrote: > > Hi Sage, > > > > > I think the problem now is just that 'osd target transaction size' is > > > too big (default is 300). Recommended 50.. let's see how that goes. > > > Even smaller (20 or 25) would probably be fine. > > > > Going through the code and reading that this solved it for Jens, could this > issue be traced back to less powerful CPUs? > > I've seen this on Atom and Fusion platforms which both don't excel in their > computing power. > > From what I read is that the OSD by default does 300 transactions and then > commits them? If the CPU is to slow to handle all the work timeouts can occur > because it can't do all the transactions inside the set window? > > By lowering the number of transactions it sends out a heartbeat more often > thus keeping itself alive. > > Correct? In this case, it controls how many operations we stuff into an atomic transaction when doing something big (like deleting an entire PG). The speed is as much about the storage as the CPU, although I'm sure a small CPU helps slow things down. The thread needs to be able to do those N unlinks (or whatever) within the heartbeat interval or else the OSD will consider the thread stuck and zap itself. I think 300 was just a silly initial value... Te default is now either 30 or 50. sage > > Wido > > > I set it to 50, and that seems to have solved all my problems. > > > > After a day or so my cluster got to a HEALTH_OK state again. It has been > > running for a few days now without any crashes! > > > > Thanks for all your help! > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-23 12:14 ` Jens Kristian Søgaard 2013-01-23 12:26 ` Wido den Hollander @ 2013-01-23 20:59 ` Jens Kristian Søgaard 2013-01-23 22:56 ` Andrey Korolyov 2013-01-24 4:28 ` Sage Weil 1 sibling, 2 replies; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-23 20:59 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, >> I think the problem now is just that 'osd target transaction size' is > I set it to 50, and that seems to have solved all my problems. > After a day or so my cluster got to a HEALTH_OK state again. It has been > running for a few days now without any crashes! Hmm, one of the OSDs crashed again, sadly. It logs: -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") With this stack trace: ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0x846ecb] 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] 5: /lib64/libpthread.so.0() [0x3cbc807d14] 6: (clone()+0x6d) [0x3cbc0f167d] I have saved the core file, if there's anything in there you need? Or do you think I just need to set the target transaction size even lower than 50? -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-23 20:59 ` Jens Kristian Søgaard @ 2013-01-23 22:56 ` Andrey Korolyov 2013-01-24 4:39 ` Sage Weil 2013-01-24 4:28 ` Sage Weil 1 sibling, 1 reply; 37+ messages in thread From: Andrey Korolyov @ 2013-01-23 22:56 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Sage Weil, Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian Søgaard <jens@mermaidconsulting.dk> wrote: > Hi Sage, > >>> I think the problem now is just that 'osd target transaction size' is >> >> I set it to 50, and that seems to have solved all my problems. >> >> After a day or so my cluster got to a HEALTH_OK state again. It has been >> running for a few days now without any crashes! > > > Hmm, one of the OSDs crashed again, sadly. > > It logs: > > -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 > -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 > 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 > > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") > > > With this stack trace: > > ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0x846ecb] > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] > 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] > 5: /lib64/libpthread.so.0() [0x3cbc807d14] > 6: (clone()+0x6d) [0x3cbc0f167d] > > > I have saved the core file, if there's anything in there you need? > > Or do you think I just need to set the target transaction size even lower > than 50? > > I was able to catch this too on rejoin to very busy cluster and seems I need to lower this value at least at start time. Also c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time before restarted or new osd will join a cluster, and for 2M objects/3T of replicated data restart of the cluster was took almost a hour before it actually begins to work. The worst thing is that a single osd, if restarted, will mark as up after couple of minutes, then after almost half of hour(eating 100 percent of one cpu, ) as down and then cluster will start to redistribute data after 300s timeout, osd still doing something. > -- > Jens Kristian Søgaard, Mermaid Consulting ApS, > jens@mermaidconsulting.dk, > http://www.mermaidconsulting.com/ > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-23 22:56 ` Andrey Korolyov @ 2013-01-24 4:39 ` Sage Weil 2013-01-24 7:44 ` Andrey Korolyov 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-24 4:39 UTC (permalink / raw) To: Andrey Korolyov Cc: Jens Kristian Søgaard, Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Thu, 24 Jan 2013, Andrey Korolyov wrote: > On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard > <jens@mermaidconsulting.dk> wrote: > > Hi Sage, > > > >>> I think the problem now is just that 'osd target transaction size' is > >> > >> I set it to 50, and that seems to have solved all my problems. > >> > >> After a day or so my cluster got to a HEALTH_OK state again. It has been > >> running for a few days now without any crashes! > > > > > > Hmm, one of the OSDs crashed again, sadly. > > > > It logs: > > > > -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy > > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 > > -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy > > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 > > 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: > > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 > > > > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") > > > > > > With this stack trace: > > > > ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) > > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > > long)+0x2eb) [0x846ecb] > > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] > > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] > > 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] > > 5: /lib64/libpthread.so.0() [0x3cbc807d14] > > 6: (clone()+0x6d) [0x3cbc0f167d] > > > > > > I have saved the core file, if there's anything in there you need? > > > > Or do you think I just need to set the target transaction size even lower > > than 50? > > > > > > I was able to catch this too on rejoin to very busy cluster and seems > I need to lower this value at least at start time. Also > c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time > before restarted or new osd will join a cluster, and for 2M objects/3T > of replicated data restart of the cluster was took almost a hour > before it actually begins to work. The worst thing is that a single > osd, if restarted, will mark as up after couple of minutes, then after > almost half of hour(eating 100 percent of one cpu, ) as down and then > cluster will start to redistribute data after 300s timeout, osd still > doing something. Okay, something is very wrong. Can you reproduce this with a log? Or even a partial log while it is spinning? You can adjust the log level on a running process with ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20 ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1 We haven't been able to reproduce this, so I'm very much interested in any light you can shine here. Thanks! sage ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-24 4:39 ` Sage Weil @ 2013-01-24 7:44 ` Andrey Korolyov 2013-01-24 18:01 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Andrey Korolyov @ 2013-01-24 7:44 UTC (permalink / raw) To: Sage Weil Cc: Jens Kristian Søgaard, Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil <sage@inktank.com> wrote: > On Thu, 24 Jan 2013, Andrey Korolyov wrote: >> On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard >> <jens@mermaidconsulting.dk> wrote: >> > Hi Sage, >> > >> >>> I think the problem now is just that 'osd target transaction size' is >> >> >> >> I set it to 50, and that seems to have solved all my problems. >> >> >> >> After a day or so my cluster got to a HEALTH_OK state again. It has been >> >> running for a few days now without any crashes! >> > >> > >> > Hmm, one of the OSDs crashed again, sadly. >> > >> > It logs: >> > >> > -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy >> > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 >> > -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy >> > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 >> > 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: >> > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, >> > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 >> > >> > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") >> > >> > >> > With this stack trace: >> > >> > ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) >> > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, >> > long)+0x2eb) [0x846ecb] >> > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] >> > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] >> > 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] >> > 5: /lib64/libpthread.so.0() [0x3cbc807d14] >> > 6: (clone()+0x6d) [0x3cbc0f167d] >> > >> > >> > I have saved the core file, if there's anything in there you need? >> > >> > Or do you think I just need to set the target transaction size even lower >> > than 50? >> > >> > >> >> I was able to catch this too on rejoin to very busy cluster and seems >> I need to lower this value at least at start time. Also >> c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time >> before restarted or new osd will join a cluster, and for 2M objects/3T >> of replicated data restart of the cluster was took almost a hour >> before it actually begins to work. The worst thing is that a single >> osd, if restarted, will mark as up after couple of minutes, then after >> almost half of hour(eating 100 percent of one cpu, ) as down and then >> cluster will start to redistribute data after 300s timeout, osd still >> doing something. > > Okay, something is very wrong. Can you reproduce this with a log? Or > even a partial log while it is spinning? You can adjust the log level on > a running process with > > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20 > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1 > > We haven't been able to reproduce this, so I'm very much interested in any > light you can shine here. > Unfortunately cluster finally hit ``suicide timeout'' by every osd, so there was no logs, only some backtraces[1]. Yesterday after an osd was not able to join cluster in a hour, I decided to wait until data is remapped, then tried to restart cluster, leaving it overnight, to morning all osd processes are dead, with the same backtraces. Before it, after a silly node crash(related to deadlocks in kernel kvm code), some pgs remains to stay in peering state without any blocker in json output, so I had decided to restart osd to which primary copy belongs, because it helped before. So most interesting part is missing, but I`ll reformat cluster soon and will try to catch this again after filling some data in. [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/ > Thanks! > sage > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-24 7:44 ` Andrey Korolyov @ 2013-01-24 18:01 ` Sage Weil 2013-02-17 11:21 ` Andrey Korolyov 0 siblings, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-24 18:01 UTC (permalink / raw) To: Andrey Korolyov Cc: Jens Kristian Søgaard, Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Thu, 24 Jan 2013, Andrey Korolyov wrote: > On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil <sage@inktank.com> wrote: > > On Thu, 24 Jan 2013, Andrey Korolyov wrote: > >> On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard > >> <jens@mermaidconsulting.dk> wrote: > >> > Hi Sage, > >> > > >> >>> I think the problem now is just that 'osd target transaction size' is > >> >> > >> >> I set it to 50, and that seems to have solved all my problems. > >> >> > >> >> After a day or so my cluster got to a HEALTH_OK state again. It has been > >> >> running for a few days now without any crashes! > >> > > >> > > >> > Hmm, one of the OSDs crashed again, sadly. > >> > > >> > It logs: > >> > > >> > -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy > >> > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 > >> > -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy > >> > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 > >> > 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: > >> > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > >> > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 > >> > > >> > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") > >> > > >> > > >> > With this stack trace: > >> > > >> > ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) > >> > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > >> > long)+0x2eb) [0x846ecb] > >> > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] > >> > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] > >> > 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] > >> > 5: /lib64/libpthread.so.0() [0x3cbc807d14] > >> > 6: (clone()+0x6d) [0x3cbc0f167d] > >> > > >> > > >> > I have saved the core file, if there's anything in there you need? > >> > > >> > Or do you think I just need to set the target transaction size even lower > >> > than 50? > >> > > >> > > >> > >> I was able to catch this too on rejoin to very busy cluster and seems > >> I need to lower this value at least at start time. Also > >> c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time > >> before restarted or new osd will join a cluster, and for 2M objects/3T > >> of replicated data restart of the cluster was took almost a hour > >> before it actually begins to work. The worst thing is that a single > >> osd, if restarted, will mark as up after couple of minutes, then after > >> almost half of hour(eating 100 percent of one cpu, ) as down and then > >> cluster will start to redistribute data after 300s timeout, osd still > >> doing something. > > > > Okay, something is very wrong. Can you reproduce this with a log? Or > > even a partial log while it is spinning? You can adjust the log level on > > a running process with > > > > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20 > > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1 > > > > We haven't been able to reproduce this, so I'm very much interested in any > > light you can shine here. > > > > Unfortunately cluster finally hit ``suicide timeout'' by every osd, so > there was no logs, only some backtraces[1]. > Yesterday after an osd was not able to join cluster in a hour, I > decided to wait until data is remapped, then tried to restart cluster, > leaving it overnight, to morning all osd processes are dead, with the > same backtraces. Before it, after a silly node crash(related to > deadlocks in kernel kvm code), some pgs remains to stay in peering > state without any blocker in json output, so I had decided to restart > osd to which primary copy belongs, because it helped before. So most > interesting part is missing, but I`ll reformat cluster soon and will > try to catch this again after filling some data in. > > [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/ Thanks, I believe I see the problem. The peering workqueue is way behind, and it is trying to it all in one lump, timing out the work queue. The workaround is to increase the timeout. We'll put together a proper fix. sage ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-24 18:01 ` Sage Weil @ 2013-02-17 11:21 ` Andrey Korolyov 2013-02-17 17:52 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Andrey Korolyov @ 2013-02-17 11:21 UTC (permalink / raw) To: Sage Weil Cc: Jens Kristian Søgaard, Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Thu, Jan 24, 2013 at 10:01 PM, Sage Weil <sage@inktank.com> wrote: > On Thu, 24 Jan 2013, Andrey Korolyov wrote: >> On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil <sage@inktank.com> wrote: >> > On Thu, 24 Jan 2013, Andrey Korolyov wrote: >> >> On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard >> >> <jens@mermaidconsulting.dk> wrote: >> >> > Hi Sage, >> >> > >> >> >>> I think the problem now is just that 'osd target transaction size' is >> >> >> >> >> >> I set it to 50, and that seems to have solved all my problems. >> >> >> >> >> >> After a day or so my cluster got to a HEALTH_OK state again. It has been >> >> >> running for a few days now without any crashes! >> >> > >> >> > >> >> > Hmm, one of the OSDs crashed again, sadly. >> >> > >> >> > It logs: >> >> > >> >> > -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy >> >> > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 >> >> > -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy >> >> > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 >> >> > 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: >> >> > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, >> >> > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 >> >> > >> >> > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") >> >> > >> >> > >> >> > With this stack trace: >> >> > >> >> > ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) >> >> > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, >> >> > long)+0x2eb) [0x846ecb] >> >> > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] >> >> > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] >> >> > 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] >> >> > 5: /lib64/libpthread.so.0() [0x3cbc807d14] >> >> > 6: (clone()+0x6d) [0x3cbc0f167d] >> >> > >> >> > >> >> > I have saved the core file, if there's anything in there you need? >> >> > >> >> > Or do you think I just need to set the target transaction size even lower >> >> > than 50? >> >> > >> >> > >> >> >> >> I was able to catch this too on rejoin to very busy cluster and seems >> >> I need to lower this value at least at start time. Also >> >> c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time >> >> before restarted or new osd will join a cluster, and for 2M objects/3T >> >> of replicated data restart of the cluster was took almost a hour >> >> before it actually begins to work. The worst thing is that a single >> >> osd, if restarted, will mark as up after couple of minutes, then after >> >> almost half of hour(eating 100 percent of one cpu, ) as down and then >> >> cluster will start to redistribute data after 300s timeout, osd still >> >> doing something. >> > >> > Okay, something is very wrong. Can you reproduce this with a log? Or >> > even a partial log while it is spinning? You can adjust the log level on >> > a running process with >> > >> > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20 >> > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1 >> > >> > We haven't been able to reproduce this, so I'm very much interested in any >> > light you can shine here. >> > >> >> Unfortunately cluster finally hit ``suicide timeout'' by every osd, so >> there was no logs, only some backtraces[1]. >> Yesterday after an osd was not able to join cluster in a hour, I >> decided to wait until data is remapped, then tried to restart cluster, >> leaving it overnight, to morning all osd processes are dead, with the >> same backtraces. Before it, after a silly node crash(related to >> deadlocks in kernel kvm code), some pgs remains to stay in peering >> state without any blocker in json output, so I had decided to restart >> osd to which primary copy belongs, because it helped before. So most >> interesting part is missing, but I`ll reformat cluster soon and will >> try to catch this again after filling some data in. >> >> [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/ > > Thanks, I believe I see the problem. The peering workqueue is way behind, > and it is trying to it all in one lump, timing out the work queue. The > workaround is to increase the timeout. We'll put together a proper fix. > > sage Hi Sage, Single OSDs still not able to join a cluster after restart, osd process eats one core and reads disk by long continuous periods, about hundreds of seconds, then staying eating 100% of core, then repeat. On relatively new cluster, it is not repeatable even with almost same data commit, only week or two of writes, snapshot creation, etc. exposes that. Please see log below: 2013-02-17 12:08:17.503992 7fbe8795c780 0 ceph version 0.56.3-2-g290a352 (290a352c3f9e241deac562e980ac8c6a74033ba6), process ceph-osd, pid 29283 starting osd.26 at :/0 osd_data /var/lib/ceph/osd/26 /var/lib/ceph/osd/journal/journal26 2013-02-17 12:08:17.508193 7fbe8795c780 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6803/29283 need_addr=1 2013-02-17 12:08:17.508222 7fbe8795c780 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6804/29283 need_addr=1 2013-02-17 12:08:17.508244 7fbe8795c780 1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6805/29283 need_addr=1 2013-02-17 12:08:17.508665 7fbe8795c780 5 filestore(/var/lib/ceph/osd/26) basedir /var/lib/ceph/osd/26 journal /var/lib/ceph/osd/journal/journal26 2013-02-17 12:08:17.508712 7fbe8795c780 10 filestore(/var/lib/ceph/osd/26) mount fsid is 67783a66-178e-430a-a031-757bd45ae718 2013-02-17 12:08:17.512110 7fbe8795c780 0 filestore(/var/lib/ceph/osd/26) mount FIEMAP ioctl is supported and appears to work 2013-02-17 12:08:17.512122 7fbe8795c780 0 filestore(/var/lib/ceph/osd/26) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-02-17 12:08:17.512590 7fbe8795c780 0 filestore(/var/lib/ceph/osd/26) mount did NOT detect btrfs 2013-02-17 12:08:17.532726 7fbe8795c780 0 filestore(/var/lib/ceph/osd/26) mount syncfs(2) syscall fully supported (by glibc and kernel) 2013-02-17 12:08:17.532901 7fbe8795c780 0 filestore(/var/lib/ceph/osd/26) mount found snaps <> 2013-02-17 12:08:17.532929 7fbe8795c780 5 filestore(/var/lib/ceph/osd/26) mount op_seq is 53467469 2013-02-17 12:08:17.534953 7fbe8795c780 20 filestore (init)dbobjectmap: seq is 852347 2013-02-17 12:08:17.534981 7fbe8795c780 10 filestore(/var/lib/ceph/osd/26) open_journal at /var/lib/ceph/osd/journal/journal26 2013-02-17 12:08:17.535035 7fbe8795c780 0 filestore(/var/lib/ceph/osd/26) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-02-17 12:08:17.535046 7fbe8795c780 10 filestore(/var/lib/ceph/osd/26) list_collections 2013-02-17 12:08:17.668919 7fbe83ba2700 20 filestore(/var/lib/ceph/osd/26) sync_entry waiting for max_interval 5.000000 2013-02-17 12:08:17.669993 7fbe80b9c700 20 filestore(/var/lib/ceph/osd/26) flusher_entry start 2013-02-17 12:08:17.670012 7fbe80b9c700 20 filestore(/var/lib/ceph/osd/26) flusher_entry sleeping 2013-02-17 12:08:17.670104 7fbe8795c780 5 filestore(/var/lib/ceph/osd/26) umount /var/lib/ceph/osd/26 2013-02-17 12:08:17.670126 7fbe83ba2700 20 filestore(/var/lib/ceph/osd/26) sync_entry force_sync set 2013-02-17 12:08:17.670162 7fbe80b9c700 20 filestore(/var/lib/ceph/osd/26) flusher_entry awoke 2013-02-17 12:08:17.670169 7fbe80b9c700 20 filestore(/var/lib/ceph/osd/26) flusher_entry finish ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-02-17 11:21 ` Andrey Korolyov @ 2013-02-17 17:52 ` Sage Weil 0 siblings, 0 replies; 37+ messages in thread From: Sage Weil @ 2013-02-17 17:52 UTC (permalink / raw) To: Andrey Korolyov Cc: Jens Kristian Søgaard, Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Sun, 17 Feb 2013, Andrey Korolyov wrote: > On Thu, Jan 24, 2013 at 10:01 PM, Sage Weil <sage@inktank.com> wrote: > > On Thu, 24 Jan 2013, Andrey Korolyov wrote: > >> On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil <sage@inktank.com> wrote: > >> > On Thu, 24 Jan 2013, Andrey Korolyov wrote: > >> >> On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard > >> >> <jens@mermaidconsulting.dk> wrote: > >> >> > Hi Sage, > >> >> > > >> >> >>> I think the problem now is just that 'osd target transaction size' is > >> >> >> > >> >> >> I set it to 50, and that seems to have solved all my problems. > >> >> >> > >> >> >> After a day or so my cluster got to a HEALTH_OK state again. It has been > >> >> >> running for a few days now without any crashes! > >> >> > > >> >> > > >> >> > Hmm, one of the OSDs crashed again, sadly. > >> >> > > >> >> > It logs: > >> >> > > >> >> > -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy > >> >> > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 > >> >> > -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy > >> >> > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 > >> >> > 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: > >> >> > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > >> >> > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 > >> >> > > >> >> > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") > >> >> > > >> >> > > >> >> > With this stack trace: > >> >> > > >> >> > ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) > >> >> > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > >> >> > long)+0x2eb) [0x846ecb] > >> >> > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] > >> >> > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] > >> >> > 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] > >> >> > 5: /lib64/libpthread.so.0() [0x3cbc807d14] > >> >> > 6: (clone()+0x6d) [0x3cbc0f167d] > >> >> > > >> >> > > >> >> > I have saved the core file, if there's anything in there you need? > >> >> > > >> >> > Or do you think I just need to set the target transaction size even lower > >> >> > than 50? > >> >> > > >> >> > > >> >> > >> >> I was able to catch this too on rejoin to very busy cluster and seems > >> >> I need to lower this value at least at start time. Also > >> >> c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time > >> >> before restarted or new osd will join a cluster, and for 2M objects/3T > >> >> of replicated data restart of the cluster was took almost a hour > >> >> before it actually begins to work. The worst thing is that a single > >> >> osd, if restarted, will mark as up after couple of minutes, then after > >> >> almost half of hour(eating 100 percent of one cpu, ) as down and then > >> >> cluster will start to redistribute data after 300s timeout, osd still > >> >> doing something. > >> > > >> > Okay, something is very wrong. Can you reproduce this with a log? Or > >> > even a partial log while it is spinning? You can adjust the log level on > >> > a running process with > >> > > >> > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20 > >> > ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1 > >> > > >> > We haven't been able to reproduce this, so I'm very much interested in any > >> > light you can shine here. > >> > > >> > >> Unfortunately cluster finally hit ``suicide timeout'' by every osd, so > >> there was no logs, only some backtraces[1]. > >> Yesterday after an osd was not able to join cluster in a hour, I > >> decided to wait until data is remapped, then tried to restart cluster, > >> leaving it overnight, to morning all osd processes are dead, with the > >> same backtraces. Before it, after a silly node crash(related to > >> deadlocks in kernel kvm code), some pgs remains to stay in peering > >> state without any blocker in json output, so I had decided to restart > >> osd to which primary copy belongs, because it helped before. So most > >> interesting part is missing, but I`ll reformat cluster soon and will > >> try to catch this again after filling some data in. > >> > >> [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/ > > > > Thanks, I believe I see the problem. The peering workqueue is way behind, > > and it is trying to it all in one lump, timing out the work queue. The > > workaround is to increase the timeout. We'll put together a proper fix. > > > > sage > > Hi Sage, > > Single OSDs still not able to join a cluster after restart, osd > process eats one core and reads disk by long continuous periods, about > hundreds of seconds, then staying eating 100% of core, then repeat. On > relatively new cluster, it is not repeatable even with almost same > data commit, only week or two of writes, snapshot creation, etc. > exposes that. Please see log below: This sounds like the fixed code chewing through PGs with very old epochs due to the (now fixed) bug in .2. Are there some pools that were completely idle on this cluster prior to the upgrade? The problem before was that we weren't persisting the new osdmap epochs for those PGs if they never saw any IO, which means they have to do catch-up on restart. They will come up eventually. Alternatively, you can stop ceph-osd and manually delete those PG directories (i'm guessing its the data and metadata pools). And/or, you can delete those pools from the cluster prior to the restart. Sorry about this--I should have made a specific mention about this possibility in the .3 release notes. I'll add something there today. sage > > 2013-02-17 12:08:17.503992 7fbe8795c780 0 ceph version > 0.56.3-2-g290a352 (290a352c3f9e241deac562e980ac8c6a74033ba6), process > ceph-osd, pid 29283 > starting osd.26 at :/0 osd_data /var/lib/ceph/osd/26 > /var/lib/ceph/osd/journal/journal26 > 2013-02-17 12:08:17.508193 7fbe8795c780 1 accepter.accepter.bind > my_inst.addr is 0.0.0.0:6803/29283 need_addr=1 > 2013-02-17 12:08:17.508222 7fbe8795c780 1 accepter.accepter.bind > my_inst.addr is 0.0.0.0:6804/29283 need_addr=1 > 2013-02-17 12:08:17.508244 7fbe8795c780 1 accepter.accepter.bind > my_inst.addr is 0.0.0.0:6805/29283 need_addr=1 > 2013-02-17 12:08:17.508665 7fbe8795c780 5 > filestore(/var/lib/ceph/osd/26) basedir /var/lib/ceph/osd/26 journal > /var/lib/ceph/osd/journal/journal26 > 2013-02-17 12:08:17.508712 7fbe8795c780 10 > filestore(/var/lib/ceph/osd/26) mount fsid is > 67783a66-178e-430a-a031-757bd45ae718 > 2013-02-17 12:08:17.512110 7fbe8795c780 0 > filestore(/var/lib/ceph/osd/26) mount FIEMAP ioctl is supported and > appears to work > 2013-02-17 12:08:17.512122 7fbe8795c780 0 > filestore(/var/lib/ceph/osd/26) mount FIEMAP ioctl is disabled via > 'filestore fiemap' config option > 2013-02-17 12:08:17.512590 7fbe8795c780 0 > filestore(/var/lib/ceph/osd/26) mount did NOT detect btrfs > 2013-02-17 12:08:17.532726 7fbe8795c780 0 > filestore(/var/lib/ceph/osd/26) mount syncfs(2) syscall fully > supported (by glibc and kernel) > 2013-02-17 12:08:17.532901 7fbe8795c780 0 > filestore(/var/lib/ceph/osd/26) mount found snaps <> > 2013-02-17 12:08:17.532929 7fbe8795c780 5 > filestore(/var/lib/ceph/osd/26) mount op_seq is 53467469 > 2013-02-17 12:08:17.534953 7fbe8795c780 20 filestore > (init)dbobjectmap: seq is 852347 > 2013-02-17 12:08:17.534981 7fbe8795c780 10 > filestore(/var/lib/ceph/osd/26) open_journal at > /var/lib/ceph/osd/journal/journal26 > 2013-02-17 12:08:17.535035 7fbe8795c780 0 > filestore(/var/lib/ceph/osd/26) mount: enabling WRITEAHEAD journal > mode: btrfs not detected > 2013-02-17 12:08:17.535046 7fbe8795c780 10 > filestore(/var/lib/ceph/osd/26) list_collections > 2013-02-17 12:08:17.668919 7fbe83ba2700 20 > filestore(/var/lib/ceph/osd/26) sync_entry waiting for max_interval > 5.000000 > 2013-02-17 12:08:17.669993 7fbe80b9c700 20 > filestore(/var/lib/ceph/osd/26) flusher_entry start > 2013-02-17 12:08:17.670012 7fbe80b9c700 20 > filestore(/var/lib/ceph/osd/26) flusher_entry sleeping > 2013-02-17 12:08:17.670104 7fbe8795c780 5 > filestore(/var/lib/ceph/osd/26) umount /var/lib/ceph/osd/26 > 2013-02-17 12:08:17.670126 7fbe83ba2700 20 > filestore(/var/lib/ceph/osd/26) sync_entry force_sync set > 2013-02-17 12:08:17.670162 7fbe80b9c700 20 > filestore(/var/lib/ceph/osd/26) flusher_entry awoke > 2013-02-17 12:08:17.670169 7fbe80b9c700 20 > filestore(/var/lib/ceph/osd/26) flusher_entry finish > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-23 20:59 ` Jens Kristian Søgaard 2013-01-23 22:56 ` Andrey Korolyov @ 2013-01-24 4:28 ` Sage Weil 2013-01-24 10:08 ` Jens Kristian Søgaard 1 sibling, 1 reply; 37+ messages in thread From: Sage Weil @ 2013-01-24 4:28 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Wed, 23 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > > I think the problem now is just that 'osd target transaction size' is > > I set it to 50, and that seems to have solved all my problems. > > After a day or so my cluster got to a HEALTH_OK state again. It has been > > running for a few days now without any crashes! > > Hmm, one of the OSDs crashed again, sadly. > > It logs: > > -2> 2013-01-23 18:01:23.563624 7f67524da700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60 > -1> 2013-01-23 18:01:23.563657 7f67524da700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180 > 0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const > char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677 > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") > > > With this stack trace: > > ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539) > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0x846ecb] > 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae] > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8] > 4: (CephContextServiceThread::entry()+0x55) [0x8e0f45] > 5: /lib64/libpthread.so.0() [0x3cbc807d14] > 6: (clone()+0x6d) [0x3cbc0f167d] > > > I have saved the core file, if there's anything in there you need? > > Or do you think I just need to set the target transaction size even lower than > 50? Can you share the output from 'thread apply all bt' so we can see what it was doing? thanks! s ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-24 4:28 ` Sage Weil @ 2013-01-24 10:08 ` Jens Kristian Søgaard 2013-01-24 18:06 ` Sage Weil 0 siblings, 1 reply; 37+ messages in thread From: Jens Kristian Søgaard @ 2013-01-24 10:08 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org Hi Sage, > Can you share the output from 'thread apply all bt' so we can see what it > was doing? Yes, ofcourse! You can download the output here: http://bit.ly/UY7TDe -- Jens Kristian Søgaard, Mermaid Consulting ApS, jens@mermaidconsulting.dk, http://wwww.mermaidconsulting.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hit suicide timeout after adding new osd 2013-01-24 10:08 ` Jens Kristian Søgaard @ 2013-01-24 18:06 ` Sage Weil 0 siblings, 0 replies; 37+ messages in thread From: Sage Weil @ 2013-01-24 18:06 UTC (permalink / raw) To: Jens Kristian Søgaard Cc: Stefan Priebe, Wido den Hollander, ceph-devel@vger.kernel.org On Thu, 24 Jan 2013, Jens Kristian S?gaard wrote: > Hi Sage, > > > Can you share the output from 'thread apply all bt' so we can see what it > > was doing? > > Yes, ofcourse! > > You can download the output here: > > http://bit.ly/UY7TDe Sadly your problem is a bit different than Andrey's. It looks like maybe it's just busy, its hard to tell from the snapshot in the stack traces. If you can enable some logs leading up to the crash, that'd be awesome. Thanks! sage ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2013-02-17 17:52 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-01-17 14:35 Hit suicide timeout after adding new osd Jens Kristian Søgaard 2013-01-17 14:47 ` Wido den Hollander 2013-01-17 14:50 ` Stefan Priebe 2013-01-17 15:33 ` Wido den Hollander 2013-01-17 15:37 ` Stefan Priebe 2013-01-17 17:17 ` Sage Weil 2013-01-17 20:32 ` Jens Kristian Søgaard 2013-01-17 22:03 ` Sage Weil 2013-01-18 11:24 ` Jens Kristian Søgaard 2013-01-18 21:28 ` Sage Weil 2013-01-18 21:36 ` Jens Kristian Søgaard 2013-01-18 21:44 ` Sage Weil 2013-01-19 9:25 ` Jens Kristian Søgaard 2013-01-19 16:44 ` Sage Weil 2013-01-19 17:56 ` Jens Kristian Søgaard 2013-01-19 18:19 ` Sage Weil 2013-01-19 18:40 ` Jens Kristian Søgaard 2013-01-19 20:08 ` Sage Weil 2013-01-19 20:29 ` Jens Kristian Søgaard 2013-01-19 22:04 ` Sage Weil 2013-01-21 0:14 ` Sage Weil 2013-01-21 6:59 ` Jens Kristian Søgaard 2013-01-21 7:11 ` Sage Weil 2013-01-23 12:14 ` Jens Kristian Søgaard 2013-01-23 12:26 ` Wido den Hollander 2013-01-23 12:29 ` Jens Kristian Søgaard 2013-01-23 13:13 ` Sage Weil 2013-01-23 20:59 ` Jens Kristian Søgaard 2013-01-23 22:56 ` Andrey Korolyov 2013-01-24 4:39 ` Sage Weil 2013-01-24 7:44 ` Andrey Korolyov 2013-01-24 18:01 ` Sage Weil 2013-02-17 11:21 ` Andrey Korolyov 2013-02-17 17:52 ` Sage Weil 2013-01-24 4:28 ` Sage Weil 2013-01-24 10:08 ` Jens Kristian Søgaard 2013-01-24 18:06 ` Sage Weil
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.