* mon crash on debian wheezy @ 2012-08-24 8:12 Xiaopong Tran 2012-08-24 16:28 ` Sage Weil 0 siblings, 1 reply; 5+ messages in thread From: Xiaopong Tran @ 2012-08-24 8:12 UTC (permalink / raw) To: ceph-devel@vger.kernel.org Hello, I've been running the 0.48argonaut on production for over a month without any issue. and today, I suddenly lost one mon. Taking a look into the syslog file, I see the following trace log. I just couldn't see what's wrong from the trace log. However, this event created a gigantic core file. Here's the size of the core file: -rw------- 1 root root 16085647360 Aug 24 14:53 core This happened while we were migrating data from our old storage to the ceph. We are running about 20 processes, migrating data into ceph, while there are about 30 more application processes reading from and writing new data to it. The following is from syslog: Aug 24 14:50:15 s100001 kernel: [3076872.019074] INFO: task ceph-mon:1686 blocked for more than 120 seconds. Aug 24 14:50:38 s100001 kernel: [3076872.019092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:38 s100001 kernel: [3076872.019109] ceph-mon D ffff88082f253740 0 1686 1 0x00000000 Aug 24 14:50:38 s100001 kernel: [3076872.019113] ffff88080b977710 0000000000000086 ffff880800000001 ffff88080c328ee0 Aug 24 14:50:38 s100001 kernel: [3076872.019118] 0000000000013740 ffff88080d4dbfd8 ffff88080d4dbfd8 ffff88080b977710 Aug 24 14:50:38 s100001 kernel: [3076872.019122] 0000000000000246 0000000100000246 ffff88080bfa7400 ffff88080b977710 Aug 24 14:50:38 s100001 kernel: [3076872.019126] Call Trace: Aug 24 14:50:38 s100001 kernel: [3076872.019133] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:38 s100001 kernel: [3076872.019136] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:38 s100001 kernel: [3076872.019139] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:38 s100001 kernel: [3076872.019144] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:38 s100001 kernel: [3076872.019149] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:38 s100001 kernel: [3076872.019152] [<ffffffff810151c5>] ? init_fpu+0x84/0x91 Aug 24 14:50:38 s100001 kernel: [3076872.019155] [<ffffffff81015d2e>] ? restore_i387_xstate+0x113/0x15d Aug 24 14:50:38 s100001 kernel: [3076872.019158] [<ffffffff8105676b>] ? do_sigaltstack+0xaa/0x13e Aug 24 14:50:38 s100001 kernel: [3076872.019162] [<ffffffff8106f2f9>] ? sys_futex+0x138/0x147 Aug 24 14:50:38 s100001 kernel: [3076872.019166] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:38 s100001 kernel: [3076872.019170] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:38 s100001 kernel: [3076872.019173] INFO: task ceph-mon:1687 blocked for more than 120 seconds. Aug 24 14:50:38 s100001 kernel: [3076872.019188] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:38 s100001 kernel: [3076872.019205] ceph-mon D ffff88080cb8a400 0 1687 1 0x00000000 Aug 24 14:50:38 s100001 kernel: [3076872.019208] ffff88080cb8a400 0000000000000086 ffff88080cba0860 ffff88080b92b6d0 Aug 24 14:50:38 s100001 kernel: [3076872.019212] 0000000000013740 ffff88080d869fd8 ffff88080d869fd8 ffff88080cb8a400 Aug 24 14:50:38 s100001 kernel: [3076872.019216] 0000000000000246 0000000000000246 ffff88080bfa7400 ffff88080cb8a400 Aug 24 14:50:38 s100001 kernel: [3076872.019220] Call Trace: Aug 24 14:50:38 s100001 kernel: [3076872.019223] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:38 s100001 kernel: [3076872.019226] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:38 s100001 kernel: [3076872.019229] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:38 s100001 kernel: [3076872.019232] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:38 s100001 kernel: [3076872.019235] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:38 s100001 kernel: [3076872.019238] [<ffffffff8106f2f9>] ? sys_futex+0x138/0x147 Aug 24 14:50:38 s100001 kernel: [3076872.019241] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:38 s100001 kernel: [3076872.019246] [<ffffffff810f96a2>] ? sys_write+0x5f/0x6b Aug 24 14:50:38 s100001 kernel: [3076872.019248] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:38 s100001 kernel: [3076872.019251] INFO: task ceph-mon:1727 blocked for more than 120 seconds. Aug 24 14:50:38 s100001 kernel: [3076872.019266] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:38 s100001 kernel: [3076872.019283] ceph-mon D ffff88080dff7710 0 1727 1 0x00000000 Aug 24 14:50:38 s100001 kernel: [3076872.019286] ffff88080dff7710 0000000000000086 ffff88080cba0860 ffff88080c39e340 Aug 24 14:50:38 s100001 kernel: [3076872.019290] 0000000000013740 ffff88080e241fd8 ffff88080e241fd8 ffff88080dff7710 Aug 24 14:50:38 s100001 kernel: [3076872.019294] 0000000000000246 0000000000000246 ffff88080bfa7400 ffff88080dff7710 Aug 24 14:50:38 s100001 kernel: [3076872.019297] Call Trace: Aug 24 14:50:38 s100001 kernel: [3076872.019300] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:38 s100001 kernel: [3076872.019303] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:38 s100001 kernel: [3076872.019307] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:38 s100001 kernel: [3076872.019310] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:38 s100001 kernel: [3076872.019313] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:38 s100001 kernel: [3076872.019316] [<ffffffff8106f2f9>] ? sys_futex+0x138/0x147 Aug 24 14:50:38 s100001 kernel: [3076872.019319] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:38 s100001 kernel: [3076872.019322] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:38 s100001 kernel: [3076872.019324] INFO: task ceph-mon:1737 blocked for more than 120 seconds. Aug 24 14:50:38 s100001 kernel: [3076872.019339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:38 s100001 kernel: [3076872.019356] ceph-mon D ffff88082f213740 0 1737 1 0x00000000 Aug 24 14:50:38 s100001 kernel: [3076872.019359] ffff88080b976930 0000000000000086 ffff880000000000 ffffffff8160d020 Aug 24 14:50:38 s100001 kernel: [3076872.019363] 0000000000013740 ffff88080dde1fd8 ffff88080dde1fd8 ffff88080b976930 Aug 24 14:50:38 s100001 kernel: [3076872.019367] 0000000000000202 000000010519fcf0 ffff88080cba0860 ffff88080b976930 Aug 24 14:50:38 s100001 kernel: [3076872.019370] Call Trace: Aug 24 14:50:38 s100001 kernel: [3076872.019373] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:38 s100001 kernel: [3076872.019376] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:38 s100001 kernel: [3076872.019379] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:38 s100001 kernel: [3076872.019382] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:38 s100001 kernel: [3076872.019385] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:38 s100001 kernel: [3076872.019389] [<ffffffff81036457>] ? should_resched+0x5/0x23 Aug 24 14:50:38 s100001 kernel: [3076872.019392] [<ffffffff81049ff4>] ? do_exit+0x6fa/0x6fc Aug 24 14:50:38 s100001 kernel: [3076872.019395] [<ffffffff8100d755>] ? __switch_to+0x1e5/0x258 Aug 24 14:50:38 s100001 kernel: [3076872.019398] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:38 s100001 kernel: [3076872.019400] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:38 s100001 kernel: [3076872.019403] INFO: task ceph-mon:1738 blocked for more than 120 seconds. Aug 24 14:50:38 s100001 kernel: [3076872.019418] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:38 s100001 kernel: [3076872.019435] ceph-mon D ffff88080e39cab0 0 1738 1 0x00000000 Aug 24 14:50:38 s100001 kernel: [3076872.019438] ffff88080e39cab0 0000000000000086 ffff88080cba0860 ffff8807fb06a0c0 Aug 24 14:50:38 s100001 kernel: [3076872.019442] 0000000000013740 ffff88080c929fd8 ffff88080c929fd8 ffff88080e39cab0 Aug 24 14:50:38 s100001 kernel: [3076872.019446] 0000000000000293 0000000000000293 ffff88080bfa7400 ffff88080e39cab0 Aug 24 14:50:38 s100001 kernel: [3076872.019449] Call Trace: Aug 24 14:50:38 s100001 kernel: [3076872.019452] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:38 s100001 kernel: [3076872.019455] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:38 s100001 kernel: [3076872.019459] [<ffffffff81035a19>] ? set_task_rq+0x23/0x35 Aug 24 14:50:38 s100001 kernel: [3076872.019463] [<ffffffff8103eb0d>] ? set_task_cpu+0xc1/0xd4 Aug 24 14:50:38 s100001 kernel: [3076872.019466] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:38 s100001 kernel: [3076872.019469] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:38 s100001 kernel: [3076872.019473] [<ffffffff811a90ec>] ? cpumask_next_and+0x28/0x34 Aug 24 14:50:38 s100001 kernel: [3076872.019476] [<ffffffff81035a19>] ? set_task_rq+0x23/0x35 Aug 24 14:50:38 s100001 kernel: [3076872.019479] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:38 s100001 kernel: [3076872.019482] [<ffffffff8103ac16>] ? enqueue_task_fair+0x7f/0x185 Aug 24 14:50:38 s100001 kernel: [3076872.019485] [<ffffffff8103703b>] ? test_tsk_need_resched+0xa/0x13 Aug 24 14:50:38 s100001 kernel: [3076872.019488] [<ffffffff8103a303>] ? resched_task+0x39/0x65 Aug 24 14:50:38 s100001 kernel: [3076872.019490] [<ffffffff8103ad52>] ? check_preempt_curr+0x36/0x5f Aug 24 14:50:38 s100001 kernel: [3076872.019493] [<ffffffff8103f836>] ? wake_up_new_task+0xb9/0xc2 Aug 24 14:50:38 s100001 kernel: [3076872.019496] [<ffffffff8104605f>] ? do_fork+0x196/0x219 Aug 24 14:50:38 s100001 kernel: [3076872.019499] [<ffffffff81053bd8>] ? recalc_sigpending+0x23/0x3c Aug 24 14:50:38 s100001 kernel: [3076872.019502] [<ffffffff81054271>] ? __set_task_blocked+0x5e/0x65 Aug 24 14:50:38 s100001 kernel: [3076872.019505] [<ffffffff8106f2f9>] ? sys_futex+0x138/0x147 Aug 24 14:50:38 s100001 kernel: [3076872.019508] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:38 s100001 kernel: [3076872.019511] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:38 s100001 kernel: [3076872.019513] INFO: task ceph-mon:1739 blocked for more than 120 seconds. Aug 24 14:50:38 s100001 kernel: [3076872.019528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:38 s100001 kernel: [3076872.019545] ceph-mon D ffff88080be943c0 0 1739 1 0x00000000 Aug 24 14:50:38 s100001 kernel: [3076872.019549] ffff88080be943c0 0000000000000086 ffff880800000001 ffff88080b6027b0 Aug 24 14:50:38 s100001 kernel: [3076872.019552] 0000000000013740 ffff88080db47fd8 ffff88080db47fd8 ffff88080be943c0 Aug 24 14:50:38 s100001 kernel: [3076872.019556] 0000000000000246 0000000100000246 ffff88080bfa7400 ffff88080be943c0 Aug 24 14:50:38 s100001 kernel: [3076872.019560] Call Trace: Aug 24 14:50:38 s100001 kernel: [3076872.019563] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:38 s100001 kernel: [3076872.019566] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:38 s100001 kernel: [3076872.019569] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:38 s100001 kernel: [3076872.019572] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:38 s100001 kernel: [3076872.019575] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:38 s100001 kernel: [3076872.019579] [<ffffffff810ea0cb>] ? kmem_cache_free+0x2d/0x69 Aug 24 14:50:38 s100001 kernel: [3076872.019582] [<ffffffff811091f8>] ? dentry_kill+0x120/0x12b Aug 24 14:50:38 s100001 kernel: [3076872.019585] [<ffffffff8106f2f9>] ? sys_futex+0x138/0x147 Aug 24 14:50:39 s100001 kernel: [3076872.019588] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:47 s100001 kernel: [3076872.019591] [<ffffffff810f7fde>] ? filp_close+0x62/0x6a Aug 24 14:50:47 s100001 kernel: [3076872.019594] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:47 s100001 kernel: [3076872.019597] INFO: task ceph-mon:1740 blocked for more than 120 seconds. Aug 24 14:50:47 s100001 kernel: [3076872.019612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:47 s100001 kernel: [3076872.019643] ceph-mon D ffff88080bc29710 0 1740 1 0x00000000 Aug 24 14:50:47 s100001 kernel: [3076872.019646] ffff88080bc29710 0000000000000086 ffff88080cba0860 ffff880145a9d510 Aug 24 14:50:47 s100001 kernel: [3076872.019650] 0000000000013740 ffff88080c921fd8 ffff88080c921fd8 ffff88080bc29710 Aug 24 14:50:47 s100001 kernel: [3076872.019654] 0000000000000293 0000000000000293 ffff88080bfa7400 ffff88080bc29710 Aug 24 14:50:47 s100001 kernel: [3076872.019657] Call Trace: Aug 24 14:50:47 s100001 kernel: [3076872.019660] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:47 s100001 kernel: [3076872.019663] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:47 s100001 kernel: [3076872.019669] [<ffffffff81024afa>] ? default_send_IPI_mask_sequence_phys+0x4b/0x6a Aug 24 14:50:47 s100001 kernel: [3076872.019673] [<ffffffff813498bf>] ? _cond_resched+0x7/0x1c Aug 24 14:50:47 s100001 kernel: [3076872.019677] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:47 s100001 kernel: [3076872.019679] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:47 s100001 kernel: [3076872.019683] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:47 s100001 kernel: [3076872.019686] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:47 s100001 kernel: [3076872.019688] [<ffffffff810f9637>] ? sys_read+0x5f/0x6b Aug 24 14:50:47 s100001 kernel: [3076872.019691] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:47 s100001 kernel: [3076872.019694] INFO: task ceph-mon:1818 blocked for more than 120 seconds. Aug 24 14:50:47 s100001 kernel: [3076872.019722] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:47 s100001 kernel: [3076872.019767] ceph-mon D ffff88082f2b3740 0 1818 1 0x00000000 Aug 24 14:50:47 s100001 kernel: [3076872.019770] ffff88080b92b6d0 0000000000000086 ffff880800000000 ffff88082bb9e200 Aug 24 14:50:47 s100001 kernel: [3076872.019774] 0000000000013740 ffff88080da6ffd8 ffff88080da6ffd8 ffff88080b92b6d0 Aug 24 14:50:47 s100001 kernel: [3076872.019777] ffff88080b92b6d0 000000010b92b6d0 0000000000000293 ffff88080b92b6d0 Aug 24 14:50:47 s100001 kernel: [3076872.019781] Call Trace: Aug 24 14:50:47 s100001 kernel: [3076872.019784] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:47 s100001 kernel: [3076872.019787] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:47 s100001 kernel: [3076872.019792] [<ffffffff810b5155>] ? generic_file_aio_write+0xa7/0xb5 Aug 24 14:50:47 s100001 kernel: [3076872.019795] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:47 s100001 kernel: [3076872.019798] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:47 s100001 kernel: [3076872.019801] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:47 s100001 kernel: [3076872.019805] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:47 s100001 kernel: [3076872.019807] [<ffffffff810f96a2>] ? sys_write+0x5f/0x6b Aug 24 14:50:47 s100001 kernel: [3076872.019810] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:47 s100001 kernel: [3076872.019812] INFO: task ceph-mon:1819 blocked for more than 120 seconds. Aug 24 14:50:47 s100001 kernel: [3076872.019841] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:47 s100001 kernel: [3076872.019885] ceph-mon D ffff88080bf7e400 0 1819 1 0x00000000 Aug 24 14:50:47 s100001 kernel: [3076872.019888] ffff88080bf7e400 0000000000000086 0000000000000000 ffff8807fa200180 Aug 24 14:50:47 s100001 kernel: [3076872.019892] 0000000000013740 ffff88080db2bfd8 ffff88080db2bfd8 ffff88080bf7e400 Aug 24 14:50:47 s100001 kernel: [3076872.019896] ffff88080bf7e400 ffff88080cba0800 ffff88080bf7e400 ffff88080bf7e400 Aug 24 14:50:47 s100001 kernel: [3076872.019900] Call Trace: Aug 24 14:50:47 s100001 kernel: [3076872.019903] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:47 s100001 kernel: [3076872.019906] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:47 s100001 kernel: [3076872.019909] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:47 s100001 kernel: [3076872.019912] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:47 s100001 kernel: [3076872.019915] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:47 s100001 kernel: [3076872.019919] [<ffffffff8106f2f9>] ? sys_futex+0x138/0x147 Aug 24 14:50:47 s100001 kernel: [3076872.019922] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:47 s100001 kernel: [3076872.019925] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 14:50:47 s100001 kernel: [3076872.019927] INFO: task ceph-mon:1820 blocked for more than 120 seconds. Aug 24 14:50:47 s100001 kernel: [3076872.019956] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 24 14:50:47 s100001 kernel: [3076872.020000] ceph-mon D ffff88080bcd49b0 0 1820 1 0x00000000 Aug 24 14:50:47 s100001 kernel: [3076872.020003] ffff88080bcd49b0 0000000000000086 0000000000000246 ffff88080b977710 Aug 24 14:50:47 s100001 kernel: [3076872.020007] 0000000000013740 ffff88080ae6dfd8 ffff88080ae6dfd8 ffff88080bcd49b0 Aug 24 14:50:47 s100001 kernel: [3076872.020010] ffff88080bcd49b0 ffff88080cba0800 ffff88080bcd49b0 ffff88080bcd49b0 Aug 24 14:50:47 s100001 kernel: [3076872.020014] Call Trace: Aug 24 14:50:47 s100001 kernel: [3076872.020017] [<ffffffff8104986f>] ? exit_mm+0x97/0x122 Aug 24 14:50:47 s100001 kernel: [3076872.020020] [<ffffffff81049b40>] ? do_exit+0x246/0x6fc Aug 24 14:50:47 s100001 kernel: [3076872.020023] [<ffffffff8104a276>] ? do_group_exit+0x74/0x9e Aug 24 14:50:47 s100001 kernel: [3076872.020026] [<ffffffff81055bb8>] ? get_signal_to_deliver+0x46d/0x48f Aug 24 14:50:47 s100001 kernel: [3076872.020030] [<ffffffff8100de33>] ? do_signal+0x38/0x610 Aug 24 14:50:47 s100001 kernel: [3076872.020033] [<ffffffff8106f2f9>] ? sys_futex+0x138/0x147 Aug 24 14:50:47 s100001 kernel: [3076872.020036] [<ffffffff8100e441>] ? do_notify_resume+0x25/0x68 Aug 24 14:50:47 s100001 kernel: [3076872.020039] [<ffffffff8134fe60>] ? int_signal+0x12/0x17 Aug 24 15:17:01 s100001 /USR/SBIN/CRON[19946]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) By looking at this log, could we tell what was going on? I restarted mon and everything is back to normal. Please let me if I can provide other information. Thanks Xiaopong ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mon crash on debian wheezy 2012-08-24 8:12 mon crash on debian wheezy Xiaopong Tran @ 2012-08-24 16:28 ` Sage Weil 2012-08-28 14:50 ` Xiaopong Tran 0 siblings, 1 reply; 5+ messages in thread From: Sage Weil @ 2012-08-24 16:28 UTC (permalink / raw) To: Xiaopong Tran; +Cc: ceph-devel@vger.kernel.org On Fri, 24 Aug 2012, Xiaopong Tran wrote: > Hello, > > I've been running the 0.48argonaut on production for over a month > without any issue. and today, I suddenly lost one mon. Taking a look > into the syslog file, I see the following trace log. I just couldn't > see what's wrong from the trace log. However, this event created > a gigantic core file. Here's the size of the core file: > > -rw------- 1 root root 16085647360 Aug 24 14:53 core > > This happened while we were migrating data from our old storage > to the ceph. We are running about 20 processes, migrating data > into ceph, while there are about 30 more application processes > reading from and writing new data to it. > > The following is from syslog: We've seen these backtraces before too, but haven't figured out what causes them. (See, for example, http://tracker.newdream.net/issues/2026.) Was there anything in the mon's log file? In most cases, a crash results in a stack trace of ceph-mon in the mon log file. Glad to hear everything recovered nicely afterwards. :) Thanks! sage > > Aug 24 14:50:15 s100001 kernel: [3076872.019074] INFO: task ceph-mon:1686 > blocked for more than 120 seconds. > Aug 24 14:50:38 s100001 kernel: [3076872.019092] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:38 s100001 kernel: [3076872.019109] ceph-mon D > ffff88082f253740 0 1686 1 0x00000000 > Aug 24 14:50:38 s100001 kernel: [3076872.019113] ffff88080b977710 > 0000000000000086 ffff880800000001 ffff88080c328ee0 > Aug 24 14:50:38 s100001 kernel: [3076872.019118] 0000000000013740 > ffff88080d4dbfd8 ffff88080d4dbfd8 ffff88080b977710 > Aug 24 14:50:38 s100001 kernel: [3076872.019122] 0000000000000246 > 0000000100000246 ffff88080bfa7400 ffff88080b977710 > Aug 24 14:50:38 s100001 kernel: [3076872.019126] Call Trace: > Aug 24 14:50:38 s100001 kernel: [3076872.019133] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:38 s100001 kernel: [3076872.019136] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:38 s100001 kernel: [3076872.019139] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:38 s100001 kernel: [3076872.019144] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:38 s100001 kernel: [3076872.019149] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:38 s100001 kernel: [3076872.019152] [<ffffffff810151c5>] ? > init_fpu+0x84/0x91 > Aug 24 14:50:38 s100001 kernel: [3076872.019155] [<ffffffff81015d2e>] ? > restore_i387_xstate+0x113/0x15d > Aug 24 14:50:38 s100001 kernel: [3076872.019158] [<ffffffff8105676b>] ? > do_sigaltstack+0xaa/0x13e > Aug 24 14:50:38 s100001 kernel: [3076872.019162] [<ffffffff8106f2f9>] ? > sys_futex+0x138/0x147 > Aug 24 14:50:38 s100001 kernel: [3076872.019166] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:38 s100001 kernel: [3076872.019170] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:38 s100001 kernel: [3076872.019173] INFO: task ceph-mon:1687 > blocked for more than 120 seconds. > Aug 24 14:50:38 s100001 kernel: [3076872.019188] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:38 s100001 kernel: [3076872.019205] ceph-mon D > ffff88080cb8a400 0 1687 1 0x00000000 > Aug 24 14:50:38 s100001 kernel: [3076872.019208] ffff88080cb8a400 > 0000000000000086 ffff88080cba0860 ffff88080b92b6d0 > Aug 24 14:50:38 s100001 kernel: [3076872.019212] 0000000000013740 > ffff88080d869fd8 ffff88080d869fd8 ffff88080cb8a400 > Aug 24 14:50:38 s100001 kernel: [3076872.019216] 0000000000000246 > 0000000000000246 ffff88080bfa7400 ffff88080cb8a400 > Aug 24 14:50:38 s100001 kernel: [3076872.019220] Call Trace: > Aug 24 14:50:38 s100001 kernel: [3076872.019223] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:38 s100001 kernel: [3076872.019226] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:38 s100001 kernel: [3076872.019229] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:38 s100001 kernel: [3076872.019232] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:38 s100001 kernel: [3076872.019235] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:38 s100001 kernel: [3076872.019238] [<ffffffff8106f2f9>] ? > sys_futex+0x138/0x147 > Aug 24 14:50:38 s100001 kernel: [3076872.019241] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:38 s100001 kernel: [3076872.019246] [<ffffffff810f96a2>] ? > sys_write+0x5f/0x6b > Aug 24 14:50:38 s100001 kernel: [3076872.019248] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:38 s100001 kernel: [3076872.019251] INFO: task ceph-mon:1727 > blocked for more than 120 seconds. > Aug 24 14:50:38 s100001 kernel: [3076872.019266] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:38 s100001 kernel: [3076872.019283] ceph-mon D > ffff88080dff7710 0 1727 1 0x00000000 > Aug 24 14:50:38 s100001 kernel: [3076872.019286] ffff88080dff7710 > 0000000000000086 ffff88080cba0860 ffff88080c39e340 > Aug 24 14:50:38 s100001 kernel: [3076872.019290] 0000000000013740 > ffff88080e241fd8 ffff88080e241fd8 ffff88080dff7710 > Aug 24 14:50:38 s100001 kernel: [3076872.019294] 0000000000000246 > 0000000000000246 ffff88080bfa7400 ffff88080dff7710 > Aug 24 14:50:38 s100001 kernel: [3076872.019297] Call Trace: > Aug 24 14:50:38 s100001 kernel: [3076872.019300] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:38 s100001 kernel: [3076872.019303] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:38 s100001 kernel: [3076872.019307] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:38 s100001 kernel: [3076872.019310] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:38 s100001 kernel: [3076872.019313] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:38 s100001 kernel: [3076872.019316] [<ffffffff8106f2f9>] ? > sys_futex+0x138/0x147 > Aug 24 14:50:38 s100001 kernel: [3076872.019319] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:38 s100001 kernel: [3076872.019322] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:38 s100001 kernel: [3076872.019324] INFO: task ceph-mon:1737 > blocked for more than 120 seconds. > Aug 24 14:50:38 s100001 kernel: [3076872.019339] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:38 s100001 kernel: [3076872.019356] ceph-mon D > ffff88082f213740 0 1737 1 0x00000000 > Aug 24 14:50:38 s100001 kernel: [3076872.019359] ffff88080b976930 > 0000000000000086 ffff880000000000 ffffffff8160d020 > Aug 24 14:50:38 s100001 kernel: [3076872.019363] 0000000000013740 > ffff88080dde1fd8 ffff88080dde1fd8 ffff88080b976930 > Aug 24 14:50:38 s100001 kernel: [3076872.019367] 0000000000000202 > 000000010519fcf0 ffff88080cba0860 ffff88080b976930 > Aug 24 14:50:38 s100001 kernel: [3076872.019370] Call Trace: > Aug 24 14:50:38 s100001 kernel: [3076872.019373] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:38 s100001 kernel: [3076872.019376] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:38 s100001 kernel: [3076872.019379] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:38 s100001 kernel: [3076872.019382] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:38 s100001 kernel: [3076872.019385] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:38 s100001 kernel: [3076872.019389] [<ffffffff81036457>] ? > should_resched+0x5/0x23 > Aug 24 14:50:38 s100001 kernel: [3076872.019392] [<ffffffff81049ff4>] ? > do_exit+0x6fa/0x6fc > Aug 24 14:50:38 s100001 kernel: [3076872.019395] [<ffffffff8100d755>] ? > __switch_to+0x1e5/0x258 > Aug 24 14:50:38 s100001 kernel: [3076872.019398] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:38 s100001 kernel: [3076872.019400] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:38 s100001 kernel: [3076872.019403] INFO: task ceph-mon:1738 > blocked for more than 120 seconds. > Aug 24 14:50:38 s100001 kernel: [3076872.019418] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:38 s100001 kernel: [3076872.019435] ceph-mon D > ffff88080e39cab0 0 1738 1 0x00000000 > Aug 24 14:50:38 s100001 kernel: [3076872.019438] ffff88080e39cab0 > 0000000000000086 ffff88080cba0860 ffff8807fb06a0c0 > Aug 24 14:50:38 s100001 kernel: [3076872.019442] 0000000000013740 > ffff88080c929fd8 ffff88080c929fd8 ffff88080e39cab0 > Aug 24 14:50:38 s100001 kernel: [3076872.019446] 0000000000000293 > 0000000000000293 ffff88080bfa7400 ffff88080e39cab0 > Aug 24 14:50:38 s100001 kernel: [3076872.019449] Call Trace: > Aug 24 14:50:38 s100001 kernel: [3076872.019452] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:38 s100001 kernel: [3076872.019455] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:38 s100001 kernel: [3076872.019459] [<ffffffff81035a19>] ? > set_task_rq+0x23/0x35 > Aug 24 14:50:38 s100001 kernel: [3076872.019463] [<ffffffff8103eb0d>] ? > set_task_cpu+0xc1/0xd4 > Aug 24 14:50:38 s100001 kernel: [3076872.019466] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:38 s100001 kernel: [3076872.019469] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:38 s100001 kernel: [3076872.019473] [<ffffffff811a90ec>] ? > cpumask_next_and+0x28/0x34 > Aug 24 14:50:38 s100001 kernel: [3076872.019476] [<ffffffff81035a19>] ? > set_task_rq+0x23/0x35 > Aug 24 14:50:38 s100001 kernel: [3076872.019479] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:38 s100001 kernel: [3076872.019482] [<ffffffff8103ac16>] ? > enqueue_task_fair+0x7f/0x185 > Aug 24 14:50:38 s100001 kernel: [3076872.019485] [<ffffffff8103703b>] ? > test_tsk_need_resched+0xa/0x13 > Aug 24 14:50:38 s100001 kernel: [3076872.019488] [<ffffffff8103a303>] ? > resched_task+0x39/0x65 > Aug 24 14:50:38 s100001 kernel: [3076872.019490] [<ffffffff8103ad52>] ? > check_preempt_curr+0x36/0x5f > Aug 24 14:50:38 s100001 kernel: [3076872.019493] [<ffffffff8103f836>] ? > wake_up_new_task+0xb9/0xc2 > Aug 24 14:50:38 s100001 kernel: [3076872.019496] [<ffffffff8104605f>] ? > do_fork+0x196/0x219 > Aug 24 14:50:38 s100001 kernel: [3076872.019499] [<ffffffff81053bd8>] ? > recalc_sigpending+0x23/0x3c > Aug 24 14:50:38 s100001 kernel: [3076872.019502] [<ffffffff81054271>] ? > __set_task_blocked+0x5e/0x65 > Aug 24 14:50:38 s100001 kernel: [3076872.019505] [<ffffffff8106f2f9>] ? > sys_futex+0x138/0x147 > Aug 24 14:50:38 s100001 kernel: [3076872.019508] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:38 s100001 kernel: [3076872.019511] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:38 s100001 kernel: [3076872.019513] INFO: task ceph-mon:1739 > blocked for more than 120 seconds. > Aug 24 14:50:38 s100001 kernel: [3076872.019528] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:38 s100001 kernel: [3076872.019545] ceph-mon D > ffff88080be943c0 0 1739 1 0x00000000 > Aug 24 14:50:38 s100001 kernel: [3076872.019549] ffff88080be943c0 > 0000000000000086 ffff880800000001 ffff88080b6027b0 > Aug 24 14:50:38 s100001 kernel: [3076872.019552] 0000000000013740 > ffff88080db47fd8 ffff88080db47fd8 ffff88080be943c0 > Aug 24 14:50:38 s100001 kernel: [3076872.019556] 0000000000000246 > 0000000100000246 ffff88080bfa7400 ffff88080be943c0 > Aug 24 14:50:38 s100001 kernel: [3076872.019560] Call Trace: > Aug 24 14:50:38 s100001 kernel: [3076872.019563] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:38 s100001 kernel: [3076872.019566] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:38 s100001 kernel: [3076872.019569] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:38 s100001 kernel: [3076872.019572] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:38 s100001 kernel: [3076872.019575] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:38 s100001 kernel: [3076872.019579] [<ffffffff810ea0cb>] ? > kmem_cache_free+0x2d/0x69 > Aug 24 14:50:38 s100001 kernel: [3076872.019582] [<ffffffff811091f8>] ? > dentry_kill+0x120/0x12b > Aug 24 14:50:38 s100001 kernel: [3076872.019585] [<ffffffff8106f2f9>] ? > sys_futex+0x138/0x147 > Aug 24 14:50:39 s100001 kernel: [3076872.019588] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:47 s100001 kernel: [3076872.019591] [<ffffffff810f7fde>] ? > filp_close+0x62/0x6a > Aug 24 14:50:47 s100001 kernel: [3076872.019594] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:47 s100001 kernel: [3076872.019597] INFO: task ceph-mon:1740 > blocked for more than 120 seconds. > Aug 24 14:50:47 s100001 kernel: [3076872.019612] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:47 s100001 kernel: [3076872.019643] ceph-mon D > ffff88080bc29710 0 1740 1 0x00000000 > Aug 24 14:50:47 s100001 kernel: [3076872.019646] ffff88080bc29710 > 0000000000000086 ffff88080cba0860 ffff880145a9d510 > Aug 24 14:50:47 s100001 kernel: [3076872.019650] 0000000000013740 > ffff88080c921fd8 ffff88080c921fd8 ffff88080bc29710 > Aug 24 14:50:47 s100001 kernel: [3076872.019654] 0000000000000293 > 0000000000000293 ffff88080bfa7400 ffff88080bc29710 > Aug 24 14:50:47 s100001 kernel: [3076872.019657] Call Trace: > Aug 24 14:50:47 s100001 kernel: [3076872.019660] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:47 s100001 kernel: [3076872.019663] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:47 s100001 kernel: [3076872.019669] [<ffffffff81024afa>] ? > default_send_IPI_mask_sequence_phys+0x4b/0x6a > Aug 24 14:50:47 s100001 kernel: [3076872.019673] [<ffffffff813498bf>] ? > _cond_resched+0x7/0x1c > Aug 24 14:50:47 s100001 kernel: [3076872.019677] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:47 s100001 kernel: [3076872.019679] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:47 s100001 kernel: [3076872.019683] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:47 s100001 kernel: [3076872.019686] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:47 s100001 kernel: [3076872.019688] [<ffffffff810f9637>] ? > sys_read+0x5f/0x6b > Aug 24 14:50:47 s100001 kernel: [3076872.019691] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:47 s100001 kernel: [3076872.019694] INFO: task ceph-mon:1818 > blocked for more than 120 seconds. > Aug 24 14:50:47 s100001 kernel: [3076872.019722] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:47 s100001 kernel: [3076872.019767] ceph-mon D > ffff88082f2b3740 0 1818 1 0x00000000 > Aug 24 14:50:47 s100001 kernel: [3076872.019770] ffff88080b92b6d0 > 0000000000000086 ffff880800000000 ffff88082bb9e200 > Aug 24 14:50:47 s100001 kernel: [3076872.019774] 0000000000013740 > ffff88080da6ffd8 ffff88080da6ffd8 ffff88080b92b6d0 > Aug 24 14:50:47 s100001 kernel: [3076872.019777] ffff88080b92b6d0 > 000000010b92b6d0 0000000000000293 ffff88080b92b6d0 > Aug 24 14:50:47 s100001 kernel: [3076872.019781] Call Trace: > Aug 24 14:50:47 s100001 kernel: [3076872.019784] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:47 s100001 kernel: [3076872.019787] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:47 s100001 kernel: [3076872.019792] [<ffffffff810b5155>] ? > generic_file_aio_write+0xa7/0xb5 > Aug 24 14:50:47 s100001 kernel: [3076872.019795] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:47 s100001 kernel: [3076872.019798] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:47 s100001 kernel: [3076872.019801] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:47 s100001 kernel: [3076872.019805] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:47 s100001 kernel: [3076872.019807] [<ffffffff810f96a2>] ? > sys_write+0x5f/0x6b > Aug 24 14:50:47 s100001 kernel: [3076872.019810] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:47 s100001 kernel: [3076872.019812] INFO: task ceph-mon:1819 > blocked for more than 120 seconds. > Aug 24 14:50:47 s100001 kernel: [3076872.019841] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:47 s100001 kernel: [3076872.019885] ceph-mon D > ffff88080bf7e400 0 1819 1 0x00000000 > Aug 24 14:50:47 s100001 kernel: [3076872.019888] ffff88080bf7e400 > 0000000000000086 0000000000000000 ffff8807fa200180 > Aug 24 14:50:47 s100001 kernel: [3076872.019892] 0000000000013740 > ffff88080db2bfd8 ffff88080db2bfd8 ffff88080bf7e400 > Aug 24 14:50:47 s100001 kernel: [3076872.019896] ffff88080bf7e400 > ffff88080cba0800 ffff88080bf7e400 ffff88080bf7e400 > Aug 24 14:50:47 s100001 kernel: [3076872.019900] Call Trace: > Aug 24 14:50:47 s100001 kernel: [3076872.019903] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:47 s100001 kernel: [3076872.019906] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:47 s100001 kernel: [3076872.019909] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:47 s100001 kernel: [3076872.019912] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:47 s100001 kernel: [3076872.019915] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:47 s100001 kernel: [3076872.019919] [<ffffffff8106f2f9>] ? > sys_futex+0x138/0x147 > Aug 24 14:50:47 s100001 kernel: [3076872.019922] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:47 s100001 kernel: [3076872.019925] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 14:50:47 s100001 kernel: [3076872.019927] INFO: task ceph-mon:1820 > blocked for more than 120 seconds. > Aug 24 14:50:47 s100001 kernel: [3076872.019956] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 24 14:50:47 s100001 kernel: [3076872.020000] ceph-mon D > ffff88080bcd49b0 0 1820 1 0x00000000 > Aug 24 14:50:47 s100001 kernel: [3076872.020003] ffff88080bcd49b0 > 0000000000000086 0000000000000246 ffff88080b977710 > Aug 24 14:50:47 s100001 kernel: [3076872.020007] 0000000000013740 > ffff88080ae6dfd8 ffff88080ae6dfd8 ffff88080bcd49b0 > Aug 24 14:50:47 s100001 kernel: [3076872.020010] ffff88080bcd49b0 > ffff88080cba0800 ffff88080bcd49b0 ffff88080bcd49b0 > Aug 24 14:50:47 s100001 kernel: [3076872.020014] Call Trace: > Aug 24 14:50:47 s100001 kernel: [3076872.020017] [<ffffffff8104986f>] ? > exit_mm+0x97/0x122 > Aug 24 14:50:47 s100001 kernel: [3076872.020020] [<ffffffff81049b40>] ? > do_exit+0x246/0x6fc > Aug 24 14:50:47 s100001 kernel: [3076872.020023] [<ffffffff8104a276>] ? > do_group_exit+0x74/0x9e > Aug 24 14:50:47 s100001 kernel: [3076872.020026] [<ffffffff81055bb8>] ? > get_signal_to_deliver+0x46d/0x48f > Aug 24 14:50:47 s100001 kernel: [3076872.020030] [<ffffffff8100de33>] ? > do_signal+0x38/0x610 > Aug 24 14:50:47 s100001 kernel: [3076872.020033] [<ffffffff8106f2f9>] ? > sys_futex+0x138/0x147 > Aug 24 14:50:47 s100001 kernel: [3076872.020036] [<ffffffff8100e441>] ? > do_notify_resume+0x25/0x68 > Aug 24 14:50:47 s100001 kernel: [3076872.020039] [<ffffffff8134fe60>] ? > int_signal+0x12/0x17 > Aug 24 15:17:01 s100001 /USR/SBIN/CRON[19946]: (root) CMD ( cd / && > run-parts --report /etc/cron.hourly) > > By looking at this log, could we tell what was going on? I restarted mon > and everything is back to normal. > > Please let me if I can provide other information. > > Thanks > > Xiaopong > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mon crash on debian wheezy 2012-08-24 16:28 ` Sage Weil @ 2012-08-28 14:50 ` Xiaopong Tran 2012-08-28 16:21 ` Gregory Farnum 0 siblings, 1 reply; 5+ messages in thread From: Xiaopong Tran @ 2012-08-28 14:50 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org On 08/25/2012 12:28 AM, Sage Weil wrote: > On Fri, 24 Aug 2012, Xiaopong Tran wrote: >> Hello, >> >> I've been running the 0.48argonaut on production for over a month >> without any issue. and today, I suddenly lost one mon. Taking a look >> into the syslog file, I see the following trace log. I just couldn't >> see what's wrong from the trace log. However, this event created >> a gigantic core file. Here's the size of the core file: >> >> -rw------- 1 root root 16085647360 Aug 24 14:53 core >> >> This happened while we were migrating data from our old storage >> to the ceph. We are running about 20 processes, migrating data >> into ceph, while there are about 30 more application processes >> reading from and writing new data to it. >> >> The following is from syslog: > > We've seen these backtraces before too, but haven't figured out what > causes them. (See, for example, http://tracker.newdream.net/issues/2026.) > > Was there anything in the mon's log file? In most cases, a crash results > in a stack trace of ceph-mon in the mon log file. > > Glad to hear everything recovered nicely afterwards. :) > > Thanks! > sage > Ah well, I got two crashes in less than 3 days. I browsed thru the mon log files, and the ceph log files, and there is nothing suspicious, no trace dump or anything. One question I don't get is, after mon has crashed, it's not running anymore, who is creating that empty mon log? The same question goes for osd. I had two osd down today, and I also see empty osd log files. And how does the crash end up generating such a huge core file? If there's any information I can provide, I'd be happy to do so. Thanks Xiaopong ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mon crash on debian wheezy 2012-08-28 14:50 ` Xiaopong Tran @ 2012-08-28 16:21 ` Gregory Farnum 2012-08-29 1:56 ` Xiaopong Tran 0 siblings, 1 reply; 5+ messages in thread From: Gregory Farnum @ 2012-08-28 16:21 UTC (permalink / raw) To: Xiaopong Tran; +Cc: Sage Weil, ceph-devel@vger.kernel.org On Tue, Aug 28, 2012 at 7:50 AM, Xiaopong Tran <xiaopong.tran@gmail.com> wrote: > On 08/25/2012 12:28 AM, Sage Weil wrote: >> >> On Fri, 24 Aug 2012, Xiaopong Tran wrote: >>> >>> Hello, >>> >>> I've been running the 0.48argonaut on production for over a month >>> without any issue. and today, I suddenly lost one mon. Taking a look >>> into the syslog file, I see the following trace log. I just couldn't >>> see what's wrong from the trace log. However, this event created >>> a gigantic core file. Here's the size of the core file: >>> >>> -rw------- 1 root root 16085647360 Aug 24 14:53 core >>> >>> This happened while we were migrating data from our old storage >>> to the ceph. We are running about 20 processes, migrating data >>> into ceph, while there are about 30 more application processes >>> reading from and writing new data to it. >>> >>> The following is from syslog: >> >> >> We've seen these backtraces before too, but haven't figured out what >> causes them. (See, for example, http://tracker.newdream.net/issues/2026.) >> >> Was there anything in the mon's log file? In most cases, a crash results >> in a stack trace of ceph-mon in the mon log file. >> >> Glad to hear everything recovered nicely afterwards. :) >> >> Thanks! >> sage >> > > Ah well, I got two crashes in less than 3 days. I browsed thru the > mon log files, and the ceph log files, and there is nothing suspicious, > no trace dump or anything. > > One question I don't get is, after mon has crashed, it's not running > anymore, who is creating that empty mon log? The same question goes > for osd. I had two osd down today, and I also see empty osd log files. > > And how does the crash end up generating such a huge core file? > > If there's any information I can provide, I'd be happy to do so. Can you extract the backtrace from the core dump? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mon crash on debian wheezy 2012-08-28 16:21 ` Gregory Farnum @ 2012-08-29 1:56 ` Xiaopong Tran 0 siblings, 0 replies; 5+ messages in thread From: Xiaopong Tran @ 2012-08-29 1:56 UTC (permalink / raw) To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org On 08/29/2012 12:21 AM, Gregory Farnum wrote: > On Tue, Aug 28, 2012 at 7:50 AM, Xiaopong Tran <xiaopong.tran@gmail.com> wrote: >> On 08/25/2012 12:28 AM, Sage Weil wrote: >>> >>> On Fri, 24 Aug 2012, Xiaopong Tran wrote: >>>> >>>> Hello, >>>> >>>> I've been running the 0.48argonaut on production for over a month >>>> without any issue. and today, I suddenly lost one mon. Taking a look >>>> into the syslog file, I see the following trace log. I just couldn't >>>> see what's wrong from the trace log. However, this event created >>>> a gigantic core file. Here's the size of the core file: >>>> >>>> -rw------- 1 root root 16085647360 Aug 24 14:53 core >>>> >>>> This happened while we were migrating data from our old storage >>>> to the ceph. We are running about 20 processes, migrating data >>>> into ceph, while there are about 30 more application processes >>>> reading from and writing new data to it. >>>> >>>> The following is from syslog: >>> >>> >>> We've seen these backtraces before too, but haven't figured out what >>> causes them. (See, for example, http://tracker.newdream.net/issues/2026.) >>> >>> Was there anything in the mon's log file? In most cases, a crash results >>> in a stack trace of ceph-mon in the mon log file. >>> >>> Glad to hear everything recovered nicely afterwards. :) >>> >>> Thanks! >>> sage >>> >> >> Ah well, I got two crashes in less than 3 days. I browsed thru the >> mon log files, and the ceph log files, and there is nothing suspicious, >> no trace dump or anything. >> >> One question I don't get is, after mon has crashed, it's not running >> anymore, who is creating that empty mon log? The same question goes >> for osd. I had two osd down today, and I also see empty osd log files. >> >> And how does the crash end up generating such a huge core file? >> >> If there's any information I can provide, I'd be happy to do so. > > Can you extract the backtrace from the core dump? > Will try to do that, it's a big one though :) Thanks Xiaopong ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-08-29 1:56 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-08-24 8:12 mon crash on debian wheezy Xiaopong Tran 2012-08-24 16:28 ` Sage Weil 2012-08-28 14:50 ` Xiaopong Tran 2012-08-28 16:21 ` Gregory Farnum 2012-08-29 1:56 ` Xiaopong Tran
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.