* Re: [ceph-users] Scrub shutdown the OSD process [not found] ` <1366046351.26980.0.camel@localhost> @ 2013-04-15 17:57 ` Gregory Farnum [not found] ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Gregory Farnum @ 2013-04-15 17:57 UTC (permalink / raw) To: ceph-devel@vger.kernel.org; +Cc: ceph-users@lists.ceph.com On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : >> Are you saying you saw this problem more than once, and so you >> completely wiped the OSD in question, then brought it back into the >> cluster, and now it's seeing this error again? > > Yes, it's exactly that. > > >> Are any other OSDs experiencing this issue? > > No, only this one have the problem. Did you run scrubs while this node was out of the cluster? If you wiped the data and this is recurring then this is apparently an issue with the cluster state, not just one node, and any other primary for the broken PG(s) should crash as well. Can you verify by taking this one down and then doing a full scrub? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Scrub shutdown the OSD process [not found] ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-04-15 18:32 ` Olivier Bonvalet 2013-04-16 6:56 ` Olivier Bonvalet 1 sibling, 0 replies; 7+ messages in thread From: Olivier Bonvalet @ 2013-04-15 18:32 UTC (permalink / raw) To: Gregory Farnum Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : > On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : > >> Are you saying you saw this problem more than once, and so you > >> completely wiped the OSD in question, then brought it back into the > >> cluster, and now it's seeing this error again? > > > > Yes, it's exactly that. > > > > > >> Are any other OSDs experiencing this issue? > > > > No, only this one have the problem. > > Did you run scrubs while this node was out of the cluster? If you > wiped the data and this is recurring then this is apparently an issue > with the cluster state, not just one node, and any other primary for > the broken PG(s) should crash as well. Can you verify by taking this > one down and then doing a full scrub? > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Also note that no PG is marked "corrupted". I have only PG in "active +remapped" or "active+degraded". _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Scrub shutdown the OSD process [not found] ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-04-15 18:32 ` Olivier Bonvalet @ 2013-04-16 6:56 ` Olivier Bonvalet 2013-04-17 9:48 ` [ceph-users] " Olivier Bonvalet 1 sibling, 1 reply; 7+ messages in thread From: Olivier Bonvalet @ 2013-04-16 6:56 UTC (permalink / raw) To: Gregory Farnum Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : > On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : > >> Are you saying you saw this problem more than once, and so you > >> completely wiped the OSD in question, then brought it back into the > >> cluster, and now it's seeing this error again? > > > > Yes, it's exactly that. > > > > > >> Are any other OSDs experiencing this issue? > > > > No, only this one have the problem. > > Did you run scrubs while this node was out of the cluster? If you > wiped the data and this is recurring then this is apparently an issue > with the cluster state, not just one node, and any other primary for > the broken PG(s) should crash as well. Can you verify by taking this > one down and then doing a full scrub? > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > So, I mark this OSD as "out" to balance data and be able to re-do a scrum. You are probably right, since I now have 3 other OSD on the same host which are down. I still haven't any PG in error (and the cluster is in HEALTH_WARN status), but something goes wrong. In syslog I have : Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742915 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find d452999/rb.0.2d76b.238e1f29.00000000162d/f99//4 in index: (2) No such file or directory Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742999 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find 85242299/rb.0.1367.2ae8944a.000000001f9c/f98//4 in index: (2) No such file or directory Apr 16 03:41:11 alim ceph-osd: 2013-04-16 03:41:11.758150 7fe64f12d700 -1 osd.31 48020 heartbeat_check: no reply from osd.5 since 2013-04-16 03:40:50.349130 (cutoff 2013-04-16 03:40:51.758149) Apr 16 04:27:40 alim ceph-osd: 2013-04-16 04:27:40.529492 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:20.529489) Apr 16 04:27:41 alim ceph-osd: 2013-04-16 04:27:41.529609 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:21.529605) Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.440257 7fe64f12d700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.440257) Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.523985 7fe65e14b700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.523984) Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:27.770327 7fe65e14b700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:07.770323) Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:28.497600 7fe64f12d700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:08.497598) Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. and last lines from osd.24.log are : -10> 2013-04-16 08:08:54.991371 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] scrub osd.6 has 10 items -9> 2013-04-16 08:08:54.991876 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 -8> 2013-04-16 08:08:54.991906 7f5bb4569700 0 log [ERR] : 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 -7> 2013-04-16 08:08:54.991913 7f5bb4569700 0 log [ERR] : 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 -6> 2013-04-16 08:08:54.991915 7f5bb4569700 0 log [ERR] : 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 -5> 2013-04-16 08:08:54.991917 7f5bb4569700 0 log [ERR] : 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 -4> 2013-04-16 08:08:54.991919 7f5bb4569700 0 log [ERR] : 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 -3> 2013-04-16 08:08:54.991986 7f5bb4569700 0 log [ERR] : deep-scrub 3.7c 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 on disk size (0) does not match object info size (4194304) -2> 2013-04-16 08:08:54.993813 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993812, event: op_applied, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) -1> 2013-04-16 08:08:54.993901 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993901, event: done, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) 0> 2013-04-16 08:08:54.994227 7f5bb4569700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7f5bb4569700 time 2013-04-16 08:08:54.991990 osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t()) ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18] 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9] 4: (PG::scrub()+0x145) [0x6c4e55] 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980] 8: (()+0x68ca) [0x7f5bc72908ca] 9: (clone()+0x6d) [0x7f5bc5dbfb6d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -1/-1 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/osd.24.log --- end dump of recent events --- 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** in thread 7f5bb4569700 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1: /usr/bin/ceph-osd() [0x7a6289] 2: (()+0xeff0) [0x7f5bc7298ff0] 3: (gsignal()+0x35) [0x7f5bc5d221b5] 4: (abort()+0x180) [0x7f5bc5d24fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] 6: (()+0xcb166) [0x7f5bc65b5166] 7: (()+0xcb193) [0x7f5bc65b5193] 8: (()+0xcb28e) [0x7f5bc65b528e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] 13: (PG::scrub()+0x145) [0x6c4e55] 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] 17: (()+0x68ca) [0x7f5bc72908ca] 18: (clone()+0x6d) [0x7f5bc5dbfb6d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** in thread 7f5bb4569700 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1: /usr/bin/ceph-osd() [0x7a6289] 2: (()+0xeff0) [0x7f5bc7298ff0] 3: (gsignal()+0x35) [0x7f5bc5d221b5] 4: (abort()+0x180) [0x7f5bc5d24fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] 6: (()+0xcb166) [0x7f5bc65b5166] 7: (()+0xcb193) [0x7f5bc65b5193] 8: (()+0xcb28e) [0x7f5bc65b528e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] 13: (PG::scrub()+0x145) [0x6c4e55] 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] 17: (()+0x68ca) [0x7f5bc72908ca] 18: (clone()+0x6d) [0x7f5bc5dbfb6d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -1/-1 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/osd.24.log --- end dump of recent events --- _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [ceph-users] Scrub shutdown the OSD process 2013-04-16 6:56 ` Olivier Bonvalet @ 2013-04-17 9:48 ` Olivier Bonvalet 2013-04-17 18:52 ` Olivier Bonvalet 0 siblings, 1 reply; 7+ messages in thread From: Olivier Bonvalet @ 2013-04-17 9:48 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com Le mardi 16 avril 2013 à 08:56 +0200, Olivier Bonvalet a écrit : > > > Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : > > On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > > > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : > > >> Are you saying you saw this problem more than once, and so you > > >> completely wiped the OSD in question, then brought it back into the > > >> cluster, and now it's seeing this error again? > > > > > > Yes, it's exactly that. > > > > > > > > >> Are any other OSDs experiencing this issue? > > > > > > No, only this one have the problem. > > > > Did you run scrubs while this node was out of the cluster? If you > > wiped the data and this is recurring then this is apparently an issue > > with the cluster state, not just one node, and any other primary for > > the broken PG(s) should crash as well. Can you verify by taking this > > one down and then doing a full scrub? > > -Greg > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > So, I mark this OSD as "out" to balance data and be able to re-do a > scrum. You are probably right, since I now have 3 other OSD on the same > host which are down. > > I still haven't any PG in error (and the cluster is in HEALTH_WARN > status), but something goes wrong. > > In syslog I have : > > Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742915 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find d452999/rb.0.2d76b.238e1f29.00000000162d/f99//4 in index: (2) No such file or directory > Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742999 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find 85242299/rb.0.1367.2ae8944a.000000001f9c/f98//4 in index: (2) No such file or directory > Apr 16 03:41:11 alim ceph-osd: 2013-04-16 03:41:11.758150 7fe64f12d700 -1 osd.31 48020 heartbeat_check: no reply from osd.5 since 2013-04-16 03:40:50.349130 (cutoff 2013-04-16 03:40:51.758149) > Apr 16 04:27:40 alim ceph-osd: 2013-04-16 04:27:40.529492 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:20.529489) > Apr 16 04:27:41 alim ceph-osd: 2013-04-16 04:27:41.529609 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:21.529605) > Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.440257 7fe64f12d700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.440257) > Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.523985 7fe65e14b700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.523984) > Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:27.770327 7fe65e14b700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:07.770323) > Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:28.497600 7fe64f12d700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:08.497598) > Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > > and last lines from osd.24.log are : > > -10> 2013-04-16 08:08:54.991371 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] scrub osd.6 has 10 items > -9> 2013-04-16 08:08:54.991876 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > > -8> 2013-04-16 08:08:54.991906 7f5bb4569700 0 log [ERR] : 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > -7> 2013-04-16 08:08:54.991913 7f5bb4569700 0 log [ERR] : 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > -6> 2013-04-16 08:08:54.991915 7f5bb4569700 0 log [ERR] : 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > -5> 2013-04-16 08:08:54.991917 7f5bb4569700 0 log [ERR] : 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > -4> 2013-04-16 08:08:54.991919 7f5bb4569700 0 log [ERR] : 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > -3> 2013-04-16 08:08:54.991986 7f5bb4569700 0 log [ERR] : deep-scrub 3.7c 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 on disk size (0) does not match object info size (4194304) > -2> 2013-04-16 08:08:54.993813 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993812, event: op_applied, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) > -1> 2013-04-16 08:08:54.993901 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993901, event: done, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) > 0> 2013-04-16 08:08:54.994227 7f5bb4569700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7f5bb4569700 time 2013-04-16 08:08:54.991990 > osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t()) > > ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) > 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] > 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18] > 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9] > 4: (PG::scrub()+0x145) [0x6c4e55] > 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] > 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] > 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980] > 8: (()+0x68ca) [0x7f5bc72908ca] > 9: (clone()+0x6d) [0x7f5bc5dbfb6d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 0/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 hadoop > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -1/-1 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/osd.24.log > --- end dump of recent events --- > 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** > in thread 7f5bb4569700 > > ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) > 1: /usr/bin/ceph-osd() [0x7a6289] > 2: (()+0xeff0) [0x7f5bc7298ff0] > 3: (gsignal()+0x35) [0x7f5bc5d221b5] > 4: (abort()+0x180) [0x7f5bc5d24fc0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] > 6: (()+0xcb166) [0x7f5bc65b5166] > 7: (()+0xcb193) [0x7f5bc65b5193] > 8: (()+0xcb28e) [0x7f5bc65b528e] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] > 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] > 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] > 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] > 13: (PG::scrub()+0x145) [0x6c4e55] > 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] > 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] > 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] > 17: (()+0x68ca) [0x7f5bc72908ca] > 18: (clone()+0x6d) [0x7f5bc5dbfb6d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > --- begin dump of recent events --- > 0> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** > in thread 7f5bb4569700 > > ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) > 1: /usr/bin/ceph-osd() [0x7a6289] > 2: (()+0xeff0) [0x7f5bc7298ff0] > 3: (gsignal()+0x35) [0x7f5bc5d221b5] > 4: (abort()+0x180) [0x7f5bc5d24fc0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] > 6: (()+0xcb166) [0x7f5bc65b5166] > 7: (()+0xcb193) [0x7f5bc65b5193] > 8: (()+0xcb28e) [0x7f5bc65b528e] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] > 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] > 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] > 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] > 13: (PG::scrub()+0x145) [0x6c4e55] > 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] > 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] > 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] > 17: (()+0x68ca) [0x7f5bc72908ca] > 18: (clone()+0x6d) [0x7f5bc5dbfb6d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 0/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 hadoop > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -1/-1 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/osd.24.log > --- end dump of recent events --- > > > > > > > > > > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com So, what can I do to fix that ? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Scrub shutdown the OSD process 2013-04-17 9:48 ` [ceph-users] " Olivier Bonvalet @ 2013-04-17 18:52 ` Olivier Bonvalet 2013-04-20 7:10 ` Olivier Bonvalet 0 siblings, 1 reply; 7+ messages in thread From: Olivier Bonvalet @ 2013-04-17 18:52 UTC (permalink / raw) To: ceph-devel-u79uwXL29TY76Z2rM5mHXA Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Some additional infos : today at 18:57:40, the PG 3.1 [19,5,28] was having a scrub date of "2013-03-28 08:38:12.858041", and the OSD 28 was recovering. Ten minutes later (@ 19:07:40), that PG 3.1 was having a scrub date of today. But at 19:41:04 I seen a error in syslog : osd.10 52042 heartbeat_check: no reply from osd.28 since 2013-04-17 19:40:43.565511 So, since 19:47:44, the PG 3.1 [19,5] is in "active+degraded" state, is scrub date is returned to "2013-03-28 08:38:12.858041" ; and of course the osd.28 is DOWN, the process abort : 0> 2013-04-17 19:40:46.791010 7f6658f5a700 -1 *** Caught signal (Aborted) ** in thread 7f6658f5a700 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1: /usr/bin/ceph-osd() [0x7a6289] 2: (()+0xeff0) [0x7f666b488ff0] 3: (gsignal()+0x35) [0x7f6669f121b5] 4: (abort()+0x180) [0x7f6669f14fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f666a7a6dc5] 6: (()+0xcb166) [0x7f666a7a5166] 7: (()+0xcb193) [0x7f666a7a5193] 8: (()+0xcb28e) [0x7f666a7a528e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] 13: (PG::scrub()+0x145) [0x6c4e55] 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] 17: (()+0x68ca) [0x7f666b4808ca] 18: (clone()+0x6d) [0x7f6669fafb6d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. What I didn't understand is why the OSD process crash, instead of marking that PG "corrupted", and does that PG really "corrupted" are is this just an OSD bug ? Thanks, Olivier ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Scrub shutdown the OSD process 2013-04-17 18:52 ` Olivier Bonvalet @ 2013-04-20 7:10 ` Olivier Bonvalet 2013-04-22 17:05 ` Scrub shutdown the OSD process / data loss Olivier Bonvalet 0 siblings, 1 reply; 7+ messages in thread From: Olivier Bonvalet @ 2013-04-20 7:10 UTC (permalink / raw) To: ceph-devel-u79uwXL29TY76Z2rM5mHXA Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Le mercredi 17 avril 2013 à 20:52 +0200, Olivier Bonvalet a écrit : > What I didn't understand is why the OSD process crash, instead of > marking that PG "corrupted", and does that PG really "corrupted" are > is > this just an OSD bug ? Once again, a bit more informations : by searching informations about one of this faulty PG (3.d), I found that : -592> 2013-04-20 08:31:56.838280 7f0f41d1b700 0 log [ERR] : 3.d osd.25 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found expected 12d7 -591> 2013-04-20 08:31:56.838284 7f0f41d1b700 0 log [ERR] : 3.d osd.4 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found expected 12d7 -590> 2013-04-20 08:31:56.838290 7f0f41d1b700 0 log [ERR] : 3.d osd.4: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0 -589> 2013-04-20 08:31:56.838292 7f0f41d1b700 0 log [ERR] : 3.d osd.11 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found expected 12d7 -588> 2013-04-20 08:31:56.838294 7f0f41d1b700 0 log [ERR] : 3.d osd.11: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0 -587> 2013-04-20 08:31:56.838395 7f0f41d1b700 0 log [ERR] : scrub 3.d a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 on disk size (0) does not match object info size (4194304) I prefered to verify, so I found that : # md5sum /var/lib/ceph/osd/ceph-*/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 217ac2518dfe9e1502e5bfedb8be29b8 /var/lib/ceph/osd/ceph-4/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB) 217ac2518dfe9e1502e5bfedb8be29b8 /var/lib/ceph/osd/ceph-11/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB) d41d8cd98f00b204e9800998ecf8427e /var/lib/ceph/osd/ceph-25/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (0B) So this object is identical on OSD 4 and 11, but is empty on OSD 25. Since 4 is the master, this should not be a problem, so I try a repair, without any success : ceph pg repair 3.d Is there a way to force rewrite of this replica ? _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Scrub shutdown the OSD process / data loss 2013-04-20 7:10 ` Olivier Bonvalet @ 2013-04-22 17:05 ` Olivier Bonvalet 0 siblings, 0 replies; 7+ messages in thread From: Olivier Bonvalet @ 2013-04-22 17:05 UTC (permalink / raw) To: ceph-devel-u79uwXL29TY76Z2rM5mHXA Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Le samedi 20 avril 2013 à 09:10 +0200, Olivier Bonvalet a écrit : > Le mercredi 17 avril 2013 à 20:52 +0200, Olivier Bonvalet a écrit : > > What I didn't understand is why the OSD process crash, instead of > > marking that PG "corrupted", and does that PG really "corrupted" are > > is > > this just an OSD bug ? > > Once again, a bit more informations : by searching informations about > one of this faulty PG (3.d), I found that : > > -592> 2013-04-20 08:31:56.838280 7f0f41d1b700 0 log [ERR] : 3.d osd.25 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found expected 12d7 > -591> 2013-04-20 08:31:56.838284 7f0f41d1b700 0 log [ERR] : 3.d osd.4 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found expected 12d7 > -590> 2013-04-20 08:31:56.838290 7f0f41d1b700 0 log [ERR] : 3.d osd.4: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0 > -589> 2013-04-20 08:31:56.838292 7f0f41d1b700 0 log [ERR] : 3.d osd.11 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found expected 12d7 > -588> 2013-04-20 08:31:56.838294 7f0f41d1b700 0 log [ERR] : 3.d osd.11: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0 > -587> 2013-04-20 08:31:56.838395 7f0f41d1b700 0 log [ERR] : scrub 3.d a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 on disk size (0) does not match object info size (4194304) > > > I prefered to verify, so I found that : > > # md5sum /var/lib/ceph/osd/ceph-*/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 > 217ac2518dfe9e1502e5bfedb8be29b8 /var/lib/ceph/osd/ceph-4/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB) > 217ac2518dfe9e1502e5bfedb8be29b8 /var/lib/ceph/osd/ceph-11/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB) > d41d8cd98f00b204e9800998ecf8427e /var/lib/ceph/osd/ceph-25/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (0B) > > > So this object is identical on OSD 4 and 11, but is empty on OSD 25. > Since 4 is the master, this should not be a problem, so I try a repair, > without any success : > ceph pg repair 3.d > > > Is there a way to force rewrite of this replica ? > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Don't know if it's related, but I see data loss on my cluster on multiple RBD images (corrupted FS, database and some empty files). I suppose It's related. _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-04-22 17:05 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1366018923.3018.3.camel@localhost>
[not found] ` <CAPYLRziFNxnUhyxHRUCzZBF6oQwrWt8YdxBtDrKzMMNuyuW=YQ@mail.gmail.com>
[not found] ` <1366046351.26980.0.camel@localhost>
2013-04-15 17:57 ` [ceph-users] Scrub shutdown the OSD process Gregory Farnum
[not found] ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-15 18:32 ` Olivier Bonvalet
2013-04-16 6:56 ` Olivier Bonvalet
2013-04-17 9:48 ` [ceph-users] " Olivier Bonvalet
2013-04-17 18:52 ` Olivier Bonvalet
2013-04-20 7:10 ` Olivier Bonvalet
2013-04-22 17:05 ` Scrub shutdown the OSD process / data loss Olivier Bonvalet
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.