Re: [ceph-users] Scrub shutdown the OSD process

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [ceph-users] Scrub shutdown the OSD process
       [not found]   ` <1366046351.26980.0.camel@localhost>
@ 2013-04-15 17:57     ` Gregory Farnum
       [not found]       ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2013-04-15 17:57 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org; +Cc: ceph-users@lists.ceph.com

On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit :
>> Are you saying you saw this problem more than once, and so you
>> completely wiped the OSD in question, then brought it back into the
>> cluster, and now it's seeing this error again?
>
> Yes, it's exactly that.
>
>
>> Are any other OSDs experiencing this issue?
>
> No, only this one have the problem.

Did you run scrubs while this node was out of the cluster? If you
wiped the data and this is recurring then this is apparently an issue
with the cluster state, not just one node, and any other primary for
the broken PG(s) should crash as well. Can you verify by taking this
one down and then doing a full scrub?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Scrub shutdown the OSD process
       [not found]       ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-04-15 18:32         ` Olivier Bonvalet
  2013-04-16  6:56         ` Olivier Bonvalet
  1 sibling, 0 replies; 7+ messages in thread
From: Olivier Bonvalet @ 2013-04-15 18:32 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org

Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit :
> On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit :
> >> Are you saying you saw this problem more than once, and so you
> >> completely wiped the OSD in question, then brought it back into the
> >> cluster, and now it's seeing this error again?
> >
> > Yes, it's exactly that.
> >
> >
> >> Are any other OSDs experiencing this issue?
> >
> > No, only this one have the problem.
> 
> Did you run scrubs while this node was out of the cluster? If you
> wiped the data and this is recurring then this is apparently an issue
> with the cluster state, not just one node, and any other primary for
> the broken PG(s) should crash as well. Can you verify by taking this
> one down and then doing a full scrub?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Also note that no PG is marked "corrupted". I have only PG in "active
+remapped" or "active+degraded".

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Scrub shutdown the OSD process
       [not found]       ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-04-15 18:32         ` Olivier Bonvalet
@ 2013-04-16  6:56         ` Olivier Bonvalet
  2013-04-17  9:48           ` [ceph-users] " Olivier Bonvalet
  1 sibling, 1 reply; 7+ messages in thread
From: Olivier Bonvalet @ 2013-04-16  6:56 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org



Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit :
> On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit :
> >> Are you saying you saw this problem more than once, and so you
> >> completely wiped the OSD in question, then brought it back into the
> >> cluster, and now it's seeing this error again?
> >
> > Yes, it's exactly that.
> >
> >
> >> Are any other OSDs experiencing this issue?
> >
> > No, only this one have the problem.
> 
> Did you run scrubs while this node was out of the cluster? If you
> wiped the data and this is recurring then this is apparently an issue
> with the cluster state, not just one node, and any other primary for
> the broken PG(s) should crash as well. Can you verify by taking this
> one down and then doing a full scrub?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

So, I mark this OSD as "out" to balance data and be able to re-do a
scrum. You are probably right, since I now have 3 other OSD on the same
host which are down.

I still haven't any PG in error (and the cluster is in HEALTH_WARN
status), but something goes wrong.

In syslog I have :

Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742915 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find d452999/rb.0.2d76b.238e1f29.00000000162d/f99//4 in index: (2) No such file or directory
Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742999 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find 85242299/rb.0.1367.2ae8944a.000000001f9c/f98//4 in index: (2) No such file or directory
Apr 16 03:41:11 alim ceph-osd: 2013-04-16 03:41:11.758150 7fe64f12d700 -1 osd.31 48020 heartbeat_check: no reply from osd.5 since 2013-04-16 03:40:50.349130 (cutoff 2013-04-16 03:40:51.758149)
Apr 16 04:27:40 alim ceph-osd: 2013-04-16 04:27:40.529492 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:20.529489)
Apr 16 04:27:41 alim ceph-osd: 2013-04-16 04:27:41.529609 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:21.529605)
Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.440257 7fe64f12d700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.440257)
Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.523985 7fe65e14b700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.523984)
Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:27.770327 7fe65e14b700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:07.770323)
Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:28.497600 7fe64f12d700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:08.497598)
Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 16 06:04:13 alim ceph-osd:      0> 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 16 06:04:13 alim ceph-osd:      0> 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


and last lines from osd.24.log are :

   -10> 2013-04-16 08:08:54.991371 7f5bb4569700  2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] scrub   osd.6 has 10 items
    -9> 2013-04-16 08:08:54.991876 7f5bb4569700  2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0
3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0

    -8> 2013-04-16 08:08:54.991906 7f5bb4569700  0 log [ERR] : 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
    -7> 2013-04-16 08:08:54.991913 7f5bb4569700  0 log [ERR] : 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
    -6> 2013-04-16 08:08:54.991915 7f5bb4569700  0 log [ERR] : 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0
    -5> 2013-04-16 08:08:54.991917 7f5bb4569700  0 log [ERR] : 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
    -4> 2013-04-16 08:08:54.991919 7f5bb4569700  0 log [ERR] : 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0
    -3> 2013-04-16 08:08:54.991986 7f5bb4569700  0 log [ERR] : deep-scrub 3.7c 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 on disk size (0) does not match object info size (4194304)
    -2> 2013-04-16 08:08:54.993813 7f5bbbd78700  5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993812, event: op_applied, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98])
    -1> 2013-04-16 08:08:54.993901 7f5bbbd78700  5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993901, event: done, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98])
     0> 2013-04-16 08:08:54.994227 7f5bb4569700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7f5bb4569700 time 2013-04-16 08:08:54.991990
osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())

 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
 4: (PG::scrub()+0x145) [0x6c4e55]
 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
 8: (()+0x68ca) [0x7f5bc72908ca]
 9: (clone()+0x6d) [0x7f5bc5dbfb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -1/-1 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/osd.24.log
--- end dump of recent events ---
2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) **
 in thread 7f5bb4569700

 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
 1: /usr/bin/ceph-osd() [0x7a6289]
 2: (()+0xeff0) [0x7f5bc7298ff0]
 3: (gsignal()+0x35) [0x7f5bc5d221b5]
 4: (abort()+0x180) [0x7f5bc5d24fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5]
 6: (()+0xcb166) [0x7f5bc65b5166]
 7: (()+0xcb193) [0x7f5bc65b5193]
 8: (()+0xcb28e) [0x7f5bc65b528e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]
 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
 13: (PG::scrub()+0x145) [0x6c4e55]
 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
 17: (()+0x68ca) [0x7f5bc72908ca]
 18: (clone()+0x6d) [0x7f5bc5dbfb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) **
 in thread 7f5bb4569700

 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
 1: /usr/bin/ceph-osd() [0x7a6289]
 2: (()+0xeff0) [0x7f5bc7298ff0]
 3: (gsignal()+0x35) [0x7f5bc5d221b5]
 4: (abort()+0x180) [0x7f5bc5d24fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5]
 6: (()+0xcb166) [0x7f5bc65b5166]
 7: (()+0xcb193) [0x7f5bc65b5193]
 8: (()+0xcb28e) [0x7f5bc65b528e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]
 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
 13: (PG::scrub()+0x145) [0x6c4e55]
 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
 17: (()+0x68ca) [0x7f5bc72908ca]
 18: (clone()+0x6d) [0x7f5bc5dbfb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -1/-1 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/osd.24.log
--- end dump of recent events ---















_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ceph-users] Scrub shutdown the OSD process
  2013-04-16  6:56         ` Olivier Bonvalet
@ 2013-04-17  9:48           ` Olivier Bonvalet
  2013-04-17 18:52             ` Olivier Bonvalet
  0 siblings, 1 reply; 7+ messages in thread
From: Olivier Bonvalet @ 2013-04-17  9:48 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com

Le mardi 16 avril 2013 à 08:56 +0200, Olivier Bonvalet a écrit :
> 
> 
> Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit :
> > On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> > > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit :
> > >> Are you saying you saw this problem more than once, and so you
> > >> completely wiped the OSD in question, then brought it back into the
> > >> cluster, and now it's seeing this error again?
> > >
> > > Yes, it's exactly that.
> > >
> > >
> > >> Are any other OSDs experiencing this issue?
> > >
> > > No, only this one have the problem.
> > 
> > Did you run scrubs while this node was out of the cluster? If you
> > wiped the data and this is recurring then this is apparently an issue
> > with the cluster state, not just one node, and any other primary for
> > the broken PG(s) should crash as well. Can you verify by taking this
> > one down and then doing a full scrub?
> > -Greg
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> So, I mark this OSD as "out" to balance data and be able to re-do a
> scrum. You are probably right, since I now have 3 other OSD on the same
> host which are down.
> 
> I still haven't any PG in error (and the cluster is in HEALTH_WARN
> status), but something goes wrong.
> 
> In syslog I have :
> 
> Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742915 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find d452999/rb.0.2d76b.238e1f29.00000000162d/f99//4 in index: (2) No such file or directory
> Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742999 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find 85242299/rb.0.1367.2ae8944a.000000001f9c/f98//4 in index: (2) No such file or directory
> Apr 16 03:41:11 alim ceph-osd: 2013-04-16 03:41:11.758150 7fe64f12d700 -1 osd.31 48020 heartbeat_check: no reply from osd.5 since 2013-04-16 03:40:50.349130 (cutoff 2013-04-16 03:40:51.758149)
> Apr 16 04:27:40 alim ceph-osd: 2013-04-16 04:27:40.529492 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:20.529489)
> Apr 16 04:27:41 alim ceph-osd: 2013-04-16 04:27:41.529609 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:21.529605)
> Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.440257 7fe64f12d700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.440257)
> Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.523985 7fe65e14b700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.523984)
> Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:27.770327 7fe65e14b700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:07.770323)
> Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:28.497600 7fe64f12d700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:08.497598)
> Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> Apr 16 06:04:13 alim ceph-osd:      0> 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> Apr 16 06:04:13 alim ceph-osd:      0> 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> and last lines from osd.24.log are :
> 
>    -10> 2013-04-16 08:08:54.991371 7f5bb4569700  2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] scrub   osd.6 has 10 items
>     -9> 2013-04-16 08:08:54.991876 7f5bb4569700  2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
> 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
> 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0
> 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
> 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0
> 
>     -8> 2013-04-16 08:08:54.991906 7f5bb4569700  0 log [ERR] : 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
>     -7> 2013-04-16 08:08:54.991913 7f5bb4569700  0 log [ERR] : 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
>     -6> 2013-04-16 08:08:54.991915 7f5bb4569700  0 log [ERR] : 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0
>     -5> 2013-04-16 08:08:54.991917 7f5bb4569700  0 log [ERR] : 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found  expected 12d7
>     -4> 2013-04-16 08:08:54.991919 7f5bb4569700  0 log [ERR] : 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0
>     -3> 2013-04-16 08:08:54.991986 7f5bb4569700  0 log [ERR] : deep-scrub 3.7c 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 on disk size (0) does not match object info size (4194304)
>     -2> 2013-04-16 08:08:54.993813 7f5bbbd78700  5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993812, event: op_applied, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98])
>     -1> 2013-04-16 08:08:54.993901 7f5bbbd78700  5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993901, event: done, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98])
>      0> 2013-04-16 08:08:54.994227 7f5bb4569700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7f5bb4569700 time 2013-04-16 08:08:54.991990
> osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())
> 
>  ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
>  1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
>  2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
>  3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
>  4: (PG::scrub()+0x145) [0x6c4e55]
>  5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
>  6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
>  7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
>  8: (()+0x68ca) [0x7f5bc72908ca]
>  9: (clone()+0x6d) [0x7f5bc5dbfb6d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    0/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 hadoop
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -1/-1 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/osd.24.log
> --- end dump of recent events ---
> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) **
>  in thread 7f5bb4569700
> 
>  ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
>  1: /usr/bin/ceph-osd() [0x7a6289]
>  2: (()+0xeff0) [0x7f5bc7298ff0]
>  3: (gsignal()+0x35) [0x7f5bc5d221b5]
>  4: (abort()+0x180) [0x7f5bc5d24fc0]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5]
>  6: (()+0xcb166) [0x7f5bc65b5166]
>  7: (()+0xcb193) [0x7f5bc65b5193]
>  8: (()+0xcb28e) [0x7f5bc65b528e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]
>  10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
>  11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
>  12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
>  13: (PG::scrub()+0x145) [0x6c4e55]
>  14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
>  15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
>  16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
>  17: (()+0x68ca) [0x7f5bc72908ca]
>  18: (clone()+0x6d) [0x7f5bc5dbfb6d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> --- begin dump of recent events ---
>      0> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) **
>  in thread 7f5bb4569700
> 
>  ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
>  1: /usr/bin/ceph-osd() [0x7a6289]
>  2: (()+0xeff0) [0x7f5bc7298ff0]
>  3: (gsignal()+0x35) [0x7f5bc5d221b5]
>  4: (abort()+0x180) [0x7f5bc5d24fc0]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5]
>  6: (()+0xcb166) [0x7f5bc65b5166]
>  7: (()+0xcb193) [0x7f5bc65b5193]
>  8: (()+0xcb28e) [0x7f5bc65b528e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]
>  10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
>  11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
>  12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
>  13: (PG::scrub()+0x145) [0x6c4e55]
>  14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
>  15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
>  16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
>  17: (()+0x68ca) [0x7f5bc72908ca]
>  18: (clone()+0x6d) [0x7f5bc5dbfb6d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    0/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 hadoop
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -1/-1 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/osd.24.log
> --- end dump of recent events ---
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


So, what can I do to fix that ?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Scrub shutdown the OSD process
  2013-04-17  9:48           ` [ceph-users] " Olivier Bonvalet
@ 2013-04-17 18:52             ` Olivier Bonvalet
  2013-04-20  7:10               ` Olivier Bonvalet
  0 siblings, 1 reply; 7+ messages in thread
From: Olivier Bonvalet @ 2013-04-17 18:52 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org

Some additional infos :

today at 18:57:40, the PG 3.1 [19,5,28] was having a scrub date of
"2013-03-28 08:38:12.858041", and the OSD 28 was recovering.

Ten minutes later (@ 19:07:40), that PG 3.1 was having a scrub date of
today.

But at 19:41:04 I seen a error in syslog :
	osd.10 52042 heartbeat_check: no reply from osd.28 since 2013-04-17 19:40:43.565511

So, since 19:47:44, the PG 3.1 [19,5] is in "active+degraded" state, is
scrub date is returned to "2013-03-28 08:38:12.858041" ; and of course
the osd.28 is DOWN, the process abort :

     0> 2013-04-17 19:40:46.791010 7f6658f5a700 -1 *** Caught signal (Aborted) **
 in thread 7f6658f5a700

 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
 1: /usr/bin/ceph-osd() [0x7a6289]
 2: (()+0xeff0) [0x7f666b488ff0]
 3: (gsignal()+0x35) [0x7f6669f121b5]
 4: (abort()+0x180) [0x7f6669f14fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f666a7a6dc5]
 6: (()+0xcb166) [0x7f666a7a5166]
 7: (()+0xcb193) [0x7f666a7a5193]
 8: (()+0xcb28e) [0x7f666a7a528e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]
 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
 13: (PG::scrub()+0x145) [0x6c4e55]
 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
 17: (()+0x68ca) [0x7f666b4808ca]
 18: (clone()+0x6d) [0x7f6669fafb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


What I didn't understand is why the OSD process crash, instead of
marking that PG "corrupted", and does that PG really "corrupted" are is
this just an OSD bug ?

Thanks,
Olivier

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Scrub shutdown the OSD process
  2013-04-17 18:52             ` Olivier Bonvalet
@ 2013-04-20  7:10               ` Olivier Bonvalet
  2013-04-22 17:05                 ` Scrub shutdown the OSD process / data loss Olivier Bonvalet
  0 siblings, 1 reply; 7+ messages in thread
From: Olivier Bonvalet @ 2013-04-20  7:10 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org

Le mercredi 17 avril 2013 à 20:52 +0200, Olivier Bonvalet a écrit :
> What I didn't understand is why the OSD process crash, instead of
> marking that PG "corrupted", and does that PG really "corrupted" are
> is
> this just an OSD bug ?

Once again, a bit more informations : by searching informations about
one of this faulty PG (3.d), I found that :

  -592> 2013-04-20 08:31:56.838280 7f0f41d1b700  0 log [ERR] : 3.d osd.25 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found  expected 12d7
  -591> 2013-04-20 08:31:56.838284 7f0f41d1b700  0 log [ERR] : 3.d osd.4 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found  expected 12d7
  -590> 2013-04-20 08:31:56.838290 7f0f41d1b700  0 log [ERR] : 3.d osd.4: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0
  -589> 2013-04-20 08:31:56.838292 7f0f41d1b700  0 log [ERR] : 3.d osd.11 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found  expected 12d7
  -588> 2013-04-20 08:31:56.838294 7f0f41d1b700  0 log [ERR] : 3.d osd.11: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0
  -587> 2013-04-20 08:31:56.838395 7f0f41d1b700  0 log [ERR] : scrub 3.d a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 on disk size (0) does not match object info size (4194304)


I prefered to verify, so I found that :

# md5sum /var/lib/ceph/osd/ceph-*/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3
217ac2518dfe9e1502e5bfedb8be29b8  /var/lib/ceph/osd/ceph-4/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB)
217ac2518dfe9e1502e5bfedb8be29b8  /var/lib/ceph/osd/ceph-11/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB)
d41d8cd98f00b204e9800998ecf8427e  /var/lib/ceph/osd/ceph-25/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (0B)


So this object is identical on OSD 4 and 11, but is empty on OSD 25.
Since 4 is the master, this should not be a problem, so I try a repair,
without any success :
    ceph pg repair 3.d


Is there a way to force rewrite of this replica ?

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Scrub shutdown the OSD process / data loss
  2013-04-20  7:10               ` Olivier Bonvalet
@ 2013-04-22 17:05                 ` Olivier Bonvalet
  0 siblings, 0 replies; 7+ messages in thread
From: Olivier Bonvalet @ 2013-04-22 17:05 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org

Le samedi 20 avril 2013 à 09:10 +0200, Olivier Bonvalet a écrit :
> Le mercredi 17 avril 2013 à 20:52 +0200, Olivier Bonvalet a écrit :
> > What I didn't understand is why the OSD process crash, instead of
> > marking that PG "corrupted", and does that PG really "corrupted" are
> > is
> > this just an OSD bug ?
> 
> Once again, a bit more informations : by searching informations about
> one of this faulty PG (3.d), I found that :
> 
>   -592> 2013-04-20 08:31:56.838280 7f0f41d1b700  0 log [ERR] : 3.d osd.25 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found  expected 12d7
>   -591> 2013-04-20 08:31:56.838284 7f0f41d1b700  0 log [ERR] : 3.d osd.4 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found  expected 12d7
>   -590> 2013-04-20 08:31:56.838290 7f0f41d1b700  0 log [ERR] : 3.d osd.4: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0
>   -589> 2013-04-20 08:31:56.838292 7f0f41d1b700  0 log [ERR] : 3.d osd.11 inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 found  expected 12d7
>   -588> 2013-04-20 08:31:56.838294 7f0f41d1b700  0 log [ERR] : 3.d osd.11: soid a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 size 4194304 != known size 0
>   -587> 2013-04-20 08:31:56.838395 7f0f41d1b700  0 log [ERR] : scrub 3.d a8620b0d/rb.0.15c26.238e1f29.000000004603/12d7//3 on disk size (0) does not match object info size (4194304)
> 
> 
> I prefered to verify, so I found that :
> 
> # md5sum /var/lib/ceph/osd/ceph-*/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3
> 217ac2518dfe9e1502e5bfedb8be29b8  /var/lib/ceph/osd/ceph-4/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB)
> 217ac2518dfe9e1502e5bfedb8be29b8  /var/lib/ceph/osd/ceph-11/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (4MB)
> d41d8cd98f00b204e9800998ecf8427e  /var/lib/ceph/osd/ceph-25/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.000000004603__12d7_A8620B0D__3 (0B)
> 
> 
> So this object is identical on OSD 4 and 11, but is empty on OSD 25.
> Since 4 is the master, this should not be a problem, so I try a repair,
> without any success :
>     ceph pg repair 3.d
> 
> 
> Is there a way to force rewrite of this replica ?
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Don't know if it's related, but I see data loss on my cluster on
multiple RBD images (corrupted FS, database and some empty files).

I suppose It's related.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-04-22 17:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1366018923.3018.3.camel@localhost>
     [not found] ` <CAPYLRziFNxnUhyxHRUCzZBF6oQwrWt8YdxBtDrKzMMNuyuW=YQ@mail.gmail.com>
     [not found]   ` <1366046351.26980.0.camel@localhost>
2013-04-15 17:57     ` [ceph-users] Scrub shutdown the OSD process Gregory Farnum
     [not found]       ` <CAPYLRzjz3FVvm5OH-TBN3R8tmv8LREFYKTEX41Zk7DTLjoPXkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-15 18:32         ` Olivier Bonvalet
2013-04-16  6:56         ` Olivier Bonvalet
2013-04-17  9:48           ` [ceph-users] " Olivier Bonvalet
2013-04-17 18:52             ` Olivier Bonvalet
2013-04-20  7:10               ` Olivier Bonvalet
2013-04-22 17:05                 ` Scrub shutdown the OSD process / data loss Olivier Bonvalet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.