* [BUG] ceph-mon crashes
@ 2012-11-11 21:51 Stefan Priebe
2012-11-11 22:20 ` Joao Eduardo Luis
0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe @ 2012-11-11 21:51 UTC (permalink / raw)
To: ceph-devel@vger.kernel.org
Hello list,
i've now seen the following ceph-mon crash several times:
-22> 2012-11-11 22:45:10.941641 7f5d49168700 1
mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
has v1017 lc 11925
-21> 2012-11-11 22:45:11.017199 7f5d49168700 1 --
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
0 0) 0x58fe840 con 0x2c38c60
-20> 2012-11-11 22:45:11.084775 7f5d49168700 1 --
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
0x4e59080 con 0x2c38c60
-19> 2012-11-11 22:45:11.084812 7f5d49168700 1 --
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
0x4e58580 con 0x2c38c60
-18> 2012-11-11 22:45:11.230970 7f5d48065700 2 --
10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection refused
-17> 2012-11-11 22:45:11.231004 7f5d48065700 2 --
10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
pgs=16 cs=2 l=0).fault 111: Connection refused
-16> 2012-11-11 22:45:11.238427 7f5d49168700 1 --
10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
0x2ca5b40 con 0x2ca4dc0
-15> 2012-11-11 22:45:11.238451 7f5d49168700 1
mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
has v1017 lc 11925
-14> 2012-11-11 22:45:11.287036 7f5d49969700 5
mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
11926 311714 bytes, gv 28550
-13> 2012-11-11 22:45:11.296084 7f5d49168700 1 --
10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
0x39bad80 con 0x2ca4c60
-12> 2012-11-11 22:45:11.350697 7f5d49969700 1 --
10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
-11> 2012-11-11 22:45:11.350790 7f5d49168700 1
mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
has v1017 lc 11925
-10> 2012-11-11 22:45:11.435570 7f5d49168700 1 --
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
0) 0x4e60b00 con 0x2c38c60
-9> 2012-11-11 22:45:11.476091 7f5d49168700 1 --
10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
-8> 2012-11-11 22:45:11.476125 7f5d49168700 1 --
10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
-7> 2012-11-11 22:45:11.593945 7f5d49168700 0 log [INF] : pgmap
v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded (17.357%)
-6> 2012-11-11 22:45:11.593954 7f5d49168700 1 --
10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
-- ?+0 0x40c7200
-5> 2012-11-11 22:45:11.593969 7f5d49168700 1 --
10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
-4> 2012-11-11 22:45:11.594050 7f5d49168700 1 --
10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
-3> 2012-11-11 22:45:11.594359 7f5d49168700 1
mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
has v1017 lc 11926
-2> 2012-11-11 22:45:11.594736 7f5d49168700 1 --
10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
-1> 2012-11-11 22:45:11.595139 7f5d49969700 1 mon.b@1(leader).osd
e1017 we have enough reports/reporters to mark osd.53 down
0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
time 2012-11-11 22:45:11.595149
./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
[0x4b14e9]
2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
3: (OSDMonitor::tick()+0xab) [0x4b213b]
4: (Monitor::tick()+0x6d) [0x477fcd]
5: (SafeTimer::timer_thread()+0x453) [0x58f203]
6: (SafeTimerThread::entry()+0xd) [0x5913dd]
7: (()+0x68ca) [0x7f5d4dbce8ca]
8: (clone()+0x6d) [0x7f5d4c456bfd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 5 lockdep
0/ 5 context
1/ 5 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 5 timer
0/ 5 filer
0/ 5 striper
0/ 0 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 5 filestore
1/ 5 journal
0/ 5 ms
1/ 5 mon
0/ 5 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000000
log_file /var/log/ceph/ceph-mon.b.log
--- end dump of recent events ---
2012-11-11 22:45:11.632718 7f5d49969700 -1 *** Caught signal (Aborted) **
in thread 7f5d49969700
ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
1: /usr/bin/ceph-mon() [0x53eb29]
2: (()+0xeff0) [0x7f5d4dbd6ff0]
3: (gsignal()+0x35) [0x7f5d4c3b9215]
4: (abort()+0x180) [0x7f5d4c3bc020]
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5d4cc4ddc5]
6: (()+0xcb166) [0x7f5d4cc4c166]
7: (()+0xcb193) [0x7f5d4cc4c193]
8: (()+0xcb28e) [0x7f5d4cc4c28e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x55f249]
10: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
[0x4b14e9]
11: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
12: (OSDMonitor::tick()+0xab) [0x4b213b]
13: (Monitor::tick()+0x6d) [0x477fcd]
14: (SafeTimer::timer_thread()+0x453) [0x58f203]
15: (SafeTimerThread::entry()+0xd) [0x5913dd]
16: (()+0x68ca) [0x7f5d4dbce8ca]
17: (clone()+0x6d) [0x7f5d4c456bfd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
0> 2012-11-11 22:45:11.632718 7f5d49969700 -1 *** Caught signal
(Aborted) **
in thread 7f5d49969700
ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
1: /usr/bin/ceph-mon() [0x53eb29]
2: (()+0xeff0) [0x7f5d4dbd6ff0]
3: (gsignal()+0x35) [0x7f5d4c3b9215]
4: (abort()+0x180) [0x7f5d4c3bc020]
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5d4cc4ddc5]
6: (()+0xcb166) [0x7f5d4cc4c166]
7: (()+0xcb193) [0x7f5d4cc4c193]
8: (()+0xcb28e) [0x7f5d4cc4c28e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x55f249]
10: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
[0x4b14e9]
11: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
12: (OSDMonitor::tick()+0xab) [0x4b213b]
13: (Monitor::tick()+0x6d) [0x477fcd]
14: (SafeTimer::timer_thread()+0x453) [0x58f203]
15: (SafeTimerThread::entry()+0xd) [0x5913dd]
16: (()+0x68ca) [0x7f5d4dbce8ca]
17: (clone()+0x6d) [0x7f5d4c456bfd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 5 lockdep
0/ 5 context
1/ 5 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 5 timer
0/ 5 filer
0/ 5 striper
0/ 0 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 5 filestore
1/ 5 journal
0/ 5 ms
1/ 5 mon
0/ 5 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000000
log_file /var/log/ceph/ceph-mon.b.log
--- end dump of recent events ---
Stefan
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [BUG] ceph-mon crashes
2012-11-11 21:51 [BUG] ceph-mon crashes Stefan Priebe
@ 2012-11-11 22:20 ` Joao Eduardo Luis
2012-11-11 22:22 ` Stefan Priebe
0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-11 22:20 UTC (permalink / raw)
To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org
Hi Stefan,
Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
Also, do you have any idea what happened to osd.53?
-Joao
On 11/11/2012 09:51 PM, Stefan Priebe wrote:
> Hello list,
>
> i've now seen the following ceph-mon crash several times:
> -22> 2012-11-11 22:45:10.941641 7f5d49168700 1
> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
> has v1017 lc 11925
> -21> 2012-11-11 22:45:11.017199 7f5d49168700 1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
> 0 0) 0x58fe840 con 0x2c38c60
> -20> 2012-11-11 22:45:11.084775 7f5d49168700 1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
> 0x4e59080 con 0x2c38c60
> -19> 2012-11-11 22:45:11.084812 7f5d49168700 1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
> 0x4e58580 con 0x2c38c60
> -18> 2012-11-11 22:45:11.230970 7f5d48065700 2 --
> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection refused
> -17> 2012-11-11 22:45:11.231004 7f5d48065700 2 --
> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
> pgs=16 cs=2 l=0).fault 111: Connection refused
> -16> 2012-11-11 22:45:11.238427 7f5d49168700 1 --
> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
> 0x2ca5b40 con 0x2ca4dc0
> -15> 2012-11-11 22:45:11.238451 7f5d49168700 1
> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
> has v1017 lc 11925
> -14> 2012-11-11 22:45:11.287036 7f5d49969700 5
> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
> 11926 311714 bytes, gv 28550
> -13> 2012-11-11 22:45:11.296084 7f5d49168700 1 --
> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
> 0x39bad80 con 0x2ca4c60
> -12> 2012-11-11 22:45:11.350697 7f5d49969700 1 --
> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
> -11> 2012-11-11 22:45:11.350790 7f5d49168700 1
> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
> has v1017 lc 11925
> -10> 2012-11-11 22:45:11.435570 7f5d49168700 1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
> 0) 0x4e60b00 con 0x2c38c60
> -9> 2012-11-11 22:45:11.476091 7f5d49168700 1 --
> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
> -8> 2012-11-11 22:45:11.476125 7f5d49168700 1 --
> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
> -7> 2012-11-11 22:45:11.593945 7f5d49168700 0 log [INF] : pgmap
> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded (17.357%)
> -6> 2012-11-11 22:45:11.593954 7f5d49168700 1 --
> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
> -- ?+0 0x40c7200
> -5> 2012-11-11 22:45:11.593969 7f5d49168700 1 --
> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
> -4> 2012-11-11 22:45:11.594050 7f5d49168700 1 --
> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
> -3> 2012-11-11 22:45:11.594359 7f5d49168700 1
> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
> has v1017 lc 11926
> -2> 2012-11-11 22:45:11.594736 7f5d49168700 1 --
> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
> -1> 2012-11-11 22:45:11.595139 7f5d49969700 1 mon.b@1(leader).osd
> e1017 we have enough reports/reporters to mark osd.53 down
> 0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
> time 2012-11-11 22:45:11.595149
> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>
> ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
> 1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
> [0x4b14e9]
> 2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
> 3: (OSDMonitor::tick()+0xab) [0x4b213b]
> 4: (Monitor::tick()+0x6d) [0x477fcd]
> 5: (SafeTimer::timer_thread()+0x453) [0x58f203]
> 6: (SafeTimerThread::entry()+0xd) [0x5913dd]
> 7: (()+0x68ca) [0x7f5d4dbce8ca]
> 8: (clone()+0x6d) [0x7f5d4c456bfd]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [BUG] ceph-mon crashes
2012-11-11 22:20 ` Joao Eduardo Luis
@ 2012-11-11 22:22 ` Stefan Priebe
2012-11-11 23:26 ` Joao Eduardo Luis
0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe @ 2012-11-11 22:22 UTC (permalink / raw)
To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org
I wanted to remove osd.51 to 54. And executed the following commands:
ceph osd 51 out
ceph osd 52 out
ceph osd 53 out
ceph osd 54 out
Greets,
Stefan
Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
> Hi Stefan,
>
> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>
> Also, do you have any idea what happened to osd.53?
>
> -Joao
>
> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>> Hello list,
>>
>> i've now seen the following ceph-mon crash several times:
>> -22> 2012-11-11 22:45:10.941641 7f5d49168700 1
>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>> has v1017 lc 11925
>> -21> 2012-11-11 22:45:11.017199 7f5d49168700 1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>> 0 0) 0x58fe840 con 0x2c38c60
>> -20> 2012-11-11 22:45:11.084775 7f5d49168700 1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>> 0x4e59080 con 0x2c38c60
>> -19> 2012-11-11 22:45:11.084812 7f5d49168700 1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>> 0x4e58580 con 0x2c38c60
>> -18> 2012-11-11 22:45:11.230970 7f5d48065700 2 --
>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection refused
>> -17> 2012-11-11 22:45:11.231004 7f5d48065700 2 --
>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>> pgs=16 cs=2 l=0).fault 111: Connection refused
>> -16> 2012-11-11 22:45:11.238427 7f5d49168700 1 --
>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>> 0x2ca5b40 con 0x2ca4dc0
>> -15> 2012-11-11 22:45:11.238451 7f5d49168700 1
>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>> has v1017 lc 11925
>> -14> 2012-11-11 22:45:11.287036 7f5d49969700 5
>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>> 11926 311714 bytes, gv 28550
>> -13> 2012-11-11 22:45:11.296084 7f5d49168700 1 --
>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>> 0x39bad80 con 0x2ca4c60
>> -12> 2012-11-11 22:45:11.350697 7f5d49969700 1 --
>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>> -11> 2012-11-11 22:45:11.350790 7f5d49168700 1
>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>> has v1017 lc 11925
>> -10> 2012-11-11 22:45:11.435570 7f5d49168700 1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>> 0) 0x4e60b00 con 0x2c38c60
>> -9> 2012-11-11 22:45:11.476091 7f5d49168700 1 --
>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>> -8> 2012-11-11 22:45:11.476125 7f5d49168700 1 --
>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>> -7> 2012-11-11 22:45:11.593945 7f5d49168700 0 log [INF] : pgmap
>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded (17.357%)
>> -6> 2012-11-11 22:45:11.593954 7f5d49168700 1 --
>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>> -- ?+0 0x40c7200
>> -5> 2012-11-11 22:45:11.593969 7f5d49168700 1 --
>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>> -4> 2012-11-11 22:45:11.594050 7f5d49168700 1 --
>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>> -3> 2012-11-11 22:45:11.594359 7f5d49168700 1
>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>> has v1017 lc 11926
>> -2> 2012-11-11 22:45:11.594736 7f5d49168700 1 --
>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>> -1> 2012-11-11 22:45:11.595139 7f5d49969700 1 mon.b@1(leader).osd
>> e1017 we have enough reports/reporters to mark osd.53 down
>> 0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>> time 2012-11-11 22:45:11.595149
>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>
>> ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>> 1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>> [0x4b14e9]
>> 2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>> 3: (OSDMonitor::tick()+0xab) [0x4b213b]
>> 4: (Monitor::tick()+0x6d) [0x477fcd]
>> 5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>> 6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>> 7: (()+0x68ca) [0x7f5d4dbce8ca]
>> 8: (clone()+0x6d) [0x7f5d4c456bfd]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [BUG] ceph-mon crashes
2012-11-11 22:22 ` Stefan Priebe
@ 2012-11-11 23:26 ` Joao Eduardo Luis
2012-11-12 6:21 ` Stefan Priebe - Profihost AG
0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-11 23:26 UTC (permalink / raw)
To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org
On 11/11/2012 10:22 PM, Stefan Priebe wrote:
> I wanted to remove osd.51 to 54. And executed the following commands:
> ceph osd 51 out
> ceph osd 52 out
> ceph osd 53 out
> ceph osd 54 out
>
> Greets,
> Stefan
Were those osds shutdown/killed before issuing the out commands?
And I'm assuming all the commands completed successfully with a 'marked
out osd.X'. Is this correct?
-Joao
> Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
>> Hi Stefan,
>>
>> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>>
>> Also, do you have any idea what happened to osd.53?
>>
>> -Joao
>>
>> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>>> Hello list,
>>>
>>> i've now seen the following ceph-mon crash several times:
>>> -22> 2012-11-11 22:45:10.941641 7f5d49168700 1
>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>>> has v1017 lc 11925
>>> -21> 2012-11-11 22:45:11.017199 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>>> 0 0) 0x58fe840 con 0x2c38c60
>>> -20> 2012-11-11 22:45:11.084775 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>>> 0x4e59080 con 0x2c38c60
>>> -19> 2012-11-11 22:45:11.084812 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>>> 0x4e58580 con 0x2c38c60
>>> -18> 2012-11-11 22:45:11.230970 7f5d48065700 2 --
>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection
>>> refused
>>> -17> 2012-11-11 22:45:11.231004 7f5d48065700 2 --
>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>> pgs=16 cs=2 l=0).fault 111: Connection refused
>>> -16> 2012-11-11 22:45:11.238427 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>>> 0x2ca5b40 con 0x2ca4dc0
>>> -15> 2012-11-11 22:45:11.238451 7f5d49168700 1
>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>>> has v1017 lc 11925
>>> -14> 2012-11-11 22:45:11.287036 7f5d49969700 5
>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>>> 11926 311714 bytes, gv 28550
>>> -13> 2012-11-11 22:45:11.296084 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>>> 0x39bad80 con 0x2ca4c60
>>> -12> 2012-11-11 22:45:11.350697 7f5d49969700 1 --
>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>>> -11> 2012-11-11 22:45:11.350790 7f5d49168700 1
>>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>>> has v1017 lc 11925
>>> -10> 2012-11-11 22:45:11.435570 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>>> 0) 0x4e60b00 con 0x2c38c60
>>> -9> 2012-11-11 22:45:11.476091 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>>> -8> 2012-11-11 22:45:11.476125 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>>> -7> 2012-11-11 22:45:11.593945 7f5d49168700 0 log [INF] : pgmap
>>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded
>>> (17.357%)
>>> -6> 2012-11-11 22:45:11.593954 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>>> -- ?+0 0x40c7200
>>> -5> 2012-11-11 22:45:11.593969 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>>> -4> 2012-11-11 22:45:11.594050 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>>> -3> 2012-11-11 22:45:11.594359 7f5d49168700 1
>>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>>> has v1017 lc 11926
>>> -2> 2012-11-11 22:45:11.594736 7f5d49168700 1 --
>>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>>> -1> 2012-11-11 22:45:11.595139 7f5d49969700 1 mon.b@1(leader).osd
>>> e1017 we have enough reports/reporters to mark osd.53 down
>>> 0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>>> time 2012-11-11 22:45:11.595149
>>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>>
>>> ceph version 0.53-758-g26e5f2d
>>> (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>>> 1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>>> [0x4b14e9]
>>> 2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>>> 3: (OSDMonitor::tick()+0xab) [0x4b213b]
>>> 4: (Monitor::tick()+0x6d) [0x477fcd]
>>> 5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>>> 6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>>> 7: (()+0x68ca) [0x7f5d4dbce8ca]
>>> 8: (clone()+0x6d) [0x7f5d4c456bfd]
>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> needed to interpret this.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [BUG] ceph-mon crashes
2012-11-11 23:26 ` Joao Eduardo Luis
@ 2012-11-12 6:21 ` Stefan Priebe - Profihost AG
2012-11-12 14:58 ` Joao Eduardo Luis
0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-12 6:21 UTC (permalink / raw)
To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org
Am 12.11.2012 um 00:26 schrieb Joao Eduardo Luis <joao.luis@inktank.com>:
> On 11/11/2012 10:22 PM, Stefan Priebe wrote:
>> I wanted to remove osd.51 to 54. And executed the following commands:
>> ceph osd 51 out
>> ceph osd 52 out
>> ceph osd 53 out
>> ceph osd 54 out
>>
>> Greets,
>> Stefan
>
> Were those osds shutdown/killed before issuing the out commands?
No
>
> And I'm assuming all the commands completed successfully with a 'marked
> out osd.X'. Is this correct?
Yes
Stefan
>
> -Joao
>
>
>> Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
>>> Hi Stefan,
>>>
>>> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>>>
>>> Also, do you have any idea what happened to osd.53?
>>>
>>> -Joao
>>>
>>> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>>>> Hello list,
>>>>
>>>> i've now seen the following ceph-mon crash several times:
>>>> -22> 2012-11-11 22:45:10.941641 7f5d49168700 1
>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>>>> has v1017 lc 11925
>>>> -21> 2012-11-11 22:45:11.017199 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>>>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>>>> 0 0) 0x58fe840 con 0x2c38c60
>>>> -20> 2012-11-11 22:45:11.084775 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>>>> 0x4e59080 con 0x2c38c60
>>>> -19> 2012-11-11 22:45:11.084812 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>>>> 0x4e58580 con 0x2c38c60
>>>> -18> 2012-11-11 22:45:11.230970 7f5d48065700 2 --
>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection
>>>> refused
>>>> -17> 2012-11-11 22:45:11.231004 7f5d48065700 2 --
>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>> pgs=16 cs=2 l=0).fault 111: Connection refused
>>>> -16> 2012-11-11 22:45:11.238427 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>>>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>>>> 0x2ca5b40 con 0x2ca4dc0
>>>> -15> 2012-11-11 22:45:11.238451 7f5d49168700 1
>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>>>> has v1017 lc 11925
>>>> -14> 2012-11-11 22:45:11.287036 7f5d49969700 5
>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>>>> 11926 311714 bytes, gv 28550
>>>> -13> 2012-11-11 22:45:11.296084 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>>>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>>>> 0x39bad80 con 0x2ca4c60
>>>> -12> 2012-11-11 22:45:11.350697 7f5d49969700 1 --
>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>>>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>>>> -11> 2012-11-11 22:45:11.350790 7f5d49168700 1
>>>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>>>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>>>> has v1017 lc 11925
>>>> -10> 2012-11-11 22:45:11.435570 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>>>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>>>> 0) 0x4e60b00 con 0x2c38c60
>>>> -9> 2012-11-11 22:45:11.476091 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>>>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>>>> -8> 2012-11-11 22:45:11.476125 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>>>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>>>> -7> 2012-11-11 22:45:11.593945 7f5d49168700 0 log [INF] : pgmap
>>>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>>>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>>>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>>>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>>>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded
>>>> (17.357%)
>>>> -6> 2012-11-11 22:45:11.593954 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>>>> -- ?+0 0x40c7200
>>>> -5> 2012-11-11 22:45:11.593969 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>>>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>>>> -4> 2012-11-11 22:45:11.594050 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>>>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>>>> -3> 2012-11-11 22:45:11.594359 7f5d49168700 1
>>>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>>>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>>>> has v1017 lc 11926
>>>> -2> 2012-11-11 22:45:11.594736 7f5d49168700 1 --
>>>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>>>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>>>> -1> 2012-11-11 22:45:11.595139 7f5d49969700 1 mon.b@1(leader).osd
>>>> e1017 we have enough reports/reporters to mark osd.53 down
>>>> 0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>>>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>>>> time 2012-11-11 22:45:11.595149
>>>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>>>
>>>> ceph version 0.53-758-g26e5f2d
>>>> (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>>>> 1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>>>> [0x4b14e9]
>>>> 2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>>>> 3: (OSDMonitor::tick()+0xab) [0x4b213b]
>>>> 4: (Monitor::tick()+0x6d) [0x477fcd]
>>>> 5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>>>> 6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>>>> 7: (()+0x68ca) [0x7f5d4dbce8ca]
>>>> 8: (clone()+0x6d) [0x7f5d4c456bfd]
>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>> needed to interpret this.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [BUG] ceph-mon crashes
2012-11-12 6:21 ` Stefan Priebe - Profihost AG
@ 2012-11-12 14:58 ` Joao Eduardo Luis
2012-11-12 15:10 ` Stefan Priebe - Profihost AG
0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-12 14:58 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org
Hi Stefan,
Any chance you can get me a larger chunk of the log from the monitor
that was the leader by the time you issued those commands until the
point the monitor crashed (from the excerpt you provided, that should be
mon.b)?
-Joao
On 11/12/2012 06:21 AM, Stefan Priebe - Profihost AG wrote:
>
> Am 12.11.2012 um 00:26 schrieb Joao Eduardo Luis <joao.luis@inktank.com>:
>
>> On 11/11/2012 10:22 PM, Stefan Priebe wrote:
>>> I wanted to remove osd.51 to 54. And executed the following commands:
>>> ceph osd 51 out
>>> ceph osd 52 out
>>> ceph osd 53 out
>>> ceph osd 54 out
>>>
>>> Greets,
>>> Stefan
>>
>> Were those osds shutdown/killed before issuing the out commands?
> No
>
>
>>
>> And I'm assuming all the commands completed successfully with a 'marked
>> out osd.X'. Is this correct?
> Yes
>
> Stefan
>
>
>>
>> -Joao
>>
>>
>>> Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
>>>> Hi Stefan,
>>>>
>>>> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>>>>
>>>> Also, do you have any idea what happened to osd.53?
>>>>
>>>> -Joao
>>>>
>>>> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>>>>> Hello list,
>>>>>
>>>>> i've now seen the following ceph-mon crash several times:
>>>>> -22> 2012-11-11 22:45:10.941641 7f5d49168700 1
>>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>>>>> has v1017 lc 11925
>>>>> -21> 2012-11-11 22:45:11.017199 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>>>>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>>>>> 0 0) 0x58fe840 con 0x2c38c60
>>>>> -20> 2012-11-11 22:45:11.084775 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>>>>> 0x4e59080 con 0x2c38c60
>>>>> -19> 2012-11-11 22:45:11.084812 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>>>>> 0x4e58580 con 0x2c38c60
>>>>> -18> 2012-11-11 22:45:11.230970 7f5d48065700 2 --
>>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection
>>>>> refused
>>>>> -17> 2012-11-11 22:45:11.231004 7f5d48065700 2 --
>>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>>> pgs=16 cs=2 l=0).fault 111: Connection refused
>>>>> -16> 2012-11-11 22:45:11.238427 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>>>>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>>>>> 0x2ca5b40 con 0x2ca4dc0
>>>>> -15> 2012-11-11 22:45:11.238451 7f5d49168700 1
>>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>>>>> has v1017 lc 11925
>>>>> -14> 2012-11-11 22:45:11.287036 7f5d49969700 5
>>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>>>>> 11926 311714 bytes, gv 28550
>>>>> -13> 2012-11-11 22:45:11.296084 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>>>>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>>>>> 0x39bad80 con 0x2ca4c60
>>>>> -12> 2012-11-11 22:45:11.350697 7f5d49969700 1 --
>>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>>>>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>>>>> -11> 2012-11-11 22:45:11.350790 7f5d49168700 1
>>>>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>>>>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>>>>> has v1017 lc 11925
>>>>> -10> 2012-11-11 22:45:11.435570 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>>>>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>>>>> 0) 0x4e60b00 con 0x2c38c60
>>>>> -9> 2012-11-11 22:45:11.476091 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>>>>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>>>>> -8> 2012-11-11 22:45:11.476125 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>>>>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>>>>> -7> 2012-11-11 22:45:11.593945 7f5d49168700 0 log [INF] : pgmap
>>>>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>>>>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>>>>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>>>>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>>>>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded
>>>>> (17.357%)
>>>>> -6> 2012-11-11 22:45:11.593954 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>>>>> -- ?+0 0x40c7200
>>>>> -5> 2012-11-11 22:45:11.593969 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>>>>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>>>>> -4> 2012-11-11 22:45:11.594050 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>>>>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>>>>> -3> 2012-11-11 22:45:11.594359 7f5d49168700 1
>>>>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>>>>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>>>>> has v1017 lc 11926
>>>>> -2> 2012-11-11 22:45:11.594736 7f5d49168700 1 --
>>>>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>>>>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>>>>> -1> 2012-11-11 22:45:11.595139 7f5d49969700 1 mon.b@1(leader).osd
>>>>> e1017 we have enough reports/reporters to mark osd.53 down
>>>>> 0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>>>>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>>>>> time 2012-11-11 22:45:11.595149
>>>>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>>>>
>>>>> ceph version 0.53-758-g26e5f2d
>>>>> (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>>>>> 1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>>>>> [0x4b14e9]
>>>>> 2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>>>>> 3: (OSDMonitor::tick()+0xab) [0x4b213b]
>>>>> 4: (Monitor::tick()+0x6d) [0x477fcd]
>>>>> 5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>>>>> 6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>>>>> 7: (()+0x68ca) [0x7f5d4dbce8ca]
>>>>> 8: (clone()+0x6d) [0x7f5d4c456bfd]
>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [BUG] ceph-mon crashes
2012-11-12 14:58 ` Joao Eduardo Luis
@ 2012-11-12 15:10 ` Stefan Priebe - Profihost AG
2012-11-12 17:05 ` Joao Eduardo Luis
0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-12 15:10 UTC (permalink / raw)
To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org
Am 12.11.2012 15:58, schrieb Joao Eduardo Luis:
> Hi Stefan,
>
>
> Any chance you can get me a larger chunk of the log from the monitor
> that was the leader by the time you issued those commands until the
> point the monitor crashed (from the excerpt you provided, that should be
> mon.b)?
Sure:
https://www.dropbox.com/s/8e604bihk56m0yd/ceph-mon.b.log.1.gz
Greets,
Stefan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] ceph-mon crashes
2012-11-12 15:10 ` Stefan Priebe - Profihost AG
@ 2012-11-12 17:05 ` Joao Eduardo Luis
2012-11-12 18:30 ` Stefan Priebe
0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-12 17:05 UTC (permalink / raw)
To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org
On 11/12/2012 03:10 PM, Stefan Priebe - Profihost AG wrote:
> Am 12.11.2012 15:58, schrieb Joao Eduardo Luis:
>> Hi Stefan,
>>
>>
>> Any chance you can get me a larger chunk of the log from the monitor
>> that was the leader by the time you issued those commands until the
>> point the monitor crashed (from the excerpt you provided, that should be
>> mon.b)?
>
> Sure:
> https://www.dropbox.com/s/8e604bihk56m0yd/ceph-mon.b.log.1.gz
>
> Greets,
> Stefan
Hi Stefan,
Thanks for the log.
Can you please confirm me that sometime between you issuing the out
command and mon.b failing, you had yet another monitor (maybe mon.a)
that was the leader but for some reason it was down by the time that
mon.b failed?
If so, could you provide the log for that monitor as well, given this
log doesn't have some infos I'm looking for?
-Joao
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] ceph-mon crashes
2012-11-12 17:05 ` Joao Eduardo Luis
@ 2012-11-12 18:30 ` Stefan Priebe
2012-11-12 20:40 ` Joao Eduardo Luis
0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe @ 2012-11-12 18:30 UTC (permalink / raw)
To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org
Hi Joao,
Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:
> Can you please confirm me that sometime between you issuing the out
> command and mon.b failing, you had yet another monitor (maybe mon.a)
> that was the leader but for some reason it was down by the time that
> mon.b failed?
>
> If so, could you provide the log for that monitor as well, given this
> log doesn't have some infos I'm looking for?
Not sure but here are the logs of the other two mons:
https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz
https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz
Thanks,
Stefan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] ceph-mon crashes
2012-11-12 18:30 ` Stefan Priebe
@ 2012-11-12 20:40 ` Joao Eduardo Luis
2012-11-12 21:20 ` Stefan Priebe
0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-12 20:40 UTC (permalink / raw)
To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org
On 11/12/2012 06:30 PM, Stefan Priebe wrote:
> Hi Joao,
>
> Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:
>> Can you please confirm me that sometime between you issuing the out
>> command and mon.b failing, you had yet another monitor (maybe mon.a)
>> that was the leader but for some reason it was down by the time that
>> mon.b failed?
>>
>> If so, could you provide the log for that monitor as well, given this
>> log doesn't have some infos I'm looking for?
> Not sure but here are the logs of the other two mons:
> https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz
>
> https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz
>
> Thanks,
> Stefan
Thanks Stefan,
I'll be looking into this.
For future reference, I created issue #3477 on the tracker:
http://tracker.newdream.net/issues/3477
-Joao
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] ceph-mon crashes
2012-11-12 20:40 ` Joao Eduardo Luis
@ 2012-11-12 21:20 ` Stefan Priebe
0 siblings, 0 replies; 11+ messages in thread
From: Stefan Priebe @ 2012-11-12 21:20 UTC (permalink / raw)
To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org
Thanks i'm subscribed to the tracker now.
Stefan
Am 12.11.2012 21:40, schrieb Joao Eduardo Luis:
> On 11/12/2012 06:30 PM, Stefan Priebe wrote:
>> Hi Joao,
>>
>> Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:
>>> Can you please confirm me that sometime between you issuing the out
>>> command and mon.b failing, you had yet another monitor (maybe mon.a)
>>> that was the leader but for some reason it was down by the time that
>>> mon.b failed?
>>>
>>> If so, could you provide the log for that monitor as well, given this
>>> log doesn't have some infos I'm looking for?
>> Not sure but here are the logs of the other two mons:
>> https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz
>>
>> https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz
>>
>> Thanks,
>> Stefan
>
> Thanks Stefan,
>
> I'll be looking into this.
>
> For future reference, I created issue #3477 on the tracker:
> http://tracker.newdream.net/issues/3477
>
> -Joao
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-11-12 21:20 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-11 21:51 [BUG] ceph-mon crashes Stefan Priebe
2012-11-11 22:20 ` Joao Eduardo Luis
2012-11-11 22:22 ` Stefan Priebe
2012-11-11 23:26 ` Joao Eduardo Luis
2012-11-12 6:21 ` Stefan Priebe - Profihost AG
2012-11-12 14:58 ` Joao Eduardo Luis
2012-11-12 15:10 ` Stefan Priebe - Profihost AG
2012-11-12 17:05 ` Joao Eduardo Luis
2012-11-12 18:30 ` Stefan Priebe
2012-11-12 20:40 ` Joao Eduardo Luis
2012-11-12 21:20 ` Stefan Priebe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.