All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] ceph-mon crashes
@ 2012-11-11 21:51 Stefan Priebe
  2012-11-11 22:20 ` Joao Eduardo Luis
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe @ 2012-11-11 21:51 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hello list,

i've now seen the following ceph-mon crash several times:
    -22> 2012-11-11 22:45:10.941641 7f5d49168700  1 
mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable 
now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577 
has v1017 lc 11925
    -21> 2012-11-11 22:45:11.017199 7f5d49168700  1 -- 
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm 
lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536 
0 0) 0x58fe840 con 0x2c38c60
    -20> 2012-11-11 22:45:11.084775 7f5d49168700  1 -- 
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap 
lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0) 
0x4e59080 con 0x2c38c60
    -19> 2012-11-11 22:45:11.084812 7f5d49168700  1 -- 
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap 
lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0) 
0x4e58580 con 0x2c38c60
    -18> 2012-11-11 22:45:11.230970 7f5d48065700  2 -- 
10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789 
pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection refused
    -17> 2012-11-11 22:45:11.231004 7f5d48065700  2 -- 
10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789 
pgs=16 cs=2 l=0).fault 111: Connection refused
    -16> 2012-11-11 22:45:11.238427 7f5d49168700  1 -- 
10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ==== 
pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0) 
0x2ca5b40 con 0x2ca4dc0
    -15> 2012-11-11 22:45:11.238451 7f5d49168700  1 
mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable 
now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577 
has v1017 lc 11925
    -14> 2012-11-11 22:45:11.287036 7f5d49969700  5 
mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value 
11926 311714 bytes, gv 28550
    -13> 2012-11-11 22:45:11.296084 7f5d49168700  1 -- 
10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ==== 
pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0) 
0x39bad80 con 0x2ca4c60
    -12> 2012-11-11 22:45:11.350697 7f5d49969700  1 -- 
10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin 
lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
    -11> 2012-11-11 22:45:11.350790 7f5d49168700  1 
mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable 
now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577 
has v1017 lc 11925
    -10> 2012-11-11 22:45:11.435570 7f5d49168700  1 -- 
10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap 
accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0 
0) 0x4e60b00 con 0x2c38c60
     -9> 2012-11-11 22:45:11.476091 7f5d49168700  1 -- 
10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit 
lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
     -8> 2012-11-11 22:45:11.476125 7f5d49168700  1 -- 
10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease 
lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
     -7> 2012-11-11 22:45:11.593945 7f5d49168700  0 log [INF] : pgmap 
v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill, 
586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18 
active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill, 
2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB 
data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded (17.357%)
     -6> 2012-11-11 22:45:11.593954 7f5d49168700  1 -- 
10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1 
-- ?+0 0x40c7200
     -5> 2012-11-11 22:45:11.593969 7f5d49168700  1 -- 
10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs 
tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
     -4> 2012-11-11 22:45:11.594050 7f5d49168700  1 -- 
10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs 
tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
     -3> 2012-11-11 22:45:11.594359 7f5d49168700  1 
mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable 
now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123 
has v1017 lc 11926
     -2> 2012-11-11 22:45:11.594736 7f5d49168700  1 -- 
10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries) 
v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
     -1> 2012-11-11 22:45:11.595139 7f5d49969700  1 mon.b@1(leader).osd 
e1017  we have enough reports/reporters to mark osd.53 down
      0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In 
function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700 
time 2012-11-11 22:45:11.595149
./osd/OSDMap.h: 345: FAILED assert(is_up(osd))

  ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
  1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09) 
[0x4b14e9]
  2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
  3: (OSDMonitor::tick()+0xab) [0x4b213b]
  4: (Monitor::tick()+0x6d) [0x477fcd]
  5: (SafeTimer::timer_thread()+0x453) [0x58f203]
  6: (SafeTimerThread::entry()+0xd) [0x5913dd]
  7: (()+0x68ca) [0x7f5d4dbce8ca]
  8: (clone()+0x6d) [0x7f5d4c456bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 5 lockdep
    0/ 5 context
    1/ 5 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 0 buffer
    0/ 5 timer
    0/ 5 filer
    0/ 5 striper
    0/ 0 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 client
    0/ 5 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 5 filestore
    1/ 5 journal
    0/ 5 ms
    1/ 5 mon
    0/ 5 monc
    0/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 finisher
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new      1000000
   log_file /var/log/ceph/ceph-mon.b.log
--- end dump of recent events ---
2012-11-11 22:45:11.632718 7f5d49969700 -1 *** Caught signal (Aborted) **
  in thread 7f5d49969700

  ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
  1: /usr/bin/ceph-mon() [0x53eb29]
  2: (()+0xeff0) [0x7f5d4dbd6ff0]
  3: (gsignal()+0x35) [0x7f5d4c3b9215]
  4: (abort()+0x180) [0x7f5d4c3bc020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5d4cc4ddc5]
  6: (()+0xcb166) [0x7f5d4cc4c166]
  7: (()+0xcb193) [0x7f5d4cc4c193]
  8: (()+0xcb28e) [0x7f5d4cc4c28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x7c9) [0x55f249]
  10: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09) 
[0x4b14e9]
  11: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
  12: (OSDMonitor::tick()+0xab) [0x4b213b]
  13: (Monitor::tick()+0x6d) [0x477fcd]
  14: (SafeTimer::timer_thread()+0x453) [0x58f203]
  15: (SafeTimerThread::entry()+0xd) [0x5913dd]
  16: (()+0x68ca) [0x7f5d4dbce8ca]
  17: (clone()+0x6d) [0x7f5d4c456bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- begin dump of recent events ---
      0> 2012-11-11 22:45:11.632718 7f5d49969700 -1 *** Caught signal 
(Aborted) **
  in thread 7f5d49969700

  ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
  1: /usr/bin/ceph-mon() [0x53eb29]
  2: (()+0xeff0) [0x7f5d4dbd6ff0]
  3: (gsignal()+0x35) [0x7f5d4c3b9215]
  4: (abort()+0x180) [0x7f5d4c3bc020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5d4cc4ddc5]
  6: (()+0xcb166) [0x7f5d4cc4c166]
  7: (()+0xcb193) [0x7f5d4cc4c193]
  8: (()+0xcb28e) [0x7f5d4cc4c28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x7c9) [0x55f249]
  10: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09) 
[0x4b14e9]
  11: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
  12: (OSDMonitor::tick()+0xab) [0x4b213b]
  13: (Monitor::tick()+0x6d) [0x477fcd]
  14: (SafeTimer::timer_thread()+0x453) [0x58f203]
  15: (SafeTimerThread::entry()+0xd) [0x5913dd]
  16: (()+0x68ca) [0x7f5d4dbce8ca]
  17: (clone()+0x6d) [0x7f5d4c456bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 5 lockdep
    0/ 5 context
    1/ 5 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 0 buffer
    0/ 5 timer
    0/ 5 filer
    0/ 5 striper
    0/ 0 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 client
    0/ 5 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 5 filestore
    1/ 5 journal
    0/ 5 ms
    1/ 5 mon
    0/ 5 monc
    0/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 finisher
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new      1000000
   log_file /var/log/ceph/ceph-mon.b.log
--- end dump of recent events ---


Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-11 21:51 [BUG] ceph-mon crashes Stefan Priebe
@ 2012-11-11 22:20 ` Joao Eduardo Luis
  2012-11-11 22:22   ` Stefan Priebe
  0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-11 22:20 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org

Hi Stefan,

Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?

Also, do you have any idea what happened to osd.53?

  -Joao

On 11/11/2012 09:51 PM, Stefan Priebe wrote:
> Hello list,
> 
> i've now seen the following ceph-mon crash several times:
>    -22> 2012-11-11 22:45:10.941641 7f5d49168700  1
> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
> has v1017 lc 11925
>    -21> 2012-11-11 22:45:11.017199 7f5d49168700  1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
> 0 0) 0x58fe840 con 0x2c38c60
>    -20> 2012-11-11 22:45:11.084775 7f5d49168700  1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
> 0x4e59080 con 0x2c38c60
>    -19> 2012-11-11 22:45:11.084812 7f5d49168700  1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
> 0x4e58580 con 0x2c38c60
>    -18> 2012-11-11 22:45:11.230970 7f5d48065700  2 --
> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection refused
>    -17> 2012-11-11 22:45:11.231004 7f5d48065700  2 --
> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
> pgs=16 cs=2 l=0).fault 111: Connection refused
>    -16> 2012-11-11 22:45:11.238427 7f5d49168700  1 --
> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
> 0x2ca5b40 con 0x2ca4dc0
>    -15> 2012-11-11 22:45:11.238451 7f5d49168700  1
> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
> has v1017 lc 11925
>    -14> 2012-11-11 22:45:11.287036 7f5d49969700  5
> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
> 11926 311714 bytes, gv 28550
>    -13> 2012-11-11 22:45:11.296084 7f5d49168700  1 --
> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
> 0x39bad80 con 0x2ca4c60
>    -12> 2012-11-11 22:45:11.350697 7f5d49969700  1 --
> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>    -11> 2012-11-11 22:45:11.350790 7f5d49168700  1
> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
> has v1017 lc 11925
>    -10> 2012-11-11 22:45:11.435570 7f5d49168700  1 --
> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
> 0) 0x4e60b00 con 0x2c38c60
>     -9> 2012-11-11 22:45:11.476091 7f5d49168700  1 --
> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>     -8> 2012-11-11 22:45:11.476125 7f5d49168700  1 --
> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>     -7> 2012-11-11 22:45:11.593945 7f5d49168700  0 log [INF] : pgmap
> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded (17.357%)
>     -6> 2012-11-11 22:45:11.593954 7f5d49168700  1 --
> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
> -- ?+0 0x40c7200
>     -5> 2012-11-11 22:45:11.593969 7f5d49168700  1 --
> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>     -4> 2012-11-11 22:45:11.594050 7f5d49168700  1 --
> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>     -3> 2012-11-11 22:45:11.594359 7f5d49168700  1
> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
> has v1017 lc 11926
>     -2> 2012-11-11 22:45:11.594736 7f5d49168700  1 --
> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>     -1> 2012-11-11 22:45:11.595139 7f5d49969700  1 mon.b@1(leader).osd
> e1017  we have enough reports/reporters to mark osd.53 down
>      0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
> time 2012-11-11 22:45:11.595149
> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
> 
>  ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>  1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
> [0x4b14e9]
>  2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>  3: (OSDMonitor::tick()+0xab) [0x4b213b]
>  4: (Monitor::tick()+0x6d) [0x477fcd]
>  5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>  6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>  7: (()+0x68ca) [0x7f5d4dbce8ca]
>  8: (clone()+0x6d) [0x7f5d4c456bfd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-11 22:20 ` Joao Eduardo Luis
@ 2012-11-11 22:22   ` Stefan Priebe
  2012-11-11 23:26     ` Joao Eduardo Luis
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe @ 2012-11-11 22:22 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org

I wanted to remove osd.51 to 54. And executed the following commands:
ceph osd 51 out
ceph osd 52 out
ceph osd 53 out
ceph osd 54 out

Greets,
Stefan
Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
> Hi Stefan,
>
> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>
> Also, do you have any idea what happened to osd.53?
>
>    -Joao
>
> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>> Hello list,
>>
>> i've now seen the following ceph-mon crash several times:
>>     -22> 2012-11-11 22:45:10.941641 7f5d49168700  1
>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>> has v1017 lc 11925
>>     -21> 2012-11-11 22:45:11.017199 7f5d49168700  1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>> 0 0) 0x58fe840 con 0x2c38c60
>>     -20> 2012-11-11 22:45:11.084775 7f5d49168700  1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>> 0x4e59080 con 0x2c38c60
>>     -19> 2012-11-11 22:45:11.084812 7f5d49168700  1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>> 0x4e58580 con 0x2c38c60
>>     -18> 2012-11-11 22:45:11.230970 7f5d48065700  2 --
>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection refused
>>     -17> 2012-11-11 22:45:11.231004 7f5d48065700  2 --
>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>> pgs=16 cs=2 l=0).fault 111: Connection refused
>>     -16> 2012-11-11 22:45:11.238427 7f5d49168700  1 --
>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>> 0x2ca5b40 con 0x2ca4dc0
>>     -15> 2012-11-11 22:45:11.238451 7f5d49168700  1
>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>> has v1017 lc 11925
>>     -14> 2012-11-11 22:45:11.287036 7f5d49969700  5
>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>> 11926 311714 bytes, gv 28550
>>     -13> 2012-11-11 22:45:11.296084 7f5d49168700  1 --
>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>> 0x39bad80 con 0x2ca4c60
>>     -12> 2012-11-11 22:45:11.350697 7f5d49969700  1 --
>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>>     -11> 2012-11-11 22:45:11.350790 7f5d49168700  1
>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>> has v1017 lc 11925
>>     -10> 2012-11-11 22:45:11.435570 7f5d49168700  1 --
>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>> 0) 0x4e60b00 con 0x2c38c60
>>      -9> 2012-11-11 22:45:11.476091 7f5d49168700  1 --
>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>>      -8> 2012-11-11 22:45:11.476125 7f5d49168700  1 --
>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>>      -7> 2012-11-11 22:45:11.593945 7f5d49168700  0 log [INF] : pgmap
>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded (17.357%)
>>      -6> 2012-11-11 22:45:11.593954 7f5d49168700  1 --
>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>> -- ?+0 0x40c7200
>>      -5> 2012-11-11 22:45:11.593969 7f5d49168700  1 --
>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>>      -4> 2012-11-11 22:45:11.594050 7f5d49168700  1 --
>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>>      -3> 2012-11-11 22:45:11.594359 7f5d49168700  1
>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>> has v1017 lc 11926
>>      -2> 2012-11-11 22:45:11.594736 7f5d49168700  1 --
>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>>      -1> 2012-11-11 22:45:11.595139 7f5d49969700  1 mon.b@1(leader).osd
>> e1017  we have enough reports/reporters to mark osd.53 down
>>       0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>> time 2012-11-11 22:45:11.595149
>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>
>>   ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>>   1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>> [0x4b14e9]
>>   2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>>   3: (OSDMonitor::tick()+0xab) [0x4b213b]
>>   4: (Monitor::tick()+0x6d) [0x477fcd]
>>   5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>>   6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>>   7: (()+0x68ca) [0x7f5d4dbce8ca]
>>   8: (clone()+0x6d) [0x7f5d4c456bfd]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-11 22:22   ` Stefan Priebe
@ 2012-11-11 23:26     ` Joao Eduardo Luis
  2012-11-12  6:21       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-11 23:26 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org

On 11/11/2012 10:22 PM, Stefan Priebe wrote:
> I wanted to remove osd.51 to 54. And executed the following commands:
> ceph osd 51 out
> ceph osd 52 out
> ceph osd 53 out
> ceph osd 54 out
> 
> Greets,
> Stefan

Were those osds shutdown/killed before issuing the out commands?

And I'm assuming all the commands completed successfully with a 'marked
out osd.X'. Is this correct?

  -Joao


> Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
>> Hi Stefan,
>>
>> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>>
>> Also, do you have any idea what happened to osd.53?
>>
>>    -Joao
>>
>> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>>> Hello list,
>>>
>>> i've now seen the following ceph-mon crash several times:
>>>     -22> 2012-11-11 22:45:10.941641 7f5d49168700  1
>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>>> has v1017 lc 11925
>>>     -21> 2012-11-11 22:45:11.017199 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>>> 0 0) 0x58fe840 con 0x2c38c60
>>>     -20> 2012-11-11 22:45:11.084775 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>>> 0x4e59080 con 0x2c38c60
>>>     -19> 2012-11-11 22:45:11.084812 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>>> 0x4e58580 con 0x2c38c60
>>>     -18> 2012-11-11 22:45:11.230970 7f5d48065700  2 --
>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection
>>> refused
>>>     -17> 2012-11-11 22:45:11.231004 7f5d48065700  2 --
>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>> pgs=16 cs=2 l=0).fault 111: Connection refused
>>>     -16> 2012-11-11 22:45:11.238427 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>>> 0x2ca5b40 con 0x2ca4dc0
>>>     -15> 2012-11-11 22:45:11.238451 7f5d49168700  1
>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>>> has v1017 lc 11925
>>>     -14> 2012-11-11 22:45:11.287036 7f5d49969700  5
>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>>> 11926 311714 bytes, gv 28550
>>>     -13> 2012-11-11 22:45:11.296084 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>>> 0x39bad80 con 0x2ca4c60
>>>     -12> 2012-11-11 22:45:11.350697 7f5d49969700  1 --
>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>>>     -11> 2012-11-11 22:45:11.350790 7f5d49168700  1
>>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>>> has v1017 lc 11925
>>>     -10> 2012-11-11 22:45:11.435570 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>>> 0) 0x4e60b00 con 0x2c38c60
>>>      -9> 2012-11-11 22:45:11.476091 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>>>      -8> 2012-11-11 22:45:11.476125 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>>>      -7> 2012-11-11 22:45:11.593945 7f5d49168700  0 log [INF] : pgmap
>>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded
>>> (17.357%)
>>>      -6> 2012-11-11 22:45:11.593954 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>>> -- ?+0 0x40c7200
>>>      -5> 2012-11-11 22:45:11.593969 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>>>      -4> 2012-11-11 22:45:11.594050 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>>>      -3> 2012-11-11 22:45:11.594359 7f5d49168700  1
>>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>>> has v1017 lc 11926
>>>      -2> 2012-11-11 22:45:11.594736 7f5d49168700  1 --
>>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>>>      -1> 2012-11-11 22:45:11.595139 7f5d49969700  1 mon.b@1(leader).osd
>>> e1017  we have enough reports/reporters to mark osd.53 down
>>>       0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>>> time 2012-11-11 22:45:11.595149
>>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>>
>>>   ceph version 0.53-758-g26e5f2d
>>> (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>>>   1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>>> [0x4b14e9]
>>>   2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>>>   3: (OSDMonitor::tick()+0xab) [0x4b213b]
>>>   4: (Monitor::tick()+0x6d) [0x477fcd]
>>>   5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>>>   6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>>>   7: (()+0x68ca) [0x7f5d4dbce8ca]
>>>   8: (clone()+0x6d) [0x7f5d4c456bfd]
>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> needed to interpret this.
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-11 23:26     ` Joao Eduardo Luis
@ 2012-11-12  6:21       ` Stefan Priebe - Profihost AG
  2012-11-12 14:58         ` Joao Eduardo Luis
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-12  6:21 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org


Am 12.11.2012 um 00:26 schrieb Joao Eduardo Luis <joao.luis@inktank.com>:

> On 11/11/2012 10:22 PM, Stefan Priebe wrote:
>> I wanted to remove osd.51 to 54. And executed the following commands:
>> ceph osd 51 out
>> ceph osd 52 out
>> ceph osd 53 out
>> ceph osd 54 out
>> 
>> Greets,
>> Stefan
> 
> Were those osds shutdown/killed before issuing the out commands?
No


> 
> And I'm assuming all the commands completed successfully with a 'marked
> out osd.X'. Is this correct?
Yes

Stefan


> 
>  -Joao
> 
> 
>> Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
>>> Hi Stefan,
>>> 
>>> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>>> 
>>> Also, do you have any idea what happened to osd.53?
>>> 
>>>   -Joao
>>> 
>>> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>>>> Hello list,
>>>> 
>>>> i've now seen the following ceph-mon crash several times:
>>>>    -22> 2012-11-11 22:45:10.941641 7f5d49168700  1
>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>>>> has v1017 lc 11925
>>>>    -21> 2012-11-11 22:45:11.017199 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>>>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>>>> 0 0) 0x58fe840 con 0x2c38c60
>>>>    -20> 2012-11-11 22:45:11.084775 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>>>> 0x4e59080 con 0x2c38c60
>>>>    -19> 2012-11-11 22:45:11.084812 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>>>> 0x4e58580 con 0x2c38c60
>>>>    -18> 2012-11-11 22:45:11.230970 7f5d48065700  2 --
>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection
>>>> refused
>>>>    -17> 2012-11-11 22:45:11.231004 7f5d48065700  2 --
>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>> pgs=16 cs=2 l=0).fault 111: Connection refused
>>>>    -16> 2012-11-11 22:45:11.238427 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>>>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>>>> 0x2ca5b40 con 0x2ca4dc0
>>>>    -15> 2012-11-11 22:45:11.238451 7f5d49168700  1
>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>>>> has v1017 lc 11925
>>>>    -14> 2012-11-11 22:45:11.287036 7f5d49969700  5
>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>>>> 11926 311714 bytes, gv 28550
>>>>    -13> 2012-11-11 22:45:11.296084 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>>>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>>>> 0x39bad80 con 0x2ca4c60
>>>>    -12> 2012-11-11 22:45:11.350697 7f5d49969700  1 --
>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>>>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>>>>    -11> 2012-11-11 22:45:11.350790 7f5d49168700  1
>>>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>>>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>>>> has v1017 lc 11925
>>>>    -10> 2012-11-11 22:45:11.435570 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>>>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>>>> 0) 0x4e60b00 con 0x2c38c60
>>>>     -9> 2012-11-11 22:45:11.476091 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>>>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>>>>     -8> 2012-11-11 22:45:11.476125 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>>>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>>>>     -7> 2012-11-11 22:45:11.593945 7f5d49168700  0 log [INF] : pgmap
>>>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>>>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>>>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>>>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>>>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded
>>>> (17.357%)
>>>>     -6> 2012-11-11 22:45:11.593954 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>>>> -- ?+0 0x40c7200
>>>>     -5> 2012-11-11 22:45:11.593969 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>>>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>>>>     -4> 2012-11-11 22:45:11.594050 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>>>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>>>>     -3> 2012-11-11 22:45:11.594359 7f5d49168700  1
>>>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>>>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>>>> has v1017 lc 11926
>>>>     -2> 2012-11-11 22:45:11.594736 7f5d49168700  1 --
>>>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>>>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>>>>     -1> 2012-11-11 22:45:11.595139 7f5d49969700  1 mon.b@1(leader).osd
>>>> e1017  we have enough reports/reporters to mark osd.53 down
>>>>      0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>>>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>>>> time 2012-11-11 22:45:11.595149
>>>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>>> 
>>>>  ceph version 0.53-758-g26e5f2d
>>>> (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>>>>  1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>>>> [0x4b14e9]
>>>>  2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>>>>  3: (OSDMonitor::tick()+0xab) [0x4b213b]
>>>>  4: (Monitor::tick()+0x6d) [0x477fcd]
>>>>  5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>>>>  6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>>>>  7: (()+0x68ca) [0x7f5d4dbce8ca]
>>>>  8: (clone()+0x6d) [0x7f5d4c456bfd]
>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>> needed to interpret this.
>>> 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-12  6:21       ` Stefan Priebe - Profihost AG
@ 2012-11-12 14:58         ` Joao Eduardo Luis
  2012-11-12 15:10           ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-12 14:58 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

Hi Stefan,


Any chance you can get me a larger chunk of the log from the monitor
that was the leader by the time you issued those commands until the
point the monitor crashed (from the excerpt you provided, that should be
mon.b)?


  -Joao
On 11/12/2012 06:21 AM, Stefan Priebe - Profihost AG wrote:
> 
> Am 12.11.2012 um 00:26 schrieb Joao Eduardo Luis <joao.luis@inktank.com>:
> 
>> On 11/11/2012 10:22 PM, Stefan Priebe wrote:
>>> I wanted to remove osd.51 to 54. And executed the following commands:
>>> ceph osd 51 out
>>> ceph osd 52 out
>>> ceph osd 53 out
>>> ceph osd 54 out
>>>
>>> Greets,
>>> Stefan
>>
>> Were those osds shutdown/killed before issuing the out commands?
> No
> 
> 
>>
>> And I'm assuming all the commands completed successfully with a 'marked
>> out osd.X'. Is this correct?
> Yes
> 
> Stefan
> 
> 
>>
>>  -Joao
>>
>>
>>> Am 11.11.2012 23:20, schrieb Joao Eduardo Luis:
>>>> Hi Stefan,
>>>>
>>>> Any chance you can get us the output of `ceph osd dump | grep 'osd.53'`?
>>>>
>>>> Also, do you have any idea what happened to osd.53?
>>>>
>>>>   -Joao
>>>>
>>>> On 11/11/2012 09:51 PM, Stefan Priebe wrote:
>>>>> Hello list,
>>>>>
>>>>> i've now seen the following ceph-mon crash several times:
>>>>>    -22> 2012-11-11 22:45:10.941641 7f5d49168700  1
>>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>>> now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577
>>>>> has v1017 lc 11925
>>>>>    -21> 2012-11-11 22:45:11.017199 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm
>>>>> lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536
>>>>> 0 0) 0x58fe840 con 0x2c38c60
>>>>>    -20> 2012-11-11 22:45:11.084775 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap
>>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0)
>>>>> 0x4e59080 con 0x2c38c60
>>>>>    -19> 2012-11-11 22:45:11.084812 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap
>>>>> lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0)
>>>>> 0x4e58580 con 0x2c38c60
>>>>>    -18> 2012-11-11 22:45:11.230970 7f5d48065700  2 --
>>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>>> pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection
>>>>> refused
>>>>>    -17> 2012-11-11 22:45:11.231004 7f5d48065700  2 --
>>>>> 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789
>>>>> pgs=16 cs=2 l=0).fault 111: Connection refused
>>>>>    -16> 2012-11-11 22:45:11.238427 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ====
>>>>> pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0)
>>>>> 0x2ca5b40 con 0x2ca4dc0
>>>>>    -15> 2012-11-11 22:45:11.238451 7f5d49168700  1
>>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable
>>>>> now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577
>>>>> has v1017 lc 11925
>>>>>    -14> 2012-11-11 22:45:11.287036 7f5d49969700  5
>>>>> mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value
>>>>> 11926 311714 bytes, gv 28550
>>>>>    -13> 2012-11-11 22:45:11.296084 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ====
>>>>> pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0)
>>>>> 0x39bad80 con 0x2ca4c60
>>>>>    -12> 2012-11-11 22:45:11.350697 7f5d49969700  1 --
>>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin
>>>>> lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
>>>>>    -11> 2012-11-11 22:45:11.350790 7f5d49168700  1
>>>>> mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable
>>>>> now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577
>>>>> has v1017 lc 11925
>>>>>    -10> 2012-11-11 22:45:11.435570 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap
>>>>> accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0
>>>>> 0) 0x4e60b00 con 0x2c38c60
>>>>>     -9> 2012-11-11 22:45:11.476091 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit
>>>>> lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
>>>>>     -8> 2012-11-11 22:45:11.476125 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease
>>>>> lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
>>>>>     -7> 2012-11-11 22:45:11.593945 7f5d49168700  0 log [INF] : pgmap
>>>>> v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill,
>>>>> 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18
>>>>> active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill,
>>>>> 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB
>>>>> data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded
>>>>> (17.357%)
>>>>>     -6> 2012-11-11 22:45:11.593954 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1
>>>>> -- ?+0 0x40c7200
>>>>>     -5> 2012-11-11 22:45:11.593969 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs
>>>>> tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
>>>>>     -4> 2012-11-11 22:45:11.594050 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs
>>>>> tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
>>>>>     -3> 2012-11-11 22:45:11.594359 7f5d49168700  1
>>>>> mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable
>>>>> now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123
>>>>> has v1017 lc 11926
>>>>>     -2> 2012-11-11 22:45:11.594736 7f5d49168700  1 --
>>>>> 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries)
>>>>> v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
>>>>>     -1> 2012-11-11 22:45:11.595139 7f5d49969700  1 mon.b@1(leader).osd
>>>>> e1017  we have enough reports/reporters to mark osd.53 down
>>>>>      0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In
>>>>> function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700
>>>>> time 2012-11-11 22:45:11.595149
>>>>> ./osd/OSDMap.h: 345: FAILED assert(is_up(osd))
>>>>>
>>>>>  ceph version 0.53-758-g26e5f2d
>>>>> (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
>>>>>  1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09)
>>>>> [0x4b14e9]
>>>>>  2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
>>>>>  3: (OSDMonitor::tick()+0xab) [0x4b213b]
>>>>>  4: (Monitor::tick()+0x6d) [0x477fcd]
>>>>>  5: (SafeTimer::timer_thread()+0x453) [0x58f203]
>>>>>  6: (SafeTimerThread::entry()+0xd) [0x5913dd]
>>>>>  7: (()+0x68ca) [0x7f5d4dbce8ca]
>>>>>  8: (clone()+0x6d) [0x7f5d4c456bfd]
>>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-12 14:58         ` Joao Eduardo Luis
@ 2012-11-12 15:10           ` Stefan Priebe - Profihost AG
  2012-11-12 17:05             ` Joao Eduardo Luis
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-12 15:10 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org

Am 12.11.2012 15:58, schrieb Joao Eduardo Luis:
> Hi Stefan,
>
>
> Any chance you can get me a larger chunk of the log from the monitor
> that was the leader by the time you issued those commands until the
> point the monitor crashed (from the excerpt you provided, that should be
> mon.b)?

Sure:
https://www.dropbox.com/s/8e604bihk56m0yd/ceph-mon.b.log.1.gz

Greets,
Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-12 15:10           ` Stefan Priebe - Profihost AG
@ 2012-11-12 17:05             ` Joao Eduardo Luis
  2012-11-12 18:30               ` Stefan Priebe
  0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-12 17:05 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

On 11/12/2012 03:10 PM, Stefan Priebe - Profihost AG wrote:
> Am 12.11.2012 15:58, schrieb Joao Eduardo Luis:
>> Hi Stefan,
>>
>>
>> Any chance you can get me a larger chunk of the log from the monitor
>> that was the leader by the time you issued those commands until the
>> point the monitor crashed (from the excerpt you provided, that should be
>> mon.b)?
> 
> Sure:
> https://www.dropbox.com/s/8e604bihk56m0yd/ceph-mon.b.log.1.gz
> 
> Greets,
> Stefan

Hi Stefan,


Thanks for the log.

Can you please confirm me that sometime between you issuing the out
command and mon.b failing, you had yet another monitor (maybe mon.a)
that was the leader but for some reason it was down by the time that
mon.b failed?

If so, could you provide the log for that monitor as well, given this
log doesn't have some infos I'm looking for?


  -Joao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-12 17:05             ` Joao Eduardo Luis
@ 2012-11-12 18:30               ` Stefan Priebe
  2012-11-12 20:40                 ` Joao Eduardo Luis
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Priebe @ 2012-11-12 18:30 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org

Hi Joao,

Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:
> Can you please confirm me that sometime between you issuing the out
> command and mon.b failing, you had yet another monitor (maybe mon.a)
> that was the leader but for some reason it was down by the time that
> mon.b failed?
>
> If so, could you provide the log for that monitor as well, given this
> log doesn't have some infos I'm looking for?
Not sure but here are the logs of the other two mons:
https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz

https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-12 18:30               ` Stefan Priebe
@ 2012-11-12 20:40                 ` Joao Eduardo Luis
  2012-11-12 21:20                   ` Stefan Priebe
  0 siblings, 1 reply; 11+ messages in thread
From: Joao Eduardo Luis @ 2012-11-12 20:40 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org

On 11/12/2012 06:30 PM, Stefan Priebe wrote:
> Hi Joao,
> 
> Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:
>> Can you please confirm me that sometime between you issuing the out
>> command and mon.b failing, you had yet another monitor (maybe mon.a)
>> that was the leader but for some reason it was down by the time that
>> mon.b failed?
>>
>> If so, could you provide the log for that monitor as well, given this
>> log doesn't have some infos I'm looking for?
> Not sure but here are the logs of the other two mons:
> https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz
> 
> https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz
> 
> Thanks,
> Stefan

Thanks Stefan,

I'll be looking into this.

For future reference, I created issue #3477 on the tracker:
http://tracker.newdream.net/issues/3477

  -Joao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] ceph-mon crashes
  2012-11-12 20:40                 ` Joao Eduardo Luis
@ 2012-11-12 21:20                   ` Stefan Priebe
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Priebe @ 2012-11-12 21:20 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org

Thanks i'm subscribed to the tracker now.

Stefan

Am 12.11.2012 21:40, schrieb Joao Eduardo Luis:
> On 11/12/2012 06:30 PM, Stefan Priebe wrote:
>> Hi Joao,
>>
>> Am 12.11.2012 18:05, schrieb Joao Eduardo Luis:
>>> Can you please confirm me that sometime between you issuing the out
>>> command and mon.b failing, you had yet another monitor (maybe mon.a)
>>> that was the leader but for some reason it was down by the time that
>>> mon.b failed?
>>>
>>> If so, could you provide the log for that monitor as well, given this
>>> log doesn't have some infos I'm looking for?
>> Not sure but here are the logs of the other two mons:
>> https://www.dropbox.com/s/jztsedvj1b2kjje/ceph-mon.a.log.1.gz
>>
>> https://www.dropbox.com/s/62jkfbbbgvs5o25/ceph-mon.c.log.1.gz
>>
>> Thanks,
>> Stefan
>
> Thanks Stefan,
>
> I'll be looking into this.
>
> For future reference, I created issue #3477 on the tracker:
> http://tracker.newdream.net/issues/3477
>
>    -Joao
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-11-12 21:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-11 21:51 [BUG] ceph-mon crashes Stefan Priebe
2012-11-11 22:20 ` Joao Eduardo Luis
2012-11-11 22:22   ` Stefan Priebe
2012-11-11 23:26     ` Joao Eduardo Luis
2012-11-12  6:21       ` Stefan Priebe - Profihost AG
2012-11-12 14:58         ` Joao Eduardo Luis
2012-11-12 15:10           ` Stefan Priebe - Profihost AG
2012-11-12 17:05             ` Joao Eduardo Luis
2012-11-12 18:30               ` Stefan Priebe
2012-11-12 20:40                 ` Joao Eduardo Luis
2012-11-12 21:20                   ` Stefan Priebe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.