All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD crashed today in os/JournalingObjectStore.cc
@ 2012-12-05  9:56 Stefan Priebe - Profihost AG
  2012-12-05 14:41 ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-12-05  9:56 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 166 bytes --]

Hello list,

i updated to latest next from today and then after 20 minutes an OSD was 
crashing in os/JournalingObjectStore.cc.

Attached is the log.

Greets,
Stefan

[-- Attachment #2: ceph-osd.43.log --]
[-- Type: text/x-log, Size: 14643 bytes --]

2012-12-05 10:21:12.591166 7f57aeeb9700  0 monclient: hunting for new mon
2012-12-05 10:21:14.338644 7f578e966700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.100:6802/28708 pipe(0xe061000 sd=67 :34107 pgs=50 cs=13 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:14.338786 7f57c6368700  0 -- 10.255.0.103:0/15121 >> 10.255.0.100:6803/28708 pipe(0xd56e900 sd=28 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:15.748915 7f578eb68700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.100:6808/29075 pipe(0xddd1480 sd=74 :6807 pgs=46 cs=27 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:15.749020 7f578c23f700  0 -- 10.255.0.103:0/15121 >> 10.255.0.100:6809/29075 pipe(0xc96b6c0 sd=47 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:17.029751 7f5789f06700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.100:6811/29438 pipe(0x11ed56c0 sd=75 :6807 pgs=76 cs=21 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:17.029925 7f578be3b700  0 -- 10.255.0.103:0/15121 >> 10.255.0.100:6814/29438 pipe(0xcf876c0 sd=55 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:18.334263 7f578fa77700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.100:6819/29801 pipe(0xd0bb480 sd=79 :6807 pgs=85 cs=43 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:18.334403 7f578a007700  0 -- 10.255.0.103:0/15121 >> 10.255.0.100:6821/29801 pipe(0x12024b40 sd=28 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:20.375215 7f578fb78700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.101:6801/8284 pipe(0xdb0ed80 sd=42 :6807 pgs=39 cs=9 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:20.375381 7f578be3b700  0 -- 10.255.0.103:0/15121 >> 10.255.0.101:6802/8284 pipe(0x100656c0 sd=59 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:22.637693 7f5789a01700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.101:6804/8467 pipe(0x13a23d80 sd=77 :6807 pgs=182 cs=15 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:22.637861 7f578f976700  0 -- 10.255.0.103:0/15121 >> 10.255.0.101:6805/8467 pipe(0xd2dcb40 sd=28 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:24.777204 7f578a108700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.101:6807/8647 pipe(0xd8eeb40 sd=40 :6807 pgs=257 cs=29 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:24.777420 7f578b431700  0 -- 10.255.0.103:0/15121 >> 10.255.0.101:6808/8647 pipe(0xceb3900 sd=74 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:26.870074 7f578f16e700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.101:6810/8877 pipe(0x114a56c0 sd=72 :6807 pgs=200 cs=13 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:26.870281 7f578ce4b700  0 -- 10.255.0.103:0/15121 >> 10.255.0.101:6811/8877 pipe(0xceb3480 sd=51 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:28.977016 7f578f471700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.102:6801/6127 pipe(0xd8ee900 sd=38 :6807 pgs=178 cs=15 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:28.977174 7f578db58700  0 -- 10.255.0.103:0/15121 >> 10.255.0.102:6802/6127 pipe(0xceb36c0 sd=40 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:31.091973 7f578f370700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.102:6806/6308 pipe(0xc96cd80 sd=36 :6807 pgs=260 cs=1 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:31.092196 7f578f16e700  0 -- 10.255.0.103:0/15121 >> 10.255.0.102:6807/6308 pipe(0xdbbc6c0 sd=31 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:33.200579 7f578f26f700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.102:6809/6491 pipe(0xc96cb40 sd=35 :6807 pgs=261 cs=1 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:33.200853 7f578f471700  0 -- 10.255.0.103:0/15121 >> 10.255.0.102:6810/6491 pipe(0xe1cf480 sd=38 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:35.329384 7f578a70e700  0 -- 10.255.0.103:6807/15121 >> 10.255.0.102:6822/6670 pipe(0xfad4b40 sd=70 :6807 pgs=319 cs=9 l=0).fault with nothing to send, going to standby
2012-12-05 10:21:35.329523 7f578d754700  0 -- 10.255.0.103:0/15121 >> 10.255.0.102:6823/6670 pipe(0xfad4240 sd=72 :0 pgs=0 cs=0 l=1).fault
2012-12-05 10:21:42.031928 7f57c26e0700 -1 osd.43 923 *** Got signal Terminated ***
2012-12-05 10:21:42.032002 7f57c26e0700 -1 osd.43 923  pausing thread pools
2012-12-05 10:21:42.032007 7f57c26e0700 -1 osd.43 923  flushing io
2012-12-05 10:21:42.032015 7f57c26e0700 -1 osd.43 923  removing pid file
2012-12-05 10:21:42.032092 7f57c26e0700 -1 osd.43 923  exit
2012-12-05 10:21:43.608251 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is supported and appears to work
2012-12-05 10:21:43.608262 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2012-12-05 10:21:43.608495 7fd046962780  0 filestore(/ceph/osd.43/) mount did NOT detect btrfs
2012-12-05 10:21:43.613072 7fd046962780  0 filestore(/ceph/osd.43/) mount syscall(__NR_syncfs, fd) fully supported
2012-12-05 10:21:43.613151 7fd046962780  0 filestore(/ceph/osd.43/) mount found snaps <>
2012-12-05 10:21:43.615479 7fd046962780  0 filestore(/ceph/osd.43/) mount: enabling WRITEAHEAD journal mode: btrfs not detected
2012-12-05 10:21:43.638102 7fd046962780  0 journal  kernel version is 3.6.7
2012-12-05 10:21:43.768129 7fd046962780  0 journal  kernel version is 3.6.7
2012-12-05 10:21:43.819826 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is supported and appears to work
2012-12-05 10:21:43.819835 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2012-12-05 10:21:43.820065 7fd046962780  0 filestore(/ceph/osd.43/) mount did NOT detect btrfs
2012-12-05 10:21:43.821567 7fd046962780  0 filestore(/ceph/osd.43/) mount syscall(__NR_syncfs, fd) fully supported
2012-12-05 10:21:43.821622 7fd046962780  0 filestore(/ceph/osd.43/) mount found snaps <>
2012-12-05 10:21:43.822791 7fd046962780  0 filestore(/ceph/osd.43/) mount: enabling WRITEAHEAD journal mode: btrfs not detected
2012-12-05 10:21:43.837954 7fd046962780  0 journal  kernel version is 3.6.7
2012-12-05 10:21:43.898018 7fd046962780  0 journal  kernel version is 3.6.7
2012-12-05 10:46:40.709056 7fd03c4b6700 -1 os/JournalingObjectStore.cc: In function 'uint64_t JournalingObjectStore::ApplyManager::op_apply_start(uint64_t)' thread 7fd03c4b6700 time 2012-12-05 10:46:40.338489
os/JournalingObjectStore.cc: 134: FAILED assert(op > committed_seq)

 ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
 1: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned long)+0x816) [0x747626]
 2: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
 3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
 4: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
 5: (()+0x68ca) [0x7fd04633f8ca]
 6: (clone()+0x6d) [0x7fd0447aeb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -29> 2012-12-05 10:21:43.592318 7fd046962780  5 asok(0x244b000) register_command perfcounters_dump hook 0x243f010
   -28> 2012-12-05 10:21:43.592340 7fd046962780  5 asok(0x244b000) register_command 1 hook 0x243f010
   -27> 2012-12-05 10:21:43.592342 7fd046962780  5 asok(0x244b000) register_command perf dump hook 0x243f010
   -26> 2012-12-05 10:21:43.592350 7fd046962780  5 asok(0x244b000) register_command perfcounters_schema hook 0x243f010
   -25> 2012-12-05 10:21:43.592354 7fd046962780  5 asok(0x244b000) register_command 2 hook 0x243f010
   -24> 2012-12-05 10:21:43.592357 7fd046962780  5 asok(0x244b000) register_command perf schema hook 0x243f010
   -23> 2012-12-05 10:21:43.592359 7fd046962780  5 asok(0x244b000) register_command config show hook 0x243f010
   -22> 2012-12-05 10:21:43.592361 7fd046962780  5 asok(0x244b000) register_command config set hook 0x243f010
   -21> 2012-12-05 10:21:43.592363 7fd046962780  5 asok(0x244b000) register_command log flush hook 0x243f010
   -20> 2012-12-05 10:21:43.592365 7fd046962780  5 asok(0x244b000) register_command log dump hook 0x243f010
   -19> 2012-12-05 10:21:43.592367 7fd046962780  5 asok(0x244b000) register_command log reopen hook 0x243f010
   -18> 2012-12-05 10:21:43.594773 7fd046962780  0 ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4), process ceph-osd, pid 31785
   -17> 2012-12-05 10:21:43.595944 7fd046962780  1 finished global_init_daemonize
   -16> 2012-12-05 10:21:43.608251 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is supported and appears to work
   -15> 2012-12-05 10:21:43.608262 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
   -14> 2012-12-05 10:21:43.608495 7fd046962780  0 filestore(/ceph/osd.43/) mount did NOT detect btrfs
   -13> 2012-12-05 10:21:43.613072 7fd046962780  0 filestore(/ceph/osd.43/) mount syscall(__NR_syncfs, fd) fully supported
   -12> 2012-12-05 10:21:43.613151 7fd046962780  0 filestore(/ceph/osd.43/) mount found snaps <>
   -11> 2012-12-05 10:21:43.615479 7fd046962780  0 filestore(/ceph/osd.43/) mount: enabling WRITEAHEAD journal mode: btrfs not detected
   -10> 2012-12-05 10:21:43.638102 7fd046962780  0 journal  kernel version is 3.6.7
    -9> 2012-12-05 10:21:43.768129 7fd046962780  0 journal  kernel version is 3.6.7
    -8> 2012-12-05 10:21:43.819826 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is supported and appears to work
    -7> 2012-12-05 10:21:43.819835 7fd046962780  0 filestore(/ceph/osd.43/) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
    -6> 2012-12-05 10:21:43.820065 7fd046962780  0 filestore(/ceph/osd.43/) mount did NOT detect btrfs
    -5> 2012-12-05 10:21:43.821567 7fd046962780  0 filestore(/ceph/osd.43/) mount syscall(__NR_syncfs, fd) fully supported
    -4> 2012-12-05 10:21:43.821622 7fd046962780  0 filestore(/ceph/osd.43/) mount found snaps <>
    -3> 2012-12-05 10:21:43.822791 7fd046962780  0 filestore(/ceph/osd.43/) mount: enabling WRITEAHEAD journal mode: btrfs not detected
    -2> 2012-12-05 10:21:43.837954 7fd046962780  0 journal  kernel version is 3.6.7
    -1> 2012-12-05 10:21:43.898018 7fd046962780  0 journal  kernel version is 3.6.7
     0> 2012-12-05 10:46:40.709056 7fd03c4b6700 -1 os/JournalingObjectStore.cc: In function 'uint64_t JournalingObjectStore::ApplyManager::op_apply_start(uint64_t)' thread 7fd03c4b6700 time 2012-12-05 10:46:40.338489
os/JournalingObjectStore.cc: 134: FAILED assert(op > committed_seq)

 ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
 1: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned long)+0x816) [0x747626]
 2: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
 3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
 4: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
 5: (()+0x68ca) [0x7fd04633f8ca]
 6: (clone()+0x6d) [0x7fd0447aeb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   1/ 5 mon
   0/ 0 monc
   0/ 5 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent    100000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.43.log
--- end dump of recent events ---
2012-12-05 10:46:40.710600 7fd03c4b6700 -1 *** Caught signal (Aborted) **
 in thread 7fd03c4b6700

 ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
 1: /usr/bin/ceph-osd() [0x797bd9]
 2: (()+0xeff0) [0x7fd046347ff0]
 3: (gsignal()+0x35) [0x7fd0447111b5]
 4: (abort()+0x180) [0x7fd044713fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fd044fa5dc5]
 6: (()+0xcb166) [0x7fd044fa4166]
 7: (()+0xcb193) [0x7fd044fa4193]
 8: (()+0xcb28e) [0x7fd044fa428e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x7fb939]
 10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned long)+0x816) [0x747626]
 11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
 14: (()+0x68ca) [0x7fd04633f8ca]
 15: (clone()+0x6d) [0x7fd0447aeb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2012-12-05 10:46:40.710600 7fd03c4b6700 -1 *** Caught signal (Aborted) **
 in thread 7fd03c4b6700

 ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
 1: /usr/bin/ceph-osd() [0x797bd9]
 2: (()+0xeff0) [0x7fd046347ff0]
 3: (gsignal()+0x35) [0x7fd0447111b5]
 4: (abort()+0x180) [0x7fd044713fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fd044fa5dc5]
 6: (()+0xcb166) [0x7fd044fa4166]
 7: (()+0xcb193) [0x7fd044fa4193]
 8: (()+0xcb28e) [0x7fd044fa428e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x7fb939]
 10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned long)+0x816) [0x747626]
 11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
 14: (()+0x68ca) [0x7fd04633f8ca]
 15: (clone()+0x6d) [0x7fd0447aeb6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   1/ 5 mon
   0/ 0 monc
   0/ 5 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent    100000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.43.log
--- end dump of recent events ---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-05  9:56 OSD crashed today in os/JournalingObjectStore.cc Stefan Priebe - Profihost AG
@ 2012-12-05 14:41 ` Sage Weil
  2012-12-05 16:05   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-12-05 14:41 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
> Hello list,
> 
> i updated to latest next from today and then after 20 minutes an OSD was
> crashing in os/JournalingObjectStore.cc.
> 
> Attached is the log.

Hmm, this is perplexing.  It might just be a bad assert, but I can't see 
how it could happen.  Any chance you can reproduce with

	debug journal = 0/10

in the [osd] section?  That will give us a dump if it fails the assert.

Thanks!
s

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-05 14:41 ` Sage Weil
@ 2012-12-05 16:05   ` Stefan Priebe - Profihost AG
  2012-12-05 22:25     ` Stefan Priebe
  2012-12-05 23:36     ` Sage Weil
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-12-05 16:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

There was a dump in the attached log.

Stefan

Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@inktank.com>:

> On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
>> Hello list,
>> 
>> i updated to latest next from today and then after 20 minutes an OSD was
>> crashing in os/JournalingObjectStore.cc.
>> 
>> Attached is the log.
> 
> Hmm, this is perplexing.  It might just be a bad assert, but I can't see 
> how it could happen.  Any chance you can reproduce with
> 
>    debug journal = 0/10
> 
> in the [osd] section?  That will give us a dump if it fails the assert.
> 
> Thanks!
> s
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-05 16:05   ` Stefan Priebe - Profihost AG
@ 2012-12-05 22:25     ` Stefan Priebe
  2012-12-05 22:29       ` Stefan Priebe
  2012-12-05 23:36     ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Priebe @ 2012-12-05 22:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hello,

i had now 8 OSDs failing again with the same error.

      0> 2012-12-05 23:10:41.213149 7f7fad109700 -1 
os/JournalingObjectStore.cc: In function 'uint64_t
JournalingObjectStore::ApplyManager::op_apply_start(uint64_t)' thread 
7f7fad109700 time 2012-12-05 23:10:41.212454
os/JournalingObjectStore.cc: 134: FAILED assert(op > committed_seq)

  ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
  1: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned 
long)+0x816) [0x747626]
  2: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
  3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
  4: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
  5: (()+0x68ca) [0x7f7fc17a78ca]
  6: (clone()+0x6d) [0x7f7fbfc16bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 0 journaler
    0/ 5 objectcacher
   0/ 5 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    1/ 5 mon
    0/ 0 monc
    0/ 5 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent    100000
   max_new         1000
   log_file /var/log/ceph/ceph-osd.13.log
--- end dump of recent events ---
2012-12-05 23:10:41.216011 7f7fad109700 -1 *** Caught signal (Aborted) **
  in thread 7f7fad109700

  ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
  1: /usr/bin/ceph-osd() [0x797bd9]
  2: (()+0xeff0) [0x7f7fc17afff0]
  3: (gsignal()+0x35) [0x7f7fbfb79215]
  4: (abort()+0x180) [0x7f7fbfb7c020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7fc040ddc5]
  6: (()+0xcb166) [0x7f7fc040c166]
  7: (()+0xcb193) [0x7f7fc040c193]
  8: (()+0xcb28e) [0x7f7fc040c28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x7c9) [0x7fb939]
  10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned 
long)+0x816) [0x747626]
  11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
  12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
  13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
  14: (()+0x68ca) [0x7f7fc17a78ca]
  15: (clone()+0x6d) [0x7f7fbfc16bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- begin dump of recent events ---
      0> 2012-12-05 23:10:41.216011 7f7fad109700 -1 *** Caught signal 
(Aborted) **
  in thread 7f7fad109700

  ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
  1: /usr/bin/ceph-osd() [0x797bd9]
  2: (()+0xeff0) [0x7f7fc17afff0]
  3: (gsignal()+0x35) [0x7f7fbfb79215]
  4: (abort()+0x180) [0x7f7fbfb7c020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7fc040ddc5]
  6: (()+0xcb166) [0x7f7fc040c166]
  7: (()+0xcb193) [0x7f7fc040c193]
  8: (()+0xcb28e) [0x7f7fc040c28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x7c9) [0x7fb939]
  10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned 
long)+0x816) [0x747626]
  11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
  12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
  13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
  14: (()+0x68ca) [0x7f7fc17a78ca]
  15: (clone()+0x6d) [0x7f7fbfc16bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 0 journaler
    0/ 5 objectcacher
    0/ 5 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    1/ 5 mon
    0/ 0 monc
    0/ 5 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent    100000
   max_new         1000
   log_file /var/log/ceph/ceph-osd.13.log
--- end dump of recent events ---

Stefan
Am 05.12.2012 17:05, schrieb Stefan Priebe - Profihost AG:
> There was a dump in the attached log.
>
> Stefan
>
> Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@inktank.com>:
>
>> On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
>>> Hello list,
>>>
>>> i updated to latest next from today and then after 20 minutes an OSD was
>>> crashing in os/JournalingObjectStore.cc.
>>>
>>> Attached is the log.
>>
>> Hmm, this is perplexing.  It might just be a bad assert, but I can't see
>> how it could happen.  Any chance you can reproduce with
>>
>>     debug journal = 0/10
>>
>> in the [osd] section?  That will give us a dump if it fails the assert.
>>
>> Thanks!
>> s
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-05 22:25     ` Stefan Priebe
@ 2012-12-05 22:29       ` Stefan Priebe
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Priebe @ 2012-12-05 22:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hello,

this seems to happens since:
85574a3

Stefan

Am 05.12.2012 23:25, schrieb Stefan Priebe:
> Hello,
>
> i had now 8 OSDs failing again with the same error.
>
>       0> 2012-12-05 23:10:41.213149 7f7fad109700 -1
> os/JournalingObjectStore.cc: In function 'uint64_t
> JournalingObjectStore::ApplyManager::op_apply_start(uint64_t)' thread
> 7f7fad109700 time 2012-12-05 23:10:41.212454
> os/JournalingObjectStore.cc: 134: FAILED assert(op > committed_seq)
>
>   ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
>   1: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned
> long)+0x816) [0x747626]
>   2: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
>   3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
>   4: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
>   5: (()+0x68ca) [0x7f7fc17a78ca]
>   6: (clone()+0x6d) [0x7f7fbfc16bfd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> --- logging levels ---
>     0/ 5 none
>     0/ 0 lockdep
>     0/ 0 context
>     0/ 0 crush
>     1/ 5 mds
>     1/ 5 mds_balancer
>     1/ 5 mds_locker
>     1/ 5 mds_log
>     1/ 5 mds_log_expire
>     1/ 5 mds_migrator
>     0/ 0 buffer
>     0/ 0 timer
>     0/ 1 filer
>     0/ 1 striper
>     0/ 1 objecter
>     0/ 5 rados
>     0/ 5 rbd
>     0/ 0 journaler
>     0/ 5 objectcacher
>    0/ 5 client
>     0/ 0 osd
>     0/ 0 optracker
>     0/ 0 objclass
>     0/ 0 filestore
>     0/ 0 journal
>     0/ 0 ms
>     1/ 5 mon
>     0/ 0 monc
>     0/ 5 paxos
>     0/ 0 tp
>     0/ 0 auth
>     1/ 5 crypto
>     0/ 0 finisher
>     0/ 0 heartbeatmap
>     0/ 0 perfcounter
>     1/ 5 rgw
>     1/ 5 hadoop
>     1/ 5 rgw
>     1/ 5 hadoop
>     1/ 5 javaclient
>     0/ 0 asok
>     0/ 0 throttle
>    -2/-2 (syslog threshold)
>    -1/-1 (stderr threshold)
>    max_recent    100000
>    max_new         1000
>    log_file /var/log/ceph/ceph-osd.13.log
> --- end dump of recent events ---
> 2012-12-05 23:10:41.216011 7f7fad109700 -1 *** Caught signal (Aborted) **
>   in thread 7f7fad109700
>
>   ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
>   1: /usr/bin/ceph-osd() [0x797bd9]
>   2: (()+0xeff0) [0x7f7fc17afff0]
>   3: (gsignal()+0x35) [0x7f7fbfb79215]
>   4: (abort()+0x180) [0x7f7fbfb7c020]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7fc040ddc5]
>   6: (()+0xcb166) [0x7f7fc040c166]
>   7: (()+0xcb193) [0x7f7fc040c193]
>   8: (()+0xcb28e) [0x7f7fc040c28e]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7c9) [0x7fb939]
>   10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned
> long)+0x816) [0x747626]
>   11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
>   12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
>   13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
>   14: (()+0x68ca) [0x7f7fc17a78ca]
>   15: (clone()+0x6d) [0x7f7fbfc16bfd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> --- begin dump of recent events ---
>       0> 2012-12-05 23:10:41.216011 7f7fad109700 -1 *** Caught signal
> (Aborted) **
>   in thread 7f7fad109700
>
>   ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
>   1: /usr/bin/ceph-osd() [0x797bd9]
>   2: (()+0xeff0) [0x7f7fc17afff0]
>   3: (gsignal()+0x35) [0x7f7fbfb79215]
>   4: (abort()+0x180) [0x7f7fbfb7c020]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7fc040ddc5]
>   6: (()+0xcb166) [0x7f7fc040c166]
>   7: (()+0xcb193) [0x7f7fc040c193]
>   8: (()+0xcb28e) [0x7f7fc040c28e]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7c9) [0x7fb939]
>   10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned
> long)+0x816) [0x747626]
>   11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
>   12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
>   13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
>   14: (()+0x68ca) [0x7f7fc17a78ca]
>   15: (clone()+0x6d) [0x7f7fbfc16bfd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> --- logging levels ---
>     0/ 5 none
>     0/ 0 lockdep
>     0/ 0 context
>     0/ 0 crush
>     1/ 5 mds
>     1/ 5 mds_balancer
>     1/ 5 mds_locker
>     1/ 5 mds_log
>     1/ 5 mds_log_expire
>     1/ 5 mds_migrator
>     0/ 0 buffer
>     0/ 0 timer
>     0/ 1 filer
>     0/ 1 striper
>     0/ 1 objecter
>     0/ 5 rados
>     0/ 5 rbd
>     0/ 0 journaler
>     0/ 5 objectcacher
>     0/ 5 client
>     0/ 0 osd
>     0/ 0 optracker
>     0/ 0 objclass
>     0/ 0 filestore
>     0/ 0 journal
>     0/ 0 ms
>     1/ 5 mon
>     0/ 0 monc
>     0/ 5 paxos
>     0/ 0 tp
>     0/ 0 auth
>     1/ 5 crypto
>     0/ 0 finisher
>     0/ 0 heartbeatmap
>     0/ 0 perfcounter
>     1/ 5 rgw
>     1/ 5 hadoop
>     1/ 5 javaclient
>     0/ 0 asok
>     0/ 0 throttle
>    -2/-2 (syslog threshold)
>    -1/-1 (stderr threshold)
>    max_recent    100000
>    max_new         1000
>    log_file /var/log/ceph/ceph-osd.13.log
> --- end dump of recent events ---
>
> Stefan
> Am 05.12.2012 17:05, schrieb Stefan Priebe - Profihost AG:
>> There was a dump in the attached log.
>>
>> Stefan
>>
>> Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@inktank.com>:
>>
>>> On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
>>>> Hello list,
>>>>
>>>> i updated to latest next from today and then after 20 minutes an OSD
>>>> was
>>>> crashing in os/JournalingObjectStore.cc.
>>>>
>>>> Attached is the log.
>>>
>>> Hmm, this is perplexing.  It might just be a bad assert, but I can't see
>>> how it could happen.  Any chance you can reproduce with
>>>
>>>     debug journal = 0/10
>>>
>>> in the [osd] section?  That will give us a dump if it fails the assert.
>>>
>>> Thanks!
>>> s
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-05 16:05   ` Stefan Priebe - Profihost AG
  2012-12-05 22:25     ` Stefan Priebe
@ 2012-12-05 23:36     ` Sage Weil
  2012-12-06  9:38       ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-12-05 23:36 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
> There was a dump in the attached log.

The stack trace is there, but with 'debug journal = 0/20' in your conf it 
will also dump all of the journal logging activity leading up to that 
point.  Can you reproduce with that enabled?  That should tell me why op < 
commited_seq.

Thanks!
sage


> 
> Stefan
> 
> Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@inktank.com>:
> 
> > On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
> >> Hello list,
> >> 
> >> i updated to latest next from today and then after 20 minutes an OSD was
> >> crashing in os/JournalingObjectStore.cc.
> >> 
> >> Attached is the log.
> > 
> > Hmm, this is perplexing.  It might just be a bad assert, but I can't see 
> > how it could happen.  Any chance you can reproduce with
> > 
> >    debug journal = 0/10
> > 
> > in the [osd] section?  That will give us a dump if it fails the assert.
> > 
> > Thanks!
> > s
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-05 23:36     ` Sage Weil
@ 2012-12-06  9:38       ` Stefan Priebe - Profihost AG
  2012-12-06 14:43         ` Sage Weil
  2012-12-07  0:38         ` Sage Weil
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-12-06  9:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hi,

here a new dump / crash:
https://www.dropbox.com/s/1qhg0dd0fv17q10/ceph-osd.54.log.gz

Stefan

Am 06.12.2012 00:36, schrieb Sage Weil:
> On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
>> There was a dump in the attached log.
>
> The stack trace is there, but with 'debug journal = 0/20' in your conf it
> will also dump all of the journal logging activity leading up to that
> point.  Can you reproduce with that enabled?  That should tell me why op <
> commited_seq.
>
> Thanks!
> sage
>
>
>>
>> Stefan
>>
>> Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@inktank.com>:
>>
>>> On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
>>>> Hello list,
>>>>
>>>> i updated to latest next from today and then after 20 minutes an OSD was
>>>> crashing in os/JournalingObjectStore.cc.
>>>>
>>>> Attached is the log.
>>>
>>> Hmm, this is perplexing.  It might just be a bad assert, but I can't see
>>> how it could happen.  Any chance you can reproduce with
>>>
>>>     debug journal = 0/10
>>>
>>> in the [osd] section?  That will give us a dump if it fails the assert.
>>>
>>> Thanks!
>>> s
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-06  9:38       ` Stefan Priebe - Profihost AG
@ 2012-12-06 14:43         ` Sage Weil
  2012-12-06 14:47           ` Stefan Priebe - Profihost AG
  2012-12-07  0:38         ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-12-06 14:43 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

On Thu, 6 Dec 2012, Stefan Priebe - Profihost AG wrote:
> Hi,
> 
> here a new dump / crash:
> https://www.dropbox.com/s/1qhg0dd0fv17q10/ceph-osd.54.log.gz

Awesome, thanks!  I see the bug now.  Working out a fix.

In the meantime, you can revert 85574a36226611ccf0fb7591fd275a2bdcca2bad 
and 528108485be7912069087822e5b7a1a2f1dd515e.

sage


> 
> Stefan
> 
> Am 06.12.2012 00:36, schrieb Sage Weil:
> > On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
> > > There was a dump in the attached log.
> > 
> > The stack trace is there, but with 'debug journal = 0/20' in your conf it
> > will also dump all of the journal logging activity leading up to that
> > point.  Can you reproduce with that enabled?  That should tell me why op <
> > commited_seq.
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > Stefan
> > > 
> > > Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@inktank.com>:
> > > 
> > > > On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
> > > > > Hello list,
> > > > > 
> > > > > i updated to latest next from today and then after 20 minutes an OSD
> > > > > was
> > > > > crashing in os/JournalingObjectStore.cc.
> > > > > 
> > > > > Attached is the log.
> > > > 
> > > > Hmm, this is perplexing.  It might just be a bad assert, but I can't see
> > > > how it could happen.  Any chance you can reproduce with
> > > > 
> > > >     debug journal = 0/10
> > > > 
> > > > in the [osd] section?  That will give us a dump if it fails the assert.
> > > > 
> > > > Thanks!
> > > > s
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-06 14:43         ` Sage Weil
@ 2012-12-06 14:47           ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-12-06 14:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hello,

as it were crashing pretty often i had a few minutes where only 9 of 20 
OSDs were still alive i reverted back to 
3ace9a7c66c79847d215f4bd38f40ca8b07bab8e

Which works fine ;-)

Stefan

Am 06.12.2012 15:43, schrieb Sage Weil:
> On Thu, 6 Dec 2012, Stefan Priebe - Profihost AG wrote:
>> Hi,
>>
>> here a new dump / crash:
>> https://www.dropbox.com/s/1qhg0dd0fv17q10/ceph-osd.54.log.gz
>
> Awesome, thanks!  I see the bug now.  Working out a fix.
>
> In the meantime, you can revert 85574a36226611ccf0fb7591fd275a2bdcca2bad
> and 528108485be7912069087822e5b7a1a2f1dd515e.
>
> sage

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-06  9:38       ` Stefan Priebe - Profihost AG
  2012-12-06 14:43         ` Sage Weil
@ 2012-12-07  0:38         ` Sage Weil
  2012-12-07  7:49           ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-12-07  0:38 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

Hi Stefan,

I've pushed a few patches to wip-filestore2 that simplify and fix this 
code.  Can you give it a go?

Thanks!
sage


On Thu, 6 Dec 2012, Stefan Priebe - Profihost AG wrote:
> Hi,
> 
> here a new dump / crash:
> https://www.dropbox.com/s/1qhg0dd0fv17q10/ceph-osd.54.log.gz
> 
> Stefan
> 
> Am 06.12.2012 00:36, schrieb Sage Weil:
> > On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
> > > There was a dump in the attached log.
> > 
> > The stack trace is there, but with 'debug journal = 0/20' in your conf it
> > will also dump all of the journal logging activity leading up to that
> > point.  Can you reproduce with that enabled?  That should tell me why op <
> > commited_seq.
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > Stefan
> > > 
> > > Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@inktank.com>:
> > > 
> > > > On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
> > > > > Hello list,
> > > > > 
> > > > > i updated to latest next from today and then after 20 minutes an OSD
> > > > > was
> > > > > crashing in os/JournalingObjectStore.cc.
> > > > > 
> > > > > Attached is the log.
> > > > 
> > > > Hmm, this is perplexing.  It might just be a bad assert, but I can't see
> > > > how it could happen.  Any chance you can reproduce with
> > > > 
> > > >     debug journal = 0/10
> > > > 
> > > > in the [osd] section?  That will give us a dump if it fails the assert.
> > > > 
> > > > Thanks!
> > > > s
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-07  0:38         ` Sage Weil
@ 2012-12-07  7:49           ` Stefan Priebe - Profihost AG
  2012-12-07 11:02             ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-12-07  7:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hi Sage,

thanks for patching this. I'm not sure whether i can test this. I've 
moved my systems to first productional tests and i can't create a 
downtime or loss of data again ;-)

How stable are these fixes?

Am 07.12.2012 01:38, schrieb Sage Weil:
> Hi Stefan,
>
> I've pushed a few patches to wip-filestore2 that simplify and fix this
> code.  Can you give it a go?
>
> Thanks!
> sage

Greets,
Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-07  7:49           ` Stefan Priebe - Profihost AG
@ 2012-12-07 11:02             ` Sage Weil
  2012-12-07 11:29               ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-12-07 11:02 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

On Fri, 7 Dec 2012, Stefan Priebe - Profihost AG wrote:
> Hi Sage,
> 
> thanks for patching this. I'm not sure whether i can test this. I've moved my
> systems to first productional tests and i can't create a downtime or loss of
> data again ;-)
> 
> How stable are these fixes?

Only tested on my laptop.  :)  

I'll put them through our qa first, and make sure our stress tests 
can trigger the old failure.

Thanks!
sage


> 
> Am 07.12.2012 01:38, schrieb Sage Weil:
> > Hi Stefan,
> > 
> > I've pushed a few patches to wip-filestore2 that simplify and fix this
> > code.  Can you give it a go?
> > 
> > Thanks!
> > sage
> 
> Greets,
> Stefan
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD crashed today in os/JournalingObjectStore.cc
  2012-12-07 11:02             ` Sage Weil
@ 2012-12-07 11:29               ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-12-07 11:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hello Sage,

that would be great. Can you comeback then to me?

Greets
Stefan

Am 07.12.2012 12:02, schrieb Sage Weil:
> On Fri, 7 Dec 2012, Stefan Priebe - Profihost AG wrote:
>> Hi Sage,
>>
>> thanks for patching this. I'm not sure whether i can test this. I've moved my
>> systems to first productional tests and i can't create a downtime or loss of
>> data again ;-)
>>
>> How stable are these fixes?
>
> Only tested on my laptop.  :)
>
> I'll put them through our qa first, and make sure our stress tests
> can trigger the old failure.
>
> Thanks!
> sage
>
>
>>
>> Am 07.12.2012 01:38, schrieb Sage Weil:
>>> Hi Stefan,
>>>
>>> I've pushed a few patches to wip-filestore2 that simplify and fix this
>>> code.  Can you give it a go?
>>>
>>> Thanks!
>>> sage
>>
>> Greets,
>> Stefan
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-12-07 11:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-05  9:56 OSD crashed today in os/JournalingObjectStore.cc Stefan Priebe - Profihost AG
2012-12-05 14:41 ` Sage Weil
2012-12-05 16:05   ` Stefan Priebe - Profihost AG
2012-12-05 22:25     ` Stefan Priebe
2012-12-05 22:29       ` Stefan Priebe
2012-12-05 23:36     ` Sage Weil
2012-12-06  9:38       ` Stefan Priebe - Profihost AG
2012-12-06 14:43         ` Sage Weil
2012-12-06 14:47           ` Stefan Priebe - Profihost AG
2012-12-07  0:38         ` Sage Weil
2012-12-07  7:49           ` Stefan Priebe - Profihost AG
2012-12-07 11:02             ` Sage Weil
2012-12-07 11:29               ` Stefan Priebe - Profihost AG

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.