From: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
To: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
Date: Wed, 24 Jul 2013 09:37:33 +0200 [thread overview]
Message-ID: <51EF843D.1070107@profihost.ag> (raw)
In-Reply-To: <51EF7CC8.9070507@profihost.ag>
Hi,
i uploaded my ceph mon store to cephdrop
/home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz.
So hopefully someone can find the culprit soon.
It fails in OSDMonitor.cc here:
// if we trigger this, then there's something else going with the store
// state, and we shouldn't want to work around it without knowing what
// exactly happened.
assert(latest_full > 0);
Stefan
Am 24.07.2013 09:05, schrieb Stefan Priebe - Profihost AG:
> Hi,
>
> today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug.
>
> But this ended in a complete desaster.
>
> What i've done:
> 1.) recompiled ceph tagged with 0.61.6
> 2.) installed new ceph version on all machines
> 3.) JUST tried to restart ONE mon
>
> this failed with:
> [1774]: (33) Numerical argument out of domain
> failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file
> /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
>
> 2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got
> Signal Terminated ***
> 2013-07-24 08:41:43.088090 7f53c185d700 0 quorum service shutdown
> 2013-07-24 08:41:43.088094 7f53c185d700 0 mon.a@0(???).health(3840)
> HealthMonitor::service_shutdown 1 services
> 2013-07-24 08:41:43.088097 7f53c185d700 0 quorum service shutdown
> 2013-07-24 08:41:44.224104 7fae6384a780 0 ceph version
> 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
> ceph-mon, pid 29871
> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
> 7fae6384a780 time 2013-07-24 08:41:56.096683
> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
>
> ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
> 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
> 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
> 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
> 4: (Monitor::init_paxos()+0xe5) [0x48f955]
> 5: (Monitor::preinit()+0x679) [0x4bba79]
> 6: (main()+0x36b0) [0x484bb0]
> 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
> 8: /usr/bin/ceph-mon() [0x4801e9]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> --- begin dump of recent events ---
> -13> 2013-07-24 08:41:44.222821 7fae6384a780 5 asok(0x2698000)
> register_command perfcounters_dump hook 0x2682010
> -12> 2013-07-24 08:41:44.222835 7fae6384a780 5 asok(0x2698000)
> register_command 1 hook 0x2682010
> -11> 2013-07-24 08:41:44.222837 7fae6384a780 5 asok(0x2698000)
> register_command perf dump hook 0x2682010
> -10> 2013-07-24 08:41:44.222842 7fae6384a780 5 asok(0x2698000)
> register_command perfcounters_schema hook 0x2682010
> -9> 2013-07-24 08:41:44.222845 7fae6384a780 5 asok(0x2698000)
> register_command 2 hook 0x2682010
> -8> 2013-07-24 08:41:44.222847 7fae6384a780 5 asok(0x2698000)
> register_command perf schema hook 0x2682010
> -7> 2013-07-24 08:41:44.222849 7fae6384a780 5 asok(0x2698000)
> register_command config show hook 0x2682010
> -6> 2013-07-24 08:41:44.222852 7fae6384a780 5 asok(0x2698000)
> register_command config set hook 0x2682010
> -5> 2013-07-24 08:41:44.222854 7fae6384a780 5 asok(0x2698000)
> register_command log flush hook 0x2682010
> -4> 2013-07-24 08:41:44.222856 7fae6384a780 5 asok(0x2698000)
> register_command log dump hook 0x2682010
> -3> 2013-07-24 08:41:44.222859 7fae6384a780 5 asok(0x2698000)
> register_command log reopen hook 0x2682010
> -2> 2013-07-24 08:41:44.224104 7fae6384a780 0 ceph version
> 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
> ceph-mon, pid 29871
> -1> 2013-07-24 08:41:44.224397 7fae6384a780 1 finished
> global_init_daemonize
> 0> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
> 7fae6384a780 time 2013-07-24 08:41:56.096683
> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
>
> ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
> 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
> 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
> 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
> 4: (Monitor::init_paxos()+0xe5) [0x48f955]
> 5: (Monitor::preinit()+0x679) [0x4bba79]
> 6: (main()+0x36b0) [0x484bb0]
> 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
> 8: /usr/bin/ceph-mon() [0x4801e9]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> 4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs
> were still trying to reach mon.a
>
> 2013-07-24 08:41:43.088997 7f011268f700 0 monclient: hunting for new mon
> 2013-07-24 08:41:56.792449 7f0109e7e700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:02.792990 7f0116b6c700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:11.793525 7f0109d7d700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x84ec280 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:23.794315 7f0109e7e700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x44c7b80 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:27.621336 7f0122d2e700 0 log [WRN] : 5 slow requests,
> 5 included below; oldest blocked for > 30.378391 secs
> 2013-07-24 08:42:27.621344 7f0122d2e700 0 log [WRN] : slow request
> 30.378391 seconds old, received at 2013-07-24 08:41:57.242902:
> osd_op(client.14727601.0:3839848
> rbd_data.e0b5b26b8b4567.0000000000005b5a [write 684032~4096] 5.816d89d1
> snapc bef=[bef] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621348 7f0122d2e700 0 log [WRN] : slow request
> 30.195074 seconds old, received at 2013-07-24 08:41:57.426219:
> osd_op(client.14828945.0:1088870
> rbd_data.e245696b8b4567.000000000000140e [write 988160~7168] 5.ed959c36
> snapc b80=[b80] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621350 7f0122d2e700 0 log [WRN] : slow request
> 30.148871 seconds old, received at 2013-07-24 08:41:57.472422:
> osd_op(client.14667314.0:2818172
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1654784~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621351 7f0122d2e700 0 log [WRN] : slow request
> 30.148829 seconds old, received at 2013-07-24 08:41:57.472464:
> osd_op(client.14667314.0:2818173
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1957888~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621352 7f0122d2e700 0 log [WRN] : slow request
> 30.148784 seconds old, received at 2013-07-24 08:41:57.472509:
> osd_op(client.14667314.0:2818174
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1966080~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
>
> ...
>
> 2013-07-24 08:50:20.826687 7f00ee6d9700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0xdf02280 sd=288 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:26.826914 7f00f1697700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x465a000 sd=229 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:40.713100 7f00ee6d9700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x4383680 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:44.828164 7f011392a700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x41ecf00 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:51:02.829357 7f00f1697700 0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x1d8b180 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
>
> Stefan
>
next prev parent reply other threads:[~2013-07-24 7:37 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-24 7:05 Upgrading from 0.61.5 to 0.61.6 ended in disaster Stefan Priebe - Profihost AG
2013-07-24 7:37 ` Stefan Priebe - Profihost AG [this message]
2013-07-24 10:42 ` Joao Eduardo Luis
2013-07-24 11:11 ` Joao Eduardo Luis
2013-07-24 11:54 ` Stefan Priebe - Profihost AG
2013-07-24 15:29 ` Sage Weil
2013-07-24 23:19 ` Sage Weil
2013-07-25 6:19 ` Stefan Priebe - Profihost AG
-- strict thread matches above, loose matches on Subject: below --
2013-07-25 11:19 peter
2013-07-25 15:46 ` Sage Weil
2013-07-25 16:12 ` peter
2013-07-29 9:40 ` peter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51EF843D.1070107@profihost.ag \
--to=s.priebe@profihost.ag \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.