From: Josh Pieper <jjp@pobox.com>
To: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Subject: Re: 0.40 OSD - Address family not supported by protocol
Date: Wed, 18 Jan 2012 23:00:33 -0500 [thread overview]
Message-ID: <20120119040033.GY4585@rcn.com> (raw)
In-Reply-To: <Pine.LNX.4.64.1201171422010.24000@cobra.newdream.net>
Sage,
Thanks for sorting out the root cause!
-Josh
Sage Weil wrote:
> Hi Josh,
>
> I just sorted this out. The problem was that the encoding for
> OSDSuperblock was changed, and that struct was embedded in the MOSDBoot
> message. Some of your OSDs restarted befor the monitors, so the old
> monitors saw the new structure and misdecoded the message with garbage
> (well, zeros) for the heartbeat address. This made it into the OSDMap,
> and a very impolite assert in the messenger code made the process crash
> when it got an error from socket(2).
>
> The assert and error handling is cleaned up. There isn't a nice way to
> fix the behavior of the old code, though, so for everyone else:
> upgrade/restart the monitors before the osds to avoid triggering this. If
> you do, restarting the OSDs (possibly a couple of times) will clear it up.
> Once all of the ':/0' values disappear from 'ceph osd dump' you're in the
> clear.
>
> sage
>
>
> http://tracker.newdream.net/issues/1942
>
> On Sat, 14 Jan 2012, Josh Pieper wrote:
>
> > I just upgraded our test cluster to 0.40, and immediately after
> > starting up get asserts in all the OSDs. I've inlined a relevant
> > backtrace below, is there anything else that would be useful for
> > debugging?
> >
> > Our test cluster is 3 ubuntu 11.10 amd64 machines, each with a mon and
> > osd.
> >
> > Looking at an strace, it is pretty clearly asking for an invalid
> > address family, although I'm not sure where it is coming from.
> >
> > [pid 30648] socket(PF_UNSPEC, SOCK_STREAM, 0 <unfinished ...>
> > [pid 30648] <... socket resumed> ) = -1 EAFNOSUPPORT (Address family not supported by protocol)
> >
> > -Josh
> >
> > -------
> > 2012-01-14 09:31:03.395266 7f67edf08700 -- 10.1.10.71:6801/27529 >> 10.1.10.73:6801/8127 pipe(0x14e0780 sd=19 pgs=0 cs=0 l=0).connect claims to be 10.1.10.73:6801/24029 not 10.1.10.73:6801/8127 - wrong node!
> > 2012-01-14 09:31:03.395579 7f67ede07700 -- :/27530 >> :/0 pipe(0x14e0500 sd=-1 pgs=0 cs=0 l=0).connect couldn't created socket Address family not supported by protocol
> > msg/SimpleMessenger.cc: In function 'int SimpleMessenger::Pipe::connect()', in thread '7f67ede07700'
> > msg/SimpleMessenger.cc: 1038: FAILED assert(0)
> > ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
> > 1: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c]
> > 2: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536]
> > 3: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d]
> > 4: (()+0x7efc) [0x7f67ffdf4efc]
> > 5: (clone()+0x6d) [0x7f67fe42589d]
> > ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
> > 1: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c]
> > 2: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536]
> > 3: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d]
> > 4: (()+0x7efc) [0x7f67ffdf4efc]
> > 5: (clone()+0x6d) [0x7f67fe42589d]
> > *** Caught signal (Aborted) **
> > in thread 7f67ede07700
> > ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
> > 1: /usr/bin/ceph-osd() [0x5fd926]
> > 2: (()+0x10060) [0x7f67ffdfd060]
> > 3: (gsignal()+0x35) [0x7f67fe37a3a5]
> > 4: (abort()+0x17b) [0x7f67fe37db0b]
> > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f67fec38d7d]
> > 6: (()+0xb9f26) [0x7f67fec36f26]
> > 7: (()+0xb9f53) [0x7f67fec36f53]
> > 8: (()+0xba04e) [0x7f67fec3704e]
> > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x193) [0x5cfd33]
> > 10: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c]
> > 11: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536]
> > 12: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d]
> > 13: (()+0x7efc) [0x7f67ffdf4efc]
> > 14: (clone()+0x6d) [0x7f67fe42589d]
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
--
Shaw's Principle:
Build a system that even a fool can use, and only a fool will
want to use it.
prev parent reply other threads:[~2012-01-19 4:00 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-14 14:55 0.40 OSD - Address family not supported by protocol Josh Pieper
2012-01-14 17:28 ` Sage Weil
2012-01-14 18:33 ` Josh Pieper
2012-01-14 22:34 ` Sage Weil
2012-01-14 22:51 ` Josh Pieper
2012-01-17 22:25 ` Sage Weil
2012-01-19 4:00 ` Josh Pieper [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120119040033.GY4585@rcn.com \
--to=jjp@pobox.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.