From: "Székelyi Szabolcs" <szekelyi@niif.hu>
To: ceph-devel@vger.kernel.org
Subject: Re: OSD doesn't start
Date: Sun, 08 Jul 2012 20:51:38 +0200 [thread overview]
Message-ID: <7377491.S2NCfnprEH@mranderson> (raw)
In-Reply-To: <1680690.nczT3S6HBC@mranderson>
On 2012. July 6. 01:33:13 Székelyi Szabolcs wrote:
> On 2012. July 5. 16:12:42 Székelyi Szabolcs wrote:
> > On 2012. July 4. 09:34:04 Gregory Farnum wrote:
> > > Hrm, it looks like the OSD data directory got a little busted somehow.
> > > How
> > > did you perform your upgrade? (That is, how did you kill your daemons,
> > > in
> > > what order, and when did you bring them back up.)
> >
> > Since it would be hard and long to describe in text, I've collected the
> > relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The
> > short story is that after seeing that the OSDs won't start, I tried to
> > bring down the whole cluster and start it up from scratch. It didn't
> > change anything, so I rebooted the two machines (running all three
> > daemons), to see if it changes anything. It didn't and I gave up.
> >
> > My ceph config is available at http://pastebin.com/KKNjmiWM .
> >
> > Since this is my test cluster, I'm not very concerned about the data on
> > it.
> > But the other one, with the same config, is dying I think. ceph-fuse is
> > eating around 75% CPU on the sole monitor ("cc") node. The monitor about
> > 15%. On the other two nodes, the OSD eats around 50%, the MDS 15%, the
> > monitor another 10%. No Ceph filesystem activity is going on at the
> > moment.
> > Blktrace reports about 1kB/s disk traffic on the partition hosting the OSD
> > data dir. The data seems to be accessible at the moment, but I'm afraid
> > that my production cluster will end up in a similar situation after
> > upgrade, so I don't dare to touch it.
> >
> > Do you have any suggestion what I should check?
>
> Yes, it definitely looks like dying. Besides the above symptoms all clients'
> ceph-fuse burn the CPU, there are unreadable files on the fs (tar blocks on
> them infinitely), the FUSE clients emit messages like
>
> ceph-fuse: 2012-07-05 23:21:41.583692 7f444dfd5700 0 -- client_ip:0/1181
> send_message dropped message ping v1 because of no pipe on con 0x1034000
>
> every 5 seconds. I tried to backup the data on it, but it got blocked in the
> middle. Since then I'm unable to get any data out of it, not even by
> killing ceph-fuse and remounting the fs.
So it looks like the recent leap second caused all my troubles... After a
colleague applied the workaround descibed here[0], the load on the nodes went
back to normal, but the cluster was still sick. For example, stopping one of
the monitors and looking at the output of `ceph -s`, it still showed all the
monitors as up & running, whereas it was clear that at least one of them
should have been marked down (there was no ceph-mon process there).
Finally I stopped the whole cluster (BTW `ceph stop` documented here[1]
doesn't work any longer, it replies something like 'unrecognized subsystem'),
rebooted all the nodes, and everything came up as it should have.
Cheers,
--
cc
[0] http://www.h-online.com/open/news/item/Leap-second-bug-in-Linux-wastes-
electricity-1631462.html
[1] http://ceph.com/docs/master/control/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2012-07-08 18:51 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-04 15:31 OSD doesn't start Székelyi Szabolcs
2012-07-04 16:34 ` Gregory Farnum
2012-07-05 14:12 ` Székelyi Szabolcs
2012-07-05 23:33 ` Székelyi Szabolcs
2012-07-08 18:51 ` Székelyi Szabolcs [this message]
2012-07-08 18:53 ` Székelyi Szabolcs
2012-07-09 16:18 ` Gregory Farnum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7377491.S2NCfnprEH@mranderson \
--to=szekelyi@niif.hu \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.