From: Kevin Decherf <kevin@kdecherf.com>
To: ceph-devel@vger.kernel.org
Cc: support@clever-cloud.com
Subject: Re: Crash and strange things on MDS
Date: Mon, 11 Feb 2013 14:05:18 +0100 [thread overview]
Message-ID: <20130211130518.GN6997@kdecherf.com> (raw)
In-Reply-To: <20130204180154.GO3286@kdecherf.com>
On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote:
> Hey everyone,
>
> It's my first post here to expose a potential issue I found today using
> Ceph 0.56.1.
>
> The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
> All nodes are running Exherbo (source-based distribution) with Ceph
> 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
> is mounted on ~60 clients (increasing each day). Objects are replicated
> three times and the cluster handles only 7GB of data atm for 350k
> objects.
>
> In certain conditions (I don't know them atm), some clients hang,
> generate CPU overloads (kworker) and are unable to make any IO on
> Ceph. The active MDS have ~20Mbps in/out during the issue (less than
> 2Mbps in normal activity). I don't know if it's directly linked but we
> also observe a lot of missing files at the same time.
>
> The problem is similar to this one [1].
>
> A restart of the client or the MDS was enough before today, but we found
> a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
> with ~25% clients hanging.
>
> In logs I found a segfault with this backtrace [2] and 100,000 dumped
> events during the first hang. We observed another hang which produces
> lot of these events (in debug mode):
> - "mds.0.server FAIL on ESTALE but attempting recovery"
> - "mds.0.server reply_request -116 (Stale NFS file handle)
> client_request(client.10991:1031 getattr As #1000004bab0
> RETRY=132)"
>
> We have no profiling tools available on these nodes, and I don't know
> what I should search in the 35 GB log file.
>
> Note: the segmentation fault occured only once but the problem was
> observed four times on this cluster.
>
> Any help may be appreciated.
>
> References:
> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> 1: /usr/bin/ceph-mds() [0x817e82]
> 2: (()+0xf140) [0x7f9091d30140]
> 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
> 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
> 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
> 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
> 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
> 8: (Server::kill_session(Session*)+0x137) [0x549c67]
> 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
> 10: (MDS::tick()+0x338) [0x4da928]
> 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
> 12: (SafeTimerThread::entry()+0xd) [0x782bad]
> 13: (()+0x7ddf) [0x7f9091d28ddf]
> 14: (clone()+0x6d) [0x7f90909cc24d]
I found a possible cause/way to reproduce this issue.
We have now ~90 clients for 18GB / 650k objects and the storm occurs
when we execute an "intensive IO" command (tar of the whole pool / rsync
in one folder) on one of our client (the only which uses ceph-fuse,
don't know if it's limited to it or not).
Any idea?
Cheers,
--
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
next prev parent reply other threads:[~2013-02-11 13:05 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-04 18:01 Crash and strange things on MDS Kevin Decherf
2013-02-11 13:05 ` Kevin Decherf [this message]
2013-02-11 17:00 ` Sam Lang
2013-02-11 18:54 ` Kevin Decherf
2013-02-11 20:25 ` Gregory Farnum
2013-02-11 22:24 ` Kevin Decherf
2013-02-11 22:47 ` Gregory Farnum
2013-02-11 23:33 ` Kevin Decherf
2013-02-13 11:47 ` Kevin Decherf
2013-02-13 18:19 ` Gregory Farnum
2013-02-16 1:02 ` Kevin Decherf
2013-02-16 17:36 ` Sam Lang
2013-02-16 18:24 ` Kevin Decherf
2013-02-19 18:15 ` Gregory Farnum
2013-02-20 1:00 ` Kevin Decherf
2013-02-20 1:09 ` Gregory Farnum
2013-02-26 17:57 ` Kevin Decherf
2013-02-26 18:10 ` Gregory Farnum
2013-02-26 19:58 ` Kevin Decherf
2013-02-26 20:26 ` Gregory Farnum
2013-02-26 21:57 ` Kevin Decherf
2013-02-26 21:58 ` Gregory Farnum
2013-02-27 0:03 ` Yan, Zheng
2013-02-27 0:14 ` Sage Weil
[not found] ` <20130227004923.GQ16091@kdecherf.com>
[not found] ` <CAPYLRzhbygkA9=DkVr474Nw8AOC2hAFG-1D6uS4WyfR=kUBXWQ@mail.gmail.com>
[not found] ` <20130308232943.GA2197@kdecherf.com>
[not found] ` <20130308232943.GA2197-fShu9kyPgSlWk0Htik3J/w@public.gmane.org>
2013-03-15 20:32 ` Greg Farnum
[not found] ` <ECAA10260D284057A52D78127F8634A8-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
2013-03-15 22:40 ` Marc-Antoine Perennou
2013-03-15 22:53 ` Greg Farnum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130211130518.GN6997@kdecherf.com \
--to=kevin@kdecherf.com \
--cc=ceph-devel@vger.kernel.org \
--cc=support@clever-cloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.