mds'es stuck in resolve(and one duplicated?)

All of lore.kernel.org
 help / color / mirror / Atom feed

* mds'es stuck in resolve(and one duplicated?)
@ 2012-09-10 21:36 Andrew Thompson
  2012-09-13 18:37 ` Gregory Farnum
  0 siblings, 1 reply; 2+ messages in thread
From: Andrew Thompson @ 2012-09-10 21:36 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1304 bytes --]

Greetings,

Has anyone seen this or got ideas on how to fix it?

mdsmap e18399: 3/3/3 up {0=b=up:resolve,1=a=up:resolve(laggy or 
crashed),2=a=up:resolve(laggy or crashed)}

Notice that the 2nd and 3rd mds are the same letter("a"). I'm not sure 
how that happened, I'm guessing a typo in my ceph.conf.

Taking mds.a down doesn't help, b just stays in resolve.

mds.a is only running on a single instance, even though it shows as up 
twice.

When I take a mds down, and start it back up, it goes through a couple 
of states and then sticks at resolve.

I've tried the method listed here, but can't see any change: 
http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/

I tried "ceph mds stop X" as mentioned here 
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/2585 , but 
see the results below:

athompson@ceph01:~$ sudo ceph mds stop 0
mds.0 not active (up:resolve)
athompson@ceph01:~$ sudo ceph mds stop 1
mds.1 not active (up:resolve)
athompson@ceph01:~$ sudo ceph mds stop 2
mds.2 not active (up:resolve)

I've attached the results of `ceph mds dump -o -`

Currently, mds.b.log is full of these reset/connect's and then where I 
issued a `service ceph stop mds` a few minutes ago(see attached).

Thanks,
Andrew.

-- 
Andrew Thompson
http://aktzero.com/


[-- Attachment #2: mds-dump.txt --]
[-- Type: text/plain, Size: 847 bytes --]

athompson@ceph01:~$ sudo ceph mds dump -o -
dumped mdsmap epoch 18493
epoch   18493
flags   0
created 2012-08-10 16:25:06.747103
modified        2012-09-10 17:29:20.826226
tableserver     0
root    0
session_timeout 60
session_autoclose       300
last_failure    3430
last_failure_osd_epoch  426
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
max_mds 3
in      0,1,2
up      {0=5401,1=5524,2=5506}
failed
stopped
data_pools      [0,0]
metadata_pool   1
5401:   172.19.7.54:6800/13793 'b' mds.0.9 up:resolve seq 149 laggy since 2012-09-10 17:21:05.270280
5524:   172.19.7.39:6800/8536 'a' mds.1.11 up:resolve seq 4 laggy since 2012-09-08 02:52:20.668649
5506:   172.19.7.39:6800/7930 'a' mds.2.3 up:resolve seq 5 laggy since 2012-09-08 02:48:05.433724


[-- Attachment #3: ceph-mds.b.log --]
[-- Type: text/plain, Size: 1158 bytes --]

2012-09-10 16:54:23.595995 7f843c55b700  0 mds.0.9 ms_handle_reset on 172.19.7.56:6800/8509
2012-09-10 16:54:23.598638 7f843c55b700  0 mds.0.9 ms_handle_connect on 172.19.7.56:6800/8509
2012-09-10 17:09:09.367041 7f843c55b700  0 mds.0.9 ms_handle_reset on 172.19.7.39:6804/6522
2012-09-10 17:09:09.370663 7f843c55b700  0 mds.0.9 ms_handle_connect on 172.19.7.39:6804/6522
2012-09-10 17:09:22.891795 7f843c55b700  0 mds.0.9 ms_handle_reset on 172.19.7.39:6801/6430
2012-09-10 17:09:22.894177 7f843c55b700  0 mds.0.9 ms_handle_connect on 172.19.7.39:6801/6430
2012-09-10 17:09:23.210881 7f843c55b700  0 mds.0.9 ms_handle_reset on 172.19.7.54:6801/14003
2012-09-10 17:09:23.214310 7f843c55b700  0 mds.0.9 ms_handle_connect on 172.19.7.54:6801/14003
2012-09-10 17:09:23.699220 7f843c55b700  0 mds.0.9 ms_handle_reset on 172.19.7.56:6800/8509
2012-09-10 17:09:23.701789 7f843c55b700  0 mds.0.9 ms_handle_connect on 172.19.7.56:6800/8509
2012-09-10 17:21:28.125699 7f843cd5c700 -1 mds.0.9 *** got signal Terminated ***
2012-09-10 17:21:28.125755 7f843cd5c700  1 mds.0.9 suicide.  wanted down:dne, now up:resolve
2012-09-10 17:21:28.386805 7f84422a6780  0 stopped.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: mds'es stuck in resolve(and one duplicated?)
  2012-09-10 21:36 mds'es stuck in resolve(and one duplicated?) Andrew Thompson
@ 2012-09-13 18:37 ` Gregory Farnum
  0 siblings, 0 replies; 2+ messages in thread
From: Gregory Farnum @ 2012-09-13 18:37 UTC (permalink / raw)
  To: Andrew Thompson; +Cc: ceph-devel

On Mon, Sep 10, 2012 at 2:36 PM, Andrew Thompson <andrewkt@aktzero.com> wrote:
> Greetings,
>
> Has anyone seen this or got ideas on how to fix it?
>
> mdsmap e18399: 3/3/3 up {0=b=up:resolve,1=a=up:resolve(laggy or
> crashed),2=a=up:resolve(laggy or crashed)}
>
> Notice that the 2nd and 3rd mds are the same letter("a"). I'm not sure how
> that happened, I'm guessing a typo in my ceph.conf.

So you actually have three configured MDSes?
What you're seeing there is that the logical[1] MDS 1 and MDS 2 have
both taken so long to report in that the monitor thinks they've
probably died ("laggy or crashed"). You should check out your
ceph.conf and see if you in fact have two named the same — the other
possibility is that mds.a actually was the most recent one to take
ownership of both of those logical identities (probably both have a
corrupted state that is causing the daemon to assert out, and perhaps
your MDSes are being restarted by upstart but mds.a went a little
longer than mds.c?), although if it did that's certainly a bug.

> Taking mds.a down doesn't help, b just stays in resolve.
Indeed not — the resolve phase is when the MDS daemons agree on who
has authority over what part of the filesystem hierarchy.

> mds.a is only running on a single instance, even though it shows as up
> twice.
>
> When I take a mds down, and start it back up, it goes through a couple of
> states and then sticks at resolve.
>
> I've tried the method listed here, but can't see any change:
> http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/
>
> I tried "ceph mds stop X" as mentioned here
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/2585 , but see
> the results below:
>
> athompson@ceph01:~$ sudo ceph mds stop 0
> mds.0 not active (up:resolve)
> athompson@ceph01:~$ sudo ceph mds stop 1
> mds.1 not active (up:resolve)
> athompson@ceph01:~$ sudo ceph mds stop 2
> mds.2 not active (up:resolve)

"Stopping" an MDS requires it to cooperatively give its hierarchy away
to other MDS daemons. Since it's not running, it can't do that.

>
> I've attached the results of `ceph mds dump -o -`
>
> Currently, mds.b.log is full of these reset/connect's and then where I
> issued a `service ceph stop mds` a few minutes ago(see attached).

You'll want to go look at mds.a — its log should have a backtrace that
might tell us more. Figuring this out will probably also require
enabling some debug logging, but I have to warn you it might not turn
out well — we say the POSIX filesystem isn't production ready for a
reason!

Sorry your email dropped through the cracks for a few days; I hope this helps.
-Greg
[1]: Since an MDS has no local storage, each configured system can be
associated with any "logical MDS", which consists solely of the MDS
log and the responsibility for part of the tree.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-09-13 18:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-10 21:36 mds'es stuck in resolve(and one duplicated?) Andrew Thompson
2012-09-13 18:37 ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.