Greetings,

Has anyone seen this or got ideas on how to fix it?

mdsmap e18399: 3/3/3 up {0=b=up:resolve,1=a=up:resolve(laggy or 
crashed),2=a=up:resolve(laggy or crashed)}

Notice that the 2nd and 3rd mds are the same letter("a"). I'm not sure 
how that happened, I'm guessing a typo in my ceph.conf.

Taking mds.a down doesn't help, b just stays in resolve.

mds.a is only running on a single instance, even though it shows as up 
twice.

When I take a mds down, and start it back up, it goes through a couple 
of states and then sticks at resolve.

I've tried the method listed here, but can't see any change: 
http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/

I tried "ceph mds stop X" as mentioned here 
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/2585 , but 
see the results below:

athompson@ceph01:~$ sudo ceph mds stop 0
mds.0 not active (up:resolve)
athompson@ceph01:~$ sudo ceph mds stop 1
mds.1 not active (up:resolve)
athompson@ceph01:~$ sudo ceph mds stop 2
mds.2 not active (up:resolve)

I've attached the results of `ceph mds dump -o -`

Currently, mds.b.log is full of these reset/connect's and then where I 
issued a `service ceph stop mds` a few minutes ago(see attached).

Thanks,
Andrew.

-- 
Andrew Thompson
http://aktzero.com/