All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph cluster is unreachable because of authentication failure
@ 2014-01-14 12:04 GuangYang
       [not found] ` <BLU0-SMTP1356EEDE6A8ACF94947F22ADFBF0-MsuGFMq8XAE@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: GuangYang @ 2014-01-14 12:04 UTC (permalink / raw)
  To: ceph-users@lists.ceph.com; +Cc: Ceph Development

Hi ceph-users and ceph-devel,
I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.

After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.

So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout…

2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: Error

Any idea why such error happened (restarting OSD would result in the same error)?

I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found] ` <BLU0-SMTP1356EEDE6A8ACF94947F22ADFBF0-MsuGFMq8XAE@public.gmane.org>
@ 2014-01-14 14:55   ` Sage Weil
       [not found]     ` <alpine.DEB.2.00.1401140654500.10628-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
       [not found]     ` <016D7F31-523E-4EC1-8222-7D4084BA400F@outlook.com>
  2014-01-18  8:55   ` Sherry Shahbazi
  1 sibling, 2 replies; 14+ messages in thread
From: Sage Weil @ 2014-01-14 14:55 UTC (permalink / raw)
  To: GuangYang
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Ceph Development

On Tue, 14 Jan 2014, GuangYang wrote:
> Hi ceph-users and ceph-devel,
> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
> 
> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
> 
> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
> 
> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
> Error connecting to cluster: Error
> 
> Any idea why such error happened (restarting OSD would result in the same error)?
> 
> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?

That sounds unlikely, but you're right that the core problem is with the 
mons.  What does 

 ceph daemon mon.`hostname` mon_status

say?  Perhaps they are not forming a quorum and that is what is preventing 
authentication.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]     ` <alpine.DEB.2.00.1401140654500.10628-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-01-14 21:54       ` Guang
  0 siblings, 0 replies; 14+ messages in thread
From: Guang @ 2014-01-14 21:54 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Ceph Development

Thanks Sage.

-bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
{ "name": "osd151",
  "rank": 2,
  "state": "electing",
  "election_epoch": 85469,
  "quorum": [],
  "outside_quorum": [],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 1,
      "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
      "modified": "0.000000",
      "created": "0.000000",
      "mons": [
            { "rank": 0,
              "name": "osd152",
              "addr": "10.193.207.130:6789\/0"},
            { "rank": 1,
              "name": "osd153",
              "addr": "10.193.207.131:6789\/0"},
            { "rank": 2,
              "name": "osd151",
              "addr": "10.194.0.68:6789\/0"}]}}

And:

-bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
{ "election_epoch": 85480,
  "quorum": [
        0,
        1,
        2],
  "quorum_names": [
        "osd151",
        "osd152",
        "osd153"],
  "quorum_leader_name": "osd152",
  "monmap": { "epoch": 1,
      "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
      "modified": "0.000000",
      "created": "0.000000",
      "mons": [
            { "rank": 0,
              "name": "osd152",
              "addr": "10.193.207.130:6789\/0"},
            { "rank": 1,
              "name": "osd153",
              "addr": "10.193.207.131:6789\/0"},
            { "rank": 2,
              "name": "osd151",
              "addr": "10.194.0.68:6789\/0"}]}}


The election has been finished with leader selected from the above status.

Thanks,
Guang

On Jan 14, 2014, at 10:55 PM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> On Tue, 14 Jan 2014, GuangYang wrote:
>> Hi ceph-users and ceph-devel,
>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
>> 
>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
>> 
>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
>> 
>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
>> Error connecting to cluster: Error
>> 
>> Any idea why such error happened (restarting OSD would result in the same error)?
>> 
>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
> 
> That sounds unlikely, but you're right that the core problem is with the 
> mons.  What does 
> 
> ceph daemon mon.`hostname` mon_status
> 
> say?  Perhaps they are not forming a quorum and that is what is preventing 
> authentication.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]     ` <016D7F31-523E-4EC1-8222-7D4084BA400F@outlook.com>
@ 2014-01-16  8:26       ` Guang
  2014-01-16 17:35         ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: Guang @ 2014-01-16  8:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users@lists.ceph.com, Ceph Development

I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried:
  1. stop all daemons (mon & osd)
  2. change the configuration to disable cephx
  3. start mon daemons (3 in total)
  4. start osd daemon one by one

After finishing step 3, the cluster can be reachable ('ceph -s' give results):
-bash-4.1$ sudo ceph -s
  cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
   health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
   monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
   osdmap e134893: 781 osds: 694 up, 751 in
    pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
   mdsmap e1: 0/0/1 up
(at this point, all OSDs should be down).

When I tried to start OSD daemon, the starting script got hang, and the process hang is:
root      80497  80496  0 08:18 pts/0    00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173

When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop:
   select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
Any idea what might trigger this?

======= STRACE (PARTIAL) ========== 
[pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000
[pid 75878] munmap(0x7f5da6529000, 28143616) = 0
[pid 75878] munmap(0x7f5dac000000, 38965248) = 0
[pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0
[pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 75873] <... futex resumed> )       = 0
[pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 75873] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 75878] <... futex resumed> )       = 0
[pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
[pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
[pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000
[pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0
[ omit some entries…]
[pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
[pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
[pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)


Thanks,
Guang

On Jan 15, 2014, at 5:54 AM, Guang <yguang11@outlook.com> wrote:

> Thanks Sage.
> 
> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
> { "name": "osd151",
>  "rank": 2,
>  "state": "electing",
>  "election_epoch": 85469,
>  "quorum": [],
>  "outside_quorum": [],
>  "extra_probe_peers": [],
>  "sync_provider": [],
>  "monmap": { "epoch": 1,
>      "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>      "modified": "0.000000",
>      "created": "0.000000",
>      "mons": [
>            { "rank": 0,
>              "name": "osd152",
>              "addr": "10.193.207.130:6789\/0"},
>            { "rank": 1,
>              "name": "osd153",
>              "addr": "10.193.207.131:6789\/0"},
>            { "rank": 2,
>              "name": "osd151",
>              "addr": "10.194.0.68:6789\/0"}]}}
> 
> And:
> 
> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
> { "election_epoch": 85480,
>  "quorum": [
>        0,
>        1,
>        2],
>  "quorum_names": [
>        "osd151",
>        "osd152",
>        "osd153"],
>  "quorum_leader_name": "osd152",
>  "monmap": { "epoch": 1,
>      "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>      "modified": "0.000000",
>      "created": "0.000000",
>      "mons": [
>            { "rank": 0,
>              "name": "osd152",
>              "addr": "10.193.207.130:6789\/0"},
>            { "rank": 1,
>              "name": "osd153",
>              "addr": "10.193.207.131:6789\/0"},
>            { "rank": 2,
>              "name": "osd151",
>              "addr": "10.194.0.68:6789\/0"}]}}
> 
> 
> The election has been finished with leader selected from the above status.
> 
> Thanks,
> Guang
> 
> On Jan 14, 2014, at 10:55 PM, Sage Weil <sage@inktank.com> wrote:
> 
>> On Tue, 14 Jan 2014, GuangYang wrote:
>>> Hi ceph-users and ceph-devel,
>>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
>>> 
>>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
>>> 
>>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
>>> 
>>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
>>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
>>> Error connecting to cluster: Error
>>> 
>>> Any idea why such error happened (restarting OSD would result in the same error)?
>>> 
>>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
>> 
>> That sounds unlikely, but you're right that the core problem is with the 
>> mons.  What does 
>> 
>> ceph daemon mon.`hostname` mon_status
>> 
>> say?  Perhaps they are not forming a quorum and that is what is preventing 
>> authentication.
>> 
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
  2014-01-16  8:26       ` Guang
@ 2014-01-16 17:35         ` Sage Weil
       [not found]           ` <BLU0-SMTP169D80D759610D226681DD3DFB80@phx.gbl>
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2014-01-16 17:35 UTC (permalink / raw)
  To: Guang; +Cc: ceph-users@lists.ceph.com, Ceph Development

Hi Guang,

On Thu, 16 Jan 2014, Guang wrote:
> I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried:
>   1. stop all daemons (mon & osd)
>   2. change the configuration to disable cephx
>   3. start mon daemons (3 in total)
>   4. start osd daemon one by one
>
> After finishing step 3, the cluster can be reachable ('ceph -s' give results):
> -bash-4.1$ sudo ceph -s
>   cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>    monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e134893: 781 osds: 694 up, 751 in
>     pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>    mdsmap e1: 0/0/1 up
> (at this point, all OSDs should be down).
> 
> When I tried to start OSD daemon, the starting script got hang, and the process hang is:
> root      80497  80496  0 08:18 pts/0    00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173
> 
> When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop:
>    select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
> Any idea what might trigger this?

It is hard to tell from the strace what is going on from this.  Do you see 
that the OSDs are booting in ceph.log (or ceph -w output)?  If not, I 
would look at the osd daemon log for clues.  You may need to turn up 
debugging to see (ceph daemon osd.NNN config set debug_osd 20 to adjust 
the level on the running daemon).

If they are booting, it is mostly a matter of letting it recover and come 
up.  We have seen patterns where configuration or network issues have let 
the system bury itself under a series of osdmap updates.  If you see that 
in the log when you turn up debugging, or see the osds going up and down 
when you try to bring the cluster up, that could be what is going on.  A 
strategy that has worked there is to let all the osds catch up on their 
maps before trying to peer and join the cluster.  To do that, 'ceph osd 
set noup' (which prevents the osds from joining), wait for the ceph-osd 
processes to stop chewing on maps (watch the cpu utilization in top), and 
once they are all ready 'ceph osd unset noup' and let them join and peer 
all at once.

sage

> 
> ======= STRACE (PARTIAL) ========== 
> [pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000
> [pid 75878] munmap(0x7f5da6529000, 28143616) = 0
> [pid 75878] munmap(0x7f5dac000000, 38965248) = 0
> [pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0
> [pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1
> [pid 75873] <... futex resumed> )       = 0
> [pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 75873] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 75878] <... futex resumed> )       = 0
> [pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
> [pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
> [pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000
> [pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0
> [ omit some entries?]
> [pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
> [pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> 
> 
> Thanks,
> Guang
> 
> On Jan 15, 2014, at 5:54 AM, Guang <yguang11@outlook.com> wrote:
> 
> > Thanks Sage.
> > 
> > -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
> > { "name": "osd151",
> >  "rank": 2,
> >  "state": "electing",
> >  "election_epoch": 85469,
> >  "quorum": [],
> >  "outside_quorum": [],
> >  "extra_probe_peers": [],
> >  "sync_provider": [],
> >  "monmap": { "epoch": 1,
> >      "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
> >      "modified": "0.000000",
> >      "created": "0.000000",
> >      "mons": [
> >            { "rank": 0,
> >              "name": "osd152",
> >              "addr": "10.193.207.130:6789\/0"},
> >            { "rank": 1,
> >              "name": "osd153",
> >              "addr": "10.193.207.131:6789\/0"},
> >            { "rank": 2,
> >              "name": "osd151",
> >              "addr": "10.194.0.68:6789\/0"}]}}
> > 
> > And:
> > 
> > -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
> > { "election_epoch": 85480,
> >  "quorum": [
> >        0,
> >        1,
> >        2],
> >  "quorum_names": [
> >        "osd151",
> >        "osd152",
> >        "osd153"],
> >  "quorum_leader_name": "osd152",
> >  "monmap": { "epoch": 1,
> >      "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
> >      "modified": "0.000000",
> >      "created": "0.000000",
> >      "mons": [
> >            { "rank": 0,
> >              "name": "osd152",
> >              "addr": "10.193.207.130:6789\/0"},
> >            { "rank": 1,
> >              "name": "osd153",
> >              "addr": "10.193.207.131:6789\/0"},
> >            { "rank": 2,
> >              "name": "osd151",
> >              "addr": "10.194.0.68:6789\/0"}]}}
> > 
> > 
> > The election has been finished with leader selected from the above status.
> > 
> > Thanks,
> > Guang
> > 
> > On Jan 14, 2014, at 10:55 PM, Sage Weil <sage@inktank.com> wrote:
> > 
> >> On Tue, 14 Jan 2014, GuangYang wrote:
> >>> Hi ceph-users and ceph-devel,
> >>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
> >>> 
> >>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
> >>> 
> >>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
> >>> 
> >>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
> >>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
> >>> Error connecting to cluster: Error
> >>> 
> >>> Any idea why such error happened (restarting OSD would result in the same error)?
> >>> 
> >>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
> >> 
> >> That sounds unlikely, but you're right that the core problem is with the 
> >> mons.  What does 
> >> 
> >> ceph daemon mon.`hostname` mon_status
> >> 
> >> say?  Perhaps they are not forming a quorum and that is what is preventing 
> >> authentication.
> >> 
> >> sage
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]           ` <BLU0-SMTP169D80D759610D226681DD3DFB80@phx.gbl>
@ 2014-01-17 16:05             ` Sage Weil
       [not found]               ` <alpine.DEB.2.00.1401170804180.304-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2014-01-17 16:05 UTC (permalink / raw)
  To: Guang; +Cc: ceph-users@lists.ceph.com, Ceph Development

On Fri, 17 Jan 2014, Guang wrote:
> Thanks Sage.
> 
> I further narrow down the problem to #any command using paxos service would hang#, following are details:
> 
> 1. I am able to run ceph status / osd dump, etc., however, the result are out of date (though I stopped all OSDs, it does not reflect in ceph status report).
> 
> -bash-4.1$ sudo ceph -s
>   cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>    monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 123278, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e134893: 781 osds: 694 up, 751 in
>     pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>    mdsmap e1: 0/0/1 up
> 
> 2. If I run a command which uses paxos, the command will hang forever, this includes, ceph osd set noup (and also including those commands osd send to monitor when being started (create-or-add)).
> 
> I attached the corresponding monitor log (it is like a bug).

I see the osd set command coming through, but it arrives while paxos is 
converging and the log seems to end before the mon would normally process 
te delayed messages.  Is there a reason why the log fragment you attached 
ends there, or did the process hang or something?

Thanks-
sage

> I 
> 
> On Jan 17, 2014, at 1:35 AM, Sage Weil <sage@inktank.com> wrote:
> 
> > Hi Guang,
> > 
> > On Thu, 16 Jan 2014, Guang wrote:
> >> I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried:
> >>  1. stop all daemons (mon & osd)
> >>  2. change the configuration to disable cephx
> >>  3. start mon daemons (3 in total)
> >>  4. start osd daemon one by one
> >> 
> >> After finishing step 3, the cluster can be reachable ('ceph -s' give results):
> >> -bash-4.1$ sudo ceph -s
> >>  cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
> >>   health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
> >>   monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
> >>   osdmap e134893: 781 osds: 694 up, 751 in
> >>    pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
> >>   mdsmap e1: 0/0/1 up
> >> (at this point, all OSDs should be down).
> >> 
> >> When I tried to start OSD daemon, the starting script got hang, and the process hang is:
> >> root      80497  80496  0 08:18 pts/0    00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173
> >> 
> >> When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop:
> >>   select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
> >> Any idea what might trigger this?
> > 
> > It is hard to tell from the strace what is going on from this.  Do you see 
> > that the OSDs are booting in ceph.log (or ceph -w output)?  If not, I 
> > would look at the osd daemon log for clues.  You may need to turn up 
> > debugging to see (ceph daemon osd.NNN config set debug_osd 20 to adjust 
> > the level on the running daemon).
> > 
> > If they are booting, it is mostly a matter of letting it recover and come 
> > up.  We have seen patterns where configuration or network issues have let 
> > the system bury itself under a series of osdmap updates.  If you see that 
> > in the log when you turn up debugging, or see the osds going up and down 
> > when you try to bring the cluster up, that could be what is going on.  A 
> > strategy that has worked there is to let all the osds catch up on their 
> > maps before trying to peer and join the cluster.  To do that, 'ceph osd 
> > set noup' (which prevents the osds from joining), wait for the ceph-osd 
> > processes to stop chewing on maps (watch the cpu utilization in top), and 
> > once they are all ready 'ceph osd unset noup' and let them join and peer 
> > all at once.
> > 
> > sage
> > 
> >> 
> >> ======= STRACE (PARTIAL) ========== 
> >> [pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> >> [pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000
> >> [pid 75878] munmap(0x7f5da6529000, 28143616) = 0
> >> [pid 75878] munmap(0x7f5dac000000, 38965248) = 0
> >> [pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0
> >> [pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1
> >> [pid 75873] <... futex resumed> )       = 0
> >> [pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> >> [pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> >> [pid 75873] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> >> [pid 75878] <... futex resumed> )       = 0
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
> >> [pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
> >> [pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000
> >> [pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0
> >> [ omit some entries?]
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >> 
> >> 
> >> Thanks,
> >> Guang
> >> 
> >> On Jan 15, 2014, at 5:54 AM, Guang <yguang11@outlook.com> wrote:
> >> 
> >>> Thanks Sage.
> >>> 
> >>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
> >>> { "name": "osd151",
> >>> "rank": 2,
> >>> "state": "electing",
> >>> "election_epoch": 85469,
> >>> "quorum": [],
> >>> "outside_quorum": [],
> >>> "extra_probe_peers": [],
> >>> "sync_provider": [],
> >>> "monmap": { "epoch": 1,
> >>>     "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
> >>>     "modified": "0.000000",
> >>>     "created": "0.000000",
> >>>     "mons": [
> >>>           { "rank": 0,
> >>>             "name": "osd152",
> >>>             "addr": "10.193.207.130:6789\/0"},
> >>>           { "rank": 1,
> >>>             "name": "osd153",
> >>>             "addr": "10.193.207.131:6789\/0"},
> >>>           { "rank": 2,
> >>>             "name": "osd151",
> >>>             "addr": "10.194.0.68:6789\/0"}]}}
> >>> 
> >>> And:
> >>> 
> >>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
> >>> { "election_epoch": 85480,
> >>> "quorum": [
> >>>       0,
> >>>       1,
> >>>       2],
> >>> "quorum_names": [
> >>>       "osd151",
> >>>       "osd152",
> >>>       "osd153"],
> >>> "quorum_leader_name": "osd152",
> >>> "monmap": { "epoch": 1,
> >>>     "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
> >>>     "modified": "0.000000",
> >>>     "created": "0.000000",
> >>>     "mons": [
> >>>           { "rank": 0,
> >>>             "name": "osd152",
> >>>             "addr": "10.193.207.130:6789\/0"},
> >>>           { "rank": 1,
> >>>             "name": "osd153",
> >>>             "addr": "10.193.207.131:6789\/0"},
> >>>           { "rank": 2,
> >>>             "name": "osd151",
> >>>             "addr": "10.194.0.68:6789\/0"}]}}
> >>> 
> >>> 
> >>> The election has been finished with leader selected from the above status.
> >>> 
> >>> Thanks,
> >>> Guang
> >>> 
> >>> On Jan 14, 2014, at 10:55 PM, Sage Weil <sage@inktank.com> wrote:
> >>> 
> >>>> On Tue, 14 Jan 2014, GuangYang wrote:
> >>>>> Hi ceph-users and ceph-devel,
> >>>>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
> >>>>> 
> >>>>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
> >>>>> 
> >>>>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
> >>>>> 
> >>>>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
> >>>>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
> >>>>> Error connecting to cluster: Error
> >>>>> 
> >>>>> Any idea why such error happened (restarting OSD would result in the same error)?
> >>>>> 
> >>>>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
> >>>> 
> >>>> That sounds unlikely, but you're right that the core problem is with the 
> >>>> mons.  What does 
> >>>> 
> >>>> ceph daemon mon.`hostname` mon_status
> >>>> 
> >>>> say?  Perhaps they are not forming a quorum and that is what is preventing 
> >>>> authentication.
> >>>> 
> >>>> sage
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> 
> >> 
> >> 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found] ` <BLU0-SMTP1356EEDE6A8ACF94947F22ADFBF0-MsuGFMq8XAE@public.gmane.org>
  2014-01-14 14:55   ` Sage Weil
@ 2014-01-18  8:55   ` Sherry Shahbazi
  1 sibling, 0 replies; 14+ messages in thread
From: Sherry Shahbazi @ 2014-01-18  8:55 UTC (permalink / raw)
  To: GuangYang, ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
  Cc: Ceph Development


[-- Attachment #1.1: Type: text/plain, Size: 1661 bytes --]

Hi Guang, 

Can you check the privileges of ceph.conf and ceph.client.admin.keyring as they should look like the following:

-rw-r--r-- 1 root root 719 Jan 17 17:34 ceph.conf
-rw-r--r-- 1 root root 64 Jan 17 11:58 ceph.client.admin.keyring

Regards
Sherry



On Wednesday, January 15, 2014 1:57 AM, GuangYang <yguang11-1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org> wrote:
 
Hi ceph-users and ceph-devel,
I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.

After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.

So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout…

2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: Error

Any idea why such error happened (restarting OSD would result in the same error)?

I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?

Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[-- Attachment #1.2: Type: text/html, Size: 2637 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]               ` <alpine.DEB.2.00.1401170804180.304-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-01-19 13:21                 ` Guang
  2014-01-20 16:35                   ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: Guang @ 2014-01-19 13:21 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Ceph Development

Thanks Sage.

I just captured part of the log (it was fast growing), the process did not hang but I saw the same pattern repeatedly. Should I increase the log level and send over email (it constantly reproduced)?

Thanks,
Guang

On Jan 18, 2014, at 12:05 AM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> On Fri, 17 Jan 2014, Guang wrote:
>> Thanks Sage.
>> 
>> I further narrow down the problem to #any command using paxos service would hang#, following are details:
>> 
>> 1. I am able to run ceph status / osd dump, etc., however, the result are out of date (though I stopped all OSDs, it does not reflect in ceph status report).
>> 
>> -bash-4.1$ sudo ceph -s
>>  cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>>   health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>>   monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 123278, quorum 0,1,2 osd151,osd152,osd153
>>   osdmap e134893: 781 osds: 694 up, 751 in
>>    pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>>   mdsmap e1: 0/0/1 up
>> 
>> 2. If I run a command which uses paxos, the command will hang forever, this includes, ceph osd set noup (and also including those commands osd send to monitor when being started (create-or-add)).
>> 
>> I attached the corresponding monitor log (it is like a bug).
> 
> I see the osd set command coming through, but it arrives while paxos is 
> converging and the log seems to end before the mon would normally process 
> te delayed messages.  Is there a reason why the log fragment you attached 
> ends there, or did the process hang or something?
> 
> Thanks-
> sage
> 
>> I 
>> 
>> On Jan 17, 2014, at 1:35 AM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
>> 
>>> Hi Guang,
>>> 
>>> On Thu, 16 Jan 2014, Guang wrote:
>>>> I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried:
>>>> 1. stop all daemons (mon & osd)
>>>> 2. change the configuration to disable cephx
>>>> 3. start mon daemons (3 in total)
>>>> 4. start osd daemon one by one
>>>> 
>>>> After finishing step 3, the cluster can be reachable ('ceph -s' give results):
>>>> -bash-4.1$ sudo ceph -s
>>>> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>>>>  health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>>>>  monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
>>>>  osdmap e134893: 781 osds: 694 up, 751 in
>>>>   pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>>>>  mdsmap e1: 0/0/1 up
>>>> (at this point, all OSDs should be down).
>>>> 
>>>> When I tried to start OSD daemon, the starting script got hang, and the process hang is:
>>>> root      80497  80496  0 08:18 pts/0    00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173
>>>> 
>>>> When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop:
>>>>  select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
>>>> Any idea what might trigger this?
>>> 
>>> It is hard to tell from the strace what is going on from this.  Do you see 
>>> that the OSDs are booting in ceph.log (or ceph -w output)?  If not, I 
>>> would look at the osd daemon log for clues.  You may need to turn up 
>>> debugging to see (ceph daemon osd.NNN config set debug_osd 20 to adjust 
>>> the level on the running daemon).
>>> 
>>> If they are booting, it is mostly a matter of letting it recover and come 
>>> up.  We have seen patterns where configuration or network issues have let 
>>> the system bury itself under a series of osdmap updates.  If you see that 
>>> in the log when you turn up debugging, or see the osds going up and down 
>>> when you try to bring the cluster up, that could be what is going on.  A 
>>> strategy that has worked there is to let all the osds catch up on their 
>>> maps before trying to peer and join the cluster.  To do that, 'ceph osd 
>>> set noup' (which prevents the osds from joining), wait for the ceph-osd 
>>> processes to stop chewing on maps (watch the cpu utilization in top), and 
>>> once they are all ready 'ceph osd unset noup' and let them join and peer 
>>> all at once.
>>> 
>>> sage
>>> 
>>>> 
>>>> ======= STRACE (PARTIAL) ========== 
>>>> [pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
>>>> [pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000
>>>> [pid 75878] munmap(0x7f5da6529000, 28143616) = 0
>>>> [pid 75878] munmap(0x7f5dac000000, 38965248) = 0
>>>> [pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0
>>>> [pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> [pid 75873] <... futex resumed> )       = 0
>>>> [pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
>>>> [pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 75873] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
>>>> [pid 75878] <... futex resumed> )       = 0
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
>>>> [pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
>>>> [pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000
>>>> [pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0
>>>> [ omit some entries?]
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>> 
>>>> 
>>>> Thanks,
>>>> Guang
>>>> 
>>>> On Jan 15, 2014, at 5:54 AM, Guang <yguang11-1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org> wrote:
>>>> 
>>>>> Thanks Sage.
>>>>> 
>>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
>>>>> { "name": "osd151",
>>>>> "rank": 2,
>>>>> "state": "electing",
>>>>> "election_epoch": 85469,
>>>>> "quorum": [],
>>>>> "outside_quorum": [],
>>>>> "extra_probe_peers": [],
>>>>> "sync_provider": [],
>>>>> "monmap": { "epoch": 1,
>>>>>    "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>>>>>    "modified": "0.000000",
>>>>>    "created": "0.000000",
>>>>>    "mons": [
>>>>>          { "rank": 0,
>>>>>            "name": "osd152",
>>>>>            "addr": "10.193.207.130:6789\/0"},
>>>>>          { "rank": 1,
>>>>>            "name": "osd153",
>>>>>            "addr": "10.193.207.131:6789\/0"},
>>>>>          { "rank": 2,
>>>>>            "name": "osd151",
>>>>>            "addr": "10.194.0.68:6789\/0"}]}}
>>>>> 
>>>>> And:
>>>>> 
>>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
>>>>> { "election_epoch": 85480,
>>>>> "quorum": [
>>>>>      0,
>>>>>      1,
>>>>>      2],
>>>>> "quorum_names": [
>>>>>      "osd151",
>>>>>      "osd152",
>>>>>      "osd153"],
>>>>> "quorum_leader_name": "osd152",
>>>>> "monmap": { "epoch": 1,
>>>>>    "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>>>>>    "modified": "0.000000",
>>>>>    "created": "0.000000",
>>>>>    "mons": [
>>>>>          { "rank": 0,
>>>>>            "name": "osd152",
>>>>>            "addr": "10.193.207.130:6789\/0"},
>>>>>          { "rank": 1,
>>>>>            "name": "osd153",
>>>>>            "addr": "10.193.207.131:6789\/0"},
>>>>>          { "rank": 2,
>>>>>            "name": "osd151",
>>>>>            "addr": "10.194.0.68:6789\/0"}]}}
>>>>> 
>>>>> 
>>>>> The election has been finished with leader selected from the above status.
>>>>> 
>>>>> Thanks,
>>>>> Guang
>>>>> 
>>>>> On Jan 14, 2014, at 10:55 PM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>> 
>>>>>> On Tue, 14 Jan 2014, GuangYang wrote:
>>>>>>> Hi ceph-users and ceph-devel,
>>>>>>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
>>>>>>> 
>>>>>>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
>>>>>>> 
>>>>>>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
>>>>>>> 
>>>>>>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
>>>>>>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
>>>>>>> Error connecting to cluster: Error
>>>>>>> 
>>>>>>> Any idea why such error happened (restarting OSD would result in the same error)?
>>>>>>> 
>>>>>>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
>>>>>> 
>>>>>> That sounds unlikely, but you're right that the core problem is with the 
>>>>>> mons.  What does 
>>>>>> 
>>>>>> ceph daemon mon.`hostname` mon_status
>>>>>> 
>>>>>> say?  Perhaps they are not forming a quorum and that is what is preventing 
>>>>>> authentication.
>>>>>> 
>>>>>> sage
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
  2014-01-19 13:21                 ` Guang
@ 2014-01-20 16:35                   ` Sage Weil
       [not found]                     ` <alpine.DEB.2.00.1401200835050.2149-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2014-01-20 16:35 UTC (permalink / raw)
  To: Guang; +Cc: ceph-users@lists.ceph.com, Ceph Development

On Sun, 19 Jan 2014, Guang wrote:
> Thanks Sage.
> 
> I just captured part of the log (it was fast growing), the process did 
> not hang but I saw the same pattern repeatedly. Should I increase the 
> log level and send over email (it constantly reproduced)?

Sure!  A representative fragment of the repeating fragment shoudl be 
enough.

s
> 
> Thanks,
> Guang
> 
> On Jan 18, 2014, at 12:05 AM, Sage Weil <sage@inktank.com> wrote:
> 
> > On Fri, 17 Jan 2014, Guang wrote:
> >> Thanks Sage.
> >> 
> >> I further narrow down the problem to #any command using paxos service would hang#, following are details:
> >> 
> >> 1. I am able to run ceph status / osd dump, etc., however, the result are out of date (though I stopped all OSDs, it does not reflect in ceph status report).
> >> 
> >> -bash-4.1$ sudo ceph -s
> >>  cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
> >>   health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
> >>   monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 123278, quorum 0,1,2 osd151,osd152,osd153
> >>   osdmap e134893: 781 osds: 694 up, 751 in
> >>    pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
> >>   mdsmap e1: 0/0/1 up
> >> 
> >> 2. If I run a command which uses paxos, the command will hang forever, this includes, ceph osd set noup (and also including those commands osd send to monitor when being started (create-or-add)).
> >> 
> >> I attached the corresponding monitor log (it is like a bug).
> > 
> > I see the osd set command coming through, but it arrives while paxos is 
> > converging and the log seems to end before the mon would normally process 
> > te delayed messages.  Is there a reason why the log fragment you attached 
> > ends there, or did the process hang or something?
> > 
> > Thanks-
> > sage
> > 
> >> I 
> >> 
> >> On Jan 17, 2014, at 1:35 AM, Sage Weil <sage@inktank.com> wrote:
> >> 
> >>> Hi Guang,
> >>> 
> >>> On Thu, 16 Jan 2014, Guang wrote:
> >>>> I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried:
> >>>> 1. stop all daemons (mon & osd)
> >>>> 2. change the configuration to disable cephx
> >>>> 3. start mon daemons (3 in total)
> >>>> 4. start osd daemon one by one
> >>>> 
> >>>> After finishing step 3, the cluster can be reachable ('ceph -s' give results):
> >>>> -bash-4.1$ sudo ceph -s
> >>>> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
> >>>>  health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
> >>>>  monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
> >>>>  osdmap e134893: 781 osds: 694 up, 751 in
> >>>>   pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
> >>>>  mdsmap e1: 0/0/1 up
> >>>> (at this point, all OSDs should be down).
> >>>> 
> >>>> When I tried to start OSD daemon, the starting script got hang, and the process hang is:
> >>>> root      80497  80496  0 08:18 pts/0    00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173
> >>>> 
> >>>> When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop:
> >>>>  select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
> >>>> Any idea what might trigger this?
> >>> 
> >>> It is hard to tell from the strace what is going on from this.  Do you see 
> >>> that the OSDs are booting in ceph.log (or ceph -w output)?  If not, I 
> >>> would look at the osd daemon log for clues.  You may need to turn up 
> >>> debugging to see (ceph daemon osd.NNN config set debug_osd 20 to adjust 
> >>> the level on the running daemon).
> >>> 
> >>> If they are booting, it is mostly a matter of letting it recover and come 
> >>> up.  We have seen patterns where configuration or network issues have let 
> >>> the system bury itself under a series of osdmap updates.  If you see that 
> >>> in the log when you turn up debugging, or see the osds going up and down 
> >>> when you try to bring the cluster up, that could be what is going on.  A 
> >>> strategy that has worked there is to let all the osds catch up on their 
> >>> maps before trying to peer and join the cluster.  To do that, 'ceph osd 
> >>> set noup' (which prevents the osds from joining), wait for the ceph-osd 
> >>> processes to stop chewing on maps (watch the cpu utilization in top), and 
> >>> once they are all ready 'ceph osd unset noup' and let them join and peer 
> >>> all at once.
> >>> 
> >>> sage
> >>> 
> >>>> 
> >>>> ======= STRACE (PARTIAL) ========== 
> >>>> [pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> >>>> [pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000
> >>>> [pid 75878] munmap(0x7f5da6529000, 28143616) = 0
> >>>> [pid 75878] munmap(0x7f5dac000000, 38965248) = 0
> >>>> [pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0
> >>>> [pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1
> >>>> [pid 75873] <... futex resumed> )       = 0
> >>>> [pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> >>>> [pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> >>>> [pid 75873] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> >>>> [pid 75878] <... futex resumed> )       = 0
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
> >>>> [pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
> >>>> [pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000
> >>>> [pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0
> >>>> [ omit some entries?]
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
> >>>> 
> >>>> 
> >>>> Thanks,
> >>>> Guang
> >>>> 
> >>>> On Jan 15, 2014, at 5:54 AM, Guang <yguang11@outlook.com> wrote:
> >>>> 
> >>>>> Thanks Sage.
> >>>>> 
> >>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
> >>>>> { "name": "osd151",
> >>>>> "rank": 2,
> >>>>> "state": "electing",
> >>>>> "election_epoch": 85469,
> >>>>> "quorum": [],
> >>>>> "outside_quorum": [],
> >>>>> "extra_probe_peers": [],
> >>>>> "sync_provider": [],
> >>>>> "monmap": { "epoch": 1,
> >>>>>    "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
> >>>>>    "modified": "0.000000",
> >>>>>    "created": "0.000000",
> >>>>>    "mons": [
> >>>>>          { "rank": 0,
> >>>>>            "name": "osd152",
> >>>>>            "addr": "10.193.207.130:6789\/0"},
> >>>>>          { "rank": 1,
> >>>>>            "name": "osd153",
> >>>>>            "addr": "10.193.207.131:6789\/0"},
> >>>>>          { "rank": 2,
> >>>>>            "name": "osd151",
> >>>>>            "addr": "10.194.0.68:6789\/0"}]}}
> >>>>> 
> >>>>> And:
> >>>>> 
> >>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
> >>>>> { "election_epoch": 85480,
> >>>>> "quorum": [
> >>>>>      0,
> >>>>>      1,
> >>>>>      2],
> >>>>> "quorum_names": [
> >>>>>      "osd151",
> >>>>>      "osd152",
> >>>>>      "osd153"],
> >>>>> "quorum_leader_name": "osd152",
> >>>>> "monmap": { "epoch": 1,
> >>>>>    "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
> >>>>>    "modified": "0.000000",
> >>>>>    "created": "0.000000",
> >>>>>    "mons": [
> >>>>>          { "rank": 0,
> >>>>>            "name": "osd152",
> >>>>>            "addr": "10.193.207.130:6789\/0"},
> >>>>>          { "rank": 1,
> >>>>>            "name": "osd153",
> >>>>>            "addr": "10.193.207.131:6789\/0"},
> >>>>>          { "rank": 2,
> >>>>>            "name": "osd151",
> >>>>>            "addr": "10.194.0.68:6789\/0"}]}}
> >>>>> 
> >>>>> 
> >>>>> The election has been finished with leader selected from the above status.
> >>>>> 
> >>>>> Thanks,
> >>>>> Guang
> >>>>> 
> >>>>> On Jan 14, 2014, at 10:55 PM, Sage Weil <sage@inktank.com> wrote:
> >>>>> 
> >>>>>> On Tue, 14 Jan 2014, GuangYang wrote:
> >>>>>>> Hi ceph-users and ceph-devel,
> >>>>>>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
> >>>>>>> 
> >>>>>>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
> >>>>>>> 
> >>>>>>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
> >>>>>>> 
> >>>>>>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
> >>>>>>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
> >>>>>>> Error connecting to cluster: Error
> >>>>>>> 
> >>>>>>> Any idea why such error happened (restarting OSD would result in the same error)?
> >>>>>>> 
> >>>>>>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
> >>>>>> 
> >>>>>> That sounds unlikely, but you're right that the core problem is with the 
> >>>>>> mons.  What does 
> >>>>>> 
> >>>>>> ceph daemon mon.`hostname` mon_status
> >>>>>> 
> >>>>>> say?  Perhaps they are not forming a quorum and that is what is preventing 
> >>>>>> authentication.
> >>>>>> 
> >>>>>> sage
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>> 
> >>>> 
> >>>> 
> >>> 
> >> 
> >> 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]                     ` <alpine.DEB.2.00.1401200835050.2149-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-01-22 11:34                       ` Guang
  2014-01-22 13:14                         ` [ceph-users] " Joao Eduardo Luis
  0 siblings, 1 reply; 14+ messages in thread
From: Guang @ 2014-01-22 11:34 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Ceph Development

[-- Attachment #1: Type: text/plain, Size: 280 bytes --]

Thanks Sage.

If we use the debug_mon and debug_paxos as 20, the log file is growing too fast, I set the log level as 10 and then: 1) run the 'ceph osd set noin' command, 2) grep the log with keyword 'noin', attached is the monitor log. Please help to check. Thanks very much!

[-- Attachment #2: mon_osd_set_noin_hang.txt --]
[-- Type: text/plain, Size: 89210 bytes --]

2014-01-21 12:24:03.851372 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273067662 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2491b200 con 0x301c8e0
2014-01-21 12:24:03.851405 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:24:03.851411 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:24:07.515533 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273091844 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x249cb980 con 0x301c8e0
2014-01-21 12:24:07.515578 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:24:07.515588 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:24:27.198169 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:24:28.412895 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:24:48.815132 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:24:53.029481 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273116132 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x17993980 con 0x301c8e0
2014-01-21 12:24:53.029525 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:24:53.029534 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:25:16.754558 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:25:18.215278 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273140317 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1f178780 con 0x301c8e0
2014-01-21 12:25:18.215327 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:25:18.215336 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:25:39.689524 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:25:43.915315 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273164608 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x21cf2f80 con 0x301c8e0
2014-01-21 12:25:43.915344 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:25:43.915350 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:25:48.113328 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273189107 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0xba25500 con 0x301c8e0
2014-01-21 12:25:48.113357 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:25:48.113362 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:26:08.210631 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:26:09.474338 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:26:30.314818 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:26:34.724654 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273213195 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x13a52080 con 0x301c8e0
2014-01-21 12:26:34.724683 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:26:34.724689 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:26:39.241503 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273237403 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2a7beb80 con 0x301c8e0
2014-01-21 12:26:39.241533 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:26:39.241539 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:26:57.721786 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:26:58.892716 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:27:20.739196 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:27:25.405511 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273261497 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0xe794600 con 0x301c8e0
2014-01-21 12:27:25.405539 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:27:25.405545 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:27:48.802763 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:27:50.017238 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273286009 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x27a64d80 con 0x301c8e0
2014-01-21 12:27:50.017268 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:27:50.017274 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:28:11.326143 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:28:15.486421 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273310107 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x27cfee00 con 0x301c8e0
2014-01-21 12:28:15.486466 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:28:15.486475 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:28:37.664040 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:28:38.609918 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273334208 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x23e50500 con 0x301c8e0
2014-01-21 12:28:38.609964 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:28:38.609977 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:28:58.422161 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:02.575579 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273358417 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1b020c80 con 0x301c8e0
2014-01-21 12:29:02.575609 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:29:02.575616 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:07.022398 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273383143 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1b021400 con 0x301c8e0
2014-01-21 12:29:07.022428 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:29:07.022434 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:24.110039 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:25.275149 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:26.221565 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:46.143811 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:50.003732 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273407220 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x23979e00 con 0x301c8e0
2014-01-21 12:29:50.003759 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:29:50.003765 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:29:54.657933 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273431666 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x27755000 con 0x301c8e0
2014-01-21 12:29:54.657962 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:29:54.657968 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:30:13.984553 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:30:15.203360 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:30:36.490604 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:30:40.233850 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273455775 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1a6a9180 con 0x301c8e0
2014-01-21 12:30:40.233895 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:30:40.233905 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:30:45.070754 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273479992 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1dbebe80 con 0x301c8e0
2014-01-21 12:30:45.070799 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:30:45.070808 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:31:04.239980 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:31:05.216624 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:31:25.616910 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:31:29.304254 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273504726 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x739b200 con 0x301c8e0
2014-01-21 12:31:29.304282 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:31:29.304287 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:31:34.011713 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273528838 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2c781900 con 0x301c8e0
2014-01-21 12:31:34.011742 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:31:34.011748 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:31:52.517705 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:31:53.467222 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:32:18.857804 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:32:23.182206 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273553344 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1f92b700 con 0x301c8e0
2014-01-21 12:32:23.182235 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:32:23.182241 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:32:46.693447 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:32:47.671134 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273577801 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x274a5f00 con 0x301c8e0
2014-01-21 12:32:47.671186 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:32:47.671195 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:33:08.137009 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:33:12.924730 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273601982 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0xdff6680 con 0x301c8e0
2014-01-21 12:33:12.924758 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:33:12.924763 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:33:16.838721 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273626561 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1d079900 con 0x301c8e0
2014-01-21 12:33:16.838749 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:33:16.838754 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:33:35.833743 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:33:37.051731 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:33:57.556307 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:34:01.891963 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273650685 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2ad5d280 con 0x301c8e0
2014-01-21 12:34:01.891993 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:34:01.891999 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:34:05.449695 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273675120 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2c634600 con 0x301c8e0
2014-01-21 12:34:05.449723 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:34:05.449728 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:34:25.386430 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:34:26.664077 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:34:39.293399 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273699199 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2734d780 con 0x301c8e0
2014-01-21 12:34:39.293428 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:34:39.293434 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:34:56.764995 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:34:57.929211 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:35:20.750150 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:35:25.637025 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273723382 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x17f1d500 con 0x301c8e0
2014-01-21 12:35:25.637053 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:35:25.637059 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:35:48.284640 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:35:49.529835 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273747931 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x20125a00 con 0x301c8e0
2014-01-21 12:35:49.529865 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:35:49.529871 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:36:11.182698 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:36:15.910868 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273772064 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x141c6400 con 0x301c8e0
2014-01-21 12:36:15.910894 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:36:15.910900 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:36:19.889472 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273796304 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0xb425780 con 0x301c8e0
2014-01-21 12:36:19.889502 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:36:19.889508 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:36:42.184570 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:36:42.632321 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:36:46.834339 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273820846 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x23e0df00 con 0x301c8e0
2014-01-21 12:36:46.834368 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:36:46.834374 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:36:51.260742 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273844985 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1ebbd500 con 0x301c8e0
2014-01-21 12:36:51.260770 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:36:51.260776 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:37:10.540199 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:37:11.907893 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:37:12.528533 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:37:16.599247 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273869117 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x10f56180 con 0x301c8e0
2014-01-21 12:37:16.599275 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:37:16.599281 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:37:37.940171 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:37:39.336286 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:37:40.310657 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273893470 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x23e03480 con 0x301c8e0
2014-01-21 12:37:40.310739 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:37:40.310748 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:38:04.774140 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:38:06.103501 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273918065 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x27b96400 con 0x301c8e0
2014-01-21 12:38:06.103530 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:38:06.103536 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:38:13.357586 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273942274 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1d848780 con 0x301c8e0
2014-01-21 12:38:13.357618 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:38:13.357624 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:38:33.608010 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:38:34.222574 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:38:39.043513 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273966619 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x26ba6900 con 0x301c8e0
2014-01-21 12:38:39.043542 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:38:39.043548 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:38:43.090934 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 273991283 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x253dc380 con 0x301c8e0
2014-01-21 12:38:43.090964 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:38:43.090971 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:03.118475 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:04.582335 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:05.219742 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:09.207392 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274015424 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x58b7080 con 0x301c8e0
2014-01-21 12:39:09.207437 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:39:09.207446 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:33.545457 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:35.002865 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:40.033516 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274039682 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1efb6900 con 0x301c8e0
2014-01-21 12:39:40.033579 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:39:40.033589 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:40.094395 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274064242 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2c3eb200 con 0x301c8e0
2014-01-21 12:39:40.094423 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:39:40.094429 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:39:44.208521 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274088941 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x156a5c80 con 0x301c8e0
2014-01-21 12:39:44.208548 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:39:44.208553 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:05.240497 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:05.251799 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:05.801058 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:09.999942 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274113165 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x272a3e80 con 0x301c8e0
2014-01-21 12:40:09.999972 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:40:09.999979 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:31.794337 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:33.334604 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:34.603958 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274137243 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2ff76b80 con 0x301c8e0
2014-01-21 12:40:34.603986 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:40:34.603992 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:54.560978 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:40:58.383452 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274161478 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x23493e80 con 0x301c8e0
2014-01-21 12:40:58.383497 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:40:58.383506 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:41:02.460721 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274186052 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x15768280 con 0x301c8e0
2014-01-21 12:41:02.460767 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:41:02.460776 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:41:21.159561 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:41:22.128066 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:41:29.339218 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274210127 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x21d62300 con 0x301c8e0
2014-01-21 12:41:29.339246 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:41:29.339252 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:41:49.615961 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:41:50.434930 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:41:55.499448 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274234473 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x29588000 con 0x301c8e0
2014-01-21 12:41:55.499476 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:41:55.499482 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:42:16.531519 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:42:17.687598 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:42:18.704302 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274258965 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x8f69e00 con 0x301c8e0
2014-01-21 12:42:18.704336 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:42:18.704342 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:42:44.306788 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:42:45.293560 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274283130 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x29f71b80 con 0x301c8e0
2014-01-21 12:42:45.293604 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:42:45.293613 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:42:52.159900 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274307291 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1a438280 con 0x301c8e0
2014-01-21 12:42:52.159946 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:42:52.159956 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:43:12.721369 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:43:13.353255 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:43:18.297851 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274331541 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x739da00 con 0x301c8e0
2014-01-21 12:43:18.297879 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:43:18.297885 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:43:21.597964 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274355712 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x25a87d00 con 0x301c8e0
2014-01-21 12:43:21.598010 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:43:21.598019 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:43:38.367001 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:43:39.506520 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:43:40.445848 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:01.185559 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:05.364080 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274379990 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1e13f080 con 0x301c8e0
2014-01-21 12:44:05.364108 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:44:05.364113 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:09.684515 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274404785 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2394a300 con 0x301c8e0
2014-01-21 12:44:09.684545 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:44:09.684551 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:29.145187 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:30.409772 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:51.065857 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:55.631886 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274428959 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1e0efd00 con 0x301c8e0
2014-01-21 12:44:55.639824 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:44:55.639833 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:44:59.612749 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274453034 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x5294b00 con 0x301c8e0
2014-01-21 12:44:59.612786 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:44:59.612792 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:45:19.247569 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:45:20.461755 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:45:40.729929 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:45:45.539968 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274477315 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x3067b700 con 0x301c8e0
2014-01-21 12:45:45.539995 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:45:45.540000 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:46:09.104222 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:46:10.286208 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274501392 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0xd7f6680 con 0x301c8e0
2014-01-21 12:46:10.286239 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:46:10.286245 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:46:31.695564 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:46:35.934508 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274525783 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1dbdfa80 con 0x301c8e0
2014-01-21 12:46:35.934535 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:46:35.934540 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:46:39.746151 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274550478 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1e796e00 con 0x301c8e0
2014-01-21 12:46:39.746179 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:46:39.746185 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:46:59.201204 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:47:00.445592 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:47:26.019461 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:47:26.980742 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274574866 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x28e7bc00 con 0x301c8e0
2014-01-21 12:47:26.980777 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:47:26.980784 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:47:33.853432 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274599161 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1d01df00 con 0x301c8e0
2014-01-21 12:47:33.853460 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:47:33.853465 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:47:54.783141 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:47:55.279268 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:47:58.999101 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274623445 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2088fa80 con 0x301c8e0
2014-01-21 12:47:58.999128 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:47:58.999134 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:23.943261 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:25.132931 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:29.807202 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274648153 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x515e900 con 0x301c8e0
2014-01-21 12:48:29.807231 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:48:29.807238 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:29.808135 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274672335 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x289edf00 con 0x301c8e0
2014-01-21 12:48:29.808165 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:48:29.808171 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:34.557747 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274696616 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x166c7580 con 0x301c8e0
2014-01-21 12:48:34.557777 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:48:34.557783 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:55.466204 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:55.466374 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:55.964376 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:48:59.634624 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274720918 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2803f300 con 0x301c8e0
2014-01-21 12:48:59.634668 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:48:59.634677 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:49:03.833361 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274744991 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x16405280 con 0x301c8e0
2014-01-21 12:49:03.833405 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:49:03.833414 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:49:21.130148 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:49:22.298940 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:49:23.414053 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:49:36.000513 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274769620 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x18bfc600 con 0x301c8e0
2014-01-21 12:49:36.000541 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:49:36.000546 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:49:53.071774 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:49:54.311747 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:50:18.132151 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:50:22.070256 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274794003 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0xdff1680 con 0x301c8e0
2014-01-21 12:50:22.070286 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:50:22.070292 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:50:26.000276 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274818200 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x19a87a80 con 0x301c8e0
2014-01-21 12:50:26.000320 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:50:26.000329 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:50:45.721502 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:50:46.789465 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:51:12.707215 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:51:13.990985 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274842736 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x234b5a00 con 0x301c8e0
2014-01-21 12:51:13.991014 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:51:13.991020 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:51:21.553390 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274866939 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x16256b80 con 0x301c8e0
2014-01-21 12:51:21.553418 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:51:21.553424 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:51:42.479953 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:51:43.097334 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:51:46.940050 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274891133 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2433e900 con 0x301c8e0
2014-01-21 12:51:46.940079 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:51:46.940085 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:52:09.372953 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:52:10.919483 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:52:12.146253 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274915549 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1b72fd00 con 0x301c8e0
2014-01-21 12:52:12.146284 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:52:12.146290 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:52:33.304222 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:52:37.977868 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274940272 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x172ec100 con 0x301c8e0
2014-01-21 12:52:37.977897 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:52:37.977902 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:52:41.746116 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274964756 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2ab5e680 con 0x301c8e0
2014-01-21 12:52:41.746145 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:52:41.746151 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:52:59.745728 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:53:01.357609 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:53:02.659803 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:53:28.917455 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:53:30.101601 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 274989169 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1512d000 con 0x301c8e0
2014-01-21 12:53:30.101630 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:53:30.101636 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:53:42.666665 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275013377 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x202e6180 con 0x301c8e0
2014-01-21 12:53:42.666722 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:53:42.666728 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:54:00.889226 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:54:02.112789 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:54:24.445887 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:54:28.083024 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275038244 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x51e0c80 con 0x301c8e0
2014-01-21 12:54:28.083053 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:54:28.083058 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:54:32.669030 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275062315 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x24e1a580 con 0x301c8e0
2014-01-21 12:54:32.669075 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:54:32.669084 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:54:51.737360 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:54:52.973578 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:55:12.663760 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:55:17.259225 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275086881 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x15987080 con 0x301c8e0
2014-01-21 12:55:17.259256 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:55:17.259262 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:55:21.528926 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275111100 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1e21f080 con 0x301c8e0
2014-01-21 12:55:21.528954 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:55:21.528960 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:55:40.680841 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:55:41.918032 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:56:02.271516 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:56:07.069396 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275135320 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x24456180 con 0x301c8e0
2014-01-21 12:56:07.069423 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:56:07.069429 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:56:10.936449 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275159391 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2b7a6180 con 0x301c8e0
2014-01-21 12:56:10.936499 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:56:10.936508 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:56:29.493354 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:56:30.466862 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:56:56.555393 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:57:01.326269 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275183971 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x52d8280 con 0x301c8e0
2014-01-21 12:57:01.326297 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:57:01.326302 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:57:04.926495 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275208402 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2004a580 con 0x301c8e0
2014-01-21 12:57:04.926539 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:57:04.926548 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:57:23.512244 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:57:24.536348 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:57:50.189644 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:57:51.490669 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275232627 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x10475500 con 0x301c8e0
2014-01-21 12:57:51.490699 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:57:51.490705 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:57:58.398306 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275256697 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x26cff080 con 0x301c8e0
2014-01-21 12:57:58.398335 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:57:58.398341 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:58:18.692241 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:58:19.240814 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:58:23.672852 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275280917 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x15d81900 con 0x301c8e0
2014-01-21 12:58:23.672881 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:58:23.672887 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:58:27.724114 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275305352 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1cb90c80 con 0x301c8e0
2014-01-21 12:58:27.724143 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:58:27.724149 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:58:45.450467 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:58:46.721492 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:58:47.695072 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:59:13.155603 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:59:14.388980 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275329581 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x297a9e00 con 0x301c8e0
2014-01-21 12:59:14.389008 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:59:14.389013 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:59:21.350523 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275353812 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1c7aa300 con 0x301c8e0
2014-01-21 12:59:21.350552 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:59:21.350557 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:59:41.494623 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:59:42.125603 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:59:47.101839 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275378202 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x1b519180 con 0x301c8e0
2014-01-21 12:59:47.101868 7fa5fbe21700  0 mon.osd152@0(electing) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:59:47.101874 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 12:59:51.135899 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275402483 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x2b1dd500 con 0x301c8e0
2014-01-21 12:59:51.135929 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 12:59:51.135935 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 13:00:09.221660 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 13:00:10.795320 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 13:00:12.082615 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 13:00:23.919564 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 275427408 ==== forward(mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 caps allow *) to leader v1 ==== 326+0+0 (2979896460 0 0) 0x23ec6180 con 0x301c8e0
2014-01-21 13:00:23.919609 7fa5fbe21700  0 mon.osd152@0(leader) e1 handle_command mon_command({"prefix": "osd set", "key": "noin"} v 0) v1
2014-01-21 13:00:23.919618 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 13:00:42.945430 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936
2014-01-21 13:00:44.239739 7fa5fbe21700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) dispatch mon_command({"prefix": "osd set", "key": "noin"} v 0) v1 from client.757016 10.193.207.130:0/1054936

[-- Attachment #3: Type: text/plain, Size: 12738 bytes --]


Thanks,
Guang

On Jan 21, 2014, at 12:35 AM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> On Sun, 19 Jan 2014, Guang wrote:
>> Thanks Sage.
>> 
>> I just captured part of the log (it was fast growing), the process did 
>> not hang but I saw the same pattern repeatedly. Should I increase the 
>> log level and send over email (it constantly reproduced)?
> 
> Sure!  A representative fragment of the repeating fragment shoudl be 
> enough.
> 
> s
>> 
>> Thanks,
>> Guang
>> 
>> On Jan 18, 2014, at 12:05 AM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
>> 
>>> On Fri, 17 Jan 2014, Guang wrote:
>>>> Thanks Sage.
>>>> 
>>>> I further narrow down the problem to #any command using paxos service would hang#, following are details:
>>>> 
>>>> 1. I am able to run ceph status / osd dump, etc., however, the result are out of date (though I stopped all OSDs, it does not reflect in ceph status report).
>>>> 
>>>> -bash-4.1$ sudo ceph -s
>>>> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>>>>  health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>>>>  monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 123278, quorum 0,1,2 osd151,osd152,osd153
>>>>  osdmap e134893: 781 osds: 694 up, 751 in
>>>>   pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>>>>  mdsmap e1: 0/0/1 up
>>>> 
>>>> 2. If I run a command which uses paxos, the command will hang forever, this includes, ceph osd set noup (and also including those commands osd send to monitor when being started (create-or-add)).
>>>> 
>>>> I attached the corresponding monitor log (it is like a bug).
>>> 
>>> I see the osd set command coming through, but it arrives while paxos is 
>>> converging and the log seems to end before the mon would normally process 
>>> te delayed messages.  Is there a reason why the log fragment you attached 
>>> ends there, or did the process hang or something?
>>> 
>>> Thanks-
>>> sage
>>> 
>>>> I 
>>>> 
>>>> On Jan 17, 2014, at 1:35 AM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
>>>> 
>>>>> Hi Guang,
>>>>> 
>>>>> On Thu, 16 Jan 2014, Guang wrote:
>>>>>> I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried:
>>>>>> 1. stop all daemons (mon & osd)
>>>>>> 2. change the configuration to disable cephx
>>>>>> 3. start mon daemons (3 in total)
>>>>>> 4. start osd daemon one by one
>>>>>> 
>>>>>> After finishing step 3, the cluster can be reachable ('ceph -s' give results):
>>>>>> -bash-4.1$ sudo ceph -s
>>>>>> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>>>>>> health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>>>>>> monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
>>>>>> osdmap e134893: 781 osds: 694 up, 751 in
>>>>>>  pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>>>>>> mdsmap e1: 0/0/1 up
>>>>>> (at this point, all OSDs should be down).
>>>>>> 
>>>>>> When I tried to start OSD daemon, the starting script got hang, and the process hang is:
>>>>>> root      80497  80496  0 08:18 pts/0    00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173
>>>>>> 
>>>>>> When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop:
>>>>>> select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
>>>>>> Any idea what might trigger this?
>>>>> 
>>>>> It is hard to tell from the strace what is going on from this.  Do you see 
>>>>> that the OSDs are booting in ceph.log (or ceph -w output)?  If not, I 
>>>>> would look at the osd daemon log for clues.  You may need to turn up 
>>>>> debugging to see (ceph daemon osd.NNN config set debug_osd 20 to adjust 
>>>>> the level on the running daemon).
>>>>> 
>>>>> If they are booting, it is mostly a matter of letting it recover and come 
>>>>> up.  We have seen patterns where configuration or network issues have let 
>>>>> the system bury itself under a series of osdmap updates.  If you see that 
>>>>> in the log when you turn up debugging, or see the osds going up and down 
>>>>> when you try to bring the cluster up, that could be what is going on.  A 
>>>>> strategy that has worked there is to let all the osds catch up on their 
>>>>> maps before trying to peer and join the cluster.  To do that, 'ceph osd 
>>>>> set noup' (which prevents the osds from joining), wait for the ceph-osd 
>>>>> processes to stop chewing on maps (watch the cpu utilization in top), and 
>>>>> once they are all ready 'ceph osd unset noup' and let them join and peer 
>>>>> all at once.
>>>>> 
>>>>> sage
>>>>> 
>>>>>> 
>>>>>> ======= STRACE (PARTIAL) ========== 
>>>>>> [pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
>>>>>> [pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000
>>>>>> [pid 75878] munmap(0x7f5da6529000, 28143616) = 0
>>>>>> [pid 75878] munmap(0x7f5dac000000, 38965248) = 0
>>>>>> [pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0
>>>>>> [pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>> [pid 75873] <... futex resumed> )       = 0
>>>>>> [pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
>>>>>> [pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>>>> [pid 75873] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
>>>>>> [pid 75878] <... futex resumed> )       = 0
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
>>>>>> [pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
>>>>>> [pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000
>>>>>> [pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0
>>>>>> [ omit some entries?]
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Guang
>>>>>> 
>>>>>> On Jan 15, 2014, at 5:54 AM, Guang <yguang11-1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>> 
>>>>>>> Thanks Sage.
>>>>>>> 
>>>>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
>>>>>>> { "name": "osd151",
>>>>>>> "rank": 2,
>>>>>>> "state": "electing",
>>>>>>> "election_epoch": 85469,
>>>>>>> "quorum": [],
>>>>>>> "outside_quorum": [],
>>>>>>> "extra_probe_peers": [],
>>>>>>> "sync_provider": [],
>>>>>>> "monmap": { "epoch": 1,
>>>>>>>   "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>>>>>>>   "modified": "0.000000",
>>>>>>>   "created": "0.000000",
>>>>>>>   "mons": [
>>>>>>>         { "rank": 0,
>>>>>>>           "name": "osd152",
>>>>>>>           "addr": "10.193.207.130:6789\/0"},
>>>>>>>         { "rank": 1,
>>>>>>>           "name": "osd153",
>>>>>>>           "addr": "10.193.207.131:6789\/0"},
>>>>>>>         { "rank": 2,
>>>>>>>           "name": "osd151",
>>>>>>>           "addr": "10.194.0.68:6789\/0"}]}}
>>>>>>> 
>>>>>>> And:
>>>>>>> 
>>>>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
>>>>>>> { "election_epoch": 85480,
>>>>>>> "quorum": [
>>>>>>>     0,
>>>>>>>     1,
>>>>>>>     2],
>>>>>>> "quorum_names": [
>>>>>>>     "osd151",
>>>>>>>     "osd152",
>>>>>>>     "osd153"],
>>>>>>> "quorum_leader_name": "osd152",
>>>>>>> "monmap": { "epoch": 1,
>>>>>>>   "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>>>>>>>   "modified": "0.000000",
>>>>>>>   "created": "0.000000",
>>>>>>>   "mons": [
>>>>>>>         { "rank": 0,
>>>>>>>           "name": "osd152",
>>>>>>>           "addr": "10.193.207.130:6789\/0"},
>>>>>>>         { "rank": 1,
>>>>>>>           "name": "osd153",
>>>>>>>           "addr": "10.193.207.131:6789\/0"},
>>>>>>>         { "rank": 2,
>>>>>>>           "name": "osd151",
>>>>>>>           "addr": "10.194.0.68:6789\/0"}]}}
>>>>>>> 
>>>>>>> 
>>>>>>> The election has been finished with leader selected from the above status.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Guang
>>>>>>> 
>>>>>>> On Jan 14, 2014, at 10:55 PM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>>> 
>>>>>>>> On Tue, 14 Jan 2014, GuangYang wrote:
>>>>>>>>> Hi ceph-users and ceph-devel,
>>>>>>>>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
>>>>>>>>> 
>>>>>>>>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
>>>>>>>>> 
>>>>>>>>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
>>>>>>>>> 
>>>>>>>>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
>>>>>>>>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
>>>>>>>>> Error connecting to cluster: Error
>>>>>>>>> 
>>>>>>>>> Any idea why such error happened (restarting OSD would result in the same error)?
>>>>>>>>> 
>>>>>>>>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
>>>>>>>> 
>>>>>>>> That sounds unlikely, but you're right that the core problem is with the 
>>>>>>>> mons.  What does 
>>>>>>>> 
>>>>>>>> ceph daemon mon.`hostname` mon_status
>>>>>>>> 
>>>>>>>> say?  Perhaps they are not forming a quorum and that is what is preventing 
>>>>>>>> authentication.
>>>>>>>> 
>>>>>>>> sage
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 


[-- Attachment #4: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [ceph-users] Ceph cluster is unreachable because of authentication failure
  2014-01-22 11:34                       ` Guang
@ 2014-01-22 13:14                         ` Joao Eduardo Luis
       [not found]                           ` <BLU0-SMTP3186ED17CE3EC156E4EA064DFA60@phx.gbl>
  0 siblings, 1 reply; 14+ messages in thread
From: Joao Eduardo Luis @ 2014-01-22 13:14 UTC (permalink / raw)
  To: Guang, Sage Weil; +Cc: ceph-users@lists.ceph.com, Ceph Development

On 01/22/2014 11:34 AM, Guang wrote:
> Thanks Sage.
>
> If we use the debug_mon and debug_paxos as 20, the log file is growing too fast, I set the log level as 10 and then: 1) run the 'ceph osd set noin' command, 2) grep the log with keyword 'noin', attached is the monitor log. Please help to check. Thanks very much!
>
>


The log doesn't show the relevant part due to only containing log 
messages mentioning the 'noin' keyword.

We need the portion of the log between a line containing

'(leader).*handle_command mon_command({"prefix": "osd set", "key": 
"noin"}.*'

and the first line (after that) containing 'won leader election'.

Otherwise we are missing what is causing the election to be triggered.

   -Joao

>
>
> Thanks,
> Guang
>
> On Jan 21, 2014, at 12:35 AM, Sage Weil <sage@inktank.com> wrote:
>
>> On Sun, 19 Jan 2014, Guang wrote:
>>> Thanks Sage.
>>>
>>> I just captured part of the log (it was fast growing), the process did
>>> not hang but I saw the same pattern repeatedly. Should I increase the
>>> log level and send over email (it constantly reproduced)?
>>
>> Sure!  A representative fragment of the repeating fragment shoudl be
>> enough.
>>
>> s
>>>
>>> Thanks,
>>> Guang
>>>
>>> On Jan 18, 2014, at 12:05 AM, Sage Weil <sage@inktank.com> wrote:
>>>
>>>> On Fri, 17 Jan 2014, Guang wrote:
>>>>> Thanks Sage.
>>>>>
>>>>> I further narrow down the problem to #any command using paxos service would hang#, following are details:
>>>>>
>>>>> 1. I am able to run ceph status / osd dump, etc., however, the result are out of date (though I stopped all OSDs, it does not reflect in ceph status report).
>>>>>
>>>>> -bash-4.1$ sudo ceph -s
>>>>> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>>>>>   health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>>>>>   monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 123278, quorum 0,1,2 osd151,osd152,osd153
>>>>>   osdmap e134893: 781 osds: 694 up, 751 in
>>>>>    pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>>>>>   mdsmap e1: 0/0/1 up
>>>>>
>>>>> 2. If I run a command which uses paxos, the command will hang forever, this includes, ceph osd set noup (and also including those commands osd send to monitor when being started (create-or-add)).
>>>>>
>>>>> I attached the corresponding monitor log (it is like a bug).
>>>>
>>>> I see the osd set command coming through, but it arrives while paxos is
>>>> converging and the log seems to end before the mon would normally process
>>>> te delayed messages.  Is there a reason why the log fragment you attached
>>>> ends there, or did the process hang or something?
>>>>
>>>> Thanks-
>>>> sage
>>>>
>>>>> I
>>>>>
>>>>> On Jan 17, 2014, at 1:35 AM, Sage Weil <sage@inktank.com> wrote:
>>>>>
>>>>>> Hi Guang,
>>>>>>
>>>>>> On Thu, 16 Jan 2014, Guang wrote:
>>>>>>> I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried:
>>>>>>> 1. stop all daemons (mon & osd)
>>>>>>> 2. change the configuration to disable cephx
>>>>>>> 3. start mon daemons (3 in total)
>>>>>>> 4. start osd daemon one by one
>>>>>>>
>>>>>>> After finishing step 3, the cluster can be reachable ('ceph -s' give results):
>>>>>>> -bash-4.1$ sudo ceph -s
>>>>>>> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>>>>>>> health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>>>>>>> monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
>>>>>>> osdmap e134893: 781 osds: 694 up, 751 in
>>>>>>>   pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
>>>>>>> mdsmap e1: 0/0/1 up
>>>>>>> (at this point, all OSDs should be down).
>>>>>>>
>>>>>>> When I tried to start OSD daemon, the starting script got hang, and the process hang is:
>>>>>>> root      80497  80496  0 08:18 pts/0    00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173
>>>>>>>
>>>>>>> When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop:
>>>>>>> select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
>>>>>>> Any idea what might trigger this?
>>>>>>
>>>>>> It is hard to tell from the strace what is going on from this.  Do you see
>>>>>> that the OSDs are booting in ceph.log (or ceph -w output)?  If not, I
>>>>>> would look at the osd daemon log for clues.  You may need to turn up
>>>>>> debugging to see (ceph daemon osd.NNN config set debug_osd 20 to adjust
>>>>>> the level on the running daemon).
>>>>>>
>>>>>> If they are booting, it is mostly a matter of letting it recover and come
>>>>>> up.  We have seen patterns where configuration or network issues have let
>>>>>> the system bury itself under a series of osdmap updates.  If you see that
>>>>>> in the log when you turn up debugging, or see the osds going up and down
>>>>>> when you try to bring the cluster up, that could be what is going on.  A
>>>>>> strategy that has worked there is to let all the osds catch up on their
>>>>>> maps before trying to peer and join the cluster.  To do that, 'ceph osd
>>>>>> set noup' (which prevents the osds from joining), wait for the ceph-osd
>>>>>> processes to stop chewing on maps (watch the cpu utilization in top), and
>>>>>> once they are all ready 'ceph osd unset noup' and let them join and peer
>>>>>> all at once.
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>>
>>>>>>> ======= STRACE (PARTIAL) ==========
>>>>>>> [pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
>>>>>>> [pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000
>>>>>>> [pid 75878] munmap(0x7f5da6529000, 28143616) = 0
>>>>>>> [pid 75878] munmap(0x7f5dac000000, 38965248) = 0
>>>>>>> [pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0
>>>>>>> [pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>> [pid 75873] <... futex resumed> )       = 0
>>>>>>> [pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
>>>>>>> [pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>>>>> [pid 75873] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
>>>>>>> [pid 75878] <... futex resumed> )       = 0
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
>>>>>>> [pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
>>>>>>> [pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000
>>>>>>> [pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0
>>>>>>> [ omit some entries?]
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Guang
>>>>>>>
>>>>>>> On Jan 15, 2014, at 5:54 AM, Guang <yguang11@outlook.com> wrote:
>>>>>>>
>>>>>>>> Thanks Sage.
>>>>>>>>
>>>>>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status
>>>>>>>> { "name": "osd151",
>>>>>>>> "rank": 2,
>>>>>>>> "state": "electing",
>>>>>>>> "election_epoch": 85469,
>>>>>>>> "quorum": [],
>>>>>>>> "outside_quorum": [],
>>>>>>>> "extra_probe_peers": [],
>>>>>>>> "sync_provider": [],
>>>>>>>> "monmap": { "epoch": 1,
>>>>>>>>    "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>>>>>>>>    "modified": "0.000000",
>>>>>>>>    "created": "0.000000",
>>>>>>>>    "mons": [
>>>>>>>>          { "rank": 0,
>>>>>>>>            "name": "osd152",
>>>>>>>>            "addr": "10.193.207.130:6789\/0"},
>>>>>>>>          { "rank": 1,
>>>>>>>>            "name": "osd153",
>>>>>>>>            "addr": "10.193.207.131:6789\/0"},
>>>>>>>>          { "rank": 2,
>>>>>>>>            "name": "osd151",
>>>>>>>>            "addr": "10.194.0.68:6789\/0"}]}}
>>>>>>>>
>>>>>>>> And:
>>>>>>>>
>>>>>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status
>>>>>>>> { "election_epoch": 85480,
>>>>>>>> "quorum": [
>>>>>>>>      0,
>>>>>>>>      1,
>>>>>>>>      2],
>>>>>>>> "quorum_names": [
>>>>>>>>      "osd151",
>>>>>>>>      "osd152",
>>>>>>>>      "osd153"],
>>>>>>>> "quorum_leader_name": "osd152",
>>>>>>>> "monmap": { "epoch": 1,
>>>>>>>>    "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
>>>>>>>>    "modified": "0.000000",
>>>>>>>>    "created": "0.000000",
>>>>>>>>    "mons": [
>>>>>>>>          { "rank": 0,
>>>>>>>>            "name": "osd152",
>>>>>>>>            "addr": "10.193.207.130:6789\/0"},
>>>>>>>>          { "rank": 1,
>>>>>>>>            "name": "osd153",
>>>>>>>>            "addr": "10.193.207.131:6789\/0"},
>>>>>>>>          { "rank": 2,
>>>>>>>>            "name": "osd151",
>>>>>>>>            "addr": "10.194.0.68:6789\/0"}]}}
>>>>>>>>
>>>>>>>>
>>>>>>>> The election has been finished with leader selected from the above status.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Guang
>>>>>>>>
>>>>>>>> On Jan 14, 2014, at 10:55 PM, Sage Weil <sage@inktank.com> wrote:
>>>>>>>>
>>>>>>>>> On Tue, 14 Jan 2014, GuangYang wrote:
>>>>>>>>>> Hi ceph-users and ceph-devel,
>>>>>>>>>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command.
>>>>>>>>>>
>>>>>>>>>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster.
>>>>>>>>>>
>>>>>>>>>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout?
>>>>>>>>>>
>>>>>>>>>> 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate timed out after 300
>>>>>>>>>> 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin authentication error (110) Connection timed out
>>>>>>>>>> Error connecting to cluster: Error
>>>>>>>>>>
>>>>>>>>>> Any idea why such error happened (restarting OSD would result in the same error)?
>>>>>>>>>>
>>>>>>>>>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted?
>>>>>>>>>
>>>>>>>>> That sounds unlikely, but you're right that the core problem is with the
>>>>>>>>> mons.  What does
>>>>>>>>>
>>>>>>>>> ceph daemon mon.`hostname` mon_status
>>>>>>>>>
>>>>>>>>> say?  Perhaps they are not forming a quorum and that is what is preventing
>>>>>>>>> authentication.
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]                             ` <BLU0-SMTP3186ED17CE3EC156E4EA064DFA60-MsuGFMq8XAE@public.gmane.org>
@ 2014-01-23 14:32                               ` Sage Weil
       [not found]                                 ` <alpine.DEB.2.00.1401230630150.18304-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2014-01-23 14:32 UTC (permalink / raw)
  To: Guang; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Ceph Development

On Thu, 23 Jan 2014, Guang wrote:
> Hi Joao,
> Thanks for your reply!
> 
> I captured the log after seeing the 'noin' keyword and the log is attached.
> 
> Meanwhile, while checking the monitor logs, I see it does election every few seconds and the election process could take several seconds, so that the cluster is doing election almost all the time (could that be the root cause that we see out of date cluster status and command failure?).
> 
> I captured one round of election logs and following is the log, please help to check. Thanks very much.

It looks like this election was triggered by a slow or misbehaving mon.1 
(10.193.207.131).  There isn't enough before or after to know what the 
pattern is (does it always ack the previous election but slowly?  maybe 
it's clock is off;  is its host overloaded or is it over a very slow 
link?).  I suspect, though, that you can get the cluster into quorum by 
just stopping the daemon.  Double-check the clock sync and then try to 
join it in.  Even with it down temporarily, though, the mons will become 
available.

sage

> 
> 2014-01-23 04:01:08.871622 7fa5fbe21700  5 mon.osd152@0(leader).elector(204008) handle_propose from mon.1
> 2014-01-23 04:01:08.871625 7fa5fbe21700  5 mon.osd152@0(leader).elector(204008)  got propose from old epoch, quorum is 0,2, mon.1 must have just started
> 2014-01-23 04:01:08.871627 7fa5fbe21700 10 mon.osd152@0(leader) e1 start_election
> 2014-01-23 04:01:08.871629 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
> 2014-01-23 04:01:08.871635 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
> 2014-01-23 04:01:08.871636 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
> 2014-01-23 04:01:08.871640 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
> 2014-01-23 04:01:08.871642 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos active c 4353345..4353903) restart -- canceling timeouts
> 2014-01-23 04:01:08.871646 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
> 2014-01-23 04:01:08.871649 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
> 2014-01-23 04:01:08.871651 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
> 2014-01-23 04:01:08.871653 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
> 2014-01-23 04:01:08.871654 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  canceling proposal_timer 0x47d3e40
> 2014-01-23 04:01:08.871657 7fa5fbe21700  7 mon.osd152@0(electing).log v1793958 _updated_log for mon.2 10.194.0.68:6789/0
> 2014-01-23 04:01:08.871663 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> 10.194.0.68:6789/0 -- route(log(last 39415) v1 tid 3682233) v2 -- ?+0 0x1236de80 con 0x301c8e0
> 2014-01-23 04:01:08.871677 7fa5fbe21700  7 mon.osd152@0(electing).log v1793958 _updated_log for mon.0 10.193.207.130:6789/0
> 2014-01-23 04:01:08.871681 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> 10.193.207.130:6789/0 -- log(last 192924) v1 -- ?+0 0x53ccec0 con 0x3018580
> 2014-01-23 04:01:08.871690 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
> 2014-01-23 04:01:08.871693 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
> 2014-01-23 04:01:08.871696 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
> 2014-01-23 04:01:08.871709 7fa5fbe21700  0 log [INF] : mon.osd152 calling new monitor election
> 2014-01-23 04:01:08.871732 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x408b33c0
> 2014-01-23 04:01:08.871739 7fa5fbe21700  5 mon.osd152@0(electing).elector(204008) start -- can i be leader?
> 2014-01-23 04:01:08.871834 7fa5fbe21700  1 mon.osd152@0(electing).elector(204008) init, last seen epoch 204008
> 2014-01-23 04:01:08.871836 7fa5fbe21700 10 mon.osd152@0(electing).elector(204008) bump_epoch 204008 to 204009
> 2014-01-23 04:01:08.872379 7fa5fbe21700 10 mon.osd152@0(electing) e1 join_election
> 2014-01-23 04:01:08.872389 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
> 2014-01-23 04:01:08.872391 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
> 2014-01-23 04:01:08.872392 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
> 2014-01-23 04:01:08.872394 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
> 2014-01-23 04:01:08.872396 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
> 2014-01-23 04:01:08.872399 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
> 2014-01-23 04:01:08.872401 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
> 2014-01-23 04:01:08.872403 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
> 2014-01-23 04:01:08.872404 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
> 2014-01-23 04:01:08.872406 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
> 2014-01-23 04:01:08.872407 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
> 2014-01-23 04:01:08.872415 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.1 10.193.207.131:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 -- ?+0 0x2d611d40
> 2014-01-23 04:01:08.872434 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 -- ?+0 0x41505c40
> 2014-01-23 04:01:08.872456 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485294 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 ==== 540+0+0 (3981094898 0 0) 0x2d610000 con 0x301c8e0
> 2014-01-23 04:01:08.872478 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.2
> 2014-01-23 04:01:08.872485 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x1236e540 con 0x3018580
> 2014-01-23 04:01:08.872495 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
> 2014-01-23 04:01:08.872501 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872502 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.872509 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
> 2014-01-23 04:01:08.872511 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872512 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.872516 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.1 10.193.207.131:6789/0 398310803 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204007) v4 ==== 540+0+0 (382260333 0 0) 0x408b3600 con 0x301cba0
> 2014-01-23 04:01:08.872530 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) old epoch, dropping
> 2014-01-23 04:01:08.872534 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x41504c80 con 0x3018580
> 2014-01-23 04:01:08.872539 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
> 2014-01-23 04:01:08.872542 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872543 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.872547 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
> 2014-01-23 04:01:08.872548 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872549 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.872553 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.1 10.193.207.131:6789/0 398310804 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 ==== 540+0+0 (3981094898 0 0) 0x408b7740 con 0x301cba0
> 2014-01-23 04:01:08.872565 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.1
> 2014-01-23 04:01:08.872568 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(last 192924) v1 ==== 0+0+0 (0 0 0) 0x53ccec0 con 0x3018580
> 2014-01-23 04:01:08.872824 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x408b33c0 con 0x3018580
> 2014-01-23 04:01:08.872830 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
> 2014-01-23 04:01:08.872834 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872835 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.872839 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
> 2014-01-23 04:01:08.872841 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872841 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.872846 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== osd.448 10.194.0.143:6816/357 1 ==== auth(proto 0 28 bytes epoch 1) v1 ==== 58+0+0 (1794778480 0 0) 0x25398fc0 con 0x25347640
> 2014-01-23 04:01:08.872854 7fa5fbe21700  5 mon.osd152@0(electing) e1 discarding message auth(proto 0 28 bytes epoch 1) v1 and sending client elsewhere
> 2014-01-23 04:01:08.872857 7fa5fbe21700  1 -- 10.193.207.130:6789/0 mark_down 0x25347640 -- 0x21ee4380
> 2014-01-23 04:01:08.874236 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485295 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 ==== 540+0+0 (229380923 0 0) 0x1236de80 con 0x301c8e0
> 2014-01-23 04:01:08.874266 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.2
> 2014-01-23 04:01:08.874270 7fa5fbe21700 10 mon.osd152@0(electing).elector(204009) bump_epoch 204009 to 204011
> 2014-01-23 04:01:08.874755 7fa5fbe21700 10 mon.osd152@0(electing) e1 join_election
> 2014-01-23 04:01:08.874764 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
> 2014-01-23 04:01:08.874767 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
> 2014-01-23 04:01:08.874768 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
> 2014-01-23 04:01:08.874770 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
> 2014-01-23 04:01:08.874771 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
> 2014-01-23 04:01:08.874775 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
> 2014-01-23 04:01:08.874777 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
> 2014-01-23 04:01:08.874779 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
> 2014-01-23 04:01:08.874780 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
> 2014-01-23 04:01:08.874782 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
> 2014-01-23 04:01:08.874783 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
> 2014-01-23 04:01:08.874786 7fa5fbe21700 10 mon.osd152@0(electing) e1 start_election
> 2014-01-23 04:01:08.874787 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
> 2014-01-23 04:01:08.874788 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
> 2014-01-23 04:01:08.874790 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
> 2014-01-23 04:01:08.874791 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
> 2014-01-23 04:01:08.874793 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
> 2014-01-23 04:01:08.874795 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
> 2014-01-23 04:01:08.874797 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
> 2014-01-23 04:01:08.874798 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
> 2014-01-23 04:01:08.874799 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
> 2014-01-23 04:01:08.874800 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
> 2014-01-23 04:01:08.874801 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
> 2014-01-23 04:01:08.874803 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
> 2014-01-23 04:01:08.874807 7fa5fbe21700  0 log [INF] : mon.osd152 calling new monitor election
> 2014-01-23 04:01:08.874824 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x408b7740
> 2014-01-23 04:01:08.874831 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) start -- can i be leader?
> 2014-01-23 04:01:08.874867 7fa5fbe21700  1 mon.osd152@0(electing).elector(204011) init, last seen epoch 204011
> 2014-01-23 04:01:08.874873 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.1 10.193.207.131:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 -- ?+0 0x25398fc0
> 2014-01-23 04:01:08.874887 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 -- ?+0 0x408b4800
> 2014-01-23 04:01:08.874909 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485296 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204011) v4 ==== 540+0+0 (1598534457 0 0) 0x2d6133c0 con 0x301c8e0
> 2014-01-23 04:01:08.874932 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) handle_ack from mon.2
> 2014-01-23 04:01:08.874936 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011)  so far i have {0=34359738367,2=34359738367}
> 2014-01-23 04:01:08.874943 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x408b7740 con 0x3018580
> 2014-01-23 04:01:08.874959 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
> 2014-01-23 04:01:08.874963 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.874964 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.874969 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
> 2014-01-23 04:01:08.874971 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.874971 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
> 2014-01-23 04:01:08.875528 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485297 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204011) v4 ==== 540+0+0 (1598534457 0 0) 0x2d613600 con 0x301c8e0
> 2014-01-23 04:01:08.875553 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) handle_ack from mon.2
> 2014-01-23 04:01:08.875556 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011)  so far i have {0=34359738367,2=34359738367}
> 2014-01-23 04:01:13.875028 7fa5fc822700  5 mon.osd152@0(electing).elector(204011) election timer expired
> 2014-01-23 04:01:13.875043 7fa5fc822700 10 mon.osd152@0(electing).elector(204011) bump_epoch 204011 to 204012
> 2014-01-23 04:01:13.875375 7fa5fc822700 10 mon.osd152@0(electing) e1 join_election
> 2014-01-23 04:01:13.875386 7fa5fc822700 10 mon.osd152@0(electing) e1 _reset
> 2014-01-23 04:01:13.875389 7fa5fc822700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
> 2014-01-23 04:01:13.875390 7fa5fc822700 10 mon.osd152@0(electing) e1 timecheck_finish
> 2014-01-23 04:01:13.875393 7fa5fc822700 10 mon.osd152@0(electing) e1 scrub_reset
> 2014-01-23 04:01:13.875395 7fa5fc822700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
> 2014-01-23 04:01:13.875401 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
> 2014-01-23 04:01:13.875404 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
> 2014-01-23 04:01:13.875407 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
> 2014-01-23 04:01:13.875409 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
> 2014-01-23 04:01:13.875411 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
> 2014-01-23 04:01:13.875413 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
> 2014-01-23 04:01:13.875423 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 victory 204012) v4 -- ?+0 0x248bf980
> 2014-01-23 04:01:13.875445 7fa5fc822700 10 mon.osd152@0(electing) e1 win_election epoch 204012 quorum 0,2 features 34359738367
> 2014-01-23 04:01:13.875457 7fa5fc822700  0 log [INF] : mon.osd152@0 won leader election with quorum 0,2
> 2014-01-23 04:01:13.875481 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x248bc5c0
> 2014-01-23 04:01:13.875498 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) leader_init -- starting paxos recovery
> 2014-01-23 04:01:13.875558 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x248bc5c0 con 0x3018580
> 2014-01-23 04:01:13.875690 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) get_new_proposal_number = 9772500
> 2014-01-23 04:01:13.875703 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) collect with pn 9772500
> 2014-01-23 04:01:13.875708 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- paxos(collect lc 4353903 fc 4353345 pn 9772500 opn 0) v3 -- ?+0 0x167ceb80
> 2014-01-23 04:01:13.875725 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(pgmap 2388001..2388518) election_finished
> 2014-01-23 04:01:13.875729 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(pgmap 2388001..2388518) _active - not active
> 2014-01-23 04:01:13.875731 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(mdsmap 1..1) election_finished
> 2014-01-23 04:01:13.875733 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(mdsmap 1..1) _active - not active
> 2014-01-23 04:01:13.875735 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) election_finished
> 2014-01-23 04:01:13.875737 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) _active - not active
> 2014-01-23 04:01:13.875740 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(logm 1793310..1793958) election_finished
> 2014-01-23 04:01:13.875741 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(logm 1793310..1793958) _active - not active
> 2014-01-23 04:01:13.875743 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(monmap 1..1) election_finished
> 2014-01-23 04:01:13.875745 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(monmap 1..1) _active - not active
> 2014-01-23 04:01:13.875747 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(auth 2010..2269) election_finished
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]                                 ` <alpine.DEB.2.00.1401230630150.18304-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-01-24 11:41                                   ` Guang
       [not found]                                     ` <BLU0-SMTP461882F11F7B298D3D58ED9DFA10-MsuGFMq8XAE@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Guang @ 2014-01-24 11:41 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Ceph Development


[-- Attachment #1.1: Type: text/plain, Size: 33671 bytes --]

Thanks Sage.

Even after I stopped mon.1 (19.193.207.131), the problem still persisted, those two left mons kept electing.

I checked the network, clock and they are all good. 

One problem I noticed is the %CPU of the monitor being elected as leader, it is always around 100%, no matter we have 3 monitors in or 2 monitors in, the other monitor's %CPU are less than 10%.

I used *pstack* checking the busy one and have the following result, looks like thread #25 make it busy.

Meanwhile, I kept seeing the logs like '2014-01-24 11:03:18.959087 7f87eff4b700  0 mon.osd153@1(leader) e1 handle_command mon_command({"prefix": "osd crush create-or-move", "args": ["root=default", "host=osd84"], "id": 725, "weight": 3.6299999999999999} v 0) v1' from within the leader's log, though all OSDs were stopped at the time, is it like event replay?

bash-4.1$ sudo pstack 1159
Thread 30 (Thread 0x7f87f7619700 (LWP 1160)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007b7acb in ceph::log::Log::entry() ()
#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 29 (Thread 0x7f87f64bd700 (LWP 1161)):
#0  0x00000037a980d811 in sem_timedwait () from /lib64/libpthread.so.0
#1  0x000000000070de60 in CephContextServiceThread::entry() ()
#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 28 (Thread 0x7f87f5abc700 (LWP 1162)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x0000000000705d5a in AdminSocket::entry() ()
#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 27 (Thread 0x7f87f134d700 (LWP 1163)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000731d31 in SimpleMessenger::reaper_entry() ()
#2  0x000000000073475d in SimpleMessenger::ReaperThread::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 26 (Thread 0x7f87f094c700 (LWP 1164)):
#0  0x00000037a980e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000037a9810e1d in _L_cond_lock_886 () from /lib64/libpthread.so.0
#2  0x00000037a9810cf7 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#3  0x00000037a980b875 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4  0x000000000072715e in SafeTimer::timer_thread() ()
#5  0x0000000000728fbd in SafeTimerThread::entry() ()
#6  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#7  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 25 (Thread 0x7f87eff4b700 (LWP 1165)):
#0  0x0000000000707e8e in crush_hash32_3 ()
#1  0x0000000000734c14 in crush_choose ()
#2  0x000000000073518f in crush_do_rule ()
#3  0x000000000068a8a3 in CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const ()
#4  0x0000000000720dff in OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >&) const ()
#5  0x0000000000720ff1 in OSDMap::pg_to_raw_up(pg_t, std::vector<int, std::allocator<int> >&) const ()
#6  0x0000000000590ee6 in OSDMonitor::remove_redundant_pg_temp() ()
#7  0x00000000005adfb5 in OSDMonitor::create_pending() ()
#8  0x000000000058cca9 in PaxosService::_active() ()
#9  0x000000000056063a in Context::complete(int) ()
#10 0x000000000056424d in finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int) ()
#11 0x000000000058461e in Paxos::handle_last(MMonPaxos*) ()
#12 0x000000000058579b in Paxos::dispatch(PaxosServiceMessage*) ()
#13 0x000000000055c9ca in Monitor::_ms_dispatch(Message*) ()
#14 0x0000000000578502 in Monitor::ms_dispatch(Message*) ()
#15 0x00000000007bc552 in DispatchQueue::entry() ()
#16 0x000000000073268d in DispatchQueue::DispatchThread::entry() ()
#17 0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#18 0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 24 (Thread 0x7f87ef54a700 (LWP 1166)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007db3f9 in Accepter::entry() ()
#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 23 (Thread 0x7f87eea48700 (LWP 1168)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 22 (Thread 0x7f87ee947700 (LWP 1169)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x0000000000646726 in SignalHandler::entry() ()
#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 21 (Thread 0x7f87eeb49700 (LWP 1173)):
#0  0x00000037a980b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007c7147 in Pipe::fault(bool) ()
#2  0x00000000007c97a2 in Pipe::connect() ()
#3  0x00000000007ccf63 in Pipe::writer() ()
#4  0x00000000007da52d in Pipe::Writer::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 20 (Thread 0x7f87edc33700 (LWP 1195)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x7f87edb32700 (LWP 1196)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x7f87eda31700 (LWP 1197)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7f87ed930700 (LWP 1198)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7f87edd34700 (LWP 1205)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7f87ede35700 (LWP 1211)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7f87ed82f700 (LWP 1212)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7f87ed72e700 (LWP 1213)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7f87ed62d700 (LWP 1214)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f87ed52c700 (LWP 1215)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f87ed42b700 (LWP 1216)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f87ed32a700 (LWP 1217)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f87ed229700 (LWP 1218)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f87ecf26700 (LWP 1221)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f87ece25700 (LWP 1222)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f87ecd24700 (LWP 1223)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f87ecc23700 (LWP 1224)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f87ecb22700 (LWP 1225)):
#0  0x00000037a94df253 in poll () from /lib64/libc.so.6
#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()
#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()
#3  0x00000000007d757f in Pipe::reader() ()
#4  0x00000000007da54d in Pipe::Reader::entry() ()
#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#6  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f87eca21700 (LWP 1226)):
#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ccc89 in Pipe::writer() ()
#2  0x00000000007da52d in Pipe::Writer::entry() ()
#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000037a94e890d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f87f761a7a0 (LWP 1159)):
#0  0x00000037a98080ad in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000735352 in Thread::join(void**) ()
#2  0x000000000072f8ba in SimpleMessenger::wait() ()
#3  0x00000000005202fa in main ()


On Jan 23, 2014, at 10:32 PM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> On Thu, 23 Jan 2014, Guang wrote:
>> Hi Joao,
>> Thanks for your reply!
>> 
>> I captured the log after seeing the 'noin' keyword and the log is attached.
>> 
>> Meanwhile, while checking the monitor logs, I see it does election every few seconds and the election process could take several seconds, so that the cluster is doing election almost all the time (could that be the root cause that we see out of date cluster status and command failure?).
>> 
>> I captured one round of election logs and following is the log, please help to check. Thanks very much.
> 
> It looks like this election was triggered by a slow or misbehaving mon.1 
> (10.193.207.131).  There isn't enough before or after to know what the 
> pattern is (does it always ack the previous election but slowly?  maybe 
> it's clock is off;  is its host overloaded or is it over a very slow 
> link?).  I suspect, though, that you can get the cluster into quorum by 
> just stopping the daemon.  Double-check the clock sync and then try to 
> join it in.  Even with it down temporarily, though, the mons will become 
> available.
> 
> sage
> 
>> 
>> 2014-01-23 04:01:08.871622 7fa5fbe21700  5 mon.osd152@0(leader).elector(204008) handle_propose from mon.1
>> 2014-01-23 04:01:08.871625 7fa5fbe21700  5 mon.osd152@0(leader).elector(204008)  got propose from old epoch, quorum is 0,2, mon.1 must have just started
>> 2014-01-23 04:01:08.871627 7fa5fbe21700 10 mon.osd152@0(leader) e1 start_election
>> 2014-01-23 04:01:08.871629 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
>> 2014-01-23 04:01:08.871635 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
>> 2014-01-23 04:01:08.871636 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
>> 2014-01-23 04:01:08.871640 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
>> 2014-01-23 04:01:08.871642 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos active c 4353345..4353903) restart -- canceling timeouts
>> 2014-01-23 04:01:08.871646 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
>> 2014-01-23 04:01:08.871649 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
>> 2014-01-23 04:01:08.871651 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
>> 2014-01-23 04:01:08.871653 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
>> 2014-01-23 04:01:08.871654 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  canceling proposal_timer 0x47d3e40
>> 2014-01-23 04:01:08.871657 7fa5fbe21700  7 mon.osd152@0(electing).log v1793958 _updated_log for mon.2 10.194.0.68:6789/0
>> 2014-01-23 04:01:08.871663 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> 10.194.0.68:6789/0 -- route(log(last 39415) v1 tid 3682233) v2 -- ?+0 0x1236de80 con 0x301c8e0
>> 2014-01-23 04:01:08.871677 7fa5fbe21700  7 mon.osd152@0(electing).log v1793958 _updated_log for mon.0 10.193.207.130:6789/0
>> 2014-01-23 04:01:08.871681 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> 10.193.207.130:6789/0 -- log(last 192924) v1 -- ?+0 0x53ccec0 con 0x3018580
>> 2014-01-23 04:01:08.871690 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
>> 2014-01-23 04:01:08.871693 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
>> 2014-01-23 04:01:08.871696 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
>> 2014-01-23 04:01:08.871709 7fa5fbe21700  0 log [INF] : mon.osd152 calling new monitor election
>> 2014-01-23 04:01:08.871732 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x408b33c0
>> 2014-01-23 04:01:08.871739 7fa5fbe21700  5 mon.osd152@0(electing).elector(204008) start -- can i be leader?
>> 2014-01-23 04:01:08.871834 7fa5fbe21700  1 mon.osd152@0(electing).elector(204008) init, last seen epoch 204008
>> 2014-01-23 04:01:08.871836 7fa5fbe21700 10 mon.osd152@0(electing).elector(204008) bump_epoch 204008 to 204009
>> 2014-01-23 04:01:08.872379 7fa5fbe21700 10 mon.osd152@0(electing) e1 join_election
>> 2014-01-23 04:01:08.872389 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
>> 2014-01-23 04:01:08.872391 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
>> 2014-01-23 04:01:08.872392 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
>> 2014-01-23 04:01:08.872394 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
>> 2014-01-23 04:01:08.872396 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
>> 2014-01-23 04:01:08.872399 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
>> 2014-01-23 04:01:08.872401 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
>> 2014-01-23 04:01:08.872403 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
>> 2014-01-23 04:01:08.872404 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
>> 2014-01-23 04:01:08.872406 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
>> 2014-01-23 04:01:08.872407 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
>> 2014-01-23 04:01:08.872415 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.1 10.193.207.131:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 -- ?+0 0x2d611d40
>> 2014-01-23 04:01:08.872434 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 -- ?+0 0x41505c40
>> 2014-01-23 04:01:08.872456 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485294 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 ==== 540+0+0 (3981094898 0 0) 0x2d610000 con 0x301c8e0
>> 2014-01-23 04:01:08.872478 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.2
>> 2014-01-23 04:01:08.872485 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x1236e540 con 0x3018580
>> 2014-01-23 04:01:08.872495 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
>> 2014-01-23 04:01:08.872501 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872502 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.872509 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
>> 2014-01-23 04:01:08.872511 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872512 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.872516 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.1 10.193.207.131:6789/0 398310803 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204007) v4 ==== 540+0+0 (382260333 0 0) 0x408b3600 con 0x301cba0
>> 2014-01-23 04:01:08.872530 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) old epoch, dropping
>> 2014-01-23 04:01:08.872534 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x41504c80 con 0x3018580
>> 2014-01-23 04:01:08.872539 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
>> 2014-01-23 04:01:08.872542 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872543 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.872547 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
>> 2014-01-23 04:01:08.872548 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872549 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.872553 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.1 10.193.207.131:6789/0 398310804 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 ==== 540+0+0 (3981094898 0 0) 0x408b7740 con 0x301cba0
>> 2014-01-23 04:01:08.872565 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.1
>> 2014-01-23 04:01:08.872568 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(last 192924) v1 ==== 0+0+0 (0 0 0) 0x53ccec0 con 0x3018580
>> 2014-01-23 04:01:08.872824 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x408b33c0 con 0x3018580
>> 2014-01-23 04:01:08.872830 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
>> 2014-01-23 04:01:08.872834 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872835 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.872839 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
>> 2014-01-23 04:01:08.872841 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872841 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.872846 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== osd.448 10.194.0.143:6816/357 1 ==== auth(proto 0 28 bytes epoch 1) v1 ==== 58+0+0 (1794778480 0 0) 0x25398fc0 con 0x25347640
>> 2014-01-23 04:01:08.872854 7fa5fbe21700  5 mon.osd152@0(electing) e1 discarding message auth(proto 0 28 bytes epoch 1) v1 and sending client elsewhere
>> 2014-01-23 04:01:08.872857 7fa5fbe21700  1 -- 10.193.207.130:6789/0 mark_down 0x25347640 -- 0x21ee4380
>> 2014-01-23 04:01:08.874236 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485295 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 ==== 540+0+0 (229380923 0 0) 0x1236de80 con 0x301c8e0
>> 2014-01-23 04:01:08.874266 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.2
>> 2014-01-23 04:01:08.874270 7fa5fbe21700 10 mon.osd152@0(electing).elector(204009) bump_epoch 204009 to 204011
>> 2014-01-23 04:01:08.874755 7fa5fbe21700 10 mon.osd152@0(electing) e1 join_election
>> 2014-01-23 04:01:08.874764 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
>> 2014-01-23 04:01:08.874767 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
>> 2014-01-23 04:01:08.874768 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
>> 2014-01-23 04:01:08.874770 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
>> 2014-01-23 04:01:08.874771 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
>> 2014-01-23 04:01:08.874775 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
>> 2014-01-23 04:01:08.874777 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
>> 2014-01-23 04:01:08.874779 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
>> 2014-01-23 04:01:08.874780 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
>> 2014-01-23 04:01:08.874782 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
>> 2014-01-23 04:01:08.874783 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
>> 2014-01-23 04:01:08.874786 7fa5fbe21700 10 mon.osd152@0(electing) e1 start_election
>> 2014-01-23 04:01:08.874787 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
>> 2014-01-23 04:01:08.874788 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
>> 2014-01-23 04:01:08.874790 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
>> 2014-01-23 04:01:08.874791 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
>> 2014-01-23 04:01:08.874793 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
>> 2014-01-23 04:01:08.874795 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
>> 2014-01-23 04:01:08.874797 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
>> 2014-01-23 04:01:08.874798 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
>> 2014-01-23 04:01:08.874799 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
>> 2014-01-23 04:01:08.874800 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
>> 2014-01-23 04:01:08.874801 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
>> 2014-01-23 04:01:08.874803 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
>> 2014-01-23 04:01:08.874807 7fa5fbe21700  0 log [INF] : mon.osd152 calling new monitor election
>> 2014-01-23 04:01:08.874824 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x408b7740
>> 2014-01-23 04:01:08.874831 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) start -- can i be leader?
>> 2014-01-23 04:01:08.874867 7fa5fbe21700  1 mon.osd152@0(electing).elector(204011) init, last seen epoch 204011
>> 2014-01-23 04:01:08.874873 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.1 10.193.207.131:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 -- ?+0 0x25398fc0
>> 2014-01-23 04:01:08.874887 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 -- ?+0 0x408b4800
>> 2014-01-23 04:01:08.874909 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485296 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204011) v4 ==== 540+0+0 (1598534457 0 0) 0x2d6133c0 con 0x301c8e0
>> 2014-01-23 04:01:08.874932 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) handle_ack from mon.2
>> 2014-01-23 04:01:08.874936 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011)  so far i have {0=34359738367,2=34359738367}
>> 2014-01-23 04:01:08.874943 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x408b7740 con 0x3018580
>> 2014-01-23 04:01:08.874959 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
>> 2014-01-23 04:01:08.874963 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.874964 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.874969 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
>> 2014-01-23 04:01:08.874971 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.874971 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
>> 2014-01-23 04:01:08.875528 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485297 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204011) v4 ==== 540+0+0 (1598534457 0 0) 0x2d613600 con 0x301c8e0
>> 2014-01-23 04:01:08.875553 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) handle_ack from mon.2
>> 2014-01-23 04:01:08.875556 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011)  so far i have {0=34359738367,2=34359738367}
>> 2014-01-23 04:01:13.875028 7fa5fc822700  5 mon.osd152@0(electing).elector(204011) election timer expired
>> 2014-01-23 04:01:13.875043 7fa5fc822700 10 mon.osd152@0(electing).elector(204011) bump_epoch 204011 to 204012
>> 2014-01-23 04:01:13.875375 7fa5fc822700 10 mon.osd152@0(electing) e1 join_election
>> 2014-01-23 04:01:13.875386 7fa5fc822700 10 mon.osd152@0(electing) e1 _reset
>> 2014-01-23 04:01:13.875389 7fa5fc822700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
>> 2014-01-23 04:01:13.875390 7fa5fc822700 10 mon.osd152@0(electing) e1 timecheck_finish
>> 2014-01-23 04:01:13.875393 7fa5fc822700 10 mon.osd152@0(electing) e1 scrub_reset
>> 2014-01-23 04:01:13.875395 7fa5fc822700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
>> 2014-01-23 04:01:13.875401 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
>> 2014-01-23 04:01:13.875404 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
>> 2014-01-23 04:01:13.875407 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
>> 2014-01-23 04:01:13.875409 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
>> 2014-01-23 04:01:13.875411 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
>> 2014-01-23 04:01:13.875413 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
>> 2014-01-23 04:01:13.875423 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 victory 204012) v4 -- ?+0 0x248bf980
>> 2014-01-23 04:01:13.875445 7fa5fc822700 10 mon.osd152@0(electing) e1 win_election epoch 204012 quorum 0,2 features 34359738367
>> 2014-01-23 04:01:13.875457 7fa5fc822700  0 log [INF] : mon.osd152@0 won leader election with quorum 0,2
>> 2014-01-23 04:01:13.875481 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x248bc5c0
>> 2014-01-23 04:01:13.875498 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) leader_init -- starting paxos recovery
>> 2014-01-23 04:01:13.875558 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x248bc5c0 con 0x3018580
>> 2014-01-23 04:01:13.875690 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) get_new_proposal_number = 9772500
>> 2014-01-23 04:01:13.875703 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) collect with pn 9772500
>> 2014-01-23 04:01:13.875708 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- paxos(collect lc 4353903 fc 4353345 pn 9772500 opn 0) v3 -- ?+0 0x167ceb80
>> 2014-01-23 04:01:13.875725 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(pgmap 2388001..2388518) election_finished
>> 2014-01-23 04:01:13.875729 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(pgmap 2388001..2388518) _active - not active
>> 2014-01-23 04:01:13.875731 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(mdsmap 1..1) election_finished
>> 2014-01-23 04:01:13.875733 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(mdsmap 1..1) _active - not active
>> 2014-01-23 04:01:13.875735 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) election_finished
>> 2014-01-23 04:01:13.875737 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) _active - not active
>> 2014-01-23 04:01:13.875740 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(logm 1793310..1793958) election_finished
>> 2014-01-23 04:01:13.875741 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(logm 1793310..1793958) _active - not active
>> 2014-01-23 04:01:13.875743 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(monmap 1..1) election_finished
>> 2014-01-23 04:01:13.875745 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(monmap 1..1) _active - not active
>> 2014-01-23 04:01:13.875747 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(auth 2010..2269) election_finished
>> 
>> 
> 


[-- Attachment #1.2: Type: text/html, Size: 37575 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Ceph cluster is unreachable because of authentication failure
       [not found]                                     ` <BLU0-SMTP461882F11F7B298D3D58ED9DFA10-MsuGFMq8XAE@public.gmane.org>
@ 2014-02-08  8:49                                       ` GuangYang
  0 siblings, 0 replies; 14+ messages in thread
From: GuangYang @ 2014-02-08  8:49 UTC (permalink / raw)
  To: Sage Weil, Joao Eduardo Luis
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Ceph Development


[-- Attachment #1.1: Type: text/plain, Size: 36147 bytes --]

My latest finding is that the probe message sending to the leader get delayed response so that it triggers election.
Some related logs:
2014-02-08 08:17:44.731127 7f531743b700  1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 4443 ==== time_check( ping e 361910 r 1 ) v1 ==== 36+0+0 (2539492411 0 0) 0x3efd1440 con 0x32055402014-02-08 08:17:44.731141 7f531743b700  1 -- 10.194.0.68:6789/0 --> 10.193.207.131:6789/0 -- time_check( pong e 361910 r 1 ts 2014-02-08 08:17:44.731141 ) v1 -- ?+0 0x3efd0000 con 0x32055402014-02-08 08:17:45.276032 7f531743b700  1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 4444 ==== paxos(lease lc 4354335 fc 4353596 pn 0 opn 0) v3 ==== 80+0+0 (3400812800 0 0) 0x3911df00 con 0x32055402014-02-08 08:17:45.276081 7f531743b700  1 -- 10.194.0.68:6789/0 --> mon.1 10.193.207.131:6789/0 -- paxos(lease_ack lc 4354335 fc 4353596 pn 0 opn 0) v3 -- ?+0 0x346907802014-02-08 08:17:45.276370 7f531743b700  1 -- 10.194.0.68:6789/0 --> mon.1 10.193.207.131:6789/0 -- forward(log(2 entries) v1 caps allow *) to leader v1 -- ?+0 0x371016802014-02-08 08:17:55.276295 7f5317e3c700  1 -- 10.194.0.68:6789/0 --> mon.0 10.193.207.130:6789/0 -- mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd151) v5 -- ?+0 0x1ba269002014-02-08 08:17:55.276345 7f5317e3c700  1 -- 10.194.0.68:6789/0 --> mon.1 10.193.207.131:6789/0 -- mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd151) v5 -- ?+0 0x3742c6002014-02-08 08:17:56.553844 7f531743b700  1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 4445 ==== time_check( report e 361910 r 2 #skews 2 #latencies 2 ) v1 ==== 648+0+0 (967739232 0 0) 0x2367a1c0 con 0x32055402014-02-08 08:17:56.553875 7f531743b700  1 mon.osd151@2(probing) e1 handle_timecheck drop unexpected msg2014-02-08 08:17:56.555878 7f531743b700  1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 4446 ==== paxos(lease lc 4354335 fc 4353596 pn 0 opn 0) v3 ==== 80+0+0 (3052915262 0 0) 0x3742c600 con 0x32055402014-02-08 08:17:56.555910 7f531743b700  1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 4447 ==== mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153) v5 ==== 55+0+0 (2421850105 0 0) 0x37101680 con 0x32055402014-02-08 08:17:56.555937 7f531743b700  1 -- 10.194.0.68:6789/0 --> 10.193.207.131:6789/0 -- mon_probe(reply b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd151 paxos( fc 4353596 lc 4354335 )) v5 -- ?+0 0x3742c600 con 0x32055402014-02-08 08:17:56.555959 7f531743b700  1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 4448 ==== mon_probe(reply b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153 paxos( fc 4353596 lc 4354335 )) v5 ==== 539+0+0 (4064466210 0 0) 0x34690780 con 0x32055402014-02-08 08:17:56.555996 7f531743b700  0 log [INF] : mon.osd151 calling new monitor election
The reason why it gets delayed response, seems like is due to the leader is too busy (high CPU %), another thing to mention is that the cluster undergo a fast changing states (OSD host die and join).
I am thinking to reduce the monitor count from 3 to 1 so that it does not need to do probing anymore and then it can handle the status change, is that possible?
Thanks,Guang
Subject: Re: [ceph-users] Ceph cluster is unreachable because of authentication failure
From: yguang11@outlook.com
Date: Fri, 24 Jan 2014 19:41:08 +0800
CC: joao.luis@inktank.com; ceph-users@lists.ceph.com; ceph-devel@vger.kernel.org
To: sage@inktank.com

Thanks Sage.
Even after I stopped mon.1 (19.193.207.131), the problem still persisted, those two left mons kept electing.
I checked the network, clock and they are all good. 
One problem I noticed is the %CPU of the monitor being elected as leader, it is always around 100%, no matter we have 3 monitors in or 2 monitors in, the other monitor's %CPU are less than 10%.
I used *pstack* checking the busy one and have the following result, looks like thread #25 make it busy.
Meanwhile, I kept seeing the logs like '2014-01-24 11:03:18.959087 7f87eff4b700  0 mon.osd153@1(leader) e1 handle_command mon_command({"prefix": "osd crush create-or-move", "args": ["root=default", "host=osd84"], "id": 725, "weight": 3.6299999999999999} v 0) v1' from within the leader's log, though all OSDs were stopped at the time, is it like event replay?
bash-4.1$ sudo pstack 1159Thread 30 (Thread 0x7f87f7619700 (LWP 1160)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007b7acb in ceph::log::Log::entry() ()#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#3  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 29 (Thread 0x7f87f64bd700 (LWP 1161)):#0  0x00000037a980d811 in sem_timedwait () from /lib64/libpthread.so.0#1  0x000000000070de60 in CephContextServiceThread::entry() ()#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#3  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 28 (Thread 0x7f87f5abc700 (LWP 1162)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x0000000000705d5a in AdminSocket::entry() ()#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#3  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 27 (Thread 0x7f87f134d700 (LWP 1163)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x0000000000731d31 in SimpleMessenger::reaper_entry() ()#2  0x000000000073475d in SimpleMessenger::ReaperThread::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 26 (Thread 0x7f87f094c700 (LWP 1164)):#0  0x00000037a980e054 in __lll_lock_wait () from /lib64/libpthread.so.0#1  0x00000037a9810e1d in _L_cond_lock_886 () from /lib64/libpthread.so.0#2  0x00000037a9810cf7 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0#3  0x00000037a980b875 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#4  0x000000000072715e in SafeTimer::timer_thread() ()#5  0x0000000000728fbd in SafeTimerThread::entry() ()#6  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#7  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 25 (Thread 0x7f87eff4b700 (LWP 1165)):#0  0x0000000000707e8e in crush_hash32_3 ()#1  0x0000000000734c14 in crush_choose ()#2  0x000000000073518f in crush_do_rule ()#3  0x000000000068a8a3 in CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const ()#4  0x0000000000720dff in OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >&) const ()#5  0x0000000000720ff1 in OSDMap::pg_to_raw_up(pg_t, std::vector<int, std::allocator<int> >&) const ()#6  0x0000000000590ee6 in OSDMonitor::remove_redundant_pg_temp() ()#7  0x00000000005adfb5 in OSDMonitor::create_pending() ()#8  0x000000000058cca9 in PaxosService::_active() ()#9  0x000000000056063a in Context::complete(int) ()#10 0x000000000056424d in finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int) ()#11 0x000000000058461e in Paxos::handle_last(MMonPaxos*) ()#12 0x000000000058579b in Paxos::dispatch(PaxosServiceMessage*) ()#13 0x000000000055c9ca in Monitor::_ms_dispatch(Message*) ()#14 0x0000000000578502 in Monitor::ms_dispatch(Message*) ()#15 0x00000000007bc552 in DispatchQueue::entry() ()#16 0x000000000073268d in DispatchQueue::DispatchThread::entry() ()#17 0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#18 0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 24 (Thread 0x7f87ef54a700 (LWP 1166)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007db3f9 in Accepter::entry() ()#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#3  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 23 (Thread 0x7f87eea48700 (LWP 1168)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 22 (Thread 0x7f87ee947700 (LWP 1169)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x0000000000646726 in SignalHandler::entry() ()#2  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#3  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 21 (Thread 0x7f87eeb49700 (LWP 1173)):#0  0x00000037a980b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007c7147 in Pipe::fault(bool) ()#2  0x00000000007c97a2 in Pipe::connect() ()#3  0x00000000007ccf63 in Pipe::writer() ()#4  0x00000000007da52d in Pipe::Writer::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 20 (Thread 0x7f87edc33700 (LWP 1195)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 19 (Thread 0x7f87edb32700 (LWP 1196)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 18 (Thread 0x7f87eda31700 (LWP 1197)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 17 (Thread 0x7f87ed930700 (LWP 1198)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 16 (Thread 0x7f87edd34700 (LWP 1205)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 15 (Thread 0x7f87ede35700 (LWP 1211)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 14 (Thread 0x7f87ed82f700 (LWP 1212)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 13 (Thread 0x7f87ed72e700 (LWP 1213)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 12 (Thread 0x7f87ed62d700 (LWP 1214)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 11 (Thread 0x7f87ed52c700 (LWP 1215)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 10 (Thread 0x7f87ed42b700 (LWP 1216)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 9 (Thread 0x7f87ed32a700 (LWP 1217)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 8 (Thread 0x7f87ed229700 (LWP 1218)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 7 (Thread 0x7f87ecf26700 (LWP 1221)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 6 (Thread 0x7f87ece25700 (LWP 1222)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 5 (Thread 0x7f87ecd24700 (LWP 1223)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 4 (Thread 0x7f87ecc23700 (LWP 1224)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 3 (Thread 0x7f87ecb22700 (LWP 1225)):#0  0x00000037a94df253 in poll () from /lib64/libc.so.6#1  0x00000000007c08f9 in Pipe::tcp_read_wait() ()#2  0x00000000007c8755 in Pipe::tcp_read(char*, int) ()#3  0x00000000007d757f in Pipe::reader() ()#4  0x00000000007da54d in Pipe::Reader::entry() ()#5  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#6  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 2 (Thread 0x7f87eca21700 (LWP 1226)):#0  0x00000037a980b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x00000000007ccc89 in Pipe::writer() ()#2  0x00000000007da52d in Pipe::Writer::entry() ()#3  0x00000037a9807851 in start_thread () from /lib64/libpthread.so.0#4  0x00000037a94e890d in clone () from /lib64/libc.so.6Thread 1 (Thread 0x7f87f761a7a0 (LWP 1159)):#0  0x00000037a98080ad in pthread_join () from /lib64/libpthread.so.0#1  0x0000000000735352 in Thread::join(void**) ()#2  0x000000000072f8ba in SimpleMessenger::wait() ()#3  0x00000000005202fa in main ()

On Jan 23, 2014, at 10:32 PM, Sage Weil <sage@inktank.com> wrote:On Thu, 23 Jan 2014, Guang wrote:
Hi Joao,
Thanks for your reply!

I captured the log after seeing the 'noin' keyword and the log is attached.

Meanwhile, while checking the monitor logs, I see it does election every few seconds and the election process could take several seconds, so that the cluster is doing election almost all the time (could that be the root cause that we see out of date cluster status and command failure?).

I captured one round of election logs and following is the log, please help to check. Thanks very much.

It looks like this election was triggered by a slow or misbehaving mon.1 
(10.193.207.131).  There isn't enough before or after to know what the 
pattern is (does it always ack the previous election but slowly?  maybe 
it's clock is off;  is its host overloaded or is it over a very slow 
link?).  I suspect, though, that you can get the cluster into quorum by 
just stopping the daemon.  Double-check the clock sync and then try to 
join it in.  Even with it down temporarily, though, the mons will become 
available.

sage


2014-01-23 04:01:08.871622 7fa5fbe21700  5 mon.osd152@0(leader).elector(204008) handle_propose from mon.1
2014-01-23 04:01:08.871625 7fa5fbe21700  5 mon.osd152@0(leader).elector(204008)  got propose from old epoch, quorum is 0,2, mon.1 must have just started
2014-01-23 04:01:08.871627 7fa5fbe21700 10 mon.osd152@0(leader) e1 start_election
2014-01-23 04:01:08.871629 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
2014-01-23 04:01:08.871635 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
2014-01-23 04:01:08.871636 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
2014-01-23 04:01:08.871640 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
2014-01-23 04:01:08.871642 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos active c 4353345..4353903) restart -- canceling timeouts
2014-01-23 04:01:08.871646 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
2014-01-23 04:01:08.871649 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
2014-01-23 04:01:08.871651 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
2014-01-23 04:01:08.871653 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
2014-01-23 04:01:08.871654 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  canceling proposal_timer 0x47d3e40
2014-01-23 04:01:08.871657 7fa5fbe21700  7 mon.osd152@0(electing).log v1793958 _updated_log for mon.2 10.194.0.68:6789/0
2014-01-23 04:01:08.871663 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> 10.194.0.68:6789/0 -- route(log(last 39415) v1 tid 3682233) v2 -- ?+0 0x1236de80 con 0x301c8e0
2014-01-23 04:01:08.871677 7fa5fbe21700  7 mon.osd152@0(electing).log v1793958 _updated_log for mon.0 10.193.207.130:6789/0
2014-01-23 04:01:08.871681 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> 10.193.207.130:6789/0 -- log(last 192924) v1 -- ?+0 0x53ccec0 con 0x3018580
2014-01-23 04:01:08.871690 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
2014-01-23 04:01:08.871693 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
2014-01-23 04:01:08.871696 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
2014-01-23 04:01:08.871709 7fa5fbe21700  0 log [INF] : mon.osd152 calling new monitor election
2014-01-23 04:01:08.871732 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x408b33c0
2014-01-23 04:01:08.871739 7fa5fbe21700  5 mon.osd152@0(electing).elector(204008) start -- can i be leader?
2014-01-23 04:01:08.871834 7fa5fbe21700  1 mon.osd152@0(electing).elector(204008) init, last seen epoch 204008
2014-01-23 04:01:08.871836 7fa5fbe21700 10 mon.osd152@0(electing).elector(204008) bump_epoch 204008 to 204009
2014-01-23 04:01:08.872379 7fa5fbe21700 10 mon.osd152@0(electing) e1 join_election
2014-01-23 04:01:08.872389 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
2014-01-23 04:01:08.872391 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
2014-01-23 04:01:08.872392 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
2014-01-23 04:01:08.872394 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
2014-01-23 04:01:08.872396 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
2014-01-23 04:01:08.872399 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
2014-01-23 04:01:08.872401 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
2014-01-23 04:01:08.872403 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
2014-01-23 04:01:08.872404 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
2014-01-23 04:01:08.872406 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
2014-01-23 04:01:08.872407 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
2014-01-23 04:01:08.872415 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.1 10.193.207.131:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 -- ?+0 0x2d611d40
2014-01-23 04:01:08.872434 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 -- ?+0 0x41505c40
2014-01-23 04:01:08.872456 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485294 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 ==== 540+0+0 (3981094898 0 0) 0x2d610000 con 0x301c8e0
2014-01-23 04:01:08.872478 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.2
2014-01-23 04:01:08.872485 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x1236e540 con 0x3018580
2014-01-23 04:01:08.872495 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
2014-01-23 04:01:08.872501 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872502 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.872509 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
2014-01-23 04:01:08.872511 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872512 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.872516 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.1 10.193.207.131:6789/0 398310803 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204007) v4 ==== 540+0+0 (382260333 0 0) 0x408b3600 con 0x301cba0
2014-01-23 04:01:08.872530 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) old epoch, dropping
2014-01-23 04:01:08.872534 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x41504c80 con 0x3018580
2014-01-23 04:01:08.872539 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
2014-01-23 04:01:08.872542 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872543 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.872547 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
2014-01-23 04:01:08.872548 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872549 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.872553 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.1 10.193.207.131:6789/0 398310804 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204009) v4 ==== 540+0+0 (3981094898 0 0) 0x408b7740 con 0x301cba0
2014-01-23 04:01:08.872565 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.1
2014-01-23 04:01:08.872568 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(last 192924) v1 ==== 0+0+0 (0 0 0) 0x53ccec0 con 0x3018580
2014-01-23 04:01:08.872824 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x408b33c0 con 0x3018580
2014-01-23 04:01:08.872830 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
2014-01-23 04:01:08.872834 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872835 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.872839 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
2014-01-23 04:01:08.872841 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.872841 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.872846 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== osd.448 10.194.0.143:6816/357 1 ==== auth(proto 0 28 bytes epoch 1) v1 ==== 58+0+0 (1794778480 0 0) 0x25398fc0 con 0x25347640
2014-01-23 04:01:08.872854 7fa5fbe21700  5 mon.osd152@0(electing) e1 discarding message auth(proto 0 28 bytes epoch 1) v1 and sending client elsewhere
2014-01-23 04:01:08.872857 7fa5fbe21700  1 -- 10.193.207.130:6789/0 mark_down 0x25347640 -- 0x21ee4380
2014-01-23 04:01:08.874236 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485295 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 ==== 540+0+0 (229380923 0 0) 0x1236de80 con 0x301c8e0
2014-01-23 04:01:08.874266 7fa5fbe21700  5 mon.osd152@0(electing).elector(204009) handle_propose from mon.2
2014-01-23 04:01:08.874270 7fa5fbe21700 10 mon.osd152@0(electing).elector(204009) bump_epoch 204009 to 204011
2014-01-23 04:01:08.874755 7fa5fbe21700 10 mon.osd152@0(electing) e1 join_election
2014-01-23 04:01:08.874764 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
2014-01-23 04:01:08.874767 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
2014-01-23 04:01:08.874768 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
2014-01-23 04:01:08.874770 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
2014-01-23 04:01:08.874771 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
2014-01-23 04:01:08.874775 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
2014-01-23 04:01:08.874777 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
2014-01-23 04:01:08.874779 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
2014-01-23 04:01:08.874780 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
2014-01-23 04:01:08.874782 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
2014-01-23 04:01:08.874783 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
2014-01-23 04:01:08.874786 7fa5fbe21700 10 mon.osd152@0(electing) e1 start_election
2014-01-23 04:01:08.874787 7fa5fbe21700 10 mon.osd152@0(electing) e1 _reset
2014-01-23 04:01:08.874788 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
2014-01-23 04:01:08.874790 7fa5fbe21700 10 mon.osd152@0(electing) e1 timecheck_finish
2014-01-23 04:01:08.874791 7fa5fbe21700 10 mon.osd152@0(electing) e1 scrub_reset
2014-01-23 04:01:08.874793 7fa5fbe21700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
2014-01-23 04:01:08.874795 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
2014-01-23 04:01:08.874797 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
2014-01-23 04:01:08.874798 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
2014-01-23 04:01:08.874799 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
2014-01-23 04:01:08.874800 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
2014-01-23 04:01:08.874801 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
2014-01-23 04:01:08.874803 7fa5fbe21700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
2014-01-23 04:01:08.874807 7fa5fbe21700  0 log [INF] : mon.osd152 calling new monitor election
2014-01-23 04:01:08.874824 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x408b7740
2014-01-23 04:01:08.874831 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) start -- can i be leader?
2014-01-23 04:01:08.874867 7fa5fbe21700  1 mon.osd152@0(electing).elector(204011) init, last seen epoch 204011
2014-01-23 04:01:08.874873 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.1 10.193.207.131:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 -- ?+0 0x25398fc0
2014-01-23 04:01:08.874887 7fa5fbe21700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 propose 204011) v4 -- ?+0 0x408b4800
2014-01-23 04:01:08.874909 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485296 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204011) v4 ==== 540+0+0 (1598534457 0 0) 0x2d6133c0 con 0x301c8e0
2014-01-23 04:01:08.874932 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) handle_ack from mon.2
2014-01-23 04:01:08.874936 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011)  so far i have {0=34359738367,2=34359738367}
2014-01-23 04:01:08.874943 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x408b7740 con 0x3018580
2014-01-23 04:01:08.874959 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) dispatch log(2 entries) v1 from mon.0 10.193.207.130:6789/0
2014-01-23 04:01:08.874963 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.874964 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.874969 7fa5fbe21700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958)  waiting for paxos -> readable (v0)
2014-01-23 04:01:08.874971 7fa5fbe21700  1 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) is_readable now=2014-01-23 04:01:08.874971 lease_expire=2014-01-23 04:01:13.871236 has v0 lc 4353903
2014-01-23 04:01:08.875528 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.2 10.194.0.68:6789/0 463485297 ==== election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 ack 204011) v4 ==== 540+0+0 (1598534457 0 0) 0x2d613600 con 0x301c8e0
2014-01-23 04:01:08.875553 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011) handle_ack from mon.2
2014-01-23 04:01:08.875556 7fa5fbe21700  5 mon.osd152@0(electing).elector(204011)  so far i have {0=34359738367,2=34359738367}
2014-01-23 04:01:13.875028 7fa5fc822700  5 mon.osd152@0(electing).elector(204011) election timer expired
2014-01-23 04:01:13.875043 7fa5fc822700 10 mon.osd152@0(electing).elector(204011) bump_epoch 204011 to 204012
2014-01-23 04:01:13.875375 7fa5fc822700 10 mon.osd152@0(electing) e1 join_election
2014-01-23 04:01:13.875386 7fa5fc822700 10 mon.osd152@0(electing) e1 _reset
2014-01-23 04:01:13.875389 7fa5fc822700 10 mon.osd152@0(electing) e1 cancel_probe_timeout (none scheduled)
2014-01-23 04:01:13.875390 7fa5fc822700 10 mon.osd152@0(electing) e1 timecheck_finish
2014-01-23 04:01:13.875393 7fa5fc822700 10 mon.osd152@0(electing) e1 scrub_reset
2014-01-23 04:01:13.875395 7fa5fc822700 10 mon.osd152@0(electing).paxos(paxos recovering c 4353345..4353903) restart -- canceling timeouts
2014-01-23 04:01:13.875401 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(pgmap 2388001..2388518) restart
2014-01-23 04:01:13.875404 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(mdsmap 1..1) restart
2014-01-23 04:01:13.875407 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(osdmap 133136..134893) restart
2014-01-23 04:01:13.875409 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(logm 1793310..1793958) restart
2014-01-23 04:01:13.875411 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(monmap 1..1) restart
2014-01-23 04:01:13.875413 7fa5fc822700 10 mon.osd152@0(electing).paxosservice(auth 2010..2269) restart
2014-01-23 04:01:13.875423 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- election(b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 victory 204012) v4 -- ?+0 0x248bf980
2014-01-23 04:01:13.875445 7fa5fc822700 10 mon.osd152@0(electing) e1 win_election epoch 204012 quorum 0,2 features 34359738367
2014-01-23 04:01:13.875457 7fa5fc822700  0 log [INF] : mon.osd152@0 won leader election with quorum 0,2
2014-01-23 04:01:13.875481 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x248bc5c0
2014-01-23 04:01:13.875498 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) leader_init -- starting paxos recovery
2014-01-23 04:01:13.875558 7fa5fbe21700  1 -- 10.193.207.130:6789/0 <== mon.0 10.193.207.130:6789/0 0 ==== log(2 entries) v1 ==== 0+0+0 (0 0 0) 0x248bc5c0 con 0x3018580
2014-01-23 04:01:13.875690 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) get_new_proposal_number = 9772500
2014-01-23 04:01:13.875703 7fa5fc822700 10 mon.osd152@0(leader).paxos(paxos recovering c 4353345..4353903) collect with pn 9772500
2014-01-23 04:01:13.875708 7fa5fc822700  1 -- 10.193.207.130:6789/0 --> mon.2 10.194.0.68:6789/0 -- paxos(collect lc 4353903 fc 4353345 pn 9772500 opn 0) v3 -- ?+0 0x167ceb80
2014-01-23 04:01:13.875725 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(pgmap 2388001..2388518) election_finished
2014-01-23 04:01:13.875729 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(pgmap 2388001..2388518) _active - not active
2014-01-23 04:01:13.875731 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(mdsmap 1..1) election_finished
2014-01-23 04:01:13.875733 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(mdsmap 1..1) _active - not active
2014-01-23 04:01:13.875735 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) election_finished
2014-01-23 04:01:13.875737 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(osdmap 133136..134893) _active - not active
2014-01-23 04:01:13.875740 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(logm 1793310..1793958) election_finished
2014-01-23 04:01:13.875741 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(logm 1793310..1793958) _active - not active
2014-01-23 04:01:13.875743 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(monmap 1..1) election_finished
2014-01-23 04:01:13.875745 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(monmap 1..1) _active - not active
2014-01-23 04:01:13.875747 7fa5fc822700 10 mon.osd152@0(leader).paxosservice(auth 2010..2269) election_finished




 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 41474 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-02-08  8:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-14 12:04 Ceph cluster is unreachable because of authentication failure GuangYang
     [not found] ` <BLU0-SMTP1356EEDE6A8ACF94947F22ADFBF0-MsuGFMq8XAE@public.gmane.org>
2014-01-14 14:55   ` Sage Weil
     [not found]     ` <alpine.DEB.2.00.1401140654500.10628-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-01-14 21:54       ` Guang
     [not found]     ` <016D7F31-523E-4EC1-8222-7D4084BA400F@outlook.com>
2014-01-16  8:26       ` Guang
2014-01-16 17:35         ` Sage Weil
     [not found]           ` <BLU0-SMTP169D80D759610D226681DD3DFB80@phx.gbl>
2014-01-17 16:05             ` Sage Weil
     [not found]               ` <alpine.DEB.2.00.1401170804180.304-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-01-19 13:21                 ` Guang
2014-01-20 16:35                   ` Sage Weil
     [not found]                     ` <alpine.DEB.2.00.1401200835050.2149-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-01-22 11:34                       ` Guang
2014-01-22 13:14                         ` [ceph-users] " Joao Eduardo Luis
     [not found]                           ` <BLU0-SMTP3186ED17CE3EC156E4EA064DFA60@phx.gbl>
     [not found]                             ` <BLU0-SMTP3186ED17CE3EC156E4EA064DFA60-MsuGFMq8XAE@public.gmane.org>
2014-01-23 14:32                               ` Sage Weil
     [not found]                                 ` <alpine.DEB.2.00.1401230630150.18304-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-01-24 11:41                                   ` Guang
     [not found]                                     ` <BLU0-SMTP461882F11F7B298D3D58ED9DFA10-MsuGFMq8XAE@public.gmane.org>
2014-02-08  8:49                                       ` GuangYang
2014-01-18  8:55   ` Sherry Shahbazi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.