* Trouble getting a new file system to start, for v0.59 and newer
@ 2013-04-03 15:46 Jim Schutt
2013-04-03 15:58 ` Sage Weil
0 siblings, 1 reply; 12+ messages in thread
From: Jim Schutt @ 2013-04-03 15:46 UTC (permalink / raw)
To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org
Hi Joao,
I alluded in an earlier thread about an issue I've been recently
having with starting a new filesystem, which I thought I had
tracked into the paxos subsystem. I believe I started having
this trouble when I started testing v0.59, and it's still there
in v0.60.
The basic configuration is 1 mon, 1 mds, 576 OSDs. Some time
after starting the mon and the OSDs, PG creation progress stops.
I'm finding this in my OSD logs, which I've been thinking is
suggestive of why OSDs stop making progress:
# cat osd.0.log | grep "acked PGStats"
2013-04-02 15:40:28.944178 7ffff387a700 1 osd.0 2 mon hasn't acked PGStats in 30.034349 seconds, reconnecting elsewhere
2013-04-02 15:42:24.570963 7ffff387a700 1 osd.0 14 mon hasn't acked PGStats in 30.072959 seconds, reconnecting elsewhere
2013-04-02 15:43:22.664921 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.570739 seconds, reconnecting elsewhere
2013-04-02 15:43:52.672461 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007379 seconds, reconnecting elsewhere
2013-04-02 15:44:22.680691 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007095 seconds, reconnecting elsewhere
2013-04-02 15:44:52.688936 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007221 seconds, reconnecting elsewhere
2013-04-02 15:45:22.702066 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.012264 seconds, reconnecting elsewhere
2013-04-02 15:45:52.711134 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007957 seconds, reconnecting elsewhere
2013-04-02 15:46:22.719845 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007859 seconds, reconnecting elsewhere
2013-04-02 15:46:52.729475 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.008756 seconds, reconnecting elsewhere
I also thought paxos might be implicated because I noticed the auth
paxos service appears to stop making progress. The mon is getting
lots of auth paxos service messages from all the OSDs:
# cat mon.cs31.log | grep paxos | grep "dispatch auth" | head -1
2013-04-02 15:38:59.995457 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 30 bytes epoch 1) v1 from client.? 172.17.131.29:0/52585
# cat mon.cs31.log | grep paxos | grep "dispatch auth" | wc -l
8802
but at some point the auth paxos service appears to stop making progress:
# cat mon.cs31.log | grep paxos | grep -v "dispatch auth" | grep auth
2013-04-02 15:38:42.895202 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:42.905705 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:59.182158 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:59.192809 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:59.290602 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) election_finished
2013-04-02 15:38:59.290608 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.300707 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.308260 7ffff7fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:38:59.329153 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) propose_pending
2013-04-02 15:38:59.390988 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.421164 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.427781 7ffff7fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:38:59.448001 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) election_finished
2013-04-02 15:38:59.448027 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.453691 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.459446 7ffff7fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:38:59.500315 7ffff328d700 10 mon.cs31@0(leader).paxosservice(auth) propose_pending
2013-04-02 15:38:59.569638 7ffff328d700 10 mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.593526 7ffff328d700 10 mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.599250 7ffff328d700 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:39:39.390156 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) propose_pending
2013-04-02 15:39:39.435906 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:39:39.456743 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:39:39.462498 7ffff2a8c700 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
even as it continues to receive auth paxos service messages:
# cat mon.cs31.log | grep paxos | grep -A 200000 "15:39:39.462498" | grep "dispatch auth" | wc -l
8799
for some time:
# cat mon.cs31.log | grep paxos | grep "dispatch auth" | tail -2
2013-04-02 15:46:56.090494 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 28 bytes epoch 1) v1 from osd.379 172.17.131.47:6819/219107
2013-04-02 15:46:56.127352 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 28 bytes epoch 1) v1 from osd.311 172.17.131.44:6823/184079
Is that all expected behavior for the auth paxos service?
To try to learn more about PGStat messages not getting acked, I caused
Pipe:writer() and Pipe:reader() to log messages at a lower log level as
they are sent and received, so I could see that vs. dispatch
at debug ms 2. Here's what the mon thinks is OSD 0's PGStat interaction:
# cat mon.cs31.log | grep "172.17.131.32:6800" | grep stats'
2013-04-02 15:39:58.923546 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 13 0x82ca800 pg_stats(2139 pgs tid 1 v 2) v1
2013-04-02 15:40:03.927889 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 15 0x88f5280 pg_stats(2139 pgs tid 2 v 2) v1
2013-04-02 15:40:08.936064 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 17 0x88f5000 pg_stats(2139 pgs tid 3 v 2) v1
2013-04-02 15:40:13.943312 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 19 0x8e35780 pg_stats(2139 pgs tid 4 v 2) v1
2013-04-02 15:40:18.949530 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 21 0x8e35500 pg_stats(2139 pgs tid 5 v 2) v1
2013-04-02 15:40:23.953961 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 23 0x8e35280 pg_stats(2139 pgs tid 6 v 2) v1
2013-04-02 15:40:42.812586 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 5 0x9178500 pg_stats(2139 pgs tid 7 v 2) v1
2013-04-02 15:40:42.824276 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 7 0x9178a00 pg_stats(2139 pgs tid 8 v 2) v1
2013-04-02 15:40:42.833998 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 9 0x9178780 pg_stats(2139 pgs tid 9 v 3) v1
2013-04-02 15:40:42.839129 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 10 0x917d000 pg_stats(2139 pgs tid 10 v 3) v1
2013-04-02 15:40:43.289351 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 5 ==== pg_stats(2139 pgs tid 7 v 2) v1 ==== 780831+0+0 (3242785419 0 0) 0x9178500 con 0x8ef77a0
2013-04-02 15:40:43.289361 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 7 v 2) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:40:44.346624 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 11 0x9178f00 pg_stats(2139 pgs tid 11 v 3) v1
2013-04-02 15:40:49.347796 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 12 0x9178c80 pg_stats(2139 pgs tid 12 v 3) v1
2013-04-02 15:40:50.974070 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 7) v1 -- ?+0 0x94efc40 con 0x8ef77a0
2013-04-02 15:40:50.975214 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 9 0x94efc40pg_stats_ack(2139 pgs tid 7) v1
2013-04-02 15:40:52.148791 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 7 ==== pg_stats(2139 pgs tid 8 v 2) v1 ==== 780831+0+0 (767410827 0 0) 0x9178a00 con 0x8ef77a0
2013-04-02 15:40:52.148798 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 8 v 2) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:40:52.184327 7ffff2a8c700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 0) v1 -- ?+0 0x95101c0 con 0x8ef77a0
2013-04-02 15:40:52.184887 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 12 0x95101c0pg_stats_ack(2139 pgs tid 0) v1
2013-04-02 15:40:54.355489 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 13 0x917d500 pg_stats(2139 pgs tid 13 v 3) v1
2013-04-02 15:40:59.366894 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 15 0x12a1400 pg_stats(2139 pgs tid 14 v 3) v1
2013-04-02 15:41:04.369893 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 16 0x88f5000 pg_stats(2139 pgs tid 15 v 3) v1
2013-04-02 15:41:09.383145 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 17 0x8572580 pg_stats(2139 pgs tid 16 v 3) v1
2013-04-02 15:41:10.331005 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 9 ==== pg_stats(2139 pgs tid 9 v 3) v1 ==== 780831+0+0 (2102719063 0 0) 0x9178780 con 0x8ef77a0
2013-04-02 15:41:10.331015 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 9 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:14.386773 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 18 0x917df00 pg_stats(2139 pgs tid 17 v 3) v1
2013-04-02 15:41:18.238008 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 9) v1 -- ?+0 0x887db00 con 0x8ef77a0
2013-04-02 15:41:18.239168 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 15 0x887db00pg_stats_ack(2139 pgs tid 9) v1
2013-04-02 15:41:19.393287 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 19 0x917dc80 pg_stats(1688 pgs tid 18 v 3) v1
2013-04-02 15:41:24.402543 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 20 0x917da00 pg_stats(1688 pgs tid 19 v 3) v1
2013-04-02 15:41:29.409268 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 22 0x917d780 pg_stats(1688 pgs tid 20 v 3) v1
2013-04-02 15:41:32.417687 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 10 ==== pg_stats(2139 pgs tid 10 v 3) v1 ==== 780831+0+0 (1002730769 0 0) 0x917d000 con 0x8ef77a0
2013-04-02 15:41:32.417694 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 10 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:34.412255 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 24 0x9178500 pg_stats(1688 pgs tid 21 v 3) v1
2013-04-02 15:41:39.515106 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 10) v1 -- ?+0 0x8bdaf40 con 0x8ef77a0
2013-04-02 15:41:39.516862 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 16 0x8bdaf40pg_stats_ack(2139 pgs tid 10) v1
2013-04-02 15:41:41.428677 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 27 0x7436400 pg_stats(1956 pgs tid 22 v 4) v1
2013-04-02 15:41:46.433146 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 29 0x7436180 pg_stats(1956 pgs tid 23 v 4) v1
2013-04-02 15:41:51.436995 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 31 0x7435f00 pg_stats(1956 pgs tid 24 v 4) v1
2013-04-02 15:41:53.051925 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 11 ==== pg_stats(2139 pgs tid 11 v 3) v1 ==== 780831+0+0 (676901400 0 0) 0x9178f00 con 0x8ef77a0
2013-04-02 15:41:53.051937 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 11 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.102194 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 12 ==== pg_stats(2139 pgs tid 12 v 3) v1 ==== 780831+0+0 (998884584 0 0) 0x9178c80 con 0x8ef77a0
2013-04-02 15:41:53.102201 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 12 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.154485 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 13 ==== pg_stats(2139 pgs tid 13 v 3) v1 ==== 780831+0+0 (1067499060 0 0) 0x917d500 con 0x8ef77a0
2013-04-02 15:41:53.154492 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 13 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.177214 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 15 ==== pg_stats(2139 pgs tid 14 v 3) v1 ==== 780831+0+0 (2843573366 0 0) 0x12a1400 con 0x8ef77a0
2013-04-02 15:41:53.177220 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 14 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.183387 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 16 ==== pg_stats(2139 pgs tid 15 v 3) v1 ==== 780831+0+0 (886357047 0 0) 0x88f5000 con 0x8ef77a0
2013-04-02 15:41:53.183393 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 15 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.188461 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 17 ==== pg_stats(2139 pgs tid 16 v 3) v1 ==== 780831+0+0 (4068350826 0 0) 0x8572580 con 0x8ef77a0
2013-04-02 15:41:53.188467 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 16 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.191039 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 18 ==== pg_stats(2139 pgs tid 17 v 3) v1 ==== 780831+0+0 (3808505220 0 0) 0x917df00 con 0x8ef77a0
2013-04-02 15:41:53.191044 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 17 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.195468 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 19 ==== pg_stats(1688 pgs tid 18 v 3) v1 ==== 616216+0+0 (786716567 0 0) 0x917dc80 con 0x8ef77a0
2013-04-02 15:41:53.195474 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 18 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.197490 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 20 ==== pg_stats(1688 pgs tid 19 v 3) v1 ==== 616216+0+0 (1691256403 0 0) 0x917da00 con 0x8ef77a0
2013-04-02 15:41:53.197495 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 19 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.203550 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 22 ==== pg_stats(1688 pgs tid 20 v 3) v1 ==== 616216+0+0 (2465465593 0 0) 0x917d780 con 0x8ef77a0
2013-04-02 15:41:53.203556 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 20 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.205756 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 24 ==== pg_stats(1688 pgs tid 21 v 3) v1 ==== 616216+0+0 (4004343678 0 0) 0x9178500 con 0x8ef77a0
2013-04-02 15:41:53.205762 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 21 v 3) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.208096 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 27 ==== pg_stats(1956 pgs tid 22 v 4) v1 ==== 716960+0+0 (2541888992 0 0) 0x7436400 con 0x8ef77a0
2013-04-02 15:41:53.208102 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1956 pgs tid 22 v 4) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.210496 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 29 ==== pg_stats(1956 pgs tid 23 v 4) v1 ==== 716960+0+0 (3133069244 0 0) 0x7436180 con 0x8ef77a0
2013-04-02 15:41:53.210502 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1956 pgs tid 23 v 4) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:53.212759 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 31 ==== pg_stats(1956 pgs tid 24 v 4) v1 ==== 716960+0+0 (584283437 0 0) 0x7435f00 con 0x8ef77a0
2013-04-02 15:41:53.212764 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1956 pgs tid 24 v 4) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:41:54.480038 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 11) v1 -- ?+0 0x95ab800 con 0x8ef77a0
2013-04-02 15:41:54.480618 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 12) v1 -- ?+0 0x9554b40 con 0x8ef77a0
2013-04-02 15:41:54.481098 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 32 0x95ab800pg_stats_ack(2139 pgs tid 11) v1
2013-04-02 15:41:54.481564 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 13) v1 -- ?+0 0x95539c0 con 0x8ef77a0
2013-04-02 15:41:54.482128 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 33 0x9554b40pg_stats_ack(2139 pgs tid 12) v1
2013-04-02 15:41:54.483234 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 34 0x95539c0pg_stats_ack(2139 pgs tid 13) v1
2013-04-02 15:41:54.484486 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 14) v1 -- ?+0 0x883f000 con 0x8ef77a0
2013-04-02 15:41:54.485222 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 15) v1 -- ?+0 0x883e3c0 con 0x8ef77a0
2013-04-02 15:41:54.485378 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 35 0x883f000pg_stats_ack(2139 pgs tid 14) v1
2013-04-02 15:41:54.486164 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 16) v1 -- ?+0 0x8bdbb80 con 0x8ef77a0
2013-04-02 15:41:54.486358 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 36 0x883e3c0pg_stats_ack(2139 pgs tid 15) v1
2013-04-02 15:41:54.487053 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 17) v1 -- ?+0 0x8bdb480 con 0x8ef77a0
2013-04-02 15:41:54.487436 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 37 0x8bdbb80pg_stats_ack(2139 pgs tid 16) v1
2013-04-02 15:41:54.488260 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 18) v1 -- ?+0 0x8bdd400 con 0x8ef77a0
2013-04-02 15:41:54.488398 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 38 0x8bdb480pg_stats_ack(2139 pgs tid 17) v1
2013-04-02 15:41:54.489004 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 19) v1 -- ?+0 0x8bdd080 con 0x8ef77a0
2013-04-02 15:41:54.489306 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 39 0x8bdd400pg_stats_ack(1688 pgs tid 18) v1
2013-04-02 15:41:54.489996 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 20) v1 -- ?+0 0x88601c0 con 0x8ef77a0
2013-04-02 15:41:54.490083 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 40 0x8bdd080pg_stats_ack(1688 pgs tid 19) v1
2013-04-02 15:41:54.490442 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 21) v1 -- ?+0 0x8865e80 con 0x8ef77a0
2013-04-02 15:41:54.490837 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 41 0x88601c0pg_stats_ack(1688 pgs tid 20) v1
2013-04-02 15:41:54.491416 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1956 pgs tid 22) v1 -- ?+0 0x8bddb00 con 0x8ef77a0
2013-04-02 15:41:54.491807 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 42 0x8865e80pg_stats_ack(1688 pgs tid 21) v1
2013-04-02 15:41:54.492195 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1956 pgs tid 23) v1 -- ?+0 0x8bde900 con 0x8ef77a0
2013-04-02 15:41:54.492728 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 43 0x8bddb00pg_stats_ack(1956 pgs tid 22) v1
2013-04-02 15:41:54.493313 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1956 pgs tid 24) v1 -- ?+0 0x8bde580 con 0x8ef77a0
2013-04-02 15:41:54.493707 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 44 0x8bde900pg_stats_ack(1956 pgs tid 23) v1
2013-04-02 15:41:54.494585 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 45 0x8bde580pg_stats_ack(1956 pgs tid 24) v1
2013-04-02 15:42:00.522781 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 36 0x7435c80 pg_stats(108 pgs tid 25 v 7) v1
2013-04-02 15:42:05.525728 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 38 0x7435a00 pg_stats(108 pgs tid 26 v 7) v1
2013-04-02 15:42:10.530772 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 39 0x7435780 pg_stats(108 pgs tid 27 v 8) v1
2013-04-02 15:42:15.565118 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 42 0x1fa52080 pg_stats(108 pgs tid 28 v 11) v1
2013-04-02 15:42:33.891896 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 4 0x1fa52080 pg_stats(108 pgs tid 30 v 14) v1
2013-04-02 15:42:33.892425 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 5 0x7435780 pg_stats(108 pgs tid 31 v 14) v1
2013-04-02 15:42:33.893026 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 9 0x1fb1fa80 pg_stats(108 pgs tid 32 v 16) v1
2013-04-02 15:42:34.574255 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 11 0x1fb1fd00 pg_stats(108 pgs tid 33 v 16) v1
2013-04-02 15:42:39.575512 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 12 0x1b4aaf80 pg_stats(108 pgs tid 34 v 16) v1
2013-04-02 15:42:48.310568 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 4 ==== pg_stats(108 pgs tid 30 v 14) v1 ==== 40520+0+0 (2643097968 0 0) 0x1fa52080 con 0x8519760
2013-04-02 15:42:48.310576 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(108 pgs tid 30 v 14) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:42:48.589731 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(108 pgs tid 30) v1 -- ?+0 0x8863d40 con 0x8519760
2013-04-02 15:42:48.589885 7fffa8659700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).writer sending message 5 0x8863d40pg_stats_ack(108 pgs tid 30) v1
2013-04-02 15:42:50.439778 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 5 ==== pg_stats(108 pgs tid 31 v 14) v1 ==== 40520+0+0 (3699058855 0 0) 0x7435780 con 0x8519760
2013-04-02 15:42:50.439786 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(108 pgs tid 31 v 14) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:42:50.439869 7ffff2a8c700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(108 pgs tid 0) v1 -- ?+0 0xc26abc0 con 0x8519760
2013-04-02 15:42:50.439926 7fffa8659700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).writer sending message 6 0xc26abc0pg_stats_ack(108 pgs tid 0) v1
2013-04-02 15:42:50.442425 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 14 0x2dc3c00 pg_stats(108 pgs tid 35 v 17) v1
2013-04-02 15:42:50.540570 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 15 0x154a8780 pg_stats(108 pgs tid 36 v 18) v1
2013-04-02 15:42:51.087764 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 9 ==== pg_stats(108 pgs tid 32 v 16) v1 ==== 40520+0+0 (3609079067 0 0) 0x1fb1fa80 con 0x8519760
2013-04-02 15:42:51.087770 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(108 pgs tid 32 v 16) v1 from osd.0 172.17.131.32:6800/217031
2013-04-02 15:42:52.092802 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(108 pgs tid 32) v1 -- ?+0 0x1c62db00 con 0x8519760
2013-04-02 15:42:52.092869 7fffa8659700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).writer sending message 13 0x1c62db00pg_stats_ack(108 pgs tid 32) v1
2013-04-02 15:42:54.594268 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 19 0x154a8280 pg_stats(108 pgs tid 37 v 20) v1
2013-04-02 15:43:22.177993 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 20 0x154a8000 pg_stats(108 pgs tid 38 v 20) v1
2013-04-02 15:43:22.304697 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 22 0x88f6400 pg_stats(108 pgs tid 39 v 20) v1
2013-04-02 15:43:22.354425 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 23 0x181b9400 pg_stats(108 pgs tid 40 v 20) v1
2013-04-02 15:43:22.417111 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 25 0xb42d780 pg_stats(108 pgs tid 41 v 20) v1
2013-04-02 15:43:22.454599 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 26 0xb42f300 pg_stats(108 pgs tid 42 v 20) v1
Here is OSD 0's idea of its PGStat interaction with the mon:
# cat osd.0.log | grep -i stats
2013-04-02 15:39:58.909877 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 1 v 2) v1 -- ?+0 0x1748500 con 0x16098c0
2013-04-02 15:39:58.918630 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 13 0x1748500pg_stats(2139 pgs tid 1 v 2) v1
2013-04-02 15:40:03.915544 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 2 v 2) v1 -- ?+0 0x1748780 con 0x16098c0
2013-04-02 15:40:03.922304 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 15 0x1748780pg_stats(2139 pgs tid 2 v 2) v1
2013-04-02 15:40:08.920874 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 3 v 2) v1 -- ?+0 0x1748500 con 0x16098c0
2013-04-02 15:40:08.929491 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 17 0x1748500pg_stats(2139 pgs tid 3 v 2) v1
2013-04-02 15:40:13.928781 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 4 v 2) v1 -- ?+0 0x1748a00 con 0x16098c0
2013-04-02 15:40:13.937503 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 19 0x1748a00pg_stats(2139 pgs tid 4 v 2) v1
2013-04-02 15:40:18.936302 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 5 v 2) v1 -- ?+0 0x1748780 con 0x16098c0
2013-04-02 15:40:18.944744 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 21 0x1748780pg_stats(2139 pgs tid 5 v 2) v1
2013-04-02 15:40:23.941554 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 6 v 2) v1 -- ?+0 0x1748500 con 0x16098c0
2013-04-02 15:40:23.949853 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 23 0x1748500pg_stats(2139 pgs tid 6 v 2) v1
2013-04-02 15:40:28.944178 7ffff387a700 1 osd.0 2 mon hasn't acked PGStats in 30.034349 seconds, reconnecting elsewhere
2013-04-02 15:40:42.798689 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 7 v 2) v1 -- ?+0 0x1748a00 con 0x1609e40
2013-04-02 15:40:42.798728 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 8 v 2) v1 -- ?+0 0x1748000 con 0x1609e40
2013-04-02 15:40:42.798813 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 9 v 3) v1 -- ?+0 0x1748500 con 0x1609e40
2013-04-02 15:40:42.798833 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 10 v 3) v1 -- ?+0 0x1748780 con 0x1609e40
2013-04-02 15:40:42.804039 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 5 0x1748a00pg_stats(2139 pgs tid 7 v 2) v1
2013-04-02 15:40:42.815140 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 7 0x1748000pg_stats(2139 pgs tid 8 v 2) v1
2013-04-02 15:40:42.825539 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 9 0x1748500pg_stats(2139 pgs tid 9 v 3) v1
2013-04-02 15:40:42.833093 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 10 0x1748780pg_stats(2139 pgs tid 10 v 3) v1
2013-04-02 15:40:44.332101 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 11 v 3) v1 -- ?+0 0x1749180 con 0x1609e40
2013-04-02 15:40:44.340297 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 11 0x1749180pg_stats(2139 pgs tid 11 v 3) v1
2013-04-02 15:40:49.338071 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 12 v 3) v1 -- ?+0 0x1748a00 con 0x1609e40
2013-04-02 15:40:49.342361 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 12 0x1748a00pg_stats(2139 pgs tid 12 v 3) v1
2013-04-02 15:40:50.977357 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 9 0x164fe00 pg_stats_ack(2139 pgs tid 7) v1
2013-04-02 15:40:50.977597 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 9 ==== pg_stats_ack(2139 pgs tid 7) v1 ==== 62035+0+0 (726597769 0 0) 0x164fe00 con 0x1609e40
2013-04-02 15:40:52.187136 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 12 0x164f380 pg_stats_ack(2139 pgs tid 0) v1
2013-04-02 15:40:52.187290 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 12 ==== pg_stats_ack(2139 pgs tid 0) v1 ==== 62035+0+0 (726597769 0 0) 0x164f380 con 0x1609e40
2013-04-02 15:40:54.344074 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 13 v 3) v1 -- ?+0 0x1748000 con 0x1609e40
2013-04-02 15:40:54.348415 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 13 0x1748000pg_stats(2139 pgs tid 13 v 3) v1
2013-04-02 15:40:59.351607 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 14 v 3) v1 -- ?+0 0x1748500 con 0x1609e40
2013-04-02 15:40:59.359988 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 15 0x1748500pg_stats(2139 pgs tid 14 v 3) v1
2013-04-02 15:41:04.359122 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 15 v 3) v1 -- ?+0 0x1748780 con 0x1609e40
2013-04-02 15:41:04.364092 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 16 0x1748780pg_stats(2139 pgs tid 15 v 3) v1
2013-04-02 15:41:09.366106 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 16 v 3) v1 -- ?+0 0x1749400 con 0x1609e40
2013-04-02 15:41:09.374543 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 17 0x1749400pg_stats(2139 pgs tid 16 v 3) v1
2013-04-02 15:41:14.371889 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 17 v 3) v1 -- ?+0 0x1748a00 con 0x1609e40
2013-04-02 15:41:14.380211 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 18 0x1748a00pg_stats(2139 pgs tid 17 v 3) v1
2013-04-02 15:41:18.240897 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 15 0x164fe00 pg_stats_ack(2139 pgs tid 9) v1
2013-04-02 15:41:18.241056 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 15 ==== pg_stats_ack(2139 pgs tid 9) v1 ==== 62035+0+0 (198780535 0 0) 0x164fe00 con 0x1609e40
2013-04-02 15:41:19.380719 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 18 v 3) v1 -- ?+0 0x1749180 con 0x1609e40
2013-04-02 15:41:19.387148 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 19 0x1749180pg_stats(1688 pgs tid 18 v 3) v1
2013-04-02 15:41:24.389573 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 19 v 3) v1 -- ?+0 0x1748780 con 0x1609e40
2013-04-02 15:41:24.396022 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 20 0x1748780pg_stats(1688 pgs tid 19 v 3) v1
2013-04-02 15:41:29.396986 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 20 v 3) v1 -- ?+0 0x1748500 con 0x1609e40
2013-04-02 15:41:29.403403 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 22 0x1748500pg_stats(1688 pgs tid 20 v 3) v1
2013-04-02 15:41:34.401630 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 21 v 3) v1 -- ?+0 0x1748000 con 0x1609e40
2013-04-02 15:41:34.407929 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 24 0x1748000pg_stats(1688 pgs tid 21 v 3) v1
2013-04-02 15:41:39.519241 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 16 0x164d5c0 pg_stats_ack(2139 pgs tid 10) v1
2013-04-02 15:41:39.519384 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 16 ==== pg_stats_ack(2139 pgs tid 10) v1 ==== 62035+0+0 (162752886 0 0) 0x164d5c0 con 0x1609e40
2013-04-02 15:41:41.416252 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1956 pgs tid 22 v 4) v1 -- ?+0 0x174be80 con 0x1609e40
2013-04-02 15:41:41.423974 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 27 0x174be80pg_stats(1956 pgs tid 22 v 4) v1
2013-04-02 15:41:46.421537 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1956 pgs tid 23 v 4) v1 -- ?+0 0x174bc00 con 0x1609e40
2013-04-02 15:41:46.429102 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 29 0x174bc00pg_stats(1956 pgs tid 23 v 4) v1
2013-04-02 15:41:51.425198 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1956 pgs tid 24 v 4) v1 -- ?+0 0x174b980 con 0x1609e40
2013-04-02 15:41:51.433005 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 31 0x174b980pg_stats(1956 pgs tid 24 v 4) v1
2013-04-02 15:41:54.483344 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 32 0x4732bc0 pg_stats_ack(2139 pgs tid 11) v1
2013-04-02 15:41:54.483470 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 32 ==== pg_stats_ack(2139 pgs tid 11) v1 ==== 62035+0+0 (162752886 0 0) 0x4732bc0 con 0x1609e40
2013-04-02 15:41:54.484667 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 33 0x4732680 pg_stats_ack(2139 pgs tid 12) v1
2013-04-02 15:41:54.484796 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 33 ==== pg_stats_ack(2139 pgs tid 12) v1 ==== 62035+0+0 (162752886 0 0) 0x4732680 con 0x1609e40
2013-04-02 15:41:54.485954 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 34 0x4732840 pg_stats_ack(2139 pgs tid 13) v1
2013-04-02 15:41:54.486291 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 34 ==== pg_stats_ack(2139 pgs tid 13) v1 ==== 62035+0+0 (162752886 0 0) 0x4732840 con 0x1609e40
2013-04-02 15:41:54.487671 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 35 0x4736040 pg_stats_ack(2139 pgs tid 14) v1
2013-04-02 15:41:54.487791 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 35 ==== pg_stats_ack(2139 pgs tid 14) v1 ==== 62035+0+0 (162752886 0 0) 0x4736040 con 0x1609e40
2013-04-02 15:41:54.488818 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 36 0x4b34440 pg_stats_ack(2139 pgs tid 15) v1
2013-04-02 15:41:54.489018 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 36 ==== pg_stats_ack(2139 pgs tid 15) v1 ==== 62035+0+0 (162752886 0 0) 0x4b34440 con 0x1609e40
2013-04-02 15:41:54.490044 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 37 0x164fa80 pg_stats_ack(2139 pgs tid 16) v1
2013-04-02 15:41:54.490173 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 37 ==== pg_stats_ack(2139 pgs tid 16) v1 ==== 62035+0+0 (162752886 0 0) 0x164fa80 con 0x1609e40
2013-04-02 15:41:54.491252 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 38 0x164f700 pg_stats_ack(2139 pgs tid 17) v1
2013-04-02 15:41:54.491376 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 38 ==== pg_stats_ack(2139 pgs tid 17) v1 ==== 62035+0+0 (162752886 0 0) 0x164f700 con 0x1609e40
2013-04-02 15:41:54.492139 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 39 0x4b34d00 pg_stats_ack(1688 pgs tid 18) v1
2013-04-02 15:41:54.492427 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 39 ==== pg_stats_ack(1688 pgs tid 18) v1 ==== 48956+0+0 (1148031672 0 0) 0x4b34d00 con 0x1609e40
2013-04-02 15:41:54.493228 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 40 0x3e63f00 pg_stats_ack(1688 pgs tid 19) v1
2013-04-02 15:41:54.493359 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 40 ==== pg_stats_ack(1688 pgs tid 19) v1 ==== 48956+0+0 (1148031672 0 0) 0x3e63f00 con 0x1609e40
2013-04-02 15:41:54.494104 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 41 0x3e63b80 pg_stats_ack(1688 pgs tid 20) v1
2013-04-02 15:41:54.494228 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 41 ==== pg_stats_ack(1688 pgs tid 20) v1 ==== 48956+0+0 (1148031672 0 0) 0x3e63b80 con 0x1609e40
2013-04-02 15:41:54.494914 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 42 0x3e640c0 pg_stats_ack(1688 pgs tid 21) v1
2013-04-02 15:41:54.494979 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 42 ==== pg_stats_ack(1688 pgs tid 21) v1 ==== 48956+0+0 (1148031672 0 0) 0x3e640c0 con 0x1609e40
2013-04-02 15:41:54.495941 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 43 0x17f5400 pg_stats_ack(1956 pgs tid 22) v1
2013-04-02 15:41:54.496066 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 43 ==== pg_stats_ack(1956 pgs tid 22) v1 ==== 56728+0+0 (217165409 0 0) 0x17f5400 con 0x1609e40
2013-04-02 15:41:54.496825 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 44 0x183fa80 pg_stats_ack(1956 pgs tid 23) v1
2013-04-02 15:41:54.496945 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 44 ==== pg_stats_ack(1956 pgs tid 23) v1 ==== 56728+0+0 (217165409 0 0) 0x183fa80 con 0x1609e40
2013-04-02 15:41:54.497867 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 45 0x1839a40 pg_stats_ack(1956 pgs tid 24) v1
2013-04-02 15:41:54.497983 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 45 ==== pg_stats_ack(1956 pgs tid 24) v1 ==== 56728+0+0 (217165409 0 0) 0x1839a40 con 0x1609e40
2013-04-02 15:42:00.522449 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 25 v 7) v1 -- ?+0 0x4031b80 con 0x1609e40
2013-04-02 15:42:00.523046 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 36 0x4031b80pg_stats(108 pgs tid 25 v 7) v1
2013-04-02 15:42:05.525433 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 26 v 7) v1 -- ?+0 0x4031900 con 0x1609e40
2013-04-02 15:42:05.526040 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 38 0x4031900pg_stats(108 pgs tid 26 v 7) v1
2013-04-02 15:42:10.530320 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 27 v 8) v1 -- ?+0 0x4031680 con 0x1609e40
2013-04-02 15:42:10.530987 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 39 0x4031680pg_stats(108 pgs tid 27 v 8) v1
2013-04-02 15:42:15.564684 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 28 v 11) v1 -- ?+0 0x4031400 con 0x1609e40
2013-04-02 15:42:15.565270 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 42 0x4031400pg_stats(108 pgs tid 28 v 11) v1
2013-04-02 15:42:20.568779 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 29 v 13) v1 -- ?+0 0x5861180 con 0x1609e40
2013-04-02 15:42:20.569360 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 45 0x5861180pg_stats(108 pgs tid 29 v 13) v1
2013-04-02 15:42:24.570963 7ffff387a700 1 osd.0 14 mon hasn't acked PGStats in 30.072959 seconds, reconnecting elsewhere
2013-04-02 15:42:33.891060 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 30 v 14) v1 -- ?+0 0x6794380 con 0x5819e40
2013-04-02 15:42:33.891072 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 31 v 14) v1 -- ?+0 0x4031680 con 0x5819e40
2013-04-02 15:42:33.891149 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 32 v 16) v1 -- ?+0 0x6793e80 con 0x5819e40
2013-04-02 15:42:33.891808 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 4 0x6794380pg_stats(108 pgs tid 30 v 14) v1
2013-04-02 15:42:33.892402 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 5 0x4031680pg_stats(108 pgs tid 31 v 14) v1
2013-04-02 15:42:33.893054 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 9 0x6793e80pg_stats(108 pgs tid 32 v 16) v1
2013-04-02 15:42:34.573582 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 33 v 16) v1 -- ?+0 0x6793700 con 0x5819e40
2013-04-02 15:42:34.574136 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 11 0x6793700pg_stats(108 pgs tid 33 v 16) v1
2013-04-02 15:42:39.575223 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 34 v 16) v1 -- ?+0 0x6792d00 con 0x5819e40
2013-04-02 15:42:39.575817 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 12 0x6792d00pg_stats(108 pgs tid 34 v 16) v1
2013-04-02 15:42:44.576975 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 35 v 17) v1 -- ?+0 0x6795f00 con 0x5819e40
2013-04-02 15:42:44.577611 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 14 0x6795f00pg_stats(108 pgs tid 35 v 17) v1
2013-04-02 15:42:48.591071 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).reader got message 5 0x443d080 pg_stats_ack(108 pgs tid 30) v1
2013-04-02 15:42:48.591184 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 5 ==== pg_stats_ack(108 pgs tid 30) v1 ==== 3136+0+0 (3131392639 0 0) 0x443d080 con 0x5819e40
2013-04-02 15:42:49.577888 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 36 v 18) v1 -- ?+0 0x6795c80 con 0x5819e40
2013-04-02 15:42:49.578479 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 15 0x6795c80pg_stats(108 pgs tid 36 v 18) v1
2013-04-02 15:42:50.441073 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).reader got message 6 0x1823480 pg_stats_ack(108 pgs tid 0) v1
2013-04-02 15:42:50.441140 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 6 ==== pg_stats_ack(108 pgs tid 0) v1 ==== 3136+0+0 (3131392639 0 0) 0x1823480 con 0x5819e40
2013-04-02 15:42:52.094048 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).reader got message 13 0x6cbec80 pg_stats_ack(108 pgs tid 32) v1
2013-04-02 15:42:52.094159 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 13 ==== pg_stats_ack(108 pgs tid 32) v1 ==== 3136+0+0 (3721100430 0 0) 0x6cbec80 con 0x5819e40
2013-04-02 15:42:54.594067 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 37 v 20) v1 -- ?+0 0x6797580 con 0x5819e40
2013-04-02 15:42:54.594729 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 19 0x6797580pg_stats(108 pgs tid 37 v 20) v1
2013-04-02 15:42:59.658795 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 38 v 20) v1 -- ?+0 0x174a300 con 0x5819e40
2013-04-02 15:42:59.659279 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 20 0x174a300pg_stats(108 pgs tid 38 v 20) v1
2013-04-02 15:43:04.660446 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 39 v 20) v1 -- ?+0 0x6793200 con 0x5819e40
2013-04-02 15:43:04.660952 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 22 0x6793200pg_stats(108 pgs tid 39 v 20) v1
2013-04-02 15:43:09.661467 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 40 v 20) v1 -- ?+0 0x6792a80 con 0x5819e40
2013-04-02 15:43:09.661950 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 23 0x6792a80pg_stats(108 pgs tid 40 v 20) v1
2013-04-02 15:43:14.662762 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 41 v 20) v1 -- ?+0 0x6792300 con 0x5819e40
2013-04-02 15:43:14.663314 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 25 0x6792300pg_stats(108 pgs tid 41 v 20) v1
2013-04-02 15:43:19.664054 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 42 v 20) v1 -- ?+0 0x6791b80 con 0x5819e40
2013-04-02 15:43:19.664776 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 26 0x6791b80pg_stats(108 pgs tid 42 v 20) v1
2013-04-02 15:43:22.664921 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.570739 seconds, reconnecting elsewhere
2013-04-02 15:43:52.672461 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007379 seconds, reconnecting elsewhere
2013-04-02 15:44:22.680691 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007095 seconds, reconnecting elsewhere
2013-04-02 15:44:52.688936 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007221 seconds, reconnecting elsewhere
2013-04-02 15:45:22.702066 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.012264 seconds, reconnecting elsewhere
2013-04-02 15:45:52.711134 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007957 seconds, reconnecting elsewhere
2013-04-02 15:46:22.719845 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007859 seconds, reconnecting elsewhere
2013-04-02 15:46:52.729475 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.008756 seconds, reconnecting elsewhere
Why would OSD 0 stop trying to send PGStats to the mon?
Maybe the above is all normal behavior, so any thoughts on what I
should be looking for instead are greatly appreciated.
On the bright side, this issue is completely reproducible for me,
and I'll be happy to test anything you think might help sort it out.
-- Jim
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 15:46 Trouble getting a new file system to start, for v0.59 and newer Jim Schutt
@ 2013-04-03 15:58 ` Sage Weil
2013-04-03 17:09 ` Jim Schutt
0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2013-04-03 15:58 UTC (permalink / raw)
To: Jim Schutt; +Cc: Joao Eduardo Luis, ceph-devel@vger.kernel.org
Hi Jim,
What happens if you change 'osd mon ack timeout = 300' (from the
default of 30)? I suspect part of the problem is that the mons are just
slow enough that the osd's resend the same thing again and it snowballs
into more work for the monitor.
sage
On Wed, 3 Apr 2013, Jim Schutt wrote:
> Hi Joao,
>
> I alluded in an earlier thread about an issue I've been recently
> having with starting a new filesystem, which I thought I had
> tracked into the paxos subsystem. I believe I started having
> this trouble when I started testing v0.59, and it's still there
> in v0.60.
>
> The basic configuration is 1 mon, 1 mds, 576 OSDs. Some time
> after starting the mon and the OSDs, PG creation progress stops.
>
> I'm finding this in my OSD logs, which I've been thinking is
> suggestive of why OSDs stop making progress:
>
> # cat osd.0.log | grep "acked PGStats"
> 2013-04-02 15:40:28.944178 7ffff387a700 1 osd.0 2 mon hasn't acked PGStats in 30.034349 seconds, reconnecting elsewhere
> 2013-04-02 15:42:24.570963 7ffff387a700 1 osd.0 14 mon hasn't acked PGStats in 30.072959 seconds, reconnecting elsewhere
> 2013-04-02 15:43:22.664921 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.570739 seconds, reconnecting elsewhere
> 2013-04-02 15:43:52.672461 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007379 seconds, reconnecting elsewhere
> 2013-04-02 15:44:22.680691 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007095 seconds, reconnecting elsewhere
> 2013-04-02 15:44:52.688936 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007221 seconds, reconnecting elsewhere
> 2013-04-02 15:45:22.702066 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.012264 seconds, reconnecting elsewhere
> 2013-04-02 15:45:52.711134 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007957 seconds, reconnecting elsewhere
> 2013-04-02 15:46:22.719845 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007859 seconds, reconnecting elsewhere
> 2013-04-02 15:46:52.729475 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.008756 seconds, reconnecting elsewhere
>
>
> I also thought paxos might be implicated because I noticed the auth
> paxos service appears to stop making progress. The mon is getting
> lots of auth paxos service messages from all the OSDs:
>
> # cat mon.cs31.log | grep paxos | grep "dispatch auth" | head -1
> 2013-04-02 15:38:59.995457 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 30 bytes epoch 1) v1 from client.? 172.17.131.29:0/52585
>
> # cat mon.cs31.log | grep paxos | grep "dispatch auth" | wc -l
> 8802
>
> but at some point the auth paxos service appears to stop making progress:
>
> # cat mon.cs31.log | grep paxos | grep -v "dispatch auth" | grep auth
> 2013-04-02 15:38:42.895202 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
> 2013-04-02 15:38:42.905705 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
> 2013-04-02 15:38:59.182158 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
> 2013-04-02 15:38:59.192809 7ffff7fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart
> 2013-04-02 15:38:59.290602 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) election_finished
> 2013-04-02 15:38:59.290608 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active
> 2013-04-02 15:38:59.300707 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub
> 2013-04-02 15:38:59.308260 7ffff7fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
> 2013-04-02 15:38:59.329153 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) propose_pending
> 2013-04-02 15:38:59.390988 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active
> 2013-04-02 15:38:59.421164 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub
> 2013-04-02 15:38:59.427781 7ffff7fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
> 2013-04-02 15:38:59.448001 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) election_finished
> 2013-04-02 15:38:59.448027 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active
> 2013-04-02 15:38:59.453691 7ffff7fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub
> 2013-04-02 15:38:59.459446 7ffff7fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
> 2013-04-02 15:38:59.500315 7ffff328d700 10 mon.cs31@0(leader).paxosservice(auth) propose_pending
> 2013-04-02 15:38:59.569638 7ffff328d700 10 mon.cs31@0(leader).paxosservice(auth) _active
> 2013-04-02 15:38:59.593526 7ffff328d700 10 mon.cs31@0(leader).paxosservice(auth) scrub
> 2013-04-02 15:38:59.599250 7ffff328d700 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
> 2013-04-02 15:39:39.390156 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) propose_pending
> 2013-04-02 15:39:39.435906 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) _active
> 2013-04-02 15:39:39.456743 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) scrub
> 2013-04-02 15:39:39.462498 7ffff2a8c700 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending
>
> even as it continues to receive auth paxos service messages:
>
> # cat mon.cs31.log | grep paxos | grep -A 200000 "15:39:39.462498" | grep "dispatch auth" | wc -l
> 8799
>
> for some time:
>
> # cat mon.cs31.log | grep paxos | grep "dispatch auth" | tail -2
> 2013-04-02 15:46:56.090494 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 28 bytes epoch 1) v1 from osd.379 172.17.131.47:6819/219107
> 2013-04-02 15:46:56.127352 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 28 bytes epoch 1) v1 from osd.311 172.17.131.44:6823/184079
>
> Is that all expected behavior for the auth paxos service?
>
> To try to learn more about PGStat messages not getting acked, I caused
> Pipe:writer() and Pipe:reader() to log messages at a lower log level as
> they are sent and received, so I could see that vs. dispatch
> at debug ms 2. Here's what the mon thinks is OSD 0's PGStat interaction:
>
> # cat mon.cs31.log | grep "172.17.131.32:6800" | grep stats'
> 2013-04-02 15:39:58.923546 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 13 0x82ca800 pg_stats(2139 pgs tid 1 v 2) v1
> 2013-04-02 15:40:03.927889 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 15 0x88f5280 pg_stats(2139 pgs tid 2 v 2) v1
> 2013-04-02 15:40:08.936064 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 17 0x88f5000 pg_stats(2139 pgs tid 3 v 2) v1
> 2013-04-02 15:40:13.943312 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 19 0x8e35780 pg_stats(2139 pgs tid 4 v 2) v1
> 2013-04-02 15:40:18.949530 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 21 0x8e35500 pg_stats(2139 pgs tid 5 v 2) v1
> 2013-04-02 15:40:23.953961 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x12a1400 sd=20 :6789 s=2 pgs=1 cs=1 l=1).reader got message 23 0x8e35280 pg_stats(2139 pgs tid 6 v 2) v1
> 2013-04-02 15:40:42.812586 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 5 0x9178500 pg_stats(2139 pgs tid 7 v 2) v1
> 2013-04-02 15:40:42.824276 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 7 0x9178a00 pg_stats(2139 pgs tid 8 v 2) v1
> 2013-04-02 15:40:42.833998 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 9 0x9178780 pg_stats(2139 pgs tid 9 v 3) v1
> 2013-04-02 15:40:42.839129 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 10 0x917d000 pg_stats(2139 pgs tid 10 v 3) v1
> 2013-04-02 15:40:43.289351 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 5 ==== pg_stats(2139 pgs tid 7 v 2) v1 ==== 780831+0+0 (3242785419 0 0) 0x9178500 con 0x8ef77a0
> 2013-04-02 15:40:43.289361 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 7 v 2) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:40:44.346624 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 11 0x9178f00 pg_stats(2139 pgs tid 11 v 3) v1
> 2013-04-02 15:40:49.347796 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 12 0x9178c80 pg_stats(2139 pgs tid 12 v 3) v1
> 2013-04-02 15:40:50.974070 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 7) v1 -- ?+0 0x94efc40 con 0x8ef77a0
> 2013-04-02 15:40:50.975214 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 9 0x94efc40pg_stats_ack(2139 pgs tid 7) v1
> 2013-04-02 15:40:52.148791 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 7 ==== pg_stats(2139 pgs tid 8 v 2) v1 ==== 780831+0+0 (767410827 0 0) 0x9178a00 con 0x8ef77a0
> 2013-04-02 15:40:52.148798 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 8 v 2) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:40:52.184327 7ffff2a8c700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 0) v1 -- ?+0 0x95101c0 con 0x8ef77a0
> 2013-04-02 15:40:52.184887 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 12 0x95101c0pg_stats_ack(2139 pgs tid 0) v1
> 2013-04-02 15:40:54.355489 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 13 0x917d500 pg_stats(2139 pgs tid 13 v 3) v1
> 2013-04-02 15:40:59.366894 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 15 0x12a1400 pg_stats(2139 pgs tid 14 v 3) v1
> 2013-04-02 15:41:04.369893 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 16 0x88f5000 pg_stats(2139 pgs tid 15 v 3) v1
> 2013-04-02 15:41:09.383145 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 17 0x8572580 pg_stats(2139 pgs tid 16 v 3) v1
> 2013-04-02 15:41:10.331005 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 9 ==== pg_stats(2139 pgs tid 9 v 3) v1 ==== 780831+0+0 (2102719063 0 0) 0x9178780 con 0x8ef77a0
> 2013-04-02 15:41:10.331015 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 9 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:14.386773 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 18 0x917df00 pg_stats(2139 pgs tid 17 v 3) v1
> 2013-04-02 15:41:18.238008 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 9) v1 -- ?+0 0x887db00 con 0x8ef77a0
> 2013-04-02 15:41:18.239168 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 15 0x887db00pg_stats_ack(2139 pgs tid 9) v1
> 2013-04-02 15:41:19.393287 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 19 0x917dc80 pg_stats(1688 pgs tid 18 v 3) v1
> 2013-04-02 15:41:24.402543 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 20 0x917da00 pg_stats(1688 pgs tid 19 v 3) v1
> 2013-04-02 15:41:29.409268 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 22 0x917d780 pg_stats(1688 pgs tid 20 v 3) v1
> 2013-04-02 15:41:32.417687 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 10 ==== pg_stats(2139 pgs tid 10 v 3) v1 ==== 780831+0+0 (1002730769 0 0) 0x917d000 con 0x8ef77a0
> 2013-04-02 15:41:32.417694 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 10 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:34.412255 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 24 0x9178500 pg_stats(1688 pgs tid 21 v 3) v1
> 2013-04-02 15:41:39.515106 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 10) v1 -- ?+0 0x8bdaf40 con 0x8ef77a0
> 2013-04-02 15:41:39.516862 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 16 0x8bdaf40pg_stats_ack(2139 pgs tid 10) v1
> 2013-04-02 15:41:41.428677 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 27 0x7436400 pg_stats(1956 pgs tid 22 v 4) v1
> 2013-04-02 15:41:46.433146 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 29 0x7436180 pg_stats(1956 pgs tid 23 v 4) v1
> 2013-04-02 15:41:51.436995 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 31 0x7435f00 pg_stats(1956 pgs tid 24 v 4) v1
> 2013-04-02 15:41:53.051925 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 11 ==== pg_stats(2139 pgs tid 11 v 3) v1 ==== 780831+0+0 (676901400 0 0) 0x9178f00 con 0x8ef77a0
> 2013-04-02 15:41:53.051937 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 11 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.102194 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 12 ==== pg_stats(2139 pgs tid 12 v 3) v1 ==== 780831+0+0 (998884584 0 0) 0x9178c80 con 0x8ef77a0
> 2013-04-02 15:41:53.102201 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 12 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.154485 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 13 ==== pg_stats(2139 pgs tid 13 v 3) v1 ==== 780831+0+0 (1067499060 0 0) 0x917d500 con 0x8ef77a0
> 2013-04-02 15:41:53.154492 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 13 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.177214 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 15 ==== pg_stats(2139 pgs tid 14 v 3) v1 ==== 780831+0+0 (2843573366 0 0) 0x12a1400 con 0x8ef77a0
> 2013-04-02 15:41:53.177220 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 14 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.183387 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 16 ==== pg_stats(2139 pgs tid 15 v 3) v1 ==== 780831+0+0 (886357047 0 0) 0x88f5000 con 0x8ef77a0
> 2013-04-02 15:41:53.183393 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 15 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.188461 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 17 ==== pg_stats(2139 pgs tid 16 v 3) v1 ==== 780831+0+0 (4068350826 0 0) 0x8572580 con 0x8ef77a0
> 2013-04-02 15:41:53.188467 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 16 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.191039 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 18 ==== pg_stats(2139 pgs tid 17 v 3) v1 ==== 780831+0+0 (3808505220 0 0) 0x917df00 con 0x8ef77a0
> 2013-04-02 15:41:53.191044 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(2139 pgs tid 17 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.195468 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 19 ==== pg_stats(1688 pgs tid 18 v 3) v1 ==== 616216+0+0 (786716567 0 0) 0x917dc80 con 0x8ef77a0
> 2013-04-02 15:41:53.195474 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 18 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.197490 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 20 ==== pg_stats(1688 pgs tid 19 v 3) v1 ==== 616216+0+0 (1691256403 0 0) 0x917da00 con 0x8ef77a0
> 2013-04-02 15:41:53.197495 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 19 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.203550 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 22 ==== pg_stats(1688 pgs tid 20 v 3) v1 ==== 616216+0+0 (2465465593 0 0) 0x917d780 con 0x8ef77a0
> 2013-04-02 15:41:53.203556 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 20 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.205756 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 24 ==== pg_stats(1688 pgs tid 21 v 3) v1 ==== 616216+0+0 (4004343678 0 0) 0x9178500 con 0x8ef77a0
> 2013-04-02 15:41:53.205762 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1688 pgs tid 21 v 3) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.208096 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 27 ==== pg_stats(1956 pgs tid 22 v 4) v1 ==== 716960+0+0 (2541888992 0 0) 0x7436400 con 0x8ef77a0
> 2013-04-02 15:41:53.208102 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1956 pgs tid 22 v 4) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.210496 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 29 ==== pg_stats(1956 pgs tid 23 v 4) v1 ==== 716960+0+0 (3133069244 0 0) 0x7436180 con 0x8ef77a0
> 2013-04-02 15:41:53.210502 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1956 pgs tid 23 v 4) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:53.212759 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 31 ==== pg_stats(1956 pgs tid 24 v 4) v1 ==== 716960+0+0 (584283437 0 0) 0x7435f00 con 0x8ef77a0
> 2013-04-02 15:41:53.212764 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(1956 pgs tid 24 v 4) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:41:54.480038 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 11) v1 -- ?+0 0x95ab800 con 0x8ef77a0
> 2013-04-02 15:41:54.480618 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 12) v1 -- ?+0 0x9554b40 con 0x8ef77a0
> 2013-04-02 15:41:54.481098 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 32 0x95ab800pg_stats_ack(2139 pgs tid 11) v1
> 2013-04-02 15:41:54.481564 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 13) v1 -- ?+0 0x95539c0 con 0x8ef77a0
> 2013-04-02 15:41:54.482128 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 33 0x9554b40pg_stats_ack(2139 pgs tid 12) v1
> 2013-04-02 15:41:54.483234 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 34 0x95539c0pg_stats_ack(2139 pgs tid 13) v1
> 2013-04-02 15:41:54.484486 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 14) v1 -- ?+0 0x883f000 con 0x8ef77a0
> 2013-04-02 15:41:54.485222 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 15) v1 -- ?+0 0x883e3c0 con 0x8ef77a0
> 2013-04-02 15:41:54.485378 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 35 0x883f000pg_stats_ack(2139 pgs tid 14) v1
> 2013-04-02 15:41:54.486164 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 16) v1 -- ?+0 0x8bdbb80 con 0x8ef77a0
> 2013-04-02 15:41:54.486358 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 36 0x883e3c0pg_stats_ack(2139 pgs tid 15) v1
> 2013-04-02 15:41:54.487053 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(2139 pgs tid 17) v1 -- ?+0 0x8bdb480 con 0x8ef77a0
> 2013-04-02 15:41:54.487436 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 37 0x8bdbb80pg_stats_ack(2139 pgs tid 16) v1
> 2013-04-02 15:41:54.488260 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 18) v1 -- ?+0 0x8bdd400 con 0x8ef77a0
> 2013-04-02 15:41:54.488398 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 38 0x8bdb480pg_stats_ack(2139 pgs tid 17) v1
> 2013-04-02 15:41:54.489004 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 19) v1 -- ?+0 0x8bdd080 con 0x8ef77a0
> 2013-04-02 15:41:54.489306 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 39 0x8bdd400pg_stats_ack(1688 pgs tid 18) v1
> 2013-04-02 15:41:54.489996 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 20) v1 -- ?+0 0x88601c0 con 0x8ef77a0
> 2013-04-02 15:41:54.490083 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 40 0x8bdd080pg_stats_ack(1688 pgs tid 19) v1
> 2013-04-02 15:41:54.490442 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1688 pgs tid 21) v1 -- ?+0 0x8865e80 con 0x8ef77a0
> 2013-04-02 15:41:54.490837 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 41 0x88601c0pg_stats_ack(1688 pgs tid 20) v1
> 2013-04-02 15:41:54.491416 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1956 pgs tid 22) v1 -- ?+0 0x8bddb00 con 0x8ef77a0
> 2013-04-02 15:41:54.491807 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 42 0x8865e80pg_stats_ack(1688 pgs tid 21) v1
> 2013-04-02 15:41:54.492195 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1956 pgs tid 23) v1 -- ?+0 0x8bde900 con 0x8ef77a0
> 2013-04-02 15:41:54.492728 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 43 0x8bddb00pg_stats_ack(1956 pgs tid 22) v1
> 2013-04-02 15:41:54.493313 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(1956 pgs tid 24) v1 -- ?+0 0x8bde580 con 0x8ef77a0
> 2013-04-02 15:41:54.493707 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 44 0x8bde900pg_stats_ack(1956 pgs tid 23) v1
> 2013-04-02 15:41:54.494585 7ffff1a8a700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).writer sending message 45 0x8bde580pg_stats_ack(1956 pgs tid 24) v1
> 2013-04-02 15:42:00.522781 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 36 0x7435c80 pg_stats(108 pgs tid 25 v 7) v1
> 2013-04-02 15:42:05.525728 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 38 0x7435a00 pg_stats(108 pgs tid 26 v 7) v1
> 2013-04-02 15:42:10.530772 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 39 0x7435780 pg_stats(108 pgs tid 27 v 8) v1
> 2013-04-02 15:42:15.565118 7fffa94ed700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917fa80 sd=596 :6789 s=2 pgs=2 cs=1 l=1).reader got message 42 0x1fa52080 pg_stats(108 pgs tid 28 v 11) v1
> 2013-04-02 15:42:33.891896 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 4 0x1fa52080 pg_stats(108 pgs tid 30 v 14) v1
> 2013-04-02 15:42:33.892425 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 5 0x7435780 pg_stats(108 pgs tid 31 v 14) v1
> 2013-04-02 15:42:33.893026 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 9 0x1fb1fa80 pg_stats(108 pgs tid 32 v 16) v1
> 2013-04-02 15:42:34.574255 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 11 0x1fb1fd00 pg_stats(108 pgs tid 33 v 16) v1
> 2013-04-02 15:42:39.575512 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 12 0x1b4aaf80 pg_stats(108 pgs tid 34 v 16) v1
> 2013-04-02 15:42:48.310568 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 4 ==== pg_stats(108 pgs tid 30 v 14) v1 ==== 40520+0+0 (2643097968 0 0) 0x1fa52080 con 0x8519760
> 2013-04-02 15:42:48.310576 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(108 pgs tid 30 v 14) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:42:48.589731 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(108 pgs tid 30) v1 -- ?+0 0x8863d40 con 0x8519760
> 2013-04-02 15:42:48.589885 7fffa8659700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).writer sending message 5 0x8863d40pg_stats_ack(108 pgs tid 30) v1
> 2013-04-02 15:42:50.439778 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 5 ==== pg_stats(108 pgs tid 31 v 14) v1 ==== 40520+0+0 (3699058855 0 0) 0x7435780 con 0x8519760
> 2013-04-02 15:42:50.439786 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(108 pgs tid 31 v 14) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:42:50.439869 7ffff2a8c700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(108 pgs tid 0) v1 -- ?+0 0xc26abc0 con 0x8519760
> 2013-04-02 15:42:50.439926 7fffa8659700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).writer sending message 6 0xc26abc0pg_stats_ack(108 pgs tid 0) v1
> 2013-04-02 15:42:50.442425 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 14 0x2dc3c00 pg_stats(108 pgs tid 35 v 17) v1
> 2013-04-02 15:42:50.540570 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 15 0x154a8780 pg_stats(108 pgs tid 36 v 18) v1
> 2013-04-02 15:42:51.087764 7ffff2a8c700 1 -- 172.17.131.31:6789/0 <== osd.0 172.17.131.32:6800/217031 9 ==== pg_stats(108 pgs tid 32 v 16) v1 ==== 40520+0+0 (3609079067 0 0) 0x1fb1fa80 con 0x8519760
> 2013-04-02 15:42:51.087770 7ffff2a8c700 10 mon.cs31@0(leader).paxosservice(pgmap) dispatch pg_stats(108 pgs tid 32 v 16) v1 from osd.0 172.17.131.32:6800/217031
> 2013-04-02 15:42:52.092802 7ffff328d700 1 -- 172.17.131.31:6789/0 --> 172.17.131.32:6800/217031 -- pg_stats_ack(108 pgs tid 32) v1 -- ?+0 0x1c62db00 con 0x8519760
> 2013-04-02 15:42:52.092869 7fffa8659700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).writer sending message 13 0x1c62db00pg_stats_ack(108 pgs tid 32) v1
> 2013-04-02 15:42:54.594268 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 19 0x154a8280 pg_stats(108 pgs tid 37 v 20) v1
> 2013-04-02 15:43:22.177993 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 20 0x154a8000 pg_stats(108 pgs tid 38 v 20) v1
> 2013-04-02 15:43:22.304697 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 22 0x88f6400 pg_stats(108 pgs tid 39 v 20) v1
> 2013-04-02 15:43:22.354425 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 23 0x181b9400 pg_stats(108 pgs tid 40 v 20) v1
> 2013-04-02 15:43:22.417111 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 25 0xb42d780 pg_stats(108 pgs tid 41 v 20) v1
> 2013-04-02 15:43:22.454599 7fffa91ea700 2 -- 172.17.131.31:6789/0 >> 172.17.131.32:6800/217031 pipe(0x917eb80 sd=6 :6789 s=2 pgs=3 cs=1 l=1).reader got message 26 0xb42f300 pg_stats(108 pgs tid 42 v 20) v1
>
> Here is OSD 0's idea of its PGStat interaction with the mon:
>
> # cat osd.0.log | grep -i stats
> 2013-04-02 15:39:58.909877 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 1 v 2) v1 -- ?+0 0x1748500 con 0x16098c0
> 2013-04-02 15:39:58.918630 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 13 0x1748500pg_stats(2139 pgs tid 1 v 2) v1
> 2013-04-02 15:40:03.915544 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 2 v 2) v1 -- ?+0 0x1748780 con 0x16098c0
> 2013-04-02 15:40:03.922304 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 15 0x1748780pg_stats(2139 pgs tid 2 v 2) v1
> 2013-04-02 15:40:08.920874 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 3 v 2) v1 -- ?+0 0x1748500 con 0x16098c0
> 2013-04-02 15:40:08.929491 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 17 0x1748500pg_stats(2139 pgs tid 3 v 2) v1
> 2013-04-02 15:40:13.928781 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 4 v 2) v1 -- ?+0 0x1748a00 con 0x16098c0
> 2013-04-02 15:40:13.937503 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 19 0x1748a00pg_stats(2139 pgs tid 4 v 2) v1
> 2013-04-02 15:40:18.936302 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 5 v 2) v1 -- ?+0 0x1748780 con 0x16098c0
> 2013-04-02 15:40:18.944744 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 21 0x1748780pg_stats(2139 pgs tid 5 v 2) v1
> 2013-04-02 15:40:23.941554 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 6 v 2) v1 -- ?+0 0x1748500 con 0x16098c0
> 2013-04-02 15:40:23.949853 7ffff7fba700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748000 sd=27 :45108 s=2 pgs=8 cs=1 l=1).writer sending message 23 0x1748500pg_stats(2139 pgs tid 6 v 2) v1
> 2013-04-02 15:40:28.944178 7ffff387a700 1 osd.0 2 mon hasn't acked PGStats in 30.034349 seconds, reconnecting elsewhere
> 2013-04-02 15:40:42.798689 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 7 v 2) v1 -- ?+0 0x1748a00 con 0x1609e40
> 2013-04-02 15:40:42.798728 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 8 v 2) v1 -- ?+0 0x1748000 con 0x1609e40
> 2013-04-02 15:40:42.798813 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 9 v 3) v1 -- ?+0 0x1748500 con 0x1609e40
> 2013-04-02 15:40:42.798833 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 10 v 3) v1 -- ?+0 0x1748780 con 0x1609e40
> 2013-04-02 15:40:42.804039 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 5 0x1748a00pg_stats(2139 pgs tid 7 v 2) v1
> 2013-04-02 15:40:42.815140 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 7 0x1748000pg_stats(2139 pgs tid 8 v 2) v1
> 2013-04-02 15:40:42.825539 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 9 0x1748500pg_stats(2139 pgs tid 9 v 3) v1
> 2013-04-02 15:40:42.833093 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 10 0x1748780pg_stats(2139 pgs tid 10 v 3) v1
> 2013-04-02 15:40:44.332101 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 11 v 3) v1 -- ?+0 0x1749180 con 0x1609e40
> 2013-04-02 15:40:44.340297 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 11 0x1749180pg_stats(2139 pgs tid 11 v 3) v1
> 2013-04-02 15:40:49.338071 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 12 v 3) v1 -- ?+0 0x1748a00 con 0x1609e40
> 2013-04-02 15:40:49.342361 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 12 0x1748a00pg_stats(2139 pgs tid 12 v 3) v1
> 2013-04-02 15:40:50.977357 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 9 0x164fe00 pg_stats_ack(2139 pgs tid 7) v1
> 2013-04-02 15:40:50.977597 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 9 ==== pg_stats_ack(2139 pgs tid 7) v1 ==== 62035+0+0 (726597769 0 0) 0x164fe00 con 0x1609e40
> 2013-04-02 15:40:52.187136 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 12 0x164f380 pg_stats_ack(2139 pgs tid 0) v1
> 2013-04-02 15:40:52.187290 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 12 ==== pg_stats_ack(2139 pgs tid 0) v1 ==== 62035+0+0 (726597769 0 0) 0x164f380 con 0x1609e40
> 2013-04-02 15:40:54.344074 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 13 v 3) v1 -- ?+0 0x1748000 con 0x1609e40
> 2013-04-02 15:40:54.348415 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 13 0x1748000pg_stats(2139 pgs tid 13 v 3) v1
> 2013-04-02 15:40:59.351607 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 14 v 3) v1 -- ?+0 0x1748500 con 0x1609e40
> 2013-04-02 15:40:59.359988 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 15 0x1748500pg_stats(2139 pgs tid 14 v 3) v1
> 2013-04-02 15:41:04.359122 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 15 v 3) v1 -- ?+0 0x1748780 con 0x1609e40
> 2013-04-02 15:41:04.364092 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 16 0x1748780pg_stats(2139 pgs tid 15 v 3) v1
> 2013-04-02 15:41:09.366106 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 16 v 3) v1 -- ?+0 0x1749400 con 0x1609e40
> 2013-04-02 15:41:09.374543 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 17 0x1749400pg_stats(2139 pgs tid 16 v 3) v1
> 2013-04-02 15:41:14.371889 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(2139 pgs tid 17 v 3) v1 -- ?+0 0x1748a00 con 0x1609e40
> 2013-04-02 15:41:14.380211 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 18 0x1748a00pg_stats(2139 pgs tid 17 v 3) v1
> 2013-04-02 15:41:18.240897 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 15 0x164fe00 pg_stats_ack(2139 pgs tid 9) v1
> 2013-04-02 15:41:18.241056 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 15 ==== pg_stats_ack(2139 pgs tid 9) v1 ==== 62035+0+0 (198780535 0 0) 0x164fe00 con 0x1609e40
> 2013-04-02 15:41:19.380719 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 18 v 3) v1 -- ?+0 0x1749180 con 0x1609e40
> 2013-04-02 15:41:19.387148 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 19 0x1749180pg_stats(1688 pgs tid 18 v 3) v1
> 2013-04-02 15:41:24.389573 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 19 v 3) v1 -- ?+0 0x1748780 con 0x1609e40
> 2013-04-02 15:41:24.396022 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 20 0x1748780pg_stats(1688 pgs tid 19 v 3) v1
> 2013-04-02 15:41:29.396986 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 20 v 3) v1 -- ?+0 0x1748500 con 0x1609e40
> 2013-04-02 15:41:29.403403 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 22 0x1748500pg_stats(1688 pgs tid 20 v 3) v1
> 2013-04-02 15:41:34.401630 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1688 pgs tid 21 v 3) v1 -- ?+0 0x1748000 con 0x1609e40
> 2013-04-02 15:41:34.407929 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 24 0x1748000pg_stats(1688 pgs tid 21 v 3) v1
> 2013-04-02 15:41:39.519241 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 16 0x164d5c0 pg_stats_ack(2139 pgs tid 10) v1
> 2013-04-02 15:41:39.519384 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 16 ==== pg_stats_ack(2139 pgs tid 10) v1 ==== 62035+0+0 (162752886 0 0) 0x164d5c0 con 0x1609e40
> 2013-04-02 15:41:41.416252 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1956 pgs tid 22 v 4) v1 -- ?+0 0x174be80 con 0x1609e40
> 2013-04-02 15:41:41.423974 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 27 0x174be80pg_stats(1956 pgs tid 22 v 4) v1
> 2013-04-02 15:41:46.421537 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1956 pgs tid 23 v 4) v1 -- ?+0 0x174bc00 con 0x1609e40
> 2013-04-02 15:41:46.429102 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 29 0x174bc00pg_stats(1956 pgs tid 23 v 4) v1
> 2013-04-02 15:41:51.425198 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(1956 pgs tid 24 v 4) v1 -- ?+0 0x174b980 con 0x1609e40
> 2013-04-02 15:41:51.433005 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 31 0x174b980pg_stats(1956 pgs tid 24 v 4) v1
> 2013-04-02 15:41:54.483344 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 32 0x4732bc0 pg_stats_ack(2139 pgs tid 11) v1
> 2013-04-02 15:41:54.483470 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 32 ==== pg_stats_ack(2139 pgs tid 11) v1 ==== 62035+0+0 (162752886 0 0) 0x4732bc0 con 0x1609e40
> 2013-04-02 15:41:54.484667 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 33 0x4732680 pg_stats_ack(2139 pgs tid 12) v1
> 2013-04-02 15:41:54.484796 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 33 ==== pg_stats_ack(2139 pgs tid 12) v1 ==== 62035+0+0 (162752886 0 0) 0x4732680 con 0x1609e40
> 2013-04-02 15:41:54.485954 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 34 0x4732840 pg_stats_ack(2139 pgs tid 13) v1
> 2013-04-02 15:41:54.486291 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 34 ==== pg_stats_ack(2139 pgs tid 13) v1 ==== 62035+0+0 (162752886 0 0) 0x4732840 con 0x1609e40
> 2013-04-02 15:41:54.487671 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 35 0x4736040 pg_stats_ack(2139 pgs tid 14) v1
> 2013-04-02 15:41:54.487791 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 35 ==== pg_stats_ack(2139 pgs tid 14) v1 ==== 62035+0+0 (162752886 0 0) 0x4736040 con 0x1609e40
> 2013-04-02 15:41:54.488818 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 36 0x4b34440 pg_stats_ack(2139 pgs tid 15) v1
> 2013-04-02 15:41:54.489018 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 36 ==== pg_stats_ack(2139 pgs tid 15) v1 ==== 62035+0+0 (162752886 0 0) 0x4b34440 con 0x1609e40
> 2013-04-02 15:41:54.490044 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 37 0x164fa80 pg_stats_ack(2139 pgs tid 16) v1
> 2013-04-02 15:41:54.490173 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 37 ==== pg_stats_ack(2139 pgs tid 16) v1 ==== 62035+0+0 (162752886 0 0) 0x164fa80 con 0x1609e40
> 2013-04-02 15:41:54.491252 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 38 0x164f700 pg_stats_ack(2139 pgs tid 17) v1
> 2013-04-02 15:41:54.491376 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 38 ==== pg_stats_ack(2139 pgs tid 17) v1 ==== 62035+0+0 (162752886 0 0) 0x164f700 con 0x1609e40
> 2013-04-02 15:41:54.492139 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 39 0x4b34d00 pg_stats_ack(1688 pgs tid 18) v1
> 2013-04-02 15:41:54.492427 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 39 ==== pg_stats_ack(1688 pgs tid 18) v1 ==== 48956+0+0 (1148031672 0 0) 0x4b34d00 con 0x1609e40
> 2013-04-02 15:41:54.493228 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 40 0x3e63f00 pg_stats_ack(1688 pgs tid 19) v1
> 2013-04-02 15:41:54.493359 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 40 ==== pg_stats_ack(1688 pgs tid 19) v1 ==== 48956+0+0 (1148031672 0 0) 0x3e63f00 con 0x1609e40
> 2013-04-02 15:41:54.494104 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 41 0x3e63b80 pg_stats_ack(1688 pgs tid 20) v1
> 2013-04-02 15:41:54.494228 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 41 ==== pg_stats_ack(1688 pgs tid 20) v1 ==== 48956+0+0 (1148031672 0 0) 0x3e63b80 con 0x1609e40
> 2013-04-02 15:41:54.494914 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 42 0x3e640c0 pg_stats_ack(1688 pgs tid 21) v1
> 2013-04-02 15:41:54.494979 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 42 ==== pg_stats_ack(1688 pgs tid 21) v1 ==== 48956+0+0 (1148031672 0 0) 0x3e640c0 con 0x1609e40
> 2013-04-02 15:41:54.495941 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 43 0x17f5400 pg_stats_ack(1956 pgs tid 22) v1
> 2013-04-02 15:41:54.496066 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 43 ==== pg_stats_ack(1956 pgs tid 22) v1 ==== 56728+0+0 (217165409 0 0) 0x17f5400 con 0x1609e40
> 2013-04-02 15:41:54.496825 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 44 0x183fa80 pg_stats_ack(1956 pgs tid 23) v1
> 2013-04-02 15:41:54.496945 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 44 ==== pg_stats_ack(1956 pgs tid 23) v1 ==== 56728+0+0 (217165409 0 0) 0x183fa80 con 0x1609e40
> 2013-04-02 15:41:54.497867 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).reader got message 45 0x1839a40 pg_stats_ack(1956 pgs tid 24) v1
> 2013-04-02 15:41:54.497983 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 45 ==== pg_stats_ack(1956 pgs tid 24) v1 ==== 56728+0+0 (217165409 0 0) 0x1839a40 con 0x1609e40
> 2013-04-02 15:42:00.522449 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 25 v 7) v1 -- ?+0 0x4031b80 con 0x1609e40
> 2013-04-02 15:42:00.523046 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 36 0x4031b80pg_stats(108 pgs tid 25 v 7) v1
> 2013-04-02 15:42:05.525433 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 26 v 7) v1 -- ?+0 0x4031900 con 0x1609e40
> 2013-04-02 15:42:05.526040 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 38 0x4031900pg_stats(108 pgs tid 26 v 7) v1
> 2013-04-02 15:42:10.530320 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 27 v 8) v1 -- ?+0 0x4031680 con 0x1609e40
> 2013-04-02 15:42:10.530987 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 39 0x4031680pg_stats(108 pgs tid 27 v 8) v1
> 2013-04-02 15:42:15.564684 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 28 v 11) v1 -- ?+0 0x4031400 con 0x1609e40
> 2013-04-02 15:42:15.565270 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 42 0x4031400pg_stats(108 pgs tid 28 v 11) v1
> 2013-04-02 15:42:20.568779 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 29 v 13) v1 -- ?+0 0x5861180 con 0x1609e40
> 2013-04-02 15:42:20.569360 7fffcce2c700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x1748c80 sd=24 :45195 s=2 pgs=1394 cs=1 l=1).writer sending message 45 0x5861180pg_stats(108 pgs tid 29 v 13) v1
> 2013-04-02 15:42:24.570963 7ffff387a700 1 osd.0 14 mon hasn't acked PGStats in 30.072959 seconds, reconnecting elsewhere
> 2013-04-02 15:42:33.891060 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 30 v 14) v1 -- ?+0 0x6794380 con 0x5819e40
> 2013-04-02 15:42:33.891072 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 31 v 14) v1 -- ?+0 0x4031680 con 0x5819e40
> 2013-04-02 15:42:33.891149 7fffe485c700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 32 v 16) v1 -- ?+0 0x6793e80 con 0x5819e40
> 2013-04-02 15:42:33.891808 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 4 0x6794380pg_stats(108 pgs tid 30 v 14) v1
> 2013-04-02 15:42:33.892402 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 5 0x4031680pg_stats(108 pgs tid 31 v 14) v1
> 2013-04-02 15:42:33.893054 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 9 0x6793e80pg_stats(108 pgs tid 32 v 16) v1
> 2013-04-02 15:42:34.573582 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 33 v 16) v1 -- ?+0 0x6793700 con 0x5819e40
> 2013-04-02 15:42:34.574136 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 11 0x6793700pg_stats(108 pgs tid 33 v 16) v1
> 2013-04-02 15:42:39.575223 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 34 v 16) v1 -- ?+0 0x6792d00 con 0x5819e40
> 2013-04-02 15:42:39.575817 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 12 0x6792d00pg_stats(108 pgs tid 34 v 16) v1
> 2013-04-02 15:42:44.576975 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 35 v 17) v1 -- ?+0 0x6795f00 con 0x5819e40
> 2013-04-02 15:42:44.577611 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 14 0x6795f00pg_stats(108 pgs tid 35 v 17) v1
> 2013-04-02 15:42:48.591071 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).reader got message 5 0x443d080 pg_stats_ack(108 pgs tid 30) v1
> 2013-04-02 15:42:48.591184 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 5 ==== pg_stats_ack(108 pgs tid 30) v1 ==== 3136+0+0 (3131392639 0 0) 0x443d080 con 0x5819e40
> 2013-04-02 15:42:49.577888 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 36 v 18) v1 -- ?+0 0x6795c80 con 0x5819e40
> 2013-04-02 15:42:49.578479 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 15 0x6795c80pg_stats(108 pgs tid 36 v 18) v1
> 2013-04-02 15:42:50.441073 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).reader got message 6 0x1823480 pg_stats_ack(108 pgs tid 0) v1
> 2013-04-02 15:42:50.441140 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 6 ==== pg_stats_ack(108 pgs tid 0) v1 ==== 3136+0+0 (3131392639 0 0) 0x1823480 con 0x5819e40
> 2013-04-02 15:42:52.094048 7fffcd82e700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).reader got message 13 0x6cbec80 pg_stats_ack(108 pgs tid 32) v1
> 2013-04-02 15:42:52.094159 7fffe485c700 1 -- 172.17.131.32:6800/217031 <== mon.0 172.17.131.31:6789/0 13 ==== pg_stats_ack(108 pgs tid 32) v1 ==== 3136+0+0 (3721100430 0 0) 0x6cbec80 con 0x5819e40
> 2013-04-02 15:42:54.594067 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 37 v 20) v1 -- ?+0 0x6797580 con 0x5819e40
> 2013-04-02 15:42:54.594729 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 19 0x6797580pg_stats(108 pgs tid 37 v 20) v1
> 2013-04-02 15:42:59.658795 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 38 v 20) v1 -- ?+0 0x174a300 con 0x5819e40
> 2013-04-02 15:42:59.659279 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 20 0x174a300pg_stats(108 pgs tid 38 v 20) v1
> 2013-04-02 15:43:04.660446 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 39 v 20) v1 -- ?+0 0x6793200 con 0x5819e40
> 2013-04-02 15:43:04.660952 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 22 0x6793200pg_stats(108 pgs tid 39 v 20) v1
> 2013-04-02 15:43:09.661467 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 40 v 20) v1 -- ?+0 0x6792a80 con 0x5819e40
> 2013-04-02 15:43:09.661950 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 23 0x6792a80pg_stats(108 pgs tid 40 v 20) v1
> 2013-04-02 15:43:14.662762 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 41 v 20) v1 -- ?+0 0x6792300 con 0x5819e40
> 2013-04-02 15:43:14.663314 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 25 0x6792300pg_stats(108 pgs tid 41 v 20) v1
> 2013-04-02 15:43:19.664054 7ffff387a700 1 -- 172.17.131.32:6800/217031 --> 172.17.131.31:6789/0 -- pg_stats(108 pgs tid 42 v 20) v1 -- ?+0 0x6791b80 con 0x5819e40
> 2013-04-02 15:43:19.664776 7fff9bc1f700 2 -- 172.17.131.32:6800/217031 >> 172.17.131.31:6789/0 pipe(0x5862800 sd=27 :47851 s=2 pgs=1425 cs=1 l=1).writer sending message 26 0x6791b80pg_stats(108 pgs tid 42 v 20) v1
> 2013-04-02 15:43:22.664921 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.570739 seconds, reconnecting elsewhere
> 2013-04-02 15:43:52.672461 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007379 seconds, reconnecting elsewhere
> 2013-04-02 15:44:22.680691 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007095 seconds, reconnecting elsewhere
> 2013-04-02 15:44:52.688936 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007221 seconds, reconnecting elsewhere
> 2013-04-02 15:45:22.702066 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.012264 seconds, reconnecting elsewhere
> 2013-04-02 15:45:52.711134 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007957 seconds, reconnecting elsewhere
> 2013-04-02 15:46:22.719845 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007859 seconds, reconnecting elsewhere
> 2013-04-02 15:46:52.729475 7ffff387a700 1 osd.0 20 mon hasn't acked PGStats in 30.008756 seconds, reconnecting elsewhere
>
> Why would OSD 0 stop trying to send PGStats to the mon?
>
> Maybe the above is all normal behavior, so any thoughts on what I
> should be looking for instead are greatly appreciated.
>
> On the bright side, this issue is completely reproducible for me,
> and I'll be happy to test anything you think might help sort it out.
>
> -- Jim
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 15:58 ` Sage Weil
@ 2013-04-03 17:09 ` Jim Schutt
2013-04-03 17:14 ` Gregory Farnum
0 siblings, 1 reply; 12+ messages in thread
From: Jim Schutt @ 2013-04-03 17:09 UTC (permalink / raw)
To: Sage Weil; +Cc: Joao Eduardo Luis, ceph-devel@vger.kernel.org
Hi Sage,
On 04/03/2013 09:58 AM, Sage Weil wrote:
> Hi Jim,
>
> What happens if you change 'osd mon ack timeout = 300' (from the
> default of 30)? I suspect part of the problem is that the mons are just
> slow enough that the osd's resend the same thing again and it snowballs
> into more work for the monitor.
Thanks, that helped. My OSDs aren't reconnecting to the mon any more,
and the new filesystem started up as expected.
Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at
about the same time I started testing v0.59. Perhaps before the upgrade
I was running right at the edge of that timeout. After the NIC upgrade
the PGStat messages come flooding in at startup, and they bunch up
enough that working through the backlog pushed me over the timeout cliff?
Is there any downside to using a large 'osd mon ack timeout', assuming I
run more than one mon? If so, I expect I'll work my way back from
'osd mon ack timeout = 300' to see how big it needs to be to stay reliable
for my configuration.
Sorry for the noise about paxos. At least it was useful
to help Joao find that debug log message that was more expensive
than expected....
Thanks -- Jim
>
> sage
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 17:09 ` Jim Schutt
@ 2013-04-03 17:14 ` Gregory Farnum
2013-04-03 17:49 ` Gregory Farnum
0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2013-04-03 17:14 UTC (permalink / raw)
To: Jim Schutt; +Cc: Sage Weil, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On Wed, Apr 3, 2013 at 10:09 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> Hi Sage,
>
> On 04/03/2013 09:58 AM, Sage Weil wrote:
>> Hi Jim,
>>
>> What happens if you change 'osd mon ack timeout = 300' (from the
>> default of 30)? I suspect part of the problem is that the mons are just
>> slow enough that the osd's resend the same thing again and it snowballs
>> into more work for the monitor.
>
> Thanks, that helped. My OSDs aren't reconnecting to the mon any more,
> and the new filesystem started up as expected.
>
> Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at
> about the same time I started testing v0.59. Perhaps before the upgrade
> I was running right at the edge of that timeout. After the NIC upgrade
> the PGStat messages come flooding in at startup, and they bunch up
> enough that working through the backlog pushed me over the timeout cliff?
>
> Is there any downside to using a large 'osd mon ack timeout', assuming I
> run more than one mon? If so, I expect I'll work my way back from
> 'osd mon ack timeout = 300' to see how big it needs to be to stay reliable
> for my configuration.
It's a timeout, so the generic downsides to larger timeouts — if the
monitor actually has gone away it's going to take the OSDs more time
to connect to somebody else for their updates and reports. This will
probably be most apparent if they're trying to peer and can't make
progress until they get acks from the monitors, but the one they're
connected to has died.
> Sorry for the noise about paxos. At least it was useful
> to help Joao find that debug log message that was more expensive
> than expected....
It's not noise — the reason this timeout is causing problems now is
that the monitor disk commits are taking so long that it looks like
they've failed. Which is bad. :/ So thanks for reporting it!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 17:14 ` Gregory Farnum
@ 2013-04-03 17:49 ` Gregory Farnum
2013-04-03 17:58 ` Jim Schutt
0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2013-04-03 17:49 UTC (permalink / raw)
To: Jim Schutt; +Cc: Sage Weil, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On Wed, Apr 3, 2013 at 10:14 AM, Gregory Farnum <greg@inktank.com> wrote:
> On Wed, Apr 3, 2013 at 10:09 AM, Jim Schutt <jaschut@sandia.gov> wrote:
>> Hi Sage,
>>
>> On 04/03/2013 09:58 AM, Sage Weil wrote:
>>> Hi Jim,
>>>
>>> What happens if you change 'osd mon ack timeout = 300' (from the
>>> default of 30)? I suspect part of the problem is that the mons are just
>>> slow enough that the osd's resend the same thing again and it snowballs
>>> into more work for the monitor.
>>
>> Thanks, that helped. My OSDs aren't reconnecting to the mon any more,
>> and the new filesystem started up as expected.
>>
>> Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at
>> about the same time I started testing v0.59. Perhaps before the upgrade
>> I was running right at the edge of that timeout. After the NIC upgrade
>> the PGStat messages come flooding in at startup, and they bunch up
>> enough that working through the backlog pushed me over the timeout cliff?
>>
>> Is there any downside to using a large 'osd mon ack timeout', assuming I
>> run more than one mon? If so, I expect I'll work my way back from
>> 'osd mon ack timeout = 300' to see how big it needs to be to stay reliable
>> for my configuration.
>
> It's a timeout, so the generic downsides to larger timeouts — if the
> monitor actually has gone away it's going to take the OSDs more time
> to connect to somebody else for their updates and reports. This will
> probably be most apparent if they're trying to peer and can't make
> progress until they get acks from the monitors, but the one they're
> connected to has died.
>
>
>> Sorry for the noise about paxos. At least it was useful
>> to help Joao find that debug log message that was more expensive
>> than expected....
>
> It's not noise — the reason this timeout is causing problems now is
> that the monitor disk commits are taking so long that it looks like
> they've failed. Which is bad. :/ So thanks for reporting it!
Sorry, guess I forgot some of the history since this piece at least is
resolved now. I'm surprised if 30-second timeouts are causing issues
without those overloads you were seeing; have you seen this issue
without your high debugging levels and without the bad PG commits (due
to debugging)?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 17:49 ` Gregory Farnum
@ 2013-04-03 17:58 ` Jim Schutt
2013-04-03 18:25 ` Sage Weil
0 siblings, 1 reply; 12+ messages in thread
From: Jim Schutt @ 2013-04-03 17:58 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On 04/03/2013 11:49 AM, Gregory Farnum wrote:
> On Wed, Apr 3, 2013 at 10:14 AM, Gregory Farnum <greg@inktank.com> wrote:
>> On Wed, Apr 3, 2013 at 10:09 AM, Jim Schutt <jaschut@sandia.gov> wrote:
>>> Hi Sage,
>>>
>>> On 04/03/2013 09:58 AM, Sage Weil wrote:
>>>> Hi Jim,
>>>>
>>>> What happens if you change 'osd mon ack timeout = 300' (from the
>>>> default of 30)? I suspect part of the problem is that the mons are just
>>>> slow enough that the osd's resend the same thing again and it snowballs
>>>> into more work for the monitor.
>>>
>>> Thanks, that helped. My OSDs aren't reconnecting to the mon any more,
>>> and the new filesystem started up as expected.
>>>
>>> Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at
>>> about the same time I started testing v0.59. Perhaps before the upgrade
>>> I was running right at the edge of that timeout. After the NIC upgrade
>>> the PGStat messages come flooding in at startup, and they bunch up
>>> enough that working through the backlog pushed me over the timeout cliff?
>>>
>>> Is there any downside to using a large 'osd mon ack timeout', assuming I
>>> run more than one mon? If so, I expect I'll work my way back from
>>> 'osd mon ack timeout = 300' to see how big it needs to be to stay reliable
>>> for my configuration.
>>
>> It's a timeout, so the generic downsides to larger timeouts — if the
>> monitor actually has gone away it's going to take the OSDs more time
>> to connect to somebody else for their updates and reports. This will
>> probably be most apparent if they're trying to peer and can't make
>> progress until they get acks from the monitors, but the one they're
>> connected to has died.
>>
>>
>>> Sorry for the noise about paxos. At least it was useful
>>> to help Joao find that debug log message that was more expensive
>>> than expected....
>>
>> It's not noise — the reason this timeout is causing problems now is
>> that the monitor disk commits are taking so long that it looks like
>> they've failed. Which is bad. :/ So thanks for reporting it!
>
> Sorry, guess I forgot some of the history since this piece at least is
> resolved now. I'm surprised if 30-second timeouts are causing issues
> without those overloads you were seeing; have you seen this issue
> without your high debugging levels and without the bad PG commits (due
> to debugging)?
I think so, because that's why I started with higher debugging
levels.
But, as it turns out, I'm just in the process of returning to my
testing of next, with all my debugging back to 0. So, I'll try
the default timeout of 30 seconds first. If I have trouble starting
up a new file system, I'll turn up the timeout and try again, without
any extra debugging. Either way, I'll let you know what happens.
-- Jim
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 17:58 ` Jim Schutt
@ 2013-04-03 18:25 ` Sage Weil
2013-04-03 22:40 ` Jim Schutt
0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2013-04-03 18:25 UTC (permalink / raw)
To: Jim Schutt; +Cc: Gregory Farnum, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On Wed, 3 Apr 2013, Jim Schutt wrote:
> On 04/03/2013 11:49 AM, Gregory Farnum wrote:
> > On Wed, Apr 3, 2013 at 10:14 AM, Gregory Farnum <greg@inktank.com> wrote:
> >> On Wed, Apr 3, 2013 at 10:09 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> >>> Hi Sage,
> >>>
> >>> On 04/03/2013 09:58 AM, Sage Weil wrote:
> >>>> Hi Jim,
> >>>>
> >>>> What happens if you change 'osd mon ack timeout = 300' (from the
> >>>> default of 30)? I suspect part of the problem is that the mons are just
> >>>> slow enough that the osd's resend the same thing again and it snowballs
> >>>> into more work for the monitor.
> >>>
> >>> Thanks, that helped. My OSDs aren't reconnecting to the mon any more,
> >>> and the new filesystem started up as expected.
> >>>
> >>> Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at
> >>> about the same time I started testing v0.59. Perhaps before the upgrade
> >>> I was running right at the edge of that timeout. After the NIC upgrade
> >>> the PGStat messages come flooding in at startup, and they bunch up
> >>> enough that working through the backlog pushed me over the timeout cliff?
> >>>
> >>> Is there any downside to using a large 'osd mon ack timeout', assuming I
> >>> run more than one mon? If so, I expect I'll work my way back from
> >>> 'osd mon ack timeout = 300' to see how big it needs to be to stay reliable
> >>> for my configuration.
> >>
> >> It's a timeout, so the generic downsides to larger timeouts ? if the
> >> monitor actually has gone away it's going to take the OSDs more time
> >> to connect to somebody else for their updates and reports. This will
> >> probably be most apparent if they're trying to peer and can't make
> >> progress until they get acks from the monitors, but the one they're
> >> connected to has died.
> >>
> >>
> >>> Sorry for the noise about paxos. At least it was useful
> >>> to help Joao find that debug log message that was more expensive
> >>> than expected....
> >>
> >> It's not noise ? the reason this timeout is causing problems now is
> >> that the monitor disk commits are taking so long that it looks like
> >> they've failed. Which is bad. :/ So thanks for reporting it!
> >
> > Sorry, guess I forgot some of the history since this piece at least is
> > resolved now. I'm surprised if 30-second timeouts are causing issues
> > without those overloads you were seeing; have you seen this issue
> > without your high debugging levels and without the bad PG commits (due
> > to debugging)?
>
> I think so, because that's why I started with higher debugging
> levels.
>
> But, as it turns out, I'm just in the process of returning to my
> testing of next, with all my debugging back to 0. So, I'll try
> the default timeout of 30 seconds first. If I have trouble starting
> up a new file system, I'll turn up the timeout and try again, without
> any extra debugging. Either way, I'll let you know what happens.
I would be curious to hear roughly what value between 30 and 300 is
sufficient, if you can experiment just a bit. We probably want to adjust
the default.
Perhaps more importantly, we'll need to look at the performance of the pg
stat updates on the mon. There is a refactor due in that code that should
improve life, but it's slated for dumpling.
sage
>
> -- Jim
>
> > -Greg
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >
>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 18:25 ` Sage Weil
@ 2013-04-03 22:40 ` Jim Schutt
2013-04-03 22:51 ` Gregory Farnum
2013-04-04 19:14 ` Jim Schutt
0 siblings, 2 replies; 12+ messages in thread
From: Jim Schutt @ 2013-04-03 22:40 UTC (permalink / raw)
To: Sage Weil; +Cc: Gregory Farnum, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On 04/03/2013 12:25 PM, Sage Weil wrote:
>>> > > Sorry, guess I forgot some of the history since this piece at least is
>>> > > resolved now. I'm surprised if 30-second timeouts are causing issues
>>> > > without those overloads you were seeing; have you seen this issue
>>> > > without your high debugging levels and without the bad PG commits (due
>>> > > to debugging)?
>> >
>> > I think so, because that's why I started with higher debugging
>> > levels.
>> >
>> > But, as it turns out, I'm just in the process of returning to my
>> > testing of next, with all my debugging back to 0. So, I'll try
>> > the default timeout of 30 seconds first. If I have trouble starting
>> > up a new file system, I'll turn up the timeout and try again, without
>> > any extra debugging. Either way, I'll let you know what happens.
> I would be curious to hear roughly what value between 30 and 300 is
> sufficient, if you can experiment just a bit. We probably want to adjust
> the default.
>
> Perhaps more importantly, we'll need to look at the performance of the pg
> stat updates on the mon. There is a refactor due in that code that should
> improve life, but it's slated for dumpling.
OK, here's some results, with all debugging at 0, using current next...
My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
use 10 GbE NICs now. The mon host uses an SSD for the mon data store.
My test procedure is to start 'ceph -w', start all the OSDs, and once
they're all running start the mon. I report the time from starting
the mon to all PGs active+clean.
# PGs osd mon ack startup notes
timeout time
------- ------------ -------- -----
55392 default >30:00 1
55392 300 18:36 2
55392 60 >30:00 3
55392 150 >30:00 4
55392 240 >30:00 5
55392 300 >30:00 2,6
notes:
1) lots of PGs marked stale, OSDs wrongly marked down
before I gave up on this case
2) OSDs report lots of slow requests for "pg_notify(...) v4
currently wait for new map"
3) some OSDs wrongly marked down, OSDs report some slow requests
for "pg_notify(...) v4 currently wait for new map"
before I gave up on this case
4) appeared to be making progress; then an OSD was marked
out at ~21 minutes; many more marked out before I
gave up on this case
5) some OSD reports of slow requests for "pg_notify",
some OSDs wrongly marked down, appeared to be making
progress, then stalled; then I gave up on this case
6) retried this case, appeared to be making progress, but
after ~18 min stalled at 19701 active+clean, 35691 peering
until I gave up
Hmmm, I didn't really expect the above results. I ran out of
time before attempting an even longer osd mon ack timeout.
But either we're on the wrong trail, or 300 is not sufficient.
Or, I'm doing something wrong and haven't yet figured out what
it is.
FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
and my memory is a new filesystem started up in ~5 minutes. For
that testing I had to increase 'paxos propose interval' to two
or three seconds to keep the monitor writeout rate (as measured
by vmstat) down to a sustained 50-70 MB/s during start-up.
That was with a 1 GbE NIC in the mon; the reason I upgraded
it was a filesystem with 512K PGs was taking too long to start,
and I thought the mon might be network-limited since it had an
SSD for the mon data store.
For the testing above I used the default 'paxos propose interval'.
I don't know if it matters, but vmstat sees only a little data
being written on the mon system.
-- Jim
>
> sage
>
>> >
>> > -- Jim
>> >
>>> > > -Greg
>>> > > Software Engineer #42 @ http://inktank.com | http://ceph.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 22:40 ` Jim Schutt
@ 2013-04-03 22:51 ` Gregory Farnum
2013-04-04 14:15 ` Jim Schutt
2013-04-04 19:14 ` Jim Schutt
1 sibling, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2013-04-03 22:51 UTC (permalink / raw)
To: Jim Schutt; +Cc: Sage Weil, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>> > > Sorry, guess I forgot some of the history since this piece at least is
>>>> > > resolved now. I'm surprised if 30-second timeouts are causing issues
>>>> > > without those overloads you were seeing; have you seen this issue
>>>> > > without your high debugging levels and without the bad PG commits (due
>>>> > > to debugging)?
>>> >
>>> > I think so, because that's why I started with higher debugging
>>> > levels.
>>> >
>>> > But, as it turns out, I'm just in the process of returning to my
>>> > testing of next, with all my debugging back to 0. So, I'll try
>>> > the default timeout of 30 seconds first. If I have trouble starting
>>> > up a new file system, I'll turn up the timeout and try again, without
>>> > any extra debugging. Either way, I'll let you know what happens.
>> I would be curious to hear roughly what value between 30 and 300 is
>> sufficient, if you can experiment just a bit. We probably want to adjust
>> the default.
>>
>> Perhaps more importantly, we'll need to look at the performance of the pg
>> stat updates on the mon. There is a refactor due in that code that should
>> improve life, but it's slated for dumpling.
>
> OK, here's some results, with all debugging at 0, using current next...
>
> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
> use 10 GbE NICs now. The mon host uses an SSD for the mon data store.
> My test procedure is to start 'ceph -w', start all the OSDs, and once
> they're all running start the mon. I report the time from starting
> the mon to all PGs active+clean.
>
> # PGs osd mon ack startup notes
> timeout time
> ------- ------------ -------- -----
> 55392 default >30:00 1
> 55392 300 18:36 2
> 55392 60 >30:00 3
> 55392 150 >30:00 4
> 55392 240 >30:00 5
> 55392 300 >30:00 2,6
>
> notes:
> 1) lots of PGs marked stale, OSDs wrongly marked down
> before I gave up on this case
> 2) OSDs report lots of slow requests for "pg_notify(...) v4
> currently wait for new map"
> 3) some OSDs wrongly marked down, OSDs report some slow requests
> for "pg_notify(...) v4 currently wait for new map"
> before I gave up on this case
> 4) appeared to be making progress; then an OSD was marked
> out at ~21 minutes; many more marked out before I
> gave up on this case
> 5) some OSD reports of slow requests for "pg_notify",
> some OSDs wrongly marked down, appeared to be making
> progress, then stalled; then I gave up on this case
> 6) retried this case, appeared to be making progress, but
> after ~18 min stalled at 19701 active+clean, 35691 peering
> until I gave up
>
> Hmmm, I didn't really expect the above results. I ran out of
> time before attempting an even longer osd mon ack timeout.
>
> But either we're on the wrong trail, or 300 is not sufficient.
> Or, I'm doing something wrong and haven't yet figured out what
> it is.
>
> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
> and my memory is a new filesystem started up in ~5 minutes. For
> that testing I had to increase 'paxos propose interval' to two
> or three seconds to keep the monitor writeout rate (as measured
> by vmstat) down to a sustained 50-70 MB/s during start-up.
>
> That was with a 1 GbE NIC in the mon; the reason I upgraded
> it was a filesystem with 512K PGs was taking too long to start,
> and I thought the mon might be network-limited since it had an
> SSD for the mon data store.
>
> For the testing above I used the default 'paxos propose interval'.
> I don't know if it matters, but vmstat sees only a little data
> being written on the mon system.
That's odd; I'd actually expect to see more going to disk with v0.59
than previously. Is vmstat actually looking at disk IO, or might it be
missing DirectIO or something? (Not that I remember if LevelDB is
using those.)
However, I think you might want to increase your paxos propose
interval to where it was before — your OSDs are having trouble keeping
up with the number of maps that are being generated, based on the fact
that you have a lot of pg notifies stuck waiting for newer maps.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 22:51 ` Gregory Farnum
@ 2013-04-04 14:15 ` Jim Schutt
2013-04-04 15:52 ` Jim Schutt
0 siblings, 1 reply; 12+ messages in thread
From: Jim Schutt @ 2013-04-04 14:15 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On 04/03/2013 04:51 PM, Gregory Farnum wrote:
> On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt <jaschut@sandia.gov> wrote:
>> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>>> Sorry, guess I forgot some of the history since this piece at least is
>>>>>>> resolved now. I'm surprised if 30-second timeouts are causing issues
>>>>>>> without those overloads you were seeing; have you seen this issue
>>>>>>> without your high debugging levels and without the bad PG commits (due
>>>>>>> to debugging)?
>>>>>
>>>>> I think so, because that's why I started with higher debugging
>>>>> levels.
>>>>>
>>>>> But, as it turns out, I'm just in the process of returning to my
>>>>> testing of next, with all my debugging back to 0. So, I'll try
>>>>> the default timeout of 30 seconds first. If I have trouble starting
>>>>> up a new file system, I'll turn up the timeout and try again, without
>>>>> any extra debugging. Either way, I'll let you know what happens.
>>> I would be curious to hear roughly what value between 30 and 300 is
>>> sufficient, if you can experiment just a bit. We probably want to adjust
>>> the default.
>>>
>>> Perhaps more importantly, we'll need to look at the performance of the pg
>>> stat updates on the mon. There is a refactor due in that code that should
>>> improve life, but it's slated for dumpling.
>>
>> OK, here's some results, with all debugging at 0, using current next...
>>
>> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
>> use 10 GbE NICs now. The mon host uses an SSD for the mon data store.
>> My test procedure is to start 'ceph -w', start all the OSDs, and once
>> they're all running start the mon. I report the time from starting
>> the mon to all PGs active+clean.
>>
>> # PGs osd mon ack startup notes
>> timeout time
>> ------- ------------ -------- -----
>> 55392 default >30:00 1
>> 55392 300 18:36 2
>> 55392 60 >30:00 3
>> 55392 150 >30:00 4
>> 55392 240 >30:00 5
>> 55392 300 >30:00 2,6
>>
>> notes:
>> 1) lots of PGs marked stale, OSDs wrongly marked down
>> before I gave up on this case
>> 2) OSDs report lots of slow requests for "pg_notify(...) v4
>> currently wait for new map"
>> 3) some OSDs wrongly marked down, OSDs report some slow requests
>> for "pg_notify(...) v4 currently wait for new map"
>> before I gave up on this case
>> 4) appeared to be making progress; then an OSD was marked
>> out at ~21 minutes; many more marked out before I
>> gave up on this case
>> 5) some OSD reports of slow requests for "pg_notify",
>> some OSDs wrongly marked down, appeared to be making
>> progress, then stalled; then I gave up on this case
>> 6) retried this case, appeared to be making progress, but
>> after ~18 min stalled at 19701 active+clean, 35691 peering
>> until I gave up
>>
>> Hmmm, I didn't really expect the above results. I ran out of
>> time before attempting an even longer osd mon ack timeout.
>>
>> But either we're on the wrong trail, or 300 is not sufficient.
>> Or, I'm doing something wrong and haven't yet figured out what
>> it is.
>>
>> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
>> and my memory is a new filesystem started up in ~5 minutes. For
>> that testing I had to increase 'paxos propose interval' to two
>> or three seconds to keep the monitor writeout rate (as measured
>> by vmstat) down to a sustained 50-70 MB/s during start-up.
>>
>> That was with a 1 GbE NIC in the mon; the reason I upgraded
>> it was a filesystem with 512K PGs was taking too long to start,
>> and I thought the mon might be network-limited since it had an
>> SSD for the mon data store.
>>
>> For the testing above I used the default 'paxos propose interval'.
>> I don't know if it matters, but vmstat sees only a little data
>> being written on the mon system.
>
> That's odd; I'd actually expect to see more going to disk with v0.59
> than previously. Is vmstat actually looking at disk IO, or might it be
> missing DirectIO or something? (Not that I remember if LevelDB is
> using those.)
FWIW, 'dd oflag=direct' shows up in vmstat. But, I don't know if
that is relevant to what LevelDB might be doing...
> However, I think you might want to increase your paxos propose
> interval to where it was before — your OSDs are having trouble keeping
> up with the number of maps that are being generated, based on the fact
> that you have a lot of pg notifies stuck waiting for newer maps.
OK, I'll try that. But to clarify, in the past the default paxos
propose interval was good up to 128K PGs, or so.
Thanks -- Jim
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-04 14:15 ` Jim Schutt
@ 2013-04-04 15:52 ` Jim Schutt
0 siblings, 0 replies; 12+ messages in thread
From: Jim Schutt @ 2013-04-04 15:52 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On 04/04/2013 08:15 AM, Jim Schutt wrote:
> On 04/03/2013 04:51 PM, Gregory Farnum wrote:
>> On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt <jaschut@sandia.gov> wrote:
>>> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>>>> Sorry, guess I forgot some of the history since this piece at least is
>>>>>>>> resolved now. I'm surprised if 30-second timeouts are causing issues
>>>>>>>> without those overloads you were seeing; have you seen this issue
>>>>>>>> without your high debugging levels and without the bad PG commits (due
>>>>>>>> to debugging)?
>>>>>>
>>>>>> I think so, because that's why I started with higher debugging
>>>>>> levels.
>>>>>>
>>>>>> But, as it turns out, I'm just in the process of returning to my
>>>>>> testing of next, with all my debugging back to 0. So, I'll try
>>>>>> the default timeout of 30 seconds first. If I have trouble starting
>>>>>> up a new file system, I'll turn up the timeout and try again, without
>>>>>> any extra debugging. Either way, I'll let you know what happens.
>>>> I would be curious to hear roughly what value between 30 and 300 is
>>>> sufficient, if you can experiment just a bit. We probably want to adjust
>>>> the default.
>>>>
>>>> Perhaps more importantly, we'll need to look at the performance of the pg
>>>> stat updates on the mon. There is a refactor due in that code that should
>>>> improve life, but it's slated for dumpling.
>>>
>>> OK, here's some results, with all debugging at 0, using current next...
>>>
>>> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
>>> use 10 GbE NICs now. The mon host uses an SSD for the mon data store.
>>> My test procedure is to start 'ceph -w', start all the OSDs, and once
>>> they're all running start the mon. I report the time from starting
>>> the mon to all PGs active+clean.
>>>
>>> # PGs osd mon ack startup notes
>>> timeout time
>>> ------- ------------ -------- -----
>>> 55392 default >30:00 1
>>> 55392 300 18:36 2
>>> 55392 60 >30:00 3
>>> 55392 150 >30:00 4
>>> 55392 240 >30:00 5
>>> 55392 300 >30:00 2,6
>>>
>>> notes:
>>> 1) lots of PGs marked stale, OSDs wrongly marked down
>>> before I gave up on this case
>>> 2) OSDs report lots of slow requests for "pg_notify(...) v4
>>> currently wait for new map"
>>> 3) some OSDs wrongly marked down, OSDs report some slow requests
>>> for "pg_notify(...) v4 currently wait for new map"
>>> before I gave up on this case
>>> 4) appeared to be making progress; then an OSD was marked
>>> out at ~21 minutes; many more marked out before I
>>> gave up on this case
>>> 5) some OSD reports of slow requests for "pg_notify",
>>> some OSDs wrongly marked down, appeared to be making
>>> progress, then stalled; then I gave up on this case
>>> 6) retried this case, appeared to be making progress, but
>>> after ~18 min stalled at 19701 active+clean, 35691 peering
>>> until I gave up
>>>
>>> Hmmm, I didn't really expect the above results. I ran out of
>>> time before attempting an even longer osd mon ack timeout.
>>>
>>> But either we're on the wrong trail, or 300 is not sufficient.
>>> Or, I'm doing something wrong and haven't yet figured out what
>>> it is.
>>>
>>> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
>>> and my memory is a new filesystem started up in ~5 minutes. For
>>> that testing I had to increase 'paxos propose interval' to two
>>> or three seconds to keep the monitor writeout rate (as measured
>>> by vmstat) down to a sustained 50-70 MB/s during start-up.
>>>
>>> That was with a 1 GbE NIC in the mon; the reason I upgraded
>>> it was a filesystem with 512K PGs was taking too long to start,
>>> and I thought the mon might be network-limited since it had an
>>> SSD for the mon data store.
>>>
>>> For the testing above I used the default 'paxos propose interval'.
>>> I don't know if it matters, but vmstat sees only a little data
>>> being written on the mon system.
>>
>> That's odd; I'd actually expect to see more going to disk with v0.59
>> than previously. Is vmstat actually looking at disk IO, or might it be
>> missing DirectIO or something? (Not that I remember if LevelDB is
>> using those.)
>
> FWIW, 'dd oflag=direct' shows up in vmstat. But, I don't know if
> that is relevant to what LevelDB might be doing...
>
>> However, I think you might want to increase your paxos propose
>> interval to where it was before — your OSDs are having trouble keeping
>> up with the number of maps that are being generated, based on the fact
>> that you have a lot of pg notifies stuck waiting for newer maps.
>
> OK, I'll try that. But to clarify, in the past the default paxos
> propose interval was good up to 128K PGs, or so.
Hmmmph. With 'paxos propose interval = 3' and 'osd mon ack timeout = 300',
and no other debugging enabled, I still didn't get a new filesystem to
start up in <30 minutes. I did modify the "mon hasn't acked PGStats"
message to be debug level 0, and saw a few of those, but saw no slow
pg_notify requests reported.
I'm really puzzled by that 'osd mon ack timeout = 300' case I reported
above, which started in ~18 minutes. So far that's the only successful
start I've gotten with 30 min since I turned off debugging....
-- Jim
>
> Thanks -- Jim
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Trouble getting a new file system to start, for v0.59 and newer
2013-04-03 22:40 ` Jim Schutt
2013-04-03 22:51 ` Gregory Farnum
@ 2013-04-04 19:14 ` Jim Schutt
1 sibling, 0 replies; 12+ messages in thread
From: Jim Schutt @ 2013-04-04 19:14 UTC (permalink / raw)
To: Sage Weil; +Cc: Gregory Farnum, Joao Eduardo Luis, ceph-devel@vger.kernel.org
On 04/03/2013 04:40 PM, Jim Schutt wrote:
> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>> >>> > > Sorry, guess I forgot some of the history since this piece at least is
>>>>>> >>> > > resolved now. I'm surprised if 30-second timeouts are causing issues
>>>>>> >>> > > without those overloads you were seeing; have you seen this issue
>>>>>> >>> > > without your high debugging levels and without the bad PG commits (due
>>>>>> >>> > > to debugging)?
>>>> >> >
>>>> >> > I think so, because that's why I started with higher debugging
>>>> >> > levels.
>>>> >> >
>>>> >> > But, as it turns out, I'm just in the process of returning to my
>>>> >> > testing of next, with all my debugging back to 0. So, I'll try
>>>> >> > the default timeout of 30 seconds first. If I have trouble starting
>>>> >> > up a new file system, I'll turn up the timeout and try again, without
>>>> >> > any extra debugging. Either way, I'll let you know what happens.
>> > I would be curious to hear roughly what value between 30 and 300 is
>> > sufficient, if you can experiment just a bit. We probably want to adjust
>> > the default.
>> >
>> > Perhaps more importantly, we'll need to look at the performance of the pg
>> > stat updates on the mon. There is a refactor due in that code that should
>> > improve life, but it's slated for dumpling.
> OK, here's some results, with all debugging at 0, using current next...
>
> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
> use 10 GbE NICs now. The mon host uses an SSD for the mon data store.
> My test procedure is to start 'ceph -w', start all the OSDs, and once
> they're all running start the mon. I report the time from starting
> the mon to all PGs active+clean.
As a sanity check, to be sure I wasn't doing something differently
now than I remember doing before, I re-ran this test for v0.57,
v0.58, and v0.59, using default 'osd mon ack timeout', default
'paxos propose interval', and no debugging:
55,392 PGs 221,568 PGs
v0.57 1m 07s 9m 42s
v0.58 1m 04s 11m 44s
v0.59 >30m not attempted
The v0.57/v0.58 runs showed no signs of stress, e.g. no
slow op reports, etc.
The v0.59 run behaved as I previously reported, i.e.,
lots of stale peers, OSDs wrongly marked down, etc.,
before I gave up on it.
-- Jim
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-04-04 19:14 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-03 15:46 Trouble getting a new file system to start, for v0.59 and newer Jim Schutt
2013-04-03 15:58 ` Sage Weil
2013-04-03 17:09 ` Jim Schutt
2013-04-03 17:14 ` Gregory Farnum
2013-04-03 17:49 ` Gregory Farnum
2013-04-03 17:58 ` Jim Schutt
2013-04-03 18:25 ` Sage Weil
2013-04-03 22:40 ` Jim Schutt
2013-04-03 22:51 ` Gregory Farnum
2013-04-04 14:15 ` Jim Schutt
2013-04-04 15:52 ` Jim Schutt
2013-04-04 19:14 ` Jim Schutt
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.