From: "Jim Schutt" <jaschut@sandia.gov>
To: Gregory Farnum <gregory.farnum@dreamhost.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Date: Thu, 2 Feb 2012 13:31:18 -0700 [thread overview]
Message-ID: <4F2AF296.4070808@sandia.gov> (raw)
In-Reply-To: <4F2AF085.6000405@sandia.gov>
On 02/02/2012 01:22 PM, Jim Schutt wrote:
> On 02/02/2012 12:32 PM, Gregory Farnum wrote:
>> On Thu, Feb 2, 2012 at 11:06 AM, Jim Schutt<jaschut@sandia.gov> wrote:
>>> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
>>>>
>>>> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov> wrote:
>>>>> The typical pattern I see is that a run starts with tens of seconds
>>>>> of aggregate throughput> 2 GB/s. Then it drops and bounces around
>>>>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then
>>>>> it ramps back up near 2 GB/s again.
>>>>
>>>>
>>>> Hmm. 100MB/s is awfully low for this theory, but have you tried to
>>>> correlate the drops in throughput with the OSD journals running out of
>>>> space?
>>>
>>>
>>> A spot check of logs from my last run doesn't seem to have any
>>> "journal throttle: waited for" messages during a slowdown.
>>> Is that what you mean?
>>
>> I'd expect to see those, yes, but I actually meant the on-disk journal
>> itself getting full. I believe that should result in output like:
>> write_thread_entry full, going to sleep (waiting for commit)
>> ...although I now notice that's a much higher log level (20) than the
>> other messages (1/5).
>
> So I've been running OSDs with
> debug osd = 20
> debug journal = 20 ; local journaling
> debug filestore = 20 ; local object storage
> debug objector = 20
> debug ms = 20
> debug = 1
>
> I found 0 instances of "waiting for commit" in all my OSD logs for my last run.
>
> So I never waited on the journal?
>
>>
>>> During the fast part of I run I see lots of journal messages
>>> with this pattern:
>>>
>>> 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
>>> and 50346596 bytes, now 22 ops and 90041106 bytes
>>> 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
>>> and 50346596 bytes, now 21 ops and 85845571 bytes
>>>
>>> which I think means my journal is doing 50 MB/s, right?
>>
>> Generally, yes — although that'll also pop up if the store manages to
>> commit faster than the journal (unlikely). :)
>>
>>>> and your description makes me
>>>> think that throughput is initially constrained by sequential journal
>>>> writes but then the journal runs out of space and the OSD has to wait
>>>> for the main store to catch up (with random IO), and that sends the IO
>>>> patterns all to hell. (If you can say that random 4MB IOs are
>>>> hellish.)
>>>
>>>
>>> iostat 1 during the fast part of a run shows both journal and data
>>> partitions running at 45-50 MB/s. During the slow part of a run
>>> they both show similar but low data rates.
>>
>> All right. That's actually not that surprising; random 4MB writes are
>> pretty nice to a modern drive.
>>
>>>> I'm also curious about memory usage as a possible explanation for the
>>>> more dramatic drops.
>>>
>>> My OSD servers have 48 GB memory. During a run I rarely see less than
>>> 24 GB used by the page cache, with the rest mostly used by anonymous memory.
>>> I don't run with any swap.
>>>
>>> So far I'm looking at two behaviours I've noticed that seem anomalous to me.
>>>
>>> One is that I instrumented ms_dispatch(), and I see it take
>>> a half-second or more several hundred times, out of several
>>> thousand messages. Is that expected?
>>
>> How did you instrument it? If you wrapped the whole function it's
>> possible that those longer runs are actually chewing through several
>> messages that had to get waitlisted for some reason previously.
>> (That's the call to do_waiters().)
>
> Yep, I wrapped the whole function, and also instrumented taking osd_lock
> while I was there. About half the time that ms_dispatch() takes more than
> 0.5 seconds, taking osd_lock is responsible for the delay. There's two
> dispatch threads, one for ops and one for rep_ops, right? So one's
> waiting on the other?
>
>>
>>> Another is that once a message receive starts, I see ~50 messages
>>> that take tens of seconds to receive, when the nominal receive time is
>>> a half-second or less. I'm in the process of tooling up to collect
>>> tcpdump data on all my clients to try to catch what is going on with that.
>>
>> Again, how are you instrumenting that?
>
> I post-process the logs, looking at the time difference between
> "reader got .* policy throttler" and "reader got .* osd_op(client".
>
> When I find a candidate message, I grep the log for just that reader thread,
> and see, e.g., this:
>
> osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000 seq 828/828 waiters 157/149 for src client.4301 tid=247
> osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/66666666
> osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128
> osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
> osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
> osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
> osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
>
> Note the ~2 minute delay (and ~300,000 lines of logging) between the first and second reads.
>
> During that time 129 sockets were processed - what makes sd=215 special?
>
> I've added tracepoints in my client kernel try_write(), and nothing seems
> unusual (that's with running the patch to ceph_write_space() I posted earlier):
>
[snip wrong trace output]
kworker/0:2-1790 [000] 1569.078614: ceph_try_write_msg_done: peer osd0 tid 207 seq 4 sent 4194304
kworker/0:2-1790 [000] 1569.078618: ceph_prepare_write_msg: peer osd0 tid 247 seq 5 sent 0
kworker/0:2-1790 [000] 1569.078621: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 0
kworker/0:2-1790 [000] 1569.281943: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 102588
kworker/0:2-1790 [000] 1569.299540: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 230012
kworker/0:2-1790 [000] 1569.303088: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 357436
kworker/0:2-1790 [000] 1569.305580: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 453004
kworker/0:2-1790 [000] 1569.308217: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 580428
kworker/0:2-1790 [000] 1569.310914: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 707852
kworker/0:2-1790 [000] 1569.313742: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 835276
kworker/0:2-1790 [000] 1569.316653: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 962700
kworker/0:2-1790 [000] 1569.319203: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1090124
kworker/0:2-1790 [000] 1569.323786: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1217548
kworker/0:2-1790 [000] 1569.325982: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1344972
kworker/0:2-1790 [000] 1569.328577: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1472396
kworker/0:2-1790 [000] 1573.844837: ceph_handle_reply_msg: peer osd0 tid 179 result 0 flags 0x00000025 (req ffff88018aa2c400)
kworker/0:2-1790 [000] 1573.845487: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1599820
flush-ceph-1-11910 [000] 1576.891480: ceph_async_writepages_req: tid 377 osd0 ops 1 0x2201/0x0000/0x0000 pages 1024
kworker/0:2-1790 [000] 1602.214574: ceph_handle_reply_msg: peer osd0 tid 207 result 0 flags 0x00000025 (req ffff88018aa7d400)
kworker/0:2-1790 [000] 1602.215230: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1663532
flush-ceph-1-11910 [000] 1636.926835: ceph_async_writepages_req: tid 410 osd0 ops 1 0x2201/0x0000/0x0000 pages 1024
kworker/0:2-1790 [000] 1775.409415: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1663532
kworker/0:2-1790 [000] 1775.411643: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1790956
kworker/0:2-1790 [000] 1775.414125: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1918380
kworker/0:2-1790 [000] 1775.416520: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2045804
kworker/0:2-1790 [000] 1775.419163: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2173228
kworker/0:2-1790 [000] 1775.421620: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2300652
kworker/0:2-1790 [000] 1775.423868: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2428076
kworker/0:2-1790 [000] 1775.426260: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2555500
kworker/0:2-1790 [000] 1775.426297: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2670592
kworker/0:2-1790 [000] 1775.426298: ceph_try_write_msg_done: peer osd0 tid 247 seq 5 sent 2670592
There's a 170 second gap at 1602.215230 - why?
-- Jim
>
> Hence, tcpdump seems like a good idea?
>
> -- Jim
>
>>
>> -Greg
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2012-02-02 20:32 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
2012-02-02 17:53 ` Gregory Farnum
2012-02-02 18:31 ` Jim Schutt
2012-02-02 19:01 ` Gregory Farnum
2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
2012-02-02 15:38 ` Jim Schutt
[not found] ` <4F29CDAA.408@sandia.gov>
[not found] ` <CAF3hT9BZEP_FWS=qt8ivA++aDpPGGFzuD_PtMcvDRS2aDEN+hw@mail.gmail.com>
[not found] ` <4F2AABF5.6050803@sandia.gov>
2012-02-02 17:52 ` Gregory Farnum
2012-02-02 19:06 ` [EXTERNAL] " Jim Schutt
2012-02-02 19:15 ` Sage Weil
2012-02-02 19:33 ` Jim Schutt
2012-02-02 19:32 ` Gregory Farnum
2012-02-02 20:22 ` Jim Schutt
2012-02-02 20:31 ` Jim Schutt [this message]
2012-02-03 0:28 ` Gregory Farnum
2012-02-03 16:17 ` Jim Schutt
2012-02-03 17:06 ` Gregory Farnum
2012-02-03 23:33 ` Jim Schutt
[not found] ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
2012-02-04 0:04 ` Yehuda Sadeh Weinraub
2012-02-06 16:20 ` Jim Schutt
2012-02-06 17:22 ` Yehuda Sadeh Weinraub
2012-02-06 18:20 ` Jim Schutt
2012-02-06 18:35 ` Gregory Farnum
2012-02-09 20:53 ` Jim Schutt
2012-02-09 22:40 ` sridhar basam
2012-02-09 23:15 ` Jim Schutt
2012-02-10 0:34 ` Tommi Virtanen
2012-02-10 1:26 ` sridhar basam
2012-02-10 15:32 ` [EXTERNAL] " Jim Schutt
2012-02-10 17:13 ` sridhar basam
2012-02-10 23:09 ` Jim Schutt
2012-02-11 0:05 ` sridhar basam
2012-02-13 15:26 ` Jim Schutt
2012-02-03 17:07 ` Sage Weil
2012-02-24 15:38 ` Jim Schutt
2012-02-24 18:31 ` Tommi Virtanen
2012-02-24 18:38 ` Tommi Virtanen
2013-02-21 0:12 ` Sage Weil
2013-02-26 19:16 ` Jim Schutt
2013-02-26 19:36 ` Sage Weil
2013-02-28 19:37 ` Jim Schutt
2013-02-28 21:06 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F2AF296.4070808@sandia.gov \
--to=jaschut@sandia.gov \
--cc=ceph-devel@vger.kernel.org \
--cc=gregory.farnum@dreamhost.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.