All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Jim Schutt" <jaschut@sandia.gov>
To: Gregory Farnum <gregory.farnum@dreamhost.com>
Cc: Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Date: Thu, 9 Feb 2012 13:53:13 -0700	[thread overview]
Message-ID: <4F343239.2010907@sandia.gov> (raw)
In-Reply-To: <CAF3hT9CW7_CF4iT0cY858kgkke7Wu=TK7ULzPhFj-AW9jycyZA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8281 bytes --]

On 02/06/2012 11:35 AM, Gregory Farnum wrote:
> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>>
>>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@sandia.gov>    wrote:
>>
>>
>>>>
>>>> The above suggests to me that the slowdown is a result
>>>> of requests not getting submitted at the same rate as
>>>> when things are running well.
>>>>
>>>
>>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>>
>>> I do see that when things go well the number of writes per device is
>>> capped at ~200 writes per second and the throughput per device is
>>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>>
>>
>> Pretty much, at least for the outer tracks on a drive.  I've seen
>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>> disk subsystem is any sort of significant bottleneck.
>
> Well, you might try changing your throttling settings on the OSDs.
> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
> and is used to limit the amount of client data in memory (ie; messages
> are included in this throttler for their entire lifetime, not just
> while waiting for dispatch).
>
>

I've made a little progress isolating this.

"osd client message size cap =  5000000" makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).

Such plots suggest to me my problem is caused by stalled
receives.  Using debug ms = 30 on my OSDs turns up instances
of this:

osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader got front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1448 of 2825073

which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.

I was able to correlate such stalls with tcpdump output collected on
clients.  Here's an example from another run:

15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr 1096266], length 1448
15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr 1096266], length 1448
15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr 1096266], length 1448
15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr 1096266], length 1448
15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr 1096266], length 1448
15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr 1096266], length 1448
15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr 1096266], length 1448
15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr 1096266], length 1448
15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1096266], length 1448
15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 2896
15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448

So in this case the client retransmitted for ~2 minutes with no response from
the OSD.  Note that during this time the client was talking to other OSDs on the
same server.

I want to try turning off GSO/GRO on my interfaces, but then
I think I need to post to netdev...

-- Jim

[-- Attachment #2: osd.0.msg-et.svg.bz2 --]
[-- Type: application/x-bzip, Size: 22502 bytes --]

  reply	other threads:[~2012-02-09 20:53 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
2012-02-02 17:53   ` Gregory Farnum
2012-02-02 18:31     ` Jim Schutt
2012-02-02 19:01       ` Gregory Farnum
2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
2012-02-02 15:38   ` Jim Schutt
     [not found]   ` <4F29CDAA.408@sandia.gov>
     [not found]     ` <CAF3hT9BZEP_FWS=qt8ivA++aDpPGGFzuD_PtMcvDRS2aDEN+hw@mail.gmail.com>
     [not found]       ` <4F2AABF5.6050803@sandia.gov>
2012-02-02 17:52         ` Gregory Farnum
2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
2012-02-02 19:15             ` Sage Weil
2012-02-02 19:33               ` Jim Schutt
2012-02-02 19:32             ` Gregory Farnum
2012-02-02 20:22               ` Jim Schutt
2012-02-02 20:31                 ` Jim Schutt
2012-02-03  0:28                 ` [EXTERNAL] " Gregory Farnum
2012-02-03 16:17                   ` Jim Schutt
2012-02-03 17:06                     ` Gregory Farnum
2012-02-03 23:33                       ` Jim Schutt
     [not found]                         ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
2012-02-04  0:04                           ` Yehuda Sadeh Weinraub
2012-02-06 16:20                           ` Jim Schutt
2012-02-06 17:22                             ` Yehuda Sadeh Weinraub
2012-02-06 18:20                               ` Jim Schutt
2012-02-06 18:35                                 ` Gregory Farnum
2012-02-09 20:53                                   ` Jim Schutt [this message]
2012-02-09 22:40                                     ` sridhar basam
2012-02-09 23:15                                       ` Jim Schutt
2012-02-10  0:34                                         ` Tommi Virtanen
2012-02-10  1:26                                         ` sridhar basam
2012-02-10 15:32                                           ` [EXTERNAL] " Jim Schutt
2012-02-10 17:13                                             ` sridhar basam
2012-02-10 23:09                                               ` Jim Schutt
2012-02-11  0:05                                                 ` sridhar basam
2012-02-13 15:26                                                   ` Jim Schutt
2012-02-03 17:07                     ` Sage Weil
2012-02-24 15:38           ` Jim Schutt
2012-02-24 18:31             ` Tommi Virtanen
2012-02-24 18:38               ` Tommi Virtanen
2013-02-21  0:12             ` Sage Weil
2013-02-26 19:16               ` Jim Schutt
2013-02-26 19:36                 ` Sage Weil
2013-02-28 19:37                   ` Jim Schutt
2013-02-28 21:06                     ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F343239.2010907@sandia.gov \
    --to=jaschut@sandia.gov \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gregory.farnum@dreamhost.com \
    --cc=yehuda.sadeh@dreamhost.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.