From: "Jim Schutt" <jaschut@sandia.gov>
To: Gregory Farnum <gregory.farnum@dreamhost.com>
Cc: ceph-devel@vger.kernel.org, sri@basam.org
Subject: Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Date: Fri, 24 Feb 2012 08:38:11 -0700 [thread overview]
Message-ID: <4F47AEE3.5080305@sandia.gov> (raw)
In-Reply-To: <CAF3hT9BNc4n4HBNEqsf+d6-Rjv7TC8nJ1VponJCBVpLB8=_F5Q@mail.gmail.com>
On 02/02/2012 10:52 AM, Gregory Farnum wrote:
> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov> wrote:
>> I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
>> per OSD. During a test I watch both OSD servers with both
>> vmstat and iostat.
>>
>> During a "good" period, vmstat says the server is sustaining> 2 GB/s
>> for multiple tens of seconds. Since I use replication factor 2, that
>> means that server is sustaining> 500 MB/s aggregate client throughput,
>> right? During such a period vmstat also reports ~10% CPU idle.
>>
>> During a "bad" period, vmstat says the server is doing ~200 MB/s,
>> with lots of idle cycles. It is during these periods that
>> messages stuck in the policy throttler build up such long
>> wait times. Sometimes I see really bad periods with aggregate
>> throughput per server< 100 MB/s.
>>
>> The typical pattern I see is that a run starts with tens of seconds
>> of aggregate throughput> 2 GB/s. Then it drops and bounces around
>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then
>> it ramps back up near 2 GB/s again.
>
> Hmm. 100MB/s is awfully low for this theory, but have you tried to
> correlate the drops in throughput with the OSD journals running out of
> space? I assume from your setup that they're sharing the disk with the
> store (although it works either way), and your description makes me
> think that throughput is initially constrained by sequential journal
> writes but then the journal runs out of space and the OSD has to wait
> for the main store to catch up (with random IO), and that sends the IO
> patterns all to hell. (If you can say that random 4MB IOs are
> hellish.)
> I'm also curious about memory usage as a possible explanation for the
> more dramatic drops.
I've finally figured out what is going on with this behaviour.
Memory usage was on the right track.
It turns out to be an unfortunate interaction between the
number of OSDs/server, number of clients, TCP socket buffer
autotuning, the policy throttler, and limits on the total
memory used by the TCP stack (net/ipv4/tcp_mem sysctl).
What happens is that for throttled reader threads, the
TCP stack will continue to receive data as long as there
is available socket buffer, and the sender has data to send.
As each reader thread receives successive messages, the
TCP socket buffer autotuning increases the size of the
socket buffer. Eventually, due to the number of OSDs
per server and the number of clients trying to write,
all the memory the TCP stack is allowed by net/ipv4/tcp_mem
to use is consumed by the socket buffers of throttled
reader threads. When this happens, TCP processing is affected
to the point that the TCP stack cannot send ACKs on behalf
of the reader threads that aren't throttled. At that point
the OSD stalls until the TCP retransmit count on some connection
is exceeded, causing it to be reset.
Since my OSD servers don't run anything else, the simplest
solution for me is to turn off socket buffer autotuning
(net/ipv4/tcp_moderate_rcvbuf), and set the default socket
buffer size to something reasonable. 256k seems to be
working well for me right now.
-- Jim
> -Greg
>
>
next prev parent reply other threads:[~2012-02-24 15:39 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
2012-02-02 17:53 ` Gregory Farnum
2012-02-02 18:31 ` Jim Schutt
2012-02-02 19:01 ` Gregory Farnum
2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
2012-02-02 15:38 ` Jim Schutt
[not found] ` <4F29CDAA.408@sandia.gov>
[not found] ` <CAF3hT9BZEP_FWS=qt8ivA++aDpPGGFzuD_PtMcvDRS2aDEN+hw@mail.gmail.com>
[not found] ` <4F2AABF5.6050803@sandia.gov>
2012-02-02 17:52 ` Gregory Farnum
2012-02-02 19:06 ` [EXTERNAL] " Jim Schutt
2012-02-02 19:15 ` Sage Weil
2012-02-02 19:33 ` Jim Schutt
2012-02-02 19:32 ` Gregory Farnum
2012-02-02 20:22 ` Jim Schutt
2012-02-02 20:31 ` Jim Schutt
2012-02-03 0:28 ` [EXTERNAL] " Gregory Farnum
2012-02-03 16:17 ` Jim Schutt
2012-02-03 17:06 ` Gregory Farnum
2012-02-03 23:33 ` Jim Schutt
[not found] ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
2012-02-04 0:04 ` Yehuda Sadeh Weinraub
2012-02-06 16:20 ` Jim Schutt
2012-02-06 17:22 ` Yehuda Sadeh Weinraub
2012-02-06 18:20 ` Jim Schutt
2012-02-06 18:35 ` Gregory Farnum
2012-02-09 20:53 ` Jim Schutt
2012-02-09 22:40 ` sridhar basam
2012-02-09 23:15 ` Jim Schutt
2012-02-10 0:34 ` Tommi Virtanen
2012-02-10 1:26 ` sridhar basam
2012-02-10 15:32 ` [EXTERNAL] " Jim Schutt
2012-02-10 17:13 ` sridhar basam
2012-02-10 23:09 ` Jim Schutt
2012-02-11 0:05 ` sridhar basam
2012-02-13 15:26 ` Jim Schutt
2012-02-03 17:07 ` Sage Weil
2012-02-24 15:38 ` Jim Schutt [this message]
2012-02-24 18:31 ` Tommi Virtanen
2012-02-24 18:38 ` Tommi Virtanen
2013-02-21 0:12 ` Sage Weil
2013-02-26 19:16 ` Jim Schutt
2013-02-26 19:36 ` Sage Weil
2013-02-28 19:37 ` Jim Schutt
2013-02-28 21:06 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F47AEE3.5080305@sandia.gov \
--to=jaschut@sandia.gov \
--cc=ceph-devel@vger.kernel.org \
--cc=gregory.farnum@dreamhost.com \
--cc=sri@basam.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.