All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Jim Schutt" <jaschut@sandia.gov>
To: Sage Weil <sage@newdream.net>
Cc: Gregory Farnum <gregory.farnum@dreamhost.com>,
	ceph-devel@vger.kernel.org
Subject: Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Date: Thu, 2 Feb 2012 12:33:49 -0700	[thread overview]
Message-ID: <4F2AE51D.6060709@sandia.gov> (raw)
In-Reply-To: <Pine.LNX.4.64.1202021110180.14452@cobra.newdream.net>

On 02/02/2012 12:15 PM, Sage Weil wrote:
> On Thu, 2 Feb 2012, Jim Schutt wrote:
>> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
>>> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>   wrote:
>>>> I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
>>>> per OSD.  During a test I watch both OSD servers with both
>>>> vmstat and iostat.
>>>>
>>>> During a "good" period, vmstat says the server is sustaining>   2 GB/s
>>>> for multiple tens of seconds.  Since I use replication factor 2, that
>>>> means that server is sustaining>   500 MB/s aggregate client throughput,
>>>> right?  During such a period vmstat also reports ~10% CPU idle.
>>>>
>>>> During a "bad" period, vmstat says the server is doing ~200 MB/s,
>>>> with lots of idle cycles.  It is during these periods that
>>>> messages stuck in the policy throttler build up such long
>>>> wait times.  Sometimes I see really bad periods with aggregate
>>>> throughput per server<   100 MB/s.
>>>>
>>>> The typical pattern I see is that a run starts with tens of seconds
>>>> of aggregate throughput>   2 GB/s.  Then it drops and bounces around
>>>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
>>>> it ramps back up near 2 GB/s again.
>>>
>>> Hmm. 100MB/s is awfully low for this theory, but have you tried to
>>> correlate the drops in throughput with the OSD journals running out of
>>> space?
>>
>> A spot check of logs from my last run doesn't seem to have any
>> "journal throttle: waited for" messages during a slowdown.
>> Is that what you mean?
>>
>> During the fast part of I run I see lots of journal messages
>> with this pattern:
>>
>> 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
>> and 50346596 bytes, now 22 ops and 90041106 bytes
>> 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
>> 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
>> and 50346596 bytes, now 21 ops and 85845571 bytes
>
> It occurs to me that part of the problem may be the current sync
> io behavior in the journal.  It ends up doing really big writes, which
> makes things bursty, and will get stuff blocked up behind the throttler.
> You might try making 'journal max write bytes' smaller?  Hmm, although
> it's currently 10MB, which isn't too bad.  So unless you've changed it
> from the default, that's probably not it.

I have changed it recently, but I was seeing this type of
behaviour before making that change.

FWIW, here's my current non-standard tunings.  I'm using
these because the standard ones work even worse for me
on this test case:

	osd op threads = 48
	filestore queue max ops = 16
	osd client message size cap = 50000000
	client oc max dirty =         50000000
	journal max write bytes =     50000000
	ms dispatch throttle bytes =  66666666
	client oc size =              100000000
	journal queue max bytes =     125000000
	filestore queue max bytes =   125000000
	objector inflight op bytes =  200000000


FWIW, turning down "filestore queue max ops" and turning up "osd op threads"
made the biggest positive impact on my performance levels on this test.

With default values for those I was seeing much worse stalling behaviour.

>
>> which I think means my journal is doing 50 MB/s, right?
>>
>>> I assume from your setup that they're sharing the disk with the
>>> store (although it works either way),
>>
>> I've got a 4 GB journal partition on the outer tracks of the disk.
>
> This is on the same disk as the osd data?  As an experiement, you could
> try putting the journal on a separate disk (half disks for journals, half
> for data).  That's obviously not what you want in the real world, but it
> would be interesting to see if contention for the spindle is responsible
> for this.
>
>> So far I'm looking at two behaviours I've noticed that seem anomalous to me.
>>
>> One is that I instrumented ms_dispatch(), and I see it take
>> a half-second or more several hundred times, out of several
>> thousand messages.  Is that expected?
>
> I don't think so, but it could happen if the cpu/memory contention, or in
> the case of osdmap updates where we block on io in that thread.

Hmmm.  Would those go into the filestore op queue?

I didn't think to check ms_dispatch() ET until after I had tuned
down "filestore queue max ops".  Is it worth checking this?

>
>> Another is that once a message receive starts, I see ~50 messages
>> that take tens of seconds to receive, when the nominal receive time is
>> a half-second or less.  I'm in the process of tooling up to collect
>> tcpdump data on all my clients to try to catch what is going on with that.
>>
>> Any other ideas on what to look for would be greatly appreciated.
>
> I'd rule out the journal+data on same disk as the source of pain first.
> If that's what's going on, we can take a closer look at speficically how
> to make it behave better!

OK, I'll try that next.

Thanks -- Jim

>
> sage
>
>



  reply	other threads:[~2012-02-02 19:34 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
2012-02-02 17:53   ` Gregory Farnum
2012-02-02 18:31     ` Jim Schutt
2012-02-02 19:01       ` Gregory Farnum
2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
2012-02-02 15:38   ` Jim Schutt
     [not found]   ` <4F29CDAA.408@sandia.gov>
     [not found]     ` <CAF3hT9BZEP_FWS=qt8ivA++aDpPGGFzuD_PtMcvDRS2aDEN+hw@mail.gmail.com>
     [not found]       ` <4F2AABF5.6050803@sandia.gov>
2012-02-02 17:52         ` Gregory Farnum
2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
2012-02-02 19:15             ` Sage Weil
2012-02-02 19:33               ` Jim Schutt [this message]
2012-02-02 19:32             ` Gregory Farnum
2012-02-02 20:22               ` Jim Schutt
2012-02-02 20:31                 ` Jim Schutt
2012-02-03  0:28                 ` [EXTERNAL] " Gregory Farnum
2012-02-03 16:17                   ` Jim Schutt
2012-02-03 17:06                     ` Gregory Farnum
2012-02-03 23:33                       ` Jim Schutt
     [not found]                         ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
2012-02-04  0:04                           ` Yehuda Sadeh Weinraub
2012-02-06 16:20                           ` Jim Schutt
2012-02-06 17:22                             ` Yehuda Sadeh Weinraub
2012-02-06 18:20                               ` Jim Schutt
2012-02-06 18:35                                 ` Gregory Farnum
2012-02-09 20:53                                   ` Jim Schutt
2012-02-09 22:40                                     ` sridhar basam
2012-02-09 23:15                                       ` Jim Schutt
2012-02-10  0:34                                         ` Tommi Virtanen
2012-02-10  1:26                                         ` sridhar basam
2012-02-10 15:32                                           ` [EXTERNAL] " Jim Schutt
2012-02-10 17:13                                             ` sridhar basam
2012-02-10 23:09                                               ` Jim Schutt
2012-02-11  0:05                                                 ` sridhar basam
2012-02-13 15:26                                                   ` Jim Schutt
2012-02-03 17:07                     ` Sage Weil
2012-02-24 15:38           ` Jim Schutt
2012-02-24 18:31             ` Tommi Virtanen
2012-02-24 18:38               ` Tommi Virtanen
2013-02-21  0:12             ` Sage Weil
2013-02-26 19:16               ` Jim Schutt
2013-02-26 19:36                 ` Sage Weil
2013-02-28 19:37                   ` Jim Schutt
2013-02-28 21:06                     ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F2AE51D.6060709@sandia.gov \
    --to=jaschut@sandia.gov \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gregory.farnum@dreamhost.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.