All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Jim Schutt" <jaschut@sandia.gov>
To: Gregory Farnum <gregory.farnum@dreamhost.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Date: Fri, 3 Feb 2012 09:17:43 -0700	[thread overview]
Message-ID: <4F2C08A7.2050507@sandia.gov> (raw)
In-Reply-To: <CAF3hT9AcxxNtscFczP8fShSsaBm_4zhLQZ2F5c7h1YswVaXHkA@mail.gmail.com>

On 02/02/2012 05:28 PM, Gregory Farnum wrote:
> On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> I found 0 instances of "waiting for commit" in all my OSD logs for my last
>> run.
>>
>> So I never waited on the journal?
>
> Looks like it. Interesting.
>
>
>>>> So far I'm looking at two behaviours I've noticed that seem anomalous to
>>>> me.
>>>>
>>>> One is that I instrumented ms_dispatch(), and I see it take
>>>> a half-second or more several hundred times, out of several
>>>> thousand messages.  Is that expected?
>>>
>>>
>>> How did you instrument it? If you wrapped the whole function it's
>>> possible that those longer runs are actually chewing through several
>>> messages that had to get waitlisted for some reason previously.
>>> (That's the call to do_waiters().)
>>
>>
>> Yep, I wrapped the whole function, and also instrumented taking osd_lock
>> while I was there.  About half the time that ms_dispatch() takes more than
>> 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
>> dispatch threads, one for ops and one for rep_ops, right?  So one's
>> waiting on the other?
>
> There's just one main dispatcher; no split for the ops and rep_ops .
> The reason for that "dispatch_running" is that if there are requests
> waiting then the tick() function will run through them if the
> messenger dispatch thread is currently idle.
> But it is possible for the Messenger to try and dispatch, and for that
> to be blocked while some amount of (usually trivial) work is being
> done by a different thread, yes. I don't think we've ever observed it
> being a problem for anything other than updating OSD maps, though...

Ah, OK.

I guess I was confused by my log output, e.g.:

osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680

I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.

I need to go study that code some more....

>
>
>>>> Another is that once a message receive starts, I see ~50 messages
>>>> that take tens of seconds to receive, when the nominal receive time is
>>>> a half-second or less.  I'm in the process of tooling up to collect
>>>> tcpdump data on all my clients to try to catch what is going on with
>>>> that.
>>>
>>>
>>> Again, how are you instrumenting that?
>>
>>
>> I post-process the logs, looking at the time difference between
>> "reader got .* policy throttler" and "reader got .* osd_op(client".
>
> I guess the logging output must have changed a bit at some pointer (or
> was that one of your patches?). master has "reader wants" not "reader
> got" for the policy throttler. (Just got a little confused when
> checking the code.)

Yep, I added an extra message to make post-processing logs easier, sorry.

>
>> When I find a candidate message, I grep the log for just that reader thread,
>> and see, e.g., this:
>>
>> osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000
>> seq 828/828 waiters 157/149 for src client.4301 tid=247
>> osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
>> 41944358/66666666
>> osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader got front 128
>> osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
>> osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
>> osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
>> osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
>>
>> Note the ~2 minute delay (and ~300,000 lines of logging) between the first
>> and second reads.
>>
>> During that time 129 sockets were processed - what makes sd=215 special?
>
> Hrm. Well, you can try turning up the messenger debugging to 30 and
> taking advantage of the "reader reading" "reader read" pair right
> around tcp_read_nonblocking.

OK, I'll give that a try as well, thanks.
>
>> I've added tracepoints in my client kernel try_write(), and nothing seems
>> unusual (that's with running the patch to ceph_write_space() I posted
>> earlier):
>>
>>      kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
>> tid 179 seq 3 sent 4194304
>>      kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer osd0
>> tid 207 seq 4 sent 0
> *snip*
>>      kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0
>> tid 207 seq 4 sent 4194304
>>
>> There's a 25 second gap at 1543.236256, but nothing like the
>> 100 second gap in the reader.
>>
>> Hence, tcpdump seems like a good idea?
>
> You do bring us interesting problems! Let us know what info you come up with.
>
> Oh, and I keep forgetting to ask: what does the write workload look
> like? At first I assumed this was a CephFS workload, but given that
> you're changing max message sizes and have half-second writes you're
> probably doing something else?

I'm just using "pdsh -f <number of clients> -w <client list>"
to start up a "dd conv=fdatasync" on each client, roughly
simultaneously.

I think the short messages are coming from writeback control.
I've got the writeback tracepoints enabled, and most of the time
I see things like this:

tc85.trace.log:166469:    flush-ceph-1-11910 [000]  1787.028175: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296424688 age=30 index=513024 to_write=1024 wrote=1024
tc85.trace.log:166474:    flush-ceph-1-11910 [000]  1787.028889: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296424688 age=30 index=514048 to_write=1024 wrote=1024

But occasionally I see this sort of thing:

tc85.trace.log:22410:    flush-ceph-1-11910 [001]  1546.957999: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=0 index=295936 to_write=11264 wrote=11264
tc85.trace.log:29383:    flush-ceph-1-11910 [001]  1547.327652: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=0 index=307200 to_write=11264 wrote=11264
tc85.trace.log:37048:    flush-ceph-1-11910 [001]  1548.861577: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=2 index=316416 to_write=9216 wrote=9216
tc85.trace.log:42864:    flush-ceph-1-11910 [000]  1550.023496: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=3 index=323584 to_write=7168 wrote=7168
tc85.trace.log:47626:    flush-ceph-1-11910 [000]  1550.976374: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=4 index=329728 to_write=6144 wrote=6144
tc85.trace.log:51607:    flush-ceph-1-11910 [001]  1551.781108: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=334848 to_write=5120 wrote=5120
tc85.trace.log:51998:    flush-ceph-1-11910 [001]  1551.860104: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=339968 to_write=5120 wrote=5120
tc85.trace.log:52018:    flush-ceph-1-11910 [001]  1551.863599: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=345088 to_write=5120 wrote=5120
tc85.trace.log:52034:    flush-ceph-1-11910 [001]  1551.866372: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=350208 to_write=5120 wrote=5120
tc85.trace.log:52044:    flush-ceph-1-11910 [001]  1551.866767: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=0 to_write=5120 wrote=648
tc85.trace.log:69705:    flush-ceph-1-11910 [000]  1576.878034: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=30 index=352256 to_write=1024 wrote=1400
tc85.trace.log:69830:    flush-ceph-1-11910 [000]  1576.892907: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296214116 age=30 index=0 to_write=1024 wrote=576
tc85.trace.log:81609:    flush-ceph-1-11910 [001]  1606.907407: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296244466 age=30 index=378880 to_write=1024 wrote=1472
tc85.trace.log:81678:    flush-ceph-1-11910 [001]  1606.916107: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296244466 age=30 index=0 to_write=1024 wrote=831
tc85.trace.log:96729:    flush-ceph-1-11910 [000]  1636.918264: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296274503 age=30 index=393216 to_write=1024 wrote=1217
tc85.trace.log:96839:    flush-ceph-1-11910 [000]  1636.931363: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296274503 age=30 index=0 to_write=1024 wrote=933
tc85.trace.log:111179:    flush-ceph-1-11910 [001]  1666.932329: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296304541 age=30 index=415744 to_write=1024 wrote=1115
tc85.trace.log:111298:    flush-ceph-1-11910 [001]  1666.945162: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296304541 age=30 index=0 to_write=1024 wrote=941

I eventually want to understand what is happening here.....

BTW, should I post my ceph client tracepoint patches?  I ask because
it's not clear to me they would be useful to anyone but me.

-- Jim

> -Greg
>
>



  reply	other threads:[~2012-02-03 16:18 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
2012-02-02 17:53   ` Gregory Farnum
2012-02-02 18:31     ` Jim Schutt
2012-02-02 19:01       ` Gregory Farnum
2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
2012-02-02 15:38   ` Jim Schutt
     [not found]   ` <4F29CDAA.408@sandia.gov>
     [not found]     ` <CAF3hT9BZEP_FWS=qt8ivA++aDpPGGFzuD_PtMcvDRS2aDEN+hw@mail.gmail.com>
     [not found]       ` <4F2AABF5.6050803@sandia.gov>
2012-02-02 17:52         ` Gregory Farnum
2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
2012-02-02 19:15             ` Sage Weil
2012-02-02 19:33               ` Jim Schutt
2012-02-02 19:32             ` Gregory Farnum
2012-02-02 20:22               ` Jim Schutt
2012-02-02 20:31                 ` Jim Schutt
2012-02-03  0:28                 ` [EXTERNAL] " Gregory Farnum
2012-02-03 16:17                   ` Jim Schutt [this message]
2012-02-03 17:06                     ` Gregory Farnum
2012-02-03 23:33                       ` Jim Schutt
     [not found]                         ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
2012-02-04  0:04                           ` Yehuda Sadeh Weinraub
2012-02-06 16:20                           ` Jim Schutt
2012-02-06 17:22                             ` Yehuda Sadeh Weinraub
2012-02-06 18:20                               ` Jim Schutt
2012-02-06 18:35                                 ` Gregory Farnum
2012-02-09 20:53                                   ` Jim Schutt
2012-02-09 22:40                                     ` sridhar basam
2012-02-09 23:15                                       ` Jim Schutt
2012-02-10  0:34                                         ` Tommi Virtanen
2012-02-10  1:26                                         ` sridhar basam
2012-02-10 15:32                                           ` [EXTERNAL] " Jim Schutt
2012-02-10 17:13                                             ` sridhar basam
2012-02-10 23:09                                               ` Jim Schutt
2012-02-11  0:05                                                 ` sridhar basam
2012-02-13 15:26                                                   ` Jim Schutt
2012-02-03 17:07                     ` Sage Weil
2012-02-24 15:38           ` Jim Schutt
2012-02-24 18:31             ` Tommi Virtanen
2012-02-24 18:38               ` Tommi Virtanen
2013-02-21  0:12             ` Sage Weil
2013-02-26 19:16               ` Jim Schutt
2013-02-26 19:36                 ` Sage Weil
2013-02-28 19:37                   ` Jim Schutt
2013-02-28 21:06                     ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F2C08A7.2050507@sandia.gov \
    --to=jaschut@sandia.gov \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gregory.farnum@dreamhost.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.