From: "Jim Schutt" <jaschut@sandia.gov>
To: Sage Weil <sage@newdream.net>
Cc: Gregory Farnum <gregory.farnum@dreamhost.com>,
"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: cosd multi-second stalls cause "wrongly marked me down"
Date: Fri, 8 Apr 2011 16:11:03 -0600 [thread overview]
Message-ID: <4D9F87F7.6090203@sandia.gov> (raw)
In-Reply-To: <Pine.LNX.4.64.1104081343460.15928@cobra.newdream.net>
Sage Weil wrote:
> On Fri, 8 Apr 2011, Jim Schutt wrote:
>> Hi Sage,
>>
>> Sage Weil wrote:
>>> On Wed, 16 Feb 2011, Jim Schutt wrote:
>>>> On Wed, 2011-02-16 at 14:40 -0700, Gregory Farnum wrote:
>>>>> On Wednesday, February 16, 2011 at 1:25 PM, Jim Schutt wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been testing v0.24.3 w/ 64 clients against
>>>>>> 1 mon, 1 mds, 96 osds. Under heavy write load I
>>>>>> see:
>>>>>> [WRN] map e7 wrongly marked me down or wrong addr
>>>>>>
>>>>>> I was able to sort through the logs and discover that when
>>>>>> this happens I have large gaps (10 seconds or more) in osd
>>>>>> heatbeat processing. In those heartbeat gaps I've discovered
>>>>>> long periods (5-15 seconds) where an osd logs nothing, even
>>>>>> though I am running with debug osd/filestore/journal = 20.
>>>>>>
>>>>>> Is this a known issue?
>>>>> You're running on btrfs?
>>>> Yep.
>>> Are the cosd log files on the same btrfs volume as the btrfs data, or
>>> elsewhere? The heartbeat thread takes some pains to avoid any locks that
>>> may be contented and do avoid any disk io, so in theory a btrfs stall
>>> shouldn't affect anything. We may have missed something.. do you have a log
>>> showing this in action?
>> In the end, after all the various things I've tried, I think
>> that the root cause of this is relatively simple: I don't
>> have enough CPU cycles available on my servers to do the
>> amount of OSD processing required to service my client
>> load, given the number of OSDs per server I'm running.
>>
>> With too much work and not enough cycles to do it, the
>> one real-time component of Ceph, heartbeat processing,
>> eventually must miss its deadline (no heartbeat "observed"
>> in osd_heartbeat_grace seconds), since it requires work
>> done by components (messengers, memory allocation system)
>> that don't provide real-time guarantees.
>>
>> All of my experiences on this make perfect sense when
>> viewed from this perspective.
>>
>> For example, when working with tcmalloc, I learned I
>> could compile it with CXXFLAGS=-DTCMALLOC_LARGE_PAGES,
>> which causes tcmalloc to allow objects up to 256k in
>> its thread caches, rather than the default 32k. So I
>> used that in combination with a 256k stripe width, on
>> the theory that deallocating messages would mostly only
>> interact with the thread cache, but it didn't help.
>>
>> When looking at thread stacks generated by my
>> Mutex::LockOrAbort trick with a 5 sec wait to acquire
>> the pipe_lock, I often saw threads waiting on the
>> DoutLocker mutex. Since lots of Ceph debugging output
>> happens with other locks being held, debugging might
>> thus slow things down out of proportion to the processing
>> required to generate the log messages. Yet, when I
>> configured no debugging, I saw no improvement; it might
>> be that things got a little worse. This now makes sense
>> to me in light of my above hypothesis about not enough
>> available CPU cycles - there's still too much work to
>> do, even with no cycles spent on debugging output.
>>
>> What I didn't see very often in my thread stacks were
>> stack frames from tcmalloc. This doesn't make sense
>> to me if the memory allocation subsystem is the root
>> cause of my problem, but makes perfect sense if there's
>> not enough CPU cycles: not so much time is spent
>> deallocating memory, so it is caught in the act less
>> often by LockOrAbort.
>>
>> What finally seemed to help at avoiding missed heartbeats
>> in my configuration was the following combination:
>> turning off debugging, running with these throttling paramters:
>> osd client message size cap = 14000000
>> client oc size = 14000000
>> client oc max dirty = 35000000
>> filestore queue max bytes = 35000000
>> journal queue max bytes = 35000000
>> ms dispatch throttle bytes = 14000000
>> objector inflight op bytes = 35000000
>> and using a 512k stripe width.
>>
>> Evidently keeping a relatively small amount of data in
>> flight, in smaller chunks, allowed heartbeat processing to
>> hit its mark more often. But, it only delayed things, it
>> didn't solve the problem. This makes sense to me if the
>> root cause is that I don't have enough CPU cycles available
>> per OSD, because I didn't change the offered load.
>>
>> So, in the short term I guess I need to run fewer cosd
>> instances per server.
>
> There is one other thing to look at, and that's the number of threads used
> by each cosd process. Have you tried setting
>
> osd op threads = 1
>
> (or even 0, although I haven't tested that recently). That will limit the
> number of concurrent IOs in flight to the fs. Setting it to 0 will avoid
> using a thread pool at all and will process the IO in the message dispatch
> thread (though we haven't tested that recently so there may be issues).
I'll try this 2nd, since it's easy.
>
> I would also be interested in seeing a system level profile (oprofile?) to
> see where CPU time is being spent. There are likely low hanging fruit in
> the OSD that would reduce CPU overhead.
This will take me a little while, since I need to learn
about the tools. But since I need to learn about them
anyway, that's a good thing.
>
> I guess the other thing that would help to confirm this is to just halve
> the number of OSDs on your machines in a test and see if the problem goes
> away.
I was going to try this first, exactly because it seems like
a definitive test.
>
>> If my analysis above is correct, do you think anything
>> can be gained by running the heartbeat and heartbeat
>> dispatcher threads as SCHED_RR threads? Since tick() runs
>> heartbeat_check(), that would also need to be SCHED_RR,
>> or the heartbeats could arrive on time, but not checked
>> until it was too late.
>
> That sounds worth trying. I don't care much about the tick() thread,
> though... if the machine is loady and we can't check heartbeats that is at
> least fail-safe. And hopefully other nodes are able to catch the slow
> guy.
The reason I mentioned tick(), though, is that I was hoping to avoid
causing extra work checking PGs (and maybe starting on reduplication,
depending on how long things drag on?), during the interval until an
OSD notices that it was marked down. I.e. I want to avoid triggering
a cascade of erroneous failure diagnoses, because I worry about what
happens if there's a real hardware failure in that window. E.g., I
don't yet understand all the algorithms enough to predict what will
happen if the OSD holding the primary copy of some object is
erroneously marked down, and then a disk for the OSD holding
the replica fails. It seems like there has to be a window there
until the primary OSD is marked up again where read requests on
such objects cannot complete, assuming 2-way replication?
>
> In the meantime, it may also be prudent for us to lower our queue size
> thresholds. The current numbers were all pulled out of a hat (100MB?
> Sure!).
:)
FWIW, I wanted to check my reasoning for "ms dispatch throttle bytes"
with you. I thought I found in the code it was used in the rep op
dispatcher thread, so since I am doing 2-way replication, on average
every OSD should see one rep op for each client write op it sees.
According to that theory, I should set
(ms dispatch throttle bytes) = (replication level - 1)*(osd client message size cap)
Does that sound right to you?
Thanks -- Jim
>
> sage
>
>
next prev parent reply other threads:[~2011-04-08 22:11 UTC|newest]
Thread overview: 94+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-16 21:25 cosd multi-second stalls cause "wrongly marked me down" Jim Schutt
2011-02-16 21:37 ` Wido den Hollander
2011-02-16 21:51 ` Jim Schutt
2011-02-16 21:40 ` Gregory Farnum
2011-02-16 21:50 ` Jim Schutt
2011-02-17 0:50 ` Sage Weil
2011-02-17 0:54 ` Sage Weil
2011-02-17 15:46 ` Jim Schutt
2011-02-17 16:11 ` Sage Weil
2011-02-17 23:31 ` Jim Schutt
2011-02-18 7:13 ` Sage Weil
2011-02-18 17:04 ` Jim Schutt
2011-02-18 17:15 ` Gregory Farnum
2011-02-18 18:41 ` Jim Schutt
2011-02-18 19:07 ` Colin McCabe
2011-02-18 20:48 ` Jim Schutt
2011-02-18 20:58 ` Sage Weil
2011-02-18 21:09 ` Jim Schutt
2011-03-09 16:02 ` Jim Schutt
2011-03-09 17:07 ` Gregory Farnum
2011-03-09 18:36 ` Jim Schutt
2011-03-09 19:37 ` Gregory Farnum
2011-03-10 23:09 ` Jim Schutt
2011-03-10 23:21 ` Sage Weil
2011-03-10 23:32 ` Jim Schutt
2011-03-10 23:40 ` Sage Weil
2011-03-11 14:51 ` Jim Schutt
2011-03-11 18:26 ` Jim Schutt
2011-03-11 18:37 ` Jim Schutt
2011-03-11 18:37 ` Sage Weil
2011-03-11 18:51 ` Jim Schutt
2011-03-11 19:09 ` Gregory Farnum
2011-03-11 19:13 ` Yehuda Sadeh Weinraub
2011-03-11 19:17 ` Yehuda Sadeh Weinraub
2011-03-11 19:16 ` Jim Schutt
2011-03-11 21:13 ` Jim Schutt
2011-03-11 21:37 ` Sage Weil
2011-03-11 22:21 ` Jim Schutt
2011-03-11 22:26 ` Jim Schutt
2011-03-11 22:45 ` Sage Weil
2011-03-11 23:29 ` Jim Schutt
2011-03-30 21:26 ` Jim Schutt
2011-03-30 21:55 ` Sage Weil
2011-03-31 14:16 ` Jim Schutt
2011-03-31 16:25 ` Sage Weil
2011-03-31 17:00 ` Jim Schutt
2011-03-31 17:10 ` Jim Schutt
2011-03-31 17:24 ` Sage Weil
2011-03-31 18:08 ` Jim Schutt
2011-03-31 18:41 ` Sage Weil
2011-04-01 22:38 ` Jim Schutt
2011-02-23 17:52 ` Jim Schutt
2011-02-23 18:12 ` Gregory Farnum
2011-02-23 18:54 ` Sage Weil
2011-02-23 19:12 ` Gregory Farnum
2011-02-23 19:23 ` Jim Schutt
2011-02-23 20:27 ` Gregory Farnum
2011-03-02 0:53 ` Sage Weil
2011-03-02 15:21 ` Jim Schutt
2011-03-02 17:10 ` Sage Weil
2011-03-02 20:54 ` Jim Schutt
2011-03-02 21:45 ` Sage Weil
2011-03-02 21:59 ` Jim Schutt
2011-03-02 22:57 ` Jim Schutt
2011-03-02 23:20 ` Gregory Farnum
2011-03-02 23:25 ` Jim Schutt
2011-03-02 23:33 ` Gregory Farnum
2011-03-03 2:26 ` Colin McCabe
2011-03-03 20:03 ` Jim Schutt
2011-03-03 20:47 ` Jim Schutt
2011-03-03 20:55 ` Yehuda Sadeh Weinraub
2011-03-03 21:45 ` Jim Schutt
2011-03-03 22:22 ` Sage Weil
2011-03-03 22:34 ` Jim Schutt
2011-03-03 21:53 ` Colin McCabe
2011-03-03 23:06 ` Jim Schutt
2011-03-03 23:30 ` Colin McCabe
2011-03-03 23:37 ` Jim Schutt
2011-03-03 5:03 ` Sage Weil
2011-03-03 16:35 ` Jim Schutt
2011-03-03 17:28 ` Jim Schutt
2011-03-03 18:04 ` Sage Weil
2011-03-03 18:42 ` Jim Schutt
2011-03-03 18:51 ` Sage Weil
2011-03-03 19:39 ` Jim Schutt
2011-04-08 16:23 ` Jim Schutt
2011-04-08 20:50 ` Sage Weil
2011-04-08 22:11 ` Jim Schutt [this message]
2011-04-08 23:10 ` Colin McCabe
2011-04-11 14:41 ` Jim Schutt
2011-04-11 16:25 ` Sage Weil
2011-04-11 20:14 ` Jim Schutt
2011-04-11 21:18 ` Jim Schutt
2011-04-11 23:23 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D9F87F7.6090203@sandia.gov \
--to=jaschut@sandia.gov \
--cc=ceph-devel@vger.kernel.org \
--cc=gregory.farnum@dreamhost.com \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.