avoiding false detection of down OSDs

All of lore.kernel.org
 help / color / mirror / Atom feed

* avoiding false detection of down OSDs
@ 2012-07-30 18:46 Gregory Farnum
  2012-07-30 22:47 ` Jim Schutt
  0 siblings, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2012-07-30 18:46 UTC (permalink / raw)
  To: ceph-devel

As Ceph gets deployed on larger clusters our most common scaling
issues have related to
1) our heartbeat system, and
2) handling the larger numbers of OSDMaps that get generated by
increases in the OSD (failures, boots, etc) and PG count (osd
up-thrus, pg_temp insertions, etc).

Lately we haven't had many issues with heartbeats when the OSDs are
happy, so it looks like the latest respin of the heartbeat code is
going to satisfy us going forward. Fast OSD map generation continues
to be a concern, but with the merge of Sam's new map handling code
recently (which reduced the amount of disk effort required to process
a map and shuffled responsibility out of the main OSD thread and into
the more highly-threaded PGs) it has become significantly less
expensive, and we have a number of implemented and planned changes
(from the short- to the long-term) to continue making it less painful.

However, we've started seeing a new issue at the intersection of these
separate problems: what happens when an OSD slows down because it's
processing too many maps, but continues to operate. In large clusters,
an OSD might go down and come back up with hundreds-to-thousands of
maps to process — often at the same time as other OSDs. We've started
to observe issues during software upgrades where a lot of OSDs come up
together and process so many maps that they run out of memory and
start swapping[1]. This can easily cause them to miss heartbeats long
enough to get marked down — but then they finish map processing, tell
the monitors they *are* alive, and get marked back up. This sequence
can cause so many new maps to generate that it repeats itself on the
new nodes, spreads to other nodes in the cluster, or even causes some
running OSD daemons to get marked out. We've taken to calling this
"OSD thrashing".

It would be great if we could come up with a systemic way to reduce
thrashing, independent from our efforts to reduce the triggering
factors. (For one thing, when only one node is thrashing we probably
want to mark it down to preserve performance, whereas when half the
cluster is thrashing we want to keep them up to reduce cluster-wide
load increases.) A few weeks ago some of us at Inktank had a meeting
to discuss the issue, and I've finally gotten around to writing it up
in this email so that we can ask for input from the wider community!

After discussing several approaches (including scaling heartbeat
intervals as more nodes are marked down, as nodes report being wrongly
marked down, putting caps on the number of nodes that can be auto
marked down and/or out, applying rate limiters to the auto-marks,
etc), we realized that what we really wanted was to do our best to
estimate the chances that an OSD which missed its heartbeat window was
simply laggy rather than being down.
While long-term I'm a proponent of pushing most of this heartbeat
handling logic to the OSDs, in the short term adjustments to the
algorithm are much easier to implement in the monitor (which has a lot
more state on the cluster already local). We came up with a broad
algorithm to estimate the chance that an OSD is laggy instead of down:
first, figure out the probability that the OSD is down based on its
past history, and then figure out that probability for the cluster
that the OSD belongs to.
Basically:
1) Keep track of when an OSD boots if it reports itself as fresh or as
wrongly-marked-down. Maintain the probability that the OSD is actually
down versus laggy based on that data and an exponential decay (more
recent reports matter more), and maintain the length of time the OSD
was laggy for in those cases.
2) When a sufficient number of failure reports come in to mark an OSD
down, additionally compute the laggy probability and laggy interval
for the reporters in aggregate.
3) Adjust the "heartbeat grace" locally on the monitor according to
the following formula:
    adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 /
laggy_probability) + group_laggy_interval * ( 1 /
group_laggy_probability)
4) If we reach the end of that adjusted heartbeat grace, and we have
not received failure cancellations (which already exist; when an OSD
gets a heartbeat from a node it's reported down but which isn't marked
down, the OSD sends a cancellation), then mark the OSD down.
5) When running the out check, adjust the "down to out interval" by
the same ratio we've adjusted the heartbeat grace by.

This algorithm has several nice properties:
1) It allows us to independently account for both the probability that
the node is laggy, and for the length of time the node is usually
laggy for.
2) It localizes lagginess by PG relationships — if your Ceph cluster
has multiple pools stored in different locations, lagginess won't
cross those boundaries.
3) It's not too expensive, and by framing it the way we have (in terms
of estimating probabilities) we can shuffle the generic algorithm
around (eg, eventually move these calculations to the reporting OSDs).
There are a couple of things it doesn't do:
1) It doesn't do a good job of noticing that a particular rack is
laggy compared to other racks within the same pool.
2) It's all continuous — there isn't yet any sense of "don't guess
anybody is laggy until we've seen a certain amount of churn over the
last x minutes".

We think that this is a good start and that any necessary
modifications will be pretty easy to add, but if you have other ideas
or critiques we'd love to hear about them!
-Greg

[1]: And we are doing a lot of work to reduce memory consumption, but
while that can delay the problem it can't fix it.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: avoiding false detection of down OSDs
  2012-07-30 18:46 avoiding false detection of down OSDs Gregory Farnum
@ 2012-07-30 22:47 ` Jim Schutt
  2012-07-31  0:24   ` Gregory Farnum
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Schutt @ 2012-07-30 22:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Hi Greg,

Thanks for the write-up.  I have a couple questions below.

On 07/30/2012 12:46 PM, Gregory Farnum wrote:
> As Ceph gets deployed on larger clusters our most common scaling
> issues have related to
> 1) our heartbeat system, and
> 2) handling the larger numbers of OSDMaps that get generated by
> increases in the OSD (failures, boots, etc) and PG count (osd
> up-thrus, pg_temp insertions, etc).
>
> Lately we haven't had many issues with heartbeats when the OSDs are
> happy, so it looks like the latest respin of the heartbeat code is
> going to satisfy us going forward. Fast OSD map generation continues
> to be a concern, but with the merge of Sam's new map handling code
> recently (which reduced the amount of disk effort required to process
> a map and shuffled responsibility out of the main OSD thread and into
> the more highly-threaded PGs) it has become significantly less
> expensive, and we have a number of implemented and planned changes
> (from the short- to the long-term) to continue making it less painful.
>
> However, we've started seeing a new issue at the intersection of these
> separate problems: what happens when an OSD slows down because it's
> processing too many maps, but continues to operate. In large clusters,
> an OSD might go down and come back up with hundreds-to-thousands of
> maps to process — often at the same time as other OSDs. We've started
> to observe issues during software upgrades where a lot of OSDs come up
> together and process so many maps that they run out of memory and
> start swapping[1]. This can easily cause them to miss heartbeats long
> enough to get marked down — but then they finish map processing, tell
> the monitors they *are* alive, and get marked back up. This sequence
> can cause so many new maps to generate that it repeats itself on the
> new nodes, spreads to other nodes in the cluster, or even causes some
> running OSD daemons to get marked out. We've taken to calling this
> "OSD thrashing".
>
> It would be great if we could come up with a systemic way to reduce
> thrashing, independent from our efforts to reduce the triggering
> factors. (For one thing, when only one node is thrashing we probably
> want to mark it down to preserve performance, whereas when half the
> cluster is thrashing we want to keep them up to reduce cluster-wide
> load increases.) A few weeks ago some of us at Inktank had a meeting
> to discuss the issue, and I've finally gotten around to writing it up
> in this email so that we can ask for input from the wider community!
>
> After discussing several approaches (including scaling heartbeat
> intervals as more nodes are marked down, as nodes report being wrongly
> marked down, putting caps on the number of nodes that can be auto
> marked down and/or out, applying rate limiters to the auto-marks,
> etc), we realized that what we really wanted was to do our best to
> estimate the chances that an OSD which missed its heartbeat window was
> simply laggy rather than being down.

I don't understand the functional difference between an OSD that
is too busy to process its heartbeat in a timely fashion, and
one that is down.  In either case, it cannot meet its obligations
to its peers.

I understand that wrongly marking an OSD down adds unnecessary map
processing work.  Also, if an OSD is wrongly marked down then any
data that would be written to it while it is marked down will be
written to other OSDs, and will need to be migrated when that OSD
is marked back up.

I don't fully understand what is the impact of not marking down
an OSD that really is dead, particularly if the cluster is under
a heavy write load from many clients.  At the very least, write
requests that have a replica on such an OSD will stall waiting
for an ack that will never come, or a new map, right?

It seems to me that each of the discarded solutions has similar
properties as the favored solution: they address a symptom, rather
than the cause.

Above you mentioned that you are seeing these issues as you scaled
out a storage cluster, but none of the solutions you mentioned
address scaling.  Let's assume your preferred solution handles
this issue perfectly on the biggest cluster anyone has built
today.  What do you predict will happen when that cluster size
is scaled up by a factor of 2, or 10, or 100?

> While long-term I'm a proponent of pushing most of this heartbeat
> handling logic to the OSDs, in the short term adjustments to the
> algorithm are much easier to implement in the monitor (which has a lot
> more state on the cluster already local). We came up with a broad
> algorithm to estimate the chance that an OSD is laggy instead of down:
> first, figure out the probability that the OSD is down based on its
> past history, and then figure out that probability for the cluster
> that the OSD belongs to.
> Basically:
> 1) Keep track of when an OSD boots if it reports itself as fresh or as
> wrongly-marked-down. Maintain the probability that the OSD is actually
> down versus laggy based on that data and an exponential decay (more
> recent reports matter more), and maintain the length of time the OSD
> was laggy for in those cases.
> 2) When a sufficient number of failure reports come in to mark an OSD
> down, additionally compute the laggy probability and laggy interval
> for the reporters in aggregate.
> 3) Adjust the "heartbeat grace" locally on the monitor according to
> the following formula:
>      adjusted_heartbeat_grace = heartbeat_grace + laggy_interval * (1 /
> laggy_probability) + group_laggy_interval * ( 1 /
> group_laggy_probability)
> 4) If we reach the end of that adjusted heartbeat grace, and we have
> not received failure cancellations (which already exist; when an OSD
> gets a heartbeat from a node it's reported down but which isn't marked
> down, the OSD sends a cancellation), then mark the OSD down.
> 5) When running the out check, adjust the "down to out interval" by
> the same ratio we've adjusted the heartbeat grace by.
>
> This algorithm has several nice properties:
> 1) It allows us to independently account for both the probability that
> the node is laggy, and for the length of time the node is usually
> laggy for.

This implies to me you think the root cause of lagginess is
independent of client offered load.  Otherwise, if client offered
load does impact lagginess, then your estimate of the probability
that an OSD is laggy is only useful for as long as your offered load
doesn't change, no?

> 2) It localizes lagginess by PG relationships — if your Ceph cluster
> has multiple pools stored in different locations, lagginess won't
> cross those boundaries.
> 3) It's not too expensive, and by framing it the way we have (in terms
> of estimating probabilities) we can shuffle the generic algorithm
> around (eg, eventually move these calculations to the reporting OSDs).
> There are a couple of things it doesn't do:
> 1) It doesn't do a good job of noticing that a particular rack is
> laggy compared to other racks within the same pool.
> 2) It's all continuous — there isn't yet any sense of "don't guess
> anybody is laggy until we've seen a certain amount of churn over the
> last x minutes".
>
> We think that this is a good start and that any necessary
> modifications will be pretty easy to add, but if you have other ideas
> or critiques we'd love to hear about them!

As I mentioned above, I'm concerned this is addressing
symptoms, rather than root causes.  I'm concerned the
root cause has something to do with how the map processing
work scales with number of OSDs/PGs, and that this will
limit the maximum size of a Ceph storage cluster.

But, if you really just want to not mark down an OSD that is
laggy, I know this will sound simplistic, but I keep thinking
that the OSD knows for itself if it's up, even when the
heartbeat mechanism is backed up.  Couldn't there be some way
to ask an OSD suspected of being down whether it is or not,
separate from the heartbeat mechanism?  I mean, if you're
considering having the monitor ignore OSD down reports for a
while based on some estimate of past behavior, wouldn't it be
better for the monitor to just ask such an OSD, "hey, are you
still there?"  If it gets an immediate "I'm busy, come back later",
extend the grace period; otherwise, mark the OSD down.

Or, maybe have a multicast group that OSDs periodically
announce on - anyone considering marking an OSD down
would look for a recent "I'm alive!" announcement from
the OSD in question, and extent the heartbeat grace period
if it saw one.

-- Jim

> -Greg
>
> [1]: And we are doing a lot of work to reduce memory consumption, but
> while that can delay the problem it can't fix it.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: avoiding false detection of down OSDs
  2012-07-30 22:47 ` Jim Schutt
@ 2012-07-31  0:24   ` Gregory Farnum
  2012-07-31 15:07     ` [EXTERNAL] " Jim Schutt
  0 siblings, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2012-07-31  0:24 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Mon, Jul 30, 2012 at 3:47 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> I don't understand the functional difference between an OSD that
> is too busy to process its heartbeat in a timely fashion, and
> one that is down.  In either case, it cannot meet its obligations
> to its peers.
>
> I understand that wrongly marking an OSD down adds unnecessary map
> processing work.  Also, if an OSD is wrongly marked down then any
> data that would be written to it while it is marked down will be
> written to other OSDs, and will need to be migrated when that OSD
> is marked back up.
Right. I'm also somewhat uncomfortable with this distinction, but
there is a line that matters: if marking the OSD down and back up is
going to cause more delays than leaving the OSD up, then you don't
want to make any changes. There are specific scenarios we're running
into on systems with many hundreds of nodes where this is a problem.

> I don't fully understand what is the impact of not marking down
> an OSD that really is dead, particularly if the cluster is under
> a heavy write load from many clients.  At the very least, write
> requests that have a replica on such an OSD will stall waiting
> for an ack that will never come, or a new map, right?
Yep, that's precisely the effect.

> It seems to me that each of the discarded solutions has similar
> properties as the favored solution: they address a symptom, rather
> than the cause.
>
> Above you mentioned that you are seeing these issues as you scaled
> out a storage cluster, but none of the solutions you mentioned
> address scaling.  Let's assume your preferred solution handles
> this issue perfectly on the biggest cluster anyone has built
> today.  What do you predict will happen when that cluster size
> is scaled up by a factor of 2, or 10, or 100?
Sage should probably describe in more depth what we've seen since he's
looked at it the most, but I can expand on it a little. In argonaut
and earlier version of Ceph, processing a new OSDMap for an OSD is
very expensive. I don't remember the precise numbers we'd whittled it
down to but it required at least one disk sync as well as pausing all
request processing for a while. If you combined this expense with a
large number of large maps (if, perhaps, one quarter of your 800-OSD
system had been down but not out for 6+ hours), you could cause memory
thrashing on OSDs as they came up, which could force them to become
very, very, veeery slow. In the next version of Ceph, map processing
is much less expensive (no syncs or full-system pauses required),
which will prevent request backup. And there are a huge number of ways
to reduce the memory utilization of maps, some of which can be
backported to argonaut and some of which can't.
Now, if we can't prevent our internal processes from running an OSD
out of memory, we'll have failed. But we don't think this is an
intractable problem; in fact we have reason to hope we've cleared it
up now that we've seen the problem — although we don't think it's
something that we can absolutely prevent on argonaut (too much code
churn).
So we're looking for something that we can apply to argonaut as a
band-aid, but that we can also keep around in case forces external to
Ceph start causing similar cluster-scale resource shortages beyond our
control (runaway co-located process eats up all the memory on lots of
boxes, switch fails and bandwidth gets cut in half, etc). If something
happens that means Ceph can only supply half as much throughput as it
was previously, then Ceph should provide that much throughput; right
now if that kind of incident occurs then Ceph won't provide any
throughput because it'll all be eaten by spurious recovery work.

>> This algorithm has several nice properties:
>> 1) It allows us to independently account for both the probability that
>> the node is laggy, and for the length of time the node is usually
>> laggy for.
>
> This implies to me you think the root cause of lagginess is
> independent of client offered load.  Otherwise, if client offered
> load does impact lagginess, then your estimate of the probability
> that an OSD is laggy is only useful for as long as your offered load
> doesn't change, no?
Approximately — client requests are throttled at ingress; all the
issues we've seen are caused by internal traffic.


>> 2) It localizes lagginess by PG relationships — if your Ceph cluster
>> has multiple pools stored in different locations, lagginess won't
>> cross those boundaries.
>> 3) It's not too expensive, and by framing it the way we have (in terms
>> of estimating probabilities) we can shuffle the generic algorithm
>> around (eg, eventually move these calculations to the reporting OSDs).
>> There are a couple of things it doesn't do:
>> 1) It doesn't do a good job of noticing that a particular rack is
>> laggy compared to other racks within the same pool.
>> 2) It's all continuous — there isn't yet any sense of "don't guess
>> anybody is laggy until we've seen a certain amount of churn over the
>> last x minutes".
>>
>> We think that this is a good start and that any necessary
>> modifications will be pretty easy to add, but if you have other ideas
>> or critiques we'd love to hear about them!
>
>
> As I mentioned above, I'm concerned this is addressing
> symptoms, rather than root causes.  I'm concerned the
> root cause has something to do with how the map processing
> work scales with number of OSDs/PGs, and that this will
> limit the maximum size of a Ceph storage cluster.
I think I discussed this above enough already? :)


> But, if you really just want to not mark down an OSD that is
> laggy, I know this will sound simplistic, but I keep thinking
> that the OSD knows for itself if it's up, even when the
> heartbeat mechanism is backed up.  Couldn't there be some way
> to ask an OSD suspected of being down whether it is or not,
> separate from the heartbeat mechanism?  I mean, if you're
> considering having the monitor ignore OSD down reports for a
> while based on some estimate of past behavior, wouldn't it be
> better for the monitor to just ask such an OSD, "hey, are you
> still there?"  If it gets an immediate "I'm busy, come back later",
> extend the grace period; otherwise, mark the OSD down.
Hmm. The concern is that if an OSD is stuck on disk swapping then it's
going to be just as stuck for the monitors as the OSDs — they're all
using the same network in the basic case, etc. We want to be able to
make that guess before the OSD is able to answer such questions.
But I'll think on if we could try something else similar.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [EXTERNAL] Re: avoiding false detection of down OSDs
  2012-07-31  0:24   ` Gregory Farnum
@ 2012-07-31 15:07     ` Jim Schutt
  2012-07-31 18:14       ` Gregory Farnum
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Schutt @ 2012-07-31 15:07 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 07/30/2012 06:24 PM, Gregory Farnum wrote:
> On Mon, Jul 30, 2012 at 3:47 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>
>> Above you mentioned that you are seeing these issues as you scaled
>> out a storage cluster, but none of the solutions you mentioned
>> address scaling.  Let's assume your preferred solution handles
>> this issue perfectly on the biggest cluster anyone has built
>> today.  What do you predict will happen when that cluster size
>> is scaled up by a factor of 2, or 10, or 100?
> Sage should probably describe in more depth what we've seen since he's
> looked at it the most, but I can expand on it a little. In argonaut
> and earlier version of Ceph, processing a new OSDMap for an OSD is
> very expensive. I don't remember the precise numbers we'd whittled it
> down to but it required at least one disk sync as well as pausing all
> request processing for a while. If you combined this expense with a
> large number of large maps (if, perhaps, one quarter of your 800-OSD
> system had been down but not out for 6+ hours), you could cause memory
> thrashing on OSDs as they came up, which could force them to become
> very, very, veeery slow. In the next version of Ceph, map processing
> is much less expensive (no syncs or full-system pauses required),
> which will prevent request backup. And there are a huge number of ways
> to reduce the memory utilization of maps, some of which can be
> backported to argonaut and some of which can't.
> Now, if we can't prevent our internal processes from running an OSD
> out of memory, we'll have failed. But we don't think this is an
> intractable problem; in fact we have reason to hope we've cleared it
> up now that we've seen the problem — although we don't think it's
> something that we can absolutely prevent on argonaut (too much code
> churn).
> So we're looking for something that we can apply to argonaut as a
> band-aid, but that we can also keep around in case forces external to
> Ceph start causing similar cluster-scale resource shortages beyond our
> control (runaway co-located process eats up all the memory on lots of
> boxes, switch fails and bandwidth gets cut in half, etc). If something
> happens that means Ceph can only supply half as much throughput as it
> was previously, then Ceph should provide that much throughput; right
> now if that kind of incident occurs then Ceph won't provide any
> throughput because it'll all be eaten by spurious recovery work.

Ah, thanks for the extra context.  I hadn't fully appreciated
the proposal was primarily a mitigation for argonaut, and
otherwise as a fail-safe mechanism.

>>
>> As I mentioned above, I'm concerned this is addressing
>> symptoms, rather than root causes.  I'm concerned the
>> root cause has something to do with how the map processing
>> work scales with number of OSDs/PGs, and that this will
>> limit the maximum size of a Ceph storage cluster.
> I think I discussed this above enough already? :)

Yep, thanks.
>
>
>> But, if you really just want to not mark down an OSD that is
>> laggy, I know this will sound simplistic, but I keep thinking
>> that the OSD knows for itself if it's up, even when the
>> heartbeat mechanism is backed up.  Couldn't there be some way
>> to ask an OSD suspected of being down whether it is or not,
>> separate from the heartbeat mechanism?  I mean, if you're
>> considering having the monitor ignore OSD down reports for a
>> while based on some estimate of past behavior, wouldn't it be
>> better for the monitor to just ask such an OSD, "hey, are you
>> still there?"  If it gets an immediate "I'm busy, come back later",
>> extend the grace period; otherwise, mark the OSD down.
> Hmm. The concern is that if an OSD is stuck on disk swapping then it's
> going to be just as stuck for the monitors as the OSDs — they're all
> using the same network in the basic case, etc. We want to be able to
> make that guess before the OSD is able to answer such questions.
> But I'll think on if we could try something else similar.

OK - thanks.

Also, FWIW I've been running my Ceph servers with no swap,
and I've recently doubled the size of my storage cluster.
Is it possible to have map processing do a little memory
accounting and log it, or to provide some way to learn
that map processing is chewing up significant amounts of
memory?  Or maybe there's already a way to learn this that
I need to learn about?  I sometimes run into something that
shares some characteristics with what you describe, but is
primarily triggered by high client write load.  I'd like
to be able to confirm or deny it's the same basic issue
you've described.

Thanks -- Jim

> -Greg
>
>


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [EXTERNAL] Re: avoiding false detection of down OSDs
  2012-07-31 15:07     ` [EXTERNAL] " Jim Schutt
@ 2012-07-31 18:14       ` Gregory Farnum
  2012-07-31 18:40         ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2012-07-31 18:14 UTC (permalink / raw)
  To: Jim Schutt, Sage Weil; +Cc: ceph-devel

On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 07/30/2012 06:24 PM, Gregory Farnum wrote:
>> Hmm. The concern is that if an OSD is stuck on disk swapping then it's
>> going to be just as stuck for the monitors as the OSDs — they're all
>> using the same network in the basic case, etc. We want to be able to
>> make that guess before the OSD is able to answer such questions.
>> But I'll think on if we could try something else similar.
>
>
> OK - thanks.
>
> Also, FWIW I've been running my Ceph servers with no swap,
> and I've recently doubled the size of my storage cluster.
> Is it possible to have map processing do a little memory
> accounting and log it, or to provide some way to learn
> that map processing is chewing up significant amounts of
> memory?  Or maybe there's already a way to learn this that
> I need to learn about?  I sometimes run into something that
> shares some characteristics with what you describe, but is
> primarily triggered by high client write load.  I'd like
> to be able to confirm or deny it's the same basic issue
> you've described.

I think that we've done all our diagnosis using profiling tools, but
there's now a map cache and it probably wouldn't be too difficult to
have it dump data via perfcounters if you poked around...anything like
this exist yet, Sage?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [EXTERNAL] Re: avoiding false detection of down OSDs
  2012-07-31 18:14       ` Gregory Farnum
@ 2012-07-31 18:40         ` Sage Weil
  2012-07-31 19:58           ` Jim Schutt
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2012-07-31 18:40 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Jim Schutt, ceph-devel

On Tue, 31 Jul 2012, Gregory Farnum wrote:
> On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> > On 07/30/2012 06:24 PM, Gregory Farnum wrote:
> >> Hmm. The concern is that if an OSD is stuck on disk swapping then it's
> >> going to be just as stuck for the monitors as the OSDs ? they're all
> >> using the same network in the basic case, etc. We want to be able to
> >> make that guess before the OSD is able to answer such questions.
> >> But I'll think on if we could try something else similar.
> >
> >
> > OK - thanks.
> >
> > Also, FWIW I've been running my Ceph servers with no swap,
> > and I've recently doubled the size of my storage cluster.
> > Is it possible to have map processing do a little memory
> > accounting and log it, or to provide some way to learn
> > that map processing is chewing up significant amounts of
> > memory?  Or maybe there's already a way to learn this that
> > I need to learn about?  I sometimes run into something that
> > shares some characteristics with what you describe, but is
> > primarily triggered by high client write load.  I'd like
> > to be able to confirm or deny it's the same basic issue
> > you've described.
> 
> I think that we've done all our diagnosis using profiling tools, but
> there's now a map cache and it probably wouldn't be too difficult to
> have it dump data via perfcounters if you poked around...anything like
> this exist yet, Sage?

Much of the bad behavior was triggered by #2860, fixes for which just went 
into the stable and master branches yesterday.  It's difficult to fully 
observe the bad behavior, though (lots of time spend in 
generate_past_intervals, reading old maps off disk).  With the fix, we 
pretty much only process maps during handle_osd_map.

Adding perfcounters in the methods that grab a map out of the cache or 
(more importantly) read it off disk will give you better visibility into 
that.  It should be pretty easy to instrument that (and I'll gladly 
take patches that implement that... :).  Without knowing more about what 
you're seeing, it's hard to say if its related, though.  This was 
triggered by long periods of unclean pgs and lots of data migration, not 
high load.

sage


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [EXTERNAL] Re: avoiding false detection of down OSDs
  2012-07-31 18:40         ` Sage Weil
@ 2012-07-31 19:58           ` Jim Schutt
  0 siblings, 0 replies; 7+ messages in thread
From: Jim Schutt @ 2012-07-31 19:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

On 07/31/2012 12:40 PM, Sage Weil wrote:
> On Tue, 31 Jul 2012, Gregory Farnum wrote:
>> On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt<jaschut@sandia.gov>  wrote:

>>> Also, FWIW I've been running my Ceph servers with no swap,
>>> and I've recently doubled the size of my storage cluster.
>>> Is it possible to have map processing do a little memory
>>> accounting and log it, or to provide some way to learn
>>> that map processing is chewing up significant amounts of
>>> memory?  Or maybe there's already a way to learn this that
>>> I need to learn about?  I sometimes run into something that
>>> shares some characteristics with what you describe, but is
>>> primarily triggered by high client write load.  I'd like
>>> to be able to confirm or deny it's the same basic issue
>>> you've described.
>>
>> I think that we've done all our diagnosis using profiling tools, but
>> there's now a map cache and it probably wouldn't be too difficult to
>> have it dump data via perfcounters if you poked around...anything like
>> this exist yet, Sage?
>
> Much of the bad behavior was triggered by #2860, fixes for which just went
> into the stable and master branches yesterday.  It's difficult to fully
> observe the bad behavior, though (lots of time spend in
> generate_past_intervals, reading old maps off disk).  With the fix, we
> pretty much only process maps during handle_osd_map.
>
> Adding perfcounters in the methods that grab a map out of the cache or
> (more importantly) read it off disk will give you better visibility into
> that.  It should be pretty easy to instrument that (and I'll gladly
> take patches that implement that... :).  Without knowing more about what
> you're seeing, it's hard to say if its related, though.  This was
> triggered by long periods of unclean pgs and lots of data migration, not
> high load.

An issue I've been seeing is unusually high OSD memory use.
It seems to be triggered by linux clients timing out requests
and resetting OSDs during a heavy write load, but I was hoping
to rule out any memory-use issues caused by map processing.
However, this morning I started testing your server wip-msgr
branch together with the kernel-side patches queued up for 3.6,
and so far with that combination I've been unable to trigger the
behavior I was seeing.  So, that's great news, and I think
confirms that issue was unrelated to any map issues.

I've also sometimes had issues with my cluster becoming unstable
when failing an OSD while the cluster is under a heavy write load,
but hadn't been successful at characterizing under what conditions
it couldn't recover.  I expect that situation is now improved as
well, and will retest.

Thanks -- Jim

>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-07-31 19:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-30 18:46 avoiding false detection of down OSDs Gregory Farnum
2012-07-30 22:47 ` Jim Schutt
2012-07-31  0:24   ` Gregory Farnum
2012-07-31 15:07     ` [EXTERNAL] " Jim Schutt
2012-07-31 18:14       ` Gregory Farnum
2012-07-31 18:40         ` Sage Weil
2012-07-31 19:58           ` Jim Schutt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.