All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zenon Panoussis <oracle@provocation.net>
To: ceph-devel@vger.kernel.org
Subject: Re: Suicide
Date: Tue, 19 Apr 2011 12:45:17 +0200	[thread overview]
Message-ID: <4DAD67BD.1010901@provocation.net> (raw)
In-Reply-To: <FF1D1635C657408E9A7F6275AB23DD61@gmail.com>



On 04/19/2011 01:02 AM, Gregory Farnum wrote:

> It's a bit more complicated than that. While we could probably do a better 
> job of controlling bandwidths, there are a lot of pieces devoted to handling 
> changes in disk performance and preventing the OSD from accepting more data 
> than it can handle -- much of this is tunable (it's not well-documented but 
> off the top of my head the ones to look at are osd_client_message_size_cap, 
> for how much client data to hold in-memory waiting on disk [defaults to 200MB], 
> and filestore_op_threads, which defaults to 2 but might be better at 1 or 0 
> for a very slow disk) . 

Indeed, I've seen cosd threads fighting each-other to write to disk. Do
I understand it correctly that OPTION(filestore_op_threads, 0, OPT_INT, 0)
in src/config.cc and recompile is the only way to change this?

> The specific crash that I saw here meant that the OSD called sync on its 
> underlying filesystem and 6 minutes later the sync hadn't completed! The 
> system can handle slow disks but it became basically unresponsive, at which 
> point the proper response in a replicated scenario is to kill itself (the 
> data should all exist elsewhere). 

It seems the right thing to do, provided that the data does exist elsewhere
in sufficient replicas. This touches something that Colin wrote in this same
thread, so I'll merge it in:

> If we let
> unresponsive OSDs linger forever, we could get in a situation where
> the upper layer (the Ceph filesystem) continues to wait forever for an
> unresponsive OSD, even though there are lots of other OSDs that could
> easily take up the slack.

The requirements and availability of OSDs at any given moment are known,
so the reaction to an unresponsive OSD can be calculated. Let's say,
given 20 OSDs and a CRUSH rule that says data {min_data 3; max_data 10},
ceph could acknowledge a write as long as committed to replicas =< 3 .
If, on the other hand, it can't commit the data to at least 3 OSDs (or
3 journals, as the case might be), it should throw an error back and tell
the application that it can't write to disk, so that the application can
react appropriately.

I have to admit though that I'm still rather confused as to how and in
which order data and metadata are passed around through memory and journal
to disk. The wiki says "The OSD does not acknowledge a write until it is
committed (stable) on disk" but does that mean "committed once" or "committed
on as many copies as data min_data"? I suspect the former, because that would
explain how the bottlenecks I'm seeing can build up. On the other hand, a
stricter flush-min_data-before-acknowledging strategy as I describe above
would automatically solve most network and disk bandwidth calculation problems
by blocking writes unless and until there is sufficient bandwidth to commit
all required copies to disk.

Of course these are design considerations that I'm sure you must have gone
through time and over again, so I could very well be missing some essential
point. I hope you bear with me.

> This is like starving to death at the
> all-you-can-eat buffet, just because they're out of jell-o.

:)) That's the funniest comparison I've seen in a long time, but I'm not
sure it fully applies. Yes, waiting forever for an unresponsive OSD when
there are others around would be exactly that, so ceph should (quickly)
time out and try elsewhere. But acknowledging a write before it knows
for sure whether it can meet its min_data requirements is another story;
that's stuffing data and hoping that some OSDs somewhere will accept it,
when in fact they might all have called it a day and gone fishing.

Z



  parent reply	other threads:[~2011-04-19 10:45 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-15 14:48 Suicide Zenon Panoussis
2011-04-15 15:16 ` Suicide Zenon Panoussis
2011-04-15 15:38 ` Suicide Zenon Panoussis
2011-04-15 20:21 ` Suicide Gregory Farnum
2011-04-15 22:38   ` Suicide Zenon Panoussis
2011-04-15 23:06     ` Suicide Gregory Farnum
2011-04-15 23:29       ` Suicide Zenon Panoussis
2011-04-16  0:00         ` Suicide Gregory Farnum
2011-04-16  9:53           ` Suicide Zenon Panoussis
2011-04-16 23:50           ` Suicide Zenon Panoussis
2011-04-17  0:14             ` Suicide Zenon Panoussis
2011-04-18 16:40             ` Suicide Gregory Farnum
2011-04-18 21:21               ` Suicide Gregory Farnum
2011-04-18 22:38                 ` Suicide Zenon Panoussis
2011-04-18 23:02                   ` Suicide Gregory Farnum
2011-04-19  0:17                     ` Suicide Colin McCabe
2011-04-19 10:45                     ` Zenon Panoussis [this message]
2011-04-19 16:29                       ` Suicide Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DAD67BD.1010901@provocation.net \
    --to=oracle@provocation.net \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.