Re: contraining crush placement possibilities

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Li Wang <liwang@ubuntukylin.com>
To: Sage Weil <sage@inktank.com>, Gregory Farnum <greg@inktank.com>
Cc: Dan van der Ster <daniel.vanderster@cern.ch>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: contraining crush placement possibilities
Date: Mon, 10 Mar 2014 17:37:04 +0800	[thread overview]
Message-ID: <531D87C0.2050304@ubuntukylin.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1403070934450.3593@cobra.newdream.net>

pgp_num is the upper bound of number of OSD combinations, right?
so we can reduce pgp_num to constrain the possible combinations,
and the data loss probability is only dependent on pgp_num,
say, pgp_num/Pn(replica_num) (Since (a, b) and (b, a) are different pgs, 
so it is permutation rather than combination). But we can still
maintain a big pg_num, will it make the object distribution more
uniform? Currently object_id is mapped to pg_id, then pg_id mapped to
OSD combinations, why does it need two levels of mapping, why not map
object_id to OSD combinations directly, will it achieve a more uniform
distribution?

On 2014/3/8 1:43, Sage Weil wrote:
> On Fri, 7 Mar 2014, Gregory Farnum wrote:
>> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote:
>>> On Fri, 7 Mar 2014, Dan van der Ster wrote:
>>>> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
>>>>> Sheldon just
>>>>> pointed out a talk from ATC that discusses the basic problem:
>>>>>
>>>>>          https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>>>>>
>>>>> The situation with CRUSH is slightly better, I think, because the number
>>>>> of peers for a given OSD in a large cluster is bounded (pg_num /
>>>>> num_osds), but I think we may still be able improve things.
>>>>
>>>> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?
>>>
>>> I think so (I didn't listen to the whole talk :).  My ears did perk up
>>> when Carlos (who was part of the original team at UCSC) asked the question
>>> about the CRUSH paper at the end, though. :)
>>>
>>> Anyway, now I'm thinking that this *is* really just all about tuning
>>> pg_num/pgp_num.  And of course managing failure domains in the CRUSH map
>>> as best we can to align placement with expected sources of correlated
>>> failure.  But again, I would appreciate any confirmation from others'
>>> intuitions or (better yet) a proper mathematical model.  This bit of my
>>> brain is full of cobwebs, and wasn't particularly strong here to begin
>>> with.
>>
>> Well, yes and no. They're constraining data sharing in order to reduce
>> the probability of any given data loss event, and we can reduce data
>> sharing by reducing the pgp_num. But the example you cited was "place
>> all copies in the top third of the selected racks", and that's a
>> little different because it means they can independently scale the
>> data sharing *within* that grouping to maintain a good data balance,
>> which CRUSH would have trouble with.
>> Unfortunately my intuition around probability and stats isn't much
>> good, so that's about as far as I can take this effectively. ;)
>
> Yeah I'm struggling with this too, but I *think* the top/middle/bottom
> rack analogy is just an easy way to think about constraining the placement
> options, which we're doing anyway with the placement group count--just in
> a way that looks random but is still sampling a small portion of the
> possible combinations.  In the end, whether you eliminate 8/9 of the
> options of the rack layers and *then* scale pg_num, or just scale pg_num,
> I think it still boils down to the number of distinct 3-disk sets out of
> the total possible 3-disk sets.
>
> Also, FWIW, the rack thing is equivalent to making 3 parallel trees so
> that the crush hierarchy goes like:
>
>   root
>   layer of rack (top/middle/bottom)
>   rack
>   host
>   osd
>
> and make the crush rule first pick 1 layer before doing the chooseleaf
> over racks.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2014-03-10  9:37 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-06 20:30 contraining crush placement possibilities Sage Weil
2014-03-07  3:51 ` Li Wang
2014-03-07  3:53   ` Li Wang
2014-03-07  4:35     ` Li Wang
2014-03-07  5:03       ` Sage Weil
2014-03-07  8:32         ` lianghaoshen
2014-03-07  8:37         ` lianghaoshen
2014-03-07  8:45 ` Dan van der Ster
2014-03-07 10:30 ` Dan van der Ster
2014-03-07 15:10   ` Sage Weil
2014-03-07 17:29     ` Gregory Farnum
2014-03-07 17:43       ` Sage Weil
2014-03-07 18:00         ` Gregory Farnum
2014-03-10  9:37         ` Li Wang [this message]
2014-03-10 16:25           ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=531D87C0.2050304@ubuntukylin.com \
    --to=liwang@ubuntukylin.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=daniel.vanderster@cern.ch \
    --cc=greg@inktank.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.