contraining crush placement possibilities

All of lore.kernel.org
 help / color / mirror / Atom feed

* contraining crush placement possibilities
@ 2014-03-06 20:30 Sage Weil
  2014-03-07  3:51 ` Li Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Sage Weil @ 2014-03-06 20:30 UTC (permalink / raw)
  To: ceph-devel

During the CRUSH CDS session yesterday I talked a bit about the desire to 
constrain the number of possible disk combinations so that we reduce the 
probability of a concurrent failure from causing data loss.  Sheldon just 
pointed out a talk from ATC that discusses the basic problem:

	https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon

The situation with CRUSH is slightly better, I think, because the number 
of peers for a given OSD in a large cluster is bounded (pg_num / 
num_osds), but I think we may still be able improve things.

Last night it occurred to me that this is almost just having pgp_num < 
pg_num, but I think that's not quite right either.

If anyone has some clear intuition here, would love to hear it.  If there 
is anything we can do to improve things we definitely want to do it!

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-06 20:30 contraining crush placement possibilities Sage Weil
@ 2014-03-07  3:51 ` Li Wang
  2014-03-07  3:53   ` Li Wang
  2014-03-07  8:45 ` Dan van der Ster
  2014-03-07 10:30 ` Dan van der Ster
  2 siblings, 1 reply; 15+ messages in thread
From: Li Wang @ 2014-03-07  3:51 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Just had a quick look. It seems crush could meet the demand,
say, if we have 100 osds, replica_num is 3, then we partition the
100 osds into 3 trees, 'take' iterates on the 3 trees, for each tree, 
select 1 osd. Then the probability of losing data is at most n*n*n/Cn3,
can we make it better?


On 2014/3/7 4:30, Sage Weil wrote:
> During the CRUSH CDS session yesterday I talked a bit about the desire to
> constrain the number of possible disk combinations so that we reduce the
> probability of a concurrent failure from causing data loss.  Sheldon just
> pointed out a talk from ATC that discusses the basic problem:
>
> 	https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>
> The situation with CRUSH is slightly better, I think, because the number
> of peers for a given OSD in a large cluster is bounded (pg_num /
> num_osds), but I think we may still be able improve things.
>
> Last night it occurred to me that this is almost just having pgp_num <
> pg_num, but I think that's not quite right either.
>
> If anyone has some clear intuition here, would love to hear it.  If there
> is anything we can do to improve things we definitely want to do it!
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07  3:51 ` Li Wang
@ 2014-03-07  3:53   ` Li Wang
  2014-03-07  4:35     ` Li Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Li Wang @ 2014-03-07  3:53 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Provided 3 osds are down simultaneously

On 2014/3/7 11:51, Li Wang wrote:
> Just had a quick look. It seems crush could meet the demand,
> say, if we have 100 osds, replica_num is 3, then we partition the
> 100 osds into 3 trees, 'take' iterates on the 3 trees, for each tree,
> select 1 osd. Then the probability of losing data is at most n*n*n/Cn3,
> can we make it better?
>
>
> On 2014/3/7 4:30, Sage Weil wrote:
>> During the CRUSH CDS session yesterday I talked a bit about the desire to
>> constrain the number of possible disk combinations so that we reduce the
>> probability of a concurrent failure from causing data loss.  Sheldon just
>> pointed out a talk from ATC that discusses the basic problem:
>>
>>     https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>>
>>
>> The situation with CRUSH is slightly better, I think, because the number
>> of peers for a given OSD in a large cluster is bounded (pg_num /
>> num_osds), but I think we may still be able improve things.
>>
>> Last night it occurred to me that this is almost just having pgp_num <
>> pg_num, but I think that's not quite right either.
>>
>> If anyone has some clear intuition here, would love to hear it.  If there
>> is anything we can do to improve things we definitely want to do it!
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07  3:53   ` Li Wang
@ 2014-03-07  4:35     ` Li Wang
  2014-03-07  5:03       ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Li Wang @ 2014-03-07  4:35 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3)

On 2014/3/7 11:53, Li Wang wrote:
> Provided 3 osds are down simultaneously
>
> On 2014/3/7 11:51, Li Wang wrote:
>> Just had a quick look. It seems crush could meet the demand,
>> say, if we have 100 osds, replica_num is 3, then we partition the
>> 100 osds into 3 trees, 'take' iterates on the 3 trees, for each tree,
>> select 1 osd. Then the probability of losing data is at most n*n*n/Cn3,
>> can we make it better?
>>
>>
>> On 2014/3/7 4:30, Sage Weil wrote:
>>> During the CRUSH CDS session yesterday I talked a bit about the
>>> desire to
>>> constrain the number of possible disk combinations so that we reduce the
>>> probability of a concurrent failure from causing data loss.  Sheldon
>>> just
>>> pointed out a talk from ATC that discusses the basic problem:
>>>
>>>
>>> https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>>>
>>>
>>>
>>> The situation with CRUSH is slightly better, I think, because the number
>>> of peers for a given OSD in a large cluster is bounded (pg_num /
>>> num_osds), but I think we may still be able improve things.
>>>
>>> Last night it occurred to me that this is almost just having pgp_num <
>>> pg_num, but I think that's not quite right either.
>>>
>>> If anyone has some clear intuition here, would love to hear it.  If
>>> there
>>> is anything we can do to improve things we definitely want to do it!
>>>
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07  4:35     ` Li Wang
@ 2014-03-07  5:03       ` Sage Weil
  2014-03-07  8:32         ` lianghaoshen
  2014-03-07  8:37         ` lianghaoshen
  0 siblings, 2 replies; 15+ messages in thread
From: Sage Weil @ 2014-03-07  5:03 UTC (permalink / raw)
  To: Li Wang; +Cc: ceph-devel

On Fri, 7 Mar 2014, Li Wang wrote:
> Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3)

Cn3 is "n choose 3"?

> > > > Last night it occurred to me that this is almost just having 
> > > > pgp_num < pg_num, but I think that's not quite right either.

Actually, maybe it is.  Basically, say there are X combinations of 3 disks 
= n choose 3.  Some fraction of these, say Y, are actually used by CRUSH.  
If we are to reduce that number, that implies that there are some PGs that 
are overlapping on the same set of disks.  Which more or less reduces to 
the case where pgp_num < pg_num, or the hashpspool flag isn't set, or any 
other behavior that makes more than one PG line up on the same disk.  
Just using fewer PGs in the system, in fact, would help here.  The main 
problem is that doing this tends to make the distribution less uniform, so 
there is a tradeoff.

There is a reliability model in ceph-tools.git at

	https://github.com/ceph/ceph-tools/tree/master/models/reliability

that Mark Kampe built last year.  Sadly I haven't looked at it closely so 
I'm not sure if it captures this.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07  5:03       ` Sage Weil
@ 2014-03-07  8:32         ` lianghaoshen
  2014-03-07  8:37         ` lianghaoshen
  1 sibling, 0 replies; 15+ messages in thread
From: lianghaoshen @ 2014-03-07  8:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: Li Wang, ceph-devel

于 2014年03月07日 13:03, Sage Weil 写道:
> On Fri, 7 Mar 2014, Li Wang wrote:
>> Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3)
> Cn3 is "n choose 3"?
>
>>>>> Last night it occurred to me that this is almost just having 
>>>>> pgp_num < pg_num, but I think that's not quite right either.
> Actually, maybe it is.  Basically, say there are X combinations of 3 disks 
> = n choose 3.  Some fraction of these, say Y, are actually used by CRUSH.  
> If we are to reduce that number, that implies that there are some PGs that 
> are overlapping on the same set of disks.  Which more or less reduces to 
> the case where pgp_num < pg_num, or the hashpspool flag isn't set, or any 
> other behavior that makes more than one PG line up on the same disk.  
> Just using fewer PGs in the system, in fact, would help here.  The main 
Dose it mean that we can calculate the pgp_num according to the
reliability request, osd_num and replica_num, instead of using a given
fixed one, namely, 100 pgs/osd ? In fact , when the osd_num of a failure
domain is small , 100pgs can easily cover all of the osds, which means
data lost will occur, when the down osds are in different failure domains.
> problem is that doing this tends to make the distribution less uniform, so 
> there is a tradeoff.
>
> There is a reliability model in ceph-tools.git at
>
> 	https://github.com/ceph/ceph-tools/tree/master/models/reliability
>
> that Mark Kampe built last year.  Sadly I haven't looked at it closely so 
> I'm not sure if it captures this.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best regards,
slhhust


-- 
Best regards,
Lianghao Shen

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07  5:03       ` Sage Weil
  2014-03-07  8:32         ` lianghaoshen
@ 2014-03-07  8:37         ` lianghaoshen
  1 sibling, 0 replies; 15+ messages in thread
From: lianghaoshen @ 2014-03-07  8:37 UTC (permalink / raw)
  To: Sage Weil; +Cc: Li Wang, ceph-devel

于 2014年03月07日 13:03, Sage Weil 写道:
> On Fri, 7 Mar 2014, Li Wang wrote:
>> Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3)
> Cn3 is "n choose 3"?
>
>>>>> Last night it occurred to me that this is almost just having 
>>>>> pgp_num < pg_num, but I think that's not quite right either.
> Actually, maybe it is.  Basically, say there are X combinations of 3 disks 
> = n choose 3.  Some fraction of these, say Y, are actually used by CRUSH.  
> If we are to reduce that number, that implies that there are some PGs that 
> are overlapping on the same set of disks.  Which more or less reduces to 
> the case where pgp_num < pg_num, or the hashpspool flag isn't set, or any 
> other behavior that makes more than one PG line up on the same disk.  
> Just using fewer PGs in the system, in fact, would help here.  The main 
Dose it mean that we can calculate the pgp_num according to the
reliability request, osd_num and replica_num, instead of using a given
fixed one, namely, 100 pgs/osd ? In fact , when the osd_num of a failure
domain is small , 100pgs can easily cover all of the osds, which means
data lost will occur, when the down osds are in different failure domains.
> problem is that doing this tends to make the distribution less uniform, so 
> there is a tradeoff.
>
> There is a reliability model in ceph-tools.git at
>
> 	https://github.com/ceph/ceph-tools/tree/master/models/reliability
>
> that Mark Kampe built last year.  Sadly I haven't looked at it closely so 
> I'm not sure if it captures this.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best regards,
slhhust


-- 
Best regards,
Lianghao Shen

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-06 20:30 contraining crush placement possibilities Sage Weil
  2014-03-07  3:51 ` Li Wang
@ 2014-03-07  8:45 ` Dan van der Ster
  2014-03-07 10:30 ` Dan van der Ster
  2 siblings, 0 replies; 15+ messages in thread
From: Dan van der Ster @ 2014-03-07  8:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
> If anyone has some clear intuition here, would love to hear it.  If there
> is anything we can do to improve things we definitely want to do it!


The thing is, if you constrain the number of OSD combinations, it just
amplifies the damage in case you do lose all replicas of a PG. So the
total expected data loss per year should stay the same, though the
probability of an incident decreases.
Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-06 20:30 contraining crush placement possibilities Sage Weil
  2014-03-07  3:51 ` Li Wang
  2014-03-07  8:45 ` Dan van der Ster
@ 2014-03-07 10:30 ` Dan van der Ster
  2014-03-07 15:10   ` Sage Weil
  2 siblings, 1 reply; 15+ messages in thread
From: Dan van der Ster @ 2014-03-07 10:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
> Sheldon just
> pointed out a talk from ATC that discusses the basic problem:
>
>         https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>
> The situation with CRUSH is slightly better, I think, because the number
> of peers for a given OSD in a large cluster is bounded (pg_num /
> num_osds), but I think we may still be able improve things.


I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?
Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07 10:30 ` Dan van der Ster
@ 2014-03-07 15:10   ` Sage Weil
  2014-03-07 17:29     ` Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2014-03-07 15:10 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel@vger.kernel.org

On Fri, 7 Mar 2014, Dan van der Ster wrote:
> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
> > Sheldon just
> > pointed out a talk from ATC that discusses the basic problem:
> >
> >         https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
> >
> > The situation with CRUSH is slightly better, I think, because the number
> > of peers for a given OSD in a large cluster is bounded (pg_num /
> > num_osds), but I think we may still be able improve things.
> 
> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?

I think so (I didn't listen to the whole talk :).  My ears did perk up 
when Carlos (who was part of the original team at UCSC) asked the question 
about the CRUSH paper at the end, though. :)

Anyway, now I'm thinking that this *is* really just all about tuning 
pg_num/pgp_num.  And of course managing failure domains in the CRUSH map 
as best we can to align placement with expected sources of correlated 
failure.  But again, I would appreciate any confirmation from others' 
intuitions or (better yet) a proper mathematical model.  This bit of my 
brain is full of cobwebs, and wasn't particularly strong here to begin 
with.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07 15:10   ` Sage Weil
@ 2014-03-07 17:29     ` Gregory Farnum
  2014-03-07 17:43       ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2014-03-07 17:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org

On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote:
> On Fri, 7 Mar 2014, Dan van der Ster wrote:
>> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
>> > Sheldon just
>> > pointed out a talk from ATC that discusses the basic problem:
>> >
>> >         https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>> >
>> > The situation with CRUSH is slightly better, I think, because the number
>> > of peers for a given OSD in a large cluster is bounded (pg_num /
>> > num_osds), but I think we may still be able improve things.
>>
>> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?
>
> I think so (I didn't listen to the whole talk :).  My ears did perk up
> when Carlos (who was part of the original team at UCSC) asked the question
> about the CRUSH paper at the end, though. :)
>
> Anyway, now I'm thinking that this *is* really just all about tuning
> pg_num/pgp_num.  And of course managing failure domains in the CRUSH map
> as best we can to align placement with expected sources of correlated
> failure.  But again, I would appreciate any confirmation from others'
> intuitions or (better yet) a proper mathematical model.  This bit of my
> brain is full of cobwebs, and wasn't particularly strong here to begin
> with.

Well, yes and no. They're constraining data sharing in order to reduce
the probability of any given data loss event, and we can reduce data
sharing by reducing the pgp_num. But the example you cited was "place
all copies in the top third of the selected racks", and that's a
little different because it means they can independently scale the
data sharing *within* that grouping to maintain a good data balance,
which CRUSH would have trouble with.
Unfortunately my intuition around probability and stats isn't much
good, so that's about as far as I can take this effectively. ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07 17:29     ` Gregory Farnum
@ 2014-03-07 17:43       ` Sage Weil
  2014-03-07 18:00         ` Gregory Farnum
  2014-03-10  9:37         ` Li Wang
  0 siblings, 2 replies; 15+ messages in thread
From: Sage Weil @ 2014-03-07 17:43 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org

On Fri, 7 Mar 2014, Gregory Farnum wrote:
> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote:
> > On Fri, 7 Mar 2014, Dan van der Ster wrote:
> >> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
> >> > Sheldon just
> >> > pointed out a talk from ATC that discusses the basic problem:
> >> >
> >> >         https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
> >> >
> >> > The situation with CRUSH is slightly better, I think, because the number
> >> > of peers for a given OSD in a large cluster is bounded (pg_num /
> >> > num_osds), but I think we may still be able improve things.
> >>
> >> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?
> >
> > I think so (I didn't listen to the whole talk :).  My ears did perk up
> > when Carlos (who was part of the original team at UCSC) asked the question
> > about the CRUSH paper at the end, though. :)
> >
> > Anyway, now I'm thinking that this *is* really just all about tuning
> > pg_num/pgp_num.  And of course managing failure domains in the CRUSH map
> > as best we can to align placement with expected sources of correlated
> > failure.  But again, I would appreciate any confirmation from others'
> > intuitions or (better yet) a proper mathematical model.  This bit of my
> > brain is full of cobwebs, and wasn't particularly strong here to begin
> > with.
> 
> Well, yes and no. They're constraining data sharing in order to reduce
> the probability of any given data loss event, and we can reduce data
> sharing by reducing the pgp_num. But the example you cited was "place
> all copies in the top third of the selected racks", and that's a
> little different because it means they can independently scale the
> data sharing *within* that grouping to maintain a good data balance,
> which CRUSH would have trouble with.
> Unfortunately my intuition around probability and stats isn't much
> good, so that's about as far as I can take this effectively. ;)

Yeah I'm struggling with this too, but I *think* the top/middle/bottom 
rack analogy is just an easy way to think about constraining the placement 
options, which we're doing anyway with the placement group count--just in 
a way that looks random but is still sampling a small portion of the 
possible combinations.  In the end, whether you eliminate 8/9 of the 
options of the rack layers and *then* scale pg_num, or just scale pg_num, 
I think it still boils down to the number of distinct 3-disk sets out of 
the total possible 3-disk sets.

Also, FWIW, the rack thing is equivalent to making 3 parallel trees so 
that the crush hierarchy goes like:

 root
 layer of rack (top/middle/bottom)
 rack
 host
 osd

and make the crush rule first pick 1 layer before doing the chooseleaf 
over racks.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07 17:43       ` Sage Weil
@ 2014-03-07 18:00         ` Gregory Farnum
  2014-03-10  9:37         ` Li Wang
  1 sibling, 0 replies; 15+ messages in thread
From: Gregory Farnum @ 2014-03-07 18:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org

On Fri, Mar 7, 2014 at 9:43 AM, Sage Weil <sage@inktank.com> wrote:
> On Fri, 7 Mar 2014, Gregory Farnum wrote:
>> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote:
>> > On Fri, 7 Mar 2014, Dan van der Ster wrote:
>> >> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
>> >> > Sheldon just
>> >> > pointed out a talk from ATC that discusses the basic problem:
>> >> >
>> >> >         https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>> >> >
>> >> > The situation with CRUSH is slightly better, I think, because the number
>> >> > of peers for a given OSD in a large cluster is bounded (pg_num /
>> >> > num_osds), but I think we may still be able improve things.
>> >>
>> >> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?
>> >
>> > I think so (I didn't listen to the whole talk :).  My ears did perk up
>> > when Carlos (who was part of the original team at UCSC) asked the question
>> > about the CRUSH paper at the end, though. :)
>> >
>> > Anyway, now I'm thinking that this *is* really just all about tuning
>> > pg_num/pgp_num.  And of course managing failure domains in the CRUSH map
>> > as best we can to align placement with expected sources of correlated
>> > failure.  But again, I would appreciate any confirmation from others'
>> > intuitions or (better yet) a proper mathematical model.  This bit of my
>> > brain is full of cobwebs, and wasn't particularly strong here to begin
>> > with.
>>
>> Well, yes and no. They're constraining data sharing in order to reduce
>> the probability of any given data loss event, and we can reduce data
>> sharing by reducing the pgp_num. But the example you cited was "place
>> all copies in the top third of the selected racks", and that's a
>> little different because it means they can independently scale the
>> data sharing *within* that grouping to maintain a good data balance,
>> which CRUSH would have trouble with.
>> Unfortunately my intuition around probability and stats isn't much
>> good, so that's about as far as I can take this effectively. ;)
>
> Yeah I'm struggling with this too, but I *think* the top/middle/bottom
> rack analogy is just an easy way to think about constraining the placement
> options, which we're doing anyway with the placement group count--just in
> a way that looks random but is still sampling a small portion of the
> possible combinations.  In the end, whether you eliminate 8/9 of the
> options of the rack layers and *then* scale pg_num, or just scale pg_num,
> I think it still boils down to the number of distinct 3-disk sets out of
> the total possible 3-disk sets.

Mmm, the bounds are very different in those two environments, though.
Let's say you have 3 racks of 9 OSDs; with CRUSH splitting across
racks you have 9^3=729 possible combinations of placement; with
thirded racks you have 3*(3^3)=81. If you constrain CRUSH to 81 PGs,
you're going to have a terrible distribution. But with a different
system it's easy to scale your shards within each grouping to maintain
balance within each group, and to adjust the boundaries between groups
as well.

>
> Also, FWIW, the rack thing is equivalent to making 3 parallel trees so
> that the crush hierarchy goes like:
>
>  root
>  layer of rack (top/middle/bottom)
>  rack
>  host
>  osd
>
> and make the crush rule first pick 1 layer before doing the chooseleaf
> over racks.

That I missed -- I was thinking we didn't have a good way to do the
split in CRUSH, but I guess if you're doing same-rack-pos then just
doing the split at the top you could probably emulate the system above
reasonably well...maybe? We should run some experiments with the crush
tester and figure out if we can get a reasonable data distribution
with reasonable PG counts under a schema like that.
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-07 17:43       ` Sage Weil
  2014-03-07 18:00         ` Gregory Farnum
@ 2014-03-10  9:37         ` Li Wang
  2014-03-10 16:25           ` Gregory Farnum
  1 sibling, 1 reply; 15+ messages in thread
From: Li Wang @ 2014-03-10  9:37 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org

pgp_num is the upper bound of number of OSD combinations, right?
so we can reduce pgp_num to constrain the possible combinations,
and the data loss probability is only dependent on pgp_num,
say, pgp_num/Pn(replica_num) (Since (a, b) and (b, a) are different pgs, 
so it is permutation rather than combination). But we can still
maintain a big pg_num, will it make the object distribution more
uniform? Currently object_id is mapped to pg_id, then pg_id mapped to
OSD combinations, why does it need two levels of mapping, why not map
object_id to OSD combinations directly, will it achieve a more uniform
distribution?

On 2014/3/8 1:43, Sage Weil wrote:
> On Fri, 7 Mar 2014, Gregory Farnum wrote:
>> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote:
>>> On Fri, 7 Mar 2014, Dan van der Ster wrote:
>>>> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
>>>>> Sheldon just
>>>>> pointed out a talk from ATC that discusses the basic problem:
>>>>>
>>>>>          https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>>>>>
>>>>> The situation with CRUSH is slightly better, I think, because the number
>>>>> of peers for a given OSD in a large cluster is bounded (pg_num /
>>>>> num_osds), but I think we may still be able improve things.
>>>>
>>>> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?
>>>
>>> I think so (I didn't listen to the whole talk :).  My ears did perk up
>>> when Carlos (who was part of the original team at UCSC) asked the question
>>> about the CRUSH paper at the end, though. :)
>>>
>>> Anyway, now I'm thinking that this *is* really just all about tuning
>>> pg_num/pgp_num.  And of course managing failure domains in the CRUSH map
>>> as best we can to align placement with expected sources of correlated
>>> failure.  But again, I would appreciate any confirmation from others'
>>> intuitions or (better yet) a proper mathematical model.  This bit of my
>>> brain is full of cobwebs, and wasn't particularly strong here to begin
>>> with.
>>
>> Well, yes and no. They're constraining data sharing in order to reduce
>> the probability of any given data loss event, and we can reduce data
>> sharing by reducing the pgp_num. But the example you cited was "place
>> all copies in the top third of the selected racks", and that's a
>> little different because it means they can independently scale the
>> data sharing *within* that grouping to maintain a good data balance,
>> which CRUSH would have trouble with.
>> Unfortunately my intuition around probability and stats isn't much
>> good, so that's about as far as I can take this effectively. ;)
>
> Yeah I'm struggling with this too, but I *think* the top/middle/bottom
> rack analogy is just an easy way to think about constraining the placement
> options, which we're doing anyway with the placement group count--just in
> a way that looks random but is still sampling a small portion of the
> possible combinations.  In the end, whether you eliminate 8/9 of the
> options of the rack layers and *then* scale pg_num, or just scale pg_num,
> I think it still boils down to the number of distinct 3-disk sets out of
> the total possible 3-disk sets.
>
> Also, FWIW, the rack thing is equivalent to making 3 parallel trees so
> that the crush hierarchy goes like:
>
>   root
>   layer of rack (top/middle/bottom)
>   rack
>   host
>   osd
>
> and make the crush rule first pick 1 layer before doing the chooseleaf
> over racks.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: contraining crush placement possibilities
  2014-03-10  9:37         ` Li Wang
@ 2014-03-10 16:25           ` Gregory Farnum
  0 siblings, 0 replies; 15+ messages in thread
From: Gregory Farnum @ 2014-03-10 16:25 UTC (permalink / raw)
  To: Li Wang; +Cc: Sage Weil, Dan van der Ster, ceph-devel@vger.kernel.org

Since pgp_num is constraining the placement, making the pg_num larger
isn't going to improve the balance.
Mapping directly from objects to OSDs would require a much higher
metadata overhead, which is part of the reason we have PGs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Mar 10, 2014 at 2:37 AM, Li Wang <liwang@ubuntukylin.com> wrote:
> pgp_num is the upper bound of number of OSD combinations, right?
> so we can reduce pgp_num to constrain the possible combinations,
> and the data loss probability is only dependent on pgp_num,
> say, pgp_num/Pn(replica_num) (Since (a, b) and (b, a) are different pgs, so
> it is permutation rather than combination). But we can still
> maintain a big pg_num, will it make the object distribution more
> uniform? Currently object_id is mapped to pg_id, then pg_id mapped to
> OSD combinations, why does it need two levels of mapping, why not map
> object_id to OSD combinations directly, will it achieve a more uniform
> distribution?
>
>
> On 2014/3/8 1:43, Sage Weil wrote:
>>
>> On Fri, 7 Mar 2014, Gregory Farnum wrote:
>>>
>>> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote:
>>>>
>>>> On Fri, 7 Mar 2014, Dan van der Ster wrote:
>>>>>
>>>>> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote:
>>>>>>
>>>>>> Sheldon just
>>>>>> pointed out a talk from ATC that discusses the basic problem:
>>>>>>
>>>>>>
>>>>>> https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
>>>>>>
>>>>>> The situation with CRUSH is slightly better, I think, because the
>>>>>> number
>>>>>> of peers for a given OSD in a large cluster is bounded (pg_num /
>>>>>> num_osds), but I think we may still be able improve things.
>>>>>
>>>>>
>>>>> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement
>>>>> groups?
>>>>
>>>>
>>>> I think so (I didn't listen to the whole talk :).  My ears did perk up
>>>> when Carlos (who was part of the original team at UCSC) asked the
>>>> question
>>>> about the CRUSH paper at the end, though. :)
>>>>
>>>> Anyway, now I'm thinking that this *is* really just all about tuning
>>>> pg_num/pgp_num.  And of course managing failure domains in the CRUSH map
>>>> as best we can to align placement with expected sources of correlated
>>>> failure.  But again, I would appreciate any confirmation from others'
>>>> intuitions or (better yet) a proper mathematical model.  This bit of my
>>>> brain is full of cobwebs, and wasn't particularly strong here to begin
>>>> with.
>>>
>>>
>>> Well, yes and no. They're constraining data sharing in order to reduce
>>> the probability of any given data loss event, and we can reduce data
>>> sharing by reducing the pgp_num. But the example you cited was "place
>>> all copies in the top third of the selected racks", and that's a
>>> little different because it means they can independently scale the
>>> data sharing *within* that grouping to maintain a good data balance,
>>> which CRUSH would have trouble with.
>>> Unfortunately my intuition around probability and stats isn't much
>>> good, so that's about as far as I can take this effectively. ;)
>>
>>
>> Yeah I'm struggling with this too, but I *think* the top/middle/bottom
>> rack analogy is just an easy way to think about constraining the placement
>> options, which we're doing anyway with the placement group count--just in
>> a way that looks random but is still sampling a small portion of the
>> possible combinations.  In the end, whether you eliminate 8/9 of the
>> options of the rack layers and *then* scale pg_num, or just scale pg_num,
>> I think it still boils down to the number of distinct 3-disk sets out of
>> the total possible 3-disk sets.
>>
>> Also, FWIW, the rack thing is equivalent to making 3 parallel trees so
>> that the crush hierarchy goes like:
>>
>>   root
>>   layer of rack (top/middle/bottom)
>>   rack
>>   host
>>   osd
>>
>> and make the crush rule first pick 1 layer before doing the chooseleaf
>> over racks.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-03-10 16:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-06 20:30 contraining crush placement possibilities Sage Weil
2014-03-07  3:51 ` Li Wang
2014-03-07  3:53   ` Li Wang
2014-03-07  4:35     ` Li Wang
2014-03-07  5:03       ` Sage Weil
2014-03-07  8:32         ` lianghaoshen
2014-03-07  8:37         ` lianghaoshen
2014-03-07  8:45 ` Dan van der Ster
2014-03-07 10:30 ` Dan van der Ster
2014-03-07 15:10   ` Sage Weil
2014-03-07 17:29     ` Gregory Farnum
2014-03-07 17:43       ` Sage Weil
2014-03-07 18:00         ` Gregory Farnum
2014-03-10  9:37         ` Li Wang
2014-03-10 16:25           ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.