* contraining crush placement possibilities
@ 2014-03-06 20:30 Sage Weil
2014-03-07 3:51 ` Li Wang
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Sage Weil @ 2014-03-06 20:30 UTC (permalink / raw)
To: ceph-devel
During the CRUSH CDS session yesterday I talked a bit about the desire to
constrain the number of possible disk combinations so that we reduce the
probability of a concurrent failure from causing data loss. Sheldon just
pointed out a talk from ATC that discusses the basic problem:
https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
The situation with CRUSH is slightly better, I think, because the number
of peers for a given OSD in a large cluster is bounded (pg_num /
num_osds), but I think we may still be able improve things.
Last night it occurred to me that this is almost just having pgp_num <
pg_num, but I think that's not quite right either.
If anyone has some clear intuition here, would love to hear it. If there
is anything we can do to improve things we definitely want to do it!
sage
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: contraining crush placement possibilities 2014-03-06 20:30 contraining crush placement possibilities Sage Weil @ 2014-03-07 3:51 ` Li Wang 2014-03-07 3:53 ` Li Wang 2014-03-07 8:45 ` Dan van der Ster 2014-03-07 10:30 ` Dan van der Ster 2 siblings, 1 reply; 15+ messages in thread From: Li Wang @ 2014-03-07 3:51 UTC (permalink / raw) To: Sage Weil, ceph-devel Just had a quick look. It seems crush could meet the demand, say, if we have 100 osds, replica_num is 3, then we partition the 100 osds into 3 trees, 'take' iterates on the 3 trees, for each tree, select 1 osd. Then the probability of losing data is at most n*n*n/Cn3, can we make it better? On 2014/3/7 4:30, Sage Weil wrote: > During the CRUSH CDS session yesterday I talked a bit about the desire to > constrain the number of possible disk combinations so that we reduce the > probability of a concurrent failure from causing data loss. Sheldon just > pointed out a talk from ATC that discusses the basic problem: > > https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon > > The situation with CRUSH is slightly better, I think, because the number > of peers for a given OSD in a large cluster is bounded (pg_num / > num_osds), but I think we may still be able improve things. > > Last night it occurred to me that this is almost just having pgp_num < > pg_num, but I think that's not quite right either. > > If anyone has some clear intuition here, would love to hear it. If there > is anything we can do to improve things we definitely want to do it! > > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 3:51 ` Li Wang @ 2014-03-07 3:53 ` Li Wang 2014-03-07 4:35 ` Li Wang 0 siblings, 1 reply; 15+ messages in thread From: Li Wang @ 2014-03-07 3:53 UTC (permalink / raw) To: Sage Weil, ceph-devel Provided 3 osds are down simultaneously On 2014/3/7 11:51, Li Wang wrote: > Just had a quick look. It seems crush could meet the demand, > say, if we have 100 osds, replica_num is 3, then we partition the > 100 osds into 3 trees, 'take' iterates on the 3 trees, for each tree, > select 1 osd. Then the probability of losing data is at most n*n*n/Cn3, > can we make it better? > > > On 2014/3/7 4:30, Sage Weil wrote: >> During the CRUSH CDS session yesterday I talked a bit about the desire to >> constrain the number of possible disk combinations so that we reduce the >> probability of a concurrent failure from causing data loss. Sheldon just >> pointed out a talk from ATC that discusses the basic problem: >> >> https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >> >> >> The situation with CRUSH is slightly better, I think, because the number >> of peers for a given OSD in a large cluster is bounded (pg_num / >> num_osds), but I think we may still be able improve things. >> >> Last night it occurred to me that this is almost just having pgp_num < >> pg_num, but I think that's not quite right either. >> >> If anyone has some clear intuition here, would love to hear it. If there >> is anything we can do to improve things we definitely want to do it! >> >> sage >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 3:53 ` Li Wang @ 2014-03-07 4:35 ` Li Wang 2014-03-07 5:03 ` Sage Weil 0 siblings, 1 reply; 15+ messages in thread From: Li Wang @ 2014-03-07 4:35 UTC (permalink / raw) To: Sage Weil, ceph-devel Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3) On 2014/3/7 11:53, Li Wang wrote: > Provided 3 osds are down simultaneously > > On 2014/3/7 11:51, Li Wang wrote: >> Just had a quick look. It seems crush could meet the demand, >> say, if we have 100 osds, replica_num is 3, then we partition the >> 100 osds into 3 trees, 'take' iterates on the 3 trees, for each tree, >> select 1 osd. Then the probability of losing data is at most n*n*n/Cn3, >> can we make it better? >> >> >> On 2014/3/7 4:30, Sage Weil wrote: >>> During the CRUSH CDS session yesterday I talked a bit about the >>> desire to >>> constrain the number of possible disk combinations so that we reduce the >>> probability of a concurrent failure from causing data loss. Sheldon >>> just >>> pointed out a talk from ATC that discusses the basic problem: >>> >>> >>> https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >>> >>> >>> >>> The situation with CRUSH is slightly better, I think, because the number >>> of peers for a given OSD in a large cluster is bounded (pg_num / >>> num_osds), but I think we may still be able improve things. >>> >>> Last night it occurred to me that this is almost just having pgp_num < >>> pg_num, but I think that's not quite right either. >>> >>> If anyone has some clear intuition here, would love to hear it. If >>> there >>> is anything we can do to improve things we definitely want to do it! >>> >>> sage >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 4:35 ` Li Wang @ 2014-03-07 5:03 ` Sage Weil 2014-03-07 8:32 ` lianghaoshen 2014-03-07 8:37 ` lianghaoshen 0 siblings, 2 replies; 15+ messages in thread From: Sage Weil @ 2014-03-07 5:03 UTC (permalink / raw) To: Li Wang; +Cc: ceph-devel On Fri, 7 Mar 2014, Li Wang wrote: > Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3) Cn3 is "n choose 3"? > > > > Last night it occurred to me that this is almost just having > > > > pgp_num < pg_num, but I think that's not quite right either. Actually, maybe it is. Basically, say there are X combinations of 3 disks = n choose 3. Some fraction of these, say Y, are actually used by CRUSH. If we are to reduce that number, that implies that there are some PGs that are overlapping on the same set of disks. Which more or less reduces to the case where pgp_num < pg_num, or the hashpspool flag isn't set, or any other behavior that makes more than one PG line up on the same disk. Just using fewer PGs in the system, in fact, would help here. The main problem is that doing this tends to make the distribution less uniform, so there is a tradeoff. There is a reliability model in ceph-tools.git at https://github.com/ceph/ceph-tools/tree/master/models/reliability that Mark Kampe built last year. Sadly I haven't looked at it closely so I'm not sure if it captures this. sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 5:03 ` Sage Weil @ 2014-03-07 8:32 ` lianghaoshen 2014-03-07 8:37 ` lianghaoshen 1 sibling, 0 replies; 15+ messages in thread From: lianghaoshen @ 2014-03-07 8:32 UTC (permalink / raw) To: Sage Weil; +Cc: Li Wang, ceph-devel 于 2014年03月07日 13:03, Sage Weil 写道: > On Fri, 7 Mar 2014, Li Wang wrote: >> Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3) > Cn3 is "n choose 3"? > >>>>> Last night it occurred to me that this is almost just having >>>>> pgp_num < pg_num, but I think that's not quite right either. > Actually, maybe it is. Basically, say there are X combinations of 3 disks > = n choose 3. Some fraction of these, say Y, are actually used by CRUSH. > If we are to reduce that number, that implies that there are some PGs that > are overlapping on the same set of disks. Which more or less reduces to > the case where pgp_num < pg_num, or the hashpspool flag isn't set, or any > other behavior that makes more than one PG line up on the same disk. > Just using fewer PGs in the system, in fact, would help here. The main Dose it mean that we can calculate the pgp_num according to the reliability request, osd_num and replica_num, instead of using a given fixed one, namely, 100 pgs/osd ? In fact , when the osd_num of a failure domain is small , 100pgs can easily cover all of the osds, which means data lost will occur, when the down osds are in different failure domains. > problem is that doing this tends to make the distribution less uniform, so > there is a tradeoff. > > There is a reliability model in ceph-tools.git at > > https://github.com/ceph/ceph-tools/tree/master/models/reliability > > that Mark Kampe built last year. Sadly I haven't looked at it closely so > I'm not sure if it captures this. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards, slhhust -- Best regards, Lianghao Shen -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 5:03 ` Sage Weil 2014-03-07 8:32 ` lianghaoshen @ 2014-03-07 8:37 ` lianghaoshen 1 sibling, 0 replies; 15+ messages in thread From: lianghaoshen @ 2014-03-07 8:37 UTC (permalink / raw) To: Sage Weil; +Cc: Li Wang, ceph-devel 于 2014年03月07日 13:03, Sage Weil 写道: > On Fri, 7 Mar 2014, Li Wang wrote: >> Sorry, it is (n/3)*(n/3)*(n/3)/Cn3 = n^3/(27*Cn3) > Cn3 is "n choose 3"? > >>>>> Last night it occurred to me that this is almost just having >>>>> pgp_num < pg_num, but I think that's not quite right either. > Actually, maybe it is. Basically, say there are X combinations of 3 disks > = n choose 3. Some fraction of these, say Y, are actually used by CRUSH. > If we are to reduce that number, that implies that there are some PGs that > are overlapping on the same set of disks. Which more or less reduces to > the case where pgp_num < pg_num, or the hashpspool flag isn't set, or any > other behavior that makes more than one PG line up on the same disk. > Just using fewer PGs in the system, in fact, would help here. The main Dose it mean that we can calculate the pgp_num according to the reliability request, osd_num and replica_num, instead of using a given fixed one, namely, 100 pgs/osd ? In fact , when the osd_num of a failure domain is small , 100pgs can easily cover all of the osds, which means data lost will occur, when the down osds are in different failure domains. > problem is that doing this tends to make the distribution less uniform, so > there is a tradeoff. > > There is a reliability model in ceph-tools.git at > > https://github.com/ceph/ceph-tools/tree/master/models/reliability > > that Mark Kampe built last year. Sadly I haven't looked at it closely so > I'm not sure if it captures this. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards, slhhust -- Best regards, Lianghao Shen -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-06 20:30 contraining crush placement possibilities Sage Weil 2014-03-07 3:51 ` Li Wang @ 2014-03-07 8:45 ` Dan van der Ster 2014-03-07 10:30 ` Dan van der Ster 2 siblings, 0 replies; 15+ messages in thread From: Dan van der Ster @ 2014-03-07 8:45 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: > If anyone has some clear intuition here, would love to hear it. If there > is anything we can do to improve things we definitely want to do it! The thing is, if you constrain the number of OSD combinations, it just amplifies the damage in case you do lose all replicas of a PG. So the total expected data loss per year should stay the same, though the probability of an incident decreases. Cheers, Dan -- Dan van der Ster || Data & Storage Services || CERN IT Department -- ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-06 20:30 contraining crush placement possibilities Sage Weil 2014-03-07 3:51 ` Li Wang 2014-03-07 8:45 ` Dan van der Ster @ 2014-03-07 10:30 ` Dan van der Ster 2014-03-07 15:10 ` Sage Weil 2 siblings, 1 reply; 15+ messages in thread From: Dan van der Ster @ 2014-03-07 10:30 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: > Sheldon just > pointed out a talk from ATC that discusses the basic problem: > > https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon > > The situation with CRUSH is slightly better, I think, because the number > of peers for a given OSD in a large cluster is bounded (pg_num / > num_osds), but I think we may still be able improve things. I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups? Cheers, Dan -- Dan van der Ster || Data & Storage Services || CERN IT Department -- ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 10:30 ` Dan van der Ster @ 2014-03-07 15:10 ` Sage Weil 2014-03-07 17:29 ` Gregory Farnum 0 siblings, 1 reply; 15+ messages in thread From: Sage Weil @ 2014-03-07 15:10 UTC (permalink / raw) To: Dan van der Ster; +Cc: ceph-devel@vger.kernel.org On Fri, 7 Mar 2014, Dan van der Ster wrote: > On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: > > Sheldon just > > pointed out a talk from ATC that discusses the basic problem: > > > > https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon > > > > The situation with CRUSH is slightly better, I think, because the number > > of peers for a given OSD in a large cluster is bounded (pg_num / > > num_osds), but I think we may still be able improve things. > > I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups? I think so (I didn't listen to the whole talk :). My ears did perk up when Carlos (who was part of the original team at UCSC) asked the question about the CRUSH paper at the end, though. :) Anyway, now I'm thinking that this *is* really just all about tuning pg_num/pgp_num. And of course managing failure domains in the CRUSH map as best we can to align placement with expected sources of correlated failure. But again, I would appreciate any confirmation from others' intuitions or (better yet) a proper mathematical model. This bit of my brain is full of cobwebs, and wasn't particularly strong here to begin with. sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 15:10 ` Sage Weil @ 2014-03-07 17:29 ` Gregory Farnum 2014-03-07 17:43 ` Sage Weil 0 siblings, 1 reply; 15+ messages in thread From: Gregory Farnum @ 2014-03-07 17:29 UTC (permalink / raw) To: Sage Weil; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote: > On Fri, 7 Mar 2014, Dan van der Ster wrote: >> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: >> > Sheldon just >> > pointed out a talk from ATC that discusses the basic problem: >> > >> > https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >> > >> > The situation with CRUSH is slightly better, I think, because the number >> > of peers for a given OSD in a large cluster is bounded (pg_num / >> > num_osds), but I think we may still be able improve things. >> >> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups? > > I think so (I didn't listen to the whole talk :). My ears did perk up > when Carlos (who was part of the original team at UCSC) asked the question > about the CRUSH paper at the end, though. :) > > Anyway, now I'm thinking that this *is* really just all about tuning > pg_num/pgp_num. And of course managing failure domains in the CRUSH map > as best we can to align placement with expected sources of correlated > failure. But again, I would appreciate any confirmation from others' > intuitions or (better yet) a proper mathematical model. This bit of my > brain is full of cobwebs, and wasn't particularly strong here to begin > with. Well, yes and no. They're constraining data sharing in order to reduce the probability of any given data loss event, and we can reduce data sharing by reducing the pgp_num. But the example you cited was "place all copies in the top third of the selected racks", and that's a little different because it means they can independently scale the data sharing *within* that grouping to maintain a good data balance, which CRUSH would have trouble with. Unfortunately my intuition around probability and stats isn't much good, so that's about as far as I can take this effectively. ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 17:29 ` Gregory Farnum @ 2014-03-07 17:43 ` Sage Weil 2014-03-07 18:00 ` Gregory Farnum 2014-03-10 9:37 ` Li Wang 0 siblings, 2 replies; 15+ messages in thread From: Sage Weil @ 2014-03-07 17:43 UTC (permalink / raw) To: Gregory Farnum; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org On Fri, 7 Mar 2014, Gregory Farnum wrote: > On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote: > > On Fri, 7 Mar 2014, Dan van der Ster wrote: > >> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: > >> > Sheldon just > >> > pointed out a talk from ATC that discusses the basic problem: > >> > > >> > https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon > >> > > >> > The situation with CRUSH is slightly better, I think, because the number > >> > of peers for a given OSD in a large cluster is bounded (pg_num / > >> > num_osds), but I think we may still be able improve things. > >> > >> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups? > > > > I think so (I didn't listen to the whole talk :). My ears did perk up > > when Carlos (who was part of the original team at UCSC) asked the question > > about the CRUSH paper at the end, though. :) > > > > Anyway, now I'm thinking that this *is* really just all about tuning > > pg_num/pgp_num. And of course managing failure domains in the CRUSH map > > as best we can to align placement with expected sources of correlated > > failure. But again, I would appreciate any confirmation from others' > > intuitions or (better yet) a proper mathematical model. This bit of my > > brain is full of cobwebs, and wasn't particularly strong here to begin > > with. > > Well, yes and no. They're constraining data sharing in order to reduce > the probability of any given data loss event, and we can reduce data > sharing by reducing the pgp_num. But the example you cited was "place > all copies in the top third of the selected racks", and that's a > little different because it means they can independently scale the > data sharing *within* that grouping to maintain a good data balance, > which CRUSH would have trouble with. > Unfortunately my intuition around probability and stats isn't much > good, so that's about as far as I can take this effectively. ;) Yeah I'm struggling with this too, but I *think* the top/middle/bottom rack analogy is just an easy way to think about constraining the placement options, which we're doing anyway with the placement group count--just in a way that looks random but is still sampling a small portion of the possible combinations. In the end, whether you eliminate 8/9 of the options of the rack layers and *then* scale pg_num, or just scale pg_num, I think it still boils down to the number of distinct 3-disk sets out of the total possible 3-disk sets. Also, FWIW, the rack thing is equivalent to making 3 parallel trees so that the crush hierarchy goes like: root layer of rack (top/middle/bottom) rack host osd and make the crush rule first pick 1 layer before doing the chooseleaf over racks. sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 17:43 ` Sage Weil @ 2014-03-07 18:00 ` Gregory Farnum 2014-03-10 9:37 ` Li Wang 1 sibling, 0 replies; 15+ messages in thread From: Gregory Farnum @ 2014-03-07 18:00 UTC (permalink / raw) To: Sage Weil; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org On Fri, Mar 7, 2014 at 9:43 AM, Sage Weil <sage@inktank.com> wrote: > On Fri, 7 Mar 2014, Gregory Farnum wrote: >> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote: >> > On Fri, 7 Mar 2014, Dan van der Ster wrote: >> >> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: >> >> > Sheldon just >> >> > pointed out a talk from ATC that discusses the basic problem: >> >> > >> >> > https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >> >> > >> >> > The situation with CRUSH is slightly better, I think, because the number >> >> > of peers for a given OSD in a large cluster is bounded (pg_num / >> >> > num_osds), but I think we may still be able improve things. >> >> >> >> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups? >> > >> > I think so (I didn't listen to the whole talk :). My ears did perk up >> > when Carlos (who was part of the original team at UCSC) asked the question >> > about the CRUSH paper at the end, though. :) >> > >> > Anyway, now I'm thinking that this *is* really just all about tuning >> > pg_num/pgp_num. And of course managing failure domains in the CRUSH map >> > as best we can to align placement with expected sources of correlated >> > failure. But again, I would appreciate any confirmation from others' >> > intuitions or (better yet) a proper mathematical model. This bit of my >> > brain is full of cobwebs, and wasn't particularly strong here to begin >> > with. >> >> Well, yes and no. They're constraining data sharing in order to reduce >> the probability of any given data loss event, and we can reduce data >> sharing by reducing the pgp_num. But the example you cited was "place >> all copies in the top third of the selected racks", and that's a >> little different because it means they can independently scale the >> data sharing *within* that grouping to maintain a good data balance, >> which CRUSH would have trouble with. >> Unfortunately my intuition around probability and stats isn't much >> good, so that's about as far as I can take this effectively. ;) > > Yeah I'm struggling with this too, but I *think* the top/middle/bottom > rack analogy is just an easy way to think about constraining the placement > options, which we're doing anyway with the placement group count--just in > a way that looks random but is still sampling a small portion of the > possible combinations. In the end, whether you eliminate 8/9 of the > options of the rack layers and *then* scale pg_num, or just scale pg_num, > I think it still boils down to the number of distinct 3-disk sets out of > the total possible 3-disk sets. Mmm, the bounds are very different in those two environments, though. Let's say you have 3 racks of 9 OSDs; with CRUSH splitting across racks you have 9^3=729 possible combinations of placement; with thirded racks you have 3*(3^3)=81. If you constrain CRUSH to 81 PGs, you're going to have a terrible distribution. But with a different system it's easy to scale your shards within each grouping to maintain balance within each group, and to adjust the boundaries between groups as well. > > Also, FWIW, the rack thing is equivalent to making 3 parallel trees so > that the crush hierarchy goes like: > > root > layer of rack (top/middle/bottom) > rack > host > osd > > and make the crush rule first pick 1 layer before doing the chooseleaf > over racks. That I missed -- I was thinking we didn't have a good way to do the split in CRUSH, but I guess if you're doing same-rack-pos then just doing the split at the top you could probably emulate the system above reasonably well...maybe? We should run some experiments with the crush tester and figure out if we can get a reasonable data distribution with reasonable PG counts under a schema like that. -Greg ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-07 17:43 ` Sage Weil 2014-03-07 18:00 ` Gregory Farnum @ 2014-03-10 9:37 ` Li Wang 2014-03-10 16:25 ` Gregory Farnum 1 sibling, 1 reply; 15+ messages in thread From: Li Wang @ 2014-03-10 9:37 UTC (permalink / raw) To: Sage Weil, Gregory Farnum; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org pgp_num is the upper bound of number of OSD combinations, right? so we can reduce pgp_num to constrain the possible combinations, and the data loss probability is only dependent on pgp_num, say, pgp_num/Pn(replica_num) (Since (a, b) and (b, a) are different pgs, so it is permutation rather than combination). But we can still maintain a big pg_num, will it make the object distribution more uniform? Currently object_id is mapped to pg_id, then pg_id mapped to OSD combinations, why does it need two levels of mapping, why not map object_id to OSD combinations directly, will it achieve a more uniform distribution? On 2014/3/8 1:43, Sage Weil wrote: > On Fri, 7 Mar 2014, Gregory Farnum wrote: >> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote: >>> On Fri, 7 Mar 2014, Dan van der Ster wrote: >>>> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: >>>>> Sheldon just >>>>> pointed out a talk from ATC that discusses the basic problem: >>>>> >>>>> https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >>>>> >>>>> The situation with CRUSH is slightly better, I think, because the number >>>>> of peers for a given OSD in a large cluster is bounded (pg_num / >>>>> num_osds), but I think we may still be able improve things. >>>> >>>> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups? >>> >>> I think so (I didn't listen to the whole talk :). My ears did perk up >>> when Carlos (who was part of the original team at UCSC) asked the question >>> about the CRUSH paper at the end, though. :) >>> >>> Anyway, now I'm thinking that this *is* really just all about tuning >>> pg_num/pgp_num. And of course managing failure domains in the CRUSH map >>> as best we can to align placement with expected sources of correlated >>> failure. But again, I would appreciate any confirmation from others' >>> intuitions or (better yet) a proper mathematical model. This bit of my >>> brain is full of cobwebs, and wasn't particularly strong here to begin >>> with. >> >> Well, yes and no. They're constraining data sharing in order to reduce >> the probability of any given data loss event, and we can reduce data >> sharing by reducing the pgp_num. But the example you cited was "place >> all copies in the top third of the selected racks", and that's a >> little different because it means they can independently scale the >> data sharing *within* that grouping to maintain a good data balance, >> which CRUSH would have trouble with. >> Unfortunately my intuition around probability and stats isn't much >> good, so that's about as far as I can take this effectively. ;) > > Yeah I'm struggling with this too, but I *think* the top/middle/bottom > rack analogy is just an easy way to think about constraining the placement > options, which we're doing anyway with the placement group count--just in > a way that looks random but is still sampling a small portion of the > possible combinations. In the end, whether you eliminate 8/9 of the > options of the rack layers and *then* scale pg_num, or just scale pg_num, > I think it still boils down to the number of distinct 3-disk sets out of > the total possible 3-disk sets. > > Also, FWIW, the rack thing is equivalent to making 3 parallel trees so > that the crush hierarchy goes like: > > root > layer of rack (top/middle/bottom) > rack > host > osd > > and make the crush rule first pick 1 layer before doing the chooseleaf > over racks. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: contraining crush placement possibilities 2014-03-10 9:37 ` Li Wang @ 2014-03-10 16:25 ` Gregory Farnum 0 siblings, 0 replies; 15+ messages in thread From: Gregory Farnum @ 2014-03-10 16:25 UTC (permalink / raw) To: Li Wang; +Cc: Sage Weil, Dan van der Ster, ceph-devel@vger.kernel.org Since pgp_num is constraining the placement, making the pg_num larger isn't going to improve the balance. Mapping directly from objects to OSDs would require a much higher metadata overhead, which is part of the reason we have PGs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Mar 10, 2014 at 2:37 AM, Li Wang <liwang@ubuntukylin.com> wrote: > pgp_num is the upper bound of number of OSD combinations, right? > so we can reduce pgp_num to constrain the possible combinations, > and the data loss probability is only dependent on pgp_num, > say, pgp_num/Pn(replica_num) (Since (a, b) and (b, a) are different pgs, so > it is permutation rather than combination). But we can still > maintain a big pg_num, will it make the object distribution more > uniform? Currently object_id is mapped to pg_id, then pg_id mapped to > OSD combinations, why does it need two levels of mapping, why not map > object_id to OSD combinations directly, will it achieve a more uniform > distribution? > > > On 2014/3/8 1:43, Sage Weil wrote: >> >> On Fri, 7 Mar 2014, Gregory Farnum wrote: >>> >>> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@inktank.com> wrote: >>>> >>>> On Fri, 7 Mar 2014, Dan van der Ster wrote: >>>>> >>>>> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@inktank.com> wrote: >>>>>> >>>>>> Sheldon just >>>>>> pointed out a talk from ATC that discusses the basic problem: >>>>>> >>>>>> >>>>>> https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >>>>>> >>>>>> The situation with CRUSH is slightly better, I think, because the >>>>>> number >>>>>> of peers for a given OSD in a large cluster is bounded (pg_num / >>>>>> num_osds), but I think we may still be able improve things. >>>>> >>>>> >>>>> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement >>>>> groups? >>>> >>>> >>>> I think so (I didn't listen to the whole talk :). My ears did perk up >>>> when Carlos (who was part of the original team at UCSC) asked the >>>> question >>>> about the CRUSH paper at the end, though. :) >>>> >>>> Anyway, now I'm thinking that this *is* really just all about tuning >>>> pg_num/pgp_num. And of course managing failure domains in the CRUSH map >>>> as best we can to align placement with expected sources of correlated >>>> failure. But again, I would appreciate any confirmation from others' >>>> intuitions or (better yet) a proper mathematical model. This bit of my >>>> brain is full of cobwebs, and wasn't particularly strong here to begin >>>> with. >>> >>> >>> Well, yes and no. They're constraining data sharing in order to reduce >>> the probability of any given data loss event, and we can reduce data >>> sharing by reducing the pgp_num. But the example you cited was "place >>> all copies in the top third of the selected racks", and that's a >>> little different because it means they can independently scale the >>> data sharing *within* that grouping to maintain a good data balance, >>> which CRUSH would have trouble with. >>> Unfortunately my intuition around probability and stats isn't much >>> good, so that's about as far as I can take this effectively. ;) >> >> >> Yeah I'm struggling with this too, but I *think* the top/middle/bottom >> rack analogy is just an easy way to think about constraining the placement >> options, which we're doing anyway with the placement group count--just in >> a way that looks random but is still sampling a small portion of the >> possible combinations. In the end, whether you eliminate 8/9 of the >> options of the rack layers and *then* scale pg_num, or just scale pg_num, >> I think it still boils down to the number of distinct 3-disk sets out of >> the total possible 3-disk sets. >> >> Also, FWIW, the rack thing is equivalent to making 3 parallel trees so >> that the crush hierarchy goes like: >> >> root >> layer of rack (top/middle/bottom) >> rack >> host >> osd >> >> and make the crush rule first pick 1 layer before doing the chooseleaf >> over racks. >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-03-10 16:25 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-06 20:30 contraining crush placement possibilities Sage Weil 2014-03-07 3:51 ` Li Wang 2014-03-07 3:53 ` Li Wang 2014-03-07 4:35 ` Li Wang 2014-03-07 5:03 ` Sage Weil 2014-03-07 8:32 ` lianghaoshen 2014-03-07 8:37 ` lianghaoshen 2014-03-07 8:45 ` Dan van der Ster 2014-03-07 10:30 ` Dan van der Ster 2014-03-07 15:10 ` Sage Weil 2014-03-07 17:29 ` Gregory Farnum 2014-03-07 17:43 ` Sage Weil 2014-03-07 18:00 ` Gregory Farnum 2014-03-10 9:37 ` Li Wang 2014-03-10 16:25 ` Gregory Farnum
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.