* Reliability model for RADOS - effects during second failures @ 2014-07-02 22:33 Koleos Fuscus 2014-07-03 5:09 ` Kyle Bader 2014-07-03 7:10 ` Loic Dachary 0 siblings, 2 replies; 5+ messages in thread From: Koleos Fuscus @ 2014-07-02 22:33 UTC (permalink / raw) To: Loic Dachary, Kyle Bader; +Cc: Sage Weil, ceph-devel@vger.kernel.org Hi Kyle, Loic, The current code uses a “FIT rate multiplier” to include for instance the effect of operations done in parallel. That multiplier (n) has an effect on Pfail. In the initial failure, it is calculated using the number of replicas and the stripe count as seen in https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86. The thing that doesn’t have sense to me is the way the multiplier is calculated for the failure of the remaining copies in https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92 Why the stripes are not taking into account? What is the purpose of using the “declustering factor” on that equation? Is that equation correct? I read this note by sage https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg01650.html trying to clarify the role of PGs but didn’t help me to understand it. Besides, I have a simple question related with the equation on L86 for the initial failure. The stripping process splits user content in #number of objects, which equivalent to the stripe count. That group of objects constitutes an object set. Each object is composed by one or more stripes units. All stripes units (stripe count) are written in parallel. Typically each object is mapped to a different disk. What happen when the object set is full and a new object is started? Are this new objects assigned to same disks used for the previous full object set? Best koleosfuscus ________________________________________________________________ "My reply is: the software has no known bugs, therefore it has not been updated." Wietse Venema -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reliability model for RADOS - effects during second failures 2014-07-02 22:33 Reliability model for RADOS - effects during second failures Koleos Fuscus @ 2014-07-03 5:09 ` Kyle Bader 2014-07-04 0:58 ` Koleos Fuscus 2014-07-03 7:10 ` Loic Dachary 1 sibling, 1 reply; 5+ messages in thread From: Kyle Bader @ 2014-07-03 5:09 UTC (permalink / raw) To: Koleos Fuscus; +Cc: Loic Dachary, Sage Weil, ceph-devel@vger.kernel.org > The current code uses a “FIT rate multiplier” to include for instance > the effect of operations done in parallel. That multiplier (n) has an > effect on Pfail. In the initial failure, it is calculated using the > number of replicas and the stripe count as seen in > https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86. So I'm not sure what term we want to use for what we are calculating the durability of but for the sake of this explanation I'll use "artifact", which will refer to a collection of objects that compose a: 1. RADOS object (stripe count=1) 2. RBD volume 3. RGW S3 or Swift object 4. RGW metadata pools 5. I'm probably forgetting something My interpretation of the models progression is: 1. Global population of placement groups, perhaps because we need the entire pool intact, eg. RGW metadata pools (upper bound for stripe count). 2. Subsection of placement groups with which we will place portions of our artifact eg. based on size of RBD/RGW artifacts striped across RADOS objects. 3. Multiplier, to account for the fact that the placement group will become degraded if any of it's members are marked out due to failure. > The thing that doesn’t have sense to me is the way the multiplier is > calculated for the failure of the remaining copies in > https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92 > Why the stripes are not taking into account? Stripes are not taken into account because at this point in the model we are calculating the chances of the degraded placement group becoming further degraded by suffering the loss of another member. Failures of other placement groups in the same stripe, during the recovery of our placement group should be calculated as an independent event. > What is the purpose of > using the “declustering factor” on that equation? My understanding is the declustering factor is synonymous with placement groups (pg) > Is that equation > correct? I read this note by sage > https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg01650.html > trying to clarify the role of PGs but didn’t help me to understand it. To distribute objects across the cluster we need to divvy up objects into groupings, in the context of Ceph those groupings are PGs (placement groups). There is a cost associated with maintaining each placement group, and the benefit is finer distribution granularity can improve utilization at the high end. This should be reflected in the full/nearfull tunables we set for our cluster: http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity > Besides, I have a simple question related with the equation on L86 for > the initial failure. The stripping process splits user content in > #number of objects, which equivalent to the stripe count. That group > of objects constitutes an object set. Each object is composed by one > or more stripes units. All stripes units (stripe count) are written in > parallel. Typically each object is mapped to a different disk. What > happen when the object set is full and a new object is started? It places a second (or more) object in one of the placement groups that already has another object belonging to the same artifact. In this way you can have arbitrarily sized artifacts and still limit the number of placement groups in order to reduce the probability of failure. -- Kyle Bader - Inktank Senior Solution Architect -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reliability model for RADOS - effects during second failures 2014-07-03 5:09 ` Kyle Bader @ 2014-07-04 0:58 ` Koleos Fuscus 0 siblings, 0 replies; 5+ messages in thread From: Koleos Fuscus @ 2014-07-04 0:58 UTC (permalink / raw) To: Kyle Bader; +Cc: Loic Dachary, Sage Weil, ceph-devel@vger.kernel.org Hello Kyle, Thanks for your e-mail. > 1. RADOS object (stripe count=1) If I understand correctly, a RADOS object can be store in a stripe with count=n, maybe 1 is the default. > My interpretation of the models progression is: > 1. Global population of placement groups, perhaps because we need the > entire pool intact, eg. RGW metadata pools (upper bound for stripe > count). > 2. Subsection of placement groups with which we will place portions of > our artifact eg. based on size of RBD/RGW artifacts striped across > RADOS objects. > 3. Multiplier, to account for the fact that the placement group will > become degraded if any of it's members are marked out due to failure. I cannot understand what you said above. The current tool refers to a RADOS object. Do we need to differentiate things in fine-grain (RBD, RGW)? Not sure if it is relevant. I will transcript some of the things from https://github.com/ceph/ceph-tools/blob/master/models/reliability/README.html "This is a model of the durability of a single, arbitrary object....That object lives in a PG." I think it is more correct to said that the object doesn't live in a PG but in a pool. If the pool is replicated, the number of PGs inside a pool is (OSDx#PG_per_OSD)/#replicas (rounded to the nearest power of two). Now, we can list what are the components that can fail in our model. A OSD node can fail. A OSD node can contain many disk and each disk can fail. What means a PG failure? Does it have sense to have many PG(from the same pool) in the same disk? If multiple PG reside in the same disk, a failure of a PG can refer to a failure of a disk sector? First failure: At this time, we need to introduce stripes into the equation. Since the original object gets stripped and stripes go to a different OSD the stripe count is important. Therefore, the fit rate multiplier includes "replicas*stripes" to calculate Pfail. That makes sense to me. > > Stripes are not taken into account because at this point in the model > we are calculating the chances of the degraded placement group > becoming further degraded by suffering the loss of another member. > Failures of other placement groups in the same stripe, during the > recovery of our placement group should be calculated as an independent > event. > I think I follow. But the concept of pg/declustering is still giving me some concerns. To illustrate, I will use a toy example: 1. Object (example object: block of 100KB) 2. Object is stripped in a 4 unit stripe: obj1 obj2 obj3 obj4 (each of 25KB) 3. Object is replicated 3-way: obj1_rep1, obj1_rep2, obj1_rep3, obj2_rep1, .... 4. Object is placed in different OSDs, and maybe in different PGs inside the same OSDs Imagine this situation for 4 OSD and 100 PGs per each OSD: OSD1: obj1_rep1,obj2_rep2... OSD2: obj2_rep1, obj3_rep2, obj1_rep3... OSD3: obj3_rep1, obj4_rep2... OSD4: obj4_rep1, obj1_rep2... Now, imagine that OSD1 fails. Let's say OSD1 has only one PG, so all the chunks inside OSD1 are missing. We focus our study on the durability of obj1. With the first failure, obj1_rep is loss. In addition, obj2_rep2 is also missing but we ignore other elements of the same stripe. As you said, we are not interested in independent elements on degraded stripes...(some doubts remain regarding whether or not this obj2_rep2 should be consider in the repairing process) The repairing process is launched after the first failure. It needs to copy all replicas to a spare OSD. I understand that declustering is necessary for perfomance, but...why it is used here in the model? A second failure occurs. The FIT rate multiplier considers '#copies-1' and the 'declustering factor/PGs'. The period to calculate Pfail is not the life time of the object but the repairing time. Repair time is the bytes to be recovered divided by repair speed and decluster factor. Adding the declustering factor to the FIT multiplier actually cancels the decluster factor of the repair time. I wonder why it is consider in the repair the first time? Is it equivalent to stripe (pg=4 instead of default value=100)? Best, koleosfuscus ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reliability model for RADOS - effects during second failures 2014-07-02 22:33 Reliability model for RADOS - effects during second failures Koleos Fuscus 2014-07-03 5:09 ` Kyle Bader @ 2014-07-03 7:10 ` Loic Dachary 2014-07-07 14:55 ` Koleos Fuscus 1 sibling, 1 reply; 5+ messages in thread From: Loic Dachary @ 2014-07-03 7:10 UTC (permalink / raw) To: Koleos Fuscus; +Cc: ceph-devel@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2596 bytes --] Hi koleosfuscus, On 03/07/2014 00:33, Koleos Fuscus wrote: > Hi Kyle, Loic, > > The current code uses a “FIT rate multiplier” to include for instance > the effect of operations done in parallel. That multiplier (n) has an > effect on Pfail. In the initial failure, it is calculated using the > number of replicas and the stripe count as seen in > https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86. > > The thing that doesn’t have sense to me is the way the multiplier is > calculated for the failure of the remaining copies in > https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92 > Why the stripes are not taking into account? What is the purpose of > using the “declustering factor” on that equation? Is that equation > correct? I read this note by sage > https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg01650.html > trying to clarify the role of PGs but didn’t help me to understand it. At the risk of adding confusion to the discussion, does the current reliability model make room to take into account what is described in anrg.usc.edu/~maheswaran/Xorbas.pdf under "4. Reliability Analysis" ? In other words, is there a place where one could set things like "disk fail % of the time" and "network is X Gb/s" and "repairing a disk failure requires disk require reading B bytes from M disks" ? As far as I understand, such factors cannot be expressed with a single formula and this is why a Markov model is useful. > Besides, I have a simple question related with the equation on L86 for > the initial failure. The stripping process splits user content in > #number of objects, which equivalent to the stripe count. That group > of objects constitutes an object set. Each object is composed by one > or more stripes units. All stripes units (stripe count) are written in > parallel. Typically each object is mapped to a different disk. What > happen when the object set is full and a new object is started? Are > this new objects assigned to same disks used for the previous full > object set? In an ideal situation, if a disk / OSD is full it means the whole cluster is full. Is it reasonable to ignore this situation when thinking about the reliability model ? If not could you explain how ? Cheers > > Best > > koleosfuscus > > ________________________________________________________________ > "My reply is: the software has no known bugs, therefore it has not > been updated." > Wietse Venema > -- Loïc Dachary, Artisan Logiciel Libre [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Reliability model for RADOS - effects during second failures 2014-07-03 7:10 ` Loic Dachary @ 2014-07-07 14:55 ` Koleos Fuscus 0 siblings, 0 replies; 5+ messages in thread From: Koleos Fuscus @ 2014-07-07 14:55 UTC (permalink / raw) To: Loic Dachary; +Cc: ceph-devel@vger.kernel.org, Kyle Bader Hi Loic, > At the risk of adding confusion to the discussion, does Indeed, you are right, answering questions with new questions adds confusion ;) I will open another thread to discuss your e-mail. I am aware that it might be difficult to answer to my previous mail but I need to understand what parts of Cephs are being modelling in the original tool. The documentation is too vague. The author even ignores Markov in the whole code documentation. Cheers, Koleosfuscus ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-07-07 14:55 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-02 22:33 Reliability model for RADOS - effects during second failures Koleos Fuscus 2014-07-03 5:09 ` Kyle Bader 2014-07-04 0:58 ` Koleos Fuscus 2014-07-03 7:10 ` Loic Dachary 2014-07-07 14:55 ` Koleos Fuscus
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.