* About the blueprint OSD: Transactions @ 2015-03-03 9:32 Li Wang 2015-03-03 22:52 ` Sage Weil 0 siblings, 1 reply; 6+ messages in thread From: Li Wang @ 2015-03-03 9:32 UTC (permalink / raw) To: Sage Weil, Josh Durgin, pmcgarry; +Cc: ceph-devel Hi Sage, We are pretty interested in the multi-object transaction support, we think it is potencially very useful. we have read your implementation description, and summarize it as below, please check if our understanding is correct, 1 client select a master, and sends full txn to master 2 master holds txn in memory, sends PREPAREs to slaves 3 slaves persist PREPARE on the side, send PREPARE_ACK, in the case there is a compare-then-write operation, and compartion fail, slave will send PREPARE_FAIL instead 4 master collects all PREPARE_ACKs and applies the txn and marks txn COMMITTING, in the case a PREPARE_FAIL received, master send slaves ROLL_BACK, and the slaves will discard the prepared txn 5 once persisted, master send COMMITs to slaves 6 master replies to client COMMITED, to enable client to proceed to do other operations except reading the commited data 7 slaves get COMMIT and apply, reply with COMMIT_ACK 8 master collect COMMIT_ACK and reply to client FINISHED, to enable client read the data 9 master closes out txn record We think it manifiests to implement a transaction itself, however, it did not take into account the cases that concurrent multiple transactions, how to enforce the order and atomicity among the distributed transactions, how to do locking and dead locking avoidance, it seems there are some further desgining jobs to do. We are wondering if you can move this blueprint discussion into a UTC+8 friendly time, so that we can involve in Cheers, Li Wang ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: About the blueprint OSD: Transactions 2015-03-03 9:32 About the blueprint OSD: Transactions Li Wang @ 2015-03-03 22:52 ` Sage Weil 2015-03-03 23:03 ` Patrick McGarry 0 siblings, 1 reply; 6+ messages in thread From: Sage Weil @ 2015-03-03 22:52 UTC (permalink / raw) To: Li Wang; +Cc: Josh Durgin, pmcgarry, ceph-devel On Tue, 3 Mar 2015, Li Wang wrote: > Hi Sage, > We are pretty interested in the multi-object transaction support, > we think it is potencially very useful. we have read your implementation > description, and summarize it as below, please check if our > understanding is correct, > > 1 client select a master, and sends full txn to master > 2 master holds txn in memory, sends PREPAREs to slaves > 3 slaves persist PREPARE on the side, send PREPARE_ACK, > in the case there is a compare-then-write operation, > and compartion fail, slave will send PREPARE_FAIL instead > 4 master collects all PREPARE_ACKs and applies the txn > and marks txn COMMITTING, in the case a PREPARE_FAIL received, > master send slaves ROLL_BACK, and the slaves will discard > the prepared txn > 5 once persisted, master send COMMITs to slaves > 6 master replies to client COMMITED, to enable client to proceed > to do other operations except reading the commited data > 7 slaves get COMMIT and apply, reply with COMMIT_ACK > 8 master collect COMMIT_ACK and reply to client FINISHED, to enable > client read the data > 9 master closes out txn record Yep! Plus the failure path handling... > We think it manifiests to implement a transaction itself, however, > it did not take into account the cases that concurrent multiple transactions, > how to enforce the order and atomicity among the distributed transactions, > how to do locking and dead locking avoidance, it seems there are > some further desgining jobs to do. Yeah. I think it would be nice if we can define a few simple flags indicating whether the masters and/or slaves are readable during the prepared-but-uncommitted phase, as there are different requirements for different users. And we need to pick a (simple!) deadlock avoidance approach. Maybe a simple EAGAIN is enough and leave it to the clients to be consistent about which object to choose as the master. > We are wondering if you can move this blueprint discussion into a > UTC+8 friendly time, so that we can involve in I think Patrick is moving it! Thanks- sage ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: About the blueprint OSD: Transactions 2015-03-03 22:52 ` Sage Weil @ 2015-03-03 23:03 ` Patrick McGarry 0 siblings, 0 replies; 6+ messages in thread From: Patrick McGarry @ 2015-03-03 23:03 UTC (permalink / raw) To: Sage Weil; +Cc: Li Wang, Josh Durgin, ceph-devel Yep, I bumped the OSD: Transactions discussion to the end of the day. Let me know if you see anything else that looks amiss (including my timezone math!). Thanks. On Tue, Mar 3, 2015 at 5:52 PM, Sage Weil <sweil@redhat.com> wrote: > On Tue, 3 Mar 2015, Li Wang wrote: >> Hi Sage, >> We are pretty interested in the multi-object transaction support, >> we think it is potencially very useful. we have read your implementation >> description, and summarize it as below, please check if our >> understanding is correct, >> >> 1 client select a master, and sends full txn to master >> 2 master holds txn in memory, sends PREPAREs to slaves >> 3 slaves persist PREPARE on the side, send PREPARE_ACK, >> in the case there is a compare-then-write operation, >> and compartion fail, slave will send PREPARE_FAIL instead >> 4 master collects all PREPARE_ACKs and applies the txn >> and marks txn COMMITTING, in the case a PREPARE_FAIL received, >> master send slaves ROLL_BACK, and the slaves will discard >> the prepared txn >> 5 once persisted, master send COMMITs to slaves >> 6 master replies to client COMMITED, to enable client to proceed >> to do other operations except reading the commited data >> 7 slaves get COMMIT and apply, reply with COMMIT_ACK >> 8 master collect COMMIT_ACK and reply to client FINISHED, to enable >> client read the data >> 9 master closes out txn record > > Yep! Plus the failure path handling... > >> We think it manifiests to implement a transaction itself, however, >> it did not take into account the cases that concurrent multiple transactions, >> how to enforce the order and atomicity among the distributed transactions, >> how to do locking and dead locking avoidance, it seems there are >> some further desgining jobs to do. > > Yeah. I think it would be nice if we can define a few simple flags > indicating whether the masters and/or slaves are readable during the > prepared-but-uncommitted phase, as there are different requirements for > different users. > > And we need to pick a (simple!) deadlock avoidance approach. Maybe a > simple EAGAIN is enough and leave it to the clients to be consistent about > which object to choose as the master. > >> We are wondering if you can move this blueprint discussion into a >> UTC+8 friendly time, so that we can involve in > > I think Patrick is moving it! > > Thanks- > sage -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <ACIAwADIAN8kzvlukrK0Farh.1.1425481937528.Hmail.liwang@ubuntukylin.com>]
* Re:Re: About the blueprint OSD: Transactions [not found] <ACIAwADIAN8kzvlukrK0Farh.1.1425481937528.Hmail.liwang@ubuntukylin.com> @ 2015-03-05 0:55 ` Sage Weil 2015-03-05 3:54 ` Li Wang 0 siblings, 1 reply; 6+ messages in thread From: Sage Weil @ 2015-03-05 0:55 UTC (permalink / raw) To: Li Wang; +Cc: Josh Durgin, pmcgarry, ceph-devel On Wed, 4 Mar 2015, Li Wang wrote: > Hi Sage, Please take a look if the below works, > [...] I think this works. A few notes: 1- I don't think there's a need to persist the txn on the master until the slaves reply with PREPARE_ACK. 2- This is basically optimistic concurrency with backoff if possible deadlock is detected. I think we can do the same thing in the proposal in the blueprint if a PREPARE sees that a txn (in-memory) is pending or if a client txn is recieved and there is a pending PREPARE. In the latter case, it seems like we should block and wait... 3- In either scheme, we can do full deadlock avoidance if we force the master to be the lowest-sorting object name, or something like that. But I think that will have a performance impact since there is likely a best choice for master depending on the transaction itself... like a txn that writes 4MB to an object and inserts a pointer in another object; clearly the 4MB piece should be the master so that it is only written once and doesn't cross the network. sage > 1 Client calculate the > PG that the master object suggested by programmers belonging to, and > retrieve the primary OSD of that PG, called master, and send the full > transaction to it 2 master persist the whole transaction in the > corresponding PG metadata 3 master parse transaction, to obtain the set > of slave OSDs which are the primary OSDs of other PGs the transaction > referred to, and send PREPARE as well as the part of transaction needed > be done on each individual PG to its corresponding slave OSD 4 For each > slave OSD, it check if there exist a PREPARED-BUT-UNCOMMITTED > transaction in its PG metadata such that the two transactions share at > least one write operation on the same object, if so, the slave OSD give > up preparing, and reply PREPARE-AGAIN. Otherwise, it perform all the > read-and-comparison operations in its received transaction part, reply > PREPARE-FAIL if any of the operation fail. If all succeed, it persist > its transaction part in its PG metadata, and reply PREPARE-ACK 5 master > collect all PREPARE_ACKs, and reply client PREPARED, in the case a > PREPARE_FAIL received, master reply client ERROR, and send slaves > ROLL_BACK, and the slaves will discard its prepared transaction part, if > any, and reply master ROLL_BACK_ACK. master collect all ROLL_BACK_ACKs, > and discard the transaction. In the case of a PREPARE_AGAIN received, > the process is similar to PREPARE_FAIL except that master reply client > EAGAIN 6 master send slaves COMMIT 7 slaves get COMMIT and commit their > individual transaction part, and reply COMMIT_ACK 8 master collect all > COMMIT_ACKs and reply client COMMITTED 9 master close out the > transaction record It seems to work without dead locking in the normal > condition, however, there are still many kinds of errors it needs to > take into account, such as PG changing, OSD down etc, does it? Cheers, > Li Wang > -----????----- > ???: Sage Weil ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: About the blueprint OSD: Transactions 2015-03-05 0:55 ` Sage Weil @ 2015-03-05 3:54 ` Li Wang 2015-03-05 7:38 ` Sage Weil 0 siblings, 1 reply; 6+ messages in thread From: Li Wang @ 2015-03-05 3:54 UTC (permalink / raw) To: Sage Weil; +Cc: Josh Durgin, pmcgarry, ceph-devel On 2015/3/5 8:56, Sage Weil wrote: > On Wed, 4 Mar 2015, Li Wang wrote: >> Hi Sage, Please take a look if the below works, >> [...] > > I think this works. A few notes: > > 1- I don't think there's a need to persist the txn on the master until the > slaves reply with PREPARE_ACK. I think the txn must be persisted at the very first at master side, since once it send the message to slaves, there must be a mechanism that the ROLL_BACK message could be resent to slaves if master down, just there may only few, rather than whole information of the transaction need be persisted > > 2- This is basically optimistic concurrency with backoff if > possible deadlock is detected. I think we can do the same thing in the > proposal in the blueprint if a PREPARE sees that a txn (in-memory) is > pending or if a client txn is recieved and there is a pending PREPARE. In > the latter case, it seems like we should block and wait... > Yes. We can divide the process into two steps, the first step is PREPARE, only for deadlock avoidance, this only refers to memory operation in all slaves' sides. First, master send PREPARE to slaves, the slaves check if there is pending transaction in memory, if so, reply master EAGAIN, otherwise reply PREPARE_ACK, which lead to an extremely fast deadlock avoidance. master collect all PREPARE_ACK, and send COMMIT to slaves, then slaves commit their transaction part to PG metadata, reply master COMMIT_ACK > 3- In either scheme, we can do full deadlock avoidance if we force > the master to be the lowest-sorting object name, or something like that. > But I think that will have a performance impact since there is likely a > best choice for master depending on the transaction itself... like a txn > that writes 4MB to an object and inserts a pointer in another object; > clearly the 4MB piece should be the master so that it is only written once > and doesn't cross the network. > > sage > > > >> 1 Client calculate the >> PG that the master object suggested by programmers belonging to, and >> retrieve the primary OSD of that PG, called master, and send the full >> transaction to it 2 master persist the whole transaction in the >> corresponding PG metadata 3 master parse transaction, to obtain the set >> of slave OSDs which are the primary OSDs of other PGs the transaction >> referred to, and send PREPARE as well as the part of transaction needed >> be done on each individual PG to its corresponding slave OSD 4 For each >> slave OSD, it check if there exist a PREPARED-BUT-UNCOMMITTED >> transaction in its PG metadata such that the two transactions share at >> least one write operation on the same object, if so, the slave OSD give >> up preparing, and reply PREPARE-AGAIN. Otherwise, it perform all the >> read-and-comparison operations in its received transaction part, reply >> PREPARE-FAIL if any of the operation fail. If all succeed, it persist >> its transaction part in its PG metadata, and reply PREPARE-ACK 5 master >> collect all PREPARE_ACKs, and reply client PREPARED, in the case a >> PREPARE_FAIL received, master reply client ERROR, and send slaves >> ROLL_BACK, and the slaves will discard its prepared transaction part, if >> any, and reply master ROLL_BACK_ACK. master collect all ROLL_BACK_ACKs, >> and discard the transaction. In the case of a PREPARE_AGAIN received, >> the process is similar to PREPARE_FAIL except that master reply client >> EAGAIN 6 master send slaves COMMIT 7 slaves get COMMIT and commit their >> individual transaction part, and reply COMMIT_ACK 8 master collect all >> COMMIT_ACKs and reply client COMMITTED 9 master close out the >> transaction record It seems to work without dead locking in the normal >> condition, however, there are still many kinds of errors it needs to >> take into account, such as PG changing, OSD down etc, does it? Cheers, >> Li Wang > -----????----- > ???: Sage Weil > > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: About the blueprint OSD: Transactions 2015-03-05 3:54 ` Li Wang @ 2015-03-05 7:38 ` Sage Weil 2015-03-10 2:09 ` Li Wang 0 siblings, 1 reply; 6+ messages in thread From: Sage Weil @ 2015-03-05 7:38 UTC (permalink / raw) To: Li Wang; +Cc: Josh Durgin, pmcgarry, ceph-devel On Thu, 5 Mar 2015, Li Wang wrote: > On 2015/3/5 8:56, Sage Weil wrote: > > On Wed, 4 Mar 2015, Li Wang wrote: > > > Hi Sage, Please take a look if the below works, > > > [...] > > > > I think this works. A few notes: > > > > 1- I don't think there's a need to persist the txn on the master until the > > slaves reply with PREPARE_ACK. > > I think the txn must be persisted at the very first at master side, > since once it send the message to slaves, there must be a mechanism > that the ROLL_BACK message could be resent to slaves if master down, > just there may only few, rather than whole information of the > transaction need be persisted I think we can still skip it because it's not about durabiliy (master and slave are both PGs that are replicated), just about coordination. if master repeers the slaves will ask whether to roll forward or back and the (new) master will respond with ROLLBACK or COMMIT. If you missed the CDS session it should be posted on youtube shortly... we discussed both possibilities. We think the main difference is that in your case you have to do a double write (prepare + commit on master) but that hides the commit latency sinc eyou can reply when you get the PREPARE_ACKs. In my proposal, you only write once on the master, but you have to wait for the PREPAREs, and then write the COMMIT, and then reply to the clients.. which will have a higher total latency. > > 2- This is basically optimistic concurrency with backoff if > > possible deadlock is detected. I think we can do the same thing in the > > proposal in the blueprint if a PREPARE sees that a txn (in-memory) is > > pending or if a client txn is recieved and there is a pending PREPARE. In > > the latter case, it seems like we should block and wait... > > > > Yes. We can divide the process into two steps, the first step is > PREPARE, only for deadlock avoidance, this only refers to memory > operation in all slaves' sides. First, master send PREPARE to slaves, > the slaves check if there is pending transaction in memory, if so, > reply master EAGAIN, otherwise reply PREPARE_ACK, which lead to > an extremely fast deadlock avoidance. master collect all PREPARE_ACK, > and send COMMIT to slaves, then slaves commit their transaction part > to PG metadata, reply master COMMIT_ACK Backing off if any affected object has another in-flight transaction is sufficient but also conservative since we'll fail/retry transactions that actually could have completed w/o deadlocking. The altnerative is to leave it to the client to only propose transactions that won't conflict. The latter is certainly an easier first version to implement :) but it may also be that it's all that we want. Solving the deadlock avoidance in the general case sucks. :( Maybe a simple backoff like you propose is a decent middle ground... I susepect, though, that a large portion of transactions in the real world will be A+B, A+C, A+D, etc where they are non-deadlocking but do overlap (e.g. on an index or metadata object). sage ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: About the blueprint OSD: Transactions 2015-03-05 7:38 ` Sage Weil @ 2015-03-10 2:09 ` Li Wang 0 siblings, 0 replies; 6+ messages in thread From: Li Wang @ 2015-03-10 2:09 UTC (permalink / raw) To: Sage Weil; +Cc: Josh Durgin, pmcgarry, ceph-devel, Samuel Just The atomicity semantics of transaction must not be violated. Suppose there are two concurrent transactions, T1 (Transaction 1) writes a set of objects {A, B, C}, and T2 touches {B, C, D}, where each object is in a different OSD. And A and D are selected as the master, respectively. For simplicity, suppose T1 do a write to make the value of each object be 1, while T2 make them 2. Then only two results are legal, either A=B=C=1, or B=C=D=2, it forbids to happen that B=1, C=2 or vice versa. Suppose OSD_B receives PREPARE in a sequence of (T1, T2), while OSD_C receives PREPARE in a sequence of (T2, T1). This could happen since T1 and T2 are managed by different masters. The operation sequence is as follows, 1. OSD_B receives PREPARE from T1 and do the preparation 2. OSD_C receives PREPARE from T2 and do the preparation 3. OSD_B receives PREPARE from T2, finds a in-flight transaction on B, wait for T1 to finish 4. OSD_C receives PREPARE from T1, finds a in-flight transaction on C, wait for T2 to finish Obviously, it results in a deadlock. So if there is in-flight transaction to share the write on the same object, it should not wait. Also it could not accept, otherwise, the atomicity may be violated. For the above example, in Steps 3 and 4, if the two OSDs accept the PREPARE, then the final results after the two transactions finished will be B=2, C=1. Note forcing the master to be the lowest-sorting object name seems not fix this problem either, if the sorting of A and D are slower than B and C. So it seems the only option is to give up and retry in such case. Please check how is the following, (1) Client calculate the PG that the master object suggested by programmers belonging to, and retrieve the primary OSD of that PG, called master, and send the full transaction to it (2) Master hold the transaction in memory, and send PREPARE to slaves (3) Slave check if there exists at least one in-flight transaction on the same object, if so, reply master EAGAIN, otherwise reply PREPARE_ACK (4) Master collect PREPARE_ACK, and send COMMIT to slaves. In the case EAGAIN received, master reply client EAGAIN, and send ROLL_BACK to any prepared slaves, discard the transaction, and expect client resend the transaction with a newer id. (5) Slave perform all the read-and-comparison operations, reply EFAIL if any operation fail. If all succeed, slave commit the transaction into journal of PG metadata, and reply master COMMIT_ACK (6) Master collect COMMIT_ACK, reply client COMMITED, and send APPLY to slaves. In the case EFAIL received from slave, master reply client EFAIL, send ROLL_BACK to slaves, and discard the transaction (7) Slave apply the transaction from journal or PG metadata to the actual objects, and reply master APPLY_ACK (8) Master collect APPLY_ACK, reply client APPLIED, and close out the transaction Note it does not describe the persist operation on master side, because in terms of the process PREPARE, COMMIT and APPLY, master acts as exactly a slave. For example, in Step 3, the master also will check if there is a conflict in-flight transaction. Cheers, Li Wang On 2015/3/5 15:49, Sage Weil wrote: > On Thu, 5 Mar 2015, Li Wang wrote: >> On 2015/3/5 8:56, Sage Weil wrote: >>> On Wed, 4 Mar 2015, Li Wang wrote: >>>> Hi Sage, Please take a look if the below works, >>>> [...] >>> >>> I think this works. A few notes: >>> >>> 1- I don't think there's a need to persist the txn on the master until the >>> slaves reply with PREPARE_ACK. >> >> I think the txn must be persisted at the very first at master side, >> since once it send the message to slaves, there must be a mechanism >> that the ROLL_BACK message could be resent to slaves if master down, >> just there may only few, rather than whole information of the >> transaction need be persisted > > I think we can still skip it because it's not about durabiliy (master and > slave are both PGs that are replicated), just about coordination. if > master repeers the slaves will ask whether to roll forward or back and the > (new) master will respond with ROLLBACK or COMMIT. > > If you missed the CDS session it should be posted on youtube shortly... we > discussed both possibilities. We think the main difference is that in > your case you have to do a double write (prepare + commit on master) but > that hides the commit latency sinc eyou can reply when you get the > PREPARE_ACKs. In my proposal, you only write once on the master, but you > have to wait for the PREPAREs, and then write the COMMIT, and then reply > to the clients.. which will have a higher total latency. > >>> 2- This is basically optimistic concurrency with backoff if >>> possible deadlock is detected. I think we can do the same thing in the >>> proposal in the blueprint if a PREPARE sees that a txn (in-memory) is >>> pending or if a client txn is recieved and there is a pending PREPARE. In >>> the latter case, it seems like we should block and wait... >>> >> >> Yes. We can divide the process into two steps, the first step is >> PREPARE, only for deadlock avoidance, this only refers to memory >> operation in all slaves' sides. First, master send PREPARE to slaves, >> the slaves check if there is pending transaction in memory, if so, >> reply master EAGAIN, otherwise reply PREPARE_ACK, which lead to >> an extremely fast deadlock avoidance. master collect all PREPARE_ACK, >> and send COMMIT to slaves, then slaves commit their transaction part >> to PG metadata, reply master COMMIT_ACK > > Backing off if any affected object has another in-flight transaction is > sufficient but also conservative since we'll fail/retry transactions that > actually could have completed w/o deadlocking. The altnerative is to > leave it to the client to only propose transactions that won't conflict. > The latter is certainly an easier first version to implement :) but it may > also be that it's all that we want. Solving the deadlock avoidance in the > general case sucks. :( > > Maybe a simple backoff like you propose is a decent middle ground... I > susepect, though, that a large portion of transactions in the real world > will be A+B, A+C, A+D, etc where they are non-deadlocking but do overlap > (e.g. on an index or metadata object). > > sage > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-03-10 2:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-03 9:32 About the blueprint OSD: Transactions Li Wang
2015-03-03 22:52 ` Sage Weil
2015-03-03 23:03 ` Patrick McGarry
[not found] <ACIAwADIAN8kzvlukrK0Farh.1.1425481937528.Hmail.liwang@ubuntukylin.com>
2015-03-05 0:55 ` Sage Weil
2015-03-05 3:54 ` Li Wang
2015-03-05 7:38 ` Sage Weil
2015-03-10 2:09 ` Li Wang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.