All of lore.kernel.org
 help / color / mirror / Atom feed
From: Li Wang <liwang@ubuntukylin.com>
To: Sage Weil <sweil@redhat.com>
Cc: Josh Durgin <josh.durgin@inktank.com>,
	pmcgarry@redhat.com, ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: About the blueprint OSD: Transactions
Date: Thu, 05 Mar 2015 11:54:30 +0800	[thread overview]
Message-ID: <54F7D376.6070705@ubuntukylin.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1503041638040.23553@cobra.newdream.net>



On 2015/3/5 8:56, Sage Weil wrote:
> On Wed, 4 Mar 2015, Li Wang wrote:
>> Hi Sage, Please take a look if the below works,
>> [...]
>
> I think this works.  A few notes:
>
> 1- I don't think there's a need to persist the txn on the master until the
> slaves reply with PREPARE_ACK.

I think the txn must be persisted at the very first at master side,
since once it send the message to slaves, there must be a mechanism
that the ROLL_BACK message could be resent to slaves if master down,
just there may only few, rather than whole information of the
transaction need be persisted

>
> 2- This is basically optimistic concurrency with backoff if
> possible deadlock is detected.  I think we can do the same thing in the
> proposal in the blueprint if a PREPARE sees that a txn (in-memory) is
> pending or if a client txn is recieved and there is a pending PREPARE.  In
> the latter case, it seems like we should block and wait...
>

Yes. We can divide the process into two steps, the first step is
PREPARE, only for deadlock avoidance, this only refers to memory
operation in all slaves' sides. First, master send PREPARE to slaves,
the slaves check if there is pending transaction in memory, if so,
reply master EAGAIN, otherwise reply PREPARE_ACK, which lead to
an extremely fast deadlock avoidance. master collect all PREPARE_ACK,
and send COMMIT to slaves, then slaves commit their transaction part
to PG metadata, reply master COMMIT_ACK

> 3- In either scheme, we can do full deadlock avoidance if we force
> the master to be the lowest-sorting object name, or something like that.
> But I think that will have a performance impact since there is likely a
> best choice for master depending on the transaction itself... like a txn
> that writes 4MB to an object and inserts a pointer in another object;
> clearly the 4MB piece should be the master so that it is only written once
> and doesn't cross the network.
>
> sage
>
>
>
>> 1 Client calculate the
>> PG that the master object suggested by programmers belonging to, and
>> retrieve the primary OSD of that PG, called master, and send the full
>> transaction to it 2 master persist the whole transaction in the
>> corresponding PG metadata 3 master parse transaction, to obtain the set
>> of slave OSDs which are the primary OSDs of other PGs the transaction
>> referred to, and send PREPARE as well as the part of transaction needed
>> be done on each individual PG to its corresponding slave OSD 4 For each
>> slave OSD, it check if there exist a PREPARED-BUT-UNCOMMITTED
>> transaction in its PG metadata such that the two transactions share at
>> least one write operation on the same object, if so, the slave OSD give
>> up preparing, and reply PREPARE-AGAIN. Otherwise, it perform all the
>> read-and-comparison operations in its received transaction part, reply
>> PREPARE-FAIL if any of the operation fail. If all succeed, it persist
>> its transaction part in its PG metadata, and reply PREPARE-ACK 5 master
>> collect all PREPARE_ACKs, and reply client PREPARED, in the case a
>> PREPARE_FAIL received, master reply client ERROR, and send slaves
>> ROLL_BACK, and the slaves will discard its prepared transaction part, if
>> any, and reply master ROLL_BACK_ACK. master collect all ROLL_BACK_ACKs,
>> and discard the transaction. In the case of a PREPARE_AGAIN received,
>> the process is similar to PREPARE_FAIL except that master reply client
>> EAGAIN 6 master send slaves COMMIT 7 slaves get COMMIT and commit their
>> individual transaction part, and reply COMMIT_ACK 8 master collect all
>> COMMIT_ACKs and reply client COMMITTED 9 master close out the
>> transaction record It seems to work without dead locking in the normal
>> condition, however, there are still many kinds of errors it needs to
>> take into account, such as PG changing, OSD down etc, does it? Cheers,
>> Li Wang > -----????----- > ???: Sage Weil
>
>
>

  reply	other threads:[~2015-03-05  3:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ACIAwADIAN8kzvlukrK0Farh.1.1425481937528.Hmail.liwang@ubuntukylin.com>
2015-03-05  0:55 ` Re:Re: About the blueprint OSD: Transactions Sage Weil
2015-03-05  3:54   ` Li Wang [this message]
2015-03-05  7:38     ` Sage Weil
2015-03-10  2:09       ` Li Wang
2015-03-03  9:32 Li Wang
2015-03-03 22:52 ` Sage Weil
2015-03-03 23:03   ` Patrick McGarry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54F7D376.6070705@ubuntukylin.com \
    --to=liwang@ubuntukylin.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=josh.durgin@inktank.com \
    --cc=pmcgarry@redhat.com \
    --cc=sweil@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.