Re: [PATCH 04/39] mds: make sure table request id unique

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Yan, Zheng" <zheng.z.yan@intel.com>
To: Greg Farnum <greg@inktank.com>
Cc: Sage Weil <sage@inktank.com>, ceph-devel@vger.kernel.org
Subject: Re: [PATCH 04/39] mds: make sure table request id unique
Date: Thu, 21 Mar 2013 16:07:34 +0800	[thread overview]
Message-ID: <514ABFC6.3080100@intel.com> (raw)
In-Reply-To: <971E9C644F3C4AD9A4BFA042FC238F34@inktank.com>

On 03/21/2013 02:31 AM, Greg Farnum wrote:
> On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote:
>> On 03/20/2013 02:15 PM, Sage Weil wrote:
>>> On Wed, 20 Mar 2013, Yan, Zheng wrote:
>>>> On 03/20/2013 07:09 AM, Greg Farnum wrote:
>>>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
>>>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so?
>>>>  
>>>>  
>>>>  
>>>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch  
>>>> increases for each stage. I don't think there are any more colliding  
>>>> states in the table. The table client/server use two phase commit. it's  
>>>> similar to client request that involves multiple MDS. the reqid is  
>>>> analogy to client request id. The difference is client request ID is  
>>>> unique because new client always get an unique session id.
>>>  
>>>  
>>>  
>>> Each time a tid is consumed (at least for an update) it is journaled in  
>>> the EMetaBlob::table_tids list, right? So we could actually take a max  
>>> from journal replay and pick up where we left off? That seems like the  
>>> cleanest.
>>>  
>>> I'm not too worried about 2^32 tids, I guess, but it would be nicer to  
>>> avoid that possibility.
>>  
>>  
>>  
>> Can we re-use the client request ID as table client request ID ?
>>  
>> Regards
>> Yan, Zheng
> 
> Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates.
> 

You are right, client request ID does not work.

> As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places…
> -Greg
> 

The table server sends 'agree' message to table client after a 'prepare entry' is safely logged. The table server re-sends 'agree' message in two cases, one is the table client restarts, another is the table server itself restarts.
The purpose of re-sending 'agree' message is to check if the table client still wants to keep the update preparation. (The table client might crash before submitting the update). The purpose of reqid is associate table update
preparation request with the server's 'agree' reply message. The problem here is that the table client does not make sure reqid unique between restarts. If you feel 2^32 reqids are still enough, set the reqid to a randomized 64bit
value should be safe enough.

Thanks
Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2013-03-21  8:07 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-17 14:51 [PATCH 00/39] fixes for MDS cluster recovery Yan, Zheng
2013-03-17 14:51 ` [PATCH 01/39] mds: preserve subtree bounds until slave commit Yan, Zheng
2013-03-20 18:33   ` Greg Farnum
2013-03-17 14:51 ` [PATCH 02/39] mds: process finished contexts in batch Yan, Zheng
2013-03-20 18:33   ` Greg Farnum
2013-03-17 14:51 ` [PATCH 03/39] mds: fix MDCache::adjust_bounded_subtree_auth() Yan, Zheng
2013-03-20 18:33   ` Greg Farnum
2013-03-17 14:51 ` [PATCH 04/39] mds: make sure table request id unique Yan, Zheng
2013-03-19 23:09   ` Greg Farnum
2013-03-20  5:53     ` Yan, Zheng
2013-03-20  6:15       ` Sage Weil
2013-03-20  6:24         ` Yan, Zheng
2013-03-20  6:49         ` Yan, Zheng
2013-03-20 18:31           ` Greg Farnum
2013-03-21  8:07             ` Yan, Zheng [this message]
2013-03-21 22:03               ` Gregory Farnum
2013-03-25 11:30                 ` Yan, Zheng
2013-03-29 22:12                   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 05/39] mds: send table request when peer is in proper state Yan, Zheng
2013-03-20 18:34   ` Greg Farnum
2013-03-29 21:58   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 06/39] mds: make table client/server tolerate duplicated message Yan, Zheng
2013-03-29 22:00   ` Gregory Farnum
2013-03-31 13:21     ` Yan, Zheng
2013-03-17 14:51 ` [PATCH 07/39] mds: mark connection down when MDS fails Yan, Zheng
2013-03-20 18:37   ` Greg Farnum
2013-03-17 14:51 ` [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state Yan, Zheng
2013-03-20 18:40   ` Greg Farnum
2013-03-21  2:22     ` Yan, Zheng
2013-03-21 21:43       ` Gregory Farnum
2013-03-20 19:09   ` Greg Farnum
2013-03-17 14:51 ` [PATCH 09/39] mds: defer eval gather locks when removing replica Yan, Zheng
2013-03-20 19:36   ` Greg Farnum
2013-03-21  2:29     ` Yan, Zheng
2013-03-17 14:51 ` [PATCH 10/39] mds: unify slave request waiting Yan, Zheng
2013-03-20 22:52   ` Sage Weil
2013-03-17 14:51 ` [PATCH 11/39] mds: don't delay processing replica buffer in slave request Yan, Zheng
2013-03-20 21:19   ` Greg Farnum
2013-03-21  2:38     ` Yan, Zheng
2013-03-21  4:15       ` Sage Weil
2013-03-21 21:48         ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 12/39] mds: compose and send resolve messages in batch Yan, Zheng
2013-03-20 21:45   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 13/39] mds: don't send resolve message between active MDS Yan, Zheng
2013-03-20 21:56   ` Gregory Farnum
2013-03-21  2:55     ` Yan, Zheng
2013-03-21 21:55       ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 14/39] mds: set resolve/rejoin gather MDS set in advance Yan, Zheng
2013-03-20 22:09   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 15/39] mds: don't send MDentry{Link,Unlink} before receiving cache rejoin Yan, Zheng
2013-03-20 22:17   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 16/39] mds: send cache rejoin messages after gathering all resolves Yan, Zheng
2013-03-20 22:57   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 17/39] mds: send resolve acks after master updates are safely logged Yan, Zheng
2013-03-20 22:58   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 18/39] mds: fix MDS recovery involving cross authority rename Yan, Zheng
2013-03-21 17:59   ` Gregory Farnum
2013-03-22  3:04     ` Yan, Zheng
2013-03-29 22:02       ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 19/39] mds: remove MDCache::rejoin_fetch_dirfrags() Yan, Zheng
2013-03-20 22:58   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 20/39] mds: include replica nonce in MMDSCacheRejoin::inode_strong Yan, Zheng
2013-03-20 23:26   ` Gregory Farnum
2013-03-20 23:36     ` Sage Weil
2013-03-17 14:51 ` [PATCH 21/39] mds: encode dirfrag base in cache rejoin ack Yan, Zheng
2013-03-20 23:33   ` Gregory Farnum
2013-03-20 23:40     ` Gregory Farnum
2013-03-21  6:41     ` Yan, Zheng
2013-03-21 21:58       ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 22/39] mds: handle linkage mismatch during cache rejoin Yan, Zheng
2013-03-21 21:23   ` Gregory Farnum
2013-03-22  3:05     ` Yan, Zheng
2013-03-25 16:14       ` Gregory Farnum
2013-03-26  7:21     ` Yan, Zheng
2013-03-29 22:09       ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 23/39] mds: reqid for rejoinning authpin/wrlock need to be list Yan, Zheng
2013-03-20 23:59   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 24/39] mds: take object's versionlock when rejoinning xlock Yan, Zheng
2013-03-21  0:37   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 25/39] mds: share inode max size after MDS recovers Yan, Zheng
2013-03-21  0:45   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 26/39] mds: issue caps when lock state in replica become SYNC Yan, Zheng
2013-03-21  0:52   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 27/39] mds: send lock action message when auth MDS is in proper state Yan, Zheng
2013-03-21  3:12   ` Gregory Farnum
2013-03-21  3:20     ` Yan, Zheng
2013-03-17 14:51 ` [PATCH 28/39] mds: add dirty imported dirfrag to LogSegment Yan, Zheng
2013-03-21  3:14   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 29/39] mds: avoid double auth pin for file recovery Yan, Zheng
2013-03-21  3:20   ` Gregory Farnum
2013-03-21  3:33     ` Yan, Zheng
2013-03-21  4:20       ` Sage Weil
2013-03-21 21:58     ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 30/39] mds: check MDS peer's state through mdsmap Yan, Zheng
2013-03-21  3:24   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 31/39] mds: unfreeze subtree if import aborts in PREPPED state Yan, Zheng
2013-03-21  3:27   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 32/39] mds: fix export cancel notification Yan, Zheng
2013-03-21  3:31   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 33/39] mds: notify bystanders if export aborts Yan, Zheng
2013-03-21  3:34   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 34/39] mds: don't open dirfrag while subtree is frozen Yan, Zheng
2013-03-21  3:38   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 35/39] mds: clear dirty inode rstat if import fails Yan, Zheng
2013-03-21  3:40   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 36/39] mds: try merging subtree after clear EXPORTBOUND Yan, Zheng
2013-03-21  3:44   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 37/39] mds: eval inodes with caps imported by cache rejoin message Yan, Zheng
2013-03-21  3:45   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 38/39] mds: don't replicate purging dentry Yan, Zheng
2013-03-21  3:46   ` Gregory Farnum
2013-03-17 14:51 ` [PATCH 39/39] mds: clear scatter dirty if replica inode has no auth subtree Yan, Zheng
2013-03-21  3:49   ` Gregory Farnum
2013-04-01  8:46 ` [PATCH 00/39] fixes for MDS cluster recovery Yan, Zheng
2013-04-01 17:00   ` Gregory Farnum
2013-04-01  8:51 ` [PATCH] mds: avoid sending duplicated table prepare/commit Yan, Zheng
2013-04-01  8:51   ` [PATCH] mds: don't roll back prepared table updates Yan, Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=514ABFC6.3080100@intel.com \
    --to=zheng.z.yan@intel.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=greg@inktank.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.