From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Dilger Date: Tue, 23 Dec 2008 16:37:53 -0700 Subject: [Lustre-devel] global epochs [an alternative proposal, long and dry]. In-Reply-To: <18767.58149.550264.505562@gargle.gargle.HOWL> References: <18767.18277.958956.959956@gargle.gargle.HOWL> <494F7F6B.9080509@sun.com> <18767.35839.133024.625896@gargle.gargle.HOWL> <494FA7E8.7030200@sun.com> <18767.52005.485425.412677@gargle.gargle.HOWL> <494FD020.70909@sun.com> <18767.58149.550264.505562@gargle.gargle.HOWL> Message-ID: <20081223233753.GJ5000@webber.adilger.int> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org Nikita, I still need more time to re-read and digest what you have written, but thanks in advance for taking the time to explain it clearly and precisely. This algorithm does seem to be related to the one originally described in Peter's "Cluster Metadata Recovery" paper where the epoch numbers are pushed and replied by every request, but is much better described. I think what would help me understand it a bit easier if it could be more closely mapped onto a potential implementation, and the issues we may see there. For example, the issue with fsync possibly involving all? nodes (including clients) is not obvious from your description. Similarly, some description of the practical requirements for message exchange, how easy/hard it would be to e.g. "find all undo records related to...", and the practical bound of the number of operations that might have to be kept in memory and/or rolled back/forward during recovery would be useful. In particular, the mention that clients need to participate to determine the oldest uncommitted operation seems troublesome unless the servers themselves can place a bound on this by the frequency of their commits. On Dec 22, 2008 21:57 +0300, Nikita Danilov wrote: > Any message is used as a transport for epochs, including any reply > from a server. So a typical scenario would be > > > client server > epoch = 8 epoch = 9 > > LOCK ---------------> > <-------------- REPLY > epoch = 9 > <----- other message with epoch = 10 from somewhere > epoch = 10 > .... > > REINT ---------------> > <-------------- REPLY > epoch = 10 > > <----- other message with epoch = 11 from somewhere > epoch = 11 > > REINT ---------------> > <-------------- REPLY > epoch = 11 > > etc. Note, that nothing prevents server from increasing its local epoch > before replying to every reintegration (this was mentioned in the > original document as an "extreme case"). With this policy there is never > more than one reintegration on a given client in a given epoch, and we > can indeed implement stability algorithm without clients. I was wondering if we could make some analogies between the current transno-based recovery system and your current proposal. For example, in our current recovery we increment the transno on the server before the reply for every reintegration, and due to single-RPC-in-flight to the client it could be considered in a separate "epoch" for every RPC to match your "extreme case" above. Similarly, I wonder if we could somehow map client (lack of) involvement in epochs to our current configuration, and only require "client" participation in the case of WBC or CMD? One thing that crossed my mind at this point is that the 1.8 servers already track recovery "epochs" for VBR using the transno (epoch is in high 32-bit word of transno, counter is in low 32-bit word). These "recovery epochs" are not (currently) synchronized between servers, but that would seem to be possible/needed in the future. Alternately, we might consider the VBR recovery "epochs" to be the same as the epochs you are proposing, and transno increment does not affect these epochs except to order operations within the epoch. We would increment these epochs periodically (either due to too many operations, or time limit). The current VBR epochs only make up 32 bits of the transno, but we might consider increasing the size of this epoch field to allow more epochs. If we need to do that it should preferrably be done ASAP before the 1.8.0 release is made (this would be a trivial change at this stage). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.