From: Andreas Dilger <adilger@sun.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] global epochs [an alternative proposal, long and dry].
Date: Tue, 23 Dec 2008 16:37:53 -0700 [thread overview]
Message-ID: <20081223233753.GJ5000@webber.adilger.int> (raw)
In-Reply-To: <18767.58149.550264.505562@gargle.gargle.HOWL>
Nikita,
I still need more time to re-read and digest what you have written,
but thanks in advance for taking the time to explain it clearly and
precisely. This algorithm does seem to be related to the one originally
described in Peter's "Cluster Metadata Recovery" paper where the epoch
numbers are pushed and replied by every request, but is much better
described.
I think what would help me understand it a bit easier if it could be more
closely mapped onto a potential implementation, and the issues we may see
there. For example, the issue with fsync possibly involving all? nodes
(including clients) is not obvious from your description.
Similarly, some description of the practical requirements for message
exchange, how easy/hard it would be to e.g. "find all undo records
related to...", and the practical bound of the number of operations that
might have to be kept in memory and/or rolled back/forward during
recovery would be useful.
In particular, the mention that clients need to participate to determine
the oldest uncommitted operation seems troublesome unless the servers
themselves can place a bound on this by the frequency of their commits.
On Dec 22, 2008 21:57 +0300, Nikita Danilov wrote:
> Any message is used as a transport for epochs, including any reply
> from a server. So a typical scenario would be
>
>
> client server
> epoch = 8 epoch = 9
>
> LOCK --------------->
> <-------------- REPLY
> epoch = 9
> <----- other message with epoch = 10 from somewhere
> epoch = 10
> ....
>
> REINT --------------->
> <-------------- REPLY
> epoch = 10
>
> <----- other message with epoch = 11 from somewhere
> epoch = 11
>
> REINT --------------->
> <-------------- REPLY
> epoch = 11
>
> etc. Note, that nothing prevents server from increasing its local epoch
> before replying to every reintegration (this was mentioned in the
> original document as an "extreme case"). With this policy there is never
> more than one reintegration on a given client in a given epoch, and we
> can indeed implement stability algorithm without clients.
I was wondering if we could make some analogies between the current
transno-based recovery system and your current proposal. For example,
in our current recovery we increment the transno on the server before
the reply for every reintegration, and due to single-RPC-in-flight to
the client it could be considered in a separate "epoch" for every RPC
to match your "extreme case" above.
Similarly, I wonder if we could somehow map client (lack of) involvement
in epochs to our current configuration, and only require "client"
participation in the case of WBC or CMD?
One thing that crossed my mind at this point is that the 1.8 servers already
track recovery "epochs" for VBR using the transno (epoch is in high 32-bit
word of transno, counter is in low 32-bit word). These "recovery epochs"
are not (currently) synchronized between servers, but that would seem to be
possible/needed in the future.
Alternately, we might consider the VBR recovery "epochs" to be the same
as the epochs you are proposing, and transno increment does not affect
these epochs except to order operations within the epoch. We would
increment these epochs periodically (either due to too many operations,
or time limit).
The current VBR epochs only make up 32 bits of the transno, but we might
consider increasing the size of this epoch field to allow more epochs.
If we need to do that it should preferrably be done ASAP before the 1.8.0
release is made (this would be a trivial change at this stage).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
next prev parent reply other threads:[~2008-12-23 23:37 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-22 7:53 [Lustre-devel] global epochs [an alternative proposal, long and dry] Nikita Danilov
2008-12-22 11:52 ` Alex Zhuravlev
2008-12-22 12:45 ` Nikita Danilov
2008-12-22 13:48 ` Alexander Zarochentsev
2008-12-22 14:21 ` Nikita Danilov
2008-12-22 14:45 ` Alex Zhuravlev
2008-12-22 14:44 ` Alex Zhuravlev
2008-12-22 17:15 ` Nikita Danilov
2008-12-22 17:36 ` Alex Zhuravlev
2008-12-22 18:57 ` Nikita Danilov
2008-12-23 6:44 ` Alex Zhuravlev
2008-12-23 10:00 ` Nikita Danilov
2008-12-23 10:21 ` Alex Zhuravlev
2008-12-23 11:06 ` Nikita Danilov
2008-12-23 11:31 ` Alex Zhuravlev
2008-12-23 12:50 ` Nikita Danilov
2008-12-23 13:11 ` Alex Zhuravlev
2008-12-23 13:24 ` Nikita Danilov
2008-12-24 10:32 ` Alex Zhuravlev
2008-12-24 11:37 ` Nikita Danilov
2008-12-26 9:01 ` Alex Zhuravlev
2008-12-23 23:37 ` Andreas Dilger [this message]
2008-12-24 12:35 ` Eric Barton
2008-12-24 16:16 ` Nikita Danilov
2009-01-15 23:40 ` [Lustre-devel] global epochs vs fsync Alex Zhuravlev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081223233753.GJ5000@webber.adilger.int \
--to=adilger@sun.com \
--cc=lustre-devel@lists.lustre.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.