From: Alex Zhuravlev <Alex.Zhuravlev@Sun.COM>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] global epochs [an alternative proposal, long and dry].
Date: Mon, 22 Dec 2008 17:44:56 +0300 [thread overview]
Message-ID: <494FA7E8.7030200@sun.com> (raw)
In-Reply-To: <18767.35839.133024.625896@gargle.gargle.HOWL>
Nikita Danilov wrote:
> > I find this relying on explicit request (lock in this case) as a disadvantage:
> > lock can be taken long before reintegration meaning epoch might be pinned for
>
> Hm.. a lock doesn't pin an epoch in any way.
well, I think it does as you don't want to use epoch received few minutes ago with lock.
if node is in WBC mode and granted some STL-like lock, then it may be sending few MBs
batch every, say, 5 minutes. there might be no interaction between batches. this means
client would need to refresh epoch. depending on workload it may happen that client
won't be able to send batch awaiting new epoch or client may refresh epoch with no real
batches after that.
> Locks are only needed to make proof of S2 possible. Once lockless
> operation or SNS guarantee in some domain-specific way that no epoch can
> depend on a future one, we are fine.
well, I guess "in some domain-specific way" means another complexity.
> > this means client actually should maintain many epochs at same time as any lock
> > enqueue can advance epoch.
>
> I don't understand what is meant by "maintaining an epoch" here. Epoch
> is just a number. Surely a client will keep in its memory (in the redo
> log) a list of updates tagged by multiple epochs, but I don't see any
> problem with this.
the problem is that with out-of-order epochs sent to different servers client can't
use notion of "last_committed" anymore.
> > I think having SC is also drawback:
> > 1) choosing such node is additional complexity and delay
> > 2) failing of such node would need global resend of states
> > 3) many unrelated nodes can get stuck due to large redo logs
>
> As I pointed out, only the simplest `1-level star' form of a stability
> algorithm was described for simplicity. This algorithms is amendable to
> a lot of optimization, because it, in effect, has to find a running
> minimum in a distributed array, and this can be done in a scalable way:
the bad think, IMHO, in all this is that all nodes making decision must
understand topology. server should separate epochs from different clients,
it's hard to send batches via some intermediate server/node.
> Note, that this requires _no_ additional rpcs from the clients.
disagree. at least for distributed operations client has to report non-volatile
epoch from time to time. in some cases we can use protocol like ping, in some - not.
> > given current epoch can be advanced by lock enqueue, client can get many used
> > epochs at same time, thus we'd have to track them all in the protocol.
>
> I am not sure I understand this. _Any_ message (including lock enqueue,
> REINT, MIN_VOLATILE, CONNECT, EVICT, etc.) potentially updates the epoch
> of a receiving node.
correct, this means client may have many epochs to track. thus no last_committed anymore.
> Only until this node is evicted, and I think that no matter what is the
> pattern of failures, a single level of `tree reduction', can be delayed
> by no more than a single eviction timeout.
the problem is that may affect non-related nodes very easily.
> Actually, single-server operation can be discarded from a redo log as
> soon as it commits on the target server, because the later can always
> redo it (possibly after undo). Given that majority of operations are
> single server, redo logs won't be much larger than they are to-day.
undo to redo? even longer recovery?
thanks, Alex
next prev parent reply other threads:[~2008-12-22 14:44 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-22 7:53 [Lustre-devel] global epochs [an alternative proposal, long and dry] Nikita Danilov
2008-12-22 11:52 ` Alex Zhuravlev
2008-12-22 12:45 ` Nikita Danilov
2008-12-22 13:48 ` Alexander Zarochentsev
2008-12-22 14:21 ` Nikita Danilov
2008-12-22 14:45 ` Alex Zhuravlev
2008-12-22 14:44 ` Alex Zhuravlev [this message]
2008-12-22 17:15 ` Nikita Danilov
2008-12-22 17:36 ` Alex Zhuravlev
2008-12-22 18:57 ` Nikita Danilov
2008-12-23 6:44 ` Alex Zhuravlev
2008-12-23 10:00 ` Nikita Danilov
2008-12-23 10:21 ` Alex Zhuravlev
2008-12-23 11:06 ` Nikita Danilov
2008-12-23 11:31 ` Alex Zhuravlev
2008-12-23 12:50 ` Nikita Danilov
2008-12-23 13:11 ` Alex Zhuravlev
2008-12-23 13:24 ` Nikita Danilov
2008-12-24 10:32 ` Alex Zhuravlev
2008-12-24 11:37 ` Nikita Danilov
2008-12-26 9:01 ` Alex Zhuravlev
2008-12-23 23:37 ` Andreas Dilger
2008-12-24 12:35 ` Eric Barton
2008-12-24 16:16 ` Nikita Danilov
2009-01-15 23:40 ` [Lustre-devel] global epochs vs fsync Alex Zhuravlev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=494FA7E8.7030200@sun.com \
--to=alex.zhuravlev@sun.com \
--cc=lustre-devel@lists.lustre.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.