All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mikhail Pershin <Mikhail.Pershin@Sun.COM>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] Recovering opens by reconstruction
Date: Tue, 07 Jul 2009 17:56:36 +0400	[thread overview]
Message-ID: <op.uwpaclw3atmt0c@garden> (raw)
In-Reply-To: <20090706173441.GL15302@Sun.COM>

On Mon, 06 Jul 2009 21:34:41 +0400, Nicolas Williams  
<Nicolas.Williams@sun.com> wrote:

>
> In my proposal what would happen is that opens would only be recovered
> by _replay_ when the transaction had not yet been committed, otherwise
> the opens will be recovered by making a _new_ (non-replay) open RPC.
>

Yes, I understood that and agree that this looks like more clean  
implementation but I see the following problems so far:
  - two kinds of client - new and old that should be handled somehow
  - client code should be changed a lot
  - server need to understand and handle this too

What will we get for this? Sorry for my annoyance, but it looks for me  
that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY  
flag to such requests, so it will be also different in wire from  
transaction replays. Or we could re-use lock replay functionality somehow.  
The locks are not kept as saved RPC too but enqueued as new requests. The  
open is very close to this, I agree with idea that open handle has all  
needed info and no need to keep original RPC in this case.

I mean that proposed solution looks overcomplicated just to solve  
signature problem though it makes sense in general. If we are going to  
re-organize open recovery and have time for this it would be better to  
move it from context of replay signature to separate task as it is quite  
complex.

>
> I'm not sure why a new stage would necessarily slow recovery in a
> significant way.  The new stage would not involve any writes to disk
> (though it would involve reads, reads which could then be cached and
> benefit the transaction recovery phase).

Not necessarily, but it can. It is not about open stage only, it is about  
the whole approach to do recovery by stages when all clients must wait for  
any other at each stage before they can continue recovery. We have already  
this in HEAD and it extends recovery window. Lustre 1.8 had only single  
timer for recovery, Lustre 2.0 has 3 stages and timer should be re-set  
after each one. If all clients are alive than the recovery time will be  
mostly the same, but if clients may gone during recovery then lustre 2.0  
recovery time can be three times longer already. Just imagine that at each  
stage one client is gone, then at each stage all clients will wait until  
timer expiration. And the bigger cluster we have the more clients can be  
lost during recovery so recovery time may differ significantly.
Also this means that server load is not well distributed over recovery  
time. It waits then start doing all requests at once then waits again on  
other stage, etc.

Another point here is the possible using the version recovery instead of  
transaction-based recovery. This will makes recovery based on versions of  
object and it makes just no sense to wait all clients at each recovery  
stage, because all dependencies should be clear from versions and clients  
may finish recovery independently. Currently the requests can be recovered  
by versions and there is work on lock replays using versions too.

>
> Suppose we recovered opens after transactions: we'd still have
> additional costs for last unlinks since we'd have to put the object on
> an on-disk queue of orpahsn until all open state is recovered.  See
> above.

There is no additional cost for pair of open-unlink because orphan is  
needed anyway after unlink. The only exception is replay of pure unlink.  
But we need to keep orphans after unlinks for other cases anyway, e.g.  
delayed recovery and such overhead is nothing compared with time that can  
be lost on waiting for everyone as described above.

In fact this is already slightly out of scope original idea about open  
replay organization. This is more related to server recovery handling,  
version recovery, delayed recovery and can be discussed later when open  
replay changes on client will be settled, it will be more clear in that  
time.

-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

  parent reply	other threads:[~2009-07-07 13:56 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-02 22:39 [Lustre-devel] Recovering opens by reconstruction Nicolas Williams
2009-07-03 19:02 ` Mikhail Pershin
2009-07-03 21:55   ` Nicolas Williams
2009-07-04  0:48     ` Nicolas Williams
2009-07-04  7:14       ` Mikhail Pershin
2009-07-04  7:10     ` Mikhail Pershin
2009-07-06 17:34       ` Nicolas Williams
2009-07-06 22:42         ` Nicolas Williams
2009-07-07  9:56           ` Alex Zhuravlev
2009-07-07 14:38             ` Andreas Dilger
2009-07-08  6:46               ` Alex Zhuravlev
2009-07-07 16:03             ` Nicolas Williams
2009-07-07 13:56         ` Mikhail Pershin [this message]
2009-07-07 15:21           ` Andreas Dilger
2009-07-07 16:42             ` Mikhail Pershin
2009-07-07 16:50               ` Nicolas Williams
2009-07-07 16:14           ` Nicolas Williams
2009-07-08 17:15             ` Alex Zhuravlev
2009-07-06 17:20 ` Nicolas Williams
2009-07-06 22:37   ` Nicolas Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=op.uwpaclw3atmt0c@garden \
    --to=mikhail.pershin@sun.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.