[Lustre-devel] Imperative Recovery - forcing failover server stop blocking

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chris Horn <hornc@cray.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] Imperative Recovery - forcing failover server stop blocking
Date: Mon, 22 Jun 2009 13:21:10 -0500	[thread overview]
Message-ID: <4A3FCB96.4010201@cray.com> (raw)
In-Reply-To: <06b201c9f362$49015b20$db041160$@com>

Eric Barton wrote:
> Consider a utility that runs on a client to notify it to reconnect to a
> failover server, and which completes with a success status only when the
> client has reconnected successfully.
>   
Would this be equivalent to monitoring the "completed_clients" field of
the recovery_status proc file?
> If you run this utility on all clients after starting a failover server,
> you can notify the server to close the recovery window once all instances have
> completed since that tells you that all clients are healthy and ready to
> participate in recovery.
>   
Won't the server already begin replay by this time, since it has
received connections from all clients?  Thus rendering our notification
to the server (to close the recovery window) redundant?
> Of course, you can decide to stop waiting and proceed with the server
> notification at any time you like.  You can base this decision on a timeout,
> knowing how many clients have reconnected successfully, or any other criterion
> you chose - i.e. you are now the effective arbiter of client health.
>   
Our initial plan was to do just this.  We would have a proxy running on
the bootnode to aggregate client responses.  It would wait some
configurable timeout period, say clnt_timeout, and if it received a # of
responses equal to obd->obd_max_recoverable_clients, it would go ahead
and notify the server to stop waiting for responses immediately (though
this is the situation described in the last comment).  If the timeout
expired it would notify the server to stop waiting.  However, it
occurred to me that we would get the same behavior by simply tuning the
server's recovery window down to whatever value we were going to assign
clnt_timeout.  It seemed we were going through an awful lot of trouble
to gain a tunable recovery_window.  I'm not sure if this is a result of
our choosing poor criterion upon which to notify the server to stop
waiting, or if there is something else (a use case perhaps) that I'm
missing.
>     Cheers,
>               Eric
>
>
>

next prev parent reply	other threads:[~2009-06-22 18:21 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-18 23:10 [Lustre-devel] Imperative Recovery - forcing failover server stop blocking Chris Horn
2009-06-19 21:18 ` Johann Lombardi
2009-06-19 22:10   ` Chris Horn
2009-06-22 17:53     ` Eric Barton
2009-06-22 18:21       ` Chris Horn [this message]
2009-06-22 19:27         ` Brian Behlendorf
2009-06-23 12:49         ` Eric Barton
2009-06-23 14:53           ` Andreas Dilger
2009-06-23 14:59             ` Chris Horn
2009-06-23 17:20             ` Robert Read

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A3FCB96.4010201@cray.com \
    --to=hornc@cray.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.