[Patch 07/12] Chunk: retry initial CLD session open

public inbox for hail-devel@vger.kernel.org
 help / color / mirror / Atom feed

* [Patch 07/12] Chunk: retry initial CLD session open
@ 2010-04-18  4:41 Pete Zaitcev
  2010-04-19  3:58 ` Jeff Garzik
  0 siblings, 1 reply; 2+ messages in thread
From: Pete Zaitcev @ 2010-04-18  4:41 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Project Hail List

This was an error in the conversion to ncld. In the cldc code, we
kick the state machine and the natural retries do the rest. Any
failures occure there. But in ncld the original kick can fail too.

Five retries give CLD server time to reboot. If it's down, then
clients refuse to start. This may be a bad idea, or may be not.
We may yet change the retries to be infinite, but for now it's
better if builds terminate somehow in case of unexpected problems.

Signed-off-by: Pete Zaitcev <zaitcev@redhat.com>

---
 server/cldu.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

commit 44cdb98d2cceb2f4e081db2ee38ec60f1c1a8d8d
Author: Master <zaitcev@lembas.zaitcev.lan>
Date:   Sat Apr 17 19:50:06 2010 -0600

    Retry the initial connection to the CLD server.

diff --git a/server/cldu.c b/server/cldu.c
index fafcc3b..58edf4b 100644
--- a/server/cldu.c
+++ b/server/cldu.c
@@ -471,6 +471,8 @@ int cld_begin(const char *thishost, uint32_t nid, char *infopath,
 {
 	static struct cld_session *cs = &ses;
 	struct server_poll *sp;
+	int retry_cnt;
+	int newactive;
 
 	if (!nid)
 		return 0;
@@ -540,9 +542,15 @@ int cld_begin(const char *thishost, uint32_t nid, char *infopath,
 	 * -- Actually, it only works when recovering from CLD failure.
 	 *    Thereafter, any slave CLD redirects us to the master.
 	 */
-	if (cldu_set_cldc(cs, 0)) {
+	newactive = 0;
+	retry_cnt = 0;
+	for (;;) {
+		if (!cldu_set_cldc(cs, newactive))
+			break;
 		/* Already logged error */
-		goto err_net;
+		if (++retry_cnt == 5)
+			goto err_net;
+		newactive = cldu_nextactive(cs);
 	}
 
 	return 0;

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [Patch 07/12] Chunk: retry initial CLD session open
  2010-04-18  4:41 [Patch 07/12] Chunk: retry initial CLD session open Pete Zaitcev
@ 2010-04-19  3:58 ` Jeff Garzik
  0 siblings, 0 replies; 2+ messages in thread
From: Jeff Garzik @ 2010-04-19  3:58 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Project Hail List

On 04/18/2010 12:41 AM, Pete Zaitcev wrote:
> This was an error in the conversion to ncld. In the cldc code, we
> kick the state machine and the natural retries do the rest. Any
> failures occure there. But in ncld the original kick can fail too.
>
> Five retries give CLD server time to reboot. If it's down, then
> clients refuse to start. This may be a bad idea, or may be not.
> We may yet change the retries to be infinite, but for now it's
> better if builds terminate somehow in case of unexpected problems.
>
> Signed-off-by: Pete Zaitcev<zaitcev@redhat.com>
>
> ---
>   server/cldu.c |   12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
>
> commit 44cdb98d2cceb2f4e081db2ee38ec60f1c1a8d8d
> Author: Master<zaitcev@lembas.zaitcev.lan>
> Date:   Sat Apr 17 19:50:06 2010 -0600
>
>      Retry the initial connection to the CLD server.

In the short term, this is acceptable.

In the medium term, this is a protocol detail that should be handled 
somewhere in libcldc.  We want all applications to behave the same way, 
including the method by which they attempt to contact a master.

Because there could be multiple CLD servers, you cannot think of retries 
in the context of a single server.  This is crucial WRT work on #replica 
branch, but it is also somewhat relevant to #master, because we might 
have multiple servers listed in SRV records as fallbacks from which to 
choose.

You don't want each application implementing this logic, because we want 
to enforce some level of predictability in master-seeking behavior, and 
in making decisions about when contacts attempts for -all- servers 
should cease, as opposed to contact attempts for a -single- server.  You 
don't want it to take 30 minutes to try all servers in a cluster, 
retrying a number of times on server A, then moving on to server B, etc.

	Jeff

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-04-19  3:58 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-18  4:41 [Patch 07/12] Chunk: retry initial CLD session open Pete Zaitcev
2010-04-19  3:58 ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox