From mboxrd@z Thu Jan 1 00:00:00 1970 From: Pete Zaitcev Subject: [Patch 07/12] Chunk: retry initial CLD session open Date: Sat, 17 Apr 2010 22:41:49 -0600 Message-ID: <20100417224149.69ab366e@redhat.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: Sender: hail-devel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Jeff Garzik Cc: Project Hail List This was an error in the conversion to ncld. In the cldc code, we kick the state machine and the natural retries do the rest. Any failures occure there. But in ncld the original kick can fail too. Five retries give CLD server time to reboot. If it's down, then clients refuse to start. This may be a bad idea, or may be not. We may yet change the retries to be infinite, but for now it's better if builds terminate somehow in case of unexpected problems. Signed-off-by: Pete Zaitcev --- server/cldu.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) commit 44cdb98d2cceb2f4e081db2ee38ec60f1c1a8d8d Author: Master Date: Sat Apr 17 19:50:06 2010 -0600 Retry the initial connection to the CLD server. diff --git a/server/cldu.c b/server/cldu.c index fafcc3b..58edf4b 100644 --- a/server/cldu.c +++ b/server/cldu.c @@ -471,6 +471,8 @@ int cld_begin(const char *thishost, uint32_t nid, char *infopath, { static struct cld_session *cs = &ses; struct server_poll *sp; + int retry_cnt; + int newactive; if (!nid) return 0; @@ -540,9 +542,15 @@ int cld_begin(const char *thishost, uint32_t nid, char *infopath, * -- Actually, it only works when recovering from CLD failure. * Thereafter, any slave CLD redirects us to the master. */ - if (cldu_set_cldc(cs, 0)) { + newactive = 0; + retry_cnt = 0; + for (;;) { + if (!cldu_set_cldc(cs, newactive)) + break; /* Already logged error */ - goto err_net; + if (++retry_cnt == 5) + goto err_net; + newactive = cldu_nextactive(cs); } return 0;