All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pete Zaitcev <zaitcev@redhat.com>
To: Project Hail List <hail-devel@vger.kernel.org>
Subject: Post-XDR CLD cannot keep session up
Date: Sun, 7 Feb 2010 00:00:47 -0700	[thread overview]
Message-ID: <20100207000047.46f77de0@redhat.com> (raw)

Hi, Jeff & Colin:

It looks like you broke something in CLD, not sure if server or client.
There are two possibly related bugs. But first, here's the messages
(The chunkd is run with -D). Note that I have 2 servers listed in DNS
(both on port 4499), but only one is up.

Feb  6 23:36:10 hitlain cld[1934]: databases up
Feb  6 23:36:10 hitlain cld[1934]: Listening on :: port 4499
Feb  6 23:36:10 hitlain cld[1934]: initialized: verbose 0
Feb  6 23:37:10 hitlain chunkd[1967]: Verbose debug output enabled
Feb  6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host hitlain.zaitcev.lan prio 10 weight 50
Feb  6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host elanor.zaitcev.lan prio 10 weight 50
Feb  6 23:37:10 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499
Feb  6 23:37:10 hitlain chunkd[1968]: Listening on host :: port 8082
Feb  6 23:37:10 hitlain chunkd[1968]: initialized
Feb  6 23:37:10 hitlain chunkd[1968]: New CLD session created, sid 05B521BF4071EBA2
Feb  6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" created
Feb  6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" written
Feb  6 23:39:45 hitlain chunkd[1968]: Session failed, sid 05B521BF4071EBA2
Feb  6 23:39:45 hitlain chunkd[1968]: Selected CLD host elanor.zaitcev.lan port 4499
Feb  6 23:39:45 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:39:50 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:39:55 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:00 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:05 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:10 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:15 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:46 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:51 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:40:56 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:01 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:06 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:11 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:16 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
Feb  6 23:41:46 hitlain chunkd[1968]: New CLD session creation failed: 17
Feb  6 23:41:46 hitlain chunkd[1968]: Session failed, sid 6C5A5E5D4D8F2270
Feb  6 23:41:46 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499
Feb  6 23:41:46 hitlain chunkd[1968]: New CLD session created, sid 4E2A8ED73878F038
Feb  6 23:41:46 hitlain chunkd[1968]: CLD file "/chunk-default/2" created
Feb  6 23:41:46 hitlain chunkd[1968]: CLD lock(/chunk-default/2) failed: 11

So, first regression: session ALWAYS fails, for no reason I can see.
It takes 2 minutes 35 seconds, as you can observe from the "Session failed"
message.

Second regression: locks of failed session are not removed (this is
what code 11 is). Once the original session fails, CLD client cannot
re-acquire the lock, ever, until the daemon is restarted.

This definitely used work before the XDR, and it only takes 3 minutes
to fail. Do you guys run and use chunkd or you just do "make check" and
consider it done? I thought we talked about having virtually permanent
cells and long-living CLD clients, because this sort of thing keeps
cropping up.

Cheers,
-- Pete

             reply	other threads:[~2010-02-07  7:00 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-07  7:00 Pete Zaitcev [this message]
2010-02-07 22:28 ` Post-XDR CLD cannot keep session up Jeff Garzik
2010-02-09 10:34 ` Jeff Garzik
2010-02-09 12:06   ` Jeff Garzik
2010-02-09 16:12     ` Pete Zaitcev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100207000047.46f77de0@redhat.com \
    --to=zaitcev@redhat.com \
    --cc=hail-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.