From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Garzik Subject: Re: Post-XDR CLD cannot keep session up Date: Sun, 07 Feb 2010 17:28:05 -0500 Message-ID: <4B6F3E75.4040401@garzik.org> References: <20100207000047.46f77de0@redhat.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=pUOFZhx/kMGS4RIuiasaMbKgars876/AQQnqsgbAa2E=; b=mmWis5Yv2eOJTbTuNzIw38HbY7e3b1oFUxPd+EDCeWoNXjHY4LzMKOt0m7wh7+UbhQ zvcKQ2DXAF4/wA5YivBWlzNrYScPfme5BfatLQ8l3hAiXuGXXaG0le3/RgNdUJID1IBo 7FdzfT0nA9pAvdexVtr/MoQz83eAbhXpNw43A= In-Reply-To: <20100207000047.46f77de0@redhat.com> Sender: hail-devel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Pete Zaitcev Cc: Project Hail List On 02/07/2010 02:00 AM, Pete Zaitcev wrote: > Hi, Jeff& Colin: > > It looks like you broke something in CLD, not sure if server or client. > There are two possibly related bugs. But first, here's the messages > (The chunkd is run with -D). Note that I have 2 servers listed in DNS > (both on port 4499), but only one is up. > > Feb 6 23:36:10 hitlain cld[1934]: databases up > Feb 6 23:36:10 hitlain cld[1934]: Listening on :: port 4499 > Feb 6 23:36:10 hitlain cld[1934]: initialized: verbose 0 > Feb 6 23:37:10 hitlain chunkd[1967]: Verbose debug output enabled > Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host hitlain.zaitcev.lan prio 10 weight 50 > Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host elanor.zaitcev.lan prio 10 weight 50 > Feb 6 23:37:10 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 > Feb 6 23:37:10 hitlain chunkd[1968]: Listening on host :: port 8082 > Feb 6 23:37:10 hitlain chunkd[1968]: initialized > Feb 6 23:37:10 hitlain chunkd[1968]: New CLD session created, sid 05B521BF4071EBA2 > Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" created > Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" written > Feb 6 23:39:45 hitlain chunkd[1968]: Session failed, sid 05B521BF4071EBA2 > Feb 6 23:39:45 hitlain chunkd[1968]: Selected CLD host elanor.zaitcev.lan port 4499 > Feb 6 23:39:45 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:39:50 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:39:55 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:00 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:05 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:10 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:15 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:46 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:51 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:40:56 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:01 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:06 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:11 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:16 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 > Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session creation failed: 17 > Feb 6 23:41:46 hitlain chunkd[1968]: Session failed, sid 6C5A5E5D4D8F2270 > Feb 6 23:41:46 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 > Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session created, sid 4E2A8ED73878F038 > Feb 6 23:41:46 hitlain chunkd[1968]: CLD file "/chunk-default/2" created > Feb 6 23:41:46 hitlain chunkd[1968]: CLD lock(/chunk-default/2) failed: 11 > > So, first regression: session ALWAYS fails, for no reason I can see. > It takes 2 minutes 35 seconds, as you can observe from the "Session failed" > message. > > Second regression: locks of failed session are not removed (this is > what code 11 is). Once the original session fails, CLD client cannot > re-acquire the lock, ever, until the daemon is restarted. Thanks for the report. That is definitely annoying... I wonder if it is related to the ping_open bug I fixed... > This definitely used work before the XDR, and it only takes 3 minutes > to fail. Do you guys run and use chunkd or you just do "make check" and > consider it done? I thought we talked about having virtually permanent > cells and long-living CLD clients, because this sort of thing keeps > cropping up. My local one (shamefully not using SRV, like I should) is pretty outdated, back to the latest released tarballs, since I dislike having to lose data on upgrade ;-) Jeff