From: Jeff Garzik <jeff@garzik.org>
To: Project Hail List <hail-devel@vger.kernel.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Subject: Re: Post-XDR CLD cannot keep session up
Date: Tue, 09 Feb 2010 07:06:39 -0500 [thread overview]
Message-ID: <4B714FCF.1060708@garzik.org> (raw)
In-Reply-To: <4B713A38.1010106@garzik.org>
On 02/09/2010 05:34 AM, Jeff Garzik wrote:
> On 02/07/2010 02:00 AM, Pete Zaitcev wrote:
>> Hi, Jeff& Colin:
>>
>> It looks like you broke something in CLD, not sure if server or client.
>> There are two possibly related bugs. But first, here's the messages
>> (The chunkd is run with -D). Note that I have 2 servers listed in DNS
>> (both on port 4499), but only one is up.
>>
>> Feb 6 23:36:10 hitlain cld[1934]: databases up
>> Feb 6 23:36:10 hitlain cld[1934]: Listening on :: port 4499
>> Feb 6 23:36:10 hitlain cld[1934]: initialized: verbose 0
>> Feb 6 23:37:10 hitlain chunkd[1967]: Verbose debug output enabled
>> Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host
>> hitlain.zaitcev.lan prio 10 weight 50
>> Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host
>> elanor.zaitcev.lan prio 10 weight 50
>> Feb 6 23:37:10 hitlain chunkd[1968]: Selected CLD host
>> hitlain.zaitcev.lan port 4499
>> Feb 6 23:37:10 hitlain chunkd[1968]: Listening on host :: port 8082
>> Feb 6 23:37:10 hitlain chunkd[1968]: initialized
>> Feb 6 23:37:10 hitlain chunkd[1968]: New CLD session created, sid
>> 05B521BF4071EBA2
>> Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" created
>> Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" written
>> Feb 6 23:39:45 hitlain chunkd[1968]: Session failed, sid 05B521BF4071EBA2
>> Feb 6 23:39:45 hitlain chunkd[1968]: Selected CLD host
>> elanor.zaitcev.lan port 4499
>> Feb 6 23:39:45 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:39:50 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:39:55 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:00 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:05 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:10 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:15 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:46 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:51 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:56 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:01 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:06 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:11 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:16 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session creation failed: 17
>> Feb 6 23:41:46 hitlain chunkd[1968]: Session failed, sid 6C5A5E5D4D8F2270
>> Feb 6 23:41:46 hitlain chunkd[1968]: Selected CLD host
>> hitlain.zaitcev.lan port 4499
>> Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session created, sid
>> 4E2A8ED73878F038
>> Feb 6 23:41:46 hitlain chunkd[1968]: CLD file "/chunk-default/2" created
>> Feb 6 23:41:46 hitlain chunkd[1968]: CLD lock(/chunk-default/2)
>> failed: 11
>>
>> So, first regression: session ALWAYS fails, for no reason I can see.
>> It takes 2 minutes 35 seconds, as you can observe from the "Session
>> failed"
>> message.
>
>
> Well, session_timeout() is not being executed like it should be, by the
> core timer code. This could be memory corruption, a libtimer bug, or
> something else entirely. I can observe session_timeout() being updated
> to a new timer expiration, and then never being called again.
There is definitely something strange going on in the timer routines,
that is causing session_timeout() not to run even though it re-adds
itself to the timer list using cld_timer_add(). fprintf() debug output
in cld_timer_add and cld_timers_run are yielding unexpected results.
More debugging after sleep.
Jeff
next prev parent reply other threads:[~2010-02-09 12:06 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-07 7:00 Post-XDR CLD cannot keep session up Pete Zaitcev
2010-02-07 22:28 ` Jeff Garzik
2010-02-09 10:34 ` Jeff Garzik
2010-02-09 12:06 ` Jeff Garzik [this message]
2010-02-09 16:12 ` Pete Zaitcev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B714FCF.1060708@garzik.org \
--to=jeff@garzik.org \
--cc=hail-devel@vger.kernel.org \
--cc=zaitcev@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox