From: Jeff Garzik <jeff@garzik.org>
To: Project Hail List <hail-devel@vger.kernel.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Subject: Re: Post-XDR CLD cannot keep session up
Date: Tue, 09 Feb 2010 07:06:39 -0500 [thread overview]
Message-ID: <4B714FCF.1060708@garzik.org> (raw)
In-Reply-To: <4B713A38.1010106@garzik.org>
On 02/09/2010 05:34 AM, Jeff Garzik wrote:
> On 02/07/2010 02:00 AM, Pete Zaitcev wrote:
>> Hi, Jeff& Colin:
>>
>> It looks like you broke something in CLD, not sure if server or client.
>> There are two possibly related bugs. But first, here's the messages
>> (The chunkd is run with -D). Note that I have 2 servers listed in DNS
>> (both on port 4499), but only one is up.
>>
>> Feb 6 23:36:10 hitlain cld[1934]: databases up
>> Feb 6 23:36:10 hitlain cld[1934]: Listening on :: port 4499
>> Feb 6 23:36:10 hitlain cld[1934]: initialized: verbose 0
>> Feb 6 23:37:10 hitlain chunkd[1967]: Verbose debug output enabled
>> Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host
>> hitlain.zaitcev.lan prio 10 weight 50
>> Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host
>> elanor.zaitcev.lan prio 10 weight 50
>> Feb 6 23:37:10 hitlain chunkd[1968]: Selected CLD host
>> hitlain.zaitcev.lan port 4499
>> Feb 6 23:37:10 hitlain chunkd[1968]: Listening on host :: port 8082
>> Feb 6 23:37:10 hitlain chunkd[1968]: initialized
>> Feb 6 23:37:10 hitlain chunkd[1968]: New CLD session created, sid
>> 05B521BF4071EBA2
>> Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" created
>> Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" written
>> Feb 6 23:39:45 hitlain chunkd[1968]: Session failed, sid 05B521BF4071EBA2
>> Feb 6 23:39:45 hitlain chunkd[1968]: Selected CLD host
>> elanor.zaitcev.lan port 4499
>> Feb 6 23:39:45 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:39:50 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:39:55 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:00 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:05 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:10 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:15 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:46 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:51 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:40:56 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:01 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:06 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:11 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:16 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111
>> Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session creation failed: 17
>> Feb 6 23:41:46 hitlain chunkd[1968]: Session failed, sid 6C5A5E5D4D8F2270
>> Feb 6 23:41:46 hitlain chunkd[1968]: Selected CLD host
>> hitlain.zaitcev.lan port 4499
>> Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session created, sid
>> 4E2A8ED73878F038
>> Feb 6 23:41:46 hitlain chunkd[1968]: CLD file "/chunk-default/2" created
>> Feb 6 23:41:46 hitlain chunkd[1968]: CLD lock(/chunk-default/2)
>> failed: 11
>>
>> So, first regression: session ALWAYS fails, for no reason I can see.
>> It takes 2 minutes 35 seconds, as you can observe from the "Session
>> failed"
>> message.
>
>
> Well, session_timeout() is not being executed like it should be, by the
> core timer code. This could be memory corruption, a libtimer bug, or
> something else entirely. I can observe session_timeout() being updated
> to a new timer expiration, and then never being called again.
There is definitely something strange going on in the timer routines,
that is causing session_timeout() not to run even though it re-adds
itself to the timer list using cld_timer_add(). fprintf() debug output
in cld_timer_add and cld_timers_run are yielding unexpected results.
More debugging after sleep.
Jeff
next prev parent reply other threads:[~2010-02-09 12:06 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-07 7:00 Post-XDR CLD cannot keep session up Pete Zaitcev
2010-02-07 22:28 ` Jeff Garzik
2010-02-09 10:34 ` Jeff Garzik
2010-02-09 12:06 ` Jeff Garzik [this message]
2010-02-09 16:12 ` Pete Zaitcev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B714FCF.1060708@garzik.org \
--to=jeff@garzik.org \
--cc=hail-devel@vger.kernel.org \
--cc=zaitcev@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.