From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Bader Subject: Re: xenbus and the message of doom Date: Thu, 15 Dec 2011 20:45:55 +0100 Message-ID: <4EEA4E73.9030002@canonical.com> References: <4EEA4877.8010307@canonical.com> <20111215193942.GA7640@andromeda.dapyr.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20111215193942.GA7640@andromeda.dapyr.net> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Konrad Rzeszutek Wilk Cc: Olaf Hering , "xen-devel@lists.xensource.com" , Konrad Rzeszutek Wilk List-Id: xen-devel@lists.xenproject.org On 15.12.2011 20:39, Konrad Rzeszutek Wilk wrote: > On Thu, Dec 15, 2011 at 08:20:23PM +0100, Stefan Bader wrote: >> I was investigating a bug report[1] about newer kernels (>3.1) not booting as >> HVM guests on Amazon EC2. For some reason git bisect did give the some pain, but >> it lead me at least close and with some crash dump data I think I figured the >> problem. > > Stefan, thanks for finding this. > I realize I wanted to add the reference to our bug report but completely forgot to do so. So just for completeness: http://bugs.launchpad.net/bugs/901305 > Olaf, what are your thoughts? Should I prep a patch to revert the patch > below and then we can work on 3.3 and rethink this in 3.3? The clock is > ticking for 3.2 and there is not much runway to fix stuff. > >> >> commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05 >> Author: Olaf Hering >> Date: Thu Sep 22 16:14:49 2011 +0200 >> >> xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old >> kernel >> >> This change introduced a xs_reset_watches() call. The problem seems to be that >> there is at least some version of Xen (I was able to reproduce with a 3.4.3 >> version which I admit to deliberately not having updated) for which xenstore >> will not return any reply. > > And oxenstore too, but Ian prepped a patch for this. Perhaps that is > what Amazon is running. >> >> At least the backtraces in crash showed that xs_init had been calling >> xs_reset_watches() and that was happily idling in read_reply(). Effectively >> nothing was going on and the boot just hung. > > So at least we should have a timeout read_reply. But I don't see > anything in the code that we could immediately use. > >> By just not doing that xs_reset_watches() call, I was able to boot under the >> same host. And for what it is worth there has not been an issue with Xen 4.1.1 >> and a 3.0 dom0 kernel. Just this "older" release is trouble. >> >> Now the big question is, should this never happen and the host needs urgent >> updating. Or, should xs_talkv() set up a time limit and assume failure when not >> receiving a message after that? I could imagine the latter might lead at least >> to a more helpful "there is something wrong here, dude" than just hanging around >> without any response. ;) >> >> -Stefan >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel