From mboxrd@z Thu Jan 1 00:00:00 1970 From: Olaf Hering Subject: Re: Need help with fixing the Xen waitqueue feature Date: Tue, 8 Nov 2011 23:20:11 +0100 Message-ID: <20111108222011.GA23969@aepfle.de> References: <20111108212024.GA5276@aepfle.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Keir Fraser Cc: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org On Tue, Nov 08, Keir Fraser wrote: > On 08/11/2011 21:20, "Olaf Hering" wrote: > > > Another thing is that sometimes the host suddenly reboots without any > > message. I think the reason for this is that a vcpu whose stack was put > > aside and that was later resumed may find itself on another physical > > cpu. And if that happens, wouldnt that invalidate some of the local > > variables back in the callchain? If some of them point to the old > > physical cpu, how could this be fixed? Perhaps a few "volatiles" are > > needed in some places. > > From how many call sites can we end up on a wait queue? I know we were going > to end up with a small and explicit number (e.g., in __hvm_copy()) but does > this patch make it a more generally-used mechanism? There will unavoidably > be many constraints on callers who want to be able to yield the cpu. We can > add Linux-style get_cpu/put_cpu abstractions to catch some of them. Actually > I don't think it's *that* common that hypercall contexts cache things like > per-cpu pointers. But every caller will need auditing, I expect. I havent started to audit the callers. In my testing mem_event_put_request() is called from p2m_mem_paging_drop_page() and p2m_mem_paging_populate(). The latter is called from more places. My plan is to put the sleep into ept_get_entry(), but I'm not there yet. First I want to test waitqueues in a rather simple code path like mem_event_put_request(). > A sudden reboot is very extreme. No message even on a serial line? That most > commonly indicates bad page tables. Most other bugs you'd at least get a > double fault message. There is no output on serial, I boot with this cmdline: vga=mode-normal console=com1 com1=57600 loglvl=all guest_loglvl=all sync_console conring_size=123456 maxcpus=8 dom0_vcpus_pin dom0_max_vcpus=2 My base changeset is 24003, the testhost is a Xeon X5670 @ 2.93GHz. Olaf