From mboxrd@z Thu Jan  1 00:00:00 1970
From: Olaf Hering <olaf@aepfle.de>
Subject: Re: Need help with fixing the Xen waitqueue feature
Date: Tue, 8 Nov 2011 23:20:11 +0100
Message-ID: <20111108222011.GA23969@aepfle.de>
References: <20111108212024.GA5276@aepfle.de>
	<CADF5835.245E1%keir.xen@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Return-path: <xen-devel-bounces@lists.xensource.com>
Content-Disposition: inline
In-Reply-To: <CADF5835.245E1%keir.xen@gmail.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Keir Fraser <keir.xen@gmail.com>
Cc: xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

On Tue, Nov 08, Keir Fraser wrote:

> On 08/11/2011 21:20, "Olaf Hering" <olaf@aepfle.de> wrote:
> 
> > Another thing is that sometimes the host suddenly reboots without any
> > message. I think the reason for this is that a vcpu whose stack was put
> > aside and that was later resumed may find itself on another physical
> > cpu. And if that happens, wouldnt that invalidate some of the local
> > variables back in the callchain? If some of them point to the old
> > physical cpu, how could this be fixed? Perhaps a few "volatiles" are
> > needed in some places.
> 
> From how many call sites can we end up on a wait queue? I know we were going
> to end up with a small and explicit number (e.g., in __hvm_copy()) but does
> this patch make it a more generally-used mechanism? There will unavoidably
> be many constraints on callers who want to be able to yield the cpu. We can
> add Linux-style get_cpu/put_cpu abstractions to catch some of them. Actually
> I don't think it's *that* common that hypercall contexts cache things like
> per-cpu pointers. But every caller will need auditing, I expect.

I havent started to audit the callers. In my testing
mem_event_put_request() is called from p2m_mem_paging_drop_page() and
p2m_mem_paging_populate(). The latter is called from more places.

My plan is to put the sleep into ept_get_entry(), but I'm not there yet.
First I want to test waitqueues in a rather simple code path like
mem_event_put_request().

> A sudden reboot is very extreme. No message even on a serial line? That most
> commonly indicates bad page tables. Most other bugs you'd at least get a
> double fault message.

There is no output on serial, I boot with this cmdline:
  vga=mode-normal console=com1 com1=57600 loglvl=all guest_loglvl=all
  sync_console conring_size=123456 maxcpus=8 dom0_vcpus_pin
  dom0_max_vcpus=2
My base changeset is 24003, the testhost is a Xeon X5670  @ 2.93GHz.

Olaf