From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: xen-4.1: PV domain hanging at startup, jiffies stopped
Date: Mon, 29 Aug 2011 16:59:38 -0400
Message-ID: <20110829205938.GB18697@dumpdata.com>
References: <4E5A3F0A.8060700@mimuw.edu.pl>
	<20110829200749.GA17265@dumpdata.com>
	<4E5BF4C3.2050108@mimuw.edu.pl>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Return-path: <xen-devel-bounces@lists.xensource.com>
Content-Disposition: inline
In-Reply-To: <4E5BF4C3.2050108@mimuw.edu.pl>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Marek Marczykowski <marmarek@mimuw.edu.pl>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Joanna Rutkowska <joanna@invisiblethingslab.com>
List-Id: xen-devel@lists.xenproject.org

On Mon, Aug 29, 2011 at 10:21:23PM +0200, Marek Marczykowski wrote:
> On 29.08.2011 22:07, Konrad Rzeszutek Wilk wrote:
> > On Sun, Aug 28, 2011 at 03:13:46PM +0200, Marek Marczykowski wrote:
> >> Hey,
> >>
> >> I'm experiencing strange problem: non-deterministic PV domain hang, only
> >> on some machines (with fast SSD drive). I've tried xen-4.1.0 and
> >> xen-4.1.1 with many kernels different kernels:
> >> VM:
> >>  - 2.6.38.3 xenlinux based on SUSE package
> >>  - vanilla 3.0.3
> >>  - vanilla 3.1 rc2
> >> dom0:
> >>  - 2.6.38.3 xenlinux based on SUSE package
> >>  - vanilla 3.1 rc2
> >>
> >> Result always the same: sometimes VM hang at startup, SysRq-T shows
> >> modprobe waiting in "wait_for_devices" (concretely schedule_timeout) and
> >> jiffies counter not increasing between task-states dumps.
> >>
> >> The only found thing (probably) connected with this problem are domU
> >> kernel messages:
> >> CE: xen increased min_delta_ns to 150000 nsec
> >> (...)
> >> CE: xen increased min_delta_ns to 4000000 nsec
> >> CE: Reprogramming failure. Giving up
> >>
> >> This messages doesn't exists in successful boot.
> >>
> >> I've also tried some options to xen and domU kernel, but without success
> >> (all combinations):
> > 
> > BTW, your 'xencons=..' and 'swiotlb=force' are obsolete. Use
> > 'console=hvc0' and 'iommu=soft'. The 'swiotlb=force' kills performance.
> > 
> >> xen: tsc=unstable, cpufreq=none
> >> domU: nohz=off, clocksource=tsc
> >>
> >> Some combination of above options lowered frequency of problem (ex
> >> tsc=unstable + nohz=off), but it happens quite often - like 1 of 15
> >> boots fails.
> >>
> >> Have you idea what is the cause and what can help?
> > 
> > The problem looks to be xenwatch stuck. So the problem is in Dom0 right?
> 
> This "R" state of xenwatch looks like result of SysRq, which dumps data...
> 
> [  118.679707]  [<ffffffff812a8081>] handle_sysrq+0x21/0x30
> [  118.679707]  [<ffffffff8128db49>] sysrq_handler+0xb9/0xe0
> [  118.679707]  [<ffffffff8128ff50>] xenwatch_thread+0xb0/0x170
> 
> And the problem is at DomU boot, Dom0 works without any problems.

Ok, but I am still unsure where it is hanging in DomU. Can you run with
'console=hvc0 debug initcall_debug loglevel=8 earlyprintk=xen' to get an idea
of what is stuck in the guest? You might also have better luck using
'xenctx' to get a stack trace of what is hangning in the guest.
(you will need the System.map file from the guest's kernel.. but that should
be fairly easy to extract).