All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ian Campbell <ian.campbell@citrix.com>
To: Jim Fehlig <jfehlig@suse.com>
Cc: Anthony PERARD <anthony.perard@citrix.com>,
	xen-devel@lists.xensource.com, ian.jackson@eu.citrix.com
Subject: Re: [libvirt test] 55257: regressions - FAIL
Date: Fri, 15 May 2015 09:44:28 +0100	[thread overview]
Message-ID: <1431679468.8943.0.camel@citrix.com> (raw)
In-Reply-To: <555511E5.3040304@suse.com>

On Thu, 2015-05-14 at 15:21 -0600, Jim Fehlig wrote:

> > FWIW http://logs.test-lab.xenproject.org/osstest/logs/55443/ seems to
> > have two more instances of this (amd64 and i386)
> 
> More cases of qemu not starting.  I'm not sure how we can get more
> details about that.

FWIW I dug into this a bit more yesterday having discussed this with Ian
and others a bit.

We wondered if qemu had crashed, but the logs show a time out and libxl
has code in the parent process which receives SIGCHLD and logs + errors
out, so I think it probably isn't that, unless the monitoring code is
buggy somehow (not out of the question, it's probably not exercised
much).

Also we expect that a crash would produce a segfault message on the
kernel console, which didn't appear.

We also considered where stderr was going. libxl redirects std{out,err}
for the qemu to the qemu-dm-debian.guest.osstest.log file, which is
captured and empty.

There was some question about where libvirt's own stderr was going
(/dev/null or perhaps the console) but it doesn't appear as if anything
is going wrong in libvirt itself and as above we capture the std* for
processes which we spawn ourselves.

Lastly libvirtd is still running and is shown in the ps logs captured.

> 
> >  but with no 
> > interesting logs still and a different one on ARM:
> >
> > http://logs.test-lab.xenproject.org/osstest/logs/55443/test-armhf-armhf-libvirt/11.ts-guest-start.log:
> > 2015-05-13 09:23:32.193+0000: 16389: info : libvirt version: 1.2.16
> > 2015-05-13 09:23:32.193+0000: 16389: warning : virKeepAliveTimerInternal:143 : No response from client 0xb7000c38 after 6 keepalive messages in 35 seconds
> > 2015-05-13 09:23:32.193+0000: 16390: warning : virKeepAliveTimerInternal:143 : No response from client 0xb7000c38 after 6 keepalive messages in 35 seconds
> > error: Failed to create domain from /etc/xen/debian.guest.osstest.cfg.xml
> > error: internal error: received hangup / error event on socket
> >   
> 
> In this case it seems libvirtd crashed.

http://logs.test-lab.xenproject.org/osstest/logs/55443/test-armhf-armhf-libvirt/arndale-lakeside-output-ps_wwwaxf_-eo_pid%2Ctty%2Cstat%2Ctime%2Cnice%2Cpsr%2Cpcpu%2Cpmem%2Cnwchan%2Cwchan%2325%2Cargs 

includes:
 2301 ?        DLl  00:00:00   0   0  0.0  1.6 ffffff fdget_pos                 /usr/local/sbin/libvirtd -d
16395 ?        S    00:00:00   0   0  0.0  0.5 24b6dc wait                       \_ /usr/local/sbin/libvirtd -d
16396 ?        Ssl  00:00:00   0   0  0.0  1.9 ffffff poll_schedule_timeout          \_ /usr/local/lib/xen/bin/qemu-system-i386 -xen-domid 1 -chardev socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-1,server,nowait -no-shutdown -mon chardev=libxl-cmd,mode=control -chardev socket,id=libxenstat-cmd,path=/var/run/xen/qmp-libxenstat-1,server,nowait -mon chardev=libxenstat-cmd,mode=control -nodefaults -xen-attach -name debian.guest.osstest -vnc none -display none -nographic -machine xenpv -m 512

So I don't think it has crashed, it's even successfully spawned a qemu
it seems.

Comparing the libxl-driver.log here with the amd64 case:

libxl: debug: libxl_event.c:537:watchfd_callback: watch w=0x7ff4d70595e0 wpath=/local/domain/0/device-model/1/state token=3/0: event epath=/local/domain/0/device-model/1/state

[arm stops here, amd64 continues with the remainder]

libxl: debug: libxl_aoutils.c:87:xswait_timeout_callback: domain 1 device model startup: xswait timeout (path=/local/domain/0/device-model/1/state)
libxl: debug: libxl_event.c:638:libxl__ev_xswatch_deregister: watch w=0x7ff4d70595e0 wpath=/local/domain/0/device-model/1/state token=3/0: deregister slotnum=3
libxl: error: libxl_exec.c:393:spawn_watch_event: domain 1 device model: startup timed out
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d70595e0: deregister unregistered
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d70595e0: deregister unregistered
libxl: error: libxl_dm.c:1565:device_model_spawn_outcome: domain 1 device model: spawn failed (rc=-3)
libxl: error: libxl_create.c:1362:domcreate_devmodel_started: device model did not start: -3
libxl: debug: libxl_dm.c:1678:kill_device_model: Device Model signaled
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d702f3c0: deregister unregistered
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d7031290: deregister unregistered
libxl: debug: libxl.c:1701:devices_destroy_cb: forked pid 18588 for destroy of domain 1
libxl: debug: libxl_event.c:1768:libxl__ao_complete: ao 0x7ff4d702ed60: complete, rc=-3
libxl: debug: libxl_event.c:1740:libxl__ao__destroy: ao 0x7ff4d702ed60: destroy

I wonder if we are somehow loosing an event or getting the event loop screwed up.

Perhaps in the amd64 case we are somehow losing the xenstore watch, in
the armhf case we are losing some other fd which interferes with
libvirt's own event loop?

So I think we are looking at either a hang or an event processing SNAFU
rather than a crash.

BTW, in the above there is "Device Model signaled", which indicates that
kill(pid, SIGHUP) returned 0 and not e.g. ESRCH (when it would log
"Device Model already exited") or anything else (when it would log
"failed to kill..."). So the qemu process was actually present.

The host is doing nothing other than running this one test case, so it
doesn't seem likely that we are really hitting the 30s qemu startup
timeout.

Ian.

  parent reply	other threads:[~2015-05-15  8:44 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-11 12:46 [libvirt test] 55257: regressions - FAIL osstest service user
2015-05-11 13:22 ` Ian Campbell
2015-05-11 16:36   ` Jim Fehlig
2015-05-11 17:02     ` Ian Campbell
2015-05-13  8:46     ` Ian Campbell
2015-05-13 17:46       ` Anthony PERARD
2015-05-14 10:47         ` Ian Campbell
2015-05-14 11:07           ` Anthony PERARD
2015-05-14 21:27             ` Jim Fehlig
2015-05-14 21:21           ` Jim Fehlig
2015-05-14 21:31             ` Jim Fehlig
2015-05-15  8:44             ` Ian Campbell [this message]
2015-05-15 10:39             ` Anthony PERARD
2015-05-15 11:54               ` Ian Campbell
2015-05-15 15:33                 ` Anthony PERARD

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1431679468.8943.0.camel@citrix.com \
    --to=ian.campbell@citrix.com \
    --cc=anthony.perard@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=jfehlig@suse.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.