Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: "Ren, Yongjie" <yongjie.ren@intel.com>,
	george.dunlap@eu.citrix.com, xen@bugs.xenproject.org
Cc: "Xu, YongweiX" <yongweix.xu@intel.com>,
	"Liu, SongtaoX" <songtaox.liu@intel.com>,
	"Tian, Yongxue" <yongxue.tian@intel.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Date: Fri, 8 Nov 2013 11:21:21 -0500	[thread overview]
Message-ID: <20131108162121.GA25007@phenom.dumpdata.com> (raw)
In-Reply-To: <20130528152156.GB3027@phenom.dumpdata.com>

On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:
> > > 5. Dom0 cannot be shutdown before PCI device detachment from guest
> > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> > 
> > Ok, I can reproduce that too.
> 
> This is what dom0 tells me:
> 
> [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> [  483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[  483.620747] init            D ffff880062b59c78  5904  4163      1 0x00000000
> [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]  ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000
> [  483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180 ffff880078bca500
> [  483.689527] Call Trace:
> [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70
> [  483.723604]  [<ffffffff813bb0dd>] read_reply+0xad/0x160
> [  483.741162]  [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40
> [  483.758572]  [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0
> [  483.775741]  [<ffffffff813bb3c6>] xs_single+0x46/0x60
> [  483.792791]  [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
> [  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
> ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
> [  483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50
> [  483.860412]  [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10
> [  483.877312]  [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0
> [  483.894036]  [<ffffffff8142e275>] device_shutdown+0x15/0x180
> [  483.910605]  [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40
> [  483.927100]  [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262]  [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260
> [  483.959480]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[  483.975786]  [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10
> [  483.991819]  [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360
> [  484.007675]  [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[  484.023336]  [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50
> [  484.039176]  [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0
> [  484.055174]  [<ffffffff816aae95>] ? sysret_check+0x22/0x5d
> [  484.070747]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10
> [  484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
> [  484.116585] 3 locks held by init/4163:
> [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260
> ^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180
> [  484.164359]  #2:  (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0
> 

A bit of debugging shows that when we are in this state:


MSent SIGKILL to[  100.454603] xen-pciback pci-1-0: shutdown

telnet> send brk 
[  110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) show-blocked-tasks(w) dump-ftrace-buffer(z) 

... snip..

 xenstored       x 0000000000000002  5504  3437      1 0x00000006
  ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000
  ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000
  ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480
 Call Trace:
  [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130
  [<ffffffff816b1594>] schedule+0x24/0x70
  [<ffffffff8109c43d>] do_exit+0x79d/0xbc0
  [<ffffffff8109c981>] do_group_exit+0x51/0x140
  [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760
  [<ffffffff8104c49f>] do_signal+0x4f/0x610
  [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60
  [<ffffffff811c3d39>] ? vfs_write+0x129/0x170
  [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80
  [<ffffffff816bc372>] int_signal+0x12/0x17


The 'x' means that the task has been killed.

(The other two threads 'xenbus' and 'xenwatch' are sleeping).

Since the xenstored can actually be in a domain nowadays and not
just in the initial domain and xenstored can be restarted anytime - we
can't depend on the task pid. Nor can we depend on the other
domain telling us that it is dead.

The best we can do is to get out of the way of the shutdown
process and not hang on forever.

This patch should solve it:
>From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 8 Nov 2013 10:48:58 -0500
Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
 shutdown/restart.

The 'read_reply' works with 'process_msg' to read of a reply in XenBus.
'process_msg' is running from within the 'xenbus' thread. Whenever
a message shows up in XenBus it is put on a xs_state.reply_list list
and 'read_reply' picks it up.

The problem is if the backend domain or the xenstored process is killed.
In which case 'xenbus' is still awaiting - and 'read_reply' if called -
stuck forever waiting for the reply_list to have some contents.

This is normally not a problem - as the backend domain can come back
or the xenstored process can be restarted. However if the domain
is in process of being powered off/restarted/halted - there is no
point of waiting on it coming back - as we are effectively being
terminated and should not impede the progress.

This patch solves this problem by checking the 'system_state' value
to see if we are in heading towards death. We also make the wait
mechanism a bit more asynchronous.

Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index b6d5fff..177fb19 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len)
 
 	while (list_empty(&xs_state.reply_list)) {
 		spin_unlock(&xs_state.reply_lock);
-		/* XXX FIXME: Avoid synchronous wait for response here. */
-		wait_event(xs_state.reply_waitq,
-			   !list_empty(&xs_state.reply_list));
+		wait_event_timeout(xs_state.reply_waitq,
+				   !list_empty(&xs_state.reply_list),
+				   msecs_to_jiffies(500));
+
+		/*
+		 * If we are in the process of being shut-down there is
+		 * no point of trying to contact XenBus - it is either
+		 * killed (xenstored application) or the other domain
+		 * has been killed or is unreachable.
+		 */
+		switch (system_state) {
+			case SYSTEM_POWER_OFF:
+			case SYSTEM_RESTART:
+			case SYSTEM_HALT:
+				return ERR_PTR(-EIO);
+			default:
+				break;
+		}
 		spin_lock(&xs_state.reply_lock);
 	}
 
@@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg)
 
 	mutex_unlock(&xs_state.request_mutex);
 
+	if (IS_ERR(ret))
+		return ret;
+
 	if ((msg->type == XS_TRANSACTION_END) ||
 	    ((req_msg.type == XS_TRANSACTION_START) &&
 	     (msg->type == XS_ERROR)))
-- 
1.7.7.6

next prev parent reply	other threads:[~2013-11-08 16:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-27  3:49 test report for Xen 4.3 RC1 Ren, Yongjie
2013-05-28 15:15 ` Konrad Rzeszutek Wilk
2013-05-28 15:21   ` Konrad Rzeszutek Wilk
2013-05-28 15:24     ` George Dunlap
2013-11-11 10:22       ` Ian Campbell
2013-11-08 16:21     ` Konrad Rzeszutek Wilk [this message]
2013-11-08 16:30       ` Processed: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: " xen
2013-11-10 20:20       ` Matt Wilson
2013-11-10 20:30         ` Processed: " xen
2013-11-11  2:40       ` Liu, SongtaoX
2013-11-11  2:45         ` Processed: " xen

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b6d5fff dfblob:177fb19 )
 OR (
bs:"xen/xenbus: Avoid synchronous wait on XenBus stalling" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131108162121.GA25007@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=songtaox.liu@intel.com \
    --cc=xen-devel@lists.xen.org \
    --cc=xen@bugs.xenproject.org \
    --cc=yongjie.ren@intel.com \
    --cc=yongweix.xu@intel.com \
    --cc=yongxue.tian@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).