All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fix xenconsole after aborted save
@ 2009-04-20 13:23 Jiri Denemark
  2009-04-20 13:38 ` Gerd Hoffmann
  0 siblings, 1 reply; 11+ messages in thread
From: Jiri Denemark @ 2009-04-20 13:23 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

Hi.

When a domain is resumed after aborted save (e.g., because of insufficient
space on a device where the domain is being saved), xm console cannot
send/read any data. The reason is that the event channel used by xenconsole
stays unbound.

This patch modifies xenconsoled to check current status of open event channels
and rebind them if necessary.

Signed-off-by: Jiri Denemark <jdenemar@redhat.com>

[-- Attachment #2: console-evtchn-rebind.patch --]
[-- Type: text/plain, Size: 1737 bytes --]

diff --git a/tools/console/daemon/io.c b/tools/console/daemon/io.c
--- a/tools/console/daemon/io.c
+++ b/tools/console/daemon/io.c
@@ -523,6 +523,7 @@ static int domain_create_ring(struct dom
 {
 	int err, remote_port, ring_ref, rc;
 	char *type, path[PATH_MAX];
+	int rebind = 0;
 
 	err = xs_gather(xs, dom->serialpath,
 			"ring-ref", "%u", &ring_ref,
@@ -548,8 +549,17 @@ static int domain_create_ring(struct dom
 	}
 	free(type);
 
-	if ((ring_ref == dom->ring_ref) && (remote_port == dom->remote_port))
-		goto out;
+	if ((ring_ref == dom->ring_ref) && (remote_port == dom->remote_port)) {
+		xc_evtchn_status_t status = {
+			.dom = DOMID_SELF,
+			.port = dom->local_port };
+
+		if (xc_evtchn_status(xc, &status)
+		    || status.status != EVTCHNSTAT_interdomain)
+			rebind = 1;
+		else
+			goto out;
+	}
 
 	if (ring_ref != dom->ring_ref) {
 		if (dom->interface != NULL)
@@ -565,17 +575,19 @@ static int domain_create_ring(struct dom
 		dom->ring_ref = ring_ref;
 	}
 
-	dom->local_port = -1;
-	dom->remote_port = -1;
-	if (dom->xce_handle != -1)
-		xc_evtchn_close(dom->xce_handle);
+	if (!rebind) {
+		dom->local_port = -1;
+		dom->remote_port = -1;
+		if (dom->xce_handle != -1)
+			xc_evtchn_close(dom->xce_handle);
 
-	/* Opening evtchn independently for each console is a bit
-	 * wasteful, but that's how the code is structured... */
-	dom->xce_handle = xc_evtchn_open();
-	if (dom->xce_handle == -1) {
-		err = errno;
-		goto out;
+		/* Opening evtchn independently for each console is a bit
+		 * wasteful, but that's how the code is structured... */
+		dom->xce_handle = xc_evtchn_open();
+		if (dom->xce_handle == -1) {
+			err = errno;
+			goto out;
+		}
 	}
  
 	rc = xc_evtchn_bind_interdomain(dom->xce_handle,

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-20 13:23 [PATCH] Fix xenconsole after aborted save Jiri Denemark
@ 2009-04-20 13:38 ` Gerd Hoffmann
  2009-04-20 14:04   ` Jiri Denemark
  0 siblings, 1 reply; 11+ messages in thread
From: Gerd Hoffmann @ 2009-04-20 13:38 UTC (permalink / raw)
  To: Jiri Denemark; +Cc: xen-devel

On 04/20/09 15:23, Jiri Denemark wrote:
> Hi.
>
> When a domain is resumed after aborted save (e.g., because of insufficient
> space on a device where the domain is being saved), xm console cannot
> send/read any data. The reason is that the event channel used by xenconsole
> stays unbound.
>
> This patch modifies xenconsoled to check current status of open event channels
> and rebind them if necessary.

close() + open() is the sledge hammer approach (will work though).  Just 
unbind(local_port) should be enough.

cheers,
   Gerd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-20 13:38 ` Gerd Hoffmann
@ 2009-04-20 14:04   ` Jiri Denemark
  2009-04-20 14:09     ` Gerd Hoffmann
  2009-04-20 14:41     ` Keir Fraser
  0 siblings, 2 replies; 11+ messages in thread
From: Jiri Denemark @ 2009-04-20 14:04 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: xen-devel

On Mon, Apr 20, 2009 at 15:38:41 +0200, Gerd Hoffmann wrote:
> On 04/20/09 15:23, Jiri Denemark wrote:
>> Hi.
>>
>> When a domain is resumed after aborted save (e.g., because of insufficient
>> space on a device where the domain is being saved), xm console cannot
>> send/read any data. The reason is that the event channel used by xenconsole
>> stays unbound.
>>
>> This patch modifies xenconsoled to check current status of open event channels
>> and rebind them if necessary.
>
> close() + open() is the sledge hammer approach (will work though).  Just  
> unbind(local_port) should be enough.

It doesn't close() and open(), it just calls xc_evtchn_bind_interdomain() in
case the event channel is unbound. The close() + open() combination was there
before... I haven't touched that code except for skipping it when only rebind
is required.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-20 14:04   ` Jiri Denemark
@ 2009-04-20 14:09     ` Gerd Hoffmann
  2009-04-20 14:41     ` Keir Fraser
  1 sibling, 0 replies; 11+ messages in thread
From: Gerd Hoffmann @ 2009-04-20 14:09 UTC (permalink / raw)
  To: Jiri Denemark; +Cc: xen-devel

On 04/20/09 16:04, Jiri Denemark wrote:
> On Mon, Apr 20, 2009 at 15:38:41 +0200, Gerd Hoffmann wrote:
>> On 04/20/09 15:23, Jiri Denemark wrote:
>>> This patch modifies xenconsoled to check current status of open event channels
>>> and rebind them if necessary.
>> close() + open() is the sledge hammer approach (will work though).  Just
>> unbind(local_port) should be enough.
>
> It doesn't close() and open(), it just calls xc_evtchn_bind_interdomain() in
> case the event channel is unbound. The close() + open() combination was there
> before... I haven't touched that code except for skipping it when only rebind
> is required.

Oh, ok, got the logic wrong.  The patch looks fine to me then.

cheers,
   Gerd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-20 14:04   ` Jiri Denemark
  2009-04-20 14:09     ` Gerd Hoffmann
@ 2009-04-20 14:41     ` Keir Fraser
  2009-04-20 15:58       ` Jiri Denemark
  2009-04-22  9:49       ` Chris Lalancette
  1 sibling, 2 replies; 11+ messages in thread
From: Keir Fraser @ 2009-04-20 14:41 UTC (permalink / raw)
  To: Jiri Denemark, Gerd Hoffmann; +Cc: xen-devel@lists.xensource.com

On 20/04/2009 15:04, "Jiri Denemark" <jdenemar@redhat.com> wrote:

>>> This patch modifies xenconsoled to check current status of open event
>>> channels
>>> and rebind them if necessary.
>> 
>> close() + open() is the sledge hammer approach (will work though).  Just
>> unbind(local_port) should be enough.
> 
> It doesn't close() and open(), it just calls xc_evtchn_bind_interdomain() in
> case the event channel is unbound. The close() + open() combination was there
> before... I haven't touched that code except for skipping it when only rebind
> is required.

And actually that is a bug, since you will leak the old dom->local_port. I
checked in an alternative patch as c/s 19561, so please take a look and test
that resolves your issue.

Another thing to note is I think this problem can only occur if the domU
does not support suspend cancellation (advertised as SUSPEND_CANCEL in
kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your
kernels should support that feature -- suspend cancellation (a.k.a. Resume)
is very likely to be hit-or-miss without it!

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-20 14:41     ` Keir Fraser
@ 2009-04-20 15:58       ` Jiri Denemark
  2009-04-22  9:49       ` Chris Lalancette
  1 sibling, 0 replies; 11+ messages in thread
From: Jiri Denemark @ 2009-04-20 15:58 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com, Gerd Hoffmann

> > It doesn't close() and open(), it just calls xc_evtchn_bind_interdomain() in
> > case the event channel is unbound. The close() + open() combination was there
> > before... I haven't touched that code except for skipping it when only rebind
> > is required.
> 
> And actually that is a bug, since you will leak the old dom->local_port.
Ah, ok, thanks for fixing that...

> I checked in an alternative patch as c/s 19561, so please take a look and
> test that resolves your issue.
Yes, the alternative patch works fine.
 
> Another thing to note is I think this problem can only occur if the domU
> does not support suspend cancellation
Sure, the problem only occurs for kernels without SUSPEND_CANCEL feature.

Jirka

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-20 14:41     ` Keir Fraser
  2009-04-20 15:58       ` Jiri Denemark
@ 2009-04-22  9:49       ` Chris Lalancette
  2009-04-22 10:22         ` Keir Fraser
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Lalancette @ 2009-04-22  9:49 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Jiri Denemark, xen-devel@lists.xensource.com, Gerd Hoffmann

Keir Fraser wrote:
> Another thing to note is I think this problem can only occur if the domU
> does not support suspend cancellation (advertised as SUSPEND_CANCEL in
> kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your
> kernels should support that feature -- suspend cancellation (a.k.a. Resume)
> is very likely to be hit-or-miss without it!

Could you elaborate a bit on this?  I was under the impression that suspend
cancellation was there mostly for the netaccel bits, but I have to admit I
didn't look at it very closely.  What scenarios do the suspend cancel bits help?

-- 
Chris Lalancette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-22  9:49       ` Chris Lalancette
@ 2009-04-22 10:22         ` Keir Fraser
  2009-04-22 12:41           ` John Levon
                             ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Keir Fraser @ 2009-04-22 10:22 UTC (permalink / raw)
  To: Chris Lalancette
  Cc: Jiri Denemark, xen-devel@lists.xensource.com, Gerd Hoffmann

On 22/04/2009 10:49, "Chris Lalancette" <clalance@redhat.com> wrote:

> Keir Fraser wrote:
>> Another thing to note is I think this problem can only occur if the domU
>> does not support suspend cancellation (advertised as SUSPEND_CANCEL in
>> kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your
>> kernels should support that feature -- suspend cancellation (a.k.a. Resume)
>> is very likely to be hit-or-miss without it!
> 
> Could you elaborate a bit on this?  I was under the impression that suspend
> cancellation was there mostly for the netaccel bits, but I have to admit I
> didn't look at it very closely.  What scenarios do the suspend cancel bits
> help?

Suspend failures (failure to save or to migrate). Also for live
checkpointing/snapshotting. The feature indicates that the guest is happy
for the suspend hypercall to return indicating 'failure/cancelled' and in
which case it can pretty much resume whatever it was doing without any of
the usual resume logic. The alternative is for the toolstack to make it look
like the domain has been restored/migrated, by resetting PV devices and the
like, and this is not much tested and almost inherently fragile.

The suspend_cancel hook you are thinking of is for any PV devices which *do*
need to know that a suspend was cancelled. Netaccel does for some reason
which I cannot recall.

Anyway, it is pretty easy and pretty important to support SUSPEND_CANCEL!

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-22 10:22         ` Keir Fraser
@ 2009-04-22 12:41           ` John Levon
  2009-04-22 14:24           ` Chris Lalancette
  2009-04-22 16:15           ` Neil Turton
  2 siblings, 0 replies; 11+ messages in thread
From: John Levon @ 2009-04-22 12:41 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jiri Denemark, Chris Lalancette, xen-devel@lists.xensource.com,
	Gerd Hoffmann

On Wed, Apr 22, 2009 at 11:22:01AM +0100, Keir Fraser wrote:

> Anyway, it is pretty easy and pretty important to support SUSPEND_CANCEL!

Note that it's still the case that Solaris doesn't, unfortunately.

regards
john

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-22 10:22         ` Keir Fraser
  2009-04-22 12:41           ` John Levon
@ 2009-04-22 14:24           ` Chris Lalancette
  2009-04-22 16:15           ` Neil Turton
  2 siblings, 0 replies; 11+ messages in thread
From: Chris Lalancette @ 2009-04-22 14:24 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Jiri Denemark, xen-devel@lists.xensource.com, Gerd Hoffmann

Keir Fraser wrote:
> On 22/04/2009 10:49, "Chris Lalancette" <clalance@redhat.com> wrote:
> 
>> Keir Fraser wrote:
>>> Another thing to note is I think this problem can only occur if the domU
>>> does not support suspend cancellation (advertised as SUSPEND_CANCEL in
>>> kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your
>>> kernels should support that feature -- suspend cancellation (a.k.a. Resume)
>>> is very likely to be hit-or-miss without it!
>> Could you elaborate a bit on this?  I was under the impression that suspend
>> cancellation was there mostly for the netaccel bits, but I have to admit I
>> didn't look at it very closely.  What scenarios do the suspend cancel bits
>> help?
> 
> Suspend failures (failure to save or to migrate). Also for live
> checkpointing/snapshotting. The feature indicates that the guest is happy
> for the suspend hypercall to return indicating 'failure/cancelled' and in
> which case it can pretty much resume whatever it was doing without any of
> the usual resume logic. The alternative is for the toolstack to make it look
> like the domain has been restored/migrated, by resetting PV devices and the
> like, and this is not much tested and almost inherently fragile.
> 
> The suspend_cancel hook you are thinking of is for any PV devices which *do*
> need to know that a suspend was cancelled. Netaccel does for some reason
> which I cannot recall.
> 
> Anyway, it is pretty easy and pretty important to support SUSPEND_CANCEL!

OK, that makes sense.  Thanks Keir!

-- 
Chris Lalancette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix xenconsole after aborted save
  2009-04-22 10:22         ` Keir Fraser
  2009-04-22 12:41           ` John Levon
  2009-04-22 14:24           ` Chris Lalancette
@ 2009-04-22 16:15           ` Neil Turton
  2 siblings, 0 replies; 11+ messages in thread
From: Neil Turton @ 2009-04-22 16:15 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jiri Denemark, Chris Lalancette, xen-devel@lists.xensource.com,
	Gerd Hoffmann

Keir Fraser wrote:
> The suspend_cancel hook you are thinking of is for any PV devices which *do*
> need to know that a suspend was cancelled. Netaccel does for some reason
> which I cannot recall.

For the record, netaccel needs to remove any mappings of the hardware
during the suspend callback since those mappings are not valid after the
suspend has completed.  Those hardware mappings (or the mappings
appropriate for the new machine) are restored on resume as a result of
the standard netfront/netback connection being re-established.  If the
suspend is cancelled, the netfront/netback connection stays in place but
the hardware mappings need to be re-established which is why the
suspend_cancel callback is used.

Cheers, Neil.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-04-22 16:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-20 13:23 [PATCH] Fix xenconsole after aborted save Jiri Denemark
2009-04-20 13:38 ` Gerd Hoffmann
2009-04-20 14:04   ` Jiri Denemark
2009-04-20 14:09     ` Gerd Hoffmann
2009-04-20 14:41     ` Keir Fraser
2009-04-20 15:58       ` Jiri Denemark
2009-04-22  9:49       ` Chris Lalancette
2009-04-22 10:22         ` Keir Fraser
2009-04-22 12:41           ` John Levon
2009-04-22 14:24           ` Chris Lalancette
2009-04-22 16:15           ` Neil Turton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.