Re: [PATCH 0/6] Save state error handling (kill off no_migrate)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, cam@cs.ualberta.ca,
	quintela@redhat.com, anthony@codemonkey.ws
Subject: Re: [PATCH 0/6] Save state error handling (kill off no_migrate)
Date: Tue, 09 Nov 2010 12:35:48 -0700	[thread overview]
Message-ID: <1289331348.14321.58.camel@x201> (raw)
In-Reply-To: <1289324646.14321.39.camel@x201>

On Tue, 2010-11-09 at 10:44 -0700, Alex Williamson wrote:
> On Tue, 2010-11-09 at 18:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Nov 09, 2010 at 09:30:45AM -0700, Alex Williamson wrote:
> > > On Tue, 2010-11-09 at 18:15 +0200, Michael S. Tsirkin wrote:
> > > > On Tue, Nov 09, 2010 at 08:47:00AM -0700, Alex Williamson wrote:
> > > > > > > But it could.  What if ivshmem is acting in a peer role, but has no
> > > > > > > clients, could it migrate?  What if ivshmem is migratable when the
> > > > > > > migration begins, but while the migration continues, a connection is
> > > > > > > setup and it becomes unmigratable.
> > > > > > 
> > > > > > Sounds like something we should work to prevent, not support :)
> > > > > 
> > > > > s/:)/:(/  why?
> > > > 
> > > > It will just confuse everyone. Also if it happens after sending
> > > > all of memory, it's pretty painful.
> > > 
> > > It happens after sending all of memory with no_migrate, and I think
> > > pushing that earlier might introduce some races around when
> > > register_device_unmigratable() can be called.
> > 
> > Good point. I guess we could check it twice just to speed things up.
> > 
> > > > > > >  Using this series, ivshmem would
> > > > > > > have multiple options how to support this.  It could a) NAK the
> > > > > > > migration, b) drop connections and prevent new connections until the
> > > > > > > migration finishes, c) detect that new connections have happened since
> > > > > > > the migration started and cancel.  And probably more.  no_migrate can
> > > > > > > only do a).  And in fact, we can only test no_migrate after the VM is
> > > > > > > stopped (after all memory is migrated) because otherwise it could race
> > > > > > > with devices setting no_migrate during migration.
> > > > > > 
> > > > > > We really want no_migrate to be static. changing it is abusing
> > > > > > the infrastructure.
> > > > > 
> > > > > You call it abusing, I call it making use of the infrastructure.  Why
> > > > > unnecessarily restrict ourselves?  Is return 0/-1 really that scary,
> > > > > unmaintainable, undebuggable?  I don't understand the resistance.
> > > > > 
> > > > > Alex
> > > > 
> > > > management really does not know how to handle unexpected
> > > > migration failures. They must be avoided.
> > > > 
> > > > There are some very special cases that fail migration. They are
> > > > currently easy to find with grep register_device_unmigratable.
> > > > I prefer to keep it that way.
> > > 
> > > How can management tools be improved to better handle unexpected
> > > migration failures when the only way for qemu to fail is an abort?
> > > We need the infrastructure to at least return an error first.  Do we just
> > > need to add some fprintfs to the save core to print the id string of the
> > > device that failed to save?  I just can't buy the "code is easier to
> > > grep" as an argument against adding better error handling to the save
> > > code path.
> > 
> > I just don't buy the 'we'll return meaningless error codes at random
> > point in time and management will figure it out' as an argument :)
> 
> Why is the error code meaningless?  The error code stops the migration
> in qemu and hopefully prints an error message (we could easily add an
> fprintf to the core save code to ensure we know the device responsible
> for the NAK).  From there we can figure out how to return the error to
> monitors and management tools, but we need to have a way to know there's
> an error first.
> 
> > >  Anyone else want to chime in?
> > > 
> > > Alex
> > 
> > Maybe try coding up some user using the new infrastructure to do
> > something useful, that register_device_unmigratable can't do.
> 
> With the number of people I hear complaining about how qemu has too many
> aborts, no error checking, and no way to return errors, I'm a little
> dumbfounded that there's such a roadblock to actually add some simple
> error handling.  Is it the error handling you're opposed to, or the way
> I'm using it to NAK a migration?

Would something like this in place of 6/6 ease your grep'ability and
debug'ability concerns?  Unfortunately it adds an assert, but if we
stomp on the vmsd, then the save state entry can't be unregistered.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

diff --git a/savevm.c b/savevm.c
index 521edc8..b0aaa46 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1039,7 +1039,6 @@ typedef struct SaveStateEntry {
     const VMStateDescription *vmsd;
     void *opaque;
     CompatEntry *compat;
-    int no_migrate;
 } SaveStateEntry;
 
 
@@ -1103,7 +1102,6 @@ int register_savevm_live(DeviceState *dev,
     se->load_state = load_state;
     se->opaque = opaque;
     se->vmsd = NULL;
-    se->no_migrate = 0;
 
     if (dev && dev->parent_bus && dev->parent_bus->info->get_dev_path) {
         char *id = dev->parent_bus->info->get_dev_path(dev);
@@ -1170,6 +1168,22 @@ void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque)
     }
 }
 
+static int nomigrate_set_params(int blk_enable, int shared, void *opaque)
+{
+    return -EFAULT;
+}
+
+static int nomigrate_save_live_state(Monitor *mon, QEMUFile *f, int stage,
+                                     void *opaque)
+{
+    return -EFAULT;
+}
+
+static int nomigrate_save_state(QEMUFile *f, void *opaque)
+{
+    return -EFAULT;
+}
+
 /* mark a device as not to be migrated, that is the device should be
    unplugged before migration */
 void register_device_unmigratable(DeviceState *dev, const char *idstr,
@@ -1190,7 +1204,10 @@ void register_device_unmigratable(DeviceState *dev, const char *idstr,
 
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
-            se->no_migrate = 1;
+            se->set_params = nomigrate_set_params;
+            se->save_live_state = nomigrate_save_live_state;
+            se->save_state = nomigrate_save_state;
+            assert(!se->vmsd);
         }
     }
 }
@@ -1410,10 +1427,6 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se, int version_id)
 
 static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
 {
-    if (se->no_migrate) {
-        return -1;
-    }
-
     if (!se->vmsd) {         /* Old style */
         return se->save_state(f, se->opaque);
     }
@@ -1443,6 +1456,9 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
 	}
 	ret = se->set_params(blk_enable, shared, se->opaque);
         if (ret < 0) {
+            monitor_printf(mon,
+                           "Save state begin blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-ret));
             return ret;
         }
     }
@@ -1470,6 +1486,9 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
 
         ret = se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
         if (ret < 0) {
+            monitor_printf(mon,
+                           "Save state begin blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-ret));
             return ret;
         }
     }
@@ -1503,6 +1522,9 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
                synchronized over and over again. */
             break;
         } else if (ret < 0) {
+            monitor_printf(mon,
+                           "Save state iterate blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-ret));
             return ret;
         }
     }
@@ -1535,6 +1557,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 
         r = se->save_live_state(mon, f, QEMU_VM_SECTION_END, se->opaque);
         if (r < 0) {
+            monitor_printf(mon,
+                           "Save state complete blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-r));
             return r;
         }
     }
@@ -1559,7 +1584,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 
         r = vmstate_save(f, se);
         if (r < 0) {
-            monitor_printf(mon, "cannot migrate with device '%s'\n", se->idstr);
+            monitor_printf(mon,
+                           "Save state complete blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-r));
             return r;
         }
     }

WARNING: multiple messages have this Message-ID (diff)

From: Alex Williamson <alex.williamson@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: cam@cs.ualberta.ca, qemu-devel@nongnu.org, kvm@vger.kernel.org,
	quintela@redhat.com
Subject: [Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migrate)
Date: Tue, 09 Nov 2010 12:35:48 -0700	[thread overview]
Message-ID: <1289331348.14321.58.camel@x201> (raw)
In-Reply-To: <1289324646.14321.39.camel@x201>

On Tue, 2010-11-09 at 10:44 -0700, Alex Williamson wrote:
> On Tue, 2010-11-09 at 18:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Nov 09, 2010 at 09:30:45AM -0700, Alex Williamson wrote:
> > > On Tue, 2010-11-09 at 18:15 +0200, Michael S. Tsirkin wrote:
> > > > On Tue, Nov 09, 2010 at 08:47:00AM -0700, Alex Williamson wrote:
> > > > > > > But it could.  What if ivshmem is acting in a peer role, but has no
> > > > > > > clients, could it migrate?  What if ivshmem is migratable when the
> > > > > > > migration begins, but while the migration continues, a connection is
> > > > > > > setup and it becomes unmigratable.
> > > > > > 
> > > > > > Sounds like something we should work to prevent, not support :)
> > > > > 
> > > > > s/:)/:(/  why?
> > > > 
> > > > It will just confuse everyone. Also if it happens after sending
> > > > all of memory, it's pretty painful.
> > > 
> > > It happens after sending all of memory with no_migrate, and I think
> > > pushing that earlier might introduce some races around when
> > > register_device_unmigratable() can be called.
> > 
> > Good point. I guess we could check it twice just to speed things up.
> > 
> > > > > > >  Using this series, ivshmem would
> > > > > > > have multiple options how to support this.  It could a) NAK the
> > > > > > > migration, b) drop connections and prevent new connections until the
> > > > > > > migration finishes, c) detect that new connections have happened since
> > > > > > > the migration started and cancel.  And probably more.  no_migrate can
> > > > > > > only do a).  And in fact, we can only test no_migrate after the VM is
> > > > > > > stopped (after all memory is migrated) because otherwise it could race
> > > > > > > with devices setting no_migrate during migration.
> > > > > > 
> > > > > > We really want no_migrate to be static. changing it is abusing
> > > > > > the infrastructure.
> > > > > 
> > > > > You call it abusing, I call it making use of the infrastructure.  Why
> > > > > unnecessarily restrict ourselves?  Is return 0/-1 really that scary,
> > > > > unmaintainable, undebuggable?  I don't understand the resistance.
> > > > > 
> > > > > Alex
> > > > 
> > > > management really does not know how to handle unexpected
> > > > migration failures. They must be avoided.
> > > > 
> > > > There are some very special cases that fail migration. They are
> > > > currently easy to find with grep register_device_unmigratable.
> > > > I prefer to keep it that way.
> > > 
> > > How can management tools be improved to better handle unexpected
> > > migration failures when the only way for qemu to fail is an abort?
> > > We need the infrastructure to at least return an error first.  Do we just
> > > need to add some fprintfs to the save core to print the id string of the
> > > device that failed to save?  I just can't buy the "code is easier to
> > > grep" as an argument against adding better error handling to the save
> > > code path.
> > 
> > I just don't buy the 'we'll return meaningless error codes at random
> > point in time and management will figure it out' as an argument :)
> 
> Why is the error code meaningless?  The error code stops the migration
> in qemu and hopefully prints an error message (we could easily add an
> fprintf to the core save code to ensure we know the device responsible
> for the NAK).  From there we can figure out how to return the error to
> monitors and management tools, but we need to have a way to know there's
> an error first.
> 
> > >  Anyone else want to chime in?
> > > 
> > > Alex
> > 
> > Maybe try coding up some user using the new infrastructure to do
> > something useful, that register_device_unmigratable can't do.
> 
> With the number of people I hear complaining about how qemu has too many
> aborts, no error checking, and no way to return errors, I'm a little
> dumbfounded that there's such a roadblock to actually add some simple
> error handling.  Is it the error handling you're opposed to, or the way
> I'm using it to NAK a migration?

Would something like this in place of 6/6 ease your grep'ability and
debug'ability concerns?  Unfortunately it adds an assert, but if we
stomp on the vmsd, then the save state entry can't be unregistered.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

diff --git a/savevm.c b/savevm.c
index 521edc8..b0aaa46 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1039,7 +1039,6 @@ typedef struct SaveStateEntry {
     const VMStateDescription *vmsd;
     void *opaque;
     CompatEntry *compat;
-    int no_migrate;
 } SaveStateEntry;
 
 
@@ -1103,7 +1102,6 @@ int register_savevm_live(DeviceState *dev,
     se->load_state = load_state;
     se->opaque = opaque;
     se->vmsd = NULL;
-    se->no_migrate = 0;
 
     if (dev && dev->parent_bus && dev->parent_bus->info->get_dev_path) {
         char *id = dev->parent_bus->info->get_dev_path(dev);
@@ -1170,6 +1168,22 @@ void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque)
     }
 }
 
+static int nomigrate_set_params(int blk_enable, int shared, void *opaque)
+{
+    return -EFAULT;
+}
+
+static int nomigrate_save_live_state(Monitor *mon, QEMUFile *f, int stage,
+                                     void *opaque)
+{
+    return -EFAULT;
+}
+
+static int nomigrate_save_state(QEMUFile *f, void *opaque)
+{
+    return -EFAULT;
+}
+
 /* mark a device as not to be migrated, that is the device should be
    unplugged before migration */
 void register_device_unmigratable(DeviceState *dev, const char *idstr,
@@ -1190,7 +1204,10 @@ void register_device_unmigratable(DeviceState *dev, const char *idstr,
 
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
-            se->no_migrate = 1;
+            se->set_params = nomigrate_set_params;
+            se->save_live_state = nomigrate_save_live_state;
+            se->save_state = nomigrate_save_state;
+            assert(!se->vmsd);
         }
     }
 }
@@ -1410,10 +1427,6 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se, int version_id)
 
 static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
 {
-    if (se->no_migrate) {
-        return -1;
-    }
-
     if (!se->vmsd) {         /* Old style */
         return se->save_state(f, se->opaque);
     }
@@ -1443,6 +1456,9 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
 	}
 	ret = se->set_params(blk_enable, shared, se->opaque);
         if (ret < 0) {
+            monitor_printf(mon,
+                           "Save state begin blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-ret));
             return ret;
         }
     }
@@ -1470,6 +1486,9 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
 
         ret = se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
         if (ret < 0) {
+            monitor_printf(mon,
+                           "Save state begin blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-ret));
             return ret;
         }
     }
@@ -1503,6 +1522,9 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
                synchronized over and over again. */
             break;
         } else if (ret < 0) {
+            monitor_printf(mon,
+                           "Save state iterate blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-ret));
             return ret;
         }
     }
@@ -1535,6 +1557,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 
         r = se->save_live_state(mon, f, QEMU_VM_SECTION_END, se->opaque);
         if (r < 0) {
+            monitor_printf(mon,
+                           "Save state complete blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-r));
             return r;
         }
     }
@@ -1559,7 +1584,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 
         r = vmstate_save(f, se);
         if (r < 0) {
-            monitor_printf(mon, "cannot migrate with device '%s'\n", se->idstr);
+            monitor_printf(mon,
+                           "Save state complete blocked by device '%s', error:"
+                           " %s\n", se->idstr, strerror(-r));
             return r;
         }
     }

next prev parent reply	other threads:[~2010-11-09 19:35 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-06 20:58 [PATCH 0/6] Save state error handling (kill off no_migrate) Alex Williamson
2010-10-06 20:58 ` [Qemu-devel] " Alex Williamson
2010-10-06 20:59 ` [PATCH 1/6] savevm: Allow SaveStateHandler() to return error Alex Williamson
2010-10-06 20:59   ` [Qemu-devel] " Alex Williamson
2010-10-06 20:59 ` [PATCH 2/6] savevm: Allow vmsd->pre_save " Alex Williamson
2010-10-06 20:59   ` [Qemu-devel] " Alex Williamson
2010-10-06 20:59 ` [PATCH 3/6] pci: Allow pci_device_save() " Alex Williamson
2010-10-06 20:59   ` [Qemu-devel] " Alex Williamson
2010-10-06 20:59 ` [PATCH 4/6] virtio: Allow virtio_save() errors Alex Williamson
2010-10-06 20:59   ` [Qemu-devel] " Alex Williamson
2010-10-06 20:59 ` [PATCH 5/6] savevm: Allow set_params and save_live_state to error Alex Williamson
2010-10-06 20:59   ` [Qemu-devel] " Alex Williamson
2010-10-06 20:59 ` [PATCH 6/6] savevm: Remove register_device_unmigratable() Alex Williamson
2010-10-06 20:59   ` [Qemu-devel] " Alex Williamson
2010-10-07 16:55 ` [PATCH 0/6] Save state error handling (kill off no_migrate) Alex Williamson
2010-10-07 16:55   ` [Qemu-devel] " Alex Williamson
2010-11-08 11:40 ` Michael S. Tsirkin
2010-11-08 11:40   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-08 14:59   ` Alex Williamson
2010-11-08 14:59     ` [Qemu-devel] " Alex Williamson
2010-11-08 16:54     ` Michael S. Tsirkin
2010-11-08 16:54       ` [Qemu-devel] " Michael S. Tsirkin
2010-11-08 17:20       ` Alex Williamson
2010-11-08 17:20         ` [Qemu-devel] " Alex Williamson
2010-11-08 20:59         ` Michael S. Tsirkin
2010-11-08 20:59           ` [Qemu-devel] " Michael S. Tsirkin
2010-11-08 21:23           ` Alex Williamson
2010-11-08 21:23             ` [Qemu-devel] " Alex Williamson
2010-11-09 12:00             ` Michael S. Tsirkin
2010-11-09 12:00               ` [Qemu-devel] " Michael S. Tsirkin
2010-11-09 14:58               ` Alex Williamson
2010-11-09 14:58                 ` [Qemu-devel] " Alex Williamson
2010-11-09 15:07                 ` Michael S. Tsirkin
2010-11-09 15:07                   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-09 15:34                   ` Alex Williamson
2010-11-09 15:34                     ` [Qemu-devel] " Alex Williamson
2010-11-09 15:42                     ` Michael S. Tsirkin
2010-11-09 15:42                       ` [Qemu-devel] " Michael S. Tsirkin
2010-11-09 15:47                       ` Alex Williamson
2010-11-09 15:47                         ` [Qemu-devel] " Alex Williamson
2010-11-09 16:15                         ` Michael S. Tsirkin
2010-11-09 16:15                           ` [Qemu-devel] " Michael S. Tsirkin
2010-11-09 16:30                           ` Alex Williamson
2010-11-09 16:30                             ` [Qemu-devel] " Alex Williamson
2010-11-09 16:49                             ` Michael S. Tsirkin
2010-11-09 16:49                               ` [Qemu-devel] " Michael S. Tsirkin
2010-11-09 17:44                               ` Alex Williamson
2010-11-09 17:44                                 ` [Qemu-devel] " Alex Williamson
2010-11-09 19:35                                 ` Alex Williamson [this message]
2010-11-09 19:35                                   ` Alex Williamson
2010-11-16 10:23 ` Juan Quintela
2010-11-16 10:23   ` [Qemu-devel] " Juan Quintela

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:521edc8 dfblob:b0aaa46 dfblob:521edc8 dfblob:b0aaa46 )
 OR (
bs:"[Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migrate)" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1289331348.14321.58.camel@x201 \
    --to=alex.williamson@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=cam@cs.ualberta.ca \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.