[PATCH v2 00/27] Virtio net failover fixes

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/27] Virtio net failover fixes
@ 2020-11-18  8:37 Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 01/27] migration: Network Failover can't work with a paused guest Juan Quintela
                   ` (28 more replies)
  0 siblings, 29 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Hi

This is a big rework of the network failover setup.  General idea is:
* We don't cache the name of the primary/standby devices
  We have several problems there with stale pointers
* After this:
- We can always remove either the primary/standby devices without trouble
- Pluggin/unplugging works
- We go to device opts to see what the failover device are.
  Notice that we are plugging/unplugging the device, so it is not critical.
- Once there, I "fixed" managedsave for libvirt (now gives an error instead o=
f just hanging)
* Fields not cached anymore:
- primary_device_dict
- primary_device_opts
- standby_id
- primary_device_id
- primary_dev
* I renamed the should_be_hidden() callback to hide device, but if
  people preffer the old name I can leave it.
* Add (some) doc to some functions
* Remove almost 100 lines of code while fixing things.

Please review.

Later, Juan.

Juan Quintela (27):
  migration: Network Failover can't work with a paused guest
  failover: fix indentantion
  failover: Use always atomics for primary_should_be_hidden
  failover: primary bus is only used once, and where it is set
  failover: Remove unused parameter
  failover: Remove external partially_hotplugged property
  failover: qdev_device_add() returns err or dev set
  failover: Rename bool to failover_primary_hidden
  failover: g_strcmp0() knows how to handle NULL
  failover: Remove primary_device_opts
  failover: remove standby_id variable
  failover: Remove primary_device_dict
  failover: Remove memory leak
  failover: simplify virtio_net_find_primary()
  failover: should_be_hidden() should take a bool
  failover: Rename function to hide_device()
  failover: virtio_net_connect_failover_devices() does nothing
  failover: Rename to failover_find_primary_device()
  failover: simplify qdev_device_add() failover case
  failover: simplify qdev_device_add()
  failover: make sure that id always exist
  failover: remove failover_find_primary_device() error parameter
  failover: split failover_find_primary_device_id()
  failover: We don't need to cache primary_device_id anymore
  failover: Caller of this two functions already have primary_dev
  failover: simplify failover_unplug_primary
  failover: Remove primary_dev member

 include/hw/qdev-core.h         |  28 ++--
 include/hw/virtio/virtio-net.h |   9 +-
 hw/core/qdev.c                 |  19 +--
 hw/net/virtio-net.c            | 298 +++++++++++++--------------------
 migration/migration.c          |  13 ++
 softmmu/qdev-monitor.c         |  43 ++---
 6 files changed, 167 insertions(+), 243 deletions(-)

--=20
2.26.2




^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-25 12:05   ` Dr. David Alan Gilbert
  2020-12-02 10:13   ` Michael S. Tsirkin
  2020-11-18  8:37 ` [PATCH v2 02/27] failover: fix indentantion Juan Quintela
                   ` (27 subsequent siblings)
  28 siblings, 2 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

If we have a paused guest, it can't unplug the network VF device, so
we wait there forever.  Just change the code to give one error on that
case.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 migration/migration.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83..d44fc880f9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3548,6 +3548,18 @@ static void *migration_thread(void *opaque)
     qemu_savevm_state_setup(s->to_dst_file);
 
     if (qemu_savevm_state_guest_unplug_pending()) {
+        /* if guest is paused, it can send back the wait event */
+        if (!runstate_is_running()) {
+            Error *local_err = NULL;
+
+            error_setg(&local_err, "migration: network failover and "
+                       "guest is paused'");
+            migrate_set_error(s, local_err);
+            error_free(local_err);
+            migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                              MIGRATION_STATUS_FAILED);
+            goto end;
+        }
         migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                           MIGRATION_STATUS_WAIT_UNPLUG);
 
@@ -3597,6 +3609,7 @@ static void *migration_thread(void *opaque)
     }
 
     trace_migration_thread_after_loop();
+end:
     migration_iteration_finish(s);
     object_unref(OBJECT(s));
     rcu_unregister_thread();
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-11-18  8:37 ` [PATCH v2 01/27] migration: Network Failover can't work with a paused guest Juan Quintela
@ 2020-11-25 12:05   ` Dr. David Alan Gilbert
  2020-12-02 10:13   ` Michael S. Tsirkin
  1 sibling, 0 replies; 58+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-25 12:05 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, qemu-devel, Paolo Bonzini

* Juan Quintela (quintela@redhat.com) wrote:
> If we have a paused guest, it can't unplug the network VF device, so
> we wait there forever.  Just change the code to give one error on that
> case.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 87a9b59f83..d44fc880f9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3548,6 +3548,18 @@ static void *migration_thread(void *opaque)
>      qemu_savevm_state_setup(s->to_dst_file);
>  
>      if (qemu_savevm_state_guest_unplug_pending()) {
> +        /* if guest is paused, it can send back the wait event */
> +        if (!runstate_is_running()) {
> +            Error *local_err = NULL;
> +
> +            error_setg(&local_err, "migration: network failover and "
> +                       "guest is paused'");
> +            migrate_set_error(s, local_err);
> +            error_free(local_err);
> +            migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> +                              MIGRATION_STATUS_FAILED);
> +            goto end;
> +        }
>          migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                            MIGRATION_STATUS_WAIT_UNPLUG);
>  
> @@ -3597,6 +3609,7 @@ static void *migration_thread(void *opaque)
>      }
>  
>      trace_migration_thread_after_loop();
> +end:
>      migration_iteration_finish(s);
>      object_unref(OBJECT(s));
>      rcu_unregister_thread();
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-11-18  8:37 ` [PATCH v2 01/27] migration: Network Failover can't work with a paused guest Juan Quintela
  2020-11-25 12:05   ` Dr. David Alan Gilbert
@ 2020-12-02 10:13   ` Michael S. Tsirkin
  2020-12-02 10:27     ` Daniel P. Berrangé
  1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02 10:13 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Daniel P. Berrangé, Eduardo Habkost, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> If we have a paused guest, it can't unplug the network VF device, so
> we wait there forever.  Just change the code to give one error on that
> case.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>

It's certainly possible but it's management that created
this situation after all - why do we bother to enforce
a policy? It is possible that management will unpause immediately
afterwards and everything will proceed smoothly.

Yes migration will not happen until guest is
unpaused but the same it true of e.g. a guest that is stuck
because of a bug.




> ---
>  migration/migration.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 87a9b59f83..d44fc880f9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3548,6 +3548,18 @@ static void *migration_thread(void *opaque)
>      qemu_savevm_state_setup(s->to_dst_file);
>  
>      if (qemu_savevm_state_guest_unplug_pending()) {
> +        /* if guest is paused, it can send back the wait event */
> +        if (!runstate_is_running()) {
> +            Error *local_err = NULL;
> +
> +            error_setg(&local_err, "migration: network failover and "
> +                       "guest is paused'");
> +            migrate_set_error(s, local_err);
> +            error_free(local_err);
> +            migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> +                              MIGRATION_STATUS_FAILED);
> +            goto end;
> +        }
>          migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                            MIGRATION_STATUS_WAIT_UNPLUG);
>  
> @@ -3597,6 +3609,7 @@ static void *migration_thread(void *opaque)
>      }
>  
>      trace_migration_thread_after_loop();
> +end:
>      migration_iteration_finish(s);
>      object_unref(OBJECT(s));
>      rcu_unregister_thread();
> -- 
> 2.26.2



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 10:13   ` Michael S. Tsirkin
@ 2020-12-02 10:27     ` Daniel P. Berrangé
  2020-12-02 10:31       ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-02 10:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > If we have a paused guest, it can't unplug the network VF device, so
> > we wait there forever.  Just change the code to give one error on that
> > case.
> > 
> > Signed-off-by: Juan Quintela <quintela@redhat.com>
> 
> It's certainly possible but it's management that created
> this situation after all - why do we bother to enforce
> a policy? It is possible that management will unpause immediately
> afterwards and everything will proceed smoothly.
> 
> Yes migration will not happen until guest is
> unpaused but the same it true of e.g. a guest that is stuck
> because of a bug.

That's pretty different behaviour from how migration normally handles
a paused guest, which is that it is guaranteed to complete the migration
in as short a time as network bandwidth allows.

Just ignoring the situation I think will lead to surprise apps / admins,
because the person/entity invoking the migration is not likely to have
checked wether this particular guest uses net failover or not before
invoking - they'll just be expecting a paused migration to run fast and
be guaranteed to complete.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 10:27     ` Daniel P. Berrangé
@ 2020-12-02 10:31       ` Michael S. Tsirkin
  2020-12-02 10:33         ` Michael S. Tsirkin
  2020-12-02 10:34         ` Daniel P. Berrangé
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02 10:31 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃ© wrote:
> On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > If we have a paused guest, it can't unplug the network VF device, so
> > > we wait there forever.  Just change the code to give one error on that
> > > case.
> > > 
> > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > 
> > It's certainly possible but it's management that created
> > this situation after all - why do we bother to enforce
> > a policy? It is possible that management will unpause immediately
> > afterwards and everything will proceed smoothly.
> > 
> > Yes migration will not happen until guest is
> > unpaused but the same it true of e.g. a guest that is stuck
> > because of a bug.
> 
> That's pretty different behaviour from how migration normally handles
> a paused guest, which is that it is guaranteed to complete the migration
> in as short a time as network bandwidth allows.
> 
> Just ignoring the situation I think will lead to surprise apps / admins,
> because the person/entity invoking the migration is not likely to have
> checked wether this particular guest uses net failover or not before
> invoking - they'll just be expecting a paused migration to run fast and
> be guaranteed to complete.
> 
> Regards,
> Daniel

Okay I guess. But then shouldn't we handle the reverse situation too:
pausing guest after migration started but before device was
unplugged?


> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 10:31       ` Michael S. Tsirkin
@ 2020-12-02 10:33         ` Michael S. Tsirkin
  2020-12-02 10:51           ` Juan Quintela
  2020-12-02 10:34         ` Daniel P. Berrangé
  1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02 10:33 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃ© wrote:
> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > we wait there forever.  Just change the code to give one error on that
> > > > case.
> > > > 
> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > 
> > > It's certainly possible but it's management that created
> > > this situation after all - why do we bother to enforce
> > > a policy? It is possible that management will unpause immediately
> > > afterwards and everything will proceed smoothly.
> > > 
> > > Yes migration will not happen until guest is
> > > unpaused but the same it true of e.g. a guest that is stuck
> > > because of a bug.
> > 
> > That's pretty different behaviour from how migration normally handles
> > a paused guest, which is that it is guaranteed to complete the migration
> > in as short a time as network bandwidth allows.
> > 
> > Just ignoring the situation I think will lead to surprise apps / admins,
> > because the person/entity invoking the migration is not likely to have
> > checked wether this particular guest uses net failover or not before
> > invoking - they'll just be expecting a paused migration to run fast and
> > be guaranteed to complete.
> > 
> > Regards,
> > Daniel
> 
> Okay I guess. But then shouldn't we handle the reverse situation too:
> pausing guest after migration started but before device was
> unplugged?
> 

Thinking of which, I have no idea how we'd handle it - fail
pausing guest until migration is cancelled?

All this seems heavy handed to me ...

> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 10:33         ` Michael S. Tsirkin
@ 2020-12-02 10:51           ` Juan Quintela
  2020-12-02 10:55             ` Daniel P. Berrangé
  0 siblings, 1 reply; 58+ messages in thread
From: Juan Quintela @ 2020-12-02 10:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrangé, Eduardo Habkost, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
>> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
>> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
>> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
>> > > > If we have a paused guest, it can't unplug the network VF device, so
>> > > > we wait there forever.  Just change the code to give one error on that
>> > > > case.
>> > > > 
>> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
>> > > 
>> > > It's certainly possible but it's management that created
>> > > this situation after all - why do we bother to enforce
>> > > a policy? It is possible that management will unpause immediately
>> > > afterwards and everything will proceed smoothly.
>> > > 
>> > > Yes migration will not happen until guest is
>> > > unpaused but the same it true of e.g. a guest that is stuck
>> > > because of a bug.
>> > 
>> > That's pretty different behaviour from how migration normally handles
>> > a paused guest, which is that it is guaranteed to complete the migration
>> > in as short a time as network bandwidth allows.
>> > 
>> > Just ignoring the situation I think will lead to surprise apps / admins,
>> > because the person/entity invoking the migration is not likely to have
>> > checked wether this particular guest uses net failover or not before
>> > invoking - they'll just be expecting a paused migration to run fast and
>> > be guaranteed to complete.
>> > 
>> > Regards,
>> > Daniel
>> 
>> Okay I guess. But then shouldn't we handle the reverse situation too:
>> pausing guest after migration started but before device was
>> unplugged?
>> 
>
> Thinking of which, I have no idea how we'd handle it - fail
> pausing guest until migration is cancelled?
>
> All this seems heavy handed to me ...

This is the minimal fix that I can think of.

Further solution would be:
- Add a new migration parameter: migrate-paused
- change libvirt to use the new parameter if it exist
- in qemu, when we do start migration (but after we wait for the unplug
  device) paused the guest before starting migration and resume it after
  migration finish.

My understanding talking with Laine is that they use this functionality
by default for migration, saving, etc, i.e. it is not an isolated case.

Later, Juan.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 10:51           ` Juan Quintela
@ 2020-12-02 10:55             ` Daniel P. Berrangé
  2020-12-02 11:19               ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-02 10:55 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Eduardo Habkost, Michael S. Tsirkin, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> >> > > > If we have a paused guest, it can't unplug the network VF device, so
> >> > > > we wait there forever.  Just change the code to give one error on that
> >> > > > case.
> >> > > > 
> >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> >> > > 
> >> > > It's certainly possible but it's management that created
> >> > > this situation after all - why do we bother to enforce
> >> > > a policy? It is possible that management will unpause immediately
> >> > > afterwards and everything will proceed smoothly.
> >> > > 
> >> > > Yes migration will not happen until guest is
> >> > > unpaused but the same it true of e.g. a guest that is stuck
> >> > > because of a bug.
> >> > 
> >> > That's pretty different behaviour from how migration normally handles
> >> > a paused guest, which is that it is guaranteed to complete the migration
> >> > in as short a time as network bandwidth allows.
> >> > 
> >> > Just ignoring the situation I think will lead to surprise apps / admins,
> >> > because the person/entity invoking the migration is not likely to have
> >> > checked wether this particular guest uses net failover or not before
> >> > invoking - they'll just be expecting a paused migration to run fast and
> >> > be guaranteed to complete.
> >> > 
> >> > Regards,
> >> > Daniel
> >> 
> >> Okay I guess. But then shouldn't we handle the reverse situation too:
> >> pausing guest after migration started but before device was
> >> unplugged?
> >> 
> >
> > Thinking of which, I have no idea how we'd handle it - fail
> > pausing guest until migration is cancelled?
> >
> > All this seems heavy handed to me ...
> 
> This is the minimal fix that I can think of.
> 
> Further solution would be:
> - Add a new migration parameter: migrate-paused
> - change libvirt to use the new parameter if it exist
> - in qemu, when we do start migration (but after we wait for the unplug
>   device) paused the guest before starting migration and resume it after
>   migration finish.

It would also have to handle issuing of paused after migration has
been started - delay the pause request until the nuplug is complete
is one answer.
 
> My understanding talking with Laine is that they use this functionality
> by default for migration, saving, etc, i.e. it is not an isolated case.

Yep,  save-to-disk always runs in the paused state, and migration is
also paused by default unless the mgmt app explicitl asks for live
migration.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 10:55             ` Daniel P. Berrangé
@ 2020-12-02 11:19               ` Michael S. Tsirkin
  2020-12-02 11:26                 ` Daniel P. Berrangé
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02 11:19 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > >> > > > we wait there forever.  Just change the code to give one error on that
> > >> > > > case.
> > >> > > > 
> > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > >> > > 
> > >> > > It's certainly possible but it's management that created
> > >> > > this situation after all - why do we bother to enforce
> > >> > > a policy? It is possible that management will unpause immediately
> > >> > > afterwards and everything will proceed smoothly.
> > >> > > 
> > >> > > Yes migration will not happen until guest is
> > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > >> > > because of a bug.
> > >> > 
> > >> > That's pretty different behaviour from how migration normally handles
> > >> > a paused guest, which is that it is guaranteed to complete the migration
> > >> > in as short a time as network bandwidth allows.
> > >> > 
> > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > >> > because the person/entity invoking the migration is not likely to have
> > >> > checked wether this particular guest uses net failover or not before
> > >> > invoking - they'll just be expecting a paused migration to run fast and
> > >> > be guaranteed to complete.
> > >> > 
> > >> > Regards,
> > >> > Daniel
> > >> 
> > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > >> pausing guest after migration started but before device was
> > >> unplugged?
> > >> 
> > >
> > > Thinking of which, I have no idea how we'd handle it - fail
> > > pausing guest until migration is cancelled?
> > >
> > > All this seems heavy handed to me ...
> > 
> > This is the minimal fix that I can think of.
> > 
> > Further solution would be:
> > - Add a new migration parameter: migrate-paused
> > - change libvirt to use the new parameter if it exist
> > - in qemu, when we do start migration (but after we wait for the unplug
> >   device) paused the guest before starting migration and resume it after
> >   migration finish.
> 
> It would also have to handle issuing of paused after migration has
> been started - delay the pause request until the nuplug is complete
> is one answer.

Hmm my worry would be that pausing is one way to give cpu
resources back to host. It's problematic if guest can delay
that indefinitely.

> > My understanding talking with Laine is that they use this functionality
> > by default for migration, saving, etc, i.e. it is not an isolated case.
> 
> Yep,  save-to-disk always runs in the paused state, and migration is
> also paused by default unless the mgmt app explicitl asks for live
> migration.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 11:19               ` Michael S. Tsirkin
@ 2020-12-02 11:26                 ` Daniel P. Berrangé
  2020-12-02 11:37                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-02 11:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > >> > > > case.
> > > >> > > > 
> > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > >> > > 
> > > >> > > It's certainly possible but it's management that created
> > > >> > > this situation after all - why do we bother to enforce
> > > >> > > a policy? It is possible that management will unpause immediately
> > > >> > > afterwards and everything will proceed smoothly.
> > > >> > > 
> > > >> > > Yes migration will not happen until guest is
> > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > >> > > because of a bug.
> > > >> > 
> > > >> > That's pretty different behaviour from how migration normally handles
> > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > >> > in as short a time as network bandwidth allows.
> > > >> > 
> > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > >> > because the person/entity invoking the migration is not likely to have
> > > >> > checked wether this particular guest uses net failover or not before
> > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > >> > be guaranteed to complete.
> > > >> > 
> > > >> > Regards,
> > > >> > Daniel
> > > >> 
> > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > >> pausing guest after migration started but before device was
> > > >> unplugged?
> > > >> 
> > > >
> > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > pausing guest until migration is cancelled?
> > > >
> > > > All this seems heavy handed to me ...
> > > 
> > > This is the minimal fix that I can think of.
> > > 
> > > Further solution would be:
> > > - Add a new migration parameter: migrate-paused
> > > - change libvirt to use the new parameter if it exist
> > > - in qemu, when we do start migration (but after we wait for the unplug
> > >   device) paused the guest before starting migration and resume it after
> > >   migration finish.
> > 
> > It would also have to handle issuing of paused after migration has
> > been started - delay the pause request until the nuplug is complete
> > is one answer.
> 
> Hmm my worry would be that pausing is one way to give cpu
> resources back to host. It's problematic if guest can delay
> that indefinitely.

hmm, yes, that is awkward.  Perhaps we should just report an explicit
error then. In normal cases this won't happen, as unplug will have
easily completed before the mgmt app pauses the running migration.
In broken/malicious guest cases, this at least ives mgmt a heads up
that something is wrong and they might then decide to cancel the
migration.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 11:26                 ` Daniel P. Berrangé
@ 2020-12-02 11:37                   ` Michael S. Tsirkin
  2020-12-02 12:01                     ` Daniel P. Berrangé
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02 11:37 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > >> > > > case.
> > > > >> > > > 
> > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > >> > > 
> > > > >> > > It's certainly possible but it's management that created
> > > > >> > > this situation after all - why do we bother to enforce
> > > > >> > > a policy? It is possible that management will unpause immediately
> > > > >> > > afterwards and everything will proceed smoothly.
> > > > >> > > 
> > > > >> > > Yes migration will not happen until guest is
> > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > >> > > because of a bug.
> > > > >> > 
> > > > >> > That's pretty different behaviour from how migration normally handles
> > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > >> > in as short a time as network bandwidth allows.
> > > > >> > 
> > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > >> > because the person/entity invoking the migration is not likely to have
> > > > >> > checked wether this particular guest uses net failover or not before
> > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > >> > be guaranteed to complete.
> > > > >> > 
> > > > >> > Regards,
> > > > >> > Daniel
> > > > >> 
> > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > >> pausing guest after migration started but before device was
> > > > >> unplugged?
> > > > >> 
> > > > >
> > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > pausing guest until migration is cancelled?
> > > > >
> > > > > All this seems heavy handed to me ...
> > > > 
> > > > This is the minimal fix that I can think of.
> > > > 
> > > > Further solution would be:
> > > > - Add a new migration parameter: migrate-paused
> > > > - change libvirt to use the new parameter if it exist
> > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > >   device) paused the guest before starting migration and resume it after
> > > >   migration finish.
> > > 
> > > It would also have to handle issuing of paused after migration has
> > > been started - delay the pause request until the nuplug is complete
> > > is one answer.
> > 
> > Hmm my worry would be that pausing is one way to give cpu
> > resources back to host. It's problematic if guest can delay
> > that indefinitely.
> 
> hmm, yes, that is awkward.  Perhaps we should just report an explicit
> error then.

Report an error in response to which command? Do you mean
fail migration?

> In normal cases this won't happen, as unplug will have
> easily completed before the mgmt app pauses the running migration.
> In broken/malicious guest cases, this at least ives mgmt a heads up
> that something is wrong and they might then decide to cancel the
> migration.
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 11:37                   ` Michael S. Tsirkin
@ 2020-12-02 12:01                     ` Daniel P. Berrangé
  2020-12-03 11:21                       ` Michael S. Tsirkin
  2020-12-08 18:32                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-02 12:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > >> > > > case.
> > > > > >> > > > 
> > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > >> > > 
> > > > > >> > > It's certainly possible but it's management that created
> > > > > >> > > this situation after all - why do we bother to enforce
> > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > >> > > 
> > > > > >> > > Yes migration will not happen until guest is
> > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > >> > > because of a bug.
> > > > > >> > 
> > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > >> > in as short a time as network bandwidth allows.
> > > > > >> > 
> > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > >> > be guaranteed to complete.
> > > > > >> > 
> > > > > >> > Regards,
> > > > > >> > Daniel
> > > > > >> 
> > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > >> pausing guest after migration started but before device was
> > > > > >> unplugged?
> > > > > >> 
> > > > > >
> > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > pausing guest until migration is cancelled?
> > > > > >
> > > > > > All this seems heavy handed to me ...
> > > > > 
> > > > > This is the minimal fix that I can think of.
> > > > > 
> > > > > Further solution would be:
> > > > > - Add a new migration parameter: migrate-paused
> > > > > - change libvirt to use the new parameter if it exist
> > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > >   device) paused the guest before starting migration and resume it after
> > > > >   migration finish.
> > > > 
> > > > It would also have to handle issuing of paused after migration has
> > > > been started - delay the pause request until the nuplug is complete
> > > > is one answer.
> > > 
> > > Hmm my worry would be that pausing is one way to give cpu
> > > resources back to host. It's problematic if guest can delay
> > > that indefinitely.
> > 
> > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > error then.
> 
> Report an error in response to which command? Do you mean
> fail migration?

If mgt attempt to pause an existing migration that hasn't finished
the PCI unplug stage, then fail the pause request.

> 
> > In normal cases this won't happen, as unplug will have
> > easily completed before the mgmt app pauses the running migration.
> > In broken/malicious guest cases, this at least ives mgmt a heads up
> > that something is wrong and they might then decide to cancel the
> > migration.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 12:01                     ` Daniel P. Berrangé
@ 2020-12-03 11:21                       ` Michael S. Tsirkin
  2020-12-03 11:32                         ` Daniel P. Berrangé
  2020-12-08 18:32                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-03 11:21 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > >> > > > case.
> > > > > > >> > > > 
> > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > >> > > 
> > > > > > >> > > It's certainly possible but it's management that created
> > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > >> > > 
> > > > > > >> > > Yes migration will not happen until guest is
> > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > >> > > because of a bug.
> > > > > > >> > 
> > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > >> > 
> > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > >> > be guaranteed to complete.
> > > > > > >> > 
> > > > > > >> > Regards,
> > > > > > >> > Daniel
> > > > > > >> 
> > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > >> pausing guest after migration started but before device was
> > > > > > >> unplugged?
> > > > > > >> 
> > > > > > >
> > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > pausing guest until migration is cancelled?
> > > > > > >
> > > > > > > All this seems heavy handed to me ...
> > > > > > 
> > > > > > This is the minimal fix that I can think of.
> > > > > > 
> > > > > > Further solution would be:
> > > > > > - Add a new migration parameter: migrate-paused
> > > > > > - change libvirt to use the new parameter if it exist
> > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > >   device) paused the guest before starting migration and resume it after
> > > > > >   migration finish.
> > > > > 
> > > > > It would also have to handle issuing of paused after migration has
> > > > > been started - delay the pause request until the nuplug is complete
> > > > > is one answer.
> > > > 
> > > > Hmm my worry would be that pausing is one way to give cpu
> > > > resources back to host. It's problematic if guest can delay
> > > > that indefinitely.
> > > 
> > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > error then.
> > 
> > Report an error in response to which command? Do you mean
> > fail migration?
> 
> If mgt attempt to pause an existing migration that hasn't finished
> the PCI unplug stage, then fail the pause request.

Pause guest not migration ...
Might be tricky ...

Let me ask this, why not just produce a warning
that migration wan't finish until guest actually runs?
User will then know and unpause the guest when he wants
migration to succeed ...


For example, user can restrict the amount of cpu
using cgroups to a level where almost no progress
is made. QEMU can't detect this ....



> > 
> > > In normal cases this won't happen, as unplug will have
> > > easily completed before the mgmt app pauses the running migration.
> > > In broken/malicious guest cases, this at least ives mgmt a heads up
> > > that something is wrong and they might then decide to cancel the
> > > migration.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 11:21                       ` Michael S. Tsirkin
@ 2020-12-03 11:32                         ` Daniel P. Berrangé
  2020-12-03 11:40                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-03 11:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > >> > > > case.
> > > > > > > >> > > > 
> > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > >> > > 
> > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > >> > > 
> > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > >> > > because of a bug.
> > > > > > > >> > 
> > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > >> > 
> > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > >> > be guaranteed to complete.
> > > > > > > >> > 
> > > > > > > >> > Regards,
> > > > > > > >> > Daniel
> > > > > > > >> 
> > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > >> pausing guest after migration started but before device was
> > > > > > > >> unplugged?
> > > > > > > >> 
> > > > > > > >
> > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > pausing guest until migration is cancelled?
> > > > > > > >
> > > > > > > > All this seems heavy handed to me ...
> > > > > > > 
> > > > > > > This is the minimal fix that I can think of.
> > > > > > > 
> > > > > > > Further solution would be:
> > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > >   migration finish.
> > > > > > 
> > > > > > It would also have to handle issuing of paused after migration has
> > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > is one answer.
> > > > > 
> > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > resources back to host. It's problematic if guest can delay
> > > > > that indefinitely.
> > > > 
> > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > error then.
> > > 
> > > Report an error in response to which command? Do you mean
> > > fail migration?
> > 
> > If mgt attempt to pause an existing migration that hasn't finished
> > the PCI unplug stage, then fail the pause request.
> 
> Pause guest not migration ...
> Might be tricky ...
> 
> Let me ask this, why not just produce a warning
> that migration wan't finish until guest actually runs?
> User will then know and unpause the guest when he wants
> migration to succeed ...

A warning is going to be essentally invisible if the pause command
succeeeds. 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 11:32                         ` Daniel P. Berrangé
@ 2020-12-03 11:40                           ` Michael S. Tsirkin
  2020-12-03 11:43                             ` Dr. David Alan Gilbert
  2020-12-03 11:45                             ` Daniel P. Berrangé
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-03 11:40 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > > >> > > > case.
> > > > > > > > >> > > > 
> > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > >> > > 
> > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > >> > > 
> > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > > >> > > because of a bug.
> > > > > > > > >> > 
> > > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > >> > 
> > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > > >> > be guaranteed to complete.
> > > > > > > > >> > 
> > > > > > > > >> > Regards,
> > > > > > > > >> > Daniel
> > > > > > > > >> 
> > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > >> unplugged?
> > > > > > > > >> 
> > > > > > > > >
> > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > >
> > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > 
> > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > 
> > > > > > > > Further solution would be:
> > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > > >   migration finish.
> > > > > > > 
> > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > > is one answer.
> > > > > > 
> > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > resources back to host. It's problematic if guest can delay
> > > > > > that indefinitely.
> > > > > 
> > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > error then.
> > > > 
> > > > Report an error in response to which command? Do you mean
> > > > fail migration?
> > > 
> > > If mgt attempt to pause an existing migration that hasn't finished
> > > the PCI unplug stage, then fail the pause request.
> > 
> > Pause guest not migration ...
> > Might be tricky ...
> > 
> > Let me ask this, why not just produce a warning
> > that migration wan't finish until guest actually runs?
> > User will then know and unpause the guest when he wants
> > migration to succeed ...
> 
> A warning is going to be essentally invisible if the pause command
> succeeeds. 
> 
> Regards,
> Daniel

I mean the situation here isn't earth shattering, an admin
created it. Maybe he will unpause shortly
and all will be well ...

How about we make it possible for admin to detect that the
reason for migration not making progress is that it is
waiting for unplug? And maybe that guest is paused too?

I just don't see how we can detect all cases and I am not
sure it is worth it to try and detect only some of them,
making users think they can rely on command failure to
detect them.


> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 11:40                           ` Michael S. Tsirkin
@ 2020-12-03 11:43                             ` Dr. David Alan Gilbert
  2020-12-03 12:02                               ` Michael S. Tsirkin
  2020-12-03 12:11                               ` Michael S. Tsirkin
  2020-12-03 11:45                             ` Daniel P. Berrangé
  1 sibling, 2 replies; 58+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-03 11:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrangé, Eduardo Habkost, Juan Quintela,
	Jason Wang, qemu-devel, Paolo Bonzini

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > > > >> > > > case.
> > > > > > > > > >> > > > 
> > > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > > >> > > 
> > > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > >> > > 
> > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > > > >> > > because of a bug.
> > > > > > > > > >> > 
> > > > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > >> > 
> > > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > >> > 
> > > > > > > > > >> > Regards,
> > > > > > > > > >> > Daniel
> > > > > > > > > >> 
> > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > >> unplugged?
> > > > > > > > > >> 
> > > > > > > > > >
> > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > >
> > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > 
> > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > 
> > > > > > > > > Further solution would be:
> > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > > > >   migration finish.
> > > > > > > > 
> > > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > > > is one answer.
> > > > > > > 
> > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > that indefinitely.
> > > > > > 
> > > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > > error then.
> > > > > 
> > > > > Report an error in response to which command? Do you mean
> > > > > fail migration?
> > > > 
> > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > the PCI unplug stage, then fail the pause request.
> > > 
> > > Pause guest not migration ...
> > > Might be tricky ...
> > > 
> > > Let me ask this, why not just produce a warning
> > > that migration wan't finish until guest actually runs?
> > > User will then know and unpause the guest when he wants
> > > migration to succeed ...
> > 
> > A warning is going to be essentally invisible if the pause command
> > succeeeds. 
> > 
> > Regards,
> > Daniel
> 
> I mean the situation here isn't earth shattering, an admin
> created it. Maybe he will unpause shortly
> and all will be well ...
> 
> How about we make it possible for admin to detect that the
> reason for migration not making progress is that it is
> waiting for unplug? And maybe that guest is paused too?

We already know that from the state of the VM.

> I just don't see how we can detect all cases and I am not
> sure it is worth it to try and detect only some of them,
> making users think they can rely on command failure to
> detect them.

Another way to solve this would be to remove the unplugging from the
migration layer and leave it as a problem for the management layer to do
the unplug.

Dave

> 
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 11:43                             ` Dr. David Alan Gilbert
@ 2020-12-03 12:02                               ` Michael S. Tsirkin
  2020-12-03 12:04                                 ` Dr. David Alan Gilbert
  2020-12-03 12:11                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-03 12:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrangé, Eduardo Habkost, Juan Quintela,
	Jason Wang, qemu-devel, Paolo Bonzini

On Thu, Dec 03, 2020 at 11:43:41AM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > > > > >> > > > case.
> > > > > > > > > > >> > > > 
> > > > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Regards,
> > > > > > > > > > >> > Daniel
> > > > > > > > > > >> 
> > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > > >> unplugged?
> > > > > > > > > > >> 
> > > > > > > > > > >
> > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > >
> > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > 
> > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > 
> > > > > > > > > > Further solution would be:
> > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > > > > >   migration finish.
> > > > > > > > > 
> > > > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > > > > is one answer.
> > > > > > > > 
> > > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > > that indefinitely.
> > > > > > > 
> > > > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > > > error then.
> > > > > > 
> > > > > > Report an error in response to which command? Do you mean
> > > > > > fail migration?
> > > > > 
> > > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > > the PCI unplug stage, then fail the pause request.
> > > > 
> > > > Pause guest not migration ...
> > > > Might be tricky ...
> > > > 
> > > > Let me ask this, why not just produce a warning
> > > > that migration wan't finish until guest actually runs?
> > > > User will then know and unpause the guest when he wants
> > > > migration to succeed ...
> > > 
> > > A warning is going to be essentally invisible if the pause command
> > > succeeeds. 
> > > 
> > > Regards,
> > > Daniel
> > 
> > I mean the situation here isn't earth shattering, an admin
> > created it. Maybe he will unpause shortly
> > and all will be well ...
> > 
> > How about we make it possible for admin to detect that the
> > reason for migration not making progress is that it is
> > waiting for unplug? And maybe that guest is paused too?
> 
> We already know that from the state of the VM.

You don't know that migration is waiting for the guest
action, no.

This is what we care about here right?


> > I just don't see how we can detect all cases and I am not
> > sure it is worth it to try and detect only some of them,
> > making users think they can rely on command failure to
> > detect them.
> 
> Another way to solve this would be to remove the unplugging from the
> migration layer and leave it as a problem for the management layer to do
> the unplug.
> 
> Dave
> 
> > 
> > > -- 
> > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 12:02                               ` Michael S. Tsirkin
@ 2020-12-03 12:04                                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 58+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-03 12:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrangé, Eduardo Habkost, Juan Quintela,
	Jason Wang, qemu-devel, Paolo Bonzini

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Thu, Dec 03, 2020 at 11:43:41AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > > > > > >> > > > case.
> > > > > > > > > > > >> > > > 
> > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Regards,
> > > > > > > > > > > >> > Daniel
> > > > > > > > > > > >> 
> > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > > > >> unplugged?
> > > > > > > > > > > >> 
> > > > > > > > > > > >
> > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > > >
> > > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > > 
> > > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > > 
> > > > > > > > > > > Further solution would be:
> > > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > > > > > >   migration finish.
> > > > > > > > > > 
> > > > > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > > > > > is one answer.
> > > > > > > > > 
> > > > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > > > that indefinitely.
> > > > > > > > 
> > > > > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > > > > error then.
> > > > > > > 
> > > > > > > Report an error in response to which command? Do you mean
> > > > > > > fail migration?
> > > > > > 
> > > > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > > > the PCI unplug stage, then fail the pause request.
> > > > > 
> > > > > Pause guest not migration ...
> > > > > Might be tricky ...
> > > > > 
> > > > > Let me ask this, why not just produce a warning
> > > > > that migration wan't finish until guest actually runs?
> > > > > User will then know and unpause the guest when he wants
> > > > > migration to succeed ...
> > > > 
> > > > A warning is going to be essentally invisible if the pause command
> > > > succeeeds. 
> > > > 
> > > > Regards,
> > > > Daniel
> > > 
> > > I mean the situation here isn't earth shattering, an admin
> > > created it. Maybe he will unpause shortly
> > > and all will be well ...
> > > 
> > > How about we make it possible for admin to detect that the
> > > reason for migration not making progress is that it is
> > > waiting for unplug? And maybe that guest is paused too?
> > 
> > We already know that from the state of the VM.
> 
> You don't know that migration is waiting for the guest
> action, no.
> 
> This is what we care about here right?

Yes we do, the migration status should be in
'MIGRATION_STATUS_WAIT_UNPLUG' (and that should have been notified
as an event).

Dave

> 
> > > I just don't see how we can detect all cases and I am not
> > > sure it is worth it to try and detect only some of them,
> > > making users think they can rely on command failure to
> > > detect them.
> > 
> > Another way to solve this would be to remove the unplugging from the
> > migration layer and leave it as a problem for the management layer to do
> > the unplug.
> > 
> > Dave
> > 
> > > 
> > > > -- 
> > > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > > 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 11:43                             ` Dr. David Alan Gilbert
  2020-12-03 12:02                               ` Michael S. Tsirkin
@ 2020-12-03 12:11                               ` Michael S. Tsirkin
  2020-12-03 12:16                                 ` Daniel P. Berrangé
  1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-03 12:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrangé, Eduardo Habkost, Juan Quintela,
	Jason Wang, qemu-devel, Paolo Bonzini

On Thu, Dec 03, 2020 at 11:43:41AM +0000, Dr. David Alan Gilbert wrote:
> Another way to solve this would be to remove the unplugging from the
> migration layer and leave it as a problem for the management layer to do
> the unplug.

Daniel described the problem with modular management tools which expect
pausing or slowing down guest to cause migration to converge.

Point is, it actually *will* make it converge but only if you
pause it after unplug.

As it is, these tools fundamentally can not handle failover
requiring guest cooperation. Moving code between layers won't help.
Introducing failure modes as this patch does won't help either
especially since Daniel wrote there are countless tools like this.
We just break them all but have no resources to fix them,
this does not help at all.

We can just leave the situation as is.

Or if we do want to be nice to these tools, how about we
unpause the guest until unplug, then pause it again?
This actually addresses the problem instead of
shifting the blame, does it not?

-- 
MST

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 12:11                               ` Michael S. Tsirkin
@ 2020-12-03 12:16                                 ` Daniel P. Berrangé
  2020-12-08 18:48                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-03 12:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 07:11:17AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:43:41AM +0000, Dr. David Alan Gilbert wrote:
> > Another way to solve this would be to remove the unplugging from the
> > migration layer and leave it as a problem for the management layer to do
> > the unplug.
> 
> Daniel described the problem with modular management tools which expect
> pausing or slowing down guest to cause migration to converge.
> 
> Point is, it actually *will* make it converge but only if you
> pause it after unplug.
> 
> As it is, these tools fundamentally can not handle failover
> requiring guest cooperation. Moving code between layers won't help.
> Introducing failure modes as this patch does won't help either
> especially since Daniel wrote there are countless tools like this.
> We just break them all but have no resources to fix them,
> this does not help at all.
> 
> We can just leave the situation as is.
> 
> Or if we do want to be nice to these tools, how about we
> unpause the guest until unplug, then pause it again?
> This actually addresses the problem instead of
> shifting the blame, does it not?

This is a very bad idea because it changes the execution status of the
guest behind the apps/admins back, and that cannot be assumed to be a
safe thing todo.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 12:16                                 ` Daniel P. Berrangé
@ 2020-12-08 18:48                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-08 18:48 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 12:16:24PM +0000, Daniel P. BerrangÃ© wrote:
> On Thu, Dec 03, 2020 at 07:11:17AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 03, 2020 at 11:43:41AM +0000, Dr. David Alan Gilbert wrote:
> > > Another way to solve this would be to remove the unplugging from the
> > > migration layer and leave it as a problem for the management layer to do
> > > the unplug.
> > 
> > Daniel described the problem with modular management tools which expect
> > pausing or slowing down guest to cause migration to converge.
> > 
> > Point is, it actually *will* make it converge but only if you
> > pause it after unplug.
> > 
> > As it is, these tools fundamentally can not handle failover
> > requiring guest cooperation. Moving code between layers won't help.
> > Introducing failure modes as this patch does won't help either
> > especially since Daniel wrote there are countless tools like this.
> > We just break them all but have no resources to fix them,
> > this does not help at all.
> > 
> > We can just leave the situation as is.
> > 
> > Or if we do want to be nice to these tools, how about we
> > unpause the guest until unplug, then pause it again?
> > This actually addresses the problem instead of
> > shifting the blame, does it not?
> 
> This is a very bad idea because it changes the execution status of the
> guest behind the apps/admins back, and that cannot be assumed to be a
> safe thing todo.
> 
> Regards,
> Daniel

My question is this: management gets an error since guest was paused
presumably by someone (admin?) outside its control.
How does it know when it is unpaused?


> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 11:40                           ` Michael S. Tsirkin
  2020-12-03 11:43                             ` Dr. David Alan Gilbert
@ 2020-12-03 11:45                             ` Daniel P. Berrangé
  2020-12-03 12:01                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-03 11:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > > > >> > > > case.
> > > > > > > > > >> > > > 
> > > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > > >> > > 
> > > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > >> > > 
> > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > > > >> > > because of a bug.
> > > > > > > > > >> > 
> > > > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > >> > 
> > > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > >> > 
> > > > > > > > > >> > Regards,
> > > > > > > > > >> > Daniel
> > > > > > > > > >> 
> > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > >> unplugged?
> > > > > > > > > >> 
> > > > > > > > > >
> > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > >
> > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > 
> > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > 
> > > > > > > > > Further solution would be:
> > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > > > >   migration finish.
> > > > > > > > 
> > > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > > > is one answer.
> > > > > > > 
> > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > that indefinitely.
> > > > > > 
> > > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > > error then.
> > > > > 
> > > > > Report an error in response to which command? Do you mean
> > > > > fail migration?
> > > > 
> > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > the PCI unplug stage, then fail the pause request.
> > > 
> > > Pause guest not migration ...
> > > Might be tricky ...
> > > 
> > > Let me ask this, why not just produce a warning
> > > that migration wan't finish until guest actually runs?
> > > User will then know and unpause the guest when he wants
> > > migration to succeed ...
> > 
> > A warning is going to be essentally invisible if the pause command
> > succeeeds. 
> 
> I mean the situation here isn't earth shattering, an admin
> created it. Maybe he will unpause shortly
> and all will be well ...

It isn't really about the admin.  It is about countless existing mgmt apps
that expect migration will always succeed if the VM is paused. The mgmt
apps triggering the migraiton is not neccessarily the same as the app
which introduced the use of NIC failover in the config.

eg in OpenStack Nova provides the VM config, but there are completely
separate apps that are built todo automation on top of Nova which 
this is liable to break. There's no human admin there to diagnose
this and re-try with unpause, as all the logic is in the apps.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 11:45                             ` Daniel P. Berrangé
@ 2020-12-03 12:01                               ` Michael S. Tsirkin
  2020-12-03 12:06                                 ` Daniel P. Berrangé
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-03 12:01 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 11:45:12AM +0000, Daniel P. Berrangé wrote:
> On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > > > > >> > > > case.
> > > > > > > > > > >> > > > 
> > > > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > >> > > 
> > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > >> > 
> > > > > > > > > > >> > Regards,
> > > > > > > > > > >> > Daniel
> > > > > > > > > > >> 
> > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > > >> unplugged?
> > > > > > > > > > >> 
> > > > > > > > > > >
> > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > >
> > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > 
> > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > 
> > > > > > > > > > Further solution would be:
> > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > > > > >   migration finish.
> > > > > > > > > 
> > > > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > > > > is one answer.
> > > > > > > > 
> > > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > > that indefinitely.
> > > > > > > 
> > > > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > > > error then.
> > > > > > 
> > > > > > Report an error in response to which command? Do you mean
> > > > > > fail migration?
> > > > > 
> > > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > > the PCI unplug stage, then fail the pause request.
> > > > 
> > > > Pause guest not migration ...
> > > > Might be tricky ...
> > > > 
> > > > Let me ask this, why not just produce a warning
> > > > that migration wan't finish until guest actually runs?
> > > > User will then know and unpause the guest when he wants
> > > > migration to succeed ...
> > > 
> > > A warning is going to be essentally invisible if the pause command
> > > succeeeds. 
> > 
> > I mean the situation here isn't earth shattering, an admin
> > created it. Maybe he will unpause shortly
> > and all will be well ...
> 
> It isn't really about the admin.  It is about countless existing mgmt apps
> that expect migration will always succeed if the VM is paused. The mgmt
> apps triggering the migraiton is not neccessarily the same as the app
> which introduced the use of NIC failover in the config.
> 
> eg in OpenStack Nova provides the VM config, but there are completely
> separate apps that are built todo automation on top of Nova which 
> this is liable to break. There's no human admin there to diagnose
> this and re-try with unpause, as all the logic is in the apps.
> 
> 
> Regards,
> Daniel

So let's say pause fails. Won't this break these theoretical apps?


> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 12:01                               ` Michael S. Tsirkin
@ 2020-12-03 12:06                                 ` Daniel P. Berrangé
  2020-12-03 12:13                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-03 12:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 07:01:11AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:45:12AM +0000, Daniel P. Berrangé wrote:
> > On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> > > > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > > > > > > >> > > > case.
> > > > > > > > > > > >> > > > 
> > > > > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > It's certainly possible but it's management that created
> > > > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > > > >> > > 
> > > > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > > > > > > >> > > because of a bug.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > > > >> > 
> > > > > > > > > > > >> > Regards,
> > > > > > > > > > > >> > Daniel
> > > > > > > > > > > >> 
> > > > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > > > >> unplugged?
> > > > > > > > > > > >> 
> > > > > > > > > > > >
> > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > > > >
> > > > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > > > 
> > > > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > > > 
> > > > > > > > > > > Further solution would be:
> > > > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > > > > > > >   device) paused the guest before starting migration and resume it after
> > > > > > > > > > >   migration finish.
> > > > > > > > > > 
> > > > > > > > > > It would also have to handle issuing of paused after migration has
> > > > > > > > > > been started - delay the pause request until the nuplug is complete
> > > > > > > > > > is one answer.
> > > > > > > > > 
> > > > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > > > that indefinitely.
> > > > > > > > 
> > > > > > > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > > > > > > error then.
> > > > > > > 
> > > > > > > Report an error in response to which command? Do you mean
> > > > > > > fail migration?
> > > > > > 
> > > > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > > > the PCI unplug stage, then fail the pause request.
> > > > > 
> > > > > Pause guest not migration ...
> > > > > Might be tricky ...
> > > > > 
> > > > > Let me ask this, why not just produce a warning
> > > > > that migration wan't finish until guest actually runs?
> > > > > User will then know and unpause the guest when he wants
> > > > > migration to succeed ...
> > > > 
> > > > A warning is going to be essentally invisible if the pause command
> > > > succeeeds. 
> > > 
> > > I mean the situation here isn't earth shattering, an admin
> > > created it. Maybe he will unpause shortly
> > > and all will be well ...
> > 
> > It isn't really about the admin.  It is about countless existing mgmt apps
> > that expect migration will always succeed if the VM is paused. The mgmt
> > apps triggering the migraiton is not neccessarily the same as the app
> > which introduced the use of NIC failover in the config.
> > 
> > eg in OpenStack Nova provides the VM config, but there are completely
> > separate apps that are built todo automation on top of Nova which 
> > this is liable to break. There's no human admin there to diagnose
> > this and re-try with unpause, as all the logic is in the apps.
> > 
> > 
> > Regards,
> > Daniel
> 
> So let's say pause fails. Won't this break these theoretical apps?

Yes, they are broken in a way that can now actually be seen, as opposed
to just hanging forever in migration.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-03 12:06                                 ` Daniel P. Berrangé
@ 2020-12-03 12:13                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-03 12:13 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Dec 03, 2020 at 12:06:14PM +0000, Daniel P. BerrangÃ© wrote:
> > > It isn't really about the admin.  It is about countless existing mgmt apps
> > > that expect migration will always succeed if the VM is paused. The mgmt
> > > apps triggering the migraiton is not neccessarily the same as the app
> > > which introduced the use of NIC failover in the config.
> > > 
> > > eg in OpenStack Nova provides the VM config, but there are completely
> > > separate apps that are built todo automation on top of Nova which 
> > > this is liable to break. There's no human admin there to diagnose
> > > this and re-try with unpause, as all the logic is in the apps.
> > > 
> > > 
> > > Regards,
> > > Daniel
> > 
> > So let's say pause fails. Won't this break these theoretical apps?
> 
> Yes, they are broken in a way that can now actually be seen, as opposed
> to just hanging forever in migration.

Right but there are countless apps out there, and they are all broken,
maybe not too hard to debug if one knows where to look, but still ...

And again only true for pause then migrate, migrate then pause is still
broken, and if we fix it by failing pause then no one knows whether apps
are prepared to handle a failure in the pause command, previously if
used correctly it would never fail I think ...

-- 
MST



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 12:01                     ` Daniel P. Berrangé
  2020-12-03 11:21                       ` Michael S. Tsirkin
@ 2020-12-08 18:32                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-08 18:32 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote:
> > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote:
> > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃƒÂ© wrote:
> > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > > > >> > > > we wait there forever.  Just change the code to give one error on that
> > > > > > >> > > > case.
> > > > > > >> > > > 
> > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > >> > > 
> > > > > > >> > > It's certainly possible but it's management that created
> > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > >> > > a policy? It is possible that management will unpause immediately
> > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > >> > > 
> > > > > > >> > > Yes migration will not happen until guest is
> > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck
> > > > > > >> > > because of a bug.
> > > > > > >> > 
> > > > > > >> > That's pretty different behaviour from how migration normally handles
> > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration
> > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > >> > 
> > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins,
> > > > > > >> > because the person/entity invoking the migration is not likely to have
> > > > > > >> > checked wether this particular guest uses net failover or not before
> > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and
> > > > > > >> > be guaranteed to complete.
> > > > > > >> > 
> > > > > > >> > Regards,
> > > > > > >> > Daniel
> > > > > > >> 
> > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too:
> > > > > > >> pausing guest after migration started but before device was
> > > > > > >> unplugged?
> > > > > > >> 
> > > > > > >
> > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > pausing guest until migration is cancelled?
> > > > > > >
> > > > > > > All this seems heavy handed to me ...
> > > > > > 
> > > > > > This is the minimal fix that I can think of.
> > > > > > 
> > > > > > Further solution would be:
> > > > > > - Add a new migration parameter: migrate-paused
> > > > > > - change libvirt to use the new parameter if it exist
> > > > > > - in qemu, when we do start migration (but after we wait for the unplug
> > > > > >   device) paused the guest before starting migration and resume it after
> > > > > >   migration finish.
> > > > > 
> > > > > It would also have to handle issuing of paused after migration has
> > > > > been started - delay the pause request until the nuplug is complete
> > > > > is one answer.
> > > > 
> > > > Hmm my worry would be that pausing is one way to give cpu
> > > > resources back to host. It's problematic if guest can delay
> > > > that indefinitely.
> > > 
> > > hmm, yes, that is awkward.  Perhaps we should just report an explicit
> > > error then.
> > 
> > Report an error in response to which command? Do you mean
> > fail migration?
> 
> If mgt attempt to pause an existing migration that hasn't finished
> the PCI unplug stage, then fail the pause request.

OK so I guess I'll apply the rest of the patchset, and let's see
a complete patch that makes pause and migrate mutually exclusive?


> > 
> > > In normal cases this won't happen, as unplug will have
> > > easily completed before the mgmt app pauses the running migration.
> > > In broken/malicious guest cases, this at least ives mgmt a heads up
> > > that something is wrong and they might then decide to cancel the
> > > migration.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
  2020-12-02 10:31       ` Michael S. Tsirkin
  2020-12-02 10:33         ` Michael S. Tsirkin
@ 2020-12-02 10:34         ` Daniel P. Berrangé
  1 sibling, 0 replies; 58+ messages in thread
From: Daniel P. Berrangé @ 2020-12-02 10:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Juan Quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini

On Wed, Dec 02, 2020 at 05:31:50AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. BerrangÃ© wrote:
> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote:
> > > > If we have a paused guest, it can't unplug the network VF device, so
> > > > we wait there forever.  Just change the code to give one error on that
> > > > case.
> > > > 
> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > 
> > > It's certainly possible but it's management that created
> > > this situation after all - why do we bother to enforce
> > > a policy? It is possible that management will unpause immediately
> > > afterwards and everything will proceed smoothly.
> > > 
> > > Yes migration will not happen until guest is
> > > unpaused but the same it true of e.g. a guest that is stuck
> > > because of a bug.
> > 
> > That's pretty different behaviour from how migration normally handles
> > a paused guest, which is that it is guaranteed to complete the migration
> > in as short a time as network bandwidth allows.
> > 
> > Just ignoring the situation I think will lead to surprise apps / admins,
> > because the person/entity invoking the migration is not likely to have
> > checked wether this particular guest uses net failover or not before
> > invoking - they'll just be expecting a paused migration to run fast and
> > be guaranteed to complete.
>
> Okay I guess. But then shouldn't we handle the reverse situation too:
> pausing guest after migration started but before device was
> unplugged?

Yeah we likely want todo something there.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v2 02/27] failover: fix indentantion
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 01/27] migration: Network Failover can't work with a paused guest Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 03/27] failover: Use always atomics for primary_should_be_hidden Juan Quintela
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Once there, remove not needed cast.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c    | 33 +++++++++++++++------------------
 softmmu/qdev-monitor.c |  4 ++--
 2 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9179013ac4..1011a524bf 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -797,7 +797,7 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
     }
 
     n->primary_device_opts = qemu_opts_find(qemu_find_opts("device"),
-            n->primary_device_id);
+                                            n->primary_device_id);
     if (n->primary_device_opts) {
         n->primary_dev = qdev_device_add(n->primary_device_opts, &err);
         if (err) {
@@ -814,9 +814,9 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
     } else {
         error_setg(errp, "Primary device not found");
         error_append_hint(errp, "Virtio-net failover will not work. Make "
-            "sure primary device has parameter"
-            " failover_pair_id=<virtio-net-id>\n");
-}
+                          "sure primary device has parameter"
+                          " failover_pair_id=<virtio-net-id>\n");
+    }
     error_propagate(errp, err);
 }
 
@@ -824,7 +824,6 @@ static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
 {
     VirtIONet *n = opaque;
     int ret = 0;
-
     const char *standby_id = qemu_opt_get(opts, "failover_pair_id");
 
     if (standby_id != NULL && (g_strcmp0(standby_id, n->netclient_name) == 0)) {
@@ -841,14 +840,14 @@ static DeviceState *virtio_net_find_primary(VirtIONet *n, Error **errp)
     Error *err = NULL;
 
     if (qemu_opts_foreach(qemu_find_opts("device"),
-                         is_my_primary, n, &err)) {
+                          is_my_primary, n, &err)) {
         if (err) {
             error_propagate(errp, err);
             return NULL;
         }
         if (n->primary_device_id) {
             dev = qdev_find_recursive(sysbus_get_default(),
-                    n->primary_device_id);
+                                      n->primary_device_id);
         } else {
             error_setg(errp, "Primary device id not found");
             return NULL;
@@ -857,8 +856,6 @@ static DeviceState *virtio_net_find_primary(VirtIONet *n, Error **errp)
     return dev;
 }
 
-
-
 static DeviceState *virtio_connect_failover_devices(VirtIONet *n,
                                                     DeviceState *dev,
                                                     Error **errp)
@@ -3126,9 +3123,9 @@ static bool failover_replug_primary(VirtIONet *n, Error **errp)
         return true;
     }
     if (!n->primary_device_opts) {
-        n->primary_device_opts = qemu_opts_from_qdict(
-                qemu_find_opts("device"),
-                n->primary_device_dict, errp);
+        n->primary_device_opts = qemu_opts_from_qdict(qemu_find_opts("device"),
+                                                      n->primary_device_dict,
+                                                      errp);
         if (!n->primary_device_opts) {
             return false;
         }
@@ -3176,8 +3173,8 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
     if (migration_in_setup(s) && !should_be_hidden) {
         if (failover_unplug_primary(n)) {
             vmstate_unregister(VMSTATE_IF(n->primary_dev),
-                    qdev_get_vmsd(n->primary_dev),
-                    n->primary_dev);
+                               qdev_get_vmsd(n->primary_dev),
+                               n->primary_dev);
             qapi_event_send_unplug_primary(n->primary_device_id);
             qatomic_set(&n->primary_should_be_hidden, true);
         } else {
@@ -3201,7 +3198,7 @@ static void virtio_net_migration_state_notifier(Notifier *notifier, void *data)
 }
 
 static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
-            QemuOpts *device_opts)
+                                               QemuOpts *device_opts)
 {
     VirtIONet *n = container_of(listener, VirtIONet, primary_listener);
     bool match_found = false;
@@ -3211,11 +3208,11 @@ static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
         return -1;
     }
     n->primary_device_dict = qemu_opts_to_qdict(device_opts,
-            n->primary_device_dict);
+                                                n->primary_device_dict);
     if (n->primary_device_dict) {
         g_free(n->standby_id);
         n->standby_id = g_strdup(qdict_get_try_str(n->primary_device_dict,
-                    "failover_pair_id"));
+                                                   "failover_pair_id"));
     }
     if (g_strcmp0(n->standby_id, n->netclient_name) == 0) {
         match_found = true;
@@ -3235,7 +3232,7 @@ static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
     if (n->primary_device_dict) {
         g_free(n->primary_device_id);
         n->primary_device_id = g_strdup(qdict_get_try_str(
-                    n->primary_device_dict, "id"));
+                                            n->primary_device_dict, "id"));
         if (!n->primary_device_id) {
             warn_report("primary_device_id not set");
         }
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index bf79d0bbcd..a25f5d612c 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -573,10 +573,10 @@ void qdev_set_id(DeviceState *dev, const char *id)
 }
 
 static int is_failover_device(void *opaque, const char *name, const char *value,
-                        Error **errp)
+                              Error **errp)
 {
     if (strcmp(name, "failover_pair_id") == 0) {
-        QemuOpts *opts = (QemuOpts *)opaque;
+        QemuOpts *opts = opaque;
 
         if (qdev_should_hide_device(opts)) {
             return 1;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 03/27] failover: Use always atomics for primary_should_be_hidden
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 01/27] migration: Network Failover can't work with a paused guest Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 02/27] failover: fix indentantion Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 04/27] failover: primary bus is only used once, and where it is set Juan Quintela
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 1011a524bf..a0fa63e7cb 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3136,7 +3136,7 @@ static bool failover_replug_primary(VirtIONet *n, Error **errp)
         return false;
     }
     qdev_set_parent_bus(n->primary_dev, n->primary_bus, &error_abort);
-    n->primary_should_be_hidden = false;
+    qatomic_set(&n->primary_should_be_hidden, false);
     if (!qemu_opt_set_bool(n->primary_device_opts,
                            "partially_hotplugged", true, errp)) {
         return false;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 04/27] failover: primary bus is only used once, and where it is set
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (2 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 03/27] failover: Use always atomics for primary_should_be_hidden Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 05/27] failover: Remove unused parameter Juan Quintela
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Just remove the struct member.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/virtio/virtio-net.h | 1 -
 hw/net/virtio-net.c            | 8 ++++----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index f4852ac27b..c8da637d40 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -205,7 +205,6 @@ struct VirtIONet {
     QemuOpts *primary_device_opts;
     QDict *primary_device_dict;
     DeviceState *primary_dev;
-    BusState *primary_bus;
     char *primary_device_id;
     char *standby_id;
     bool primary_should_be_hidden;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index a0fa63e7cb..786d313330 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -804,7 +804,6 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
             qemu_opts_del(n->primary_device_opts);
         }
         if (n->primary_dev) {
-            n->primary_bus = n->primary_dev->parent_bus;
             if (err) {
                 qdev_unplug(n->primary_dev, &err);
                 qdev_set_id(n->primary_dev, "");
@@ -3118,6 +3117,7 @@ static bool failover_replug_primary(VirtIONet *n, Error **errp)
     Error *err = NULL;
     HotplugHandler *hotplug_ctrl;
     PCIDevice *pdev = PCI_DEVICE(n->primary_dev);
+    BusState *primary_bus;
 
     if (!pdev->partially_hotplugged) {
         return true;
@@ -3130,12 +3130,12 @@ static bool failover_replug_primary(VirtIONet *n, Error **errp)
             return false;
         }
     }
-    n->primary_bus = n->primary_dev->parent_bus;
-    if (!n->primary_bus) {
+    primary_bus = n->primary_dev->parent_bus;
+    if (!primary_bus) {
         error_setg(errp, "virtio_net: couldn't find primary bus");
         return false;
     }
-    qdev_set_parent_bus(n->primary_dev, n->primary_bus, &error_abort);
+    qdev_set_parent_bus(n->primary_dev, primary_bus, &error_abort);
     qatomic_set(&n->primary_should_be_hidden, false);
     if (!qemu_opt_set_bool(n->primary_device_opts,
                            "partially_hotplugged", true, errp)) {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 05/27] failover: Remove unused parameter
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (3 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 04/27] failover: primary bus is only used once, and where it is set Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 06/27] failover: Remove external partially_hotplugged property Juan Quintela
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 786d313330..3f658d6246 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -855,9 +855,7 @@ static DeviceState *virtio_net_find_primary(VirtIONet *n, Error **errp)
     return dev;
 }
 
-static DeviceState *virtio_connect_failover_devices(VirtIONet *n,
-                                                    DeviceState *dev,
-                                                    Error **errp)
+static DeviceState *virtio_connect_failover_devices(VirtIONet *n, Error **errp)
 {
     DeviceState *prim_dev = NULL;
     Error *err = NULL;
@@ -928,7 +926,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
         qatomic_set(&n->primary_should_be_hidden, false);
         failover_add_primary(n, &err);
         if (err) {
-            n->primary_dev = virtio_connect_failover_devices(n, n->qdev, &err);
+            n->primary_dev = virtio_connect_failover_devices(n, &err);
             if (err) {
                 goto out_err;
             }
@@ -3164,7 +3162,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
     should_be_hidden = qatomic_read(&n->primary_should_be_hidden);
 
     if (!n->primary_dev) {
-        n->primary_dev = virtio_connect_failover_devices(n, n->qdev, &err);
+        n->primary_dev = virtio_connect_failover_devices(n, &err);
         if (!n->primary_dev) {
             return;
         }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 06/27] failover: Remove external partially_hotplugged property
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (4 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 05/27] failover: Remove unused parameter Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 07/27] failover: qdev_device_add() returns err or dev set Juan Quintela
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

It was only set "once", and with the wrong value. As far as I can see,
libvirt still don't use it.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 3f658d6246..6ca85627d8 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3135,10 +3135,6 @@ static bool failover_replug_primary(VirtIONet *n, Error **errp)
     }
     qdev_set_parent_bus(n->primary_dev, primary_bus, &error_abort);
     qatomic_set(&n->primary_should_be_hidden, false);
-    if (!qemu_opt_set_bool(n->primary_device_opts,
-                           "partially_hotplugged", true, errp)) {
-        return false;
-    }
     hotplug_ctrl = qdev_get_hotplug_handler(n->primary_dev);
     if (hotplug_ctrl) {
         hotplug_handler_pre_plug(hotplug_ctrl, n->primary_dev, &err);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 07/27] failover: qdev_device_add() returns err or dev set
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (5 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 06/27] failover: Remove external partially_hotplugged property Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 08/27] failover: Rename bool to failover_primary_hidden Juan Quintela
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Never both.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 6ca85627d8..3e82108d42 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -803,13 +803,6 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
         if (err) {
             qemu_opts_del(n->primary_device_opts);
         }
-        if (n->primary_dev) {
-            if (err) {
-                qdev_unplug(n->primary_dev, &err);
-                qdev_set_id(n->primary_dev, "");
-
-            }
-        }
     } else {
         error_setg(errp, "Primary device not found");
         error_append_hint(errp, "Virtio-net failover will not work. Make "
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 08/27] failover: Rename bool to failover_primary_hidden
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (6 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 07/27] failover: qdev_device_add() returns err or dev set Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 09/27] failover: g_strcmp0() knows how to handle NULL Juan Quintela
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

You should not use passive naming variables.
And once there, be able to search for them.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/virtio/virtio-net.h |  3 ++-
 hw/net/virtio-net.c            | 14 +++++++-------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index c8da637d40..ca68be759f 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -207,7 +207,8 @@ struct VirtIONet {
     DeviceState *primary_dev;
     char *primary_device_id;
     char *standby_id;
-    bool primary_should_be_hidden;
+    /* primary failover device is hidden*/
+    bool failover_primary_hidden;
     bool failover;
     DeviceListener primary_listener;
     Notifier migration_state;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 3e82108d42..c221671852 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -916,7 +916,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
 
     if (virtio_has_feature(features, VIRTIO_NET_F_STANDBY)) {
         qapi_event_send_failover_negotiated(n->netclient_name);
-        qatomic_set(&n->primary_should_be_hidden, false);
+        qatomic_set(&n->failover_primary_hidden, false);
         failover_add_primary(n, &err);
         if (err) {
             n->primary_dev = virtio_connect_failover_devices(n, &err);
@@ -3127,7 +3127,7 @@ static bool failover_replug_primary(VirtIONet *n, Error **errp)
         return false;
     }
     qdev_set_parent_bus(n->primary_dev, primary_bus, &error_abort);
-    qatomic_set(&n->primary_should_be_hidden, false);
+    qatomic_set(&n->failover_primary_hidden, false);
     hotplug_ctrl = qdev_get_hotplug_handler(n->primary_dev);
     if (hotplug_ctrl) {
         hotplug_handler_pre_plug(hotplug_ctrl, n->primary_dev, &err);
@@ -3148,7 +3148,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
     bool should_be_hidden;
     Error *err = NULL;
 
-    should_be_hidden = qatomic_read(&n->primary_should_be_hidden);
+    should_be_hidden = qatomic_read(&n->failover_primary_hidden);
 
     if (!n->primary_dev) {
         n->primary_dev = virtio_connect_failover_devices(n, &err);
@@ -3163,7 +3163,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
                                qdev_get_vmsd(n->primary_dev),
                                n->primary_dev);
             qapi_event_send_unplug_primary(n->primary_device_id);
-            qatomic_set(&n->primary_should_be_hidden, true);
+            qatomic_set(&n->failover_primary_hidden, true);
         } else {
             warn_report("couldn't unplug primary device");
         }
@@ -3213,8 +3213,8 @@ static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
 
     n->primary_device_opts = device_opts;
 
-    /* primary_should_be_hidden is set during feature negotiation */
-    hide = qatomic_read(&n->primary_should_be_hidden);
+    /* failover_primary_hidden is set during feature negotiation */
+    hide = qatomic_read(&n->failover_primary_hidden);
 
     if (n->primary_device_dict) {
         g_free(n->primary_device_id);
@@ -3271,7 +3271,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     if (n->failover) {
         n->primary_listener.should_be_hidden =
             virtio_net_primary_should_be_hidden;
-        qatomic_set(&n->primary_should_be_hidden, true);
+        qatomic_set(&n->failover_primary_hidden, true);
         device_listener_register(&n->primary_listener);
         n->migration_state.notify = virtio_net_migration_state_notifier;
         add_migration_state_change_notifier(&n->migration_state);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 09/27] failover: g_strcmp0() knows how to handle NULL
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (7 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 08/27] failover: Rename bool to failover_primary_hidden Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 10/27] failover: Remove primary_device_opts Juan Quintela
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index c221671852..e334f05352 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -818,7 +818,7 @@ static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
     int ret = 0;
     const char *standby_id = qemu_opt_get(opts, "failover_pair_id");
 
-    if (standby_id != NULL && (g_strcmp0(standby_id, n->netclient_name) == 0)) {
+    if (g_strcmp0(standby_id, n->netclient_name) == 0) {
         n->primary_device_id = g_strdup(opts->id);
         ret = 1;
     }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 10/27] failover: Remove primary_device_opts
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (8 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 09/27] failover: g_strcmp0() knows how to handle NULL Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 11/27] failover: remove standby_id variable Juan Quintela
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

It was really only used once, in failover_add_primary().  Just search
for it on global opts when it is needed.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/virtio/virtio-net.h |  1 -
 hw/net/virtio-net.c            | 21 +++++----------------
 2 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index ca68be759f..7159e6c0a0 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -202,7 +202,6 @@ struct VirtIONet {
     AnnounceTimer announce_timer;
     bool needs_vnet_hdr_swap;
     bool mtu_bypass_backend;
-    QemuOpts *primary_device_opts;
     QDict *primary_device_dict;
     DeviceState *primary_dev;
     char *primary_device_id;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index e334f05352..2a99b0e0f6 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -791,17 +791,17 @@ static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n)
 static void failover_add_primary(VirtIONet *n, Error **errp)
 {
     Error *err = NULL;
+    QemuOpts *opts;
 
     if (n->primary_dev) {
         return;
     }
 
-    n->primary_device_opts = qemu_opts_find(qemu_find_opts("device"),
-                                            n->primary_device_id);
-    if (n->primary_device_opts) {
-        n->primary_dev = qdev_device_add(n->primary_device_opts, &err);
+    opts = qemu_opts_find(qemu_find_opts("device"), n->primary_device_id);
+    if (opts) {
+        n->primary_dev = qdev_device_add(opts, &err);
         if (err) {
-            qemu_opts_del(n->primary_device_opts);
+            qemu_opts_del(opts);
         }
     } else {
         error_setg(errp, "Primary device not found");
@@ -856,7 +856,6 @@ static DeviceState *virtio_connect_failover_devices(VirtIONet *n, Error **errp)
     prim_dev = virtio_net_find_primary(n, &err);
     if (prim_dev) {
         n->primary_device_id = g_strdup(prim_dev->id);
-        n->primary_device_opts = prim_dev->opts;
     } else {
         error_propagate(errp, err);
     }
@@ -3113,14 +3112,6 @@ static bool failover_replug_primary(VirtIONet *n, Error **errp)
     if (!pdev->partially_hotplugged) {
         return true;
     }
-    if (!n->primary_device_opts) {
-        n->primary_device_opts = qemu_opts_from_qdict(qemu_find_opts("device"),
-                                                      n->primary_device_dict,
-                                                      errp);
-        if (!n->primary_device_opts) {
-            return false;
-        }
-    }
     primary_bus = n->primary_dev->parent_bus;
     if (!primary_bus) {
         error_setg(errp, "virtio_net: couldn't find primary bus");
@@ -3211,8 +3202,6 @@ static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
         goto out;
     }
 
-    n->primary_device_opts = device_opts;
-
     /* failover_primary_hidden is set during feature negotiation */
     hide = qatomic_read(&n->failover_primary_hidden);
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 11/27] failover: remove standby_id variable
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (9 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 10/27] failover: Remove primary_device_opts Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 12/27] failover: Remove primary_device_dict Juan Quintela
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

We can calculate it, and we only use it once anyways.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/virtio/virtio-net.h |  1 -
 hw/net/virtio-net.c            | 11 +++--------
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 7159e6c0a0..a055f39dd6 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -205,7 +205,6 @@ struct VirtIONet {
     QDict *primary_device_dict;
     DeviceState *primary_dev;
     char *primary_device_id;
-    char *standby_id;
     /* primary failover device is hidden*/
     bool failover_primary_hidden;
     bool failover;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 2a99b0e0f6..953d5c2bc8 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3181,23 +3181,19 @@ static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
     VirtIONet *n = container_of(listener, VirtIONet, primary_listener);
     bool match_found = false;
     bool hide = false;
+    const char *standby_id;
 
     if (!device_opts) {
         return -1;
     }
     n->primary_device_dict = qemu_opts_to_qdict(device_opts,
                                                 n->primary_device_dict);
-    if (n->primary_device_dict) {
-        g_free(n->standby_id);
-        n->standby_id = g_strdup(qdict_get_try_str(n->primary_device_dict,
-                                                   "failover_pair_id"));
-    }
-    if (g_strcmp0(n->standby_id, n->netclient_name) == 0) {
+    standby_id = qemu_opt_get(device_opts, "failover_pair_id");
+    if (g_strcmp0(standby_id, n->netclient_name) == 0) {
         match_found = true;
     } else {
         match_found = false;
         hide = false;
-        g_free(n->standby_id);
         n->primary_device_dict = NULL;
         goto out;
     }
@@ -3400,7 +3396,6 @@ static void virtio_net_device_unrealize(DeviceState *dev)
     if (n->failover) {
         device_listener_unregister(&n->primary_listener);
         g_free(n->primary_device_id);
-        g_free(n->standby_id);
         qobject_unref(n->primary_device_dict);
         n->primary_device_dict = NULL;
     }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 12/27] failover: Remove primary_device_dict
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (10 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 11/27] failover: remove standby_id variable Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 13/27] failover: Remove memory leak Juan Quintela
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

It was only used once.  And we have there opts->id, so no need for it.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/virtio/virtio-net.h |  1 -
 hw/net/virtio-net.c            | 17 ++++-------------
 2 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index a055f39dd6..fe353d8299 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -202,7 +202,6 @@ struct VirtIONet {
     AnnounceTimer announce_timer;
     bool needs_vnet_hdr_swap;
     bool mtu_bypass_backend;
-    QDict *primary_device_dict;
     DeviceState *primary_dev;
     char *primary_device_id;
     /* primary failover device is hidden*/
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 953d5c2bc8..6e5a56a230 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3186,28 +3186,21 @@ static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
     if (!device_opts) {
         return -1;
     }
-    n->primary_device_dict = qemu_opts_to_qdict(device_opts,
-                                                n->primary_device_dict);
     standby_id = qemu_opt_get(device_opts, "failover_pair_id");
     if (g_strcmp0(standby_id, n->netclient_name) == 0) {
         match_found = true;
     } else {
         match_found = false;
         hide = false;
-        n->primary_device_dict = NULL;
         goto out;
     }
 
     /* failover_primary_hidden is set during feature negotiation */
     hide = qatomic_read(&n->failover_primary_hidden);
-
-    if (n->primary_device_dict) {
-        g_free(n->primary_device_id);
-        n->primary_device_id = g_strdup(qdict_get_try_str(
-                                            n->primary_device_dict, "id"));
-        if (!n->primary_device_id) {
-            warn_report("primary_device_id not set");
-        }
+    g_free(n->primary_device_id);
+    n->primary_device_id = g_strdup(device_opts->id);
+    if (!n->primary_device_id) {
+        warn_report("primary_device_id not set");
     }
 
 out:
@@ -3396,8 +3389,6 @@ static void virtio_net_device_unrealize(DeviceState *dev)
     if (n->failover) {
         device_listener_unregister(&n->primary_listener);
         g_free(n->primary_device_id);
-        qobject_unref(n->primary_device_dict);
-        n->primary_device_dict = NULL;
     }
 
     max_queues = n->multiqueue ? n->max_queues : 1;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 13/27] failover: Remove memory leak
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (11 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 12/27] failover: Remove primary_device_dict Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 14/27] failover: simplify virtio_net_find_primary() Juan Quintela
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Two things, at this point:

* n->primary_device_id has to be set, otherwise
  virtio_net_find_primary don't work.  So we have a leak here.

* it has to be exactly the same that prim_dev->id because what
  qdev_find_recursive() does is just compare this two values.

So remove the unneeded assignment and leaky bits.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 6e5a56a230..70fa372c08 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -854,9 +854,7 @@ static DeviceState *virtio_connect_failover_devices(VirtIONet *n, Error **errp)
     Error *err = NULL;
 
     prim_dev = virtio_net_find_primary(n, &err);
-    if (prim_dev) {
-        n->primary_device_id = g_strdup(prim_dev->id);
-    } else {
+    if (!prim_dev) {
         error_propagate(errp, err);
     }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 14/27] failover: simplify virtio_net_find_primary()
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (12 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 13/27] failover: Remove memory leak Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 15/27] failover: should_be_hidden() should take a bool Juan Quintela
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

a - is_my_primary() never sets one error
b - If we return 1, primary_device_id is always set

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 18 +++---------------
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 70fa372c08..881907d1bd 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -828,24 +828,12 @@ static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
 
 static DeviceState *virtio_net_find_primary(VirtIONet *n, Error **errp)
 {
-    DeviceState *dev = NULL;
     Error *err = NULL;
 
-    if (qemu_opts_foreach(qemu_find_opts("device"),
-                          is_my_primary, n, &err)) {
-        if (err) {
-            error_propagate(errp, err);
-            return NULL;
-        }
-        if (n->primary_device_id) {
-            dev = qdev_find_recursive(sysbus_get_default(),
-                                      n->primary_device_id);
-        } else {
-            error_setg(errp, "Primary device id not found");
-            return NULL;
-        }
+    if (!qemu_opts_foreach(qemu_find_opts("device"), is_my_primary, n, &err)) {
+        return NULL;
     }
-    return dev;
+    return qdev_find_recursive(sysbus_get_default(), n->primary_device_id);
 }
 
 static DeviceState *virtio_connect_failover_devices(VirtIONet *n, Error **errp)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 15/27] failover: should_be_hidden() should take a bool
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (13 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 14/27] failover: simplify virtio_net_find_primary() Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 16/27] failover: Rename function to hide_device() Juan Quintela
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

We didn't use at all the -1 value, and we don't really care.  It was
only used for the cases when this is not the device that we are
searching for.  And in that case we should not hide the device.

Once there, simplify virtio-Snet_primary_should_be_hidden.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/qdev-core.h |  2 +-
 hw/core/qdev.c         | 19 +++++--------------
 hw/net/virtio-net.c    | 27 +++++++--------------------
 3 files changed, 13 insertions(+), 35 deletions(-)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 5e737195b5..250f4edef6 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -200,7 +200,7 @@ struct DeviceListener {
      * inform qdev that a device should be hidden, depending on the device
      * opts, for example, to hide a standby device.
      */
-    int (*should_be_hidden)(DeviceListener *listener, QemuOpts *device_opts);
+    bool (*should_be_hidden)(DeviceListener *listener, QemuOpts *device_opts);
     QTAILQ_ENTRY(DeviceListener) link;
 };
 
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 262bca716f..8f4b8f3cc1 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -214,26 +214,17 @@ void device_listener_unregister(DeviceListener *listener)
 
 bool qdev_should_hide_device(QemuOpts *opts)
 {
-    int rc = -1;
     DeviceListener *listener;
 
     QTAILQ_FOREACH(listener, &device_listeners, link) {
-       if (listener->should_be_hidden) {
-            /*
-             * should_be_hidden_will return
-             *  1 if device matches opts and it should be hidden
-             *  0 if device matches opts and should not be hidden
-             *  -1 if device doesn't match ops
-             */
-            rc = listener->should_be_hidden(listener, opts);
-        }
-
-        if (rc > 0) {
-            break;
+        if (listener->should_be_hidden) {
+            if (listener->should_be_hidden(listener, opts)) {
+                return true;
+            }
         }
     }
 
-    return rc > 0;
+    return false;
 }
 
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 881907d1bd..9f12d33da0 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3161,24 +3161,19 @@ static void virtio_net_migration_state_notifier(Notifier *notifier, void *data)
     virtio_net_handle_migration_primary(n, s);
 }
 
-static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
-                                               QemuOpts *device_opts)
+static bool virtio_net_primary_should_be_hidden(DeviceListener *listener,
+                                                QemuOpts *device_opts)
 {
     VirtIONet *n = container_of(listener, VirtIONet, primary_listener);
-    bool match_found = false;
-    bool hide = false;
+    bool hide;
     const char *standby_id;
 
     if (!device_opts) {
-        return -1;
+        return false;
     }
     standby_id = qemu_opt_get(device_opts, "failover_pair_id");
-    if (g_strcmp0(standby_id, n->netclient_name) == 0) {
-        match_found = true;
-    } else {
-        match_found = false;
-        hide = false;
-        goto out;
+    if (g_strcmp0(standby_id, n->netclient_name) != 0) {
+        return false;
     }
 
     /* failover_primary_hidden is set during feature negotiation */
@@ -3188,15 +3183,7 @@ static int virtio_net_primary_should_be_hidden(DeviceListener *listener,
     if (!n->primary_device_id) {
         warn_report("primary_device_id not set");
     }
-
-out:
-    if (match_found && hide) {
-        return 1;
-    } else if (match_found && !hide) {
-        return 0;
-    } else {
-        return -1;
-    }
+    return hide;
 }
 
 static void virtio_net_device_realize(DeviceState *dev, Error **errp)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 16/27] failover: Rename function to hide_device()
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (14 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 15/27] failover: should_be_hidden() should take a bool Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 17/27] failover: virtio_net_connect_failover_devices() does nothing Juan Quintela
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

You should not use pasive.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/qdev-core.h | 28 +++++++++++++++-------------
 hw/core/qdev.c         |  4 ++--
 hw/net/virtio-net.c    |  7 +++----
 3 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 250f4edef6..6ac86db44e 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -81,16 +81,17 @@ typedef void (*BusUnrealize)(BusState *bus);
  * </note>
  *
  * # Hiding a device #
- * To hide a device, a DeviceListener function should_be_hidden() needs to
+ * To hide a device, a DeviceListener function hide_device() needs to
  * be registered.
- * It can be used to defer adding a device and therefore hide it from the
- * guest. The handler registering to this DeviceListener can save the QOpts
- * passed to it for re-using it later and must return that it wants the device
- * to be/remain hidden or not. When the handler function decides the device
- * shall not be hidden it will be added in qdev_device_add() and
- * realized as any other device. Otherwise qdev_device_add() will return early
- * without adding the device. The guest will not see a "hidden" device
- * until it was marked don't hide and qdev_device_add called again.
+ * It can be used to defer adding a device and therefore hide it from
+ * the guest. The handler registering to this DeviceListener can save
+ * the QOpts passed to it for re-using it later. It must return if it
+ * wants the device to be hidden or visible. When the handler function
+ * decides the device shall be visible it will be added with
+ * qdev_device_add() and realized as any other device. Otherwise
+ * qdev_device_add() will return early without adding the device. The
+ * guest will not see a "hidden" device until it was marked visible
+ * and qdev_device_add called again.
  *
  */
 struct DeviceClass {
@@ -196,11 +197,12 @@ struct DeviceListener {
     void (*realize)(DeviceListener *listener, DeviceState *dev);
     void (*unrealize)(DeviceListener *listener, DeviceState *dev);
     /*
-     * This callback is called upon init of the DeviceState and allows to
-     * inform qdev that a device should be hidden, depending on the device
-     * opts, for example, to hide a standby device.
+     * This callback is called upon init of the DeviceState and
+     * informs qdev if a device should be visible or hidden.  We can
+     * hide a failover device depending for example on the device
+     * opts.
      */
-    bool (*should_be_hidden)(DeviceListener *listener, QemuOpts *device_opts);
+    bool (*hide_device)(DeviceListener *listener, QemuOpts *device_opts);
     QTAILQ_ENTRY(DeviceListener) link;
 };
 
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 8f4b8f3cc1..cbdff0b6c6 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -217,8 +217,8 @@ bool qdev_should_hide_device(QemuOpts *opts)
     DeviceListener *listener;
 
     QTAILQ_FOREACH(listener, &device_listeners, link) {
-        if (listener->should_be_hidden) {
-            if (listener->should_be_hidden(listener, opts)) {
+        if (listener->hide_device) {
+            if (listener->hide_device(listener, opts)) {
                 return true;
             }
         }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9f12d33da0..747614ff2a 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3161,8 +3161,8 @@ static void virtio_net_migration_state_notifier(Notifier *notifier, void *data)
     virtio_net_handle_migration_primary(n, s);
 }
 
-static bool virtio_net_primary_should_be_hidden(DeviceListener *listener,
-                                                QemuOpts *device_opts)
+static bool failover_hide_primary_device(DeviceListener *listener,
+                                         QemuOpts *device_opts)
 {
     VirtIONet *n = container_of(listener, VirtIONet, primary_listener);
     bool hide;
@@ -3220,8 +3220,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     }
 
     if (n->failover) {
-        n->primary_listener.should_be_hidden =
-            virtio_net_primary_should_be_hidden;
+        n->primary_listener.hide_device = failover_hide_primary_device;
         qatomic_set(&n->failover_primary_hidden, true);
         device_listener_register(&n->primary_listener);
         n->migration_state.notify = virtio_net_migration_state_notifier;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 17/27] failover: virtio_net_connect_failover_devices() does nothing
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (15 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 16/27] failover: Rename function to hide_device() Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 18/27] failover: Rename to failover_find_primary_device() Juan Quintela
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

It just calls virtio_net_find_primary(), so just update the callers.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 747614ff2a..c6200b924e 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -836,19 +836,6 @@ static DeviceState *virtio_net_find_primary(VirtIONet *n, Error **errp)
     return qdev_find_recursive(sysbus_get_default(), n->primary_device_id);
 }
 
-static DeviceState *virtio_connect_failover_devices(VirtIONet *n, Error **errp)
-{
-    DeviceState *prim_dev = NULL;
-    Error *err = NULL;
-
-    prim_dev = virtio_net_find_primary(n, &err);
-    if (!prim_dev) {
-        error_propagate(errp, err);
-    }
-
-    return prim_dev;
-}
-
 static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
 {
     VirtIONet *n = VIRTIO_NET(vdev);
@@ -904,7 +891,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
         qatomic_set(&n->failover_primary_hidden, false);
         failover_add_primary(n, &err);
         if (err) {
-            n->primary_dev = virtio_connect_failover_devices(n, &err);
+            n->primary_dev = virtio_net_find_primary(n, &err);
             if (err) {
                 goto out_err;
             }
@@ -3128,7 +3115,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
     should_be_hidden = qatomic_read(&n->failover_primary_hidden);
 
     if (!n->primary_dev) {
-        n->primary_dev = virtio_connect_failover_devices(n, &err);
+        n->primary_dev = virtio_net_find_primary(n, &err);
         if (!n->primary_dev) {
             return;
         }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 18/27] failover: Rename to failover_find_primary_device()
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (16 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 17/27] failover: virtio_net_connect_failover_devices() does nothing Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 19/27] failover: simplify qdev_device_add() failover case Juan Quintela
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

This commit:
* Rename them to failover_find_primary_devices() so
  - it starts with failover_
  - it don't connect anything, just find the primary device
* Create documentation for the function

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index c6200b924e..ff82f1017d 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -826,7 +826,13 @@ static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
     return ret;
 }
 
-static DeviceState *virtio_net_find_primary(VirtIONet *n, Error **errp)
+/**
+ * Find the primary device for this failover virtio-net
+ *
+ * @n: VirtIONet device
+ * @errp: returns an error if this function fails
+ */
+static DeviceState *failover_find_primary_device(VirtIONet *n, Error **errp)
 {
     Error *err = NULL;
 
@@ -891,7 +897,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
         qatomic_set(&n->failover_primary_hidden, false);
         failover_add_primary(n, &err);
         if (err) {
-            n->primary_dev = virtio_net_find_primary(n, &err);
+            n->primary_dev = failover_find_primary_device(n, &err);
             if (err) {
                 goto out_err;
             }
@@ -3115,7 +3121,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
     should_be_hidden = qatomic_read(&n->failover_primary_hidden);
 
     if (!n->primary_dev) {
-        n->primary_dev = virtio_net_find_primary(n, &err);
+        n->primary_dev = failover_find_primary_device(n, &err);
         if (!n->primary_dev) {
             return;
         }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 19/27] failover: simplify qdev_device_add() failover case
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (17 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 18/27] failover: Rename to failover_find_primary_device() Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 20/27] failover: simplify qdev_device_add() Juan Quintela
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Just put allthe logic inside the same if.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 softmmu/qdev-monitor.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index a25f5d612c..12b7540f17 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -600,7 +600,6 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
     const char *driver, *path;
     DeviceState *dev = NULL;
     BusState *bus = NULL;
-    bool hide;
 
     driver = qemu_opt_get(opts, "driver");
     if (!driver) {
@@ -634,17 +633,19 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
             return NULL;
         }
     }
-    hide = should_hide_device(opts);
 
-    if ((hide || qdev_hotplug) && bus && !qbus_is_hotpluggable(bus)) {
+    if (should_hide_device(opts)) {
+        if (bus && !qbus_is_hotpluggable(bus)) {
+            error_setg(errp, QERR_BUS_NO_HOTPLUG, bus->name);
+        }
+        return NULL;
+    }
+
+    if (qdev_hotplug && bus && !qbus_is_hotpluggable(bus)) {
         error_setg(errp, QERR_BUS_NO_HOTPLUG, bus->name);
         return NULL;
     }
 
-    if (hide) {
-        return NULL;
-    }
-
     if (!migration_is_idle()) {
         error_setg(errp, "device_add not allowed while migrating");
         return NULL;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 20/27] failover: simplify qdev_device_add()
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (18 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 19/27] failover: simplify qdev_device_add() failover case Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 21/27] failover: make sure that id always exist Juan Quintela
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

We don't need to walk the opts by hand.  qmp_opt_get() already does
that.  And then we can remove the functions that did that walk.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 softmmu/qdev-monitor.c | 32 ++++++--------------------------
 1 file changed, 6 insertions(+), 26 deletions(-)

diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 12b7540f17..0e10f0466f 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -572,28 +572,6 @@ void qdev_set_id(DeviceState *dev, const char *id)
     }
 }
 
-static int is_failover_device(void *opaque, const char *name, const char *value,
-                              Error **errp)
-{
-    if (strcmp(name, "failover_pair_id") == 0) {
-        QemuOpts *opts = opaque;
-
-        if (qdev_should_hide_device(opts)) {
-            return 1;
-        }
-    }
-
-    return 0;
-}
-
-static bool should_hide_device(QemuOpts *opts)
-{
-    if (qemu_opt_foreach(opts, is_failover_device, opts, NULL) == 0) {
-        return false;
-    }
-    return true;
-}
-
 DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
 {
     DeviceClass *dc;
@@ -634,11 +612,13 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
         }
     }
 
-    if (should_hide_device(opts)) {
-        if (bus && !qbus_is_hotpluggable(bus)) {
-            error_setg(errp, QERR_BUS_NO_HOTPLUG, bus->name);
+    if (qemu_opt_get(opts, "failover_pair_id")) {
+        if (qdev_should_hide_device(opts)) {
+            if (bus && !qbus_is_hotpluggable(bus)) {
+                error_setg(errp, QERR_BUS_NO_HOTPLUG, bus->name);
+            }
+            return NULL;
         }
-        return NULL;
     }
 
     if (qdev_hotplug && bus && !qbus_is_hotpluggable(bus)) {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 21/27] failover: make sure that id always exist
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (19 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 20/27] failover: simplify qdev_device_add() Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 22/27] failover: remove failover_find_primary_device() error parameter Juan Quintela
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

We check that it exist at device creation time, so we don't have to
check anywhere else.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c    | 3 ---
 softmmu/qdev-monitor.c | 4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index ff82f1017d..c708c03cf6 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3173,9 +3173,6 @@ static bool failover_hide_primary_device(DeviceListener *listener,
     hide = qatomic_read(&n->failover_primary_hidden);
     g_free(n->primary_device_id);
     n->primary_device_id = g_strdup(device_opts->id);
-    if (!n->primary_device_id) {
-        warn_report("primary_device_id not set");
-    }
     return hide;
 }
 
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 0e10f0466f..301089eaea 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -613,6 +613,10 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
     }
 
     if (qemu_opt_get(opts, "failover_pair_id")) {
+        if (!opts->id) {
+            error_setg(errp, "Device with failover_pair_id don't have id");
+            return NULL;
+        }
         if (qdev_should_hide_device(opts)) {
             if (bus && !qbus_is_hotpluggable(bus)) {
                 error_setg(errp, QERR_BUS_NO_HOTPLUG, bus->name);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 22/27] failover: remove failover_find_primary_device() error parameter
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (20 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 21/27] failover: make sure that id always exist Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 23/27] failover: split failover_find_primary_device_id() Juan Quintela
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

It can never give one error.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index c708c03cf6..b994796734 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -832,7 +832,7 @@ static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
  * @n: VirtIONet device
  * @errp: returns an error if this function fails
  */
-static DeviceState *failover_find_primary_device(VirtIONet *n, Error **errp)
+static DeviceState *failover_find_primary_device(VirtIONet *n)
 {
     Error *err = NULL;
 
@@ -897,10 +897,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
         qatomic_set(&n->failover_primary_hidden, false);
         failover_add_primary(n, &err);
         if (err) {
-            n->primary_dev = failover_find_primary_device(n, &err);
-            if (err) {
-                goto out_err;
-            }
+            n->primary_dev = failover_find_primary_device(n);
             failover_add_primary(n, &err);
             if (err) {
                 goto out_err;
@@ -3121,7 +3118,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
     should_be_hidden = qatomic_read(&n->failover_primary_hidden);
 
     if (!n->primary_dev) {
-        n->primary_dev = failover_find_primary_device(n, &err);
+        n->primary_dev = failover_find_primary_device(n);
         if (!n->primary_dev) {
             return;
         }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 23/27] failover: split failover_find_primary_device_id()
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (21 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 22/27] failover: remove failover_find_primary_device() error parameter Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 24/27] failover: We don't need to cache primary_device_id anymore Juan Quintela
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

So we can calculate the device id when we need it.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 63 +++++++++++++++++++++++++++++++++------------
 1 file changed, 47 insertions(+), 16 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index b994796734..2c502c13fd 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -788,6 +788,49 @@ static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n)
     return virtio_net_guest_offloads_by_features(vdev->guest_features);
 }
 
+typedef struct {
+    VirtIONet *n;
+    char *id;
+} FailoverId;
+
+/**
+ * Set the id of the failover primary device
+ *
+ * @opaque: FailoverId to setup
+ * @opts: opts for device we are handling
+ * @errp: returns an error if this function fails
+ */
+static int failover_set_primary(void *opaque, QemuOpts *opts, Error **errp)
+{
+    FailoverId *fid = opaque;
+    const char *standby_id = qemu_opt_get(opts, "failover_pair_id");
+
+    if (g_strcmp0(standby_id, fid->n->netclient_name) == 0) {
+        fid->id = g_strdup(opts->id);
+        return 1;
+    }
+
+    return 0;
+}
+
+/**
+ * Find the primary device id for this failover virtio-net
+ *
+ * @n: VirtIONet device
+ * @errp: returns an error if this function fails
+ */
+static char *failover_find_primary_device_id(VirtIONet *n)
+{
+    Error *err = NULL;
+    FailoverId fid;
+
+    if (!qemu_opts_foreach(qemu_find_opts("device"),
+                           failover_set_primary, &fid, &err)) {
+        return NULL;
+    }
+    return fid.id;
+}
+
 static void failover_add_primary(VirtIONet *n, Error **errp)
 {
     Error *err = NULL;
@@ -812,20 +855,6 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
     error_propagate(errp, err);
 }
 
-static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
-{
-    VirtIONet *n = opaque;
-    int ret = 0;
-    const char *standby_id = qemu_opt_get(opts, "failover_pair_id");
-
-    if (g_strcmp0(standby_id, n->netclient_name) == 0) {
-        n->primary_device_id = g_strdup(opts->id);
-        ret = 1;
-    }
-
-    return ret;
-}
-
 /**
  * Find the primary device for this failover virtio-net
  *
@@ -834,11 +863,13 @@ static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
  */
 static DeviceState *failover_find_primary_device(VirtIONet *n)
 {
-    Error *err = NULL;
+    char *id = failover_find_primary_device_id(n);
 
-    if (!qemu_opts_foreach(qemu_find_opts("device"), is_my_primary, n, &err)) {
+    if (!id) {
         return NULL;
     }
+    n->primary_device_id = g_strdup(id);
+
     return qdev_find_recursive(sysbus_get_default(), n->primary_device_id);
 }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 24/27] failover: We don't need to cache primary_device_id anymore
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (22 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 23/27] failover: split failover_find_primary_device_id() Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 25/27] failover: Caller of this two functions already have primary_dev Juan Quintela
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/virtio/virtio-net.h |  1 -
 hw/net/virtio-net.c            | 20 ++++++++++----------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index fe353d8299..efef64e02f 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -203,7 +203,6 @@ struct VirtIONet {
     bool needs_vnet_hdr_swap;
     bool mtu_bypass_backend;
     DeviceState *primary_dev;
-    char *primary_device_id;
     /* primary failover device is hidden*/
     bool failover_primary_hidden;
     bool failover;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 2c502c13fd..746ed3fb71 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -824,6 +824,7 @@ static char *failover_find_primary_device_id(VirtIONet *n)
     Error *err = NULL;
     FailoverId fid;
 
+    fid.n = n;
     if (!qemu_opts_foreach(qemu_find_opts("device"),
                            failover_set_primary, &fid, &err)) {
         return NULL;
@@ -835,12 +836,17 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
 {
     Error *err = NULL;
     QemuOpts *opts;
+    char *id;
 
     if (n->primary_dev) {
         return;
     }
 
-    opts = qemu_opts_find(qemu_find_opts("device"), n->primary_device_id);
+    id = failover_find_primary_device_id(n);
+    if (!id) {
+        return;
+    }
+    opts = qemu_opts_find(qemu_find_opts("device"), id);
     if (opts) {
         n->primary_dev = qdev_device_add(opts, &err);
         if (err) {
@@ -868,9 +874,8 @@ static DeviceState *failover_find_primary_device(VirtIONet *n)
     if (!id) {
         return NULL;
     }
-    n->primary_device_id = g_strdup(id);
 
-    return qdev_find_recursive(sysbus_get_default(), n->primary_device_id);
+    return qdev_find_recursive(sysbus_get_default(), id);
 }
 
 static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
@@ -3160,7 +3165,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
             vmstate_unregister(VMSTATE_IF(n->primary_dev),
                                qdev_get_vmsd(n->primary_dev),
                                n->primary_dev);
-            qapi_event_send_unplug_primary(n->primary_device_id);
+            qapi_event_send_unplug_primary(n->primary_dev->id);
             qatomic_set(&n->failover_primary_hidden, true);
         } else {
             warn_report("couldn't unplug primary device");
@@ -3186,7 +3191,6 @@ static bool failover_hide_primary_device(DeviceListener *listener,
                                          QemuOpts *device_opts)
 {
     VirtIONet *n = container_of(listener, VirtIONet, primary_listener);
-    bool hide;
     const char *standby_id;
 
     if (!device_opts) {
@@ -3198,10 +3202,7 @@ static bool failover_hide_primary_device(DeviceListener *listener,
     }
 
     /* failover_primary_hidden is set during feature negotiation */
-    hide = qatomic_read(&n->failover_primary_hidden);
-    g_free(n->primary_device_id);
-    n->primary_device_id = g_strdup(device_opts->id);
-    return hide;
+    return qatomic_read(&n->failover_primary_hidden);
 }
 
 static void virtio_net_device_realize(DeviceState *dev, Error **errp)
@@ -3378,7 +3379,6 @@ static void virtio_net_device_unrealize(DeviceState *dev)
 
     if (n->failover) {
         device_listener_unregister(&n->primary_listener);
-        g_free(n->primary_device_id);
     }
 
     max_queues = n->multiqueue ? n->max_queues : 1;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 25/27] failover: Caller of this two functions already have primary_dev
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (23 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 24/27] failover: We don't need to cache primary_device_id anymore Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 26/27] failover: simplify failover_unplug_primary Juan Quintela
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Pass it as an argument.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 746ed3fb71..b37e9cd1d9 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3093,17 +3093,17 @@ void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
     n->netclient_type = g_strdup(type);
 }
 
-static bool failover_unplug_primary(VirtIONet *n)
+static bool failover_unplug_primary(VirtIONet *n, DeviceState *dev)
 {
     HotplugHandler *hotplug_ctrl;
     PCIDevice *pci_dev;
     Error *err = NULL;
 
-    hotplug_ctrl = qdev_get_hotplug_handler(n->primary_dev);
+    hotplug_ctrl = qdev_get_hotplug_handler(dev);
     if (hotplug_ctrl) {
-        pci_dev = PCI_DEVICE(n->primary_dev);
+        pci_dev = PCI_DEVICE(dev);
         pci_dev->partially_hotplugged = true;
-        hotplug_handler_unplug_request(hotplug_ctrl, n->primary_dev, &err);
+        hotplug_handler_unplug_request(hotplug_ctrl, dev, &err);
         if (err) {
             error_report_err(err);
             return false;
@@ -3114,30 +3114,31 @@ static bool failover_unplug_primary(VirtIONet *n)
     return true;
 }
 
-static bool failover_replug_primary(VirtIONet *n, Error **errp)
+static bool failover_replug_primary(VirtIONet *n, DeviceState *dev,
+                                    Error **errp)
 {
     Error *err = NULL;
     HotplugHandler *hotplug_ctrl;
-    PCIDevice *pdev = PCI_DEVICE(n->primary_dev);
+    PCIDevice *pdev = PCI_DEVICE(dev);
     BusState *primary_bus;
 
     if (!pdev->partially_hotplugged) {
         return true;
     }
-    primary_bus = n->primary_dev->parent_bus;
+    primary_bus = dev->parent_bus;
     if (!primary_bus) {
         error_setg(errp, "virtio_net: couldn't find primary bus");
         return false;
     }
-    qdev_set_parent_bus(n->primary_dev, primary_bus, &error_abort);
+    qdev_set_parent_bus(dev, primary_bus, &error_abort);
     qatomic_set(&n->failover_primary_hidden, false);
-    hotplug_ctrl = qdev_get_hotplug_handler(n->primary_dev);
+    hotplug_ctrl = qdev_get_hotplug_handler(dev);
     if (hotplug_ctrl) {
-        hotplug_handler_pre_plug(hotplug_ctrl, n->primary_dev, &err);
+        hotplug_handler_pre_plug(hotplug_ctrl, dev, &err);
         if (err) {
             goto out;
         }
-        hotplug_handler_plug(hotplug_ctrl, n->primary_dev, &err);
+        hotplug_handler_plug(hotplug_ctrl, dev, &err);
     }
 
 out:
@@ -3161,7 +3162,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
     }
 
     if (migration_in_setup(s) && !should_be_hidden) {
-        if (failover_unplug_primary(n)) {
+        if (failover_unplug_primary(n, n->primary_dev)) {
             vmstate_unregister(VMSTATE_IF(n->primary_dev),
                                qdev_get_vmsd(n->primary_dev),
                                n->primary_dev);
@@ -3172,7 +3173,7 @@ static void virtio_net_handle_migration_primary(VirtIONet *n,
         }
     } else if (migration_has_failed(s)) {
         /* We already unplugged the device let's plug it back */
-        if (!failover_replug_primary(n, &err)) {
+        if (!failover_replug_primary(n, n->primary_dev, &err)) {
             if (err) {
                 error_report_err(err);
             }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 26/27] failover: simplify failover_unplug_primary
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (24 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 25/27] failover: Caller of this two functions already have primary_dev Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:37 ` [PATCH v2 27/27] failover: Remove primary_dev member Juan Quintela
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

We can calculate device just once.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/net/virtio-net.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index b37e9cd1d9..9203d81780 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3146,34 +3146,29 @@ out:
     return !err;
 }
 
-static void virtio_net_handle_migration_primary(VirtIONet *n,
-                                                MigrationState *s)
+static void virtio_net_handle_migration_primary(VirtIONet *n, MigrationState *s)
 {
     bool should_be_hidden;
     Error *err = NULL;
+    DeviceState *dev = failover_find_primary_device(n);
+
+    if (!dev) {
+        return;
+    }
 
     should_be_hidden = qatomic_read(&n->failover_primary_hidden);
 
-    if (!n->primary_dev) {
-        n->primary_dev = failover_find_primary_device(n);
-        if (!n->primary_dev) {
-            return;
-        }
-    }
-
     if (migration_in_setup(s) && !should_be_hidden) {
-        if (failover_unplug_primary(n, n->primary_dev)) {
-            vmstate_unregister(VMSTATE_IF(n->primary_dev),
-                               qdev_get_vmsd(n->primary_dev),
-                               n->primary_dev);
-            qapi_event_send_unplug_primary(n->primary_dev->id);
+        if (failover_unplug_primary(n, dev)) {
+            vmstate_unregister(VMSTATE_IF(dev), qdev_get_vmsd(dev), dev);
+            qapi_event_send_unplug_primary(dev->id);
             qatomic_set(&n->failover_primary_hidden, true);
         } else {
             warn_report("couldn't unplug primary device");
         }
     } else if (migration_has_failed(s)) {
         /* We already unplugged the device let's plug it back */
-        if (!failover_replug_primary(n, n->primary_dev, &err)) {
+        if (!failover_replug_primary(n, dev, &err)) {
             if (err) {
                 error_report_err(err);
             }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v2 27/27] failover: Remove primary_dev member
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (25 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 26/27] failover: simplify failover_unplug_primary Juan Quintela
@ 2020-11-18  8:37 ` Juan Quintela
  2020-11-18  8:53 ` [PATCH v2 00/27] Virtio net failover fixes Michael S. Tsirkin
  2020-12-02  9:55 ` Michael S. Tsirkin
  28 siblings, 0 replies; 58+ messages in thread
From: Juan Quintela @ 2020-11-18  8:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrangé, Eduardo Habkost, Michael S. Tsirkin,
	Jason Wang, Juan Quintela, Dr. David Alan Gilbert, Paolo Bonzini

Only three uses remained, and we can remove them on that case.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/virtio/virtio-net.h |  1 -
 hw/net/virtio-net.c            | 55 +++++++++++++++-------------------
 2 files changed, 24 insertions(+), 32 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index efef64e02f..7e96d193aa 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -202,7 +202,6 @@ struct VirtIONet {
     AnnounceTimer announce_timer;
     bool needs_vnet_hdr_swap;
     bool mtu_bypass_backend;
-    DeviceState *primary_dev;
     /* primary failover device is hidden*/
     bool failover_primary_hidden;
     bool failover;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9203d81780..044ac95f6f 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -832,13 +832,31 @@ static char *failover_find_primary_device_id(VirtIONet *n)
     return fid.id;
 }
 
+/**
+ * Find the primary device for this failover virtio-net
+ *
+ * @n: VirtIONet device
+ * @errp: returns an error if this function fails
+ */
+static DeviceState *failover_find_primary_device(VirtIONet *n)
+{
+    char *id = failover_find_primary_device_id(n);
+
+    if (!id) {
+        return NULL;
+    }
+
+    return qdev_find_recursive(sysbus_get_default(), id);
+}
+
 static void failover_add_primary(VirtIONet *n, Error **errp)
 {
     Error *err = NULL;
     QemuOpts *opts;
     char *id;
+    DeviceState *dev = failover_find_primary_device(n);
 
-    if (n->primary_dev) {
+    if (dev) {
         return;
     }
 
@@ -848,7 +866,7 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
     }
     opts = qemu_opts_find(qemu_find_opts("device"), id);
     if (opts) {
-        n->primary_dev = qdev_device_add(opts, &err);
+        dev = qdev_device_add(opts, &err);
         if (err) {
             qemu_opts_del(opts);
         }
@@ -861,23 +879,6 @@ static void failover_add_primary(VirtIONet *n, Error **errp)
     error_propagate(errp, err);
 }
 
-/**
- * Find the primary device for this failover virtio-net
- *
- * @n: VirtIONet device
- * @errp: returns an error if this function fails
- */
-static DeviceState *failover_find_primary_device(VirtIONet *n)
-{
-    char *id = failover_find_primary_device_id(n);
-
-    if (!id) {
-        return NULL;
-    }
-
-    return qdev_find_recursive(sysbus_get_default(), id);
-}
-
 static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
 {
     VirtIONet *n = VIRTIO_NET(vdev);
@@ -933,19 +934,9 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
         qatomic_set(&n->failover_primary_hidden, false);
         failover_add_primary(n, &err);
         if (err) {
-            n->primary_dev = failover_find_primary_device(n);
-            failover_add_primary(n, &err);
-            if (err) {
-                goto out_err;
-            }
+            warn_report_err(err);
         }
     }
-    return;
-
-out_err:
-    if (err) {
-        warn_report_err(err);
-    }
 }
 
 static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd,
@@ -3420,13 +3411,15 @@ static int virtio_net_pre_save(void *opaque)
 static bool primary_unplug_pending(void *opaque)
 {
     DeviceState *dev = opaque;
+    DeviceState *primary;
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIONet *n = VIRTIO_NET(vdev);
 
     if (!virtio_vdev_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
         return false;
     }
-    return n->primary_dev ? n->primary_dev->pending_deleted_event : false;
+    primary = failover_find_primary_device(n);
+    return primary ? primary->pending_deleted_event : false;
 }
 
 static bool dev_unplug_pending(void *opaque)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 00/27] Virtio net failover fixes
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (26 preceding siblings ...)
  2020-11-18  8:37 ` [PATCH v2 27/27] failover: Remove primary_dev member Juan Quintela
@ 2020-11-18  8:53 ` Michael S. Tsirkin
  2020-12-02 10:16   ` Juan Quintela
  2020-12-02  9:55 ` Michael S. Tsirkin
  28 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-11-18  8:53 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Daniel P. Berrangé, Eduardo Habkost, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Wed, Nov 18, 2020 at 09:37:21AM +0100, Juan Quintela wrote:
> Hi
> 
> This is a big rework of the network failover setup.  General idea is:
> * We don't cache the name of the primary/standby devices
>   We have several problems there with stale pointers
> * After this:
> - We can always remove either the primary/standby devices without trouble
> - Pluggin/unplugging works
> - We go to device opts to see what the failover device are.
>   Notice that we are plugging/unplugging the device, so it is not critical.
> - Once there, I "fixed" managedsave for libvirt (now gives an error instead o=
> f just hanging)
> * Fields not cached anymore:
> - primary_device_dict
> - primary_device_opts
> - standby_id
> - primary_device_id
> - primary_dev
> * I renamed the should_be_hidden() callback to hide device, but if
>   people preffer the old name I can leave it.
> * Add (some) doc to some functions
> * Remove almost 100 lines of code while fixing things.
> 
> Please review.

OK that's great, any of this appropriate for 5.2?
The memory leak fix maybe?

> Later, Juan.
> 
> Juan Quintela (27):
>   migration: Network Failover can't work with a paused guest
>   failover: fix indentantion
>   failover: Use always atomics for primary_should_be_hidden
>   failover: primary bus is only used once, and where it is set
>   failover: Remove unused parameter
>   failover: Remove external partially_hotplugged property
>   failover: qdev_device_add() returns err or dev set
>   failover: Rename bool to failover_primary_hidden
>   failover: g_strcmp0() knows how to handle NULL
>   failover: Remove primary_device_opts
>   failover: remove standby_id variable
>   failover: Remove primary_device_dict
>   failover: Remove memory leak
>   failover: simplify virtio_net_find_primary()
>   failover: should_be_hidden() should take a bool
>   failover: Rename function to hide_device()
>   failover: virtio_net_connect_failover_devices() does nothing
>   failover: Rename to failover_find_primary_device()
>   failover: simplify qdev_device_add() failover case
>   failover: simplify qdev_device_add()
>   failover: make sure that id always exist
>   failover: remove failover_find_primary_device() error parameter
>   failover: split failover_find_primary_device_id()
>   failover: We don't need to cache primary_device_id anymore
>   failover: Caller of this two functions already have primary_dev
>   failover: simplify failover_unplug_primary
>   failover: Remove primary_dev member
> 
>  include/hw/qdev-core.h         |  28 ++--
>  include/hw/virtio/virtio-net.h |   9 +-
>  hw/core/qdev.c                 |  19 +--
>  hw/net/virtio-net.c            | 298 +++++++++++++--------------------
>  migration/migration.c          |  13 ++
>  softmmu/qdev-monitor.c         |  43 ++---
>  6 files changed, 167 insertions(+), 243 deletions(-)
> 
> --=20
> 2.26.2
> 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 00/27] Virtio net failover fixes
  2020-11-18  8:53 ` [PATCH v2 00/27] Virtio net failover fixes Michael S. Tsirkin
@ 2020-12-02 10:16   ` Juan Quintela
  2020-12-02 10:30     ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Juan Quintela @ 2020-12-02 10:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrangé, Eduardo Habkost, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Nov 18, 2020 at 09:37:21AM +0100, Juan Quintela wrote:
>> Hi
>> 
>> This is a big rework of the network failover setup.  General idea is:
>> * We don't cache the name of the primary/standby devices
>>   We have several problems there with stale pointers
>> * After this:
>> - We can always remove either the primary/standby devices without trouble
>> - Pluggin/unplugging works
>> - We go to device opts to see what the failover device are.
>>   Notice that we are plugging/unplugging the device, so it is not critical.
>> - Once there, I "fixed" managedsave for libvirt (now gives an error instead o=
>> f just hanging)
>> * Fields not cached anymore:
>> - primary_device_dict
>> - primary_device_opts
>> - standby_id
>> - primary_device_id
>> - primary_dev
>> * I renamed the should_be_hidden() callback to hide device, but if
>>   people preffer the old name I can leave it.
>> * Add (some) doc to some functions
>> * Remove almost 100 lines of code while fixing things.
>> 
>> Please review.
>
> OK that's great, any of this appropriate for 5.2?
> The memory leak fix maybe?

1st one is also a fix, current code just hangs the guest.
Rest of things .... current code fails a lot, but we are too late on the
cycle.

Later, Juan.


>> Later, Juan.
>> 
>> Juan Quintela (27):
>>   migration: Network Failover can't work with a paused guest
>>   failover: fix indentantion
>>   failover: Use always atomics for primary_should_be_hidden
>>   failover: primary bus is only used once, and where it is set
>>   failover: Remove unused parameter
>>   failover: Remove external partially_hotplugged property
>>   failover: qdev_device_add() returns err or dev set
>>   failover: Rename bool to failover_primary_hidden
>>   failover: g_strcmp0() knows how to handle NULL
>>   failover: Remove primary_device_opts
>>   failover: remove standby_id variable
>>   failover: Remove primary_device_dict
>>   failover: Remove memory leak
>>   failover: simplify virtio_net_find_primary()
>>   failover: should_be_hidden() should take a bool
>>   failover: Rename function to hide_device()
>>   failover: virtio_net_connect_failover_devices() does nothing
>>   failover: Rename to failover_find_primary_device()
>>   failover: simplify qdev_device_add() failover case
>>   failover: simplify qdev_device_add()
>>   failover: make sure that id always exist
>>   failover: remove failover_find_primary_device() error parameter
>>   failover: split failover_find_primary_device_id()
>>   failover: We don't need to cache primary_device_id anymore
>>   failover: Caller of this two functions already have primary_dev
>>   failover: simplify failover_unplug_primary
>>   failover: Remove primary_dev member
>> 
>>  include/hw/qdev-core.h         |  28 ++--
>>  include/hw/virtio/virtio-net.h |   9 +-
>>  hw/core/qdev.c                 |  19 +--
>>  hw/net/virtio-net.c            | 298 +++++++++++++--------------------
>>  migration/migration.c          |  13 ++
>>  softmmu/qdev-monitor.c         |  43 ++---
>>  6 files changed, 167 insertions(+), 243 deletions(-)
>> 
>> --=20
>> 2.26.2
>> 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 00/27] Virtio net failover fixes
  2020-12-02 10:16   ` Juan Quintela
@ 2020-12-02 10:30     ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02 10:30 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Daniel P. Berrangé, Eduardo Habkost, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Wed, Dec 02, 2020 at 11:16:04AM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, Nov 18, 2020 at 09:37:21AM +0100, Juan Quintela wrote:
> >> Hi
> >> 
> >> This is a big rework of the network failover setup.  General idea is:
> >> * We don't cache the name of the primary/standby devices
> >>   We have several problems there with stale pointers
> >> * After this:
> >> - We can always remove either the primary/standby devices without trouble
> >> - Pluggin/unplugging works
> >> - We go to device opts to see what the failover device are.
> >>   Notice that we are plugging/unplugging the device, so it is not critical.
> >> - Once there, I "fixed" managedsave for libvirt (now gives an error instead o=
> >> f just hanging)
> >> * Fields not cached anymore:
> >> - primary_device_dict
> >> - primary_device_opts
> >> - standby_id
> >> - primary_device_id
> >> - primary_dev
> >> * I renamed the should_be_hidden() callback to hide device, but if
> >>   people preffer the old name I can leave it.
> >> * Add (some) doc to some functions
> >> * Remove almost 100 lines of code while fixing things.
> >> 
> >> Please review.
> >
> > OK that's great, any of this appropriate for 5.2?
> > The memory leak fix maybe?
> 
> 1st one is also a fix, current code just hangs the guest.


Hmm it does but then proceeds when you unpause so I'm not sure
it's a good idea for 5.2. It's late in the cycle to try to
handle management bugs ...

> Rest of things .... current code fails a lot, but we are too late on the
> cycle.
> 
> Later, Juan.
> 
> 
> >> Later, Juan.
> >> 
> >> Juan Quintela (27):
> >>   migration: Network Failover can't work with a paused guest
> >>   failover: fix indentantion
> >>   failover: Use always atomics for primary_should_be_hidden
> >>   failover: primary bus is only used once, and where it is set
> >>   failover: Remove unused parameter
> >>   failover: Remove external partially_hotplugged property
> >>   failover: qdev_device_add() returns err or dev set
> >>   failover: Rename bool to failover_primary_hidden
> >>   failover: g_strcmp0() knows how to handle NULL
> >>   failover: Remove primary_device_opts
> >>   failover: remove standby_id variable
> >>   failover: Remove primary_device_dict
> >>   failover: Remove memory leak
> >>   failover: simplify virtio_net_find_primary()
> >>   failover: should_be_hidden() should take a bool
> >>   failover: Rename function to hide_device()
> >>   failover: virtio_net_connect_failover_devices() does nothing
> >>   failover: Rename to failover_find_primary_device()
> >>   failover: simplify qdev_device_add() failover case
> >>   failover: simplify qdev_device_add()
> >>   failover: make sure that id always exist
> >>   failover: remove failover_find_primary_device() error parameter
> >>   failover: split failover_find_primary_device_id()
> >>   failover: We don't need to cache primary_device_id anymore
> >>   failover: Caller of this two functions already have primary_dev
> >>   failover: simplify failover_unplug_primary
> >>   failover: Remove primary_dev member
> >> 
> >>  include/hw/qdev-core.h         |  28 ++--
> >>  include/hw/virtio/virtio-net.h |   9 +-
> >>  hw/core/qdev.c                 |  19 +--
> >>  hw/net/virtio-net.c            | 298 +++++++++++++--------------------
> >>  migration/migration.c          |  13 ++
> >>  softmmu/qdev-monitor.c         |  43 ++---
> >>  6 files changed, 167 insertions(+), 243 deletions(-)
> >> 
> >> --=20
> >> 2.26.2
> >> 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 00/27] Virtio net failover fixes
  2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
                   ` (27 preceding siblings ...)
  2020-11-18  8:53 ` [PATCH v2 00/27] Virtio net failover fixes Michael S. Tsirkin
@ 2020-12-02  9:55 ` Michael S. Tsirkin
  28 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-12-02  9:55 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Daniel P. Berrangé, Eduardo Habkost, Jason Wang, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Wed, Nov 18, 2020 at 09:37:21AM +0100, Juan Quintela wrote:
> Hi

I tagged this for after the release. To help make sure this
is not lost pls ping me after the release. Thanks!


> This is a big rework of the network failover setup.  General idea is:
> * We don't cache the name of the primary/standby devices
>   We have several problems there with stale pointers
> * After this:
> - We can always remove either the primary/standby devices without trouble
> - Pluggin/unplugging works
> - We go to device opts to see what the failover device are.
>   Notice that we are plugging/unplugging the device, so it is not critical.
> - Once there, I "fixed" managedsave for libvirt (now gives an error instead o=
> f just hanging)
> * Fields not cached anymore:
> - primary_device_dict
> - primary_device_opts
> - standby_id
> - primary_device_id
> - primary_dev
> * I renamed the should_be_hidden() callback to hide device, but if
>   people preffer the old name I can leave it.
> * Add (some) doc to some functions
> * Remove almost 100 lines of code while fixing things.
> 
> Please review.
> 
> Later, Juan.
> 
> Juan Quintela (27):
>   migration: Network Failover can't work with a paused guest
>   failover: fix indentantion
>   failover: Use always atomics for primary_should_be_hidden
>   failover: primary bus is only used once, and where it is set
>   failover: Remove unused parameter
>   failover: Remove external partially_hotplugged property
>   failover: qdev_device_add() returns err or dev set
>   failover: Rename bool to failover_primary_hidden
>   failover: g_strcmp0() knows how to handle NULL
>   failover: Remove primary_device_opts
>   failover: remove standby_id variable
>   failover: Remove primary_device_dict
>   failover: Remove memory leak
>   failover: simplify virtio_net_find_primary()
>   failover: should_be_hidden() should take a bool
>   failover: Rename function to hide_device()
>   failover: virtio_net_connect_failover_devices() does nothing
>   failover: Rename to failover_find_primary_device()
>   failover: simplify qdev_device_add() failover case
>   failover: simplify qdev_device_add()
>   failover: make sure that id always exist
>   failover: remove failover_find_primary_device() error parameter
>   failover: split failover_find_primary_device_id()
>   failover: We don't need to cache primary_device_id anymore
>   failover: Caller of this two functions already have primary_dev
>   failover: simplify failover_unplug_primary
>   failover: Remove primary_dev member
> 
>  include/hw/qdev-core.h         |  28 ++--
>  include/hw/virtio/virtio-net.h |   9 +-
>  hw/core/qdev.c                 |  19 +--
>  hw/net/virtio-net.c            | 298 +++++++++++++--------------------
>  migration/migration.c          |  13 ++
>  softmmu/qdev-monitor.c         |  43 ++---
>  6 files changed, 167 insertions(+), 243 deletions(-)
> 
> --=20
> 2.26.2
> 



^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2020-12-08 19:10 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-18  8:37 [PATCH v2 00/27] Virtio net failover fixes Juan Quintela
2020-11-18  8:37 ` [PATCH v2 01/27] migration: Network Failover can't work with a paused guest Juan Quintela
2020-11-25 12:05   ` Dr. David Alan Gilbert
2020-12-02 10:13   ` Michael S. Tsirkin
2020-12-02 10:27     ` Daniel P. Berrangé
2020-12-02 10:31       ` Michael S. Tsirkin
2020-12-02 10:33         ` Michael S. Tsirkin
2020-12-02 10:51           ` Juan Quintela
2020-12-02 10:55             ` Daniel P. Berrangé
2020-12-02 11:19               ` Michael S. Tsirkin
2020-12-02 11:26                 ` Daniel P. Berrangé
2020-12-02 11:37                   ` Michael S. Tsirkin
2020-12-02 12:01                     ` Daniel P. Berrangé
2020-12-03 11:21                       ` Michael S. Tsirkin
2020-12-03 11:32                         ` Daniel P. Berrangé
2020-12-03 11:40                           ` Michael S. Tsirkin
2020-12-03 11:43                             ` Dr. David Alan Gilbert
2020-12-03 12:02                               ` Michael S. Tsirkin
2020-12-03 12:04                                 ` Dr. David Alan Gilbert
2020-12-03 12:11                               ` Michael S. Tsirkin
2020-12-03 12:16                                 ` Daniel P. Berrangé
2020-12-08 18:48                                   ` Michael S. Tsirkin
2020-12-03 11:45                             ` Daniel P. Berrangé
2020-12-03 12:01                               ` Michael S. Tsirkin
2020-12-03 12:06                                 ` Daniel P. Berrangé
2020-12-03 12:13                                   ` Michael S. Tsirkin
2020-12-08 18:32                       ` Michael S. Tsirkin
2020-12-02 10:34         ` Daniel P. Berrangé
2020-11-18  8:37 ` [PATCH v2 02/27] failover: fix indentantion Juan Quintela
2020-11-18  8:37 ` [PATCH v2 03/27] failover: Use always atomics for primary_should_be_hidden Juan Quintela
2020-11-18  8:37 ` [PATCH v2 04/27] failover: primary bus is only used once, and where it is set Juan Quintela
2020-11-18  8:37 ` [PATCH v2 05/27] failover: Remove unused parameter Juan Quintela
2020-11-18  8:37 ` [PATCH v2 06/27] failover: Remove external partially_hotplugged property Juan Quintela
2020-11-18  8:37 ` [PATCH v2 07/27] failover: qdev_device_add() returns err or dev set Juan Quintela
2020-11-18  8:37 ` [PATCH v2 08/27] failover: Rename bool to failover_primary_hidden Juan Quintela
2020-11-18  8:37 ` [PATCH v2 09/27] failover: g_strcmp0() knows how to handle NULL Juan Quintela
2020-11-18  8:37 ` [PATCH v2 10/27] failover: Remove primary_device_opts Juan Quintela
2020-11-18  8:37 ` [PATCH v2 11/27] failover: remove standby_id variable Juan Quintela
2020-11-18  8:37 ` [PATCH v2 12/27] failover: Remove primary_device_dict Juan Quintela
2020-11-18  8:37 ` [PATCH v2 13/27] failover: Remove memory leak Juan Quintela
2020-11-18  8:37 ` [PATCH v2 14/27] failover: simplify virtio_net_find_primary() Juan Quintela
2020-11-18  8:37 ` [PATCH v2 15/27] failover: should_be_hidden() should take a bool Juan Quintela
2020-11-18  8:37 ` [PATCH v2 16/27] failover: Rename function to hide_device() Juan Quintela
2020-11-18  8:37 ` [PATCH v2 17/27] failover: virtio_net_connect_failover_devices() does nothing Juan Quintela
2020-11-18  8:37 ` [PATCH v2 18/27] failover: Rename to failover_find_primary_device() Juan Quintela
2020-11-18  8:37 ` [PATCH v2 19/27] failover: simplify qdev_device_add() failover case Juan Quintela
2020-11-18  8:37 ` [PATCH v2 20/27] failover: simplify qdev_device_add() Juan Quintela
2020-11-18  8:37 ` [PATCH v2 21/27] failover: make sure that id always exist Juan Quintela
2020-11-18  8:37 ` [PATCH v2 22/27] failover: remove failover_find_primary_device() error parameter Juan Quintela
2020-11-18  8:37 ` [PATCH v2 23/27] failover: split failover_find_primary_device_id() Juan Quintela
2020-11-18  8:37 ` [PATCH v2 24/27] failover: We don't need to cache primary_device_id anymore Juan Quintela
2020-11-18  8:37 ` [PATCH v2 25/27] failover: Caller of this two functions already have primary_dev Juan Quintela
2020-11-18  8:37 ` [PATCH v2 26/27] failover: simplify failover_unplug_primary Juan Quintela
2020-11-18  8:37 ` [PATCH v2 27/27] failover: Remove primary_dev member Juan Quintela
2020-11-18  8:53 ` [PATCH v2 00/27] Virtio net failover fixes Michael S. Tsirkin
2020-12-02 10:16   ` Juan Quintela
2020-12-02 10:30     ` Michael S. Tsirkin
2020-12-02  9:55 ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).