* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
[not found] <20190322134447.14831-1-jfreimann@redhat.com>
@ 2019-04-04 8:29 ` Jens Freimann
2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-04 12:53 ` Daniel P. Berrangé
1 sibling, 1 reply; 26+ messages in thread
From: Jens Freimann @ 2019-04-04 8:29 UTC (permalink / raw)
To: qemu-devel
Cc: pkrempa, ehabkost, mst, mdroth, liran.alon, laine, ogerlitz,
ailan, dgilbert
ping
FYI: I'm also working on a few related tools to detect driver behaviour when
assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
regards,
Jens
On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
>This is another attempt at implementing the host side of the
>net_failover concept
>(https://www.kernel.org/doc/html/latest/networking/net_failover.html)
>
>The general idea is that we have a pair of devices, a vfio-pci and a
>emulated device. Before migration the vfio device is unplugged and data
>flows to the emulated device, on the target side another vfio-pci device
>is plugged in to take over the data-path. In the guest the net_failover
>module will pair net devices with the same MAC address.
>
>* In the first patch the infrastructure for hiding the device is added
> for the qbus and qdev APIs. A "hidden" boolean is added to the device
> state and it is set based on a callback to the standby device which
> registers itself for handling the assessment: "should the primary device
> be hidden?" by cross validating the ids of the devices.
>
>* In the second patch the virtio-net uses the API to hide the vfio
> device and unhides it when the feature is acked.
>
>Previous discussion: https://patchwork.ozlabs.org/cover/989098/
>
>To summarize concerns/feedback from previous discussion:
>1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> Migration might get stuck for unpredictable time with unclear reason.
> This approach combines two tricky things, hot/unplug and migration.
> -> We can surprise-remove the PCI device and in QEMU we can do all
> necessary rollbacks transparent to management software. Will it be
> easy, probably not.
>2. PCI devices are a precious ressource. The primary device should never
> be added to QEMU if it won't be used by guest instead of hiding it in
> QEMU.
> -> We only hotplug the device when the standby feature bit was
> negotiated. We save the device cmdline options until we need it for
> qdev_device_add()
> Hiding a device can be a useful concept to model. For example a
> pci device in a powered-off slot could be marked as hidden until the slot is
> powered on (mst).
>3. Management layer software should handle this. Open Stack already has
> components/code to handle unplug/replug VFIO devices and metadata to
> provide to the guest for detecting which devices should be paired.
> -> An approach that includes all software from firmware to
> higher-level management software wasn't tried in the last years. This is
> an attempt to keep it simple and contained in QEMU as much as possible.
>4. Hotplugging a device and then making it part of a failover setup is
> not possible
> -> addressed by extending qdev hotplug functions to check for hidden
> attribute, so e.g. device_add can be used to plug a device.
>
>There are still some open issues:
>
>Migration: I'm looking for something like a pre-migration hook that I
>could use to unplug the vfio-pci device. I tried with a migration
>notifier but it is called to late, i.e. after migration is aborted due
>to vfio-pci marked unmigrateable. I worked around this by setting it
>to migrateable and used a migration notifier on the virtio-net device.
>
>Commandline: There is a dependency between vfio-pci and virtio-net
>devices. One points to the other via new parameters
>primar=<primary qdev id> and standby='<standby qdev id>'. This means
>that the primary device needs to be specified after standby device on
>the qemu command line. Not sure how to solve this.
>
>Error handling: Patches don't cover all possible error scenarios yet.
>
>I have tested this with a mlx5 NIC and was able to migrate the VM with
>above mentioned workarounds for open problems.
>
>Command line example:
>
>qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> -machine q35,kernel-irqchip=split -cpu host \
> -k fr \
> -serial stdio \
> -net none \
> -qmp unix:/tmp/qmp.socket,server,nowait \
> -monitor telnet:127.0.0.1:5555,server,nowait \
> -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
> /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>
>I'm grateful for any remarks or ideas!
>
>Thanks!
>
>regards,
>Jens
>
>Sameeh Jubran (2):
> qdev/qbus: Add hidden device support
> net/virtio: add failover support
>
> hw/core/qdev.c | 27 ++++++++++
> hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> hw/pci/pci.c | 1 +
> include/hw/pci/pci.h | 2 +
> include/hw/qdev-core.h | 8 +++
> include/hw/virtio/virtio-net.h | 7 +++
> qdev-monitor.c | 48 +++++++++++++++--
> vl.c | 7 ++-
> 8 files changed, 189 insertions(+), 6 deletions(-)
>
>--
>2.20.1
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
[not found] <20190322134447.14831-1-jfreimann@redhat.com>
2019-04-04 8:29 ` [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices Jens Freimann
@ 2019-04-04 12:53 ` Daniel P. Berrangé
1 sibling, 0 replies; 26+ messages in thread
From: Daniel P. Berrangé @ 2019-04-04 12:53 UTC (permalink / raw)
To: Jens Freimann
Cc: qemu-devel, ehabkost, mst, mdroth, pkrempa, laine, liran.alon,
ogerlitz, ailan
On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> Commandline: There is a dependency between vfio-pci and virtio-net
> devices. One points to the other via new parameters
> primar=<primary qdev id> and standby='<standby qdev id>'. This means
> that the primary device needs to be specified after standby device on
> the qemu command line. Not sure how to solve this.
So we hae this pair of args
> -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
In this you have denoted the "vfio-pci" device as the primary and
"virtio-net-pci" as the standby.
There's no need for the user to specify the "primary" property for for
the "virtio-net-pci" NIC though. We only need to tell QEMU the relationship
in one direction and it can set the relationship in the reverse direction
automatically.
In fact it is undesirable for the user to specify "primary" property, as
they should be able to start QEMU with only the virtio-net-pci device
present and then hot-plug a vfio-pci on the fly when it is available.
So I think the "virtio-net-pci" merely needs a flag to indicate that it
should be prepared to take part in a failover pair, but *not* take any
device ID from the user.
eg we should be able to start with just
-device virtio-net-pci,netdev=hostnet1,id=net1,\
mac=52:54:00:6f:55:cc,bus=root2,failover=on
When vfio-pci is then created, either via a further -device arg on later
in QMP via device_add, it only needs to specify standby=net1.
When vfio-pci is realized it can lookup the virtio-net-pci device
in the QOM tree and set its "primary" property to point back to its
own device ID. There should never be any need for the user to tell
virtio-net-pci what the device ID of the vfio-pci is.
Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-04 8:29 ` [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices Jens Freimann
@ 2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-05 8:56 ` Dr. David Alan Gilbert
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-05 8:56 UTC (permalink / raw)
To: Jens Freimann, armbru
Cc: qemu-devel, pkrempa, ehabkost, mst, mdroth, liran.alon, laine,
ogerlitz, ailan
* Jens Freimann (jfreimann@redhat.com) wrote:
> ping
>
> FYI: I'm also working on a few related tools to detect driver behaviour when
> assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
Hi Jens,
I've not been following this too uch, but:
> regards,
> Jens
>
> On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > This is another attempt at implementing the host side of the
> > net_failover concept
> > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> >
> > The general idea is that we have a pair of devices, a vfio-pci and a
> > emulated device. Before migration the vfio device is unplugged and data
> > flows to the emulated device, on the target side another vfio-pci device
> > is plugged in to take over the data-path. In the guest the net_failover
> > module will pair net devices with the same MAC address.
> >
> > * In the first patch the infrastructure for hiding the device is added
> > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > state and it is set based on a callback to the standby device which
> > registers itself for handling the assessment: "should the primary device
> > be hidden?" by cross validating the ids of the devices.
> >
> > * In the second patch the virtio-net uses the API to hide the vfio
> > device and unhides it when the feature is acked.
> >
> > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> >
> > To summarize concerns/feedback from previous discussion:
> > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > Migration might get stuck for unpredictable time with unclear reason.
> > This approach combines two tricky things, hot/unplug and migration.
> > -> We can surprise-remove the PCI device and in QEMU we can do all
> > necessary rollbacks transparent to management software. Will it be
> > easy, probably not.
This sounds 'fun' - bonus cases are things like what happens if the
guest gets rebooted somewhere during the process or if it's currently
sitting in the bios/grub/etc
> > 2. PCI devices are a precious ressource. The primary device should never
> > be added to QEMU if it won't be used by guest instead of hiding it in
> > QEMU.
> > -> We only hotplug the device when the standby feature bit was
> > negotiated. We save the device cmdline options until we need it for
> > qdev_device_add()
> > Hiding a device can be a useful concept to model. For example a
> > pci device in a powered-off slot could be marked as hidden until the slot is
> > powered on (mst).
Are they really that precious? Personally it's not something I'd worry
about.
> > 3. Management layer software should handle this. Open Stack already has
> > components/code to handle unplug/replug VFIO devices and metadata to
> > provide to the guest for detecting which devices should be paired.
> > -> An approach that includes all software from firmware to
> > higher-level management software wasn't tried in the last years. This is
> > an attempt to keep it simple and contained in QEMU as much as possible.
> > 4. Hotplugging a device and then making it part of a failover setup is
> > not possible
> > -> addressed by extending qdev hotplug functions to check for hidden
> > attribute, so e.g. device_add can be used to plug a device.
> >
> > There are still some open issues:
> >
> > Migration: I'm looking for something like a pre-migration hook that I
> > could use to unplug the vfio-pci device. I tried with a migration
> > notifier but it is called to late, i.e. after migration is aborted due
> > to vfio-pci marked unmigrateable. I worked around this by setting it
> > to migrateable and used a migration notifier on the virtio-net device.
Why not just let this happen at the libvirt level; then you do the
hotunplug etc before you actually tell qemu anything about starting a
migration?
> > Commandline: There is a dependency between vfio-pci and virtio-net
> > devices. One points to the other via new parameters
> > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > that the primary device needs to be specified after standby device on
> > the qemu command line. Not sure how to solve this.
> >
> > Error handling: Patches don't cover all possible error scenarios yet.
> >
> > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > above mentioned workarounds for open problems.
> >
> > Command line example:
> >
> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > -machine q35,kernel-irqchip=split -cpu host \
> > -k fr \
> > -serial stdio \
> > -net none \
> > -qmp unix:/tmp/qmp.socket,server,nowait \
> > -monitor telnet:127.0.0.1:5555,server,nowait \
> > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
'net1' id's. cc'ing in Markus.
Dave
> > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> >
> > I'm grateful for any remarks or ideas!
> >
> > Thanks!
> >
> > regards,
> > Jens
> >
> > Sameeh Jubran (2):
> > qdev/qbus: Add hidden device support
> > net/virtio: add failover support
> >
> > hw/core/qdev.c | 27 ++++++++++
> > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > hw/pci/pci.c | 1 +
> > include/hw/pci/pci.h | 2 +
> > include/hw/qdev-core.h | 8 +++
> > include/hw/virtio/virtio-net.h | 7 +++
> > qdev-monitor.c | 48 +++++++++++++++--
> > vl.c | 7 ++-
> > 8 files changed, 189 insertions(+), 6 deletions(-)
> >
> > --
> > 2.20.1
> >
> >
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 8:56 ` Dr. David Alan Gilbert
@ 2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-05 9:20 ` Jens Freimann
2019-04-05 23:22 ` Michael S. Tsirkin
2 siblings, 0 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-05 8:56 UTC (permalink / raw)
To: Jens Freimann, armbru
Cc: pkrempa, ehabkost, mst, mdroth, qemu-devel, liran.alon, laine,
ogerlitz, ailan
* Jens Freimann (jfreimann@redhat.com) wrote:
> ping
>
> FYI: I'm also working on a few related tools to detect driver behaviour when
> assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
Hi Jens,
I've not been following this too uch, but:
> regards,
> Jens
>
> On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > This is another attempt at implementing the host side of the
> > net_failover concept
> > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> >
> > The general idea is that we have a pair of devices, a vfio-pci and a
> > emulated device. Before migration the vfio device is unplugged and data
> > flows to the emulated device, on the target side another vfio-pci device
> > is plugged in to take over the data-path. In the guest the net_failover
> > module will pair net devices with the same MAC address.
> >
> > * In the first patch the infrastructure for hiding the device is added
> > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > state and it is set based on a callback to the standby device which
> > registers itself for handling the assessment: "should the primary device
> > be hidden?" by cross validating the ids of the devices.
> >
> > * In the second patch the virtio-net uses the API to hide the vfio
> > device and unhides it when the feature is acked.
> >
> > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> >
> > To summarize concerns/feedback from previous discussion:
> > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > Migration might get stuck for unpredictable time with unclear reason.
> > This approach combines two tricky things, hot/unplug and migration.
> > -> We can surprise-remove the PCI device and in QEMU we can do all
> > necessary rollbacks transparent to management software. Will it be
> > easy, probably not.
This sounds 'fun' - bonus cases are things like what happens if the
guest gets rebooted somewhere during the process or if it's currently
sitting in the bios/grub/etc
> > 2. PCI devices are a precious ressource. The primary device should never
> > be added to QEMU if it won't be used by guest instead of hiding it in
> > QEMU.
> > -> We only hotplug the device when the standby feature bit was
> > negotiated. We save the device cmdline options until we need it for
> > qdev_device_add()
> > Hiding a device can be a useful concept to model. For example a
> > pci device in a powered-off slot could be marked as hidden until the slot is
> > powered on (mst).
Are they really that precious? Personally it's not something I'd worry
about.
> > 3. Management layer software should handle this. Open Stack already has
> > components/code to handle unplug/replug VFIO devices and metadata to
> > provide to the guest for detecting which devices should be paired.
> > -> An approach that includes all software from firmware to
> > higher-level management software wasn't tried in the last years. This is
> > an attempt to keep it simple and contained in QEMU as much as possible.
> > 4. Hotplugging a device and then making it part of a failover setup is
> > not possible
> > -> addressed by extending qdev hotplug functions to check for hidden
> > attribute, so e.g. device_add can be used to plug a device.
> >
> > There are still some open issues:
> >
> > Migration: I'm looking for something like a pre-migration hook that I
> > could use to unplug the vfio-pci device. I tried with a migration
> > notifier but it is called to late, i.e. after migration is aborted due
> > to vfio-pci marked unmigrateable. I worked around this by setting it
> > to migrateable and used a migration notifier on the virtio-net device.
Why not just let this happen at the libvirt level; then you do the
hotunplug etc before you actually tell qemu anything about starting a
migration?
> > Commandline: There is a dependency between vfio-pci and virtio-net
> > devices. One points to the other via new parameters
> > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > that the primary device needs to be specified after standby device on
> > the qemu command line. Not sure how to solve this.
> >
> > Error handling: Patches don't cover all possible error scenarios yet.
> >
> > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > above mentioned workarounds for open problems.
> >
> > Command line example:
> >
> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > -machine q35,kernel-irqchip=split -cpu host \
> > -k fr \
> > -serial stdio \
> > -net none \
> > -qmp unix:/tmp/qmp.socket,server,nowait \
> > -monitor telnet:127.0.0.1:5555,server,nowait \
> > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
'net1' id's. cc'ing in Markus.
Dave
> > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> >
> > I'm grateful for any remarks or ideas!
> >
> > Thanks!
> >
> > regards,
> > Jens
> >
> > Sameeh Jubran (2):
> > qdev/qbus: Add hidden device support
> > net/virtio: add failover support
> >
> > hw/core/qdev.c | 27 ++++++++++
> > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > hw/pci/pci.c | 1 +
> > include/hw/pci/pci.h | 2 +
> > include/hw/qdev-core.h | 8 +++
> > include/hw/virtio/virtio-net.h | 7 +++
> > qdev-monitor.c | 48 +++++++++++++++--
> > vl.c | 7 ++-
> > 8 files changed, 189 insertions(+), 6 deletions(-)
> >
> > --
> > 2.20.1
> >
> >
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-05 8:56 ` Dr. David Alan Gilbert
@ 2019-04-05 9:20 ` Jens Freimann
2019-04-05 9:20 ` Jens Freimann
2019-04-08 5:53 ` Markus Armbruster
2019-04-05 23:22 ` Michael S. Tsirkin
2 siblings, 2 replies; 26+ messages in thread
From: Jens Freimann @ 2019-04-05 9:20 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: armbru, qemu-devel, pkrempa, ehabkost, mst, mdroth, liran.alon,
laine, ogerlitz, ailan
On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>* Jens Freimann (jfreimann@redhat.com) wrote:
[...]
>> > To summarize concerns/feedback from previous discussion:
>> > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>> > Migration might get stuck for unpredictable time with unclear reason.
>> > This approach combines two tricky things, hot/unplug and migration.
>> > -> We can surprise-remove the PCI device and in QEMU we can do all
>> > necessary rollbacks transparent to management software. Will it be
>> > easy, probably not.
>
>This sounds 'fun' - bonus cases are things like what happens if the
>guest gets rebooted somewhere during the process or if it's currently
>sitting in the bios/grub/etc
Yeah, I have to think about this...
>> > 2. PCI devices are a precious ressource. The primary device should never
>> > be added to QEMU if it won't be used by guest instead of hiding it in
>> > QEMU.
>> > -> We only hotplug the device when the standby feature bit was
>> > negotiated. We save the device cmdline options until we need it for
>> > qdev_device_add()
>> > Hiding a device can be a useful concept to model. For example a
>> > pci device in a powered-off slot could be marked as hidden until the slot is
>> > powered on (mst).
>
>Are they really that precious? Personally it's not something I'd worry
>about.
>
>> > 3. Management layer software should handle this. Open Stack already has
>> > components/code to handle unplug/replug VFIO devices and metadata to
>> > provide to the guest for detecting which devices should be paired.
>> > -> An approach that includes all software from firmware to
>> > higher-level management software wasn't tried in the last years. This is
>> > an attempt to keep it simple and contained in QEMU as much as possible.
>> > 4. Hotplugging a device and then making it part of a failover setup is
>> > not possible
>> > -> addressed by extending qdev hotplug functions to check for hidden
>> > attribute, so e.g. device_add can be used to plug a device.
>> >
>> > There are still some open issues:
>> >
>> > Migration: I'm looking for something like a pre-migration hook that I
>> > could use to unplug the vfio-pci device. I tried with a migration
>> > notifier but it is called to late, i.e. after migration is aborted due
>> > to vfio-pci marked unmigrateable. I worked around this by setting it
>> > to migrateable and used a migration notifier on the virtio-net device.
>
>Why not just let this happen at the libvirt level; then you do the
>hotunplug etc before you actually tell qemu anything about starting a
>migration?
Yes...the goal was to see if we can contain changes to QEMU (to keep
it simple, although it seems that covering all error cases won't be that simple :).
But I don't see a mechanism to trigger the unplug at the right moment
yet. So yes, maybe there's no way around involving libvirt at least
for this part...
>> > Commandline: There is a dependency between vfio-pci and virtio-net
>> > devices. One points to the other via new parameters
>> > primar=<primary qdev id> and standby='<standby qdev id>'. This means
>> > that the primary device needs to be specified after standby device on
>> > the qemu command line. Not sure how to solve this.
>> >
>> > Error handling: Patches don't cover all possible error scenarios yet.
>> >
>> > I have tested this with a mlx5 NIC and was able to migrate the VM with
>> > above mentioned workarounds for open problems.
>> >
>> > Command line example:
>> >
>> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>> > -machine q35,kernel-irqchip=split -cpu host \
>> > -k fr \
>> > -serial stdio \
>> > -net none \
>> > -qmp unix:/tmp/qmp.socket,server,nowait \
>> > -monitor telnet:127.0.0.1:5555,server,nowait \
>> > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>> > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>> > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>> > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>> > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
>> > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
>
>Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
>'net1' id's. cc'ing in Markus.
Dan had an idea how to avoid having to specify the id for the
virtio-net device. I'm currently looking into it, but it seems like it
should work.
regards,
Jens
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 9:20 ` Jens Freimann
@ 2019-04-05 9:20 ` Jens Freimann
2019-04-08 5:53 ` Markus Armbruster
1 sibling, 0 replies; 26+ messages in thread
From: Jens Freimann @ 2019-04-05 9:20 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: pkrempa, ehabkost, mst, qemu-devel, mdroth, armbru, liran.alon,
laine, ogerlitz, ailan
On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>* Jens Freimann (jfreimann@redhat.com) wrote:
[...]
>> > To summarize concerns/feedback from previous discussion:
>> > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>> > Migration might get stuck for unpredictable time with unclear reason.
>> > This approach combines two tricky things, hot/unplug and migration.
>> > -> We can surprise-remove the PCI device and in QEMU we can do all
>> > necessary rollbacks transparent to management software. Will it be
>> > easy, probably not.
>
>This sounds 'fun' - bonus cases are things like what happens if the
>guest gets rebooted somewhere during the process or if it's currently
>sitting in the bios/grub/etc
Yeah, I have to think about this...
>> > 2. PCI devices are a precious ressource. The primary device should never
>> > be added to QEMU if it won't be used by guest instead of hiding it in
>> > QEMU.
>> > -> We only hotplug the device when the standby feature bit was
>> > negotiated. We save the device cmdline options until we need it for
>> > qdev_device_add()
>> > Hiding a device can be a useful concept to model. For example a
>> > pci device in a powered-off slot could be marked as hidden until the slot is
>> > powered on (mst).
>
>Are they really that precious? Personally it's not something I'd worry
>about.
>
>> > 3. Management layer software should handle this. Open Stack already has
>> > components/code to handle unplug/replug VFIO devices and metadata to
>> > provide to the guest for detecting which devices should be paired.
>> > -> An approach that includes all software from firmware to
>> > higher-level management software wasn't tried in the last years. This is
>> > an attempt to keep it simple and contained in QEMU as much as possible.
>> > 4. Hotplugging a device and then making it part of a failover setup is
>> > not possible
>> > -> addressed by extending qdev hotplug functions to check for hidden
>> > attribute, so e.g. device_add can be used to plug a device.
>> >
>> > There are still some open issues:
>> >
>> > Migration: I'm looking for something like a pre-migration hook that I
>> > could use to unplug the vfio-pci device. I tried with a migration
>> > notifier but it is called to late, i.e. after migration is aborted due
>> > to vfio-pci marked unmigrateable. I worked around this by setting it
>> > to migrateable and used a migration notifier on the virtio-net device.
>
>Why not just let this happen at the libvirt level; then you do the
>hotunplug etc before you actually tell qemu anything about starting a
>migration?
Yes...the goal was to see if we can contain changes to QEMU (to keep
it simple, although it seems that covering all error cases won't be that simple :).
But I don't see a mechanism to trigger the unplug at the right moment
yet. So yes, maybe there's no way around involving libvirt at least
for this part...
>> > Commandline: There is a dependency between vfio-pci and virtio-net
>> > devices. One points to the other via new parameters
>> > primar=<primary qdev id> and standby='<standby qdev id>'. This means
>> > that the primary device needs to be specified after standby device on
>> > the qemu command line. Not sure how to solve this.
>> >
>> > Error handling: Patches don't cover all possible error scenarios yet.
>> >
>> > I have tested this with a mlx5 NIC and was able to migrate the VM with
>> > above mentioned workarounds for open problems.
>> >
>> > Command line example:
>> >
>> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>> > -machine q35,kernel-irqchip=split -cpu host \
>> > -k fr \
>> > -serial stdio \
>> > -net none \
>> > -qmp unix:/tmp/qmp.socket,server,nowait \
>> > -monitor telnet:127.0.0.1:5555,server,nowait \
>> > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>> > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>> > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>> > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>> > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
>> > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
>
>Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
>'net1' id's. cc'ing in Markus.
Dan had an idea how to avoid having to specify the id for the
virtio-net device. I'm currently looking into it, but it seems like it
should work.
regards,
Jens
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-05 9:20 ` Jens Freimann
@ 2019-04-05 23:22 ` Michael S. Tsirkin
2019-04-05 23:22 ` Michael S. Tsirkin
` (3 more replies)
2 siblings, 4 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2019-04-05 23:22 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: Jens Freimann, armbru, qemu-devel, pkrempa, ehabkost, mdroth,
liran.alon, laine, ogerlitz, ailan
On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> * Jens Freimann (jfreimann@redhat.com) wrote:
> > ping
> >
> > FYI: I'm also working on a few related tools to detect driver behaviour when
> > assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
>
> Hi Jens,
> I've not been following this too uch, but:
>
> > regards,
> > Jens
> >
> > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > This is another attempt at implementing the host side of the
> > > net_failover concept
> > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > >
> > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > emulated device. Before migration the vfio device is unplugged and data
> > > flows to the emulated device, on the target side another vfio-pci device
> > > is plugged in to take over the data-path. In the guest the net_failover
> > > module will pair net devices with the same MAC address.
> > >
> > > * In the first patch the infrastructure for hiding the device is added
> > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > state and it is set based on a callback to the standby device which
> > > registers itself for handling the assessment: "should the primary device
> > > be hidden?" by cross validating the ids of the devices.
> > >
> > > * In the second patch the virtio-net uses the API to hide the vfio
> > > device and unhides it when the feature is acked.
> > >
> > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > >
> > > To summarize concerns/feedback from previous discussion:
> > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > Migration might get stuck for unpredictable time with unclear reason.
> > > This approach combines two tricky things, hot/unplug and migration.
> > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > necessary rollbacks transparent to management software. Will it be
> > > easy, probably not.
>
> This sounds 'fun' - bonus cases are things like what happens if the
> guest gets rebooted somewhere during the process or if it's currently
> sitting in the bios/grub/etc
Um, during which process? Guests are gradually fixed to support
surprise removal well. Part of it is thunderbolt which makes
it incredibly easy. Yes - bios/grub will need to learn to
handle this well.
> > > 2. PCI devices are a precious ressource. The primary device should never
> > > be added to QEMU if it won't be used by guest instead of hiding it in
> > > QEMU.
> > > -> We only hotplug the device when the standby feature bit was
> > > negotiated. We save the device cmdline options until we need it for
> > > qdev_device_add()
> > > Hiding a device can be a useful concept to model. For example a
> > > pci device in a powered-off slot could be marked as hidden until the slot is
> > > powered on (mst).
>
> Are they really that precious? Personally it's not something I'd worry
> about.
>
> > > 3. Management layer software should handle this. Open Stack already has
> > > components/code to handle unplug/replug VFIO devices and metadata to
> > > provide to the guest for detecting which devices should be paired.
> > > -> An approach that includes all software from firmware to
> > > higher-level management software wasn't tried in the last years. This is
> > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > 4. Hotplugging a device and then making it part of a failover setup is
> > > not possible
> > > -> addressed by extending qdev hotplug functions to check for hidden
> > > attribute, so e.g. device_add can be used to plug a device.
> > >
> > > There are still some open issues:
> > >
> > > Migration: I'm looking for something like a pre-migration hook that I
> > > could use to unplug the vfio-pci device. I tried with a migration
> > > notifier but it is called to late, i.e. after migration is aborted due
> > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > to migrateable and used a migration notifier on the virtio-net device.
>
> Why not just let this happen at the libvirt level; then you do the
> hotunplug etc before you actually tell qemu anything about starting a
> migration?
If qemu frees up resources (as it does on unplug) then libvirt
is not guaranteed it can roll the change back on e.g.
migration failure.
But really another issue is simply that it's a mechanism,
there's no policy that management needs to decide on.
Doing it at lowest possible level ensures all
upper layers benefit with minimal pain.
> > > Commandline: There is a dependency between vfio-pci and virtio-net
> > > devices. One points to the other via new parameters
> > > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > > that the primary device needs to be specified after standby device on
> > > the qemu command line. Not sure how to solve this.
> > >
> > > Error handling: Patches don't cover all possible error scenarios yet.
> > >
> > > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > > above mentioned workarounds for open problems.
> > >
> > > Command line example:
> > >
> > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > -machine q35,kernel-irqchip=split -cpu host \
> > > -k fr \
> > > -serial stdio \
> > > -net none \
> > > -qmp unix:/tmp/qmp.socket,server,nowait \
> > > -monitor telnet:127.0.0.1:5555,server,nowait \
> > > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
>
> Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
> 'net1' id's. cc'ing in Markus.
>
> Dave
>
> > > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > >
> > > I'm grateful for any remarks or ideas!
> > >
> > > Thanks!
> > >
> > > regards,
> > > Jens
> > >
> > > Sameeh Jubran (2):
> > > qdev/qbus: Add hidden device support
> > > net/virtio: add failover support
> > >
> > > hw/core/qdev.c | 27 ++++++++++
> > > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > > hw/pci/pci.c | 1 +
> > > include/hw/pci/pci.h | 2 +
> > > include/hw/qdev-core.h | 8 +++
> > > include/hw/virtio/virtio-net.h | 7 +++
> > > qdev-monitor.c | 48 +++++++++++++++--
> > > vl.c | 7 ++-
> > > 8 files changed, 189 insertions(+), 6 deletions(-)
> > >
> > > --
> > > 2.20.1
> > >
> > >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 23:22 ` Michael S. Tsirkin
@ 2019-04-05 23:22 ` Michael S. Tsirkin
2019-04-05 23:46 ` Eduardo Habkost
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2019-04-05 23:22 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: pkrempa, ehabkost, qemu-devel, mdroth, armbru, liran.alon, laine,
ogerlitz, Jens Freimann, ailan
On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> * Jens Freimann (jfreimann@redhat.com) wrote:
> > ping
> >
> > FYI: I'm also working on a few related tools to detect driver behaviour when
> > assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
>
> Hi Jens,
> I've not been following this too uch, but:
>
> > regards,
> > Jens
> >
> > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > This is another attempt at implementing the host side of the
> > > net_failover concept
> > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > >
> > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > emulated device. Before migration the vfio device is unplugged and data
> > > flows to the emulated device, on the target side another vfio-pci device
> > > is plugged in to take over the data-path. In the guest the net_failover
> > > module will pair net devices with the same MAC address.
> > >
> > > * In the first patch the infrastructure for hiding the device is added
> > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > state and it is set based on a callback to the standby device which
> > > registers itself for handling the assessment: "should the primary device
> > > be hidden?" by cross validating the ids of the devices.
> > >
> > > * In the second patch the virtio-net uses the API to hide the vfio
> > > device and unhides it when the feature is acked.
> > >
> > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > >
> > > To summarize concerns/feedback from previous discussion:
> > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > Migration might get stuck for unpredictable time with unclear reason.
> > > This approach combines two tricky things, hot/unplug and migration.
> > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > necessary rollbacks transparent to management software. Will it be
> > > easy, probably not.
>
> This sounds 'fun' - bonus cases are things like what happens if the
> guest gets rebooted somewhere during the process or if it's currently
> sitting in the bios/grub/etc
Um, during which process? Guests are gradually fixed to support
surprise removal well. Part of it is thunderbolt which makes
it incredibly easy. Yes - bios/grub will need to learn to
handle this well.
> > > 2. PCI devices are a precious ressource. The primary device should never
> > > be added to QEMU if it won't be used by guest instead of hiding it in
> > > QEMU.
> > > -> We only hotplug the device when the standby feature bit was
> > > negotiated. We save the device cmdline options until we need it for
> > > qdev_device_add()
> > > Hiding a device can be a useful concept to model. For example a
> > > pci device in a powered-off slot could be marked as hidden until the slot is
> > > powered on (mst).
>
> Are they really that precious? Personally it's not something I'd worry
> about.
>
> > > 3. Management layer software should handle this. Open Stack already has
> > > components/code to handle unplug/replug VFIO devices and metadata to
> > > provide to the guest for detecting which devices should be paired.
> > > -> An approach that includes all software from firmware to
> > > higher-level management software wasn't tried in the last years. This is
> > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > 4. Hotplugging a device and then making it part of a failover setup is
> > > not possible
> > > -> addressed by extending qdev hotplug functions to check for hidden
> > > attribute, so e.g. device_add can be used to plug a device.
> > >
> > > There are still some open issues:
> > >
> > > Migration: I'm looking for something like a pre-migration hook that I
> > > could use to unplug the vfio-pci device. I tried with a migration
> > > notifier but it is called to late, i.e. after migration is aborted due
> > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > to migrateable and used a migration notifier on the virtio-net device.
>
> Why not just let this happen at the libvirt level; then you do the
> hotunplug etc before you actually tell qemu anything about starting a
> migration?
If qemu frees up resources (as it does on unplug) then libvirt
is not guaranteed it can roll the change back on e.g.
migration failure.
But really another issue is simply that it's a mechanism,
there's no policy that management needs to decide on.
Doing it at lowest possible level ensures all
upper layers benefit with minimal pain.
> > > Commandline: There is a dependency between vfio-pci and virtio-net
> > > devices. One points to the other via new parameters
> > > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > > that the primary device needs to be specified after standby device on
> > > the qemu command line. Not sure how to solve this.
> > >
> > > Error handling: Patches don't cover all possible error scenarios yet.
> > >
> > > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > > above mentioned workarounds for open problems.
> > >
> > > Command line example:
> > >
> > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > -machine q35,kernel-irqchip=split -cpu host \
> > > -k fr \
> > > -serial stdio \
> > > -net none \
> > > -qmp unix:/tmp/qmp.socket,server,nowait \
> > > -monitor telnet:127.0.0.1:5555,server,nowait \
> > > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
>
> Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
> 'net1' id's. cc'ing in Markus.
>
> Dave
>
> > > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > >
> > > I'm grateful for any remarks or ideas!
> > >
> > > Thanks!
> > >
> > > regards,
> > > Jens
> > >
> > > Sameeh Jubran (2):
> > > qdev/qbus: Add hidden device support
> > > net/virtio: add failover support
> > >
> > > hw/core/qdev.c | 27 ++++++++++
> > > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > > hw/pci/pci.c | 1 +
> > > include/hw/pci/pci.h | 2 +
> > > include/hw/qdev-core.h | 8 +++
> > > include/hw/virtio/virtio-net.h | 7 +++
> > > qdev-monitor.c | 48 +++++++++++++++--
> > > vl.c | 7 ++-
> > > 8 files changed, 189 insertions(+), 6 deletions(-)
> > >
> > > --
> > > 2.20.1
> > >
> > >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 23:22 ` Michael S. Tsirkin
2019-04-05 23:22 ` Michael S. Tsirkin
@ 2019-04-05 23:46 ` Eduardo Habkost
2019-04-05 23:46 ` Eduardo Habkost
2019-04-08 5:26 ` Markus Armbruster
2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-05-29 0:35 ` si-wei liu
3 siblings, 2 replies; 26+ messages in thread
From: Eduardo Habkost @ 2019-04-05 23:46 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Dr. David Alan Gilbert, Jens Freimann, armbru, qemu-devel,
pkrempa, mdroth, liran.alon, laine, ogerlitz, ailan
On Fri, Apr 05, 2019 at 07:22:35PM -0400, Michael S. Tsirkin wrote:
> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
[...]
> > > > 3. Management layer software should handle this. Open Stack already has
> > > > components/code to handle unplug/replug VFIO devices and metadata to
> > > > provide to the guest for detecting which devices should be paired.
> > > > -> An approach that includes all software from firmware to
> > > > higher-level management software wasn't tried in the last years. This is
> > > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > > 4. Hotplugging a device and then making it part of a failover setup is
> > > > not possible
> > > > -> addressed by extending qdev hotplug functions to check for hidden
> > > > attribute, so e.g. device_add can be used to plug a device.
> > > >
> > > > There are still some open issues:
> > > >
> > > > Migration: I'm looking for something like a pre-migration hook that I
> > > > could use to unplug the vfio-pci device. I tried with a migration
> > > > notifier but it is called to late, i.e. after migration is aborted due
> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > > to migrateable and used a migration notifier on the virtio-net device.
> >
> > Why not just let this happen at the libvirt level; then you do the
> > hotunplug etc before you actually tell qemu anything about starting a
> > migration?
>
> If qemu frees up resources (as it does on unplug) then libvirt
> is not guaranteed it can roll the change back on e.g.
> migration failure.
Why should we always free up resources on unplug?
Unplug of a disk device doesn't close the corresponding -blockdev.
Unplug of a serial device doesn't close the corresponding -chardev.
Unplug of a memory device doesn't close the corresponding memory backend.
Unplug of a crypto device doesn't close the corresponding crypto backend.
Why do we expect device_del of a passthrough PCI device to always
release the host side PCI device? We can provide a better API
than that.
>
> But really another issue is simply that it's a mechanism,
> there's no policy that management needs to decide on.
> Doing it at lowest possible level ensures all
> upper layers benefit with minimal pain.
I don't see a problem in trying to make this work with surprise
removal too. But if it is also possible to make this work
without surprise removal support on the guest side, why not
provide the mechanisms for the management layer to implement it?
--
Eduardo
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 23:46 ` Eduardo Habkost
@ 2019-04-05 23:46 ` Eduardo Habkost
2019-04-08 5:26 ` Markus Armbruster
1 sibling, 0 replies; 26+ messages in thread
From: Eduardo Habkost @ 2019-04-05 23:46 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: pkrempa, armbru, qemu-devel, mdroth, liran.alon, laine, ogerlitz,
Jens Freimann, ailan, Dr. David Alan Gilbert
On Fri, Apr 05, 2019 at 07:22:35PM -0400, Michael S. Tsirkin wrote:
> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
[...]
> > > > 3. Management layer software should handle this. Open Stack already has
> > > > components/code to handle unplug/replug VFIO devices and metadata to
> > > > provide to the guest for detecting which devices should be paired.
> > > > -> An approach that includes all software from firmware to
> > > > higher-level management software wasn't tried in the last years. This is
> > > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > > 4. Hotplugging a device and then making it part of a failover setup is
> > > > not possible
> > > > -> addressed by extending qdev hotplug functions to check for hidden
> > > > attribute, so e.g. device_add can be used to plug a device.
> > > >
> > > > There are still some open issues:
> > > >
> > > > Migration: I'm looking for something like a pre-migration hook that I
> > > > could use to unplug the vfio-pci device. I tried with a migration
> > > > notifier but it is called to late, i.e. after migration is aborted due
> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > > to migrateable and used a migration notifier on the virtio-net device.
> >
> > Why not just let this happen at the libvirt level; then you do the
> > hotunplug etc before you actually tell qemu anything about starting a
> > migration?
>
> If qemu frees up resources (as it does on unplug) then libvirt
> is not guaranteed it can roll the change back on e.g.
> migration failure.
Why should we always free up resources on unplug?
Unplug of a disk device doesn't close the corresponding -blockdev.
Unplug of a serial device doesn't close the corresponding -chardev.
Unplug of a memory device doesn't close the corresponding memory backend.
Unplug of a crypto device doesn't close the corresponding crypto backend.
Why do we expect device_del of a passthrough PCI device to always
release the host side PCI device? We can provide a better API
than that.
>
> But really another issue is simply that it's a mechanism,
> there's no policy that management needs to decide on.
> Doing it at lowest possible level ensures all
> upper layers benefit with minimal pain.
I don't see a problem in trying to make this work with surprise
removal too. But if it is also possible to make this work
without surprise removal support on the guest side, why not
provide the mechanisms for the management layer to implement it?
--
Eduardo
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 23:46 ` Eduardo Habkost
2019-04-05 23:46 ` Eduardo Habkost
@ 2019-04-08 5:26 ` Markus Armbruster
2019-04-08 5:26 ` Markus Armbruster
2019-04-12 19:50 ` Eduardo Habkost
1 sibling, 2 replies; 26+ messages in thread
From: Markus Armbruster @ 2019-04-08 5:26 UTC (permalink / raw)
To: Eduardo Habkost
Cc: Michael S. Tsirkin, pkrempa, armbru, qemu-devel, mdroth,
liran.alon, laine, ogerlitz, Jens Freimann, ailan,
Dr. David Alan Gilbert
Eduardo Habkost <ehabkost@redhat.com> writes:
> On Fri, Apr 05, 2019 at 07:22:35PM -0400, Michael S. Tsirkin wrote:
>> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>> > * Jens Freimann (jfreimann@redhat.com) wrote:
>> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> [...]
>> > > > 3. Management layer software should handle this. Open Stack already has
>> > > > components/code to handle unplug/replug VFIO devices and metadata to
>> > > > provide to the guest for detecting which devices should be paired.
>> > > > -> An approach that includes all software from firmware to
>> > > > higher-level management software wasn't tried in the last years. This is
>> > > > an attempt to keep it simple and contained in QEMU as much as possible.
>> > > > 4. Hotplugging a device and then making it part of a failover setup is
>> > > > not possible
>> > > > -> addressed by extending qdev hotplug functions to check for hidden
>> > > > attribute, so e.g. device_add can be used to plug a device.
>> > > >
>> > > > There are still some open issues:
>> > > >
>> > > > Migration: I'm looking for something like a pre-migration hook that I
>> > > > could use to unplug the vfio-pci device. I tried with a migration
>> > > > notifier but it is called to late, i.e. after migration is aborted due
>> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
>> > > > to migrateable and used a migration notifier on the virtio-net device.
>> >
>> > Why not just let this happen at the libvirt level; then you do the
>> > hotunplug etc before you actually tell qemu anything about starting a
>> > migration?
>>
>> If qemu frees up resources (as it does on unplug) then libvirt
>> is not guaranteed it can roll the change back on e.g.
>> migration failure.
>
> Why should we always free up resources on unplug?
>
> Unplug of a disk device doesn't close the corresponding -blockdev.
It does for block backends created with -drive, and that was a mistake
we corrected with -blockdev.
> Unplug of a serial device doesn't close the corresponding -chardev.
> Unplug of a memory device doesn't close the corresponding memory backend.
> Unplug of a crypto device doesn't close the corresponding crypto backend.
>
> Why do we expect device_del of a passthrough PCI device to always
> release the host side PCI device? We can provide a better API
> than that.
device_del should free what device_add allocates.
Does device_add allocate the host side PCI device? If yes, should it?
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 5:26 ` Markus Armbruster
@ 2019-04-08 5:26 ` Markus Armbruster
2019-04-12 19:50 ` Eduardo Habkost
1 sibling, 0 replies; 26+ messages in thread
From: Markus Armbruster @ 2019-04-08 5:26 UTC (permalink / raw)
To: Eduardo Habkost
Cc: pkrempa, Michael S. Tsirkin, armbru, qemu-devel, mdroth,
liran.alon, laine, ogerlitz, Jens Freimann, ailan,
Dr. David Alan Gilbert
Eduardo Habkost <ehabkost@redhat.com> writes:
> On Fri, Apr 05, 2019 at 07:22:35PM -0400, Michael S. Tsirkin wrote:
>> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>> > * Jens Freimann (jfreimann@redhat.com) wrote:
>> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> [...]
>> > > > 3. Management layer software should handle this. Open Stack already has
>> > > > components/code to handle unplug/replug VFIO devices and metadata to
>> > > > provide to the guest for detecting which devices should be paired.
>> > > > -> An approach that includes all software from firmware to
>> > > > higher-level management software wasn't tried in the last years. This is
>> > > > an attempt to keep it simple and contained in QEMU as much as possible.
>> > > > 4. Hotplugging a device and then making it part of a failover setup is
>> > > > not possible
>> > > > -> addressed by extending qdev hotplug functions to check for hidden
>> > > > attribute, so e.g. device_add can be used to plug a device.
>> > > >
>> > > > There are still some open issues:
>> > > >
>> > > > Migration: I'm looking for something like a pre-migration hook that I
>> > > > could use to unplug the vfio-pci device. I tried with a migration
>> > > > notifier but it is called to late, i.e. after migration is aborted due
>> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
>> > > > to migrateable and used a migration notifier on the virtio-net device.
>> >
>> > Why not just let this happen at the libvirt level; then you do the
>> > hotunplug etc before you actually tell qemu anything about starting a
>> > migration?
>>
>> If qemu frees up resources (as it does on unplug) then libvirt
>> is not guaranteed it can roll the change back on e.g.
>> migration failure.
>
> Why should we always free up resources on unplug?
>
> Unplug of a disk device doesn't close the corresponding -blockdev.
It does for block backends created with -drive, and that was a mistake
we corrected with -blockdev.
> Unplug of a serial device doesn't close the corresponding -chardev.
> Unplug of a memory device doesn't close the corresponding memory backend.
> Unplug of a crypto device doesn't close the corresponding crypto backend.
>
> Why do we expect device_del of a passthrough PCI device to always
> release the host side PCI device? We can provide a better API
> than that.
device_del should free what device_add allocates.
Does device_add allocate the host side PCI device? If yes, should it?
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 9:20 ` Jens Freimann
2019-04-05 9:20 ` Jens Freimann
@ 2019-04-08 5:53 ` Markus Armbruster
2019-04-08 5:53 ` Markus Armbruster
1 sibling, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2019-04-08 5:53 UTC (permalink / raw)
To: Jens Freimann
Cc: Dr. David Alan Gilbert, pkrempa, ehabkost, mst, qemu-devel,
mdroth, armbru, liran.alon, laine, ogerlitz, ailan
Jens Freimann <jfreimann@redhat.com> writes:
> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>>* Jens Freimann (jfreimann@redhat.com) wrote:
> [...]
>>> > Commandline: There is a dependency between vfio-pci and virtio-net
>>> > devices. One points to the other via new parameters
>>> > primar=<primary qdev id> and standby='<standby qdev id>'. This means
>>> > that the primary device needs to be specified after standby device on
>>> > the qemu command line. Not sure how to solve this.
>>> >
>>> > Error handling: Patches don't cover all possible error scenarios yet.
>>> >
>>> > I have tested this with a mlx5 NIC and was able to migrate the VM with
>>> > above mentioned workarounds for open problems.
>>> >
>>> > Command line example:
>>> >
>>> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>>> > -machine q35,kernel-irqchip=split -cpu host \
>>> > -k fr \
>>> > -serial stdio \
>>> > -net none \
>>> > -qmp unix:/tmp/qmp.socket,server,nowait \
>>> > -monitor telnet:127.0.0.1:5555,server,nowait \
>>> > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>>> > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>>> > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>>> > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>>> > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
>>> > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
>>
>>Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
>>'net1' id's. cc'ing in Markus.
>
> Dan had an idea how to avoid having to specify the id for the
> virtio-net device. I'm currently looking into it, but it seems like it
> should work.
Excellent. A circular dependency between -device could only lead to
trouble.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 5:53 ` Markus Armbruster
@ 2019-04-08 5:53 ` Markus Armbruster
0 siblings, 0 replies; 26+ messages in thread
From: Markus Armbruster @ 2019-04-08 5:53 UTC (permalink / raw)
To: Jens Freimann
Cc: pkrempa, ehabkost, mst, armbru, qemu-devel, mdroth, liran.alon,
laine, ogerlitz, ailan, Dr. David Alan Gilbert
Jens Freimann <jfreimann@redhat.com> writes:
> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>>* Jens Freimann (jfreimann@redhat.com) wrote:
> [...]
>>> > Commandline: There is a dependency between vfio-pci and virtio-net
>>> > devices. One points to the other via new parameters
>>> > primar=<primary qdev id> and standby='<standby qdev id>'. This means
>>> > that the primary device needs to be specified after standby device on
>>> > the qemu command line. Not sure how to solve this.
>>> >
>>> > Error handling: Patches don't cover all possible error scenarios yet.
>>> >
>>> > I have tested this with a mlx5 NIC and was able to migrate the VM with
>>> > above mentioned workarounds for open problems.
>>> >
>>> > Command line example:
>>> >
>>> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>>> > -machine q35,kernel-irqchip=split -cpu host \
>>> > -k fr \
>>> > -serial stdio \
>>> > -net none \
>>> > -qmp unix:/tmp/qmp.socket,server,nowait \
>>> > -monitor telnet:127.0.0.1:5555,server,nowait \
>>> > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>>> > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>>> > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>>> > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>>> > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
>>> > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
>>
>>Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
>>'net1' id's. cc'ing in Markus.
>
> Dan had an idea how to avoid having to specify the id for the
> virtio-net device. I'm currently looking into it, but it seems like it
> should work.
Excellent. A circular dependency between -device could only lead to
trouble.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 23:22 ` Michael S. Tsirkin
2019-04-05 23:22 ` Michael S. Tsirkin
2019-04-05 23:46 ` Eduardo Habkost
@ 2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-04-08 9:16 ` Dr. David Alan Gilbert
` (2 more replies)
2019-05-29 0:35 ` si-wei liu
3 siblings, 3 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-08 9:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jens Freimann, armbru, qemu-devel, pkrempa, ehabkost, mdroth,
liran.alon, laine, ogerlitz, ailan
* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > ping
> > >
> > > FYI: I'm also working on a few related tools to detect driver behaviour when
> > > assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
> >
> > Hi Jens,
> > I've not been following this too uch, but:
> >
> > > regards,
> > > Jens
> > >
> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > > This is another attempt at implementing the host side of the
> > > > net_failover concept
> > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > > >
> > > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > > emulated device. Before migration the vfio device is unplugged and data
> > > > flows to the emulated device, on the target side another vfio-pci device
> > > > is plugged in to take over the data-path. In the guest the net_failover
> > > > module will pair net devices with the same MAC address.
> > > >
> > > > * In the first patch the infrastructure for hiding the device is added
> > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > state and it is set based on a callback to the standby device which
> > > > registers itself for handling the assessment: "should the primary device
> > > > be hidden?" by cross validating the ids of the devices.
> > > >
> > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > device and unhides it when the feature is acked.
> > > >
> > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > > >
> > > > To summarize concerns/feedback from previous discussion:
> > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > > Migration might get stuck for unpredictable time with unclear reason.
> > > > This approach combines two tricky things, hot/unplug and migration.
> > > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > > necessary rollbacks transparent to management software. Will it be
> > > > easy, probably not.
> >
> > This sounds 'fun' - bonus cases are things like what happens if the
> > guest gets rebooted somewhere during the process or if it's currently
> > sitting in the bios/grub/etc
>
> Um, during which process? Guests are gradually fixed to support
> surprise removal well. Part of it is thunderbolt which makes
> it incredibly easy. Yes - bios/grub will need to learn to
> handle this well.
Ignoring the actual mechanism of the unplug itself; there are probably
loads of cases; e.g.
running with both cards
hot unplug real card
start migration
guest reboots
Kernel sees only the virtio card
migration completes
hotadd the real card back
so the guest has to know to pair the real card even though it booted
with only the virtio card.
I'm sure there are loads of other corners.
> > > > 2. PCI devices are a precious ressource. The primary device should never
> > > > be added to QEMU if it won't be used by guest instead of hiding it in
> > > > QEMU.
> > > > -> We only hotplug the device when the standby feature bit was
> > > > negotiated. We save the device cmdline options until we need it for
> > > > qdev_device_add()
> > > > Hiding a device can be a useful concept to model. For example a
> > > > pci device in a powered-off slot could be marked as hidden until the slot is
> > > > powered on (mst).
> >
> > Are they really that precious? Personally it's not something I'd worry
> > about.
> >
> > > > 3. Management layer software should handle this. Open Stack already has
> > > > components/code to handle unplug/replug VFIO devices and metadata to
> > > > provide to the guest for detecting which devices should be paired.
> > > > -> An approach that includes all software from firmware to
> > > > higher-level management software wasn't tried in the last years. This is
> > > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > > 4. Hotplugging a device and then making it part of a failover setup is
> > > > not possible
> > > > -> addressed by extending qdev hotplug functions to check for hidden
> > > > attribute, so e.g. device_add can be used to plug a device.
> > > >
> > > > There are still some open issues:
> > > >
> > > > Migration: I'm looking for something like a pre-migration hook that I
> > > > could use to unplug the vfio-pci device. I tried with a migration
> > > > notifier but it is called to late, i.e. after migration is aborted due
> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > > to migrateable and used a migration notifier on the virtio-net device.
> >
> > Why not just let this happen at the libvirt level; then you do the
> > hotunplug etc before you actually tell qemu anything about starting a
> > migration?
>
> If qemu frees up resources (as it does on unplug) then libvirt
> is not guaranteed it can roll the change back on e.g.
> migration failure.
Can you explain this in a bit more detail please; do you mean if it
frees the netdev?
> But really another issue is simply that it's a mechanism,
> there's no policy that management needs to decide on.
> Doing it at lowest possible level ensures all
> upper layers benefit with minimal pain.
Network setups in things like OpenStack can be really complex and
involve interacting with switches etc - something somewhere might
have to reconfigure a switch when you pull the real card.
Dave
> > > > Commandline: There is a dependency between vfio-pci and virtio-net
> > > > devices. One points to the other via new parameters
> > > > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > > > that the primary device needs to be specified after standby device on
> > > > the qemu command line. Not sure how to solve this.
> > > >
> > > > Error handling: Patches don't cover all possible error scenarios yet.
> > > >
> > > > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > > > above mentioned workarounds for open problems.
> > > >
> > > > Command line example:
> > > >
> > > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > > -machine q35,kernel-irqchip=split -cpu host \
> > > > -k fr \
> > > > -serial stdio \
> > > > -net none \
> > > > -qmp unix:/tmp/qmp.socket,server,nowait \
> > > > -monitor telnet:127.0.0.1:5555,server,nowait \
> > > > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > > > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
> >
> > Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
> > 'net1' id's. cc'ing in Markus.
> >
> > Dave
> >
> > > > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > >
> > > > I'm grateful for any remarks or ideas!
> > > >
> > > > Thanks!
> > > >
> > > > regards,
> > > > Jens
> > > >
> > > > Sameeh Jubran (2):
> > > > qdev/qbus: Add hidden device support
> > > > net/virtio: add failover support
> > > >
> > > > hw/core/qdev.c | 27 ++++++++++
> > > > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > > > hw/pci/pci.c | 1 +
> > > > include/hw/pci/pci.h | 2 +
> > > > include/hw/qdev-core.h | 8 +++
> > > > include/hw/virtio/virtio-net.h | 7 +++
> > > > qdev-monitor.c | 48 +++++++++++++++--
> > > > vl.c | 7 ++-
> > > > 8 files changed, 189 insertions(+), 6 deletions(-)
> > > >
> > > > --
> > > > 2.20.1
> > > >
> > > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 9:16 ` Dr. David Alan Gilbert
@ 2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-04-08 13:00 ` Jens Freimann
2019-04-08 13:22 ` Michael S. Tsirkin
2 siblings, 0 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-08 9:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: pkrempa, ehabkost, qemu-devel, mdroth, armbru, liran.alon, laine,
ogerlitz, Jens Freimann, ailan
* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > ping
> > >
> > > FYI: I'm also working on a few related tools to detect driver behaviour when
> > > assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
> >
> > Hi Jens,
> > I've not been following this too uch, but:
> >
> > > regards,
> > > Jens
> > >
> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > > This is another attempt at implementing the host side of the
> > > > net_failover concept
> > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > > >
> > > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > > emulated device. Before migration the vfio device is unplugged and data
> > > > flows to the emulated device, on the target side another vfio-pci device
> > > > is plugged in to take over the data-path. In the guest the net_failover
> > > > module will pair net devices with the same MAC address.
> > > >
> > > > * In the first patch the infrastructure for hiding the device is added
> > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > state and it is set based on a callback to the standby device which
> > > > registers itself for handling the assessment: "should the primary device
> > > > be hidden?" by cross validating the ids of the devices.
> > > >
> > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > device and unhides it when the feature is acked.
> > > >
> > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > > >
> > > > To summarize concerns/feedback from previous discussion:
> > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > > Migration might get stuck for unpredictable time with unclear reason.
> > > > This approach combines two tricky things, hot/unplug and migration.
> > > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > > necessary rollbacks transparent to management software. Will it be
> > > > easy, probably not.
> >
> > This sounds 'fun' - bonus cases are things like what happens if the
> > guest gets rebooted somewhere during the process or if it's currently
> > sitting in the bios/grub/etc
>
> Um, during which process? Guests are gradually fixed to support
> surprise removal well. Part of it is thunderbolt which makes
> it incredibly easy. Yes - bios/grub will need to learn to
> handle this well.
Ignoring the actual mechanism of the unplug itself; there are probably
loads of cases; e.g.
running with both cards
hot unplug real card
start migration
guest reboots
Kernel sees only the virtio card
migration completes
hotadd the real card back
so the guest has to know to pair the real card even though it booted
with only the virtio card.
I'm sure there are loads of other corners.
> > > > 2. PCI devices are a precious ressource. The primary device should never
> > > > be added to QEMU if it won't be used by guest instead of hiding it in
> > > > QEMU.
> > > > -> We only hotplug the device when the standby feature bit was
> > > > negotiated. We save the device cmdline options until we need it for
> > > > qdev_device_add()
> > > > Hiding a device can be a useful concept to model. For example a
> > > > pci device in a powered-off slot could be marked as hidden until the slot is
> > > > powered on (mst).
> >
> > Are they really that precious? Personally it's not something I'd worry
> > about.
> >
> > > > 3. Management layer software should handle this. Open Stack already has
> > > > components/code to handle unplug/replug VFIO devices and metadata to
> > > > provide to the guest for detecting which devices should be paired.
> > > > -> An approach that includes all software from firmware to
> > > > higher-level management software wasn't tried in the last years. This is
> > > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > > 4. Hotplugging a device and then making it part of a failover setup is
> > > > not possible
> > > > -> addressed by extending qdev hotplug functions to check for hidden
> > > > attribute, so e.g. device_add can be used to plug a device.
> > > >
> > > > There are still some open issues:
> > > >
> > > > Migration: I'm looking for something like a pre-migration hook that I
> > > > could use to unplug the vfio-pci device. I tried with a migration
> > > > notifier but it is called to late, i.e. after migration is aborted due
> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > > to migrateable and used a migration notifier on the virtio-net device.
> >
> > Why not just let this happen at the libvirt level; then you do the
> > hotunplug etc before you actually tell qemu anything about starting a
> > migration?
>
> If qemu frees up resources (as it does on unplug) then libvirt
> is not guaranteed it can roll the change back on e.g.
> migration failure.
Can you explain this in a bit more detail please; do you mean if it
frees the netdev?
> But really another issue is simply that it's a mechanism,
> there's no policy that management needs to decide on.
> Doing it at lowest possible level ensures all
> upper layers benefit with minimal pain.
Network setups in things like OpenStack can be really complex and
involve interacting with switches etc - something somewhere might
have to reconfigure a switch when you pull the real card.
Dave
> > > > Commandline: There is a dependency between vfio-pci and virtio-net
> > > > devices. One points to the other via new parameters
> > > > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > > > that the primary device needs to be specified after standby device on
> > > > the qemu command line. Not sure how to solve this.
> > > >
> > > > Error handling: Patches don't cover all possible error scenarios yet.
> > > >
> > > > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > > > above mentioned workarounds for open problems.
> > > >
> > > > Command line example:
> > > >
> > > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > > -machine q35,kernel-irqchip=split -cpu host \
> > > > -k fr \
> > > > -serial stdio \
> > > > -net none \
> > > > -qmp unix:/tmp/qmp.socket,server,nowait \
> > > > -monitor telnet:127.0.0.1:5555,server,nowait \
> > > > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > > > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
> >
> > Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
> > 'net1' id's. cc'ing in Markus.
> >
> > Dave
> >
> > > > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > >
> > > > I'm grateful for any remarks or ideas!
> > > >
> > > > Thanks!
> > > >
> > > > regards,
> > > > Jens
> > > >
> > > > Sameeh Jubran (2):
> > > > qdev/qbus: Add hidden device support
> > > > net/virtio: add failover support
> > > >
> > > > hw/core/qdev.c | 27 ++++++++++
> > > > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > > > hw/pci/pci.c | 1 +
> > > > include/hw/pci/pci.h | 2 +
> > > > include/hw/qdev-core.h | 8 +++
> > > > include/hw/virtio/virtio-net.h | 7 +++
> > > > qdev-monitor.c | 48 +++++++++++++++--
> > > > vl.c | 7 ++-
> > > > 8 files changed, 189 insertions(+), 6 deletions(-)
> > > >
> > > > --
> > > > 2.20.1
> > > >
> > > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-04-08 9:16 ` Dr. David Alan Gilbert
@ 2019-04-08 13:00 ` Jens Freimann
2019-04-08 13:00 ` Jens Freimann
2019-04-08 17:00 ` Dr. David Alan Gilbert
2019-04-08 13:22 ` Michael S. Tsirkin
2 siblings, 2 replies; 26+ messages in thread
From: Jens Freimann @ 2019-04-08 13:00 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: Michael S. Tsirkin, armbru, qemu-devel, pkrempa, ehabkost, mdroth,
liran.alon, laine, ogerlitz, ailan
On Mon, Apr 08, 2019 at 10:16:50AM +0100, Dr. David Alan Gilbert wrote:
>* Michael S. Tsirkin (mst@redhat.com) wrote:
>> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>> > * Jens Freimann (jfreimann@redhat.com) wrote:
>> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
>> > > > This is another attempt at implementing the host side of the
>> > > > net_failover concept
>> > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
>> > > >
>> > > > The general idea is that we have a pair of devices, a vfio-pci and a
>> > > > emulated device. Before migration the vfio device is unplugged and data
>> > > > flows to the emulated device, on the target side another vfio-pci device
>> > > > is plugged in to take over the data-path. In the guest the net_failover
>> > > > module will pair net devices with the same MAC address.
>> > > >
>> > > > * In the first patch the infrastructure for hiding the device is added
>> > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
>> > > > state and it is set based on a callback to the standby device which
>> > > > registers itself for handling the assessment: "should the primary device
>> > > > be hidden?" by cross validating the ids of the devices.
>> > > >
>> > > > * In the second patch the virtio-net uses the API to hide the vfio
>> > > > device and unhides it when the feature is acked.
>> > > >
>> > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
>> > > >
>> > > > To summarize concerns/feedback from previous discussion:
>> > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>> > > > Migration might get stuck for unpredictable time with unclear reason.
>> > > > This approach combines two tricky things, hot/unplug and migration.
>> > > > -> We can surprise-remove the PCI device and in QEMU we can do all
>> > > > necessary rollbacks transparent to management software. Will it be
>> > > > easy, probably not.
>> >
>> > This sounds 'fun' - bonus cases are things like what happens if the
>> > guest gets rebooted somewhere during the process or if it's currently
>> > sitting in the bios/grub/etc
>>
>> Um, during which process? Guests are gradually fixed to support
>> surprise removal well. Part of it is thunderbolt which makes
>> it incredibly easy. Yes - bios/grub will need to learn to
>> handle this well.
>
>Ignoring the actual mechanism of the unplug itself; there are probably
>loads of cases; e.g.
>
> running with both cards
> hot unplug real card
> start migration
> guest reboots
> Kernel sees only the virtio card
> migration completes
> hotadd the real card back
>
>so the guest has to know to pair the real card even though it booted
>with only the virtio card.
Maybe I misunderstand, but, when the 'real card' is added back after
migration the net_failover driver in the guest will know to pair it
with the virtio card because they have the same MAC address. Did you
mean something else?
>I'm sure there are loads of other corners.
Probably yes.
regards,
Jens
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 13:00 ` Jens Freimann
@ 2019-04-08 13:00 ` Jens Freimann
2019-04-08 17:00 ` Dr. David Alan Gilbert
1 sibling, 0 replies; 26+ messages in thread
From: Jens Freimann @ 2019-04-08 13:00 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: pkrempa, ehabkost, Michael S. Tsirkin, armbru, qemu-devel, mdroth,
liran.alon, laine, ogerlitz, ailan
On Mon, Apr 08, 2019 at 10:16:50AM +0100, Dr. David Alan Gilbert wrote:
>* Michael S. Tsirkin (mst@redhat.com) wrote:
>> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>> > * Jens Freimann (jfreimann@redhat.com) wrote:
>> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
>> > > > This is another attempt at implementing the host side of the
>> > > > net_failover concept
>> > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
>> > > >
>> > > > The general idea is that we have a pair of devices, a vfio-pci and a
>> > > > emulated device. Before migration the vfio device is unplugged and data
>> > > > flows to the emulated device, on the target side another vfio-pci device
>> > > > is plugged in to take over the data-path. In the guest the net_failover
>> > > > module will pair net devices with the same MAC address.
>> > > >
>> > > > * In the first patch the infrastructure for hiding the device is added
>> > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
>> > > > state and it is set based on a callback to the standby device which
>> > > > registers itself for handling the assessment: "should the primary device
>> > > > be hidden?" by cross validating the ids of the devices.
>> > > >
>> > > > * In the second patch the virtio-net uses the API to hide the vfio
>> > > > device and unhides it when the feature is acked.
>> > > >
>> > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
>> > > >
>> > > > To summarize concerns/feedback from previous discussion:
>> > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>> > > > Migration might get stuck for unpredictable time with unclear reason.
>> > > > This approach combines two tricky things, hot/unplug and migration.
>> > > > -> We can surprise-remove the PCI device and in QEMU we can do all
>> > > > necessary rollbacks transparent to management software. Will it be
>> > > > easy, probably not.
>> >
>> > This sounds 'fun' - bonus cases are things like what happens if the
>> > guest gets rebooted somewhere during the process or if it's currently
>> > sitting in the bios/grub/etc
>>
>> Um, during which process? Guests are gradually fixed to support
>> surprise removal well. Part of it is thunderbolt which makes
>> it incredibly easy. Yes - bios/grub will need to learn to
>> handle this well.
>
>Ignoring the actual mechanism of the unplug itself; there are probably
>loads of cases; e.g.
>
> running with both cards
> hot unplug real card
> start migration
> guest reboots
> Kernel sees only the virtio card
> migration completes
> hotadd the real card back
>
>so the guest has to know to pair the real card even though it booted
>with only the virtio card.
Maybe I misunderstand, but, when the 'real card' is added back after
migration the net_failover driver in the guest will know to pair it
with the virtio card because they have the same MAC address. Did you
mean something else?
>I'm sure there are loads of other corners.
Probably yes.
regards,
Jens
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-04-08 13:00 ` Jens Freimann
@ 2019-04-08 13:22 ` Michael S. Tsirkin
2019-04-08 13:22 ` Michael S. Tsirkin
2 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2019-04-08 13:22 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: Jens Freimann, armbru, qemu-devel, pkrempa, ehabkost, mdroth,
liran.alon, laine, ogerlitz, ailan
On Mon, Apr 08, 2019 at 10:16:50AM +0100, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > > ping
> > > >
> > > > FYI: I'm also working on a few related tools to detect driver behaviour when
> > > > assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
> > >
> > > Hi Jens,
> > > I've not been following this too uch, but:
> > >
> > > > regards,
> > > > Jens
> > > >
> > > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > > > This is another attempt at implementing the host side of the
> > > > > net_failover concept
> > > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > > > >
> > > > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > > > emulated device. Before migration the vfio device is unplugged and data
> > > > > flows to the emulated device, on the target side another vfio-pci device
> > > > > is plugged in to take over the data-path. In the guest the net_failover
> > > > > module will pair net devices with the same MAC address.
> > > > >
> > > > > * In the first patch the infrastructure for hiding the device is added
> > > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > > state and it is set based on a callback to the standby device which
> > > > > registers itself for handling the assessment: "should the primary device
> > > > > be hidden?" by cross validating the ids of the devices.
> > > > >
> > > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > > device and unhides it when the feature is acked.
> > > > >
> > > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > > > >
> > > > > To summarize concerns/feedback from previous discussion:
> > > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > > > Migration might get stuck for unpredictable time with unclear reason.
> > > > > This approach combines two tricky things, hot/unplug and migration.
> > > > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > > > necessary rollbacks transparent to management software. Will it be
> > > > > easy, probably not.
> > >
> > > This sounds 'fun' - bonus cases are things like what happens if the
> > > guest gets rebooted somewhere during the process or if it's currently
> > > sitting in the bios/grub/etc
> >
> > Um, during which process? Guests are gradually fixed to support
> > surprise removal well. Part of it is thunderbolt which makes
> > it incredibly easy. Yes - bios/grub will need to learn to
> > handle this well.
>
> Ignoring the actual mechanism of the unplug itself; there are probably
> loads of cases; e.g.
>
> running with both cards
> hot unplug real card
> start migration
> guest reboots
> Kernel sees only the virtio card
> migration completes
> hotadd the real card back
>
> so the guest has to know to pair the real card even though it booted
> with only the virtio card.
Yes - that's why virtio has a flag to notify guest there
will be a pair even if it's not there at the moment.
> I'm sure there are loads of other corners.
>
> > > > > 2. PCI devices are a precious ressource. The primary device should never
> > > > > be added to QEMU if it won't be used by guest instead of hiding it in
> > > > > QEMU.
> > > > > -> We only hotplug the device when the standby feature bit was
> > > > > negotiated. We save the device cmdline options until we need it for
> > > > > qdev_device_add()
> > > > > Hiding a device can be a useful concept to model. For example a
> > > > > pci device in a powered-off slot could be marked as hidden until the slot is
> > > > > powered on (mst).
> > >
> > > Are they really that precious? Personally it's not something I'd worry
> > > about.
> > >
> > > > > 3. Management layer software should handle this. Open Stack already has
> > > > > components/code to handle unplug/replug VFIO devices and metadata to
> > > > > provide to the guest for detecting which devices should be paired.
> > > > > -> An approach that includes all software from firmware to
> > > > > higher-level management software wasn't tried in the last years. This is
> > > > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > > > 4. Hotplugging a device and then making it part of a failover setup is
> > > > > not possible
> > > > > -> addressed by extending qdev hotplug functions to check for hidden
> > > > > attribute, so e.g. device_add can be used to plug a device.
> > > > >
> > > > > There are still some open issues:
> > > > >
> > > > > Migration: I'm looking for something like a pre-migration hook that I
> > > > > could use to unplug the vfio-pci device. I tried with a migration
> > > > > notifier but it is called to late, i.e. after migration is aborted due
> > > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > > > to migrateable and used a migration notifier on the virtio-net device.
> > >
> > > Why not just let this happen at the libvirt level; then you do the
> > > hotunplug etc before you actually tell qemu anything about starting a
> > > migration?
> >
> > If qemu frees up resources (as it does on unplug) then libvirt
> > is not guaranteed it can roll the change back on e.g.
> > migration failure.
>
> Can you explain this in a bit more detail please; do you mean if it
> frees the netdev?
>
> > But really another issue is simply that it's a mechanism,
> > there's no policy that management needs to decide on.
> > Doing it at lowest possible level ensures all
> > upper layers benefit with minimal pain.
>
> Network setups in things like OpenStack can be really complex and
> involve interacting with switches etc - something somewhere might
> have to reconfigure a switch when you pull the real card.
>
> Dave
>
> > > > > Commandline: There is a dependency between vfio-pci and virtio-net
> > > > > devices. One points to the other via new parameters
> > > > > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > > > > that the primary device needs to be specified after standby device on
> > > > > the qemu command line. Not sure how to solve this.
> > > > >
> > > > > Error handling: Patches don't cover all possible error scenarios yet.
> > > > >
> > > > > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > > > > above mentioned workarounds for open problems.
> > > > >
> > > > > Command line example:
> > > > >
> > > > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > > > -machine q35,kernel-irqchip=split -cpu host \
> > > > > -k fr \
> > > > > -serial stdio \
> > > > > -net none \
> > > > > -qmp unix:/tmp/qmp.socket,server,nowait \
> > > > > -monitor telnet:127.0.0.1:5555,server,nowait \
> > > > > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > > > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > > > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > > > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > > > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > > > > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
> > >
> > > Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
> > > 'net1' id's. cc'ing in Markus.
> > >
> > > Dave
> > >
> > > > > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > > >
> > > > > I'm grateful for any remarks or ideas!
> > > > >
> > > > > Thanks!
> > > > >
> > > > > regards,
> > > > > Jens
> > > > >
> > > > > Sameeh Jubran (2):
> > > > > qdev/qbus: Add hidden device support
> > > > > net/virtio: add failover support
> > > > >
> > > > > hw/core/qdev.c | 27 ++++++++++
> > > > > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > > > > hw/pci/pci.c | 1 +
> > > > > include/hw/pci/pci.h | 2 +
> > > > > include/hw/qdev-core.h | 8 +++
> > > > > include/hw/virtio/virtio-net.h | 7 +++
> > > > > qdev-monitor.c | 48 +++++++++++++++--
> > > > > vl.c | 7 ++-
> > > > > 8 files changed, 189 insertions(+), 6 deletions(-)
> > > > >
> > > > > --
> > > > > 2.20.1
> > > > >
> > > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 13:22 ` Michael S. Tsirkin
@ 2019-04-08 13:22 ` Michael S. Tsirkin
0 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2019-04-08 13:22 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: pkrempa, ehabkost, qemu-devel, mdroth, armbru, liran.alon, laine,
ogerlitz, Jens Freimann, ailan
On Mon, Apr 08, 2019 at 10:16:50AM +0100, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > > ping
> > > >
> > > > FYI: I'm also working on a few related tools to detect driver behaviour when
> > > > assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
> > >
> > > Hi Jens,
> > > I've not been following this too uch, but:
> > >
> > > > regards,
> > > > Jens
> > > >
> > > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > > > This is another attempt at implementing the host side of the
> > > > > net_failover concept
> > > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > > > >
> > > > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > > > emulated device. Before migration the vfio device is unplugged and data
> > > > > flows to the emulated device, on the target side another vfio-pci device
> > > > > is plugged in to take over the data-path. In the guest the net_failover
> > > > > module will pair net devices with the same MAC address.
> > > > >
> > > > > * In the first patch the infrastructure for hiding the device is added
> > > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > > state and it is set based on a callback to the standby device which
> > > > > registers itself for handling the assessment: "should the primary device
> > > > > be hidden?" by cross validating the ids of the devices.
> > > > >
> > > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > > device and unhides it when the feature is acked.
> > > > >
> > > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > > > >
> > > > > To summarize concerns/feedback from previous discussion:
> > > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > > > Migration might get stuck for unpredictable time with unclear reason.
> > > > > This approach combines two tricky things, hot/unplug and migration.
> > > > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > > > necessary rollbacks transparent to management software. Will it be
> > > > > easy, probably not.
> > >
> > > This sounds 'fun' - bonus cases are things like what happens if the
> > > guest gets rebooted somewhere during the process or if it's currently
> > > sitting in the bios/grub/etc
> >
> > Um, during which process? Guests are gradually fixed to support
> > surprise removal well. Part of it is thunderbolt which makes
> > it incredibly easy. Yes - bios/grub will need to learn to
> > handle this well.
>
> Ignoring the actual mechanism of the unplug itself; there are probably
> loads of cases; e.g.
>
> running with both cards
> hot unplug real card
> start migration
> guest reboots
> Kernel sees only the virtio card
> migration completes
> hotadd the real card back
>
> so the guest has to know to pair the real card even though it booted
> with only the virtio card.
Yes - that's why virtio has a flag to notify guest there
will be a pair even if it's not there at the moment.
> I'm sure there are loads of other corners.
>
> > > > > 2. PCI devices are a precious ressource. The primary device should never
> > > > > be added to QEMU if it won't be used by guest instead of hiding it in
> > > > > QEMU.
> > > > > -> We only hotplug the device when the standby feature bit was
> > > > > negotiated. We save the device cmdline options until we need it for
> > > > > qdev_device_add()
> > > > > Hiding a device can be a useful concept to model. For example a
> > > > > pci device in a powered-off slot could be marked as hidden until the slot is
> > > > > powered on (mst).
> > >
> > > Are they really that precious? Personally it's not something I'd worry
> > > about.
> > >
> > > > > 3. Management layer software should handle this. Open Stack already has
> > > > > components/code to handle unplug/replug VFIO devices and metadata to
> > > > > provide to the guest for detecting which devices should be paired.
> > > > > -> An approach that includes all software from firmware to
> > > > > higher-level management software wasn't tried in the last years. This is
> > > > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > > > 4. Hotplugging a device and then making it part of a failover setup is
> > > > > not possible
> > > > > -> addressed by extending qdev hotplug functions to check for hidden
> > > > > attribute, so e.g. device_add can be used to plug a device.
> > > > >
> > > > > There are still some open issues:
> > > > >
> > > > > Migration: I'm looking for something like a pre-migration hook that I
> > > > > could use to unplug the vfio-pci device. I tried with a migration
> > > > > notifier but it is called to late, i.e. after migration is aborted due
> > > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > > > to migrateable and used a migration notifier on the virtio-net device.
> > >
> > > Why not just let this happen at the libvirt level; then you do the
> > > hotunplug etc before you actually tell qemu anything about starting a
> > > migration?
> >
> > If qemu frees up resources (as it does on unplug) then libvirt
> > is not guaranteed it can roll the change back on e.g.
> > migration failure.
>
> Can you explain this in a bit more detail please; do you mean if it
> frees the netdev?
>
> > But really another issue is simply that it's a mechanism,
> > there's no policy that management needs to decide on.
> > Doing it at lowest possible level ensures all
> > upper layers benefit with minimal pain.
>
> Network setups in things like OpenStack can be really complex and
> involve interacting with switches etc - something somewhere might
> have to reconfigure a switch when you pull the real card.
>
> Dave
>
> > > > > Commandline: There is a dependency between vfio-pci and virtio-net
> > > > > devices. One points to the other via new parameters
> > > > > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > > > > that the primary device needs to be specified after standby device on
> > > > > the qemu command line. Not sure how to solve this.
> > > > >
> > > > > Error handling: Patches don't cover all possible error scenarios yet.
> > > > >
> > > > > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > > > > above mentioned workarounds for open problems.
> > > > >
> > > > > Command line example:
> > > > >
> > > > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > > > -machine q35,kernel-irqchip=split -cpu host \
> > > > > -k fr \
> > > > > -serial stdio \
> > > > > -net none \
> > > > > -qmp unix:/tmp/qmp.socket,server,nowait \
> > > > > -monitor telnet:127.0.0.1:5555,server,nowait \
> > > > > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > > > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > > > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > > > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > > > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > > > > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
> > >
> > > Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
> > > 'net1' id's. cc'ing in Markus.
> > >
> > > Dave
> > >
> > > > > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > > >
> > > > > I'm grateful for any remarks or ideas!
> > > > >
> > > > > Thanks!
> > > > >
> > > > > regards,
> > > > > Jens
> > > > >
> > > > > Sameeh Jubran (2):
> > > > > qdev/qbus: Add hidden device support
> > > > > net/virtio: add failover support
> > > > >
> > > > > hw/core/qdev.c | 27 ++++++++++
> > > > > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > > > > hw/pci/pci.c | 1 +
> > > > > include/hw/pci/pci.h | 2 +
> > > > > include/hw/qdev-core.h | 8 +++
> > > > > include/hw/virtio/virtio-net.h | 7 +++
> > > > > qdev-monitor.c | 48 +++++++++++++++--
> > > > > vl.c | 7 ++-
> > > > > 8 files changed, 189 insertions(+), 6 deletions(-)
> > > > >
> > > > > --
> > > > > 2.20.1
> > > > >
> > > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 13:00 ` Jens Freimann
2019-04-08 13:00 ` Jens Freimann
@ 2019-04-08 17:00 ` Dr. David Alan Gilbert
2019-04-08 17:00 ` Dr. David Alan Gilbert
1 sibling, 1 reply; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-08 17:00 UTC (permalink / raw)
To: Jens Freimann
Cc: Michael S. Tsirkin, armbru, qemu-devel, pkrempa, ehabkost, mdroth,
liran.alon, laine, ogerlitz, ailan
* Jens Freimann (jfreimann@redhat.com) wrote:
> On Mon, Apr 08, 2019 at 10:16:50AM +0100, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > > > > This is another attempt at implementing the host side of the
> > > > > > net_failover concept
> > > > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > > > > >
> > > > > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > > > > emulated device. Before migration the vfio device is unplugged and data
> > > > > > flows to the emulated device, on the target side another vfio-pci device
> > > > > > is plugged in to take over the data-path. In the guest the net_failover
> > > > > > module will pair net devices with the same MAC address.
> > > > > >
> > > > > > * In the first patch the infrastructure for hiding the device is added
> > > > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > > > state and it is set based on a callback to the standby device which
> > > > > > registers itself for handling the assessment: "should the primary device
> > > > > > be hidden?" by cross validating the ids of the devices.
> > > > > >
> > > > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > > > device and unhides it when the feature is acked.
> > > > > >
> > > > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > > > > >
> > > > > > To summarize concerns/feedback from previous discussion:
> > > > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > > > > Migration might get stuck for unpredictable time with unclear reason.
> > > > > > This approach combines two tricky things, hot/unplug and migration.
> > > > > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > > > > necessary rollbacks transparent to management software. Will it be
> > > > > > easy, probably not.
> > > >
> > > > This sounds 'fun' - bonus cases are things like what happens if the
> > > > guest gets rebooted somewhere during the process or if it's currently
> > > > sitting in the bios/grub/etc
> > >
> > > Um, during which process? Guests are gradually fixed to support
> > > surprise removal well. Part of it is thunderbolt which makes
> > > it incredibly easy. Yes - bios/grub will need to learn to
> > > handle this well.
> >
> > Ignoring the actual mechanism of the unplug itself; there are probably
> > loads of cases; e.g.
> >
> > running with both cards
> > hot unplug real card
> > start migration
> > guest reboots
> > Kernel sees only the virtio card
> > migration completes
> > hotadd the real card back
> >
> > so the guest has to know to pair the real card even though it booted
> > with only the virtio card.
>
> Maybe I misunderstand, but, when the 'real card' is added back after
> migration the net_failover driver in the guest will know to pair it
> with the virtio card because they have the same MAC address. Did you
> mean something else?
OK if it knows to do that.
> > I'm sure there are loads of other corners.
>
> Probably yes.
Yeh, that was just my worry - just there's loads of this type of corner
around reboots.
Dave
> regards,
> Jens
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 17:00 ` Dr. David Alan Gilbert
@ 2019-04-08 17:00 ` Dr. David Alan Gilbert
0 siblings, 0 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-08 17:00 UTC (permalink / raw)
To: Jens Freimann
Cc: pkrempa, ehabkost, Michael S. Tsirkin, armbru, qemu-devel, mdroth,
liran.alon, laine, ogerlitz, ailan
* Jens Freimann (jfreimann@redhat.com) wrote:
> On Mon, Apr 08, 2019 at 10:16:50AM +0100, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > > > > This is another attempt at implementing the host side of the
> > > > > > net_failover concept
> > > > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > > > > >
> > > > > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > > > > emulated device. Before migration the vfio device is unplugged and data
> > > > > > flows to the emulated device, on the target side another vfio-pci device
> > > > > > is plugged in to take over the data-path. In the guest the net_failover
> > > > > > module will pair net devices with the same MAC address.
> > > > > >
> > > > > > * In the first patch the infrastructure for hiding the device is added
> > > > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > > > state and it is set based on a callback to the standby device which
> > > > > > registers itself for handling the assessment: "should the primary device
> > > > > > be hidden?" by cross validating the ids of the devices.
> > > > > >
> > > > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > > > device and unhides it when the feature is acked.
> > > > > >
> > > > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > > > > >
> > > > > > To summarize concerns/feedback from previous discussion:
> > > > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > > > > Migration might get stuck for unpredictable time with unclear reason.
> > > > > > This approach combines two tricky things, hot/unplug and migration.
> > > > > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > > > > necessary rollbacks transparent to management software. Will it be
> > > > > > easy, probably not.
> > > >
> > > > This sounds 'fun' - bonus cases are things like what happens if the
> > > > guest gets rebooted somewhere during the process or if it's currently
> > > > sitting in the bios/grub/etc
> > >
> > > Um, during which process? Guests are gradually fixed to support
> > > surprise removal well. Part of it is thunderbolt which makes
> > > it incredibly easy. Yes - bios/grub will need to learn to
> > > handle this well.
> >
> > Ignoring the actual mechanism of the unplug itself; there are probably
> > loads of cases; e.g.
> >
> > running with both cards
> > hot unplug real card
> > start migration
> > guest reboots
> > Kernel sees only the virtio card
> > migration completes
> > hotadd the real card back
> >
> > so the guest has to know to pair the real card even though it booted
> > with only the virtio card.
>
> Maybe I misunderstand, but, when the 'real card' is added back after
> migration the net_failover driver in the guest will know to pair it
> with the virtio card because they have the same MAC address. Did you
> mean something else?
OK if it knows to do that.
> > I'm sure there are loads of other corners.
>
> Probably yes.
Yeh, that was just my worry - just there's loads of this type of corner
around reboots.
Dave
> regards,
> Jens
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-08 5:26 ` Markus Armbruster
2019-04-08 5:26 ` Markus Armbruster
@ 2019-04-12 19:50 ` Eduardo Habkost
2019-04-12 19:50 ` Eduardo Habkost
1 sibling, 1 reply; 26+ messages in thread
From: Eduardo Habkost @ 2019-04-12 19:50 UTC (permalink / raw)
To: Markus Armbruster
Cc: Michael S. Tsirkin, pkrempa, qemu-devel, mdroth, liran.alon,
laine, ogerlitz, Jens Freimann, ailan, Dr. David Alan Gilbert
On Mon, Apr 08, 2019 at 07:26:16AM +0200, Markus Armbruster wrote:
> Eduardo Habkost <ehabkost@redhat.com> writes:
>
> > On Fri, Apr 05, 2019 at 07:22:35PM -0400, Michael S. Tsirkin wrote:
> >> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> >> > * Jens Freimann (jfreimann@redhat.com) wrote:
> >> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > [...]
> >> > > > 3. Management layer software should handle this. Open Stack already has
> >> > > > components/code to handle unplug/replug VFIO devices and metadata to
> >> > > > provide to the guest for detecting which devices should be paired.
> >> > > > -> An approach that includes all software from firmware to
> >> > > > higher-level management software wasn't tried in the last years. This is
> >> > > > an attempt to keep it simple and contained in QEMU as much as possible.
> >> > > > 4. Hotplugging a device and then making it part of a failover setup is
> >> > > > not possible
> >> > > > -> addressed by extending qdev hotplug functions to check for hidden
> >> > > > attribute, so e.g. device_add can be used to plug a device.
> >> > > >
> >> > > > There are still some open issues:
> >> > > >
> >> > > > Migration: I'm looking for something like a pre-migration hook that I
> >> > > > could use to unplug the vfio-pci device. I tried with a migration
> >> > > > notifier but it is called to late, i.e. after migration is aborted due
> >> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> >> > > > to migrateable and used a migration notifier on the virtio-net device.
> >> >
> >> > Why not just let this happen at the libvirt level; then you do the
> >> > hotunplug etc before you actually tell qemu anything about starting a
> >> > migration?
> >>
> >> If qemu frees up resources (as it does on unplug) then libvirt
> >> is not guaranteed it can roll the change back on e.g.
> >> migration failure.
> >
> > Why should we always free up resources on unplug?
> >
> > Unplug of a disk device doesn't close the corresponding -blockdev.
>
> It does for block backends created with -drive, and that was a mistake
> we corrected with -blockdev.
>
> > Unplug of a serial device doesn't close the corresponding -chardev.
> > Unplug of a memory device doesn't close the corresponding memory backend.
> > Unplug of a crypto device doesn't close the corresponding crypto backend.
> >
> > Why do we expect device_del of a passthrough PCI device to always
> > release the host side PCI device? We can provide a better API
> > than that.
>
> device_del should free what device_add allocates.
Absolutely. Making unplug of the guest device not close the host
device implies in having the host device not being opened by
device_add (it would be opened by -object/object_add, I assume).
>
> Does device_add allocate the host side PCI device? If yes, should it?
I don't see any reason it should.
--
Eduardo
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-12 19:50 ` Eduardo Habkost
@ 2019-04-12 19:50 ` Eduardo Habkost
0 siblings, 0 replies; 26+ messages in thread
From: Eduardo Habkost @ 2019-04-12 19:50 UTC (permalink / raw)
To: Markus Armbruster
Cc: pkrempa, Michael S. Tsirkin, mdroth, qemu-devel, liran.alon,
laine, ogerlitz, Jens Freimann, ailan, Dr. David Alan Gilbert
On Mon, Apr 08, 2019 at 07:26:16AM +0200, Markus Armbruster wrote:
> Eduardo Habkost <ehabkost@redhat.com> writes:
>
> > On Fri, Apr 05, 2019 at 07:22:35PM -0400, Michael S. Tsirkin wrote:
> >> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> >> > * Jens Freimann (jfreimann@redhat.com) wrote:
> >> > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > [...]
> >> > > > 3. Management layer software should handle this. Open Stack already has
> >> > > > components/code to handle unplug/replug VFIO devices and metadata to
> >> > > > provide to the guest for detecting which devices should be paired.
> >> > > > -> An approach that includes all software from firmware to
> >> > > > higher-level management software wasn't tried in the last years. This is
> >> > > > an attempt to keep it simple and contained in QEMU as much as possible.
> >> > > > 4. Hotplugging a device and then making it part of a failover setup is
> >> > > > not possible
> >> > > > -> addressed by extending qdev hotplug functions to check for hidden
> >> > > > attribute, so e.g. device_add can be used to plug a device.
> >> > > >
> >> > > > There are still some open issues:
> >> > > >
> >> > > > Migration: I'm looking for something like a pre-migration hook that I
> >> > > > could use to unplug the vfio-pci device. I tried with a migration
> >> > > > notifier but it is called to late, i.e. after migration is aborted due
> >> > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> >> > > > to migrateable and used a migration notifier on the virtio-net device.
> >> >
> >> > Why not just let this happen at the libvirt level; then you do the
> >> > hotunplug etc before you actually tell qemu anything about starting a
> >> > migration?
> >>
> >> If qemu frees up resources (as it does on unplug) then libvirt
> >> is not guaranteed it can roll the change back on e.g.
> >> migration failure.
> >
> > Why should we always free up resources on unplug?
> >
> > Unplug of a disk device doesn't close the corresponding -blockdev.
>
> It does for block backends created with -drive, and that was a mistake
> we corrected with -blockdev.
>
> > Unplug of a serial device doesn't close the corresponding -chardev.
> > Unplug of a memory device doesn't close the corresponding memory backend.
> > Unplug of a crypto device doesn't close the corresponding crypto backend.
> >
> > Why do we expect device_del of a passthrough PCI device to always
> > release the host side PCI device? We can provide a better API
> > than that.
>
> device_del should free what device_add allocates.
Absolutely. Making unplug of the guest device not close the host
device implies in having the host device not being opened by
device_add (it would be opened by -object/object_add, I assume).
>
> Does device_add allocate the host side PCI device? If yes, should it?
I don't see any reason it should.
--
Eduardo
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-04-05 23:22 ` Michael S. Tsirkin
` (2 preceding siblings ...)
2019-04-08 9:16 ` Dr. David Alan Gilbert
@ 2019-05-29 0:35 ` si-wei liu
2019-05-29 2:47 ` Michael S. Tsirkin
3 siblings, 1 reply; 26+ messages in thread
From: si-wei liu @ 2019-05-29 0:35 UTC (permalink / raw)
To: Michael S. Tsirkin, Dr. David Alan Gilbert
Cc: pkrempa, ehabkost, qemu-devel, mdroth, armbru, liran.alon, laine,
ogerlitz, Jens Freimann, ailan
On 4/5/2019 4:22 PM, Michael S. Tsirkin wrote:
> On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
>> * Jens Freimann (jfreimann@redhat.com) wrote:
>>> ping
>>>
>>> FYI: I'm also working on a few related tools to detect driver behaviour when
>>> assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
>> Hi Jens,
>> I've not been following this too uch, but:
>>
>>> regards,
>>> Jens
>>>
>>> On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
>>>> This is another attempt at implementing the host side of the
>>>> net_failover concept
>>>> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
>>>>
>>>> The general idea is that we have a pair of devices, a vfio-pci and a
>>>> emulated device. Before migration the vfio device is unplugged and data
>>>> flows to the emulated device, on the target side another vfio-pci device
>>>> is plugged in to take over the data-path. In the guest the net_failover
>>>> module will pair net devices with the same MAC address.
>>>>
>>>> * In the first patch the infrastructure for hiding the device is added
>>>> for the qbus and qdev APIs. A "hidden" boolean is added to the device
>>>> state and it is set based on a callback to the standby device which
>>>> registers itself for handling the assessment: "should the primary device
>>>> be hidden?" by cross validating the ids of the devices.
>>>>
>>>> * In the second patch the virtio-net uses the API to hide the vfio
>>>> device and unhides it when the feature is acked.
>>>>
>>>> Previous discussion: https://patchwork.ozlabs.org/cover/989098/
>>>>
>>>> To summarize concerns/feedback from previous discussion:
>>>> 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>>>> Migration might get stuck for unpredictable time with unclear reason.
>>>> This approach combines two tricky things, hot/unplug and migration.
>>>> -> We can surprise-remove the PCI device and in QEMU we can do all
>>>> necessary rollbacks transparent to management software. Will it be
>>>> easy, probably not.
>> This sounds 'fun' - bonus cases are things like what happens if the
>> guest gets rebooted somewhere during the process or if it's currently
>> sitting in the bios/grub/etc
> Um, during which process? Guests are gradually fixed to support
> surprise removal well. Part of it is thunderbolt which makes
> it incredibly easy. Yes - bios/grub will need to learn to
> handle this well.
I shared the same concern. As device emulator (QEMU), you know where
guest would reject or delay - it's even agnostic bios/grub should
respond to hot plug or not. You don't even know whether guest has the
support for ACPI hotplug, surprise removal, do you? How QEMU infer what
is the right disposition by telling apart these guest states?
-Siwei
>
>>>> 2. PCI devices are a precious ressource. The primary device should never
>>>> be added to QEMU if it won't be used by guest instead of hiding it in
>>>> QEMU.
>>>> -> We only hotplug the device when the standby feature bit was
>>>> negotiated. We save the device cmdline options until we need it for
>>>> qdev_device_add()
>>>> Hiding a device can be a useful concept to model. For example a
>>>> pci device in a powered-off slot could be marked as hidden until the slot is
>>>> powered on (mst).
>> Are they really that precious? Personally it's not something I'd worry
>> about.
>>
>>>> 3. Management layer software should handle this. Open Stack already has
>>>> components/code to handle unplug/replug VFIO devices and metadata to
>>>> provide to the guest for detecting which devices should be paired.
>>>> -> An approach that includes all software from firmware to
>>>> higher-level management software wasn't tried in the last years. This is
>>>> an attempt to keep it simple and contained in QEMU as much as possible.
>>>> 4. Hotplugging a device and then making it part of a failover setup is
>>>> not possible
>>>> -> addressed by extending qdev hotplug functions to check for hidden
>>>> attribute, so e.g. device_add can be used to plug a device.
>>>>
>>>> There are still some open issues:
>>>>
>>>> Migration: I'm looking for something like a pre-migration hook that I
>>>> could use to unplug the vfio-pci device. I tried with a migration
>>>> notifier but it is called to late, i.e. after migration is aborted due
>>>> to vfio-pci marked unmigrateable. I worked around this by setting it
>>>> to migrateable and used a migration notifier on the virtio-net device.
>> Why not just let this happen at the libvirt level; then you do the
>> hotunplug etc before you actually tell qemu anything about starting a
>> migration?
> If qemu frees up resources (as it does on unplug) then libvirt
> is not guaranteed it can roll the change back on e.g.
> migration failure.
>
> But really another issue is simply that it's a mechanism,
> there's no policy that management needs to decide on.
> Doing it at lowest possible level ensures all
> upper layers benefit with minimal pain.
>
>>>> Commandline: There is a dependency between vfio-pci and virtio-net
>>>> devices. One points to the other via new parameters
>>>> primar=<primary qdev id> and standby='<standby qdev id>'. This means
>>>> that the primary device needs to be specified after standby device on
>>>> the qemu command line. Not sure how to solve this.
>>>>
>>>> Error handling: Patches don't cover all possible error scenarios yet.
>>>>
>>>> I have tested this with a mlx5 NIC and was able to migrate the VM with
>>>> above mentioned workarounds for open problems.
>>>>
>>>> Command line example:
>>>>
>>>> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>>>> -machine q35,kernel-irqchip=split -cpu host \
>>>> -k fr \
>>>> -serial stdio \
>>>> -net none \
>>>> -qmp unix:/tmp/qmp.socket,server,nowait \
>>>> -monitor telnet:127.0.0.1:5555,server,nowait \
>>>> -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>>>> -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>>>> -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>>>> -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>>>> -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
>>>> -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
>> Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
>> 'net1' id's. cc'ing in Markus.
>>
>> Dave
>>
>>>> /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>>>>
>>>> I'm grateful for any remarks or ideas!
>>>>
>>>> Thanks!
>>>>
>>>> regards,
>>>> Jens
>>>>
>>>> Sameeh Jubran (2):
>>>> qdev/qbus: Add hidden device support
>>>> net/virtio: add failover support
>>>>
>>>> hw/core/qdev.c | 27 ++++++++++
>>>> hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
>>>> hw/pci/pci.c | 1 +
>>>> include/hw/pci/pci.h | 2 +
>>>> include/hw/qdev-core.h | 8 +++
>>>> include/hw/virtio/virtio-net.h | 7 +++
>>>> qdev-monitor.c | 48 +++++++++++++++--
>>>> vl.c | 7 ++-
>>>> 8 files changed, 189 insertions(+), 6 deletions(-)
>>>>
>>>> --
>>>> 2.20.1
>>>>
>>>>
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices
2019-05-29 0:35 ` si-wei liu
@ 2019-05-29 2:47 ` Michael S. Tsirkin
0 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2019-05-29 2:47 UTC (permalink / raw)
To: si-wei liu
Cc: pkrempa, ehabkost, armbru, qemu-devel, mdroth, liran.alon, laine,
ogerlitz, Jens Freimann, ailan, Dr. David Alan Gilbert
On Tue, May 28, 2019 at 05:35:26PM -0700, si-wei liu wrote:
>
>
> On 4/5/2019 4:22 PM, Michael S. Tsirkin wrote:
> > On Fri, Apr 05, 2019 at 09:56:29AM +0100, Dr. David Alan Gilbert wrote:
> > > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > > ping
> > > >
> > > > FYI: I'm also working on a few related tools to detect driver behaviour when
> > > > assigning a MAC to the vf device. Code is at https://github.com/jensfr/netfailover_driver_detect
> > > Hi Jens,
> > > I've not been following this too uch, but:
> > >
> > > > regards,
> > > > Jens
> > > >
> > > > On Fri, Mar 22, 2019 at 02:44:45PM +0100, Jens Freimann wrote:
> > > > > This is another attempt at implementing the host side of the
> > > > > net_failover concept
> > > > > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > > > >
> > > > > The general idea is that we have a pair of devices, a vfio-pci and a
> > > > > emulated device. Before migration the vfio device is unplugged and data
> > > > > flows to the emulated device, on the target side another vfio-pci device
> > > > > is plugged in to take over the data-path. In the guest the net_failover
> > > > > module will pair net devices with the same MAC address.
> > > > >
> > > > > * In the first patch the infrastructure for hiding the device is added
> > > > > for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > > state and it is set based on a callback to the standby device which
> > > > > registers itself for handling the assessment: "should the primary device
> > > > > be hidden?" by cross validating the ids of the devices.
> > > > >
> > > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > > device and unhides it when the feature is acked.
> > > > >
> > > > > Previous discussion: https://patchwork.ozlabs.org/cover/989098/
> > > > >
> > > > > To summarize concerns/feedback from previous discussion:
> > > > > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> > > > > Migration might get stuck for unpredictable time with unclear reason.
> > > > > This approach combines two tricky things, hot/unplug and migration.
> > > > > -> We can surprise-remove the PCI device and in QEMU we can do all
> > > > > necessary rollbacks transparent to management software. Will it be
> > > > > easy, probably not.
> > > This sounds 'fun' - bonus cases are things like what happens if the
> > > guest gets rebooted somewhere during the process or if it's currently
> > > sitting in the bios/grub/etc
> > Um, during which process? Guests are gradually fixed to support
> > surprise removal well. Part of it is thunderbolt which makes
> > it incredibly easy. Yes - bios/grub will need to learn to
> > handle this well.
> I shared the same concern. As device emulator (QEMU), you know where guest
> would reject or delay - it's even agnostic bios/grub should respond to hot
> plug or not. You don't even know whether guest has the support for ACPI
> hotplug, surprise removal, do you? How QEMU infer what is the right
> disposition by telling apart these guest states?
>
> -Siwei
We can always add a feature bit for that :)
No feature bit -> primary stays hidden.
> >
> > > > > 2. PCI devices are a precious ressource. The primary device should never
> > > > > be added to QEMU if it won't be used by guest instead of hiding it in
> > > > > QEMU.
> > > > > -> We only hotplug the device when the standby feature bit was
> > > > > negotiated. We save the device cmdline options until we need it for
> > > > > qdev_device_add()
> > > > > Hiding a device can be a useful concept to model. For example a
> > > > > pci device in a powered-off slot could be marked as hidden until the slot is
> > > > > powered on (mst).
> > > Are they really that precious? Personally it's not something I'd worry
> > > about.
> > >
> > > > > 3. Management layer software should handle this. Open Stack already has
> > > > > components/code to handle unplug/replug VFIO devices and metadata to
> > > > > provide to the guest for detecting which devices should be paired.
> > > > > -> An approach that includes all software from firmware to
> > > > > higher-level management software wasn't tried in the last years. This is
> > > > > an attempt to keep it simple and contained in QEMU as much as possible.
> > > > > 4. Hotplugging a device and then making it part of a failover setup is
> > > > > not possible
> > > > > -> addressed by extending qdev hotplug functions to check for hidden
> > > > > attribute, so e.g. device_add can be used to plug a device.
> > > > >
> > > > > There are still some open issues:
> > > > >
> > > > > Migration: I'm looking for something like a pre-migration hook that I
> > > > > could use to unplug the vfio-pci device. I tried with a migration
> > > > > notifier but it is called to late, i.e. after migration is aborted due
> > > > > to vfio-pci marked unmigrateable. I worked around this by setting it
> > > > > to migrateable and used a migration notifier on the virtio-net device.
> > > Why not just let this happen at the libvirt level; then you do the
> > > hotunplug etc before you actually tell qemu anything about starting a
> > > migration?
> > If qemu frees up resources (as it does on unplug) then libvirt
> > is not guaranteed it can roll the change back on e.g.
> > migration failure.
> >
> > But really another issue is simply that it's a mechanism,
> > there's no policy that management needs to decide on.
> > Doing it at lowest possible level ensures all
> > upper layers benefit with minimal pain.
> >
> > > > > Commandline: There is a dependency between vfio-pci and virtio-net
> > > > > devices. One points to the other via new parameters
> > > > > primar=<primary qdev id> and standby='<standby qdev id>'. This means
> > > > > that the primary device needs to be specified after standby device on
> > > > > the qemu command line. Not sure how to solve this.
> > > > >
> > > > > Error handling: Patches don't cover all possible error scenarios yet.
> > > > >
> > > > > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > > > > above mentioned workarounds for open problems.
> > > > >
> > > > > Command line example:
> > > > >
> > > > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > > > -machine q35,kernel-irqchip=split -cpu host \
> > > > > -k fr \
> > > > > -serial stdio \
> > > > > -net none \
> > > > > -qmp unix:/tmp/qmp.socket,server,nowait \
> > > > > -monitor telnet:127.0.0.1:5555,server,nowait \
> > > > > -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > > > -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > > > -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > > > -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > > > -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,primary=hostdev0 \
> > > > > -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1 \
> > > Yes, that's a bit grim; it's circular dependency on the 'hostdev0' and
> > > 'net1' id's. cc'ing in Markus.
> > >
> > > Dave
> > >
> > > > > /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > > >
> > > > > I'm grateful for any remarks or ideas!
> > > > >
> > > > > Thanks!
> > > > >
> > > > > regards,
> > > > > Jens
> > > > >
> > > > > Sameeh Jubran (2):
> > > > > qdev/qbus: Add hidden device support
> > > > > net/virtio: add failover support
> > > > >
> > > > > hw/core/qdev.c | 27 ++++++++++
> > > > > hw/net/virtio-net.c | 95 ++++++++++++++++++++++++++++++++++
> > > > > hw/pci/pci.c | 1 +
> > > > > include/hw/pci/pci.h | 2 +
> > > > > include/hw/qdev-core.h | 8 +++
> > > > > include/hw/virtio/virtio-net.h | 7 +++
> > > > > qdev-monitor.c | 48 +++++++++++++++--
> > > > > vl.c | 7 ++-
> > > > > 8 files changed, 189 insertions(+), 6 deletions(-)
> > > > >
> > > > > --
> > > > > 2.20.1
> > > > >
> > > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2019-05-29 2:49 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20190322134447.14831-1-jfreimann@redhat.com>
2019-04-04 8:29 ` [Qemu-devel] [RFC PATCH 0/2] implement the failover feature for assigned network devices Jens Freimann
2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-05 8:56 ` Dr. David Alan Gilbert
2019-04-05 9:20 ` Jens Freimann
2019-04-05 9:20 ` Jens Freimann
2019-04-08 5:53 ` Markus Armbruster
2019-04-08 5:53 ` Markus Armbruster
2019-04-05 23:22 ` Michael S. Tsirkin
2019-04-05 23:22 ` Michael S. Tsirkin
2019-04-05 23:46 ` Eduardo Habkost
2019-04-05 23:46 ` Eduardo Habkost
2019-04-08 5:26 ` Markus Armbruster
2019-04-08 5:26 ` Markus Armbruster
2019-04-12 19:50 ` Eduardo Habkost
2019-04-12 19:50 ` Eduardo Habkost
2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-04-08 9:16 ` Dr. David Alan Gilbert
2019-04-08 13:00 ` Jens Freimann
2019-04-08 13:00 ` Jens Freimann
2019-04-08 17:00 ` Dr. David Alan Gilbert
2019-04-08 17:00 ` Dr. David Alan Gilbert
2019-04-08 13:22 ` Michael S. Tsirkin
2019-04-08 13:22 ` Michael S. Tsirkin
2019-05-29 0:35 ` si-wei liu
2019-05-29 2:47 ` Michael S. Tsirkin
2019-04-04 12:53 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).