From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Yishai Hadas <yishaih@nvidia.com>,
Jason Gunthorpe <jgg@nvidia.com>,
bhelgaas@google.com, saeedm@nvidia.com,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
netdev@vger.kernel.org, kuba@kernel.org, leonro@nvidia.com,
kwankhede@nvidia.com, mgurtovoy@nvidia.com, maorg@nvidia.com,
Cornelia Huck <cohuck@redhat.com>
Subject: Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
Date: Mon, 25 Oct 2021 19:47:29 +0100 [thread overview]
Message-ID: <YXb7wejD1qckNrhC@work-vm> (raw)
In-Reply-To: <20211025115535.49978053.alex.williamson@redhat.com>
* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Mon, 25 Oct 2021 17:34:01 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > [Cc +dgilbert, +cohuck]
> > >
> > > On Wed, 20 Oct 2021 11:28:04 +0300
> > > Yishai Hadas <yishaih@nvidia.com> wrote:
> > >
> > > > On 10/20/2021 2:04 AM, Jason Gunthorpe wrote:
> > > > > On Tue, Oct 19, 2021 at 02:58:56PM -0600, Alex Williamson wrote:
> > > > >> I think that gives us this table:
> > > > >>
> > > > >> | NDMA | RESUMING | SAVING | RUNNING |
> > > > >> +----------+----------+----------+----------+ ---
> > > > >> | X | 0 | 0 | 0 | ^
> > > > >> +----------+----------+----------+----------+ |
> > > > >> | 0 | 0 | 0 | 1 | |
> > > > >> +----------+----------+----------+----------+ |
> > > > >> | X | 0 | 1 | 0 |
> > > > >> +----------+----------+----------+----------+ NDMA value is either compatible
> > > > >> | 0 | 0 | 1 | 1 | to existing behavior or don't
> > > > >> +----------+----------+----------+----------+ care due to redundancy vs
> > > > >> | X | 1 | 0 | 0 | !_RUNNING/INVALID/ERROR
> > > > >> +----------+----------+----------+----------+
> > > > >> | X | 1 | 0 | 1 | |
> > > > >> +----------+----------+----------+----------+ |
> > > > >> | X | 1 | 1 | 0 | |
> > > > >> +----------+----------+----------+----------+ |
> > > > >> | X | 1 | 1 | 1 | v
> > > > >> +----------+----------+----------+----------+ ---
> > > > >> | 1 | 0 | 0 | 1 | ^
> > > > >> +----------+----------+----------+----------+ Desired new useful cases
> > > > >> | 1 | 0 | 1 | 1 | v
> > > > >> +----------+----------+----------+----------+ ---
> > > > >>
> > > > >> Specifically, rows 1, 3, 5 with NDMA = 1 are valid states a user can
> > > > >> set which are simply redundant to the NDMA = 0 cases.
> > > > > It seems right
> > > > >
> > > > >> Row 6 remains invalid due to lack of support for pre-copy (_RESUMING
> > > > >> | _RUNNING) and therefore cannot be set by userspace. Rows 7 & 8
> > > > >> are error states and cannot be set by userspace.
> > > > > I wonder, did Yishai's series capture this row 6 restriction? Yishai?
> > > >
> > > >
> > > > It seems so, by using the below check which includes the
> > > > !VFIO_DEVICE_STATE_VALID clause.
> > > >
> > > > if (old_state == VFIO_DEVICE_STATE_ERROR ||
> > > > !VFIO_DEVICE_STATE_VALID(state) ||
> > > > (state & ~MLX5VF_SUPPORTED_DEVICE_STATES))
> > > > return -EINVAL;
> > > >
> > > > Which is:
> > > >
> > > > #define VFIO_DEVICE_STATE_VALID(state) \
> > > > (state & VFIO_DEVICE_STATE_RESUMING ? \
> > > > (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> > > >
> > > > >
> > > > >> Like other bits, setting the bit should be effective at the completion
> > > > >> of writing device state. Therefore the device would need to flush any
> > > > >> outbound DMA queues before returning.
> > > > > Yes, the device commands are expected to achieve this.
> > > > >
> > > > >> The question I was really trying to get to though is whether we have a
> > > > >> supportable interface without such an extension. There's currently
> > > > >> only an experimental version of vfio migration support for PCI devices
> > > > >> in QEMU (afaik),
> > > > > If I recall this only matters if you have a VM that is causing
> > > > > migratable devices to interact with each other. So long as the devices
> > > > > are only interacting with the CPU this extra step is not strictly
> > > > > needed.
> > > > >
> > > > > So, single device cases can be fine as-is
> > > > >
> > > > > IMHO the multi-device case the VMM should probably demand this support
> > > > > from the migration drivers, otherwise it cannot know if it is safe for
> > > > > sure.
> > > > >
> > > > > A config option to override the block if the admin knows there is no
> > > > > use case to cause devices to interact - eg two NVMe devices without
> > > > > CMB do not have a useful interaction.
> > > > >
> > > > >> so it seems like we could make use of the bus-master bit to fill
> > > > >> this gap in QEMU currently, before we claim non-experimental
> > > > >> support, but this new device agnostic extension would be required
> > > > >> for non-PCI device support (and PCI support should adopt it as
> > > > >> available). Does that sound right? Thanks,
> > > > > I don't think the bus master support is really a substitute, tripping
> > > > > bus master will stop DMA but it will not do so in a clean way and is
> > > > > likely to be non-transparent to the VM's driver.
> > > > >
> > > > > The single-device-assigned case is a cleaner restriction, IMHO.
> > > > >
> > > > > Alternatively we can add the 4th bit and insist that migration drivers
> > > > > support all the states. I'm just unsure what other HW can do, I get
> > > > > the feeling people have been designing to the migration description in
> > > > > the header file for a while and this is a new idea.
> > >
> > > I'm wondering if we're imposing extra requirements on the !_RUNNING
> > > state that don't need to be there. For example, if we can assume that
> > > all devices within a userspace context are !_RUNNING before any of the
> > > devices begin to retrieve final state, then clearing of the _RUNNING
> > > bit becomes the device quiesce point and the beginning of reading
> > > device data is the point at which the device state is frozen and
> > > serialized. No new states required and essentially works with a slight
> > > rearrangement of the callbacks in this series. Why can't we do that?
> >
> > So without me actually understanding your bit encodings that closely, I
> > think the problem is we have to asusme that any transition takes time.
> > From the QEMU point of view I think the requirement is when we stop the
> > machine (vm_stop_force_state(RUN_STATE_FINISH_MIGRATE) in
> > migration_completion) that at the point that call returns (with no
> > error) all devices are idle. That means you need a way to command the
> > device to go into the stopped state, and probably another to make sure
> > it's got there.
>
> In a way. We're essentially recognizing that we cannot stop a single
> device in isolation of others that might participate in peer-to-peer
> DMA with that device, so we need to make a pass to quiesce each device
> before we can ask the device to fully stop. This new device state bit
> is meant to be that quiescent point, devices can accept incoming DMA
> but should cease to generate any. Once all device are quiesced then we
> can safely stop them.
It may need some further refinement; for example in that quiesed state
do counters still tick? will a NIC still respond to packets that don't
get forwarded to the host?
Note I still think you need a way to know when you have actually reached
these states; setting a bit in a register is asking nicely for a device
to go into a state - has it got there?
> > Now, you could be a *little* more sloppy; you could allow a device carry
> > on doing stuff purely with it's own internal state up until the point
> > it needs to serialise; but that would have to be strictly internal state
> > only - if it can change any other devices state (or issue an interrupt,
> > change RAM etc) then you get into ordering issues on the serialisation
> > of multiple devices.
>
> Yep, that's the proposal that doesn't require a uAPI change, we loosen
> the definition of stopped to mean the device can no longer generate DMA
> or interrupts and all internal processing outside or responding to
> incoming DMA should halt (essentially the same as the new quiescent
> state above). Once all devices are in this state, there should be no
> incoming DMA and we can safely collect per device migration data. If
> state changes occur beyond the point in time where userspace has
> initiated the collection of migration data, drivers have options for
> generating errors when userspace consumes that data.
How do you know that last device has actually gone into that state?
Also be careful; it feels much more delicate where something might
accidentally start a transaction.
> AFAICT, the two approaches are equally valid. If we modify the uAPI to
> include this new quiescent state then userspace needs to make some hard
> choices about what configurations they support without such a feature.
> The majority of configurations are likely not exercising p2p between
> assigned devices, but the hypervisor can't know that. If we work
> within the existing uAPI, well there aren't any open source driver
> implementations yet anyway and any non-upstream implementations would
> need to be updated for this clarification. Existing userspace works
> better with no change, so long as they already follow the guideline
> that all devices in the userspace context must be stopped before the
> migration data of any device can be considered valid. Thanks,
Dave
> Alex
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2021-10-25 18:47 UTC|newest]
Thread overview: 100+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-19 10:58 [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 01/14] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 02/14] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 03/14] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 04/14] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 05/14] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 06/14] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device Yishai Hadas
2021-10-19 11:16 ` Max Gurtovoy
2021-10-20 8:58 ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 07/14] vfio: Fix VFIO_DEVICE_STATE_SET_ERROR macro Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR Yishai Hadas
2021-10-19 15:48 ` Alex Williamson
2021-10-19 15:50 ` Alex Williamson
2021-10-20 7:35 ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 09/14] vfio/pci_core: Make the region->release() function optional Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 10/14] net/mlx5: Introduce migration bits and structures Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 11/14] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2021-10-19 18:43 ` Alex Williamson
2021-10-19 19:23 ` Jason Gunthorpe
2021-10-19 20:58 ` Alex Williamson
2021-10-19 23:04 ` Jason Gunthorpe
2021-10-20 8:28 ` Yishai Hadas
2021-10-20 16:52 ` Alex Williamson
2021-10-20 18:59 ` Jason Gunthorpe
2021-10-20 21:07 ` Alex Williamson
2021-10-21 9:34 ` Cornelia Huck
2021-10-21 21:47 ` Alex Williamson
2021-10-25 12:29 ` Jason Gunthorpe
2021-10-25 14:28 ` Alex Williamson
2021-10-25 14:56 ` Jason Gunthorpe
2021-10-26 14:42 ` Alex Williamson
2021-10-26 15:18 ` Jason Gunthorpe
2021-10-26 19:50 ` Alex Williamson
2021-10-26 23:43 ` Jason Gunthorpe
2021-10-27 19:05 ` Alex Williamson
2021-10-27 19:23 ` Jason Gunthorpe
2021-10-28 15:08 ` Cornelia Huck
2021-10-29 0:26 ` Jason Gunthorpe
2021-10-29 7:35 ` Yishai Hadas
2021-10-28 15:30 ` Alex Williamson
2021-10-28 23:47 ` Jason Gunthorpe
2021-10-29 6:57 ` Cornelia Huck
2021-10-29 7:48 ` Yishai Hadas
2021-10-29 10:32 ` Shameerali Kolothum Thodi
2021-10-29 12:15 ` Jason Gunthorpe
2021-10-29 22:06 ` Alex Williamson
2021-11-01 17:25 ` Jason Gunthorpe
2021-11-02 11:19 ` Shameerali Kolothum Thodi
2021-11-02 14:56 ` Alex Williamson
2021-11-02 15:54 ` Jason Gunthorpe
2021-11-02 16:22 ` Alex Williamson
2021-11-02 16:36 ` Jason Gunthorpe
2021-11-02 20:15 ` Alex Williamson
2021-11-03 12:09 ` Jason Gunthorpe
2021-11-03 15:44 ` Alex Williamson
2021-11-03 16:10 ` Jason Gunthorpe
2021-11-03 18:04 ` Alex Williamson
2021-11-04 11:19 ` Cornelia Huck
2021-11-05 16:53 ` Cornelia Huck
2021-11-16 16:59 ` Cornelia Huck
2021-11-05 13:24 ` Jason Gunthorpe
2021-11-05 15:31 ` Alex Williamson
2021-11-15 23:29 ` Jason Gunthorpe
2021-11-16 17:57 ` Alex Williamson
2021-11-16 19:25 ` Jason Gunthorpe
2021-11-16 21:10 ` Alex Williamson
2021-11-17 1:48 ` Jason Gunthorpe
2021-11-18 18:15 ` Alex Williamson
2021-11-22 19:18 ` Jason Gunthorpe
2021-11-08 8:53 ` Tian, Kevin
2021-11-08 12:35 ` Jason Gunthorpe
2021-11-09 0:58 ` Tian, Kevin
2021-11-09 12:45 ` Jason Gunthorpe
2021-10-25 16:34 ` Dr. David Alan Gilbert
2021-10-25 17:55 ` Alex Williamson
2021-10-25 18:47 ` Dr. David Alan Gilbert [this message]
2021-10-25 19:15 ` Jason Gunthorpe
2021-10-26 8:40 ` Dr. David Alan Gilbert
2021-10-26 12:13 ` Jason Gunthorpe
2021-10-26 14:52 ` Alex Williamson
2021-10-26 15:56 ` Jason Gunthorpe
2021-10-26 14:29 ` Alex Williamson
2021-10-26 14:51 ` Dr. David Alan Gilbert
2021-10-26 15:25 ` Jason Gunthorpe
2021-10-20 8:01 ` Yishai Hadas
2021-10-20 16:25 ` Jason Gunthorpe
2021-10-21 10:46 ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 13/14] vfio/pci: Expose vfio_pci_aer_err_detected() Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2021-10-19 18:55 ` Alex Williamson
2021-10-19 19:10 ` Jason Gunthorpe
2021-10-20 8:46 ` Yishai Hadas
2021-10-20 16:46 ` Jason Gunthorpe
2021-10-20 17:45 ` Alex Williamson
2021-10-20 18:57 ` Jason Gunthorpe
2021-10-20 21:38 ` Alex Williamson
2021-10-21 10:39 ` Yishai Hadas
2021-11-17 16:42 ` vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver) Cornelia Huck
2021-11-17 17:47 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YXb7wejD1qckNrhC@work-vm \
--to=dgilbert@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=bhelgaas@google.com \
--cc=cohuck@redhat.com \
--cc=jgg@nvidia.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=leonro@nvidia.com \
--cc=linux-pci@vger.kernel.org \
--cc=maorg@nvidia.com \
--cc=mgurtovoy@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=saeedm@nvidia.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).