From: Jason Gunthorpe <jgg@nvidia.com>
To: Cornelia Huck <cohuck@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
Jonathan Corbet <corbet@lwn.net>,
linux-doc@vger.kernel.org, kvm@vger.kernel.org,
Kirti Wankhede <kwankhede@nvidia.com>,
Max Gurtovoy <mgurtovoy@nvidia.com>,
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [PATCH RFC] vfio: Documentation for the migration region
Date: Wed, 24 Nov 2021 14:40:20 -0400 [thread overview]
Message-ID: <20211124184020.GM4670@nvidia.com> (raw)
In-Reply-To: <87fsrljxwq.fsf@redhat.com>
On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:
> Yes, defining what we mean by "VCPU RUNNING" and "DIRTY TRACKING" first
> makes the most sense.
>
> (It also imposes some rules on userspace, doesn't it? Whatever it does,
> the interaction with vfio needs to be at least somewhat similar to what
> QEMU or another VMM would do. I wonder if we need to be more concrete
> here; but let's talk about the basic interface first.)
I don't think we need to have excessive precision here. The main
thrust of this as a spec is to define behaviors which starts at the
'Actions on Set/Clear' section.
This part is informative so everyone has the same picture in their
mind about what it is we are trying to accomplish. This can be a bit
imprecise.
> > I don't think I like this statement - why/where would the overall flow
> > differ?
>
> What I meant to say: If we give userspace the flexibility to operate
> this, we also must give different device types some flexibility. While
> subchannels will follow the general flow, they'll probably condense/omit
> some steps, as I/O is quite different to PCI there.
I would say no - migration is general, no device type should get to
violate this spec. Did you have something specific in mind? There is
very little PCI specific here already
> >> > + Normal operating state
> >> > + RUNNING, DIRTY TRACKING, VCPU RUNNING
> >> > + Log DMAs
> >> > + Stream all memory
> >>
> >> all memory accessed by the device?
> >
> > In this reference flow this is all VCPU memory. Ie you start global
> > dirty tracking both in VFIO and in the VCPU and start copying all VM
> > memory.
>
> So, general migration, not just the vfio specific parts?
Sure, as above precision isn't important here, the userspace doing
migration should start streaming whatever state it has covered by
dirty logging here.
> "subtly complicated" captures this well :(
Indeed. Frankly, my observation is the team here has invested a lot of
person hours trying to make sense of this and our well-researched take
'this is a FSM' was substantially different from Alex's version 'this
is control bits'. For the 'control bit' model few seem to understand
it at all, and the driver code is short but deceptively complicated.
> For example, if I interpret your list correctly, the driver should
> prioritize clearing RUNNING over setting SAVING | !RUNNING. What does
> that mean? If RUNNING is cleared, first deal with whatever action that
> triggers, then later check if it is actually a case of setting SAVING |
> !RUNNING, and perform the required actions for that?
Yes.
Since this is not a FSM a change from any two valid device_state
values is completely legal. Many of these involve multiple driver
steps. So all drivers must do the actions in the same order to have a
real ABI.
> Also, does e.g. SAVING | RUNNING mean that both SAVING and RUNNING are
> getting set, or only one of them, if the other was already set?
It always refers to the requested migration_state
> > SAVING|0 -> SAVING|RUNNING
> > 0|RUNNING -> SAVING|RUNNING
> > 0 -> SAVING|RUNNING
Are all described as userspace requesting a migration_state
of SAVING | RUNNING
> > For clarity I didn't split things like that. All the continuous
> > behaviors start when the given bits begins and stop when the bits
> > end.
> >
> > Most of the actions talk about changes in the data window
>
> This might need some better terminology, I did not understand the split
> like that...
>
> "action trigger" is basically that the driver sets certain bits and a
> certain device action happens. "continuous" means that a certain device
> action is done as long as certain bits are set. Sounds a bit like edge
> triggered vs level triggered to me. What about:
Yes
> - event-triggered actions: bits get set/unset, an action needs to be
> done
"""Event-triggered actions happen when userspace requests a new
migration_state that differs from the current migration_state. Actions
happen on a bit group basis:"""
> - condition-triggered actions: as long as bits are set/unset, an action
> needs to be done
"""Continuous actions are in effect so long as the below migration_state bit
group is active:"""
> >> What does that mean? That the operation setting NDMA in device_state
> >> returns?
> >
> > Yes, it must be a synchronous behavior.
>
> To be extra clear: the _setting_ action (e.g. a write), not the
> condition (NDMA set)? Sorry if that sounds nitpicky, but I think we
> should eliminate possible points of confusion early on.
""Whenever the kernel returns with a migration_state of NDMA there can be no
in progress DMAs.""
> I'm trying to understand this document without looking at the mlx5
> implementation: Somebody using it as a guide needs to be able to
> implement a driver without looking at another driver (unless they prefer
> to work with examples.) Using the mlx5 driver as the basis for
> _writing_ this document makes sense, but it needs to stand on its own.
That may be an ideal that is too hard to reach :(
Thanks,
Jason
Below is where I have things now:
VFIO migration driver API
-------------------------------------------------------------------------------
VFIO drivers that support migration implement a migration control register
called device_state in the struct vfio_device_migration_info which is in its
VFIO_REGION_TYPE_MIGRATION region.
The device_state controls both device action and continuous behaviour.
Setting/clearing bit groups triggers device action, and each bit controls a
continuous device behaviour.
Along with the device_state the migration driver provides a data window which
allows streaming migration data into or out of the device.
A lot of flexibility is provided to userspace in how it operates these
bits. What follows is a reference flow for saving device state in a live
migration, with all features, and an illustration how other external non-VFIO
entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
RUNNING, VCPU_RUNNING
Normal operating state
RUNNING, DIRTY_TRACKING, VCPU_RUNNING
Log DMAs
Stream all memory
SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
Log internal device changes (pre-copy)
Stream device state through the migration window
While in this state repeat as desired:
Atomic Read and Clear DMA Dirty log
Stream dirty memory
SAVING | NDMA | RUNNING, VCPU_RUNNING
vIOMMU grace state
Complete all in progress IO page faults, idle the vIOMMU
SAVING | NDMA | RUNNING
Peer to Peer DMA grace state
Final snapshot of DMA dirty log (atomic not required)
SAVING
Stream final device state through the migration window
Copy final dirty data
0
Device is halted
and the reference flow for resuming:
RUNNING
Issue VFIO_DEVICE_RESET to clear the internal device state
0
Device is halted
RESUMING
Push in migration data. Data captured during pre-copy should be
prepended to data captured during SAVING.
NDMA | RUNNING
Peer to Peer DMA grace state
RUNNING, VCPU_RUNNING
Normal operating state
If the VMM has multiple VFIO devices undergoing migration then the grace
states act as cross device synchronization points. The VMM must bring all
devices to the grace state before advancing past it.
The above reference flows are built around specific requirements on the
migration driver for its implementation of the migration_state input.
Event triggered actions happen when userspace requests a new migration_state
that differs from the current migration_state. Actions happen on a bit group
basis:
- SAVING | RUNNING
The device clears the data window and begins streaming 'pre copy' migration
data through the window. Devices that cannot log internal state changes
return a 0 length migration stream.
- SAVING | !RUNNING
The device captures its internal state that is not covered by internal
logging, as well as any logged changes.
The device clears the data window and begins streaming the captured
migration data through the window. Devices that cannot log internal state
changes stream all of their device state here.
- RESUMING
The data window is cleared, opened and can receive the migration data
stream.
- !RESUMING
All the data transferred into the data window is loaded into the device's
internal state. The migration driver can rely on userspace issuing a
VFIO_DEVICE_RESET prior to starting RESUMING.
To abort a RESUMING issue a VFIO_DEVICE_RESET.
If the migration data is invalid then the ERROR state must be set.
Continuous actions are in effect when migration_state bit groups are active:
- RUNNING | NDMA
The device is not allowed to issue new DMA operations.
Whenever the kernel returns with a migration_state of NDMA there can be no
in progress DMAs.
- !RUNNING
The device should not change its internal state. Further implies the NDMA
behavior above.
- SAVING | !RUNNING
RESUMING | !RUNNING
The device may assume there are no incoming MMIO operations.
Internal state logging can stop.
- RUNNING
The device can alter its internal state and must respond to incoming MMIO.
- SAVING | RUNNING
The device is logging changes to the internal state.
- ERROR
The behavior of the device is largely undefined. The device must be
recovered by issuing VFIO_DEVICE_RESET or closing the device file
descriptor.
However, devices supporting NDMA must behave as though NDMA is asserted
during ERROR to avoid corrupting other devices or a VM during a failed
migration.
When multiple bits change in the migration_state they may describe multiple
event triggered actions, and multiple changes to continuous actions. The
migration driver must process them in a priority order:
- SAVING | RUNNING
- NDMA
- !RUNNING
- SAVING | !RUNNING
- RESUMING
- !RESUMING
- RUNNING
- !NDMA
In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
device back to device_state RUNNING. When a migration driver executes this
ioctl it should discard the data window and set migration_state to RUNNING as
part of resetting the device to a clean state. This must happen even if the
migration_state has errored. A freshly opened device FD should always be in
the RUNNING state.
The migration driver has limitations on what device state it can affect. Any
device state controlled by general kernel subsystems must not be changed
during RESUME, and SAVING must tolerate mutation of this state. Change to
externally controlled device state can happen at any time, asynchronously, to
the migration (ie interrupt rebalancing).
Some examples of externally controlled state:
- MSI-X interrupt page
- MSI/legacy interrupt configuration
- Large parts of the PCI configuration space, ie common control bits
- PCI power management
- Changes via VFIO_DEVICE_SET_IRQS
During !RUNNING, especially during SAVING and RESUMING, the device may have
limitations on what it can tolerate. An ideal device will discard/return all
ones to all incoming MMIO/PIO operations (exclusive of the external state
above) in !RUNNING. However, devices are free to have undefined behavior if
they receive MMIOs. This includes corrupting/aborting the migration, dirtying
pages, and segfaulting userspace.
However, a device may not compromise system integrity if it is subjected to a
MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
it can not compromise device isolation.
There are several edge cases that userspace should keep in mind when
implementing migration:
- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
the IOMMU.
As Peer to Peer DMA is a MMIO touch like any other, it is important that
userspace suspend these accesses before entering any device_state where MMIO
is not permitted, such as !RUNNING. This can be accomplished with the NDMA
state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
device does not support NDMA, and rely on that to guarantee quiet MMIO.
The Peer to Peer Grace States exist so that all devices may reach RUNNING
before any device is subjected to a MMIO access.
Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate
the no-MMIO restriction during SAVING or RESUMING and corrupt the migration on
devices that cannot protect themselves.
- IOMMU Page faults handled in userspace can occur at any time. A migration
driver is not required to serialize in-progress page faults. It can assume
that all page faults are completed before entering SAVING | !RUNNING. Since
the guest VCPU is required to complete page faults the VMM can accomplish this
by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before
clearing VCPU_RUNNING.
Device that do not support NDMA cannot be configured to generate page faults
that require the VCPU to complete.
- pre-copy allows the device to implement a dirty log for its internal state.
During the SAVING | RUNNING state the data window should present the device
state being logged and during SAVING | !RUNNING the data window should present
the unlogged device state as well as the changes from the internal dirty log.
On RESUME these two data streams are concatenated together.
pre-copy is only concerned with internal device state. External DMAs are
covered by the seperate DIRTY_TRACKING function.
- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
cannot support this, then NDMA could be used to synthesize it less
efficiently.
- NDMA is optional, if the device does not support this then the NDMA States
are pushed down to the next step in the sequence and various behaviors that
rely on NDMA cannot be used.
- Migration control registers inside the same iommu_group as the VFIO device.
This immediately raises a security concern as userspace can use Peer to Peer
DMA to manipulate these migration control registers concurrently with
any kernel actions.
A device driver operating such a device must ensure that kernel integrity
can not be broken by hostile user space operating the migration MMIO
registers via peer to peer, at any point in the sequence. Notably the kernel
cannot use DMA to transfer any migration data.
However, as discussed above in the "Device Peer to Peer DMA" section, it can
assume quiet MMIO as a condition to have a successful and uncorrupted
migration.
To elaborate details on the reference flows, they assume the following details
about the external behaviors:
- !VCPU_RUNNING
Userspace must not generate dirty pages or issue MMIO operations to devices.
For a VMM this would typically be a control toward KVM.
- DIRTY_TRACKING
Clear the DMA log and start DMA logging
DMA logs should be readable with an "atomic test and clear" to allow
continuous non-disruptive sampling of the log.
This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
fd.
- !DIRTY_TRACKING
Freeze the DMA log, stop tracking and allow userspace to read it.
If userspace is going to have any use of the dirty log it must ensure ensure
that all DMA is suspended before clearing DIRTY_TRACKING, for instance by
using NDMA or !RUNNING on all VFIO devices.
next prev parent reply other threads:[~2021-11-24 18:40 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-22 19:53 [PATCH RFC] vfio: Documentation for the migration region Jason Gunthorpe
2021-11-22 20:31 ` Jonathan Corbet
2021-11-23 0:20 ` Jason Gunthorpe
2021-11-23 7:22 ` Akira Yokosawa
2021-11-23 14:21 ` Cornelia Huck
2021-11-23 16:53 ` Jason Gunthorpe
2021-11-24 16:55 ` Cornelia Huck
2021-11-24 18:40 ` Jason Gunthorpe [this message]
2021-11-25 12:27 ` Cornelia Huck
2021-11-25 16:14 ` Jason Gunthorpe
2021-11-26 12:56 ` Cornelia Huck
2021-11-26 13:06 ` Jason Gunthorpe
2021-11-26 15:01 ` Cornelia Huck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211124184020.GM4670@nvidia.com \
--to=jgg@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=cohuck@redhat.com \
--cc=corbet@lwn.net \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=linux-doc@vger.kernel.org \
--cc=mgurtovoy@nvidia.com \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).