From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: cjia@nvidia.com, kevin.tian@intel.com, ziye.yang@intel.com,
changpeng.liu@intel.com, yi.l.liu@intel.com, mlevitsk@redhat.com,
eskultet@redhat.com, cohuck@redhat.com, dgilbert@redhat.com,
jonathan.davies@nutanix.com, eauger@redhat.com, aik@ozlabs.ru,
pasic@linux.ibm.com, felipe@nutanix.com,
Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst@alibaba-inc.com,
Ken.Xue@amd.com, zhi.a.wang@intel.com, yan.y.zhao@intel.com,
yulei.zhang@intel.com, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface
Date: Thu, 7 Mar 2019 12:45:18 -0700 [thread overview]
Message-ID: <20190307124518.37a0908a@w520.home> (raw)
In-Reply-To: <f15fecfe-26c0-05fa-da3b-20156145d287@nvidia.com>
On Wed, 27 Feb 2019 01:35:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
> Alex,
>
> On 2/22/2019 3:53 AM, Alex Williamson wrote:
> > On Wed, 20 Feb 2019 02:53:16 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> - Defined MIGRATION region type and sub-type.
> >> - Used 2 bits to define VFIO device states.
> >> Bit 0 => 0/1 => _STOPPED/_RUNNING
> >> Bit 1 => 0/1 => _RESUMING/_SAVING
> >> Combination of these bits defines VFIO device's state during migration
> >> _RUNNING => Normal VFIO device running state.
> >> _STOPPED => VFIO device stopped.
> >> _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >> saving state of device i.e. pre-copy state
> >> _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be stopped, and
> >> save device state,i.e. stop-n-copy state
> >> _RESUMING => VFIO device resuming state.
> >
> > Shouldn't we have a non-_RESUMING/_SAVING run state? If these are
> > indicating directly flow, maybe we need two bits:
> >
> > 00b - None, normal runtime
> > 01b - Saving
> > 10b - Resuming
> > 11b - Invalid/reserved (maybe a Failed state indicator)
> >
>
> There has to be 2 more states:
> _SAVING | _RUNNING => SAVING while device is RUNNING - pre-copy phase
> _SAVING | _STOPPED => SAVING while device is STOPPED - stop-and-copy phase
Sorry, this was unclear, a separate third bit would need to be the stop
bit, the above was only focusing on the data direction, because...
> So the 2 bits used in this patch are:
> 00b - _RESUMING
This is actually:
_RESUMING | _STOPPED
> 01b - _RUNNING - Normal Running
And this is really:
_RESUMING | _RUNNING
We're just selectively ignoring _RESUMING when we choose to. So my
question is really more like:
|0|00|
^ ^^
| ||
| |+-- Saving
| +--- Resuming
+----- Stopped
Where Saving|Resuming both set is either invalid or a failure indicator.
> 10b - _SAVING | _STOPPED - stop-and-copy phase
> 11b - _SAVING | _RUNNING - pre-copy phase
>
>
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >> offset of migration region to get/set VFIO device related information.
> >> Defined members of structure and usage on read/write access:
> >> * device_state: (write only)
> >> To convey VFIO device state to be transitioned to.
> >
> > Seems trivial and potentially useful to support read here, we have 30
> > (or maybe 29) bits yet to define.
> >
>
> Ok, those bits can be used later.
>
> >> * pending bytes: (read only)
> >> To get pending bytes yet to be migrated for VFIO device
> >> * data_offset: (read/write)
> >> To get or set data offset in migration from where data exist
> >> during _SAVING and _RESUMING state
> >
> > What's the use case for writing this?
> >
>
> During resuming, user space application (QEMU) writes data to migration
> region. At that time QEMU writes data_offset from where data is written,
> so that vendor driver can read data from that offset. If data section is
> mmapped, data_offset is start of mmapped region where as if data section
> is trapped, data_offset is sizeof(struct vfio_device_migration_info) +
> 1, just after vfio_device_migration_info structure.
It doesn't make sense to me that this proposal allows the user to write
wherever they want, the vendor driver defines whether they support mmap
for a given operation and the user can choose to make use of mmap or
not. Therefore I'd suggest that the vendor driver expose a read-only
data_offset that matches a sparse mmap capability entry should the
driver support mmap. The use should always read or write data from the
vendor defined data_offset. Otherwise we have scenarios like the other
patch I commented on where the user implicitly decides that the first
mmap region large enough is the one to be used for a given operation,
removing the vendor driver's ability to decide whether it wants to
support mmap for that operation.
> >> * data_size: (write only)
> >> To convey size of data copied in migration region during _RESUMING
> >> state
> >
> > How to know how much is available for read?
>
> pending_bytes tells how much is still to be read.
But pending_bytes can be bigger than the region, right? Does the data
area necessarily extend to the end of the region?
> >> * start_pfn, page_size, total_pfns: (write only)
> >> To get bitmap of dirty pages from vendor driver from given
> >> start address for total_pfns.
> >
> > What would happen if a user wrote in 1MB for page size? Is the vendor
> > driver expected to support arbitrary page sizes? Are we only trying to
> > convey the page size and would that page size ever be other than
> > getpagesize()?
> >
> >> * copied_pfns: (read only)
> >> To get number of pfns bitmap copied in migration region.
> >> Vendor driver should copy the bitmap with bits set only for
> >> pages to be marked dirty in migration region. Vendor driver
> >> should return 0 if there are 0 pages dirty in requested
> >> range.
> >
> > This is useful, but I wonder if it's really a required feature for the
> > vendor driver. For instance, with mdev IOMMU support we could wrap an
> > arbitrary PCI device as mdev, but we don't necessarily have dirty page
> > tracking. Would a device need to report -1 here if it wanted to
> > indicate any page could be dirty if we only know how to collect the
> > state of the device itself for migration (ie. force the device to be
> > stopped first).
> >
>
> Does that mean if returned -1 here, mark all pages in the section as dirty?
Yes
> >> Migration region looks like:
> >> ------------------------------------------------------------------
> >> |vfio_device_migration_info| data section |
> >> | | /////////////////////////////// |
> >> ------------------------------------------------------------------
> > ^ what's this?
> >
> >> ^ ^ ^
> >> offset 0-trapped part data.offset data.size
> >
> > Isn't data.size above really (data.offset + data.size)? '.' vs '_'
> > inconsistency vs above.
> >
>
> Yes, it should be '_'. I'll correct that.
>
>
> >> Data section is always followed by vfio_device_migration_info
> >> structure in the region, so data.offset will always be none-0.
> >
> > This seems exactly backwards from the diagram, data section follows
> > vfio_device_migration_info. Also, "non-zero".
> >
> >> Offset from where data is copied is decided by kernel driver, data
> >
> > But data_offset is listed as read-write.
> >
>
> data_offset is read during _SAVING state so QEMU knows from where to
> read data
> data_offset is written during _RESUMING state so that QEMU can convey
> offset in migration region to vendor driver from where data should be
> considered as input data.
Please let's drop the idea that the user can write data to an arbitrary
offset within the region. Does it serve any value?
> >> section can be trapped or mapped depending on how kernel driver
> >> defines data section. If mmapped, then data.offset should be page
> >> aligned, where as initial section which contain
> >> vfio_device_migration_info structure might not end at offset which
> >> is page aligned.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >> linux-headers/linux/vfio.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 65 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 12a7b1dc53c8..1b12a9b95e00 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -368,6 +368,71 @@ struct vfio_region_gfx_edid {
> >> */
> >> #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> >>
> >> +/* Migration region type and sub-type */
> >> +#define VFIO_REGION_TYPE_MIGRATION (2)
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION (1)
> >> +
> >> +/**
> >> + * Structure vfio_device_migration_info is placed at 0th offset of
> >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >> + * information. Field accesses from this structure are only supported at their
> >> + * native width and alignment, otherwise should return error.
> >> + *
> >> + * device_state: (write only)
> >> + * To indicate vendor driver the state VFIO device should be transitioned
> >> + * to. If device state transition fails, write to this field return error.
> >> + * It consists of 2 bits.
> >> + * - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >> + * _STOPPED state. When device is changed to _STOPPED, driver should stop
> >> + * device before write returns.
> >> + * - If bit 1 set, indicates _SAVING state. When its reset, that indicates
> >> + * _RESUMING state.
> >> + *
> >> + * pending bytes: (read only)
> >> + * Read pending bytes yet to be migrated from vendor driver
> >
> > Is this essentially a volatile value, changing based on data previously
> > copied and device activity?
>
> Yes.
>
> >
> >> + *
> >> + * data_offset: (read/write)
> >> + * User application should read data_offset in migration region from where
> >> + * user application should read data during _SAVING state.
> >> + * User application would write data_offset in migration region from where
> >> + * user application is had written data during _RESUMING state.
> >
> > Why wouldn't data_offset simply be fixed? Do we really want to support
> > the user writing a arbitrary offsets in the migration region? Each
> > chunk can simply start at the kernel provided data_offset.
> >
>
> data_offset differs based on region is trapped on mapped. In case when
> region is mmaped, QEMU writes data to mapped region and then write
> data_offset field, indicating data is present in mmaped buffer. Read
> below for more detailed steps.
>
> >> + *
> >> + * data_size: (write only)
> >> + * User application should write size of data copied in migration region
> >> + * during _RESUMING state.
> >
> > How much does the user read on _SAVING then?
> >
>
> pending_bytes tells bytes that should be read on _SAVING.
So pending_bytes is always less than or equal to the size of the
region, thus pending_bytes cannot be used to gauge the _total_ pending
device state?
> >> + *
> >> + * start_pfn: (write only)
> >> + * Start address pfn to get bitmap of dirty pages from vendor driver duing
> >> + * _SAVING state.
> >> + *
> >> + * page_size: (write only)
> >> + * User application should write the page_size of pfn.
> >> + *
> >> + * total_pfns: (write only)
> >> + * Total pfn count from start_pfn for which dirty bitmap is requested.
> >
> > So effectively the area that begins at data_offset is dual purpose
> > (triple purpose vs Yan's, config, memory, and dirty bitmap) but the
> > protocol isn't entirely evident to me. I think we need to write it out
> > as Yan provided. If I'm in the _SAVING state, do I simply read from
> > data_offset to min(data_size, region.size - data_offset)? If that area
> > is mmap'd, how does the user indicate to the kernel to prepare the data
> > or that X bytes were acquired? In the _RESUMING state, do I write from
> > data_offset to min(data_size, region.size - data_offset) and indicate
> > the write using data_size?
> >
>
> Let me list down the steps for each state, hope that helps to answer all
> above questions.
>
> In _SAVING|_RUNNING device state:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
> buffer which is mmapped.
> - if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> - if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)
Ok, pending_bytes can describe more than the region currently holds.
It still seems worthwhile to have a data_size so that we can extend
this interface. For instance what if a driver wanted to trap data
accesses but mmap a dirty bitmap. This interface doesn't allow for
that since they're both necessarily covered by the same sparse mmap
offset.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
>
>
> In _SAVING|_STOPPED device state:
> a. read config space of device and save to migration file stream. This
> doesn't need to be from vendor driver. Any other special config state
> from driver can be saved as data in following iteration.
> b. read pending_bytes - indicates kernel driver to write data to staging
> buffer which is mmapped.
> c. if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> d. if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)
> e. Write data packet as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> f. iterate through steps b to e until (pending_bytes > 0)
What indicates to the vendor driver to deduct from pending_bytes and
advance the data? It seems like it's assumed that a read of
pending_bytes indicates this, but I don't like that implicit
interaction, a user could be interrupted and read pending_bytes again
to see if there's still data and introduce data loss. IMO, there
should be an explicit "Ok, I read # bytes, advance the data stream"
type operation.
> g. Write {VFIO_MIG_FLAG_END_OF_STATE}
>
>
> Dirty page tracking (.log_sync) is part of RAM copying state, where
> vendor driver provides the bitmap of pages which are dirtied by vendor
> driver through migration region and as part of RAM copy, those pages
> gets copied to file stream.
>
>
> In _RESUMING device state:
> - load config state.
> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
> - read data_size from packet, read buffer of data_size
> if region is mmaped, write data of data_size to mmaped region.
> - write data_offset and data_size.
> In case of mmapped region, write to data_size indicates kernel
> driver that data is written in staging buffer.
> - if region is trapped, pwrite() data of data_size from data_offset.
I still think data_offset should be read_only and this should use the
same operation above to indicate how many bytes were written rather
than read. Thanks,
Alex
> >> + *
> >> + * copied_pfns: (read only)
> >> + * pfn count for which dirty bitmap is copied to migration region.
> >> + * Vendor driver should copy the bitmap with bits set only for pages to be
> >> + * marked dirty in migration region.
> >> + * Vendor driver should return 0 if there are 0 pages dirty in requested
> >> + * range.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> + __u32 device_state; /* VFIO device state */
> >> +#define VFIO_DEVICE_STATE_RUNNING (1 << 0)
> >> +#define VFIO_DEVICE_STATE_SAVING (1 << 1)
> >> + __u32 reserved;
> >> + __u64 pending_bytes;
> >> + __u64 data_offset;
> >> + __u64 data_size;
> >> + __u64 start_pfn;
> >> + __u64 page_size;
> >> + __u64 total_pfns;
> >> + __u64 copied_pfns;
> >> +} __attribute__((packed));
> >> +
> >
> > As you mentioned, and with Yan's version, we still need to figure out
> > something with compatibility and versioning. Thanks,
> >
>
> Yes, we still need to figure out compatibility and versioning.
>
> Thanks,
> Kirti
>
> > Alex
> >
> >> /*
> >> * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >> * which allows direct access to non-MSIX registers which happened to be within
> >
next prev parent reply other threads:[~2019-03-07 19:45 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface Kirti Wankhede
2019-02-21 22:23 ` Alex Williamson
2019-02-26 20:05 ` Kirti Wankhede
2019-03-01 13:36 ` Cornelia Huck
2019-03-07 19:45 ` Alex Williamson [this message]
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices Kirti Wankhede
2019-02-21 22:38 ` Alex Williamson
2019-02-26 20:16 ` Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 5/5] Make vfio-pci device migration capable Kirti Wankhede
2019-02-20 10:22 ` [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Dr. David Alan Gilbert
2019-02-21 5:25 ` Kirti Wankhede
2019-02-21 5:52 ` Tian, Kevin
2019-02-21 7:10 ` Neo Jia
2019-02-27 1:51 ` Tian, Kevin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190307124518.37a0908a@w520.home \
--to=alex.williamson@redhat.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kwankhede@nvidia.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=yulei.zhang@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).