From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Zhengxiao.zx@Alibaba-inc.com, kevin.tian@intel.com,
yi.l.liu@intel.com, cjia@nvidia.com, eskultet@redhat.com,
ziye.yang@intel.com, qemu-devel@nongnu.org, cohuck@redhat.com,
shuangtai.tst@alibaba-inc.com, dgilbert@redhat.com,
zhi.a.wang@intel.com, mlevitsk@redhat.com, pasic@linux.ibm.com,
aik@ozlabs.ru, eauger@redhat.com, felipe@nutanix.com,
jonathan.davies@nutanix.com, yan.y.zhao@intel.com,
changpeng.liu@intel.com, Ken.Xue@amd.com
Subject: Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
Date: Fri, 21 Jun 2019 16:01:04 -0600 [thread overview]
Message-ID: <20190621160104.6893958f@x1.home> (raw)
In-Reply-To: <24404cfe-bb2d-551b-af8f-609a4bdfce38@nvidia.com>
On Sat, 22 Jun 2019 02:00:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
> On 6/22/2019 1:30 AM, Alex Williamson wrote:
> > On Sat, 22 Jun 2019 01:05:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> On 6/21/2019 8:33 PM, Alex Williamson wrote:
> >>> On Fri, 21 Jun 2019 11:22:15 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>
> >>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:
> >>>>> On Thu, 20 Jun 2019 20:07:29 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>
> >>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>> - Used 3 bits to define VFIO device states.
> >>>>>> Bit 0 => _RUNNING
> >>>>>> Bit 1 => _SAVING
> >>>>>> Bit 2 => _RESUMING
> >>>>>> Combination of these bits defines VFIO device's state during migration
> >>>>>> _STOPPED => All bits 0 indicates VFIO device stopped.
> >>>>>> _RUNNING => Normal VFIO device running state.
> >>>>>> _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
> >>>>>> saving state of device i.e. pre-copy state
> >>>>>> _SAVING => vCPUs are stoppped, VFIO device should be stopped, and
> >>>>>> save device state,i.e. stop-n-copy state
> >>>>>> _RESUMING => VFIO device resuming state.
> >>>>>> _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
> >>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>>>> offset of migration region to get/set VFIO device related information.
> >>>>>> Defined members of structure and usage on read/write access:
> >>>>>> * device_state: (read/write)
> >>>>>> To convey VFIO device state to be transitioned to. Only 3 bits are used
> >>>>>> as of now.
> >>>>>> * pending bytes: (read only)
> >>>>>> To get pending bytes yet to be migrated for VFIO device.
> >>>>>> * data_offset: (read only)
> >>>>>> To get data offset in migration from where data exist during _SAVING
> >>>>>> and from where data should be written by user space application during
> >>>>>> _RESUMING state
> >>>>>> * data_size: (read/write)
> >>>>>> To get and set size of data copied in migration region during _SAVING
> >>>>>> and _RESUMING state.
> >>>>>> * start_pfn, page_size, total_pfns: (write only)
> >>>>>> To get bitmap of dirty pages from vendor driver from given
> >>>>>> start address for total_pfns.
> >>>>>> * copied_pfns: (read only)
> >>>>>> To get number of pfns bitmap copied in migration region.
> >>>>>> Vendor driver should copy the bitmap with bits set only for
> >>>>>> pages to be marked dirty in migration region. Vendor driver
> >>>>>> should return 0 if there are 0 pages dirty in requested
> >>>>>> range. Vendor driver should return -1 to mark all pages in the section
> >>>>>> as dirty
> >>>>>>
> >>>>>> Migration region looks like:
> >>>>>> ------------------------------------------------------------------
> >>>>>> |vfio_device_migration_info| data section |
> >>>>>> | | /////////////////////////////// |
> >>>>>> ------------------------------------------------------------------
> >>>>>> ^ ^ ^
> >>>>>> offset 0-trapped part data_offset data_size
> >>>>>>
> >>>>>> Data section is always followed by vfio_device_migration_info
> >>>>>> structure in the region, so data_offset will always be none-0.
> >>>>>> Offset from where data is copied is decided by kernel driver, data
> >>>>>> section can be trapped or mapped depending on how kernel driver
> >>>>>> defines data section. If mmapped, then data_offset should be page
> >>>>>> aligned, where as initial section which contain
> >>>>>> vfio_device_migration_info structure might not end at offset which
> >>>>>> is page aligned.
> >>>>>>
> >>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>>>>> ---
> >>>>>> linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> 1 file changed, 71 insertions(+)
> >>>>>>
> >>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>>> index 24f505199f83..274ec477eb82 100644
> >>>>>> --- a/linux-headers/linux/vfio.h
> >>>>>> +++ b/linux-headers/linux/vfio.h
> >>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>>>>> */
> >>>>>> #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> >>>>>>
> >>>>>> +/* Migration region type and sub-type */
> >>>>>> +#define VFIO_REGION_TYPE_MIGRATION (2)
> >>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION (1)
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>>>>> + * information. Field accesses from this structure are only supported at their
> >>>>>> + * native width and alignment, otherwise should return error.
> >>>>>> + *
> >>>>>> + * device_state: (read/write)
> >>>>>> + * To indicate vendor driver the state VFIO device should be transitioned
> >>>>>> + * to. If device state transition fails, write to this field return error.
> >>>>>> + * It consists of 3 bits:
> >>>>>> + * - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>>>>> + * _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>>>>> + * device before write returns.
> >>>>>> + * - If bit 1 set, indicates _SAVING state.
> >>>>>> + * - If bit 2 set, indicates _RESUMING state.
> >>>>>> + *
> >>>>>> + * pending bytes: (read only)
> >>>>>> + * Read pending bytes yet to be migrated from vendor driver
> >>>>>> + *
> >>>>>> + * data_offset: (read only)
> >>>>>> + * User application should read data_offset in migration region from where
> >>>>>> + * user application should read data during _SAVING state or write data
> >>>>>> + * during _RESUMING state.
> >>>>>> + *
> >>>>>> + * data_size: (read/write)
> >>>>>> + * User application should read data_size to know data copied in migration
> >>>>>> + * region during _SAVING state and write size of data copied in migration
> >>>>>> + * region during _RESUMING state.
> >>>>>> + *
> >>>>>> + * start_pfn: (write only)
> >>>>>> + * Start address pfn to get bitmap of dirty pages from vendor driver duing
> >>>>>> + * _SAVING state.
> >>>>>> + *
> >>>>>> + * page_size: (write only)
> >>>>>> + * User application should write the page_size of pfn.
> >>>>>> + *
> >>>>>> + * total_pfns: (write only)
> >>>>>> + * Total pfn count from start_pfn for which dirty bitmap is requested.
> >>>>>> + *
> >>>>>> + * copied_pfns: (read only)
> >>>>>> + * pfn count for which dirty bitmap is copied to migration region.
> >>>>>> + * Vendor driver should copy the bitmap with bits set only for pages to be
> >>>>>> + * marked dirty in migration region.
> >>>>>> + * Vendor driver should return 0 if there are 0 pages dirty in requested
> >>>>>> + * range.
> >>>>>> + * Vendor driver should return -1 to mark all pages in the section as
> >>>>>> + * dirty.
> >>>>>
> >>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
> >>>>> any order and then the read of copied_pfns is what triggers the
> >>>>> snapshot?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>> re-reading copied_pfns?
> >>>>
> >>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>> start_pfn + tolal_pfns).
> >>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>> possible that vendor driver reserved area for bitmap < total bitmap size
> >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>> is, there are no pages dirty in rest of the range)
> >>>
> >>> So reading copied_pfns triggers the data range to be updated, but the
> >>> caller cannot assume it to be synchronous and uses total_pfns to poll
> >>> that the update is complete? How does the vendor driver differentiate
> >>> the user polling for the previous update to finish versus requesting a
> >>> new update?
> >>>
> >>
> >> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
> >> indicates new update, where as sequential read on copied_pfns indicates
> >> polling for previous update.
> >
> > Hmm, this seems to contradict the answer to my question above where I
> > ask if the write fields are sticky so a user can trigger a refresh via
> > copied_pfns.
>
> Sorry, how its contradict? pasting it again below:
> >>>>> Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>> re-reading copied_pfns?
> >>>>
> >>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>> start_pfn + tolal_pfns).
> >>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>> possible that vendor driver reserved area for bitmap < total bitmap
> size
> >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>> is, there are no pages dirty in rest of the range)
Sorry, I guess I misinterpreted again. So the vendor driver can return
copied_pfns < total_pfns if it has a buffer limitation, not as an
indication of its background progress in writing out the bitmap. Just
as a proof of concept, let's say the vendor driver has a 1 bit buffer
and I write 0 to start_pfn and 3 to total_pfns. I read copied_pfns,
which returns 1, so I read data_offset to find where this 1 bit is
located and then read my bit from that location. This is the dirty
state of the first pfn. I read copied_pfns again and it reports 2, so
I again read data_offset to find where the data is located, and it's my
job to remember that I've already read 1 bit, so 2 means there's only 1
bit available and it's the second pfn. I read the bit. I again read
copied_pfns, which now reports 3, I read data_offset to find the
location of the data, I remember that I've already read 2 bits, so I
read my bit into the 3rd pfn. This seems rather clumsy.
Now that copied_pfns == total_pfns, what happens if I read copied_pfns
again? This is actually what I thought I was asking previously.
Should we expose the pfn buffer size and fault on writes of larger than that
size, requiring the user to iterate start_pfn themselves? Are there
any operations where the user can assume data_offset is constant? Thanks,
Alex
next prev parent reply other threads:[~2019-06-21 22:08 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface Kirti Wankhede
2019-06-20 17:18 ` Alex Williamson
2019-06-21 5:52 ` Kirti Wankhede
2019-06-21 15:03 ` Alex Williamson
2019-06-21 19:35 ` Kirti Wankhede
2019-06-21 20:00 ` Alex Williamson
2019-06-21 20:30 ` Kirti Wankhede
2019-06-21 22:01 ` Alex Williamson [this message]
2019-06-24 15:00 ` Kirti Wankhede
2019-06-24 15:25 ` Alex Williamson
2019-06-24 18:52 ` Kirti Wankhede
2019-06-24 19:01 ` Alex Williamson
2019-06-25 15:20 ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-06-21 0:12 ` Yan Zhao
2019-06-21 6:44 ` Kirti Wankhede
2019-06-21 7:50 ` Yan Zhao
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
2019-06-24 14:00 ` Cornelia Huck
2019-06-27 14:56 ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2019-06-25 10:29 ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier Kirti Wankhede
2019-06-27 10:33 ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2019-06-27 10:01 ` Dr. David Alan Gilbert
2019-06-27 14:31 ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2019-06-20 19:25 ` Alex Williamson
2019-06-21 6:38 ` Kirti Wankhede
2019-06-21 15:16 ` Alex Williamson
2019-06-21 19:38 ` Kirti Wankhede
2019-06-21 20:02 ` Alex Williamson
2019-06-21 20:07 ` Kirti Wankhede
2019-06-21 20:32 ` Alex Williamson
2019-06-21 21:05 ` Kirti Wankhede
2019-06-21 22:13 ` Alex Williamson
2019-06-24 14:31 ` Kirti Wankhede
2019-06-21 0:31 ` Yan Zhao
2019-06-25 3:30 ` Yan Zhao
2019-06-28 8:50 ` Dr. David Alan Gilbert
2019-06-28 21:16 ` Yan Zhao
2019-06-28 9:09 ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 09/13] vfio: Add load " Kirti Wankhede
2019-06-28 9:18 ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list Kirti Wankhede
2019-06-26 0:40 ` Yan Zhao
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 11/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 12/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path Kirti Wankhede
2019-06-20 18:50 ` Dr. David Alan Gilbert
2019-06-21 5:54 ` Kirti Wankhede
2019-06-21 0:25 ` [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Yan Zhao
2019-06-21 1:24 ` Yan Zhao
2019-06-21 8:02 ` Kirti Wankhede
2019-06-21 8:46 ` Yan Zhao
2019-06-21 9:22 ` Kirti Wankhede
2019-06-21 10:45 ` Yan Zhao
2019-06-24 19:00 ` Dr. David Alan Gilbert
2019-06-26 0:43 ` Yan Zhao
2019-06-28 9:44 ` Dr. David Alan Gilbert
2019-06-28 21:28 ` Yan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190621160104.6893958f@x1.home \
--to=alex.williamson@redhat.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kwankhede@nvidia.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).