From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Zhao Yan <yan.y.zhao@intel.com>
Cc: cjia@nvidia.com, kvm@vger.kernel.org, aik@ozlabs.ru,
Zhengxiao.zx@alibaba-inc.com, shuangtai.tst@alibaba-inc.com,
qemu-devel@nongnu.org, kwankhede@nvidia.com, eauger@redhat.com,
yi.l.liu@intel.com, eskultet@redhat.com, ziye.yang@intel.com,
mlevitsk@redhat.com, pasic@linux.ibm.com,
arei.gonglei@huawei.com, felipe@nutanix.com, Ken.Xue@amd.com,
kevin.tian@intel.com, alex.williamson@redhat.com,
intel-gvt-dev@lists.freedesktop.org, changpeng.liu@intel.com,
cohuck@redhat.com, zhi.a.wang@intel.com,
jonathan.davies@nutanix.com
Subject: Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
Date: Wed, 20 Feb 2019 11:01:43 +0000 [thread overview]
Message-ID: <20190220110142.GD2608@work-vm> (raw)
In-Reply-To: <20190220052838.GC16456@joy-OptiPlex-7040>
* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > >
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> >
> > Hi,
> > I've sent minor comments to later patches; but some minor general
> > comments:
> >
> > a) Never trust the incoming migrations stream - it might be corrupt,
> > so check when you can.
> hi Dave
> Thanks for this suggestion. I'll add more checks for migration streams.
>
>
> > b) How do we detect if we're migrating from/to the wrong device or
> > version of device? Or say to a device with older firmware or perhaps
> > a device that has less device memory ?
> Actually it's still an open for VFIO migration. Need to think about
> whether it's better to check that in libvirt or qemu (like a device magic
> along with verion ?).
> This patchset is intended to settle down the main device state interfaces
> for VFIO migration. So that we can work on that and improve it.
>
>
> > c) Consider using the trace_ mechanism - it's really useful to
> > add to loops writing/reading data so that you can see when it fails.
> >
> > Dave
> >
> Got it. many thanks~~
>
>
> > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > 'migrtion'
>
> sorry :)
No problem.
Given the mails, I'm guessing you've mostly tested this on graphics
devices? Have you also checked with VFIO network cards?
Also see the mail I sent in reply to Kirti's series; we need to boil
these down to one solution.
Dave
> >
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > >
> > > Device config: data like MMIOs, page tables...
> > > Every device is supposed to possess device config data.
> > > Usually device config's size is small (no big than 10M), and it
> > > needs to be loaded in certain strict order.
> > > Therefore, device config only needs to be saved/loaded in
> > > stop-and-copy phase.
> > > The data of device config is held in device config region.
> > > Size of device config data is smaller than or equal to that of
> > > device config region.
> > >
> > > Device Memory: device's internal memory, standalone and outside system
> > > memory. It is usually very big.
> > > This kind of data needs to be saved / loaded in pre-copy and
> > > stop-and-copy phase.
> > > The data of device memory is held in device memory region.
> > > Size of devie memory is usually larger than that of device
> > > memory region. qemu needs to save/load it in chunks of size of
> > > device memory region.
> > > Not all device has device memory. Like IGD only uses system memory.
> > >
> > > System memory dirty pages: If a device produces dirty pages in system
> > > memory, it is able to get dirty bitmap for certain range of system
> > > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > callback, dirty pages in system memory will be save/loaded by ram's
> > > live migration code.
> > > The dirty bitmap of system memory is held in dirty bitmap region.
> > > If system memory range is larger than that dirty bitmap region can
> > > hold, qemu will cut it into several chunks and get dirty bitmap in
> > > succession.
> > >
> > >
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > >
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > > Get access via read/write system call.
> > > Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > > device config region: mandatory, holding data of device config
> > > device memory region: optional, holding data of device memory
> > > dirty bitmap region: optional, holding bitmap of system memory
> > > dirty pages
> > >
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > >
> > >
> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > >
> > > #define VFIO_DEVICE_STATE_RUNNING 0
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2
> > >
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > >
> > > struct vfio_device_state_ctl {
> > > __u32 version; /* ro */
> > > __u32 device_state; /* VFIO device state, wo */
> > > __u32 caps; /* ro */
> > > struct {
> > > __u32 action; /* wo, GET_BUFFER or SET_BUFFER */
> > > __u64 size; /*rw*/
> > > } device_config;
> > > struct {
> > > __u32 action; /* wo, GET_BUFFER or SET_BUFFER */
> > > __u64 size; /* rw */
> > > __u64 pos; /*the offset in total buffer of device memory*/
> > > } device_memory;
> > > struct {
> > > __u64 start_addr; /* wo */
> > > __u64 page_nr; /* wo */
> > > } system_memory;
> > > };
> > >
> > > Devcie States
> > > -------------
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > >
> > > Four states are defined for a VFIO device:
> > > RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > >
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > commands from device driver.
> > > It is the default state that a VFIO device enters initially.
> > >
> > > STOP: In this state, a VFIO device is deactivated to interact with
> > > device driver.
> > >
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > > STOP & LOGGING).
> > > Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > > driver can start dirty data logging for device memory and system
> > > memory.
> > > LOGGING only impacts device/system memory. They return whole
> > > snapshot outside LOGGING and dirty data since last get operation
> > > inside LOGGING.
> > > Device config should be always accessible and return whole config
> > > snapshot regardless of LOGGING state.
> > >
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default.
> > >
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > >
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > > device memory in pre-copy and stop-and-copy phase. The data of
> > > device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > > produced by VFIO device during pre-copy and stop-and-copy phase.
> > > The dirty bitmap of system memory is held in dirty bitmap region.
> > >
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > >
> > >
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > >
> > > Live migration save path:
> > >
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > >
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > |
> > > MIGRATION_STATUS_SAVE_SETUP
> > > |
> > > .save_setup callback -->
> > > get device memory size (whole snapshot size)
> > > get device memory buffer (whole snapshot data)
> > > set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > > |
> > > MIGRATION_STATUS_ACTIVE
> > > |
> > > .save_live_pending callback --> get device memory size (dirty data)
> > > .save_live_iteration callback --> get device memory buffer (dirty data)
> > > .log_sync callback --> get system memory dirty bitmap
> > > |
> > > (vcpu stops) --> set device state -->
> > > VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > |
> > > .save_live_complete_precopy callback -->
> > > get device memory size (dirty data)
> > > get device memory buffer (dirty data)
> > > get device config size (whole snapshot size)
> > > get device config buffer (whole snapshot data)
> > > |
> > > .save_cleanup callback --> set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > >
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > > |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >
> > >
> > > Live migration load path:
> > >
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > >
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > |
> > > MIGRATION_STATUS_ACTIVE
> > > |
> > > .load state callback -->
> > > set device memory size, set device memory buffer, set device config size,
> > > set device config buffer
> > > |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > |
> > > MIGRATION_STATUS_COMPLETED
> > >
> > >
> > >
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > >
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.
> > >
> > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > region by writing system memory's start address, page count and action
> > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > "system_memory.action" fields of control region.
> > > If page count passed in .log_sync callback is larger than the bitmap size
> > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > vendor driver's get system memory dirty bitmap interface.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > returns without call to vendor driver.
> > >
> > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > in save_live_complete_precopy callback,
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > get device memory size and get device memory buffer will be called again.
> > > After that,
> > > device config data is get from device config region by reading
> > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > device_config.action of control region.
> > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > cleared (i.e. deivce state is set to STOP).
> > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > of "migration failed" and "migration cancelled". They can also leverage
> > > the cleanup handler to unset LOGGING state.
> > >
> > >
> > > References
> > > ----------
> > > 1. kernel side implementation of Device state interfaces:
> > > https://patchwork.freedesktop.org/series/56876/
> > >
> > >
> > > Yan Zhao (5):
> > > vfio/migration: define kernel interfaces
> > > vfio/migration: support device of device config capability
> > > vfio/migration: tracking of dirty page in system memory
> > > vfio/migration: turn on migration
> > > vfio/migration: support device memory capability
> > >
> > > hw/vfio/Makefile.objs | 2 +-
> > > hw/vfio/common.c | 26 ++
> > > hw/vfio/migration.c | 858 ++++++++++++++++++++++++++++++++++++++++++
> > > hw/vfio/pci.c | 10 +-
> > > hw/vfio/pci.h | 26 +-
> > > include/hw/vfio/vfio-common.h | 1 +
> > > linux-headers/linux/vfio.h | 260 +++++++++++++
> > > 7 files changed, 1174 insertions(+), 9 deletions(-)
> > > create mode 100644 hw/vfio/migration.c
> > >
> > > --
> > > 2.7.4
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2019-02-20 11:09 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-19 8:50 [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration Yan Zhao
2019-02-19 8:52 ` [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces Yan Zhao
2019-02-19 13:09 ` Cornelia Huck
2019-02-20 7:36 ` Zhao Yan
2019-02-20 17:08 ` Cornelia Huck
2019-02-21 1:47 ` Zhao Yan
2019-02-19 8:52 ` [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability Yan Zhao
2019-02-19 11:01 ` Dr. David Alan Gilbert
2019-02-20 5:12 ` Zhao Yan
2019-02-20 10:57 ` Dr. David Alan Gilbert
2019-02-19 14:37 ` Cornelia Huck
2019-02-20 22:54 ` Zhao Yan
2019-02-21 10:56 ` Cornelia Huck
2019-02-19 8:52 ` [Qemu-devel] [PATCH 3/5] vfio/migration: tracking of dirty page in system memory Yan Zhao
2019-02-19 8:52 ` [Qemu-devel] [PATCH 4/5] vfio/migration: turn on migration Yan Zhao
2019-02-19 8:53 ` [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability Yan Zhao
2019-02-19 11:25 ` Dr. David Alan Gilbert
2019-02-20 5:17 ` Zhao Yan
2019-02-19 14:42 ` Christophe de Dinechin
2019-02-20 7:58 ` Zhao Yan
2019-02-20 10:14 ` Christophe de Dinechin
2019-02-21 0:07 ` Zhao Yan
2019-02-19 11:32 ` [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration Dr. David Alan Gilbert
2019-02-20 5:28 ` Zhao Yan
2019-02-20 11:01 ` Dr. David Alan Gilbert [this message]
2019-02-20 11:28 ` Gonglei (Arei)
2019-02-20 11:42 ` Cornelia Huck
2019-02-20 12:07 ` Gonglei (Arei)
[not found] ` <20190327063509.GD14681@joy-OptiPlex-7040>
[not found] ` <20190327201854.GG2636@work-vm>
[not found] ` <20190327161020.1c013e65@x1.home>
2019-04-01 8:14 ` Cornelia Huck
2019-04-01 8:40 ` Yan Zhao
2019-04-01 14:15 ` Alex Williamson
2019-02-21 0:31 ` Zhao Yan
2019-02-21 9:15 ` Dr. David Alan Gilbert
2019-02-20 11:56 ` Gonglei (Arei)
2019-02-21 0:24 ` Zhao Yan
2019-02-21 1:35 ` Gonglei (Arei)
2019-02-21 1:58 ` Zhao Yan
2019-02-21 3:33 ` Gonglei (Arei)
2019-02-21 4:08 ` Zhao Yan
2019-02-21 5:46 ` Gonglei (Arei)
2019-02-21 2:04 ` Zhao Yan
2019-02-21 3:16 ` Gonglei (Arei)
2019-02-21 4:21 ` Zhao Yan
2019-02-21 5:56 ` Gonglei (Arei)
2019-02-21 20:40 ` Alex Williamson
2019-02-25 2:22 ` Zhao Yan
2019-03-06 0:22 ` Zhao Yan
2019-03-07 17:44 ` Alex Williamson
2019-03-07 23:20 ` Tian, Kevin
2019-03-08 16:11 ` Alex Williamson
2019-03-08 16:21 ` Dr. David Alan Gilbert
2019-03-08 22:02 ` Alex Williamson
2019-03-11 2:33 ` Tian, Kevin
2019-03-11 20:19 ` Alex Williamson
2019-03-12 2:48 ` Tian, Kevin
2019-03-12 2:57 ` Zhao Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190220110142.GD2608@work-vm \
--to=dgilbert@redhat.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=arei.gonglei@huawei.com \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=intel-gvt-dev@lists.freedesktop.org \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).