From: Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: cjia@nvidia.com, kevin.tian@intel.com, ziye.yang@intel.com,
changpeng.liu@intel.com, yi.l.liu@intel.com, mlevitsk@redhat.com,
eskultet@redhat.com, cohuck@redhat.com, dgilbert@redhat.com,
jonathan.davies@nutanix.com, eauger@redhat.com, aik@ozlabs.ru,
pasic@linux.ibm.com, felipe@nutanix.com,
Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst@alibaba-inc.com,
Ken.Xue@amd.com, zhi.a.wang@intel.com, yan.y.zhao@intel.com,
yulei.zhang@intel.com, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices
Date: Wed, 27 Feb 2019 01:46:54 +0530 [thread overview]
Message-ID: <dada4d97-a567-2238-e15e-aa007a139169@nvidia.com> (raw)
In-Reply-To: <20190221153822.34a1dcc4@w520.home>
On 2/22/2019 4:08 AM, Alex Williamson wrote:
> On Wed, 20 Feb 2019 02:53:18 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
>> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
>> - Added SaveVMHandlers and implemented all basic functions required for live
>> migration.
>> - Added VM state change handler to know running or stopped state of VM.
>> - Added migration state change notifier to get notification on migration state
>> change. This state is translated to VFIO device state and conveyed to vendor
>> driver.
>> - VFIO device supports migration or not is decided based of migration region
>> query. If migration region query is successful then migration is supported
>> else migration is blocked.
>> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>> region and should always trapped by VFIO device's driver. Added both type of
>> access support, trapped or mmapped, for data section of the region.
>> - To save device state, read pending_bytes and data_offset using structure
>> vfio_device_migration_info, accordingly copy data from the region.
>> - To restore device state, write data_offset and data_size in the structure
>> and write data in the region.
>> - To get dirty page bitmap, write start address and pfn count then read count of
>> pfns copied and accordingly read those from the rest of the region or mmaped
>> part of the region. This copy is iterated till page bitmap for all requested
>> pfns are copied.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>> hw/vfio/Makefile.objs | 2 +-
>> hw/vfio/migration.c | 714 ++++++++++++++++++++++++++++++++++++++++++
>> include/hw/vfio/vfio-common.h | 20 ++
>> 3 files changed, 735 insertions(+), 1 deletion(-)
>> create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index abad8b818c9b..36033d1437c5 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -obj-y += common.o spapr.o
>> +obj-y += common.o spapr.o migration.o
>> obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>> obj-$(CONFIG_VFIO_CCW) += ccw.o
>> obj-$(CONFIG_VFIO_PLATFORM) += platform.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..d7b6d972c043
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
>> @@ -0,0 +1,714 @@
>> +/*
>> + * Migration support for VFIO devices
>> + *
>> + * Copyright NVIDIA, Inc. 2018
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "cpu.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/register.h"
>> +#include "migration/blocker.h"
>> +#include "migration/misc.h"
>> +#include "qapi/error.h"
>> +#include "exec/ramlist.h"
>> +#include "exec/ram_addr.h"
>> +#include "pci.h"
>> +
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10 => emulated (virtual) function IO
>
> :^)
>
>> + * 0x0000 => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
>> +
>> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> +
>> + if (!migration) {
>> + return;
>> + }
>> +
>> + if (migration->region.buffer.size) {
>> + vfio_region_exit(&migration->region.buffer);
>> + vfio_region_finalize(&migration->region.buffer);
>> + }
>> +}
>> +
>> +static int vfio_migration_region_init(VFIODevice *vbasedev)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> + Object *obj = NULL;
>> + int ret = -EINVAL;
>> +
>> + if (!migration) {
>> + return ret;
>> + }
>> +
>> + /* Migration support added for PCI device only */
>> + if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> + obj = vfio_pci_get_object(vbasedev);
>> + }
>> +
>> + if (!obj) {
>> + return ret;
>> + }
>> +
>> + ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
>> + migration->region.index, "migration");
>> + if (ret) {
>> + error_report("Failed to setup VFIO migration region %d: %s",
>> + migration->region.index, strerror(-ret));
>> + goto err;
>> + }
>> +
>> + if (!migration->region.buffer.size) {
>> + ret = -EINVAL;
>> + error_report("Invalid region size of VFIO migration region %d: %s",
>> + migration->region.index, strerror(-ret));
>> + goto err;
>> + }
>> +
>> + if (migration->region.buffer.mmaps) {
>> + ret = vfio_region_mmap(&migration->region.buffer);
>> + if (ret) {
>> + error_report("Failed to mmap VFIO migration region %d: %s",
>> + migration->region.index, strerror(-ret));
>> + goto err;
>> + }
>
> In patch 5 this eventually gets called from our realize function, so
> we're forcing the kernel to allocate whatever backing store it might
> use for these mmaps for the entire life of the device, in case maybe we
> want to migrate it. Like my reply to Yan, unless I'm misunderstanding
> how drivers will back this mmap this seems really inefficient.
>
Backing storage can be allocated on access, that is from fault handler,
but will remain allocated for entire life of device.
This can be moved to .save_setup and unmmap from .save_cleanup,
similarly during resuming .load_setup and .load_cleanup can be used.
I'll update patch in next iteration.
>> + }
>> +
>> + return 0;
>> +
>> +err:
>> + vfio_migration_region_exit(vbasedev);
>> + return ret;
>> +}
>> +
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> + VFIORegion *region = &migration->region.buffer;
>> + int ret = 0;
>> +
>> + ret = pwrite(vbasedev->fd, &state, sizeof(state),
>> + region->fd_offset + offsetof(struct vfio_device_migration_info,
>> + device_state));
>> + if (ret < 0) {
>> + error_report("Failed to set migration state %d %s",
>> + ret, strerror(errno));
>> + return ret;
>> + }
>> +
>> + vbasedev->device_state = state;
>> + return 0;
>> +}
>> +
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>> + uint64_t start_pfn,
>> + uint64_t pfn_count,
>> + uint64_t page_size)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> + VFIORegion *region = &migration->region.buffer;
>> + uint64_t count = 0, copied_pfns = 0;
>> + int ret;
>> +
>> + ret = pwrite(vbasedev->fd, &start_pfn, sizeof(start_pfn),
>> + region->fd_offset + offsetof(struct vfio_device_migration_info,
>> + start_pfn));
>> + if (ret < 0) {
>> + error_report("Failed to set dirty pages start address %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
>> + region->fd_offset + offsetof(struct vfio_device_migration_info,
>> + page_size));
>> + if (ret < 0) {
>> + error_report("Failed to set dirty page size %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + ret = pwrite(vbasedev->fd, &pfn_count, sizeof(pfn_count),
>> + region->fd_offset + offsetof(struct vfio_device_migration_info,
>> + total_pfns));
>> + if (ret < 0) {
>> + error_report("Failed to set dirty page total pfns %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + do {
>> + uint64_t bitmap_size;
>> + void *buf = NULL;
>> + bool buffer_mmaped = false;
>> +
>> + /* Read dirty_pfns.copied */
>> + ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
>> + region->fd_offset + offsetof(struct vfio_device_migration_info,
>> + copied_pfns));
>> + if (ret < 0) {
>> + error_report("Failed to get dirty pages bitmap count %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + if (copied_pfns == 0) {
>> + /*
>> + * copied_pfns could be 0 if driver doesn't have any page to report
>> + * dirty in given range
>> + */
>> + break;
>> + }
>> +
>> + bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>> +
>> + if (region->mmaps) {
>> + int i;
>> +
>> + for (i = 0; i < region->nr_mmaps; i++) {
>> + if (region->mmaps[i].size >= bitmap_size) {
>> + buf = region->mmaps[i].mmap;
>> + buffer_mmaped = true;
>> + break;
>> + }
>
> Wait, the first sparse mmap region that's big enough to hold the bitmap
> is necessarily the one where we read it? Shouldn't we test that the
> mmap range covers the correct offset? This also makes me realize that
> if a driver wanted the device memory mmap'd but not the dirty memory
> backing, I don't think they can do it with this interface. Perhaps
> there should be separate data and dirty offset fields in the header to
> that a driver can place them at separate offsets with different mmap
> coverage if they desire. Thanks,
>
Makes sense. I'll add dirty_offset in header.
Thanks,
Kirti
> Alex
>
next prev parent reply other threads:[~2019-02-26 20:17 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-19 21:23 [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface Kirti Wankhede
2019-02-21 22:23 ` Alex Williamson
2019-02-26 20:05 ` Kirti Wankhede
2019-03-01 13:36 ` Cornelia Huck
2019-03-07 19:45 ` Alex Williamson
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 2/5] Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 3/5] Add migration functions for VFIO devices Kirti Wankhede
2019-02-21 22:38 ` Alex Williamson
2019-02-26 20:16 ` Kirti Wankhede [this message]
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 4/5] Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2019-02-19 21:23 ` [Qemu-devel] [PATCH v3 5/5] Make vfio-pci device migration capable Kirti Wankhede
2019-02-20 10:22 ` [Qemu-devel] [PATCH v3 0/5] Add migration support for VFIO device Dr. David Alan Gilbert
2019-02-21 5:25 ` Kirti Wankhede
2019-02-21 5:52 ` Tian, Kevin
2019-02-21 7:10 ` Neo Jia
2019-02-27 1:51 ` Tian, Kevin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dada4d97-a567-2238-e15e-aa007a139169@nvidia.com \
--to=kwankhede@nvidia.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=yulei.zhang@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).