From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yang Zhang <yang.zhang.wz@gmail.com>
Subject: Re: [Qemu-devel] live migration vs device assignment (motivation)
Date: Thu, 10 Dec 2015 19:28:01 +0800
Message-ID: <566961C1.6030000@gmail.com>
References: <1448372127-28115-1-git-send-email-tianyu.lan@intel.com>
 <20151207165039.GA20210@redhat.com> <56685631.50700@intel.com>
 <20151210101840.GA2570@work-vm>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "Michael S. Tsirkin" <mst@redhat.com>, qemu-devel@nongnu.org,
	emil.s.tantilov@intel.com, kvm@vger.kernel.org,
	ard.biesheuvel@linaro.org, aik@ozlabs.ru,
	donald.c.skidmore@intel.com, quintela@redhat.com,
	eddie.dong@intel.com, nrupal.jani@intel.com, agraf@suse.de,
	blauwirbel@gmail.com, cornelia.huck@de.ibm.com,
	alex.williamson@redhat.com, kraxel@redhat.com,
	anthony@codemonkey.ws, amit.shah@redhat.com, pbonzini@redhat.com,
	mark.d.rustad@intel.com, lcapitulino@redhat.com,
	gerlitz.or@gmail.com
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Lan, Tianyu" <tianyu.lan@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-pa0-f46.google.com ([209.85.220.46]:33154 "EHLO
	mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750762AbbLJL2S (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 10 Dec 2015 06:28:18 -0500
Received: by pabur14 with SMTP id ur14so46502815pab.0
        for <kvm@vger.kernel.org>; Thu, 10 Dec 2015 03:28:18 -0800 (PST)
In-Reply-To: <20151210101840.GA2570@work-vm>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
> * Lan, Tianyu (tianyu.lan@intel.com) wrote:
>> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
>>> I thought about what this is doing at the high level, and I do have some
>>> value in what you are trying to do, but I also think we need to clarify
>>> the motivation a bit more.  What you are saying is not really what the
>>> patches are doing.
>>>
>>> And with that clearer understanding of the motivation in mind (assuming
>>> it actually captures a real need), I would also like to suggest some
>>> changes.
>>
>> Motivation:
>> Most current solutions for migration with passthough device are based on
>> the PCI hotplug but it has side affect and can't work for all device.
>>
>> For NIC device:
>> PCI hotplug solution can work around Network device migration
>> via switching VF and PF.
>>
>> But switching network interface will introduce service down time.
>>
>> I tested the service down time via putting VF and PV interface
>> into a bonded interface and ping the bonded interface during plug
>> and unplug VF.
>> 1) About 100ms when add VF
>> 2) About 30ms when del VF
>>
>> It also requires guest to do switch configuration. These are hard to
>> manage and deploy from our customers. To maintain PV performance during
>> migration, host side also needs to assign a VF to PV device. This
>> affects scalability.
>>
>> These factors block SRIOV NIC passthough usage in the cloud service and
>> OPNFV which require network high performance and stability a lot.
>
> Right, that I'll agree it's hard to do migration of a VM which uses
> an SRIOV device; and while I think it should be possible to bond a virtio device
> to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.
>
>> For other kind of devices, it's hard to work.
>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>
>> QAT device user case introduction.
>> Server, networking, big data, and storage applications use QuickAssist
>> Technology to offload servers from handling compute-intensive operations,
>> such as:
>> 1) Symmetric cryptography functions including cipher operations and
>> authentication operations
>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>> cryptography
>> 3) Compression and decompression functions including DEFLATE and LZS
>>
>> PCI hotplug will not work for such devices during migration and these
>> operations will fail when unplug device.
>
> I don't understand that QAT argument; if the device is purely an offload
> engine for performance, then why can't you fall back to doing the
> same operations in the VM or in QEMU if the card is unavailable?
> The tricky bit is dealing with outstanding operations.
>
>> So we are trying implementing a new solution which really migrates
>> device state to target machine and won't affect user during migration
>> with low service down time.
>
> Right, that's a good aim - the only question is how to do it.
>
> It looks like this is always going to need some device-specific code;
> the question I see is whether that's in:
>      1) qemu
>      2) the host kernel
>      3) the guest kernel driver
>
> The objections to this series seem to be that it needs changes to (3);
> I can see the worry that the guest kernel driver might not get a chance
> to run during the right time in migration and it's painful having to
> change every guest driver (although your change is small).
>
> My question is what stage of the migration process do you expect to tell
> the guest kernel driver to do this?
>
>      If you do it at the start of the migration, and quiesce the device,
>      the migration might take a long time (say 30 minutes) - are you
>      intending the device to be quiesced for this long? And where are
>      you going to send the traffic?
>      If you are, then do you need to do it via this PCI trick, or could
>      you just do it via something higher level to quiesce the device.
>
>      Or are you intending to do it just near the end of the migration?
>      But then how do we know how long it will take the guest driver to
>      respond?

Ideally, it is able to leave guest driver unmodified but it requires the 
hypervisor or qemu to aware the device which means we may need a driver 
in hypervisor or qemu to handle the device on behalf of guest driver.

>
> It would be great if we could avoid changing the guest; but at least your guest
> driver changes don't actually seem to be that hardware specific; could your
> changes actually be moved to generic PCI level so they could be made
> to work for lots of drivers?

It is impossible to use one common solution for all devices unless the 
PCIE spec documents it clearly and i think one day it will be there. But 
before that, we need some workarounds on guest driver to make it work 
even it looks ugly.


-- 
best regards
yang