qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	"mst@redhat.com" <mtsirkin@redhat.com>,
	"Janosch Frank" <frankja@linux.vnet.ibm.com>,
	"Stefan Hajnoczi" <stefanha@gmail.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Kirti Wankhede" <kwankhede@nvidia.com>,
	"Gerd Hoffmann" <kraxel@redhat.com>,
	"Yan Vugenfirer" <yan@daynix.com>,
	"Jag Raman" <jag.raman@oracle.com>,
	"Eugenio Pérez" <eperezma@redhat.com>,
	"Anup Patel" <anup@brainfault.org>,
	"Claudio Imbrenda" <imbrenda@linux.vnet.ibm.com>,
	"Christian Borntraeger" <borntraeger@de.ibm.com>,
	"Roman Kagan" <rkagan@virtuozzo.com>,
	"Felipe Franciosi" <felipe@nutanix.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Jens Freimann" <jfreimann@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Eduardo Habkost" <ehabkost@redhat.com>,
	"Sergio Lopez" <slp@redhat.com>,
	"Kashyap Chamarthy" <kchamart@redhat.com>,
	"Darren Kenny" <darren.kenny@oracle.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Liran Alon" <liran.alon@oracle.com>,
	"Thanos Makatos" <thanos.makatos@nutanix.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"David Gibson" <david@gibson.dropbear.id.au>,
	"Kevin Wolf" <kwolf@redhat.com>,
	"Halil Pasic" <pasic@linux.vnet.ibm.com>,
	"Daniel P. Berrange" <berrange@redhat.com>,
	"Christophe de Dinechin" <dinechin@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>, fam <fam@euphon.net>
Subject: Re: Out-of-Process Device Emulation session at KVM Forum 2020
Date: Tue, 3 Nov 2020 15:52:50 +0800	[thread overview]
Message-ID: <c007455d-b9fc-32d5-a58c-fd8d17794996@redhat.com> (raw)
In-Reply-To: <20201102101308.GA42093@stefanha-x1.localdomain>


On 2020/11/2 下午6:13, Stefan Hajnoczi wrote:
> On Mon, Nov 02, 2020 at 10:51:18AM +0800, Jason Wang wrote:
>> On 2020/10/30 下午9:15, Stefan Hajnoczi wrote:
>>> On Fri, Oct 30, 2020 at 12:08 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
>>>>> On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
>>>>>>> <alex.williamson@redhat.com> wrote:
>>>>>>>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
>>>>>>>> because the data transfer is opaque, without defining why that's bad,
>>>>>>>> evaluating the feasibility and implementation of defining a well
>>>>>>>> specified data format rather than protocol, including cross-vendor
>>>>>>>> support, or proposing any sort of alternative is not so helpful imo.
>>>>>>> The migration approaches in VFIO and vDPA/vhost were designed for
>>>>>>> different requirements and I think this is why there are different
>>>>>>> perspectives on this. Here is a comparison and how VFIO could be
>>>>>>> extended in the future. I see 3 levels of device state compatibility:
>>>>>>>
>>>>>>> 1. The device cannot save/load state blobs, instead userspace fetches
>>>>>>> and restores specific values of the device's runtime state (e.g. last
>>>>>>> processed ring index). This is the vhost approach.
>>>>>>>
>>>>>>> 2. The device can save/load state in a standard format. This is
>>>>>>> similar to #1 except that there is a single read/write blob interface
>>>>>>> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
>>>>>>> pushes the migration state parsing into the device so that userspace
>>>>>>> doesn't need knowledge of every device type. With this approach it is
>>>>>>> possible for a device from vendor A to migrate to a device from vendor
>>>>>>> B, as long as they both implement the same standard migration format.
>>>>>>> The limitation of this approach is that vendor-specific state cannot
>>>>>>> be transferred.
>>>>>>>
>>>>>>> 3. The device can save/load opaque blobs. This is the initial VFIO
>>>>>>> approach.
>>>>>> I still don't get why it must be opaque.
>>>>> If the device state format needs to be in the VMM then each device
>>>>> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
>>>>>
>>>>> Let's invert the question: why does the VMM need to understand the
>>>>> device state of a _passthrough_ device?
>>>> For better manageability, compatibility and debug-ability. If we depends
>>>> on a opaque structure, do we encourage device to implement its own
>>>> migration protocol? It would be very challenge.
>>>>
>>>> For VFIO in the kernel, I suspect a uAPI that may result a opaque data
>>>> to be read or wrote from guest violates the Linux uAPI principle. It
>>>> will be very hard to maintain uABI or even impossible. It looks to me
>>>> VFIO is the first subsystem that is trying to do this.
>>> I think our concepts of uAPI are different. The uAPI of read(2) and
>>> write(2) does not define the structure of the data buffers. VFIO
>>> device regions are exactly the same, the structure of the data is not
>>> defined by the kernel uAPI.
>>
>> I think we're talking about different things. It's not about the data
>> structure, it's about whether to data that reads from kernel can be
>> understood by userspace.
>>
>>
>>> Maybe microcode and firmware loading is an example we agree on?
>>
>> I think not. They are bytecodes that have
>>
>> 1) strict ABI definitions
>> 2) understood by userspace
> No, they can be proprietary formats that neither the Linux kernel nor
> userspace can parse. For example, look at linux-firmware
> (https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/about/)
> it's just a collection of binary blobs. The format is not necessarily
> public. The only restriction on that repo is that the binary blob must
> be redistributable and users must be allowed to run them (i.e.
> proprietary licenses can be used).


I think not. Obviously each firmware should have its own ABI no matter 
whether its public or proprietary. For proprietary firmware, it should 
be understood by the proprietary userspace counterpart.


>
> Or look at other passthrough device interfaces like /dev/i2c or libusb.
> They expose data to userspace without requiring a defined format. It's
> the same as VFIO.


Again, it should have an ABI there (either device or spec) no matter 
whether or not it's a transport layer. And there will be an endpoint in 
the userspace know all the format.


>
> In addition, look at kernel uAPIs where userspace acts simply as a data
> transport for opaque data (e.g. where a userspace helper facilitates
> communication but has no visibility of the data). I imagine that memory
> encryption relies on this because the host kernel and userspace do not
> have access to encrypted memory or associated state - but they need to
> help migrate them to other hosts.


Which uAPI do you mean here?


>
> I hope these examples show that such APIs don't pose a problem for the
> Linux uAPI and are already in use. VFIO device state isn't doing
> anything new here.


I feel that you tried to explain "why it can be" but not "why it must 
be". Trying to find one or two subsystems that have opaque uAPI without 
ABI (though I suspect there will be one) may not convince here.

Thanks


>
>>>>>>>      A device from vendor A cannot migrate to a device from
>>>>>>> vendor B because the format is incompatible. This approach works well
>>>>>>> when devices have unique guest-visible hardware interfaces so the
>>>>>>> guest wouldn't be able to handle migrating a device from vendor A to a
>>>>>>> device from vendor B anyway.
>>>>>> For VFIO I guess cross vendor live migration can't succeed unless we do
>>>>>> some cheats in device/vendor id.
>>>>> Yes. I haven't looked into the details of PCI (Sub-)Device/Vendor IDs
>>>>> and how to best enable migration but I hope that can be solved. The
>>>>> simplest approach is to override the IDs and make them part of the
>>>>> guest configuration.
>>>> That would be very tricky (or requires whitelist). E.g the opaque of the
>>>> src may match the opaque of the dst by chance.
>>> Luckily identifying things based on magic constants has been solved
>>> many times in the past.
>>>
>>> A central identifier registry prevents all collisions but is a pain to
>>> manage. Or use a 128-bit UUID and self-allocate the identifier with an
>>> extremely low chance of collision:
>>> https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
>>
>> I may miss something. I think we're talking about cross vendor live
>> migration.
>>
>> Would you want src and dest have same UUID or not?
>>
>> If they have different UUIDs, how could we know we can live migrate between
>> them.
>>
>> If they have the same UUID, what's the rule of forcing the the vendors to
>> choose same UUID (a spec)?
> I will send a separate email that describes how VFIO live migration can
> work in more detail. I think it's possible to do it with existing ioctl
> interface that Kirti has proposed and still prevent the risk of
> incorrectly interpreting data that you have pointed out.
>
> The document that I'm sending will allow us to discuss in more detail
> and make the approach clearer.
>
> Stefan



  reply	other threads:[~2020-11-03  7:54 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
2020-10-28  9:32 ` Thanos Makatos
2020-10-28 10:07   ` Thanos Makatos
2020-10-28 11:09 ` Michael S. Tsirkin
2020-10-29  8:21 ` Stefan Hajnoczi
2020-10-29 12:08 ` Stefan Hajnoczi
2020-10-29 13:02   ` Jason Wang
2020-10-29 13:06     ` Paolo Bonzini
2020-10-29 14:08     ` Stefan Hajnoczi
2020-10-29 14:31     ` Alex Williamson
2020-10-29 15:09       ` Jason Wang
2020-10-29 15:46         ` Alex Williamson
2020-10-29 16:10           ` Paolo Bonzini
2020-10-30  1:11           ` Jason Wang
2020-10-30  3:04             ` Alex Williamson
2020-10-30  6:21               ` Stefan Hajnoczi
2020-10-30  9:45                 ` Jason Wang
2020-10-30 11:13                   ` Stefan Hajnoczi
2020-10-30 12:07                     ` Jason Wang
2020-10-30 13:15                       ` Stefan Hajnoczi
2020-11-02  2:51                         ` Jason Wang
2020-11-02 10:13                           ` Stefan Hajnoczi
2020-11-03  7:52                             ` Jason Wang [this message]
2020-11-03 14:26                               ` Stefan Hajnoczi
2020-11-04  6:50                                 ` Gerd Hoffmann
2020-11-04  7:42                                   ` Michael S. Tsirkin
2020-10-31 21:49                     ` Michael S. Tsirkin
2020-11-01  8:26                       ` Paolo Bonzini
2020-11-02  2:54                         ` Jason Wang
2020-11-02  3:00                     ` Jason Wang
2020-11-02 10:27                       ` Stefan Hajnoczi
2020-11-02 10:34                         ` Michael S. Tsirkin
2020-11-02 14:59                           ` Stefan Hajnoczi
2020-10-30  7:51               ` Michael S. Tsirkin
2020-10-30  9:31               ` Jason Wang
2020-10-29 16:15     ` David Edmondson
2020-10-29 16:42       ` Daniel P. Berrangé
2020-10-29 17:47         ` Kirti Wankhede
2020-10-29 18:07           ` Paolo Bonzini
2020-10-30  1:15             ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c007455d-b9fc-32d5-a58c-fd8d17794996@redhat.com \
    --to=jasowang@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=alex.williamson@redhat.com \
    --cc=anup@brainfault.org \
    --cc=berrange@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=darren.kenny@oracle.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=dinechin@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=eperezma@redhat.com \
    --cc=fam@euphon.net \
    --cc=felipe@nutanix.com \
    --cc=frankja@linux.vnet.ibm.com \
    --cc=imbrenda@linux.vnet.ibm.com \
    --cc=jag.raman@oracle.com \
    --cc=jfreimann@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=kchamart@redhat.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=kwolf@redhat.com \
    --cc=liran.alon@oracle.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mtsirkin@redhat.com \
    --cc=pasic@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rkagan@virtuozzo.com \
    --cc=sgarzare@redhat.com \
    --cc=slp@redhat.com \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=yan@daynix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).