Re: Out-of-Process Device Emulation session at KVM Forum 2020

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jason Wang <jasowang@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"Janosch Frank" <frankja@linux.vnet.ibm.com>,
	"mst@redhat.com" <mtsirkin@redhat.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	"Stefan Hajnoczi" <stefanha@gmail.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Kirti Wankhede" <kwankhede@nvidia.com>,
	"Gerd Hoffmann" <kraxel@redhat.com>,
	"Yan Vugenfirer" <yan@daynix.com>,
	"Jag Raman" <jag.raman@oracle.com>,
	"Anup Patel" <anup@brainfault.org>,
	"Claudio Imbrenda" <imbrenda@linux.vnet.ibm.com>,
	"Christian Borntraeger" <borntraeger@de.ibm.com>,
	"Roman Kagan" <rkagan@virtuozzo.com>,
	"Felipe Franciosi" <felipe@nutanix.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Jens Freimann" <jfreimann@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Eduardo Habkost" <ehabkost@redhat.com>,
	"Sergio Lopez" <slp@redhat.com>,
	"Kashyap Chamarthy" <kchamart@redhat.com>,
	"Darren Kenny" <darren.kenny@oracle.com>,
	"Liran Alon" <liran.alon@oracle.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"David Gibson" <david@gibson.dropbear.id.au>,
	"Kevin Wolf" <kwolf@redhat.com>,
	"Halil Pasic" <pasic@linux.vnet.ibm.com>,
	"Daniel P. Berrange" <berrange@redhat.com>,
	"Christophe de Dinechin" <dinechin@redhat.com>,
	"Thanos Makatos" <thanos.makatos@nutanix.com>,
	fam <fam@euphon.net>
Subject: Re: Out-of-Process Device Emulation session at KVM Forum 2020
Date: Fri, 30 Oct 2020 17:31:39 +0800	[thread overview]
Message-ID: <0b098087-86aa-11d1-058d-db43d0f89db8@redhat.com> (raw)
In-Reply-To: <20201029210407.33d6f008@x1.home>


On 2020/10/30 上午11:04, Alex Williamson wrote:
> On Fri, 30 Oct 2020 09:11:23 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2020/10/29 下午11:46, Alex Williamson wrote:
>>> On Thu, 29 Oct 2020 23:09:33 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>   
>>>> On 2020/10/29 下午10:31, Alex Williamson wrote:
>>>>> On Thu, 29 Oct 2020 21:02:05 +0800
>>>>> Jason Wang <jasowang@redhat.com> wrote:
>>>>>      
>>>>>> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:
>>>>>>> Here are notes from the session:
>>>>>>>
>>>>>>> protocol stability:
>>>>>>>         * vhost-user already exists for existing third-party applications
>>>>>>>         * vfio-user is more general but will take more time to develop
>>>>>>>         * libvfio-user can be provided to allow device implementations
>>>>>>>
>>>>>>> management:
>>>>>>>         * Should QEMU launch device emulation processes?
>>>>>>>             * Nicer user experience
>>>>>>>             * Technical blockers: forking, hotplug, security is hard once
>>>>>>> QEMU has started running
>>>>>>>             * Probably requires a new process model with a long-running
>>>>>>> QEMU management process proxying QMP requests to the emulator process
>>>>>>>
>>>>>>> migration:
>>>>>>>         * dbus-vmstate
>>>>>>>         * VFIO live migration ioctls
>>>>>>>             * Source device can continue if migration fails
>>>>>>>             * Opaque blobs are transferred to destination, destination can
>>>>>>> fail migration if it decides the blobs are incompatible
>>>>>> I'm not sure this can work:
>>>>>>
>>>>>> 1) Reading something that is opaque to userspace is probably a hint of
>>>>>> bad uAPI design
>>>>>> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
>>>>>> design of migration protocol as well.
>>>>>>
>>>>>> It looks to me have a migration driver in qemu that can clearly define
>>>>>> each byte in the migration stream is a better approach.
>>>>> Any time during the previous two years of development might have been a
>>>>> more appropriate time to express your doubts.
>>>> Somehow I did that in this series[1]. But the main issue is still there.
>>> That series is related to a migration compatibility interface, not the
>>> migration data itself.
>>
>> They are not independent. The compatibility interface design depends on
>> the migration data design. I ask the uAPI issue in that thread but
>> without any response.
>>
>>
>>>   
>>>> Is this legal to have a uAPI that turns out to be opaque to userspace?
>>>> (VFIO seems to be the first). If it's not,  the only choice is to do
>>>> that in Qemu.
>>> So you're suggesting that any time the kernel is passing through opaque
>>> data that gets interpreted by some entity elsewhere, potentially with
>>> proprietary code, that we're in legal jeopardy?  VFIO is certainly not
>>> the first to do that (storage and network devices come to mind).
>>> Devices are essentially opaque data themselves, vfio provides access to
>>> (ex.) BARs, but the interpretation of what resides in that BAR is device
>>> specific.  Sometimes it's defined in a public datasheet, sometimes not.
>>> Suggesting that we can't move opaque data through a uAPI seems rather
>>> absurd.
>>
>> No, I think we are talking about different things. What I meant is the
>> data carried via uAPI should not opaque userspace. What you said here is
>> a good example for this actually. When you expose BAR to userspace,
>> there should be driver that knows the semantics of BAR running in the
>> userspace, so it's not opaque to userspace.
>
> But the thing running in userspace might be QEMU, which doesn't know
> the semantics of the BAR, it might not be until a driver in the guest
> that we have something that understands the BAR semantics beyond opaque
> data.  We might have nested guests, so it could be passed through
> multiple userspaces as opaque data.  The requirement make no sense.


I don't see the difference. From kernel perspective they are all 
userspace drivers regardless whether it's a guest or not. No matter how 
many levels in the middle, there will always be a final endpoint that 
know clearly about the semantics of the BAR. The intermediate level just 
transports the uAPI to upper levels.


>
>
>>>>> Note that we're not talking about vDPA devices here, we're talking
>>>>> about arbitrary devices with arbitrary state.  Some degree of migration
>>>>> support for assigned devices can be implemented in QEMU, Alex Graf
>>>>> proved this several years ago with i40evf.  Years later, we don't have
>>>>> any vendors proposing device specific migration code for QEMU.
>>>> Yes but it's not necessarily VFIO as well.
>>> I don't know what this means.
>>
>> I meant we can't not assume VFIO is the only uAPI that will be used by Qemu.
>   
> And we don't, DPDK, SPDK, various other userspaces exist.  All can take
> advantage of the migration uAPI that we've developed rather than
> implementing device specific code in their projects.


Obviously, for device that has higher level of abstraction like virtio, 
using a bus level device model for migration is a burden.


>   I'm not sure how
> this is strengthening your argument for device specific migration code
> in QEMU, which would need to be replicated in every other userspace.


Any reason for such replication? Except for the devices that have well 
known interface like virtio, each device should have unique 
attributes/behaviors that needs to be dealt with during live migration.


>   As
> opaque data with a well defined protocol, each userspace can implement
> support for this migration protocol once and it should work independent
> of the device or vendor.  It only requires support in the code
> implementing the device, which is already necessarily device specific.
>
>
>>>>> Clearly we're also trying to account for proprietary devices where even
>>>>> for suspend/resume support, proprietary drivers may be required for
>>>>> manipulating that internal state.  When we move device emulation
>>>>> outside of QEMU, whether in kernel or to other userspace processes,
>>>>> does it still make sense to require code in QEMU to support
>>>>> interpretation of that device for migration purposes?
>>>> Well, we could extend Qemu to support property module (or have we
>>>> supported that now?). And then it can talk to property drivers via
>>>> either VFIO or vendor specific uAPI.
>>> Yikes, I thought out-of-process devices was exactly the compromise
>>> being developed to avoid QEMU supporting proprietary modules and ad-hoc
>>> vendor specific uAPIs.
>>
>> We can't even prevent this in kernel, so I don't see how possible we can
>> make it for Qemu.
>
> The kernel is a different beast, it already supports loadable modules
> and due to whatever pressures or market demands of the past, it allows
> non-GPL use of symbols necessary for some of those modules.


So this just answer my question. It's not hard to forecast Qemu may end 
up with similar pressure in the future. The request is simple, connect a 
guest with a vendor specific proprietary uAPI.


>    QEMU has
> no module support outside of non-mainline forks.  Clearly there is
> pressure to support sub-process and proprietary device emulation and
> it's our choice how we enable that.  This vfio over socket approach is
> the mechanism we're trying to enable to avoid proprietary modules in
> QEMU proper.


VFIO user is not the first, vhost-user can do this already. I would 
rather consider VFIO-user to cover the case that vhost-user can't cover. 
And if possible, we should encourage to use vhost-user.


>
>
>>> I think you're actually questioning even the
>>> premise of developing a standardized API for out-of-process devices
>>> here.  Thanks,
>>
>> Actually not, it's just question in my mind when looking at VFIO
>> migration compatibility patches, since vfio-user is being proposed, it's
>> a good time to revisit them.
>
> A migration compatibility interface has not been determined for vfio.
> We currently rely on the vendor drivers to provide their own internal
> validation and harmlessly reject migration from an incompatible device.


So it looks like vendor needs to implement their own migration protocol 
instead of the well defined ones in qemu? I think migration guys may 
share more experiences of how challenge it would be.


> It would be great if we could make progress on this, but it's a
> difficult problem, and one that I hope we can further address once we
> have a base level of migration support.


One thing that is missed in this summary is the way to detect migration 
compatibility. We can't not simply depend on migration failure I guess.


>
> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> because the data transfer is opaque, without defining why that's bad,


Well, it should be sufficient if the opaque uAPI itself is against the 
Linux uAPI/ABI design principles. The side effect is obvious, 
maintainability, debug-ability and compatibility.


> evaluating the feasibility and implementation of defining a well
> specified data format rather than protocol, including cross-vendor
> support, or proposing any sort of alternative is not so helpful imo.


I don't get here, why proposing alternative is not helpful consider 
we're in the early stage?


>
> Note that we also migrate guest memory as opaque data; we don't require
> knowing the data structures it holds or how regions are used, we simply
> look for changes and transfer the new data.  That's not so different
> from a vendor driver passing us a blob of data as "information it needs
> to replicate the device state at the target."  Thanks,


That's complete different, for guest memory, it's:

1) not read from any uAPI
2) not opaque for guest itself

But what qemu expect to read from VFIO uAPI is completely opaque to any 
of the upper layer.

Thanks


>
> Alex
>
>

next prev parent reply	other threads:[~2020-10-30  9:33 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
2020-10-28  9:32 ` Thanos Makatos
2020-10-28 10:07   ` Thanos Makatos
2020-10-28 11:09 ` Michael S. Tsirkin
2020-10-29  8:21 ` Stefan Hajnoczi
2020-10-29 12:08 ` Stefan Hajnoczi
2020-10-29 13:02   ` Jason Wang
2020-10-29 13:06     ` Paolo Bonzini
2020-10-29 14:08     ` Stefan Hajnoczi
2020-10-29 14:31     ` Alex Williamson
2020-10-29 15:09       ` Jason Wang
2020-10-29 15:46         ` Alex Williamson
2020-10-29 16:10           ` Paolo Bonzini
2020-10-30  1:11           ` Jason Wang
2020-10-30  3:04             ` Alex Williamson
2020-10-30  6:21               ` Stefan Hajnoczi
2020-10-30  9:45                 ` Jason Wang
2020-10-30 11:13                   ` Stefan Hajnoczi
2020-10-30 12:07                     ` Jason Wang
2020-10-30 13:15                       ` Stefan Hajnoczi
2020-11-02  2:51                         ` Jason Wang
2020-11-02 10:13                           ` Stefan Hajnoczi
2020-11-03  7:52                             ` Jason Wang
2020-11-03 14:26                               ` Stefan Hajnoczi
2020-11-04  6:50                                 ` Gerd Hoffmann
2020-11-04  7:42                                   ` Michael S. Tsirkin
2020-10-31 21:49                     ` Michael S. Tsirkin
2020-11-01  8:26                       ` Paolo Bonzini
2020-11-02  2:54                         ` Jason Wang
2020-11-02  3:00                     ` Jason Wang
2020-11-02 10:27                       ` Stefan Hajnoczi
2020-11-02 10:34                         ` Michael S. Tsirkin
2020-11-02 14:59                           ` Stefan Hajnoczi
2020-10-30  7:51               ` Michael S. Tsirkin
2020-10-30  9:31               ` Jason Wang [this message]
2020-10-29 16:15     ` David Edmondson
2020-10-29 16:42       ` Daniel P. Berrangé
2020-10-29 17:47         ` Kirti Wankhede
2020-10-29 18:07           ` Paolo Bonzini
2020-10-30  1:15             ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0b098087-86aa-11d1-058d-db43d0f89db8@redhat.com \
    --to=jasowang@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=alex.williamson@redhat.com \
    --cc=anup@brainfault.org \
    --cc=berrange@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=darren.kenny@oracle.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=dinechin@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=fam@euphon.net \
    --cc=felipe@nutanix.com \
    --cc=frankja@linux.vnet.ibm.com \
    --cc=imbrenda@linux.vnet.ibm.com \
    --cc=jag.raman@oracle.com \
    --cc=jfreimann@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=kchamart@redhat.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=kwolf@redhat.com \
    --cc=liran.alon@oracle.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mtsirkin@redhat.com \
    --cc=pasic@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rkagan@virtuozzo.com \
    --cc=sgarzare@redhat.com \
    --cc=slp@redhat.com \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=yan@daynix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).