* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 12:40 ` [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices Yishai Hadas
@ 2023-09-21 13:16 ` Michael S. Tsirkin
2023-09-21 14:11 ` Jason Gunthorpe
2023-09-21 13:33 ` Michael S. Tsirkin
` (4 subsequent siblings)
5 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 13:16 UTC (permalink / raw)
To: Yishai Hadas
Cc: alex.williamson, jasowang, jgg, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
> MAINTAINERS | 6 +
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 2 +
> drivers/vfio/pci/virtio/Kconfig | 15 +
> drivers/vfio/pci/virtio/Makefile | 4 +
> drivers/vfio/pci/virtio/cmd.c | 4 +-
> drivers/vfio/pci/virtio/cmd.h | 8 +
> drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++
> 8 files changed, 585 insertions(+), 2 deletions(-)
> create mode 100644 drivers/vfio/pci/virtio/Kconfig
> create mode 100644 drivers/vfio/pci/virtio/Makefile
> create mode 100644 drivers/vfio/pci/virtio/main.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index bf0f54c24f81..5098418c8389 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> S: Maintained
> F: drivers/vfio/pci/mlx5/
>
> +VFIO VIRTIO PCI DRIVER
> +M: Yishai Hadas <yishaih@nvidia.com>
> +L: kvm@vger.kernel.org
> +S: Maintained
> +F: drivers/vfio/pci/virtio
> +
> VFIO PCI DEVICE SPECIFIC DRIVERS
> R: Jason Gunthorpe <jgg@nvidia.com>
> R: Yishai Hadas <yishaih@nvidia.com>
Tying two subsystems together like this is going to cause pain when
merging. God forbid there's something e.g. virtio net specific
(and there's going to be for sure) - now we are talking 3 subsystems.
Case in point all other virtio drivers are nicely grouped, have a common
mailing list etc etc. This one is completely separate to the point
where people won't even remember to copy the virtio mailing list.
diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
new file mode 100644
index 000000000000..89eddce8b1bd
--- /dev/null
+++ b/drivers/vfio/pci/virtio/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config VIRTIO_VFIO_PCI
+ tristate "VFIO support for VIRTIO PCI devices"
+ depends on VIRTIO_PCI
+ select VFIO_PCI_CORE
+ help
+ This provides support for exposing VIRTIO VF devices using the VFIO
+ framework that can work with a legacy virtio driver in the guest.
+ Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
+ not indicate I/O Space.
+ As of that this driver emulated I/O BAR in software to let a VF be
+ seen as a transitional device in the guest and let it work with
+ a legacy driver.
+
+ If you don't know what to do here, say N.
I don't promise we'll remember to poke at vfio if we tweak something
in the virtio kconfig.
--
MST
^ permalink raw reply related [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 13:16 ` Michael S. Tsirkin
@ 2023-09-21 14:11 ` Jason Gunthorpe
2023-09-21 14:16 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 14:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index bf0f54c24f81..5098418c8389 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> > S: Maintained
> > F: drivers/vfio/pci/mlx5/
> >
> > +VFIO VIRTIO PCI DRIVER
> > +M: Yishai Hadas <yishaih@nvidia.com>
> > +L: kvm@vger.kernel.org
> > +S: Maintained
> > +F: drivers/vfio/pci/virtio
> > +
> > VFIO PCI DEVICE SPECIFIC DRIVERS
> > R: Jason Gunthorpe <jgg@nvidia.com>
> > R: Yishai Hadas <yishaih@nvidia.com>
>
> Tying two subsystems together like this is going to cause pain when
> merging. God forbid there's something e.g. virtio net specific
> (and there's going to be for sure) - now we are talking 3
> subsystems.
Cross subsystem stuff is normal in the kernel. Drivers should be
placed in their most logical spot - this driver exposes a VFIO
interface so it belongs here.
Your exact argument works the same from the VFIO perspective, someone
has to have code that belongs to them outside their little sphere
here.
> Case in point all other virtio drivers are nicely grouped, have a common
> mailing list etc etc. This one is completely separate to the point
> where people won't even remember to copy the virtio mailing list.
The virtio mailing list should probably be added to the maintainers
enry
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 14:11 ` Jason Gunthorpe
@ 2023-09-21 14:16 ` Michael S. Tsirkin
2023-09-21 16:41 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 14:16 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 11:11:25AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote:
>
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index bf0f54c24f81..5098418c8389 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> > > S: Maintained
> > > F: drivers/vfio/pci/mlx5/
> > >
> > > +VFIO VIRTIO PCI DRIVER
> > > +M: Yishai Hadas <yishaih@nvidia.com>
> > > +L: kvm@vger.kernel.org
> > > +S: Maintained
> > > +F: drivers/vfio/pci/virtio
> > > +
> > > VFIO PCI DEVICE SPECIFIC DRIVERS
> > > R: Jason Gunthorpe <jgg@nvidia.com>
> > > R: Yishai Hadas <yishaih@nvidia.com>
> >
> > Tying two subsystems together like this is going to cause pain when
> > merging. God forbid there's something e.g. virtio net specific
> > (and there's going to be for sure) - now we are talking 3
> > subsystems.
>
> Cross subsystem stuff is normal in the kernel.
Yea. But it's completely spurious here - virtio has its own way
to work with userspace which is vdpa and let's just use that.
Keeps things nice and contained.
> Drivers should be
> placed in their most logical spot - this driver exposes a VFIO
> interface so it belongs here.
>
> Your exact argument works the same from the VFIO perspective, someone
> has to have code that belongs to them outside their little sphere
> here.
>
> > Case in point all other virtio drivers are nicely grouped, have a common
> > mailing list etc etc. This one is completely separate to the point
> > where people won't even remember to copy the virtio mailing list.
>
> The virtio mailing list should probably be added to the maintainers
> enry
>
> Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 14:16 ` Michael S. Tsirkin
@ 2023-09-21 16:41 ` Jason Gunthorpe
2023-09-21 16:53 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 16:41 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 10:16:04AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 11:11:25AM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote:
> >
> > > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > > index bf0f54c24f81..5098418c8389 100644
> > > > --- a/MAINTAINERS
> > > > +++ b/MAINTAINERS
> > > > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> > > > S: Maintained
> > > > F: drivers/vfio/pci/mlx5/
> > > >
> > > > +VFIO VIRTIO PCI DRIVER
> > > > +M: Yishai Hadas <yishaih@nvidia.com>
> > > > +L: kvm@vger.kernel.org
> > > > +S: Maintained
> > > > +F: drivers/vfio/pci/virtio
> > > > +
> > > > VFIO PCI DEVICE SPECIFIC DRIVERS
> > > > R: Jason Gunthorpe <jgg@nvidia.com>
> > > > R: Yishai Hadas <yishaih@nvidia.com>
> > >
> > > Tying two subsystems together like this is going to cause pain when
> > > merging. God forbid there's something e.g. virtio net specific
> > > (and there's going to be for sure) - now we are talking 3
> > > subsystems.
> >
> > Cross subsystem stuff is normal in the kernel.
>
> Yea. But it's completely spurious here - virtio has its own way
> to work with userspace which is vdpa and let's just use that.
> Keeps things nice and contained.
vdpa is not vfio, I don't know how you can suggest vdpa is a
replacement for a vfio driver. They are completely different
things.
Each side has its own strengths, and vfio especially is accelerating
in its capability in way that vpda is not. eg if an iommufd conversion
had been done by now for vdpa I might be more sympathetic. Asking for
someone else to do a huge amount of pointless work to improve vdpa
just to level of this vfio driver already is at is ridiculous.
vdpa is great for certain kinds of HW, let it focus on that, don't try
to paint it as an alternative to vfio. It isn't.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 16:41 ` Jason Gunthorpe
@ 2023-09-21 16:53 ` Michael S. Tsirkin
2023-09-21 18:39 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 16:53 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 01:41:39PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 10:16:04AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 11:11:25AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote:
> > >
> > > > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > > > index bf0f54c24f81..5098418c8389 100644
> > > > > --- a/MAINTAINERS
> > > > > +++ b/MAINTAINERS
> > > > > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> > > > > S: Maintained
> > > > > F: drivers/vfio/pci/mlx5/
> > > > >
> > > > > +VFIO VIRTIO PCI DRIVER
> > > > > +M: Yishai Hadas <yishaih@nvidia.com>
> > > > > +L: kvm@vger.kernel.org
> > > > > +S: Maintained
> > > > > +F: drivers/vfio/pci/virtio
> > > > > +
> > > > > VFIO PCI DEVICE SPECIFIC DRIVERS
> > > > > R: Jason Gunthorpe <jgg@nvidia.com>
> > > > > R: Yishai Hadas <yishaih@nvidia.com>
> > > >
> > > > Tying two subsystems together like this is going to cause pain when
> > > > merging. God forbid there's something e.g. virtio net specific
> > > > (and there's going to be for sure) - now we are talking 3
> > > > subsystems.
> > >
> > > Cross subsystem stuff is normal in the kernel.
> >
> > Yea. But it's completely spurious here - virtio has its own way
> > to work with userspace which is vdpa and let's just use that.
> > Keeps things nice and contained.
>
> vdpa is not vfio, I don't know how you can suggest vdpa is a
> replacement for a vfio driver. They are completely different
> things.
> Each side has its own strengths, and vfio especially is accelerating
> in its capability in way that vpda is not. eg if an iommufd conversion
> had been done by now for vdpa I might be more sympathetic.
Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
sick and I didn't know and kept assuming she's working on this. I don't
think it's a huge amount of work though. I'll take a look.
Is there anything else though? Do tell.
> Asking for
> someone else to do a huge amount of pointless work to improve vdpa
> just to level of this vfio driver already is at is ridiculous.
>
> vdpa is great for certain kinds of HW, let it focus on that, don't try
> to paint it as an alternative to vfio. It isn't.
>
> Jason
There are a bunch of things that I think are important for virtio
that are completely out of scope for vfio, such as migrating
cross-vendor. What is the huge amount of work am I asking to do?
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 16:53 ` Michael S. Tsirkin
@ 2023-09-21 18:39 ` Jason Gunthorpe
2023-09-21 19:13 ` Michael S. Tsirkin
` (2 more replies)
0 siblings, 3 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 18:39 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > replacement for a vfio driver. They are completely different
> > things.
> > Each side has its own strengths, and vfio especially is accelerating
> > in its capability in way that vpda is not. eg if an iommufd conversion
> > had been done by now for vdpa I might be more sympathetic.
>
> Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> sick and I didn't know and kept assuming she's working on this. I don't
> think it's a huge amount of work though. I'll take a look.
> Is there anything else though? Do tell.
Confidential compute will never work with VDPA's approach.
> There are a bunch of things that I think are important for virtio
> that are completely out of scope for vfio, such as migrating
> cross-vendor.
VFIO supports migration, if you want to have cross-vendor migration
then make a standard that describes the VFIO migration data format for
virtio devices.
> What is the huge amount of work am I asking to do?
You are asking us to invest in the complexity of VDPA through out
(keep it working, keep it secure, invest time in deploying and
debugging in the field)
When it doesn't provide *ANY* value to the solution.
The starting point is a completely working vfio PCI function and the
end goal is to put that function into a VM. That is VFIO, not VDPA.
VPDA is fine for what it does, but it is not a reasonable replacement
for VFIO.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 18:39 ` Jason Gunthorpe
@ 2023-09-21 19:13 ` Michael S. Tsirkin
2023-09-21 19:49 ` Jason Gunthorpe
2023-09-21 19:17 ` Michael S. Tsirkin
2023-09-22 3:45 ` Zhu, Lingshan
2 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 19:13 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > replacement for a vfio driver. They are completely different
> > > things.
> > > Each side has its own strengths, and vfio especially is accelerating
> > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > had been done by now for vdpa I might be more sympathetic.
> >
> > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > sick and I didn't know and kept assuming she's working on this. I don't
> > think it's a huge amount of work though. I'll take a look.
> > Is there anything else though? Do tell.
>
> Confidential compute will never work with VDPA's approach.
I don't see how what this patchset is doing is different
wrt to Confidential compute - you trap IO accesses and emulate.
Care to elaborate?
> > There are a bunch of things that I think are important for virtio
> > that are completely out of scope for vfio, such as migrating
> > cross-vendor.
>
> VFIO supports migration, if you want to have cross-vendor migration
> then make a standard that describes the VFIO migration data format for
> virtio devices.
This has nothing to do with data formats - you need two devices to
behave identically. Which is what VDPA is about really.
> > What is the huge amount of work am I asking to do?
>
> You are asking us to invest in the complexity of VDPA through out
> (keep it working, keep it secure, invest time in deploying and
> debugging in the field)
>
> When it doesn't provide *ANY* value to the solution.
There's no "the solution" - this sounds like a vendor only caring about
solutions that involve that vendor's hardware exclusively, a little.
> The starting point is a completely working vfio PCI function and the
> end goal is to put that function into a VM. That is VFIO, not VDPA.
>
> VPDA is fine for what it does, but it is not a reasonable replacement
> for VFIO.
>
> Jason
VDPA basically should be a kind of "VFIO for virtio".
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:13 ` Michael S. Tsirkin
@ 2023-09-21 19:49 ` Jason Gunthorpe
2023-09-21 20:45 ` Michael S. Tsirkin
2023-09-22 3:01 ` Jason Wang
0 siblings, 2 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 19:49 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > replacement for a vfio driver. They are completely different
> > > > things.
> > > > Each side has its own strengths, and vfio especially is accelerating
> > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > had been done by now for vdpa I might be more sympathetic.
> > >
> > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > sick and I didn't know and kept assuming she's working on this. I don't
> > > think it's a huge amount of work though. I'll take a look.
> > > Is there anything else though? Do tell.
> >
> > Confidential compute will never work with VDPA's approach.
>
> I don't see how what this patchset is doing is different
> wrt to Confidential compute - you trap IO accesses and emulate.
> Care to elaborate?
This patch series isn't about confidential compute, you asked about
the future. VFIO will support confidential compute in the future, VDPA
will not.
> > > There are a bunch of things that I think are important for virtio
> > > that are completely out of scope for vfio, such as migrating
> > > cross-vendor.
> >
> > VFIO supports migration, if you want to have cross-vendor migration
> > then make a standard that describes the VFIO migration data format for
> > virtio devices.
>
> This has nothing to do with data formats - you need two devices to
> behave identically. Which is what VDPA is about really.
We've been looking at VFIO live migration extensively. Device
mediation, like VDPA does, is one legitimate approach for live
migration. It suites a certain type of heterogeneous environment well.
But, it is equally legitimate to make the devices behave the same and
have them process a common migration data.
This can happen in public with standards, or it can happen in private
within a cloud operator's "private-standard" environment.
To date, in most of my discussions, I have not seen a strong appetite
for such public standards. In part due to the complexity.
Regardles, it is not the kernel communities job to insist on one
approach or the other.
> > You are asking us to invest in the complexity of VDPA through out
> > (keep it working, keep it secure, invest time in deploying and
> > debugging in the field)
> >
> > When it doesn't provide *ANY* value to the solution.
>
> There's no "the solution"
Nonsense.
> this sounds like a vendor only caring about solutions that involve
> that vendor's hardware exclusively, a little.
Not really.
Understand the DPU provider is not the vendor here. The DPU provider
gives a cloud operator a SDK to build these things. The operator is
the vendor from your perspective.
In many cases live migration never leaves the operator's confines in
the first place.
Even when it does, there is no real use case to live migrate a
virtio-net function from, say, AWS to GCP.
You are pushing for a lot of complexity and software that solves a
problem people in this space don't actually have.
As I said, VDPA is fine for the scenarios it addresses. It is an
alternative, not a replacement, for VFIO.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:49 ` Jason Gunthorpe
@ 2023-09-21 20:45 ` Michael S. Tsirkin
2023-09-21 22:55 ` Jason Gunthorpe
2023-09-22 3:01 ` Jason Wang
1 sibling, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 20:45 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > replacement for a vfio driver. They are completely different
> > > > > things.
> > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > had been done by now for vdpa I might be more sympathetic.
> > > >
> > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > think it's a huge amount of work though. I'll take a look.
> > > > Is there anything else though? Do tell.
> > >
> > > Confidential compute will never work with VDPA's approach.
> >
> > I don't see how what this patchset is doing is different
> > wrt to Confidential compute - you trap IO accesses and emulate.
> > Care to elaborate?
>
> This patch series isn't about confidential compute, you asked about
> the future. VFIO will support confidential compute in the future, VDPA
> will not.
Nonsense it already works.
But I did not ask about the future since I do not believe it
can be confidently predicted. I asked what is missing in VDPA
now for you to add this feature there and not in VFIO.
> > > > There are a bunch of things that I think are important for virtio
> > > > that are completely out of scope for vfio, such as migrating
> > > > cross-vendor.
> > >
> > > VFIO supports migration, if you want to have cross-vendor migration
> > > then make a standard that describes the VFIO migration data format for
> > > virtio devices.
> >
> > This has nothing to do with data formats - you need two devices to
> > behave identically. Which is what VDPA is about really.
>
> We've been looking at VFIO live migration extensively. Device
> mediation, like VDPA does, is one legitimate approach for live
> migration. It suites a certain type of heterogeneous environment well.
>
> But, it is equally legitimate to make the devices behave the same and
> have them process a common migration data.
>
> This can happen in public with standards, or it can happen in private
> within a cloud operator's "private-standard" environment.
>
> To date, in most of my discussions, I have not seen a strong appetite
> for such public standards. In part due to the complexity.
>
> Regardles, it is not the kernel communities job to insist on one
> approach or the other.
>
> > > You are asking us to invest in the complexity of VDPA through out
> > > (keep it working, keep it secure, invest time in deploying and
> > > debugging in the field)
> > >
> > > When it doesn't provide *ANY* value to the solution.
> >
> > There's no "the solution"
>
> Nonsense.
what there's only one solution that you use the definite article?
> > this sounds like a vendor only caring about solutions that involve
> > that vendor's hardware exclusively, a little.
>
> Not really.
>
> Understand the DPU provider is not the vendor here. The DPU provider
> gives a cloud operator a SDK to build these things. The operator is
> the vendor from your perspective.
>
> In many cases live migration never leaves the operator's confines in
> the first place.
>
> Even when it does, there is no real use case to live migrate a
> virtio-net function from, say, AWS to GCP.
>
> You are pushing for a lot of complexity and software that solves a
> problem people in this space don't actually have.
>
> As I said, VDPA is fine for the scenarios it addresses. It is an
> alternative, not a replacement, for VFIO.
>
> Jason
yea, VDPA does trap and emulate for config accesses. which is exactly
what this patch does? so why does it belong in vfio muddying up its
passthrough model is beyond me, except that apparently there's some
specific deployment that happens to use vfio so now whatever
that deployment needs has to go into vfio whether it belongs there or not.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 20:45 ` Michael S. Tsirkin
@ 2023-09-21 22:55 ` Jason Gunthorpe
2023-09-22 3:02 ` Jason Wang
2023-09-22 11:23 ` Michael S. Tsirkin
0 siblings, 2 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 22:55 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > > replacement for a vfio driver. They are completely different
> > > > > > things.
> > > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > > had been done by now for vdpa I might be more sympathetic.
> > > > >
> > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > > think it's a huge amount of work though. I'll take a look.
> > > > > Is there anything else though? Do tell.
> > > >
> > > > Confidential compute will never work with VDPA's approach.
> > >
> > > I don't see how what this patchset is doing is different
> > > wrt to Confidential compute - you trap IO accesses and emulate.
> > > Care to elaborate?
> >
> > This patch series isn't about confidential compute, you asked about
> > the future. VFIO will support confidential compute in the future, VDPA
> > will not.
>
> Nonsense it already works.
That isn't what I'm talking about. With a real PCI function and TDISP
we can actually DMA directly from the guest's memory without needing
the ugly bounce buffer hack. Then you can get decent performance.
> But I did not ask about the future since I do not believe it
> can be confidently predicted. I asked what is missing in VDPA
> now for you to add this feature there and not in VFIO.
I don't see that VDPA needs this, VDPA should process the IO BAR on
its own with its own logic, just like everything else it does.
This is specifically about avoiding mediation by relaying directly the
IO BAR operations to the device itself.
That is the entire irony, this whole scheme was designed and
standardized *specifically* to avoid complex mediation and here you
are saying we should just use mediation.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 22:55 ` Jason Gunthorpe
@ 2023-09-22 3:02 ` Jason Wang
2023-09-22 11:23 ` Michael S. Tsirkin
1 sibling, 0 replies; 181+ messages in thread
From: Jason Wang @ 2023-09-22 3:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Michael S. Tsirkin, Yishai Hadas, alex.williamson, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 6:55 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > > > replacement for a vfio driver. They are completely different
> > > > > > > things.
> > > > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > > > had been done by now for vdpa I might be more sympathetic.
> > > > > >
> > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > > > think it's a huge amount of work though. I'll take a look.
> > > > > > Is there anything else though? Do tell.
> > > > >
> > > > > Confidential compute will never work with VDPA's approach.
> > > >
> > > > I don't see how what this patchset is doing is different
> > > > wrt to Confidential compute - you trap IO accesses and emulate.
> > > > Care to elaborate?
> > >
> > > This patch series isn't about confidential compute, you asked about
> > > the future. VFIO will support confidential compute in the future, VDPA
> > > will not.
What blocks vDPA from supporting that?
> >
> > Nonsense it already works.
>
> That isn't what I'm talking about. With a real PCI function and TDISP
> we can actually DMA directly from the guest's memory without needing
> the ugly bounce buffer hack. Then you can get decent performance.
This series requires the trapping in the legacy I/O BAR in VFIO. Why
can TDISP work when trapping in VFIO but not vDPA? If neither, how can
TDISP affect here?
>
> > But I did not ask about the future since I do not believe it
> > can be confidently predicted. I asked what is missing in VDPA
> > now for you to add this feature there and not in VFIO.
>
> I don't see that VDPA needs this, VDPA should process the IO BAR on
> its own with its own logic, just like everything else it does.
>
> This is specifically about avoiding mediation by relaying directly the
> IO BAR operations to the device itself.
So we had:
1) a new virtio specific driver for VFIO
2) the existing vp_vdpa driver
How much differences between them in the context of the mediation or
relaying? Or is it hard to introduce admin commands in the vDPA bus?
> That is the entire irony, this whole scheme was designed and
> standardized *specifically* to avoid complex mediation and here you
> are saying we should just use mediation.
No, using "simple VFIO passthrough" is just fine.
Thanks
>
> Jason
>
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 22:55 ` Jason Gunthorpe
2023-09-22 3:02 ` Jason Wang
@ 2023-09-22 11:23 ` Michael S. Tsirkin
2023-09-22 12:15 ` Jason Gunthorpe
1 sibling, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 11:23 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 07:55:26PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > > > replacement for a vfio driver. They are completely different
> > > > > > > things.
> > > > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > > > had been done by now for vdpa I might be more sympathetic.
> > > > > >
> > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > > > think it's a huge amount of work though. I'll take a look.
> > > > > > Is there anything else though? Do tell.
> > > > >
> > > > > Confidential compute will never work with VDPA's approach.
> > > >
> > > > I don't see how what this patchset is doing is different
> > > > wrt to Confidential compute - you trap IO accesses and emulate.
> > > > Care to elaborate?
> > >
> > > This patch series isn't about confidential compute, you asked about
> > > the future. VFIO will support confidential compute in the future, VDPA
> > > will not.
> >
> > Nonsense it already works.
>
> That isn't what I'm talking about. With a real PCI function and TDISP
> we can actually DMA directly from the guest's memory without needing
> the ugly bounce buffer hack. Then you can get decent performance.
Aha, TDISP. But that one clearly does not need and can not use
this kind of hack?
> > But I did not ask about the future since I do not believe it
> > can be confidently predicted. I asked what is missing in VDPA
> > now for you to add this feature there and not in VFIO.
>
> I don't see that VDPA needs this, VDPA should process the IO BAR on
> its own with its own logic, just like everything else it does.
First there's some logic here such as translating legacy IO
offsets to modern ones that could be reused.
But also, this is not just IO BAR, that indeed can be easily done in
software. When a device operates in legacy mode there are subtle
differences with modern mode such as a different header size for the net
device.
> This is specifically about avoiding mediation by relaying directly the
> IO BAR operations to the device itself.
>
> That is the entire irony, this whole scheme was designed and
> standardized *specifically* to avoid complex mediation and here you
> are saying we should just use mediation.
>
> Jason
Not exactly. What I had in mind is just having the logic in
the vdpa module so users don't need to know what does the device
support and what it doesn't. If we can we bypass mediation
(to simplify the software stack) if we can not we do not.
Looking at it from user's POV, it is just super confusing that
card ABC would need to be used with VDPA to drive legacy while
card DEF needs to be used with VFIO. And both VFIO and VDPA
will happily bind, too. Oh man ...
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 11:23 ` Michael S. Tsirkin
@ 2023-09-22 12:15 ` Jason Gunthorpe
0 siblings, 0 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 12:15 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Fri, Sep 22, 2023 at 07:23:28AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 07:55:26PM -0300, Jason Gunthorpe wrote:
> Looking at it from user's POV, it is just super confusing that
> card ABC would need to be used with VDPA to drive legacy while
> card DEF needs to be used with VFIO. And both VFIO and VDPA
> will happily bind, too. Oh man ...
It is standard VFIO stuff. If you don't attach vfio then you get a
normal kernel vfio-net driver. If you turn that into VDPA then you get
that.
If you attach VFIO to the PCI function then you get VFIO.
There is nothing special here, we have good infrastructure to support
doing this already.
User gets to pick. I don't understand why you think the kernel side
should deny this choice.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:49 ` Jason Gunthorpe
2023-09-21 20:45 ` Michael S. Tsirkin
@ 2023-09-22 3:01 ` Jason Wang
2023-09-22 12:11 ` Jason Gunthorpe
1 sibling, 1 reply; 181+ messages in thread
From: Jason Wang @ 2023-09-22 3:01 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Michael S. Tsirkin, Yishai Hadas, alex.williamson, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 3:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > replacement for a vfio driver. They are completely different
> > > > > things.
> > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > had been done by now for vdpa I might be more sympathetic.
> > > >
> > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > think it's a huge amount of work though. I'll take a look.
> > > > Is there anything else though? Do tell.
> > >
> > > Confidential compute will never work with VDPA's approach.
> >
> > I don't see how what this patchset is doing is different
> > wrt to Confidential compute - you trap IO accesses and emulate.
> > Care to elaborate?
>
> This patch series isn't about confidential compute, you asked about
> the future. VFIO will support confidential compute in the future, VDPA
> will not.
>
> > > > There are a bunch of things that I think are important for virtio
> > > > that are completely out of scope for vfio, such as migrating
> > > > cross-vendor.
> > >
> > > VFIO supports migration, if you want to have cross-vendor migration
> > > then make a standard that describes the VFIO migration data format for
> > > virtio devices.
> >
> > This has nothing to do with data formats - you need two devices to
> > behave identically. Which is what VDPA is about really.
>
> We've been looking at VFIO live migration extensively. Device
> mediation, like VDPA does, is one legitimate approach for live
> migration. It suites a certain type of heterogeneous environment well.
>
> But, it is equally legitimate to make the devices behave the same and
> have them process a common migration data.
>
> This can happen in public with standards, or it can happen in private
> within a cloud operator's "private-standard" environment.
>
> To date, in most of my discussions, I have not seen a strong appetite
> for such public standards. In part due to the complexity.
>
> Regardles, it is not the kernel communities job to insist on one
> approach or the other.
>
> > > You are asking us to invest in the complexity of VDPA through out
> > > (keep it working, keep it secure, invest time in deploying and
> > > debugging in the field)
> > >
> > > When it doesn't provide *ANY* value to the solution.
> >
> > There's no "the solution"
>
> Nonsense.
>
> > this sounds like a vendor only caring about solutions that involve
> > that vendor's hardware exclusively, a little.
>
> Not really.
>
> Understand the DPU provider is not the vendor here. The DPU provider
> gives a cloud operator a SDK to build these things. The operator is
> the vendor from your perspective.
>
> In many cases live migration never leaves the operator's confines in
> the first place.
>
> Even when it does, there is no real use case to live migrate a
> virtio-net function from, say, AWS to GCP.
It can happen inside a single cloud vendor. For some reasons, DPU must
be purchased from different vendors. And vDPA has been used in that
case.
I've asked them to present this probably somewhere like KVM Forum.
>
> You are pushing for a lot of complexity and software that solves a
> problem people in this space don't actually have.
>
> As I said, VDPA is fine for the scenarios it addresses. It is an
> alternative, not a replacement, for VFIO.
We never try to replace VFIO. I don't see any problem by just using
the current VFIO to assign a virtio-pci device to the guest.
The problem is the mediation (or what you called relaying) layer
you've invented.
Thanks
>
> Jason
>
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 3:01 ` Jason Wang
@ 2023-09-22 12:11 ` Jason Gunthorpe
2023-09-25 2:34 ` Jason Wang
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 12:11 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, Yishai Hadas, alex.williamson, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 11:01:23AM +0800, Jason Wang wrote:
> > Even when it does, there is no real use case to live migrate a
> > virtio-net function from, say, AWS to GCP.
>
> It can happen inside a single cloud vendor. For some reasons, DPU must
> be purchased from different vendors. And vDPA has been used in that
> case.
Nope, you misunderstand the DPU scenario.
Look at something like vmware DPU enablement. vmware runs the software
side of the DPU and all their supported DPU HW, from every vendor,
generates the same PCI functions on the x86. They are the same because
the same software on the DPU side is creating them.
There is no reason to put a mediation layer in the x86 if you also
control the DPU.
Cloud vendors will similarly use DPUs to create a PCI functions that
meet the cloud vendor's internal specification. Regardless of DPU
vendor.
Fundamentally if you control the DPU SW and the hypervisor software
you do not need hypervisor meditation because everything you could do
in hypervisor mediation can just be done in the DPU. Putting it in the
DPU is better in every regard.
So, as I keep saying, in this scenario the goal is no mediation in the
hypervisor. It is pointless, everything you think you need to do there
is actually already being done in the DPU.
Once you commit to this configuration you are committed to VFIO in the
hypervisor. eg your DPU is likely also making NVMe and other PCI
functions too.
> The problem is the mediation (or what you called relaying) layer
> you've invented.
It is not mediation, it is implementing the OASIS spec for VFIO
support of IO BAR.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 12:11 ` Jason Gunthorpe
@ 2023-09-25 2:34 ` Jason Wang
2023-09-25 12:26 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Jason Wang @ 2023-09-25 2:34 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Michael S. Tsirkin, Yishai Hadas, alex.williamson, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 8:11 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Sep 22, 2023 at 11:01:23AM +0800, Jason Wang wrote:
>
> > > Even when it does, there is no real use case to live migrate a
> > > virtio-net function from, say, AWS to GCP.
> >
> > It can happen inside a single cloud vendor. For some reasons, DPU must
> > be purchased from different vendors. And vDPA has been used in that
> > case.
>
> Nope, you misunderstand the DPU scenario.
>
> Look at something like vmware DPU enablement. vmware runs the software
> side of the DPU and all their supported DPU HW, from every vendor,
> generates the same PCI functions on the x86. They are the same because
> the same software on the DPU side is creating them.
>
> There is no reason to put a mediation layer in the x86 if you also
> control the DPU.
>
> Cloud vendors will similarly use DPUs to create a PCI functions that
> meet the cloud vendor's internal specification.
This can only work if:
1) the internal specification has finer garin than virtio spec
2) so it can define what is not implemented in the virtio spec (like
migration and compatibility)
All of the above doesn't seem to be possible or realistic now, and it
actually has a risk to be not compatible with virtio spec. In the
future when virtio has live migration supported, they want to be able
to migrate between virtio and vDPA.
As I said, vDPA has been used for cross vendor live migration for a while.
> Regardless of DPU
> vendor.
>
> Fundamentally if you control the DPU SW and the hypervisor software
> you do not need hypervisor meditation because everything you could do
> in hypervisor mediation can just be done in the DPU. Putting it in the
> DPU is better in every regard.
>
> So, as I keep saying, in this scenario the goal is no mediation in the
> hypervisor.
That's pretty fine, but I don't think trapping + relying is not
mediation. Does it really matter what happens after trapping?
> It is pointless, everything you think you need to do there
> is actually already being done in the DPU.
Well, migration or even Qemu could be offloaded to DPU as well. If
that's the direction that's pretty fine.
Thanks
>
> Jason
>
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 2:34 ` Jason Wang
@ 2023-09-25 12:26 ` Jason Gunthorpe
2023-09-25 19:44 ` Michael S. Tsirkin
2023-09-26 4:37 ` Jason Wang
0 siblings, 2 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-25 12:26 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, Yishai Hadas, alex.williamson, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Mon, Sep 25, 2023 at 10:34:54AM +0800, Jason Wang wrote:
> > Cloud vendors will similarly use DPUs to create a PCI functions that
> > meet the cloud vendor's internal specification.
>
> This can only work if:
>
> 1) the internal specification has finer garin than virtio spec
> 2) so it can define what is not implemented in the virtio spec (like
> migration and compatibility)
Yes, and that is what is happening. Realistically the "spec" isjust a
piece of software that the Cloud vendor owns which is simply ported to
multiple DPU vendors.
It is the same as VDPA. If VDPA can make multiple NIC vendors
consistent then why do you have a hard time believing we can do the
same thing just on the ARM side of a DPU?
> All of the above doesn't seem to be possible or realistic now, and it
> actually has a risk to be not compatible with virtio spec. In the
> future when virtio has live migration supported, they want to be able
> to migrate between virtio and vDPA.
Well, that is for the spec to design.
> > So, as I keep saying, in this scenario the goal is no mediation in the
> > hypervisor.
>
> That's pretty fine, but I don't think trapping + relying is not
> mediation. Does it really matter what happens after trapping?
It is not mediation in the sense that the kernel driver does not in
any way make decisions on the behavior of the device. It simply
transforms an IO operation into a device command and relays it to the
device. The device still fully controls its own behavior.
VDPA is very different from this. You might call them both mediation,
sure, but then you need another word to describe the additional
changes VPDA is doing.
> > It is pointless, everything you think you need to do there
> > is actually already being done in the DPU.
>
> Well, migration or even Qemu could be offloaded to DPU as well. If
> that's the direction that's pretty fine.
That's silly, of course qemu/kvm can't run in the DPU.
However, we can empty qemu and the hypervisor out so all it does is
run kvm and run vfio. In this model the DPU does all the OVS, storage,
"VPDA", etc. qemu is just a passive relay of the DPU PCI functions
into VM's vPCI functions.
So, everything VDPA was doing in the environment is migrated into the
DPU.
In this model the DPU is an extension of the hypervisor/qemu
environment and we shift code from x86 side to arm side to increase
security, save power and increase total system performance.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 12:26 ` Jason Gunthorpe
@ 2023-09-25 19:44 ` Michael S. Tsirkin
2023-09-26 0:40 ` Jason Gunthorpe
2023-09-26 4:37 ` Jason Wang
1 sibling, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-25 19:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Mon, Sep 25, 2023 at 09:26:07AM -0300, Jason Gunthorpe wrote:
> > > So, as I keep saying, in this scenario the goal is no mediation in the
> > > hypervisor.
> >
> > That's pretty fine, but I don't think trapping + relying is not
> > mediation. Does it really matter what happens after trapping?
>
> It is not mediation in the sense that the kernel driver does not in
> any way make decisions on the behavior of the device. It simply
> transforms an IO operation into a device command and relays it to the
> device. The device still fully controls its own behavior.
>
> VDPA is very different from this. You might call them both mediation,
> sure, but then you need another word to describe the additional
> changes VPDA is doing.
Sorry about hijacking the thread a little bit, but could you
call out some of the changes that are the most problematic
for you?
> > > It is pointless, everything you think you need to do there
> > > is actually already being done in the DPU.
> >
> > Well, migration or even Qemu could be offloaded to DPU as well. If
> > that's the direction that's pretty fine.
>
> That's silly, of course qemu/kvm can't run in the DPU.
>
> However, we can empty qemu and the hypervisor out so all it does is
> run kvm and run vfio. In this model the DPU does all the OVS, storage,
> "VPDA", etc. qemu is just a passive relay of the DPU PCI functions
> into VM's vPCI functions.
>
> So, everything VDPA was doing in the environment is migrated into the
> DPU.
>
> In this model the DPU is an extension of the hypervisor/qemu
> environment and we shift code from x86 side to arm side to increase
> security, save power and increase total system performance.
>
> Jason
I think I begin to understand. On the DPU you have some virtio
devices but also some non-virtio devices. So you have to
use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
devices too, simplifies things for you. If guests will see
vendor-specific devices from the DPU anyway, it will be impossible
to migrate such guests away from the DPU so the cross-vendor
migration capability is less important in this use-case.
Is this a good summary?
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 19:44 ` Michael S. Tsirkin
@ 2023-09-26 0:40 ` Jason Gunthorpe
2023-09-26 5:34 ` Michael S. Tsirkin
2023-09-26 5:42 ` Michael S. Tsirkin
0 siblings, 2 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-26 0:40 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > VDPA is very different from this. You might call them both mediation,
> > sure, but then you need another word to describe the additional
> > changes VPDA is doing.
>
> Sorry about hijacking the thread a little bit, but could you
> call out some of the changes that are the most problematic
> for you?
I don't really know these details. The operators have an existing
virtio world that is ABI toward the VM for them, and they do not want
*anything* to change. The VM should be unware if the virtio device is
created by old hypervisor software or new DPU software. It presents
exactly the same ABI.
So the challenge really is to convince that VDPA delivers that, and
frankly, I don't think it does. ABI toward the VM is very important
here.
> > In this model the DPU is an extension of the hypervisor/qemu
> > environment and we shift code from x86 side to arm side to increase
> > security, save power and increase total system performance.
>
> I think I begin to understand. On the DPU you have some virtio
> devices but also some non-virtio devices. So you have to
> use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
> devices too, simplifies things for you.
Yes
> If guests will see vendor-specific devices from the DPU anyway, it
> will be impossible to migrate such guests away from the DPU so the
> cross-vendor migration capability is less important in this
> use-case. Is this a good summary?
Well, sort of. As I said before, the vendor here is the cloud
operator, not the DPU supplier. The guest will see an AWS virtio-net
function, for example.
The operator will ensure that all their different implementations of
this function will interwork for migration.
So within the closed world of a single operator live migration will
work just fine.
Since the hypervisor's controlled by the operator only migrate within
the operators own environment anyhow, it is an already solved problem.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 0:40 ` Jason Gunthorpe
@ 2023-09-26 5:34 ` Michael S. Tsirkin
2023-09-26 5:42 ` Michael S. Tsirkin
1 sibling, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-26 5:34 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > VDPA is very different from this. You might call them both mediation,
> > > sure, but then you need another word to describe the additional
> > > changes VPDA is doing.
> >
> > Sorry about hijacking the thread a little bit, but could you
> > call out some of the changes that are the most problematic
> > for you?
>
> I don't really know these details. The operators have an existing
> virtio world that is ABI toward the VM for them, and they do not want
> *anything* to change. The VM should be unware if the virtio device is
> created by old hypervisor software or new DPU software. It presents
> exactly the same ABI.
>
> So the challenge really is to convince that VDPA delivers that, and
> frankly, I don't think it does. ABI toward the VM is very important
> here.
And to complete the picture, it is the DPU software/firmware that
is resposible for maintaining this ABI in your ideal world?
> > > In this model the DPU is an extension of the hypervisor/qemu
> > > environment and we shift code from x86 side to arm side to increase
> > > security, save power and increase total system performance.
> >
> > I think I begin to understand. On the DPU you have some virtio
> > devices but also some non-virtio devices. So you have to
> > use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
> > devices too, simplifies things for you.
>
> Yes
>
> > If guests will see vendor-specific devices from the DPU anyway, it
> > will be impossible to migrate such guests away from the DPU so the
> > cross-vendor migration capability is less important in this
> > use-case. Is this a good summary?
>
> Well, sort of. As I said before, the vendor here is the cloud
> operator, not the DPU supplier. The guest will see an AWS virtio-net
> function, for example.
>
> The operator will ensure that all their different implementations of
> this function will interwork for migration.
>
> So within the closed world of a single operator live migration will
> work just fine.
>
> Since the hypervisor's controlled by the operator only migrate within
> the operators own environment anyhow, it is an already solved problem.
>
> Jason
Okay the picture emerges I think. Thanks! I'll try to summarize later
for everyone's benefit.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 0:40 ` Jason Gunthorpe
2023-09-26 5:34 ` Michael S. Tsirkin
@ 2023-09-26 5:42 ` Michael S. Tsirkin
2023-09-26 13:50 ` Jason Gunthorpe
1 sibling, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-26 5:42 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > VDPA is very different from this. You might call them both mediation,
> > > sure, but then you need another word to describe the additional
> > > changes VPDA is doing.
> >
> > Sorry about hijacking the thread a little bit, but could you
> > call out some of the changes that are the most problematic
> > for you?
>
> I don't really know these details.
Maybe, you then should desist from saying things like "It entirely fails
to achieve the most important thing it needs to do!" You are not making
any new friends with saying this about a piece of software without
knowing the details.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 5:42 ` Michael S. Tsirkin
@ 2023-09-26 13:50 ` Jason Gunthorpe
2023-09-27 21:38 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-26 13:50 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > VDPA is very different from this. You might call them both mediation,
> > > > sure, but then you need another word to describe the additional
> > > > changes VPDA is doing.
> > >
> > > Sorry about hijacking the thread a little bit, but could you
> > > call out some of the changes that are the most problematic
> > > for you?
> >
> > I don't really know these details.
>
> Maybe, you then should desist from saying things like "It entirely fails
> to achieve the most important thing it needs to do!" You are not making
> any new friends with saying this about a piece of software without
> knowing the details.
I can't tell you what cloud operators are doing, but I can say with
confidence that it is not the same as VDPA. As I said, if you want to
know more details you need to ask a cloud operator.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 13:50 ` Jason Gunthorpe
@ 2023-09-27 21:38 ` Michael S. Tsirkin
2023-09-27 23:20 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-27 21:38 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > > VDPA is very different from this. You might call them both mediation,
> > > > > sure, but then you need another word to describe the additional
> > > > > changes VPDA is doing.
> > > >
> > > > Sorry about hijacking the thread a little bit, but could you
> > > > call out some of the changes that are the most problematic
> > > > for you?
> > >
> > > I don't really know these details.
> >
> > Maybe, you then should desist from saying things like "It entirely fails
> > to achieve the most important thing it needs to do!" You are not making
> > any new friends with saying this about a piece of software without
> > knowing the details.
>
> I can't tell you what cloud operators are doing, but I can say with
> confidence that it is not the same as VDPA. As I said, if you want to
> know more details you need to ask a cloud operator.
>
> Jason
So it's not the changes that are problematic, it's that you have
customers who are not using vdpa. The "most important thing" that vdpa
fails at is simply converting your customers from vfio to vdpa.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-27 21:38 ` Michael S. Tsirkin
@ 2023-09-27 23:20 ` Jason Gunthorpe
2023-09-28 5:31 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-27 23:20 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Wed, Sep 27, 2023 at 05:38:55PM -0400, Michael S. Tsirkin wrote:
> On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > > > VDPA is very different from this. You might call them both mediation,
> > > > > > sure, but then you need another word to describe the additional
> > > > > > changes VPDA is doing.
> > > > >
> > > > > Sorry about hijacking the thread a little bit, but could you
> > > > > call out some of the changes that are the most problematic
> > > > > for you?
> > > >
> > > > I don't really know these details.
> > >
> > > Maybe, you then should desist from saying things like "It entirely fails
> > > to achieve the most important thing it needs to do!" You are not making
> > > any new friends with saying this about a piece of software without
> > > knowing the details.
> >
> > I can't tell you what cloud operators are doing, but I can say with
> > confidence that it is not the same as VDPA. As I said, if you want to
> > know more details you need to ask a cloud operator.
>
> So it's not the changes that are problematic, it's that you have
> customers who are not using vdpa. The "most important thing" that vdpa
> fails at is simply converting your customers from vfio to vdpa.
I said the most important thing was that VFIO presents exactly the
same virtio device to the VM as the baremetal. Do you dispute that,
technically, VDPA does not actually achieve that?
Then why is it so surprising that people don't want a solution that
changes the vPCI ABI they worked hard to create in the first place?
I'm still baffled why you think everyone should use vdpa..
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-27 23:20 ` Jason Gunthorpe
@ 2023-09-28 5:31 ` Michael S. Tsirkin
0 siblings, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-28 5:31 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Yishai Hadas, alex.williamson, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Wed, Sep 27, 2023 at 08:20:05PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 27, 2023 at 05:38:55PM -0400, Michael S. Tsirkin wrote:
> > On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> > > > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > > > > VDPA is very different from this. You might call them both mediation,
> > > > > > > sure, but then you need another word to describe the additional
> > > > > > > changes VPDA is doing.
> > > > > >
> > > > > > Sorry about hijacking the thread a little bit, but could you
> > > > > > call out some of the changes that are the most problematic
> > > > > > for you?
> > > > >
> > > > > I don't really know these details.
> > > >
> > > > Maybe, you then should desist from saying things like "It entirely fails
> > > > to achieve the most important thing it needs to do!" You are not making
> > > > any new friends with saying this about a piece of software without
> > > > knowing the details.
> > >
> > > I can't tell you what cloud operators are doing, but I can say with
> > > confidence that it is not the same as VDPA. As I said, if you want to
> > > know more details you need to ask a cloud operator.
> >
> > So it's not the changes that are problematic, it's that you have
> > customers who are not using vdpa. The "most important thing" that vdpa
> > fails at is simply converting your customers from vfio to vdpa.
>
> I said the most important thing was that VFIO presents exactly the
> same virtio device to the VM as the baremetal. Do you dispute that,
> technically, VDPA does not actually achieve that?
I dispute that it is the most important. The important thing is to have
guests work.
> Then why is it so surprising that people don't want a solution that
> changes the vPCI ABI they worked hard to create in the first place?
>
> I'm still baffled why you think everyone should use vdpa..
>
> Jason
They shouldn't. If you want proprietary extensions then vfio is the way
to go, I don't think vdpa will support that.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 12:26 ` Jason Gunthorpe
2023-09-25 19:44 ` Michael S. Tsirkin
@ 2023-09-26 4:37 ` Jason Wang
2023-09-26 5:33 ` Parav Pandit
1 sibling, 1 reply; 181+ messages in thread
From: Jason Wang @ 2023-09-26 4:37 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Michael S. Tsirkin, Yishai Hadas, alex.williamson, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Mon, Sep 25, 2023 at 8:26 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Sep 25, 2023 at 10:34:54AM +0800, Jason Wang wrote:
>
> > > Cloud vendors will similarly use DPUs to create a PCI functions that
> > > meet the cloud vendor's internal specification.
> >
> > This can only work if:
> >
> > 1) the internal specification has finer garin than virtio spec
> > 2) so it can define what is not implemented in the virtio spec (like
> > migration and compatibility)
>
> Yes, and that is what is happening. Realistically the "spec" isjust a
> piece of software that the Cloud vendor owns which is simply ported to
> multiple DPU vendors.
>
> It is the same as VDPA. If VDPA can make multiple NIC vendors
> consistent then why do you have a hard time believing we can do the
> same thing just on the ARM side of a DPU?
I don't. We all know vDPA can do more than virtio.
>
> > All of the above doesn't seem to be possible or realistic now, and it
> > actually has a risk to be not compatible with virtio spec. In the
> > future when virtio has live migration supported, they want to be able
> > to migrate between virtio and vDPA.
>
> Well, that is for the spec to design.
Right, so if we'd consider migration from virtio to vDPA, it needs to
be designed in a way that allows more involvement from hypervisor
other than coupling it with a specific interface (like admin
virtqueues).
>
> > > So, as I keep saying, in this scenario the goal is no mediation in the
> > > hypervisor.
> >
> > That's pretty fine, but I don't think trapping + relying is not
> > mediation. Does it really matter what happens after trapping?
>
> It is not mediation in the sense that the kernel driver does not in
> any way make decisions on the behavior of the device. It simply
> transforms an IO operation into a device command and relays it to the
> device. The device still fully controls its own behavior.
>
> VDPA is very different from this. You might call them both mediation,
> sure, but then you need another word to describe the additional
> changes VPDA is doing.
>
> > > It is pointless, everything you think you need to do there
> > > is actually already being done in the DPU.
> >
> > Well, migration or even Qemu could be offloaded to DPU as well. If
> > that's the direction that's pretty fine.
>
> That's silly, of course qemu/kvm can't run in the DPU.
KVM can't for sure but part of Qemu could. This model has been used.
>
> However, we can empty qemu and the hypervisor out so all it does is
> run kvm and run vfio. In this model the DPU does all the OVS, storage,
> "VPDA", etc. qemu is just a passive relay of the DPU PCI functions
> into VM's vPCI functions.
>
> So, everything VDPA was doing in the environment is migrated into the
> DPU.
It really depends on the use cases. For example, in the case of DPU
what if we want to provide multiple virtio devices through a single
VF?
>
> In this model the DPU is an extension of the hypervisor/qemu
> environment and we shift code from x86 side to arm side to increase
> security, save power and increase total system performance.
That's pretty fine.
Thanks
>
> Jason
>
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 4:37 ` Jason Wang
@ 2023-09-26 5:33 ` Parav Pandit
0 siblings, 0 replies; 181+ messages in thread
From: Parav Pandit @ 2023-09-26 5:33 UTC (permalink / raw)
To: Jason Wang, Jason Gunthorpe
Cc: Michael S. Tsirkin, Yishai Hadas, alex.williamson@redhat.com,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, September 26, 2023 10:08 AM
> Right, so if we'd consider migration from virtio to vDPA, it needs to be designed
> in a way that allows more involvement from hypervisor other than coupling it
> with a specific interface (like admin virtqueues).
It is not attached to the admin virtqueues.
One way to use it using admin commands at [1].
One can define without admin command by explaining the technical difficulties in admin command may/cannot work.
[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 18:39 ` Jason Gunthorpe
2023-09-21 19:13 ` Michael S. Tsirkin
@ 2023-09-21 19:17 ` Michael S. Tsirkin
2023-09-21 19:51 ` Jason Gunthorpe
2023-09-22 3:45 ` Zhu, Lingshan
2 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 19:17 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > What is the huge amount of work am I asking to do?
>
> You are asking us to invest in the complexity of VDPA through out
> (keep it working, keep it secure, invest time in deploying and
> debugging in the field)
I'm asking you to do nothing of the kind - I am saying that this code
will have to be duplicated in vdpa, and so I am asking what exactly is
missing to just keep it all there. So far you said iommufd and
note I didn't ask you to add iommufd to vdpa though that would be nice ;)
I just said I'll look into it in the next several days.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:17 ` Michael S. Tsirkin
@ 2023-09-21 19:51 ` Jason Gunthorpe
2023-09-21 20:55 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 19:51 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > What is the huge amount of work am I asking to do?
> >
> > You are asking us to invest in the complexity of VDPA through out
> > (keep it working, keep it secure, invest time in deploying and
> > debugging in the field)
>
> I'm asking you to do nothing of the kind - I am saying that this code
> will have to be duplicated in vdpa,
Why would that be needed?
> and so I am asking what exactly is missing to just keep it all
> there.
VFIO. Seriously, we don't want unnecessary mediation in this path at
all.
> note I didn't ask you to add iommufd to vdpa though that would be
> nice ;)
I did once send someone to look.. It didn't succeed :(
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:51 ` Jason Gunthorpe
@ 2023-09-21 20:55 ` Michael S. Tsirkin
2023-09-21 23:08 ` Jason Gunthorpe
2023-09-25 4:44 ` Zhu, Lingshan
0 siblings, 2 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 20:55 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 04:51:15PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > What is the huge amount of work am I asking to do?
> > >
> > > You are asking us to invest in the complexity of VDPA through out
> > > (keep it working, keep it secure, invest time in deploying and
> > > debugging in the field)
> >
> > I'm asking you to do nothing of the kind - I am saying that this code
> > will have to be duplicated in vdpa,
>
> Why would that be needed?
For the same reason it was developed in the 1st place - presumably
because it adds efficient legacy guest support with the right card?
I get it, you specifically don't need VDPA functionality, but I don't
see why is this universal, or common.
> > and so I am asking what exactly is missing to just keep it all
> > there.
>
> VFIO. Seriously, we don't want unnecessary mediation in this path at
> all.
But which mediation is necessary is exactly up to the specific use-case.
I have no idea why would you want all of VFIO to e.g. pass access to
random config registers to the guest when it's a virtio device and the
config registers are all nicely listed in the spec. I know nvidia
hardware is so great, it has super robust cards with less security holes
than the vdpa driver, but I very much doubt this is universal for all
virtio offload cards.
> > note I didn't ask you to add iommufd to vdpa though that would be
> > nice ;)
>
> I did once send someone to look.. It didn't succeed :(
>
> Jason
Pity. Maybe there's some big difficulty blocking this? I'd like to know.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 20:55 ` Michael S. Tsirkin
@ 2023-09-21 23:08 ` Jason Gunthorpe
2023-09-25 4:44 ` Zhu, Lingshan
1 sibling, 0 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 23:08 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 04:55:05PM -0400, Michael S. Tsirkin wrote:
> But which mediation is necessary is exactly up to the specific use-case.
> I have no idea why would you want all of VFIO to e.g. pass access to
> random config registers to the guest when it's a virtio device and the
> config registers are all nicely listed in the spec. I know nvidia
> hardware is so great, it has super robust cards with less security holes
> than the vdpa driver, but I very much doubt this is universal for all
> virtio offload cards.
The great thing about choice is that people can choose the
configuration that best meets their situation and needs.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 20:55 ` Michael S. Tsirkin
2023-09-21 23:08 ` Jason Gunthorpe
@ 2023-09-25 4:44 ` Zhu, Lingshan
1 sibling, 0 replies; 181+ messages in thread
From: Zhu, Lingshan @ 2023-09-25 4:44 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Gunthorpe
Cc: kvm, maorg, virtualization, jiri, leonro
On 9/22/2023 4:55 AM, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 04:51:15PM -0300, Jason Gunthorpe wrote:
>> On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote:
>>> On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
>>>>> What is the huge amount of work am I asking to do?
>>>> You are asking us to invest in the complexity of VDPA through out
>>>> (keep it working, keep it secure, invest time in deploying and
>>>> debugging in the field)
>>> I'm asking you to do nothing of the kind - I am saying that this code
>>> will have to be duplicated in vdpa,
>> Why would that be needed?
> For the same reason it was developed in the 1st place - presumably
> because it adds efficient legacy guest support with the right card?
> I get it, you specifically don't need VDPA functionality, but I don't
> see why is this universal, or common.
>
>
>>> and so I am asking what exactly is missing to just keep it all
>>> there.
>> VFIO. Seriously, we don't want unnecessary mediation in this path at
>> all.
> But which mediation is necessary is exactly up to the specific use-case.
> I have no idea why would you want all of VFIO to e.g. pass access to
> random config registers to the guest when it's a virtio device and the
> config registers are all nicely listed in the spec. I know nvidia
> hardware is so great, it has super robust cards with less security holes
> than the vdpa driver, but I very much doubt this is universal for all
> virtio offload cards.
I agree with MST.
>>> note I didn't ask you to add iommufd to vdpa though that would be
>>> nice ;)
>> I did once send someone to look.. It didn't succeed :(
>>
>> Jason
> Pity. Maybe there's some big difficulty blocking this? I'd like to know.
>
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 18:39 ` Jason Gunthorpe
2023-09-21 19:13 ` Michael S. Tsirkin
2023-09-21 19:17 ` Michael S. Tsirkin
@ 2023-09-22 3:45 ` Zhu, Lingshan
2 siblings, 0 replies; 181+ messages in thread
From: Zhu, Lingshan @ 2023-09-22 3:45 UTC (permalink / raw)
To: Jason Gunthorpe, Michael S. Tsirkin
Cc: Yishai Hadas, alex.williamson, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On 9/22/2023 2:39 AM, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
>>> vdpa is not vfio, I don't know how you can suggest vdpa is a
>>> replacement for a vfio driver. They are completely different
>>> things.
>>> Each side has its own strengths, and vfio especially is accelerating
>>> in its capability in way that vpda is not. eg if an iommufd conversion
>>> had been done by now for vdpa I might be more sympathetic.
>> Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
>> sick and I didn't know and kept assuming she's working on this. I don't
>> think it's a huge amount of work though. I'll take a look.
>> Is there anything else though? Do tell.
> Confidential compute will never work with VDPA's approach.
I don't understand why vDPA can not and will never support Confidential
computing?
Do you see any blockers?
>
>> There are a bunch of things that I think are important for virtio
>> that are completely out of scope for vfio, such as migrating
>> cross-vendor.
> VFIO supports migration, if you want to have cross-vendor migration
> then make a standard that describes the VFIO migration data format for
> virtio devices.
>
>> What is the huge amount of work am I asking to do?
> You are asking us to invest in the complexity of VDPA through out
> (keep it working, keep it secure, invest time in deploying and
> debugging in the field)
>
> When it doesn't provide *ANY* value to the solution.
>
> The starting point is a completely working vfio PCI function and the
> end goal is to put that function into a VM. That is VFIO, not VDPA.
>
> VPDA is fine for what it does, but it is not a reasonable replacement
> for VFIO.
>
> Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 12:40 ` [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices Yishai Hadas
2023-09-21 13:16 ` Michael S. Tsirkin
@ 2023-09-21 13:33 ` Michael S. Tsirkin
2023-09-21 16:43 ` Alex Williamson
` (3 subsequent siblings)
5 siblings, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 13:33 UTC (permalink / raw)
To: Yishai Hadas
Cc: alex.williamson, jasowang, jgg, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:40:40PM +0300, Yishai Hadas wrote:
> +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
> +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
This is exactly part of VIRTIO_PCI_CONFIG_OFF duplicated.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 12:40 ` [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices Yishai Hadas
2023-09-21 13:16 ` Michael S. Tsirkin
2023-09-21 13:33 ` Michael S. Tsirkin
@ 2023-09-21 16:43 ` Alex Williamson
2023-09-21 16:52 ` Jason Gunthorpe
2023-09-21 19:58 ` Alex Williamson
` (2 subsequent siblings)
5 siblings, 1 reply; 181+ messages in thread
From: Alex Williamson @ 2023-09-21 16:43 UTC (permalink / raw)
To: Yishai Hadas
Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, jiri,
kevin.tian, joao.m.martins, leonro, maorg
On Thu, 21 Sep 2023 15:40:40 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:
> Introduce a vfio driver over virtio devices to support the legacy
> interface functionality for VFs.
>
> Background, from the virtio spec [1].
> --------------------------------------------------------------------
> In some systems, there is a need to support a virtio legacy driver with
> a device that does not directly support the legacy interface. In such
> scenarios, a group owner device can provide the legacy interface
> functionality for the group member devices. The driver of the owner
> device can then access the legacy interface of a member device on behalf
> of the legacy member device driver.
>
> For example, with the SR-IOV group type, group members (VFs) can not
> present the legacy interface in an I/O BAR in BAR0 as expected by the
> legacy pci driver. If the legacy driver is running inside a virtual
> machine, the hypervisor executing the virtual machine can present a
> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> legacy driver accesses to this I/O BAR and forwards them to the group
> owner device (PF) using group administration commands.
> --------------------------------------------------------------------
>
> Specifically, this driver adds support for a virtio-net VF to be exposed
> as a transitional device to a guest driver and allows the legacy IO BAR
> functionality on top.
>
> This allows a VM which uses a legacy virtio-net driver in the guest to
> work transparently over a VF which its driver in the host is that new
> driver.
>
> The driver can be extended easily to support some other types of virtio
> devices (e.g virtio-blk), by adding in a few places the specific type
> properties as was done for virtio-net.
>
> For now, only the virtio-net use case was tested and as such we introduce
> the support only for such a device.
>
> Practically,
> Upon probing a VF for a virtio-net device, in case its PF supports
> legacy access over the virtio admin commands and the VF doesn't have BAR
> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> transitional device with I/O BAR in BAR 0.
>
> The existence of the simulated I/O bar is reported later on by
> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> exposes itself as a transitional device by overwriting some properties
> upon reading its config space.
>
> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> guest may use it via read/write calls according to the virtio
> specification.
>
> Any read/write towards the control parts of the BAR will be captured by
> the new driver and will be translated into admin commands towards the
> device.
>
> Any data path read/write access (i.e. virtio driver notifications) will
> be forwarded to the physical BAR which its properties were supplied by
> the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow.
>
> With that code in place a legacy driver in the guest has the look and
> feel as if having a transitional device with legacy support for both its
> control and data path flows.
Why do we need to enable a "legacy" driver in the guest? The very name
suggests there's an alternative driver that perhaps doesn't require
this I/O BAR. Why don't we just require the non-legacy driver in the
guest rather than increase our maintenance burden? Thanks,
Alex
>
> [1]
> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
> MAINTAINERS | 6 +
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 2 +
> drivers/vfio/pci/virtio/Kconfig | 15 +
> drivers/vfio/pci/virtio/Makefile | 4 +
> drivers/vfio/pci/virtio/cmd.c | 4 +-
> drivers/vfio/pci/virtio/cmd.h | 8 +
> drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++
> 8 files changed, 585 insertions(+), 2 deletions(-)
> create mode 100644 drivers/vfio/pci/virtio/Kconfig
> create mode 100644 drivers/vfio/pci/virtio/Makefile
> create mode 100644 drivers/vfio/pci/virtio/main.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index bf0f54c24f81..5098418c8389 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> S: Maintained
> F: drivers/vfio/pci/mlx5/
>
> +VFIO VIRTIO PCI DRIVER
> +M: Yishai Hadas <yishaih@nvidia.com>
> +L: kvm@vger.kernel.org
> +S: Maintained
> +F: drivers/vfio/pci/virtio
> +
> VFIO PCI DEVICE SPECIFIC DRIVERS
> R: Jason Gunthorpe <jgg@nvidia.com>
> R: Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..18c397df566d 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>
> source "drivers/vfio/pci/pds/Kconfig"
>
> +source "drivers/vfio/pci/virtio/Kconfig"
> +
> endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..046139a4eca5 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>
> obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> new file mode 100644
> index 000000000000..89eddce8b1bd
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config VIRTIO_VFIO_PCI
> + tristate "VFIO support for VIRTIO PCI devices"
> + depends on VIRTIO_PCI
> + select VFIO_PCI_CORE
> + help
> + This provides support for exposing VIRTIO VF devices using the VFIO
> + framework that can work with a legacy virtio driver in the guest.
> + Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> + not indicate I/O Space.
> + As of that this driver emulated I/O BAR in software to let a VF be
> + seen as a transitional device in the guest and let it work with
> + a legacy driver.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> new file mode 100644
> index 000000000000..584372648a03
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> +virtio-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c
> index f068239cdbb0..aea9d25fbf1d 100644
> --- a/drivers/vfio/pci/virtio/cmd.c
> +++ b/drivers/vfio/pci/virtio/cmd.c
> @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_write *in;
> + struct virtio_admin_cmd_legacy_wr_data *in;
> struct scatterlist in_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_read *in;
> + struct virtio_admin_cmd_legacy_rd_data *in;
> struct scatterlist in_sg, out_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h
> index c2a3645f4b90..347b1dc85570 100644
> --- a/drivers/vfio/pci/virtio/cmd.h
> +++ b/drivers/vfio/pci/virtio/cmd.h
> @@ -13,7 +13,15 @@
>
> struct virtiovf_pci_core_device {
> struct vfio_pci_core_device core_device;
> + u8 bar0_virtual_buf_size;
> + u8 *bar0_virtual_buf;
> + /* synchronize access to the virtual buf */
> + struct mutex bar_mutex;
> int vf_id;
> + void __iomem *notify_addr;
> + u32 notify_offset;
> + u8 notify_bar;
> + u8 pci_cmd_io :1;
> };
>
> int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size);
> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> new file mode 100644
> index 000000000000..2486991c49f3
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/main.c
> @@ -0,0 +1,546 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/virtio_pci.h>
> +#include <linux/virtio_net.h>
> +#include <linux/virtio_pci_modern.h>
> +
> +#include "cmd.h"
> +
> +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
> +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
> +
> +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> + u16 opcode;
> + int ret;
> +
> + mutex_lock(&virtvdev->bar_mutex);
> + if (read) {
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> + ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos,
> + count, bar0_buf + pos);
> + if (ret)
> + goto out;
> + if (copy_to_user(buf, bar0_buf + pos, count))
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + if (copy_from_user(bar0_buf + pos, buf, count)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
> + ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count,
> + bar0_buf + pos);
> +out:
> + mutex_unlock(&virtvdev->bar_mutex);
> + return ret;
> +}
> +
> +static int
> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + u16 queue_notify;
> + int ret;
> +
> + if (pos + count > virtvdev->bar0_virtual_buf_size)
> + return -EINVAL;
> +
> + switch (pos) {
> + case VIRTIO_PCI_QUEUE_NOTIFY:
> + if (count != sizeof(queue_notify))
> + return -EINVAL;
> + if (read) {
> + ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> + virtvdev->notify_addr);
> + if (ret)
> + return ret;
> + if (copy_to_user(buf, &queue_notify,
> + sizeof(queue_notify)))
> + return -EFAULT;
> + break;
> + }
> +
> + if (copy_from_user(&queue_notify, buf, count))
> + return -EFAULT;
> +
> + ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> + virtvdev->notify_addr);
> + break;
> + default:
> + ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read);
> + }
> +
> + return ret ? ret : count;
> +}
> +
> +static bool range_contains_range(loff_t range1_start, size_t count1,
> + loff_t range2_start, size_t count2,
> + loff_t *start_offset)
> +{
> + if (range1_start <= range2_start &&
> + range1_start + count1 >= range2_start + count2) {
> + *start_offset = range2_start - range1_start;
> + return true;
> + }
> + return false;
> +}
> +
> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> + char __user *buf, size_t count,
> + loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + loff_t copy_offset;
> + __le32 val32;
> + __le16 val16;
> + u8 val8;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;
> +
> + if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> + ©_offset)) {
> + val16 = cpu_to_le16(0x1000);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + if (virtvdev->pci_cmd_io &&
> + range_contains_range(pos, count, PCI_COMMAND, sizeof(val16),
> + ©_offset)) {
> + if (copy_from_user(&val16, buf, sizeof(val16)))
> + return -EFAULT;
> + val16 |= cpu_to_le16(PCI_COMMAND_IO);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> + ©_offset)) {
> + /* Transional needs to have revision 0 */
> + val8 = 0;
> + if (copy_to_user(buf + copy_offset, &val8, sizeof(val8)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> + ©_offset)) {
> + val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
> + if (copy_to_user(buf + copy_offset, &val32, sizeof(val32)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> + ©_offset)) {
> + /* Transitional devices use the PCI subsystem device id as
> + * virtio device id, same as legacy driver always did.
> + */
> + val16 = cpu_to_le16(VIRTIO_ID_NET);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + return count;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> + ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> + loff_t copy_offset;
> + u16 cmd;
> +
> + if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd),
> + ©_offset)) {
> + if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd)))
> + return -EFAULT;
> + virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO);
> + }
> + }
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static int
> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> + void __user *uarg = (void __user *)arg;
> + struct vfio_region_info info = {};
> +
> + if (copy_from_user(&info, uarg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + switch (info.index) {
> + case VFIO_PCI_BAR0_REGION_INDEX:
> + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> + info.size = virtvdev->bar0_virtual_buf_size;
> + info.flags = VFIO_REGION_INFO_FLAG_READ |
> + VFIO_REGION_INFO_FLAG_WRITE;
> + return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static long
> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> + unsigned long arg)
> +{
> + switch (cmd) {
> + case VFIO_DEVICE_GET_REGION_INFO:
> + return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static int
> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + int ret;
> +
> + /* Setup the BAR where the 'notify' exists to be used by vfio as well
> + * This will let us mmap it only once and use it when needed.
> + */
> + ret = vfio_pci_core_setup_barmap(core_device,
> + virtvdev->notify_bar);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> + virtvdev->notify_offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> + int ret;
> +
> + ret = vfio_pci_core_enable(vdev);
> + if (ret)
> + return ret;
> +
> + if (virtvdev->bar0_virtual_buf) {
> + /* upon close_device() the vfio_pci_core_disable() is called
> + * and will close all the previous mmaps, so it seems that the
> + * valid life cycle for the 'notify' addr is per open/close.
> + */
> + ret = virtiovf_set_notify_addr(virtvdev);
> + if (ret) {
> + vfio_pci_core_disable(vdev);
> + return ret;
> + }
> + }
> +
> + vfio_pci_core_finish_enable(vdev);
> + return 0;
> +}
> +
> +static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
> +{
> + vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static int virtiovf_get_device_config_size(unsigned short device)
> +{
> + switch (device) {
> + case 0x1041:
> + /* network card */
> + return offsetofend(struct virtio_net_config, status);
> + default:
> + return 0;
> + }
> +}
> +
> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> +{
> + u64 offset;
> + int ret;
> + u8 bar;
> +
> + ret = virtiovf_cmd_lq_read_notify(virtvdev,
> + VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> + &bar, &offset);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_bar = bar;
> + virtvdev->notify_offset = offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev;
> + int ret;
> +
> + ret = vfio_pci_core_init_dev(core_vdev);
> + if (ret)
> + return ret;
> +
> + pdev = virtvdev->core_device.pdev;
> + virtvdev->vf_id = pci_iov_vf_id(pdev);
> + if (virtvdev->vf_id < 0)
> + return -EINVAL;
> +
> + ret = virtiovf_read_notify_info(virtvdev);
> + if (ret)
> + return ret;
> +
> + virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN +
> + VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN +
> + virtiovf_get_device_config_size(pdev->device);
> + virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> + GFP_KERNEL);
> + if (!virtvdev->bar0_virtual_buf)
> + return -ENOMEM;
> + mutex_init(&virtvdev->bar_mutex);
> + return 0;
> +}
> +
> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +
> + kfree(virtvdev->bar0_virtual_buf);
> + vfio_pci_core_release_dev(core_vdev);
> +}
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> + .name = "virtio-transitional-vfio-pci",
> + .init = virtiovf_pci_init_device,
> + .release = virtiovf_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = virtiovf_vfio_pci_core_ioctl,
> + .read = virtiovf_pci_core_read,
> + .write = virtiovf_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> + .name = "virtio-acc-vfio-pci",
> + .init = vfio_pci_core_init_dev,
> + .release = vfio_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = vfio_pci_core_ioctl,
> + .device_feature = vfio_pci_core_ioctl_feature,
> + .read = vfio_pci_core_read,
> + .write = vfio_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> +{
> + struct resource *res = pdev->resource;
> +
> + return res->flags ? true : false;
> +}
> +
> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> + (BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> +
> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> +{
> + int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> + u8 *buf;
> + int ret;
> +
> + /* Only virtio-net is supported/tested so far */
> + if (pdev->device != 0x1041)
> + return false;
> +
> + buf = kzalloc(buf_size, GFP_KERNEL);
> + if (!buf)
> + return false;
> +
> + ret = virtiovf_cmd_list_query(pdev, buf, buf_size);
> + if (ret)
> + goto end;
> +
> + if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> + VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> + ret = -EOPNOTSUPP;
> + goto end;
> + }
> +
> + /* confirm the used commands */
> + memset(buf, 0, buf_size);
> + *(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> + ret = virtiovf_cmd_list_use(pdev, buf, buf_size);
> +
> +end:
> + kfree(buf);
> + return ret ? false : true;
> +}
> +
> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> + const struct pci_device_id *id)
> +{
> + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> + struct virtiovf_pci_core_device *virtvdev;
> + int ret;
> +
> + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> + !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> + ops = &virtiovf_acc_vfio_pci_tran_ops;
> +
> + virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> + &pdev->dev, ops);
> + if (IS_ERR(virtvdev))
> + return PTR_ERR(virtvdev);
> +
> + dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> + ret = vfio_pci_core_register_device(&virtvdev->core_device);
> + if (ret)
> + goto out;
> + return 0;
> +out:
> + vfio_put_device(&virtvdev->core_device.vdev);
> + return ret;
> +}
> +
> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> +
> + vfio_pci_core_unregister_device(&virtvdev->core_device);
> + vfio_put_device(&virtvdev->core_device.vdev);
> +}
> +
> +static const struct pci_device_id virtiovf_pci_table[] = {
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> + {}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> +
> +static struct pci_driver virtiovf_pci_driver = {
> + .name = KBUILD_MODNAME,
> + .id_table = virtiovf_pci_table,
> + .probe = virtiovf_pci_probe,
> + .remove = virtiovf_pci_remove,
> + .err_handler = &vfio_pci_core_err_handlers,
> + .driver_managed_dma = true,
> +};
> +
> +module_pci_driver(virtiovf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> + "VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 16:43 ` Alex Williamson
@ 2023-09-21 16:52 ` Jason Gunthorpe
2023-09-21 17:01 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 16:52 UTC (permalink / raw)
To: Alex Williamson
Cc: Yishai Hadas, mst, jasowang, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote:
> > With that code in place a legacy driver in the guest has the look and
> > feel as if having a transitional device with legacy support for both its
> > control and data path flows.
>
> Why do we need to enable a "legacy" driver in the guest? The very name
> suggests there's an alternative driver that perhaps doesn't require
> this I/O BAR. Why don't we just require the non-legacy driver in the
> guest rather than increase our maintenance burden? Thanks,
It was my reaction also.
Apparently there is a big deployed base of people using old guest VMs
with old drivers and they do not want to update their VMs. It is the
same basic reason why qemu supports all those weird old machine types
and HW emulations. The desire is to support these old devices so that
old VMs can work unchanged.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 16:52 ` Jason Gunthorpe
@ 2023-09-21 17:01 ` Michael S. Tsirkin
2023-09-21 17:07 ` Jason Gunthorpe
2023-09-21 17:09 ` Parav Pandit
0 siblings, 2 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 17:01 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote:
>
> > > With that code in place a legacy driver in the guest has the look and
> > > feel as if having a transitional device with legacy support for both its
> > > control and data path flows.
> >
> > Why do we need to enable a "legacy" driver in the guest? The very name
> > suggests there's an alternative driver that perhaps doesn't require
> > this I/O BAR. Why don't we just require the non-legacy driver in the
> > guest rather than increase our maintenance burden? Thanks,
>
> It was my reaction also.
>
> Apparently there is a big deployed base of people using old guest VMs
> with old drivers and they do not want to update their VMs. It is the
> same basic reason why qemu supports all those weird old machine types
> and HW emulations. The desire is to support these old devices so that
> old VMs can work unchanged.
>
> Jason
And you are saying all these very old VMs use such a large number of
legacy devices that over-counting of locked memory due to vdpa not
correctly using iommufd is a problem that urgently needs to be solved
otherwise the solution has no value?
Another question I'm interested in is whether there's actually a
performance benefit to using this as compared to just software
vhost. I note there's a VM exit on each IO access, so ... perhaps?
Would be nice to see some numbers.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 17:01 ` Michael S. Tsirkin
@ 2023-09-21 17:07 ` Jason Gunthorpe
2023-09-21 17:21 ` Michael S. Tsirkin
2023-09-21 17:09 ` Parav Pandit
1 sibling, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 17:07 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 01:01:12PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote:
> >
> > > > With that code in place a legacy driver in the guest has the look and
> > > > feel as if having a transitional device with legacy support for both its
> > > > control and data path flows.
> > >
> > > Why do we need to enable a "legacy" driver in the guest? The very name
> > > suggests there's an alternative driver that perhaps doesn't require
> > > this I/O BAR. Why don't we just require the non-legacy driver in the
> > > guest rather than increase our maintenance burden? Thanks,
> >
> > It was my reaction also.
> >
> > Apparently there is a big deployed base of people using old guest VMs
> > with old drivers and they do not want to update their VMs. It is the
> > same basic reason why qemu supports all those weird old machine types
> > and HW emulations. The desire is to support these old devices so that
> > old VMs can work unchanged.
> >
> > Jason
>
> And you are saying all these very old VMs use such a large number of
> legacy devices that over-counting of locked memory due to vdpa not
> correctly using iommufd is a problem that urgently needs to be solved
> otherwise the solution has no value?
No one has said that.
iommufd is gaining alot more functions than just pinned memory
accounting.
> Another question I'm interested in is whether there's actually a
> performance benefit to using this as compared to just software
> vhost. I note there's a VM exit on each IO access, so ... perhaps?
> Would be nice to see some numbers.
At least a single trap compared with an entire per-packet SW flow
undoubtably uses alot less CPU power in the hypervisor.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 17:07 ` Jason Gunthorpe
@ 2023-09-21 17:21 ` Michael S. Tsirkin
2023-09-21 17:44 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 17:21 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 02:07:09PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 01:01:12PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote:
> > >
> > > > > With that code in place a legacy driver in the guest has the look and
> > > > > feel as if having a transitional device with legacy support for both its
> > > > > control and data path flows.
> > > >
> > > > Why do we need to enable a "legacy" driver in the guest? The very name
> > > > suggests there's an alternative driver that perhaps doesn't require
> > > > this I/O BAR. Why don't we just require the non-legacy driver in the
> > > > guest rather than increase our maintenance burden? Thanks,
> > >
> > > It was my reaction also.
> > >
> > > Apparently there is a big deployed base of people using old guest VMs
> > > with old drivers and they do not want to update their VMs. It is the
> > > same basic reason why qemu supports all those weird old machine types
> > > and HW emulations. The desire is to support these old devices so that
> > > old VMs can work unchanged.
> > >
> > > Jason
> >
> > And you are saying all these very old VMs use such a large number of
> > legacy devices that over-counting of locked memory due to vdpa not
> > correctly using iommufd is a problem that urgently needs to be solved
> > otherwise the solution has no value?
>
> No one has said that.
>
> iommufd is gaining alot more functions than just pinned memory
> accounting.
Yea it's very useful - it's also useful for vdpa whether this patchset
goes in or not. At some level, if vdpa can't keep up then maybe going
the vfio route is justified. I'm not sure why didn't anyone fix iommufd
yet - looks like a small amount of work. I'll see if I can address it
quickly because we already have virtio accelerators under vdpa and it
seems confusing to people to use vdpa for some and vfio for others, with
overlapping but slightly incompatible functionality. I'll get back next
week, in either case. I am however genuinely curious whether all the new
functionality is actually useful for these legacy guests.
> > Another question I'm interested in is whether there's actually a
> > performance benefit to using this as compared to just software
> > vhost. I note there's a VM exit on each IO access, so ... perhaps?
> > Would be nice to see some numbers.
>
> At least a single trap compared with an entire per-packet SW flow
> undoubtably uses alot less CPU power in the hypervisor.
>
> Jason
Something like the shadow vq thing will be more or less equivalent then?
That's upstream in qemu and needs no hardware support. Worth comparing
against. Anyway, there's presumably actual hardware this was tested
with, so why guess? Just test and post numbers.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 17:21 ` Michael S. Tsirkin
@ 2023-09-21 17:44 ` Jason Gunthorpe
2023-09-21 17:55 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 17:44 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 01:21:26PM -0400, Michael S. Tsirkin wrote:
> Yea it's very useful - it's also useful for vdpa whether this patchset
> goes in or not. At some level, if vdpa can't keep up then maybe going
> the vfio route is justified. I'm not sure why didn't anyone fix iommufd
> yet - looks like a small amount of work. I'll see if I can address it
> quickly because we already have virtio accelerators under vdpa and it
> seems confusing to people to use vdpa for some and vfio for others, with
> overlapping but slightly incompatible functionality. I'll get back next
> week, in either case. I am however genuinely curious whether all the new
> functionality is actually useful for these legacy guests.
It doesn't have much to do with the guests - this is new hypervisor
functionality to make the hypervisor do more things. This stuff can
still work with old VMs.
> > > Another question I'm interested in is whether there's actually a
> > > performance benefit to using this as compared to just software
> > > vhost. I note there's a VM exit on each IO access, so ... perhaps?
> > > Would be nice to see some numbers.
> >
> > At least a single trap compared with an entire per-packet SW flow
> > undoubtably uses alot less CPU power in the hypervisor.
>
> Something like the shadow vq thing will be more or less equivalent
> then?
Huh? It still has the entire netdev stack to go through on every
packet before it reaches the real virtio device.
> That's upstream in qemu and needs no hardware support. Worth comparing
> against. Anyway, there's presumably actual hardware this was tested
> with, so why guess? Just test and post numbers.
Our prior benchmarking put our VPDA/VFIO solutions at something like
2x-3x improvement over the qemu SW path it replaces.
Parav said 10% is lost, so 10% of 3x is still 3x better :)
I thought we all agreed on this when vdpa was created in the first
place, the all SW path was hopeless to get high performance out of?
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 17:44 ` Jason Gunthorpe
@ 2023-09-21 17:55 ` Michael S. Tsirkin
2023-09-21 18:16 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 17:55 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 02:44:50PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 01:21:26PM -0400, Michael S. Tsirkin wrote:
> > Yea it's very useful - it's also useful for vdpa whether this patchset
> > goes in or not. At some level, if vdpa can't keep up then maybe going
> > the vfio route is justified. I'm not sure why didn't anyone fix iommufd
> > yet - looks like a small amount of work. I'll see if I can address it
> > quickly because we already have virtio accelerators under vdpa and it
> > seems confusing to people to use vdpa for some and vfio for others, with
> > overlapping but slightly incompatible functionality. I'll get back next
> > week, in either case. I am however genuinely curious whether all the new
> > functionality is actually useful for these legacy guests.
>
> It doesn't have much to do with the guests - this is new hypervisor
> functionality to make the hypervisor do more things. This stuff can
> still work with old VMs.
>
> > > > Another question I'm interested in is whether there's actually a
> > > > performance benefit to using this as compared to just software
> > > > vhost. I note there's a VM exit on each IO access, so ... perhaps?
> > > > Would be nice to see some numbers.
> > >
> > > At least a single trap compared with an entire per-packet SW flow
> > > undoubtably uses alot less CPU power in the hypervisor.
> >
> > Something like the shadow vq thing will be more or less equivalent
> > then?
>
> Huh? It still has the entire netdev stack to go through on every
> packet before it reaches the real virtio device.
No - shadow vq just tweaks the descriptor and forwards it to
the modern vdpa hardware. No net stack involved.
> > That's upstream in qemu and needs no hardware support. Worth comparing
> > against. Anyway, there's presumably actual hardware this was tested
> > with, so why guess? Just test and post numbers.
>
> Our prior benchmarking put our VPDA/VFIO solutions at something like
> 2x-3x improvement over the qemu SW path it replaces.
> Parav said 10% is lost, so 10% of 3x is still 3x better :)
>
> I thought we all agreed on this when vdpa was created in the first
> place, the all SW path was hopeless to get high performance out of?
>
> Jason
That's not what I'm asking about though - not what shadow vq does,
shadow vq is a vdpa feature.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 17:55 ` Michael S. Tsirkin
@ 2023-09-21 18:16 ` Jason Gunthorpe
2023-09-21 19:34 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 18:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 01:55:42PM -0400, Michael S. Tsirkin wrote:
> That's not what I'm asking about though - not what shadow vq does,
> shadow vq is a vdpa feature.
That's just VDPA then. We already talked about why VDPA is not a
replacement for VFIO.
I agree you can probably get decent performance out of choosing VDPA
over VFIO. That doesn't justify VDPA as a replacement for VFIO.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 18:16 ` Jason Gunthorpe
@ 2023-09-21 19:34 ` Michael S. Tsirkin
2023-09-21 19:53 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 19:34 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:16:37PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 01:55:42PM -0400, Michael S. Tsirkin wrote:
>
> > That's not what I'm asking about though - not what shadow vq does,
> > shadow vq is a vdpa feature.
>
> That's just VDPA then. We already talked about why VDPA is not a
> replacement for VFIO.
It does however work universally, by software, without any special
hardware support. Which is kind of why I am curious - if VDPA needs this
proxy code because shadow vq is slower then that's an argument for not
having it in two places, and trying to improve vdpa to use iommufd if
that's easy/practical. If instead VDPA gives the same speed with just
shadow vq then keeping this hack in vfio seems like less of a problem.
Finally if VDPA is faster then maybe you will reconsider using it ;)
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:34 ` Michael S. Tsirkin
@ 2023-09-21 19:53 ` Jason Gunthorpe
2023-09-21 20:16 ` Michael S. Tsirkin
2023-09-22 3:02 ` Jason Wang
0 siblings, 2 replies; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 19:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> that's easy/practical. If instead VDPA gives the same speed with just
> shadow vq then keeping this hack in vfio seems like less of a problem.
> Finally if VDPA is faster then maybe you will reconsider using it ;)
It is not all about the speed.
VDPA presents another large and complex software stack in the
hypervisor that can be eliminated by simply using VFIO. VFIO is
already required for other scenarios.
This is about reducing complexity, reducing attack surface and
increasing maintainability of the hypervisor environment.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:53 ` Jason Gunthorpe
@ 2023-09-21 20:16 ` Michael S. Tsirkin
2023-09-21 22:48 ` Jason Gunthorpe
2023-09-22 3:02 ` Jason Wang
2023-09-22 3:02 ` Jason Wang
1 sibling, 2 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 20:16 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
>
> > that's easy/practical. If instead VDPA gives the same speed with just
> > shadow vq then keeping this hack in vfio seems like less of a problem.
> > Finally if VDPA is faster then maybe you will reconsider using it ;)
>
> It is not all about the speed.
>
> VDPA presents another large and complex software stack in the
> hypervisor that can be eliminated by simply using VFIO.
If all you want is passing through your card to guest
then yes this can be addressed "by simply using VFIO".
And let me give you a simple example just from this patchset:
it assumes guest uses MSIX and just breaks if it doesn't.
As VDPA emulates it can emulate INT#x for guest while doing MSI
on the host side. Yea modern guests use MSIX but this is about legacy
yes?
> VFIO is
> already required for other scenarios.
Required ... by some people? Most VMs I run don't use anything
outside of virtio.
> This is about reducing complexity, reducing attack surface and
> increasing maintainability of the hypervisor environment.
>
> Jason
Generally you get better security if you don't let guests poke at
hardware when they don't have to. But sure, matter of preference -
use VFIO, it's great. I am worried about the specific patchset though.
It seems to deal with emulating virtio which seems more like a vdpa
thing. If you start adding virtio emulation to vfio then won't
you just end up with another vdpa? And if no why not?
And I don't buy the "we already invested in this vfio based solution",
sorry - that's not a reason upstream has to maintain it.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 20:16 ` Michael S. Tsirkin
@ 2023-09-21 22:48 ` Jason Gunthorpe
2023-09-22 9:47 ` Michael S. Tsirkin
2023-09-22 3:02 ` Jason Wang
1 sibling, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 22:48 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 04:16:25PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> >
> > > that's easy/practical. If instead VDPA gives the same speed with just
> > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> >
> > It is not all about the speed.
> >
> > VDPA presents another large and complex software stack in the
> > hypervisor that can be eliminated by simply using VFIO.
>
> If all you want is passing through your card to guest
> then yes this can be addressed "by simply using VFIO".
That is pretty much the goal, yes.
> And let me give you a simple example just from this patchset:
> it assumes guest uses MSIX and just breaks if it doesn't.
It does? Really? Where did you see that?
> > VFIO is
> > already required for other scenarios.
>
> Required ... by some people? Most VMs I run don't use anything
> outside of virtio.
Yes, some people. The sorts of people who run large data centers.
> It seems to deal with emulating virtio which seems more like a vdpa
> thing.
Alex described it right, it creates an SW trapped IO bar that relays
the doorbell to an admin queue command.
> If you start adding virtio emulation to vfio then won't
> you just end up with another vdpa? And if no why not?
> And I don't buy the "we already invested in this vfio based solution",
> sorry - that's not a reason upstream has to maintain it.
I think you would be well justified to object to actual mediation,
like processing queues in VFIO or otherwise complex things.
Fortunately there is no need to do that with DPU HW. The legacy IO BAR
is a weird quirk that just cannot be done without a software trap, and
the OASIS standardization effort was for exactly this kind of
simplistic transformation.
I also don't buy the "upstream has to maintain it" line. The team that
submitted it will maintain it just fine, thank you.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 22:48 ` Jason Gunthorpe
@ 2023-09-22 9:47 ` Michael S. Tsirkin
2023-09-22 12:23 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 9:47 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 07:48:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 04:16:25PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > >
> > > > that's easy/practical. If instead VDPA gives the same speed with just
> > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > >
> > > It is not all about the speed.
> > >
> > > VDPA presents another large and complex software stack in the
> > > hypervisor that can be eliminated by simply using VFIO.
> >
> > If all you want is passing through your card to guest
> > then yes this can be addressed "by simply using VFIO".
>
> That is pretty much the goal, yes.
>
> > And let me give you a simple example just from this patchset:
> > it assumes guest uses MSIX and just breaks if it doesn't.
>
> It does? Really? Where did you see that?
This thing apparently:
+ opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
+ VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
+ VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
That "true" is supposed to be whether guest enabled MSI or not.
> > > VFIO is
> > > already required for other scenarios.
> >
> > Required ... by some people? Most VMs I run don't use anything
> > outside of virtio.
>
> Yes, some people. The sorts of people who run large data centers.
>
> > It seems to deal with emulating virtio which seems more like a vdpa
> > thing.
>
> Alex described it right, it creates an SW trapped IO bar that relays
> the doorbell to an admin queue command.
>
> > If you start adding virtio emulation to vfio then won't
> > you just end up with another vdpa? And if no why not?
> > And I don't buy the "we already invested in this vfio based solution",
> > sorry - that's not a reason upstream has to maintain it.
>
> I think you would be well justified to object to actual mediation,
> like processing queues in VFIO or otherwise complex things.
This mediation is kind of smallish, I agree. Not completely devoid of
logic though.
> Fortunately there is no need to do that with DPU HW. The legacy IO BAR
> is a weird quirk that just cannot be done without a software trap, and
> the OASIS standardization effort was for exactly this kind of
> simplistic transformation.
>
> I also don't buy the "upstream has to maintain it" line. The team that
> submitted it will maintain it just fine, thank you.
it will require maintainance effort when virtio changes are made. For
example it pokes at the device state - I don't see specific races right
now but in the past we did e.g. reset the device to recover from errors
and we might start doing it again.
If more of the logic is under virtio directory where we'll remember
to keep it in loop, and will be able to reuse it from vdpa
down the road, I would be more sympathetic.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 9:47 ` Michael S. Tsirkin
@ 2023-09-22 12:23 ` Jason Gunthorpe
2023-09-22 15:45 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 12:23 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Fri, Sep 22, 2023 at 05:47:23AM -0400, Michael S. Tsirkin wrote:
> it will require maintainance effort when virtio changes are made. For
> example it pokes at the device state - I don't see specific races right
> now but in the past we did e.g. reset the device to recover from errors
> and we might start doing it again.
>
> If more of the logic is under virtio directory where we'll remember
> to keep it in loop, and will be able to reuse it from vdpa
> down the road, I would be more sympathetic.
This is inevitable, the VFIO live migration driver will need all this
infrastructure too.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 12:23 ` Jason Gunthorpe
@ 2023-09-22 15:45 ` Michael S. Tsirkin
0 siblings, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 15:45 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Fri, Sep 22, 2023 at 09:23:28AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 05:47:23AM -0400, Michael S. Tsirkin wrote:
>
> > it will require maintainance effort when virtio changes are made. For
> > example it pokes at the device state - I don't see specific races right
> > now but in the past we did e.g. reset the device to recover from errors
> > and we might start doing it again.
> >
> > If more of the logic is under virtio directory where we'll remember
> > to keep it in loop, and will be able to reuse it from vdpa
> > down the road, I would be more sympathetic.
>
> This is inevitable, the VFIO live migration driver will need all this
> infrastructure too.
>
> Jason
>
I am not sure what you are saying and what is inevitable.
VDPA for sure will want live migration support. I am not at all
sympathetic to efforts that want to duplicate that support for virtio
under VFIO. Put it in a library under the virtio directory,
with a sane will documented interface.
I don't maintain VFIO and Alex can merge what he wants,
but I won't merge patches that export virtio internals in a way
that will make virtio maintainance harder.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 20:16 ` Michael S. Tsirkin
2023-09-21 22:48 ` Jason Gunthorpe
@ 2023-09-22 3:02 ` Jason Wang
2023-09-22 12:22 ` Jason Gunthorpe
1 sibling, 1 reply; 181+ messages in thread
From: Jason Wang @ 2023-09-22 3:02 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Gunthorpe, Alex Williamson, Yishai Hadas, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 4:16 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> >
> > > that's easy/practical. If instead VDPA gives the same speed with just
> > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> >
> > It is not all about the speed.
> >
> > VDPA presents another large and complex software stack in the
> > hypervisor that can be eliminated by simply using VFIO.
>
> If all you want is passing through your card to guest
> then yes this can be addressed "by simply using VFIO".
+1.
And what's more, using MMIO BAR0 then it can work for legacy.
I have a handy virtio hardware from one vendor that works like this,
and I see it is done by a lot of other vendors.
Thanks
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 3:02 ` Jason Wang
@ 2023-09-22 12:22 ` Jason Gunthorpe
2023-09-22 12:25 ` Parav Pandit
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 12:22 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, Alex Williamson, Yishai Hadas, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 11:02:21AM +0800, Jason Wang wrote:
> And what's more, using MMIO BAR0 then it can work for legacy.
Oh? How? Our team didn't think so.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 12:22 ` Jason Gunthorpe
@ 2023-09-22 12:25 ` Parav Pandit
2023-09-22 15:13 ` Michael S. Tsirkin
2023-09-25 2:30 ` Jason Wang
0 siblings, 2 replies; 181+ messages in thread
From: Parav Pandit @ 2023-09-22 12:25 UTC (permalink / raw)
To: Jason Gunthorpe, Jason Wang
Cc: Michael S. Tsirkin, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, September 22, 2023 5:53 PM
> > And what's more, using MMIO BAR0 then it can work for legacy.
>
> Oh? How? Our team didn't think so.
It does not. It was already discussed.
The device reset in legacy is not synchronous.
The drivers do not wait for reset to complete; it was written for the sw backend.
Hence MMIO BAR0 is not the best option in real implementations.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 12:25 ` Parav Pandit
@ 2023-09-22 15:13 ` Michael S. Tsirkin
2023-09-22 15:15 ` Jason Gunthorpe
2023-09-25 2:30 ` Jason Wang
1 sibling, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 15:13 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Gunthorpe, Jason Wang, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
>
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, September 22, 2023 5:53 PM
>
>
> > > And what's more, using MMIO BAR0 then it can work for legacy.
> >
> > Oh? How? Our team didn't think so.
>
> It does not. It was already discussed.
> The device reset in legacy is not synchronous.
> The drivers do not wait for reset to complete; it was written for the sw backend.
> Hence MMIO BAR0 is not the best option in real implementations.
Or maybe they made it synchronous in hardware, that's all.
After all same is true for the IO BAR0 e.g. for the PF: IO writes are posted anyway.
Whether that's possible would depend on the hardware architecture.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 15:13 ` Michael S. Tsirkin
@ 2023-09-22 15:15 ` Jason Gunthorpe
2023-09-22 15:40 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 15:15 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Parav Pandit, Jason Wang, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, September 22, 2023 5:53 PM
> >
> >
> > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > >
> > > Oh? How? Our team didn't think so.
> >
> > It does not. It was already discussed.
> > The device reset in legacy is not synchronous.
> > The drivers do not wait for reset to complete; it was written for the sw backend.
> > Hence MMIO BAR0 is not the best option in real implementations.
>
> Or maybe they made it synchronous in hardware, that's all.
> After all same is true for the IO BAR0 e.g. for the PF: IO writes
> are posted anyway.
IO writes are not posted in PCI.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 15:15 ` Jason Gunthorpe
@ 2023-09-22 15:40 ` Michael S. Tsirkin
2023-09-22 16:22 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 15:40 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Parav Pandit, Jason Wang, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > >
> > >
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > >
> > > > Oh? How? Our team didn't think so.
> > >
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> > > The drivers do not wait for reset to complete; it was written for the sw backend.
> > > Hence MMIO BAR0 is not the best option in real implementations.
> >
> > Or maybe they made it synchronous in hardware, that's all.
> > After all same is true for the IO BAR0 e.g. for the PF: IO writes
> > are posted anyway.
>
> IO writes are not posted in PCI.
Aha, I was confused. Thanks for the correction. I guess you just buffer
subsequent transactions while reset is going on and reset quickly enough
for it to be seemless then?
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 15:40 ` Michael S. Tsirkin
@ 2023-09-22 16:22 ` Jason Gunthorpe
2023-09-25 17:36 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 16:22 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Parav Pandit, Jason Wang, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Fri, Sep 22, 2023 at 11:40:58AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Friday, September 22, 2023 5:53 PM
> > > >
> > > >
> > > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > > >
> > > > > Oh? How? Our team didn't think so.
> > > >
> > > > It does not. It was already discussed.
> > > > The device reset in legacy is not synchronous.
> > > > The drivers do not wait for reset to complete; it was written for the sw backend.
> > > > Hence MMIO BAR0 is not the best option in real implementations.
> > >
> > > Or maybe they made it synchronous in hardware, that's all.
> > > After all same is true for the IO BAR0 e.g. for the PF: IO writes
> > > are posted anyway.
> >
> > IO writes are not posted in PCI.
>
> Aha, I was confused. Thanks for the correction. I guess you just buffer
> subsequent transactions while reset is going on and reset quickly enough
> for it to be seemless then?
From a hardware perspective the CPU issues an non-posted IO write and
then it stops processing until the far side returns an IO completion.
Using that you can emulate what the SW virtio model did and delay the
CPU from restarting until the reset is completed.
Since MMIO is always posted, this is not possible to emulate directly
using MMIO.
Converting IO into non-posted admin commands is a fairly close
recreation to what actual HW would do.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 16:22 ` Jason Gunthorpe
@ 2023-09-25 17:36 ` Michael S. Tsirkin
0 siblings, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-25 17:36 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Parav Pandit, Jason Wang, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Fri, Sep 22, 2023 at 01:22:33PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:40:58AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> > > > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Friday, September 22, 2023 5:53 PM
> > > > >
> > > > >
> > > > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > > > >
> > > > > > Oh? How? Our team didn't think so.
> > > > >
> > > > > It does not. It was already discussed.
> > > > > The device reset in legacy is not synchronous.
> > > > > The drivers do not wait for reset to complete; it was written for the sw backend.
> > > > > Hence MMIO BAR0 is not the best option in real implementations.
> > > >
> > > > Or maybe they made it synchronous in hardware, that's all.
> > > > After all same is true for the IO BAR0 e.g. for the PF: IO writes
> > > > are posted anyway.
> > >
> > > IO writes are not posted in PCI.
> >
> > Aha, I was confused. Thanks for the correction. I guess you just buffer
> > subsequent transactions while reset is going on and reset quickly enough
> > for it to be seemless then?
>
> >From a hardware perspective the CPU issues an non-posted IO write and
> then it stops processing until the far side returns an IO completion.
>
> Using that you can emulate what the SW virtio model did and delay the
> CPU from restarting until the reset is completed.
>
> Since MMIO is always posted, this is not possible to emulate directly
> using MMIO.
>
> Converting IO into non-posted admin commands is a fairly close
> recreation to what actual HW would do.
>
> Jason
I thought you asked how it is possible for hardware to support reset if
all it does is replace IO BAR with memory BAR. The answer is that since
2011 the reset is followed by read of the status field (which isn't much
older than MSIX support from 2009 - which this code assumes). If one
uses a Linux driver from 2011 and on then all you need to do is defer
response to this read until after the reset is complete.
If you are using older drivers or other OSes then reset using a posted
write after device has operated for a while might not be safe, so e.g.
you might trigger races if you remove drivers from system or
trigger hot unplug. For example:
static void virtio_pci_remove(struct pci_dev *pci_dev)
{
....
unregister_virtio_device(&vp_dev->vdev);
^^^^ triggers reset, then releases memory
....
pci_disable_device(pci_dev);
^^^ blocks DMA by clearing bus master
}
here you could see some DMA into memory that has just been released.
As Jason mentions hardware exists that is used under one of these two
restrictions on the guest (Linux since 2011 or no resets while DMA is
going on), and it works fine with these existing guests.
Given the restrictions, virtio TC didn't elect to standardize this
approach and instead opted for the heavier approach of
converting IO into non-posted admin commands in software.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 12:25 ` Parav Pandit
2023-09-22 15:13 ` Michael S. Tsirkin
@ 2023-09-25 2:30 ` Jason Wang
2023-09-25 8:26 ` Parav Pandit
1 sibling, 1 reply; 181+ messages in thread
From: Jason Wang @ 2023-09-25 2:30 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Gunthorpe, Michael S. Tsirkin, Alex Williamson,
Yishai Hadas, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, September 22, 2023 5:53 PM
>
>
> > > And what's more, using MMIO BAR0 then it can work for legacy.
> >
> > Oh? How? Our team didn't think so.
>
> It does not. It was already discussed.
> The device reset in legacy is not synchronous.
How do you know this?
> The drivers do not wait for reset to complete; it was written for the sw backend.
Do you see there's a flush after reset in the legacy driver?
static void vp_reset(struct virtio_device *vdev)
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
/* 0 status means a reset. */
vp_legacy_set_status(&vp_dev->ldev, 0);
/* Flush out the status write, and flush in device writes,
* including MSi-X interrupts, if any. */
vp_legacy_get_status(&vp_dev->ldev);
/* Flush pending VQ/configuration callbacks. */
vp_synchronize_vectors(vdev);
}
Thanks
> Hence MMIO BAR0 is not the best option in real implementations.
>
^ permalink raw reply [flat|nested] 181+ messages in thread* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 2:30 ` Jason Wang
@ 2023-09-25 8:26 ` Parav Pandit
2023-09-25 18:36 ` Michael S. Tsirkin
2023-09-26 2:32 ` Jason Wang
0 siblings, 2 replies; 181+ messages in thread
From: Parav Pandit @ 2023-09-25 8:26 UTC (permalink / raw)
To: Jason Wang
Cc: Jason Gunthorpe, Michael S. Tsirkin, Alex Williamson,
Yishai Hadas, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, September 25, 2023 8:00 AM
>
> On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, September 22, 2023 5:53 PM
> >
> >
> > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > >
> > > Oh? How? Our team didn't think so.
> >
> > It does not. It was already discussed.
> > The device reset in legacy is not synchronous.
>
> How do you know this?
>
Not sure the motivation of same discussion done in the OASIS with you and others in past.
Anyways, please find the answer below.
About reset,
The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.
"The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."
[1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf
> > The drivers do not wait for reset to complete; it was written for the sw
> backend.
>
> Do you see there's a flush after reset in the legacy driver?
>
Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.
Please see the reset flow of 1.x device as below.
In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset.
I will send separate patch for improving this comment of vp_reset() to match the spec.
static void vp_reset(struct virtio_device *vdev)
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
/* 0 status means a reset. */
vp_modern_set_status(mdev, 0);
/* After writing 0 to device_status, the driver MUST wait for a read of
* device_status to return 0 before reinitializing the device.
* This will flush out the status write, and flush in device writes,
* including MSI-X interrupts, if any.
*/
while (vp_modern_get_status(mdev))
msleep(1);
/* Flush pending VQ/configuration callbacks. */
vp_synchronize_vectors(vdev);
}
> static void vp_reset(struct virtio_device *vdev) {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> /* 0 status means a reset. */
> vp_legacy_set_status(&vp_dev->ldev, 0);
> /* Flush out the status write, and flush in device writes,
> * including MSi-X interrupts, if any. */
> vp_legacy_get_status(&vp_dev->ldev);
> /* Flush pending VQ/configuration callbacks. */
> vp_synchronize_vectors(vdev);
> }
>
> Thanks
>
>
>
> > Hence MMIO BAR0 is not the best option in real implementations.
> >
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 8:26 ` Parav Pandit
@ 2023-09-25 18:36 ` Michael S. Tsirkin
2023-09-26 2:34 ` Zhu, Lingshan
2023-09-26 3:45 ` Parav Pandit
2023-09-26 2:32 ` Jason Wang
1 sibling, 2 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-25 18:36 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Wang, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Mon, Sep 25, 2023 at 08:26:33AM +0000, Parav Pandit wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, September 25, 2023 8:00 AM
> >
> > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > >
> > >
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > >
> > > > Oh? How? Our team didn't think so.
> > >
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> >
> > How do you know this?
> >
> Not sure the motivation of same discussion done in the OASIS with you and others in past.
>
> Anyways, please find the answer below.
>
> About reset,
> The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.
>
> "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."
>
> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf
Basically, I think any drivers that did not read status (linux pre 2011)
before freeing memory under DMA have a reset path that is racy wrt DMA, since
memory writes are posted and IO writes while not posted have completion
that does not order posted transactions, e.g. from pci express spec:
D2b
An I/O or Configuration Write Completion 37 is permitted to pass a Posted Request.
having said that there were a ton of driver races discovered on this
path in the years since, I suspect if one cares about this then
just avoiding stress on reset is wise.
> > > The drivers do not wait for reset to complete; it was written for the sw
> > backend.
> >
> > Do you see there's a flush after reset in the legacy driver?
> >
> Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.
One can thinkably do that wait in hardware, though. Just defer completion until
read is done.
> Please see the reset flow of 1.x device as below.
> In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset.
> I will send separate patch for improving this comment of vp_reset() to match the spec.
>
> static void vp_reset(struct virtio_device *vdev)
> {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
>
> /* 0 status means a reset. */
> vp_modern_set_status(mdev, 0);
> /* After writing 0 to device_status, the driver MUST wait for a read of
> * device_status to return 0 before reinitializing the device.
> * This will flush out the status write, and flush in device writes,
> * including MSI-X interrupts, if any.
> */
> while (vp_modern_get_status(mdev))
> msleep(1);
> /* Flush pending VQ/configuration callbacks. */
> vp_synchronize_vectors(vdev);
> }
>
>
> > static void vp_reset(struct virtio_device *vdev) {
> > struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > /* 0 status means a reset. */
> > vp_legacy_set_status(&vp_dev->ldev, 0);
> > /* Flush out the status write, and flush in device writes,
> > * including MSi-X interrupts, if any. */
> > vp_legacy_get_status(&vp_dev->ldev);
> > /* Flush pending VQ/configuration callbacks. */
> > vp_synchronize_vectors(vdev);
> > }
> >
> > Thanks
> >
> >
> >
> > > Hence MMIO BAR0 is not the best option in real implementations.
> > >
>
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 18:36 ` Michael S. Tsirkin
@ 2023-09-26 2:34 ` Zhu, Lingshan
2023-09-26 3:45 ` Parav Pandit
1 sibling, 0 replies; 181+ messages in thread
From: Zhu, Lingshan @ 2023-09-26 2:34 UTC (permalink / raw)
To: Michael S. Tsirkin, Parav Pandit
Cc: kvm@vger.kernel.org, Maor Gottlieb,
virtualization@lists.linux-foundation.org, Jason Gunthorpe,
Jiri Pirko, Leon Romanovsky
On 9/26/2023 2:36 AM, Michael S. Tsirkin wrote:
> On Mon, Sep 25, 2023 at 08:26:33AM +0000, Parav Pandit wrote:
>>
>>> From: Jason Wang <jasowang@redhat.com>
>>> Sent: Monday, September 25, 2023 8:00 AM
>>>
>>> On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
>>>>
>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Sent: Friday, September 22, 2023 5:53 PM
>>>>
>>>>>> And what's more, using MMIO BAR0 then it can work for legacy.
>>>>> Oh? How? Our team didn't think so.
>>>> It does not. It was already discussed.
>>>> The device reset in legacy is not synchronous.
>>> How do you know this?
>>>
>> Not sure the motivation of same discussion done in the OASIS with you and others in past.
>>
>> Anyways, please find the answer below.
>>
>> About reset,
>> The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.
>>
>> "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."
>>
>> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf
> Basically, I think any drivers that did not read status (linux pre 2011)
> before freeing memory under DMA have a reset path that is racy wrt DMA, since
> memory writes are posted and IO writes while not posted have completion
> that does not order posted transactions, e.g. from pci express spec:
> D2b
> An I/O or Configuration Write Completion 37 is permitted to pass a Posted Request.
> having said that there were a ton of driver races discovered on this
> path in the years since, I suspect if one cares about this then
> just avoiding stress on reset is wise.
>
>
>
>>>> The drivers do not wait for reset to complete; it was written for the sw
>>> backend.
>>>
>>> Do you see there's a flush after reset in the legacy driver?
>>>
>> Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.
> One can thinkably do that wait in hardware, though. Just defer completion until
> read is done.
I agree with MST. At least Intel devices work fine with vfio-pci and
legacy driver without any changes.
So far so good.
Thanks
Zhu Lingshan
>
>> Please see the reset flow of 1.x device as below.
>> In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset.
>> I will send separate patch for improving this comment of vp_reset() to match the spec.
>>
>> static void vp_reset(struct virtio_device *vdev)
>> {
>> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
>>
>> /* 0 status means a reset. */
>> vp_modern_set_status(mdev, 0);
>> /* After writing 0 to device_status, the driver MUST wait for a read of
>> * device_status to return 0 before reinitializing the device.
>> * This will flush out the status write, and flush in device writes,
>> * including MSI-X interrupts, if any.
>> */
>> while (vp_modern_get_status(mdev))
>> msleep(1);
>> /* Flush pending VQ/configuration callbacks. */
>> vp_synchronize_vectors(vdev);
>> }
>>
>>
>>> static void vp_reset(struct virtio_device *vdev) {
>>> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>> /* 0 status means a reset. */
>>> vp_legacy_set_status(&vp_dev->ldev, 0);
>>> /* Flush out the status write, and flush in device writes,
>>> * including MSi-X interrupts, if any. */
>>> vp_legacy_get_status(&vp_dev->ldev);
>>> /* Flush pending VQ/configuration callbacks. */
>>> vp_synchronize_vectors(vdev);
>>> }
>>>
>>> Thanks
>>>
>>>
>>>
>>>> Hence MMIO BAR0 is not the best option in real implementations.
>>>>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 181+ messages in thread* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 18:36 ` Michael S. Tsirkin
2023-09-26 2:34 ` Zhu, Lingshan
@ 2023-09-26 3:45 ` Parav Pandit
2023-09-26 4:37 ` Jason Wang
2023-10-12 10:52 ` Michael S. Tsirkin
1 sibling, 2 replies; 181+ messages in thread
From: Parav Pandit @ 2023-09-26 3:45 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, September 26, 2023 12:06 AM
> One can thinkably do that wait in hardware, though. Just defer completion until
> read is done.
>
Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it.
When we worked with you, we discussed that there such hw does not have enough returns and hence technical committee choose to proceed with admin commands.
I will skip re-discussing all over it again here.
The current virto spec is delivering the best trade-offs of functionality, performance and light weight implementation with future forward path towards more features as Jason explained such as migration.
All with near zero driver, qemu and sw involvement for rapidly growing feature set...
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 3:45 ` Parav Pandit
@ 2023-09-26 4:37 ` Jason Wang
2023-10-12 10:52 ` Michael S. Tsirkin
1 sibling, 0 replies; 181+ messages in thread
From: Jason Wang @ 2023-09-26 4:37 UTC (permalink / raw)
To: Parav Pandit
Cc: Michael S. Tsirkin, Jason Gunthorpe, Alex Williamson,
Yishai Hadas, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
On Tue, Sep 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, September 26, 2023 12:06 AM
>
> > One can thinkably do that wait in hardware, though. Just defer completion until
> > read is done.
> >
> Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it.
It is you that tries to revive legacy in the spec. We all know legacy
is tricky but work.
> When we worked with you, we discussed that there such hw does not have enough returns and hence technical committee choose to proceed with admin commands.
I don't think my questions regarding the legacy transport get good
answers at that time. What's more, we all know spec allows to fix,
workaround or even deprecate a feature.
Thanks
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 3:45 ` Parav Pandit
2023-09-26 4:37 ` Jason Wang
@ 2023-10-12 10:52 ` Michael S. Tsirkin
2023-10-12 11:11 ` Parav Pandit
1 sibling, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-10-12 10:52 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Wang, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote:
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, September 26, 2023 12:06 AM
>
> > One can thinkably do that wait in hardware, though. Just defer completion until
> > read is done.
> >
> Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it.
The reset behaviour I describe is already in the spec. What else do you
want OASIS to standardize? Virtio currently is just a register map it
does not yet include suggestions on how exactly do pci express
transactions look. You feel we should add that?
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-10-12 10:52 ` Michael S. Tsirkin
@ 2023-10-12 11:11 ` Parav Pandit
2023-10-12 11:30 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Parav Pandit @ 2023-10-12 11:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 12, 2023 4:23 PM
>
> On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Tuesday, September 26, 2023 12:06 AM
> >
> > > One can thinkably do that wait in hardware, though. Just defer
> > > completion until read is done.
> > >
> > Once OASIS does such new interface and if some hw vendor _actually_ wants
> to do such complex hw, may be vfio driver can adopt to it.
>
> The reset behaviour I describe is already in the spec. What else do you want
> OASIS to standardize? Virtio currently is just a register map it does not yet
> include suggestions on how exactly do pci express transactions look. You feel we
> should add that?
The reset behavior in the spec for modern as listed in [1] and [2] is just fine.
What I meant is in context of having MMIO based legacy registers to "defer completion until read is done".
I think you meant, "Just differ read completion, until reset is done".
This means the hw needs to finish the device reset for thousands of devices within the read completion timeout of the pci.
So when if OASIS does such standardization, someone can implement it.
What I recollect, is OASIS didn't not standardize such anti-scale approach and took the admin command approach which achieve better scale.
Hope I clarified.
I am not expecting OASIS to do anything extra for legacy registers.
[1] The device MUST reset when 0 is written to device_status, and present a 0 in device_status once that is done.
[2] After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing
the device.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-10-12 11:11 ` Parav Pandit
@ 2023-10-12 11:30 ` Michael S. Tsirkin
2023-10-12 11:40 ` Parav Pandit
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-10-12 11:30 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Wang, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Thu, Oct 12, 2023 at 11:11:20AM +0000, Parav Pandit wrote:
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 12, 2023 4:23 PM
> >
> > On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Tuesday, September 26, 2023 12:06 AM
> > >
> > > > One can thinkably do that wait in hardware, though. Just defer
> > > > completion until read is done.
> > > >
> > > Once OASIS does such new interface and if some hw vendor _actually_ wants
> > to do such complex hw, may be vfio driver can adopt to it.
> >
> > The reset behaviour I describe is already in the spec. What else do you want
> > OASIS to standardize? Virtio currently is just a register map it does not yet
> > include suggestions on how exactly do pci express transactions look. You feel we
> > should add that?
>
> The reset behavior in the spec for modern as listed in [1] and [2] is just fine.
>
> What I meant is in context of having MMIO based legacy registers to "defer completion until read is done".
> I think you meant, "Just differ read completion, until reset is done".
yes
> This means the hw needs to finish the device reset for thousands of devices within the read completion timeout of the pci.
no, each device does it's own reset.
> So when if OASIS does such standardization, someone can implement it.
>
> What I recollect, is OASIS didn't not standardize such anti-scale approach and took the admin command approach which achieve better scale.
> Hope I clarified.
You are talking about the extension for trap and emulate.
I am instead talking about devices that work with
existing legacy linux drivers with no traps.
> I am not expecting OASIS to do anything extra for legacy registers.
>
> [1] The device MUST reset when 0 is written to device_status, and present a 0 in device_status once that is done.
> [2] After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing
> the device.
We can add a note explaining that legacy drivers do not wait
after doing reset, that is not a problem.
If someone wants to make a device that works with existing
legacy linux drivers, they can do that.
Won't work with all drivers though, which is why oasis did not
want to standardize this.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-10-12 11:30 ` Michael S. Tsirkin
@ 2023-10-12 11:40 ` Parav Pandit
0 siblings, 0 replies; 181+ messages in thread
From: Parav Pandit @ 2023-10-12 11:40 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 12, 2023 5:00 PM
> I am instead talking about devices that work with existing legacy linux drivers
> with no traps.
>
Yep, I understood.
> > I am not expecting OASIS to do anything extra for legacy registers.
> >
> > [1] The device MUST reset when 0 is written to device_status, and present a 0
> in device_status once that is done.
> > [2] After writing 0 to device_status, the driver MUST wait for a read
> > of device_status to return 0 before reinitializing the device.
>
> We can add a note explaining that legacy drivers do not wait after doing reset,
> that is not a problem.
> If someone wants to make a device that works with existing legacy linux drivers,
> they can do that.
> Won't work with all drivers though, which is why oasis did not want to
> standardize this.
Ok. thanks.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 8:26 ` Parav Pandit
2023-09-25 18:36 ` Michael S. Tsirkin
@ 2023-09-26 2:32 ` Jason Wang
2023-09-26 4:01 ` Parav Pandit
2023-09-26 11:49 ` Michael S. Tsirkin
1 sibling, 2 replies; 181+ messages in thread
From: Jason Wang @ 2023-09-26 2:32 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Gunthorpe, Michael S. Tsirkin, Alex Williamson,
Yishai Hadas, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
On Mon, Sep 25, 2023 at 4:26 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, September 25, 2023 8:00 AM
> >
> > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > >
> > >
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > >
> > > > Oh? How? Our team didn't think so.
> > >
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> >
> > How do you know this?
> >
> Not sure the motivation of same discussion done in the OASIS with you and others in past.
That is exactly the same point.
It's too late to define the legacy behaviour accurately in the spec so
people will be lost in the legacy maze easily.
>
> Anyways, please find the answer below.
>
> About reset,
> The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.
>
> "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."
We are talking about how to make devices work for legacy drivers. So
it has nothing related to 1.0.
>
> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf
>
> > > The drivers do not wait for reset to complete; it was written for the sw
> > backend.
> >
> > Do you see there's a flush after reset in the legacy driver?
> >
> Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.
It's the implementation details in legacy. The device needs to make
sure (reset) the driver can work (is done before get_status return).
That's all.
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 2:32 ` Jason Wang
@ 2023-09-26 4:01 ` Parav Pandit
2023-09-26 4:37 ` Jason Wang
2023-09-26 11:49 ` Michael S. Tsirkin
1 sibling, 1 reply; 181+ messages in thread
From: Parav Pandit @ 2023-09-26 4:01 UTC (permalink / raw)
To: Jason Wang
Cc: Jason Gunthorpe, Michael S. Tsirkin, Alex Williamson,
Yishai Hadas, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, September 26, 2023 8:03 AM
>
> It's the implementation details in legacy. The device needs to make sure (reset)
> the driver can work (is done before get_status return).
It is part of the 0.9.5 and 1.x specification as I quoted those text above.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 4:01 ` Parav Pandit
@ 2023-09-26 4:37 ` Jason Wang
2023-09-26 5:27 ` Parav Pandit
0 siblings, 1 reply; 181+ messages in thread
From: Jason Wang @ 2023-09-26 4:37 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Gunthorpe, Michael S. Tsirkin, Alex Williamson,
Yishai Hadas, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
On Tue, Sep 26, 2023 at 12:01 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, September 26, 2023 8:03 AM
> >
> > It's the implementation details in legacy. The device needs to make sure (reset)
> > the driver can work (is done before get_status return).
> It is part of the 0.9.5 and 1.x specification as I quoted those text above.
What I meant is: legacy devices need to find their way to make legacy
drivers work. That's how legacy works.
It's too late to add any normative to the 0.95 spec. So the device
behaviour is actually defined by the legacy drivers. That is why it is
tricky.
If you can't find a way to make legacy drivers work, use modern.
That's it.
Thanks
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 4:37 ` Jason Wang
@ 2023-09-26 5:27 ` Parav Pandit
0 siblings, 0 replies; 181+ messages in thread
From: Parav Pandit @ 2023-09-26 5:27 UTC (permalink / raw)
To: Jason Wang
Cc: Jason Gunthorpe, Michael S. Tsirkin, Alex Williamson,
Yishai Hadas, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, September 26, 2023 10:07 AM
>
> If you can't find a way to make legacy drivers work, use modern.
>
Understood.
This vfio series make the legacy drivers work.
Thanks.
> That's it.
>
> Thanks
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 2:32 ` Jason Wang
2023-09-26 4:01 ` Parav Pandit
@ 2023-09-26 11:49 ` Michael S. Tsirkin
2023-10-08 4:28 ` Jason Wang
1 sibling, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-26 11:49 UTC (permalink / raw)
To: Jason Wang
Cc: Parav Pandit, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Tue, Sep 26, 2023 at 10:32:39AM +0800, Jason Wang wrote:
> It's the implementation details in legacy. The device needs to make
> sure (reset) the driver can work (is done before get_status return).
I think that there's no way to make it reliably work for all legacy drivers.
They just assumed a software backend and did not bother with DMA
ordering. You can try to avoid resets, they are not that common so
things will tend to mostly work if you don't stress them to much with
things like hot plug/unplug in a loop. Or you can try to use a driver
after 2011 which is more aware of hardware ordering and flushes the
reset write with a read. One of these two tricks, I think, is the magic
behind the device exposing memory bar 0 that you mention.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 11:49 ` Michael S. Tsirkin
@ 2023-10-08 4:28 ` Jason Wang
0 siblings, 0 replies; 181+ messages in thread
From: Jason Wang @ 2023-10-08 4:28 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Parav Pandit, Jason Gunthorpe, Alex Williamson, Yishai Hadas,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
On Tue, Sep 26, 2023 at 7:49 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Sep 26, 2023 at 10:32:39AM +0800, Jason Wang wrote:
> > It's the implementation details in legacy. The device needs to make
> > sure (reset) the driver can work (is done before get_status return).
>
> I think that there's no way to make it reliably work for all legacy drivers.
Yes, we may have ancient drivers.
>
> They just assumed a software backend and did not bother with DMA
> ordering. You can try to avoid resets, they are not that common so
> things will tend to mostly work if you don't stress them to much with
> things like hot plug/unplug in a loop. Or you can try to use a driver
> after 2011 which is more aware of hardware ordering and flushes the
> reset write with a read. One of these two tricks, I think, is the magic
> behind the device exposing memory bar 0 that you mention.
Right this is what I see for hardware legacy devices shipped by some
cloud vendors.
Thanks
>
> --
> MST
>
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:53 ` Jason Gunthorpe
2023-09-21 20:16 ` Michael S. Tsirkin
@ 2023-09-22 3:02 ` Jason Wang
2023-09-22 12:25 ` Jason Gunthorpe
1 sibling, 1 reply; 181+ messages in thread
From: Jason Wang @ 2023-09-22 3:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Michael S. Tsirkin, Alex Williamson, Yishai Hadas, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
>
> > that's easy/practical. If instead VDPA gives the same speed with just
> > shadow vq then keeping this hack in vfio seems like less of a problem.
> > Finally if VDPA is faster then maybe you will reconsider using it ;)
>
> It is not all about the speed.
>
> VDPA presents another large and complex software stack in the
> hypervisor that can be eliminated by simply using VFIO.
vDPA supports standard virtio devices so how did you define complexity?
From the view of the application, what it wants is a simple virtio
device but not virtio-pci devices. That is what vDPA tries to present.
By simply counting LOCs: vdpa + vhost + vp_vdpa is much less code than
what VFIO had. It's not hard to expect, it will still be much less
even if iommufd is done.
Thanks
> VFIO is
> already required for other scenarios.
>
> This is about reducing complexity, reducing attack surface and
> increasing maintainability of the hypervisor environment.
>
> Jason
>
>
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 3:02 ` Jason Wang
@ 2023-09-22 12:25 ` Jason Gunthorpe
2023-09-22 15:39 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 12:25 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, Alex Williamson, Yishai Hadas, kvm,
virtualization, parav, feliu, jiri, kevin.tian, joao.m.martins,
leonro, maorg
On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> >
> > > that's easy/practical. If instead VDPA gives the same speed with just
> > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> >
> > It is not all about the speed.
> >
> > VDPA presents another large and complex software stack in the
> > hypervisor that can be eliminated by simply using VFIO.
>
> vDPA supports standard virtio devices so how did you define
> complexity?
As I said, VFIO is already required for other devices in these VMs. So
anything incremental over base-line vfio-pci is complexity to
minimize.
Everything vdpa does is either redundant or unnecessary compared to
VFIO in these environments.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 12:25 ` Jason Gunthorpe
@ 2023-09-22 15:39 ` Michael S. Tsirkin
2023-09-22 16:19 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 15:39 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Alex Williamson, Yishai Hadas, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > >
> > > > that's easy/practical. If instead VDPA gives the same speed with just
> > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > >
> > > It is not all about the speed.
> > >
> > > VDPA presents another large and complex software stack in the
> > > hypervisor that can be eliminated by simply using VFIO.
> >
> > vDPA supports standard virtio devices so how did you define
> > complexity?
>
> As I said, VFIO is already required for other devices in these VMs. So
> anything incremental over base-line vfio-pci is complexity to
> minimize.
>
> Everything vdpa does is either redundant or unnecessary compared to
> VFIO in these environments.
>
> Jason
Yes but you know. There are all kind of environments. I guess you
consider yours the most mainstream and important, and are sure it will
always stay like this. But if there's a driver that does what you need
then you use that. You really should be explaining what vdpa
*does not* do that you need.
But anyway, if Alex wants to maintain this it's not too bad,
but I would like to see more code move into a library
living under the virtio directory. As it is structured now
it will make virtio core development harder.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 15:39 ` Michael S. Tsirkin
@ 2023-09-22 16:19 ` Jason Gunthorpe
2023-09-25 18:16 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 16:19 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Alex Williamson, Yishai Hadas, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Fri, Sep 22, 2023 at 11:39:19AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > > >
> > > > > that's easy/practical. If instead VDPA gives the same speed with just
> > > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > > >
> > > > It is not all about the speed.
> > > >
> > > > VDPA presents another large and complex software stack in the
> > > > hypervisor that can be eliminated by simply using VFIO.
> > >
> > > vDPA supports standard virtio devices so how did you define
> > > complexity?
> >
> > As I said, VFIO is already required for other devices in these VMs. So
> > anything incremental over base-line vfio-pci is complexity to
> > minimize.
> >
> > Everything vdpa does is either redundant or unnecessary compared to
> > VFIO in these environments.
> >
> > Jason
>
> Yes but you know. There are all kind of environments. I guess you
> consider yours the most mainstream and important, and are sure it will
> always stay like this. But if there's a driver that does what you need
> then you use that.
Come on, you are the one saying we cannot do things in the best way
possible because you want your way of doing things to be the only way
allowed. Which of us thinks "yours the most mainstream and important" ??
I'm not telling you to throw away VPDA, I'm saying there are
legimitate real world use cases where VFIO is the appropriate
interface, not VDPA.
I want choice, not dogmatic exclusion that there is Only One True Way.
> You really should be explaining what vdpa *does not* do that you
> need.
I think I've done that enough, but if you have been following my
explanation you should see that the entire point of this design is to
allow a virtio device to be created inside a DPU to a specific
detailed specification (eg an AWS virtio-net device, for instance)
The implementation is in the DPU, and only the DPU.
At the end of the day VDPA uses mediation and creates some
RedHat/VDPA/Qemu virtio-net device in the guest. It is emphatically
NOT a perfect recreation of the "AWS virtio-net" we started out with.
It entirely fails to achieve the most important thing it needs to do!
Yishai will rework the series with your remarks, we can look again on
v2, thanks for all the input!
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 16:19 ` Jason Gunthorpe
@ 2023-09-25 18:16 ` Michael S. Tsirkin
2023-09-25 18:53 ` Jason Gunthorpe
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-25 18:16 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Alex Williamson, Yishai Hadas, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Fri, Sep 22, 2023 at 01:19:28PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:39:19AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> > > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > > >
> > > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > > > >
> > > > > > that's easy/practical. If instead VDPA gives the same speed with just
> > > > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > > > >
> > > > > It is not all about the speed.
> > > > >
> > > > > VDPA presents another large and complex software stack in the
> > > > > hypervisor that can be eliminated by simply using VFIO.
> > > >
> > > > vDPA supports standard virtio devices so how did you define
> > > > complexity?
> > >
> > > As I said, VFIO is already required for other devices in these VMs. So
> > > anything incremental over base-line vfio-pci is complexity to
> > > minimize.
> > >
> > > Everything vdpa does is either redundant or unnecessary compared to
> > > VFIO in these environments.
> > >
> > > Jason
> >
> > Yes but you know. There are all kind of environments. I guess you
> > consider yours the most mainstream and important, and are sure it will
> > always stay like this. But if there's a driver that does what you need
> > then you use that.
>
> Come on, you are the one saying we cannot do things in the best way
> possible because you want your way of doing things to be the only way
> allowed. Which of us thinks "yours the most mainstream and important" ??
>
> I'm not telling you to throw away VPDA, I'm saying there are
> legimitate real world use cases where VFIO is the appropriate
> interface, not VDPA.
>
> I want choice, not dogmatic exclusion that there is Only One True Way.
I don't particularly think there's only one way, vfio is already there.
I am specifically thinking about this patch, for example it
muddies the waters a bit: normally I think vfio exposed device
with the same ID, suddenly it changes the ID as visible to the guest.
But again, whether doing this kind of thing is OK is more up to Alex than me.
I do want to understand if there's a use-case that vdpa does not address
simply because it might be worth while to extend it to do so, and a
bunch of people working on it are at Red Hat and I might have some input
into how that labor is allocated. But if the use-case is simply "has to
be vfio and not vdpa" then I guess not.
> > You really should be explaining what vdpa *does not* do that you
> > need.
>
> I think I've done that enough, but if you have been following my
> explanation you should see that the entire point of this design is to
> allow a virtio device to be created inside a DPU to a specific
> detailed specification (eg an AWS virtio-net device, for instance)
>
> The implementation is in the DPU, and only the DPU.
>
> At the end of the day VDPA uses mediation and creates some
> RedHat/VDPA/Qemu virtio-net device in the guest. It is emphatically
> NOT a perfect recreation of the "AWS virtio-net" we started out with.
>
> It entirely fails to achieve the most important thing it needs to do!
It could be that we are using mediation differently - in my world it's
when there's some host software on the path between guest and hardware,
and this qualifies. The difference between what this patch does and
what vdpa does seems quantitative, not qualitative. Which might be
enough to motivate this work, I don't mind. But you seem to feel
it is qualitative and I am genuinely curious about it, because
if yes then it might lead e.g. the virtio standard in new directions.
I can *imagine* all kind of reasons to want to use vfio as compared to vdpa;
here are some examples I came up with, quickly:
- maybe you have drivers that poke at registers not in virtio spec:
vfio allows that, vdpa by design does not
- maybe you are using vfio with a lot of devices already and don't want
to special-case handling for virtio devices on the host
do any of the above motivations ring the bell? Some of the things you
said seem to hint at that. If yes maybe include this in the cover
letter.
There is also a question of capability. Specifically iommufd support
is lacking in vdpa (though there are finally some RFC patches to
address that). All this is fine, could be enough to motivate
a work like this one. But I am very curious to know if there
is any other capability lacking in vdpa. I asked already and you
didn't answer so I guess not?
> Yishai will rework the series with your remarks, we can look again on
> v2, thanks for all the input!
>
> Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 18:16 ` Michael S. Tsirkin
@ 2023-09-25 18:53 ` Jason Gunthorpe
2023-09-25 19:52 ` Michael S. Tsirkin
0 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-25 18:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Alex Williamson, Yishai Hadas, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Mon, Sep 25, 2023 at 02:16:30PM -0400, Michael S. Tsirkin wrote:
> I do want to understand if there's a use-case that vdpa does not address
> simply because it might be worth while to extend it to do so, and a
> bunch of people working on it are at Red Hat and I might have some input
> into how that labor is allocated. But if the use-case is simply "has to
> be vfio and not vdpa" then I guess not.
If you strip away all the philisophical arguing VDPA has no way to
isolate the control and data virtqs to different IOMMU configurations
with this single PCI function.
The existing HW VDPA drivers provided device specific ways to handle
this.
Without DMA isolation you can't assign the high speed data virtq's to
the VM without mediating them as well.
> It could be that we are using mediation differently - in my world it's
> when there's some host software on the path between guest and hardware,
> and this qualifies.
That is pretty general. As I said to Jason, if you want to use it that
way then you need to make up a new word to describe what VDPA does as
there is a clear difference in scope between this VFIO patch (relay IO
commands to the device) and VDPA (intercept all the control plane,
control virtq and bring it to a RedHat/qemu standard common behavior)
> There is also a question of capability. Specifically iommufd support
> is lacking in vdpa (though there are finally some RFC patches to
> address that). All this is fine, could be enough to motivate
> a work like this one.
I've answered many times, you just don't semm to like the answers or
dismiss them as not relevant to you.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-25 18:53 ` Jason Gunthorpe
@ 2023-09-25 19:52 ` Michael S. Tsirkin
0 siblings, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-25 19:52 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jason Wang, Alex Williamson, Yishai Hadas, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Mon, Sep 25, 2023 at 03:53:18PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 02:16:30PM -0400, Michael S. Tsirkin wrote:
>
> > I do want to understand if there's a use-case that vdpa does not address
> > simply because it might be worth while to extend it to do so, and a
> > bunch of people working on it are at Red Hat and I might have some input
> > into how that labor is allocated. But if the use-case is simply "has to
> > be vfio and not vdpa" then I guess not.
>
> If you strip away all the philisophical arguing VDPA has no way to
> isolate the control and data virtqs to different IOMMU configurations
> with this single PCI function.
Aha, so address space/PASID support then?
> The existing HW VDPA drivers provided device specific ways to handle
> this.
>
> Without DMA isolation you can't assign the high speed data virtq's to
> the VM without mediating them as well.
>
> > It could be that we are using mediation differently - in my world it's
> > when there's some host software on the path between guest and hardware,
> > and this qualifies.
>
> That is pretty general. As I said to Jason, if you want to use it that
> way then you need to make up a new word to describe what VDPA does as
> there is a clear difference in scope between this VFIO patch (relay IO
> commands to the device) and VDPA (intercept all the control plane,
> control virtq and bring it to a RedHat/qemu standard common behavior)
IIUC VDPA itself does not really bring it to either RedHat or qemu
standard, it just allows userspace to control behaviour - if userspace
is qemu then it's qemu deciding how it behaves. Which I guess this
doesn't. Right? RedHat's not in the picture at all I think.
> > There is also a question of capability. Specifically iommufd support
> > is lacking in vdpa (though there are finally some RFC patches to
> > address that). All this is fine, could be enough to motivate
> > a work like this one.
>
> I've answered many times, you just don't semm to like the answers or
> dismiss them as not relevant to you.
>
> Jason
Not really I think I lack some of the picture so I don't fully
understand. Or maybe I missed something else.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 17:01 ` Michael S. Tsirkin
2023-09-21 17:07 ` Jason Gunthorpe
@ 2023-09-21 17:09 ` Parav Pandit
2023-09-21 17:24 ` Michael S. Tsirkin
1 sibling, 1 reply; 181+ messages in thread
From: Parav Pandit @ 2023-09-21 17:09 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang@redhat.com,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, September 21, 2023 10:31 PM
> Another question I'm interested in is whether there's actually a performance
> benefit to using this as compared to just software vhost. I note there's a VM exit
> on each IO access, so ... perhaps?
> Would be nice to see some numbers.
Packet rate and bandwidth are close are only 10% lower than modern device due to the batching of driver notification.
Bw tested with iperf with one and multiple queues.
Packet rate tested with testpmd.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 17:09 ` Parav Pandit
@ 2023-09-21 17:24 ` Michael S. Tsirkin
0 siblings, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 17:24 UTC (permalink / raw)
To: Parav Pandit
Cc: Jason Gunthorpe, Alex Williamson, Yishai Hadas,
jasowang@redhat.com, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, Feng Liu, Jiri Pirko,
kevin.tian@intel.com, joao.m.martins@oracle.com, Leon Romanovsky,
Maor Gottlieb
On Thu, Sep 21, 2023 at 05:09:04PM +0000, Parav Pandit wrote:
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, September 21, 2023 10:31 PM
>
> > Another question I'm interested in is whether there's actually a performance
> > benefit to using this as compared to just software vhost. I note there's a VM exit
> > on each IO access, so ... perhaps?
> > Would be nice to see some numbers.
>
> Packet rate and bandwidth are close are only 10% lower than modern device due to the batching of driver notification.
> Bw tested with iperf with one and multiple queues.
> Packet rate tested with testpmd.
Nice, good to know. Could you compare this with vdpa with shadow vq
enabled? That's probably the closest equivalent that needs
no kernel or hardware work.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 12:40 ` [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices Yishai Hadas
` (2 preceding siblings ...)
2023-09-21 16:43 ` Alex Williamson
@ 2023-09-21 19:58 ` Alex Williamson
2023-09-21 20:01 ` Jason Gunthorpe
` (2 more replies)
2023-09-22 10:10 ` Michael S. Tsirkin
2023-09-22 15:53 ` Michael S. Tsirkin
5 siblings, 3 replies; 181+ messages in thread
From: Alex Williamson @ 2023-09-21 19:58 UTC (permalink / raw)
To: Yishai Hadas
Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, jiri,
kevin.tian, joao.m.martins, leonro, maorg
On Thu, 21 Sep 2023 15:40:40 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:
> Introduce a vfio driver over virtio devices to support the legacy
> interface functionality for VFs.
>
> Background, from the virtio spec [1].
> --------------------------------------------------------------------
> In some systems, there is a need to support a virtio legacy driver with
> a device that does not directly support the legacy interface. In such
> scenarios, a group owner device can provide the legacy interface
> functionality for the group member devices. The driver of the owner
> device can then access the legacy interface of a member device on behalf
> of the legacy member device driver.
>
> For example, with the SR-IOV group type, group members (VFs) can not
> present the legacy interface in an I/O BAR in BAR0 as expected by the
> legacy pci driver. If the legacy driver is running inside a virtual
> machine, the hypervisor executing the virtual machine can present a
> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> legacy driver accesses to this I/O BAR and forwards them to the group
> owner device (PF) using group administration commands.
> --------------------------------------------------------------------
>
> Specifically, this driver adds support for a virtio-net VF to be exposed
> as a transitional device to a guest driver and allows the legacy IO BAR
> functionality on top.
>
> This allows a VM which uses a legacy virtio-net driver in the guest to
> work transparently over a VF which its driver in the host is that new
> driver.
>
> The driver can be extended easily to support some other types of virtio
> devices (e.g virtio-blk), by adding in a few places the specific type
> properties as was done for virtio-net.
>
> For now, only the virtio-net use case was tested and as such we introduce
> the support only for such a device.
>
> Practically,
> Upon probing a VF for a virtio-net device, in case its PF supports
> legacy access over the virtio admin commands and the VF doesn't have BAR
> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> transitional device with I/O BAR in BAR 0.
>
> The existence of the simulated I/O bar is reported later on by
> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> exposes itself as a transitional device by overwriting some properties
> upon reading its config space.
>
> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> guest may use it via read/write calls according to the virtio
> specification.
>
> Any read/write towards the control parts of the BAR will be captured by
> the new driver and will be translated into admin commands towards the
> device.
>
> Any data path read/write access (i.e. virtio driver notifications) will
> be forwarded to the physical BAR which its properties were supplied by
> the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow.
>
> With that code in place a legacy driver in the guest has the look and
> feel as if having a transitional device with legacy support for both its
> control and data path flows.
>
> [1]
> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
> MAINTAINERS | 6 +
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 2 +
> drivers/vfio/pci/virtio/Kconfig | 15 +
> drivers/vfio/pci/virtio/Makefile | 4 +
> drivers/vfio/pci/virtio/cmd.c | 4 +-
> drivers/vfio/pci/virtio/cmd.h | 8 +
> drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++
> 8 files changed, 585 insertions(+), 2 deletions(-)
> create mode 100644 drivers/vfio/pci/virtio/Kconfig
> create mode 100644 drivers/vfio/pci/virtio/Makefile
> create mode 100644 drivers/vfio/pci/virtio/main.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index bf0f54c24f81..5098418c8389 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> S: Maintained
> F: drivers/vfio/pci/mlx5/
>
> +VFIO VIRTIO PCI DRIVER
> +M: Yishai Hadas <yishaih@nvidia.com>
> +L: kvm@vger.kernel.org
> +S: Maintained
> +F: drivers/vfio/pci/virtio
> +
> VFIO PCI DEVICE SPECIFIC DRIVERS
> R: Jason Gunthorpe <jgg@nvidia.com>
> R: Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..18c397df566d 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>
> source "drivers/vfio/pci/pds/Kconfig"
>
> +source "drivers/vfio/pci/virtio/Kconfig"
> +
> endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..046139a4eca5 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>
> obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> new file mode 100644
> index 000000000000..89eddce8b1bd
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config VIRTIO_VFIO_PCI
> + tristate "VFIO support for VIRTIO PCI devices"
> + depends on VIRTIO_PCI
> + select VFIO_PCI_CORE
> + help
> + This provides support for exposing VIRTIO VF devices using the VFIO
> + framework that can work with a legacy virtio driver in the guest.
> + Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> + not indicate I/O Space.
> + As of that this driver emulated I/O BAR in software to let a VF be
> + seen as a transitional device in the guest and let it work with
> + a legacy driver.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> new file mode 100644
> index 000000000000..584372648a03
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> +virtio-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c
> index f068239cdbb0..aea9d25fbf1d 100644
> --- a/drivers/vfio/pci/virtio/cmd.c
> +++ b/drivers/vfio/pci/virtio/cmd.c
> @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_write *in;
> + struct virtio_admin_cmd_legacy_wr_data *in;
> struct scatterlist in_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_read *in;
> + struct virtio_admin_cmd_legacy_rd_data *in;
> struct scatterlist in_sg, out_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h
> index c2a3645f4b90..347b1dc85570 100644
> --- a/drivers/vfio/pci/virtio/cmd.h
> +++ b/drivers/vfio/pci/virtio/cmd.h
> @@ -13,7 +13,15 @@
>
> struct virtiovf_pci_core_device {
> struct vfio_pci_core_device core_device;
> + u8 bar0_virtual_buf_size;
> + u8 *bar0_virtual_buf;
> + /* synchronize access to the virtual buf */
> + struct mutex bar_mutex;
> int vf_id;
> + void __iomem *notify_addr;
> + u32 notify_offset;
> + u8 notify_bar;
> + u8 pci_cmd_io :1;
> };
>
> int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size);
> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> new file mode 100644
> index 000000000000..2486991c49f3
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/main.c
> @@ -0,0 +1,546 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/virtio_pci.h>
> +#include <linux/virtio_net.h>
> +#include <linux/virtio_pci_modern.h>
> +
> +#include "cmd.h"
> +
> +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
> +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
> +
> +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> + u16 opcode;
> + int ret;
> +
> + mutex_lock(&virtvdev->bar_mutex);
> + if (read) {
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> + ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos,
> + count, bar0_buf + pos);
> + if (ret)
> + goto out;
> + if (copy_to_user(buf, bar0_buf + pos, count))
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + if (copy_from_user(bar0_buf + pos, buf, count)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
> + ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count,
> + bar0_buf + pos);
> +out:
> + mutex_unlock(&virtvdev->bar_mutex);
> + return ret;
> +}
> +
> +static int
> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + u16 queue_notify;
> + int ret;
> +
> + if (pos + count > virtvdev->bar0_virtual_buf_size)
> + return -EINVAL;
> +
> + switch (pos) {
> + case VIRTIO_PCI_QUEUE_NOTIFY:
> + if (count != sizeof(queue_notify))
> + return -EINVAL;
> + if (read) {
> + ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> + virtvdev->notify_addr);
> + if (ret)
> + return ret;
> + if (copy_to_user(buf, &queue_notify,
> + sizeof(queue_notify)))
> + return -EFAULT;
> + break;
> + }
> +
> + if (copy_from_user(&queue_notify, buf, count))
> + return -EFAULT;
> +
> + ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> + virtvdev->notify_addr);
> + break;
> + default:
> + ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read);
> + }
> +
> + return ret ? ret : count;
> +}
> +
> +static bool range_contains_range(loff_t range1_start, size_t count1,
> + loff_t range2_start, size_t count2,
> + loff_t *start_offset)
> +{
> + if (range1_start <= range2_start &&
> + range1_start + count1 >= range2_start + count2) {
> + *start_offset = range2_start - range1_start;
> + return true;
> + }
> + return false;
> +}
> +
> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> + char __user *buf, size_t count,
> + loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + loff_t copy_offset;
> + __le32 val32;
> + __le16 val16;
> + u8 val8;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;
> +
> + if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> + ©_offset)) {
> + val16 = cpu_to_le16(0x1000);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
So we take a 0x1041 ("Virtio 1.0 network device") and turn it into a
0x1000 ("Virtio network device"). Are there no features implied by the
device ID? NB, a byte-wise access would read the real device ID.
> +
> + if (virtvdev->pci_cmd_io &&
> + range_contains_range(pos, count, PCI_COMMAND, sizeof(val16),
> + ©_offset)) {
> + if (copy_from_user(&val16, buf, sizeof(val16)))
> + return -EFAULT;
> + val16 |= cpu_to_le16(PCI_COMMAND_IO);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
So we can't turn off I/O memory.
> +
> + if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> + ©_offset)) {
> + /* Transional needs to have revision 0 */
> + val8 = 0;
> + if (copy_to_user(buf + copy_offset, &val8, sizeof(val8)))
> + return -EFAULT;
> + }
Surely some driver cares about this, right? How is this supposed to
work in a world where libvirt parses modules.alias and automatically
loads this driver rather than vfio-pci for all 0x1041 devices? We'd
need to denylist this driver to ever see the device for what it is.
> +
> + if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> + ©_offset)) {
> + val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
> + if (copy_to_user(buf + copy_offset, &val32, sizeof(val32)))
> + return -EFAULT;
> + }
Sloppy BAR emulation compared to the real BARs. QEMU obviously doesn't
care.
> +
> + if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> + ©_offset)) {
> + /* Transitional devices use the PCI subsystem device id as
> + * virtio device id, same as legacy driver always did.
> + */
Non-networking multi-line comment style throughout please.
> + val16 = cpu_to_le16(VIRTIO_ID_NET);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + return count;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> + ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
If the heart of this driver is simply pretending to have an I/O BAR
where I/O accesses into that BAR are translated to accesses in the MMIO
BAR, why can't this be done in the VMM, ie. QEMU? Could I/O to MMIO
translation in QEMU improve performance (ex. if the MMIO is mmap'd and
can be accessed without bouncing back into kernel code)?
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> + loff_t copy_offset;
> + u16 cmd;
> +
> + if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd),
> + ©_offset)) {
> + if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd)))
> + return -EFAULT;
> + virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO);
If we're tracking writes to PCI_COMMAND_IO, why did we statically
report I/O enabled in the read function previously?
> + }
> + }
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static int
> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> + void __user *uarg = (void __user *)arg;
> + struct vfio_region_info info = {};
> +
> + if (copy_from_user(&info, uarg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + switch (info.index) {
> + case VFIO_PCI_BAR0_REGION_INDEX:
> + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> + info.size = virtvdev->bar0_virtual_buf_size;
> + info.flags = VFIO_REGION_INFO_FLAG_READ |
> + VFIO_REGION_INFO_FLAG_WRITE;
> + return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static long
> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> + unsigned long arg)
> +{
> + switch (cmd) {
> + case VFIO_DEVICE_GET_REGION_INFO:
> + return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static int
> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + int ret;
> +
> + /* Setup the BAR where the 'notify' exists to be used by vfio as well
> + * This will let us mmap it only once and use it when needed.
> + */
> + ret = vfio_pci_core_setup_barmap(core_device,
> + virtvdev->notify_bar);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> + virtvdev->notify_offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> + int ret;
> +
> + ret = vfio_pci_core_enable(vdev);
> + if (ret)
> + return ret;
> +
> + if (virtvdev->bar0_virtual_buf) {
> + /* upon close_device() the vfio_pci_core_disable() is called
> + * and will close all the previous mmaps, so it seems that the
> + * valid life cycle for the 'notify' addr is per open/close.
> + */
> + ret = virtiovf_set_notify_addr(virtvdev);
> + if (ret) {
> + vfio_pci_core_disable(vdev);
> + return ret;
> + }
> + }
> +
> + vfio_pci_core_finish_enable(vdev);
> + return 0;
> +}
> +
> +static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
> +{
> + vfio_pci_core_close_device(core_vdev);
> +}
Why does this function exist?
> +
> +static int virtiovf_get_device_config_size(unsigned short device)
> +{
> + switch (device) {
> + case 0x1041:
> + /* network card */
> + return offsetofend(struct virtio_net_config, status);
> + default:
> + return 0;
> + }
> +}
> +
> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> +{
> + u64 offset;
> + int ret;
> + u8 bar;
> +
> + ret = virtiovf_cmd_lq_read_notify(virtvdev,
> + VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> + &bar, &offset);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_bar = bar;
> + virtvdev->notify_offset = offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev;
> + int ret;
> +
> + ret = vfio_pci_core_init_dev(core_vdev);
> + if (ret)
> + return ret;
> +
> + pdev = virtvdev->core_device.pdev;
> + virtvdev->vf_id = pci_iov_vf_id(pdev);
> + if (virtvdev->vf_id < 0)
> + return -EINVAL;
vf_id is never used.
> +
> + ret = virtiovf_read_notify_info(virtvdev);
> + if (ret)
> + return ret;
> +
> + virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN +
> + VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN +
> + virtiovf_get_device_config_size(pdev->device);
> + virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> + GFP_KERNEL);
> + if (!virtvdev->bar0_virtual_buf)
> + return -ENOMEM;
> + mutex_init(&virtvdev->bar_mutex);
> + return 0;
> +}
> +
> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +
> + kfree(virtvdev->bar0_virtual_buf);
> + vfio_pci_core_release_dev(core_vdev);
> +}
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> + .name = "virtio-transitional-vfio-pci",
> + .init = virtiovf_pci_init_device,
> + .release = virtiovf_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = virtiovf_vfio_pci_core_ioctl,
> + .read = virtiovf_pci_core_read,
> + .write = virtiovf_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> + .name = "virtio-acc-vfio-pci",
> + .init = vfio_pci_core_init_dev,
> + .release = vfio_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = vfio_pci_core_ioctl,
> + .device_feature = vfio_pci_core_ioctl_feature,
> + .read = vfio_pci_core_read,
> + .write = vfio_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
Why are we claiming devices that should just use vfio-pci instead?
> +
> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> +{
> + struct resource *res = pdev->resource;
> +
> + return res->flags ? true : false;
> +}
> +
> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> + (BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> +
> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> +{
> + int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> + u8 *buf;
> + int ret;
> +
> + /* Only virtio-net is supported/tested so far */
> + if (pdev->device != 0x1041)
> + return false;
Seems like the ID table should handle this, why are we preemptively
claiming all virtio devices... or actually all 0x1af4 devices, which
might not even be virtio, ex. the non-virtio ivshmem devices is 0x1110.
> +
> + buf = kzalloc(buf_size, GFP_KERNEL);
> + if (!buf)
> + return false;
> +
> + ret = virtiovf_cmd_list_query(pdev, buf, buf_size);
> + if (ret)
> + goto end;
> +
> + if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> + VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> + ret = -EOPNOTSUPP;
> + goto end;
> + }
> +
> + /* confirm the used commands */
> + memset(buf, 0, buf_size);
> + *(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> + ret = virtiovf_cmd_list_use(pdev, buf, buf_size);
> +
> +end:
> + kfree(buf);
> + return ret ? false : true;
> +}
> +
> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> + const struct pci_device_id *id)
> +{
> + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> + struct virtiovf_pci_core_device *virtvdev;
> + int ret;
> +
> + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> + !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> + ops = &virtiovf_acc_vfio_pci_tran_ops;
> +
> + virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> + &pdev->dev, ops);
> + if (IS_ERR(virtvdev))
> + return PTR_ERR(virtvdev);
> +
> + dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> + ret = vfio_pci_core_register_device(&virtvdev->core_device);
> + if (ret)
> + goto out;
> + return 0;
> +out:
> + vfio_put_device(&virtvdev->core_device.vdev);
> + return ret;
> +}
> +
> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> +
> + vfio_pci_core_unregister_device(&virtvdev->core_device);
> + vfio_put_device(&virtvdev->core_device.vdev);
> +}
> +
> +static const struct pci_device_id virtiovf_pci_table[] = {
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
libvirt will blindly use this driver for all devices matching this as
we've discussed how it should make use of modules.alias. I don't think
this driver should be squatting on devices where it doesn't add value
and it's not clear whether this is adding or subtracting value in all
cases for the one NIC that it modifies. How should libvirt choose when
and where to use this driver? What regressions are we going to see
with VMs that previously saw "modern" virtio-net devices and now see a
legacy compatible device? Thanks,
Alex
> + {}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> +
> +static struct pci_driver virtiovf_pci_driver = {
> + .name = KBUILD_MODNAME,
> + .id_table = virtiovf_pci_table,
> + .probe = virtiovf_pci_probe,
> + .remove = virtiovf_pci_remove,
> + .err_handler = &vfio_pci_core_err_handlers,
> + .driver_managed_dma = true,
> +};
> +
> +module_pci_driver(virtiovf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> + "VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:58 ` Alex Williamson
@ 2023-09-21 20:01 ` Jason Gunthorpe
2023-09-21 20:20 ` Michael S. Tsirkin
2023-09-22 12:37 ` Jason Gunthorpe
2023-09-26 15:20 ` Yishai Hadas
2 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-21 20:01 UTC (permalink / raw)
To: Alex Williamson
Cc: Yishai Hadas, mst, jasowang, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 01:58:32PM -0600, Alex Williamson wrote:
> > +static const struct pci_device_id virtiovf_pci_table[] = {
> > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
>
> libvirt will blindly use this driver for all devices matching this as
> we've discussed how it should make use of modules.alias. I don't think
> this driver should be squatting on devices where it doesn't add value
> and it's not clear whether this is adding or subtracting value in all
> cases for the one NIC that it modifies. How should libvirt choose when
> and where to use this driver? What regressions are we going to see
> with VMs that previously saw "modern" virtio-net devices and now see a
> legacy compatible device? Thanks,
Maybe this approach needs to use a subsystem ID match?
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 20:01 ` Jason Gunthorpe
@ 2023-09-21 20:20 ` Michael S. Tsirkin
2023-09-21 20:59 ` Alex Williamson
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-21 20:20 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 05:01:21PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 01:58:32PM -0600, Alex Williamson wrote:
>
> > > +static const struct pci_device_id virtiovf_pci_table[] = {
> > > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> >
> > libvirt will blindly use this driver for all devices matching this as
> > we've discussed how it should make use of modules.alias. I don't think
> > this driver should be squatting on devices where it doesn't add value
> > and it's not clear whether this is adding or subtracting value in all
> > cases for the one NIC that it modifies. How should libvirt choose when
> > and where to use this driver? What regressions are we going to see
> > with VMs that previously saw "modern" virtio-net devices and now see a
> > legacy compatible device? Thanks,
>
> Maybe this approach needs to use a subsystem ID match?
>
> Jason
Maybe make users load it manually?
Please don't bind to virtio by default, you will break
all guests.
--
MST
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 20:20 ` Michael S. Tsirkin
@ 2023-09-21 20:59 ` Alex Williamson
0 siblings, 0 replies; 181+ messages in thread
From: Alex Williamson @ 2023-09-21 20:59 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Gunthorpe, Yishai Hadas, jasowang, kvm, virtualization,
parav, feliu, jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, 21 Sep 2023 16:20:59 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Sep 21, 2023 at 05:01:21PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 01:58:32PM -0600, Alex Williamson wrote:
> >
> > > > +static const struct pci_device_id virtiovf_pci_table[] = {
> > > > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> > >
> > > libvirt will blindly use this driver for all devices matching this as
> > > we've discussed how it should make use of modules.alias. I don't think
> > > this driver should be squatting on devices where it doesn't add value
> > > and it's not clear whether this is adding or subtracting value in all
> > > cases for the one NIC that it modifies. How should libvirt choose when
> > > and where to use this driver? What regressions are we going to see
> > > with VMs that previously saw "modern" virtio-net devices and now see a
> > > legacy compatible device? Thanks,
> >
> > Maybe this approach needs to use a subsystem ID match?
> >
> > Jason
>
> Maybe make users load it manually?
>
> Please don't bind to virtio by default, you will break
> all guests.
This would never bind by default, it's only bound as a vfio override
driver, but if libvirt were trying to determine the correct driver to
use with vfio for a 0x1af4 device, it'd land on this one. Thanks,
Alex
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:58 ` Alex Williamson
2023-09-21 20:01 ` Jason Gunthorpe
@ 2023-09-22 12:37 ` Jason Gunthorpe
2023-09-22 12:59 ` Parav Pandit
2023-09-26 15:20 ` Yishai Hadas
2 siblings, 1 reply; 181+ messages in thread
From: Jason Gunthorpe @ 2023-09-22 12:37 UTC (permalink / raw)
To: Alex Williamson
Cc: Yishai Hadas, mst, jasowang, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 01:58:32PM -0600, Alex Williamson wrote:
> If the heart of this driver is simply pretending to have an I/O BAR
> where I/O accesses into that BAR are translated to accesses in the MMIO
> BAR, why can't this be done in the VMM, ie. QEMU?
That isn't exactly what it does, the IO bar access is translated into
an admin queue command on the PF and excuted by the PCI function.
So it would be difficult to do that in qemu without also somehow
wiring up qemu to access the PF's kernel driver's admin queue.
It would have been nice if it was a trivial 1:1 translation to the
MMIO bar, but it seems that didn't entirely work with existing VMs. So
OASIS standardized this approach.
The bigger picture is there is also a live migration standard & driver
in the works that will re-use all this admin queue infrastructure
anyhow, so the best course is to keep this in the kernel.
Jason
^ permalink raw reply [flat|nested] 181+ messages in thread
* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 12:37 ` Jason Gunthorpe
@ 2023-09-22 12:59 ` Parav Pandit
0 siblings, 0 replies; 181+ messages in thread
From: Parav Pandit @ 2023-09-22 12:59 UTC (permalink / raw)
To: Jason Gunthorpe, Alex Williamson
Cc: Yishai Hadas, mst@redhat.com, jasowang@redhat.com,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, September 22, 2023 6:07 PM
>
> On Thu, Sep 21, 2023 at 01:58:32PM -0600, Alex Williamson wrote:
>
> > If the heart of this driver is simply pretending to have an I/O BAR
> > where I/O accesses into that BAR are translated to accesses in the
> > MMIO BAR, why can't this be done in the VMM, ie. QEMU?
>
> That isn't exactly what it does, the IO bar access is translated into an admin
> queue command on the PF and excuted by the PCI function.
>
> So it would be difficult to do that in qemu without also somehow wiring up
> qemu to access the PF's kernel driver's admin queue.
>
> It would have been nice if it was a trivial 1:1 translation to the MMIO bar, but it
> seems that didn't entirely work with existing VMs. So OASIS standardized this
> approach.
>
> The bigger picture is there is also a live migration standard & driver in the
> works that will re-use all this admin queue infrastructure anyhow, so the best
> course is to keep this in the kernel.
Additionally in the future the AQ of the PF will also be used to provision the VFs (virtio OASIS calls them member devices), such framework also resides in the kernel.
Such PFs are in use by the kernel driver.
+1 for keeping this framework in the kernel.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 19:58 ` Alex Williamson
2023-09-21 20:01 ` Jason Gunthorpe
2023-09-22 12:37 ` Jason Gunthorpe
@ 2023-09-26 15:20 ` Yishai Hadas
2023-09-26 17:00 ` Michael S. Tsirkin
2 siblings, 1 reply; 181+ messages in thread
From: Yishai Hadas @ 2023-09-26 15:20 UTC (permalink / raw)
To: Alex Williamson
Cc: mst, jasowang, jgg, kvm, virtualization, parav, feliu, jiri,
kevin.tian, joao.m.martins, leonro, maorg
On 21/09/2023 22:58, Alex Williamson wrote:
> On Thu, 21 Sep 2023 15:40:40 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> Introduce a vfio driver over virtio devices to support the legacy
>> interface functionality for VFs.
>>
>> Background, from the virtio spec [1].
>> --------------------------------------------------------------------
>> In some systems, there is a need to support a virtio legacy driver with
>> a device that does not directly support the legacy interface. In such
>> scenarios, a group owner device can provide the legacy interface
>> functionality for the group member devices. The driver of the owner
>> device can then access the legacy interface of a member device on behalf
>> of the legacy member device driver.
>>
>> For example, with the SR-IOV group type, group members (VFs) can not
>> present the legacy interface in an I/O BAR in BAR0 as expected by the
>> legacy pci driver. If the legacy driver is running inside a virtual
>> machine, the hypervisor executing the virtual machine can present a
>> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
>> legacy driver accesses to this I/O BAR and forwards them to the group
>> owner device (PF) using group administration commands.
>> --------------------------------------------------------------------
>>
>> Specifically, this driver adds support for a virtio-net VF to be exposed
>> as a transitional device to a guest driver and allows the legacy IO BAR
>> functionality on top.
>>
>> This allows a VM which uses a legacy virtio-net driver in the guest to
>> work transparently over a VF which its driver in the host is that new
>> driver.
>>
>> The driver can be extended easily to support some other types of virtio
>> devices (e.g virtio-blk), by adding in a few places the specific type
>> properties as was done for virtio-net.
>>
>> For now, only the virtio-net use case was tested and as such we introduce
>> the support only for such a device.
>>
>> Practically,
>> Upon probing a VF for a virtio-net device, in case its PF supports
>> legacy access over the virtio admin commands and the VF doesn't have BAR
>> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
>> transitional device with I/O BAR in BAR 0.
>>
>> The existence of the simulated I/O bar is reported later on by
>> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
>> exposes itself as a transitional device by overwriting some properties
>> upon reading its config space.
>>
>> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
>> guest may use it via read/write calls according to the virtio
>> specification.
>>
>> Any read/write towards the control parts of the BAR will be captured by
>> the new driver and will be translated into admin commands towards the
>> device.
>>
>> Any data path read/write access (i.e. virtio driver notifications) will
>> be forwarded to the physical BAR which its properties were supplied by
>> the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow.
>>
>> With that code in place a legacy driver in the guest has the look and
>> feel as if having a transitional device with legacy support for both its
>> control and data path flows.
>>
>> [1]
>> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>> MAINTAINERS | 6 +
>> drivers/vfio/pci/Kconfig | 2 +
>> drivers/vfio/pci/Makefile | 2 +
>> drivers/vfio/pci/virtio/Kconfig | 15 +
>> drivers/vfio/pci/virtio/Makefile | 4 +
>> drivers/vfio/pci/virtio/cmd.c | 4 +-
>> drivers/vfio/pci/virtio/cmd.h | 8 +
>> drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++
>> 8 files changed, 585 insertions(+), 2 deletions(-)
>> create mode 100644 drivers/vfio/pci/virtio/Kconfig
>> create mode 100644 drivers/vfio/pci/virtio/Makefile
>> create mode 100644 drivers/vfio/pci/virtio/main.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index bf0f54c24f81..5098418c8389 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
>> S: Maintained
>> F: drivers/vfio/pci/mlx5/
>>
>> +VFIO VIRTIO PCI DRIVER
>> +M: Yishai Hadas <yishaih@nvidia.com>
>> +L: kvm@vger.kernel.org
>> +S: Maintained
>> +F: drivers/vfio/pci/virtio
>> +
>> VFIO PCI DEVICE SPECIFIC DRIVERS
>> R: Jason Gunthorpe <jgg@nvidia.com>
>> R: Yishai Hadas <yishaih@nvidia.com>
>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>> index 8125e5f37832..18c397df566d 100644
>> --- a/drivers/vfio/pci/Kconfig
>> +++ b/drivers/vfio/pci/Kconfig
>> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>>
>> source "drivers/vfio/pci/pds/Kconfig"
>>
>> +source "drivers/vfio/pci/virtio/Kconfig"
>> +
>> endmenu
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 45167be462d8..046139a4eca5 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
>> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>>
>> obj-$(CONFIG_PDS_VFIO_PCI) += pds/
>> +
>> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
>> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
>> new file mode 100644
>> index 000000000000..89eddce8b1bd
>> --- /dev/null
>> +++ b/drivers/vfio/pci/virtio/Kconfig
>> @@ -0,0 +1,15 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +config VIRTIO_VFIO_PCI
>> + tristate "VFIO support for VIRTIO PCI devices"
>> + depends on VIRTIO_PCI
>> + select VFIO_PCI_CORE
>> + help
>> + This provides support for exposing VIRTIO VF devices using the VFIO
>> + framework that can work with a legacy virtio driver in the guest.
>> + Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
>> + not indicate I/O Space.
>> + As of that this driver emulated I/O BAR in software to let a VF be
>> + seen as a transitional device in the guest and let it work with
>> + a legacy driver.
>> +
>> + If you don't know what to do here, say N.
>> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
>> new file mode 100644
>> index 000000000000..584372648a03
>> --- /dev/null
>> +++ b/drivers/vfio/pci/virtio/Makefile
>> @@ -0,0 +1,4 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
>> +virtio-vfio-pci-y := main.o cmd.o
>> +
>> diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c
>> index f068239cdbb0..aea9d25fbf1d 100644
>> --- a/drivers/vfio/pci/virtio/cmd.c
>> +++ b/drivers/vfio/pci/virtio/cmd.c
>> @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
>> {
>> struct virtio_device *virtio_dev =
>> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
>> - struct virtio_admin_cmd_data_lr_write *in;
>> + struct virtio_admin_cmd_legacy_wr_data *in;
>> struct scatterlist in_sg;
>> struct virtio_admin_cmd cmd = {};
>> int ret;
>> @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
>> {
>> struct virtio_device *virtio_dev =
>> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
>> - struct virtio_admin_cmd_data_lr_read *in;
>> + struct virtio_admin_cmd_legacy_rd_data *in;
>> struct scatterlist in_sg, out_sg;
>> struct virtio_admin_cmd cmd = {};
>> int ret;
>> diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h
>> index c2a3645f4b90..347b1dc85570 100644
>> --- a/drivers/vfio/pci/virtio/cmd.h
>> +++ b/drivers/vfio/pci/virtio/cmd.h
>> @@ -13,7 +13,15 @@
>>
>> struct virtiovf_pci_core_device {
>> struct vfio_pci_core_device core_device;
>> + u8 bar0_virtual_buf_size;
>> + u8 *bar0_virtual_buf;
>> + /* synchronize access to the virtual buf */
>> + struct mutex bar_mutex;
>> int vf_id;
>> + void __iomem *notify_addr;
>> + u32 notify_offset;
>> + u8 notify_bar;
>> + u8 pci_cmd_io :1;
>> };
>>
>> int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size);
>> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
>> new file mode 100644
>> index 000000000000..2486991c49f3
>> --- /dev/null
>> +++ b/drivers/vfio/pci/virtio/main.c
>> @@ -0,0 +1,546 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +
>> +#include <linux/device.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/pci.h>
>> +#include <linux/pm_runtime.h>
>> +#include <linux/types.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>> +#include <linux/vfio_pci_core.h>
>> +#include <linux/virtio_pci.h>
>> +#include <linux/virtio_net.h>
>> +#include <linux/virtio_pci_modern.h>
>> +
>> +#include "cmd.h"
>> +
>> +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
>> +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
>> +
>> +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev,
>> + loff_t pos, char __user *buf,
>> + size_t count, bool read)
>> +{
>> + u8 *bar0_buf = virtvdev->bar0_virtual_buf;
>> + u16 opcode;
>> + int ret;
>> +
>> + mutex_lock(&virtvdev->bar_mutex);
>> + if (read) {
>> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
>> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
>> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
>> + ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos,
>> + count, bar0_buf + pos);
>> + if (ret)
>> + goto out;
>> + if (copy_to_user(buf, bar0_buf + pos, count))
>> + ret = -EFAULT;
>> + goto out;
>> + }
>> +
>> + if (copy_from_user(bar0_buf + pos, buf, count)) {
>> + ret = -EFAULT;
>> + goto out;
>> + }
>> +
>> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
>> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
>> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
>> + ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count,
>> + bar0_buf + pos);
>> +out:
>> + mutex_unlock(&virtvdev->bar_mutex);
>> + return ret;
>> +}
>> +
>> +static int
>> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
>> + loff_t pos, char __user *buf,
>> + size_t count, bool read)
>> +{
>> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
>> + u16 queue_notify;
>> + int ret;
>> +
>> + if (pos + count > virtvdev->bar0_virtual_buf_size)
>> + return -EINVAL;
>> +
>> + switch (pos) {
>> + case VIRTIO_PCI_QUEUE_NOTIFY:
>> + if (count != sizeof(queue_notify))
>> + return -EINVAL;
>> + if (read) {
>> + ret = vfio_pci_ioread16(core_device, true, &queue_notify,
>> + virtvdev->notify_addr);
>> + if (ret)
>> + return ret;
>> + if (copy_to_user(buf, &queue_notify,
>> + sizeof(queue_notify)))
>> + return -EFAULT;
>> + break;
>> + }
>> +
>> + if (copy_from_user(&queue_notify, buf, count))
>> + return -EFAULT;
>> +
>> + ret = vfio_pci_iowrite16(core_device, true, queue_notify,
>> + virtvdev->notify_addr);
>> + break;
>> + default:
>> + ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read);
>> + }
>> +
>> + return ret ? ret : count;
>> +}
>> +
>> +static bool range_contains_range(loff_t range1_start, size_t count1,
>> + loff_t range2_start, size_t count2,
>> + loff_t *start_offset)
>> +{
>> + if (range1_start <= range2_start &&
>> + range1_start + count1 >= range2_start + count2) {
>> + *start_offset = range2_start - range1_start;
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
>> + char __user *buf, size_t count,
>> + loff_t *ppos)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = container_of(
>> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> + loff_t copy_offset;
>> + __le32 val32;
>> + __le16 val16;
>> + u8 val8;
>> + int ret;
>> +
>> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
>> + if (ret < 0)
>> + return ret;
>> +
>> + if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
>> + ©_offset)) {
>> + val16 = cpu_to_le16(0x1000);
>> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
>> + return -EFAULT;
>> + }
> So we take a 0x1041 ("Virtio 1.0 network device") and turn it into a
> 0x1000 ("Virtio network device"). Are there no features implied by the
> device ID? NB, a byte-wise access would read the real device ID.
From spec POV 0x1000 is a transitional device which covers the
functionality of 0x1041 device and the legacy device, so we should be
fine here.
Re the byte-wise access, do we have such an access from QEMU ? I
couldn't see a partial read of a config field.
As of that I preferred to keep the code simple and to not manage such a
partial flow.
However, If we may still be concerned about, I can allow that partial
read as part of V1.
What do you think ?
>> +
>> + if (virtvdev->pci_cmd_io &&
>> + range_contains_range(pos, count, PCI_COMMAND, sizeof(val16),
>> + ©_offset)) {
>> + if (copy_from_user(&val16, buf, sizeof(val16)))
>> + return -EFAULT;
>> + val16 |= cpu_to_le16(PCI_COMMAND_IO);
>> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
>> + return -EFAULT;
>> + }
> So we can't turn off I/O memory.
See below as part of virtiovf_pci_core_write(), it can be turned off,
the next virtiovf_pci_read_config() won't turn it on in that case.
This is what ' virtvdev->pci_cmd_io' field was used for.
>
>> +
>> + if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8),
>> + ©_offset)) {
>> + /* Transional needs to have revision 0 */
>> + val8 = 0;
>> + if (copy_to_user(buf + copy_offset, &val8, sizeof(val8)))
>> + return -EFAULT;
>> + }
> Surely some driver cares about this, right? How is this supposed to
> work in a world where libvirt parses modules.alias and automatically
> loads this driver rather than vfio-pci for all 0x1041 devices? We'd
> need to denylist this driver to ever see the device for what it is.
This was needed by the guest driver to support both modern and legacy
access, it can still chose the modern one.
Please see below re libvirt.
>
>> +
>> + if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
>> + ©_offset)) {
>> + val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
>> + if (copy_to_user(buf + copy_offset, &val32, sizeof(val32)))
>> + return -EFAULT;
>> + }
> Sloppy BAR emulation compared to the real BARs. QEMU obviously doesn't
> care.
From what I could see, QEMU needs the bit for 'PCI_BASE_ADDRESS_SPACE_IO'.
It doesn't really care about the address as you wrote, this is why it
was just left as zero here.
Does it make sense to you ?
>
>> +
>> + if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
>> + ©_offset)) {
>> + /* Transitional devices use the PCI subsystem device id as
>> + * virtio device id, same as legacy driver always did.
>> + */
> Non-networking multi-line comment style throughout please.
Sure, will handle as part of V1.
>
>> + val16 = cpu_to_le16(VIRTIO_ID_NET);
>> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
>> + return -EFAULT;
>> + }
>> +
>> + return count;
>> +}
>> +
>> +static ssize_t
>> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
>> + size_t count, loff_t *ppos)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = container_of(
>> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> + struct pci_dev *pdev = virtvdev->core_device.pdev;
>> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> + int ret;
>> +
>> + if (!count)
>> + return 0;
>> +
>> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
>> + return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
>> +
>> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
>> + return vfio_pci_core_read(core_vdev, buf, count, ppos);
>> +
>> + ret = pm_runtime_resume_and_get(&pdev->dev);
>> + if (ret) {
>> + pci_info_ratelimited(pdev, "runtime resume failed %d\n",
>> + ret);
>> + return -EIO;
>> + }
>> +
>> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> If the heart of this driver is simply pretending to have an I/O BAR
> where I/O accesses into that BAR are translated to accesses in the MMIO
> BAR, why can't this be done in the VMM, ie. QEMU? Could I/O to MMIO
> translation in QEMU improve performance (ex. if the MMIO is mmap'd and
> can be accessed without bouncing back into kernel code)?
>
The I/O bar control registers access is not converted to MMIO but into
admin commands.
Such admin commands transported using an admin queue owned by the
hypervisor driver.
Hypervisor driver in future may use admin queue for other tasks such as
device msix config, features provisioning, device migration commands
(dirty page tracking, device state read/write) and may be more.
Only the driver notification register (i.e. kick/doorbell register) is
converted to the MMIO.
Hence, the VFIO solution looks the better approach to match current UAPI.
>> + pm_runtime_put(&pdev->dev);
>> + return ret;
>> +}
>> +
>> +static ssize_t
>> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
>> + size_t count, loff_t *ppos)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = container_of(
>> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> + struct pci_dev *pdev = virtvdev->core_device.pdev;
>> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> + int ret;
>> +
>> + if (!count)
>> + return 0;
>> +
>> + if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
>> + loff_t copy_offset;
>> + u16 cmd;
>> +
>> + if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd),
>> + ©_offset)) {
>> + if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd)))
>> + return -EFAULT;
>> + virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO);
> If we're tracking writes to PCI_COMMAND_IO, why did we statically
> report I/O enabled in the read function previously?
In case it will be turned off here, we may not turn it on back upon the
read(), please see the above note in that area.
>> + }
>> + }
>> +
>> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
>> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
>> +
>> + ret = pm_runtime_resume_and_get(&pdev->dev);
>> + if (ret) {
>> + pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
>> + return -EIO;
>> + }
>> +
>> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
>> + pm_runtime_put(&pdev->dev);
>> + return ret;
>> +}
>> +
>> +static int
>> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
>> + unsigned int cmd, unsigned long arg)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = container_of(
>> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
>> + void __user *uarg = (void __user *)arg;
>> + struct vfio_region_info info = {};
>> +
>> + if (copy_from_user(&info, uarg, minsz))
>> + return -EFAULT;
>> +
>> + if (info.argsz < minsz)
>> + return -EINVAL;
>> +
>> + switch (info.index) {
>> + case VFIO_PCI_BAR0_REGION_INDEX:
>> + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
>> + info.size = virtvdev->bar0_virtual_buf_size;
>> + info.flags = VFIO_REGION_INFO_FLAG_READ |
>> + VFIO_REGION_INFO_FLAG_WRITE;
>> + return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
>> + default:
>> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>> + }
>> +}
>> +
>> +static long
>> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>> + unsigned long arg)
>> +{
>> + switch (cmd) {
>> + case VFIO_DEVICE_GET_REGION_INFO:
>> + return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
>> + default:
>> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>> + }
>> +}
>> +
>> +static int
>> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
>> +{
>> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
>> + int ret;
>> +
>> + /* Setup the BAR where the 'notify' exists to be used by vfio as well
>> + * This will let us mmap it only once and use it when needed.
>> + */
>> + ret = vfio_pci_core_setup_barmap(core_device,
>> + virtvdev->notify_bar);
>> + if (ret)
>> + return ret;
>> +
>> + virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
>> + virtvdev->notify_offset;
>> + return 0;
>> +}
>> +
>> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = container_of(
>> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> + struct vfio_pci_core_device *vdev = &virtvdev->core_device;
>> + int ret;
>> +
>> + ret = vfio_pci_core_enable(vdev);
>> + if (ret)
>> + return ret;
>> +
>> + if (virtvdev->bar0_virtual_buf) {
>> + /* upon close_device() the vfio_pci_core_disable() is called
>> + * and will close all the previous mmaps, so it seems that the
>> + * valid life cycle for the 'notify' addr is per open/close.
>> + */
>> + ret = virtiovf_set_notify_addr(virtvdev);
>> + if (ret) {
>> + vfio_pci_core_disable(vdev);
>> + return ret;
>> + }
>> + }
>> +
>> + vfio_pci_core_finish_enable(vdev);
>> + return 0;
>> +}
>> +
>> +static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
>> +{
>> + vfio_pci_core_close_device(core_vdev);
>> +}
> Why does this function exist?
From symmetric reasons, as we have the virtiovf_pci_open_device() I put
also the close() one.
However, we can just set virtiovf_pci_close_device() on the ops and drop
this code.
>
>> +
>> +static int virtiovf_get_device_config_size(unsigned short device)
>> +{
>> + switch (device) {
>> + case 0x1041:
>> + /* network card */
>> + return offsetofend(struct virtio_net_config, status);
>> + default:
>> + return 0;
>> + }
>> +}
>> +
>> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
>> +{
>> + u64 offset;
>> + int ret;
>> + u8 bar;
>> +
>> + ret = virtiovf_cmd_lq_read_notify(virtvdev,
>> + VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
>> + &bar, &offset);
>> + if (ret)
>> + return ret;
>> +
>> + virtvdev->notify_bar = bar;
>> + virtvdev->notify_offset = offset;
>> + return 0;
>> +}
>> +
>> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = container_of(
>> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> + struct pci_dev *pdev;
>> + int ret;
>> +
>> + ret = vfio_pci_core_init_dev(core_vdev);
>> + if (ret)
>> + return ret;
>> +
>> + pdev = virtvdev->core_device.pdev;
>> + virtvdev->vf_id = pci_iov_vf_id(pdev);
>> + if (virtvdev->vf_id < 0)
>> + return -EINVAL;
> vf_id is never used.
It's used as part of the virtio commands, see the previous preparation
patch.
>
>> +
>> + ret = virtiovf_read_notify_info(virtvdev);
>> + if (ret)
>> + return ret;
>> +
>> + virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN +
>> + VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN +
>> + virtiovf_get_device_config_size(pdev->device);
>> + virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
>> + GFP_KERNEL);
>> + if (!virtvdev->bar0_virtual_buf)
>> + return -ENOMEM;
>> + mutex_init(&virtvdev->bar_mutex);
>> + return 0;
>> +}
>> +
>> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = container_of(
>> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +
>> + kfree(virtvdev->bar0_virtual_buf);
>> + vfio_pci_core_release_dev(core_vdev);
>> +}
>> +
>> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
>> + .name = "virtio-transitional-vfio-pci",
>> + .init = virtiovf_pci_init_device,
>> + .release = virtiovf_pci_core_release_dev,
>> + .open_device = virtiovf_pci_open_device,
>> + .close_device = virtiovf_pci_close_device,
>> + .ioctl = virtiovf_vfio_pci_core_ioctl,
>> + .read = virtiovf_pci_core_read,
>> + .write = virtiovf_pci_core_write,
>> + .mmap = vfio_pci_core_mmap,
>> + .request = vfio_pci_core_request,
>> + .match = vfio_pci_core_match,
>> + .bind_iommufd = vfio_iommufd_physical_bind,
>> + .unbind_iommufd = vfio_iommufd_physical_unbind,
>> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
>> +};
>> +
>> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
>> + .name = "virtio-acc-vfio-pci",
>> + .init = vfio_pci_core_init_dev,
>> + .release = vfio_pci_core_release_dev,
>> + .open_device = virtiovf_pci_open_device,
>> + .close_device = virtiovf_pci_close_device,
>> + .ioctl = vfio_pci_core_ioctl,
>> + .device_feature = vfio_pci_core_ioctl_feature,
>> + .read = vfio_pci_core_read,
>> + .write = vfio_pci_core_write,
>> + .mmap = vfio_pci_core_mmap,
>> + .request = vfio_pci_core_request,
>> + .match = vfio_pci_core_match,
>> + .bind_iommufd = vfio_iommufd_physical_bind,
>> + .unbind_iommufd = vfio_iommufd_physical_unbind,
>> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
>> +};
> Why are we claiming devices that should just use vfio-pci instead?
Upon probe we may chose to set those default vfio-pci ops in case the
device is not legacy capable.
This will eliminate any usage of the new driver functionality when it's
not applicable.
>
>> +
>> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
>> +{
>> + struct resource *res = pdev->resource;
>> +
>> + return res->flags ? true : false;
>> +}
>> +
>> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
>> + (BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
>> + BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
>> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
>> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
>> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
>> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
>> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
>> +
>> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
>> +{
>> + int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
>> + u8 *buf;
>> + int ret;
>> +
>> + /* Only virtio-net is supported/tested so far */
>> + if (pdev->device != 0x1041)
>> + return false;
> Seems like the ID table should handle this, why are we preemptively
> claiming all virtio devices... or actually all 0x1af4 devices, which
> might not even be virtio, ex. the non-virtio ivshmem devices is 0x1110.
Makes sense, will change in the ID table from PCI_ANY_ID to 0x1041 and
cleanup that code.
>> +
>> + buf = kzalloc(buf_size, GFP_KERNEL);
>> + if (!buf)
>> + return false;
>> +
>> + ret = virtiovf_cmd_list_query(pdev, buf, buf_size);
>> + if (ret)
>> + goto end;
>> +
>> + if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
>> + VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
>> + ret = -EOPNOTSUPP;
>> + goto end;
>> + }
>> +
>> + /* confirm the used commands */
>> + memset(buf, 0, buf_size);
>> + *(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
>> + ret = virtiovf_cmd_list_use(pdev, buf, buf_size);
>> +
>> +end:
>> + kfree(buf);
>> + return ret ? false : true;
>> +}
>> +
>> +static int virtiovf_pci_probe(struct pci_dev *pdev,
>> + const struct pci_device_id *id)
>> +{
>> + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
>> + struct virtiovf_pci_core_device *virtvdev;
>> + int ret;
>> +
>> + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
>> + !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
>> + ops = &virtiovf_acc_vfio_pci_tran_ops;
>> +
>> + virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
>> + &pdev->dev, ops);
>> + if (IS_ERR(virtvdev))
>> + return PTR_ERR(virtvdev);
>> +
>> + dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
>> + ret = vfio_pci_core_register_device(&virtvdev->core_device);
>> + if (ret)
>> + goto out;
>> + return 0;
>> +out:
>> + vfio_put_device(&virtvdev->core_device.vdev);
>> + return ret;
>> +}
>> +
>> +static void virtiovf_pci_remove(struct pci_dev *pdev)
>> +{
>> + struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
>> +
>> + vfio_pci_core_unregister_device(&virtvdev->core_device);
>> + vfio_put_device(&virtvdev->core_device.vdev);
>> +}
>> +
>> +static const struct pci_device_id virtiovf_pci_table[] = {
>> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> libvirt will blindly use this driver for all devices matching this as
> we've discussed how it should make use of modules.alias. I don't think
> this driver should be squatting on devices where it doesn't add value
> and it's not clear whether this is adding or subtracting value in all
> cases for the one NIC that it modifies.
When the device is not legacy capable, we chose the vfio-pci default ops
as pointed above, otherwise we may chose the new functionality to enable
it in the guest.
> How should libvirt choose when
> and where to use this driver? What regressions are we going to see
> with VMs that previously saw "modern" virtio-net devices and now see a
> legacy compatible device? Thanks,
We don't expect a regression here, a modern driver in the guest will
continue using its direct access flow.
Do you see a real concern why to not enable it by default and come with
some pre-configuration before the probe phase to activate it ?
If so, any specific suggestion how to manage that ?
Thanks,
Yishai
> Alex
>
>> + {}
>> +};
>> +
>> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
>> +
>> +static struct pci_driver virtiovf_pci_driver = {
>> + .name = KBUILD_MODNAME,
>> + .id_table = virtiovf_pci_table,
>> + .probe = virtiovf_pci_probe,
>> + .remove = virtiovf_pci_remove,
>> + .err_handler = &vfio_pci_core_err_handlers,
>> + .driver_managed_dma = true,
>> +};
>> +
>> +module_pci_driver(virtiovf_pci_driver);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
>> +MODULE_DESCRIPTION(
>> + "VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 15:20 ` Yishai Hadas
@ 2023-09-26 17:00 ` Michael S. Tsirkin
2023-10-02 4:38 ` Parav Pandit
0 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-26 17:00 UTC (permalink / raw)
To: Yishai Hadas
Cc: Alex Williamson, jasowang, jgg, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
On Tue, Sep 26, 2023 at 06:20:45PM +0300, Yishai Hadas wrote:
> On 21/09/2023 22:58, Alex Williamson wrote:
> > On Thu, 21 Sep 2023 15:40:40 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >
> > > Introduce a vfio driver over virtio devices to support the legacy
> > > interface functionality for VFs.
> > >
> > > Background, from the virtio spec [1].
> > > --------------------------------------------------------------------
> > > In some systems, there is a need to support a virtio legacy driver with
> > > a device that does not directly support the legacy interface. In such
> > > scenarios, a group owner device can provide the legacy interface
> > > functionality for the group member devices. The driver of the owner
> > > device can then access the legacy interface of a member device on behalf
> > > of the legacy member device driver.
> > >
> > > For example, with the SR-IOV group type, group members (VFs) can not
> > > present the legacy interface in an I/O BAR in BAR0 as expected by the
> > > legacy pci driver. If the legacy driver is running inside a virtual
> > > machine, the hypervisor executing the virtual machine can present a
> > > virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> > > legacy driver accesses to this I/O BAR and forwards them to the group
> > > owner device (PF) using group administration commands.
> > > --------------------------------------------------------------------
> > >
> > > Specifically, this driver adds support for a virtio-net VF to be exposed
> > > as a transitional device to a guest driver and allows the legacy IO BAR
> > > functionality on top.
> > >
> > > This allows a VM which uses a legacy virtio-net driver in the guest to
> > > work transparently over a VF which its driver in the host is that new
> > > driver.
> > >
> > > The driver can be extended easily to support some other types of virtio
> > > devices (e.g virtio-blk), by adding in a few places the specific type
> > > properties as was done for virtio-net.
> > >
> > > For now, only the virtio-net use case was tested and as such we introduce
> > > the support only for such a device.
> > >
> > > Practically,
> > > Upon probing a VF for a virtio-net device, in case its PF supports
> > > legacy access over the virtio admin commands and the VF doesn't have BAR
> > > 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> > > transitional device with I/O BAR in BAR 0.
> > >
> > > The existence of the simulated I/O bar is reported later on by
> > > overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> > > exposes itself as a transitional device by overwriting some properties
> > > upon reading its config space.
> > >
> > > Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> > > guest may use it via read/write calls according to the virtio
> > > specification.
> > >
> > > Any read/write towards the control parts of the BAR will be captured by
> > > the new driver and will be translated into admin commands towards the
> > > device.
> > >
> > > Any data path read/write access (i.e. virtio driver notifications) will
> > > be forwarded to the physical BAR which its properties were supplied by
> > > the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow.
> > >
> > > With that code in place a legacy driver in the guest has the look and
> > > feel as if having a transitional device with legacy support for both its
> > > control and data path flows.
> > >
> > > [1]
> > > https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
> > >
> > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > ---
> > > MAINTAINERS | 6 +
> > > drivers/vfio/pci/Kconfig | 2 +
> > > drivers/vfio/pci/Makefile | 2 +
> > > drivers/vfio/pci/virtio/Kconfig | 15 +
> > > drivers/vfio/pci/virtio/Makefile | 4 +
> > > drivers/vfio/pci/virtio/cmd.c | 4 +-
> > > drivers/vfio/pci/virtio/cmd.h | 8 +
> > > drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++
> > > 8 files changed, 585 insertions(+), 2 deletions(-)
> > > create mode 100644 drivers/vfio/pci/virtio/Kconfig
> > > create mode 100644 drivers/vfio/pci/virtio/Makefile
> > > create mode 100644 drivers/vfio/pci/virtio/main.c
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index bf0f54c24f81..5098418c8389 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> > > S: Maintained
> > > F: drivers/vfio/pci/mlx5/
> > > +VFIO VIRTIO PCI DRIVER
> > > +M: Yishai Hadas <yishaih@nvidia.com>
> > > +L: kvm@vger.kernel.org
> > > +S: Maintained
> > > +F: drivers/vfio/pci/virtio
> > > +
> > > VFIO PCI DEVICE SPECIFIC DRIVERS
> > > R: Jason Gunthorpe <jgg@nvidia.com>
> > > R: Yishai Hadas <yishaih@nvidia.com>
> > > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> > > index 8125e5f37832..18c397df566d 100644
> > > --- a/drivers/vfio/pci/Kconfig
> > > +++ b/drivers/vfio/pci/Kconfig
> > > @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
> > > source "drivers/vfio/pci/pds/Kconfig"
> > > +source "drivers/vfio/pci/virtio/Kconfig"
> > > +
> > > endmenu
> > > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> > > index 45167be462d8..046139a4eca5 100644
> > > --- a/drivers/vfio/pci/Makefile
> > > +++ b/drivers/vfio/pci/Makefile
> > > @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
> > > obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
> > > obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> > > +
> > > +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> > > diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> > > new file mode 100644
> > > index 000000000000..89eddce8b1bd
> > > --- /dev/null
> > > +++ b/drivers/vfio/pci/virtio/Kconfig
> > > @@ -0,0 +1,15 @@
> > > +# SPDX-License-Identifier: GPL-2.0-only
> > > +config VIRTIO_VFIO_PCI
> > > + tristate "VFIO support for VIRTIO PCI devices"
> > > + depends on VIRTIO_PCI
> > > + select VFIO_PCI_CORE
> > > + help
> > > + This provides support for exposing VIRTIO VF devices using the VFIO
> > > + framework that can work with a legacy virtio driver in the guest.
> > > + Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> > > + not indicate I/O Space.
> > > + As of that this driver emulated I/O BAR in software to let a VF be
> > > + seen as a transitional device in the guest and let it work with
> > > + a legacy driver.
> > > +
> > > + If you don't know what to do here, say N.
> > > diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> > > new file mode 100644
> > > index 000000000000..584372648a03
> > > --- /dev/null
> > > +++ b/drivers/vfio/pci/virtio/Makefile
> > > @@ -0,0 +1,4 @@
> > > +# SPDX-License-Identifier: GPL-2.0-only
> > > +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> > > +virtio-vfio-pci-y := main.o cmd.o
> > > +
> > > diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c
> > > index f068239cdbb0..aea9d25fbf1d 100644
> > > --- a/drivers/vfio/pci/virtio/cmd.c
> > > +++ b/drivers/vfio/pci/virtio/cmd.c
> > > @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> > > {
> > > struct virtio_device *virtio_dev =
> > > virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> > > - struct virtio_admin_cmd_data_lr_write *in;
> > > + struct virtio_admin_cmd_legacy_wr_data *in;
> > > struct scatterlist in_sg;
> > > struct virtio_admin_cmd cmd = {};
> > > int ret;
> > > @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> > > {
> > > struct virtio_device *virtio_dev =
> > > virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> > > - struct virtio_admin_cmd_data_lr_read *in;
> > > + struct virtio_admin_cmd_legacy_rd_data *in;
> > > struct scatterlist in_sg, out_sg;
> > > struct virtio_admin_cmd cmd = {};
> > > int ret;
> > > diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h
> > > index c2a3645f4b90..347b1dc85570 100644
> > > --- a/drivers/vfio/pci/virtio/cmd.h
> > > +++ b/drivers/vfio/pci/virtio/cmd.h
> > > @@ -13,7 +13,15 @@
> > > struct virtiovf_pci_core_device {
> > > struct vfio_pci_core_device core_device;
> > > + u8 bar0_virtual_buf_size;
> > > + u8 *bar0_virtual_buf;
> > > + /* synchronize access to the virtual buf */
> > > + struct mutex bar_mutex;
> > > int vf_id;
> > > + void __iomem *notify_addr;
> > > + u32 notify_offset;
> > > + u8 notify_bar;
> > > + u8 pci_cmd_io :1;
> > > };
> > > int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size);
> > > diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> > > new file mode 100644
> > > index 000000000000..2486991c49f3
> > > --- /dev/null
> > > +++ b/drivers/vfio/pci/virtio/main.c
> > > @@ -0,0 +1,546 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> > > + */
> > > +
> > > +#include <linux/device.h>
> > > +#include <linux/module.h>
> > > +#include <linux/mutex.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/pm_runtime.h>
> > > +#include <linux/types.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/vfio.h>
> > > +#include <linux/vfio_pci_core.h>
> > > +#include <linux/virtio_pci.h>
> > > +#include <linux/virtio_net.h>
> > > +#include <linux/virtio_pci_modern.h>
> > > +
> > > +#include "cmd.h"
> > > +
> > > +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
> > > +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
> > > +
> > > +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev,
> > > + loff_t pos, char __user *buf,
> > > + size_t count, bool read)
> > > +{
> > > + u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> > > + u16 opcode;
> > > + int ret;
> > > +
> > > + mutex_lock(&virtvdev->bar_mutex);
> > > + if (read) {
> > > + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> > > + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> > > + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> > > + ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos,
> > > + count, bar0_buf + pos);
> > > + if (ret)
> > > + goto out;
> > > + if (copy_to_user(buf, bar0_buf + pos, count))
> > > + ret = -EFAULT;
> > > + goto out;
> > > + }
> > > +
> > > + if (copy_from_user(bar0_buf + pos, buf, count)) {
> > > + ret = -EFAULT;
> > > + goto out;
> > > + }
> > > +
> > > + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> > > + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> > > + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
> > > + ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count,
> > > + bar0_buf + pos);
> > > +out:
> > > + mutex_unlock(&virtvdev->bar_mutex);
> > > + return ret;
> > > +}
> > > +
> > > +static int
> > > +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> > > + loff_t pos, char __user *buf,
> > > + size_t count, bool read)
> > > +{
> > > + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> > > + u16 queue_notify;
> > > + int ret;
> > > +
> > > + if (pos + count > virtvdev->bar0_virtual_buf_size)
> > > + return -EINVAL;
> > > +
> > > + switch (pos) {
> > > + case VIRTIO_PCI_QUEUE_NOTIFY:
> > > + if (count != sizeof(queue_notify))
> > > + return -EINVAL;
> > > + if (read) {
> > > + ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> > > + virtvdev->notify_addr);
> > > + if (ret)
> > > + return ret;
> > > + if (copy_to_user(buf, &queue_notify,
> > > + sizeof(queue_notify)))
> > > + return -EFAULT;
> > > + break;
> > > + }
> > > +
> > > + if (copy_from_user(&queue_notify, buf, count))
> > > + return -EFAULT;
> > > +
> > > + ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> > > + virtvdev->notify_addr);
> > > + break;
> > > + default:
> > > + ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read);
> > > + }
> > > +
> > > + return ret ? ret : count;
> > > +}
> > > +
> > > +static bool range_contains_range(loff_t range1_start, size_t count1,
> > > + loff_t range2_start, size_t count2,
> > > + loff_t *start_offset)
> > > +{
> > > + if (range1_start <= range2_start &&
> > > + range1_start + count1 >= range2_start + count2) {
> > > + *start_offset = range2_start - range1_start;
> > > + return true;
> > > + }
> > > + return false;
> > > +}
> > > +
> > > +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> > > + char __user *buf, size_t count,
> > > + loff_t *ppos)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = container_of(
> > > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> > > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> > > + loff_t copy_offset;
> > > + __le32 val32;
> > > + __le16 val16;
> > > + u8 val8;
> > > + int ret;
> > > +
> > > + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> > > + ©_offset)) {
> > > + val16 = cpu_to_le16(0x1000);
> > > + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> > > + return -EFAULT;
> > > + }
> > So we take a 0x1041 ("Virtio 1.0 network device") and turn it into a
> > 0x1000 ("Virtio network device"). Are there no features implied by the
> > device ID? NB, a byte-wise access would read the real device ID.
>
> From spec POV 0x1000 is a transitional device which covers the functionality
> of 0x1041 device and the legacy device, so we should be fine here.
>
> Re the byte-wise access, do we have such an access from QEMU ? I couldn't
> see a partial read of a config field.
> As of that I preferred to keep the code simple and to not manage such a
> partial flow.
> However, If we may still be concerned about, I can allow that partial read
> as part of V1.
>
> What do you think ?
>
> > > +
> > > + if (virtvdev->pci_cmd_io &&
> > > + range_contains_range(pos, count, PCI_COMMAND, sizeof(val16),
> > > + ©_offset)) {
> > > + if (copy_from_user(&val16, buf, sizeof(val16)))
> > > + return -EFAULT;
> > > + val16 |= cpu_to_le16(PCI_COMMAND_IO);
> > > + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> > > + return -EFAULT;
> > > + }
> > So we can't turn off I/O memory.
>
> See below as part of virtiovf_pci_core_write(), it can be turned off, the
> next virtiovf_pci_read_config() won't turn it on in that case.
>
> This is what ' virtvdev->pci_cmd_io' field was used for.
>
> >
> > > +
> > > + if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> > > + ©_offset)) {
> > > + /* Transional needs to have revision 0 */
> > > + val8 = 0;
> > > + if (copy_to_user(buf + copy_offset, &val8, sizeof(val8)))
> > > + return -EFAULT;
> > > + }
> > Surely some driver cares about this, right? How is this supposed to
> > work in a world where libvirt parses modules.alias and automatically
> > loads this driver rather than vfio-pci for all 0x1041 devices? We'd
> > need to denylist this driver to ever see the device for what it is.
I think I'm missing something. What in this patch might make
libvirt load this driver automatically?
>
> This was needed by the guest driver to support both modern and legacy
> access, it can still chose the modern one.
>
> Please see below re libvirt.
>
> >
> > > +
> > > + if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> > > + ©_offset)) {
> > > + val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
> > > + if (copy_to_user(buf + copy_offset, &val32, sizeof(val32)))
> > > + return -EFAULT;
> > > + }
> > Sloppy BAR emulation compared to the real BARs. QEMU obviously doesn't
> > care.
>
> From what I could see, QEMU needs the bit for 'PCI_BASE_ADDRESS_SPACE_IO'.
>
> It doesn't really care about the address as you wrote, this is why it was
> just left as zero here.
> Does it make sense to you ?
I mean if all you care about is QEMU then you should just keep all this
code in QEMU. One you have some behaviour in UAPI you can never take
it back even if it's a bug userspace will come to depend on it.
> >
> > > +
> > > + if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> > > + ©_offset)) {
> > > + /* Transitional devices use the PCI subsystem device id as
> > > + * virtio device id, same as legacy driver always did.
> > > + */
> > Non-networking multi-line comment style throughout please.
>
> Sure, will handle as part of V1.
>
> >
> > > + val16 = cpu_to_le16(VIRTIO_ID_NET);
> > > + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> > > + return -EFAULT;
> > > + }
> > > +
> > > + return count;
> > > +}
> > > +
> > > +static ssize_t
> > > +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> > > + size_t count, loff_t *ppos)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = container_of(
> > > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> > > + struct pci_dev *pdev = virtvdev->core_device.pdev;
> > > + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> > > + int ret;
> > > +
> > > + if (!count)
> > > + return 0;
> > > +
> > > + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> > > + return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> > > +
> > > + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> > > + return vfio_pci_core_read(core_vdev, buf, count, ppos);
> > > +
> > > + ret = pm_runtime_resume_and_get(&pdev->dev);
> > > + if (ret) {
> > > + pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> > > + ret);
> > > + return -EIO;
> > > + }
> > > +
> > > + ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> > If the heart of this driver is simply pretending to have an I/O BAR
> > where I/O accesses into that BAR are translated to accesses in the MMIO
> > BAR, why can't this be done in the VMM, ie. QEMU? Could I/O to MMIO
> > translation in QEMU improve performance (ex. if the MMIO is mmap'd and
> > can be accessed without bouncing back into kernel code)?
Hmm. Not this patch, but Jason tells me there are devices which actually do have
it implemented like this (with an MMIO BAR). You have to convert writes
into MMIO write+MMIO read to make it robus.
> The I/O bar control registers access is not converted to MMIO but into admin
> commands.
> Such admin commands transported using an admin queue owned by the hypervisor
> driver.
> Hypervisor driver in future may use admin queue for other tasks such as
> device msix config, features provisioning, device migration commands (dirty
> page tracking, device state read/write) and may be more.
> Only the driver notification register (i.e. kick/doorbell register) is
> converted to the MMIO.
> Hence, the VFIO solution looks the better approach to match current UAPI.
>
> > > + pm_runtime_put(&pdev->dev);
> > > + return ret;
> > > +}
> > > +
> > > +static ssize_t
> > > +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> > > + size_t count, loff_t *ppos)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = container_of(
> > > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> > > + struct pci_dev *pdev = virtvdev->core_device.pdev;
> > > + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> > > + int ret;
> > > +
> > > + if (!count)
> > > + return 0;
> > > +
> > > + if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> > > + loff_t copy_offset;
> > > + u16 cmd;
> > > +
> > > + if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd),
> > > + ©_offset)) {
> > > + if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd)))
> > > + return -EFAULT;
> > > + virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO);
> > If we're tracking writes to PCI_COMMAND_IO, why did we statically
> > report I/O enabled in the read function previously?
>
> In case it will be turned off here, we may not turn it on back upon the
> read(), please see the above note in that area.
>
>
> > > + }
> > > + }
> > > +
> > > + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> > > + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> > > +
> > > + ret = pm_runtime_resume_and_get(&pdev->dev);
> > > + if (ret) {
> > > + pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> > > + return -EIO;
> > > + }
> > > +
> > > + ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> > > + pm_runtime_put(&pdev->dev);
> > > + return ret;
> > > +}
> > > +
> > > +static int
> > > +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> > > + unsigned int cmd, unsigned long arg)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = container_of(
> > > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> > > + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> > > + void __user *uarg = (void __user *)arg;
> > > + struct vfio_region_info info = {};
> > > +
> > > + if (copy_from_user(&info, uarg, minsz))
> > > + return -EFAULT;
> > > +
> > > + if (info.argsz < minsz)
> > > + return -EINVAL;
> > > +
> > > + switch (info.index) {
> > > + case VFIO_PCI_BAR0_REGION_INDEX:
> > > + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > > + info.size = virtvdev->bar0_virtual_buf_size;
> > > + info.flags = VFIO_REGION_INFO_FLAG_READ |
> > > + VFIO_REGION_INFO_FLAG_WRITE;
> > > + return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> > > + default:
> > > + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> > > + }
> > > +}
> > > +
> > > +static long
> > > +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> > > + unsigned long arg)
> > > +{
> > > + switch (cmd) {
> > > + case VFIO_DEVICE_GET_REGION_INFO:
> > > + return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> > > + default:
> > > + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> > > + }
> > > +}
> > > +
> > > +static int
> > > +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> > > +{
> > > + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> > > + int ret;
> > > +
> > > + /* Setup the BAR where the 'notify' exists to be used by vfio as well
> > > + * This will let us mmap it only once and use it when needed.
> > > + */
> > > + ret = vfio_pci_core_setup_barmap(core_device,
> > > + virtvdev->notify_bar);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> > > + virtvdev->notify_offset;
> > > + return 0;
> > > +}
> > > +
> > > +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = container_of(
> > > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> > > + struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> > > + int ret;
> > > +
> > > + ret = vfio_pci_core_enable(vdev);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (virtvdev->bar0_virtual_buf) {
> > > + /* upon close_device() the vfio_pci_core_disable() is called
> > > + * and will close all the previous mmaps, so it seems that the
> > > + * valid life cycle for the 'notify' addr is per open/close.
> > > + */
> > > + ret = virtiovf_set_notify_addr(virtvdev);
> > > + if (ret) {
> > > + vfio_pci_core_disable(vdev);
> > > + return ret;
> > > + }
> > > + }
> > > +
> > > + vfio_pci_core_finish_enable(vdev);
> > > + return 0;
> > > +}
> > > +
> > > +static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
> > > +{
> > > + vfio_pci_core_close_device(core_vdev);
> > > +}
> > Why does this function exist?
>
> From symmetric reasons, as we have the virtiovf_pci_open_device() I put also
> the close() one.
> However, we can just set virtiovf_pci_close_device() on the ops and drop
> this code.
> >
> > > +
> > > +static int virtiovf_get_device_config_size(unsigned short device)
> > > +{
> > > + switch (device) {
> > > + case 0x1041:
> > > + /* network card */
> > > + return offsetofend(struct virtio_net_config, status);
> > > + default:
> > > + return 0;
> > > + }
> > > +}
> > > +
> > > +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> > > +{
> > > + u64 offset;
> > > + int ret;
> > > + u8 bar;
> > > +
> > > + ret = virtiovf_cmd_lq_read_notify(virtvdev,
> > > + VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> > > + &bar, &offset);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + virtvdev->notify_bar = bar;
> > > + virtvdev->notify_offset = offset;
> > > + return 0;
> > > +}
> > > +
> > > +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = container_of(
> > > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> > > + struct pci_dev *pdev;
> > > + int ret;
> > > +
> > > + ret = vfio_pci_core_init_dev(core_vdev);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + pdev = virtvdev->core_device.pdev;
> > > + virtvdev->vf_id = pci_iov_vf_id(pdev);
> > > + if (virtvdev->vf_id < 0)
> > > + return -EINVAL;
> > vf_id is never used.
>
> It's used as part of the virtio commands, see the previous preparation
> patch.
>
> >
> > > +
> > > + ret = virtiovf_read_notify_info(virtvdev);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN +
> > > + VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN +
> > > + virtiovf_get_device_config_size(pdev->device);
> > > + virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> > > + GFP_KERNEL);
> > > + if (!virtvdev->bar0_virtual_buf)
> > > + return -ENOMEM;
> > > + mutex_init(&virtvdev->bar_mutex);
> > > + return 0;
> > > +}
> > > +
> > > +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = container_of(
> > > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> > > +
> > > + kfree(virtvdev->bar0_virtual_buf);
> > > + vfio_pci_core_release_dev(core_vdev);
> > > +}
> > > +
> > > +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> > > + .name = "virtio-transitional-vfio-pci",
> > > + .init = virtiovf_pci_init_device,
> > > + .release = virtiovf_pci_core_release_dev,
> > > + .open_device = virtiovf_pci_open_device,
> > > + .close_device = virtiovf_pci_close_device,
> > > + .ioctl = virtiovf_vfio_pci_core_ioctl,
> > > + .read = virtiovf_pci_core_read,
> > > + .write = virtiovf_pci_core_write,
> > > + .mmap = vfio_pci_core_mmap,
> > > + .request = vfio_pci_core_request,
> > > + .match = vfio_pci_core_match,
> > > + .bind_iommufd = vfio_iommufd_physical_bind,
> > > + .unbind_iommufd = vfio_iommufd_physical_unbind,
> > > + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> > > +};
> > > +
> > > +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> > > + .name = "virtio-acc-vfio-pci",
> > > + .init = vfio_pci_core_init_dev,
> > > + .release = vfio_pci_core_release_dev,
> > > + .open_device = virtiovf_pci_open_device,
> > > + .close_device = virtiovf_pci_close_device,
> > > + .ioctl = vfio_pci_core_ioctl,
> > > + .device_feature = vfio_pci_core_ioctl_feature,
> > > + .read = vfio_pci_core_read,
> > > + .write = vfio_pci_core_write,
> > > + .mmap = vfio_pci_core_mmap,
> > > + .request = vfio_pci_core_request,
> > > + .match = vfio_pci_core_match,
> > > + .bind_iommufd = vfio_iommufd_physical_bind,
> > > + .unbind_iommufd = vfio_iommufd_physical_unbind,
> > > + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> > > +};
> > Why are we claiming devices that should just use vfio-pci instead?
>
>
> Upon probe we may chose to set those default vfio-pci ops in case the device
> is not legacy capable.
> This will eliminate any usage of the new driver functionality when it's not
> applicable.
>
> >
> > > +
> > > +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> > > +{
> > > + struct resource *res = pdev->resource;
> > > +
> > > + return res->flags ? true : false;
> > > +}
> > > +
> > > +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> > > + (BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> > > + BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> > > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> > > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> > > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> > > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> > > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> > > +
> > > +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> > > +{
> > > + int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> > > + u8 *buf;
> > > + int ret;
> > > +
> > > + /* Only virtio-net is supported/tested so far */
> > > + if (pdev->device != 0x1041)
> > > + return false;
> > Seems like the ID table should handle this, why are we preemptively
> > claiming all virtio devices... or actually all 0x1af4 devices, which
> > might not even be virtio, ex. the non-virtio ivshmem devices is 0x1110.
>
> Makes sense, will change in the ID table from PCI_ANY_ID to 0x1041 and
> cleanup that code.
>
> > > +
> > > + buf = kzalloc(buf_size, GFP_KERNEL);
> > > + if (!buf)
> > > + return false;
> > > +
> > > + ret = virtiovf_cmd_list_query(pdev, buf, buf_size);
> > > + if (ret)
> > > + goto end;
> > > +
> > > + if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> > > + VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> > > + ret = -EOPNOTSUPP;
> > > + goto end;
> > > + }
> > > +
> > > + /* confirm the used commands */
> > > + memset(buf, 0, buf_size);
> > > + *(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> > > + ret = virtiovf_cmd_list_use(pdev, buf, buf_size);
> > > +
> > > +end:
> > > + kfree(buf);
> > > + return ret ? false : true;
> > > +}
> > > +
> > > +static int virtiovf_pci_probe(struct pci_dev *pdev,
> > > + const struct pci_device_id *id)
> > > +{
> > > + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> > > + struct virtiovf_pci_core_device *virtvdev;
> > > + int ret;
> > > +
> > > + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> > > + !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> > > + ops = &virtiovf_acc_vfio_pci_tran_ops;
> > > +
> > > + virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> > > + &pdev->dev, ops);
> > > + if (IS_ERR(virtvdev))
> > > + return PTR_ERR(virtvdev);
> > > +
> > > + dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> > > + ret = vfio_pci_core_register_device(&virtvdev->core_device);
> > > + if (ret)
> > > + goto out;
> > > + return 0;
> > > +out:
> > > + vfio_put_device(&virtvdev->core_device.vdev);
> > > + return ret;
> > > +}
> > > +
> > > +static void virtiovf_pci_remove(struct pci_dev *pdev)
> > > +{
> > > + struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> > > +
> > > + vfio_pci_core_unregister_device(&virtvdev->core_device);
> > > + vfio_put_device(&virtvdev->core_device.vdev);
> > > +}
> > > +
> > > +static const struct pci_device_id virtiovf_pci_table[] = {
> > > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> > libvirt will blindly use this driver for all devices matching this as
> > we've discussed how it should make use of modules.alias. I don't think
> > this driver should be squatting on devices where it doesn't add value
> > and it's not clear whether this is adding or subtracting value in all
> > cases for the one NIC that it modifies.
>
>
> When the device is not legacy capable, we chose the vfio-pci default ops as
> pointed above, otherwise we may chose the new functionality to enable it in
> the guest.
>
> > How should libvirt choose when
> > and where to use this driver? What regressions are we going to see
> > with VMs that previously saw "modern" virtio-net devices and now see a
> > legacy compatible device? Thanks,
> We don't expect a regression here, a modern driver in the guest will
> continue using its direct access flow.
>
> Do you see a real concern why to not enable it by default and come with some
> pre-configuration before the probe phase to activate it ?
> If so, any specific suggestion how to manage that ?
>
> Thanks,
> Yishai
I would not claim that it can't happen.
For example, a transitional device
must not in theory be safely passed through to guest userspace, because
guest then might try to use it through the legacy BAR
without acknowledging ACCESS_PLATFORM.
Do any guests check this and fail? Hard to say.
> > Alex
> >
> > > + {}
> > > +};
> > > +
> > > +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> > > +
> > > +static struct pci_driver virtiovf_pci_driver = {
> > > + .name = KBUILD_MODNAME,
> > > + .id_table = virtiovf_pci_table,
> > > + .probe = virtiovf_pci_probe,
> > > + .remove = virtiovf_pci_remove,
> > > + .err_handler = &vfio_pci_core_err_handlers,
> > > + .driver_managed_dma = true,
> > > +};
> > > +
> > > +module_pci_driver(virtiovf_pci_driver);
> > > +
> > > +MODULE_LICENSE("GPL");
> > > +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> > > +MODULE_DESCRIPTION(
> > > + "VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
>
^ permalink raw reply [flat|nested] 181+ messages in thread* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-26 17:00 ` Michael S. Tsirkin
@ 2023-10-02 4:38 ` Parav Pandit
0 siblings, 0 replies; 181+ messages in thread
From: Parav Pandit @ 2023-10-02 4:38 UTC (permalink / raw)
To: Michael S. Tsirkin, Yishai Hadas
Cc: Alex Williamson, jasowang@redhat.com, Jason Gunthorpe,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, September 26, 2023 10:30 PM
> For example, a transitional device
> must not in theory be safely passed through to guest userspace, because guest
> then might try to use it through the legacy BAR without acknowledging
> ACCESS_PLATFORM.
> Do any guests check this and fail? Hard to say.
>
ACCESS_PLATFORM is not offered on the legacy interface because legacy interface spec 0.9.5 didn't have it.
Whether guest VM maps it to user space and using GIOVA is completely unknown to the device.
And all of this is just fine, because IOMMU through vfio takes care of necessary translation with/without mapping the transitional device to the guest user space.
Hence, it is not a compat problem.
Anyways, only those user will attach a virtio device to vfio-virtio device when user care to expose transitional device in guest.
I can see that in future, when user wants to do this optionally, a devlink/sysfs knob will be added, at that point, one needs to have a disable_transitional flag.
So it may be worth to optionally enable transitional support on user request as Michael suggested.
^ permalink raw reply [flat|nested] 181+ messages in thread
* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 12:40 ` [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices Yishai Hadas
` (3 preceding siblings ...)
2023-09-21 19:58 ` Alex Williamson
@ 2023-09-22 10:10 ` Michael S. Tsirkin
2023-09-22 15:53 ` Michael S. Tsirkin
5 siblings, 0 replies; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 10:10 UTC (permalink / raw)
To: Yishai Hadas
Cc: alex.williamson, jasowang, jgg, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:40:40PM +0300, Yishai Hadas wrote:
> Introduce a vfio driver over virtio devices to support the legacy
> interface functionality for VFs.
>
> Background, from the virtio spec [1].
> --------------------------------------------------------------------
> In some systems, there is a need to support a virtio legacy driver with
> a device that does not directly support the legacy interface. In such
> scenarios, a group owner device can provide the legacy interface
> functionality for the group member devices. The driver of the owner
> device can then access the legacy interface of a member device on behalf
> of the legacy member device driver.
>
> For example, with the SR-IOV group type, group members (VFs) can not
> present the legacy interface in an I/O BAR in BAR0 as expected by the
> legacy pci driver. If the legacy driver is running inside a virtual
> machine, the hypervisor executing the virtual machine can present a
> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> legacy driver accesses to this I/O BAR and forwards them to the group
> owner device (PF) using group administration commands.
> --------------------------------------------------------------------
>
> Specifically, this driver adds support for a virtio-net VF to be exposed
> as a transitional device to a guest driver and allows the legacy IO BAR
> functionality on top.
>
> This allows a VM which uses a legacy virtio-net driver in the guest to
> work transparently over a VF which its driver in the host is that new
> driver.
>
> The driver can be extended easily to support some other types of virtio
> devices (e.g virtio-blk), by adding in a few places the specific type
> properties as was done for virtio-net.
>
> For now, only the virtio-net use case was tested and as such we introduce
> the support only for such a device.
>
> Practically,
> Upon probing a VF for a virtio-net device, in case its PF supports
> legacy access over the virtio admin commands and the VF doesn't have BAR
> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> transitional device with I/O BAR in BAR 0.
>
> The existence of the simulated I/O bar is reported later on by
> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> exposes itself as a transitional device by overwriting some properties
> upon reading its config space.
>
> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> guest may use it via read/write calls according to the virtio
> specification.
>
> Any read/write towards the control parts of the BAR will be captured by
> the new driver and will be translated into admin commands towards the
> device.
>
> Any data path read/write access (i.e. virtio driver notifications) will
> be forwarded to the physical BAR which its properties were supplied by
> the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow.
>
> With that code in place a legacy driver in the guest has the look and
> feel as if having a transitional device with legacy support for both its
> control and data path flows.
>
> [1]
> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
> MAINTAINERS | 6 +
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 2 +
> drivers/vfio/pci/virtio/Kconfig | 15 +
> drivers/vfio/pci/virtio/Makefile | 4 +
> drivers/vfio/pci/virtio/cmd.c | 4 +-
> drivers/vfio/pci/virtio/cmd.h | 8 +
> drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++
> 8 files changed, 585 insertions(+), 2 deletions(-)
> create mode 100644 drivers/vfio/pci/virtio/Kconfig
> create mode 100644 drivers/vfio/pci/virtio/Makefile
> create mode 100644 drivers/vfio/pci/virtio/main.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index bf0f54c24f81..5098418c8389 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> S: Maintained
> F: drivers/vfio/pci/mlx5/
>
> +VFIO VIRTIO PCI DRIVER
> +M: Yishai Hadas <yishaih@nvidia.com>
> +L: kvm@vger.kernel.org
> +S: Maintained
> +F: drivers/vfio/pci/virtio
> +
> VFIO PCI DEVICE SPECIFIC DRIVERS
> R: Jason Gunthorpe <jgg@nvidia.com>
> R: Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..18c397df566d 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>
> source "drivers/vfio/pci/pds/Kconfig"
>
> +source "drivers/vfio/pci/virtio/Kconfig"
> +
> endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..046139a4eca5 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>
> obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> new file mode 100644
> index 000000000000..89eddce8b1bd
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config VIRTIO_VFIO_PCI
> + tristate "VFIO support for VIRTIO PCI devices"
> + depends on VIRTIO_PCI
> + select VFIO_PCI_CORE
> + help
> + This provides support for exposing VIRTIO VF devices using the VFIO
> + framework that can work with a legacy virtio driver in the guest.
> + Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> + not indicate I/O Space.
> + As of that this driver emulated I/O BAR in software to let a VF be
> + seen as a transitional device in the guest and let it work with
> + a legacy driver.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> new file mode 100644
> index 000000000000..584372648a03
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> +virtio-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c
> index f068239cdbb0..aea9d25fbf1d 100644
> --- a/drivers/vfio/pci/virtio/cmd.c
> +++ b/drivers/vfio/pci/virtio/cmd.c
> @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_write *in;
> + struct virtio_admin_cmd_legacy_wr_data *in;
> struct scatterlist in_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_read *in;
> + struct virtio_admin_cmd_legacy_rd_data *in;
> struct scatterlist in_sg, out_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h
> index c2a3645f4b90..347b1dc85570 100644
> --- a/drivers/vfio/pci/virtio/cmd.h
> +++ b/drivers/vfio/pci/virtio/cmd.h
> @@ -13,7 +13,15 @@
>
> struct virtiovf_pci_core_device {
> struct vfio_pci_core_device core_device;
> + u8 bar0_virtual_buf_size;
> + u8 *bar0_virtual_buf;
> + /* synchronize access to the virtual buf */
> + struct mutex bar_mutex;
> int vf_id;
> + void __iomem *notify_addr;
> + u32 notify_offset;
> + u8 notify_bar;
> + u8 pci_cmd_io :1;
> };
>
> int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size);
> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> new file mode 100644
> index 000000000000..2486991c49f3
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/main.c
> @@ -0,0 +1,546 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/virtio_pci.h>
> +#include <linux/virtio_net.h>
> +#include <linux/virtio_pci_modern.h>
> +
> +#include "cmd.h"
> +
> +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
> +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
> +
> +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> + u16 opcode;
> + int ret;
> +
> + mutex_lock(&virtvdev->bar_mutex);
> + if (read) {
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
This "true" seems wrong. You need to know whether guest wants
msix enabled for the device or not.
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> + ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos,
> + count, bar0_buf + pos);
> + if (ret)
> + goto out;
> + if (copy_to_user(buf, bar0_buf + pos, count))
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + if (copy_from_user(bar0_buf + pos, buf, count)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
same
> + ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count,
> + bar0_buf + pos);
> +out:
> + mutex_unlock(&virtvdev->bar_mutex);
> + return ret;
> +}
> +
> +static int
> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + u16 queue_notify;
> + int ret;
> +
> + if (pos + count > virtvdev->bar0_virtual_buf_size)
> + return -EINVAL;
> +
> + switch (pos) {
> + case VIRTIO_PCI_QUEUE_NOTIFY:
> + if (count != sizeof(queue_notify))
> + return -EINVAL;
> + if (read) {
> + ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> + virtvdev->notify_addr);
> + if (ret)
> + return ret;
> + if (copy_to_user(buf, &queue_notify,
> + sizeof(queue_notify)))
> + return -EFAULT;
> + break;
> + }
> +
> + if (copy_from_user(&queue_notify, buf, count))
> + return -EFAULT;
> +
> + ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> + virtvdev->notify_addr);
> + break;
> + default:
> + ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read);
> + }
> +
> + return ret ? ret : count;
> +}
> +
> +static bool range_contains_range(loff_t range1_start, size_t count1,
> + loff_t range2_start, size_t count2,
> + loff_t *start_offset)
> +{
> + if (range1_start <= range2_start &&
> + range1_start + count1 >= range2_start + count2) {
> + *start_offset = range2_start - range1_start;
> + return true;
> + }
> + return false;
> +}
> +
> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> + char __user *buf, size_t count,
> + loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + loff_t copy_offset;
> + __le32 val32;
> + __le16 val16;
> + u8 val8;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;
> +
> + if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> + ©_offset)) {
> + val16 = cpu_to_le16(0x1000);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + if (virtvdev->pci_cmd_io &&
> + range_contains_range(pos, count, PCI_COMMAND, sizeof(val16),
> + ©_offset)) {
> + if (copy_from_user(&val16, buf, sizeof(val16)))
> + return -EFAULT;
> + val16 |= cpu_to_le16(PCI_COMMAND_IO);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> + ©_offset)) {
> + /* Transional needs to have revision 0 */
> + val8 = 0;
> + if (copy_to_user(buf + copy_offset, &val8, sizeof(val8)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> + ©_offset)) {
> + val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
> + if (copy_to_user(buf + copy_offset, &val32, sizeof(val32)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> + ©_offset)) {
> + /* Transitional devices use the PCI subsystem device id as
> + * virtio device id, same as legacy driver always did.
> + */
> + val16 = cpu_to_le16(VIRTIO_ID_NET);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + return count;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> + ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> + loff_t copy_offset;
> + u16 cmd;
> +
> + if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd),
> + ©_offset)) {
> + if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd)))
> + return -EFAULT;
> + virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO);
> + }
> + }
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static int
> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> + void __user *uarg = (void __user *)arg;
> + struct vfio_region_info info = {};
> +
> + if (copy_from_user(&info, uarg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + switch (info.index) {
> + case VFIO_PCI_BAR0_REGION_INDEX:
> + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> + info.size = virtvdev->bar0_virtual_buf_size;
> + info.flags = VFIO_REGION_INFO_FLAG_READ |
> + VFIO_REGION_INFO_FLAG_WRITE;
> + return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static long
> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> + unsigned long arg)
> +{
> + switch (cmd) {
> + case VFIO_DEVICE_GET_REGION_INFO:
> + return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static int
> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + int ret;
> +
> + /* Setup the BAR where the 'notify' exists to be used by vfio as well
> + * This will let us mmap it only once and use it when needed.
> + */
> + ret = vfio_pci_core_setup_barmap(core_device,
> + virtvdev->notify_bar);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> + virtvdev->notify_offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> + int ret;
> +
> + ret = vfio_pci_core_enable(vdev);
> + if (ret)
> + return ret;
> +
> + if (virtvdev->bar0_virtual_buf) {
> + /* upon close_device() the vfio_pci_core_disable() is called
> + * and will close all the previous mmaps, so it seems that the
> + * valid life cycle for the 'notify' addr is per open/close.
> + */
> + ret = virtiovf_set_notify_addr(virtvdev);
> + if (ret) {
> + vfio_pci_core_disable(vdev);
> + return ret;
> + }
> + }
> +
> + vfio_pci_core_finish_enable(vdev);
> + return 0;
> +}
> +
> +static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
> +{
> + vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static int virtiovf_get_device_config_size(unsigned short device)
> +{
> + switch (device) {
> + case 0x1041:
> + /* network card */
> + return offsetofend(struct virtio_net_config, status);
> + default:
> + return 0;
> + }
> +}
> +
> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> +{
> + u64 offset;
> + int ret;
> + u8 bar;
> +
> + ret = virtiovf_cmd_lq_read_notify(virtvdev,
> + VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> + &bar, &offset);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_bar = bar;
> + virtvdev->notify_offset = offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev;
> + int ret;
> +
> + ret = vfio_pci_core_init_dev(core_vdev);
> + if (ret)
> + return ret;
> +
> + pdev = virtvdev->core_device.pdev;
> + virtvdev->vf_id = pci_iov_vf_id(pdev);
> + if (virtvdev->vf_id < 0)
> + return -EINVAL;
> +
> + ret = virtiovf_read_notify_info(virtvdev);
> + if (ret)
> + return ret;
> +
> + virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN +
> + VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN +
> + virtiovf_get_device_config_size(pdev->device);
> + virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> + GFP_KERNEL);
> + if (!virtvdev->bar0_virtual_buf)
> + return -ENOMEM;
> + mutex_init(&virtvdev->bar_mutex);
> + return 0;
> +}
There is very little vfio specific above. I feel with a bit of
refactoring the logic parts can be moved to virtio, with just
vfio things under vfio (mostly, vfio lets user disable memory or set low PM state
so it has to be careful not to access device in such cases).
E.g.:
virtio_legacy_translate_offset(....)
copy_from_user(...)
vfio_pci_iowrite16(...)
and virtio_legacy_translate_offset would live under virtio.
Something similar for config space hacks.
> +
> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +
> + kfree(virtvdev->bar0_virtual_buf);
> + vfio_pci_core_release_dev(core_vdev);
> +}
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> + .name = "virtio-transitional-vfio-pci",
> + .init = virtiovf_pci_init_device,
> + .release = virtiovf_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = virtiovf_vfio_pci_core_ioctl,
> + .read = virtiovf_pci_core_read,
> + .write = virtiovf_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> + .name = "virtio-acc-vfio-pci",
> + .init = vfio_pci_core_init_dev,
> + .release = vfio_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = vfio_pci_core_ioctl,
> + .device_feature = vfio_pci_core_ioctl_feature,
> + .read = vfio_pci_core_read,
> + .write = vfio_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> +{
> + struct resource *res = pdev->resource;
> +
> + return res->flags ? true : false;
> +}
> +
> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> + (BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> +
> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> +{
> + int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> + u8 *buf;
> + int ret;
> +
> + /* Only virtio-net is supported/tested so far */
> + if (pdev->device != 0x1041)
> + return false;
> +
> + buf = kzalloc(buf_size, GFP_KERNEL);
> + if (!buf)
> + return false;
> +
> + ret = virtiovf_cmd_list_query(pdev, buf, buf_size);
> + if (ret)
> + goto end;
> +
> + if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> + VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> + ret = -EOPNOTSUPP;
> + goto end;
> + }
> +
> + /* confirm the used commands */
> + memset(buf, 0, buf_size);
> + *(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> + ret = virtiovf_cmd_list_use(pdev, buf, buf_size);
> +
> +end:
> + kfree(buf);
> + return ret ? false : true;
> +}
This is virtio stuff too.
> +
> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> + const struct pci_device_id *id)
> +{
> + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> + struct virtiovf_pci_core_device *virtvdev;
> + int ret;
> +
> + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> + !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> + ops = &virtiovf_acc_vfio_pci_tran_ops;
> +
> + virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> + &pdev->dev, ops);
> + if (IS_ERR(virtvdev))
> + return PTR_ERR(virtvdev);
> +
> + dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> + ret = vfio_pci_core_register_device(&virtvdev->core_device);
> + if (ret)
> + goto out;
> + return 0;
> +out:
> + vfio_put_device(&virtvdev->core_device.vdev);
> + return ret;
> +}
> +
> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> +
> + vfio_pci_core_unregister_device(&virtvdev->core_device);
> + vfio_put_device(&virtvdev->core_device.vdev);
> +}
> +
> +static const struct pci_device_id virtiovf_pci_table[] = {
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> + {}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> +
> +static struct pci_driver virtiovf_pci_driver = {
> + .name = KBUILD_MODNAME,
> + .id_table = virtiovf_pci_table,
> + .probe = virtiovf_pci_probe,
> + .remove = virtiovf_pci_remove,
> + .err_handler = &vfio_pci_core_err_handlers,
> + .driver_managed_dma = true,
> +};
> +
> +module_pci_driver(virtiovf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> + "VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
> --
> 2.27.0
^ permalink raw reply [flat|nested] 181+ messages in thread* Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-21 12:40 ` [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices Yishai Hadas
` (4 preceding siblings ...)
2023-09-22 10:10 ` Michael S. Tsirkin
@ 2023-09-22 15:53 ` Michael S. Tsirkin
2023-10-02 11:23 ` Parav Pandit
5 siblings, 1 reply; 181+ messages in thread
From: Michael S. Tsirkin @ 2023-09-22 15:53 UTC (permalink / raw)
To: Yishai Hadas
Cc: alex.williamson, jasowang, jgg, kvm, virtualization, parav, feliu,
jiri, kevin.tian, joao.m.martins, leonro, maorg
On Thu, Sep 21, 2023 at 03:40:40PM +0300, Yishai Hadas wrote:
> Introduce a vfio driver over virtio devices to support the legacy
> interface functionality for VFs.
>
> Background, from the virtio spec [1].
> --------------------------------------------------------------------
> In some systems, there is a need to support a virtio legacy driver with
> a device that does not directly support the legacy interface. In such
> scenarios, a group owner device can provide the legacy interface
> functionality for the group member devices. The driver of the owner
> device can then access the legacy interface of a member device on behalf
> of the legacy member device driver.
>
> For example, with the SR-IOV group type, group members (VFs) can not
> present the legacy interface in an I/O BAR in BAR0 as expected by the
> legacy pci driver. If the legacy driver is running inside a virtual
> machine, the hypervisor executing the virtual machine can present a
> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> legacy driver accesses to this I/O BAR and forwards them to the group
> owner device (PF) using group administration commands.
> --------------------------------------------------------------------
>
> Specifically, this driver adds support for a virtio-net VF to be exposed
> as a transitional device to a guest driver and allows the legacy IO BAR
> functionality on top.
>
> This allows a VM which uses a legacy virtio-net driver in the guest to
> work transparently over a VF which its driver in the host is that new
> driver.
>
> The driver can be extended easily to support some other types of virtio
> devices (e.g virtio-blk), by adding in a few places the specific type
> properties as was done for virtio-net.
>
> For now, only the virtio-net use case was tested and as such we introduce
> the support only for such a device.
>
> Practically,
> Upon probing a VF for a virtio-net device, in case its PF supports
> legacy access over the virtio admin commands and the VF doesn't have BAR
> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> transitional device with I/O BAR in BAR 0.
>
> The existence of the simulated I/O bar is reported later on by
> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> exposes itself as a transitional device by overwriting some properties
> upon reading its config space.
>
> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> guest may use it via read/write calls according to the virtio
> specification.
>
> Any read/write towards the control parts of the BAR will be captured by
> the new driver and will be translated into admin commands towards the
> device.
>
> Any data path read/write access (i.e. virtio driver notifications) will
> be forwarded to the physical BAR which its properties were supplied by
> the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow.
>
> With that code in place a legacy driver in the guest has the look and
> feel as if having a transitional device with legacy support for both its
> control and data path flows.
>
> [1]
> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
> MAINTAINERS | 6 +
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 2 +
> drivers/vfio/pci/virtio/Kconfig | 15 +
> drivers/vfio/pci/virtio/Makefile | 4 +
> drivers/vfio/pci/virtio/cmd.c | 4 +-
> drivers/vfio/pci/virtio/cmd.h | 8 +
> drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++
> 8 files changed, 585 insertions(+), 2 deletions(-)
> create mode 100644 drivers/vfio/pci/virtio/Kconfig
> create mode 100644 drivers/vfio/pci/virtio/Makefile
> create mode 100644 drivers/vfio/pci/virtio/main.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index bf0f54c24f81..5098418c8389 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org
> S: Maintained
> F: drivers/vfio/pci/mlx5/
>
> +VFIO VIRTIO PCI DRIVER
> +M: Yishai Hadas <yishaih@nvidia.com>
> +L: kvm@vger.kernel.org
> +S: Maintained
> +F: drivers/vfio/pci/virtio
> +
> VFIO PCI DEVICE SPECIFIC DRIVERS
> R: Jason Gunthorpe <jgg@nvidia.com>
> R: Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..18c397df566d 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>
> source "drivers/vfio/pci/pds/Kconfig"
>
> +source "drivers/vfio/pci/virtio/Kconfig"
> +
> endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..046139a4eca5 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>
> obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> new file mode 100644
> index 000000000000..89eddce8b1bd
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config VIRTIO_VFIO_PCI
> + tristate "VFIO support for VIRTIO PCI devices"
> + depends on VIRTIO_PCI
> + select VFIO_PCI_CORE
> + help
> + This provides support for exposing VIRTIO VF devices using the VFIO
> + framework that can work with a legacy virtio driver in the guest.
> + Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> + not indicate I/O Space.
> + As of that this driver emulated I/O BAR in software to let a VF be
> + seen as a transitional device in the guest and let it work with
> + a legacy driver.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> new file mode 100644
> index 000000000000..584372648a03
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> +virtio-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c
> index f068239cdbb0..aea9d25fbf1d 100644
> --- a/drivers/vfio/pci/virtio/cmd.c
> +++ b/drivers/vfio/pci/virtio/cmd.c
> @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_write *in;
> + struct virtio_admin_cmd_legacy_wr_data *in;
> struct scatterlist in_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
> {
> struct virtio_device *virtio_dev =
> virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> - struct virtio_admin_cmd_data_lr_read *in;
> + struct virtio_admin_cmd_legacy_rd_data *in;
> struct scatterlist in_sg, out_sg;
> struct virtio_admin_cmd cmd = {};
> int ret;
> diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h
> index c2a3645f4b90..347b1dc85570 100644
> --- a/drivers/vfio/pci/virtio/cmd.h
> +++ b/drivers/vfio/pci/virtio/cmd.h
> @@ -13,7 +13,15 @@
>
> struct virtiovf_pci_core_device {
> struct vfio_pci_core_device core_device;
> + u8 bar0_virtual_buf_size;
> + u8 *bar0_virtual_buf;
> + /* synchronize access to the virtual buf */
> + struct mutex bar_mutex;
> int vf_id;
> + void __iomem *notify_addr;
> + u32 notify_offset;
> + u8 notify_bar;
> + u8 pci_cmd_io :1;
> };
>
> int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size);
> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> new file mode 100644
> index 000000000000..2486991c49f3
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/main.c
> @@ -0,0 +1,546 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/virtio_pci.h>
> +#include <linux/virtio_net.h>
> +#include <linux/virtio_pci_modern.h>
> +
> +#include "cmd.h"
> +
> +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
> +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
> +
> +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> + u16 opcode;
> + int ret;
> +
> + mutex_lock(&virtvdev->bar_mutex);
> + if (read) {
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> + ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos,
> + count, bar0_buf + pos);
> + if (ret)
> + goto out;
> + if (copy_to_user(buf, bar0_buf + pos, count))
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + if (copy_from_user(bar0_buf + pos, buf, count)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
> + ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count,
> + bar0_buf + pos);
> +out:
> + mutex_unlock(&virtvdev->bar_mutex);
> + return ret;
> +}
> +
> +static int
> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> + loff_t pos, char __user *buf,
> + size_t count, bool read)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + u16 queue_notify;
> + int ret;
> +
> + if (pos + count > virtvdev->bar0_virtual_buf_size)
> + return -EINVAL;
> +
> + switch (pos) {
> + case VIRTIO_PCI_QUEUE_NOTIFY:
> + if (count != sizeof(queue_notify))
> + return -EINVAL;
> + if (read) {
> + ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> + virtvdev->notify_addr);
> + if (ret)
> + return ret;
> + if (copy_to_user(buf, &queue_notify,
> + sizeof(queue_notify)))
> + return -EFAULT;
> + break;
> + }
> +
> + if (copy_from_user(&queue_notify, buf, count))
> + return -EFAULT;
> +
> + ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> + virtvdev->notify_addr);
> + break;
> + default:
> + ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read);
> + }
> +
> + return ret ? ret : count;
> +}
> +
> +static bool range_contains_range(loff_t range1_start, size_t count1,
> + loff_t range2_start, size_t count2,
> + loff_t *start_offset)
> +{
> + if (range1_start <= range2_start &&
> + range1_start + count1 >= range2_start + count2) {
> + *start_offset = range2_start - range1_start;
> + return true;
> + }
> + return false;
> +}
> +
> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> + char __user *buf, size_t count,
> + loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + loff_t copy_offset;
> + __le32 val32;
> + __le16 val16;
> + u8 val8;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;
> +
> + if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> + ©_offset)) {
> + val16 = cpu_to_le16(0x1000);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + if (virtvdev->pci_cmd_io &&
> + range_contains_range(pos, count, PCI_COMMAND, sizeof(val16),
> + ©_offset)) {
> + if (copy_from_user(&val16, buf, sizeof(val16)))
> + return -EFAULT;
> + val16 |= cpu_to_le16(PCI_COMMAND_IO);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> + ©_offset)) {
> + /* Transional needs to have revision 0 */
> + val8 = 0;
> + if (copy_to_user(buf + copy_offset, &val8, sizeof(val8)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> + ©_offset)) {
> + val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
> + if (copy_to_user(buf + copy_offset, &val32, sizeof(val32)))
> + return -EFAULT;
> + }
> +
> + if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> + ©_offset)) {
> + /* Transitional devices use the PCI subsystem device id as
> + * virtio device id, same as legacy driver always did.
> + */
> + val16 = cpu_to_le16(VIRTIO_ID_NET);
> + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> + return -EFAULT;
> + }
> +
> + return count;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> + ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev = virtvdev->core_device.pdev;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + if (!count)
> + return 0;
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> + loff_t copy_offset;
> + u16 cmd;
> +
> + if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd),
> + ©_offset)) {
> + if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd)))
> + return -EFAULT;
> + virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO);
> + }
> + }
> +
> + if (index != VFIO_PCI_BAR0_REGION_INDEX)
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret) {
> + pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> + return -EIO;
> + }
> +
> + ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> + pm_runtime_put(&pdev->dev);
> + return ret;
> +}
> +
> +static int
> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> + void __user *uarg = (void __user *)arg;
> + struct vfio_region_info info = {};
> +
> + if (copy_from_user(&info, uarg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + switch (info.index) {
> + case VFIO_PCI_BAR0_REGION_INDEX:
> + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> + info.size = virtvdev->bar0_virtual_buf_size;
> + info.flags = VFIO_REGION_INFO_FLAG_READ |
> + VFIO_REGION_INFO_FLAG_WRITE;
> + return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static long
> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> + unsigned long arg)
> +{
> + switch (cmd) {
> + case VFIO_DEVICE_GET_REGION_INFO:
> + return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static int
> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> +{
> + struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> + int ret;
> +
> + /* Setup the BAR where the 'notify' exists to be used by vfio as well
> + * This will let us mmap it only once and use it when needed.
> + */
> + ret = vfio_pci_core_setup_barmap(core_device,
> + virtvdev->notify_bar);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> + virtvdev->notify_offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> + int ret;
> +
> + ret = vfio_pci_core_enable(vdev);
> + if (ret)
> + return ret;
> +
> + if (virtvdev->bar0_virtual_buf) {
> + /* upon close_device() the vfio_pci_core_disable() is called
> + * and will close all the previous mmaps, so it seems that the
> + * valid life cycle for the 'notify' addr is per open/close.
> + */
> + ret = virtiovf_set_notify_addr(virtvdev);
> + if (ret) {
> + vfio_pci_core_disable(vdev);
> + return ret;
> + }
> + }
> +
> + vfio_pci_core_finish_enable(vdev);
> + return 0;
> +}
> +
> +static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
> +{
> + vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static int virtiovf_get_device_config_size(unsigned short device)
> +{
> + switch (device) {
> + case 0x1041:
> + /* network card */
> + return offsetofend(struct virtio_net_config, status);
> + default:
> + return 0;
> + }
> +}
> +
> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> +{
> + u64 offset;
> + int ret;
> + u8 bar;
> +
> + ret = virtiovf_cmd_lq_read_notify(virtvdev,
> + VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> + &bar, &offset);
> + if (ret)
> + return ret;
> +
> + virtvdev->notify_bar = bar;
> + virtvdev->notify_offset = offset;
> + return 0;
> +}
> +
> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> + struct pci_dev *pdev;
> + int ret;
> +
> + ret = vfio_pci_core_init_dev(core_vdev);
> + if (ret)
> + return ret;
> +
> + pdev = virtvdev->core_device.pdev;
> + virtvdev->vf_id = pci_iov_vf_id(pdev);
> + if (virtvdev->vf_id < 0)
> + return -EINVAL;
> +
> + ret = virtiovf_read_notify_info(virtvdev);
> + if (ret)
> + return ret;
> +
> + virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN +
> + VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN +
> + virtiovf_get_device_config_size(pdev->device);
> + virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> + GFP_KERNEL);
> + if (!virtvdev->bar0_virtual_buf)
> + return -ENOMEM;
> + mutex_init(&virtvdev->bar_mutex);
> + return 0;
> +}
> +
> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = container_of(
> + core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +
> + kfree(virtvdev->bar0_virtual_buf);
> + vfio_pci_core_release_dev(core_vdev);
> +}
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> + .name = "virtio-transitional-vfio-pci",
> + .init = virtiovf_pci_init_device,
> + .release = virtiovf_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = virtiovf_vfio_pci_core_ioctl,
> + .read = virtiovf_pci_core_read,
> + .write = virtiovf_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> + .name = "virtio-acc-vfio-pci",
> + .init = vfio_pci_core_init_dev,
> + .release = vfio_pci_core_release_dev,
> + .open_device = virtiovf_pci_open_device,
> + .close_device = virtiovf_pci_close_device,
> + .ioctl = vfio_pci_core_ioctl,
> + .device_feature = vfio_pci_core_ioctl_feature,
> + .read = vfio_pci_core_read,
> + .write = vfio_pci_core_write,
> + .mmap = vfio_pci_core_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> +{
> + struct resource *res = pdev->resource;
> +
> + return res->flags ? true : false;
> +}
> +
> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> + (BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> +
> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> +{
> + int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> + u8 *buf;
> + int ret;
> +
> + /* Only virtio-net is supported/tested so far */
> + if (pdev->device != 0x1041)
> + return false;
> +
> + buf = kzalloc(buf_size, GFP_KERNEL);
> + if (!buf)
> + return false;
> +
> + ret = virtiovf_cmd_list_query(pdev, buf, buf_size);
> + if (ret)
> + goto end;
> +
> + if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> + VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> + ret = -EOPNOTSUPP;
> + goto end;
> + }
> +
> + /* confirm the used commands */
> + memset(buf, 0, buf_size);
> + *(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> + ret = virtiovf_cmd_list_use(pdev, buf, buf_size);
> +
> +end:
> + kfree(buf);
> + return ret ? false : true;
> +}
> +
> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> + const struct pci_device_id *id)
> +{
> + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> + struct virtiovf_pci_core_device *virtvdev;
> + int ret;
> +
> + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> + !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
I see this is the reason you set MSIX to true. But I think it's a
misunderstanding - that true means MSIX is enabled by guest, not that
it exists.
> + ops = &virtiovf_acc_vfio_pci_tran_ops;
Actually, I remember there's a problem with just always doing
transitional and that is VIRTIO_F_ACCESS_PLATFORM - some configs just
break in weird ways as device will go through an iommu. It would be
nicer I think if userspace had the last word on whether it wants to
enable legacy or not, even if hardware supports that.
> +
> + virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> + &pdev->dev, ops);
> + if (IS_ERR(virtvdev))
> + return PTR_ERR(virtvdev);
> +
> + dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> + ret = vfio_pci_core_register_device(&virtvdev->core_device);
> + if (ret)
> + goto out;
> + return 0;
> +out:
> + vfio_put_device(&virtvdev->core_device.vdev);
> + return ret;
> +}
> +
> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> +{
> + struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> +
> + vfio_pci_core_unregister_device(&virtvdev->core_device);
> + vfio_put_device(&virtvdev->core_device.vdev);
> +}
> +
> +static const struct pci_device_id virtiovf_pci_table[] = {
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> + {}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> +
> +static struct pci_driver virtiovf_pci_driver = {
> + .name = KBUILD_MODNAME,
> + .id_table = virtiovf_pci_table,
> + .probe = virtiovf_pci_probe,
> + .remove = virtiovf_pci_remove,
> + .err_handler = &vfio_pci_core_err_handlers,
> + .driver_managed_dma = true,
> +};
> +
> +module_pci_driver(virtiovf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> + "VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
> --
> 2.27.0
^ permalink raw reply [flat|nested] 181+ messages in thread* RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices
2023-09-22 15:53 ` Michael S. Tsirkin
@ 2023-10-02 11:23 ` Parav Pandit
0 siblings, 0 replies; 181+ messages in thread
From: Parav Pandit @ 2023-10-02 11:23 UTC (permalink / raw)
To: Michael S. Tsirkin, Yishai Hadas
Cc: alex.williamson@redhat.com, jasowang@redhat.com, Jason Gunthorpe,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
Feng Liu, Jiri Pirko, kevin.tian@intel.com,
joao.m.martins@oracle.com, Leon Romanovsky, Maor Gottlieb
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, September 22, 2023 9:23 PM
> > +static int virtiovf_pci_probe(struct pci_dev *pdev,
> > + const struct pci_device_id *id) {
> > + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> > + struct virtiovf_pci_core_device *virtvdev;
> > + int ret;
> > +
> > + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> > + !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
>
> I see this is the reason you set MSIX to true. But I think it's a misunderstanding -
> that true means MSIX is enabled by guest, not that it exists.
Msix check here just looks a sanity check to make sure that guest can enable msix.
The msix enable check should be in the read()/write() calls to decide which AQ command to choose from,
i.e. to access common config or device config as written in the virtio spec.
Yishai please fix the read() write() calls to dynamically consider the offset of 24/20 based on the msix enabled state.
^ permalink raw reply [flat|nested] 181+ messages in thread