Linux virtualization list
 help / color / mirror / Atom feed
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-08 13:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180808123036.GA2525@infradead.org>

On Wed, 2018-08-08 at 05:30 -0700, Christoph Hellwig wrote:
> On Wed, Aug 08, 2018 at 08:07:49PM +1000, Benjamin Herrenschmidt wrote:
> > Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
> > is not set (default) but there's nothing in the device-tree to tell the
> > guest about this since it's a violation of our pseries architecture, so
> > we just rely on Linux virtio "knowing" that it happens. It's a bit
> > yucky but that's now history...
> 
> That is ugly as hell, but it is how virtio works everywhere, so nothing
> special so far.

Yup.

> > Essentially pseries "architecturally" does not have the concept of not
> > having an iommu in the way and qemu violates that architecture today.
> > 
> > (Remember it comes from pHyp, our priorietary HV, which we are somewhat
> > mimmicing here).
> 
> It shouldnt be too hard to have a dt property that communicates this,
> should it?

We could invent something I suppose. The additional problem then (yeah
I know ... what a mess) is that qemu doesn't create the DT for PCI
devices, the firmware (SLOF) inside the guest does using normal PCI
probing.

That said, that FW could know about all the virtio vendor/device IDs,
check the VIRTIO_F_IOMMU_PLATFORM and set that property accordingly...
messy but doable. It's not a bus property (see my other reply below as
this could complicate things with your bus mask).

But we are drifting from the problem at hand :-) You propose we do set
VIRTIO_F_IOMMU_PLATFORM so we aren't in the above case, and the bypass
stuff works, so no need to touch it.

See my recap at the end of the email to make sure I understand fully
what you suggest.

> > So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
> > through that iommu and performance will suffer (esp vhost I suspect),
> > especially since adding/removing translations in the iommu is a
> > hypercall.

> Well, we'd nee to make sure that for this particular bus we skip the
> actualy iommu.

It's not a bus property. Qemu will happily mix up everything on the
same bus, that includes emulated devices that go through the emulated
iommu, real VFIO devices that go through an actual HW iommu and virtio
that bypasses everything.

This makes things tricky in general (not just in my powerpc secure VM
case) since, at least on powerpc but I suppose elsewhere too, iommu
related properties tend to be per "bus" while here, qemu will mix and
match.

But again, I think we are drifting away from the topic, see below

> > > It would not be the same effect.  The problem with that is that you must
> > > now assumes that your qemu knows that for example you might be passing
> > > a dma offset if the bus otherwise requires it. 
> > 
> > I would assume that arch_virtio_wants_dma_ops() only returns true when
> > no such offsets are involved, at least in our case that would be what
> > happens.
> 
> That would work, but we're really piling hacĸs ontop of hacks here.

Sort-of :-) At least none of what we are discussing now involves
touching the dma_ops themselves so we are not in the way of your big
cleanup operation here. But yeah, let's continue discussing your other
solution below.

> > >  Or in other words:
> > > you potentially break the contract between qemu and the guest of always
> > > passing down physical addresses.  If we explicitly change that contract
> > > through using a flag that says you pass bus address everything is fine.
> > 
> > For us a "bus address" is behind the iommu so that's what
> > VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
> > bus address that is different. I suppose it's an ARMism to have DMA
> > offsets that are separate from iommus ? 
> 
> No, a lot of platforms support a bus address that has an offset from
> the physical address. including a lot of power platforms:

Ok, just talking past each other :-) For all the powerpc ones, these
*do* go through the iommu, which is what I meant. It's just a window of
the iommu that provides some kind of direct mapping of memory.

For pseries, there is no such thing however. What we do to avoid
constant map/unmap of iommu PTEs in pseries guests is that we use
hypercalls to create a 64-bit window and populate all its PTEs with an
identity mapping. But that's not as efficient as a real bypass.

There are good historical reasons for that, since pseries is a guest
platform, its memory is never really where the guest thinks it is, so
you always need an iommu to remap. Even for virtual devices, since for
most of them, in the "IBM" pHyp model, the "peer" is actually another
partition, so the virtual iommu handles translating accross the two
partitions.

Same goes with cell in HW, no real bypass, just the iommu being
confiured with very large pages and a fixed mapping.

powernv has a separate physical window that can be configured as a real
bypass though, so does the U4 DART. Not sure about the FSL one.

But yeah, your point stands, this is just implementation details.

> arch/powerpc/kernel/pci-common.c:       set_dma_offset(&dev->dev, PCI_DRAM_OFFSET);
> arch/powerpc/platforms/cell/iommu.c:            set_dma_offset(dev, cell_dma_nommu_offset);
> arch/powerpc/platforms/cell/iommu.c:            set_dma_offset(dev, addr);
> arch/powerpc/platforms/powernv/pci-ioda.c:      set_dma_offset(&pdev->dev, pe->tce_bypass_base);
> arch/powerpc/platforms/powernv/pci-ioda.c:                      set_dma_offset(&pdev->dev, (1ULL << 32));
> arch/powerpc/platforms/powernv/pci-ioda.c:              set_dma_offset(&dev->dev, pe->tce_bypass_base);
> arch/powerpc/platforms/pseries/iommu.c:                         set_dma_offset(dev, dma_offset);
> arch/powerpc/sysdev/dart_iommu.c:               set_dma_offset(&dev->dev, DART_U4_BYPASS_BASE);
> arch/powerpc/sysdev/fsl_pci.c:          set_dma_offset(dev, pci64_dma_offset);
> 
> to make things worse some platforms (at least on arm/arm64/mips/x86) can
> also require additional banking where it isn't even a single linear map
> but multiples windows.

Sure, but all of this is just the configuration of the iommu. But I
think we agree here, and your point remains valid, indeed my proposed
hack:

>       if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())

Will only work if the IOMMU and non-IOMMU path are completely equivalent.

We can provide that guarantee for our secure VM case, but not generally so if
we were to go down the route of a quirk in virtio, it might be better to
make it painfully obvious that it's specific to that one case with a different
kind of turd:

-	if (xen_domain())
+	if (xen_domain() || pseries_secure_vm())
		return true;

So to summarize, and make sure I'm not missing something, the two approaches
at hand are either:

 1- The above, which is a one liner and contained in the guest, so that's nice, but
also means another turd in virtio which isn't ...

 2- We force pseries to always set VIRTIO_F_IOMMU_PLATFORM, but with the current
architecture on our side that will force virtio to always go through an emulated
iommu, as pseries doesn't have the concept of a real bypass window, and thus will
impact performance for both secure and non-secure VMs.

 3- Invent a property that can be put in selected PCI device tree nodes that
indicates that for that device specifically, the iommu can be bypassed, along with
a hypercall to turn that bypass on/off. Virtio would then use VIRTIO_F_IOMMU_PLATFORM
but its DT nodes would also have that property and Linux would notice it and turn
bypass on.

The resulting properties of those options are:

1- Is what I want because it's the simplest, provides the best performance now,
   and works without code changes to qemu or non-secure Linux. However it does
   add a tiny turd to virtio which is annoying.

2- This works but it puts the iommu in the way always, thus reducing virtio performance
   accross the board for pseries unless we only do that for secure VMs but that is
   difficult (as discussed earlier).

3- This would recover the performance lost in -2-, however it requires qemu *and*
   guest changes. Specifically, existing guests (RHEL 7 etc...) would get the
   performance hit of -2- unless modified to call that 'enable bypass' call, which
   isn't great.

So imho we have to chose one of 3 not-great solutions here... Unless I missed
something in your ideas of course.

Cheers,
Ben.



_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-08 12:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <4b596883892b5cb5560bef26fcd249e7107173ac.camel@kernel.crashing.org>

On Wed, Aug 08, 2018 at 08:07:49PM +1000, Benjamin Herrenschmidt wrote:
> Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
> is not set (default) but there's nothing in the device-tree to tell the
> guest about this since it's a violation of our pseries architecture, so
> we just rely on Linux virtio "knowing" that it happens. It's a bit
> yucky but that's now history...

That is ugly as hell, but it is how virtio works everywhere, so nothing
special so far.

> Essentially pseries "architecturally" does not have the concept of not
> having an iommu in the way and qemu violates that architecture today.
> 
> (Remember it comes from pHyp, our priorietary HV, which we are somewhat
> mimmicing here).

It shouldnt be too hard to have a dt property that communicates this,
should it?

> So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
> through that iommu and performance will suffer (esp vhost I suspect),
> especially since adding/removing translations in the iommu is a
> hypercall.

Well, we'd nee to make sure that for this particular bus we skip the
actualy iommu.

> > It would not be the same effect.  The problem with that is that you must
> > now assumes that your qemu knows that for example you might be passing
> > a dma offset if the bus otherwise requires it. 
> 
> I would assume that arch_virtio_wants_dma_ops() only returns true when
> no such offsets are involved, at least in our case that would be what
> happens.

That would work, but we're really piling hacĸs ontop of hacks here.

> >  Or in other words:
> > you potentially break the contract between qemu and the guest of always
> > passing down physical addresses.  If we explicitly change that contract
> > through using a flag that says you pass bus address everything is fine.
> 
> For us a "bus address" is behind the iommu so that's what
> VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
> bus address that is different. I suppose it's an ARMism to have DMA
> offsets that are separate from iommus ? 

No, a lot of platforms support a bus address that has an offset from
the physical address. including a lot of power platforms:

arch/powerpc/kernel/pci-common.c:       set_dma_offset(&dev->dev, PCI_DRAM_OFFSET);
arch/powerpc/platforms/cell/iommu.c:            set_dma_offset(dev, cell_dma_nommu_offset);
arch/powerpc/platforms/cell/iommu.c:            set_dma_offset(dev, addr);
arch/powerpc/platforms/powernv/pci-ioda.c:      set_dma_offset(&pdev->dev, pe->tce_bypass_base);
arch/powerpc/platforms/powernv/pci-ioda.c:                      set_dma_offset(&pdev->dev, (1ULL << 32));
arch/powerpc/platforms/powernv/pci-ioda.c:              set_dma_offset(&dev->dev, pe->tce_bypass_base);
arch/powerpc/platforms/pseries/iommu.c:                         set_dma_offset(dev, dma_offset);
arch/powerpc/sysdev/dart_iommu.c:               set_dma_offset(&dev->dev, DART_U4_BYPASS_BASE);
arch/powerpc/sysdev/fsl_pci.c:          set_dma_offset(dev, pci64_dma_offset);

to make things worse some platforms (at least on arm/arm64/mips/x86) can
also require additional banking where it isn't even a single linear map
but multiples windows.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH v3] drm/cirrus: flip default from 24bpp to 16bpp
From: Gerd Hoffmann @ 2018-08-08 11:13 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, Dave Airlie, open list,
	open list:DRM DRIVER FOR QEMU'S CIRRUS DEVICE

The problem with 24bpp is that it is a rather unusual depth these days,
cirrus is pretty much the only relevant device still using that, and it
is a endless source of issues.  Wayland doesn't support it at all.  Bugs
in Xorg keep showing up.

Typically either 32bpp or 16bpp are used.  Using 32bpp would limit the
resolution to 800x600 due to hardware constrains.  So lets go with 16bpp.

Also use the default depth for the framebuffer console and
mode_info->preferred_depth.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/cirrus/cirrus_drv.c   | 4 ++--
 drivers/gpu/drm/cirrus/cirrus_fbdev.c | 3 +--
 drivers/gpu/drm/cirrus/cirrus_mode.c  | 2 +-
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/cirrus/cirrus_drv.c b/drivers/gpu/drm/cirrus/cirrus_drv.c
index 69c4e352dd..b4f7be056c 100644
--- a/drivers/gpu/drm/cirrus/cirrus_drv.c
+++ b/drivers/gpu/drm/cirrus/cirrus_drv.c
@@ -16,11 +16,11 @@
 #include "cirrus_drv.h"
 
 int cirrus_modeset = -1;
-int cirrus_bpp = 24;
+int cirrus_bpp = 16;
 
 MODULE_PARM_DESC(modeset, "Disable/Enable modesetting");
 module_param_named(modeset, cirrus_modeset, int, 0400);
-MODULE_PARM_DESC(bpp, "Max bits-per-pixel (default:24)");
+MODULE_PARM_DESC(bpp, "Max bits-per-pixel (default:16)");
 module_param_named(bpp, cirrus_bpp, int, 0400);
 
 /*
diff --git a/drivers/gpu/drm/cirrus/cirrus_fbdev.c b/drivers/gpu/drm/cirrus/cirrus_fbdev.c
index 32fbfba2c6..b329a54290 100644
--- a/drivers/gpu/drm/cirrus/cirrus_fbdev.c
+++ b/drivers/gpu/drm/cirrus/cirrus_fbdev.c
@@ -271,7 +271,6 @@ int cirrus_fbdev_init(struct cirrus_device *cdev)
 {
 	struct cirrus_fbdev *gfbdev;
 	int ret;
-	int bpp_sel = 24;
 
 	/*bpp_sel = 8;*/
 	gfbdev = kzalloc(sizeof(struct cirrus_fbdev), GFP_KERNEL);
@@ -296,7 +295,7 @@ int cirrus_fbdev_init(struct cirrus_device *cdev)
 	/* disable all the possible outputs/crtcs before entering KMS mode */
 	drm_helper_disable_unused_functions(cdev->dev);
 
-	return drm_fb_helper_initial_config(&gfbdev->helper, bpp_sel);
+	return drm_fb_helper_initial_config(&gfbdev->helper, cirrus_bpp);
 }
 
 void cirrus_fbdev_fini(struct cirrus_device *cdev)
diff --git a/drivers/gpu/drm/cirrus/cirrus_mode.c b/drivers/gpu/drm/cirrus/cirrus_mode.c
index c91b9b054e..91128b1d8d 100644
--- a/drivers/gpu/drm/cirrus/cirrus_mode.c
+++ b/drivers/gpu/drm/cirrus/cirrus_mode.c
@@ -518,7 +518,7 @@ int cirrus_modeset_init(struct cirrus_device *cdev)
 	cdev->dev->mode_config.max_height = CIRRUS_MAX_FB_HEIGHT;
 
 	cdev->dev->mode_config.fb_base = cdev->mc.vram_base;
-	cdev->dev->mode_config.preferred_depth = 24;
+	cdev->dev->mode_config.preferred_depth = cirrus_bpp;
 	/* don't prefer a shadow on virt GPU */
 	cdev->dev->mode_config.prefer_shadow = 0;
 
-- 
2.9.3

^ permalink raw reply related

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-08 10:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180808063158.GA2474@infradead.org>

On Tue, 2018-08-07 at 23:31 -0700, Christoph Hellwig wrote:
> 
> You don't need to set them the time you go secure.  You just need to
> set the flag from the beginning on any VM you might want to go secure.
> Or for simplicity just any VM - if the DT/ACPI tables exposed by
> qemu are good enough that will always exclude a iommu and not set a
> DMA offset, so nothing will change on the qemu side of he processing,
> and with the new direct calls for the direct dma ops performance in
> the guest won't change either.

So that's where I'm not sure things are "good enough" due to how
pseries works. (remember it's paravirtualized).

A pseries system starts with a default iommu on all devices, that uses
translation using 4k entires with a "pinhole" window (usually 2G with
qemu iirc). There's no "pass through" by default.

Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
is not set (default) but there's nothing in the device-tree to tell the
guest about this since it's a violation of our pseries architecture, so
we just rely on Linux virtio "knowing" that it happens. It's a bit
yucky but that's now history...

Essentially pseries "architecturally" does not have the concept of not
having an iommu in the way and qemu violates that architecture today.

(Remember it comes from pHyp, our priorietary HV, which we are somewhat
mimmicing here).

So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
through that iommu and performance will suffer (esp vhost I suspect),
especially since adding/removing translations in the iommu is a
hypercall.

Now, we do have HV APIs to create a second window that's "permanently
mapped" to the guest memory, thus avoiding dynamic map/unmaps, and
Linux can make use of this but I don't know if that works with qemu and
the performance impact with vhost.

So the situation isn't that great.... On the other hand, I think the
other approach works for us:

> > It's nicer if we have a way in the guest virtio driver to do something
> > along the lines of
> > 
> > 	if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())
> > 
> > Which would have the same effect and means the issue is entirely
> > contained in the guest.
> 
> It would not be the same effect.  The problem with that is that you must
> now assumes that your qemu knows that for example you might be passing
> a dma offset if the bus otherwise requires it. 

I would assume that arch_virtio_wants_dma_ops() only returns true when
no such offsets are involved, at least in our case that would be what
happens.

>  Or in other words:
> you potentially break the contract between qemu and the guest of always
> passing down physical addresses.  If we explicitly change that contract
> through using a flag that says you pass bus address everything is fine.

For us a "bus address" is behind the iommu so that's what
VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
bus address that is different. I suppose it's an ARMism to have DMA
offsets that are separate from iommus ? 

> Note that in practice your scheme will probably just work for your
> initial prototype, but chances are it will get us in trouble later on.

Not on pseries, at least not in any way I can think of mind you... but
maybe other architectures would abuse it... We could add a WARN_ON if
that calls returns true on a bus with an offset I suppose.

Cheers,
Ben.

^ permalink raw reply

* [PATCH v2] drm/cirrus: flip default from 24bpp to 16bpp
From: Gerd Hoffmann @ 2018-08-08  9:06 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, Dave Airlie, open list,
	open list:DRM DRIVER FOR QEMU'S CIRRUS DEVICE

The problem with 24bpp is that it is a rather unusual depth these days,
cirrus is pretty much the only relevant device still using that, and it
is a endless source of issues.  Wayland doesn't support it at all.  Bugs
in Xorg keep showing up.

Typically either 32bpp or 16bpp are used.  Using 32bpp would limit the
resolution to 800x600 due to hardware constrains.  So lets go with 16bpp.

Also use the default depth for the framebuffer console.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 drivers/gpu/drm/cirrus/cirrus_drv.c   | 4 ++--
 drivers/gpu/drm/cirrus/cirrus_fbdev.c | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/cirrus/cirrus_drv.c b/drivers/gpu/drm/cirrus/cirrus_drv.c
index 69c4e352dd..b4f7be056c 100644
--- a/drivers/gpu/drm/cirrus/cirrus_drv.c
+++ b/drivers/gpu/drm/cirrus/cirrus_drv.c
@@ -16,11 +16,11 @@
 #include "cirrus_drv.h"
 
 int cirrus_modeset = -1;
-int cirrus_bpp = 24;
+int cirrus_bpp = 16;
 
 MODULE_PARM_DESC(modeset, "Disable/Enable modesetting");
 module_param_named(modeset, cirrus_modeset, int, 0400);
-MODULE_PARM_DESC(bpp, "Max bits-per-pixel (default:24)");
+MODULE_PARM_DESC(bpp, "Max bits-per-pixel (default:16)");
 module_param_named(bpp, cirrus_bpp, int, 0400);
 
 /*
diff --git a/drivers/gpu/drm/cirrus/cirrus_fbdev.c b/drivers/gpu/drm/cirrus/cirrus_fbdev.c
index 32fbfba2c6..b329a54290 100644
--- a/drivers/gpu/drm/cirrus/cirrus_fbdev.c
+++ b/drivers/gpu/drm/cirrus/cirrus_fbdev.c
@@ -271,7 +271,6 @@ int cirrus_fbdev_init(struct cirrus_device *cdev)
 {
 	struct cirrus_fbdev *gfbdev;
 	int ret;
-	int bpp_sel = 24;
 
 	/*bpp_sel = 8;*/
 	gfbdev = kzalloc(sizeof(struct cirrus_fbdev), GFP_KERNEL);
@@ -296,7 +295,7 @@ int cirrus_fbdev_init(struct cirrus_device *cdev)
 	/* disable all the possible outputs/crtcs before entering KMS mode */
 	drm_helper_disable_unused_functions(cdev->dev);
 
-	return drm_fb_helper_initial_config(&gfbdev->helper, bpp_sel);
+	return drm_fb_helper_initial_config(&gfbdev->helper, cirrus_bpp);
 }
 
 void cirrus_fbdev_fini(struct cirrus_device *cdev)
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH] drm/cirrus: flip default to 32bpp
From: Gerd Hoffmann @ 2018-08-08  8:46 UTC (permalink / raw)
  To: Adam Jackson, dri-devel, David Airlie, Dave Airlie, open list,
	open list:DRM DRIVER FOR QEMU'S CIRRUS DEVICE
In-Reply-To: <20180709073924.GB3008@phenom.ffwll.local>

On Mon, Jul 09, 2018 at 09:39:24AM +0200, Daniel Vetter wrote:
> On Fri, Jul 06, 2018 at 02:35:07PM -0400, Adam Jackson wrote:
> > On Fri, 2018-07-06 at 11:12 +0200, Gerd Hoffmann wrote:
> > > cirrus can handle 1024x768 (and slightly higher) with 24bpp depth.
> > > cirrus can handle up to 800x600 with 32bpp.
> > 
> > 16bpp is maybe a better choice? Nobody's using cirrus because they care
> > about color fidelity and it'll use less CPU to update. There's
> > precedent here, mgag200 defaults to 16bpp on sufficiently memory-
> > impaired devices.
> 
> Yeah nouveau does the same fallback to 16bpp if there's not enough vram.
> So do a bunch of other drivers.

Makes sense.  But appearently 16bpp support bitrotted over time, when
just flipping the default the driver Oopses at init time.  I'll look
into it.

cheers,
  Gerd

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-08  6:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <2103ecfe52d23cec03f185d08a87bfad9c9d82b5.camel@kernel.crashing.org>

On Wed, Aug 08, 2018 at 06:32:45AM +1000, Benjamin Herrenschmidt wrote:
> As for the flag itself, while we could set it from qemu when we get
> notified that the guest is going secure, both Michael and I think it's
> rather gross, it requires qemu to go iterate all virtio devices and
> "poke" something into them.

You don't need to set them the time you go secure.  You just need to
set the flag from the beginning on any VM you might want to go secure.
Or for simplicity just any VM - if the DT/ACPI tables exposed by
qemu are good enough that will always exclude a iommu and not set a
DMA offset, so nothing will change on the qemu side of he processing,
and with the new direct calls for the direct dma ops performance in
the guest won't change either.

> It's nicer if we have a way in the guest virtio driver to do something
> along the lines of
> 
> 	if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())
> 
> Which would have the same effect and means the issue is entirely
> contained in the guest.

It would not be the same effect.  The problem with that is that you must
now assumes that your qemu knows that for example you might be passing
a dma offset if the bus otherwise requires it.  Or in other words:
you potentially break the contract between qemu and the guest of always
passing down physical addresses.  If we explicitly change that contract
through using a flag that says you pass bus address everything is fine.

Note that in practice your scheme will probably just work for your
initial prototype, but chances are it will get us in trouble later on.

^ permalink raw reply

* [PATCH net] vhost: reset metadata cache when initializing new IOTLB
From: Jason Wang @ 2018-08-08  3:43 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, linux-kernel, kvm, virtualization

We need to reset metadata cache during new IOTLB initialization,
otherwise the stale pointers to previous IOTLB may be still accessed
which will lead a use after free.

Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com
Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a502f1a..ed31145 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1560,9 +1560,12 @@ int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
 	d->iotlb = niotlb;
 
 	for (i = 0; i < d->nvqs; ++i) {
-		mutex_lock(&d->vqs[i]->mutex);
-		d->vqs[i]->iotlb = niotlb;
-		mutex_unlock(&d->vqs[i]->mutex);
+		struct vhost_virtqueue *vq = d->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vq->iotlb = niotlb;
+		__vhost_vq_meta_reset(vq);
+		mutex_unlock(&vq->mutex);
 	}
 
 	vhost_umem_clean(oiotlb);
-- 
2.7.4

^ permalink raw reply related

* Re: KASAN: use-after-free Read in iotlb_access_ok
From: Jason Wang @ 2018-08-08  2:52 UTC (permalink / raw)
  To: syzbot, kvm, linux-kernel, mst, netdev, syzkaller-bugs,
	virtualization
In-Reply-To: <000000000000eb92240572d68452@google.com>



On 2018年08月07日 19:16, syzbot wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    e30cb13c5a09 Merge 
> git://git.kernel.org/pub/scm/linux/kern..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10a153e0400000
> kernel config: https://syzkaller.appspot.com/x/.config?x=2dc0cd7c2eefb46f
> dashboard link: 
> https://syzkaller.appspot.com/bug?extid=c51e6736a1bf614b3272
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the 
> commit:
> Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com
>
> ==================================================================
> BUG: KASAN: use-after-free in vhost_vq_meta_fetch 
> drivers/vhost/vhost.c:702 [inline]
> BUG: KASAN: use-after-free in iotlb_access_ok+0x5c9/0x600 
> drivers/vhost/vhost.c:1177
> Read of size 8 at addr ffff880197df2fc0 by task vhost-8938/8941
>
> CPU: 0 PID: 8941 Comm: vhost-8938 Not tainted 4.18.0-rc7+ #174
> Hardware name: Google Google Compute Engine/Google Compute Engine, 
> BIOS Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
>  print_address_description+0x6c/0x20b mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>  vhost_vq_meta_fetch drivers/vhost/vhost.c:702 [inline]
>  iotlb_access_ok+0x5c9/0x600 drivers/vhost/vhost.c:1177
>  vq_iotlb_prefetch+0x10e/0x230 drivers/vhost/vhost.c:1214
>  handle_rx+0x247/0x1f80 drivers/vhost/net.c:799
>  handle_rx_net+0x19/0x20 drivers/vhost/net.c:934
>  vhost_worker+0x283/0x490 drivers/vhost/vhost.c:360
>  kthread+0x345/0x410 kernel/kthread.c:246
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
>
> Allocated by task 8938:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>  set_track mm/kasan/kasan.c:460 [inline]
>  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
>  kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
>  kmalloc include/linux/slab.h:513 [inline]
>  vhost_new_umem_range+0xcb/0x7c0 drivers/vhost/vhost.c:911
>  vhost_process_iotlb_msg drivers/vhost/vhost.c:1000 [inline]
>  vhost_chr_write_iter+0xe53/0x1a00 drivers/vhost/vhost.c:1043
>  vhost_net_chr_write_iter+0x59/0x70 drivers/vhost/net.c:1399
>  call_write_iter include/linux/fs.h:1793 [inline]
>  new_sync_write fs/read_write.c:474 [inline]
>  __vfs_write+0x6c6/0x9f0 fs/read_write.c:487
>  vfs_write+0x1f8/0x560 fs/read_write.c:549
>  ksys_write+0x101/0x260 fs/read_write.c:598
>  __do_sys_write fs/read_write.c:610 [inline]
>  __se_sys_write fs/read_write.c:607 [inline]
>  __x64_sys_write+0x73/0xb0 fs/read_write.c:607
>  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Freed by task 8950:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>  set_track mm/kasan/kasan.c:460 [inline]
>  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
>  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
>  __cache_free mm/slab.c:3498 [inline]
>  kfree+0xd9/0x260 mm/slab.c:3813
>  vhost_umem_free+0x944/0x14d0 drivers/vhost/vhost.c:576
>  vhost_umem_clean+0x83/0xf0 drivers/vhost/vhost.c:588
>  vhost_init_device_iotlb+0x1d7/0x290 drivers/vhost/vhost.c:1568
>  vhost_net_set_features drivers/vhost/net.c:1292 [inline]
>  vhost_net_ioctl+0xff3/0x1a80 drivers/vhost/net.c:1357
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  file_ioctl fs/ioctl.c:500 [inline]
>  do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:684
>  ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
>  __do_sys_ioctl fs/ioctl.c:708 [inline]
>  __se_sys_ioctl fs/ioctl.c:706 [inline]
>  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
>  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> The buggy address belongs to the object at ffff880197df2f80
>  which belongs to the cache kmalloc-96 of size 96
> The buggy address is located 64 bytes inside of
>  96-byte region [ffff880197df2f80, ffff880197df2fe0)
> The buggy address belongs to the page:
> page:ffffea00065f7c80 count:1 mapcount:0 mapping:ffff8801dac004c0 
> index:0x0
> flags: 0x2fffc0000000100(slab)
> raw: 02fffc0000000100 ffffea0007530148 ffffea000663b888 ffff8801dac004c0
> raw: 0000000000000000 ffff880197df2000 0000000100000020 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  ffff880197df2e80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>  ffff880197df2f00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>> ffff880197df2f80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>                                            ^
>  ffff880197df3000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  ffff880197df3080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with 
> syzbot.

Looks like the metadata IOTLB needs to be reset during set_features.

Will post a fix.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-07 20:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180807135505.GA29034@infradead.org>

On Tue, 2018-08-07 at 06:55 -0700, Christoph Hellwig wrote:
> On Tue, Aug 07, 2018 at 04:42:44PM +1000, Benjamin Herrenschmidt wrote:
> > Note that I can make it so that the same DMA ops (basically standard
> > swiotlb ops without arch hacks) work for both "direct virtio" and
> > "normal PCI" devices.
> > 
> > The trick is simply in the arch to setup the iommu to map the swiotlb
> > bounce buffer pool 1:1 in the iommu, so the iommu essentially can be
> > ignored without affecting the physical addresses.
> > 
> > If I do that, *all* I need is a way, from the guest itself (again, the
> > other side dosn't know anything about it), to force virtio to use the
> > DMA ops as if there was an iommu, that is, use whatever dma ops were
> > setup by the platform for the pci device.
> 
> In that case just setting VIRTIO_F_IOMMU_PLATFORM in the flags should
> do the work (even if that isn't strictly what the current definition
> of the flag actually means).  On the qemu side you'll need to make
> sure you have a way to set VIRTIO_F_IOMMU_PLATFORM without emulating
> an iommu, but with code to take dma offsets into account if your
> plaform has any (various power plaforms seem to have them, not sure
> if it affects your config).

Something like that yes. I prefer a slightly different way, see below,
any but in both cases, it should alleviate your concerns since it means
there would be no particular mucking around with DMA ops at all, virtio
would just use whatever "normal" ops we establish for all PCI devices
on that platform, which will be standard ones.

(swiotlb ones today and the new "integrates" ones you're cooking
tomorrow).

As for the flag itself, while we could set it from qemu when we get
notified that the guest is going secure, both Michael and I think it's
rather gross, it requires qemu to go iterate all virtio devices and
"poke" something into them.

It also means qemu will need some other internal nasty flag that says
"set that bit but don't do iommu".

It's nicer if we have a way in the guest virtio driver to do something
along the lines of

	if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())

Which would have the same effect and means the issue is entirely
contained in the guest.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH net-next] net: allow to call netif_reset_xps_queues() under cpu_read_lock
From: David Miller @ 2018-08-07 20:15 UTC (permalink / raw)
  To: avagin; +Cc: amritha.nambiar, mst, netdev, virtualization, avagin
In-Reply-To: <20180807041454.18315-1-avagin@openvz.org>

From: Andrei Vagin <avagin@openvz.org>
Date: Mon,  6 Aug 2018 21:14:54 -0700

> This patch adds an ability to call __netif_set_xps_queue under
> cpu_read_lock().

Please don't add conditional locking using a boolean argument.

Simply wrap calls to __netif_set_xps_queue() with cpu read
lock held by the caller or similar.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07 13:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <aa59c7f8556bd4b332394a1dcf2d4a8faf3dc4a2.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 04:42:44PM +1000, Benjamin Herrenschmidt wrote:
> Note that I can make it so that the same DMA ops (basically standard
> swiotlb ops without arch hacks) work for both "direct virtio" and
> "normal PCI" devices.
> 
> The trick is simply in the arch to setup the iommu to map the swiotlb
> bounce buffer pool 1:1 in the iommu, so the iommu essentially can be
> ignored without affecting the physical addresses.
> 
> If I do that, *all* I need is a way, from the guest itself (again, the
> other side dosn't know anything about it), to force virtio to use the
> DMA ops as if there was an iommu, that is, use whatever dma ops were
> setup by the platform for the pci device.

In that case just setting VIRTIO_F_IOMMU_PLATFORM in the flags should
do the work (even if that isn't strictly what the current definition
of the flag actually means).  On the qemu side you'll need to make
sure you have a way to set VIRTIO_F_IOMMU_PLATFORM without emulating
an iommu, but with code to take dma offsets into account if your
plaform has any (various power plaforms seem to have them, not sure
if it affects your config).

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-07  6:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180807062731.GA23159@infradead.org>

On Mon, 2018-08-06 at 23:27 -0700, Christoph Hellwig wrote:
> On Tue, Aug 07, 2018 at 08:13:56AM +1000, Benjamin Herrenschmidt wrote:
> > It would be indeed ideal if all we had to do was setup some kind of
> > bus_dma_mask on all PCI devices and have virtio automagically insert
> > swiotlb when necessary.
> 
> For 4.20 I plan to remove the swiotlb ops and instead do the bounce
> buffering in the common code, including a direct call to the direct
> ops to avoid retpoline overhead.  For that you still need a flag
> in virtio that instead of blindly working physical addresses it needs
> to be treated like a real device in terms of DMA.

But you will still call the swiotlb infrastructure, right ? IE, I sitll
need to control where/how the swiotlb "pool" is allocated.
> 
> And for powerpc to make use of that I need to get the dma series I
> posted last week reviewed and included, otherwise powerpc will have
> to be excepted (like arm, where rmk didn't like the way the code
> was factored, everything else has already been taken care of).
> 
> https://lists.linuxfoundation.org/pipermail/iommu/2018-July/028989.html

Yes, I saw your series. I'm just back from a week of travel, I plan to
start reviewing it this week if Michael doesn't beat me to it.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-07  6:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180807062117.GD32709@infradead.org>

On Mon, 2018-08-06 at 23:21 -0700, Christoph Hellwig wrote:
> On Tue, Aug 07, 2018 at 05:52:12AM +1000, Benjamin Herrenschmidt wrote:
> > > It is your job to write a coherent interface specification that does
> > > not depend on the used components.  The hypervisor might be PAPR,
> > > Linux + qemu, VMware, Hyperv or something so secret that you'd have
> > > to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
> > > AIX, OS400 or a Hipster project of the day in Rust.  As long as we
> > > properly specify the interface it simplify does not matter.
> > 
> > That's the point Christoph. The interface is today's interface. It does
> > NOT change. That information is not part of the interface.
> > 
> > It's the VM itself that is stashing away its memory in a secret place,
> > and thus needs to do bounce buffering. There is no change to the virtio
> > interface per-se.
> 
> Any guest that doesn't know about your magic limited adressing is simply
> not going to work, so we need to communicate that fact.

The guest does. It's the guest itself that initiates it. That's my
point, it's not a factor of the hypervisor, which is unchanged in that
area.

It's the guest itself, that makes the decision early on, to stash it's
memory away in a secure place, and thus needs to establish some kind of
bouce buffering via a few left over "insecure" pages.

It's all done by the guest: initiated by the guest and controlled by
the guest.

That's why I don't see why this specifically needs to involve the
hypervisor side, and thus a VIRTIO feature bit.

Note that I can make it so that the same DMA ops (basically standard
swiotlb ops without arch hacks) work for both "direct virtio" and
"normal PCI" devices.

The trick is simply in the arch to setup the iommu to map the swiotlb
bounce buffer pool 1:1 in the iommu, so the iommu essentially can be
ignored without affecting the physical addresses.

If I do that, *all* I need is a way, from the guest itself (again, the
other side dosn't know anything about it), to force virtio to use the
DMA ops as if there was an iommu, that is, use whatever dma ops were
setup by the platform for the pci device.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, mpe, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180807024503-mutt-send-email-mst@kernel.org>

On Tue, Aug 07, 2018 at 02:45:25AM +0300, Michael S. Tsirkin wrote:
> > > I think that's where Christoph might have specific ideas about it.
> > 
> > OK well, assuming Christoph can solve the direct case in a way that
> > also work for the virtio !iommu case, we still want some bit of logic
> > somewhere that will "switch" to swiotlb based ops if the DMA mask is
> > limited.
> > 
> > You mentioned an RFC for that ? Do you happen to have a link ?
> 
> No but Christoph did I think.

Do you mean the direct map retpoline mitigation?  It is here:

https://www.spinics.net/lists/netdev/msg495413.html

https://www.spinics.net/lists/netdev/msg495785.html

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <93518075238a07e9f011774d89bdc652c083f1ba.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 08:13:56AM +1000, Benjamin Herrenschmidt wrote:
> It would be indeed ideal if all we had to do was setup some kind of
> bus_dma_mask on all PCI devices and have virtio automagically insert
> swiotlb when necessary.

For 4.20 I plan to remove the swiotlb ops and instead do the bounce
buffering in the common code, including a direct call to the direct
ops to avoid retpoline overhead.  For that you still need a flag
in virtio that instead of blindly working physical addresses it needs
to be treated like a real device in terms of DMA.

And for powerpc to make use of that I need to get the dma series I
posted last week reviewed and included, otherwise powerpc will have
to be excepted (like arm, where rmk didn't like the way the code
was factored, everything else has already been taken care of).

https://lists.linuxfoundation.org/pipermail/iommu/2018-July/028989.html

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <6c707d6d33ac25a42265c2e9b521c2416d72c739.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 05:52:12AM +1000, Benjamin Herrenschmidt wrote:
> > It is your job to write a coherent interface specification that does
> > not depend on the used components.  The hypervisor might be PAPR,
> > Linux + qemu, VMware, Hyperv or something so secret that you'd have
> > to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
> > AIX, OS400 or a Hipster project of the day in Rust.  As long as we
> > properly specify the interface it simplify does not matter.
> 
> That's the point Christoph. The interface is today's interface. It does
> NOT change. That information is not part of the interface.
> 
> It's the VM itself that is stashing away its memory in a secret place,
> and thus needs to do bounce buffering. There is no change to the virtio
> interface per-se.

Any guest that doesn't know about your magic limited adressing is simply
not going to work, so we need to communicate that fact.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, mpe, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180807002857-mutt-send-email-mst@kernel.org>

On Tue, Aug 07, 2018 at 12:46:34AM +0300, Michael S. Tsirkin wrote:
> Well we have the RFC for that - the switch to using DMA ops unconditionally isn't
> problematic itself IMHO, for now that RFC is blocked
> by its perfromance overhead for now but Christoph says
> he's trying to remove that for direct mappings,
> so we should hopefully be able to get there in X weeks.

The direct calls to dma_direct_ops aren't going to help you with legacy
virtio, given that virtio is specified to deal with physical addresses,
while dma-direct is not in many cases.

It would however help with the case where qemu always sets the platform
dma flag, as we'd avoid the indirect calls for that.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <0967fc30001323e6e38ed12c8dba8ee3d1aa13f5.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 07:26:35AM +1000, Benjamin Herrenschmidt wrote:
> > I think Christoph merely objects to the specific implementation.  If
> > instead you do something like tweak dev->bus_dma_mask for the virtio
> > device I think he won't object.
> 
> Well, we don't have "bus_dma_mask" yet ..or you mean dma_mask ?

It will be new in 4.19:

http://git.infradead.org/users/hch/dma-mapping.git/commitdiff/f07d141fe9430cdf9f8a65a87c41

> So, something like that would be a possibility, but the problem is that
> the current virtio (guest side) implementation doesn't honor this when
> not using dma ops and will not use dma ops if not using iommu, so back
> to square one.
> 
> Christoph seems to be wanting to use a flag in the interface to make
> the guest use dma_ops which is what I don't understand.

As-is virtio devices are very clearly and explcitly defined to use
physical addresses in the spec.  dma ops will often do platform
based translations (iommu, offsets), so we can't just use the plaform
default dma ops and will need to opt into them.

> What would be needed then would be something along the lines of virtio
> noticing that dma_mask isn't big enough to cover all of memory (which
> isn't something generic code can easily do here for various reasons I
> can elaborate if you want, but that specific test more/less has to be
> arch specific), and in that case, force itself to use DMA ops routed to
> swiotlb.
> 
> I'd rather have arch code do the bulk of that work, don't you think ?

There is nothing architecture specific about that.  I've been working
hard to remove all the bullshit architectures have done in their DMA
ops and consolidating them into common code based on rules.  The last
thing I want is another vector for weird underspecified arch
interfaction with DMA ops, which is exactly what your patch below
does.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-07  6:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig,
	jean-philippe.brucker, paulus, marc.zyngier, mpe, joe,
	robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180806233024-mutt-send-email-mst@kernel.org>

On Mon, Aug 06, 2018 at 11:35:39PM +0300, Michael S. Tsirkin wrote:
> > As I said replying to Christoph, we are "leaking" into the interface
> > something here that is really what's the VM is doing to itself, which
> > is to stash its memory away in an inaccessible place.
> > 
> > Cheers,
> > Ben.
> 
> I think Christoph merely objects to the specific implementation.  If
> instead you do something like tweak dev->bus_dma_mask for the virtio
> device I think he won't object.

As long as we also document how dev->bus_dma_mask is tweaked for this
particular virtual bus, yes.

^ permalink raw reply

* [PATCH net-next] net: allow to call netif_reset_xps_queues() under cpu_read_lock
From: Andrei Vagin @ 2018-08-07  4:14 UTC (permalink / raw)
  To: David S. Miller
  Cc: Nambiar, Amritha, Michael S. Tsirkin, netdev, virtualization,
	Andrei Vagin

From: Andrei Vagin <avagin@gmail.com>

The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.

This patch adds an ability to call __netif_set_xps_queue under
cpu_read_lock().

============================================
WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20

but task is already holding lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(cpu_hotplug_lock.rw_sem);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by swapper/0/1:
 #0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
 #1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
 #2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60

Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")

Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 drivers/net/virtio_net.c  |  4 +++-
 include/linux/netdevice.h |  2 +-
 net/core/dev.c            | 23 +++++++++++++++++------
 net/core/net-sysfs.c      |  2 +-
 4 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62311dde6e71..a4abcfcf26b2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1903,9 +1903,11 @@ static void virtnet_set_affinity(struct virtnet_info *vi)
 
 	i = 0;
 	for_each_online_cpu(cpu) {
+		const unsigned long *mask = cpumask_bits(cpumask_of(cpu));
+
 		virtqueue_set_affinity(vi->rq[i].vq, cpu);
 		virtqueue_set_affinity(vi->sq[i].vq, cpu);
-		netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
+		__netif_set_xps_queue(vi->dev, mask, i, false, true);
 		i++;
 	}
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 282e2e95ad5b..124f9a00ce71 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3320,7 +3320,7 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			u16 index);
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
-			  u16 index, bool is_rxqs_map);
+			  u16 index, bool is_rxqs_map, bool cpuslocked);
 
 /**
  *	netif_attr_test_mask - Test a CPU or Rx queue set in a mask
diff --git a/net/core/dev.c b/net/core/dev.c
index f68122f0ab02..d6a0f64ccdd9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2176,6 +2176,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 	if (!static_key_false(&xps_needed))
 		return;
 
+	cpus_read_lock();
 	mutex_lock(&xps_map_mutex);
 
 	if (static_key_false(&xps_rxqs_needed)) {
@@ -2199,10 +2200,11 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 
 out_no_maps:
 	if (static_key_enabled(&xps_rxqs_needed))
-		static_key_slow_dec(&xps_rxqs_needed);
+		static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
 
-	static_key_slow_dec(&xps_needed);
+	static_key_slow_dec_cpuslocked(&xps_needed);
 	mutex_unlock(&xps_map_mutex);
+	cpus_read_unlock();
 }
 
 static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
@@ -2251,7 +2253,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
 }
 
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
-			  u16 index, bool is_rxqs_map)
+			  u16 index, bool is_rxqs_map, bool cpuslocked)
 {
 	const unsigned long *online_mask = NULL, *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
@@ -2275,6 +2277,9 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
 			return -EINVAL;
 	}
 
+	if (!cpuslocked)
+		cpus_read_lock();
+
 	mutex_lock(&xps_map_mutex);
 	if (is_rxqs_map) {
 		maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
@@ -2317,9 +2322,9 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
-	static_key_slow_inc(&xps_needed);
+	static_key_slow_inc_cpuslocked(&xps_needed);
 	if (is_rxqs_map)
-		static_key_slow_inc(&xps_rxqs_needed);
+		static_key_slow_inc_cpuslocked(&xps_rxqs_needed);
 
 	for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 	     j < nr_ids;) {
@@ -2427,6 +2432,8 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
 
 out_no_maps:
 	mutex_unlock(&xps_map_mutex);
+	if (!cpuslocked)
+		cpus_read_unlock();
 
 	return 0;
 error:
@@ -2444,15 +2451,19 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
 	}
 
 	mutex_unlock(&xps_map_mutex);
+	if (!cpuslocked)
+		cpus_read_unlock();
 
 	kfree(new_dev_maps);
 	return -ENOMEM;
 }
+EXPORT_SYMBOL_GPL(__netif_set_xps_queue);
 
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			u16 index)
 {
-	return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
+	return __netif_set_xps_queue(dev, cpumask_bits(mask),
+						index, false, false);
 }
 EXPORT_SYMBOL(netif_set_xps_queue);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 0a95bcf64cdc..06a141445d80 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1400,7 +1400,7 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
 		return err;
 	}
 
-	err = __netif_set_xps_queue(dev, mask, index, true);
+	err = __netif_set_xps_queue(dev, mask, index, true, false);
 	kfree(mask);
 	return err ? : len;
 }
-- 
2.17.1

^ permalink raw reply related

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-07  0:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180807024503-mutt-send-email-mst@kernel.org>

On Tue, 2018-08-07 at 02:45 +0300, Michael S. Tsirkin wrote:
> > OK well, assuming Christoph can solve the direct case in a way that
> > also work for the virtio !iommu case, we still want some bit of logic
> > somewhere that will "switch" to swiotlb based ops if the DMA mask is
> > limited.
> > 
> > You mentioned an RFC for that ? Do you happen to have a link ?
> 
> No but Christoph did I think.

Ok I missed that, sorry, I'll dig it out. Thanks.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-06 23:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <93518075238a07e9f011774d89bdc652c083f1ba.camel@kernel.crashing.org>

On Tue, Aug 07, 2018 at 08:13:56AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2018-08-07 at 00:46 +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 07, 2018 at 07:26:35AM +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-08-06 at 23:35 +0300, Michael S. Tsirkin wrote:
> > > > > As I said replying to Christoph, we are "leaking" into the interface
> > > > > something here that is really what's the VM is doing to itself, which
> > > > > is to stash its memory away in an inaccessible place.
> > > > > 
> > > > > Cheers,
> > > > > Ben.
> > > > 
> > > > I think Christoph merely objects to the specific implementation.  If
> > > > instead you do something like tweak dev->bus_dma_mask for the virtio
> > > > device I think he won't object.
> > > 
> > > Well, we don't have "bus_dma_mask" yet ..or you mean dma_mask ?
> > > 
> > > So, something like that would be a possibility, but the problem is that
> > > the current virtio (guest side) implementation doesn't honor this when
> > > not using dma ops and will not use dma ops if not using iommu, so back
> > > to square one.
> > 
> > Well we have the RFC for that - the switch to using DMA ops unconditionally isn't
> > problematic itself IMHO, for now that RFC is blocked
> > by its perfromance overhead for now but Christoph says
> > he's trying to remove that for direct mappings,
> > so we should hopefully be able to get there in X weeks.
> 
> That would be good yes.
> 
>  ../..
> 
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -155,7 +155,7 @@ static bool vring_use_dma_api(struct virtio_device
> > > *vdev)
> > >          * the DMA API if we're a Xen guest, which at least allows
> > >          * all of the sensible Xen configurations to work correctly.
> > >          */
> > > -       if (xen_domain())
> > > +       if (xen_domain() || arch_virtio_direct_dma_ops(&vdev->dev))
> > >                 return true;
> > >  
> > >         return false;
> > 
> > Right but can't we fix the retpoline overhead such that
> > vring_use_dma_api will not be called on data path any longer, making
> > this a setup time check?
> 
> Yes it needs to be a setup time check regardless actually !
> 
> The above is broken, sorry I was a bit quick here (too early in the
> morning... ugh). We don't want the arch to go override the dma ops
> every time that is callled.
> 
> But yes, if we can fix the overhead, it becomes just a matter of
> setting up the "right" ops automatically.
> 
> > > (Passing the dev allows the arch to know this is a virtio device in
> > > "direct" mode or whatever we want to call the !iommu case, and
> > > construct appropriate DMA ops for it, which aren't the same as the DMA
> > > ops of any other PCI device who *do* use the iommu).
> > 
> > I think that's where Christoph might have specific ideas about it.
> 
> OK well, assuming Christoph can solve the direct case in a way that
> also work for the virtio !iommu case, we still want some bit of logic
> somewhere that will "switch" to swiotlb based ops if the DMA mask is
> limited.
> 
> You mentioned an RFC for that ? Do you happen to have a link ?

No but Christoph did I think.

> It would be indeed ideal if all we had to do was setup some kind of
> bus_dma_mask on all PCI devices and have virtio automagically insert
> swiotlb when necessary.
> 
> Cheers,
> Ben.
> 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-06 23:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180806233024-mutt-send-email-mst@kernel.org>

On Mon, 2018-08-06 at 23:35 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 07, 2018 at 05:56:59AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-08-06 at 16:46 +0300, Michael S. Tsirkin wrote:
> > > 
> > > > Right, we'll need some quirk to disable balloons  in the guest I
> > > > suppose.
> > > > 
> > > > Passing something from libvirt is cumbersome because the end user may
> > > > not even need to know about secure VMs. There are use cases where the
> > > > security is a contract down to some special application running inside
> > > > the secure VM, the sysadmin knows nothing about.
> > > > 
> > > > Also there's repercussions all the way to admin tools, web UIs etc...
> > > > so it's fairly wide ranging.
> > > > 
> > > > So as long as we only need to quirk a couple of devices, it's much
> > > > better contained that way.
> > > 
> > > So just the balloon thing already means that yes management and all the
> > > way to the user tools must know this is going on. Otherwise
> > > user will try to inflate the balloon and wonder why this does not work.
> > 
> > There is *dozens* of management systems out there, not even all open
> > source, we won't ever be able to see the end of the tunnel if we need
> > to teach every single of them, including end users, about platform
> > specific new VM flags like that.
> > 
> > .../...
> 
> In the end I suspect you will find you have to.

Maybe... we'll tackle this if/when we have to.

For balloon I suspect it's not such a big deal because once secure, all
the guest memory goes into the secure memory which isn't visible or
accounted by the hypervisor, so there's nothing to steal but the guest
is also using no HV memory (other than the few "non-secure" pages used
for swiotlb and a couple of other kernel things).

Future versions of our secure architecture might allow to turn
arbitrary pages of memory secure/non-secure rather than relying on a
separate physical pool, in which case, the balloon will be able to work
normally.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-06 23:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <93518075238a07e9f011774d89bdc652c083f1ba.camel@kernel.crashing.org>

On Tue, 2018-08-07 at 08:13 +1000, Benjamin Herrenschmidt wrote:
> 
> OK well, assuming Christoph can solve the direct case in a way that
> also work for the virtio !iommu case, we still want some bit of logic
> somewhere that will "switch" to swiotlb based ops if the DMA mask is
> limited.
> 
> You mentioned an RFC for that ? Do you happen to have a link ?
> 
> It would be indeed ideal if all we had to do was setup some kind of
> bus_dma_mask on all PCI devices and have virtio automagically insert
> swiotlb when necessary.

Actually... I can think of a simpler option (Anshuman, didn't you
prototype this earlier ?):

Since that limitaiton of requiring bounce buffering via swiotlb is true
of any device in a secure VM, whether it goes through the iommu or not,
the iommu remapping is essentially pointless.

Thus, we could ensure that the iommu maps 1:1 the swiotlb bounce buffer
(either that or we configure it as "disabled" which is equivalent in
this case).

That way, we can now use the basic swiotlb ops everywhere, the same
dma_ops (swiotlb) will work whether a device uses the iommu or not.

Which boils down now to only making virtio use dma ops, there is no
need to override the dma_ops.

Which means all we have to do is either make xen_domain() return true
(yuck) or replace that one test with arch_virtio_force_dma_api() which
resolves to xen_domain() on x86 and can do something else for us.

As to using a virtio feature flag for that, which is what Christoph
proposes, I'm not too fan of it because this means effectively exposing
this to the peer, ie the interface. I don't think it belong there. The
interface, from the hypervisor perspective, whether it's qemu, vmware,
hyperz etc... have no business knowing how the guest manages its dma
operations, and may not even be aware of the access limitations (in our
case they are somewhat guest self-imposed).

Now, if this flag really is what we have to do, then we'd probably need
a qemu hack which will go set that flag on all virtio devices when it
detects that the VM is going secure.

But I don't think that's where that information "need to use the dma
API even for direct mode" belongs.

Cheers,
Ben.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox