Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-19 18:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160419204907-mutt-send-email-mst@redhat.com>

On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >
>> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> > > device would skip translation?  Or is that problematic for vfio?
>> >
>> > Exactly that's problematic for security.
>> > You can't allow guest driver to decide whether device skips security.
>>
>> Right. Because fundamentally, this *isn't* a property of the endpoint
>> device, and doesn't live in virtio itself.
>>
>> It's a property of the platform IOMMU, and lives there.
>
> It's a property of the hypervisor virtio implementation, and lives there.

It is now, but QEMU could, in principle, change the way it thinks
about it so that virtio devices would use the QEMU DMA API but ask
QEMU to pass everything through 1:1.  This would be entirely invisible
to guests but would make it be a property of the IOMMU implementation.
At that point, maybe QEMU could find a (platform dependent) way to
tell the guest what's going on.

FWIW, as far as I can tell, PPC and SPARC really could, in principle,
set up 1:1 mappings in the guest so that the virtio devices would work
regardless of whether QEMU is ignoring the IOMMU or not -- I think the
only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
up with an offset.  I don't know too much about those platforms, but
presumably the layout could be changed so that 1:1 really was 1:1.

--Andy

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19 17:49 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	linux-kernel@vger.kernel.org, peterx,
	qemu-devel@nongnu.org Developers, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Andy Lutomirski, Linux Virtualization
In-Reply-To: <1461083204.20056.8.camel@infradead.org>

On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> > 
> > > I thought that PLATFORM served that purpose.  Woudn't the host
> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> > > device would skip translation?  Or is that problematic for vfio?
> > 
> > Exactly that's problematic for security.
> > You can't allow guest driver to decide whether device skips security.
> 
> Right. Because fundamentally, this *isn't* a property of the endpoint
> device, and doesn't live in virtio itself.
> 
> It's a property of the platform IOMMU, and lives there.

It's a property of the hypervisor virtio implementation, and lives there.

-- 
MST

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-19 16:26 UTC (permalink / raw)
  To: Michael S. Tsirkin, Andy Lutomirski
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization
In-Reply-To: <20160419191914-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 594 bytes --]

On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> 
> > I thought that PLATFORM served that purpose.  Woudn't the host
> > advertise PLATFORM support and, if the guest doesn't ack it, the host
> > device would skip translation?  Or is that problematic for vfio?
> 
> Exactly that's problematic for security.
> You can't allow guest driver to decide whether device skips security.

Right. Because fundamentally, this *isn't* a property of the endpoint
device, and doesn't live in virtio itself.

It's a property of the platform IOMMU, and lives there.

-- 
dwmw2


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19 16:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrWEo12k49kFmPA3NByXrLb8OYQEmDYoYgngQcxMZ=1Mzw@mail.gmail.com>

On Tue, Apr 19, 2016 at 09:12:03AM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
> >> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> >> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> >> >> > the truth, and even legacy kernels ought to cope with that.
> >> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
> >> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> >> >> > tables, which puts it back into the same camp as ARM and Power.
> >> >>
> >> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
> >> >> implementation on x86 has always been "experimental", so it just might
> >> >> be okay to change it in a way that causes some older kernels to OOPS.
> >> >>
> >> >> --Andy
> >> >
> >> > Since it's experimental, it might be OK to change *guest kernels*
> >> > such that they oops on old QEMU.
> >> > But guest kernels were not experimental - so we need a QEMU mode that
> >> > makes them work fine. The more functionality is available in this QEMU
> >> > mode, the betterm because it's going to be the default for a while. For
> >> > the same reason, it is preferable to also have new kernels not crash in
> >> > this mode.
> >> >
> >>
> >> People add QEMU features that need new guest kernels all time time.
> >> If you enable virtio-scsi and try to boot a guest that's too old, it
> >> won't work.  So I don't see anything fundamentally wrong with saying
> >> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
> >> kernel is too old.  It might be annoying, since old kernels do work on
> >> actual Q35 hardware, but it at least seems to be that it might be
> >> okay.
> >>
> >> --Andy
> >
> > Yes but we need a mode that makes both old and new kernels work, and
> > that should be the default for a while.  this is what the
> > IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
> > API, new kernels go "oh compatibility mode" and bypass the IOMMU
> > within DMA API.
> 
> I thought that PLATFORM served that purpose.  Woudn't the host
> advertise PLATFORM support and, if the guest doesn't ack it, the host
> device would skip translation?  Or is that problematic for vfio?

Exactly that's problematic for security.
You can't allow guest driver to decide whether device skips security.

> >
> > --
> > MST
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-19 16:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160419190520-mutt-send-email-mst@redhat.com>

On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> >> > the truth, and even legacy kernels ought to cope with that.
>> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> >> > tables, which puts it back into the same camp as ARM and Power.
>> >>
>> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> >> implementation on x86 has always been "experimental", so it just might
>> >> be okay to change it in a way that causes some older kernels to OOPS.
>> >>
>> >> --Andy
>> >
>> > Since it's experimental, it might be OK to change *guest kernels*
>> > such that they oops on old QEMU.
>> > But guest kernels were not experimental - so we need a QEMU mode that
>> > makes them work fine. The more functionality is available in this QEMU
>> > mode, the betterm because it's going to be the default for a while. For
>> > the same reason, it is preferable to also have new kernels not crash in
>> > this mode.
>> >
>>
>> People add QEMU features that need new guest kernels all time time.
>> If you enable virtio-scsi and try to boot a guest that's too old, it
>> won't work.  So I don't see anything fundamentally wrong with saying
>> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
>> kernel is too old.  It might be annoying, since old kernels do work on
>> actual Q35 hardware, but it at least seems to be that it might be
>> okay.
>>
>> --Andy
>
> Yes but we need a mode that makes both old and new kernels work, and
> that should be the default for a while.  this is what the
> IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
> API, new kernels go "oh compatibility mode" and bypass the IOMMU
> within DMA API.

I thought that PLATFORM served that purpose.  Woudn't the host
advertise PLATFORM support and, if the guest doesn't ack it, the host
device would skip translation?  Or is that problematic for vfio?

>
> --
> MST



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19 16:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrX72XV25B-0GdaMyJNkJRvvC8MbT4G=L5oGR=D5stpG6w@mail.gmail.com>

On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> >> > the truth, and even legacy kernels ought to cope with that.
> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> >> > tables, which puts it back into the same camp as ARM and Power.
> >>
> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
> >> implementation on x86 has always been "experimental", so it just might
> >> be okay to change it in a way that causes some older kernels to OOPS.
> >>
> >> --Andy
> >
> > Since it's experimental, it might be OK to change *guest kernels*
> > such that they oops on old QEMU.
> > But guest kernels were not experimental - so we need a QEMU mode that
> > makes them work fine. The more functionality is available in this QEMU
> > mode, the betterm because it's going to be the default for a while. For
> > the same reason, it is preferable to also have new kernels not crash in
> > this mode.
> >
> 
> People add QEMU features that need new guest kernels all time time.
> If you enable virtio-scsi and try to boot a guest that's too old, it
> won't work.  So I don't see anything fundamentally wrong with saying
> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
> kernel is too old.  It might be annoying, since old kernels do work on
> actual Q35 hardware, but it at least seems to be that it might be
> okay.
> 
> --Andy

Yes but we need a mode that makes both old and new kernels work, and
that should be the default for a while.  this is what the
IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
API, new kernels go "oh compatibility mode" and bypass the IOMMU
within DMA API.

-- 
MST

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19 16:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Wei Liu, kvm list, qemu-block, Amit Shah,
	qemu-devel@nongnu.org Developers, peterx, Linux Virtualization,
	Christian Borntraeger, Stefan Hajnoczi, Paolo Bonzini,
	David Woodhouse, linux-kernel@vger.kernel.org
In-Reply-To: <CALCETrVBo-XBD53eStOCwJd53F7_VTnrMJ9rY7_cXwnt2Z6AOw@mail.gmail.com>

On Tue, Apr 19, 2016 at 09:00:27AM -0700, Andy Lutomirski wrote:
> On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >
> >
> > I guess you are right in that we should split this part out.
> > What I wanted is really the combination
> > PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> > need to guess, this device actually bypasses the IOMMU".
> 
> What happens when you use a device like this on Xen or with a similar
> software translation layer?

I think you don't use it on Xen since virtio doesn't bypass an IOMMU there.
If you do you have misconfigured your device.

-- 
MST

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-19 16:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160419130732-mutt-send-email-mst@redhat.com>

On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> > the truth, and even legacy kernels ought to cope with that.
>> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> > tables, which puts it back into the same camp as ARM and Power.
>>
>> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> implementation on x86 has always been "experimental", so it just might
>> be okay to change it in a way that causes some older kernels to OOPS.
>>
>> --Andy
>
> Since it's experimental, it might be OK to change *guest kernels*
> such that they oops on old QEMU.
> But guest kernels were not experimental - so we need a QEMU mode that
> makes them work fine. The more functionality is available in this QEMU
> mode, the betterm because it's going to be the default for a while. For
> the same reason, it is preferable to also have new kernels not crash in
> this mode.
>

People add QEMU features that need new guest kernels all time time.
If you enable virtio-scsi and try to boot a guest that's too old, it
won't work.  So I don't see anything fundamentally wrong with saying
that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
kernel is too old.  It might be annoying, since old kernels do work on
actual Q35 hardware, but it at least seems to be that it might be
okay.

--Andy

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-19 16:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, kvm list, qemu-block, Amit Shah,
	qemu-devel@nongnu.org Developers, peterx, Linux Virtualization,
	Christian Borntraeger, Stefan Hajnoczi, Paolo Bonzini,
	David Woodhouse, linux-kernel@vger.kernel.org
In-Reply-To: <20160419103815-mutt-send-email-mst@redhat.com>

On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".

What happens when you use a device like this on Xen or with a similar
software translation layer?

I think that a "please bypass IOMMU" feature would be better in the
PCI, IOMMU, or platform code.  For Xen, virtio would still want to use
the DMA API, just without translating at the DMAR or hardware level.
Doing it in virtio is awkward, because virtio is involved at the
device level and the driver level, but the translation might be
entirely in between.

I think a nicer long-term approach would be to have a way to ask the
guest to set up a full 1:1 mapping for performance, but to still
handle the case where the guest refuses to do so or where there's more
than one translation layer involved.

But I agree that this part shouldn't delay the other part of your series.

--Andy

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Alex Williamson @ 2016-04-19 14:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization, David Woodhouse
In-Reply-To: <20160419103815-mutt-send-email-mst@redhat.com>

On Tue, 19 Apr 2016 12:13:29 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Apr 18, 2016 at 02:29:33PM -0400, David Woodhouse wrote:
> > On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin wrote:  
> > > I balk at adding more hacks to a broken system. My goals are
> > > merely to
> > > - make things work correctly with an IOMMU and new guests,
> > >   so people can use userspace drivers with virtio devices
> > > - prevent security risks when guest kernel mistakenly thinks
> > >   it's protected by an IOMMU, but in fact isn't
> > > - avoid breaking any working configurations  
> > 
> > AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
> > That's just an optimisation, for telling an OS "you don't really need
> > to bother with the IOMMU, even though you it works".
> > 
> > There are two main reasons why an operating system might want to use
> > the IOMMU via the DMA API for native drivers: 
> >  - To protect against driver bugs triggering rogue DMA.
> >  - To protect against hardware (or firmware) bugs.
> > 
> > With virtio, the first reason still exists. But the second is moot
> > because the device is part of the hypervisor and if the hypervisor is
> > untrustworthy then you're screwed anyway... but then again, in SoC
> > devices you could replace 'hypervisor' with 'chip' and the same is
> > true, isn't it? Is there *really* anything virtio-specific here?
> >
> > Sure, I want my *external* network device on a PCIe card with software-
> > loadable firmware to be behind an IOMMU because I don't trust it as far
> > as I can throw it. But for on-SoC devices surely the situation is
> > *just* the same as devices provided by a hypervisor?  
> 
> Depends on how SoC is designed I guess.  At the moment specifically QEMU
> runs everything in a single memory space so an IOMMU table lookup does
> not offer any extra protection. That's not a must, one could come
> up with modular hypervisor designs - it's just what we have ATM.
> 
> 
> > And some people want that external network device to use passthrough
> > anyway, for performance reasons.  
> 
> That's a policy decision though.
> 
> > On the whole, there are *plenty* of reasons why we might want to have a
> > passthrough mapping on a per-device basis,  
> 
> That's true. And driver security also might differ, for example maybe I
> trust a distro-supplied driver more than an out of tree one.  Or maybe I
> trust a distro-supplied userspace driver more than a closed-source one.
> And maybe I trust devices from same vendor as my chip more than a 3rd
> party one.  So one can generalize this even further, think about device
> and driver security/trust level as an integer and platform protection as an
> integer.
> 
> If platform IOMMU offers you extra protection over trusting the device
> (trust < protection) it improves you security to use platform to limit
> the device. If trust >= protection it just adds overhead without
> increasing the security.
> 
> > and I really struggle to
> > find justification for having this 'hint' in a virtio-specific way.  
> 
> It's a way. No system seems to expose this information in a more generic
> way at the moment, and it's portable. Would you like to push for some
> kind of standartization of such a hint? I would be interested
> to hear about that.
> 
> 
> > And it's complicating the discussion of the *actual* fix we're looking
> > at.  
> 
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".
> 
> And I thought it's a nice idea to use PASSTHROUGH && PLATFORM
> as a hint since it seemed to be unused.
> But maybe the best thing to do for now is to say
> - hosts should not set PASSTHROUGH && PLATFORM
> - guests should ignore PASSTHROUGH if PLATFORM is set
> 
> and then we can come back to this optimization idea later
> if it's appropriate.
> 
> So yes I think we need the two bits but no we don't need to
> mix the hint discussion in here.
> 
> > > Looking at guest code, it looks like virtio was always
> > > bypassing the IOMMU even if configured, but no other
> > > guest driver did.
> > > 
> > > This makes me think the problem where guest drivers
> > > ignore the IOMMU is virtio specific
> > > and so a virtio specific solution seems cleaner.
> > > 
> > > The problem for assigned devices is IMHO different: they bypass
> > > the guest IOMMU too but no guest driver knows about this,
> > > so guests do not work. Seems cleaner to fix QEMU to make
> > > existing guests work.  
> > 
> > I certainly agree that it's better to fix QEMU. Whether devices are
> > behind an IOMMU or not, the DMAR tables we expose to a guest should
> > tell the truth.
> > 
> > Part of the issue here is virtio-specific; part isn't.
> > 
> > Basically, we have a conjunction of two separate bugs which happened to
> > work (for virtio) — the IOMMU support in QEMU wasn't working for virtio
> > (and assigned) devices even though it theoretically *should* have been,
> > and the virtio drivers weren't using the DMA API as they theoretically
> > should have been.
> > 
> > So there were corner cases like assigned PCI devices, and real hardware
> > implementations of virtio stuff (and perhaps virtio devices being
> > assigned to nested guests) which didn't work. But for the *common* use
> > case, one bug cancelled out the other.
> > 
> > Now we want to fix both bugs, and of course that involves carefully
> > coordinating both fixes.
> > 
> > I *like* your idea of a flag from the hypervisor which essentially says
> > "trust me, I'm telling the truth now".
> > 
> > But don't think that wants to be virtio-specific, because we actually
> > want it to cover *all* the corner cases, not just the common case which
> > *happened* to work before due to the alignment of the two previous
> > bugs.  
> 
> I guess we differ here. I care about fixing bugs and not breaking
> working setups but I see little value in working around
> existing bugs if they can be fixed at their source.
> 
> Building a generic mechanism to report which devices bypass the IOMMU
> isn't trivial because there's no simple generic way to address
> an arbitrary device from hypervisor. For example, DMAR tables
> commonly use bus numbers for that but these are guest (bios) assigned.
> So if we used bus numbers we'd have to ask bios to build a custom
> ACPI table and stick bus numbers there.

This is incorrect, the DMAR table specifically uses devices paths in
order to avoid the issue with guest assigned bus numbers.  The only
bus number used is the starting bus number, which is generally
provided by the platform anyway.  Excluding devices isn't necessarily
easy with DMAR though, we don't get to be lazy and use the
INCLUDE_PCI_ALL flag.  Hotplug is also an issue, we either need to
hot-add devices into slots where there's already the correct DMAR
coverage (or lack of coverage) to represent the inclusion or exclusion
or enable dynamic table support.  And really it seems like dynamic
tables are the only possible way DMAR could support replacing a
device that obeys the IOMMU with one that does not at the same address,
or vica versa.

For any sort of sane implementation, it probably comes down to fully
enumerating root bus devices in the DMAR and creating PCI sub-hierarchy
entries for certain subordinate buses, leaving others undefined.
Devices making use of the IOMMU could only be attached behind those
sub-hierarchies and devices not making use of the IOMMU would be
downstream of bridges not covered.  The management stack would need to
know where to place devices.
 
> > An updated guest OS can look for this flag (in its generic IOMMU code)
> > and can apply a heuristic of its own to work out which devices *aren't*
> > behind the IOMMU, if the flag isn't present. And it can get that right
> > even for assigned devices, so that new kernels can run happily even on
> > today's QEMU instances.  
> 
> With iommu enabled? Point is, I don't really care about that.
> At this point only a very small number of devices work with this
> IOMMU at all. I expect that we'll fix assigned devices very soon.
> 
> > And the virtio driver in new kernels should
> > just use the DMA API and expect it to work. Just as the various drivers
> > for assigned PCI devices do.  
> 
> Absolutely but that's a separate discussion.
> 
> > The other interesting case for compatibility is old kernels running in
> > a new QEMU. And for that case, things are likely to break if you
> > suddenly start putting the virtio devices behind an IOMMU. There's
> > nothing you can do on ARM and Power to stop that breakage, since they
> > don't *have* a way to tell legacy guests that certain devices aren't
> > translated. So I suspect you probably can't enable virtio-behind-IOMMU
> > in QEMU *ever* for those platforms as the default behaviour.
> > 
> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> > the truth, and even legacy kernels ought to cope with that.  
> 
> I don't see how in that legacy kernels bypassed the DMA API.

It's a matter of excluding the device from being explicitly covered by
the DMAR AIUI.  This is theoretically possible, but I wonder if it
actually works for all kernels.

> To me it looks like we either use physical addresses that they give us
> or they don't work at all (at least without iommu=pt),
> since the VT-D spec says:
> 	DMA requests processed through root-entries with present field
> 	Clear result in translation-fault.
> 
> So I suspect the IOMMU_PLATFORM flag would have to stay off
> by default for a while.
> 
> 
> > FSVO 'ought to' where I suspect some of them will actually crash with a
> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> > tables, which puts it back into the same camp as ARM and Power.  
> 
> Right.  That would also be an issue.
> 
> >   
> > > True but I think we should fix QEMU to shadow IOMMU
> > > page tables for assigned devices. This seems rather
> > > possible with VT-D, and there are patches already on list.
> > > 
> > > It looks like this will fix all legacy guests which is
> > > much nicer than what you suggest which will only help new guests.  
> > 
> > Yes, we should do that. And in the short term we should at *least* fix
> > the DMAR tables to tell the truth.  
> 
> Right. However, the way timing happens to work, we are out of time to
> fix it in 2.6 and we are highly likely to have the proper VFIO fix in
> 2.7.  So I'm not sure there's space for a short term fix.

Note that vfio already works with IOMMUs on power, the issue I believe
we're talking about for assigned devices bypassing the guest IOMMU is
limited to the QEMU VT-d implementation failing to do the proper
notifies.  Legacy KVM device assignment of course has no idea about the
IOMMU because it piggy backs on KVM memory slot mapping instead of
operating within the QEMU Memory API like vfio does.

The issues I believe we're going to hit with vfio assigned devices and
QEMU VT-d are that 1) the vfio IOMMU interface is not designed for the
frequency of mapping that a DMA API managed guest device will generate,
2) we have accounting issues for locked pages since each device will
run in a separate IOMMU domain, accounted separately, and 3) we don't
have a way to expose host grouping to the VM so trying to assign
multiple devices from the same group is likely to fail.  We'd almost
need to put all of the devices within a group behind a conventional PCI
bridge in the VM to get them into the same address space, but I suspect
QEMU VT-d doesn't take that aliasing into account.

> > > > 
> > > > Furthermore, some platforms don't *have* a standard way for qemu to
> > > > 'tell the truth' to the guests, and that's where the real fun comes in.
> > > > But still, I'd like to see a generic solution for that lack instead of
> > > > a virtio-specific hack.  
> > > But the issue is not just these holes.  E.g. with VT-D it is only easy
> > > to emulate because there's a "caching mode" hook. It is fundamentally
> > > paravirtualization.  So a completely generic solution would be a
> > > paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
> > > justified if many platforms have hard to emulate interfaces.  
> > 
> > Hm, I'm not sure I understand the point here.
> > 
> > Either there is a way for the hypervisor to expose an IOMMU to a guest
> > (be it full hardware virt, or paravirt). Or there isn't.
> > 
> > If there is, it doesn't matter *how* it's done.  
> 
> Well it does matter for people doing it :)
> 
> > And if there isn't, the
> > whole discussion is moot anyway.  
> 
> Point was that we can always build a paravirt interface
> if it does not exist, but it's easier to maintain
> if it's minimal, being as close to emulating hardware
> as we can.
> 
> > -- 
> > dwmw2
> > 
> >   
> 
> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19 10:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel@nongnu.org Developers, peterx,
	linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrXGC9X6fndftzUKKQ76TXSARHotW8Kz3KcO6opQ-oLaqA@mail.gmail.com>

On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> > the truth, and even legacy kernels ought to cope with that.
> > FSVO 'ought to' where I suspect some of them will actually crash with a
> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> > tables, which puts it back into the same camp as ARM and Power.
> 
> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
> implementation on x86 has always been "experimental", so it just might
> be okay to change it in a way that causes some older kernels to OOPS.
> 
> --Andy

Since it's experimental, it might be OK to change *guest kernels*
such that they oops on old QEMU.
But guest kernels were not experimental - so we need a QEMU mode that
makes them work fine. The more functionality is available in this QEMU
mode, the betterm because it's going to be the default for a while. For
the same reason, it is preferable to also have new kernels not crash in
this mode.

-- 
MST

^ permalink raw reply

* Re: [PATCH RFC 3/3] vfio: add virtio pci quirk
From: Michael S. Tsirkin @ 2016-04-19  9:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wei Liu, kvm, Julia Lawall, qemu-devel, linux-kernel,
	virtualization, Christian Borntraeger, Andy Lutomirski,
	Paolo Bonzini, Feng Wu, David Woodhouse, Dan Carpenter
In-Reply-To: <20160418140006.5f68d516@t450s.home>

On Mon, Apr 18, 2016 at 02:00:06PM -0600, Alex Williamson wrote:
> On Mon, 18 Apr 2016 12:58:28 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > Modern virtio pci devices can set VIRTIO_F_IOMMU_PLATFORM
> > to signal they are safe to use with an IOMMU.
> > 
> > Without this bit, exposing the device to userspace is unsafe, so probe
> > and fail VFIO initialization unless noiommu is enabled.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_private.h |   1 +
> >  drivers/vfio/pci/vfio_pci.c         |  11 +++
> >  drivers/vfio/pci/vfio_pci_virtio.c  | 135 ++++++++++++++++++++++++++++++++++++
> >  drivers/vfio/pci/Makefile           |   1 +
> >  4 files changed, 148 insertions(+)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_virtio.c
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> > index 8a7d546..604d445 100644
> > --- a/drivers/vfio/pci/vfio_pci_private.h
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -130,4 +130,5 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
> >  	return -ENODEV;
> >  }
> >  #endif
> > +extern int vfio_pci_virtio_quirk(struct vfio_pci_device *vdev, int noiommu);
> >  #endif /* VFIO_PCI_PRIVATE_H */
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index d622a41..2bb8c76 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1125,6 +1125,17 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  		return ret;
> >  	}
> >  
> > +	if (pdev->vendor == PCI_VENDOR_ID_REDHAT_QUMRANET &&
> 
> Virtio really owns this entire vendor ID block?  Apparently nobody told
> ivshmem: http://pci-ids.ucw.cz/read/PC/1af4/1110  Even the comment by
> virtio_pci_id_table[] suggests virtio is only a subset even if the code
> doesn't appear to honor that comment.  I don't know the history there,
> but that seems like really inefficient use of an entire, coveted vendor
> block.

True - virtio spec also says it's up to 0x107f.


> > +	    ((ret = vfio_pci_virtio_quirk(vdev, ret)))) {
> 
> Please don't set variables like this unless necessary.
> 
> if (vendor...) {
>    ret = vfio_pci_virtio_quir...
>    if (ret) {
>        ...
> 
> > +		dev_warn(&vdev->pdev->dev,
> > +			 "Failed to setup Virtio for VFIO\n");
> > +		vfio_del_group_dev(&pdev->dev);
> > +		vfio_iommu_group_put(group, &pdev->dev);
> > +		kfree(vdev);
> > +		return ret;
> > +	}
> > +
> > +
> >  	if (vfio_pci_is_vga(pdev)) {
> >  		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
> >  		vga_set_legacy_decoding(pdev,
> > diff --git a/drivers/vfio/pci/vfio_pci_virtio.c b/drivers/vfio/pci/vfio_pci_virtio.c
> > new file mode 100644
> > index 0000000..1a32064
> > --- /dev/null
> > +++ b/drivers/vfio/pci/vfio_pci_virtio.c
> > @@ -0,0 +1,135 @@
> > +/*
> > + * VFIO PCI Intel Graphics support
> > + *
> > + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> > + *	Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> 
> Update
> 
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Register a device specific region through which to provide read-only
> > + * access to the Intel IGD opregion.  The register defining the opregion
> > + * address is also virtualized to prevent user modification.
> 
> Update
> 
> > + */
> > +
> > +#include <linux/io.h>
> > +#include <linux/pci.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/virtio_pci.h>
> > +#include <linux/virtio_config.h>
> 
> I don't see where io or uaccess are needed here.
> 
> > +
> > +#include "vfio_pci_private.h"
> > +
> > +/**
> > + * virtio_pci_find_capability - walk capabilities to find device info.
> > + * @dev: the pci device
> > + * @cfg_type: the VIRTIO_PCI_CAP_* value we seek
> > + *
> > + * Returns offset of the capability, or 0.
> > + */
> > +static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type)
> 
> This is called from probe code, why inline?  There's already a function
> with this exact same name in virtio code, can we come up with something
> unique to avoid confusion?
> 
> > +{
> > +	int pos;
> > +
> > +	for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> > +	     pos > 0;
> > +	     pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> > +		u8 type;
> > +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> > +							 cfg_type),
> > +				     &type);
> > +
> > +		if (type != cfg_type)
> > +			continue;
> > +
> > +		/* Ignore structures with reserved BAR values */
> > +		if (type != VIRTIO_PCI_CAP_PCI_CFG) {
> > +			u8 bar;
> > +
> > +			pci_read_config_byte(dev, pos +
> > +					     offsetof(struct virtio_pci_cap,
> > +						      bar),
> > +					     &bar);
> > +			if (bar > 0x5)
> > +				continue;
> > +		}
> > +
> > +		return pos;
> > +	}
> > +	return 0;
> > +}
> > +
> > +
> > +int vfio_pci_virtio_quirk(struct vfio_pci_device *vdev, int noiommu)
> > +{
> > +	struct pci_dev *dev = vdev->pdev;
> > +	int common, cfg;
> > +	u32 features;
> > +	u32 offset;
> > +	u8 bar;
> > +
> > +	/* Without an IOMMU, we don't care */
> > +	if (noiommu)
> > +		return 0;
> > +	/* Check whether device enforces the IOMMU correctly */
> > +
> > +	/*
> > +	 * All modern devices must have common and cfg capabilities. We use cfg
> > +	 * capability for access so that we don't need to worry about resource
> > +	 * availability. Slow but sure.
> > +	 * Note that all vendor-specific fields we access are little-endian
> > +	 * which matches what pci config accessors expect, so they do byteswap
> > +	 * for us if appropriate.
> > +	 */
> > +	common = virtio_pci_find_capability(dev, VIRTIO_PCI_CAP_COMMON_CFG);
> > +	cfg = virtio_pci_find_capability(dev, VIRTIO_PCI_CAP_PCI_CFG);
> > +	if (!cfg || !common) {
> > +                dev_warn(&dev->dev,
> > +                         "Virtio device lacks common or pci cfg.\n");
> 
> White space
> 
> > +		return -ENODEV;
> > +	}
> > +
> > +	pci_read_config_byte(dev, common + offsetof(struct virtio_pci_cap,
> > +						    bar),
> > +			     &bar);
> > +	pci_read_config_dword(dev, common + offsetof(struct virtio_pci_cap,
> > +						    offset),
> > +			     &offset);
> > +
> > +	/* Program cfg capability for dword access into common cfg. */
> > +	pci_write_config_byte(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> > +						  cap.bar),
> > +			      bar);
> > +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> > +						   cap.length),
> > +			       0x4);
> > +
> > +	/* Select features dword that has VIRTIO_F_IOMMU_PLATFORM. */
> > +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> > +						  cap.offset),
> > +			       offset + offsetof(struct virtio_pci_common_cfg,
> > +						 device_feature_select));
> > +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> > +						  pci_cfg_data),
> > +			       VIRTIO_F_IOMMU_PLATFORM / 32);
> > +
> > +	/* Get the features dword. */
> > +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> > +						  cap.offset),
> > +			       offset + offsetof(struct virtio_pci_common_cfg,
> > +						 device_feature));
> > +	pci_read_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> > +						  pci_cfg_data),
> > +			      &features);
> > +
> > +	/* Does this device obey the platform's IOMMU? If not it's an error. */
> > +	if (!(features & (0x1 << (VIRTIO_F_IOMMU_PLATFORM % 32)))) {
> > +                dev_warn(&dev->dev,
> > +                         "Virtio device lacks VIRTIO_F_IOMMU_PLATFORM.\n");
> 
> White space
> 
> > +		return -ENODEV;
> > +	}
> > +
> > +	return 0;
> > +}
> > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> > index 76d8ec0..e9b20e7 100644
> > --- a/drivers/vfio/pci/Makefile
> > +++ b/drivers/vfio/pci/Makefile
> > @@ -1,5 +1,6 @@
> >  
> >  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> > +vfio-pci-y += vfio_pci_virtio.o
> >  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> >  
> >  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19  9:13 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <1461004173.3765.73.camel@infradead.org>

On Mon, Apr 18, 2016 at 02:29:33PM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin wrote:
> > I balk at adding more hacks to a broken system. My goals are
> > merely to
> > - make things work correctly with an IOMMU and new guests,
> >   so people can use userspace drivers with virtio devices
> > - prevent security risks when guest kernel mistakenly thinks
> >   it's protected by an IOMMU, but in fact isn't
> > - avoid breaking any working configurations
> 
> AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
> That's just an optimisation, for telling an OS "you don't really need
> to bother with the IOMMU, even though you it works".
> 
> There are two main reasons why an operating system might want to use
> the IOMMU via the DMA API for native drivers: 
>  - To protect against driver bugs triggering rogue DMA.
>  - To protect against hardware (or firmware) bugs.
> 
> With virtio, the first reason still exists. But the second is moot
> because the device is part of the hypervisor and if the hypervisor is
> untrustworthy then you're screwed anyway... but then again, in SoC
> devices you could replace 'hypervisor' with 'chip' and the same is
> true, isn't it? Is there *really* anything virtio-specific here?
>
> Sure, I want my *external* network device on a PCIe card with software-
> loadable firmware to be behind an IOMMU because I don't trust it as far
> as I can throw it. But for on-SoC devices surely the situation is
> *just* the same as devices provided by a hypervisor?

Depends on how SoC is designed I guess.  At the moment specifically QEMU
runs everything in a single memory space so an IOMMU table lookup does
not offer any extra protection. That's not a must, one could come
up with modular hypervisor designs - it's just what we have ATM.


> And some people want that external network device to use passthrough
> anyway, for performance reasons.

That's a policy decision though.

> On the whole, there are *plenty* of reasons why we might want to have a
> passthrough mapping on a per-device basis,

That's true. And driver security also might differ, for example maybe I
trust a distro-supplied driver more than an out of tree one.  Or maybe I
trust a distro-supplied userspace driver more than a closed-source one.
And maybe I trust devices from same vendor as my chip more than a 3rd
party one.  So one can generalize this even further, think about device
and driver security/trust level as an integer and platform protection as an
integer.

If platform IOMMU offers you extra protection over trusting the device
(trust < protection) it improves you security to use platform to limit
the device. If trust >= protection it just adds overhead without
increasing the security.

> and I really struggle to
> find justification for having this 'hint' in a virtio-specific way.

It's a way. No system seems to expose this information in a more generic
way at the moment, and it's portable. Would you like to push for some
kind of standartization of such a hint? I would be interested
to hear about that.


> And it's complicating the discussion of the *actual* fix we're looking
> at.

I guess you are right in that we should split this part out.
What I wanted is really the combination
PASSTHROUGH && !PLATFORM so that we can say "ok we don't
need to guess, this device actually bypasses the IOMMU".

And I thought it's a nice idea to use PASSTHROUGH && PLATFORM
as a hint since it seemed to be unused.
But maybe the best thing to do for now is to say
- hosts should not set PASSTHROUGH && PLATFORM
- guests should ignore PASSTHROUGH if PLATFORM is set

and then we can come back to this optimization idea later
if it's appropriate.

So yes I think we need the two bits but no we don't need to
mix the hint discussion in here.

> > Looking at guest code, it looks like virtio was always
> > bypassing the IOMMU even if configured, but no other
> > guest driver did.
> > 
> > This makes me think the problem where guest drivers
> > ignore the IOMMU is virtio specific
> > and so a virtio specific solution seems cleaner.
> > 
> > The problem for assigned devices is IMHO different: they bypass
> > the guest IOMMU too but no guest driver knows about this,
> > so guests do not work. Seems cleaner to fix QEMU to make
> > existing guests work.
> 
> I certainly agree that it's better to fix QEMU. Whether devices are
> behind an IOMMU or not, the DMAR tables we expose to a guest should
> tell the truth.
> 
> Part of the issue here is virtio-specific; part isn't.
> 
> Basically, we have a conjunction of two separate bugs which happened to
> work (for virtio) — the IOMMU support in QEMU wasn't working for virtio
> (and assigned) devices even though it theoretically *should* have been,
> and the virtio drivers weren't using the DMA API as they theoretically
> should have been.
> 
> So there were corner cases like assigned PCI devices, and real hardware
> implementations of virtio stuff (and perhaps virtio devices being
> assigned to nested guests) which didn't work. But for the *common* use
> case, one bug cancelled out the other.
> 
> Now we want to fix both bugs, and of course that involves carefully
> coordinating both fixes.
> 
> I *like* your idea of a flag from the hypervisor which essentially says
> "trust me, I'm telling the truth now".
> 
> But don't think that wants to be virtio-specific, because we actually
> want it to cover *all* the corner cases, not just the common case which
> *happened* to work before due to the alignment of the two previous
> bugs.

I guess we differ here. I care about fixing bugs and not breaking
working setups but I see little value in working around
existing bugs if they can be fixed at their source.

Building a generic mechanism to report which devices bypass the IOMMU
isn't trivial because there's no simple generic way to address
an arbitrary device from hypervisor. For example, DMAR tables
commonly use bus numbers for that but these are guest (bios) assigned.
So if we used bus numbers we'd have to ask bios to build a custom
ACPI table and stick bus numbers there.

> An updated guest OS can look for this flag (in its generic IOMMU code)
> and can apply a heuristic of its own to work out which devices *aren't*
> behind the IOMMU, if the flag isn't present. And it can get that right
> even for assigned devices, so that new kernels can run happily even on
> today's QEMU instances.

With iommu enabled? Point is, I don't really care about that.
At this point only a very small number of devices work with this
IOMMU at all. I expect that we'll fix assigned devices very soon.

> And the virtio driver in new kernels should
> just use the DMA API and expect it to work. Just as the various drivers
> for assigned PCI devices do.

Absolutely but that's a separate discussion.

> The other interesting case for compatibility is old kernels running in
> a new QEMU. And for that case, things are likely to break if you
> suddenly start putting the virtio devices behind an IOMMU. There's
> nothing you can do on ARM and Power to stop that breakage, since they
> don't *have* a way to tell legacy guests that certain devices aren't
> translated. So I suspect you probably can't enable virtio-behind-IOMMU
> in QEMU *ever* for those platforms as the default behaviour.
> 
> For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> the truth, and even legacy kernels ought to cope with that.

I don't see how in that legacy kernels bypassed the DMA API.
To me it looks like we either use physical addresses that they give us
or they don't work at all (at least without iommu=pt),
since the VT-D spec says:
	DMA requests processed through root-entries with present field
	Clear result in translation-fault.

So I suspect the IOMMU_PLATFORM flag would have to stay off
by default for a while.


> FSVO 'ought to' where I suspect some of them will actually crash with a
> NULL pointer dereference if there's no "catch-all" DMAR unit in the
> tables, which puts it back into the same camp as ARM and Power.

Right.  That would also be an issue.

> 
> > True but I think we should fix QEMU to shadow IOMMU
> > page tables for assigned devices. This seems rather
> > possible with VT-D, and there are patches already on list.
> > 
> > It looks like this will fix all legacy guests which is
> > much nicer than what you suggest which will only help new guests.
> 
> Yes, we should do that. And in the short term we should at *least* fix
> the DMAR tables to tell the truth.

Right. However, the way timing happens to work, we are out of time to
fix it in 2.6 and we are highly likely to have the proper VFIO fix in
2.7.  So I'm not sure there's space for a short term fix.

> > > 
> > > Furthermore, some platforms don't *have* a standard way for qemu to
> > > 'tell the truth' to the guests, and that's where the real fun comes in.
> > > But still, I'd like to see a generic solution for that lack instead of
> > > a virtio-specific hack.
> > But the issue is not just these holes.  E.g. with VT-D it is only easy
> > to emulate because there's a "caching mode" hook. It is fundamentally
> > paravirtualization.  So a completely generic solution would be a
> > paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
> > justified if many platforms have hard to emulate interfaces.
> 
> Hm, I'm not sure I understand the point here.
> 
> Either there is a way for the hypervisor to expose an IOMMU to a guest
> (be it full hardware virt, or paravirt). Or there isn't.
> 
> If there is, it doesn't matter *how* it's done.

Well it does matter for people doing it :)

> And if there isn't, the
> whole discussion is moot anyway.

Point was that we can always build a paravirt interface
if it does not exist, but it's easier to maintain
if it's minimal, being as close to emulating hardware
as we can.

> -- 
> dwmw2
> 
> 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage
From: Sergey Senozhatsky @ 2016-04-19  7:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, Sergey Senozhatsky, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins, linux-kernel,
	Al Viro, virtualization, bfields, linux-mm, Mel Gorman, Gioh Kim,
	koct9i, Sangseok Lee, Joonsoo Kim, Andrew Morton, jlayton,
	Vlastimil Babka, aquini
In-Reply-To: <20160419075118.GD18448@bbox>

Hello Minchan,

On (04/19/16 16:51), Minchan Kim wrote:
[..]
> 
> I guess it is remained thing after I rebased to catch any mistake.
> But I'm heavily chainging this part.
> Please review next version instead of this after a few days. :)

ah, got it. thanks!

	-ss

^ permalink raw reply

* Re: [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage
From: Minchan Kim @ 2016-04-19  7:51 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160418010408.GB5882@swordfish>

Hi Sergey,

On Mon, Apr 18, 2016 at 10:04:08AM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> [..]
> > @@ -1835,23 +1827,31 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
> >  			if (!migrate_zspage(pool, class, &cc))
> >  				break;
> >  
> > -			putback_zspage(pool, class, dst_page);
> > +			VM_BUG_ON_PAGE(putback_zspage(pool, class,
> > +				dst_page) == ZS_EMPTY, dst_page);
> 
> can this VM_BUG_ON_PAGE() condition ever be true?

I guess it is remained thing after I rebased to catch any mistake.
But I'm heavily chainging this part.
Please review next version instead of this after a few days. :)

> 
> >  		}
> >  		/* Stop if we couldn't find slot */
> >  		if (dst_page == NULL)
> >  			break;
> > -		putback_zspage(pool, class, dst_page);
> > -		if (putback_zspage(pool, class, src_page) == ZS_EMPTY)
> > +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> > +				dst_page) == ZS_EMPTY, dst_page);
> 
> hm... this VM_BUG_ON_PAGE(dst_page) is sort of confusing. under what
> circumstances it can be true?
> 
> a minor nit, it took me some time (need some coffee I guess) to
> correctly parse this macro wrapper
> 
> 		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> 			dst_page) == ZS_EMPTY, dst_page);
> 
> may be do it like:
> 		fullness = putback_zspage(pool, class, dst_page);
> 		VM_BUG_ON_PAGE(fullness == ZS_EMPTY, dst_page);
> 
> 
> well, if we want to VM_BUG_ON_PAGE() at all. there haven't been any
> problems with compaction, is there any specific reason these macros
> were added?
> 
> 
> 
> > +		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
> >  			pool->stats.pages_compacted += class->pages_per_zspage;
> > -		spin_unlock(&class->lock);
> > +			spin_unlock(&class->lock);
> > +			free_zspage(pool, class, src_page);
> 
> do we really need to free_zspage() out of class->lock?
> wouldn't something like this
> 
> 		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
> 			pool->stats.pages_compacted += class->pages_per_zspage;
> 			free_zspage(pool, class, src_page);
> 		}
> 		spin_unlock(&class->lock);
> 
> be simpler?

The reason I did out of class->lock is deadlock between page_lock
and class->lock with upcoming page migration.
However, as I said, I'm now heavily changing the part. :)

> 
> besides, free_zspage() now updates class stats out of class lock,
> not critical but still.
> 
> 	-ss
> 
> > +		} else {
> > +			spin_unlock(&class->lock);
> > +		}
> > +
> >  		cond_resched();
> >  		spin_lock(&class->lock);
> >  	}
> >  
> >  	if (src_page)
> > -		putback_zspage(pool, class, src_page);
> > +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> > +				src_page) == ZS_EMPTY, src_page);
> >  
> >  	spin_unlock(&class->lock);
> >  }

^ permalink raw reply

* Re: [PATCH v3 10/16] zsmalloc: factor page chain functionality out
From: Minchan Kim @ 2016-04-19  7:46 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160418003305.GA5882@swordfish>

On Mon, Apr 18, 2016 at 09:33:05AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> > @@ -1421,7 +1434,6 @@ static unsigned long obj_malloc(struct size_class *class,
> >  	unsigned long m_offset;
> >  	void *vaddr;
> >  
> > -	handle |= OBJ_ALLOCATED_TAG;
> 
> a nitpick, why did you replace this ALLOCATED_TAG assignment
> with 2 'handle | OBJ_ALLOCATED_TAG'?

I thought this handle variable in here is pure handle but OBJ_ALLOCATED_TAG
should live in head(i.e., link->handle), not pure handle, itself.

> 
> 	-ss
> 
> >  	obj = get_freeobj(first_page);
> >  	objidx_to_page_and_offset(class, first_page, obj,
> >  				&m_page, &m_offset);
> > @@ -1431,10 +1443,10 @@ static unsigned long obj_malloc(struct size_class *class,
> >  	set_freeobj(first_page, link->next >> OBJ_ALLOCATED_TAG);
> >  	if (!class->huge)
> >  		/* record handle in the header of allocated chunk */
> > -		link->handle = handle;
> > +		link->handle = handle | OBJ_ALLOCATED_TAG;
> >  	else
> >  		/* record handle in first_page->private */
> > -		set_page_private(first_page, handle);
> > +		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
> >  	kunmap_atomic(vaddr);
> >  	mod_zspage_inuse(first_page, 1);
> >  	zs_stat_inc(class, OBJ_USED, 1);

^ permalink raw reply

* Re: [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping
From: Minchan Kim @ 2016-04-19  7:42 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Chan Gyun Jeong,
	Hugh Dickins, linux-kernel, Al Viro, virtualization, bfields,
	linux-mm, Gioh Kim, koct9i, Sangseok Lee, Joonsoo Kim,
	Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160417155621.GE575@swordfish>

On Mon, Apr 18, 2016 at 12:56:21AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> [..]
> > +static void objidx_to_page_and_offset(struct size_class *class,
> > +				struct page *first_page,
> > +				unsigned long obj_idx,
> > +				struct page **obj_page,
> > +				unsigned long *offset_in_page)
> >  {
> > -	unsigned long obj;
> > +	int i;
> > +	unsigned long offset;
> > +	struct page *cursor;
> > +	int nr_page;
> >  
> > -	if (!page) {
> > -		VM_BUG_ON(obj_idx);
> > -		return NULL;
> > -	}
> > +	offset = obj_idx * class->size;
> 
> so we already know the `offset' before we call objidx_to_page_and_offset(),
> thus we can drop `struct size_class *class' and `obj_idx', and pass
> `long obj_offset'  (which is `obj_idx * class->size') instead, right?
> 
> we also _may be_ can return `cursor' from the function.
> 
> static struct page *objidx_to_page_and_offset(struct page *first_page,
> 					unsigned long obj_offset,
> 					unsigned long *offset_in_page);
> 
> this can save ~20 instructions, which is not so terrible for a hot path
> like obj_malloc(). what do you think?
> 
> well, seems that `unsigned long *offset_in_page' can be calculated
> outside of this function too, it's basically
> 
> 	*offset_in_page = (obj_idx * class->size) & ~PAGE_MASK;
> 
> so we don't need to supply it to this function, nor modify it there.
> which can save ~40 instructions on my system. does this sound silly?

Sound smart. :)
At least, we will use it in hot path.

Thanks!

^ permalink raw reply

* Re: [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping
From: Minchan Kim @ 2016-04-19  7:40 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Chan Gyun Jeong,
	Hugh Dickins, linux-kernel, Al Viro, virtualization, bfields,
	linux-mm, Gioh Kim, koct9i, Sangseok Lee, Joonsoo Kim,
	Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160417150804.GA575@swordfish>

On Mon, Apr 18, 2016 at 12:08:04AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> [..]
> > +static int get_zspage_inuse(struct page *first_page)
> > +{
> > +	struct zs_meta *m;
> > +
> > +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> > +static void set_zspage_inuse(struct page *first_page, int val)
> > +{
> > +	struct zs_meta *m;
> > +
> > +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> > +static void mod_zspage_inuse(struct page *first_page, int val)
> > +{
> > +	struct zs_meta *m;
> > +
> > +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> >  static void get_zspage_mapping(struct page *first_page,
> >  				unsigned int *class_idx,
> >  				enum fullness_group *fullness)
> >  {
> > -	unsigned long m;
> > +	struct zs_meta *m;
> > +
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> >  static void set_zspage_mapping(struct page *first_page,
> >  				unsigned int class_idx,
> >  				enum fullness_group fullness)
> >  {
> > +	struct zs_meta *m;
> > +
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >  
> > +	m = (struct zs_meta *)&first_page->mapping;
> > +	m->fullness = fullness;
> > +	m->class = class_idx;
> >  }
> 
> 
> a nitpick: this
> 
> 	struct zs_meta *m;
> 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> 	m = (struct zs_meta *)&first_page->mapping;
> 
> 
> seems to be common in several places, may be it makes sense to
> factor it out and turn into a macro or a static inline helper?
> 
> other than that, looks good to me

Yeb.

> 
> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

Thanks for the review!

^ permalink raw reply

* [PATCH] VSOCK: Only check error on skb_recv_datagram when skb is NULL
From: Jorgen Hansen @ 2016-04-19  6:58 UTC (permalink / raw)
  To: netdev, linux-kernel, virtualization
  Cc: pv-drivers, gregkh, davem, Jorgen Hansen

If skb_recv_datagram returns an skb, we should ignore the err
value returned. Otherwise, datagram receives will return EAGAIN
when they have to wait for a datagram.

Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
---
 net/vmw_vsock/vmci_transport.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 662bdd2..5621473 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1735,11 +1735,8 @@ static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
 	/* Retrieve the head sk_buff from the socket's receive queue. */
 	err = 0;
 	skb = skb_recv_datagram(&vsk->sk, flags, noblock, &err);
-	if (err)
-		return err;
-
 	if (!skb)
-		return -EAGAIN;
+		return err;
 
 	dg = (struct vmci_datagram *)skb->data;
 	if (!dg)
@@ -2154,7 +2151,7 @@ module_exit(vmci_transport_exit);
 
 MODULE_AUTHOR("VMware, Inc.");
 MODULE_DESCRIPTION("VMCI transport for Virtual Sockets");
-MODULE_VERSION("1.0.3.0-k");
+MODULE_VERSION("1.0.4.0-k");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS("vmware_vsock");
 MODULE_ALIAS_NETPROTO(PF_VSOCK);
-- 
1.7.0

^ permalink raw reply related

* Re: [PATCH RFC 3/3] vfio: add virtio pci quirk
From: Alex Williamson @ 2016-04-18 20:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, kvm, Julia Lawall, qemu-devel, linux-kernel,
	virtualization, Christian Borntraeger, Andy Lutomirski,
	Paolo Bonzini, Feng Wu, David Woodhouse, Dan Carpenter
In-Reply-To: <1460973374-32719-4-git-send-email-mst@redhat.com>

On Mon, 18 Apr 2016 12:58:28 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> Modern virtio pci devices can set VIRTIO_F_IOMMU_PLATFORM
> to signal they are safe to use with an IOMMU.
> 
> Without this bit, exposing the device to userspace is unsafe, so probe
> and fail VFIO initialization unless noiommu is enabled.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/vfio/pci/vfio_pci_private.h |   1 +
>  drivers/vfio/pci/vfio_pci.c         |  11 +++
>  drivers/vfio/pci/vfio_pci_virtio.c  | 135 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/Makefile           |   1 +
>  4 files changed, 148 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_virtio.c
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a7d546..604d445 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -130,4 +130,5 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>  	return -ENODEV;
>  }
>  #endif
> +extern int vfio_pci_virtio_quirk(struct vfio_pci_device *vdev, int noiommu);
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d622a41..2bb8c76 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1125,6 +1125,17 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  		return ret;
>  	}
>  
> +	if (pdev->vendor == PCI_VENDOR_ID_REDHAT_QUMRANET &&

Virtio really owns this entire vendor ID block?  Apparently nobody told
ivshmem: http://pci-ids.ucw.cz/read/PC/1af4/1110  Even the comment by
virtio_pci_id_table[] suggests virtio is only a subset even if the code
doesn't appear to honor that comment.  I don't know the history there,
but that seems like really inefficient use of an entire, coveted vendor
block.

> +	    ((ret = vfio_pci_virtio_quirk(vdev, ret)))) {

Please don't set variables like this unless necessary.

if (vendor...) {
   ret = vfio_pci_virtio_quir...
   if (ret) {
       ...

> +		dev_warn(&vdev->pdev->dev,
> +			 "Failed to setup Virtio for VFIO\n");
> +		vfio_del_group_dev(&pdev->dev);
> +		vfio_iommu_group_put(group, &pdev->dev);
> +		kfree(vdev);
> +		return ret;
> +	}
> +
> +
>  	if (vfio_pci_is_vga(pdev)) {
>  		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
>  		vga_set_legacy_decoding(pdev,
> diff --git a/drivers/vfio/pci/vfio_pci_virtio.c b/drivers/vfio/pci/vfio_pci_virtio.c
> new file mode 100644
> index 0000000..1a32064
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_virtio.c
> @@ -0,0 +1,135 @@
> +/*
> + * VFIO PCI Intel Graphics support
> + *
> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> + *	Author: Alex Williamson <alex.williamson@redhat.com>
> + *

Update

> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register a device specific region through which to provide read-only
> + * access to the Intel IGD opregion.  The register defining the opregion
> + * address is also virtualized to prevent user modification.

Update

> + */
> +
> +#include <linux/io.h>
> +#include <linux/pci.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/virtio_pci.h>
> +#include <linux/virtio_config.h>

I don't see where io or uaccess are needed here.

> +
> +#include "vfio_pci_private.h"
> +
> +/**
> + * virtio_pci_find_capability - walk capabilities to find device info.
> + * @dev: the pci device
> + * @cfg_type: the VIRTIO_PCI_CAP_* value we seek
> + *
> + * Returns offset of the capability, or 0.
> + */
> +static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type)

This is called from probe code, why inline?  There's already a function
with this exact same name in virtio code, can we come up with something
unique to avoid confusion?

> +{
> +	int pos;
> +
> +	for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> +	     pos > 0;
> +	     pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> +		u8 type;
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +							 cfg_type),
> +				     &type);
> +
> +		if (type != cfg_type)
> +			continue;
> +
> +		/* Ignore structures with reserved BAR values */
> +		if (type != VIRTIO_PCI_CAP_PCI_CFG) {
> +			u8 bar;
> +
> +			pci_read_config_byte(dev, pos +
> +					     offsetof(struct virtio_pci_cap,
> +						      bar),
> +					     &bar);
> +			if (bar > 0x5)
> +				continue;
> +		}
> +
> +		return pos;
> +	}
> +	return 0;
> +}
> +
> +
> +int vfio_pci_virtio_quirk(struct vfio_pci_device *vdev, int noiommu)
> +{
> +	struct pci_dev *dev = vdev->pdev;
> +	int common, cfg;
> +	u32 features;
> +	u32 offset;
> +	u8 bar;
> +
> +	/* Without an IOMMU, we don't care */
> +	if (noiommu)
> +		return 0;
> +	/* Check whether device enforces the IOMMU correctly */
> +
> +	/*
> +	 * All modern devices must have common and cfg capabilities. We use cfg
> +	 * capability for access so that we don't need to worry about resource
> +	 * availability. Slow but sure.
> +	 * Note that all vendor-specific fields we access are little-endian
> +	 * which matches what pci config accessors expect, so they do byteswap
> +	 * for us if appropriate.
> +	 */
> +	common = virtio_pci_find_capability(dev, VIRTIO_PCI_CAP_COMMON_CFG);
> +	cfg = virtio_pci_find_capability(dev, VIRTIO_PCI_CAP_PCI_CFG);
> +	if (!cfg || !common) {
> +                dev_warn(&dev->dev,
> +                         "Virtio device lacks common or pci cfg.\n");

White space

> +		return -ENODEV;
> +	}
> +
> +	pci_read_config_byte(dev, common + offsetof(struct virtio_pci_cap,
> +						    bar),
> +			     &bar);
> +	pci_read_config_dword(dev, common + offsetof(struct virtio_pci_cap,
> +						    offset),
> +			     &offset);
> +
> +	/* Program cfg capability for dword access into common cfg. */
> +	pci_write_config_byte(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> +						  cap.bar),
> +			      bar);
> +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> +						   cap.length),
> +			       0x4);
> +
> +	/* Select features dword that has VIRTIO_F_IOMMU_PLATFORM. */
> +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> +						  cap.offset),
> +			       offset + offsetof(struct virtio_pci_common_cfg,
> +						 device_feature_select));
> +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> +						  pci_cfg_data),
> +			       VIRTIO_F_IOMMU_PLATFORM / 32);
> +
> +	/* Get the features dword. */
> +	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> +						  cap.offset),
> +			       offset + offsetof(struct virtio_pci_common_cfg,
> +						 device_feature));
> +	pci_read_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
> +						  pci_cfg_data),
> +			      &features);
> +
> +	/* Does this device obey the platform's IOMMU? If not it's an error. */
> +	if (!(features & (0x1 << (VIRTIO_F_IOMMU_PLATFORM % 32)))) {
> +                dev_warn(&dev->dev,
> +                         "Virtio device lacks VIRTIO_F_IOMMU_PLATFORM.\n");

White space

> +		return -ENODEV;
> +	}
> +
> +	return 0;
> +}
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..e9b20e7 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> +vfio-pci-y += vfio_pci_virtio.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-18 19:24 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Michael S. Tsirkin,
	linux-kernel@vger.kernel.org, peterx,
	qemu-devel@nongnu.org Developers, Amit Shah, Stefan Hajnoczi,
	kvm list, Paolo Bonzini, Linux Virtualization,
	Christian Borntraeger
In-Reply-To: <1461004173.3765.73.camel@infradead.org>

On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> the truth, and even legacy kernels ought to cope with that.
> FSVO 'ought to' where I suspect some of them will actually crash with a
> NULL pointer dereference if there's no "catch-all" DMAR unit in the
> tables, which puts it back into the same camp as ARM and Power.

I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
implementation on x86 has always been "experimental", so it just might
be okay to change it in a way that causes some older kernels to OOPS.

--Andy

^ permalink raw reply

* Re: [PATCH RFC 2/3] vfio: report group noiommu status
From: Alex Williamson @ 2016-04-18 18:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Julia Lawall, Wei Liu, kvm, Jonathan Corbet, qemu-devel,
	linux-doc, linux-kernel, virtualization, Christian Borntraeger,
	Andy Lutomirski, Baptiste Reynal, David Woodhouse, Dan Carpenter
In-Reply-To: <1460973374-32719-3-git-send-email-mst@redhat.com>

On Mon, 18 Apr 2016 12:58:20 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> When using vfio, callers might want to know whether device is added to a
> regular group or an non-iommu group.
> 
> Report this status from vfio_add_group_dev.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---

What about making an interface to query this rather than playing games
with magic return values?

bool vfio_iommu_group_is_noiommu(struct iommu_group *group)
{
    return iommu_group_get_iommudata(group) == &noiommu;
}

>  drivers/vfio/pci/vfio_pci.c                  | 2 +-
>  drivers/vfio/platform/vfio_platform_common.c | 2 +-
>  drivers/vfio/vfio.c                          | 5 ++++-
>  Documentation/vfio.txt                       | 4 +++-
>  4 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 712a849..d622a41 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1119,7 +1119,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	spin_lock_init(&vdev->irqlock);
>  
>  	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
> -	if (ret) {
> +	if (ret < 0) {
>  		vfio_iommu_group_put(group, &pdev->dev);
>  		kfree(vdev);
>  		return ret;
> diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c
> index e65b142..bf74e21 100644
> --- a/drivers/vfio/platform/vfio_platform_common.c
> +++ b/drivers/vfio/platform/vfio_platform_common.c
> @@ -568,7 +568,7 @@ int vfio_platform_probe_common(struct vfio_platform_device *vdev,
>  	}
>  
>  	ret = vfio_add_group_dev(dev, &vfio_platform_ops, vdev);
> -	if (ret) {
> +	if (ret < 0) {
>  		iommu_group_put(group);
>  		return ret;
>  	}
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6fd6fa5..67db231 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -756,6 +756,7 @@ int vfio_add_group_dev(struct device *dev,
>  	struct iommu_group *iommu_group;
>  	struct vfio_group *group;
>  	struct vfio_device *device;
> +	int noiommu;
>  
>  	iommu_group = iommu_group_get(dev);
>  	if (!iommu_group)
> @@ -791,6 +792,8 @@ int vfio_add_group_dev(struct device *dev,
>  		return PTR_ERR(device);
>  	}
>  
> +	noiommu = group->noiommu;
> +
>  	/*
>  	 * Drop all but the vfio_device reference.  The vfio_device holds
>  	 * a reference to the vfio_group, which holds a reference to the
> @@ -798,7 +801,7 @@ int vfio_add_group_dev(struct device *dev,
>  	 */
>  	vfio_group_put(group);
>  
> -	return 0;
> +	return noiommu;
>  }
>  EXPORT_SYMBOL_GPL(vfio_add_group_dev);
>  
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 1dd3fdd..d76be0f 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -259,7 +259,9 @@ extern void *vfio_del_group_dev(struct device *dev);
>  
>  vfio_add_group_dev() indicates to the core to begin tracking the
>  specified iommu_group and register the specified dev as owned by
> -a VFIO bus driver.  The driver provides an ops structure for callbacks
> +a VFIO bus driver.  A negative return value indicates failure.
> +A positive return value indicates that an unsafe noiommu mode
> +is in use.  The driver provides an ops structure for callbacks
>  similar to a file operations structure:
>  
>  struct vfio_device_ops {

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-18 18:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <20160418190203-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 6217 bytes --]

On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin wrote:
> I balk at adding more hacks to a broken system. My goals are
> merely to
> - make things work correctly with an IOMMU and new guests,
>   so people can use userspace drivers with virtio devices
> - prevent security risks when guest kernel mistakenly thinks
>   it's protected by an IOMMU, but in fact isn't
> - avoid breaking any working configurations

AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
That's just an optimisation, for telling an OS "you don't really need
to bother with the IOMMU, even though you it works".

There are two main reasons why an operating system might want to use
the IOMMU via the DMA API for native drivers: 
 - To protect against driver bugs triggering rogue DMA.
 - To protect against hardware (or firmware) bugs.

With virtio, the first reason still exists. But the second is moot
because the device is part of the hypervisor and if the hypervisor is
untrustworthy then you're screwed anyway... but then again, in SoC
devices you could replace 'hypervisor' with 'chip' and the same is
true, isn't it? Is there *really* anything virtio-specific here?

Sure, I want my *external* network device on a PCIe card with software-
loadable firmware to be behind an IOMMU because I don't trust it as far
as I can throw it. But for on-SoC devices surely the situation is
*just* the same as devices provided by a hypervisor?

And some people want that external network device to use passthrough
anyway, for performance reasons.

On the whole, there are *plenty* of reasons why we might want to have a
passthrough mapping on a per-device basis, and I really struggle to
find justification for having this 'hint' in a virtio-specific way.

And it's complicating the discussion of the *actual* fix we're looking
at.

> Looking at guest code, it looks like virtio was always
> bypassing the IOMMU even if configured, but no other
> guest driver did.
> 
> This makes me think the problem where guest drivers
> ignore the IOMMU is virtio specific
> and so a virtio specific solution seems cleaner.
> 
> The problem for assigned devices is IMHO different: they bypass
> the guest IOMMU too but no guest driver knows about this,
> so guests do not work. Seems cleaner to fix QEMU to make
> existing guests work.

I certainly agree that it's better to fix QEMU. Whether devices are
behind an IOMMU or not, the DMAR tables we expose to a guest should
tell the truth.

Part of the issue here is virtio-specific; part isn't.

Basically, we have a conjunction of two separate bugs which happened to
work (for virtio) — the IOMMU support in QEMU wasn't working for virtio
(and assigned) devices even though it theoretically *should* have been,
and the virtio drivers weren't using the DMA API as they theoretically
should have been.

So there were corner cases like assigned PCI devices, and real hardware
implementations of virtio stuff (and perhaps virtio devices being
assigned to nested guests) which didn't work. But for the *common* use
case, one bug cancelled out the other.

Now we want to fix both bugs, and of course that involves carefully
coordinating both fixes.

I *like* your idea of a flag from the hypervisor which essentially says
"trust me, I'm telling the truth now".

But don't think that wants to be virtio-specific, because we actually
want it to cover *all* the corner cases, not just the common case which
*happened* to work before due to the alignment of the two previous
bugs.

An updated guest OS can look for this flag (in its generic IOMMU code)
and can apply a heuristic of its own to work out which devices *aren't*
behind the IOMMU, if the flag isn't present. And it can get that right
even for assigned devices, so that new kernels can run happily even on
today's QEMU instances. And the virtio driver in new kernels should
just use the DMA API and expect it to work. Just as the various drivers
for assigned PCI devices do.

The other interesting case for compatibility is old kernels running in
a new QEMU. And for that case, things are likely to break if you
suddenly start putting the virtio devices behind an IOMMU. There's
nothing you can do on ARM and Power to stop that breakage, since they
don't *have* a way to tell legacy guests that certain devices aren't
translated. So I suspect you probably can't enable virtio-behind-IOMMU
in QEMU *ever* for those platforms as the default behaviour.

For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
the truth, and even legacy kernels ought to cope with that.
FSVO 'ought to' where I suspect some of them will actually crash with a
NULL pointer dereference if there's no "catch-all" DMAR unit in the
tables, which puts it back into the same camp as ARM and Power.


> True but I think we should fix QEMU to shadow IOMMU
> page tables for assigned devices. This seems rather
> possible with VT-D, and there are patches already on list.
> 
> It looks like this will fix all legacy guests which is
> much nicer than what you suggest which will only help new guests.

Yes, we should do that. And in the short term we should at *least* fix
the DMAR tables to tell the truth.

> > 
> > Furthermore, some platforms don't *have* a standard way for qemu to
> > 'tell the truth' to the guests, and that's where the real fun comes in.
> > But still, I'd like to see a generic solution for that lack instead of
> > a virtio-specific hack.
> But the issue is not just these holes.  E.g. with VT-D it is only easy
> to emulate because there's a "caching mode" hook. It is fundamentally
> paravirtualization.  So a completely generic solution would be a
> paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
> justified if many platforms have hard to emulate interfaces.

Hm, I'm not sure I understand the point here.

Either there is a way for the hypervisor to expose an IOMMU to a guest
(be it full hardware virt, or paravirt). Or there isn't.

If there is, it doesn't matter *how* it's done. And if there isn't, the
whole discussion is moot anyway.

-- 
dwmw2



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: Potential hang in virtnet_send_command
From: Michael S. Tsirkin @ 2016-04-18 17:24 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Thomas Monjalon, virtualization
In-Reply-To: <CAATJJ0JF7-ApjMO+XtsVgD8rU6WTCK28vTjmG-Aww-ASHw4cBQ@mail.gmail.com>

On Mon, Apr 18, 2016 at 07:07:32PM +0200, Christian Ehrhardt wrote:
> Hi,
> I was debugging an issue caused by a bad usage of dpdk.
> That is fixed in the meantime and I'll backport that into our code as well.
> 
> But along debugging that, I found a potential hang in virtnet_send_command that
> my case ran into.
> The following code can become an infinite loop:
> /* Spin for a response, the kick causes an ioport write, trapping
>  * into the hypervisor, so the request should be handled immediately.
>  */   
> while (!virtqueue_get_buf(vi->cvq, &tmp) &&   
>        !virtqueue_is_broken(vi->cvq))   
>         cpu_relax();
> 
> In my case dpdk broke something - not exactly clear what - and due to that
> following calls through virtnet_send_command ran into this hang.
> Effectively it seems that the buffers didn't get refreshed at all anymore.
> 
> That said the dpdk issue to touch devices that belong to a kernel owned driver
> is fixed, so one could leave the code as is for now.
> Yet I wanted to make you aware in case you would vote for a time or retry based
> upper limit on that loop to avoid hangs - who knows what else might bring it in
> this broken state in a different case.
> 
> Kind Regards,
> Christian Ehrhardt
> 
> P.S.
> Steps to reproduce, backtraces and more data can be found in:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1570195
> A general setup howto for DPDK in KVM guests which is a prereq is at:
> https://help.ubuntu.com/16.04/serverguide/DPDK.html#dpdk-in-guest

True. It's not that clear how to handle this cleanly though.
Ideally we'd have to transmit the next command once the
previous one completes.

-- 
MST

^ permalink raw reply

* Potential hang in virtnet_send_command
From: Christian Ehrhardt @ 2016-04-18 17:07 UTC (permalink / raw)
  To: virtualization; +Cc: Thomas Monjalon, Michael S. Tsirkin


[-- Attachment #1.1: Type: text/plain, Size: 1373 bytes --]

Hi,
I was debugging an issue caused by a bad usage of dpdk.
That is fixed in the meantime and I'll backport that into our code as well.

But along debugging that, I found a potential hang in virtnet_send_command
that my case ran into.
The following code can become an infinite loop:
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
 */
while (!virtqueue_get_buf(vi->cvq, &tmp) &&
       !virtqueue_is_broken(vi->cvq))
        cpu_relax();

In my case dpdk broke something - not exactly clear what - and due to that
following calls through virtnet_send_command ran into this hang.
Effectively it seems that the buffers didn't get refreshed at all anymore.

That said the dpdk issue to touch devices that belong to a kernel owned
driver is fixed, so one could leave the code as is for now.
Yet I wanted to make you aware in case you would vote for a time or retry
based upper limit on that loop to avoid hangs - who knows what else might
bring it in this broken state in a different case.

Kind Regards,
Christian Ehrhardt

P.S.
Steps to reproduce, backtraces and more data can be found in:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1570195
A general setup howto for DPDK in KVM guests which is a prereq is at:
https://help.ubuntu.com/16.04/serverguide/DPDK.html#dpdk-in-guest

[-- Attachment #1.2: Type: text/html, Size: 1864 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox