* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-19 18:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
qemu-devel@nongnu.org Developers, peterx,
linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160419204907-mutt-send-email-mst@redhat.com>
On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >
>> > > I thought that PLATFORM served that purpose. Woudn't the host
>> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> > > device would skip translation? Or is that problematic for vfio?
>> >
>> > Exactly that's problematic for security.
>> > You can't allow guest driver to decide whether device skips security.
>>
>> Right. Because fundamentally, this *isn't* a property of the endpoint
>> device, and doesn't live in virtio itself.
>>
>> It's a property of the platform IOMMU, and lives there.
>
> It's a property of the hypervisor virtio implementation, and lives there.
It is now, but QEMU could, in principle, change the way it thinks
about it so that virtio devices would use the QEMU DMA API but ask
QEMU to pass everything through 1:1. This would be entirely invisible
to guests but would make it be a property of the IOMMU implementation.
At that point, maybe QEMU could find a (platform dependent) way to
tell the guest what's going on.
FWIW, as far as I can tell, PPC and SPARC really could, in principle,
set up 1:1 mappings in the guest so that the virtio devices would work
regardless of whether QEMU is ignoring the IOMMU or not -- I think the
only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
up with an offset. I don't know too much about those platforms, but
presumably the layout could be changed so that 1:1 really was 1:1.
--Andy
^ permalink raw reply
* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19 20:16 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
qemu-devel@nongnu.org Developers, peterx,
linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrXsbYj5c3j2em2-jrhBWp-VqRJd900s5WDdVUMCN2aiNg@mail.gmail.com>
On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> >> >
> >> > > I thought that PLATFORM served that purpose. Woudn't the host
> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> >> > > device would skip translation? Or is that problematic for vfio?
> >> >
> >> > Exactly that's problematic for security.
> >> > You can't allow guest driver to decide whether device skips security.
> >>
> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> >> device, and doesn't live in virtio itself.
> >>
> >> It's a property of the platform IOMMU, and lives there.
> >
> > It's a property of the hypervisor virtio implementation, and lives there.
>
> It is now, but QEMU could, in principle, change the way it thinks
> about it so that virtio devices would use the QEMU DMA API but ask
> QEMU to pass everything through 1:1. This would be entirely invisible
> to guests but would make it be a property of the IOMMU implementation.
> At that point, maybe QEMU could find a (platform dependent) way to
> tell the guest what's going on.
>
> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> set up 1:1 mappings in the guest so that the virtio devices would work
> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> up with an offset. I don't know too much about those platforms, but
> presumably the layout could be changed so that 1:1 really was 1:1.
>
> --Andy
Sure. Do you see any reason why the decision to do this can't be
keyed off the virtio feature bit?
--
MST
^ permalink raw reply
* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-19 20:27 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
qemu-devel@nongnu.org Developers, peterx,
linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160419231437-mutt-send-email-mst@redhat.com>
On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >
>> >> > > I thought that PLATFORM served that purpose. Woudn't the host
>> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> >> > > device would skip translation? Or is that problematic for vfio?
>> >> >
>> >> > Exactly that's problematic for security.
>> >> > You can't allow guest driver to decide whether device skips security.
>> >>
>> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> device, and doesn't live in virtio itself.
>> >>
>> >> It's a property of the platform IOMMU, and lives there.
>> >
>> > It's a property of the hypervisor virtio implementation, and lives there.
>>
>> It is now, but QEMU could, in principle, change the way it thinks
>> about it so that virtio devices would use the QEMU DMA API but ask
>> QEMU to pass everything through 1:1. This would be entirely invisible
>> to guests but would make it be a property of the IOMMU implementation.
>> At that point, maybe QEMU could find a (platform dependent) way to
>> tell the guest what's going on.
>>
>> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> set up 1:1 mappings in the guest so that the virtio devices would work
>> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> up with an offset. I don't know too much about those platforms, but
>> presumably the layout could be changed so that 1:1 really was 1:1.
>>
>> --Andy
>
> Sure. Do you see any reason why the decision to do this can't be
> keyed off the virtio feature bit?
I can think of three types of virtio host:
a) virtio always bypasses the IOMMU.
b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
it does) -- i.e. virtio works like any other device.
c) virtio may bypass the IOMMU depending on what the guest asks it to do.
If this is keyed off a virtio feature bit and anyone tries to
implement (c), the vfio is going to have a problem. And, if it's
keyed off a virtio feature bit, then (a) won't work on Xen or similar
setups unless the Xen hypervisor adds a giant and probably unreliable
kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a
default x86 QEMU configuration, and I'd really like to keep it that
way.
What could plausibly work using a virtio feature bit is for a device
to say "hey, I'm a new device and I support the platform-defined IOMMU
mechanism". This bit would be *set* on default IOMMU-less QEMU
configurations and on physical virtio PCI cards. The guest could
operate accordingly. I'm not sure I see a good way for feature
negotiation to work the other direction, though.
PPC and SPARC could only set this bit on emulated devices if they know
that new guest kernels are in use.
--Andy
^ permalink raw reply
* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-19 20:54 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
qemu-devel@nongnu.org Developers, peterx,
linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrXqrbwvStj9MhKizceze56SAFH_hUJwsQ83NsSRzWiq8Q@mail.gmail.com>
On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> >> >> >
> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host
> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> >> >> > > device would skip translation? Or is that problematic for vfio?
> >> >> >
> >> >> > Exactly that's problematic for security.
> >> >> > You can't allow guest driver to decide whether device skips security.
> >> >>
> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> >> >> device, and doesn't live in virtio itself.
> >> >>
> >> >> It's a property of the platform IOMMU, and lives there.
> >> >
> >> > It's a property of the hypervisor virtio implementation, and lives there.
> >>
> >> It is now, but QEMU could, in principle, change the way it thinks
> >> about it so that virtio devices would use the QEMU DMA API but ask
> >> QEMU to pass everything through 1:1. This would be entirely invisible
> >> to guests but would make it be a property of the IOMMU implementation.
> >> At that point, maybe QEMU could find a (platform dependent) way to
> >> tell the guest what's going on.
> >>
> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> >> set up 1:1 mappings in the guest so that the virtio devices would work
> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> >> up with an offset. I don't know too much about those platforms, but
> >> presumably the layout could be changed so that 1:1 really was 1:1.
> >>
> >> --Andy
> >
> > Sure. Do you see any reason why the decision to do this can't be
> > keyed off the virtio feature bit?
>
> I can think of three types of virtio host:
>
> a) virtio always bypasses the IOMMU.
>
> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> it does) -- i.e. virtio works like any other device.
>
> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
d) some virtio devices bypass the IOMMU and some don't,
e.g. it's harder to support IOMMU with vhost.
> If this is keyed off a virtio feature bit and anyone tries to
> implement (c), the vfio is going to have a problem. And, if it's
> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> setups unless the Xen hypervisor adds a giant and probably unreliable
> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a
> default x86 QEMU configuration, and I'd really like to keep it that
> way.
>
> What could plausibly work using a virtio feature bit is for a device
> to say "hey, I'm a new device and I support the platform-defined IOMMU
> mechanism". This bit would be *set* on default IOMMU-less QEMU
> configurations and on physical virtio PCI cards.
And clear on xen.
> The guest could
> operate accordingly. I'm not sure I see a good way for feature
> negotiation to work the other direction, though.
I agree.
> PPC and SPARC could only set this bit on emulated devices if they know
> that new guest kernels are in use.
>
> --Andy
^ permalink raw reply
* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-19 21:07 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
qemu-devel@nongnu.org Developers, peterx,
linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <20160419235212-mutt-send-email-mst@redhat.com>
On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >> >
>> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host
>> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> >> >> > > device would skip translation? Or is that problematic for vfio?
>> >> >> >
>> >> >> > Exactly that's problematic for security.
>> >> >> > You can't allow guest driver to decide whether device skips security.
>> >> >>
>> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> >> device, and doesn't live in virtio itself.
>> >> >>
>> >> >> It's a property of the platform IOMMU, and lives there.
>> >> >
>> >> > It's a property of the hypervisor virtio implementation, and lives there.
>> >>
>> >> It is now, but QEMU could, in principle, change the way it thinks
>> >> about it so that virtio devices would use the QEMU DMA API but ask
>> >> QEMU to pass everything through 1:1. This would be entirely invisible
>> >> to guests but would make it be a property of the IOMMU implementation.
>> >> At that point, maybe QEMU could find a (platform dependent) way to
>> >> tell the guest what's going on.
>> >>
>> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> >> set up 1:1 mappings in the guest so that the virtio devices would work
>> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> >> up with an offset. I don't know too much about those platforms, but
>> >> presumably the layout could be changed so that 1:1 really was 1:1.
>> >>
>> >> --Andy
>> >
>> > Sure. Do you see any reason why the decision to do this can't be
>> > keyed off the virtio feature bit?
>>
>> I can think of three types of virtio host:
>>
>> a) virtio always bypasses the IOMMU.
>>
>> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
>> it does) -- i.e. virtio works like any other device.
>>
>> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
>
> d) some virtio devices bypass the IOMMU and some don't,
> e.g. it's harder to support IOMMU with vhost.
>
>
>> If this is keyed off a virtio feature bit and anyone tries to
>> implement (c), the vfio is going to have a problem. And, if it's
>> keyed off a virtio feature bit, then (a) won't work on Xen or similar
>> setups unless the Xen hypervisor adds a giant and probably unreliable
>> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a
>> default x86 QEMU configuration, and I'd really like to keep it that
>> way.
>>
>> What could plausibly work using a virtio feature bit is for a device
>> to say "hey, I'm a new device and I support the platform-defined IOMMU
>> mechanism". This bit would be *set* on default IOMMU-less QEMU
>> configurations and on physical virtio PCI cards.
>
> And clear on xen.
How? QEMU has no idea that the guest is running Xen.
^ permalink raw reply
* Re: [PATCH] VSOCK: Only check error on skb_recv_datagram when skb is NULL
From: David Miller @ 2016-04-20 0:42 UTC (permalink / raw)
To: jhansen; +Cc: pv-drivers, netdev, gregkh, linux-kernel, virtualization
In-Reply-To: <1461049132-59927-1-git-send-email-jhansen@vmware.com>
From: Jorgen Hansen <jhansen@vmware.com>
Date: Mon, 18 Apr 2016 23:58:52 -0700
> If skb_recv_datagram returns an skb, we should ignore the err
> value returned. Otherwise, datagram receives will return EAGAIN
> when they have to wait for a datagram.
>
> Acked-by: Adit Ranadive <aditr@vmware.com>
> Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Applied.
^ permalink raw reply
* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-20 13:14 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
qemu-devel@nongnu.org Developers, peterx,
linux-kernel@vger.kernel.org, Amit Shah, Stefan Hajnoczi,
kvm list, Paolo Bonzini, Linux Virtualization, David Woodhouse
In-Reply-To: <CALCETrUif0iAFWFsz-i848yUO9WrXZ7q+Kx3-4wu-dQRiuCweg@mail.gmail.com>
On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> >> >> >> >
> >> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host
> >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> >> >> >> > > device would skip translation? Or is that problematic for vfio?
> >> >> >> >
> >> >> >> > Exactly that's problematic for security.
> >> >> >> > You can't allow guest driver to decide whether device skips security.
> >> >> >>
> >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> >> >> >> device, and doesn't live in virtio itself.
> >> >> >>
> >> >> >> It's a property of the platform IOMMU, and lives there.
> >> >> >
> >> >> > It's a property of the hypervisor virtio implementation, and lives there.
> >> >>
> >> >> It is now, but QEMU could, in principle, change the way it thinks
> >> >> about it so that virtio devices would use the QEMU DMA API but ask
> >> >> QEMU to pass everything through 1:1. This would be entirely invisible
> >> >> to guests but would make it be a property of the IOMMU implementation.
> >> >> At that point, maybe QEMU could find a (platform dependent) way to
> >> >> tell the guest what's going on.
> >> >>
> >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> >> >> set up 1:1 mappings in the guest so that the virtio devices would work
> >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> >> >> up with an offset. I don't know too much about those platforms, but
> >> >> presumably the layout could be changed so that 1:1 really was 1:1.
> >> >>
> >> >> --Andy
> >> >
> >> > Sure. Do you see any reason why the decision to do this can't be
> >> > keyed off the virtio feature bit?
> >>
> >> I can think of three types of virtio host:
> >>
> >> a) virtio always bypasses the IOMMU.
> >>
> >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> >> it does) -- i.e. virtio works like any other device.
> >>
> >> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
> >
> > d) some virtio devices bypass the IOMMU and some don't,
> > e.g. it's harder to support IOMMU with vhost.
> >
> >
> >> If this is keyed off a virtio feature bit and anyone tries to
> >> implement (c), the vfio is going to have a problem. And, if it's
> >> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> >> setups unless the Xen hypervisor adds a giant and probably unreliable
> >> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a
> >> default x86 QEMU configuration, and I'd really like to keep it that
> >> way.
> >>
> >> What could plausibly work using a virtio feature bit is for a device
> >> to say "hey, I'm a new device and I support the platform-defined IOMMU
> >> mechanism". This bit would be *set* on default IOMMU-less QEMU
> >> configurations and on physical virtio PCI cards.
> >
> > And clear on xen.
>
> How? QEMU has no idea that the guest is running Xen.
I was under impression xen_enabled() is true in QEMU.
Am I wrong?
--
MST
^ permalink raw reply
* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Andy Lutomirski @ 2016-04-20 15:43 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, qemu-block, kvm list, Amit Shah,
qemu-devel@nongnu.org Developers, peterx, Linux Virtualization,
Christian Borntraeger, Stefan Hajnoczi, Paolo Bonzini,
David Woodhouse, linux-kernel@vger.kernel.org
In-Reply-To: <20160420161338-mutt-send-email-mst@redhat.com>
On Apr 20, 2016 6:14 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> > >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> > >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> > >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> > >> >> >> >
> > >> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host
> > >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> > >> >> >> > > device would skip translation? Or is that problematic for vfio?
> > >> >> >> >
> > >> >> >> > Exactly that's problematic for security.
> > >> >> >> > You can't allow guest driver to decide whether device skips security.
> > >> >> >>
> > >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> > >> >> >> device, and doesn't live in virtio itself.
> > >> >> >>
> > >> >> >> It's a property of the platform IOMMU, and lives there.
> > >> >> >
> > >> >> > It's a property of the hypervisor virtio implementation, and lives there.
> > >> >>
> > >> >> It is now, but QEMU could, in principle, change the way it thinks
> > >> >> about it so that virtio devices would use the QEMU DMA API but ask
> > >> >> QEMU to pass everything through 1:1. This would be entirely invisible
> > >> >> to guests but would make it be a property of the IOMMU implementation.
> > >> >> At that point, maybe QEMU could find a (platform dependent) way to
> > >> >> tell the guest what's going on.
> > >> >>
> > >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> > >> >> set up 1:1 mappings in the guest so that the virtio devices would work
> > >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> > >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> > >> >> up with an offset. I don't know too much about those platforms, but
> > >> >> presumably the layout could be changed so that 1:1 really was 1:1.
> > >> >>
> > >> >> --Andy
> > >> >
> > >> > Sure. Do you see any reason why the decision to do this can't be
> > >> > keyed off the virtio feature bit?
> > >>
> > >> I can think of three types of virtio host:
> > >>
> > >> a) virtio always bypasses the IOMMU.
> > >>
> > >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> > >> it does) -- i.e. virtio works like any other device.
> > >>
> > >> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
> > >
> > > d) some virtio devices bypass the IOMMU and some don't,
> > > e.g. it's harder to support IOMMU with vhost.
> > >
> > >
> > >> If this is keyed off a virtio feature bit and anyone tries to
> > >> implement (c), the vfio is going to have a problem. And, if it's
> > >> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> > >> setups unless the Xen hypervisor adds a giant and probably unreliable
> > >> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a
> > >> default x86 QEMU configuration, and I'd really like to keep it that
> > >> way.
> > >>
> > >> What could plausibly work using a virtio feature bit is for a device
> > >> to say "hey, I'm a new device and I support the platform-defined IOMMU
> > >> mechanism". This bit would be *set* on default IOMMU-less QEMU
> > >> configurations and on physical virtio PCI cards.
> > >
> > > And clear on xen.
> >
> > How? QEMU has no idea that the guest is running Xen.
>
> I was under impression xen_enabled() is true in QEMU.
> Am I wrong?
I'd be rather surprised, given that QEMU would have to inspect the
guest kernel to figure it out. I'm talking about Xen under QEMU. For
example, if you feed QEMU a guest disk image that contains Fedora with
the xen packages installed, you can boot it and get a grub menu. If
you ask grub to boot Xen, you get Xen. If you ask grub to boot Linux
directly, you don't get Xen.
I assume xen_enabled is for QEMU under Xen, i.e. QEMU, running under
Xen, supplying emulated devices to a Xen domU guest. Since QEMU is
seeing the guest address space directly, this should be much the same
as QEMU !xen_enabled -- if you boot plain Linux, everything works, but
if you do Xen -> QEMU -> HVM guest running Xen PV -> Linux, then
virtio drivers in the Xen PV Linux guest need to translate addresses.
--Andy
>
> --
> MST
^ permalink raw reply
* Re: [PATCH v4 6/6] hwmon: use smp_call_on_cpu() for dell-smm i8k
From: Pali Rohár @ 2016-04-21 10:57 UTC (permalink / raw)
To: Guenter Roeck
Cc: Juergen Gross, x86, jeremy, jdelvare, peterz, akataria,
linux-kernel, virtualization, chrisw, mingo, david.vrabel,
Douglas_Warzecha, hpa, xen-devel, boris.ostrovsky, tglx
In-Reply-To: <201604052131.52765@pali>
On Tuesday 05 April 2016 21:31:52 Pali Rohár wrote:
> On Tuesday 05 April 2016 16:54:14 Guenter Roeck wrote:
> > On Tue, Apr 05, 2016 at 07:10:07AM +0200, Juergen Gross wrote:
> > > Use the smp_call_on_cpu() function to call system management
> > > mode on cpu 0.
> > > Make call secure by adding get_online_cpus() to avoid e.g. suspend
> > > resume cycles in between.
> > >
> > > Signed-off-by: Juergen Gross <jgross@suse.com>
> > > ---
> > > V4: add call to get_online_cpus()
> >
> > Pali, any chance to test this ?
>
> I can test it, but just on machine where (probably) smm calls can be
> send from any cpu... Need some time for testing and I believe I can do
> that at the end of the week.
Sorry I had absolutely no more free time last weekend :-( And same
prediction is for this weekend and also next one...
--
Pali Rohár
pali.rohar@gmail.com
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v4 6/6] hwmon: use smp_call_on_cpu() for dell-smm i8k
From: Juergen Gross @ 2016-04-21 13:12 UTC (permalink / raw)
To: Pali Rohár, Guenter Roeck
Cc: x86, jeremy, jdelvare, peterz, akataria, linux-kernel,
virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
hpa, xen-devel, boris.ostrovsky, tglx
In-Reply-To: <20160421105724.GK29406@pali>
On 21/04/16 12:57, Pali Rohár wrote:
> On Tuesday 05 April 2016 21:31:52 Pali Rohár wrote:
>> On Tuesday 05 April 2016 16:54:14 Guenter Roeck wrote:
>>> On Tue, Apr 05, 2016 at 07:10:07AM +0200, Juergen Gross wrote:
>>>> Use the smp_call_on_cpu() function to call system management
>>>> mode on cpu 0.
>>>> Make call secure by adding get_online_cpus() to avoid e.g. suspend
>>>> resume cycles in between.
>>>>
>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>> ---
>>>> V4: add call to get_online_cpus()
>>>
>>> Pali, any chance to test this ?
>>
>> I can test it, but just on machine where (probably) smm calls can be
>> send from any cpu... Need some time for testing and I believe I can do
>> that at the end of the week.
>
> Sorry I had absolutely no more free time last weekend :-( And same
> prediction is for this weekend and also next one...
Pali, I've got a Dell laptop (Latitude E6440) here. Would this device be
okay for a test? What would you do for testing? In case you can give me
some hints how to do a sensible test I'd do it.
I've verified by adding a printk() to smp_call_on_cpu() that at least
one of the modified drivers has been used during system boot.
Juergen
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v4 6/6] hwmon: use smp_call_on_cpu() for dell-smm i8k
From: Pali Rohár @ 2016-04-21 13:27 UTC (permalink / raw)
To: Juergen Gross
Cc: x86, jeremy, jdelvare, peterz, akataria, linux-kernel,
virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
hpa, xen-devel, boris.ostrovsky, tglx, Guenter Roeck
In-Reply-To: <5718D1D4.3040309@suse.com>
On Thursday 21 April 2016 15:12:52 Juergen Gross wrote:
> On 21/04/16 12:57, Pali Rohár wrote:
> > On Tuesday 05 April 2016 21:31:52 Pali Rohár wrote:
> >> On Tuesday 05 April 2016 16:54:14 Guenter Roeck wrote:
> >>> On Tue, Apr 05, 2016 at 07:10:07AM +0200, Juergen Gross wrote:
> >>>> Use the smp_call_on_cpu() function to call system management
> >>>> mode on cpu 0.
> >>>> Make call secure by adding get_online_cpus() to avoid e.g. suspend
> >>>> resume cycles in between.
> >>>>
> >>>> Signed-off-by: Juergen Gross <jgross@suse.com>
> >>>> ---
> >>>> V4: add call to get_online_cpus()
> >>>
> >>> Pali, any chance to test this ?
> >>
> >> I can test it, but just on machine where (probably) smm calls can be
> >> send from any cpu... Need some time for testing and I believe I can do
> >> that at the end of the week.
> >
> > Sorry I had absolutely no more free time last weekend :-( And same
> > prediction is for this weekend and also next one...
>
> Pali, I've got a Dell laptop (Latitude E6440) here. Would this device be
> okay for a test?
Hi!
Proper regression test should check if this patch does not break any
function or drivers dependent on dcdbas.ko. And should be done on both
notebook devices: which needs to issue that smm call on cpu 0 and also
on which it is not needed.
Some notebooks which needs smm call to issued from cpu 0 can be found in
git commit messages of i8k, dell-laptop or dcdbas kernel drivers.
> What would you do for testing? In case you can give me
> some hints how to do a sensible test I'd do it.
Test e.g. dell-laptop.ko driver. It provides /sys interface for changing
keyboard backlight or changing rfkill switches (bluetooth wifi).
Also test tools from libsmbios (userspace) package.
There must be no difference in output/functionality with or without your
patches.
> I've verified by adding a printk() to smp_call_on_cpu() that at least
> one of the modified drivers has been used during system boot.
Also you can patch i8k/dcdbas smm function to print cpu number on which
is code running (to verify that it was really called on cpu 0 as
needed).
--
Pali Rohár
pali.rohar@gmail.com
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-21 13:43 UTC (permalink / raw)
To: qemu-devel, linux-kernel
Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
peterx, virtualization, Amit Shah, Stefan Hajnoczi, kvm, pbonzini,
David Woodhouse
This adds a flag to enable/disable bypassing the IOMMU by
virtio devices.
This is on top of patch
http://article.gmane.org/gmane.comp.emulators.qemu/403467
virtio: convert to use DMA api
Tested with patchset
http://article.gmane.org/gmane.linux.kernel.virtualization/27545
virtio-pci: iommu support (note: bit number has been kept at 34
intentionally to match posted guest code. a non-RFC version will
renumber bits to be contigious).
changes from v1:
drop PASSTHROUGH flag
The interaction between virtio and DMA API is messy.
On most systems with virtio, physical addresses match bus addresses,
and it doesn't particularly matter whether we use the DMA API.
On some systems, including Xen and any system with a physical device
that speaks virtio behind a physical IOMMU, we must use the DMA API
for virtio DMA to work at all.
Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
If not there, we preserve historic behavior and bypass the DMA
API unless within Xen guest. This is actually required for
systems, including SPARC and PPC64, where virtio-pci devices are
enumerated as though they are behind an IOMMU, but the virtio host
ignores the IOMMU, so we must either pretend that the IOMMU isn't
there or somehow map everything as the identity.
Re: non-virtio devices.
It turns out that on old QEMU hosts, only emulated devices which were
part of QEMU use the IOMMU. Should we want to bypass the IOMMU for such
devices *only*, it would be rather easy to detect them by looking at
subsystem vendor and device ID. Thus, no new interfaces are required
except for virtio which always uses the same subsystem vendor and device ID.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/hw/virtio/virtio-access.h | 3 ++-
include/hw/virtio/virtio.h | 4 +++-
include/standard-headers/linux/virtio_config.h | 2 ++
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
index 967cc75..bb6f34e 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
- if (k->get_dma_as) {
+ if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
+ k->get_dma_as) {
return k->get_dma_as(qbus->parent);
}
return &address_space_memory;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b12faa9..44f3788 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
DEFINE_PROP_BIT64("notify_on_empty", _state, _field, \
VIRTIO_F_NOTIFY_ON_EMPTY, true), \
DEFINE_PROP_BIT64("any_layout", _state, _field, \
- VIRTIO_F_ANY_LAYOUT, true)
+ VIRTIO_F_ANY_LAYOUT, true), \
+ DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
+ VIRTIO_F_IOMMU_PLATFORM, false)
hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
hwaddr virtio_queue_get_avail_addr(VirtIODevice *vdev, int n);
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index bcc445b..3fcfbb1 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -61,4 +61,6 @@
/* v1.0 compliant. */
#define VIRTIO_F_VERSION_1 32
+/* Do not bypass the IOMMU (if configured) */
+#define VIRTIO_F_IOMMU_PLATFORM 34
#endif /* _LINUX_VIRTIO_CONFIG_H */
--
MST
^ permalink raw reply related
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Wei Liu @ 2016-04-21 13:54 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Anthony PERARD, Wei Liu, Andy Lutomirski, qemu-block,
Christian Borntraeger, Stefano Stabellini, qemu-devel, peterx,
linux-kernel, Amit Shah, Stefan Hajnoczi, kvm, pbonzini,
virtualization, David Woodhouse
In-Reply-To: <1461245745-6710-1-git-send-email-mst@redhat.com>
Add Stefano and Anthony
On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> This adds a flag to enable/disable bypassing the IOMMU by
> virtio devices.
>
> This is on top of patch
> http://article.gmane.org/gmane.comp.emulators.qemu/403467
> virtio: convert to use DMA api
>
> Tested with patchset
> http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> virtio-pci: iommu support (note: bit number has been kept at 34
> intentionally to match posted guest code. a non-RFC version will
> renumber bits to be contigious).
>
> changes from v1:
> drop PASSTHROUGH flag
>
> The interaction between virtio and DMA API is messy.
>
> On most systems with virtio, physical addresses match bus addresses,
> and it doesn't particularly matter whether we use the DMA API.
>
> On some systems, including Xen and any system with a physical device
> that speaks virtio behind a physical IOMMU, we must use the DMA API
> for virtio DMA to work at all.
>
> Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
>
> If not there, we preserve historic behavior and bypass the DMA
> API unless within Xen guest. This is actually required for
> systems, including SPARC and PPC64, where virtio-pci devices are
> enumerated as though they are behind an IOMMU, but the virtio host
> ignores the IOMMU, so we must either pretend that the IOMMU isn't
> there or somehow map everything as the identity.
>
> Re: non-virtio devices.
>
> It turns out that on old QEMU hosts, only emulated devices which were
> part of QEMU use the IOMMU. Should we want to bypass the IOMMU for such
> devices *only*, it would be rather easy to detect them by looking at
> subsystem vendor and device ID. Thus, no new interfaces are required
> except for virtio which always uses the same subsystem vendor and device ID.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> include/hw/virtio/virtio-access.h | 3 ++-
> include/hw/virtio/virtio.h | 4 +++-
> include/standard-headers/linux/virtio_config.h | 2 ++
> 3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> index 967cc75..bb6f34e 100644
> --- a/include/hw/virtio/virtio-access.h
> +++ b/include/hw/virtio/virtio-access.h
> @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
> BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
>
> - if (k->get_dma_as) {
> + if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> + k->get_dma_as) {
> return k->get_dma_as(qbus->parent);
> }
> return &address_space_memory;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index b12faa9..44f3788 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> DEFINE_PROP_BIT64("notify_on_empty", _state, _field, \
> VIRTIO_F_NOTIFY_ON_EMPTY, true), \
> DEFINE_PROP_BIT64("any_layout", _state, _field, \
> - VIRTIO_F_ANY_LAYOUT, true)
> + VIRTIO_F_ANY_LAYOUT, true), \
> + DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> + VIRTIO_F_IOMMU_PLATFORM, false)
>
> hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
> hwaddr virtio_queue_get_avail_addr(VirtIODevice *vdev, int n);
> diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
> index bcc445b..3fcfbb1 100644
> --- a/include/standard-headers/linux/virtio_config.h
> +++ b/include/standard-headers/linux/virtio_config.h
> @@ -61,4 +61,6 @@
> /* v1.0 compliant. */
> #define VIRTIO_F_VERSION_1 32
>
> +/* Do not bypass the IOMMU (if configured) */
> +#define VIRTIO_F_IOMMU_PLATFORM 34
> #endif /* _LINUX_VIRTIO_CONFIG_H */
> --
> MST
^ permalink raw reply
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Stefan Hajnoczi @ 2016-04-21 14:56 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, kvm, qemu-block, Christian Borntraeger, qemu-devel,
peterx, linux-kernel, Amit Shah, Andy Lutomirski, pbonzini,
virtualization, David Woodhouse
In-Reply-To: <1461245745-6710-1-git-send-email-mst@redhat.com>
[-- Attachment #1.1: Type: text/plain, Size: 3778 bytes --]
On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> This adds a flag to enable/disable bypassing the IOMMU by
> virtio devices.
>
> This is on top of patch
> http://article.gmane.org/gmane.comp.emulators.qemu/403467
> virtio: convert to use DMA api
>
> Tested with patchset
> http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> virtio-pci: iommu support (note: bit number has been kept at 34
> intentionally to match posted guest code. a non-RFC version will
> renumber bits to be contigious).
>
> changes from v1:
> drop PASSTHROUGH flag
>
> The interaction between virtio and DMA API is messy.
>
> On most systems with virtio, physical addresses match bus addresses,
> and it doesn't particularly matter whether we use the DMA API.
>
> On some systems, including Xen and any system with a physical device
> that speaks virtio behind a physical IOMMU, we must use the DMA API
> for virtio DMA to work at all.
>
> Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
>
> If not there, we preserve historic behavior and bypass the DMA
> API unless within Xen guest. This is actually required for
> systems, including SPARC and PPC64, where virtio-pci devices are
> enumerated as though they are behind an IOMMU, but the virtio host
> ignores the IOMMU, so we must either pretend that the IOMMU isn't
> there or somehow map everything as the identity.
>
> Re: non-virtio devices.
>
> It turns out that on old QEMU hosts, only emulated devices which were
> part of QEMU use the IOMMU. Should we want to bypass the IOMMU for such
> devices *only*, it would be rather easy to detect them by looking at
> subsystem vendor and device ID. Thus, no new interfaces are required
> except for virtio which always uses the same subsystem vendor and device ID.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> include/hw/virtio/virtio-access.h | 3 ++-
> include/hw/virtio/virtio.h | 4 +++-
> include/standard-headers/linux/virtio_config.h | 2 ++
> 3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> index 967cc75..bb6f34e 100644
> --- a/include/hw/virtio/virtio-access.h
> +++ b/include/hw/virtio/virtio-access.h
> @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
> BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
>
> - if (k->get_dma_as) {
> + if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> + k->get_dma_as) {
> return k->get_dma_as(qbus->parent);
> }
> return &address_space_memory;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index b12faa9..44f3788 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> DEFINE_PROP_BIT64("notify_on_empty", _state, _field, \
> VIRTIO_F_NOTIFY_ON_EMPTY, true), \
> DEFINE_PROP_BIT64("any_layout", _state, _field, \
> - VIRTIO_F_ANY_LAYOUT, true)
> + VIRTIO_F_ANY_LAYOUT, true), \
> + DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> + VIRTIO_F_IOMMU_PLATFORM, false)
Looks like the impact of this patch is that users who relied on
k->get_dma_as today may now have to explicitly add iommu_platform=on.
Are there any such users (e.g. Xen)?
Instead of breaking the command-line for these users you could invert
the flag's meaning ("iommu_bypass=on") and set it in the SPARC/PPC
machine types.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-21 15:11 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Wei Liu, kvm, qemu-block, Christian Borntraeger, qemu-devel,
peterx, linux-kernel, Amit Shah, Andy Lutomirski, pbonzini,
virtualization, David Woodhouse
In-Reply-To: <20160421145653.GB11814@stefanha-x1.localdomain>
On Thu, Apr 21, 2016 at 03:56:53PM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> > This adds a flag to enable/disable bypassing the IOMMU by
> > virtio devices.
> >
> > This is on top of patch
> > http://article.gmane.org/gmane.comp.emulators.qemu/403467
> > virtio: convert to use DMA api
> >
> > Tested with patchset
> > http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> > virtio-pci: iommu support (note: bit number has been kept at 34
> > intentionally to match posted guest code. a non-RFC version will
> > renumber bits to be contigious).
> >
> > changes from v1:
> > drop PASSTHROUGH flag
> >
> > The interaction between virtio and DMA API is messy.
> >
> > On most systems with virtio, physical addresses match bus addresses,
> > and it doesn't particularly matter whether we use the DMA API.
> >
> > On some systems, including Xen and any system with a physical device
> > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > for virtio DMA to work at all.
> >
> > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> >
> > If not there, we preserve historic behavior and bypass the DMA
> > API unless within Xen guest. This is actually required for
> > systems, including SPARC and PPC64, where virtio-pci devices are
> > enumerated as though they are behind an IOMMU, but the virtio host
> > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > there or somehow map everything as the identity.
> >
> > Re: non-virtio devices.
> >
> > It turns out that on old QEMU hosts, only emulated devices which were
> > part of QEMU use the IOMMU. Should we want to bypass the IOMMU for such
> > devices *only*, it would be rather easy to detect them by looking at
> > subsystem vendor and device ID. Thus, no new interfaces are required
> > except for virtio which always uses the same subsystem vendor and device ID.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > include/hw/virtio/virtio-access.h | 3 ++-
> > include/hw/virtio/virtio.h | 4 +++-
> > include/standard-headers/linux/virtio_config.h | 2 ++
> > 3 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> > index 967cc75..bb6f34e 100644
> > --- a/include/hw/virtio/virtio-access.h
> > +++ b/include/hw/virtio/virtio-access.h
> > @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
> > BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> > VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> >
> > - if (k->get_dma_as) {
> > + if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> > + k->get_dma_as) {
> > return k->get_dma_as(qbus->parent);
> > }
> > return &address_space_memory;
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index b12faa9..44f3788 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> > DEFINE_PROP_BIT64("notify_on_empty", _state, _field, \
> > VIRTIO_F_NOTIFY_ON_EMPTY, true), \
> > DEFINE_PROP_BIT64("any_layout", _state, _field, \
> > - VIRTIO_F_ANY_LAYOUT, true)
> > + VIRTIO_F_ANY_LAYOUT, true), \
> > + DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> > + VIRTIO_F_IOMMU_PLATFORM, false)
>
> Looks like the impact of this patch is that users who relied on
> k->get_dma_as today may now have to explicitly add iommu_platform=on.
> Are there any such users (e.g. Xen)?
No because upstream this is ignored. This is an incremental patch
on top of Jason's one.
> Instead of breaking the command-line for these users you could invert
> the flag's meaning ("iommu_bypass=on") and set it in the SPARC/PPC
> machine types.
>
> Stefan
I hope I made it clear that there are no such users.
^ permalink raw reply
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Stefan Hajnoczi @ 2016-04-22 9:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, kvm, qemu-block, David Woodhouse, qemu-devel, peterx,
linux-kernel, Christian Borntraeger, Stefan Hajnoczi, Amit Shah,
pbonzini, virtualization, Andy Lutomirski
In-Reply-To: <20160421181102-mutt-send-email-mst@redhat.com>
[-- Attachment #1.1: Type: text/plain, Size: 4487 bytes --]
On Thu, Apr 21, 2016 at 06:11:40PM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 21, 2016 at 03:56:53PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> > > This adds a flag to enable/disable bypassing the IOMMU by
> > > virtio devices.
> > >
> > > This is on top of patch
> > > http://article.gmane.org/gmane.comp.emulators.qemu/403467
> > > virtio: convert to use DMA api
> > >
> > > Tested with patchset
> > > http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> > > virtio-pci: iommu support (note: bit number has been kept at 34
> > > intentionally to match posted guest code. a non-RFC version will
> > > renumber bits to be contigious).
> > >
> > > changes from v1:
> > > drop PASSTHROUGH flag
> > >
> > > The interaction between virtio and DMA API is messy.
> > >
> > > On most systems with virtio, physical addresses match bus addresses,
> > > and it doesn't particularly matter whether we use the DMA API.
> > >
> > > On some systems, including Xen and any system with a physical device
> > > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > > for virtio DMA to work at all.
> > >
> > > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> > >
> > > If not there, we preserve historic behavior and bypass the DMA
> > > API unless within Xen guest. This is actually required for
> > > systems, including SPARC and PPC64, where virtio-pci devices are
> > > enumerated as though they are behind an IOMMU, but the virtio host
> > > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > > there or somehow map everything as the identity.
> > >
> > > Re: non-virtio devices.
> > >
> > > It turns out that on old QEMU hosts, only emulated devices which were
> > > part of QEMU use the IOMMU. Should we want to bypass the IOMMU for such
> > > devices *only*, it would be rather easy to detect them by looking at
> > > subsystem vendor and device ID. Thus, no new interfaces are required
> > > except for virtio which always uses the same subsystem vendor and device ID.
> > >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > > include/hw/virtio/virtio-access.h | 3 ++-
> > > include/hw/virtio/virtio.h | 4 +++-
> > > include/standard-headers/linux/virtio_config.h | 2 ++
> > > 3 files changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> > > index 967cc75..bb6f34e 100644
> > > --- a/include/hw/virtio/virtio-access.h
> > > +++ b/include/hw/virtio/virtio-access.h
> > > @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
> > > BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> > > VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> > >
> > > - if (k->get_dma_as) {
> > > + if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> > > + k->get_dma_as) {
> > > return k->get_dma_as(qbus->parent);
> > > }
> > > return &address_space_memory;
> > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > index b12faa9..44f3788 100644
> > > --- a/include/hw/virtio/virtio.h
> > > +++ b/include/hw/virtio/virtio.h
> > > @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> > > DEFINE_PROP_BIT64("notify_on_empty", _state, _field, \
> > > VIRTIO_F_NOTIFY_ON_EMPTY, true), \
> > > DEFINE_PROP_BIT64("any_layout", _state, _field, \
> > > - VIRTIO_F_ANY_LAYOUT, true)
> > > + VIRTIO_F_ANY_LAYOUT, true), \
> > > + DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> > > + VIRTIO_F_IOMMU_PLATFORM, false)
> >
> > Looks like the impact of this patch is that users who relied on
> > k->get_dma_as today may now have to explicitly add iommu_platform=on.
> > Are there any such users (e.g. Xen)?
>
> No because upstream this is ignored. This is an incremental patch
> on top of Jason's one.
>
> > Instead of breaking the command-line for these users you could invert
> > the flag's meaning ("iommu_bypass=on") and set it in the SPARC/PPC
> > machine types.
> >
> > Stefan
>
> I hope I made it clear that there are no such users.
Okay, thanks!
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH] virtio: queue variable should be initialized in vring_create_virtqueue
From: Xin Long @ 2016-04-24 16:50 UTC (permalink / raw)
To: linux-kernel, virtualization; +Cc: Andy Lutomirski, Michael S. Tsirkin
'queue' may be used uninitialized in vring_create_virtqueue, actually
when we compile kernel, gcc can also give this warning.
Fixes: 2a2d1382fe9d ("virtio: Add improved queue allocation API")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
drivers/virtio/virtio_ring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5c802d4..ca6bfdd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1006,7 +1006,7 @@ struct virtqueue *vring_create_virtqueue(
const char *name)
{
struct virtqueue *vq;
- void *queue;
+ void *queue = NULL;
dma_addr_t dma_addr;
size_t queue_size_in_bytes;
struct vring vring;
--
2.1.0
^ permalink raw reply related
* [PATCH 1/2] vhost: simplify work flushing
From: Jason Wang @ 2016-04-26 2:14 UTC (permalink / raw)
To: mst; +Cc: netdev, linux-kernel, kvm, virtualization
We used to implement the work flushing through tracking queued seq,
done seq, and the number of flushing. This patch simplify this by just
implement work flushing through another kind of vhost work with
completion. This will be used by lockless enqueuing patch.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 53 ++++++++++++++++++++-------------------------------
1 file changed, 21 insertions(+), 32 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1..73dd16d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -131,6 +131,19 @@ static void vhost_reset_is_le(struct vhost_virtqueue *vq)
vq->is_le = virtio_legacy_is_little_endian();
}
+struct vhost_flush_struct {
+ struct vhost_work work;
+ struct completion wait_event;
+};
+
+static void vhost_flush_work(struct vhost_work *work)
+{
+ struct vhost_flush_struct *s;
+
+ s = container_of(work, struct vhost_flush_struct, work);
+ complete(&s->wait_event);
+}
+
static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
{
@@ -158,8 +171,6 @@ void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
INIT_LIST_HEAD(&work->node);
work->fn = fn;
init_waitqueue_head(&work->done);
- work->flushing = 0;
- work->queue_seq = work->done_seq = 0;
}
EXPORT_SYMBOL_GPL(vhost_work_init);
@@ -211,31 +222,17 @@ void vhost_poll_stop(struct vhost_poll *poll)
}
EXPORT_SYMBOL_GPL(vhost_poll_stop);
-static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
- unsigned seq)
-{
- int left;
-
- spin_lock_irq(&dev->work_lock);
- left = seq - work->done_seq;
- spin_unlock_irq(&dev->work_lock);
- return left <= 0;
-}
-
void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
{
- unsigned seq;
- int flushing;
+ struct vhost_flush_struct flush;
+
+ if (dev->worker) {
+ init_completion(&flush.wait_event);
+ vhost_work_init(&flush.work, vhost_flush_work);
- spin_lock_irq(&dev->work_lock);
- seq = work->queue_seq;
- work->flushing++;
- spin_unlock_irq(&dev->work_lock);
- wait_event(work->done, vhost_work_seq_done(dev, work, seq));
- spin_lock_irq(&dev->work_lock);
- flushing = --work->flushing;
- spin_unlock_irq(&dev->work_lock);
- BUG_ON(flushing < 0);
+ vhost_work_queue(dev, &flush.work);
+ wait_for_completion(&flush.wait_event);
+ }
}
EXPORT_SYMBOL_GPL(vhost_work_flush);
@@ -254,7 +251,6 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
spin_lock_irqsave(&dev->work_lock, flags);
if (list_empty(&work->node)) {
list_add_tail(&work->node, &dev->work_list);
- work->queue_seq++;
spin_unlock_irqrestore(&dev->work_lock, flags);
wake_up_process(dev->worker);
} else {
@@ -310,7 +306,6 @@ static int vhost_worker(void *data)
{
struct vhost_dev *dev = data;
struct vhost_work *work = NULL;
- unsigned uninitialized_var(seq);
mm_segment_t oldfs = get_fs();
set_fs(USER_DS);
@@ -321,11 +316,6 @@ static int vhost_worker(void *data)
set_current_state(TASK_INTERRUPTIBLE);
spin_lock_irq(&dev->work_lock);
- if (work) {
- work->done_seq = seq;
- if (work->flushing)
- wake_up_all(&work->done);
- }
if (kthread_should_stop()) {
spin_unlock_irq(&dev->work_lock);
@@ -336,7 +326,6 @@ static int vhost_worker(void *data)
work = list_first_entry(&dev->work_list,
struct vhost_work, node);
list_del_init(&work->node);
- seq = work->queue_seq;
} else
work = NULL;
spin_unlock_irq(&dev->work_lock);
--
1.8.3.1
^ permalink raw reply related
* [PATCH 2/2] vhost: lockless enqueuing
From: Jason Wang @ 2016-04-26 2:14 UTC (permalink / raw)
To: mst; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1461636873-45335-1-git-send-email-jasowang@redhat.com>
We use spinlock to synchronize the work list now which may cause
unnecessary contentions. So this patch switch to use llist to remove
this contention. Pktgen tests shows about 5% improvement:
Before:
~1300000 pps
After:
~1370000 pps
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 52 +++++++++++++++++++++++++--------------------------
drivers/vhost/vhost.h | 7 ++++---
2 files changed, 29 insertions(+), 30 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 73dd16d..0061a7b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -168,7 +168,7 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
{
- INIT_LIST_HEAD(&work->node);
+ clear_bit(VHOST_WORK_QUEUED, &work->flags);
work->fn = fn;
init_waitqueue_head(&work->done);
}
@@ -246,15 +246,16 @@ EXPORT_SYMBOL_GPL(vhost_poll_flush);
void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
{
- unsigned long flags;
+ if (!dev->worker)
+ return;
- spin_lock_irqsave(&dev->work_lock, flags);
- if (list_empty(&work->node)) {
- list_add_tail(&work->node, &dev->work_list);
- spin_unlock_irqrestore(&dev->work_lock, flags);
+ if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
+ /* We can only add the work to the list after we're
+ * sure it was not in the list.
+ */
+ smp_mb();
+ llist_add(&work->node, &dev->work_list);
wake_up_process(dev->worker);
- } else {
- spin_unlock_irqrestore(&dev->work_lock, flags);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -262,7 +263,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
/* A lockless hint for busy polling code to exit the loop */
bool vhost_has_work(struct vhost_dev *dev)
{
- return !list_empty(&dev->work_list);
+ return !llist_empty(&dev->work_list);
}
EXPORT_SYMBOL_GPL(vhost_has_work);
@@ -305,7 +306,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
static int vhost_worker(void *data)
{
struct vhost_dev *dev = data;
- struct vhost_work *work = NULL;
+ struct vhost_work *work, *work_next;
+ struct llist_node *node;
mm_segment_t oldfs = get_fs();
set_fs(USER_DS);
@@ -315,29 +317,25 @@ static int vhost_worker(void *data)
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);
- spin_lock_irq(&dev->work_lock);
-
if (kthread_should_stop()) {
- spin_unlock_irq(&dev->work_lock);
__set_current_state(TASK_RUNNING);
break;
}
- if (!list_empty(&dev->work_list)) {
- work = list_first_entry(&dev->work_list,
- struct vhost_work, node);
- list_del_init(&work->node);
- } else
- work = NULL;
- spin_unlock_irq(&dev->work_lock);
- if (work) {
+ node = llist_del_all(&dev->work_list);
+ if (!node)
+ schedule();
+
+ node = llist_reverse_order(node);
+ /* make sure flag is seen after deletion */
+ smp_wmb();
+ llist_for_each_entry_safe(work, work_next, node, node) {
+ clear_bit(VHOST_WORK_QUEUED, &work->flags);
__set_current_state(TASK_RUNNING);
work->fn(work);
if (need_resched())
schedule();
- } else
- schedule();
-
+ }
}
unuse_mm(dev->mm);
set_fs(oldfs);
@@ -398,9 +396,9 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->log_file = NULL;
dev->memory = NULL;
dev->mm = NULL;
- spin_lock_init(&dev->work_lock);
- INIT_LIST_HEAD(&dev->work_list);
dev->worker = NULL;
+ init_llist_head(&dev->work_list);
+
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
@@ -566,7 +564,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
/* No one will access memory at this point */
kvfree(dev->memory);
dev->memory = NULL;
- WARN_ON(!list_empty(&dev->work_list));
+ WARN_ON(!llist_empty(&dev->work_list));
if (dev->worker) {
kthread_stop(dev->worker);
dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d36d8be..6690e64 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -15,13 +15,15 @@
struct vhost_work;
typedef void (*vhost_work_fn_t)(struct vhost_work *work);
+#define VHOST_WORK_QUEUED 1
struct vhost_work {
- struct list_head node;
+ struct llist_node node;
vhost_work_fn_t fn;
wait_queue_head_t done;
int flushing;
unsigned queue_seq;
unsigned done_seq;
+ unsigned long flags;
};
/* Poll a file (eventfd or socket) */
@@ -126,8 +128,7 @@ struct vhost_dev {
int nvqs;
struct file *log_file;
struct eventfd_ctx *log_ctx;
- spinlock_t work_lock;
- struct list_head work_list;
+ struct llist_head work_list;
struct task_struct *worker;
};
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH 2/2] vhost: lockless enqueuing
From: Pankaj Gupta @ 2016-04-26 6:24 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1461636873-45335-2-git-send-email-jasowang@redhat.com>
Hi Jason,
Overall patches look good. Just one doubt I have is below:
>
> We use spinlock to synchronize the work list now which may cause
> unnecessary contentions. So this patch switch to use llist to remove
> this contention. Pktgen tests shows about 5% improvement:
>
> Before:
> ~1300000 pps
> After:
> ~1370000 pps
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> drivers/vhost/vhost.c | 52
> +++++++++++++++++++++++++--------------------------
> drivers/vhost/vhost.h | 7 ++++---
> 2 files changed, 29 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 73dd16d..0061a7b 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -168,7 +168,7 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned
> mode, int sync,
>
> void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> {
> - INIT_LIST_HEAD(&work->node);
> + clear_bit(VHOST_WORK_QUEUED, &work->flags);
> work->fn = fn;
> init_waitqueue_head(&work->done);
> }
> @@ -246,15 +246,16 @@ EXPORT_SYMBOL_GPL(vhost_poll_flush);
>
> void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
> {
> - unsigned long flags;
> + if (!dev->worker)
> + return;
>
> - spin_lock_irqsave(&dev->work_lock, flags);
> - if (list_empty(&work->node)) {
> - list_add_tail(&work->node, &dev->work_list);
> - spin_unlock_irqrestore(&dev->work_lock, flags);
> + if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
> + /* We can only add the work to the list after we're
> + * sure it was not in the list.
> + */
> + smp_mb();
> + llist_add(&work->node, &dev->work_list);
> wake_up_process(dev->worker);
> - } else {
> - spin_unlock_irqrestore(&dev->work_lock, flags);
> }
> }
> EXPORT_SYMBOL_GPL(vhost_work_queue);
> @@ -262,7 +263,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
> /* A lockless hint for busy polling code to exit the loop */
> bool vhost_has_work(struct vhost_dev *dev)
> {
> - return !list_empty(&dev->work_list);
> + return !llist_empty(&dev->work_list);
> }
> EXPORT_SYMBOL_GPL(vhost_has_work);
>
> @@ -305,7 +306,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> static int vhost_worker(void *data)
> {
> struct vhost_dev *dev = data;
> - struct vhost_work *work = NULL;
> + struct vhost_work *work, *work_next;
> + struct llist_node *node;
> mm_segment_t oldfs = get_fs();
>
> set_fs(USER_DS);
> @@ -315,29 +317,25 @@ static int vhost_worker(void *data)
> /* mb paired w/ kthread_stop */
> set_current_state(TASK_INTERRUPTIBLE);
>
> - spin_lock_irq(&dev->work_lock);
> -
> if (kthread_should_stop()) {
> - spin_unlock_irq(&dev->work_lock);
> __set_current_state(TASK_RUNNING);
> break;
> }
> - if (!list_empty(&dev->work_list)) {
> - work = list_first_entry(&dev->work_list,
> - struct vhost_work, node);
> - list_del_init(&work->node);
> - } else
> - work = NULL;
> - spin_unlock_irq(&dev->work_lock);
>
> - if (work) {
> + node = llist_del_all(&dev->work_list);
> + if (!node)
> + schedule();
> +
> + node = llist_reverse_order(node);
Can we avoid llist reverse here?
> + /* make sure flag is seen after deletion */
> + smp_wmb();
> + llist_for_each_entry_safe(work, work_next, node, node) {
> + clear_bit(VHOST_WORK_QUEUED, &work->flags);
> __set_current_state(TASK_RUNNING);
> work->fn(work);
> if (need_resched())
> schedule();
> - } else
> - schedule();
> -
> + }
> }
> unuse_mm(dev->mm);
> set_fs(oldfs);
> @@ -398,9 +396,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> dev->log_file = NULL;
> dev->memory = NULL;
> dev->mm = NULL;
> - spin_lock_init(&dev->work_lock);
> - INIT_LIST_HEAD(&dev->work_list);
> dev->worker = NULL;
> + init_llist_head(&dev->work_list);
> +
>
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> @@ -566,7 +564,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool
> locked)
> /* No one will access memory at this point */
> kvfree(dev->memory);
> dev->memory = NULL;
> - WARN_ON(!list_empty(&dev->work_list));
> + WARN_ON(!llist_empty(&dev->work_list));
> if (dev->worker) {
> kthread_stop(dev->worker);
> dev->worker = NULL;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d36d8be..6690e64 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -15,13 +15,15 @@
> struct vhost_work;
> typedef void (*vhost_work_fn_t)(struct vhost_work *work);
>
> +#define VHOST_WORK_QUEUED 1
> struct vhost_work {
> - struct list_head node;
> + struct llist_node node;
> vhost_work_fn_t fn;
> wait_queue_head_t done;
> int flushing;
> unsigned queue_seq;
> unsigned done_seq;
> + unsigned long flags;
> };
>
> /* Poll a file (eventfd or socket) */
> @@ -126,8 +128,7 @@ struct vhost_dev {
> int nvqs;
> struct file *log_file;
> struct eventfd_ctx *log_ctx;
> - spinlock_t work_lock;
> - struct list_head work_list;
> + struct llist_head work_list;
> struct task_struct *worker;
> };
>
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH 2/2] vhost: lockless enqueuing
From: Jason Wang @ 2016-04-26 7:05 UTC (permalink / raw)
To: Pankaj Gupta; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <2033086948.46236145.1461651858100.JavaMail.zimbra@redhat.com>
On 04/26/2016 02:24 PM, Pankaj Gupta wrote:
> Hi Jason,
>
> Overall patches look good. Just one doubt I have is below:
>> We use spinlock to synchronize the work list now which may cause
>> unnecessary contentions. So this patch switch to use llist to remove
>> this contention. Pktgen tests shows about 5% improvement:
>>
>> Before:
>> ~1300000 pps
>> After:
>> ~1370000 pps
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>> drivers/vhost/vhost.c | 52
>> +++++++++++++++++++++++++--------------------------
>> drivers/vhost/vhost.h | 7 ++++---
>> 2 files changed, 29 insertions(+), 30 deletions(-)
[...]
>> - if (work) {
>> + node = llist_del_all(&dev->work_list);
>> + if (!node)
>> + schedule();
>> +
>> + node = llist_reverse_order(node);
> Can we avoid llist reverse here?
>
Probably not, this is because:
- we should process the work exactly the same order as they were queued,
otherwise flush won't work
- llist can only add a node to the head of list.
Thanks
^ permalink raw reply
* Re: [PATCH 2/2] vhost: lockless enqueuing
From: Pankaj Gupta @ 2016-04-26 7:57 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <571F1330.7030504@redhat.com>
>
>
>
> On 04/26/2016 02:24 PM, Pankaj Gupta wrote:
> > Hi Jason,
> >
> > Overall patches look good. Just one doubt I have is below:
> >> We use spinlock to synchronize the work list now which may cause
> >> unnecessary contentions. So this patch switch to use llist to remove
> >> this contention. Pktgen tests shows about 5% improvement:
> >>
> >> Before:
> >> ~1300000 pps
> >> After:
> >> ~1370000 pps
> >>
> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
> >> ---
> >> drivers/vhost/vhost.c | 52
> >> +++++++++++++++++++++++++--------------------------
> >> drivers/vhost/vhost.h | 7 ++++---
> >> 2 files changed, 29 insertions(+), 30 deletions(-)
>
> [...]
>
> >> - if (work) {
> >> + node = llist_del_all(&dev->work_list);
> >> + if (!node)
> >> + schedule();
> >> +
> >> + node = llist_reverse_order(node);
> > Can we avoid llist reverse here?
> >
>
> Probably not, this is because:
>
> - we should process the work exactly the same order as they were queued,
> otherwise flush won't work
> - llist can only add a node to the head of list.
Got it.
Thanks,
>
> Thanks
>
^ permalink raw reply
* Re: [Qemu-devel] KVM Forum 2016: Call For Participation
From: Peter Maydell @ 2016-04-26 17:58 UTC (permalink / raw)
To: Paolo Bonzini
Cc: KVM list, Libvirt, edk2-devel@ml01.01.org, seabios@seabios.org,
qemu-devel, Linux Virtualization
In-Reply-To: <56E1B83D.7070601@redhat.com>
On 10 March 2016 at 18:09, Paolo Bonzini <pbonzini@redhat.com> wrote:
> =================================================================
> KVM Forum 2016: Call For Participation
> August 24-26, 2016 - Westin Harbor Castle - Toronto, Canada
>
> (All submissions must be received before midnight May 1, 2016)
> =================================================================
>
> KVM Forum is an annual event that presents a rare opportunity
> for developers and users to meet, discuss the state of Linux
> virtualization technology, and plan for the challenges ahead.
> We invite you to lead part of the discussion by submitting a speaking
> proposal for KVM Forum 2016.
Hi; this is just a followup to remind people that the KVM Forum
CFP deadline of May 1st is rapidly approaching, so if you were
thinking of submitting a proposal now is the time.
All the CFP information is here:
http://events.linuxfoundation.org/events/kvm-forum/program/cfp
thanks
-- PMM (on behalf of the KVM Forum 2016 Program Committee)
^ permalink raw reply
* [PATCH v4 00/13] Support non-lru page migration
From: Minchan Kim @ 2016-04-27 7:48 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Minchan Kim,
Chan Gyun Jeong, Jonathan Corbet, Hugh Dickins, linux-kernel,
dri-devel, virtualization, John Einar Reitan, linux-mm,
Chulmin Kim, Gioh Kim, Konstantin Khlebnikov, Sangseok Lee,
Joonsoo Kim, Kyeongdon Kim, Naoya Horiguchi, Vlastimil Babka,
Mel Gorman
Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.
The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.
Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not
memory efficient. In our product which has big CMA memory, it reclaims
zones too exccessively to allocate GPU and zram page although there are
lots of free space in CMA so system becomes very slow easily.
To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.
struct address_space_operations {
..
..
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
..
}
new page flags
PG_movable
PG_isolated
For details, please read description in "mm: migrate: support non-lru
movable page migration".
Originally, Gioh Kim had tried to support this feature but he moved so
I took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have
many credit, too.
Thanks, Gioh and Konstantin!
This patchset consists of five parts.
1. clean up migration
mm: use put_page to free page instead of putback_lru_page
2. add non-lru page migration feature
mm: migrate: support non-lru movable page migration
3. rework KVM memory-ballooning
mm: balloon: use general non-lru movable page feature
4. zsmalloc refactoring for preparing page migration
zsmalloc: keep max_object in size_class
zsmalloc: use bit_spin_lock
zsmalloc: use accessor
zsmalloc: factor page chain functionality out
zsmalloc: introduce zspage structure
zsmalloc: separate free_zspage from putback_zspage
zsmalloc: use freeobj for index
5. zsmalloc page migration
zsmalloc: page migration support
zram: use __GFP_MOVABLE for memory allocation
* From v3
* rebase on mmotm-2016-04-06-20-40
* fix swap_info deadlock - Chulmin
* race without page_lock - Vlastimil
* no use page._mapcount for potential user-mapped page driver - Vlastimil
* fix and enhance doc/description - Vlastimil
* use page->mapping lower bits to represent PG_movable
* make driver side's rule simple.
* From v2
* rebase on mmotm-2016-03-29-15-54-16
* check PageMovable before lock_page - Joonsoo
* check PageMovable before PageIsolated checking - Joonsoo
* add more description about rule
* From v1
* rebase on v4.5-mmotm-2016-03-17-15-04
* reordering patches to merge clean-up patches first
* add Acked-by/Reviewed-by from Vlastimil and Sergey
* use each own mount model instead of reusing anon_inode_fs - Al Viro
* small changes - YiPing, Gioh
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: dri-devel@lists.freedesktop.org
Cc: Hugh Dickins <hughd@google.com>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Chan Gyun Jeong <chan.jeong@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Kyeongdon Kim <kyeongdon.kim@lge.com>
Cc: Chulmin Kim <cmlaika.kim@samsung.com>
Minchan Kim (12):
mm: use put_page to free page instead of putback_lru_page
mm: migrate: support non-lru movable page migration
mm: balloon: use general non-lru movable page feature
zsmalloc: keep max_object in size_class
zsmalloc: use bit_spin_lock
zsmalloc: use accessor
zsmalloc: factor page chain functionality out
zsmalloc: introduce zspage structure
zsmalloc: separate free_zspage from putback_zspage
zsmalloc: use freeobj for index
zsmalloc: page migration support
zram: use __GFP_MOVABLE for memory allocation
Documentation/filesystems/Locking | 4 +
Documentation/filesystems/vfs.txt | 11 +
Documentation/vm/page_migration | 107 +++-
drivers/block/zram/zram_drv.c | 3 +-
drivers/virtio/virtio_balloon.c | 52 +-
include/linux/balloon_compaction.h | 51 +-
include/linux/fs.h | 2 +
include/linux/ksm.h | 3 +-
include/linux/migrate.h | 5 +
include/linux/mm.h | 1 +
include/linux/page-flags.h | 23 +-
include/uapi/linux/magic.h | 2 +
mm/balloon_compaction.c | 94 +--
mm/compaction.c | 40 +-
mm/ksm.c | 4 +-
mm/migrate.c | 287 +++++++--
mm/page_alloc.c | 4 +-
mm/util.c | 6 +-
mm/vmscan.c | 2 +-
mm/zsmalloc.c | 1147 ++++++++++++++++++++++++------------
20 files changed, 1283 insertions(+), 565 deletions(-)
--
1.9.1
^ permalink raw reply
* [PATCH v4 02/12] mm: migrate: support non-lru movable page migration
From: Minchan Kim @ 2016-04-27 7:48 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Minchan Kim,
Jonathan Corbet, Hugh Dickins, linux-kernel, dri-devel,
virtualization, John Einar Reitan, linux-mm, Gioh Kim, Mel Gorman,
Joonsoo Kim, Vlastimil Babka
In-Reply-To: <1461743305-19970-1-git-send-email-minchan@kernel.org>
We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.
So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.
If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations.
1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*.
Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.
2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);
After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
migrated the oldpage successfully and returns MIGRATEPAGE_SUCCESS.
If driver cannot migrate the page at the moment, driver can return -EAGAIN.
On -EAGAIN, VM will retry page migration in a short time because VM interprets
-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
VM will give up the page migration without retrying in this time.
Driver shouldn't touch page.lru field VM using in the functions.
3. void (*putback_page)(struct page *);
If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data
structure.
4. non-lru movable page flags
There are two page flags for supporting non-lru movable page.
* PG_movable
Driver should use the below function to make page movable under page_lock.
void __SetPageMovable(struct page *page, struct address_space *mapping)
It needs argument of address_space for registering migration family functions
which will be called by VM. Exactly speaking, PG_movable is not a real flag of
struct page. Rather than, VM reuses page->mapping's lower bits to represent it.
#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping so it can get
right struct address_space.
For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim.
For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping.
Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page.
* PG_isolated
To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose.
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: dri-devel@lists.freedesktop.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
Documentation/filesystems/Locking | 4 +
Documentation/filesystems/vfs.txt | 11 ++
Documentation/vm/page_migration | 107 +++++++++++++++++-
include/linux/fs.h | 2 +
include/linux/ksm.h | 3 +-
include/linux/migrate.h | 5 +
include/linux/mm.h | 1 +
include/linux/page-flags.h | 23 +++-
mm/compaction.c | 47 +++++---
mm/ksm.c | 4 +-
mm/migrate.c | 222 ++++++++++++++++++++++++++++++++++----
mm/page_alloc.c | 4 +-
mm/util.c | 6 +-
13 files changed, 390 insertions(+), 49 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 619af9bfdcb3..0bb79560abb3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -195,7 +195,9 @@ unlocks and drops the reference.
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+ bool (*isolate_page) (struct page *, isolate_mode_t);
int (*migratepage)(struct address_space *, struct page *, struct page *);
+ void (*putback_page) (struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
@@ -219,7 +221,9 @@ invalidatepage: yes
releasepage: yes
freepage: yes
direct_IO:
+isolate_page: yes
migratepage: yes (both)
+putback_page: yes
launder_page: yes
is_partially_uptodate: yes
error_remove_page: yes
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 4164bd6397a2..4fa8f54d94c8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -592,9 +592,14 @@ struct address_space_operations {
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+ /* isolate a page for migration */
+ bool (*isolate_page) (struct page *, isolate_mode_t);
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct page *, struct page *);
+ /* put migration-failed page back to right list */
+ void (*putback_page) (struct page *);
int (*launder_page) (struct page *);
+
int (*is_partially_uptodate) (struct page *, unsigned long,
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
@@ -747,6 +752,10 @@ struct address_space_operations {
and transfer data directly between the storage and the
application's address space.
+ isolate_page: Called by the VM when isolating a movable non-lru page.
+ If page is successfully isolated, VM marks the page as PG_isolated
+ via __SetPageIsolated.
+
migrate_page: This is used to compact the physical memory usage.
If the VM wants to relocate a page (maybe off a memory card
that is signalling imminent failure) it will pass a new page
@@ -754,6 +763,8 @@ struct address_space_operations {
transfer any private data across and update any references
that it has to the page.
+ putback_page: Called by the VM when isolated page's migration fails.
+
launder_page: Called before freeing a page - it writes back the dirty page. To
prevent redirtying the page, it is kept locked during the whole
operation.
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index fea5c0864170..b89d1de026df 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -142,5 +142,110 @@ is increased so that the page cannot be freed while page migration occurs.
20. The new page is moved to the LRU and can be scanned by the swapper
etc again.
-Christoph Lameter, May 8, 2006.
+C. Non-LRU page migration
+-------------------------
+
+Although original migration aimed for reducing the latency of memory access
+for NUMA, compaction who want to create high-order page is also main customer.
+
+Current problem of the implementation is that it is designed to migrate only
+*LRU* pages. However, there are potential non-lru pages which can be migrated
+in drivers, for example, zsmalloc, virtio-balloon pages.
+
+For virtio-balloon pages, some parts of migration code path have been hooked
+up and added virtio-balloon specific functions to intercept migration logics.
+It's too specific to a driver so other drivers who want to make their pages
+movable would have to add own specific hooks in migration path.
+
+To overclome the problem, VM supports non-LRU page migration which provides
+generic functions for non-LRU movable pages without driver specific hooks
+migration path.
+
+If a driver want to make own pages movable, it should define three functions
+which are function pointers of struct address_space_operations.
+
+1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
+
+What VM expects on isolate_page function of driver is to return *true*
+if driver isolates page successfully. On returing true, VM marks the page
+as PG_isolated so concurrent isolation in several CPUs skip the page
+for isolation. If a driver cannot isolate the page, it should return *false*.
+
+Once page is successfully isolated, VM uses page.lru fields so driver
+shouldn't expect to preserve values in that fields.
+
+2. int (*migratepage) (struct address_space *mapping,
+ struct page *newpage, struct page *oldpage, enum migrate_mode);
+
+After isolation, VM calls migratepage of driver with isolated page.
+The function of migratepage is to move content of the old page to new page
+and set up fields of struct page newpage. Keep in mind that you should
+clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
+migrated the oldpage successfully and returns MIGRATEPAGE_SUCCESS.
+If driver cannot migrate the page at the moment, driver can return -EAGAIN.
+On -EAGAIN, VM will retry page migration in a short time because VM interprets
+-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
+VM will give up the page migration without retrying in this time.
+
+Driver shouldn't touch page.lru field VM using in the functions.
+
+3. void (*putback_page)(struct page *);
+
+If migration fails on isolated page, VM should return the isolated page
+to the driver so VM calls driver's putback_page with migration failed page.
+In this function, driver should put the isolated page back to the own data
+structure.
+4. non-lru movable page flags
+
+There are two page flags for supporting non-lru movable page.
+
+* PG_movable
+
+Driver should use the below function to make page movable under page_lock.
+
+ void __SetPageMovable(struct page *page, struct address_space *mapping)
+
+It needs argument of address_space for registering migration family functions
+which will be called by VM. Exactly speaking, PG_movable is not a real flag of
+struct page. Rather than, VM reuses page->mapping's lower bits to represent it.
+
+ #define PAGE_MAPPING_MOVABLE 0x2
+ page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
+
+so driver shouldn't access page->mapping directly. Instead, driver should
+use page_mapping which mask off the low two bits of page->mapping so it can get
+right struct address_space.
+
+For testing of non-lru movable page, VM supports __PageMovable function.
+However, it doesn't guarantee to identify non-lru movable page because
+page->mapping field is unified with other variables in struct page.
+As well, if driver releases the page after isolation by VM, page->mapping
+doesn't have stable value although it has PAGE_MAPPING_MOVABLE
+(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
+page is LRU or non-lru movable once the page has been isolated. Because
+LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
+good for just peeking to test non-lru movable pages before more expensive
+checking with lock_page in pfn scanning to select victim.
+
+For guaranteeing non-lru movable page, VM provides PageMovable function.
+Unlike __PageMovable, PageMovable functions validates page->mapping and
+mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
+destroying of page->mapping.
+
+Driver using __SetPageMovable should clear the flag via __ClearMovablePage
+under page_lock before the releasing the page.
+
+* PG_isolated
+
+To prevent concurrent isolation among several CPUs, VM marks isolated page
+as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
+movable page, it can skip it. Driver doesn't need to manipulate the flag
+because VM will set/clear it automatically. Keep in mind that if driver
+sees PG_isolated page, it means the page have been isolated by VM so it
+shouldn't touch page.lru field.
+PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
+for own purpose.
+
+Christoph Lameter, May 8, 2006.
+Minchan Kim, Mar 28, 2016.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c0b7e54d47..e520d365adf6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -401,6 +401,8 @@ struct address_space_operations {
*/
int (*migratepage) (struct address_space *,
struct page *, struct page *, enum migrate_mode);
+ bool (*isolate_page)(struct page *, isolate_mode_t);
+ void (*putback_page)(struct page *);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, unsigned long,
unsigned long);
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 7ae216a39c9e..481c8c4627ca 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -43,8 +43,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
static inline void set_page_stable_node(struct page *page,
struct stable_node *stable_node)
{
- page->mapping = (void *)stable_node +
- (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+ page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
}
/*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f087653baf94..9365858db978 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -34,11 +34,16 @@ extern char *migrate_reason_names[MR_TYPES];
#ifdef CONFIG_MIGRATION
+extern int PageMovable(struct page *page);
+extern void __SetPageMovable(struct page *page, struct address_space *mapping);
+extern void __ClearPageMovable(struct page *page);
extern void putback_movable_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
unsigned long private, enum migrate_mode mode, int reason);
+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
+extern void putback_movable_page(struct page *page);
extern int migrate_prep(void);
extern int migrate_prep_local(void);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ebc23193f556..d1095c8ea851 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1021,6 +1021,7 @@ static inline pgoff_t page_file_index(struct page *page)
}
bool page_mapped(struct page *page);
+struct address_space *page_mapping(struct page *page);
/*
* Return true only if the page has been allocated with
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 36e3a3fa41b2..079a90ceae75 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -132,6 +132,9 @@ enum pageflags {
/* Compound pages. Stored in first tail page's flags */
PG_double_map = PG_private_2,
+
+ /* non-lru isolated movable page */
+ PG_isolated = PG_reclaim,
};
#ifndef __GENERATING_BOUNDS_H
@@ -367,15 +370,17 @@ PAGEFLAG(Idle, idle, PF_ANY)
* and then page->mapping points, not to an anon_vma, but to a private
* structure which KSM associates with that merged page. See ksm.h.
*
- * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
+ * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable
+ * page and then page->mapping points a struct address_space.
*
* Please note that, confusingly, "page_mapping" refers to the inode
* address_space which maps the page from disk; whereas "page_mapped"
* refers to user virtual address space into which the page is mapped.
*/
-#define PAGE_MAPPING_ANON 1
-#define PAGE_MAPPING_KSM 2
-#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
+#define PAGE_MAPPING_ANON 0x1
+#define PAGE_MAPPING_MOVABLE 0x2
+#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
static __always_inline int PageAnon(struct page *page)
{
@@ -383,6 +388,12 @@ static __always_inline int PageAnon(struct page *page)
return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
}
+static __always_inline int __PageMovable(struct page *page)
+{
+ return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
+ PAGE_MAPPING_MOVABLE;
+}
+
#ifdef CONFIG_KSM
/*
* A KSM page is one of those write-protected "shared pages" or "merged pages"
@@ -394,7 +405,7 @@ static __always_inline int PageKsm(struct page *page)
{
page = compound_head(page);
return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
- (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+ PAGE_MAPPING_KSM;
}
#else
TESTPAGEFLAG_FALSE(Ksm)
@@ -624,6 +635,8 @@ static inline void __ClearPageBalloon(struct page *page)
atomic_set(&page->_mapcount, -1);
}
+__PAGEFLAG(Isolated, isolated, PF_ANY);
+
/*
* If network-based swap is enabled, sl*b must keep track of whether pages
* were allocated from pfmemalloc reserves.
diff --git a/mm/compaction.c b/mm/compaction.c
index 2f31c3769a8e..01789ebc4b10 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -740,21 +740,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
}
/*
- * Check may be lockless but that's ok as we recheck later.
- * It's possible to migrate LRU pages and balloon pages
- * Skip any other type of page
- */
- is_lru = PageLRU(page);
- if (!is_lru) {
- if (unlikely(balloon_page_movable(page))) {
- if (balloon_page_isolate(page)) {
- /* Successfully isolated */
- goto isolate_success;
- }
- }
- }
-
- /*
* Regardless of being on LRU, compound pages such as THP and
* hugetlbfs are not to be compacted. We can potentially save
* a lot of iterations if we skip them at once. The check is
@@ -770,8 +755,38 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
goto isolate_fail;
}
- if (!is_lru)
+ /*
+ * Check may be lockless but that's ok as we recheck later.
+ * It's possible to migrate LRU and non-lru movable pages.
+ * Skip any other type of page
+ */
+ is_lru = PageLRU(page);
+ if (!is_lru) {
+ if (unlikely(balloon_page_movable(page))) {
+ if (balloon_page_isolate(page)) {
+ /* Successfully isolated */
+ goto isolate_success;
+ }
+ }
+
+ /*
+ * __PageMovable can return false positive so we need
+ * to verify it under page_lock.
+ */
+ if (unlikely(__PageMovable(page)) &&
+ !PageIsolated(page)) {
+ if (locked) {
+ spin_unlock_irqrestore(&zone->lru_lock,
+ flags);
+ locked = false;
+ }
+
+ if (isolate_movable_page(page, isolate_mode))
+ goto isolate_success;
+ }
+
goto isolate_fail;
+ }
/*
* Migration will fail if an anonymous page is pinned in memory,
diff --git a/mm/ksm.c b/mm/ksm.c
index 3dee82e3f59a..647c3a446d35 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -658,8 +658,8 @@ static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
void *expected_mapping;
unsigned long kpfn;
- expected_mapping = (void *)stable_node +
- (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+ expected_mapping = (void *)((unsigned long)stable_node |
+ PAGE_MAPPING_KSM);
again:
kpfn = READ_ONCE(stable_node->kpfn);
page = pfn_to_page(kpfn);
diff --git a/mm/migrate.c b/mm/migrate.c
index 7880f30d1d3d..b2ababa21440 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -48,6 +48,38 @@
#include "internal.h"
+int PageMovable(struct page *page)
+{
+ struct address_space *mapping;
+
+ WARN_ON(!PageLocked(page));
+ if (!__PageMovable(page))
+ goto out;
+
+ mapping = page_mapping(page);
+ if (mapping && mapping->a_ops && mapping->a_ops->isolate_page)
+ return 1;
+out:
+ return 0;
+}
+
+void __SetPageMovable(struct page *page, struct address_space *mapping)
+{
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page);
+ page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE);
+}
+
+void __ClearPageMovable(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(!PageMovable(page), page);
+ VM_BUG_ON_PAGE(!((unsigned long)page->mapping & PAGE_MAPPING_MOVABLE),
+ page);
+ page->mapping = (void *)((unsigned long)page->mapping &
+ PAGE_MAPPING_MOVABLE);
+}
+
/*
* migrate_prep() needs to be called before we start compiling a list of pages
* to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -74,6 +106,79 @@ int migrate_prep_local(void)
return 0;
}
+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
+{
+ struct address_space *mapping;
+
+ /*
+ * Avoid burning cycles with pages that are yet under __free_pages(),
+ * or just got freed under us.
+ *
+ * In case we 'win' a race for a movable page being freed under us and
+ * raise its refcount preventing __free_pages() from doing its job
+ * the put_page() at the end of this block will take care of
+ * release this page, thus avoiding a nasty leakage.
+ */
+ if (unlikely(!get_page_unless_zero(page)))
+ goto out;
+
+ /*
+ * Check PageMovable before holding a PG_lock because page's owner
+ * assumes anybody doesn't touch PG_lock of newly allocated page
+ * so unconditionally grapping the lock ruins page's owner side.
+ */
+ if (unlikely(!__PageMovable(page)))
+ goto out_putpage;
+ /*
+ * As movable pages are not isolated from LRU lists, concurrent
+ * compaction threads can race against page migration functions
+ * as well as race against the releasing a page.
+ *
+ * In order to avoid having an already isolated movable page
+ * being (wrongly) re-isolated while it is under migration,
+ * or to avoid attempting to isolate pages being released,
+ * lets be sure we have the page lock
+ * before proceeding with the movable page isolation steps.
+ */
+ if (unlikely(!trylock_page(page)))
+ goto out_putpage;
+
+ if (!PageMovable(page) || PageIsolated(page))
+ goto out_no_isolated;
+
+ mapping = page_mapping(page);
+ if (!mapping->a_ops->isolate_page(page, mode))
+ goto out_no_isolated;
+
+ /* Driver shouldn't use PG_isolated bit of page->flags */
+ WARN_ON_ONCE(PageIsolated(page));
+ __SetPageIsolated(page);
+ unlock_page(page);
+
+ return true;
+
+out_no_isolated:
+ unlock_page(page);
+out_putpage:
+ put_page(page);
+out:
+ return false;
+}
+
+/* It should be called on page which is PG_movable */
+void putback_movable_page(struct page *page)
+{
+ struct address_space *mapping;
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(!PageMovable(page), page);
+ VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+ mapping = page_mapping(page);
+ mapping->a_ops->putback_page(page);
+ __ClearPageIsolated(page);
+}
+
/*
* Put previously isolated pages back onto the appropriate lists
* from where they were once taken off for compaction/migration.
@@ -95,10 +200,25 @@ void putback_movable_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
- if (unlikely(isolated_balloon_page(page)))
+ if (unlikely(isolated_balloon_page(page))) {
balloon_page_putback(page);
- else
+ /*
+ * We isolated non-lru movable page so here we can use
+ * __PageMovable because LRU page's mapping cannot have
+ * PAGE_MAPPING_MOVABLE.
+ */
+ } else if (unlikely(__PageMovable(page))) {
+ VM_BUG_ON_PAGE(!PageIsolated(page), page);
+ lock_page(page);
+ if (PageMovable(page))
+ putback_movable_page(page);
+ else
+ __ClearPageIsolated(page);
+ unlock_page(page);
+ put_page(page);
+ } else {
putback_lru_page(page);
+ }
}
}
@@ -601,7 +721,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
***********************************************************/
/*
- * Common logic to directly migrate a single page suitable for
+ * Common logic to directly migrate a single LRU page suitable for
* pages that do not use PagePrivate/PagePrivate2.
*
* Pages are locked upon entry and exit.
@@ -764,33 +884,69 @@ static int move_to_new_page(struct page *newpage, struct page *page,
enum migrate_mode mode)
{
struct address_space *mapping;
- int rc;
+ int rc = -EAGAIN;
+ bool is_lru = !__PageMovable(page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
mapping = page_mapping(page);
- if (!mapping)
- rc = migrate_page(mapping, newpage, page, mode);
- else if (mapping->a_ops->migratepage)
- /*
- * Most pages have a mapping and most filesystems provide a
- * migratepage callback. Anonymous pages are part of swap
- * space which also has its own migratepage callback. This
- * is the most common path for page migration.
- */
- rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
- else
- rc = fallback_migrate_page(mapping, newpage, page, mode);
+ /*
+ * In case of non-lru page, it could be released after
+ * isolation step. In that case, we shouldn't try
+ * fallback migration which is designed for LRU pages.
+ */
+ if (unlikely(!is_lru)) {
+ VM_BUG_ON_PAGE(!PageIsolated(page), page);
+ if (!PageMovable(page)) {
+ rc = MIGRATEPAGE_SUCCESS;
+ __ClearPageIsolated(page);
+ goto out;
+ }
+ }
+
+ if (likely(is_lru)) {
+ if (!mapping)
+ rc = migrate_page(mapping, newpage, page, mode);
+ else if (mapping->a_ops->migratepage)
+ /*
+ * Most pages have a mapping and most filesystems
+ * provide a migratepage callback. Anonymous pages
+ * are part of swap space which also has its own
+ * migratepage callback. This is the most common path
+ * for page migration.
+ */
+ rc = mapping->a_ops->migratepage(mapping, newpage,
+ page, mode);
+ else
+ rc = fallback_migrate_page(mapping, newpage,
+ page, mode);
+ } else {
+ rc = mapping->a_ops->migratepage(mapping, newpage,
+ page, mode);
+ WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
+ !PageIsolated(page));
+ }
/*
* When successful, old pagecache page->mapping must be cleared before
* page is freed; but stats require that PageAnon be left as PageAnon.
*/
if (rc == MIGRATEPAGE_SUCCESS) {
- if (!PageAnon(page))
+ if (__PageMovable(page)) {
+ VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+ /*
+ * We clear PG_movable under page_lock so any compactor
+ * cannot try to migrate this page.
+ */
+ __ClearPageIsolated(page);
+ }
+
+ if (!((unsigned long)page->mapping & PAGE_MAPPING_FLAGS))
page->mapping = NULL;
}
+out:
return rc;
}
@@ -800,6 +956,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
int rc = -EAGAIN;
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
+ bool is_lru = !__PageMovable(page);
if (!trylock_page(page)) {
if (!force || mode == MIGRATE_ASYNC)
@@ -891,6 +1048,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
goto out_unlock_both;
}
+ if (unlikely(!is_lru)) {
+ rc = move_to_new_page(newpage, page, mode);
+ goto out_unlock_both;
+ }
+
/*
* Corner case handling:
* 1. When a new swap-cache page is read into, it is added to the LRU
@@ -940,7 +1102,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
* list in here.
*/
if (rc == MIGRATEPAGE_SUCCESS) {
- if (unlikely(__is_movable_balloon_page(newpage)))
+ if (unlikely(__is_movable_balloon_page(newpage) ||
+ __PageMovable(newpage)))
put_page(newpage);
else
putback_lru_page(newpage);
@@ -986,6 +1149,12 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
/* page was freed from under us. So we are done. */
ClearPageActive(page);
ClearPageUnevictable(page);
+ if (unlikely(__PageMovable(page))) {
+ lock_page(page);
+ if (!PageMovable(page))
+ __ClearPageIsolated(page);
+ unlock_page(page);
+ }
if (put_new_page)
put_new_page(newpage, private);
else
@@ -1035,8 +1204,21 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
num_poisoned_pages_inc();
}
} else {
- if (rc != -EAGAIN)
- putback_lru_page(page);
+ if (rc != -EAGAIN) {
+ if (likely(!__PageMovable(page))) {
+ putback_lru_page(page);
+ goto put_new;
+ }
+
+ lock_page(page);
+ if (PageMovable(page))
+ putback_movable_page(page);
+ else
+ __ClearPageIsolated(page);
+ unlock_page(page);
+ put_page(page);
+ }
+put_new:
if (put_new_page)
put_new_page(newpage, private);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 226db261215a..4af78fa1e665 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1034,8 +1034,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
kmemcheck_free_shadow(page, order);
kasan_poison_free_pages(page, order);
- if (PageAnon(page))
+ /* If PageAnon or __PageMovable is true, clear page->mapping */
+ if ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS)
page->mapping = NULL;
+
bad += free_pages_check(page);
for (i = 1; i < (1 << order); i++) {
if (compound)
diff --git a/mm/util.c b/mm/util.c
index 23025c8ad31f..732cf3c74646 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -398,10 +398,12 @@ struct address_space *page_mapping(struct page *page)
}
mapping = page->mapping;
- if ((unsigned long)mapping & PAGE_MAPPING_FLAGS)
+ if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;
- return mapping;
+
+ return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
+EXPORT_SYMBOL(page_mapping);
/* Slow path of page_mapcount() for compound pages */
int __page_mapcount(struct page *page)
--
1.9.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox