Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-26 17:42 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180626180316.3723422f.cohuck@redhat.com>

On Tue, Jun 26, 2018 at 06:03:16PM +0200, Cornelia Huck wrote:
> Ok, that makes me conclude that we definitely need to involve the
> libvirt folks before we proceed further with defining QEMU interfaces.

That's always a wise thing to do.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-26 16:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180626183706-mutt-send-email-mst@kernel.org>

On Tue, 26 Jun 2018 18:38:51 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 26, 2018 at 05:17:32PM +0200, Cornelia Huck wrote:
> > On Tue, 26 Jun 2018 04:50:25 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:  
> > > > > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > > > > prohibit the MAC of VF from changing when enslaved by failover.    
> > > > > > > You mean guest changing MAC? I'm not sure why we prohibit that.    
> > > > > > I think Sridhar and Jiri might be better person to answer it. My
> > > > > > impression was that sync'ing the MAC address change between all 3
> > > > > > devices is challenging, as the failover driver uses MAC address to
> > > > > > match net_device internally.    
> > > > 
> > > > Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> > > > of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> > > > synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> > > > don't allow changing guest MAC unless it is a trusted VF.    
> > > 
> > > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> > > For example I can see host just
> > > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> > > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> > >   
> > 
> > So, what I get from this is that QEMU needs to be able to control all
> > of standby, uuid, and mac to accommodate the different setups
> > (respectively have libvirt/management software set it up). Is the host
> > able to find out respectively define whether a VF is trusted?  
> 
> You do it with ip link I think but QEMU doesn't normally do this,
> it relies on libvirt to poke at host kernel and supply the info.
> 

Ok, that makes me conclude that we definitely need to involve the
libvirt folks before we proceed further with defining QEMU interfaces.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-26 15:38 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180626171732.5038f53f.cohuck@redhat.com>

On Tue, Jun 26, 2018 at 05:17:32PM +0200, Cornelia Huck wrote:
> On Tue, 26 Jun 2018 04:50:25 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> > > > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > > > prohibit the MAC of VF from changing when enslaved by failover.  
> > > > > > You mean guest changing MAC? I'm not sure why we prohibit that.  
> > > > > I think Sridhar and Jiri might be better person to answer it. My
> > > > > impression was that sync'ing the MAC address change between all 3
> > > > > devices is challenging, as the failover driver uses MAC address to
> > > > > match net_device internally.  
> > > 
> > > Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> > > of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> > > synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> > > don't allow changing guest MAC unless it is a trusted VF.  
> > 
> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> > For example I can see host just
> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> > 
> 
> So, what I get from this is that QEMU needs to be able to control all
> of standby, uuid, and mac to accommodate the different setups
> (respectively have libvirt/management software set it up). Is the host
> able to find out respectively define whether a VF is trusted?

You do it with ip link I think but QEMU doesn't normally do this,
it relies on libvirt to poke at host kernel and supply the info.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-26 15:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180626044650-mutt-send-email-mst@kernel.org>

On Tue, 26 Jun 2018 04:50:25 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> > > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > > prohibit the MAC of VF from changing when enslaved by failover.  
> > > > > You mean guest changing MAC? I'm not sure why we prohibit that.  
> > > > I think Sridhar and Jiri might be better person to answer it. My
> > > > impression was that sync'ing the MAC address change between all 3
> > > > devices is challenging, as the failover driver uses MAC address to
> > > > match net_device internally.  
> > 
> > Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> > of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> > synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> > don't allow changing guest MAC unless it is a trusted VF.  
> 
> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> For example I can see host just
> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> 

So, what I get from this is that QEMU needs to be able to control all
of standby, uuid, and mac to accommodate the different setups
(respectively have libvirt/management software set it up). Is the host
able to find out respectively define whether a VF is trusted?

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-26 15:08 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, konrad.wilk, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <CADGSJ21HNd4VYNcCt4H0gJ_CCx1GUFpHrDof2N=4WqhD24Zc2A@mail.gmail.com>

On Fri, 22 Jun 2018 17:05:04 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > I suspect the diveregence will be lost on most users though
> > simply because they don't even care about vfio. They just
> > want things to go fast.  
> 
> Like Jason said, VF isn't faster than virtio-net in all cases. It
> depends on the workload and performance metrics: throughput, latency,
> or packet per second.

So, will it be guest/admin-controllable then where the traffic flows
through? Just because we do have a vf available after negotiation of
the feature bit, it does not necessarily mean we want to use it? Do we
(the guest) even want to make it visible in that case?

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-26 13:54 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180626135509.23154723.cohuck@redhat.com>

On Tue, Jun 26, 2018 at 01:55:09PM +0200, Cornelia Huck wrote:
> On Tue, 26 Jun 2018 04:46:03 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote:
> > > On Fri, 22 Jun 2018 22:05:50 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:  
> > > > > On Thu, 21 Jun 2018 21:20:13 +0300
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > >     
> > > > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:    
> > > > > > > OK, so what about the following:
> > > > > > > 
> > > > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > > > > > >   that we have a new uuid field in the virtio-net config space
> > > > > > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > > > > > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > > > > > - when configuring, set the property to the group UUID of the vfio-pci
> > > > > > >   device
> > > > > > > - in the guest, use the uuid from the virtio-net device's config space
> > > > > > >   if applicable; else, fall back to matching by MAC as done today
> > > > > > > 
> > > > > > > That should work for all virtio transports.      
> > > > > > 
> > > > > > True. I'm a bit unhappy that it's virtio net specific though
> > > > > > since down the road I expect we'll have a very similar feature
> > > > > > for scsi (and maybe others).
> > > > > > 
> > > > > > But we do not have a way to have fields that are portable
> > > > > > both across devices and transports, and I think it would
> > > > > > be a useful addition. How would this work though? Any idea?    
> > > > > 
> > > > > Can we introduce some kind of device-independent config space area?
> > > > > Pushing back the device-specific config space by a certain value if the
> > > > > appropriate feature is negotiated and use that for things like the uuid?    
> > > > 
> > > > So config moves back and forth?
> > > > Reminds me of the msi vector mess we had with pci.  
> > > 
> > > Yes, that would be a bit unfortunate.
> > >   
> > > > I'd rather have every transport add a new config.  
> > > 
> > > You mean via different mechanisms?  
> > 
> > I guess so.
> 
> Is there an alternate mechanism for pci to use? (Not so familiar with
> it.)

We have a device and transport config capability.
We could add a generic config capability too.

> For ccw, this needs more thought. We already introduced two commands
> for reading/writing the config space (a concept that does not really
> exist on s390). There's the generic read configuration data command,
> but the data returned by it is not really generic enough. So we would
> need one new command (or two, if we need to write as well). I'm not
> sure about that yet.
> 
> > 
> > > >   
> > > > > But regardless of that, I'm not sure whether extending this approach to
> > > > > other device types is the way to go. Tying together two different
> > > > > devices is creating complicated situations at least in the hypervisor
> > > > > (even if it's fairly straightforward in the guest). [I have not come
> > > > > around again to look at the "how to handle visibility in QEMU"
> > > > > questions due to lack of cycles, sorry about that.]
> > > > > 
> > > > > So, what's the goal of this approach? Only to allow migration with
> > > > > vfio-pci, or also to plug in a faster device and use it instead of an
> > > > > already attached paravirtualized device?    
> > > > 
> > > > These are two sides of the same coin, I think the second approach
> > > > is closer to what we are doing here.  
> > > 
> > > Thinking about it, do we need any knob to keep the vfio device
> > > invisible if the virtio device is not present? IOW, how does the
> > > hypervisor know that the vfio device is supposed to be paired with a
> > > virtio device? It seems we need an explicit tie-in.  
> > 
> > If we are going the way of the bridge, both bridge and
> > virtio would have some kind of id.
> 
> So the presence of the id would indicate "this is one part of a pair"?

I guess so, yes.

> > 
> > When pairing using mac, I'm less sure. PAss vfio device mac to qemu
> > as a property?
> 
> That feels a bit odd. "This is the vfio device's mac, use this instead
> of your usual mac property"? As we have not designed the QEMU interface
> yet, just go with the id in any case? The guest can still match by mac.

OK

> > > > > What about migration of vfio devices that are not easily replaced by a
> > > > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> > > > > currently only) supported device is dasd (disks) -- which can do a lot
> > > > > of specialized things that virtio-blk does not support (and should not
> > > > > or even cannot support).    
> > > > 
> > > > But maybe virtio-scsi can?  
> > > 
> > > I don't think so. Dasds have some channel commands that don't map
> > > easily to scsi commands.  
> > 
> > There's always a choice of adding these to the spec.
> > E.g. FC extensions were proposed, I don't remember why they
> > are still stuck.
> 
> FC extensions are a completely different kind of enhancements, though.
> For a start, they are not unique to a certain transport.
> 
> Also, we have a whole list of special dasd issues. Weird disk layout
> for eckd, low-level disk formatting, etc. (See the list of commands in
> drivers/s390/block/dasd_eckd.h for an idea. There's also no public
> documentation AFAICS; https://en.wikipedia.org/wiki/ECKD does not link
> to anything interesting.) I don't think we want to cram stuff like this
> into a completely different framework.

^ permalink raw reply

* Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-26 13:34 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
	akpm, virtualization, torvalds
In-Reply-To: <5B323140.1000306@intel.com>

On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote:
> On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote:
> > On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:
> > 
> 
> > > 
> > > > 
> > > > > +	if (!arrays)
> > > > > +		return NULL;
> > > > > +
> > > > > +	for (i = 0; i < max_array_num; i++) {
> > > > So we are getting a ton of memory here just to free it up a bit later.
> > > > Why doesn't get_from_free_page_list get the pages from free list for us?
> > > > We could also avoid the 1st allocation then - just build a list
> > > > of these.
> > > That wouldn't be a good choice for us. If we check how the regular
> > > allocation works, there are many many things we need to consider when pages
> > > are allocated to users.
> > > For example, we need to take care of the nr_free
> > > counter, we need to check the watermark and perform the related actions.
> > > Also the folks working on arch_alloc_page to monitor page allocation
> > > activities would get a surprise..if page allocation is allowed to work in
> > > this way.
> > > 
> > mm/ code is well positioned to handle all this correctly.
> 
> I'm afraid that would be a re-implementation of the alloc functions,

A re-factoring - you can share code. The main difference is locking.

> and
> that would be much more complex than what we have. I think your idea of
> passing a list of pages is better.
> 
> Best,
> Wei

How much memory is this allocating anyway?

-- 
MST

^ permalink raw reply

* Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-06-26 12:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
	akpm, virtualization, torvalds
In-Reply-To: <20180626064338-mutt-send-email-mst@kernel.org>

On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:
>

>>
>>>
>>>> +	if (!arrays)
>>>> +		return NULL;
>>>> +
>>>> +	for (i = 0; i < max_array_num; i++) {
>>> So we are getting a ton of memory here just to free it up a bit later.
>>> Why doesn't get_from_free_page_list get the pages from free list for us?
>>> We could also avoid the 1st allocation then - just build a list
>>> of these.
>> That wouldn't be a good choice for us. If we check how the regular
>> allocation works, there are many many things we need to consider when pages
>> are allocated to users.
>> For example, we need to take care of the nr_free
>> counter, we need to check the watermark and perform the related actions.
>> Also the folks working on arch_alloc_page to monitor page allocation
>> activities would get a surprise..if page allocation is allowed to work in
>> this way.
>>
> mm/ code is well positioned to handle all this correctly.

I'm afraid that would be a re-implementation of the alloc functions, and 
that would be much more complex than what we have. I think your idea of 
passing a list of pages is better.

Best,
Wei

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-26 11:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180626043839-mutt-send-email-mst@kernel.org>

On Tue, 26 Jun 2018 04:46:03 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote:
> > On Fri, 22 Jun 2018 22:05:50 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:  
> > > > On Thu, 21 Jun 2018 21:20:13 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >     
> > > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:    
> > > > > > OK, so what about the following:
> > > > > > 
> > > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > > > > >   that we have a new uuid field in the virtio-net config space
> > > > > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > > > > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > > > > - when configuring, set the property to the group UUID of the vfio-pci
> > > > > >   device
> > > > > > - in the guest, use the uuid from the virtio-net device's config space
> > > > > >   if applicable; else, fall back to matching by MAC as done today
> > > > > > 
> > > > > > That should work for all virtio transports.      
> > > > > 
> > > > > True. I'm a bit unhappy that it's virtio net specific though
> > > > > since down the road I expect we'll have a very similar feature
> > > > > for scsi (and maybe others).
> > > > > 
> > > > > But we do not have a way to have fields that are portable
> > > > > both across devices and transports, and I think it would
> > > > > be a useful addition. How would this work though? Any idea?    
> > > > 
> > > > Can we introduce some kind of device-independent config space area?
> > > > Pushing back the device-specific config space by a certain value if the
> > > > appropriate feature is negotiated and use that for things like the uuid?    
> > > 
> > > So config moves back and forth?
> > > Reminds me of the msi vector mess we had with pci.  
> > 
> > Yes, that would be a bit unfortunate.
> >   
> > > I'd rather have every transport add a new config.  
> > 
> > You mean via different mechanisms?  
> 
> I guess so.

Is there an alternate mechanism for pci to use? (Not so familiar with
it.)

For ccw, this needs more thought. We already introduced two commands
for reading/writing the config space (a concept that does not really
exist on s390). There's the generic read configuration data command,
but the data returned by it is not really generic enough. So we would
need one new command (or two, if we need to write as well). I'm not
sure about that yet.

> 
> > >   
> > > > But regardless of that, I'm not sure whether extending this approach to
> > > > other device types is the way to go. Tying together two different
> > > > devices is creating complicated situations at least in the hypervisor
> > > > (even if it's fairly straightforward in the guest). [I have not come
> > > > around again to look at the "how to handle visibility in QEMU"
> > > > questions due to lack of cycles, sorry about that.]
> > > > 
> > > > So, what's the goal of this approach? Only to allow migration with
> > > > vfio-pci, or also to plug in a faster device and use it instead of an
> > > > already attached paravirtualized device?    
> > > 
> > > These are two sides of the same coin, I think the second approach
> > > is closer to what we are doing here.  
> > 
> > Thinking about it, do we need any knob to keep the vfio device
> > invisible if the virtio device is not present? IOW, how does the
> > hypervisor know that the vfio device is supposed to be paired with a
> > virtio device? It seems we need an explicit tie-in.  
> 
> If we are going the way of the bridge, both bridge and
> virtio would have some kind of id.

So the presence of the id would indicate "this is one part of a pair"?

> 
> When pairing using mac, I'm less sure. PAss vfio device mac to qemu
> as a property?

That feels a bit odd. "This is the vfio device's mac, use this instead
of your usual mac property"? As we have not designed the QEMU interface
yet, just go with the id in any case? The guest can still match by mac.

> > > > What about migration of vfio devices that are not easily replaced by a
> > > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> > > > currently only) supported device is dasd (disks) -- which can do a lot
> > > > of specialized things that virtio-blk does not support (and should not
> > > > or even cannot support).    
> > > 
> > > But maybe virtio-scsi can?  
> > 
> > I don't think so. Dasds have some channel commands that don't map
> > easily to scsi commands.  
> 
> There's always a choice of adding these to the spec.
> E.g. FC extensions were proposed, I don't remember why they
> are still stuck.

FC extensions are a completely different kind of enhancements, though.
For a start, they are not unique to a certain transport.

Also, we have a whole list of special dasd issues. Weird disk layout
for eckd, low-level disk formatting, etc. (See the list of commands in
drivers/s390/block/dasd_eckd.h for an idea. There's also no public
documentation AFAICS; https://en.wikipedia.org/wiki/ECKD does not link
to anything interesting.) I don't think we want to cram stuff like this
into a completely different framework.

^ permalink raw reply

* Re: [PATCH v6 3/3] x86: paravirt: make native_save_fl extern inline
From: Ingo Molnar @ 2018-06-26  7:13 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Kate Stewart, andrea.parri, linux-efi, brijesh.singh, J. Kiszka,
	Josh Poimboeuf, Will Deacon, jarkko.sakkinen, virtualization,
	Masahiro Yamada, Manoj Gupta, hpa, boris.ostrovsky,
	Thiebaud Weksteen, mawilcox, x86, akataria, Greg Hackmann, mingo,
	Alistair Strachan, David Rientjes, geert, thomas.lendacky,
	Arnd Bergmann, Linux Kbuild mailing list, Philippe Ombredanne,
	rostedt
In-Reply-To: <CAKwvOdn5gnB5cK=cy6weKyxAn4SNEdZTT_MtT_D_nHRG5EDZ4g@mail.gmail.com>


* Nick Desaulniers <ndesaulniers@google.com> wrote:

> On Thu, Jun 21, 2018 at 7:24 PM Ingo Molnar <mingo@kernel.org> wrote:
> > * Nick Desaulniers <ndesaulniers@google.com> wrote:
> >
> > > native_save_fl() is marked static inline, but by using it as
> > > a function pointer in arch/x86/kernel/paravirt.c, it MUST be outlined.
> >
> > > --- a/arch/x86/include/asm/irqflags.h
> > > +++ b/arch/x86/include/asm/irqflags.h
> > > @@ -13,7 +13,7 @@
> > >   * Interrupt control:
> > >   */
> > >
> > > -static inline unsigned long native_save_fl(void)
> > > +extern inline unsigned long native_save_fl(void)
> > >  {
> > >       unsigned long flags;
> > >
> >
> > What's the code generation effect of this on say a defconfig kernel vmlinux with
> > paravirt enabled?
> 
> Starting with this patch set applied:
> $ make CC=gcc-8 -j46
> $ objdump -d vmlinux | grep native_save_fl --context=3
> ffffffff81059140 <native_save_fl>:
> ffffffff81059140: 9c                    pushfq
> ffffffff81059141: 58                    pop    %rax
> ffffffff81059142: c3                    retq
> $ git checkout HEAD~3
> $ make CC=gcc-8 -j46
> $ objdump -d vmlinux | grep native_save_fl --context=3
> ffffffff81079410 <native_save_fl>:
> ffffffff81079410: 9c                    pushfq
> ffffffff81079411: 58                    pop    %rax
> ffffffff81079412: c3                    retq
> 
> Mainly, this is to prevent the compiler from adding a stack protector
> to the outlined version, as the stack protector clobbers %rcx, but
> paravirt expects %rcx to be preserved. More info can be found:
> https://lkml.org/lkml/2018/5/24/1242--

Ok!

Acked-by: Ingo Molnar <mingo@kernel.org>

What's the planned upstreaming route for these patches/fixes?

Thanks,

	Ingo

^ permalink raw reply

* [PATCH net-next v2] net: vhost: improve performance when enable busyloop
From: xiangxia.m.yue @ 2018-06-26  5:17 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, Tonghao Zhang, virtualization

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch improves the guest receive performance from
host. On the handle_tx side, we poll the sock receive
queue at the same time. handle_rx do that in the same way.

For avoiding deadlock, change the code to lock the vq one
by one and use the VHOST_NET_VQ_XX as a subclass for
mutex_lock_nested. With the patch, qemu can set differently
the busyloop_timeout for rx or tx queue.

We set the poll-us=100us and use the iperf3 to test
its throughput. The iperf3 command is shown as below.

on the guest:
iperf3  -s -D

on the host:
iperf3  -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400

* With the patch:     23.1 Gbits/sec
* Without the patch:  12.7 Gbits/sec

Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
---
 drivers/vhost/net.c   | 106 +++++++++++++++++++++++++++-----------------------
 drivers/vhost/vhost.c |  24 ++++--------
 2 files changed, 66 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e7cf7d2..38e9adb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -429,22 +429,62 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 	return vhost_poll_start(poll, sock->file);
 }
 
+static int sk_has_rx_data(struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+
+	if (sock->ops->peek_len)
+		return sock->ops->peek_len(sock);
+
+	return skb_queue_empty(&sk->sk_receive_queue);
+}
+
+static void vhost_net_busy_poll(struct vhost_net *net,
+				struct vhost_virtqueue *rvq,
+				struct vhost_virtqueue *tvq,
+				bool rx)
+{
+	unsigned long uninitialized_var(endtime);
+	struct socket *sock = rvq->private_data;
+	struct vhost_virtqueue *vq = rx ? tvq : rvq;
+	unsigned long busyloop_timeout = rx ? rvq->busyloop_timeout :
+					      tvq->busyloop_timeout;
+
+	mutex_lock_nested(&vq->mutex, rx ? VHOST_NET_VQ_TX: VHOST_NET_VQ_RX);
+	vhost_disable_notify(&net->dev, vq);
+
+	preempt_disable();
+	endtime = busy_clock() + busyloop_timeout;
+	while (vhost_can_busy_poll(tvq->dev, endtime) &&
+	       !(sock && sk_has_rx_data(sock->sk)) &&
+	       vhost_vq_avail_empty(tvq->dev, tvq))
+		cpu_relax();
+	preempt_enable();
+
+	if ((rx && !vhost_vq_avail_empty(&net->dev, vq)) ||
+	    (!rx && (sock && sk_has_rx_data(sock->sk)))) {
+		vhost_poll_queue(&vq->poll);
+	} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+		vhost_disable_notify(&net->dev, vq);
+		vhost_poll_queue(&vq->poll);
+	}
+
+	mutex_unlock(&vq->mutex);
+}
+
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_virtqueue *vq,
 				    struct iovec iov[], unsigned int iov_size,
 				    unsigned int *out_num, unsigned int *in_num)
 {
-	unsigned long uninitialized_var(endtime);
+	struct vhost_net_virtqueue *nvq_rx = &net->vqs[VHOST_NET_VQ_RX];
+
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				  out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
-		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
-		while (vhost_can_busy_poll(vq->dev, endtime) &&
-		       vhost_vq_avail_empty(vq->dev, vq))
-			cpu_relax();
-		preempt_enable();
+		vhost_net_busy_poll(net, &nvq_rx->vq, vq, false);
+
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				      out_num, in_num, NULL, NULL);
 	}
@@ -484,7 +524,7 @@ static void handle_tx(struct vhost_net *net)
 	bool zcopy, zcopy_used;
 	int sent_pkts = 0;
 
-	mutex_lock(&vq->mutex);
+	mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_TX);
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
@@ -621,16 +661,6 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	return len;
 }
 
-static int sk_has_rx_data(struct sock *sk)
-{
-	struct socket *sock = sk->sk_socket;
-
-	if (sock->ops->peek_len)
-		return sock->ops->peek_len(sock);
-
-	return skb_queue_empty(&sk->sk_receive_queue);
-}
-
 static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
 {
 	struct vhost_virtqueue *vq = &nvq->vq;
@@ -645,39 +675,19 @@ static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
 
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 {
-	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
-	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
-	struct vhost_virtqueue *vq = &nvq->vq;
-	unsigned long uninitialized_var(endtime);
-	int len = peek_head_len(rvq, sk);
-
-	if (!len && vq->busyloop_timeout) {
-		/* Flush batched heads first */
-		vhost_rx_signal_used(rvq);
-		/* Both tx vq and rx socket were polled here */
-		mutex_lock_nested(&vq->mutex, 1);
-		vhost_disable_notify(&net->dev, vq);
-
-		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
-
-		while (vhost_can_busy_poll(&net->dev, endtime) &&
-		       !sk_has_rx_data(sk) &&
-		       vhost_vq_avail_empty(&net->dev, vq))
-			cpu_relax();
+	struct vhost_net_virtqueue *nvq_rx = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_net_virtqueue *nvq_tx = &net->vqs[VHOST_NET_VQ_TX];
 
-		preempt_enable();
+	int len = peek_head_len(nvq_rx, sk);
 
-		if (!vhost_vq_avail_empty(&net->dev, vq))
-			vhost_poll_queue(&vq->poll);
-		else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
-			vhost_disable_notify(&net->dev, vq);
-			vhost_poll_queue(&vq->poll);
-		}
+	if (!len && nvq_rx->vq.busyloop_timeout) {
+		/* Flush batched heads first */
+		vhost_rx_signal_used(nvq_rx);
 
-		mutex_unlock(&vq->mutex);
+		/* Both tx vq and rx socket were polled here */
+		vhost_net_busy_poll(net, &nvq_rx->vq, &nvq_tx->vq, true);
 
-		len = peek_head_len(rvq, sk);
+		len = peek_head_len(nvq_rx, sk);
 	}
 
 	return len;
@@ -789,7 +799,7 @@ static void handle_rx(struct vhost_net *net)
 	__virtio16 num_buffers;
 	int recv_pkts = 0;
 
-	mutex_lock_nested(&vq->mutex, 0);
+	mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 895eaa2..1716b10 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -294,8 +294,11 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 {
 	int i;
 
-	for (i = 0; i < d->nvqs; ++i)
+	for (i = 0; i < d->nvqs; ++i) {
+		mutex_lock(&d->vqs[i]->mutex);
 		__vhost_vq_meta_reset(d->vqs[i]);
+		mutex_unlock(&d->vqs[i]->mutex);
+	}
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -887,19 +890,6 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 #define vhost_get_used(vq, x, ptr) \
 	vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
 
-static void vhost_dev_lock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_lock_nested(&d->vqs[i]->mutex, i);
-}
-
-static void vhost_dev_unlock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_unlock(&d->vqs[i]->mutex);
-}
 
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
@@ -950,7 +940,11 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
 		if (msg->iova <= vq_msg->iova &&
 		    msg->iova + msg->size - 1 > vq_msg->iova &&
 		    vq_msg->type == VHOST_IOTLB_MISS) {
+
+			mutex_lock(&node->vq->mutex);
 			vhost_poll_queue(&node->vq->poll);
+			mutex_unlock(&node->vq->mutex);
+
 			list_del(&node->node);
 			kfree(node);
 		}
@@ -982,7 +976,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	int ret = 0;
 
 	mutex_lock(&dev->mutex);
-	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		if (!dev->iotlb) {
@@ -1016,7 +1009,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 		break;
 	}
 
-	vhost_dev_unlock_vqs(dev);
 	mutex_unlock(&dev->mutex);
 
 	return ret;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-26  3:56 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
	akpm, virtualization, torvalds
In-Reply-To: <5B31B71B.6080709@intel.com>

On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:
> On 06/26/2018 09:37 AM, Michael S. Tsirkin wrote:
> > On Mon, Jun 25, 2018 at 08:05:10PM +0800, Wei Wang wrote:
> > 
> > > @@ -326,17 +353,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
> > >   	virtqueue_kick(vq);
> > >   }
> > > -static void virtballoon_changed(struct virtio_device *vdev)
> > > -{
> > > -	struct virtio_balloon *vb = vdev->priv;
> > > -	unsigned long flags;
> > > -
> > > -	spin_lock_irqsave(&vb->stop_update_lock, flags);
> > > -	if (!vb->stop_update)
> > > -		queue_work(system_freezable_wq, &vb->update_balloon_size_work);
> > > -	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> > > -}
> > > -
> > >   static inline s64 towards_target(struct virtio_balloon *vb)
> > >   {
> > >   	s64 target;
> > > @@ -353,6 +369,35 @@ static inline s64 towards_target(struct virtio_balloon *vb)
> > >   	return target - vb->num_pages;
> > >   }
> > > +static void virtballoon_changed(struct virtio_device *vdev)
> > > +{
> > > +	struct virtio_balloon *vb = vdev->priv;
> > > +	unsigned long flags;
> > > +	s64 diff = towards_target(vb);
> > > +
> > > +	if (diff) {
> > > +		spin_lock_irqsave(&vb->stop_update_lock, flags);
> > > +		if (!vb->stop_update)
> > > +			queue_work(system_freezable_wq,
> > > +				   &vb->update_balloon_size_work);
> > > +		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> > > +	}
> > > +
> > > +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> > > +		virtio_cread(vdev, struct virtio_balloon_config,
> > > +			     free_page_report_cmd_id, &vb->cmd_id_received);
> > > +		if (vb->cmd_id_received !=
> > > +		    VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID &&
> > > +		    vb->cmd_id_received != vb->cmd_id_active) {
> > > +			spin_lock_irqsave(&vb->stop_update_lock, flags);
> > > +			if (!vb->stop_update)
> > > +				queue_work(vb->balloon_wq,
> > > +					   &vb->report_free_page_work);
> > > +			spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> > > +		}
> > > +	}
> > > +}
> > > +
> > >   static void update_balloon_size(struct virtio_balloon *vb)
> > >   {
> > >   	u32 actual = vb->num_pages;
> > > @@ -425,44 +470,253 @@ static void update_balloon_size_func(struct work_struct *work)
> > >   		queue_work(system_freezable_wq, work);
> > >   }
> > > +static void free_page_vq_cb(struct virtqueue *vq)
> > > +{
> > > +	unsigned int len;
> > > +	void *buf;
> > > +	struct virtio_balloon *vb = vq->vdev->priv;
> > > +
> > > +	while (1) {
> > > +		buf = virtqueue_get_buf(vq, &len);
> > > +
> > > +		if (!buf || buf == &vb->cmd_start || buf == &vb->cmd_stop)
> > > +			break;
> > If there's any buffer after this one we might never get another
> > callback.
> 
> I think every used buffer can get the callback, because host takes from the
> arrays one by one, and puts back each with a vq notify.

It's probabky racy even in this case. Besides, host is free to do it in
any way that's legal in spec.

> 
> 
> > > +		free_pages((unsigned long)buf, ARRAY_ALLOC_ORDER);
> > > +	}
> > > +}
> > > +
> > >   static int init_vqs(struct virtio_balloon *vb)
> > >   {
> > > -	struct virtqueue *vqs[3];
> > > -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> > > -	static const char * const names[] = { "inflate", "deflate", "stats" };
> > > -	int err, nvqs;
> > > +	struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
> > > +	vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
> > > +	const char *names[VIRTIO_BALLOON_VQ_MAX];
> > > +	struct scatterlist sg;
> > > +	int ret;
> > >   	/*
> > > -	 * We expect two virtqueues: inflate and deflate, and
> > > -	 * optionally stat.
> > > +	 * Inflateq and deflateq are used unconditionally. The names[]
> > > +	 * will be NULL if the related feature is not enabled, which will
> > > +	 * cause no allocation for the corresponding virtqueue in find_vqs.
> > >   	 */
> > > -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > > -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> > > -	if (err)
> > > -		return err;
> > > +	callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
> > > +	names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
> > > +	callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
> > > +	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > > +	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > > +	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > -	vb->inflate_vq = vqs[0];
> > > -	vb->deflate_vq = vqs[1];
> > >   	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > -		struct scatterlist sg;
> > > -		unsigned int num_stats;
> > > -		vb->stats_vq = vqs[2];
> > > +		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > > +		callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
> > > +	}
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> > > +		names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
> > > +		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = free_page_vq_cb;
> > > +	}
> > > +
> > > +	ret = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > > +					 vqs, callbacks, names, NULL, NULL);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > > +	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > +		vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
> > >   		/*
> > >   		 * Prime this virtqueue with one buffer so the hypervisor can
> > >   		 * use it to signal us later (it can't be broken yet!).
> > >   		 */
> > > -		num_stats = update_balloon_stats(vb);
> > > -
> > > -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> > > -		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> > > -		    < 0)
> > > -			BUG();
> > > +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
> > > +		ret = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
> > > +					   GFP_KERNEL);
> > > +		if (ret) {
> > > +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> > > +				 __func__);
> > > +			return ret;
> > > +		}
> > Why the change? Is it more likely to happen now?
> 
> Actually this part remains the same as the previous versions (e.g. v32). It
> is changed because we agreed that using BUG() isn't necessary here, and
> better to bail out nicely.

Why is this part of the hinting patch though? I'd rather have
a separate one.

> 
> 
> > 
> > +/*
> > + * virtio_balloon_send_hints - send arrays of hints to host
> > + * @vb: the virtio_balloon struct
> > + * @arrays: the arrays of hints
> > + * @array_num: the number of arrays give by the caller
> > + * @last_array_hints: the number of hints in the last array
> > + *
> > + * Send hints to host array by array. This begins by sending a start cmd,
> > + * which contains a cmd id received from host and the free page block size in
> > + * bytes of each hint. At the end, a stop cmd is sent to host to indicate the
> > + * end of this reporting. If host actively requests to stop the reporting, free
> > + * the arrays that have not been sent.
> > + */
> > +static void virtio_balloon_send_hints(struct virtio_balloon *vb,
> > +				      __le64 **arrays,
> > +				      uint32_t array_num,
> > +				      uint32_t last_array_hints)
> > +{
> > +	int err, i = 0;
> > +	struct scatterlist sg;
> > +	struct virtqueue *vq = vb->free_page_vq;
> > +
> > +	/* Start by sending the received cmd id to host with an outbuf. */
> > +	err = send_start_cmd_id(vb);
> > +	if (unlikely(err))
> > +		goto out_err;
> > +	/* Kick host to start taking entries from the vq. */
> > +	virtqueue_kick(vq);
> > +
> > +	for (i = 0; i < array_num; i++) {
> > +		/*
> > +		 * If a stop id or a new cmd id was just received from host,
> > +		 * stop the reporting, and free the remaining arrays that
> > +		 * haven't been sent to host.
> > +		 */
> > +		if (vb->cmd_id_received != vb->cmd_id_active)
> > +			goto out_free;
> > +
> > +		if (i + 1 == array_num)
> > +			sg_init_one(&sg, (void *)arrays[i],
> > +				    last_array_hints * sizeof(__le64));
> > +		else
> > +			sg_init_one(&sg, (void *)arrays[i], ARRAY_ALLOC_SIZE);
> > +		err = virtqueue_add_inbuf(vq, &sg, 1, (void *)arrays[i],
> > +					  GFP_KERNEL);
> > +		if (unlikely(err))
> > +			goto out_err;
> > +	}
> > +
> > +	/* End by sending a stop id to host with an outbuf. */
> > +	err = send_stop_cmd_id(vb);
> > +	if (unlikely(err))
> > +		goto out_err;
> > Don't we need to kick here?
> 
> I think not needed, because we have kicked host about starting the report,
> and the host side optimization won't exit unless receiving this stop sign or
> the migration thread asks to exit.

You can't assume that. Host might want to sleep.
If it doesn't then it will disable notifications and kick will be free.

> > 
> > > +	int i;
> > > +
> > > +	max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER);
> > > +	entries_per_page = PAGE_SIZE / sizeof(__le64);
> > > +	entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER);
> > > +	max_array_num = max_entries / entries_per_array +
> > > +			!!(max_entries % entries_per_array);
> > > +	arrays = kmalloc_array(max_array_num, sizeof(__le64 *), GFP_KERNEL);
> > Instead of all this mess, how about get_free_pages here as well?
> 
> Sounds good, will replace kmalloc_array with __get_free_pages(),

Or alloc_pages, __ APIs are better avoided if possible.

> but still
> need the above calculation to get max_array_num.

Maybe alloc_pages?

> > 
> > Also why do we need GFP_KERNEL for this?
> 
> I guess it is better to use "__GFP_ATOMIC | __GFP_NOMEMALLOC", thanks.
> 
> > 
> > 
> > > +	if (!arrays)
> > > +		return NULL;
> > > +
> > > +	for (i = 0; i < max_array_num; i++) {
> > So we are getting a ton of memory here just to free it up a bit later.
> > Why doesn't get_from_free_page_list get the pages from free list for us?
> > We could also avoid the 1st allocation then - just build a list
> > of these.
> 
> That wouldn't be a good choice for us. If we check how the regular
> allocation works, there are many many things we need to consider when pages
> are allocated to users.
> For example, we need to take care of the nr_free
> counter, we need to check the watermark and perform the related actions.
> Also the folks working on arch_alloc_page to monitor page allocation
> activities would get a surprise..if page allocation is allowed to work in
> this way.
> 

mm/ code is well positioned to handle all this correctly.


> 
> 
> 
> > 
> > > +		arrays[i] =
> > > +		(__le64 *)__get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC,
> > > +					   ARRAY_ALLOC_ORDER);
> > Coding style says:
> > 
> > Descendants are always substantially shorter than the parent and
> > are placed substantially to the right.
> 
> Thanks, will rearrange it:
> 
> arrays[i] = (__le64 *)__get_free_pages(__GFP_ATOMIC |
> 				__GFP_NOMEMALLOC, ARRAY_ALLOC_ORDER);
> 
> 
> 
> > 
> > > +		if (!arrays[i]) {
> > Also if it does fail (small guest), shall we try with less arrays?
> 
> I think it's not needed. If the free list is empty, no matter it is a huge
> guest or a small guest, get_from_free_page_list() will load nothing even we
> pass a small array to it.
> 
> 
> Best,
> Wei

Yes but the reason it's empty is maybe because we used a ton of
memory for all of the arrays. Why allocate a top level array at all?
Can't we pass in a list?

-- 
MST

^ permalink raw reply

* Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-06-26  3:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
	akpm, virtualization, torvalds
In-Reply-To: <20180626002822-mutt-send-email-mst@kernel.org>

On 06/26/2018 09:37 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 25, 2018 at 08:05:10PM +0800, Wei Wang wrote:
>
>> @@ -326,17 +353,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
>>   	virtqueue_kick(vq);
>>   }
>>   
>> -static void virtballoon_changed(struct virtio_device *vdev)
>> -{
>> -	struct virtio_balloon *vb = vdev->priv;
>> -	unsigned long flags;
>> -
>> -	spin_lock_irqsave(&vb->stop_update_lock, flags);
>> -	if (!vb->stop_update)
>> -		queue_work(system_freezable_wq, &vb->update_balloon_size_work);
>> -	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
>> -}
>> -
>>   static inline s64 towards_target(struct virtio_balloon *vb)
>>   {
>>   	s64 target;
>> @@ -353,6 +369,35 @@ static inline s64 towards_target(struct virtio_balloon *vb)
>>   	return target - vb->num_pages;
>>   }
>>   
>> +static void virtballoon_changed(struct virtio_device *vdev)
>> +{
>> +	struct virtio_balloon *vb = vdev->priv;
>> +	unsigned long flags;
>> +	s64 diff = towards_target(vb);
>> +
>> +	if (diff) {
>> +		spin_lock_irqsave(&vb->stop_update_lock, flags);
>> +		if (!vb->stop_update)
>> +			queue_work(system_freezable_wq,
>> +				   &vb->update_balloon_size_work);
>> +		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
>> +	}
>> +
>> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>> +		virtio_cread(vdev, struct virtio_balloon_config,
>> +			     free_page_report_cmd_id, &vb->cmd_id_received);
>> +		if (vb->cmd_id_received !=
>> +		    VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID &&
>> +		    vb->cmd_id_received != vb->cmd_id_active) {
>> +			spin_lock_irqsave(&vb->stop_update_lock, flags);
>> +			if (!vb->stop_update)
>> +				queue_work(vb->balloon_wq,
>> +					   &vb->report_free_page_work);
>> +			spin_unlock_irqrestore(&vb->stop_update_lock, flags);
>> +		}
>> +	}
>> +}
>> +
>>   static void update_balloon_size(struct virtio_balloon *vb)
>>   {
>>   	u32 actual = vb->num_pages;
>> @@ -425,44 +470,253 @@ static void update_balloon_size_func(struct work_struct *work)
>>   		queue_work(system_freezable_wq, work);
>>   }
>>   
>> +static void free_page_vq_cb(struct virtqueue *vq)
>> +{
>> +	unsigned int len;
>> +	void *buf;
>> +	struct virtio_balloon *vb = vq->vdev->priv;
>> +
>> +	while (1) {
>> +		buf = virtqueue_get_buf(vq, &len);
>> +
>> +		if (!buf || buf == &vb->cmd_start || buf == &vb->cmd_stop)
>> +			break;
> If there's any buffer after this one we might never get another
> callback.

I think every used buffer can get the callback, because host takes from 
the arrays one by one, and puts back each with a vq notify.



>> +		free_pages((unsigned long)buf, ARRAY_ALLOC_ORDER);
>> +	}
>> +}
>> +
>>   static int init_vqs(struct virtio_balloon *vb)
>>   {
>> -	struct virtqueue *vqs[3];
>> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
>> -	static const char * const names[] = { "inflate", "deflate", "stats" };
>> -	int err, nvqs;
>> +	struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
>> +	vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
>> +	const char *names[VIRTIO_BALLOON_VQ_MAX];
>> +	struct scatterlist sg;
>> +	int ret;
>>   
>>   	/*
>> -	 * We expect two virtqueues: inflate and deflate, and
>> -	 * optionally stat.
>> +	 * Inflateq and deflateq are used unconditionally. The names[]
>> +	 * will be NULL if the related feature is not enabled, which will
>> +	 * cause no allocation for the corresponding virtqueue in find_vqs.
>>   	 */
>> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
>> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
>> -	if (err)
>> -		return err;
>> +	callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
>> +	names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
>> +	callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
>> +	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>> +	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>> +	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>   
>> -	vb->inflate_vq = vqs[0];
>> -	vb->deflate_vq = vqs[1];
>>   	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>> -		struct scatterlist sg;
>> -		unsigned int num_stats;
>> -		vb->stats_vq = vqs[2];
>> +		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>> +		callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
>> +	}
>>   
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>> +		names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
>> +		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = free_page_vq_cb;
>> +	}
>> +
>> +	ret = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>> +					 vqs, callbacks, names, NULL, NULL);
>> +	if (ret)
>> +		return ret;
>> +
>> +	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>> +	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>> +		vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
>>   		/*
>>   		 * Prime this virtqueue with one buffer so the hypervisor can
>>   		 * use it to signal us later (it can't be broken yet!).
>>   		 */
>> -		num_stats = update_balloon_stats(vb);
>> -
>> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
>> -		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
>> -		    < 0)
>> -			BUG();
>> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>> +		ret = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
>> +					   GFP_KERNEL);
>> +		if (ret) {
>> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
>> +				 __func__);
>> +			return ret;
>> +		}
> Why the change? Is it more likely to happen now?

Actually this part remains the same as the previous versions (e.g. v32). 
It is changed because we agreed that using BUG() isn't necessary here, 
and better to bail out nicely.



>
> +/*
> + * virtio_balloon_send_hints - send arrays of hints to host
> + * @vb: the virtio_balloon struct
> + * @arrays: the arrays of hints
> + * @array_num: the number of arrays give by the caller
> + * @last_array_hints: the number of hints in the last array
> + *
> + * Send hints to host array by array. This begins by sending a start cmd,
> + * which contains a cmd id received from host and the free page block size in
> + * bytes of each hint. At the end, a stop cmd is sent to host to indicate the
> + * end of this reporting. If host actively requests to stop the reporting, free
> + * the arrays that have not been sent.
> + */
> +static void virtio_balloon_send_hints(struct virtio_balloon *vb,
> +				      __le64 **arrays,
> +				      uint32_t array_num,
> +				      uint32_t last_array_hints)
> +{
> +	int err, i = 0;
> +	struct scatterlist sg;
> +	struct virtqueue *vq = vb->free_page_vq;
> +
> +	/* Start by sending the received cmd id to host with an outbuf. */
> +	err = send_start_cmd_id(vb);
> +	if (unlikely(err))
> +		goto out_err;
> +	/* Kick host to start taking entries from the vq. */
> +	virtqueue_kick(vq);
> +
> +	for (i = 0; i < array_num; i++) {
> +		/*
> +		 * If a stop id or a new cmd id was just received from host,
> +		 * stop the reporting, and free the remaining arrays that
> +		 * haven't been sent to host.
> +		 */
> +		if (vb->cmd_id_received != vb->cmd_id_active)
> +			goto out_free;
> +
> +		if (i + 1 == array_num)
> +			sg_init_one(&sg, (void *)arrays[i],
> +				    last_array_hints * sizeof(__le64));
> +		else
> +			sg_init_one(&sg, (void *)arrays[i], ARRAY_ALLOC_SIZE);
> +		err = virtqueue_add_inbuf(vq, &sg, 1, (void *)arrays[i],
> +					  GFP_KERNEL);
> +		if (unlikely(err))
> +			goto out_err;
> +	}
> +
> +	/* End by sending a stop id to host with an outbuf. */
> +	err = send_stop_cmd_id(vb);
> +	if (unlikely(err))
> +		goto out_err;
> Don't we need to kick here?

I think not needed, because we have kicked host about starting the 
report, and the host side optimization won't exit unless receiving this 
stop sign or the migration thread asks to exit.

>
>> +	int i;
>> +
>> +	max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER);
>> +	entries_per_page = PAGE_SIZE / sizeof(__le64);
>> +	entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER);
>> +	max_array_num = max_entries / entries_per_array +
>> +			!!(max_entries % entries_per_array);
>> +	arrays = kmalloc_array(max_array_num, sizeof(__le64 *), GFP_KERNEL);
> Instead of all this mess, how about get_free_pages here as well?

Sounds good, will replace kmalloc_array with __get_free_pages(), but 
still need the above calculation to get max_array_num.

>
> Also why do we need GFP_KERNEL for this?

I guess it is better to use "__GFP_ATOMIC | __GFP_NOMEMALLOC", thanks.

>
>
>> +	if (!arrays)
>> +		return NULL;
>> +
>> +	for (i = 0; i < max_array_num; i++) {
> So we are getting a ton of memory here just to free it up a bit later.
> Why doesn't get_from_free_page_list get the pages from free list for us?
> We could also avoid the 1st allocation then - just build a list
> of these.

That wouldn't be a good choice for us. If we check how the regular 
allocation works, there are many many things we need to consider when 
pages are allocated to users. For example, we need to take care of the 
nr_free counter, we need to check the watermark and perform the related 
actions. Also the folks working on arch_alloc_page to monitor page 
allocation activities would get a surprise..if page allocation is 
allowed to work in this way.





>
>> +		arrays[i] =
>> +		(__le64 *)__get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC,
>> +					   ARRAY_ALLOC_ORDER);
> Coding style says:
>
> Descendants are always substantially shorter than the parent and
> are placed substantially to the right.

Thanks, will rearrange it:

arrays[i] = (__le64 *)__get_free_pages(__GFP_ATOMIC |
				__GFP_NOMEMALLOC, ARRAY_ALLOC_ORDER);



>
>> +		if (!arrays[i]) {
> Also if it does fail (small guest), shall we try with less arrays?

I think it's not needed. If the free list is empty, no matter it is a 
huge guest or a small guest, get_from_free_page_list() will load nothing 
even we pass a small array to it.


Best,
Wei

^ permalink raw reply

* Re: [PATCH v33 1/4] mm: add a function to get free page blocks
From: Michael S. Tsirkin @ 2018-06-26  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: yang.zhang.wz, virtio-dev, Rik van Riel, quan.xu0, KVM list,
	nilal, liliang.opensource, Linux Kernel Mailing List,
	virtualization, linux-mm, Paolo Bonzini, Andrew Morton,
	Michal Hocko
In-Reply-To: <CA+55aFzhuGKinEq5udPsk_uYHShkQxJYqcPO=tLCkT-oxpsgPg@mail.gmail.com>

On Sat, Jun 16, 2018 at 08:08:53AM +0900, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <wei.w.wang@intel.com> wrote:
> >
> > This patch adds a function to get free pages blocks from a free page
> > list. The obtained free page blocks are hints about free pages, because
> > there is no guarantee that they are still on the free page list after
> > the function returns.

...

> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)

...

> 
> Ack. This is the kind of simple interface where I don't need to worry
> about the MM code calling out to random drivers or subsystems.
> 
> I think that "order" should be checked for validity, but from a MM
> standpoint I think this is fine.
> 
>                 Linus

The only issue seems to be getting hold of buf that's large enough -
and we don't really know what the size is, or whether one
buf would be enough.

Linus, do you think it would be ok to have get_from_free_page_list
actually pop entries from the free list and use them as the buffer
to store PAs?

Caller would be responsible for freeing the returned entries.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-26  1:50 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Netdev, Cornelia Huck, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <da144f8f-65cd-45d9-3c81-6c8ca1fb25fe@intel.com>

On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > prohibit the MAC of VF from changing when enslaved by failover.
> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
> > > I think Sridhar and Jiri might be better person to answer it. My
> > > impression was that sync'ing the MAC address change between all 3
> > > devices is challenging, as the failover driver uses MAC address to
> > > match net_device internally.
> 
> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> don't allow changing guest MAC unless it is a trusted VF.

OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
For example I can see host just
failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-26  1:46 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180625115512.2f29844e.cohuck@redhat.com>

On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote:
> On Fri, 22 Jun 2018 22:05:50 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> > > On Thu, 21 Jun 2018 21:20:13 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:  
> > > > > OK, so what about the following:
> > > > > 
> > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > > > >   that we have a new uuid field in the virtio-net config space
> > > > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > > > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > > > - when configuring, set the property to the group UUID of the vfio-pci
> > > > >   device
> > > > > - in the guest, use the uuid from the virtio-net device's config space
> > > > >   if applicable; else, fall back to matching by MAC as done today
> > > > > 
> > > > > That should work for all virtio transports.    
> > > > 
> > > > True. I'm a bit unhappy that it's virtio net specific though
> > > > since down the road I expect we'll have a very similar feature
> > > > for scsi (and maybe others).
> > > > 
> > > > But we do not have a way to have fields that are portable
> > > > both across devices and transports, and I think it would
> > > > be a useful addition. How would this work though? Any idea?  
> > > 
> > > Can we introduce some kind of device-independent config space area?
> > > Pushing back the device-specific config space by a certain value if the
> > > appropriate feature is negotiated and use that for things like the uuid?  
> > 
> > So config moves back and forth?
> > Reminds me of the msi vector mess we had with pci.
> 
> Yes, that would be a bit unfortunate.
> 
> > I'd rather have every transport add a new config.
> 
> You mean via different mechanisms?

I guess so.

> > 
> > > But regardless of that, I'm not sure whether extending this approach to
> > > other device types is the way to go. Tying together two different
> > > devices is creating complicated situations at least in the hypervisor
> > > (even if it's fairly straightforward in the guest). [I have not come
> > > around again to look at the "how to handle visibility in QEMU"
> > > questions due to lack of cycles, sorry about that.]
> > > 
> > > So, what's the goal of this approach? Only to allow migration with
> > > vfio-pci, or also to plug in a faster device and use it instead of an
> > > already attached paravirtualized device?  
> > 
> > These are two sides of the same coin, I think the second approach
> > is closer to what we are doing here.
> 
> Thinking about it, do we need any knob to keep the vfio device
> invisible if the virtio device is not present? IOW, how does the
> hypervisor know that the vfio device is supposed to be paired with a
> virtio device? It seems we need an explicit tie-in.

If we are going the way of the bridge, both bridge and
virtio would have some kind of id.

When pairing using mac, I'm less sure. PAss vfio device mac to qemu
as a property?


> > 
> > > What about migration of vfio devices that are not easily replaced by a
> > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> > > currently only) supported device is dasd (disks) -- which can do a lot
> > > of specialized things that virtio-blk does not support (and should not
> > > or even cannot support).  
> > 
> > But maybe virtio-scsi can?
> 
> I don't think so. Dasds have some channel commands that don't map
> easily to scsi commands.

There's always a choice of adding these to the spec.
E.g. FC extensions were proposed, I don't remember why they
are still stuck.

> > 
> > > Would it be more helpful to focus on generic
> > > migration support for vfio instead of going about it device by device?
> > > 
> > > This network device approach already seems far along, so it makes sense
> > > to continue with it. But I'm not sure whether we want to spend time and
> > > energy on that for other device types rather than working on a general
> > > solution for vfio migration.  
> > 
> > I'm inclined to say finalizing this feature would be a good first step
> > and will teach us how we can move forward.
> 
> I'm not opposed to figuring out this one, but I'm not sure whether we
> want to extend it to more device types.
> 
> Are people looking into generic migration support? I have it on my
> things-to-look-at list (figuring out what needs to be device specific
> and what can be generic, figuring out how we can support vfio-ccw,
> etc.).

I expect to see more of it if SPDK makes progress.

-- 
MST

^ permalink raw reply

* Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-26  1:37 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
	akpm, virtualization, torvalds
In-Reply-To: <1529928312-30500-3-git-send-email-wei.w.wang@intel.com>

On Mon, Jun 25, 2018 at 08:05:10PM +0800, Wei Wang wrote:
> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
> support of reporting hints of guest free pages to host via virtio-balloon.
> 
> Host requests the guest to report free page hints by sending a new cmd id
> to the guest via the free_page_report_cmd_id configuration register.
> 
> As the first step here, virtio-balloon only reports free page hints from
> the max order (i.e. 10) free page list to host. This has generated similar
> good results as reporting all free page hints during our tests.
> 
> When the guest starts to report, it first sends a start cmd to host via
> the free page vq, which acks to host the cmd id received, and tells it the
> hint size (e.g. 4MB each on x86). When the guest finishes the reporting,
> a stop cmd is sent to host via the vq.
> 
> TODO:
> - support reporting free page hints from smaller order free page lists
>   when there is a need/request from users.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 347 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |  11 ++
>  2 files changed, 322 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6b237e3..d05f0ba 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -43,6 +43,11 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +/* The order used to allocate an array to load free page hints */
> +#define ARRAY_ALLOC_ORDER (MAX_ORDER - 1)
> +/* The size of an array in bytes */
> +#define ARRAY_ALLOC_SIZE ((1 << ARRAY_ALLOC_ORDER) << PAGE_SHIFT)
> +


Pls prefix macros so we can figure out they are local ones.

>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> @@ -51,9 +56,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +enum virtio_balloon_vq {
> +	VIRTIO_BALLOON_VQ_INFLATE,
> +	VIRTIO_BALLOON_VQ_DEFLATE,
> +	VIRTIO_BALLOON_VQ_STATS,
> +	VIRTIO_BALLOON_VQ_FREE_PAGE,
> +	VIRTIO_BALLOON_VQ_MAX
> +};
> +
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The free page reporting work item submitted to the balloon wq */
> +	struct work_struct report_free_page_work;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -63,6 +81,15 @@ struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
>  
> +	/* Command buffers to start and stop the reporting of hints to host */
> +	struct virtio_balloon_free_page_hints_cmd cmd_start;
> +	struct virtio_balloon_free_page_hints_cmd cmd_stop;
> +
> +	/* The cmd id received from host */
> +	uint32_t cmd_id_received;
> +	/* The cmd id that is actively in use */
> +	uint32_t cmd_id_active;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  

You want u32 types.

> @@ -326,17 +353,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	virtqueue_kick(vq);
>  }
>  
> -static void virtballoon_changed(struct virtio_device *vdev)
> -{
> -	struct virtio_balloon *vb = vdev->priv;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&vb->stop_update_lock, flags);
> -	if (!vb->stop_update)
> -		queue_work(system_freezable_wq, &vb->update_balloon_size_work);
> -	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> -}
> -
>  static inline s64 towards_target(struct virtio_balloon *vb)
>  {
>  	s64 target;
> @@ -353,6 +369,35 @@ static inline s64 towards_target(struct virtio_balloon *vb)
>  	return target - vb->num_pages;
>  }
>  
> +static void virtballoon_changed(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = vdev->priv;
> +	unsigned long flags;
> +	s64 diff = towards_target(vb);
> +
> +	if (diff) {
> +		spin_lock_irqsave(&vb->stop_update_lock, flags);
> +		if (!vb->stop_update)
> +			queue_work(system_freezable_wq,
> +				   &vb->update_balloon_size_work);
> +		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> +	}
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> +		virtio_cread(vdev, struct virtio_balloon_config,
> +			     free_page_report_cmd_id, &vb->cmd_id_received);
> +		if (vb->cmd_id_received !=
> +		    VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID &&
> +		    vb->cmd_id_received != vb->cmd_id_active) {
> +			spin_lock_irqsave(&vb->stop_update_lock, flags);
> +			if (!vb->stop_update)
> +				queue_work(vb->balloon_wq,
> +					   &vb->report_free_page_work);
> +			spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> +		}
> +	}
> +}
> +
>  static void update_balloon_size(struct virtio_balloon *vb)
>  {
>  	u32 actual = vb->num_pages;
> @@ -425,44 +470,253 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void free_page_vq_cb(struct virtqueue *vq)
> +{
> +	unsigned int len;
> +	void *buf;
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	while (1) {
> +		buf = virtqueue_get_buf(vq, &len);
> +
> +		if (!buf || buf == &vb->cmd_start || buf == &vb->cmd_stop)
> +			break;

If there's any buffer after this one we might never get another
callback.

> +		free_pages((unsigned long)buf, ARRAY_ALLOC_ORDER);
> +	}
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
> +	vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
> +	const char *names[VIRTIO_BALLOON_VQ_MAX];
> +	struct scatterlist sg;
> +	int ret;
>  
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * Inflateq and deflateq are used unconditionally. The names[]
> +	 * will be NULL if the related feature is not enabled, which will
> +	 * cause no allocation for the corresponding virtqueue in find_vqs.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> -	if (err)
> -		return err;
> +	callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
> +	names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
> +	callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
> +	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> +	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> +	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>  
> -	vb->inflate_vq = vqs[0];
> -	vb->deflate_vq = vqs[1];
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> +		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> +		callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
> +	}
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> +		names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
> +		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = free_page_vq_cb;
> +	}
> +
> +	ret = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> +					 vqs, callbacks, names, NULL, NULL);
> +	if (ret)
> +		return ret;
> +
> +	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> +	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> -		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
> +		ret = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
> +					   GFP_KERNEL);
> +		if (ret) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			return ret;
> +		}

Why the change? Is it more likely to happen now?

>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> +		vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
> +
>  	return 0;
>  }
>  
> +static int send_start_cmd_id(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg;
> +	struct virtqueue *vq = vb->free_page_vq;
> +
> +	vb->cmd_start.id = cpu_to_virtio32(vb->vdev, vb->cmd_id_active);
> +	vb->cmd_start.size = cpu_to_virtio32(vb->vdev,
> +					     MAX_ORDER_NR_PAGES * PAGE_SIZE);
> +	sg_init_one(&sg, &vb->cmd_start,
> +		    sizeof(struct virtio_balloon_free_page_hints_cmd));
> +	return virtqueue_add_outbuf(vq, &sg, 1, &vb->cmd_start, GFP_KERNEL);
> +}
> +
> +static int send_stop_cmd_id(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg;
> +	struct virtqueue *vq = vb->free_page_vq;
> +
> +	vb->cmd_stop.id = cpu_to_virtio32(vb->vdev,
> +				VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID);
> +	vb->cmd_stop.size = 0;
> +	sg_init_one(&sg, &vb->cmd_stop,
> +		    sizeof(struct virtio_balloon_free_page_hints_cmd));
> +	return virtqueue_add_outbuf(vq, &sg, 1, &vb->cmd_stop, GFP_KERNEL);
> +}
> +
> +/*
> + * virtio_balloon_send_hints - send arrays of hints to host
> + * @vb: the virtio_balloon struct
> + * @arrays: the arrays of hints
> + * @array_num: the number of arrays give by the caller
> + * @last_array_hints: the number of hints in the last array
> + *
> + * Send hints to host array by array. This begins by sending a start cmd,
> + * which contains a cmd id received from host and the free page block size in
> + * bytes of each hint. At the end, a stop cmd is sent to host to indicate the
> + * end of this reporting. If host actively requests to stop the reporting, free
> + * the arrays that have not been sent.
> + */
> +static void virtio_balloon_send_hints(struct virtio_balloon *vb,
> +				      __le64 **arrays,
> +				      uint32_t array_num,
> +				      uint32_t last_array_hints)
> +{
> +	int err, i = 0;
> +	struct scatterlist sg;
> +	struct virtqueue *vq = vb->free_page_vq;
> +
> +	/* Start by sending the received cmd id to host with an outbuf. */
> +	err = send_start_cmd_id(vb);
> +	if (unlikely(err))
> +		goto out_err;
> +	/* Kick host to start taking entries from the vq. */
> +	virtqueue_kick(vq);
> +
> +	for (i = 0; i < array_num; i++) {
> +		/*
> +		 * If a stop id or a new cmd id was just received from host,
> +		 * stop the reporting, and free the remaining arrays that
> +		 * haven't been sent to host.
> +		 */
> +		if (vb->cmd_id_received != vb->cmd_id_active)
> +			goto out_free;
> +
> +		if (i + 1 == array_num)
> +			sg_init_one(&sg, (void *)arrays[i],
> +				    last_array_hints * sizeof(__le64));
> +		else
> +			sg_init_one(&sg, (void *)arrays[i], ARRAY_ALLOC_SIZE);
> +		err = virtqueue_add_inbuf(vq, &sg, 1, (void *)arrays[i],
> +					  GFP_KERNEL);
> +		if (unlikely(err))
> +			goto out_err;
> +	}
> +
> +	/* End by sending a stop id to host with an outbuf. */
> +	err = send_stop_cmd_id(vb);
> +	if (unlikely(err))
> +		goto out_err;

Don't we need to kick here?

> +	return;
> +
> +out_err:
> +	dev_err(&vb->vdev->dev, "%s: err = %d\n", __func__, err);
> +out_free:
> +	while (i < array_num)
> +		free_pages((unsigned long)arrays[i++], ARRAY_ALLOC_ORDER);
> +}
> +
> +/*
> + * virtio_balloon_load_hints - load free page hints into arrays
> + * @vb: the virtio_balloon struct
> + * @array_num: the number of arrays allocated
> + * @last_array_hints: the number of hints loaded into the last array
> + *
> + * Only free pages blocks of MAX_ORDER - 1 are loaded into the arrays.
> + * Each array size is MAX_ORDER_NR_PAGES * PAGE_SIZE (e.g. 4MB on x86). Failing
> + * to allocate such an array essentially implies that no such free page blocks
> + * could be reported. Alloacte the number of arrays according to the free page
> + * blocks of MAX_ORDER - 1 that the system may have, and free the unused ones
> + * after loading the free page hints. The last array may be partially loaded,
> + * and @last_array_hints tells the caller about the number of hints there.
> + *
> + * Return the pointer to the memory that holds the addresses of the allocated
> + * arrays, or NULL if no arrays are allocated.
> + */
> +static  __le64 **virtio_balloon_load_hints(struct virtio_balloon *vb,
> +					   uint32_t *array_num,
> +					   uint32_t *last_array_hints)
> +{
> +	__le64 **arrays;
> +	uint32_t max_entries, entries_per_page, entries_per_array,
> +		 max_array_num, loaded_hints;

All above likely should be int.

> +	int i;
> +
> +	max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER);
> +	entries_per_page = PAGE_SIZE / sizeof(__le64);
> +	entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER);
> +	max_array_num = max_entries / entries_per_array +
> +			!!(max_entries % entries_per_array);
> +	arrays = kmalloc_array(max_array_num, sizeof(__le64 *), GFP_KERNEL);

Instead of all this mess, how about get_free_pages here as well?

Also why do we need GFP_KERNEL for this?


> +	if (!arrays)
> +		return NULL;
> +
> +	for (i = 0; i < max_array_num; i++) {

So we are getting a ton of memory here just to free it up a bit later.
Why doesn't get_from_free_page_list get the pages from free list for us?
We could also avoid the 1st allocation then - just build a list
of these.


> +		arrays[i] =
> +		(__le64 *)__get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC,
> +					   ARRAY_ALLOC_ORDER);

Coding style says:

Descendants are always substantially shorter than the parent and
are placed substantially to the right. 

> +		if (!arrays[i]) {
Also if it does fail (small guest), shall we try with less arrays?
> +			/*
> +			 * If any one of the arrays fails to be allocated, it
> +			 * implies that the free list that we are interested
> +			 * in is empty, and there is no need to continue the
> +			 * reporting. So just free what's allocated and return
> +			 * NULL.
> +			 */
> +			while (i > 0)
> +				free_pages((unsigned long)arrays[i--],
> +					   ARRAY_ALLOC_ORDER);
> +			kfree(arrays);
> +			return NULL;
> +		}
> +	}
> +	loaded_hints = get_from_free_page_list(ARRAY_ALLOC_ORDER,
> +					       max_array_num, arrays,
> +					       entries_per_array);
> +	*array_num = loaded_hints / entries_per_array +
> +		     !!(max_entries % entries_per_array);
> +	*last_array_hints = loaded_hints -
> +			    (*array_num - 1) * entries_per_array;
> +	for (i = *array_num; i < max_array_num; i++)
> +		free_pages((unsigned long)arrays[i], ARRAY_ALLOC_ORDER);
> +
> +	return arrays;
> +}
> +
> +static void report_free_page_func(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +	uint32_t array_num = 0, last_array_hints = 0;
> +	__le64 **arrays;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	vb->cmd_id_active = vb->cmd_id_received;
> +
> +	arrays = virtio_balloon_load_hints(vb, &array_num, &last_array_hints);
> +	if (arrays) {
> +		virtio_balloon_send_hints(vb, arrays, array_num,
> +					  last_array_hints);
> +		kfree(arrays);
> +	}
> +}
> +
>  #ifdef CONFIG_BALLOON_COMPACTION
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
> @@ -576,18 +830,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		if (!vb->balloon_wq) {
> +			err = -ENOMEM;
> +			goto out_del_vqs;
> +		}
> +		INIT_WORK(&vb->report_free_page_work, report_free_page_func);
> +		vb->cmd_id_received = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
> +		vb->cmd_id_active = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
>  	if (err < 0)
> -		goto out_del_vqs;
> +		goto out_del_balloon_wq;
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	balloon_mnt = kern_mount(&balloon_fs);
>  	if (IS_ERR(balloon_mnt)) {
>  		err = PTR_ERR(balloon_mnt);
>  		unregister_oom_notifier(&vb->nb);
> -		goto out_del_vqs;
> +		goto out_del_balloon_wq;
>  	}
>  
>  	vb->vb_dev_info.migratepage = virtballoon_migratepage;
> @@ -597,7 +863,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		kern_unmount(balloon_mnt);
>  		unregister_oom_notifier(&vb->nb);
>  		vb->vb_dev_info.inode = NULL;
> -		goto out_del_vqs;
> +		goto out_del_balloon_wq;
>  	}
>  	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
>  #endif
> @@ -608,6 +874,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		virtballoon_changed(vdev);
>  	return 0;
>  
> +out_del_balloon_wq:
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> +		destroy_workqueue(vb->balloon_wq);
>  out_del_vqs:
>  	vdev->config->del_vqs(vdev);
>  out_free_vb:
> @@ -641,6 +910,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> +		cancel_work_sync(&vb->report_free_page_work);
> +		destroy_workqueue(vb->balloon_wq);
> +	}
> +
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
> @@ -692,6 +966,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 13b8cb5..860456f 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,15 +34,26 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>  
> +#define VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID	0
>  struct virtio_balloon_config {
>  	/* Number of pages host wants Guest to give up. */
>  	__u32 num_pages;
>  	/* Number of pages we've actually got in balloon. */
>  	__u32 actual;
> +	/* Free page report command id, readonly by guest */
> +	__u32 free_page_report_cmd_id;
> +};
> +
> +struct virtio_balloon_free_page_hints_cmd {
> +	/* The command id received from host */
> +	__le32 id;
> +	/* The free page block size in bytes */
> +	__le32 size;
>  };
>  
>  #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
> -- 
> 2.7.4

^ permalink raw reply

* [PATCH v5 12/27] x86/paravirt: Adapt assembly for PIE support
From: Thomas Garnier via Virtualization @ 2018-06-25 22:39 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Juergen Gross, x86, linux-kernel, Thomas Garnier, Ingo Molnar,
	H. Peter Anvin, Alok Kataria, virtualization, Thomas Gleixner
In-Reply-To: <20180625224014.134829-1-thgarnie@google.com>

if PIE is enabled, switch the paravirt assembly constraints to be
compatible. The %c/i constrains generate smaller code so is kept by
default.

Position Independent Executable (PIE) support will allow to extend the
KASLR randomization range 0xffffffff80000000.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
---
 arch/x86/include/asm/paravirt_types.h | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 180bc0bff0fb..140747a98d94 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -337,9 +337,17 @@ extern struct pv_lock_ops pv_lock_ops;
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+#ifdef CONFIG_X86_PIE
+#define paravirt_opptr_call "a"
+#define paravirt_opptr_type "p"
+#else
+#define paravirt_opptr_call "c"
+#define paravirt_opptr_type "i"
+#endif
+
 #define paravirt_type(op)				\
 	[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),	\
-	[paravirt_opptr] "i" (&(op))
+	[paravirt_opptr] paravirt_opptr_type (&(op))
 #define paravirt_clobber(clobber)		\
 	[paravirt_clobber] "i" (clobber)
 
@@ -395,7 +403,7 @@ int paravirt_disable_iospace(void);
  */
 #define PARAVIRT_CALL					\
 	ANNOTATE_RETPOLINE_SAFE				\
-	"call *%c[paravirt_opptr];"
+	"call *%" paravirt_opptr_call "[paravirt_opptr];"
 
 /*
  * These macros are intended to wrap calls through one of the paravirt
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* [PATCH v5 00/27] x86: PIE support and option to extend KASLR randomization
From: Thomas Garnier via Virtualization @ 2018-06-25 22:38 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kate Stewart, Sergey Senozhatsky, Petr Mladek,
	Radim Krčmář, Peter Zijlstra, Jan Kiszka,
	Christopher Li, Dave Hansen, linux-kernel, Masahiro Yamada,
	Pavel Machek, H. Peter Anvin, Christoph Lameter, Alok Kataria,
	linux-arch, Herbert Xu, Baoquan He, Jonathan Corbet, Joerg Roedel,
	x86, Daniel Micay, linux-doc, xen-devel, linux-sparse

Changes:
 - patch v5:
   - Adapt new crypto modules for PIE.
   - Improve per-cpu commit message.
   - Fix xen 32-bit build error with .quad.
   - Remove extra code for ftrace.
 - patch v4:
   - Simplify early boot by removing global variables.
   - Modify the mcount location script for __mcount_loc intead of the address
     read in the ftrace implementation.
   - Edit commit description to explain better where the kernel can be located.
   - Streamlined the testing done on each patch proposal. Always testing
     hibernation, suspend, ftrace and kprobe to ensure no regressions.
 - patch v3:
   - Update on message to describe longer term PIE goal.
   - Minor change on ftrace if condition.
   - Changed code using xchgq.
 - patch v2:
   - Adapt patch to work post KPTI and compiler changes
   - Redo all performance testing with latest configs and compilers
   - Simplify mov macro on PIE (MOVABS now)
   - Reduce GOT footprint
 - patch v1:
   - Simplify ftrace implementation.
   - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
 - rfc v3:
   - Use --emit-relocs instead of -pie to reduce dynamic relocation space on
     mapped memory. It also simplifies the relocation process.
   - Move the start the module section next to the kernel. Remove the need for
     -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
   - Support for XEN PVH as 32-bit relocations can be ignored with
     --emit-relocs.
   - Support for GOT relocations previously done automatically with -pie.
   - Remove need for dynamic PLT in modules.
   - Support dymamic GOT for modules.
 - rfc v2:
   - Add support for global stack cookie while compiler default to fs without
     mcmodel=kernel
   - Change patch 7 to correctly jump out of the identity mapping on kexec load
     preserve.

These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G. The chosen range is the one currently
available, future changes will allow the kernel module to have a wider
randomization range.

Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
feedback for using -pie versus --emit-relocs and details on compiler code
generation.

The patches:
 - 1-3, 5-13, 18-19: Change in assembly code to be PIE compliant.
 - 4: Add a new _ASM_MOVABS macro to fetch a symbol address generically.
 - 14: Adapt percpu design to work correctly when PIE is enabled.
 - 15: Provide an option to default visibility to hidden except for key symbols.
       It removes errors between compilation units.
 - 16: Add PROVIDE_HIDDEN replacement on the linker script for weak symbols to
       reduce GOT footprint.
 - 17: Adapt relocation tool to handle PIE binary correctly.
 - 20: Add support for global cookie.
 - 21: Support ftrace with PIE (used on Ubuntu config).
 - 22: Add option to move the module section just after the kernel.
 - 23: Adapt module loading to support PIE with dynamic GOT.
 - 24: Make the GOT read-only.
 - 25: Add the CONFIG_X86_PIE option (off by default).
 - 26: Adapt relocation tool to generate a 64-bit relocation table.
 - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation range
       from 1G to 3G (off by default).

Performance/Size impact:

Size of vmlinux (Default configuration):
 File size:
 - PIE disabled: +0.18%
 - PIE enabled: -1.977% (less relocations)
 .text section:
 - PIE disabled: same
 - PIE enabled: same

Size of vmlinux (Ubuntu configuration):
 File size:
 - PIE disabled: +0.21%
 - PIE enabled: +10%
 .text section:
 - PIE disabled: same
 - PIE enabled: +0.001%

The size increase is mainly due to not having access to the 32-bit signed
relocation that can be used with mcmodel=kernel. A small part is due to reduced
optimization for PIE code. This bug [1] was opened with gcc to provide a better
code generation for kernel PIE.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
 - PIE disabled: no significant change (avg -/+ 0.5% on latest test).
 - PIE enabled: between -1% to +1% in average (default and Ubuntu config).

Kernbench (average of 10 Half and Optimal runs):
 Elapsed Time:
 - PIE disabled: no significant change (avg -0.5%)
 - PIE enabled: average -0.5% to +0.5%
 System Time:
 - PIE disabled: no significant change (avg -0.1%)
 - PIE enabled: average -0.4% to +0.4%.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

diffstat:
 Documentation/x86/x86_64/mm.txt              |    3 v
 arch/x86/Kconfig                             |   45 ++++++
 arch/x86/Makefile                            |   58 ++++++++
 arch/x86/boot/boot.h                         |    2 
 arch/x86/boot/compressed/Makefile            |    5 
 arch/x86/boot/compressed/misc.c              |   10 +
 arch/x86/crypto/aegis128-aesni-asm.S         |    6 
 arch/x86/crypto/aegis128l-aesni-asm.S        |    8 -
 arch/x86/crypto/aegis256-aesni-asm.S         |    6 
 arch/x86/crypto/aes-x86_64-asm_64.S          |   45 ++++--
 arch/x86/crypto/aesni-intel_asm.S            |    8 -
 arch/x86/crypto/aesni-intel_avx-x86_64.S     |    6 
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  |   42 +++---
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S |   44 +++---
 arch/x86/crypto/camellia-x86_64-asm_64.S     |    8 -
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S    |   50 ++++---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S    |   44 +++---
 arch/x86/crypto/des3_ede-asm_64.S            |   96 +++++++++-----
 arch/x86/crypto/ghash-clmulni-intel_asm.S    |    4 
 arch/x86/crypto/glue_helper-asm-avx.S        |    4 
 arch/x86/crypto/glue_helper-asm-avx2.S       |    6 
 arch/x86/crypto/morus1280-avx2-asm.S         |    4 
 arch/x86/crypto/morus1280-sse2-asm.S         |    8 -
 arch/x86/crypto/morus640-sse2-asm.S          |    6 
 arch/x86/crypto/sha256-avx2-asm.S            |   23 ++-
 arch/x86/entry/calling.h                     |    2 
 arch/x86/entry/entry_32.S                    |    3 
 arch/x86/entry/entry_64.S                    |   25 ++-
 arch/x86/include/asm/asm.h                   |    1 
 arch/x86/include/asm/bug.h                   |    2 
 arch/x86/include/asm/jump_label.h            |    8 -
 arch/x86/include/asm/kvm_host.h              |    8 -
 arch/x86/include/asm/module.h                |   11 +
 arch/x86/include/asm/page_64_types.h         |    9 +
 arch/x86/include/asm/paravirt_types.h        |   12 +
 arch/x86/include/asm/percpu.h                |   25 ++-
 arch/x86/include/asm/pgtable_64_types.h      |    6 
 arch/x86/include/asm/pm-trace.h              |    2 
 arch/x86/include/asm/processor.h             |   16 +-
 arch/x86/include/asm/sections.h              |    4 
 arch/x86/include/asm/setup.h                 |    2 
 arch/x86/include/asm/stackprotector.h        |   19 ++
 arch/x86/kernel/Makefile                     |    6 
 arch/x86/kernel/acpi/wakeup_64.S             |   31 ++--
 arch/x86/kernel/asm-offsets.c                |    3 
 arch/x86/kernel/asm-offsets_32.c             |    3 
 arch/x86/kernel/asm-offsets_64.c             |    3 
 arch/x86/kernel/cpu/common.c                 |    3 
 arch/x86/kernel/cpu/microcode/core.c         |    4 
 arch/x86/kernel/ftrace.c                     |   51 +++++++
 arch/x86/kernel/head64.c                     |   23 ++-
 arch/x86/kernel/head_32.S                    |    3 
 arch/x86/kernel/head_64.S                    |   31 +++-
 arch/x86/kernel/kvm.c                        |    6 
 arch/x86/kernel/module.c                     |  181 ++++++++++++++++++++++++++-
 arch/x86/kernel/module.lds                   |    3 
 arch/x86/kernel/process.c                    |    5 
 arch/x86/kernel/relocate_kernel_64.S         |   16 +-
 arch/x86/kernel/setup_percpu.c               |    5 
 arch/x86/kernel/vmlinux.lds.S                |   13 +
 arch/x86/kvm/svm.c                           |    4 
 arch/x86/lib/cmpxchg16b_emu.S                |    8 -
 arch/x86/mm/dump_pagetables.c                |    3 
 arch/x86/power/hibernate_asm_64.S            |    4 
 arch/x86/tools/relocs.c                      |  169 +++++++++++++++++++++++--
 arch/x86/tools/relocs.h                      |    4 
 arch/x86/tools/relocs_common.c               |   15 +-
 arch/x86/xen/xen-asm.S                       |   12 -
 arch/x86/xen/xen-head.S                      |   11 -
 arch/x86/xen/xen-pvh.S                       |   14 +-
 drivers/base/firmware_loader/main.c          |    4 
 include/asm-generic/sections.h               |    6 
 include/asm-generic/vmlinux.lds.h            |   12 +
 include/linux/compiler.h                     |    7 +
 init/Kconfig                                 |   16 ++
 kernel/kallsyms.c                            |   16 +-
 kernel/trace/trace.h                         |    4 
 lib/dynamic_debug.c                          |    4 
 scripts/link-vmlinux.sh                      |   14 ++
 scripts/recordmcount.c                       |   79 +++++++----
 80 files changed, 1134 insertions(+), 358 deletions(-)

^ permalink raw reply

* Re: [PATCH v2 1/5] dt-bindings: virtio: Specify #iommu-cells value for a virtio-iommu
From: Rob Herring @ 2018-06-25 19:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, devicetree, jayachandran.nair, lorenzo.pieralisi,
	tnowicki, kvm, virtio-dev, joro, mst, will.deacon, virtualization,
	marc.zyngier, iommu, eric.auger, robin.murphy, kvmarm
In-Reply-To: <20180621190655.56391-2-jean-philippe.brucker@arm.com>

On Thu, Jun 21, 2018 at 08:06:51PM +0100, Jean-Philippe Brucker wrote:
> A virtio-mmio node may represent a virtio-iommu device. This is discovered
> by the virtio driver at probe time, but the DMA topology isn't
> discoverable and must be described by firmware. For DT the standard IOMMU
> description is used, as specified in bindings/iommu/iommu.txt and
> bindings/pci/pci-iommu.txt. Like many other IOMMUs, virtio-iommu
> distinguishes masters by their endpoint IDs, which requires one IOMMU cell
> in the "iommus" property.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  Documentation/devicetree/bindings/virtio/mmio.txt | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt b/Documentation/devicetree/bindings/virtio/mmio.txt
> index 5069c1b8e193..337da0e3a87f 100644
> --- a/Documentation/devicetree/bindings/virtio/mmio.txt
> +++ b/Documentation/devicetree/bindings/virtio/mmio.txt
> @@ -8,6 +8,14 @@ Required properties:
>  - reg:		control registers base address and size including configuration space
>  - interrupts:	interrupt generated by the device
>  
> +Required properties for virtio-iommu:
> +
> +- #iommu-cells:	When the node describes a virtio-iommu device, it is
> +		linked to DMA masters using the "iommus" property as
> +		described in devicetree/bindings/iommu/iommu.txt. For
> +		virtio-iommu #iommu-cells must be 1, each cell describing
> +		a single endpoint ID.

The iommus property should also be documented for the client side.

Rob

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-25 17:54 UTC (permalink / raw)
  To: Siwei Liu, Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Netdev, Cornelia Huck, qemu-devel, virtualization,
	Venu Busireddy, boris.ostrovsky, aaron.f.brown, Joao Martins
In-Reply-To: <CADGSJ22DX8=rNPY+gNS_q=LCYLR5ieoE7oD8p4+8AnpzQsWSCg@mail.gmail.com>

On 6/22/2018 5:17 PM, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 4:40 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>>>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>>>>>> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>>>>> On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>>> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>>>>>>>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>>>>>> On Wed, 20 Jun 2018 22:48:58 +0300
>>>>>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>>>>>>>>>>> In any case, I'm not sure anymore why we'd want the extra uuid.
>>>>>>>>>>> It's mostly so we can have e.g. multiple devices with same MAC
>>>>>>>>>>> (which some people seem to want in order to then use
>>>>>>>>>>> then with different containers).
>>>>>>>>>>>
>>>>>>>>>>> But it is also handy for when you assign a PF, since then you
>>>>>>>>>>> can't set the MAC.
>>>>>>>>>>>
>>>>>>>>>> OK, so what about the following:
>>>>>>>>>>
>>>>>>>>>> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>>>>>>>>>>    that we have a new uuid field in the virtio-net config space
>>>>>>>>>> - in QEMU, add a property for virtio-net that allows to specify a uuid,
>>>>>>>>>>    offer VIRTIO_NET_F_STANDBY_UUID if set
>>>>>>>>>> - when configuring, set the property to the group UUID of the vfio-pci
>>>>>>>>>>    device
>>>>>>>>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>>>>>>>> to still expose UUID in the config space on virtio-pci?
>>>>>>>>
>>>>>>>> Yes but guest is not supposed to read it.
>>>>>>>>
>>>>>>>>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>>>>>>>> where the corresponding vfio-pci device attached to for a guest which
>>>>>>>>> doesn't support the feature (legacy).
>>>>>>>>>
>>>>>>>>> -Siwei
>>>>>>>> Yes but you won't add the primary behind such a bridge.
>>>>>>> I assume the UUID feature is a new one besides the exiting
>>>>>>> VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>>>>>>> if UUID feature is present and supported by guest, we'll do pairing
>>>>>>> based on UUID; if UUID feature present
>>>>>>                                   ^^^^^^^  is NOT present
>>>>> Let's go back for a bit.
>>>>>
>>>>> What is wrong with matching by mac?
>>>>>
>>>>> 1. Does not allow multiple NICs with same mac
>>>>> 2. Requires MAC to be programmed by host in the PT device
>>>>>     (which is often possible with VFs but isn't possible with all PT
>>>>>     devices)
>>>> Might not neccessarily be something wrong, but it's very limited to
>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>> I think Sridhar and Jiri might be better person to answer it. My
>> impression was that sync'ing the MAC address change between all 3
>> devices is challenging, as the failover driver uses MAC address to
>> match net_device internally.

Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
don't allow changing guest MAC unless it is a trusted VF.


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v2 3/5] iommu/virtio: Add probe request
From: Jean-Philippe Brucker @ 2018-06-25 17:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Mark Rutland, devicetree@vger.kernel.org,
	jayachandran.nair@cavium.com, Lorenzo Pieralisi,
	tnowicki@caviumnetworks.com, kvm@vger.kernel.org,
	virtio-dev@lists.oasis-open.org, joro@8bytes.org, Will Deacon,
	virtualization@lists.linux-foundation.org, Marc Zyngier,
	iommu@lists.linux-foundation.org, robh+dt@kernel.org,
	eric.auger@redhat.com, Robin Murphy, kvmarm@lists.cs.columbia.edu
In-Reply-To: <20180622035442-mutt-send-email-mst@kernel.org>

On 22/06/18 01:55, Michael S. Tsirkin wrote:
>> +#define VIRTIO_IOMMU_RESV_MEM_T_RESERVED	0
>> +#define VIRTIO_IOMMU_RESV_MEM_T_MSI		1
>> +
>> +struct virtio_iommu_probe_resv_mem {
>> +	__u8					subtype;
>> +	__u8					reserved[3];
>> +	__le64					start;
>> +	__le64					end;
>> +} __packed;
> 
> 
> start/end are not aligned you need to pad more.

The complete structure is actually the 32-bit header
virtio_iommu_probe_property, followed by the content
virtio_iommu_probe_resv_mem:

struct {
	le16	type
	le16	length
	u8	subtype
	u8	reserved[3]
	le64	start
	le64	end
};

I'll redefine virtio_iommu_probe_resv_mem to include this header.
Otherwise, without __packed, a compiler introduces padding when
concatenating header and content.

Thanks,
Jean

^ permalink raw reply

* Re: [PATCH v2 2/5] iommu: Add virtio-iommu driver
From: Jean-Philippe Brucker @ 2018-06-25 17:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Mark Rutland, devicetree@vger.kernel.org,
	jayachandran.nair@cavium.com, Lorenzo Pieralisi,
	tnowicki@caviumnetworks.com, kvm@vger.kernel.org,
	virtio-dev@lists.oasis-open.org, joro@8bytes.org, Will Deacon,
	virtualization@lists.linux-foundation.org, Marc Zyngier,
	iommu@lists.linux-foundation.org, robh+dt@kernel.org,
	eric.auger@redhat.com, Robin Murphy, kvmarm@lists.cs.columbia.edu
In-Reply-To: <20180622033720-mutt-send-email-mst@kernel.org>

On 22/06/18 01:51, Michael S. Tsirkin wrote:
>> +VIRTIO IOMMU DRIVER
>> +M:	Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> +S:	Maintained
>> +F:	drivers/iommu/virtio-iommu.c
>> +F:	include/uapi/linux/virtio_iommu.h
>> +
>>  VIRTUAL BOX GUEST DEVICE DRIVER
>>  M:	Hans de Goede <hdegoede@redhat.com>
>>  M:	Arnd Bergmann <arnd@arndb.de>
> 
> Please add the virtualization mailing list.

Ok. For the driver get_maintainer.pl now outputs the iommu and
virtualizations lists, as well as Joerg Roedel. For the header it
outputs the virtualization list, Jason and you.

>> +#define VIOMMU_REQUEST_TIMEOUT		10000 /* 10s */
> 
> Where does this come from? Do you absolutely have to have
> an arbitrary timeout?

It isn't necessary but very useful for development, and I wonder if
we could keep it behind an #ifdef DEBUG. Some requests are sent from
hardirq-safe context (because device driver are allowed to call
iommu_map/unmap from an IRQ handler), where they would lock the CPU if
the IOMMU device misbehaves. The timeout is here to provide a little
more info to developers, instead of a generic RCU lockup message.

On the other hand allowing unmap requests to timeout might pose a
security risk (freeing pages that are still used by endpoints). So I
think moving this behind a DEBUG would be better.

>> +static int viommu_probe(struct virtio_device *vdev)
>> +{
>> +	struct device *parent_dev = vdev->dev.parent;
>> +	struct viommu_dev *viommu = NULL;
>> +	struct device *dev = &vdev->dev;
>> +	u64 input_start = 0;
>> +	u64 input_end = -1UL;
>> +	int ret;
> 
> Please validate that device has VIRTIO_F_VERSION_1 -
> we don't ever want new devices to grow legacy
> interfaces.

Agreed. In this case I think I can also remove from the device
specification all legacy paragraphs, that cater for missing FEATURES_OK
status and native endianess.

For the record, all kvmtool devices on my branch now default to virtio
version 1, since they need the VIRTIO_F_IOMMU_PLATFORM bit.

>> +struct virtio_iommu_config {
>> +	/* Supported page sizes */
>> +	__u64					page_size_mask;
>> +	/* Supported IOVA range */
>> +	struct virtio_iommu_range {
>> +		__u64				start;
>> +		__u64				end;
>> +	} input_range;
>> +	/* Max domain ID size */
>> +	__u8					domain_bits;
>> +} __packed;
> 
> Please pad structs so each field and all of it are size aligned and avoid __packed.
> 
> Applies below too.

Ok, I'll clean this up (as well as the structs in the device spec). I
think the only struct that needs more padding is virtio_iommu_fault. The
total size of the attach request is 20 bytes, not aligned on 64 bits,
but it probably doesn't need to change. virtio_blk_req for example has
an odd size and I don't think that caused any problem (?).

>> +union virtio_iommu_req {
>> +	struct virtio_iommu_req_head		head;
>> +
>> +	struct virtio_iommu_req_attach		attach;
>> +	struct virtio_iommu_req_detach		detach;
>> +	struct virtio_iommu_req_map		map;
>> +	struct virtio_iommu_req_unmap		unmap;
>> +};
> 
> Such unions are problematic: if you add an option of a larger structure
> down the road, do all requests grow automatically?

Yes, requests don't have a fixed size, and the next extension adds two
slightly larger requests. I'm also in favor of removing the union, and
letting implementations redefine it if needed. At the moment this union
is only used by kvmtool for parsing requests, not by QEMU or the Linux
driver.

Thanks a lot for the comments

Jean

^ permalink raw reply

* [PATCH v34 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
From: Wei Wang @ 2018-06-25 12:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds
In-Reply-To: <1529928312-30500-1-git-send-email-wei.w.wang@intel.com>

The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 drivers/virtio/virtio_balloon.c     | 10 ++++++++++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d05f0ba..c834ef1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -801,6 +801,7 @@ static struct file_system_type balloon_fs = {
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
+	__u32 poison_val;
 	int err;
 
 	if (!vdev->config->get) {
@@ -840,6 +841,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		INIT_WORK(&vb->report_free_page_work, report_free_page_func);
 		vb->cmd_id_received = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
 		vb->cmd_id_active = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+		if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+			memset(&poison_val, PAGE_POISON, sizeof(poison_val));
+			virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+				      poison_val, &poison_val);
+		}
 	}
 
 	vb->nb.notifier_call = virtballoon_oom_notify;
@@ -958,6 +964,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+	if (!page_poisoning_enabled())
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
 	__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
 	return 0;
 }
@@ -967,6 +976,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+	VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 860456f..12b5a4f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
 	__u32 actual;
 	/* Free page report command id, readonly by guest */
 	__u32 free_page_report_cmd_id;
+	/* Stores PAGE_POISON if page poisoning is in use */
+	__u32 poison_val;
 };
 
 struct virtio_balloon_free_page_hints_cmd {
-- 
2.7.4

^ permalink raw reply related

* [PATCH v34 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules
From: Wei Wang @ 2018-06-25 12:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds
In-Reply-To: <1529928312-30500-1-git-send-email-wei.w.wang@intel.com>

In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/page_poison.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
 	/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
 		(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
 		debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox