Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH net] vhost_net: remove VHOST_NET_F_VIRTIO_NET_HDR support
From: Jason Wang @ 2018-06-11  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180611050741-mutt-send-email-mst@kernel.org>



On 2018年06月11日 10:12, Michael S. Tsirkin wrote:
> On Fri, Jun 08, 2018 at 01:07:09PM +0800, Jason Wang wrote:
>>
>> On 2018年06月08日 12:46, Michael S. Tsirkin wrote:
>>> On Fri, Jun 08, 2018 at 11:50:42AM +0800, Jason Wang wrote:
>>>> This feature bit is duplicated with VIRTIO_F_ANY_LAYOUT, this means if
>>>> a userpsace want to enable VRITIO_F_ANY_LAYOUT,
>>>> VHOST_NET_F_VIRTIO_NET_HDR will be implied too. This is wrong and will
>>>> break networking.
>>> What breaks networking exactly? VHOST_NET supported ANY_LAYOUT
>>> from day one. For this reason it does not need to know about
>>> VRITIO_F_ANY_LAYOUT and we reused the bit for other purposes.
>> It's the knowledge of vhost_net code it self but not userspace. For
>> userspace, it should depends on the value of returned by VHOST_GET_FEATURES.
>> So when userspace can set_features with ANY_LAYOUT, vhost may think it wants
>> VHOST_NET_F_VIRTIO_NET_HDR.
> Yes but that's the admittedly ugly API that we have now.
> userspace is supposed to know VRITIO_F_ANY_LAYOUT does
> not make sense for vhost.

Ok.

>
>
>
>>>
>>>
>>>> Fixing this by safely removing
>>>> VHOST_NET_F_VIRTIO_NET_HDR support. There should be very few or even
>>>> no userspace can use this.
>>> Quite possibly, but it is hard to be sure. It seems safer to
>>> maintain it unless there's an actual reason something's broken.
>> I think not since the feature is negotiated not mandatory?
> That doesn't mean much.
>
>>>> Further cleanups could be done for
>>>> -net-next for safety.
>>>>
>>>> In the future, we need a vhost dedicated feature set/get ioctl()
>>>> instead of reusing virtio ones.
>>> Not just in the future, we might want to switch iommu
>>> to a sane structure without the 64 bit padding bug
>>> right now.
>> Yes, I hit this bug when introducing V2 of msg IOTLB message.
> Sounds good, so if you like, reserve a bit for
> VHOST_NET_F_VIRTIO_NET_HDR in the new ioctl mask and
> do not enable it there.

Ok, and maybe VHOST_F_LOG_ALL.

>
>>>> Fixes: 4e9fa50c6ccbe ("vhost: move features to core")
>>> This tag makes no sense here IMHO. Looks like people are using some tool
>>> that just looks at the earliest version where patch won't apply. The
>>> commit in question just moved some code around.
>> Looks not, before this commit, vhost_net won't return ANY_LAYOUT.
>>
>> Thanks
> Well ANY_LAYOUT just happens to be same as VHOST_NET_F_VIRTIO_NET_HDR
> and that has been set since forever.

So do you still want this patch? If not we need to document that 
ANY_LAYOUT could not be passed through SET_FEATURES somewhere.

Thanks

>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>>    drivers/vhost/net.c | 15 +++++----------
>>>>    1 file changed, 5 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>>> index 986058a..83eef52 100644
>>>> --- a/drivers/vhost/net.c
>>>> +++ b/drivers/vhost/net.c
>>>> @@ -69,7 +69,6 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>>>>    enum {
>>>>    	VHOST_NET_FEATURES = VHOST_FEATURES |
>>>> -			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
>>>>    			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
>>>>    			 (1ULL << VIRTIO_F_IOMMU_PLATFORM)
>>>>    };
>>>> @@ -1255,15 +1254,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
>>>>    			       (1ULL << VIRTIO_F_VERSION_1))) ?
>>>>    			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
>>>>    			sizeof(struct virtio_net_hdr);
>>>> -	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
>>>> -		/* vhost provides vnet_hdr */
>>>> -		vhost_hlen = hdr_len;
>>>> -		sock_hlen = 0;
>>>> -	} else {
>>>> -		/* socket provides vnet_hdr */
>>>> -		vhost_hlen = 0;
>>>> -		sock_hlen = hdr_len;
>>>> -	}
>>>> +
>>>> +        /* socket provides vnet_hdr */
>>>> +	vhost_hlen = 0;
>>>> +	sock_hlen = hdr_len;
>>>> +
>>>>    	mutex_lock(&n->dev.mutex);
>>>>    	if ((features & (1 << VHOST_F_LOG_ALL)) &&
>>>>    	    !vhost_log_access_ok(&n->dev))
>>>> -- 
>>>> 2.7.4

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net] vhost_net: remove VHOST_NET_F_VIRTIO_NET_HDR support
From: Michael S. Tsirkin @ 2018-06-11  3:08 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <90e9b3ea-e9d2-d60d-8355-42d447c67452@redhat.com>

On Mon, Jun 11, 2018 at 10:29:36AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月11日 10:12, Michael S. Tsirkin wrote:
> > On Fri, Jun 08, 2018 at 01:07:09PM +0800, Jason Wang wrote:
> > > 
> > > On 2018年06月08日 12:46, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 08, 2018 at 11:50:42AM +0800, Jason Wang wrote:
> > > > > This feature bit is duplicated with VIRTIO_F_ANY_LAYOUT, this means if
> > > > > a userpsace want to enable VRITIO_F_ANY_LAYOUT,
> > > > > VHOST_NET_F_VIRTIO_NET_HDR will be implied too. This is wrong and will
> > > > > break networking.
> > > > What breaks networking exactly? VHOST_NET supported ANY_LAYOUT
> > > > from day one. For this reason it does not need to know about
> > > > VRITIO_F_ANY_LAYOUT and we reused the bit for other purposes.
> > > It's the knowledge of vhost_net code it self but not userspace. For
> > > userspace, it should depends on the value of returned by VHOST_GET_FEATURES.
> > > So when userspace can set_features with ANY_LAYOUT, vhost may think it wants
> > > VHOST_NET_F_VIRTIO_NET_HDR.
> > Yes but that's the admittedly ugly API that we have now.
> > userspace is supposed to know VRITIO_F_ANY_LAYOUT does
> > not make sense for vhost.
> 
> Ok.
> 
> > 
> > 
> > 
> > > > 
> > > > 
> > > > > Fixing this by safely removing
> > > > > VHOST_NET_F_VIRTIO_NET_HDR support. There should be very few or even
> > > > > no userspace can use this.
> > > > Quite possibly, but it is hard to be sure. It seems safer to
> > > > maintain it unless there's an actual reason something's broken.
> > > I think not since the feature is negotiated not mandatory?
> > That doesn't mean much.
> > 
> > > > > Further cleanups could be done for
> > > > > -net-next for safety.
> > > > > 
> > > > > In the future, we need a vhost dedicated feature set/get ioctl()
> > > > > instead of reusing virtio ones.
> > > > Not just in the future, we might want to switch iommu
> > > > to a sane structure without the 64 bit padding bug
> > > > right now.
> > > Yes, I hit this bug when introducing V2 of msg IOTLB message.
> > Sounds good, so if you like, reserve a bit for
> > VHOST_NET_F_VIRTIO_NET_HDR in the new ioctl mask and
> > do not enable it there.
> 
> Ok, and maybe VHOST_F_LOG_ALL.
> 
> > 
> > > > > Fixes: 4e9fa50c6ccbe ("vhost: move features to core")
> > > > This tag makes no sense here IMHO. Looks like people are using some tool
> > > > that just looks at the earliest version where patch won't apply. The
> > > > commit in question just moved some code around.
> > > Looks not, before this commit, vhost_net won't return ANY_LAYOUT.
> > > 
> > > Thanks
> > Well ANY_LAYOUT just happens to be same as VHOST_NET_F_VIRTIO_NET_HDR
> > and that has been set since forever.
> 
> So do you still want this patch? If not we need to document that ANY_LAYOUT
> could not be passed through SET_FEATURES somewhere.
> 
> Thanks

Let's document this, yes. I don't think we should drop
VHOST_NET_F_VIRTIO_NET_HDR now.

> > 
> > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > ---
> > > > >    drivers/vhost/net.c | 15 +++++----------
> > > > >    1 file changed, 5 insertions(+), 10 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > > index 986058a..83eef52 100644
> > > > > --- a/drivers/vhost/net.c
> > > > > +++ b/drivers/vhost/net.c
> > > > > @@ -69,7 +69,6 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> > > > >    enum {
> > > > >    	VHOST_NET_FEATURES = VHOST_FEATURES |
> > > > > -			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
> > > > >    			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> > > > >    			 (1ULL << VIRTIO_F_IOMMU_PLATFORM)
> > > > >    };
> > > > > @@ -1255,15 +1254,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
> > > > >    			       (1ULL << VIRTIO_F_VERSION_1))) ?
> > > > >    			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
> > > > >    			sizeof(struct virtio_net_hdr);
> > > > > -	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
> > > > > -		/* vhost provides vnet_hdr */
> > > > > -		vhost_hlen = hdr_len;
> > > > > -		sock_hlen = 0;
> > > > > -	} else {
> > > > > -		/* socket provides vnet_hdr */
> > > > > -		vhost_hlen = 0;
> > > > > -		sock_hlen = hdr_len;
> > > > > -	}
> > > > > +
> > > > > +        /* socket provides vnet_hdr */
> > > > > +	vhost_hlen = 0;
> > > > > +	sock_hlen = hdr_len;
> > > > > +
> > > > >    	mutex_lock(&n->dev.mutex);
> > > > >    	if ((features & (1 << VHOST_F_LOG_ALL)) &&
> > > > >    	    !vhost_log_access_ok(&n->dev))
> > > > > -- 
> > > > > 2.7.4
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-11  3:28 UTC (permalink / raw)
  To: Ram Pai
  Cc: robh, pawel.moll, Tom Lendacky, benh, cohuck, linux-kernel,
	virtualization, Christoph Hellwig, joe, Rustad, Mark D,
	Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180611023909.GA5726@ram.oc3035372033.ibm.com>

On Sun, Jun 10, 2018 at 07:39:09PM -0700, Ram Pai wrote:
> On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:
> > > On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> > > > Pls work on a long term solution. Short term needs can be served by
> > > > enabling the iommu platform in qemu.
> > > 
> > > So, I spent some time looking at converting virtio to dma ops overrides,
> > > and the current virtio spec, and the sad through I have to tell is that
> > > both the spec and the Linux implementation are complete and utterly fucked
> > > up.
> > 
> > Let me restate it: DMA API has support for a wide range of hardware, and
> > hardware based virtio implementations likely won't benefit from all of
> > it.
> > 
> > And given virtio right now is optimized for specific workloads, improving
> > portability without regressing performance isn't easy.
> > 
> > I think it's unsurprising since it started a strictly a guest/host
> > mechanism.  People did implement offloads on specific platforms though,
> > and they are known to work. To improve portability even further,
> > we might need to make spec and code changes.
> > 
> > I'm not really sympathetic to people complaining that they can't even
> > set a flag in qemu though. If that's the case the stack in question is
> > way too inflexible.
> 
> We did consider your suggestion. But can't see how it will work.
> Maybe you can guide us here. 
> 
> In our case qemu has absolutely no idea if the VM will switch itself to
> secure mode or not.  Its a dynamic decision made entirely by the VM
> through direct interaction with the hardware/firmware; no
> qemu/hypervisor involved.
> 
> If the administrator, who invokes qemu, enables the flag, the DMA ops
> associated with the virito devices will be called, and hence will be
> able to do the right things. Yes we might incur performance hit due to
> the IOMMU translations, but lets ignore that for now; the functionality
> will work. Good till now.
> 
> However if the administrator
> ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> flag, virtio will not be able to pass control to the DMA ops associated
> with the virtio devices. Which means, we have no opportunity to share
> the I/O buffers with the hypervisor/qemu.
> 
> How do you suggest, we handle this case?

As step 1, ignore it as a user error.
Further you can for example add per-device quirks in virtio so it can be
switched to dma api. make extra decisions in platform code then.

> > 
> > 
> > 
> > > Both in the flag naming and the implementation there is an implication
> > > of DMA API == IOMMU, which is fundamentally wrong.
> > 
> > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
> > 
> > It's possible that some setups will benefit from a more
> > fine-grained approach where some aspects of the DMA
> > API are bypassed, others aren't.
> > 
> > This seems to be what was being asked for in this thread,
> > with comments claiming IOMMU flag adds too much overhead.
> > 
> > 
> > > The DMA API does a few different things:
> > > 
> > >  a) address translation
> > > 
> > > 	This does include IOMMUs.  But it also includes random offsets
> > > 	between PCI bars and system memory that we see on various
> > > 	platforms.
> > 
> > I don't think you mean bars. That's unrelated to DMA.
> > 
> > >  Worse so some of these offsets might be based on
> > > 	banks, e.g. on the broadcom bmips platform.  It also deals
> > > 	with bitmask in physical addresses related to memory encryption
> > > 	like AMD SEV.  I'd be really curious how for example the
> > > 	Intel virtio based NIC is going to work on any of those
> > > 	plaforms.
> > 
> > SEV guys report that they just set the iommu flag and then it all works.
> 
> This is one of the fundamental difference between SEV architecture and
> the ultravisor architecture. In SEV, qemu is aware of SEV.  In
> ultravisor architecture, only the VM that runs within qemu is aware of
> ultravisor;  hypervisor/qemu/administrator are untrusted entities.

Spo one option is to teach qemu that it's on a platform with an
ultravisor, this might have more advantages.

> I hope, we can make virtio subsystem flexibe enough to support various
> security paradigms.

So if you are worried about qemu attacking guests, I see
more problems than just passing an incorrect iommu
flag.


> Apart from the above reason, Christoph and Ben point to so many other
> reasons to make it flexibe. So why not, make it happen?
> 

I don't see a flexibility argument.  I just don't think new platforms
should use workarounds that we put in place for old ones.


> > I guess if there's translation we can think of this as a kind of iommu.
> > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
> > 
> > And apparently some people complain that just setting that flag makes
> > qemu check translation on each access with an unacceptable performance
> > overhead.  Forcing same behaviour for everyone on general principles
> > even without the flag is unlikely to make them happy.
> > 
> > >   b) coherency
> > > 
> > > 	On many architectures DMA is not cache coherent, and we need
> > > 	to invalidate and/or write back cache lines before doing
> > > 	DMA.  Again, I wonder how this is every going to work with
> > > 	hardware based virtio implementations.
> > 
> > 
> > You mean dma_Xmb and friends?
> > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> > for that.
> > 
> > 
> > >  Even worse I think this
> > > 	is actually broken at least for VIVT event for virtualized
> > > 	implementations.  E.g. a KVM guest is going to access memory
> > > 	using different virtual addresses than qemu, vhost might throw
> > > 	in another different address space.
> > 
> > I don't really know what VIVT is. Could you help me please?
> > 
> > >   c) bounce buffering
> > > 
> > > 	Many DMA implementations can not address all physical memory
> > > 	due to addressing limitations.  In such cases we copy the
> > > 	DMA memory into a known addressable bounc buffer and DMA
> > > 	from there.
> > 
> > Don't do it then?
> > 
> > 
> > >   d) flushing write combining buffers or similar
> > > 
> > > 	On some hardware platforms we need workarounds to e.g. read
> > > 	from a certain mmio address to make sure DMA can actually
> > > 	see memory written by the host.
> > 
> > I guess it isn't an issue as long as WC isn't actually used.
> > It will become an issue when virtio spec adds some WC capability -
> > I suspect we can ignore this for now.
> > 
> > > 
> > > All of this is bypassed by virtio by default despite generally being
> > > platform issues, not particular to a given device.
> > 
> > It's both a device and a platform issue. A PV device is often more like
> > another CPU than like a PCI device.
> > 
> > 
> > 
> > -- 
> > MST
> 
> -- 
> Ram Pai

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-11  3:29 UTC (permalink / raw)
  To: Ram Pai, Michael S. Tsirkin
  Cc: robh, pawel.moll, Tom Lendacky, cohuck, linux-kernel,
	virtualization, Christoph Hellwig, joe, Rustad, Mark D,
	Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180611023909.GA5726@ram.oc3035372033.ibm.com>

On Sun, 2018-06-10 at 19:39 -0700, Ram Pai wrote:
> 
> However if the administrator
> ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> flag, virtio will not be able to pass control to the DMA ops associated
> with the virtio devices. Which means, we have no opportunity to share
> the I/O buffers with the hypervisor/qemu.
> 
> How do you suggest, we handle this case?

At the risk of repeating myself, let's just do the first pass which is
to switch virtio over to always using the DMA API in the actual data
flow code, with a hook at initialization time that replaces the DMA ops
with some home cooked "direct" ops in the case where the IOMMU flag
isn't set.

This will be equivalent to what we have today but avoids having 2
separate code path all over the driver.

Then a second stage, I think, is to replace this "hook" so that the
architecture gets a say in the matter.

Basically, something like:

	arch_virtio_update_dma_ops(pci_dev, qemu_direct_mode).

IE, virtio would tell the arch whether the "other side" is in fact QEMU
in a mode that bypasses the IOMMU and is cache coherent with the guest.
This is our legacy "qemu special" mode. If the performance is
sufficient we may want to deprecate it over time and have qemu enable
the iommu by default but we still need it.

A weak implementation of the above will be provied that just puts in
the direct ops when qemu_direct_mode is set, and thus provides today's
behaviour on any arch that doesn't override it. If the flag is not set,
the ops are left to whatever the arch PCI layer already set.

This will provide the opportunity for architectures that want to do
something special, such as in our case, when we want to force even the
"qemu_direct_mode" to go via bounce buffers, to put our own ops in
there, while retaining the legacy behaviour otherwise.

It also means that the "gunk" is entirely localized in that one
function, the rest of virtio just uses the DMA API normally.

Christoph, are you actually hacking "stage 1" above already or should
we produce patches ?

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-11  3:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, Ram Pai
  Cc: robh, pawel.moll, Tom Lendacky, cohuck, linux-kernel,
	virtualization, Christoph Hellwig, joe, Rustad, Mark D,
	Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180611060949-mutt-send-email-mst@kernel.org>

On Mon, 2018-06-11 at 06:28 +0300, Michael S. Tsirkin wrote:
> 
> > However if the administrator
> > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > flag, virtio will not be able to pass control to the DMA ops associated
> > with the virtio devices. Which means, we have no opportunity to share
> > the I/O buffers with the hypervisor/qemu.
> > 
> > How do you suggest, we handle this case?
> 
> As step 1, ignore it as a user error.

Ugh ... not again. Ram, don't bring that subject back we ALREADY
addressed it, and requiring the *user* to do special things is just
utterly and completely wrong.

The *user* has no bloody idea what that stuff is, will never know to
set whatver magic qemu flag etc... The user will just start a a VM
normally and expect things to work. Requiring the *user* to know things
like that iommu virtio flag is complete nonsense.

If by "user" you mean libvirt, then you are now requesting about 4 or 5
different projects to be patched to add speical cases for something
they know nothing about and is completely irrelevant, while it can be
entirely addressed with a 1-liner in virtio kernel side to allow the
arch to plumb alternate DMA ops.

So for some reason you seem to be dead set on a path that leads to
mountain of user pain, changes to many different projects and overall
havok while there is a much much simpler and elegant solution at hand
which I described (again) in the response to Ram I sent about 5mn ago.

> Further you can for example add per-device quirks in virtio so it can be
> switched to dma api. make extra decisions in platform code then.
> 
> > > 
> > > 
> > > 
> > > > Both in the flag naming and the implementation there is an implication
> > > > of DMA API == IOMMU, which is fundamentally wrong.
> > > 
> > > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
> > > 
> > > It's possible that some setups will benefit from a more
> > > fine-grained approach where some aspects of the DMA
> > > API are bypassed, others aren't.
> > > 
> > > This seems to be what was being asked for in this thread,
> > > with comments claiming IOMMU flag adds too much overhead.
> > > 
> > > 
> > > > The DMA API does a few different things:
> > > > 
> > > >  a) address translation
> > > > 
> > > > 	This does include IOMMUs.  But it also includes random offsets
> > > > 	between PCI bars and system memory that we see on various
> > > > 	platforms.
> > > 
> > > I don't think you mean bars. That's unrelated to DMA.
> > > 
> > > >  Worse so some of these offsets might be based on
> > > > 	banks, e.g. on the broadcom bmips platform.  It also deals
> > > > 	with bitmask in physical addresses related to memory encryption
> > > > 	like AMD SEV.  I'd be really curious how for example the
> > > > 	Intel virtio based NIC is going to work on any of those
> > > > 	plaforms.
> > > 
> > > SEV guys report that they just set the iommu flag and then it all works.
> > 
> > This is one of the fundamental difference between SEV architecture and
> > the ultravisor architecture. In SEV, qemu is aware of SEV.  In
> > ultravisor architecture, only the VM that runs within qemu is aware of
> > ultravisor;  hypervisor/qemu/administrator are untrusted entities.
> 
> Spo one option is to teach qemu that it's on a platform with an
> ultravisor, this might have more advantages.
> 
> > I hope, we can make virtio subsystem flexibe enough to support various
> > security paradigms.
> 
> So if you are worried about qemu attacking guests, I see
> more problems than just passing an incorrect iommu
> flag.
> 
> 
> > Apart from the above reason, Christoph and Ben point to so many other
> > reasons to make it flexibe. So why not, make it happen?
> > 
> 
> I don't see a flexibility argument.  I just don't think new platforms
> should use workarounds that we put in place for old ones.
> 
> 
> > > I guess if there's translation we can think of this as a kind of iommu.
> > > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
> > > 
> > > And apparently some people complain that just setting that flag makes
> > > qemu check translation on each access with an unacceptable performance
> > > overhead.  Forcing same behaviour for everyone on general principles
> > > even without the flag is unlikely to make them happy.
> > > 
> > > >   b) coherency
> > > > 
> > > > 	On many architectures DMA is not cache coherent, and we need
> > > > 	to invalidate and/or write back cache lines before doing
> > > > 	DMA.  Again, I wonder how this is every going to work with
> > > > 	hardware based virtio implementations.
> > > 
> > > 
> > > You mean dma_Xmb and friends?
> > > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> > > for that.
> > > 
> > > 
> > > >  Even worse I think this
> > > > 	is actually broken at least for VIVT event for virtualized
> > > > 	implementations.  E.g. a KVM guest is going to access memory
> > > > 	using different virtual addresses than qemu, vhost might throw
> > > > 	in another different address space.
> > > 
> > > I don't really know what VIVT is. Could you help me please?
> > > 
> > > >   c) bounce buffering
> > > > 
> > > > 	Many DMA implementations can not address all physical memory
> > > > 	due to addressing limitations.  In such cases we copy the
> > > > 	DMA memory into a known addressable bounc buffer and DMA
> > > > 	from there.
> > > 
> > > Don't do it then?
> > > 
> > > 
> > > >   d) flushing write combining buffers or similar
> > > > 
> > > > 	On some hardware platforms we need workarounds to e.g. read
> > > > 	from a certain mmio address to make sure DMA can actually
> > > > 	see memory written by the host.
> > > 
> > > I guess it isn't an issue as long as WC isn't actually used.
> > > It will become an issue when virtio spec adds some WC capability -
> > > I suspect we can ignore this for now.
> > > 
> > > > 
> > > > All of this is bypassed by virtio by default despite generally being
> > > > platform issues, not particular to a given device.
> > > 
> > > It's both a device and a platform issue. A PV device is often more like
> > > another CPU than like a PCI device.
> > > 
> > > 
> > > 
> > > -- 
> > > MST
> > 
> > -- 
> > Ram Pai

^ permalink raw reply

* RE: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands support
From: Liu, Changpeng @ 2018-06-11  3:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: pbonzini@redhat.com, cavery@redhat.com,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20180608102027.GB31164@stefanha-x1.localdomain>



> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Friday, June 8, 2018 6:20 PM
> To: Liu, Changpeng <changpeng.liu@intel.com>
> Cc: virtualization@lists.linux-foundation.org; cavery@redhat.com;
> jasowang@redhat.com; pbonzini@redhat.com; Wang, Wei W
> <wei.w.wang@intel.com>
> Subject: Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands
> support
> 
> On Thu, Jun 07, 2018 at 11:07:06PM +0000, Liu, Changpeng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> > > Sent: Thursday, June 7, 2018 9:10 PM
> > > To: Liu, Changpeng <changpeng.liu@intel.com>
> > > Cc: virtualization@lists.linux-foundation.org; cavery@redhat.com;
> > > jasowang@redhat.com; pbonzini@redhat.com; Wang, Wei W
> > > <wei.w.wang@intel.com>
> > > Subject: Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands
> > > support
> > >
> > > On Wed, Jun 06, 2018 at 12:19:00PM +0800, Changpeng Liu wrote:
> > > > Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES commands
> > > > support, this will impact the performance when using SSD backend over
> > > > file systems.
> > > >
> > > > Commit 88c85538 "virtio-blk: add discard and write zeroes features to
> > > > specification"(see https://github.com/oasis-tcs/virtio-spec) extended
> > > > existing virtio-blk protocol, adding extra DISCARD and WRITE ZEROES
> > > > commands support.
> > > >
> > > > While here, using 16 bytes descriptor to describe one segment of DISCARD
> > > > or WRITE ZEROES commands, each command may contain one or more
> > > decriptors.
> > > >
> > > > The following data structure shows the definition of one descriptor:
> > > >
> > > > struct virtio_blk_discard_write_zeroes {
> > > >         le64 sector;
> > > >         le32 num_sectors;
> > > >         le32 unmap;
> > > > };
> > > >
> > > > Field 'sector' means the start sector for DISCARD and WRITE ZEROES,
> > > > filed 'num_sectors' means the number of sectors for DISCARD and WRITE
> > > > ZEROES, if both DISCARD and WRITE ZEROES are supported, field 'unmap'
> > > > maybe used for WRITE ZEROES command with DISCARD enabled.
> > > >
> > > > We also extended the virtio-blk configuration space to let backend
> > > > device put DISCARD and WRITE ZEROES configuration parameters.
> > > >
> > > > struct virtio_blk_config {
> > > >         [...]
> > > >
> > > >         le32 max_discard_sectors;
> > > >         le32 max_discard_seg;
> > > >         le32 discard_sector_alignment;
> > > >         le32 max_write_zeroes_sectors;
> > > >         le32 max_write_zeroes_seg;
> > > >         u8 write_zeroes_may_unmap;
> > > > }
> > > >
> > > > New feature bit [VIRTIO_BLK_F_DISCARD (13)]: Device can support discard
> > > > command, maximum discard sectors size in field 'max_discard_sectors' and
> > > > maximum discard segment number in field 'max_discard_seg'.
> > > >
> > > > New feature [VIRTIO_BLK_F_WRITE_ZEROES (14)]: Device can support write
> > > > zeroes command, maximum write zeroes sectors size in field
> > > > 'max_write_zeroes_sectors' and maximum write zeroes segment number in
> > > > field 'max_write_zeroes_seg'.
> > > >
> > > > The parameters in the configuration space of the device field
> > > > 'max_discard_sectors' and field 'discard_sector_alignment' are expressed in
> > > > 512-byte units if the VIRTIO_BLK_F_DISCARD feature bit is negotiated. The
> > > > field 'max_write_zeroes_sectors' is expressed in 512-byte units if the
> > > > VIRTIO_BLK_F_WRITE_ZEROES feature bit is negotiated.
> > > >
> > > > Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> > > > ---
> > > > CHANGELOG:
> > > > v6: don't set T_OUT bit to discard and write zeroes commands.
> > >
> > > I don't see this in the patch...
> > Yeah, do noting with DISCARD/WRITE ZEROES means no need to OR BLK_T_OUT
> again.
> > >
> > > > @@ -225,6 +260,7 @@ static blk_status_t virtio_queue_rq(struct
> > > blk_mq_hw_ctx *hctx,
> > > >  	int qid = hctx->queue_num;
> > > >  	int err;
> > > >  	bool notify = false;
> > > > +	bool unmap = false;
> > > >  	u32 type;
> > > >
> > > >  	BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> > > > @@ -237,6 +273,13 @@ static blk_status_t virtio_queue_rq(struct
> > > blk_mq_hw_ctx *hctx,
> > > >  	case REQ_OP_FLUSH:
> > > >  		type = VIRTIO_BLK_T_FLUSH;
> > > >  		break;
> > > > +	case REQ_OP_DISCARD:
> > > > +		type = VIRTIO_BLK_T_DISCARD;
> > > > +		break;
> > > > +	case REQ_OP_WRITE_ZEROES:
> > > > +		type = VIRTIO_BLK_T_WRITE_ZEROES;
> > > > +		unmap = !(req->cmd_flags & REQ_NOUNMAP);
> > > > +		break;
> > > >  	case REQ_OP_SCSI_IN:
> > > >  	case REQ_OP_SCSI_OUT:
> > > >  		type = VIRTIO_BLK_T_SCSI_CMD;
> > > > @@ -256,6 +299,12 @@ static blk_status_t virtio_queue_rq(struct
> > > blk_mq_hw_ctx *hctx,
> > > >
> > > >  	blk_mq_start_request(req);
> > > >
> > > > +	if (type == VIRTIO_BLK_T_DISCARD || type ==
> > > VIRTIO_BLK_T_WRITE_ZEROES) {
> > > > +		err = virtblk_setup_discard_write_zeroes(req, unmap);
> > > > +		if (err)
> > > > +			return BLK_STS_RESOURCE;
> > > > +	}
> > > > +
> > > >  	num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
> > > >  	if (num) {
> > > >  		if (rq_data_dir(req) == WRITE)
> > >
> > > ...since we still do blk_rq_map_sg() here and num should be != 0.
> > No, while here, we should keep the original logic for READ/WRITE commands.
> 
> My question is: why does the changelog say "don't set T_OUT" but the
> code *will* set it because blk_rq_map_sg() returns != 0 and
> rq_data_dir(req) == WRITE?
Since the last bit of DISCARD/WRITE ZEROES commands are already 1, so I said we don't need to set
T_OUT bit to DISCARD/WRITE ZEROES commands again. But the original logic for WRITE, T_OUT is still
needed, so just keep the original code here is fine.
> 
> Maybe I'm misreading the code, but it looks to me like this patch
> does the opposite of what the changelog says.
> 
> Stefan

^ permalink raw reply

* Re: [PATCH v3 6/9] x86: prevent inline distortion by paravirt ops
From: Peter Zijlstra @ 2018-06-11  7:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Juergen Gross, x86, linux-kernel, virtualization, Ingo Molnar,
	H. Peter Anvin, Alok Kataria, Thomas Gleixner
In-Reply-To: <20180610141911.52948-7-namit@vmware.com>

On Sun, Jun 10, 2018 at 07:19:08AM -0700, Nadav Amit wrote:
> +/*
> + * This generates an indirect call based on the operation type number.
> + * The type number, computed in PARAVIRT_PATCH, is derived from the
> + * offset into the paravirt_patch_template structure, and can therefore be
> + * freely converted back into a structure offset.
> + */
> +.macro PARAVIRT_ALT type:req clobber:req pv_opptr:req

Unlike the marcro maze you replaced, this has the CALL hardcoded in. So
maybe name this PARAVIRT_CALL instead of PARAVIRT_ALT ?

> +771:	ANNOTATE_RETPOLINE_SAFE
> +	call *\pv_opptr
> +772:	.pushsection .parainstructions,"a"
> +	_ASM_ALIGN
> +	_ASM_PTR 771b
> +	.byte \type
> +	.byte 772b-771b
> +	.short \clobber
> +	.popsection
> +.endm

^ permalink raw reply

* [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-11 16:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: ohad, kevin, kvm, mst, netdev, liang.z.li, linux-remoteproc,
	linux-kernel, stable, bjorn.andersson, mhocko, mhocko,
	syzbot+87cfa083e727a224754b, akpm, virtualization

The following changes since commit 29dcea88779c856c7dc92040a0c01233263101d4:

  Linux 4.17 (2018-06-03 14:15:21 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to aa15783ee62d57d69433101ede3e3ed11e48161d:

  virtio: update the comments for transport features (2018-06-07 22:17:40 +0300)

----------------------------------------------------------------
virtio, vhost: features, fixes

VF support for virtio.
Free page hint request support for VM migration.
DMA barriers for virtio strong barriers.
Bugfixes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Michael S. Tsirkin (2):
      virtio_ring: switch to dma_XX barriers for rpmsg
      vhost: fix info leak due to uninitialized memory

Tiwei Bie (2):
      virtio_pci: support enabling VFs
      virtio: update the comments for transport features

Wei Wang (4):
      mm: support reporting free page blocks
      virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
      mm/page_poison: expose page_poisoning_enabled to kernel modules
      virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/vhost/vhost.c               |   3 +
 drivers/virtio/virtio_balloon.c     | 298 +++++++++++++++++++++++++++++++-----
 drivers/virtio/virtio_pci_common.c  |  30 ++++
 drivers/virtio/virtio_pci_modern.c  |  14 ++
 include/linux/mm.h                  |   6 +
 include/linux/virtio_ring.h         |   4 +-
 include/uapi/linux/virtio_balloon.h |   7 +
 include/uapi/linux/virtio_config.h  |  16 +-
 mm/page_alloc.c                     |  97 ++++++++++++
 mm/page_poison.c                    |   6 +
 10 files changed, 439 insertions(+), 42 deletions(-)

^ permalink raw reply

* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-11 17:26 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, qemu-devel, jiri, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown
In-Reply-To: <1525734594-11134-1-git-send-email-sridhar.samudrala@intel.com>

On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a standby for another device with the same MAC address.
> 
> I tested this with a small change to the patch to mark the STANDBY feature 'true'
> by default as i am using libvirt to start the VMs.
> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> XML file?
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

So I do not think we can commit to this interface: we
really need to control visibility of the primary device.

However just for testing purposes, we could add a non-stable
interface "x-standby" with the understanding that as any
x- prefix it's unstable and will be changed down the road,
likely in the next release.


> ---
>  hw/net/virtio-net.c                         | 2 ++
>  include/standard-headers/linux/virtio_net.h | 3 +++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 90502fca7c..38b3140670 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>                       true),
>      DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>      DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> +                      false),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index e9f255ea3f..01ec09684c 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,9 @@
>  					 * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>  
> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> +                                         * with the same MAC.
> +                                         */
>  #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>  
>  #ifndef VIRTIO_NET_NO_LEGACY
> -- 
> 2.14.3

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-11 18:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <20180611192353-mutt-send-email-mst@kernel.org>

On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
>       virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

Is this really a good idea?

Plus it seems entirely broken.

The report_pfn_range() callback is done under the zone lock, but
virtio_balloon_send_free_pages() (which is the only callback used that
I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
1, vq, GFP_KERNEL);

So now we apparently do a GFP_KERNEL allocation insider the mm zone
lock, which is broken on just _so_ many levels.

Pulled and then unpulled again.

Either somebody needs to explain why I'm wrong and you can re-submit
this, or this kind of garbage needs to go away.

I do *not* want to be in the situation where I pull stuff from the
virtio people that adds completely broken core VM functionality.

Because if I'm in that situation, I will stop pulling from you guys.
Seriously. You have *no* place sending me broken shit that is outside
the virtio layer.

Subsystems that break code MM will get shunned. You just aren't
important enough to allow you breaking code VM.

                Linus

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-11 18:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFzrPgnd7hRPrkeV+jX-MSwOZf7T4wKxz66Lk4oub3PZsw@mail.gmail.com>

On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.

Oh, I see the comment about how it doesn't actually do an allocation
at all because it's a single-entry.

Still too damn ugly to live, and much too fragile. No way in hell do
we even _hint_ at a GFP_KERNEL when we're inside a context that can't
do any memory allocation at all.

Plus I'm not convinced it's a "no allocation" path even despite that
comment, because it also does a "dma_map_page()" etc, which can cause
allocations to do the dma mapping thing afaik. No?

Maybe there's some reason why that doesn't happen either, but
basically this whole callchain looks *way* to complicated to be used
under a core VM spinlock.

                Linus

^ permalink raw reply

* Re: [PATCH 1/2] Convert target drivers to use sbitmap
From: Jens Axboe @ 2018-06-12  1:18 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel, linux-scsi, target-devel,
	linux1394-devel, linux-usb, kvm, virtualization, netdev,
	Juergen Gross, qla2xxx-upstream, Kent Overstreet
  Cc: Matthew Wilcox
In-Reply-To: <3a56027b-47bc-dcb8-a465-3670031572f1@kernel.dk>

On 5/15/18 10:11 AM, Jens Axboe wrote:
> On 5/15/18 10:00 AM, Matthew Wilcox wrote:
>> From: Matthew Wilcox <mawilcox@microsoft.com>
>>
>> The sbitmap and the percpu_ida perform essentially the same task,
>> allocating tags for commands.  Since the sbitmap is more used than
>> the percpu_ida, convert the percpu_ida users to the sbitmap API.
> 
> It should also be the same performance as percpu_ida in light use, and
> performs much better at > 50% utilization of the tag space. I think
> that's better justification than "more used than".
> 
>> diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
>> index 4435bf374d2d..28bcffae609f 100644
>> --- a/drivers/target/iscsi/iscsi_target_util.c
>> +++ b/drivers/target/iscsi/iscsi_target_util.c
>> @@ -17,7 +17,7 @@
>>   ******************************************************************************/
>>  
>>  #include <linux/list.h>
>> -#include <linux/percpu_ida.h>
>> +#include <linux/sched/signal.h>
>>  #include <net/ipv6.h>         /* ipv6_addr_equal() */
>>  #include <scsi/scsi_tcq.h>
>>  #include <scsi/iscsi_proto.h>
>> @@ -147,6 +147,28 @@ void iscsit_free_r2ts_from_list(struct iscsi_cmd *cmd)
>>  	spin_unlock_bh(&cmd->r2t_lock);
>>  }
>>  
>> +int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
>> +{
>> +	int tag = -1;
>> +	DEFINE_WAIT(wait);
>> +	struct sbq_wait_state *ws;
>> +
>> +	if (state == TASK_RUNNING)
>> +		return tag;
>> +
>> +	ws = &se_sess->sess_tag_pool.ws[0];
>> +	for (;;) {
>> +		prepare_to_wait_exclusive(&ws->wait, &wait, state);
>> +		if (signal_pending_state(state, current))
>> +			break;
>> +		schedule();
>> +		tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
>> +	}
>> +
>> +	finish_wait(&ws->wait, &wait);
>> +	return tag;
>> +}
> 
> Seems like that should be:
> 
> 
> 	ws = &se_sess->sess_tag_pool.ws[0];
> 	for (;;) {
> 		prepare_to_wait_exclusive(&ws->wait, &wait, state);
> 		if (signal_pending_state(state, current))
> 			break;
> 		tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
> 		if (tag != -1)
> 			break;
> 		schedule();
> 	}
> 
> 	finish_wait(&ws->wait, &wait);
> 	return tag;
> 
>>  /*
>>   * May be called from software interrupt (timer) context for allocating
>>   * iSCSI NopINs.
>> @@ -155,9 +177,11 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, int state)
>>  {
>>  	struct iscsi_cmd *cmd;
>>  	struct se_session *se_sess = conn->sess->se_sess;
>> -	int size, tag;
>> +	int size, tag, cpu;
>>  
>> -	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
>> +	tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
>> +	if (tag < 0)
>> +		tag = iscsit_wait_for_tag(se_sess, state, &cpu);
>>  	if (tag < 0)
>>  		return NULL;
> 
> Might make sense to just roll the whole thing into iscsi_get_tag(), that
> would be cleaner.
> 
> sbitmap should provide a helper for that, but we can do that cleanup
> later. That would encapsulate things like the per-cpu caching hint too,
> for instance.
> 
> Rest looks fine to me.

Are you going to push this further? I really think we should.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-12  1:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFz4uBHaWnBarVUEG90s2ucyVtoLmwYpYVwDV+XQESNRqw@mail.gmail.com>

On Mon, Jun 11, 2018 at 11:44:05AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So now we apparently do a GFP_KERNEL allocation insider the mm zone
> > lock, which is broken on just _so_ many levels.
> 
> Oh, I see the comment about how it doesn't actually do an allocation
> at all because it's a single-entry.
> 
> Still too damn ugly to live, and much too fragile. No way in hell do
> we even _hint_ at a GFP_KERNEL when we're inside a context that can't
> do any memory allocation at all.
> 
> Plus I'm not convinced it's a "no allocation" path even despite that
> comment, because it also does a "dma_map_page()" etc, which can cause
> allocations to do the dma mapping thing afaik. No?

Well no because DMA is triggered by the IOMMU flag and
that is always off for the balloon. But I hear what you
are saying about it being fragile.

> Maybe there's some reason why that doesn't happen either, but
> basically this whole callchain looks *way* to complicated to be used
> under a core VM spinlock.
> 
>                 Linus

Maybe it will help to have GFP_NONE which will make any allocation
fail if attempted. Linus, would this address your comment?
-- 
MST

^ permalink raw reply

* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Jason Wang @ 2018-06-12  1:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
	netdev, qemu-devel, loseweigh, virtualization
In-Reply-To: <20180611202207-mutt-send-email-mst@kernel.org>



On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> This feature bit can be used by hypervisor to indicate virtio_net device to
>> act as a standby for another device with the same MAC address.
>>
>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>> by default as i am using libvirt to start the VMs.
>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>> XML file?
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> So I do not think we can commit to this interface: we
> really need to control visibility of the primary device.

The problem is legacy guest won't use primary device at all if we do this.

How about control the visibility of standby device?

Thanks

> However just for testing purposes, we could add a non-stable
> interface "x-standby" with the understanding that as any
> x- prefix it's unstable and will be changed down the road,
> likely in the next release.
>
>
>> ---
>>   hw/net/virtio-net.c                         | 2 ++
>>   include/standard-headers/linux/virtio_net.h | 3 +++
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 90502fca7c..38b3140670 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>                        true),
>>       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>> +                      false),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   
>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>> index e9f255ea3f..01ec09684c 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -57,6 +57,9 @@
>>   					 * Steering */
>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>   
>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>> +                                         * with the same MAC.
>> +                                         */
>>   #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>   
>>   #ifndef VIRTIO_NET_NO_LEGACY
>> -- 
>> 2.14.3

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-12  1:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: nilal, KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFzrPgnd7hRPrkeV+jX-MSwOZf7T4wKxz66Lk4oub3PZsw@mail.gmail.com>

On Mon, Jun 11, 2018 at 11:32:41AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> >       virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
> 
> Is this really a good idea?

Well knowing which pages are unused does seem to be useful.  Do you
refer to this generally or to the idea of scanning the free list,
or to the specific implementation issues you listed below?


> Plus it seems entirely broken.
> 
> The report_pfn_range() callback is done under the zone lock, but
> virtio_balloon_send_free_pages() (which is the only callback used that
> I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
> 1, vq, GFP_KERNEL);
> 
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.
> 
> Pulled and then unpulled again.
> 
> Either somebody needs to explain why I'm wrong and you can re-submit
> this, or this kind of garbage needs to go away.
> 
> I do *not* want to be in the situation where I pull stuff from the
> virtio people that adds completely broken core VM functionality.
> 
> Because if I'm in that situation, I will stop pulling from you guys.
> Seriously. You have *no* place sending me broken shit that is outside
> the virtio layer.
> 
> Subsystems that break code MM will get shunned. You just aren't
> important enough to allow you breaking code VM.
> 
>                 Linus

So no, it doesn't do any allocations - GFP_KERNEL or otherwise, but I
agree how it might be confusing.

So I'll drop this for now, but it would be nice if there is a bit more
direction on this feature since I know several people are trying to work
on this.


-- 
MST

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-12  1:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <20180612042600-mutt-send-email-mst@kernel.org>

On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Maybe it will help to have GFP_NONE which will make any allocation
> fail if attempted. Linus, would this address your comment?

It would definitely have helped me initially overlook that call chain.

But then when I started looking at the whole dma_map_page() thing, it
just raised my hackles again.

I would seriously suggest having a much simpler version for the "no
allocation, no dma mapping" case, so that it's *obvious* that that
never happens.

So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?

Honestly, I look at "add_one_sg()" and it really doesn't make me
happy. It looks hacky as hell. If I read the code right, you're really
trying to just queue up a simple tuple of <pfn,len>, except you encode
it as a page pointer in order to play games with the SG logic, and
then you hmap that to the ring, except in this case it's all a fake
ring that just adds the cpu-physical address instead.

And to figuer that out, it's like five layers of indirection through
different helper functions that *can* do more generic things but in
this case don't.

And you do all of this from a core VM callback function with some
_really_ core VM locks held.

That makes no sense to me.

How about this:

 - get rid of all that code

 - make the core VM callback save the "these are the free memory
regions" in a fixed and limited array. One that DOES JUST THAT. No
crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
size, pre-allocated for that virtio instance.

 - make it obvious that what you do in that sequence is ten
instructions and no allocations ("Look ma, I wrote a value to an array
and incremented the array idex, and I'M DONE")

 - then in that workqueue entry that you start *anyway*, you empty the
array and do all the crazy virtio stuff.

In fact, while at it, just simplify the VM interface too. Instead of
traversing a random number of buddy lists, just trraverse *one* - the
top-level one. Are you seriously ever going to shrink or mark
read-only anythin *but* something big enough to be in the maximum
order?

MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
want the balloon code to work on smaller things, particularly since
the whole interface is fundamentally racy and opportunistic to begin
with?

The whole sequence of events really looks "this is too much
complexity, and way too fragile" to me at so many levels.

                 Linus

^ permalink raw reply

* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-12  2:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
	Sridhar Samudrala, qemu-devel, loseweigh, netdev, virtualization
In-Reply-To: <dc8b1669-b8b6-fb59-94e2-6e4de6b96572@redhat.com>

On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> > On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > act as a standby for another device with the same MAC address.
> > > 
> > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > by default as i am using libvirt to start the VMs.
> > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > XML file?
> > > 
> > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > So I do not think we can commit to this interface: we
> > really need to control visibility of the primary device.
> 
> The problem is legacy guest won't use primary device at all if we do this.

And that's by design - I think it's the only way to ensure the
legacy guest isn't confused.

> How about control the visibility of standby device?
> 
> Thanks

standy the always there to guarantee no downtime.

> > However just for testing purposes, we could add a non-stable
> > interface "x-standby" with the understanding that as any
> > x- prefix it's unstable and will be changed down the road,
> > likely in the next release.
> > 
> > 
> > > ---
> > >   hw/net/virtio-net.c                         | 2 ++
> > >   include/standard-headers/linux/virtio_net.h | 3 +++
> > >   2 files changed, 5 insertions(+)
> > > 
> > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > index 90502fca7c..38b3140670 100644
> > > --- a/hw/net/virtio-net.c
> > > +++ b/hw/net/virtio-net.c
> > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > >                        true),
> > >       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > >       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > +                      false),
> > >       DEFINE_PROP_END_OF_LIST(),
> > >   };
> > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > index e9f255ea3f..01ec09684c 100644
> > > --- a/include/standard-headers/linux/virtio_net.h
> > > +++ b/include/standard-headers/linux/virtio_net.h
> > > @@ -57,6 +57,9 @@
> > >   					 * Steering */
> > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > > +                                         * with the same MAC.
> > > +                                         */
> > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> > >   #ifndef VIRTIO_NET_NO_LEGACY
> > > -- 
> > > 2.14.3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 1/2] Convert target drivers to use sbitmap
From: Matthew Wilcox @ 2018-06-12  3:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Juergen Gross, kvm, linux-scsi, Matthew Wilcox, netdev, linux-usb,
	linux-kernel, virtualization, target-devel, qla2xxx-upstream,
	linux1394-devel, Kent Overstreet
In-Reply-To: <02326395-3241-c94b-ad70-3de27a6f5a8c@kernel.dk>

On Mon, Jun 11, 2018 at 07:18:55PM -0600, Jens Axboe wrote:
> Are you going to push this further? I really think we should.

Yes, I'll resubmit it tomorrow.  Sorry for dropping the ball on this one.

^ permalink raw reply

* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-12  5:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
	netdev, qemu-devel, loseweigh, virtualization
In-Reply-To: <20180612051432-mutt-send-email-mst@kernel.org>

On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>
>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>> act as a standby for another device with the same MAC address.
>>>>
>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>> by default as i am using libvirt to start the VMs.
>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>> XML file?
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> So I do not think we can commit to this interface: we
>>> really need to control visibility of the primary device.
>> The problem is legacy guest won't use primary device at all if we do this.
> And that's by design - I think it's the only way to ensure the
> legacy guest isn't confused.

Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
of the primary device. The primary device may not be specified as an argument to Qemu. It
may be plugged in later.
The cloud service provider is providing a feature that enables low latency datapath and live
migration capability.
A tenant can use this feature only if he is running a VM that has virtio-net with failover support.

I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
an upper layer indicating if the STANDBY feature is successfully negotiated or not.
The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
If VF is successfully hot plugged, virtio-net link should be disabled.


>
>> How about control the visibility of standby device?
>>
>> Thanks
> standy the always there to guarantee no downtime.
>
>>> However just for testing purposes, we could add a non-stable
>>> interface "x-standby" with the understanding that as any
>>> x- prefix it's unstable and will be changed down the road,
>>> likely in the next release.
>>>
>>>
>>>> ---
>>>>    hw/net/virtio-net.c                         | 2 ++
>>>>    include/standard-headers/linux/virtio_net.h | 3 +++
>>>>    2 files changed, 5 insertions(+)
>>>>
>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>> index 90502fca7c..38b3140670 100644
>>>> --- a/hw/net/virtio-net.c
>>>> +++ b/hw/net/virtio-net.c
>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>                         true),
>>>>        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>> +                      false),
>>>>        DEFINE_PROP_END_OF_LIST(),
>>>>    };
>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>> index e9f255ea3f..01ec09684c 100644
>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>> @@ -57,6 +57,9 @@
>>>>    					 * Steering */
>>>>    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>>> +                                         * with the same MAC.
>>>> +                                         */
>>>>    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>>    #ifndef VIRTIO_NET_NO_LEGACY
>>>> -- 
>>>> 2.14.3

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Wei Wang @ 2018-06-12 11:05 UTC (permalink / raw)
  To: Linus Torvalds, Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFyNhEzzufw0XP9DcqZNS1CH+jDGdN4CVnazb3ssFxFbzQ@mail.gmail.com>

On 06/12/2018 09:59 AM, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>> Maybe it will help to have GFP_NONE which will make any allocation
>> fail if attempted. Linus, would this address your comment?
> It would definitely have helped me initially overlook that call chain.
>
> But then when I started looking at the whole dma_map_page() thing, it
> just raised my hackles again.
>
> I would seriously suggest having a much simpler version for the "no
> allocation, no dma mapping" case, so that it's *obvious* that that
> never happens.
>
> So instead of having virtio_balloon_send_free_pages() call a really
> generic complex chain of functions that in _some_ cases can do memory
> allocation, why isn't there a short-circuited "vitruque_add_datum()"
> that is guaranteed to never do anything like that?
>
> Honestly, I look at "add_one_sg()" and it really doesn't make me
> happy. It looks hacky as hell. If I read the code right, you're really
> trying to just queue up a simple tuple of <pfn,len>, except you encode
> it as a page pointer in order to play games with the SG logic, and
> then you hmap that to the ring, except in this case it's all a fake
> ring that just adds the cpu-physical address instead.
>
> And to figuer that out, it's like five layers of indirection through
> different helper functions that *can* do more generic things but in
> this case don't.
>
> And you do all of this from a core VM callback function with some
> _really_ core VM locks held.
>
> That makes no sense to me.
>
> How about this:
>
>   - get rid of all that code
>
>   - make the core VM callback save the "these are the free memory
> regions" in a fixed and limited array. One that DOES JUST THAT. No
> crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
> size, pre-allocated for that virtio instance.
>
>   - make it obvious that what you do in that sequence is ten
> instructions and no allocations ("Look ma, I wrote a value to an array
> and incremented the array idex, and I'M DONE")
>
>   - then in that workqueue entry that you start *anyway*, you empty the
> array and do all the crazy virtio stuff.
>
> In fact, while at it, just simplify the VM interface too. Instead of
> traversing a random number of buddy lists, just trraverse *one* - the
> top-level one. Are you seriously ever going to shrink or mark
> read-only anythin *but* something big enough to be in the maximum
> order?
>
> MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
> want the balloon code to work on smaller things, particularly since
> the whole interface is fundamentally racy and opportunistic to begin
> with?

OK, I will implement a new version based on the suggestions. Thanks.

Best,
Wei

^ permalink raw reply

* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-12 11:34 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
	netdev, qemu-devel, loseweigh, virtualization
In-Reply-To: <23fc4aa4-ec41-d6e2-3354-10cbfc13b7ec@intel.com>

On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
> > On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
> > > 
> > > On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> > > > On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> > > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > > > act as a standby for another device with the same MAC address.
> > > > > 
> > > > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > > > by default as i am using libvirt to start the VMs.
> > > > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > > > XML file?
> > > > > 
> > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > So I do not think we can commit to this interface: we
> > > > really need to control visibility of the primary device.
> > > The problem is legacy guest won't use primary device at all if we do this.
> > And that's by design - I think it's the only way to ensure the
> > legacy guest isn't confused.
> 
> Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
> of the primary device. The primary device may not be specified as an argument to Qemu. It
> may be plugged in later.
> The cloud service provider is providing a feature that enables low latency datapath and live
> migration capability.
> A tenant can use this feature only if he is running a VM that has virtio-net with failover support.

Well live migration is there already. The new feature is low latency
data path.

And it's the guest that needs failover support not the VM.


> I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
> an upper layer indicating if the STANDBY feature is successfully negotiated or not.
> The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
> If VF is successfully hot plugged, virtio-net link should be disabled.

Did you even talk to upper layer management about it?
Just list the steps they need to do and you will see
that's a lot of machinery to manage by the upper layer.

What do we gain in flexibility? As far as I can see the
only gain is some resources saved for legacy VMs.

That's not a lot as tenant of the upper layer probably already has
at least a hunch that it's a new guest otherwise
why bother specifying the feature at all - you
save even more resources without it.




> 
> > 
> > > How about control the visibility of standby device?
> > > 
> > > Thanks
> > standy the always there to guarantee no downtime.
> > 
> > > > However just for testing purposes, we could add a non-stable
> > > > interface "x-standby" with the understanding that as any
> > > > x- prefix it's unstable and will be changed down the road,
> > > > likely in the next release.
> > > > 
> > > > 
> > > > > ---
> > > > >    hw/net/virtio-net.c                         | 2 ++
> > > > >    include/standard-headers/linux/virtio_net.h | 3 +++
> > > > >    2 files changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > > > index 90502fca7c..38b3140670 100644
> > > > > --- a/hw/net/virtio-net.c
> > > > > +++ b/hw/net/virtio-net.c
> > > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > > >                         true),
> > > > >        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > > > >        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > > > +                      false),
> > > > >        DEFINE_PROP_END_OF_LIST(),
> > > > >    };
> > > > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > > > index e9f255ea3f..01ec09684c 100644
> > > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > > @@ -57,6 +57,9 @@
> > > > >    					 * Steering */
> > > > >    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > > > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > > > > +                                         * with the same MAC.
> > > > > +                                         */
> > > > >    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> > > > >    #ifndef VIRTIO_NET_NO_LEGACY
> > > > > -- 
> > > > > 2.14.3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-12 11:47 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, qemu-devel, Samudrala, Sridhar,
	virtualization
In-Reply-To: <CADGSJ21AMp02vV+qxa+DuyvAUCCsXUFwV7wjzQ1Vy1jKV=n1HA@mail.gmail.com>

On Tue, Jun 05, 2018 at 03:09:26PM -0700, Siwei Liu wrote:
> The thing is cloud service provider might prefer sticking to the same
> level of service agreement (SLA) of keeping VF over migration,

That requirement is trivially satisfied by just a single VF :) If your
SLA does not require live migration, you should do just that.

-- 
MST

^ permalink raw reply

* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-12 11:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: alexander.h.duyck, virtio-dev, Samudrala, Sridhar, qemu-devel,
	virtualization
In-Reply-To: <a45625c9-d2cd-f8f7-ce31-37c558e70b88@redhat.com>

On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
> > I don't think this is sufficient.
> > 
> > If both primary and standby devices are present, a legacy guest without
> > support for the feature might see two devices with same mac and get
> > confused.
> > 
> > I think that we should only make primary visible after guest acked the
> > backup feature bit.
> 
> I think we want exactly the reverse? E.g fail the negotiation when guest
> does not ack backup feature.
> 
> Otherwise legacy guest won't even have the chance to see primary device in
> the guest.

That's by design.

> > 
> > And on reset or when backup is cleared in some other way, unplug the
> > primary.
> 
> What if guest does not support hotplug?

It shouldn't ack the backup feature then and it's a good point.
We should both document this and check kernel config has
hotplug support. Sridhar could you take a look pls?

> > 
> > Something like the below will do the job:
> > 
> > Primary device is added with a special "primary-failover" flag.
> > A virtual machine is then initialized with just a standby virtio
> > device. Primary is not yet added.
> 
> A question is how to do the matching? Qemu knows nothing about e.g mac
> address of a pass-through device I believe?

Supply a flag to the VFIO when it's added, this way QEMU will know.

> > 
> > Later QEMU detects that guest driver device set DRIVER_OK.
> > It then exposes the primary device to the guest, and triggers
> > a device addition event (hot-plug event) for it.
> 
> Do you mean we won't have primary device in the initial qemu cli?

No, that's not what I mean.

I mean management will supply a flag to VFIO and then


- VFIO defers exposing
primary to guest until guest acks the feature bit.
- When we see guest ack, initiate hotplug.
- On reboot, hide it again.
- On reset without reboot, request hot-unplug and on eject hide it again.


> > 
> > If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> > to remove the primary driver.  In particular, if QEMU detects guest
> > re-initialization (e.g. by detecting guest reset) it immediately removes
> > the primary device.
> 
> I believe guest failover module should handle this gracefully?

It can't control enumeration order, if primary is enumerated before
standby then guest will load its driver and it's too late
when failover driver is loaded.

> Thanks
> 
> > 
> > We can move some of this code to management as well, architecturally it
> > does not make too much sense but it might be easier implementation-wise.
> > 
> > HTH
> > 
> > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
> > > Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
> > > https://patchwork.ozlabs.org/cover/920005/
> > > 
> > > 
> > > On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > > act as a standby for another device with the same MAC address.
> > > > 
> > > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > > by default as i am using libvirt to start the VMs.
> > > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > > XML file?
> > > > 
> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > ---
> > > >    hw/net/virtio-net.c                         | 2 ++
> > > >    include/standard-headers/linux/virtio_net.h | 3 +++
> > > >    2 files changed, 5 insertions(+)
> > > > 
> > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > > index 90502fca7c..38b3140670 100644
> > > > --- a/hw/net/virtio-net.c
> > > > +++ b/hw/net/virtio-net.c
> > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > >                         true),
> > > >        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > > >        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > > +                      false),
> > > >        DEFINE_PROP_END_OF_LIST(),
> > > >    };
> > > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > > index e9f255ea3f..01ec09684c 100644
> > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > @@ -57,6 +57,9 @@
> > > >    					 * Steering */
> > > >    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > > > +                                         * with the same MAC.
> > > > +                                         */
> > > >    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> > > >    #ifndef VIRTIO_NET_NO_LEGACY
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 1/2] Convert target drivers to use sbitmap
From: Bart Van Assche @ 2018-06-12 15:22 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-usb@vger.kernel.org, willy@infradead.org,
	virtualization@lists.linux-foundation.org,
	kent.overstreet@gmail.com, linux1394-devel@lists.sourceforge.net,
	jgross@suse.com, axboe@kernel.dk, linux-scsi@vger.kernel.org,
	qla2xxx-upstream@qlogic.com, target-devel@vger.kernel.org,
	netdev@vger.kernel.org
  Cc: mawilcox@microsoft.com
In-Reply-To: <20180515160043.27044-2-willy@infradead.org>

On Tue, 2018-05-15 at 09:00 -0700, Matthew Wilcox wrote:
> diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c
> index 025dc2d3f3de..cdf671c2af61 100644
> --- a/drivers/scsi/qla2xxx/qla_target.c
> +++ b/drivers/scsi/qla2xxx/qla_target.c
> @@ -3719,7 +3719,8 @@ void qlt_free_cmd(struct qla_tgt_cmd *cmd)
>  		return;
>  	}
>  	cmd->jiffies_at_free = get_jiffies_64();
> -	percpu_ida_free(&sess->se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
> +	sbitmap_queue_clear(&sess->se_sess->sess_tag_pool, cmd->se_cmd.map_tag,
> +			cmd->se_cmd.map_cpu);
>  }
>  EXPORT_SYMBOL(qlt_free_cmd);

Please introduce functions in the target core for allocating and freeing a tag
instead of spreading the knowledge of how to allocate and free tags over all
target drivers.
 
> +int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
> +{
> +	int tag = -1;
> +	DEFINE_WAIT(wait);
> +	struct sbq_wait_state *ws;
> +
> +	if (state == TASK_RUNNING)
> +		return tag;
> +
> +	ws = &se_sess->sess_tag_pool.ws[0];
> +	for (;;) {
> +		prepare_to_wait_exclusive(&ws->wait, &wait, state);
> +		if (signal_pending_state(state, current))
> +			break;

This looks weird to me. Shouldn't target code ignore signals instead of causing
tag allocation to fail if a signal is received?

> +		schedule();
> +		tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
> +	}
> +
> +	finish_wait(&ws->wait, &wait);
> +	return tag;
> +}

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands support
From: Stefan Hajnoczi @ 2018-06-12 16:05 UTC (permalink / raw)
  To: Liu, Changpeng
  Cc: pbonzini@redhat.com, cavery@redhat.com,
	virtualization@lists.linux-foundation.org
In-Reply-To: <FF7FC980937D6342B9D289F5F3C7C2625B6DF62C@SHSMSX103.ccr.corp.intel.com>


[-- Attachment #1.1: Type: text/plain, Size: 6316 bytes --]

On Mon, Jun 11, 2018 at 03:37:00AM +0000, Liu, Changpeng wrote:
> 
> 
> > -----Original Message-----
> > From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> > Sent: Friday, June 8, 2018 6:20 PM
> > To: Liu, Changpeng <changpeng.liu@intel.com>
> > Cc: virtualization@lists.linux-foundation.org; cavery@redhat.com;
> > jasowang@redhat.com; pbonzini@redhat.com; Wang, Wei W
> > <wei.w.wang@intel.com>
> > Subject: Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands
> > support
> > 
> > On Thu, Jun 07, 2018 at 11:07:06PM +0000, Liu, Changpeng wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> > > > Sent: Thursday, June 7, 2018 9:10 PM
> > > > To: Liu, Changpeng <changpeng.liu@intel.com>
> > > > Cc: virtualization@lists.linux-foundation.org; cavery@redhat.com;
> > > > jasowang@redhat.com; pbonzini@redhat.com; Wang, Wei W
> > > > <wei.w.wang@intel.com>
> > > > Subject: Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands
> > > > support
> > > >
> > > > On Wed, Jun 06, 2018 at 12:19:00PM +0800, Changpeng Liu wrote:
> > > > > Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES commands
> > > > > support, this will impact the performance when using SSD backend over
> > > > > file systems.
> > > > >
> > > > > Commit 88c85538 "virtio-blk: add discard and write zeroes features to
> > > > > specification"(see https://github.com/oasis-tcs/virtio-spec) extended
> > > > > existing virtio-blk protocol, adding extra DISCARD and WRITE ZEROES
> > > > > commands support.
> > > > >
> > > > > While here, using 16 bytes descriptor to describe one segment of DISCARD
> > > > > or WRITE ZEROES commands, each command may contain one or more
> > > > decriptors.
> > > > >
> > > > > The following data structure shows the definition of one descriptor:
> > > > >
> > > > > struct virtio_blk_discard_write_zeroes {
> > > > >         le64 sector;
> > > > >         le32 num_sectors;
> > > > >         le32 unmap;
> > > > > };
> > > > >
> > > > > Field 'sector' means the start sector for DISCARD and WRITE ZEROES,
> > > > > filed 'num_sectors' means the number of sectors for DISCARD and WRITE
> > > > > ZEROES, if both DISCARD and WRITE ZEROES are supported, field 'unmap'
> > > > > maybe used for WRITE ZEROES command with DISCARD enabled.
> > > > >
> > > > > We also extended the virtio-blk configuration space to let backend
> > > > > device put DISCARD and WRITE ZEROES configuration parameters.
> > > > >
> > > > > struct virtio_blk_config {
> > > > >         [...]
> > > > >
> > > > >         le32 max_discard_sectors;
> > > > >         le32 max_discard_seg;
> > > > >         le32 discard_sector_alignment;
> > > > >         le32 max_write_zeroes_sectors;
> > > > >         le32 max_write_zeroes_seg;
> > > > >         u8 write_zeroes_may_unmap;
> > > > > }
> > > > >
> > > > > New feature bit [VIRTIO_BLK_F_DISCARD (13)]: Device can support discard
> > > > > command, maximum discard sectors size in field 'max_discard_sectors' and
> > > > > maximum discard segment number in field 'max_discard_seg'.
> > > > >
> > > > > New feature [VIRTIO_BLK_F_WRITE_ZEROES (14)]: Device can support write
> > > > > zeroes command, maximum write zeroes sectors size in field
> > > > > 'max_write_zeroes_sectors' and maximum write zeroes segment number in
> > > > > field 'max_write_zeroes_seg'.
> > > > >
> > > > > The parameters in the configuration space of the device field
> > > > > 'max_discard_sectors' and field 'discard_sector_alignment' are expressed in
> > > > > 512-byte units if the VIRTIO_BLK_F_DISCARD feature bit is negotiated. The
> > > > > field 'max_write_zeroes_sectors' is expressed in 512-byte units if the
> > > > > VIRTIO_BLK_F_WRITE_ZEROES feature bit is negotiated.
> > > > >
> > > > > Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> > > > > ---
> > > > > CHANGELOG:
> > > > > v6: don't set T_OUT bit to discard and write zeroes commands.
> > > >
> > > > I don't see this in the patch...
> > > Yeah, do noting with DISCARD/WRITE ZEROES means no need to OR BLK_T_OUT
> > again.
> > > >
> > > > > @@ -225,6 +260,7 @@ static blk_status_t virtio_queue_rq(struct
> > > > blk_mq_hw_ctx *hctx,
> > > > >  	int qid = hctx->queue_num;
> > > > >  	int err;
> > > > >  	bool notify = false;
> > > > > +	bool unmap = false;
> > > > >  	u32 type;
> > > > >
> > > > >  	BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> > > > > @@ -237,6 +273,13 @@ static blk_status_t virtio_queue_rq(struct
> > > > blk_mq_hw_ctx *hctx,
> > > > >  	case REQ_OP_FLUSH:
> > > > >  		type = VIRTIO_BLK_T_FLUSH;
> > > > >  		break;
> > > > > +	case REQ_OP_DISCARD:
> > > > > +		type = VIRTIO_BLK_T_DISCARD;
> > > > > +		break;
> > > > > +	case REQ_OP_WRITE_ZEROES:
> > > > > +		type = VIRTIO_BLK_T_WRITE_ZEROES;
> > > > > +		unmap = !(req->cmd_flags & REQ_NOUNMAP);
> > > > > +		break;
> > > > >  	case REQ_OP_SCSI_IN:
> > > > >  	case REQ_OP_SCSI_OUT:
> > > > >  		type = VIRTIO_BLK_T_SCSI_CMD;
> > > > > @@ -256,6 +299,12 @@ static blk_status_t virtio_queue_rq(struct
> > > > blk_mq_hw_ctx *hctx,
> > > > >
> > > > >  	blk_mq_start_request(req);
> > > > >
> > > > > +	if (type == VIRTIO_BLK_T_DISCARD || type ==
> > > > VIRTIO_BLK_T_WRITE_ZEROES) {
> > > > > +		err = virtblk_setup_discard_write_zeroes(req, unmap);
> > > > > +		if (err)
> > > > > +			return BLK_STS_RESOURCE;
> > > > > +	}
> > > > > +
> > > > >  	num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
> > > > >  	if (num) {
> > > > >  		if (rq_data_dir(req) == WRITE)
> > > >
> > > > ...since we still do blk_rq_map_sg() here and num should be != 0.
> > > No, while here, we should keep the original logic for READ/WRITE commands.
> > 
> > My question is: why does the changelog say "don't set T_OUT" but the
> > code *will* set it because blk_rq_map_sg() returns != 0 and
> > rq_data_dir(req) == WRITE?
> Since the last bit of DISCARD/WRITE ZEROES commands are already 1, so I said we don't need to set
> T_OUT bit to DISCARD/WRITE ZEROES commands again. But the original logic for WRITE, T_OUT is still
> needed, so just keep the original code here is fine.

Okay, I understand what you meant now.  Thanks!

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox