* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-12 5:02 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang
Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
netdev, qemu-devel, loseweigh, virtualization
In-Reply-To: <20180612051432-mutt-send-email-mst@kernel.org>
On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>
>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>> act as a standby for another device with the same MAC address.
>>>>
>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>> by default as i am using libvirt to start the VMs.
>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>> XML file?
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> So I do not think we can commit to this interface: we
>>> really need to control visibility of the primary device.
>> The problem is legacy guest won't use primary device at all if we do this.
> And that's by design - I think it's the only way to ensure the
> legacy guest isn't confused.
Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
of the primary device. The primary device may not be specified as an argument to Qemu. It
may be plugged in later.
The cloud service provider is providing a feature that enables low latency datapath and live
migration capability.
A tenant can use this feature only if he is running a VM that has virtio-net with failover support.
I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
an upper layer indicating if the STANDBY feature is successfully negotiated or not.
The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
If VF is successfully hot plugged, virtio-net link should be disabled.
>
>> How about control the visibility of standby device?
>>
>> Thanks
> standy the always there to guarantee no downtime.
>
>>> However just for testing purposes, we could add a non-stable
>>> interface "x-standby" with the understanding that as any
>>> x- prefix it's unstable and will be changed down the road,
>>> likely in the next release.
>>>
>>>
>>>> ---
>>>> hw/net/virtio-net.c | 2 ++
>>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>>> 2 files changed, 5 insertions(+)
>>>>
>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>> index 90502fca7c..38b3140670 100644
>>>> --- a/hw/net/virtio-net.c
>>>> +++ b/hw/net/virtio-net.c
>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>> true),
>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>> + false),
>>>> DEFINE_PROP_END_OF_LIST(),
>>>> };
>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>> index e9f255ea3f..01ec09684c 100644
>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>> @@ -57,6 +57,9 @@
>>>> * Steering */
>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
>>>> + * with the same MAC.
>>>> + */
>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>>>> #ifndef VIRTIO_NET_NO_LEGACY
>>>> --
>>>> 2.14.3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 1/2] Convert target drivers to use sbitmap
From: Matthew Wilcox @ 2018-06-12 3:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Juergen Gross, kvm, linux-scsi, Matthew Wilcox, netdev, linux-usb,
linux-kernel, virtualization, target-devel, qla2xxx-upstream,
linux1394-devel, Kent Overstreet
In-Reply-To: <02326395-3241-c94b-ad70-3de27a6f5a8c@kernel.dk>
On Mon, Jun 11, 2018 at 07:18:55PM -0600, Jens Axboe wrote:
> Are you going to push this further? I really think we should.
Yes, I'll resubmit it tomorrow. Sorry for dropping the ball on this one.
^ permalink raw reply
* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-12 2:17 UTC (permalink / raw)
To: Jason Wang
Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
Sridhar Samudrala, qemu-devel, loseweigh, netdev, virtualization
In-Reply-To: <dc8b1669-b8b6-fb59-94e2-6e4de6b96572@redhat.com>
On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>
>
> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> > On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > act as a standby for another device with the same MAC address.
> > >
> > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > by default as i am using libvirt to start the VMs.
> > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > XML file?
> > >
> > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > So I do not think we can commit to this interface: we
> > really need to control visibility of the primary device.
>
> The problem is legacy guest won't use primary device at all if we do this.
And that's by design - I think it's the only way to ensure the
legacy guest isn't confused.
> How about control the visibility of standby device?
>
> Thanks
standy the always there to guarantee no downtime.
> > However just for testing purposes, we could add a non-stable
> > interface "x-standby" with the understanding that as any
> > x- prefix it's unstable and will be changed down the road,
> > likely in the next release.
> >
> >
> > > ---
> > > hw/net/virtio-net.c | 2 ++
> > > include/standard-headers/linux/virtio_net.h | 3 +++
> > > 2 files changed, 5 insertions(+)
> > >
> > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > index 90502fca7c..38b3140670 100644
> > > --- a/hw/net/virtio-net.c
> > > +++ b/hw/net/virtio-net.c
> > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > true),
> > > DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > > DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > + false),
> > > DEFINE_PROP_END_OF_LIST(),
> > > };
> > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > index e9f255ea3f..01ec09684c 100644
> > > --- a/include/standard-headers/linux/virtio_net.h
> > > +++ b/include/standard-headers/linux/virtio_net.h
> > > @@ -57,6 +57,9 @@
> > > * Steering */
> > > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> > > +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
> > > + * with the same MAC.
> > > + */
> > > #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
> > > #ifndef VIRTIO_NET_NO_LEGACY
> > > --
> > > 2.14.3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-12 1:59 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: KVM list, Network Development, Linux Kernel Mailing List,
Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <20180612042600-mutt-send-email-mst@kernel.org>
On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Maybe it will help to have GFP_NONE which will make any allocation
> fail if attempted. Linus, would this address your comment?
It would definitely have helped me initially overlook that call chain.
But then when I started looking at the whole dma_map_page() thing, it
just raised my hackles again.
I would seriously suggest having a much simpler version for the "no
allocation, no dma mapping" case, so that it's *obvious* that that
never happens.
So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?
Honestly, I look at "add_one_sg()" and it really doesn't make me
happy. It looks hacky as hell. If I read the code right, you're really
trying to just queue up a simple tuple of <pfn,len>, except you encode
it as a page pointer in order to play games with the SG logic, and
then you hmap that to the ring, except in this case it's all a fake
ring that just adds the cpu-physical address instead.
And to figuer that out, it's like five layers of indirection through
different helper functions that *can* do more generic things but in
this case don't.
And you do all of this from a core VM callback function with some
_really_ core VM locks held.
That makes no sense to me.
How about this:
- get rid of all that code
- make the core VM callback save the "these are the free memory
regions" in a fixed and limited array. One that DOES JUST THAT. No
crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
size, pre-allocated for that virtio instance.
- make it obvious that what you do in that sequence is ten
instructions and no allocations ("Look ma, I wrote a value to an array
and incremented the array idex, and I'M DONE")
- then in that workqueue entry that you start *anyway*, you empty the
array and do all the crazy virtio stuff.
In fact, while at it, just simplify the VM interface too. Instead of
traversing a random number of buddy lists, just trraverse *one* - the
top-level one. Are you seriously ever going to shrink or mark
read-only anythin *but* something big enough to be in the maximum
order?
MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
want the balloon code to work on smaller things, particularly since
the whole interface is fundamentally racy and opportunistic to begin
with?
The whole sequence of events really looks "this is too much
complexity, and way too fragile" to me at so many levels.
Linus
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-12 1:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: nilal, KVM list, Network Development, Linux Kernel Mailing List,
Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFzrPgnd7hRPrkeV+jX-MSwOZf7T4wKxz66Lk4oub3PZsw@mail.gmail.com>
On Mon, Jun 11, 2018 at 11:32:41AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
>
> Is this really a good idea?
Well knowing which pages are unused does seem to be useful. Do you
refer to this generally or to the idea of scanning the free list,
or to the specific implementation issues you listed below?
> Plus it seems entirely broken.
>
> The report_pfn_range() callback is done under the zone lock, but
> virtio_balloon_send_free_pages() (which is the only callback used that
> I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
> 1, vq, GFP_KERNEL);
>
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.
>
> Pulled and then unpulled again.
>
> Either somebody needs to explain why I'm wrong and you can re-submit
> this, or this kind of garbage needs to go away.
>
> I do *not* want to be in the situation where I pull stuff from the
> virtio people that adds completely broken core VM functionality.
>
> Because if I'm in that situation, I will stop pulling from you guys.
> Seriously. You have *no* place sending me broken shit that is outside
> the virtio layer.
>
> Subsystems that break code MM will get shunned. You just aren't
> important enough to allow you breaking code VM.
>
> Linus
So no, it doesn't do any allocations - GFP_KERNEL or otherwise, but I
agree how it might be confusing.
So I'll drop this for now, but it would be nice if there is a bit more
direction on this feature since I know several people are trying to work
on this.
--
MST
^ permalink raw reply
* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Jason Wang @ 2018-06-12 1:54 UTC (permalink / raw)
To: Michael S. Tsirkin, Sridhar Samudrala
Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
netdev, qemu-devel, loseweigh, virtualization
In-Reply-To: <20180611202207-mutt-send-email-mst@kernel.org>
On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> This feature bit can be used by hypervisor to indicate virtio_net device to
>> act as a standby for another device with the same MAC address.
>>
>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>> by default as i am using libvirt to start the VMs.
>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>> XML file?
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> So I do not think we can commit to this interface: we
> really need to control visibility of the primary device.
The problem is legacy guest won't use primary device at all if we do this.
How about control the visibility of standby device?
Thanks
> However just for testing purposes, we could add a non-stable
> interface "x-standby" with the understanding that as any
> x- prefix it's unstable and will be changed down the road,
> likely in the next release.
>
>
>> ---
>> hw/net/virtio-net.c | 2 ++
>> include/standard-headers/linux/virtio_net.h | 3 +++
>> 2 files changed, 5 insertions(+)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 90502fca7c..38b3140670 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>> true),
>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>> + false),
>> DEFINE_PROP_END_OF_LIST(),
>> };
>>
>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>> index e9f255ea3f..01ec09684c 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -57,6 +57,9 @@
>> * Steering */
>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>
>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
>> + * with the same MAC.
>> + */
>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>>
>> #ifndef VIRTIO_NET_NO_LEGACY
>> --
>> 2.14.3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-12 1:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: KVM list, Network Development, Linux Kernel Mailing List,
Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFz4uBHaWnBarVUEG90s2ucyVtoLmwYpYVwDV+XQESNRqw@mail.gmail.com>
On Mon, Jun 11, 2018 at 11:44:05AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So now we apparently do a GFP_KERNEL allocation insider the mm zone
> > lock, which is broken on just _so_ many levels.
>
> Oh, I see the comment about how it doesn't actually do an allocation
> at all because it's a single-entry.
>
> Still too damn ugly to live, and much too fragile. No way in hell do
> we even _hint_ at a GFP_KERNEL when we're inside a context that can't
> do any memory allocation at all.
>
> Plus I'm not convinced it's a "no allocation" path even despite that
> comment, because it also does a "dma_map_page()" etc, which can cause
> allocations to do the dma mapping thing afaik. No?
Well no because DMA is triggered by the IOMMU flag and
that is always off for the balloon. But I hear what you
are saying about it being fragile.
> Maybe there's some reason why that doesn't happen either, but
> basically this whole callchain looks *way* to complicated to be used
> under a core VM spinlock.
>
> Linus
Maybe it will help to have GFP_NONE which will make any allocation
fail if attempted. Linus, would this address your comment?
--
MST
^ permalink raw reply
* Re: [PATCH 1/2] Convert target drivers to use sbitmap
From: Jens Axboe @ 2018-06-12 1:18 UTC (permalink / raw)
To: Matthew Wilcox, linux-kernel, linux-scsi, target-devel,
linux1394-devel, linux-usb, kvm, virtualization, netdev,
Juergen Gross, qla2xxx-upstream, Kent Overstreet
Cc: Matthew Wilcox
In-Reply-To: <3a56027b-47bc-dcb8-a465-3670031572f1@kernel.dk>
On 5/15/18 10:11 AM, Jens Axboe wrote:
> On 5/15/18 10:00 AM, Matthew Wilcox wrote:
>> From: Matthew Wilcox <mawilcox@microsoft.com>
>>
>> The sbitmap and the percpu_ida perform essentially the same task,
>> allocating tags for commands. Since the sbitmap is more used than
>> the percpu_ida, convert the percpu_ida users to the sbitmap API.
>
> It should also be the same performance as percpu_ida in light use, and
> performs much better at > 50% utilization of the tag space. I think
> that's better justification than "more used than".
>
>> diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
>> index 4435bf374d2d..28bcffae609f 100644
>> --- a/drivers/target/iscsi/iscsi_target_util.c
>> +++ b/drivers/target/iscsi/iscsi_target_util.c
>> @@ -17,7 +17,7 @@
>> ******************************************************************************/
>>
>> #include <linux/list.h>
>> -#include <linux/percpu_ida.h>
>> +#include <linux/sched/signal.h>
>> #include <net/ipv6.h> /* ipv6_addr_equal() */
>> #include <scsi/scsi_tcq.h>
>> #include <scsi/iscsi_proto.h>
>> @@ -147,6 +147,28 @@ void iscsit_free_r2ts_from_list(struct iscsi_cmd *cmd)
>> spin_unlock_bh(&cmd->r2t_lock);
>> }
>>
>> +int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
>> +{
>> + int tag = -1;
>> + DEFINE_WAIT(wait);
>> + struct sbq_wait_state *ws;
>> +
>> + if (state == TASK_RUNNING)
>> + return tag;
>> +
>> + ws = &se_sess->sess_tag_pool.ws[0];
>> + for (;;) {
>> + prepare_to_wait_exclusive(&ws->wait, &wait, state);
>> + if (signal_pending_state(state, current))
>> + break;
>> + schedule();
>> + tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
>> + }
>> +
>> + finish_wait(&ws->wait, &wait);
>> + return tag;
>> +}
>
> Seems like that should be:
>
>
> ws = &se_sess->sess_tag_pool.ws[0];
> for (;;) {
> prepare_to_wait_exclusive(&ws->wait, &wait, state);
> if (signal_pending_state(state, current))
> break;
> tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
> if (tag != -1)
> break;
> schedule();
> }
>
> finish_wait(&ws->wait, &wait);
> return tag;
>
>> /*
>> * May be called from software interrupt (timer) context for allocating
>> * iSCSI NopINs.
>> @@ -155,9 +177,11 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, int state)
>> {
>> struct iscsi_cmd *cmd;
>> struct se_session *se_sess = conn->sess->se_sess;
>> - int size, tag;
>> + int size, tag, cpu;
>>
>> - tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
>> + tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
>> + if (tag < 0)
>> + tag = iscsit_wait_for_tag(se_sess, state, &cpu);
>> if (tag < 0)
>> return NULL;
>
> Might make sense to just roll the whole thing into iscsi_get_tag(), that
> would be cleaner.
>
> sbitmap should provide a helper for that, but we can do that cleanup
> later. That would encapsulate things like the per-cpu caching hint too,
> for instance.
>
> Rest looks fine to me.
Are you going to push this further? I really think we should.
--
Jens Axboe
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-11 18:44 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: KVM list, Network Development, Linux Kernel Mailing List,
Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFzrPgnd7hRPrkeV+jX-MSwOZf7T4wKxz66Lk4oub3PZsw@mail.gmail.com>
On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.
Oh, I see the comment about how it doesn't actually do an allocation
at all because it's a single-entry.
Still too damn ugly to live, and much too fragile. No way in hell do
we even _hint_ at a GFP_KERNEL when we're inside a context that can't
do any memory allocation at all.
Plus I'm not convinced it's a "no allocation" path even despite that
comment, because it also does a "dma_map_page()" etc, which can cause
allocations to do the dma mapping thing afaik. No?
Maybe there's some reason why that doesn't happen either, but
basically this whole callchain looks *way* to complicated to be used
under a core VM spinlock.
Linus
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-11 18:32 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: KVM list, Network Development, Linux Kernel Mailing List,
Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <20180611192353-mutt-send-email-mst@kernel.org>
On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
Is this really a good idea?
Plus it seems entirely broken.
The report_pfn_range() callback is done under the zone lock, but
virtio_balloon_send_free_pages() (which is the only callback used that
I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
1, vq, GFP_KERNEL);
So now we apparently do a GFP_KERNEL allocation insider the mm zone
lock, which is broken on just _so_ many levels.
Pulled and then unpulled again.
Either somebody needs to explain why I'm wrong and you can re-submit
this, or this kind of garbage needs to go away.
I do *not* want to be in the situation where I pull stuff from the
virtio people that adds completely broken core VM functionality.
Because if I'm in that situation, I will stop pulling from you guys.
Seriously. You have *no* place sending me broken shit that is outside
the virtio layer.
Subsystems that break code MM will get shunned. You just aren't
important enough to allow you breaking code VM.
Linus
^ permalink raw reply
* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-11 17:26 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: alexander.h.duyck, virtio-dev, qemu-devel, jiri, kubakici, netdev,
virtualization, loseweigh, aaron.f.brown
In-Reply-To: <1525734594-11134-1-git-send-email-sridhar.samudrala@intel.com>
On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a standby for another device with the same MAC address.
>
> I tested this with a small change to the patch to mark the STANDBY feature 'true'
> by default as i am using libvirt to start the VMs.
> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> XML file?
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
So I do not think we can commit to this interface: we
really need to control visibility of the primary device.
However just for testing purposes, we could add a non-stable
interface "x-standby" with the understanding that as any
x- prefix it's unstable and will be changed down the road,
likely in the next release.
> ---
> hw/net/virtio-net.c | 2 ++
> include/standard-headers/linux/virtio_net.h | 3 +++
> 2 files changed, 5 insertions(+)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 90502fca7c..38b3140670 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> true),
> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> + false),
> DEFINE_PROP_END_OF_LIST(),
> };
>
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index e9f255ea3f..01ec09684c 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,9 @@
> * Steering */
> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>
> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
> + * with the same MAC.
> + */
> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>
> #ifndef VIRTIO_NET_NO_LEGACY
> --
> 2.14.3
^ permalink raw reply
* [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-11 16:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: ohad, kevin, kvm, mst, netdev, liang.z.li, linux-remoteproc,
linux-kernel, stable, bjorn.andersson, mhocko, mhocko,
syzbot+87cfa083e727a224754b, akpm, virtualization
The following changes since commit 29dcea88779c856c7dc92040a0c01233263101d4:
Linux 4.17 (2018-06-03 14:15:21 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to aa15783ee62d57d69433101ede3e3ed11e48161d:
virtio: update the comments for transport features (2018-06-07 22:17:40 +0300)
----------------------------------------------------------------
virtio, vhost: features, fixes
VF support for virtio.
Free page hint request support for VM migration.
DMA barriers for virtio strong barriers.
Bugfixes.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Michael S. Tsirkin (2):
virtio_ring: switch to dma_XX barriers for rpmsg
vhost: fix info leak due to uninitialized memory
Tiwei Bie (2):
virtio_pci: support enabling VFs
virtio: update the comments for transport features
Wei Wang (4):
mm: support reporting free page blocks
virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
mm/page_poison: expose page_poisoning_enabled to kernel modules
virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
drivers/vhost/vhost.c | 3 +
drivers/virtio/virtio_balloon.c | 298 +++++++++++++++++++++++++++++++-----
drivers/virtio/virtio_pci_common.c | 30 ++++
drivers/virtio/virtio_pci_modern.c | 14 ++
include/linux/mm.h | 6 +
include/linux/virtio_ring.h | 4 +-
include/uapi/linux/virtio_balloon.h | 7 +
include/uapi/linux/virtio_config.h | 16 +-
mm/page_alloc.c | 97 ++++++++++++
mm/page_poison.c | 6 +
10 files changed, 439 insertions(+), 42 deletions(-)
^ permalink raw reply
* Re: [PATCH v3 6/9] x86: prevent inline distortion by paravirt ops
From: Peter Zijlstra @ 2018-06-11 7:45 UTC (permalink / raw)
To: Nadav Amit
Cc: Juergen Gross, x86, linux-kernel, virtualization, Ingo Molnar,
H. Peter Anvin, Alok Kataria, Thomas Gleixner
In-Reply-To: <20180610141911.52948-7-namit@vmware.com>
On Sun, Jun 10, 2018 at 07:19:08AM -0700, Nadav Amit wrote:
> +/*
> + * This generates an indirect call based on the operation type number.
> + * The type number, computed in PARAVIRT_PATCH, is derived from the
> + * offset into the paravirt_patch_template structure, and can therefore be
> + * freely converted back into a structure offset.
> + */
> +.macro PARAVIRT_ALT type:req clobber:req pv_opptr:req
Unlike the marcro maze you replaced, this has the CALL hardcoded in. So
maybe name this PARAVIRT_CALL instead of PARAVIRT_ALT ?
> +771: ANNOTATE_RETPOLINE_SAFE
> + call *\pv_opptr
> +772: .pushsection .parainstructions,"a"
> + _ASM_ALIGN
> + _ASM_PTR 771b
> + .byte \type
> + .byte 772b-771b
> + .short \clobber
> + .popsection
> +.endm
^ permalink raw reply
* RE: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands support
From: Liu, Changpeng @ 2018-06-11 3:37 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: pbonzini@redhat.com, cavery@redhat.com,
virtualization@lists.linux-foundation.org
In-Reply-To: <20180608102027.GB31164@stefanha-x1.localdomain>
> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Friday, June 8, 2018 6:20 PM
> To: Liu, Changpeng <changpeng.liu@intel.com>
> Cc: virtualization@lists.linux-foundation.org; cavery@redhat.com;
> jasowang@redhat.com; pbonzini@redhat.com; Wang, Wei W
> <wei.w.wang@intel.com>
> Subject: Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands
> support
>
> On Thu, Jun 07, 2018 at 11:07:06PM +0000, Liu, Changpeng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> > > Sent: Thursday, June 7, 2018 9:10 PM
> > > To: Liu, Changpeng <changpeng.liu@intel.com>
> > > Cc: virtualization@lists.linux-foundation.org; cavery@redhat.com;
> > > jasowang@redhat.com; pbonzini@redhat.com; Wang, Wei W
> > > <wei.w.wang@intel.com>
> > > Subject: Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands
> > > support
> > >
> > > On Wed, Jun 06, 2018 at 12:19:00PM +0800, Changpeng Liu wrote:
> > > > Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES commands
> > > > support, this will impact the performance when using SSD backend over
> > > > file systems.
> > > >
> > > > Commit 88c85538 "virtio-blk: add discard and write zeroes features to
> > > > specification"(see https://github.com/oasis-tcs/virtio-spec) extended
> > > > existing virtio-blk protocol, adding extra DISCARD and WRITE ZEROES
> > > > commands support.
> > > >
> > > > While here, using 16 bytes descriptor to describe one segment of DISCARD
> > > > or WRITE ZEROES commands, each command may contain one or more
> > > decriptors.
> > > >
> > > > The following data structure shows the definition of one descriptor:
> > > >
> > > > struct virtio_blk_discard_write_zeroes {
> > > > le64 sector;
> > > > le32 num_sectors;
> > > > le32 unmap;
> > > > };
> > > >
> > > > Field 'sector' means the start sector for DISCARD and WRITE ZEROES,
> > > > filed 'num_sectors' means the number of sectors for DISCARD and WRITE
> > > > ZEROES, if both DISCARD and WRITE ZEROES are supported, field 'unmap'
> > > > maybe used for WRITE ZEROES command with DISCARD enabled.
> > > >
> > > > We also extended the virtio-blk configuration space to let backend
> > > > device put DISCARD and WRITE ZEROES configuration parameters.
> > > >
> > > > struct virtio_blk_config {
> > > > [...]
> > > >
> > > > le32 max_discard_sectors;
> > > > le32 max_discard_seg;
> > > > le32 discard_sector_alignment;
> > > > le32 max_write_zeroes_sectors;
> > > > le32 max_write_zeroes_seg;
> > > > u8 write_zeroes_may_unmap;
> > > > }
> > > >
> > > > New feature bit [VIRTIO_BLK_F_DISCARD (13)]: Device can support discard
> > > > command, maximum discard sectors size in field 'max_discard_sectors' and
> > > > maximum discard segment number in field 'max_discard_seg'.
> > > >
> > > > New feature [VIRTIO_BLK_F_WRITE_ZEROES (14)]: Device can support write
> > > > zeroes command, maximum write zeroes sectors size in field
> > > > 'max_write_zeroes_sectors' and maximum write zeroes segment number in
> > > > field 'max_write_zeroes_seg'.
> > > >
> > > > The parameters in the configuration space of the device field
> > > > 'max_discard_sectors' and field 'discard_sector_alignment' are expressed in
> > > > 512-byte units if the VIRTIO_BLK_F_DISCARD feature bit is negotiated. The
> > > > field 'max_write_zeroes_sectors' is expressed in 512-byte units if the
> > > > VIRTIO_BLK_F_WRITE_ZEROES feature bit is negotiated.
> > > >
> > > > Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> > > > ---
> > > > CHANGELOG:
> > > > v6: don't set T_OUT bit to discard and write zeroes commands.
> > >
> > > I don't see this in the patch...
> > Yeah, do noting with DISCARD/WRITE ZEROES means no need to OR BLK_T_OUT
> again.
> > >
> > > > @@ -225,6 +260,7 @@ static blk_status_t virtio_queue_rq(struct
> > > blk_mq_hw_ctx *hctx,
> > > > int qid = hctx->queue_num;
> > > > int err;
> > > > bool notify = false;
> > > > + bool unmap = false;
> > > > u32 type;
> > > >
> > > > BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> > > > @@ -237,6 +273,13 @@ static blk_status_t virtio_queue_rq(struct
> > > blk_mq_hw_ctx *hctx,
> > > > case REQ_OP_FLUSH:
> > > > type = VIRTIO_BLK_T_FLUSH;
> > > > break;
> > > > + case REQ_OP_DISCARD:
> > > > + type = VIRTIO_BLK_T_DISCARD;
> > > > + break;
> > > > + case REQ_OP_WRITE_ZEROES:
> > > > + type = VIRTIO_BLK_T_WRITE_ZEROES;
> > > > + unmap = !(req->cmd_flags & REQ_NOUNMAP);
> > > > + break;
> > > > case REQ_OP_SCSI_IN:
> > > > case REQ_OP_SCSI_OUT:
> > > > type = VIRTIO_BLK_T_SCSI_CMD;
> > > > @@ -256,6 +299,12 @@ static blk_status_t virtio_queue_rq(struct
> > > blk_mq_hw_ctx *hctx,
> > > >
> > > > blk_mq_start_request(req);
> > > >
> > > > + if (type == VIRTIO_BLK_T_DISCARD || type ==
> > > VIRTIO_BLK_T_WRITE_ZEROES) {
> > > > + err = virtblk_setup_discard_write_zeroes(req, unmap);
> > > > + if (err)
> > > > + return BLK_STS_RESOURCE;
> > > > + }
> > > > +
> > > > num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
> > > > if (num) {
> > > > if (rq_data_dir(req) == WRITE)
> > >
> > > ...since we still do blk_rq_map_sg() here and num should be != 0.
> > No, while here, we should keep the original logic for READ/WRITE commands.
>
> My question is: why does the changelog say "don't set T_OUT" but the
> code *will* set it because blk_rq_map_sg() returns != 0 and
> rq_data_dir(req) == WRITE?
Since the last bit of DISCARD/WRITE ZEROES commands are already 1, so I said we don't need to set
T_OUT bit to DISCARD/WRITE ZEROES commands again. But the original logic for WRITE, T_OUT is still
needed, so just keep the original code here is fine.
>
> Maybe I'm misreading the code, but it looks to me like this patch
> does the opposite of what the changelog says.
>
> Stefan
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-11 3:34 UTC (permalink / raw)
To: Michael S. Tsirkin, Ram Pai
Cc: robh, pawel.moll, Tom Lendacky, cohuck, linux-kernel,
virtualization, Christoph Hellwig, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180611060949-mutt-send-email-mst@kernel.org>
On Mon, 2018-06-11 at 06:28 +0300, Michael S. Tsirkin wrote:
>
> > However if the administrator
> > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > flag, virtio will not be able to pass control to the DMA ops associated
> > with the virtio devices. Which means, we have no opportunity to share
> > the I/O buffers with the hypervisor/qemu.
> >
> > How do you suggest, we handle this case?
>
> As step 1, ignore it as a user error.
Ugh ... not again. Ram, don't bring that subject back we ALREADY
addressed it, and requiring the *user* to do special things is just
utterly and completely wrong.
The *user* has no bloody idea what that stuff is, will never know to
set whatver magic qemu flag etc... The user will just start a a VM
normally and expect things to work. Requiring the *user* to know things
like that iommu virtio flag is complete nonsense.
If by "user" you mean libvirt, then you are now requesting about 4 or 5
different projects to be patched to add speical cases for something
they know nothing about and is completely irrelevant, while it can be
entirely addressed with a 1-liner in virtio kernel side to allow the
arch to plumb alternate DMA ops.
So for some reason you seem to be dead set on a path that leads to
mountain of user pain, changes to many different projects and overall
havok while there is a much much simpler and elegant solution at hand
which I described (again) in the response to Ram I sent about 5mn ago.
> Further you can for example add per-device quirks in virtio so it can be
> switched to dma api. make extra decisions in platform code then.
>
> > >
> > >
> > >
> > > > Both in the flag naming and the implementation there is an implication
> > > > of DMA API == IOMMU, which is fundamentally wrong.
> > >
> > > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
> > >
> > > It's possible that some setups will benefit from a more
> > > fine-grained approach where some aspects of the DMA
> > > API are bypassed, others aren't.
> > >
> > > This seems to be what was being asked for in this thread,
> > > with comments claiming IOMMU flag adds too much overhead.
> > >
> > >
> > > > The DMA API does a few different things:
> > > >
> > > > a) address translation
> > > >
> > > > This does include IOMMUs. But it also includes random offsets
> > > > between PCI bars and system memory that we see on various
> > > > platforms.
> > >
> > > I don't think you mean bars. That's unrelated to DMA.
> > >
> > > > Worse so some of these offsets might be based on
> > > > banks, e.g. on the broadcom bmips platform. It also deals
> > > > with bitmask in physical addresses related to memory encryption
> > > > like AMD SEV. I'd be really curious how for example the
> > > > Intel virtio based NIC is going to work on any of those
> > > > plaforms.
> > >
> > > SEV guys report that they just set the iommu flag and then it all works.
> >
> > This is one of the fundamental difference between SEV architecture and
> > the ultravisor architecture. In SEV, qemu is aware of SEV. In
> > ultravisor architecture, only the VM that runs within qemu is aware of
> > ultravisor; hypervisor/qemu/administrator are untrusted entities.
>
> Spo one option is to teach qemu that it's on a platform with an
> ultravisor, this might have more advantages.
>
> > I hope, we can make virtio subsystem flexibe enough to support various
> > security paradigms.
>
> So if you are worried about qemu attacking guests, I see
> more problems than just passing an incorrect iommu
> flag.
>
>
> > Apart from the above reason, Christoph and Ben point to so many other
> > reasons to make it flexibe. So why not, make it happen?
> >
>
> I don't see a flexibility argument. I just don't think new platforms
> should use workarounds that we put in place for old ones.
>
>
> > > I guess if there's translation we can think of this as a kind of iommu.
> > > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
> > >
> > > And apparently some people complain that just setting that flag makes
> > > qemu check translation on each access with an unacceptable performance
> > > overhead. Forcing same behaviour for everyone on general principles
> > > even without the flag is unlikely to make them happy.
> > >
> > > > b) coherency
> > > >
> > > > On many architectures DMA is not cache coherent, and we need
> > > > to invalidate and/or write back cache lines before doing
> > > > DMA. Again, I wonder how this is every going to work with
> > > > hardware based virtio implementations.
> > >
> > >
> > > You mean dma_Xmb and friends?
> > > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> > > for that.
> > >
> > >
> > > > Even worse I think this
> > > > is actually broken at least for VIVT event for virtualized
> > > > implementations. E.g. a KVM guest is going to access memory
> > > > using different virtual addresses than qemu, vhost might throw
> > > > in another different address space.
> > >
> > > I don't really know what VIVT is. Could you help me please?
> > >
> > > > c) bounce buffering
> > > >
> > > > Many DMA implementations can not address all physical memory
> > > > due to addressing limitations. In such cases we copy the
> > > > DMA memory into a known addressable bounc buffer and DMA
> > > > from there.
> > >
> > > Don't do it then?
> > >
> > >
> > > > d) flushing write combining buffers or similar
> > > >
> > > > On some hardware platforms we need workarounds to e.g. read
> > > > from a certain mmio address to make sure DMA can actually
> > > > see memory written by the host.
> > >
> > > I guess it isn't an issue as long as WC isn't actually used.
> > > It will become an issue when virtio spec adds some WC capability -
> > > I suspect we can ignore this for now.
> > >
> > > >
> > > > All of this is bypassed by virtio by default despite generally being
> > > > platform issues, not particular to a given device.
> > >
> > > It's both a device and a platform issue. A PV device is often more like
> > > another CPU than like a PCI device.
> > >
> > >
> > >
> > > --
> > > MST
> >
> > --
> > Ram Pai
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-11 3:29 UTC (permalink / raw)
To: Ram Pai, Michael S. Tsirkin
Cc: robh, pawel.moll, Tom Lendacky, cohuck, linux-kernel,
virtualization, Christoph Hellwig, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180611023909.GA5726@ram.oc3035372033.ibm.com>
On Sun, 2018-06-10 at 19:39 -0700, Ram Pai wrote:
>
> However if the administrator
> ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> flag, virtio will not be able to pass control to the DMA ops associated
> with the virtio devices. Which means, we have no opportunity to share
> the I/O buffers with the hypervisor/qemu.
>
> How do you suggest, we handle this case?
At the risk of repeating myself, let's just do the first pass which is
to switch virtio over to always using the DMA API in the actual data
flow code, with a hook at initialization time that replaces the DMA ops
with some home cooked "direct" ops in the case where the IOMMU flag
isn't set.
This will be equivalent to what we have today but avoids having 2
separate code path all over the driver.
Then a second stage, I think, is to replace this "hook" so that the
architecture gets a say in the matter.
Basically, something like:
arch_virtio_update_dma_ops(pci_dev, qemu_direct_mode).
IE, virtio would tell the arch whether the "other side" is in fact QEMU
in a mode that bypasses the IOMMU and is cache coherent with the guest.
This is our legacy "qemu special" mode. If the performance is
sufficient we may want to deprecate it over time and have qemu enable
the iommu by default but we still need it.
A weak implementation of the above will be provied that just puts in
the direct ops when qemu_direct_mode is set, and thus provides today's
behaviour on any arch that doesn't override it. If the flag is not set,
the ops are left to whatever the arch PCI layer already set.
This will provide the opportunity for architectures that want to do
something special, such as in our case, when we want to force even the
"qemu_direct_mode" to go via bounce buffers, to put our own ops in
there, while retaining the legacy behaviour otherwise.
It also means that the "gunk" is entirely localized in that one
function, the rest of virtio just uses the DMA API normally.
Christoph, are you actually hacking "stage 1" above already or should
we produce patches ?
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-11 3:28 UTC (permalink / raw)
To: Ram Pai
Cc: robh, pawel.moll, Tom Lendacky, benh, cohuck, linux-kernel,
virtualization, Christoph Hellwig, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180611023909.GA5726@ram.oc3035372033.ibm.com>
On Sun, Jun 10, 2018 at 07:39:09PM -0700, Ram Pai wrote:
> On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:
> > > On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> > > > Pls work on a long term solution. Short term needs can be served by
> > > > enabling the iommu platform in qemu.
> > >
> > > So, I spent some time looking at converting virtio to dma ops overrides,
> > > and the current virtio spec, and the sad through I have to tell is that
> > > both the spec and the Linux implementation are complete and utterly fucked
> > > up.
> >
> > Let me restate it: DMA API has support for a wide range of hardware, and
> > hardware based virtio implementations likely won't benefit from all of
> > it.
> >
> > And given virtio right now is optimized for specific workloads, improving
> > portability without regressing performance isn't easy.
> >
> > I think it's unsurprising since it started a strictly a guest/host
> > mechanism. People did implement offloads on specific platforms though,
> > and they are known to work. To improve portability even further,
> > we might need to make spec and code changes.
> >
> > I'm not really sympathetic to people complaining that they can't even
> > set a flag in qemu though. If that's the case the stack in question is
> > way too inflexible.
>
> We did consider your suggestion. But can't see how it will work.
> Maybe you can guide us here.
>
> In our case qemu has absolutely no idea if the VM will switch itself to
> secure mode or not. Its a dynamic decision made entirely by the VM
> through direct interaction with the hardware/firmware; no
> qemu/hypervisor involved.
>
> If the administrator, who invokes qemu, enables the flag, the DMA ops
> associated with the virito devices will be called, and hence will be
> able to do the right things. Yes we might incur performance hit due to
> the IOMMU translations, but lets ignore that for now; the functionality
> will work. Good till now.
>
> However if the administrator
> ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> flag, virtio will not be able to pass control to the DMA ops associated
> with the virtio devices. Which means, we have no opportunity to share
> the I/O buffers with the hypervisor/qemu.
>
> How do you suggest, we handle this case?
As step 1, ignore it as a user error.
Further you can for example add per-device quirks in virtio so it can be
switched to dma api. make extra decisions in platform code then.
> >
> >
> >
> > > Both in the flag naming and the implementation there is an implication
> > > of DMA API == IOMMU, which is fundamentally wrong.
> >
> > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
> >
> > It's possible that some setups will benefit from a more
> > fine-grained approach where some aspects of the DMA
> > API are bypassed, others aren't.
> >
> > This seems to be what was being asked for in this thread,
> > with comments claiming IOMMU flag adds too much overhead.
> >
> >
> > > The DMA API does a few different things:
> > >
> > > a) address translation
> > >
> > > This does include IOMMUs. But it also includes random offsets
> > > between PCI bars and system memory that we see on various
> > > platforms.
> >
> > I don't think you mean bars. That's unrelated to DMA.
> >
> > > Worse so some of these offsets might be based on
> > > banks, e.g. on the broadcom bmips platform. It also deals
> > > with bitmask in physical addresses related to memory encryption
> > > like AMD SEV. I'd be really curious how for example the
> > > Intel virtio based NIC is going to work on any of those
> > > plaforms.
> >
> > SEV guys report that they just set the iommu flag and then it all works.
>
> This is one of the fundamental difference between SEV architecture and
> the ultravisor architecture. In SEV, qemu is aware of SEV. In
> ultravisor architecture, only the VM that runs within qemu is aware of
> ultravisor; hypervisor/qemu/administrator are untrusted entities.
Spo one option is to teach qemu that it's on a platform with an
ultravisor, this might have more advantages.
> I hope, we can make virtio subsystem flexibe enough to support various
> security paradigms.
So if you are worried about qemu attacking guests, I see
more problems than just passing an incorrect iommu
flag.
> Apart from the above reason, Christoph and Ben point to so many other
> reasons to make it flexibe. So why not, make it happen?
>
I don't see a flexibility argument. I just don't think new platforms
should use workarounds that we put in place for old ones.
> > I guess if there's translation we can think of this as a kind of iommu.
> > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
> >
> > And apparently some people complain that just setting that flag makes
> > qemu check translation on each access with an unacceptable performance
> > overhead. Forcing same behaviour for everyone on general principles
> > even without the flag is unlikely to make them happy.
> >
> > > b) coherency
> > >
> > > On many architectures DMA is not cache coherent, and we need
> > > to invalidate and/or write back cache lines before doing
> > > DMA. Again, I wonder how this is every going to work with
> > > hardware based virtio implementations.
> >
> >
> > You mean dma_Xmb and friends?
> > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> > for that.
> >
> >
> > > Even worse I think this
> > > is actually broken at least for VIVT event for virtualized
> > > implementations. E.g. a KVM guest is going to access memory
> > > using different virtual addresses than qemu, vhost might throw
> > > in another different address space.
> >
> > I don't really know what VIVT is. Could you help me please?
> >
> > > c) bounce buffering
> > >
> > > Many DMA implementations can not address all physical memory
> > > due to addressing limitations. In such cases we copy the
> > > DMA memory into a known addressable bounc buffer and DMA
> > > from there.
> >
> > Don't do it then?
> >
> >
> > > d) flushing write combining buffers or similar
> > >
> > > On some hardware platforms we need workarounds to e.g. read
> > > from a certain mmio address to make sure DMA can actually
> > > see memory written by the host.
> >
> > I guess it isn't an issue as long as WC isn't actually used.
> > It will become an issue when virtio spec adds some WC capability -
> > I suspect we can ignore this for now.
> >
> > >
> > > All of this is bypassed by virtio by default despite generally being
> > > platform issues, not particular to a given device.
> >
> > It's both a device and a platform issue. A PV device is often more like
> > another CPU than like a PCI device.
> >
> >
> >
> > --
> > MST
>
> --
> Ram Pai
^ permalink raw reply
* Re: [PATCH net] vhost_net: remove VHOST_NET_F_VIRTIO_NET_HDR support
From: Michael S. Tsirkin @ 2018-06-11 3:08 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <90e9b3ea-e9d2-d60d-8355-42d447c67452@redhat.com>
On Mon, Jun 11, 2018 at 10:29:36AM +0800, Jason Wang wrote:
>
>
> On 2018年06月11日 10:12, Michael S. Tsirkin wrote:
> > On Fri, Jun 08, 2018 at 01:07:09PM +0800, Jason Wang wrote:
> > >
> > > On 2018年06月08日 12:46, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 08, 2018 at 11:50:42AM +0800, Jason Wang wrote:
> > > > > This feature bit is duplicated with VIRTIO_F_ANY_LAYOUT, this means if
> > > > > a userpsace want to enable VRITIO_F_ANY_LAYOUT,
> > > > > VHOST_NET_F_VIRTIO_NET_HDR will be implied too. This is wrong and will
> > > > > break networking.
> > > > What breaks networking exactly? VHOST_NET supported ANY_LAYOUT
> > > > from day one. For this reason it does not need to know about
> > > > VRITIO_F_ANY_LAYOUT and we reused the bit for other purposes.
> > > It's the knowledge of vhost_net code it self but not userspace. For
> > > userspace, it should depends on the value of returned by VHOST_GET_FEATURES.
> > > So when userspace can set_features with ANY_LAYOUT, vhost may think it wants
> > > VHOST_NET_F_VIRTIO_NET_HDR.
> > Yes but that's the admittedly ugly API that we have now.
> > userspace is supposed to know VRITIO_F_ANY_LAYOUT does
> > not make sense for vhost.
>
> Ok.
>
> >
> >
> >
> > > >
> > > >
> > > > > Fixing this by safely removing
> > > > > VHOST_NET_F_VIRTIO_NET_HDR support. There should be very few or even
> > > > > no userspace can use this.
> > > > Quite possibly, but it is hard to be sure. It seems safer to
> > > > maintain it unless there's an actual reason something's broken.
> > > I think not since the feature is negotiated not mandatory?
> > That doesn't mean much.
> >
> > > > > Further cleanups could be done for
> > > > > -net-next for safety.
> > > > >
> > > > > In the future, we need a vhost dedicated feature set/get ioctl()
> > > > > instead of reusing virtio ones.
> > > > Not just in the future, we might want to switch iommu
> > > > to a sane structure without the 64 bit padding bug
> > > > right now.
> > > Yes, I hit this bug when introducing V2 of msg IOTLB message.
> > Sounds good, so if you like, reserve a bit for
> > VHOST_NET_F_VIRTIO_NET_HDR in the new ioctl mask and
> > do not enable it there.
>
> Ok, and maybe VHOST_F_LOG_ALL.
>
> >
> > > > > Fixes: 4e9fa50c6ccbe ("vhost: move features to core")
> > > > This tag makes no sense here IMHO. Looks like people are using some tool
> > > > that just looks at the earliest version where patch won't apply. The
> > > > commit in question just moved some code around.
> > > Looks not, before this commit, vhost_net won't return ANY_LAYOUT.
> > >
> > > Thanks
> > Well ANY_LAYOUT just happens to be same as VHOST_NET_F_VIRTIO_NET_HDR
> > and that has been set since forever.
>
> So do you still want this patch? If not we need to document that ANY_LAYOUT
> could not be passed through SET_FEATURES somewhere.
>
> Thanks
Let's document this, yes. I don't think we should drop
VHOST_NET_F_VIRTIO_NET_HDR now.
> >
> > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > ---
> > > > > drivers/vhost/net.c | 15 +++++----------
> > > > > 1 file changed, 5 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > > index 986058a..83eef52 100644
> > > > > --- a/drivers/vhost/net.c
> > > > > +++ b/drivers/vhost/net.c
> > > > > @@ -69,7 +69,6 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> > > > > enum {
> > > > > VHOST_NET_FEATURES = VHOST_FEATURES |
> > > > > - (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
> > > > > (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> > > > > (1ULL << VIRTIO_F_IOMMU_PLATFORM)
> > > > > };
> > > > > @@ -1255,15 +1254,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
> > > > > (1ULL << VIRTIO_F_VERSION_1))) ?
> > > > > sizeof(struct virtio_net_hdr_mrg_rxbuf) :
> > > > > sizeof(struct virtio_net_hdr);
> > > > > - if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
> > > > > - /* vhost provides vnet_hdr */
> > > > > - vhost_hlen = hdr_len;
> > > > > - sock_hlen = 0;
> > > > > - } else {
> > > > > - /* socket provides vnet_hdr */
> > > > > - vhost_hlen = 0;
> > > > > - sock_hlen = hdr_len;
> > > > > - }
> > > > > +
> > > > > + /* socket provides vnet_hdr */
> > > > > + vhost_hlen = 0;
> > > > > + sock_hlen = hdr_len;
> > > > > +
> > > > > mutex_lock(&n->dev.mutex);
> > > > > if ((features & (1 << VHOST_F_LOG_ALL)) &&
> > > > > !vhost_log_access_ok(&n->dev))
> > > > > --
> > > > > 2.7.4
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net] vhost_net: remove VHOST_NET_F_VIRTIO_NET_HDR support
From: Jason Wang @ 2018-06-11 2:29 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180611050741-mutt-send-email-mst@kernel.org>
On 2018年06月11日 10:12, Michael S. Tsirkin wrote:
> On Fri, Jun 08, 2018 at 01:07:09PM +0800, Jason Wang wrote:
>>
>> On 2018年06月08日 12:46, Michael S. Tsirkin wrote:
>>> On Fri, Jun 08, 2018 at 11:50:42AM +0800, Jason Wang wrote:
>>>> This feature bit is duplicated with VIRTIO_F_ANY_LAYOUT, this means if
>>>> a userpsace want to enable VRITIO_F_ANY_LAYOUT,
>>>> VHOST_NET_F_VIRTIO_NET_HDR will be implied too. This is wrong and will
>>>> break networking.
>>> What breaks networking exactly? VHOST_NET supported ANY_LAYOUT
>>> from day one. For this reason it does not need to know about
>>> VRITIO_F_ANY_LAYOUT and we reused the bit for other purposes.
>> It's the knowledge of vhost_net code it self but not userspace. For
>> userspace, it should depends on the value of returned by VHOST_GET_FEATURES.
>> So when userspace can set_features with ANY_LAYOUT, vhost may think it wants
>> VHOST_NET_F_VIRTIO_NET_HDR.
> Yes but that's the admittedly ugly API that we have now.
> userspace is supposed to know VRITIO_F_ANY_LAYOUT does
> not make sense for vhost.
Ok.
>
>
>
>>>
>>>
>>>> Fixing this by safely removing
>>>> VHOST_NET_F_VIRTIO_NET_HDR support. There should be very few or even
>>>> no userspace can use this.
>>> Quite possibly, but it is hard to be sure. It seems safer to
>>> maintain it unless there's an actual reason something's broken.
>> I think not since the feature is negotiated not mandatory?
> That doesn't mean much.
>
>>>> Further cleanups could be done for
>>>> -net-next for safety.
>>>>
>>>> In the future, we need a vhost dedicated feature set/get ioctl()
>>>> instead of reusing virtio ones.
>>> Not just in the future, we might want to switch iommu
>>> to a sane structure without the 64 bit padding bug
>>> right now.
>> Yes, I hit this bug when introducing V2 of msg IOTLB message.
> Sounds good, so if you like, reserve a bit for
> VHOST_NET_F_VIRTIO_NET_HDR in the new ioctl mask and
> do not enable it there.
Ok, and maybe VHOST_F_LOG_ALL.
>
>>>> Fixes: 4e9fa50c6ccbe ("vhost: move features to core")
>>> This tag makes no sense here IMHO. Looks like people are using some tool
>>> that just looks at the earliest version where patch won't apply. The
>>> commit in question just moved some code around.
>> Looks not, before this commit, vhost_net won't return ANY_LAYOUT.
>>
>> Thanks
> Well ANY_LAYOUT just happens to be same as VHOST_NET_F_VIRTIO_NET_HDR
> and that has been set since forever.
So do you still want this patch? If not we need to document that
ANY_LAYOUT could not be passed through SET_FEATURES somewhere.
Thanks
>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>> drivers/vhost/net.c | 15 +++++----------
>>>> 1 file changed, 5 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>>> index 986058a..83eef52 100644
>>>> --- a/drivers/vhost/net.c
>>>> +++ b/drivers/vhost/net.c
>>>> @@ -69,7 +69,6 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>>>> enum {
>>>> VHOST_NET_FEATURES = VHOST_FEATURES |
>>>> - (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
>>>> (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
>>>> (1ULL << VIRTIO_F_IOMMU_PLATFORM)
>>>> };
>>>> @@ -1255,15 +1254,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
>>>> (1ULL << VIRTIO_F_VERSION_1))) ?
>>>> sizeof(struct virtio_net_hdr_mrg_rxbuf) :
>>>> sizeof(struct virtio_net_hdr);
>>>> - if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
>>>> - /* vhost provides vnet_hdr */
>>>> - vhost_hlen = hdr_len;
>>>> - sock_hlen = 0;
>>>> - } else {
>>>> - /* socket provides vnet_hdr */
>>>> - vhost_hlen = 0;
>>>> - sock_hlen = hdr_len;
>>>> - }
>>>> +
>>>> + /* socket provides vnet_hdr */
>>>> + vhost_hlen = 0;
>>>> + sock_hlen = hdr_len;
>>>> +
>>>> mutex_lock(&n->dev.mutex);
>>>> if ((features & (1 << VHOST_F_LOG_ALL)) &&
>>>> !vhost_log_access_ok(&n->dev))
>>>> --
>>>> 2.7.4
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net] vhost_net: remove VHOST_NET_F_VIRTIO_NET_HDR support
From: Michael S. Tsirkin @ 2018-06-11 2:12 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <23efe110-f61a-5aee-c0b4-bd3dc5426438@redhat.com>
On Fri, Jun 08, 2018 at 01:07:09PM +0800, Jason Wang wrote:
>
>
> On 2018年06月08日 12:46, Michael S. Tsirkin wrote:
> > On Fri, Jun 08, 2018 at 11:50:42AM +0800, Jason Wang wrote:
> > > This feature bit is duplicated with VIRTIO_F_ANY_LAYOUT, this means if
> > > a userpsace want to enable VRITIO_F_ANY_LAYOUT,
> > > VHOST_NET_F_VIRTIO_NET_HDR will be implied too. This is wrong and will
> > > break networking.
> > What breaks networking exactly? VHOST_NET supported ANY_LAYOUT
> > from day one. For this reason it does not need to know about
> > VRITIO_F_ANY_LAYOUT and we reused the bit for other purposes.
>
> It's the knowledge of vhost_net code it self but not userspace. For
> userspace, it should depends on the value of returned by VHOST_GET_FEATURES.
> So when userspace can set_features with ANY_LAYOUT, vhost may think it wants
> VHOST_NET_F_VIRTIO_NET_HDR.
Yes but that's the admittedly ugly API that we have now.
userspace is supposed to know VRITIO_F_ANY_LAYOUT does
not make sense for vhost.
> >
> >
> >
> > > Fixing this by safely removing
> > > VHOST_NET_F_VIRTIO_NET_HDR support. There should be very few or even
> > > no userspace can use this.
> > Quite possibly, but it is hard to be sure. It seems safer to
> > maintain it unless there's an actual reason something's broken.
>
> I think not since the feature is negotiated not mandatory?
That doesn't mean much.
> >
> > > Further cleanups could be done for
> > > -net-next for safety.
> > >
> > > In the future, we need a vhost dedicated feature set/get ioctl()
> > > instead of reusing virtio ones.
> > Not just in the future, we might want to switch iommu
> > to a sane structure without the 64 bit padding bug
> > right now.
>
> Yes, I hit this bug when introducing V2 of msg IOTLB message.
Sounds good, so if you like, reserve a bit for
VHOST_NET_F_VIRTIO_NET_HDR in the new ioctl mask and
do not enable it there.
> >
> > > Fixes: 4e9fa50c6ccbe ("vhost: move features to core")
> > This tag makes no sense here IMHO. Looks like people are using some tool
> > that just looks at the earliest version where patch won't apply. The
> > commit in question just moved some code around.
>
> Looks not, before this commit, vhost_net won't return ANY_LAYOUT.
>
> Thanks
Well ANY_LAYOUT just happens to be same as VHOST_NET_F_VIRTIO_NET_HDR
and that has been set since forever.
> >
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> > > drivers/vhost/net.c | 15 +++++----------
> > > 1 file changed, 5 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 986058a..83eef52 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -69,7 +69,6 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> > > enum {
> > > VHOST_NET_FEATURES = VHOST_FEATURES |
> > > - (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
> > > (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> > > (1ULL << VIRTIO_F_IOMMU_PLATFORM)
> > > };
> > > @@ -1255,15 +1254,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
> > > (1ULL << VIRTIO_F_VERSION_1))) ?
> > > sizeof(struct virtio_net_hdr_mrg_rxbuf) :
> > > sizeof(struct virtio_net_hdr);
> > > - if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
> > > - /* vhost provides vnet_hdr */
> > > - vhost_hlen = hdr_len;
> > > - sock_hlen = 0;
> > > - } else {
> > > - /* socket provides vnet_hdr */
> > > - vhost_hlen = 0;
> > > - sock_hlen = hdr_len;
> > > - }
> > > +
> > > + /* socket provides vnet_hdr */
> > > + vhost_hlen = 0;
> > > + sock_hlen = hdr_len;
> > > +
> > > mutex_lock(&n->dev.mutex);
> > > if ((features & (1 << VHOST_F_LOG_ALL)) &&
> > > !vhost_log_access_ok(&n->dev))
> > > --
> > > 2.7.4
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net] vhost_net: remove VHOST_NET_F_VIRTIO_NET_HDR support
From: David Miller @ 2018-06-10 19:27 UTC (permalink / raw)
To: jasowang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1528429842-22835-1-git-send-email-jasowang@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Fri, 8 Jun 2018 11:50:42 +0800
> This feature bit is duplicated with VIRTIO_F_ANY_LAYOUT, this means if
> a userpsace want to enable VRITIO_F_ANY_LAYOUT,
> VHOST_NET_F_VIRTIO_NET_HDR will be implied too. This is wrong and will
> break networking. Fixing this by safely removing
> VHOST_NET_F_VIRTIO_NET_HDR support. There should be very few or even
> no userspace can use this. Further cleanups could be done for
> -net-next for safety.
>
> In the future, we need a vhost dedicated feature set/get ioctl()
> instead of reusing virtio ones.
>
> Fixes: 4e9fa50c6ccbe ("vhost: move features to core")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
I don't see this discussion as resolved yet so I'll mark this
as Deferred in patchwork.
Thanks.
^ permalink raw reply
* Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands support
From: Stefan Hajnoczi @ 2018-06-08 10:20 UTC (permalink / raw)
To: Liu, Changpeng
Cc: pbonzini@redhat.com, cavery@redhat.com,
virtualization@lists.linux-foundation.org
In-Reply-To: <FF7FC980937D6342B9D289F5F3C7C2625B6DED4D@SHSMSX103.ccr.corp.intel.com>
[-- Attachment #1.1: Type: text/plain, Size: 5124 bytes --]
On Thu, Jun 07, 2018 at 11:07:06PM +0000, Liu, Changpeng wrote:
>
>
> > -----Original Message-----
> > From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> > Sent: Thursday, June 7, 2018 9:10 PM
> > To: Liu, Changpeng <changpeng.liu@intel.com>
> > Cc: virtualization@lists.linux-foundation.org; cavery@redhat.com;
> > jasowang@redhat.com; pbonzini@redhat.com; Wang, Wei W
> > <wei.w.wang@intel.com>
> > Subject: Re: [PATCH v6] virtio_blk: add DISCARD and WRIET ZEROES commands
> > support
> >
> > On Wed, Jun 06, 2018 at 12:19:00PM +0800, Changpeng Liu wrote:
> > > Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES commands
> > > support, this will impact the performance when using SSD backend over
> > > file systems.
> > >
> > > Commit 88c85538 "virtio-blk: add discard and write zeroes features to
> > > specification"(see https://github.com/oasis-tcs/virtio-spec) extended
> > > existing virtio-blk protocol, adding extra DISCARD and WRITE ZEROES
> > > commands support.
> > >
> > > While here, using 16 bytes descriptor to describe one segment of DISCARD
> > > or WRITE ZEROES commands, each command may contain one or more
> > decriptors.
> > >
> > > The following data structure shows the definition of one descriptor:
> > >
> > > struct virtio_blk_discard_write_zeroes {
> > > le64 sector;
> > > le32 num_sectors;
> > > le32 unmap;
> > > };
> > >
> > > Field 'sector' means the start sector for DISCARD and WRITE ZEROES,
> > > filed 'num_sectors' means the number of sectors for DISCARD and WRITE
> > > ZEROES, if both DISCARD and WRITE ZEROES are supported, field 'unmap'
> > > maybe used for WRITE ZEROES command with DISCARD enabled.
> > >
> > > We also extended the virtio-blk configuration space to let backend
> > > device put DISCARD and WRITE ZEROES configuration parameters.
> > >
> > > struct virtio_blk_config {
> > > [...]
> > >
> > > le32 max_discard_sectors;
> > > le32 max_discard_seg;
> > > le32 discard_sector_alignment;
> > > le32 max_write_zeroes_sectors;
> > > le32 max_write_zeroes_seg;
> > > u8 write_zeroes_may_unmap;
> > > }
> > >
> > > New feature bit [VIRTIO_BLK_F_DISCARD (13)]: Device can support discard
> > > command, maximum discard sectors size in field 'max_discard_sectors' and
> > > maximum discard segment number in field 'max_discard_seg'.
> > >
> > > New feature [VIRTIO_BLK_F_WRITE_ZEROES (14)]: Device can support write
> > > zeroes command, maximum write zeroes sectors size in field
> > > 'max_write_zeroes_sectors' and maximum write zeroes segment number in
> > > field 'max_write_zeroes_seg'.
> > >
> > > The parameters in the configuration space of the device field
> > > 'max_discard_sectors' and field 'discard_sector_alignment' are expressed in
> > > 512-byte units if the VIRTIO_BLK_F_DISCARD feature bit is negotiated. The
> > > field 'max_write_zeroes_sectors' is expressed in 512-byte units if the
> > > VIRTIO_BLK_F_WRITE_ZEROES feature bit is negotiated.
> > >
> > > Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> > > ---
> > > CHANGELOG:
> > > v6: don't set T_OUT bit to discard and write zeroes commands.
> >
> > I don't see this in the patch...
> Yeah, do noting with DISCARD/WRITE ZEROES means no need to OR BLK_T_OUT again.
> >
> > > @@ -225,6 +260,7 @@ static blk_status_t virtio_queue_rq(struct
> > blk_mq_hw_ctx *hctx,
> > > int qid = hctx->queue_num;
> > > int err;
> > > bool notify = false;
> > > + bool unmap = false;
> > > u32 type;
> > >
> > > BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> > > @@ -237,6 +273,13 @@ static blk_status_t virtio_queue_rq(struct
> > blk_mq_hw_ctx *hctx,
> > > case REQ_OP_FLUSH:
> > > type = VIRTIO_BLK_T_FLUSH;
> > > break;
> > > + case REQ_OP_DISCARD:
> > > + type = VIRTIO_BLK_T_DISCARD;
> > > + break;
> > > + case REQ_OP_WRITE_ZEROES:
> > > + type = VIRTIO_BLK_T_WRITE_ZEROES;
> > > + unmap = !(req->cmd_flags & REQ_NOUNMAP);
> > > + break;
> > > case REQ_OP_SCSI_IN:
> > > case REQ_OP_SCSI_OUT:
> > > type = VIRTIO_BLK_T_SCSI_CMD;
> > > @@ -256,6 +299,12 @@ static blk_status_t virtio_queue_rq(struct
> > blk_mq_hw_ctx *hctx,
> > >
> > > blk_mq_start_request(req);
> > >
> > > + if (type == VIRTIO_BLK_T_DISCARD || type ==
> > VIRTIO_BLK_T_WRITE_ZEROES) {
> > > + err = virtblk_setup_discard_write_zeroes(req, unmap);
> > > + if (err)
> > > + return BLK_STS_RESOURCE;
> > > + }
> > > +
> > > num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
> > > if (num) {
> > > if (rq_data_dir(req) == WRITE)
> >
> > ...since we still do blk_rq_map_sg() here and num should be != 0.
> No, while here, we should keep the original logic for READ/WRITE commands.
My question is: why does the changelog say "don't set T_OUT" but the
code *will* set it because blk_rq_map_sg() returns != 0 and
rq_data_dir(req) == WRITE?
Maybe I'm misreading the code, but it looks to me like this patch
does the opposite of what the changelog says.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC v6 4/5] virtio_ring: add event idx support in packed ring
From: Tiwei Bie @ 2018-06-08 8:32 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <c0621b16-3143-0bf7-f44f-3d1173577734@redhat.com>
On Thu, Jun 07, 2018 at 05:50:32PM +0800, Jason Wang wrote:
> On 2018年06月05日 15:40, Tiwei Bie wrote:
> > static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> > {
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > + u16 bufs, used_idx, wrap_counter;
> > START_USE(vq);
> > /* We optimistically turn back on interrupts, then check if there was
> > * more to do. */
> > + /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> > + * either clear the flags bit or point the event index at the next
> > + * entry. Always update the event index to keep code simple. */
> > +
>
> Maybe for packed ring, it's time to treat event index separately to avoid a
> virtio_wmb() for event idx is off.
Okay. I'll do it.
>
> > + /* TODO: tune this threshold */
> > + if (vq->next_avail_idx < vq->last_used_idx)
> > + bufs = (vq->vring_packed.num + vq->next_avail_idx -
> > + vq->last_used_idx) * 3 / 4;
> > + else
> > + bufs = (vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
>
> vq->next_avail-idx could be equal to vq->last_usd_idx when the ring is full.
> Though virito-net is the only user now and it can guarantee this won't
> happen. But consider this is a core API, we should make sure it can work for
> any cases.
>
> It looks to me that bufs is just vq->vring_packed.num - vq->num_free?
I'll try it! Thanks for the suggestion! :)
>
> > +
> > + wrap_counter = vq->used_wrap_counter;
> > +
> > + used_idx = vq->last_used_idx + bufs;
> > + if (used_idx >= vq->vring_packed.num) {
> > + used_idx -= vq->vring_packed.num;
> > + wrap_counter ^= 1;
> > + }
> > +
> > + vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > + used_idx | (wrap_counter << 15));
> > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > - vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > + /* We need to update event offset and event wrap
> > + * counter first before updating event flags. */
> > + virtio_wmb(vq->weak_barriers);
> > + vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > + VRING_EVENT_F_ENABLE;
> > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > vq->event_flags_shadow);
> > - /* We need to enable interrupts first before re-checking
> > - * for more used buffers. */
> > - virtio_mb(vq->weak_barriers);
> > }
> > + /* We need to update event suppression structure first
> > + * before re-checking for more used buffers. */
> > + virtio_mb(vq->weak_barriers);
> > +
> > if (more_used_packed(vq)) {
> > END_USE(vq);
> > return false;
>
> I think what we need to to make sure the descriptor used_idx is used?
> Otherwise we may stop and restart qdisc too frequently?
Good catch! Yeah. It's not the best choice to check the
descriptor at last_used_idx. I'll try to fix this!
Best regards,
Tiwei Bie
>
> Thanks
>
> > --
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v4 1/3] compiler-gcc.h: add gnu_inline to all inline declarations
From: Arnd Bergmann @ 2018-06-08 7:59 UTC (permalink / raw)
To: Nick Desaulniers
Cc: Kate Stewart, linux-efi, Brijesh Singh, Jan Kiszka,
Josh Poimboeuf, Will Deacon, Jarkko Sakkinen, virtualization,
Masahiro Yamada, Manoj Gupta, H. Peter Anvin, Boris Ostrovsky,
Thiebaud Weksteen, mawilcox, the arch/x86 maintainers, akataria,
Greg Hackmann, Ingo Molnar, astrachan, David Rientjes,
Geert Uytterhoeven, Lendacky, Thomas, Linux Kbuild mailing list
In-Reply-To: <20180607204927.219329-2-ndesaulniers@google.com>
On Thu, Jun 7, 2018 at 10:49 PM, Nick Desaulniers
<ndesaulniers@google.com> wrote:
> Functions marked extern inline do not emit an externally visible
> function when the gnu89 C standard is used. Some KBUILD Makefiles
> overwrite KBUILD_CFLAGS. This is an issue for GCC 5.1+ users as without
> an explicit C standard specified, the default is gnu11. Since c99, the
> semantics of extern inline have changed such that an externally visible
> function is always emitted. This can lead to multiple definition errors
> of extern inline functions at link time of compilation units whose build
> files have removed an explicit C standard compiler flag for users of GCC
> 5.1+ or Clang.
>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
> Suggested-by: H. Peter Anvin <hpa@zytor.com>
> Suggested-by: Joe Perches <joe@perches.com>
I suspect this will break Geert's gcc-4.1.2, which I think doesn't have that
attribute yet (4.1.3 or higher have it according to the documentation.
It wouldn't be hard to work around that if we want to keep that version
working, or we could decide that it's time to officially stop supporting
that version, but we should probably decide on one or the other.
Arnd
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Christoph Hellwig @ 2018-06-08 6:36 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: robh, pawel.moll, Tom Lendacky, cohuck, Ram Pai, linux-kernel,
virtualization, Christoph Hellwig, joe, Rustad, Mark D, david,
linuxppc-dev, elfring, Anshuman Khandual
In-Reply-To: <20180607185234-mutt-send-email-mst@kernel.org>
On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote:
> Let me restate it: DMA API has support for a wide range of hardware, and
> hardware based virtio implementations likely won't benefit from all of
> it.
That is completely wrong. All aspects of the DMA API are about the
system they are used in. The CPU, the PCIe root complex, interconnects.
All the issues I mentioned in my previous mail exist in realy life
systems that you can plug virtio PCI or PCIe cards into.
> I'm not really sympathetic to people complaining that they can't even
> set a flag in qemu though. If that's the case the stack in question is
> way too inflexible.
The flag as defined in the spec is the wrong thing to set, because they
are not using an iommu. They probably don't even do any address
translation.
> > Both in the flag naming and the implementation there is an implication
> > of DMA API == IOMMU, which is fundamentally wrong.
>
> Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
And the explanation.
>
> It's possible that some setups will benefit from a more
> fine-grained approach where some aspects of the DMA
> API are bypassed, others aren't.
Hell no. DMA API = abstraction for any possble platform wart.
We are not going to make this any more fine grained. It is bad
enough that virtio already has a mode bypassing any of this,
we are not going to make even more of a mess of it.
> This seems to be what was being asked for in this thread,
> with comments claiming IOMMU flag adds too much overhead.
Right now it means implementing a virtual iommu, which I agree is
way too much overhead.
>
> > The DMA API does a few different things:
> >
> > a) address translation
> >
> > This does include IOMMUs. But it also includes random offsets
> > between PCI bars and system memory that we see on various
> > platforms.
>
> I don't think you mean bars. That's unrelated to DMA.
Of course it matters. If the device always needs an offset in
the DMA addresses it is completely related to DMA.
For some examples take a look at:
arch/x86/pci/sta2x11-fixup.c
arch/mips/include/asm/mach-ath25/dma-coherence.h
or anything setting dma_pfn_offset.
> > Worse so some of these offsets might be based on
> > banks, e.g. on the broadcom bmips platform. It also deals
> > with bitmask in physical addresses related to memory encryption
> > like AMD SEV. I'd be really curious how for example the
> > Intel virtio based NIC is going to work on any of those
> > plaforms.
>
> SEV guys report that they just set the iommu flag and then it all works.
> I guess if there's translation we can think of this as a kind of iommu.
> Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
VIRTIO_F_BEHAVES_LIKE_A_REAL_PCI_DEVICE_DONT_TRY_TO_OUTSMART_ME
as said it's not just translations, it is cache coherence as well.
> And apparently some people complain that just setting that flag makes
> qemu check translation on each access with an unacceptable performance
> overhead. Forcing same behaviour for everyone on general principles
> even without the flag is unlikely to make them happy.
That sounds like a qemu implementation bug. If qemu knowns that
guest physiscall == guest dma space there is no need to check.
> > b) coherency
> >
> > On many architectures DMA is not cache coherent, and we need
> > to invalidate and/or write back cache lines before doing
> > DMA. Again, I wonder how this is every going to work with
> > hardware based virtio implementations.
>
>
> You mean dma_Xmb and friends?
> There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> for that.
No. I mean the fact that PCI(e) devices often are not coherent with
the cache. So you need to writeback the cpu cache before transferring
data to the device, and invalidate the cpu cache before transferring
data from the device. Plus additional workarounds for speculation.
Looks at the implementations and comments around the dma_sync_*
calls.
>
> > Even worse I think this
> > is actually broken at least for VIVT event for virtualized
> > implementations. E.g. a KVM guest is going to access memory
> > using different virtual addresses than qemu, vhost might throw
> > in another different address space.
>
> I don't really know what VIVT is. Could you help me please?
Virtually indexed, virtually tagged. In short you must do cache
maintainance based on the virtual address used to fill the cache.
>
> > c) bounce buffering
> >
> > Many DMA implementations can not address all physical memory
> > due to addressing limitations. In such cases we copy the
> > DMA memory into a known addressable bounc buffer and DMA
> > from there.
>
> Don't do it then?
Because for example your PCIe root complex only supports 32-bit
addressing, but the memory buffer is outside the addressing range.
> > d) flushing write combining buffers or similar
> >
> > On some hardware platforms we need workarounds to e.g. read
> > from a certain mmio address to make sure DMA can actually
> > see memory written by the host.
>
> I guess it isn't an issue as long as WC isn't actually used.
> It will become an issue when virtio spec adds some WC capability -
> I suspect we can ignore this for now.
This is write combibining in the SOC with the root complex. Nothing
your can work around in the device or device driver.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox