Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-11  1:21 UTC (permalink / raw)
  To: David Miller
  Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, virtualization,
	torvalds
In-Reply-To: <20180410.105043.1798364865868187298.davem@davemloft.net>


[-- Attachment #1.1: Type: text/plain, Size: 1182 bytes --]

On Tue, Apr 10, 2018 at 10:50:43AM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Tue, 10 Apr 2018 14:40:10 +0800
> 
> > On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
> >> v2:
> >>   * Rewrote the conditional to make the vq access check clearer [Linus]
> >>   * Added Patch 2 to make the return type consistent and harder to misuse
> >>   * [Linus]
> >>
> >> The first patch fixes the vhost virtqueue access check which was
> >> recently
> >> broken.  The second patch replaces the int return type with bool to
> >> prevent
> >> future bugs.
> >>
> >> Stefan Hajnoczi (2):
> >>    vhost: fix vhost_vq_access_ok() log check
> >>    vhost: return bool from *_access_ok() functions
> >>
> >>   drivers/vhost/vhost.h |  4 +--
> >>   drivers/vhost/vhost.c | 70
> >>   ++++++++++++++++++++++++++-------------------------
> >>   2 files changed, 38 insertions(+), 36 deletions(-)
> >>
> > 
> > Acked-by: Jason Wang <jasowang@redhat.com>
> 
> This sereis doesn't apply cleanly to the net tree, please respin
> and add the appropriate Acks and Fixes: tags that Michael and Jason
> have provided.

Sorry, will fix.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Michael S. Tsirkin @ 2018-04-11  1:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem
In-Reply-To: <20180410142608.50f15b45@xeon-e3>

On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.
> 
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.

Well right now netvsc does an indirect call to the PT device,
does it not? If you really want max performance when PT
is in use you need to do the reverse and have PT forward to netvsc.

> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Stephen Hemminger @ 2018-04-10 23:59 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Netdev,
	David Miller
In-Reply-To: <CADGSJ22rVsC0TDTd6OKVnwbx0ExoQ8xWXBMumKB-OFH4sX=yaQ@mail.gmail.com>

On Tue, 10 Apr 2018 16:44:47 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:  
> >> On Tue, 10 Apr 2018 11:59:50 -0700
> >> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >>  
> >> > Use the registration/notification framework supported by the generic
> >> > bypass infrastructure.
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > ---  
> >>
> >> Thanks for doing this.  Your current version has couple show stopper
> >> issues.
> >>
> >> First, the slave device is instantly taking over the slave.
> >> This doesn't allow udev/systemd to do its device rename of the slave
> >> device. Netvsc uses a delayed work to workaround this.  
> >
> > Interesting. Does this mean udev must act within a specific time window
> > then?  
> 
> Sighs, lots of hacks. Why propgating this from driver to a common
> module. We really need a clean solution.
> 

I had a patch to wait for udev to do the rename and go from there
but davem rejected it.

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Siwei Liu @ 2018-04-10 23:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller
In-Reply-To: <20180411022807-mutt-send-email-mst@kernel.org>

On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Michael S. Tsirkin @ 2018-04-10 23:28 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, davem
In-Reply-To: <20180410142608.50f15b45@xeon-e3>

On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

Interesting. Does this mean udev must act within a specific time window
then?

> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.
> 
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

^ permalink raw reply

* Re: [PATCH v29 1/4] mm: support reporting free page blocks
From: Michael S. Tsirkin @ 2018-04-10 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, virtualization, linux-mm,
	huangzhichao, pbonzini, mhocko
In-Reply-To: <20180410135429.d1aeeb91d7f2754ffe7fb80e@linux-foundation.org>

On Tue, Apr 10, 2018 at 01:54:29PM -0700, Andrew Morton wrote:
> On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > 
> > Andrew, were your questions answered? If yes could I bother you for an ack on this?
> > 
> 
> Still not very happy that readers are told that "this function may
> sleep" when it clearly doesn't do so.  If we wish to be able to change
> it to sleep in the future then that should be mentioned.  And even put a
> might_sleep() in there, to catch people who didn't read the comments...
> 
> Otherwise it looks OK.

Oh, might_sleep with a comment explaining it's for the future sounds
good to me. I queued this - Wei, could you post a patch on top pls?

-- 
MST

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Samudrala, Sridhar @ 2018-04-10 22:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <20180410142608.50f15b45@xeon-e3>

On 4/10/2018 2:26 PM, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
> Thanks for doing this.  Your current version has couple show stopper
> issues.
>
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

OK. I guess you are referring to the dev_set_mtu() and dev_open() calls that are
made in bypass_slave_register() and you want to defer them to be done after
a delay.  I could avoid these calls in case of netvsc based on bypass_ops.

>
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.

netvsc should not be using bypass_select_queue() as  that ndo op gets used
only with 3-netdev model.
Anyway, will look into updating bypass_select_queue() based on your fix.

>
> Lastly, more indirection is bad in current climate.
>
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

Not sure we can avoid this indirection if we want to commonize the code,  but use
different models for virtio-net and netvsc.

On the other hand, these patches avoid calls to get_netvsc_bymac() and
get_netvsc_by_ref() that go through all the devices for all the netdev events.
netvsc lookups should be much faster.

Thanks
Sridhar

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Stephen Hemminger @ 2018-04-10 21:26 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <1523386790-12396-5-git-send-email-sridhar.samudrala@intel.com>

On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.

^ permalink raw reply

* Re: [PATCH v29 1/4] mm: support reporting free page blocks
From: Andrew Morton @ 2018-04-10 20:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, virtualization, linux-mm,
	huangzhichao, pbonzini, mhocko
In-Reply-To: <20180410211719-mutt-send-email-mst@kernel.org>

On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote:

> 
> Andrew, were your questions answered? If yes could I bother you for an ack on this?
> 

Still not very happy that readers are told that "this function may
sleep" when it clearly doesn't do so.  If we wish to be able to change
it to sleep in the future then that should be mentioned.  And even put a
might_sleep() in there, to catch people who didn't read the comments...

Otherwise it looks OK.

^ permalink raw reply

* Re: [PATCH v29 1/4] mm: support reporting free page blocks
From: Michael S. Tsirkin @ 2018-04-10 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, virtualization, linux-mm,
	huangzhichao, pbonzini, mhocko
In-Reply-To: <20180326142254.c4129c3a54ade686ee2a5e21@linux-foundation.org>

On Mon, Mar 26, 2018 at 02:22:54PM -0700, Andrew Morton wrote:
> On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
> 
> > This patch adds support to walk through the free page blocks in the
> > system and report them via a callback function. Some page blocks may
> > leave the free list after zone->lock is released, so it is the caller's
> > responsibility to either detect or prevent the use of such pages.
> > 
> > One use example of this patch is to accelerate live migration by skipping
> > the transfer of free pages reported from the guest. A popular method used
> > by the hypervisor to track which part of memory is written during live
> > migration is to write-protect all the guest memory. So, those pages that
> > are reported as free pages but are written after the report function
> > returns will be captured by the hypervisor, and they will be added to the
> > next round of memory transfer.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  	show_swap_cache_info();
> >  }
> >  
> > +/*
> > + * Walk through a free page list and report the found pfn range via the
> > + * callback.
> > + *
> > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > + * value returned from the callback.
> > + */
> > +static int walk_free_page_list(void *opaque,
> > +			       struct zone *zone,
> > +			       int order,
> > +			       enum migratetype mt,
> > +			       int (*report_pfn_range)(void *,
> > +						       unsigned long,
> > +						       unsigned long))
> > +{
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned long pfn, flags;
> > +	int ret = 0;
> > +
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	list = &zone->free_area[order].free_list[mt];
> > +	list_for_each_entry(page, list, lru) {
> > +		pfn = page_to_pfn(page);
> > +		ret = report_pfn_range(opaque, pfn, 1 << order);
> > +		if (ret)
> > +			break;
> > +	}
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * walk_free_mem_block - Walk through the free page blocks in the system
> > + * @opaque: the context passed from the caller
> > + * @min_order: the minimum order of free lists to check
> > + * @report_pfn_range: the callback to report the pfn range of the free pages
> > + *
> > + * If the callback returns a non-zero value, stop iterating the list of free
> > + * page blocks. Otherwise, continue to report.
> > + *
> > + * Please note that there are no locking guarantees for the callback and
> > + * that the reported pfn range might be freed or disappear after the
> > + * callback returns so the caller has to be very careful how it is used.
> > + *
> > + * The callback itself must not sleep or perform any operations which would
> > + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> > + * or via any lock dependency. It is generally advisable to implement
> > + * the callback as simple as possible and defer any heavy lifting to a
> > + * different context.
> > + *
> > + * There is no guarantee that each free range will be reported only once
> > + * during one walk_free_mem_block invocation.
> > + *
> > + * pfn_to_page on the given range is strongly discouraged and if there is
> > + * an absolute need for that make sure to contact MM people to discuss
> > + * potential problems.
> > + *
> > + * The function itself might sleep so it cannot be called from atomic
> > + * contexts.
> 
> I don't see how walk_free_mem_block() can sleep.
> 
> > + * In general low orders tend to be very volatile and so it makes more
> > + * sense to query larger ones first for various optimizations which like
> > + * ballooning etc... This will reduce the overhead as well.
> > + *
> > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > + * value returned from the callback.
> > + */
> > +int walk_free_mem_block(void *opaque,
> > +			int min_order,
> > +			int (*report_pfn_range)(void *opaque,
> > +			unsigned long pfn,
> > +			unsigned long num))
> > +{
> > +	struct zone *zone;
> > +	int order;
> > +	enum migratetype mt;
> > +	int ret;
> > +
> > +	for_each_populated_zone(zone) {
> > +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> > +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +				ret = walk_free_page_list(opaque, zone,
> > +							  order, mt,
> > +							  report_pfn_range);
> > +				if (ret)
> > +					return ret;
> > +			}
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> 
> This looks like it could take a long time.  Will we end up needing to
> add cond_resched() in there somewhere?

Andrew, were your questions answered? If yes could I bother you for an ack on this?


> >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> >  {
> >  	zoneref->zone = zone;
> > -- 
> > 2.7.4

^ permalink raw reply

* Re: [virtio-dev] [RFC] virtio-iommu version 0.6
From: Jean-Philippe Brucker @ 2018-04-10 16:18 UTC (permalink / raw)
  To: Tian, Kevin, virtio-dev@lists.oasis-open.org,
	virtualization@lists.linux-foundation.org
  Cc: eric.auger@redhat.com, eric.auger.pro@gmail.com
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D19108DBDD@SHSMSX101.ccr.corp.intel.com>

On 22/03/18 09:44, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Wednesday, March 21, 2018 9:14 PM
>>
>> Hi Kevin,
>>
>> Thanks for the comments
>>
>> On 19/03/18 10:03, Tian, Kevin wrote:
>>> BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it
>>> intended?
>>
>> In my opinion BYPASS is a bit different from the other features: while the
>> others are needed for correctness, this one is optional and even if the
>> guest supports BYPASS, it should be allowed not to accept it. For security
>> reasons it may not want to let endpoints access the whole guest-physical
>> address space.
> 
> ok, possibly because I'm not familiar with virtio spec convention.
> My original feeling is that each feature bit will have a behavior
> description regarding to whether device reports and whether 
> driver accepts. If no need to cover optional feature, then it's fine
> to me. :-)

I think the virtio spec allows the reader to choose how to implement any
behavior that isn't explicitly described. Whenever a SHOULD or MUST
isn't stated in a requirement section, the spec implies "feature X MAY
be enabled if offered".

Looking at the virtio-net specification, optional features such as
VIRTIO_NET_F_VLAN aren't mentioned in driver requirements. On the other
hand, features that the device needs in order to function properly are
explicitly stated, for example VIRTIO_NET_F_MAC.

>> [...]
>>> Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI.
>>> I know there were quite some discussion around this flag before,
>>> but my mental picture now still is a bit difficult to understand its
>>> usage based on examples in implementation notes:
>>>
>>> 	- for x86, you describe it as indicating MSI bypass for
>>> oxfeexxxxx. However guest doesn't need to know this fact. Only
>>> requirement is to treat it as reserved range (as on bare metal)
>>> then T_RESERVED is sufficient for this purpose>
>>> 	- for ARM, either let guest or let host to choose a virtual
>>> address for mapping to MSI doorbell. the former doesn't require
>>> a reserved range. for the latter also T_RESERVED is enough as
>>> the example in hardware device assignment section.
>>
>> It might be nicer to have the host decide it, but when the physical IOMMU
>> is ARM SMMU, nesting translation complicates things, because the guest
>> *has* to create a mapping:
> 
> confirm one thing first. v0.6 doesn't support binding to guest IOVA
> page table yet. So based on current map/unmap interface, there is 
> no such complicity right? stage-1 is just a shadowed translation (IOVA
> ->HPA) to guest side (IOVA->GPA) with nesting disabled. In that case
> the default behavior is host-reserved style.

Yes, with v0.6 the host chooses whether to translate or bypass the MSIs.

> Then comes nested scenario:
> 
>>
>> * The guest is in charge of stage-1. It creates IOVA->GPA mapping for the
>>   MSI doorbell. The GPA is mapped to the physical MSI doorbell at
>>   stage-2 by the host.
> 
> Is it a must that above GPA is mapped to physical MSI doorbell?
> 
> Ideally:
> 
> 1) Host reserves some IOVA range for mapping MSI doorbell
> 2) the range is reported to user space
> 3) Qemu reported the range as reserved on endpoints, marked
> as T_IDENTITY (a new type to be introduced), meaning guest
> needs setup identity mapping in stage-1

I think it works, and is saner than my idea. I had one concern (thought
I had more but can't find any in my notes; they might resurface when
prototyping): the MSI doorbell has to be mapped with device attributes
(MMIO) to ensure proper interrupt delivery. I think it's fine because
the host maps the doorbell with the right attributes at stage-2, which
will take precedence over what the guest uses for stage-1 according to
the Arm architecture.

So I don't think T_IDENTITY requires arch-specific attributes for now
but there should definitely be space for extension. I also noticed that
Intel and AMD drivers in Linux don't add any special attribute or
protection to identity mappings (RESV_DIRECT in Linux), apart from
READ/WRITE.

For reference the previous discussion about T_IDENTITY is here:
https://www.spinics.net/lists/kvm/msg155240.html

One problem with my idea was that if the device is behind multiple IRQ
chips, the host cannot know which IRQ chip the IOVA corresponds to, when
reading the MSI-X tables. Then again I don't know why multiple IRQ chips
per device would be desirable, but it's allowed.

> 4) Then device should be able to ring physical MSI doorbell
> 5) I'm not sure whether guest still needs to allocate its own
> IOVA and mapping to vGIC doorbell in such case...

I don't think it matters, because the host MSI infrastructure is
decoupled from the guest's. The host programs the physical MSI-X tables
and only needs the guest to create a stage-1 mappings for those, which
is done with T_IDENTITY. The guest doesn't know what the T_IDENTITY
mapping is for.

Then QEMU can choose whether to map or bypass the vGIC MSI doorbell.
With MSI bypass, guest writes a GPA into the virtual MSI-X table,
and QEMU sets up the IRQ routing.

With mapped MSI
- The guest creates a dangling stage-1 mapping for the vGIC doorbell.
  The vGIC doorbell isn't mapped at stage-2, so accessing it from the
  IOVA would fault. It doesn't matter because the host programs a valid
  value into the physical MSI-X table.
- The guest writes an IOVA into the virtual MSI-X table. The host can't
  know which doorbell it corresponds to, because it doesn't walk pIOMMU
  page tables. It has to guess.

So I advise to use bypass. Mapped MSI prevents from having multiple IRQ
chips for one device. Although we probably don't care about this, we
should aim to avoid any guessing in the host.

>> [...]
>> Another way of choosing would be with #ifdef CONFIG_ARM64,
>> CONFIG_X86 etc,
>> but I find it nasty, and I personally prefer using MSI bypass for ARM when
>> possible.
> 
> however from current v0.6 examples, BYPASS is only listed for x86
> case. ARM usage is the missing piece making me confused. 

Are you referring to "3.3 - Hardware device assignment" or "3.2 Message
Signaled Interrupts"?

3.2.2 gives "ARM-based platforms" as an example for MSI address
translation because all physical Arm platforms (except one that has to
work around an erratum) use translated MSIs, and I don't think others do
this. Bypass isn't possible because the GIC interface can be anywhere in
the physical address space, and therefore the PCI RC or the SMMU cannot
distinguish MSIs from other memory accesses.

The virtual platform is a bit different, in that it doesn't need to
perform DMA writes for MSIs. Instead it can create eventfd routes
beforehand, maybe making it easier to support bypass. That is for vhost
and vfio, but I think for userspace devices QEMU performs a normal DMA
write that goes through the IOMMU followed by the GIC.

Thanks,
Jean

^ permalink raw reply

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-10 15:59 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <20180410154304.GG2028@nanopsycho>

On 4/10/2018 8:43 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>>>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>>>>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>>>>>> +				     struct net_device *child_netdev)
>>>>>>>>>> +{
>>>>>>>>>> +	struct virtnet_bypass_info *vbi;
>>>>>>>>>> +	bool backup;
>>>>>>>>>> +
>>>>>>>>>> +	vbi = netdev_priv(bypass_netdev);
>>>>>>>>>> +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>>>>>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>>>>>> +			rtnl_dereference(vbi->active_netdev)) {
>>>>>>>>>> +		netdev_info(bypass_netdev,
>>>>>>>>>> +			    "%s attempting to join bypass dev when %s already present\n",
>>>>>>>>>> +			    child_netdev->name, backup ? "backup" : "active");
>>>>>>>>> Bypass module should check if there is already some other netdev
>>>>>>>>> enslaved and refuse right there.
>>>>>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>>>>>> as its bypass_netdev is same as the backup_netdev.
>>>>>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>>>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>>>>>> for 3 netdev scenario.
>>>>>>> Just let me undestand it clearly. What I expect the difference would be
>>>>>>> between 2netdev and3 netdev model is this:
>>>>>>> 2netdev:
>>>>>>>       bypass_master
>>>>>>>          /
>>>>>>>         /
>>>>>>> VF_slave
>>>>>>>
>>>>>>> 3netdev:
>>>>>>>       bypass_master
>>>>>>>          /     \
>>>>>>>         /       \
>>>>>>> VF_slave   backup_slave
>>>>>>>
>>>>>>> Is that correct? If not, how does it look like?
>>>>>>>
>>>>>>>
>>>>>> Looks correct.
>>>>>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>>>>>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>>>>> marked as the 2 slaves of this new netdev.
>>>>> You say it looks correct and in another sentence you provide completely
>>>>> different description. Could you please look again?
>>>>>
>>>> To be exact, 2 netdev model with netvsc looks like this.
>>>>
>>>>      netvsc_netdev
>>>>        /
>>>>       /
>>>> VF_slave
>>>>
>>>> With virtio_net, 3 netdev model
>>>>
>>>>    bypass_netdev
>>>>        /     \
>>>>       /       \
>>>> VF_slave   virtio_net netdev
>>> Could you also mark the original netdev which is there now? is it
>>> bypass_netdev or virtio_net_netdev ?
>> bypass_netdev
>>      /     \
>>     /       \
>> VF_slave   virtio_net netdev (original)
> That does not make sense.
> 1) You diverge from the behaviour of the netvsc, where the original
>     netdev is a master of the VF
> 2) If the original netdev is a slave, you cannot have any IP address
>     configured on it (well you could, but the rx_handler would eat every
>     incoming packet). So you will break the user bacause he would have to
>     move the configuration to the new master device.
> This only makes sense that the original netdev becomes the master for both
> netvsc and virtio_net.
Forgot to mention that bypass_netdev takes over the name of the original 
netdev and
virtio_net netdev will get the backup name.
So the userspace network configuration doesn't need to change.

^ permalink raw reply

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Siwei Liu @ 2018-04-10 15:55 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, David Miller
In-Reply-To: <20180410154304.GG2028@nanopsycho>

On Tue, Apr 10, 2018 at 8:43 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>> > > > > [...]
>>> > > > >
>>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>> > > > > > > > +                                   struct net_device *child_netdev)
>>> > > > > > > > +{
>>> > > > > > > > +      struct virtnet_bypass_info *vbi;
>>> > > > > > > > +      bool backup;
>>> > > > > > > > +
>>> > > > > > > > +      vbi = netdev_priv(bypass_netdev);
>>> > > > > > > > +      backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>> > > > > > > > +      if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>> > > > > > > > +                      rtnl_dereference(vbi->active_netdev)) {
>>> > > > > > > > +              netdev_info(bypass_netdev,
>>> > > > > > > > +                          "%s attempting to join bypass dev when %s already present\n",
>>> > > > > > > > +                          child_netdev->name, backup ? "backup" : "active");
>>> > > > > > > Bypass module should check if there is already some other netdev
>>> > > > > > > enslaved and refuse right there.
>>> > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>> > > > > > as its bypass_netdev is same as the backup_netdev.
>>> > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>> > > > > > for 3 netdev scenario.
>>> > > > > Just let me undestand it clearly. What I expect the difference would be
>>> > > > > between 2netdev and3 netdev model is this:
>>> > > > > 2netdev:
>>> > > > >      bypass_master
>>> > > > >         /
>>> > > > >        /
>>> > > > > VF_slave
>>> > > > >
>>> > > > > 3netdev:
>>> > > > >      bypass_master
>>> > > > >         /     \
>>> > > > >        /       \
>>> > > > > VF_slave   backup_slave
>>> > > > >
>>> > > > > Is that correct? If not, how does it look like?
>>> > > > >
>>> > > > >
>>> > > > Looks correct.
>>> > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>> > > > marked as the 2 slaves of this new netdev.
>>> > > You say it looks correct and in another sentence you provide completely
>>> > > different description. Could you please look again?
>>> > >
>>> > To be exact, 2 netdev model with netvsc looks like this.
>>> >
>>> >     netvsc_netdev
>>> >       /
>>> >      /
>>> > VF_slave
>>> >
>>> > With virtio_net, 3 netdev model
>>> >
>>> >   bypass_netdev
>>> >       /     \
>>> >      /       \
>>> > VF_slave   virtio_net netdev
>>> Could you also mark the original netdev which is there now? is it
>>> bypass_netdev or virtio_net_netdev ?
>>
>> bypass_netdev
>>     /     \
>>    /       \
>>VF_slave   virtio_net netdev (original)
>
> That does not make sense.
> 1) You diverge from the behaviour of the netvsc, where the original
>    netdev is a master of the VF
> 2) If the original netdev is a slave, you cannot have any IP address
>    configured on it (well you could, but the rx_handler would eat every
>    incoming packet). So you will break the user bacause he would have to
>    move the configuration to the new master device.

That's exactly the point why I need to hide the lower netdev slaves
and trying the align the naming of the bypass with where IP was
configured on the original netdev. The current 3-netdev model is not
"transparent" by any means.

-Siwei

> This only makes sense that the original netdev becomes the master for both
> netvsc and virtio_net.

^ permalink raw reply

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-10 15:43 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <d08ed1e6-678a-7b34-cd2e-52788ceec919@intel.com>

Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > [...]
>> > > > > 
>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > > > > > +				     struct net_device *child_netdev)
>> > > > > > > > +{
>> > > > > > > > +	struct virtnet_bypass_info *vbi;
>> > > > > > > > +	bool backup;
>> > > > > > > > +
>> > > > > > > > +	vbi = netdev_priv(bypass_netdev);
>> > > > > > > > +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > > > > > +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > +			rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > +		netdev_info(bypass_netdev,
>> > > > > > > > +			    "%s attempting to join bypass dev when %s already present\n",
>> > > > > > > > +			    child_netdev->name, backup ? "backup" : "active");
>> > > > > > > Bypass module should check if there is already some other netdev
>> > > > > > > enslaved and refuse right there.
>> > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > > > > > for 3 netdev scenario.
>> > > > > Just let me undestand it clearly. What I expect the difference would be
>> > > > > between 2netdev and3 netdev model is this:
>> > > > > 2netdev:
>> > > > >      bypass_master
>> > > > >         /
>> > > > >        /
>> > > > > VF_slave
>> > > > > 
>> > > > > 3netdev:
>> > > > >      bypass_master
>> > > > >         /     \
>> > > > >        /       \
>> > > > > VF_slave   backup_slave
>> > > > > 
>> > > > > Is that correct? If not, how does it look like?
>> > > > > 
>> > > > > 
>> > > > Looks correct.
>> > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> > > > marked as the 2 slaves of this new netdev.
>> > > You say it looks correct and in another sentence you provide completely
>> > > different description. Could you please look again?
>> > > 
>> > To be exact, 2 netdev model with netvsc looks like this.
>> > 
>> >     netvsc_netdev
>> >       /
>> >      /
>> > VF_slave
>> > 
>> > With virtio_net, 3 netdev model
>> > 
>> >   bypass_netdev
>> >       /     \
>> >      /       \
>> > VF_slave   virtio_net netdev
>> Could you also mark the original netdev which is there now? is it
>> bypass_netdev or virtio_net_netdev ?
>
> bypass_netdev
>     /     \
>    /       \
>VF_slave   virtio_net netdev (original)

That does not make sense.
1) You diverge from the behaviour of the netvsc, where the original
   netdev is a master of the VF
2) If the original netdev is a slave, you cannot have any IP address
   configured on it (well you could, but the rx_handler would eat every
   incoming packet). So you will break the user bacause he would have to
   move the configuration to the new master device.
This only makes sense that the original netdev becomes the master for both
netvsc and virtio_net.

^ permalink raw reply

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-10 15:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <20180410152205.GF2028@nanopsycho>

On 4/10/2018 8:22 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>>>> [...]
>>>>>
>>>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>>>> +				     struct net_device *child_netdev)
>>>>>>>> +{
>>>>>>>> +	struct virtnet_bypass_info *vbi;
>>>>>>>> +	bool backup;
>>>>>>>> +
>>>>>>>> +	vbi = netdev_priv(bypass_netdev);
>>>>>>>> +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>>>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>>>> +			rtnl_dereference(vbi->active_netdev)) {
>>>>>>>> +		netdev_info(bypass_netdev,
>>>>>>>> +			    "%s attempting to join bypass dev when %s already present\n",
>>>>>>>> +			    child_netdev->name, backup ? "backup" : "active");
>>>>>>> Bypass module should check if there is already some other netdev
>>>>>>> enslaved and refuse right there.
>>>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>>>> as its bypass_netdev is same as the backup_netdev.
>>>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>>>> for 3 netdev scenario.
>>>>> Just let me undestand it clearly. What I expect the difference would be
>>>>> between 2netdev and3 netdev model is this:
>>>>> 2netdev:
>>>>>      bypass_master
>>>>>         /
>>>>>        /
>>>>> VF_slave
>>>>>
>>>>> 3netdev:
>>>>>      bypass_master
>>>>>         /     \
>>>>>        /       \
>>>>> VF_slave   backup_slave
>>>>>
>>>>> Is that correct? If not, how does it look like?
>>>>>
>>>>>
>>>> Looks correct.
>>>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>>>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>>> marked as the 2 slaves of this new netdev.
>>> You say it looks correct and in another sentence you provide completely
>>> different description. Could you please look again?
>>>
>> To be exact, 2 netdev model with netvsc looks like this.
>>
>>     netvsc_netdev
>>       /
>>      /
>> VF_slave
>>
>> With virtio_net, 3 netdev model
>>
>>   bypass_netdev
>>       /     \
>>      /       \
>> VF_slave   virtio_net netdev
> Could you also mark the original netdev which is there now? is it
> bypass_netdev or virtio_net_netdev ?

  bypass_netdev
      /     \
     /       \
VF_slave   virtio_net netdev (original)

^ permalink raw reply

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-10 15:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <20180410105504.GA2028@nanopsycho>

On 4/10/2018 3:55 AM, Jiri Pirko wrote:
> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>> [...]
>>>
>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>> +				     struct net_device *child_netdev)
>>>>>> +{
>>>>>> +	struct virtnet_bypass_info *vbi;
>>>>>> +	bool backup;
>>>>>> +
>>>>>> +	vbi = netdev_priv(bypass_netdev);
>>>>>> +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>> +			rtnl_dereference(vbi->active_netdev)) {
>>>>>> +		netdev_info(bypass_netdev,
>>>>>> +			    "%s attempting to join bypass dev when %s already present\n",
>>>>>> +			    child_netdev->name, backup ? "backup" : "active");
>>>>> Bypass module should check if there is already some other netdev
>>>>> enslaved and refuse right there.
>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>> as its bypass_netdev is same as the backup_netdev.
>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>> for 3 netdev scenario.
>>> Just let me undestand it clearly. What I expect the difference would be
>>> between 2netdev and3 netdev model is this:
>>> 2netdev:
>>>     bypass_master
>>>        /
>>>       /
>>> VF_slave
>>>
>>> 3netdev:
>>>     bypass_master
>>>        /     \
>>>       /       \
>>> VF_slave   backup_slave
>>>
>>> Is that correct? If not, how does it look like?
>>>
>>>
>> Looks correct.
>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> marked as the 2 slaves of this new netdev.
> You say it looks correct and in another sentence you provide completely
> different description. Could you please look again?
>
To be exact, 2 netdev model with netvsc looks like this.

     netvsc_netdev
       /
      /
  VF_slave

With virtio_net, 3 netdev model

   bypass_netdev
       /     \
      /       \
VF_slave   virtio_net netdev

^ permalink raw reply

* Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check
From: David Miller @ 2018-04-10 14:50 UTC (permalink / raw)
  To: jasowang
  Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, virtualization,
	stefanha, torvalds
In-Reply-To: <8642c057-e49d-d95d-19e5-6d304f20064e@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Tue, 10 Apr 2018 14:40:10 +0800

> On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
>> v2:
>>   * Rewrote the conditional to make the vq access check clearer [Linus]
>>   * Added Patch 2 to make the return type consistent and harder to misuse
>>   * [Linus]
>>
>> The first patch fixes the vhost virtqueue access check which was
>> recently
>> broken.  The second patch replaces the int return type with bool to
>> prevent
>> future bugs.
>>
>> Stefan Hajnoczi (2):
>>    vhost: fix vhost_vq_access_ok() log check
>>    vhost: return bool from *_access_ok() functions
>>
>>   drivers/vhost/vhost.h |  4 +--
>>   drivers/vhost/vhost.c | 70
>>   ++++++++++++++++++++++++++-------------------------
>>   2 files changed, 38 insertions(+), 36 deletions(-)
>>
> 
> Acked-by: Jason Wang <jasowang@redhat.com>

This sereis doesn't apply cleanly to the net tree, please respin
and add the appropriate Acks and Fixes: tags that Michael and Jason
have provided.

Thank you.

^ permalink raw reply

* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Michael S. Tsirkin @ 2018-04-10 13:36 UTC (permalink / raw)
  To: Liang, Cunming
  Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
	kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, Wang, Xiao W,
	ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong, Paolo Bonzini
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9222F4@SHSMSX104.ccr.corp.intel.com>

On Tue, Apr 10, 2018 at 09:23:53AM +0000, Liang, Cunming wrote:
> 
> 
> > -----Original Message-----
> > From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> > Sent: Tuesday, April 10, 2018 3:52 PM
> > To: Bie, Tiwei <tiwei.bie@intel.com>; Jason Wang <jasowang@redhat.com>
> > Cc: mst@redhat.com; alex.williamson@redhat.com; ddutile@redhat.com;
> > Duyck, Alexander H <alexander.h.duyck@intel.com>; virtio-dev@lists.oasis-
> > open.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> > virtualization@lists.linux-foundation.org; netdev@vger.kernel.org; Daly, Dan
> > <dan.daly@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Wang,
> > Zhihong <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>;
> > Wang, Xiao W <xiao.w.wang@intel.com>
> > Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
> > vhost backend
> > 
> > On 10/04/2018 06:57, Tiwei Bie wrote:
> > >> So you just move the abstraction layer from qemu to kernel, and you
> > >> still need different drivers in kernel for different device
> > >> interfaces of accelerators. This looks even more complex than leaving
> > >> it in qemu. As you said, another idea is to implement userspace vhost
> > >> backend for accelerators which seems easier and could co-work with
> > >> other parts of qemu without inventing new type of messages.
> > >
> > > I'm not quite sure. Do you think it's acceptable to add various vendor
> > > specific hardware drivers in QEMU?
> > 
> > I think so.  We have vendor-specific quirks, and at some point there was an
> > idea of using quirks to implement (vendor-specific) live migration support for
> > assigned devices.
> 
> Vendor-specific quirks of accessing VGA is a small portion. Other major portions are still handled by guest driver.
> 
> While in this case, when saying various vendor specific drivers in QEMU, it says QEMU takes over and provides the entire user space device drivers. Some parts are even not relevant to vhost, they're basic device function enabling. Moreover, it could be different kinds of devices(network/block/...) under vhost. No matter # of vendors or # of types, total LOC is not small.
> 
> The idea is to avoid introducing these extra complexity out of QEMU, keeping vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel device driver to provide the diversity. Changing once in QEMU, then it supports multi-vendor devices whose drivers naturally providing kernel driver there.
> 
> If QEMU is going to build a user space driver framework there, we're open mind on that, even leveraging DPDK as the underlay library. Looking forward to more others' comments from community.
> 
> Steve

Dependency on a kernel driver is fine IMHO. It's the dependency on a
DPDK backend that makes people unhappy, since the functionality
in question is setup-time only.

As others said, we do not need to go overeboard. A couple of small
vendor-specific quirks in qemu isn't a big deal.

> > 
> > Paolo

^ permalink raw reply

* Re: [PATCH v2 2/2] vhost: return bool from *_access_ok() functions
From: Michael S. Tsirkin @ 2018-04-10 13:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
	Linus Torvalds
In-Reply-To: <20180410052630.11270-3-stefanha@redhat.com>

On Tue, Apr 10, 2018 at 01:26:30PM +0800, Stefan Hajnoczi wrote:
> Currently vhost *_access_ok() functions return int.  This is error-prone
> because there are two popular conventions:
> 
> 1. 0 means failure, 1 means success
> 2. -errno means failure, 0 means success
> 
> Although vhost mostly uses #1, it does not do so consistently.
> umem_access_ok() uses #2.
> 
> This patch changes the return type from int to bool so that false means
> failure and true means success.  This eliminates a potential source of
> errors.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/vhost/vhost.h |  4 ++--
>  drivers/vhost/vhost.c | 66 +++++++++++++++++++++++++--------------------------
>  2 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index ac4b6056f19a..6e00fa57af09 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
>  void vhost_dev_stop(struct vhost_dev *);
>  long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
>  long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
> -int vhost_vq_access_ok(struct vhost_virtqueue *vq);
> -int vhost_log_access_ok(struct vhost_dev *);
> +bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
> +bool vhost_log_access_ok(struct vhost_dev *);
>  
>  int vhost_get_vq_desc(struct vhost_virtqueue *,
>  		      struct iovec iov[], unsigned int iov_count,
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 93fd0c75b0d8..b6a082ef33dd 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
>  
> -static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> +static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
>  {
>  	u64 a = addr / VHOST_PAGE_SIZE / 8;
>  
>  	/* Make sure 64 bit math will not overflow. */
>  	if (a > ULONG_MAX - (unsigned long)log_base ||
>  	    a + (unsigned long)log_base > ULONG_MAX)
> -		return 0;
> +		return false;
>  
>  	return access_ok(VERIFY_WRITE, log_base + a,
>  			 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
> @@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
>  }
>  
>  /* Caller should have vq mutex and device mutex. */
> -static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
> -			       int log_all)
> +static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
> +				int log_all)
>  {
>  	struct vhost_umem_node *node;
>  
>  	if (!umem)
> -		return 0;
> +		return false;
>  
>  	list_for_each_entry(node, &umem->umem_list, link) {
>  		unsigned long a = node->userspace_addr;
>  
>  		if (vhost_overflow(node->userspace_addr, node->size))
> -			return 0;
> +			return false;
>  
>  
>  		if (!access_ok(VERIFY_WRITE, (void __user *)a,
>  				    node->size))
> -			return 0;
> +			return false;
>  		else if (log_all && !log_access_ok(log_base,
>  						   node->start,
>  						   node->size))
> -			return 0;
> +			return false;
>  	}
> -	return 1;
> +	return true;
>  }
>  
>  static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
> @@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
>  
>  /* Can we switch to this memory table? */
>  /* Caller should have device mutex but not vq mutex */
> -static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> -			    int log_all)
> +static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> +			     int log_all)
>  {
>  	int i;
>  
>  	for (i = 0; i < d->nvqs; ++i) {
> -		int ok;
> +		bool ok;
>  		bool log;
>  
>  		mutex_lock(&d->vqs[i]->mutex);
> @@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
>  			ok = vq_memory_access_ok(d->vqs[i]->log_base,
>  						 umem, log);
>  		else
> -			ok = 1;
> +			ok = true;
>  		mutex_unlock(&d->vqs[i]->mutex);
>  		if (!ok)
> -			return 0;
> +			return false;
>  	}
> -	return 1;
> +	return true;
>  }
>  
>  static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> @@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
>  	spin_unlock(&d->iotlb_lock);
>  }
>  
> -static int umem_access_ok(u64 uaddr, u64 size, int access)
> +static bool umem_access_ok(u64 uaddr, u64 size, int access)
>  {
>  	unsigned long a = uaddr;
>  
>  	/* Make sure 64 bit math will not overflow. */
>  	if (vhost_overflow(uaddr, size))
> -		return -EFAULT;
> +		return false;
>  
>  	if ((access & VHOST_ACCESS_RO) &&
>  	    !access_ok(VERIFY_READ, (void __user *)a, size))
> -		return -EFAULT;
> +		return false;
>  	if ((access & VHOST_ACCESS_WO) &&
>  	    !access_ok(VERIFY_WRITE, (void __user *)a, size))
> -		return -EFAULT;
> -	return 0;
> +		return false;
> +	return true;
>  }
>  
>  static int vhost_process_iotlb_msg(struct vhost_dev *dev,
> @@ -988,7 +988,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  			ret = -EFAULT;
>  			break;
>  		}
> -		if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
> +		if (!umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
>  			ret = -EFAULT;
>  			break;
>  		}
> @@ -1135,10 +1135,10 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
>  	return 0;
>  }
>  
> -static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
> -			struct vring_desc __user *desc,
> -			struct vring_avail __user *avail,
> -			struct vring_used __user *used)
> +static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
> +			 struct vring_desc __user *desc,
> +			 struct vring_avail __user *avail,
> +			 struct vring_used __user *used)
>  
>  {
>  	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> @@ -1161,8 +1161,8 @@ static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
>  		vq->meta_iotlb[type] = node;
>  }
>  
> -static int iotlb_access_ok(struct vhost_virtqueue *vq,
> -			   int access, u64 addr, u64 len, int type)
> +static bool iotlb_access_ok(struct vhost_virtqueue *vq,
> +			    int access, u64 addr, u64 len, int type)
>  {
>  	const struct vhost_umem_node *node;
>  	struct vhost_umem *umem = vq->iotlb;
> @@ -1220,7 +1220,7 @@ EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
>  
>  /* Can we log writes? */
>  /* Caller should have device mutex but not vq mutex */
> -int vhost_log_access_ok(struct vhost_dev *dev)
> +bool vhost_log_access_ok(struct vhost_dev *dev)
>  {
>  	return memory_access_ok(dev, dev->umem, 1);
>  }
> @@ -1228,8 +1228,8 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
>  
>  /* Verify access for write logging. */
>  /* Caller should have vq mutex and device mutex */
> -static int vq_log_access_ok(struct vhost_virtqueue *vq,
> -			    void __user *log_base)
> +static bool vq_log_access_ok(struct vhost_virtqueue *vq,
> +			     void __user *log_base)
>  {
>  	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
>  
> @@ -1242,14 +1242,14 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
>  
>  /* Can we start vq? */
>  /* Caller should have vq mutex and device mutex */
> -int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> +bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  {
>  	if (!vq_log_access_ok(vq, vq->log_base))
> -		return 0;
> +		return false;
>  
>  	/* Access validation occurs at prefetch time with IOTLB */
>  	if (vq->iotlb)
> -		return 1;
> +		return true;
>  
>  	return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
>  }
> -- 
> 2.14.3

^ permalink raw reply

* Re: [PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-10 13:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
	Linus Torvalds
In-Reply-To: <20180410161905-mutt-send-email-mst@kernel.org>

On Tue, Apr 10, 2018 at 04:22:35PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 01:26:29PM +0800, Stefan Hajnoczi wrote:
> > Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> > when IOTLB is enabled") introduced a regression.  The logic was
> > originally:
> > 
> >   if (vq->iotlb)
> >       return 1;
> >   return A && B;
> > 
> > After the patch the short-circuit logic for A was inverted:
> > 
> >   if (A || vq->iotlb)
> >       return A;
> >   return B;
> > 
> > This patch fixes the regression by rewriting the checks in the obvious
> > way, no longer returning A when vq->iotlb is non-NULL (which is hard to
> > understand).
> > 
> > Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> > Cc: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> This patch only makes sense after patch 2/2 is applied.
> Otherwise the logic seems reversed below.
> 
> Can you pls squash these two?

Oh, in fact the patch is ok. This just goes to show
that patch 2/2 is useful. But squashing is not required.
Sorry about the noise.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: d65026c6c ("vhost: validate log when IOTLB is enabled")

probably stable material since patch it fixes was queued to stable?


> > ---
> >  drivers/vhost/vhost.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 5320039671b7..93fd0c75b0d8 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
> >  /* Caller should have vq mutex and device mutex */
> >  int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> >  {
> > -	int ret = vq_log_access_ok(vq, vq->log_base);
> > +	if (!vq_log_access_ok(vq, vq->log_base))
> > +		return 0;
> >  
> > -	if (ret || vq->iotlb)
> > -		return ret;
> > +	/* Access validation occurs at prefetch time with IOTLB */
> > +	if (vq->iotlb)
> > +		return 1;
> >  
> >  	return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> >  }
> > -- 
> > 2.14.3

^ permalink raw reply

* Re: [PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-10 13:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
	Linus Torvalds
In-Reply-To: <20180410052630.11270-2-stefanha@redhat.com>

On Tue, Apr 10, 2018 at 01:26:29PM +0800, Stefan Hajnoczi wrote:
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression.  The logic was
> originally:
> 
>   if (vq->iotlb)
>       return 1;
>   return A && B;
> 
> After the patch the short-circuit logic for A was inverted:
> 
>   if (A || vq->iotlb)
>       return A;
>   return B;
> 
> This patch fixes the regression by rewriting the checks in the obvious
> way, no longer returning A when vq->iotlb is non-NULL (which is hard to
> understand).
> 
> Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> Cc: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

This patch only makes sense after patch 2/2 is applied.
Otherwise the logic seems reversed below.

Can you pls squash these two?

> ---
>  drivers/vhost/vhost.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5320039671b7..93fd0c75b0d8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
>  /* Caller should have vq mutex and device mutex */
>  int vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  {
> -	int ret = vq_log_access_ok(vq, vq->log_base);
> +	if (!vq_log_access_ok(vq, vq->log_base))
> +		return 0;
>  
> -	if (ret || vq->iotlb)
> -		return ret;
> +	/* Access validation occurs at prefetch time with IOTLB */
> +	if (vq->iotlb)
> +		return 1;
>  
>  	return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
>  }
> -- 
> 2.14.3

^ permalink raw reply

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-10 10:55 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <16b2e531-7bfa-7f25-2702-f3f8069663ee@intel.com>

Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> [...]
>> 
>> > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > +				     struct net_device *child_netdev)
>> > > > +{
>> > > > +	struct virtnet_bypass_info *vbi;
>> > > > +	bool backup;
>> > > > +
>> > > > +	vbi = netdev_priv(bypass_netdev);
>> > > > +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > +			rtnl_dereference(vbi->active_netdev)) {
>> > > > +		netdev_info(bypass_netdev,
>> > > > +			    "%s attempting to join bypass dev when %s already present\n",
>> > > > +			    child_netdev->name, backup ? "backup" : "active");
>> > > Bypass module should check if there is already some other netdev
>> > > enslaved and refuse right there.
>> > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > as its bypass_netdev is same as the backup_netdev.
>> > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > for 3 netdev scenario.
>> Just let me undestand it clearly. What I expect the difference would be
>> between 2netdev and3 netdev model is this:
>> 2netdev:
>>    bypass_master
>>       /
>>      /
>> VF_slave
>> 
>> 3netdev:
>>    bypass_master
>>       /     \
>>      /       \
>> VF_slave   backup_slave
>> 
>> Is that correct? If not, how does it look like?
>> 
>> 
>Looks correct.
>VF_slave and backup_slave are the original netdevs and are present in both the models.
>In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?

[...]

^ permalink raw reply

* RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Liang, Cunming @ 2018-04-10  9:23 UTC (permalink / raw)
  To: Paolo Bonzini, Bie, Tiwei, Jason Wang
  Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
	kvm@vger.kernel.org, mst@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, Wang, Xiao W,
	ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong
In-Reply-To: <7ee31a12-a370-fc43-82a6-2235f598e970@redhat.com>

> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Tuesday, April 10, 2018 3:52 PM
> To: Bie, Tiwei <tiwei.bie@intel.com>; Jason Wang <jasowang@redhat.com>
> Cc: mst@redhat.com; alex.williamson@redhat.com; ddutile@redhat.com;
> Duyck, Alexander H <alexander.h.duyck@intel.com>; virtio-dev@lists.oasis-
> open.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; netdev@vger.kernel.org; Daly, Dan
> <dan.daly@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Wang,
> Zhihong <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>;
> Wang, Xiao W <xiao.w.wang@intel.com>
> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
> vhost backend
> 
> On 10/04/2018 06:57, Tiwei Bie wrote:
> >> So you just move the abstraction layer from qemu to kernel, and you
> >> still need different drivers in kernel for different device
> >> interfaces of accelerators. This looks even more complex than leaving
> >> it in qemu. As you said, another idea is to implement userspace vhost
> >> backend for accelerators which seems easier and could co-work with
> >> other parts of qemu without inventing new type of messages.
> >
> > I'm not quite sure. Do you think it's acceptable to add various vendor
> > specific hardware drivers in QEMU?
> 
> I think so.  We have vendor-specific quirks, and at some point there was an
> idea of using quirks to implement (vendor-specific) live migration support for
> assigned devices.

Vendor-specific quirks of accessing VGA is a small portion. Other major portions are still handled by guest driver.

While in this case, when saying various vendor specific drivers in QEMU, it says QEMU takes over and provides the entire user space device drivers. Some parts are even not relevant to vhost, they're basic device function enabling. Moreover, it could be different kinds of devices(network/block/...) under vhost. No matter # of vendors or # of types, total LOC is not small.

The idea is to avoid introducing these extra complexity out of QEMU, keeping vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel device driver to provide the diversity. Changing once in QEMU, then it supports multi-vendor devices whose drivers naturally providing kernel driver there.

If QEMU is going to build a user space driver framework there, we're open mind on that, even leveraging DPDK as the underlay library. Looking forward to more others' comments from community.

Steve

> 
> Paolo

^ permalink raw reply

* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Paolo Bonzini @ 2018-04-10  7:51 UTC (permalink / raw)
  To: Tiwei Bie, Jason Wang
  Cc: alexander.h.duyck, virtio-dev, kvm, mst, netdev, linux-kernel,
	virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <20180410045723.rftsb7l4l3ip2ioi@debian>

On 10/04/2018 06:57, Tiwei Bie wrote:
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
> 
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?

I think so.  We have vendor-specific quirks, and at some point there was
an idea of using quirks to implement (vendor-specific) live migration
support for assigned devices.

Paolo

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2018-04-10  7:25 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: alexander.h.duyck, virtio-dev, kvm, mst, netdev, linux-kernel,
	virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <20180410045723.rftsb7l4l3ip2ioi@debian>



On 2018年04月10日 12:57, Tiwei Bie wrote:
> On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:
>> On 2018年04月02日 23:23, Tiwei Bie wrote:
>>> This patch introduces a mdev (mediated device) based hardware
>>> vhost backend. This backend is an abstraction of the various
>>> hardware vhost accelerators (potentially any device that uses
>>> virtio ring can be used as a vhost accelerator). Some generic
>>> mdev parent ops are provided for accelerator drivers to support
>>> generating mdev instances.
>>>
>>> What's this
>>> ===========
>>>
>>> The idea is that we can setup a virtio ring compatible device
>>> with the messages available at the vhost-backend. Originally,
>>> these messages are used to implement a software vhost backend,
>>> but now we will use these messages to setup a virtio ring
>>> compatible hardware device. Then the hardware device will be
>>> able to work with the guest virtio driver in the VM just like
>>> what the software backend does. That is to say, we can implement
>>> a hardware based vhost backend in QEMU, and any virtio ring
>>> compatible devices potentially can be used with this backend.
>>> (We also call it vDPA -- vhost Data Path Acceleration).
>>>
>>> One problem is that, different virtio ring compatible devices
>>> may have different device interfaces. That is to say, we will
>>> need different drivers in QEMU. It could be troublesome. And
>>> that's what this patch trying to fix. The idea behind this
>>> patch is very simple: mdev is a standard way to emulate device
>>> in kernel.
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?
>

I don't object but we need to figure out the advantages of doing it in 
qemu too.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox