Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH 2/6] pci: Scan all functions when probing while running over Jailhouse
From: Jan Kiszka @ 2018-02-27  7:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: jailhouse-dev, Benedikt Spranger, linux-pci, x86,
	Linux Kernel Mailing List, virtualization, Ingo Molnar,
	H . Peter Anvin, Bjorn Helgaas, Thomas Gleixner
In-Reply-To: <20180222205753.GE16519@bhelgaas-glaptop.roam.corp.google.com>

On 2018-02-22 21:57, Bjorn Helgaas wrote:
> On Mon, Jan 22, 2018 at 07:12:46AM +0100, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> PCI and PCIBIOS probing only scans devices at function number 0/8/16/...
>> Subdevices (e.g. multiqueue) have function numbers which are not a
>> multiple of 8.
> 
> Suggested text:
> 
>   Per PCIe r4.0, sec 7.5.1.1.9, multi-function devices are required to
>   have a function 0.  Therefore, Linux scans for devices at function 0
>   (devfn 0/8/16/...) and only scans for other functions if function 0
>   has its Multi-Function Device bit set or ARI or SR-IOV indicate
>   there are more functions.
>   
>   The Jailhouse hypervisor may pass individual functions of a
>   multi-function device to a guest without passing function 0, which
>   means a Linux guest won't find them.
> 
>   Change Linux PCI probing so it scans all function numbers when
>   running as a guest over Jailhouse.
>   
>   This is technically prohibited by the spec, so it is possible that
>   PCI devices without the Multi-Function Device bit set may have
>   unexpected behavior in response to this probe.
> 
>> The simple hypervisor Jailhouse passes subdevices directly w/o providing
>> a virtual PCI topology like KVM. As a consequence a PCI passthrough from
>> Jailhouse to a guest will not be detected by Linux.
>>
>> Based on patch by Benedikt Spranger, adding Jailhouse probing to avoid
>> changing the behavior in the absence of the hypervisor.
>>
>> CC: Benedikt Spranger <b.spranger@linutronix.de>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> 
> With subject change to:
> 
>   PCI: Scan all functions when running over Jailhouse
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 

Thanks, all suggestions picked up for next round.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply

* Re: [v2 1/1] xen, mm: Allow deferred page initialization for xen pv domains
From: Juergen Gross @ 2018-02-27  7:44 UTC (permalink / raw)
  To: Pavel Tatashin, steven.sistare, daniel.m.jordan, akataria, tglx,
	mingo, hpa, x86, boris.ostrovsky, akpm, mhocko, vbabka, luto,
	labbott, kirill.shutemov, bp, minipli, jinb.park7, dan.j.williams,
	bhe, zhang.jia, mgorman, hannes, virtualization, linux-kernel,
	xen-devel, linux-mm
In-Reply-To: <20180226160112.24724-2-pasha.tatashin@oracle.com>

On 26/02/18 17:01, Pavel Tatashin wrote:
> Juergen Gross noticed that commit
> f7f99100d8d ("mm: stop zeroing memory during allocation in vmemmap")
> broke XEN PV domains when deferred struct page initialization is enabled.
> 
> This is because the xen's PagePinned() flag is getting erased from struct
> pages when they are initialized later in boot.
> 
> Juergen fixed this problem by disabling deferred pages on xen pv domains.
> It is desirable, however, to have this feature available as it reduces boot
> time. This fix re-enables the feature for pv-dmains, and fixes the problem
> the following way:
> 
> The fix is to delay setting PagePinned flag until struct pages for all
> allocated memory are initialized, i.e. until after free_all_bootmem().
> 
> A new x86_init.hyper op init_after_bootmem() is called to let xen know
> that boot allocator is done, and hence struct pages for all the allocated
> memory are now initialized. If deferred page initialization is enabled, the
> rest of struct pages are going to be initialized later in boot once
> page_alloc_init_late() is called.
> 
> xen_after_bootmem() walks page table's pages and marks them pinned.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Verified to work on a system where the original issue caused a crash.

Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Jiri Pirko @ 2018-02-27  8:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski, Netdev,
	Alexander Duyck, virtualization, Siwei Liu, Samudrala, Sridhar,
	David Miller
In-Reply-To: <20180227030424-mutt-send-email-mst@kernel.org>

Tue, Feb 27, 2018 at 02:18:12AM CET, mst@redhat.com wrote:
>On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote:
>> On Mon, 26 Feb 2018 08:19:24 +0100
>> Jiri Pirko <jiri@resnulli.us> wrote:
>> 
>> > Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
>> > >On Thu, 22 Feb 2018 13:30:12 -0800
>> > >Alexander Duyck <alexander.duyck@gmail.com> wrote:
>> > >  
>> > >> > Again, I undertand your motivation. Yet I don't like your solution.
>> > >> > But if the decision is made to do this in-driver bonding. I would like
>> > >> > to see it baing done some generic way:
>> > >> > 1) share the same "in-driver bonding core" code with netvsc
>> > >> >    put to net/core.
>> > >> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> > >> >    like active-backup mode only, one vf, one backup, vf netdev type
>> > >> >    check (so noone could enslave a tap or anything else)
>> > >> > If user would need something more, he should employ team/bond.    
>> > >
>> > >Sharing would be good, but netvsc world would really like to only have
>> > >one visible network device.  
>> > 
>> > Why do you mind? All would be the same, there would be just another
>> > netdevice unused by the vm user (same as the vf netdev).
>> > 
>> 
>> I mind because our requirement is no changes to userspace.
>> No special udev rules, no bonding script, no setup.
>
>Agreed. It is mostly fine from this point of view, except that you need
>to know to skip the slaves.  Maybe we could look at some kind of
>trick e.g. pretending link is down for slaves?

:O Another hack. Please, don't.


>
>> Things like cloudinit running on current distro's expect to see a single
>> eth0.  The VF device show up can also be an issue because distro's have
>> stupid rules like Network Manager trying to start DHCP on every interface.
>> We deal with that now by doing stuff like udev rules to get it to stop
>> but that is still causing user errors.

So that means that with an extra netdev for "virtio_net bypass" you will
face exactly the same problems. Should not be an issue for you then.


>
>So the ideal of a single net device isn't achieved by netvsc.
>
>Since you have scripts to skip the PT device, can't they
>hind the PV slave too? How do they identify the device to skip?
>
>I agree it would be nice to have a way to hide the extra netdev
>from userspace.

"A hidden netdevice", hmm. I believe that instead of doing hacks like
this, we should fix userspace to treat particular netdevices correctly.


>
>The benefit of the separation is that each slave device can
>be configured with e.g. its own native ethtool commands for
>optimum performance.
>
>-- 
>MST

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Jiri Pirko @ 2018-02-27  8:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller
In-Reply-To: <CAKgT0UdU8PDXduzxp4kKfur-DLeFQSJj7-fhW_eTgVzd+AcViw@mail.gmail.com>

Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>used by hypervisor to indicate that virtio_net interface should act as
>>>a backup for another device with the same MAC address.
>>>
>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>solution.  However, it creates some issues we'll get into in a moment.
>>>It extends virtio_net to use alternate datapath when available and
>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>an additional 'bypass' netdev that acts as a master device and controls
>>>2 slave devices.  The original virtio_net netdev is registered as
>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>associated with the same 'pci' device.  The user accesses the network
>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>as default for transmits when it is available with link up and running.
>>
>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>> of bonding driver as a part of NIC driver. Bond and team drivers
>> are mature solutions, well tested, broadly used, with lots of issues
>> resolved in the past. What you try to introduce is a weird shortcut
>> that already has couple of issues as you mentioned and will certanly
>> have many more. Also, I'm pretty sure that in future, someone comes up
>> with ideas like multiple VFs, LACP and similar bonding things.
>
>The problem with the bond and team drivers is they are too large and
>have too many interfaces available for configuration so as a result
>they can really screw this interface up.
>
>Essentially this is meant to be a bond that is more-or-less managed by
>the host, not the guest. We want the host to be able to configure it
>and have it automatically kick in on the guest. For now we want to
>avoid adding too much complexity as this is meant to be just the first
>step. Trying to go in and implement the whole solution right from the
>start based on existing drivers is going to be a massive time sink and
>will likely never get completed due to the fact that there is always
>going to be some other thing that will interfere.
>
>My personal hope is that we can look at doing a virtio-bond sort of
>device that will handle all this as well as providing a communication
>channel, but that is much further down the road. For now we only have
>a single bit so the goal for now is trying to keep this as simple as
>possible.

I have another usecase that would require the solution to be different
then what you suggest. Consider following scenario:
- baremetal has 2 sr-iov nics
- there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
- baremetal would like to somehow tell the VM to bond vf0 and vf1
  together and how this bonding should be configured, according to how
  the VF representors are configured on the baremetal (LACP for example)

The baremetal could decide to remove any VF during the VM runtime, it
can add another VF there. For migration, it can add virtio_net. The VM
should be inctructed to bond all interfaces together according to how
baremetal decided - as it knows better.

For this we need a separate communication channel from baremetal to VM
(perhaps something re-usable already exists), we need something to
listen to the events coming from this channel (kernel/userspace) and to
react accordingly (create bond/team, enslave, etc).

Now the question is: is it possible to merge the demands you have and
the generic needs I described into a single solution? From what I see,
that would be quite hard/impossible. So at the end, I think that we have
to end-up with 2 solutions:
1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
   solution that works for all (no matter what OS you use in VM)
2) team/bond solution with assistance of preferably userspace daemon
   getting info from baremetal. This is not 0config, but minimal config
   - user just have to define this "magic bonding" should be on.
   This covers all possible usecases, including multiple VFs, RDMA, etc.

Thoughts?

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio: introduce packed ring defines
From: Jens Freimann @ 2018-02-27  8:54 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180223111801.15240-2-tiwei.bie@intel.com>

On Fri, Feb 23, 2018 at 07:18:00PM +0800, Tiwei Bie wrote:
>Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>---
> include/uapi/linux/virtio_config.h | 18 +++++++++-
> include/uapi/linux/virtio_ring.h   | 68 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 85 insertions(+), 1 deletion(-)
>
>diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
>index 308e2096291f..e3d077ef5207 100644
>--- a/include/uapi/linux/virtio_config.h
>+++ b/include/uapi/linux/virtio_config.h
>@@ -49,7 +49,7 @@
>  * transport being used (eg. virtio_ring), the rest are per-device feature
>  * bits. */
> #define VIRTIO_TRANSPORT_F_START	28
>-#define VIRTIO_TRANSPORT_F_END		34
>+#define VIRTIO_TRANSPORT_F_END		37
>
> #ifndef VIRTIO_CONFIG_NO_LEGACY
> /* Do we get callbacks when the ring is completely used, even if we've
>@@ -71,4 +71,20 @@
>  * this is for compatibility with legacy systems.
>  */
> #define VIRTIO_F_IOMMU_PLATFORM		33
>+
>+/* This feature indicates support for the packed virtqueue layout. */
>+#define VIRTIO_F_RING_PACKED		34

Spec says VIRTIO_F_PACKED_RING not RING_PACKED
>+
>+/*
>+ * This feature indicates that all buffers are used by the device
>+ * in the same order in which they have been made available.
>+ */
>+#define VIRTIO_F_IN_ORDER		35
>+
>+/*
>+ * This feature indicates that drivers pass extra data (besides
>+ * identifying the Virtqueue) in their device notifications.
>+ */
>+#define VIRTIO_F_NOTIFICATION_DATA	36
>+
> #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
>diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
>index 6d5d5faa989b..77b1d4aeef72 100644
>--- a/include/uapi/linux/virtio_ring.h
>+++ b/include/uapi/linux/virtio_ring.h
>@@ -44,6 +44,9 @@
> /* This means the buffer contains a list of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT	4
>
>+#define VRING_DESC_F_AVAIL(b)	((b) << 7)
>+#define VRING_DESC_F_USED(b)	((b) << 15)
>+
> /* The Host uses this in used->flags to advise the Guest: don't kick me when
>  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
>  * will still kick if it's out of buffers. */
>@@ -104,6 +107,36 @@ struct vring {
> 	struct vring_used *used;
> };
>
>+struct vring_packed_desc_event {
>+	/* Descriptor Event Offset */
>+	__virtio16 desc_event_off   : 15,
>+	/* Descriptor Event Wrap Counter */
>+		   desc_event_wrap  : 1;
>+	/* Descriptor Event Flags */
>+	__virtio16 desc_event_flags : 2;
>+};

Where would the virtqueue number go in driver notifications?

regards,
Jens 

^ permalink raw reply

* Re: [PATCH RFC 2/2] vhost: packed ring support
From: Tiwei Bie @ 2018-02-27  9:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1518575829-1431-3-git-send-email-jasowang@redhat.com>

On Wed, Feb 14, 2018 at 10:37:09AM +0800, Jason Wang wrote:
[...]
> +static void set_desc_used(struct vhost_virtqueue *vq,
> +			  struct vring_desc_packed *desc, bool wrap_counter)
> +{
> +	__virtio16 flags = desc->flags;
> +
> +	if (wrap_counter) {
> +		desc->flags |= cpu_to_vhost16(vq, DESC_AVAIL);
> +		desc->flags |= cpu_to_vhost16(vq, DESC_USED);
> +	} else {
> +		desc->flags &= ~cpu_to_vhost16(vq, DESC_AVAIL);
> +		desc->flags &= ~cpu_to_vhost16(vq, DESC_USED);
> +	}
> +
> +	desc->flags = flags;

The `desc->flags` is restored after the change.

> +}
> +
> +static int vhost_get_vq_desc_packed(struct vhost_virtqueue *vq,
> +				    struct iovec iov[], unsigned int iov_size,
> +				    unsigned int *out_num, unsigned int *in_num,
> +				    struct vhost_log *log,
> +				    unsigned int *log_num)
> +{
> +	struct vring_desc_packed desc;
> +	int ret, access, i;
> +	u16 avail_idx = vq->last_avail_idx;
> +
> +	/* When we start there are none of either input nor output. */
> +	*out_num = *in_num = 0;
> +	if (unlikely(log))
> +		*log_num = 0;
> +
> +	do {
> +		unsigned int iov_count = *in_num + *out_num;
> +
> +		i = vq->last_avail_idx & (vq->num - 1);

The queue size may not be a power of 2 in packed ring.

Best regards,
Tiwei Bie

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio: introduce packed ring defines
From: Jens Freimann @ 2018-02-27  9:18 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180227085458.h5dytkwaeczrrijr@dhcp-192-241.str.redhat.com>

On Tue, Feb 27, 2018 at 09:54:58AM +0100, Jens Freimann wrote:
>On Fri, Feb 23, 2018 at 07:18:00PM +0800, Tiwei Bie wrote:
>>Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>---
>>include/uapi/linux/virtio_config.h | 18 +++++++++-
>>include/uapi/linux/virtio_ring.h   | 68 ++++++++++++++++++++++++++++++++++++++
>>2 files changed, 85 insertions(+), 1 deletion(-)
>>
>>diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
>>index 308e2096291f..e3d077ef5207 100644
>>--- a/include/uapi/linux/virtio_config.h
>>+++ b/include/uapi/linux/virtio_config.h
>>@@ -49,7 +49,7 @@
>> * transport being used (eg. virtio_ring), the rest are per-device feature
>> * bits. */
>>#define VIRTIO_TRANSPORT_F_START	28
>>-#define VIRTIO_TRANSPORT_F_END		34
>>+#define VIRTIO_TRANSPORT_F_END		37
>>
>>#ifndef VIRTIO_CONFIG_NO_LEGACY
>>/* Do we get callbacks when the ring is completely used, even if we've
>>@@ -71,4 +71,20 @@
>> * this is for compatibility with legacy systems.
>> */
>>#define VIRTIO_F_IOMMU_PLATFORM		33
>>+
>>+/* This feature indicates support for the packed virtqueue layout. */
>>+#define VIRTIO_F_RING_PACKED		34
>
>Spec says VIRTIO_F_PACKED_RING not RING_PACKED

Ignore this. Seems to have changed.

regards,
Jens 

^ permalink raw reply

* RE: [PATCH RFC 1/2] virtio: introduce packed ring defines
From: David Laight @ 2018-02-27  9:26 UTC (permalink / raw)
  To: 'Tiwei Bie', mst@redhat.com,
	virtualization@lists.linux-foundation.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
  Cc: wexu@redhat.com
In-Reply-To: <20180223111801.15240-2-tiwei.bie@intel.com>

From: Tiwei Bie
> Sent: 23 February 2018 11:18
...
> +struct vring_packed_desc_event {
> +	/* Descriptor Event Offset */
> +	__virtio16 desc_event_off   : 15,
> +	/* Descriptor Event Wrap Counter */
> +		   desc_event_wrap  : 1;
> +	/* Descriptor Event Flags */
> +	__virtio16 desc_event_flags : 2;
> +};

This looks like you are assuming that a bit-field has a defined
layout and can be used to map a 'hardware' structure.
The don't, don't use them like that.

	David

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio: introduce packed ring defines
From: Tiwei Bie @ 2018-02-27 11:31 UTC (permalink / raw)
  To: David Laight
  Cc: mst@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, wexu@redhat.com
In-Reply-To: <e7a7f045b87a420a8863030b9378cb38@AcuMS.aculab.com>

On Tue, Feb 27, 2018 at 09:26:27AM +0000, David Laight wrote:
> From: Tiwei Bie
> > Sent: 23 February 2018 11:18
> ...
> > +struct vring_packed_desc_event {
> > +	/* Descriptor Event Offset */
> > +	__virtio16 desc_event_off   : 15,
> > +	/* Descriptor Event Wrap Counter */
> > +		   desc_event_wrap  : 1;
> > +	/* Descriptor Event Flags */
> > +	__virtio16 desc_event_flags : 2;
> > +};
> 
> This looks like you are assuming that a bit-field has a defined
> layout and can be used to map a 'hardware' structure.
> The don't, don't use them like that.
> 
> 	David
> 

Thanks for the comments! Above definition isn't used in
this RFC, and the corresponding parts (event suppression)
haven't been implemented yet. It's more like some pseudo
code (I should add some comments about this in the code).

I planned to change it to something like this in the next
version:

struct vring_packed_desc_event {
	__virtio16 off_wrap;
	__virtio16 flags;  // XXX maybe not a good name for future
};                         // extension. Only 2bits are used now.

But it seems that I had a misunderstanding about the spec
on this previously:

https://lists.oasis-open.org/archives/virtio-dev/201802/msg00173.html

Anyway, it will be addressed. Thank you very much! ;-)

Best regards,
Tiwei Bie

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio: introduce packed ring defines
From: Tiwei Bie @ 2018-02-27 12:01 UTC (permalink / raw)
  To: Jens Freimann; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180227085458.h5dytkwaeczrrijr@dhcp-192-241.str.redhat.com>

On Tue, Feb 27, 2018 at 09:54:58AM +0100, Jens Freimann wrote:
> On Fri, Feb 23, 2018 at 07:18:00PM +0800, Tiwei Bie wrote:
[...]
> > 
> > +struct vring_packed_desc_event {
> > +	/* Descriptor Event Offset */
> > +	__virtio16 desc_event_off   : 15,
> > +	/* Descriptor Event Wrap Counter */
> > +		   desc_event_wrap  : 1;
> > +	/* Descriptor Event Flags */
> > +	__virtio16 desc_event_flags : 2;
> > +};
> 
> Where would the virtqueue number go in driver notifications?

This structure is for event suppression instead of notification.

You could refer to the "Event Suppression Structure Format"
section of the spec for more details:

https://github.com/oasis-tcs/virtio-docs/blob/master/virtio-v1.1-packed-wd08.pdf

Best regards,
Tiwei Bie

^ permalink raw reply

* Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context
From: Mark Rutland @ 2018-02-27 12:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin, linux-kernel,
	virtualization, linux-block, Jan Kara, linux-ext4
In-Reply-To: <20180226124455.rkp3xfn3cqplzh4p@quack2.suse.cz>

On Mon, Feb 26, 2018 at 01:44:55PM +0100, Jan Kara wrote:
> On Mon 26-02-18 11:38:19, Mark Rutland wrote:
> > That seems to be it!
> > 
> > With the below patch applied, I can't trigger the bug after ~10 minutes,
> > whereas prior to the patch I can trigger it in ~10 seconds. I'll leave
> > that running for a while just in case there's another part to the
> > problem, but FWIW:
> > 
> > Tested-by: Mark Rutland <mark.rutland@arm.com>
> 
> Thanks for testing! Sent the patch to Jens for inclusion.

Cheers!

FWIW, I left my test case running for a day with no issue, so this looks
rock solid.

Mark.

^ permalink raw reply

* Re: [PATCH 1/4] iommu: Add virtio-iommu driver
From: Michael S. Tsirkin @ 2018-02-27 14:47 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: kvm@vger.kernel.org, Will Deacon,
	virtualization@lists.linux-foundation.org,
	tnowicki@caviumnetworks.com, jintack@cs.columbia.edu,
	eric.auger.pro@gmail.com, virtio-dev@lists.oasis-open.org,
	jayachandran.nair@cavium.com, Lorenzo Pieralisi,
	kbuild test robot, joro@8bytes.org, kvmarm@lists.cs.columbia.edu,
	Marc Zyngier, eric.auger@redhat.com,
	iommu@lists.linux-foundation.org, kbuild-all@01.org, Robin Murphy
In-Reply-To: <e5ffc52f-4510-f757-aa83-2a99af3ae06b@arm.com>

On Thu, Feb 22, 2018 at 11:04:30AM +0000, Jean-Philippe Brucker wrote:
> On 21/02/18 20:12, kbuild test robot wrote:
> [...]
> > config: arm64-allmodconfig (attached as .config)
> [...]
> >    aarch64-linux-gnu-ld: arch/arm64/kernel/head.o: relocation R_AARCH64_ABS32 against `_kernel_offset_le_lo32' can not be used when making a shared object
> >    arch/arm64/kernel/head.o: In function `kimage_vaddr':
> >    (.idmap.text+0x0): dangerous relocation: unsupported relocation
> 
> Is this related?
> 
> >    arch/arm64/kernel/head.o: In function `__primary_switch':
> >    (.idmap.text+0x340): dangerous relocation: unsupported relocation
> >    (.idmap.text+0x348): dangerous relocation: unsupported relocation
> >    drivers/iommu/virtio-iommu.o: In function `viommu_probe':
> >    virtio-iommu.c:(.text+0xbdc): undefined reference to `virtio_check_driver_offered_feature'
> >    virtio-iommu.c:(.text+0xcfc): undefined reference to `virtio_check_driver_offered_feature'
> >    virtio-iommu.c:(.text+0xe10): undefined reference to `virtio_check_driver_offered_feature'
> >    drivers/iommu/virtio-iommu.o: In function `viommu_send_reqs_sync':
> >    virtio-iommu.c:(.text+0x130c): undefined reference to `virtqueue_add_sgs'
> >    virtio-iommu.c:(.text+0x1398): undefined reference to `virtqueue_kick'
> >    virtio-iommu.c:(.text+0x14d4): undefined reference to `virtqueue_get_buf'
> >    drivers/iommu/virtio-iommu.o: In function `virtio_iommu_drv_init':
> >    virtio-iommu.c:(.init.text+0x1c): undefined reference to `register_virtio_driver'
> >    drivers/iommu/virtio-iommu.o: In function `virtio_iommu_drv_exit':
> >>> virtio-iommu.c:(.exit.text+0x14): undefined reference to `unregister_virtio_driver'
> 
> Right. At the moment CONFIG_VIRTIO_IOMMU is a bool instead of tristate,
> because the IOMMU subsystem isn't entirely ready to have IOMMU drivers
> built as modules. In addition to exporting symbols it would also needs to
> hold off probing endpoints behind the IOMMU until the IOMMU driver is
> loaded. At the moment (I think) it gives up once userspace is reached (see
> of_iommu_driver_present).
> 
> The above report is due to VIRTIO=m and VIRTIO_IOMMU=y. To solve it we could:
> 
> a) Allow VIRTIO_IOMMU to be built as module by exporting a dozen IOMMU
> symbols. It would be a lie. The driver wouldn't be usable because of the
> probe issue discussed above, but it would build.
> 
> b) Actually make any IOMMU driver work as module. Whilst it would
> certainly be a nice feature, it's a bigger problem and I don't think it
> has its place in this series.
> 
> c) Make VIRTIO_IOMMU depend on VIRTIO_MMIO=y instead of VIRTIO_MMIO, which
> I think is the sanest for now (and does work), even though distro kernels
> probably all have VIRTIO=m.
> 
> I prefer doing c) now and experiment with b) once I got some time.
> 
> Thanks,
> Jean

It kind of means it's a toy for now though. So fine as long
as it's out of tree.

-- 
MST

^ permalink raw reply

* Re: [PATCH] virtio_balloon: export huge page allocation statistics
From: Michael S. Tsirkin @ 2018-02-27 15:20 UTC (permalink / raw)
  To: Jonathan Helman; +Cc: linux-kernel, virtualization
In-Reply-To: <1518846272-5216-1-git-send-email-jonathan.helman@oracle.com>

On Fri, Feb 16, 2018 at 09:44:32PM -0800, Jonathan Helman wrote:
> Export statistics for successful and failed huge page allocations
> from the virtio balloon driver. These 2 stats come directly from
> the vm_events HTLB_BUDDY_PGALLOC and HTLB_BUDDY_PGALLOC_FAIL.
> Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>

Any host/guest intergace changes need to be copied to
the virtio TC mailing list (subscriber-only, sorry about that):
virtio-dev@lists.oasis-open.org

> ---
>  drivers/virtio/virtio_balloon.c     | 6 ++++++
>  include/uapi/linux/virtio_balloon.h | 4 +++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index dfe5684..6b237e3 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -272,6 +272,12 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb)
>  				pages_to_bytes(events[PSWPOUT]));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> +#ifdef CONFIG_HUGETLB_PAGE
> +	update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
> +		    events[HTLB_BUDDY_PGALLOC]);
> +	update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
> +		    events[HTLB_BUDDY_PGALLOC_FAIL]);
> +#endif
>  #endif
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>  				pages_to_bytes(i.freeram));
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 4e8b830..e3e8071 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -53,7 +53,9 @@ struct virtio_balloon_config {
>  #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of memory */
>  #define VIRTIO_BALLOON_S_AVAIL    6   /* Available memory as in /proc */
>  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
> -#define VIRTIO_BALLOON_S_NR       8
> +#define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Number of htlb pgalloc successes */
> +#define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Number of htlb pgalloc failures */
> +#define VIRTIO_BALLOON_S_NR       10

Can you clarify the comments pls? Eschew abbreviation.

>  
>  /*
>   * Memory statistics structure.
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: [PATCH 4/6] x86: Consolidate PCI_MMCONFIG configs
From: Andy Shevchenko @ 2018-02-27 15:47 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: jailhouse-dev, linux-pci, x86, Linux Kernel Mailing List,
	virtualization, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas,
	Thomas Gleixner
In-Reply-To: <45f0d934-c4cf-fdc7-d7ca-17373c01e733@siemens.com>

On Tue, Feb 27, 2018 at 9:19 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2018-01-28 18:26, Andy Shevchenko wrote:
>> On Mon, Jan 22, 2018 at 8:12 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>
>>> Not sure if those two worked by design or just by chance so far. In any
>>> case, it's at least cleaner and clearer to express this in a single
>>> config statement.
>>
>> Congrats! You found by the way a bug in
>>
>> commit e279b6c1d329e50b766bce96aacc197eae8a053b
>> Author: Sam Ravnborg <sam@ravnborg.org>
>> Date:   Tue Nov 6 20:41:05 2007 +0100
>>
>>    x86: start unification of arch/x86/Kconfig.*
>>
>> ...and proper fix seems to split PCI stuff to common + X86_32 only + X86_64 only
>>
>
> Hmm, is that a change request on this patch?

From my side it's a suggestion.
Better wait for the answer from x86 maintainers.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH 2/6] pci: Scan all functions when probing while running over Jailhouse
From: Andy Shevchenko @ 2018-02-27 15:48 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: jailhouse-dev, Benedikt Spranger, linux-pci, x86,
	Linux Kernel Mailing List, virtualization, Ingo Molnar,
	H . Peter Anvin, Bjorn Helgaas, Thomas Gleixner
In-Reply-To: <17710852-a221-c707-0438-44031666028e@siemens.com>

On Tue, Feb 27, 2018 at 9:22 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2018-02-23 14:23, Andy Shevchenko wrote:
>> On Mon, Jan 22, 2018 at 8:12 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:

>>>  #include <linux/acpi.h>
>>>  #include <linux/irqdomain.h>
>>>  #include <linux/pm_runtime.h>
>>> +#include <linux/hypervisor.h>
>>
>> Ditto.
>>
>
> Despite the context suggesting it, this file has no ordering.

At least you might not increase disordering by putting the line after acpi.h.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* CFP PECCS 2018 - 8th Int.l Joint Conf. on Pervasive and Embedded Computing and Communication Systems (Porto/Portugal)
From: peccs @ 2018-02-27 16:10 UTC (permalink / raw)
  To: virtualization

SUBMISSION DEADLINE 

8th International Joint Conference on Pervasive and Embedded Computing and Communication Systems

Submission Deadline: March 13, 2018

http://www.peccs.org/

July 29 - 30, 2018
Porto, Portugal.

In Cooperation with EUROMICRO.

A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.

All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).

Should you have any question please don’t hesitate contacting me.

Kind regards,
PECCS Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 184
Fax: +351 265 520 186
Web: http://www.peccs.org/
e-mail: peccs.secretariat@insticc.org

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC 1/2] virtio: introduce packed ring defines
From: Michael S. Tsirkin @ 2018-02-27 20:28 UTC (permalink / raw)
  To: Jens Freimann; +Cc: netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180227085458.h5dytkwaeczrrijr@dhcp-192-241.str.redhat.com>

On Tue, Feb 27, 2018 at 09:54:58AM +0100, Jens Freimann wrote:
> On Fri, Feb 23, 2018 at 07:18:00PM +0800, Tiwei Bie wrote:
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> > include/uapi/linux/virtio_config.h | 18 +++++++++-
> > include/uapi/linux/virtio_ring.h   | 68 ++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 85 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
> > index 308e2096291f..e3d077ef5207 100644
> > --- a/include/uapi/linux/virtio_config.h
> > +++ b/include/uapi/linux/virtio_config.h
> > @@ -49,7 +49,7 @@
> >  * transport being used (eg. virtio_ring), the rest are per-device feature
> >  * bits. */
> > #define VIRTIO_TRANSPORT_F_START	28
> > -#define VIRTIO_TRANSPORT_F_END		34
> > +#define VIRTIO_TRANSPORT_F_END		37
> > 
> > #ifndef VIRTIO_CONFIG_NO_LEGACY
> > /* Do we get callbacks when the ring is completely used, even if we've
> > @@ -71,4 +71,20 @@
> >  * this is for compatibility with legacy systems.
> >  */
> > #define VIRTIO_F_IOMMU_PLATFORM		33
> > +
> > +/* This feature indicates support for the packed virtqueue layout. */
> > +#define VIRTIO_F_RING_PACKED		34
> 
> Spec says VIRTIO_F_PACKED_RING not RING_PACKED

I changed that now :)

> > +
> > +/*
> > + * This feature indicates that all buffers are used by the device
> > + * in the same order in which they have been made available.
> > + */
> > +#define VIRTIO_F_IN_ORDER		35
> > +
> > +/*
> > + * This feature indicates that drivers pass extra data (besides
> > + * identifying the Virtqueue) in their device notifications.
> > + */
> > +#define VIRTIO_F_NOTIFICATION_DATA	36
> > +
> > #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
> > diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> > index 6d5d5faa989b..77b1d4aeef72 100644
> > --- a/include/uapi/linux/virtio_ring.h
> > +++ b/include/uapi/linux/virtio_ring.h
> > @@ -44,6 +44,9 @@
> > /* This means the buffer contains a list of buffer descriptors. */
> > #define VRING_DESC_F_INDIRECT	4
> > 
> > +#define VRING_DESC_F_AVAIL(b)	((b) << 7)
> > +#define VRING_DESC_F_USED(b)	((b) << 15)
> > +
> > /* The Host uses this in used->flags to advise the Guest: don't kick me when
> >  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
> >  * will still kick if it's out of buffers. */
> > @@ -104,6 +107,36 @@ struct vring {
> > 	struct vring_used *used;
> > };
> > 
> > +struct vring_packed_desc_event {
> > +	/* Descriptor Event Offset */
> > +	__virtio16 desc_event_off   : 15,
> > +	/* Descriptor Event Wrap Counter */
> > +		   desc_event_wrap  : 1;
> > +	/* Descriptor Event Flags */
> > +	__virtio16 desc_event_flags : 2;
> > +};
> 
> Where would the virtqueue number go in driver notifications?
> 
> regards,
> Jens

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Alexander Duyck @ 2018-02-27 21:16 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller
In-Reply-To: <20180227084959.GB2005@nanopsycho>

On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>a backup for another device with the same MAC address.
>>>>
>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>It extends virtio_net to use alternate datapath when available and
>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>associated with the same 'pci' device.  The user accesses the network
>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> I have another usecase that would require the solution to be different
> then what you suggest. Consider following scenario:
> - baremetal has 2 sr-iov nics
> - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
> - baremetal would like to somehow tell the VM to bond vf0 and vf1
>   together and how this bonding should be configured, according to how
>   the VF representors are configured on the baremetal (LACP for example)
>
> The baremetal could decide to remove any VF during the VM runtime, it
> can add another VF there. For migration, it can add virtio_net. The VM
> should be inctructed to bond all interfaces together according to how
> baremetal decided - as it knows better.
>
> For this we need a separate communication channel from baremetal to VM
> (perhaps something re-usable already exists), we need something to
> listen to the events coming from this channel (kernel/userspace) and to
> react accordingly (create bond/team, enslave, etc).
>
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
>
> Thoughts?

So that is about what I had in mind. We end up having to do something
completely different to support this more complex solution. I think we
might have referred to it as v2/v3 in a different thread, and
virt-bond in this thread.

Basically we need some sort of PCI or PCIe topology mapping for the
devices that can be translated into something we can communicate over
the communication channel. After that we also have the added
complexity of how do we figure out which Tx path we want to choose.
This is one of the reasons why I was thinking of something like a eBPF
blob that is handed up from the host side and into the guest to select
the Tx queue. That way when we add some new approach such as a
NUMA/cpu based netdev selection then we just provide an eBPF blob that
does that. Most of this is just theoretical at this point though since
I haven't had a chance to look into it too deeply yet. If you want to
take something like this on the help would always be welcome.. :)

The other thing I am looking at is trying to find a good way to do
dirty page tracking in the hypervisor using something like a
para-virtual IOMMU. However I don't have any ETA on that as I am just
starting out and have limited development time. If we get that in
place we can leave the VF in the guest until the very last moments
instead of having to remove it before we start the live migration.

- Alex

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Michael S. Tsirkin @ 2018-02-27 21:23 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Siwei Liu, Netdev,
	David Miller
In-Reply-To: <CAKgT0Ucp8WhkORpVVy2VWwuvxZHZjXKn6YfHfVaLAVVybf3t5w@mail.gmail.com>

On Tue, Feb 27, 2018 at 01:16:21PM -0800, Alexander Duyck wrote:
> The other thing I am looking at is trying to find a good way to do
> dirty page tracking in the hypervisor using something like a
> para-virtual IOMMU. However I don't have any ETA on that as I am just
> starting out and have limited development time. If we get that in
> place we can leave the VF in the guest until the very last moments
> instead of having to remove it before we start the live migration.
> 
> - Alex

I actually think your old RFC would be a good starting point:
https://lkml.org/lkml/2016/1/5/104

What is missing is I think enabling/disabling dynamically.

Seems to be easier than tracking by the hypervisor.

-- 
MST

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Michael S. Tsirkin @ 2018-02-27 21:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Siwei Liu, Netdev, David Miller
In-Reply-To: <20180227084959.GB2005@nanopsycho>

On Tue, Feb 27, 2018 at 09:49:59AM +0100, Jiri Pirko wrote:
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
> 
> Thoughts?

I think I agree. This RFC is trying to do 1 above.  Looks like we now
all agree 1 and 2 are not exclusive, both have place in the kernel. Is
that right?

-- 
MST

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Jakub Kicinski @ 2018-02-27 21:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Sridhar Samudrala, virtualization, Siwei Liu, Netdev,
	David Miller
In-Reply-To: <CAKgT0Ucp8WhkORpVVy2VWwuvxZHZjXKn6YfHfVaLAVVybf3t5w@mail.gmail.com>

On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> Basically we need some sort of PCI or PCIe topology mapping for the
> devices that can be translated into something we can communicate over
> the communication channel. 

Hm.  This is probably a completely stupid idea, but if we need to
start marshalling configuration requests/hints maybe the entire problem
could be solved by opening a netlink socket from hypervisor?  Even make
teamd run on the hypervisor side...

^ permalink raw reply

* Re: [PATCH RFC 2/2] vhost: packed ring support
From: Jason Wang @ 2018-02-28  2:38 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180227090319.h7hxmxp5rpozrb5s@debian>



On 2018年02月27日 17:03, Tiwei Bie wrote:
> On Wed, Feb 14, 2018 at 10:37:09AM +0800, Jason Wang wrote:
> [...]
>> +static void set_desc_used(struct vhost_virtqueue *vq,
>> +			  struct vring_desc_packed *desc, bool wrap_counter)
>> +{
>> +	__virtio16 flags = desc->flags;
>> +
>> +	if (wrap_counter) {
>> +		desc->flags |= cpu_to_vhost16(vq, DESC_AVAIL);
>> +		desc->flags |= cpu_to_vhost16(vq, DESC_USED);
>> +	} else {
>> +		desc->flags &= ~cpu_to_vhost16(vq, DESC_AVAIL);
>> +		desc->flags &= ~cpu_to_vhost16(vq, DESC_USED);
>> +	}
>> +
>> +	desc->flags = flags;
> The `desc->flags` is restored after the change.

Right, will fix.

>> +}
>> +
>> +static int vhost_get_vq_desc_packed(struct vhost_virtqueue *vq,
>> +				    struct iovec iov[], unsigned int iov_size,
>> +				    unsigned int *out_num, unsigned int *in_num,
>> +				    struct vhost_log *log,
>> +				    unsigned int *log_num)
>> +{
>> +	struct vring_desc_packed desc;
>> +	int ret, access, i;
>> +	u16 avail_idx = vq->last_avail_idx;
>> +
>> +	/* When we start there are none of either input nor output. */
>> +	*out_num = *in_num = 0;
>> +	if (unlikely(log))
>> +		*log_num = 0;
>> +
>> +	do {
>> +		unsigned int iov_count = *in_num + *out_num;
>> +
>> +		i = vq->last_avail_idx & (vq->num - 1);
> The queue size may not be a power of 2 in packed ring.

Will fix this too.

Thanks

> Best regards,
> Tiwei Bie

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH v2 0/6] jailhouse: Enhance secondary Jailhouse guest support /wrt PCI
From: Jan Kiszka @ 2018-02-28  6:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
  Cc: jailhouse-dev, Mark Rutland, Benedikt Spranger, linux-pci, x86,
	Linux Kernel Mailing List, virtualization, Rob Herring,
	Otavio Pontes

Basic x86 support [1] for running Linux as secondary Jailhouse [2] guest
is currently pending in the tip tree. This builds on top and enhances
the PCI support for x86 and also ARM guests (ARM[64] does not require
platform patches and works already).

Key elements of this series are:
 - detection of Jailhouse via device tree hypervisor node
 - function-level PCI scan if Jailhouse is detected
 - MMCONFIG support for x86 guests

As most changes affect x86, I would suggest to route the series also via
tip after the necessary acks are collected.

Changes in v2:
 - adjusted commit log and include ordering in patch 2
 - rebased over Linus master

Jan

[1] https://lkml.org/lkml/2017/11/27/125
[2] http://jailhouse-project.org

CC: Benedikt Spranger <b.spranger@linutronix.de>
CC: Mark Rutland <mark.rutland@arm.com>
CC: Otavio Pontes <otavio.pontes@intel.com>
CC: Rob Herring <robh+dt@kernel.org>

Jan Kiszka (5):
  jailhouse: Provide detection for non-x86 systems
  PCI: Scan all functions when running over Jailhouse
  x86: Consolidate PCI_MMCONFIG configs
  x86/jailhouse: Allow to use PCI_MMCONFIG without ACPI
  MAINTAINERS: Add entry for Jailhouse

Otavio Pontes (1):
  x86/jailhouse: Enable PCI mmconfig access in inmates

 Documentation/devicetree/bindings/jailhouse.txt |  8 ++++++++
 MAINTAINERS                                     |  7 +++++++
 arch/x86/Kconfig                                | 11 ++++++-----
 arch/x86/include/asm/jailhouse_para.h           |  2 +-
 arch/x86/include/asm/pci_x86.h                  |  2 ++
 arch/x86/kernel/Makefile                        |  2 +-
 arch/x86/kernel/cpu/amd.c                       |  2 +-
 arch/x86/kernel/jailhouse.c                     |  7 +++++++
 arch/x86/pci/legacy.c                           |  4 +++-
 arch/x86/pci/mmconfig-shared.c                  |  4 ++--
 drivers/pci/probe.c                             |  4 +++-
 include/linux/hypervisor.h                      | 17 +++++++++++++++--
 12 files changed, 56 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/jailhouse.txt

-- 
2.13.6

^ permalink raw reply

* [PATCH v2 1/6] jailhouse: Provide detection for non-x86 systems
From: Jan Kiszka @ 2018-02-28  6:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
  Cc: jailhouse-dev, Mark Rutland, linux-pci, x86,
	Linux Kernel Mailing List, virtualization, Rob Herring
In-Reply-To: <cover.1519799691.git.jan.kiszka@siemens.com>

From: Jan Kiszka <jan.kiszka@siemens.com>

Implement jailhouse_paravirt() via device tree probing on architectures
!= x86. Will be used by the PCI core.

CC: Rob Herring <robh+dt@kernel.org>
CC: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 Documentation/devicetree/bindings/jailhouse.txt |  8 ++++++++
 arch/x86/include/asm/jailhouse_para.h           |  2 +-
 include/linux/hypervisor.h                      | 17 +++++++++++++++--
 3 files changed, 24 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/jailhouse.txt

diff --git a/Documentation/devicetree/bindings/jailhouse.txt b/Documentation/devicetree/bindings/jailhouse.txt
new file mode 100644
index 000000000000..2901c25ff340
--- /dev/null
+++ b/Documentation/devicetree/bindings/jailhouse.txt
@@ -0,0 +1,8 @@
+Jailhouse non-root cell device tree bindings
+--------------------------------------------
+
+When running in a non-root Jailhouse cell (partition), the device tree of this
+platform shall have a top-level "hypervisor" node with the following
+properties:
+
+- compatible = "jailhouse,cell"
diff --git a/arch/x86/include/asm/jailhouse_para.h b/arch/x86/include/asm/jailhouse_para.h
index 875b54376689..b885a961a150 100644
--- a/arch/x86/include/asm/jailhouse_para.h
+++ b/arch/x86/include/asm/jailhouse_para.h
@@ -1,7 +1,7 @@
 /* SPDX-License-Identifier: GPL2.0 */
 
 /*
- * Jailhouse paravirt_ops implementation
+ * Jailhouse paravirt detection
  *
  * Copyright (c) Siemens AG, 2015-2017
  *
diff --git a/include/linux/hypervisor.h b/include/linux/hypervisor.h
index b19563f9a8eb..fc08b433c856 100644
--- a/include/linux/hypervisor.h
+++ b/include/linux/hypervisor.h
@@ -8,15 +8,28 @@
  */
 
 #ifdef CONFIG_X86
+
+#include <asm/jailhouse_para.h>
 #include <asm/x86_init.h>
+
 static inline void hypervisor_pin_vcpu(int cpu)
 {
 	x86_platform.hyper.pin_vcpu(cpu);
 }
-#else
+
+#else /* !CONFIG_X86 */
+
+#include <linux/of.h>
+
 static inline void hypervisor_pin_vcpu(int cpu)
 {
 }
-#endif
+
+static inline bool jailhouse_paravirt(void)
+{
+	return of_find_compatible_node(NULL, NULL, "jailhouse,cell");
+}
+
+#endif /* !CONFIG_X86 */
 
 #endif /* __LINUX_HYPEVISOR_H */
-- 
2.13.6

^ permalink raw reply related

* [PATCH v2 2/6] PCI: Scan all functions when running over Jailhouse
From: Jan Kiszka @ 2018-02-28  6:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
  Cc: jailhouse-dev, Benedikt Spranger, linux-pci, x86,
	Linux Kernel Mailing List, virtualization
In-Reply-To: <cover.1519799691.git.jan.kiszka@siemens.com>

From: Jan Kiszka <jan.kiszka@siemens.com>

Per PCIe r4.0, sec 7.5.1.1.9, multi-function devices are required to
have a function 0.  Therefore, Linux scans for devices at function 0
(devfn 0/8/16/...) and only scans for other functions if function 0
has its Multi-Function Device bit set or ARI or SR-IOV indicate
there are more functions.

The Jailhouse hypervisor may pass individual functions of a
multi-function device to a guest without passing function 0, which
means a Linux guest won't find them.

Change Linux PCI probing so it scans all function numbers when
running as a guest over Jailhouse.

This is technically prohibited by the spec, so it is possible that
PCI devices without the Multi-Function Device bit set may have
unexpected behavior in response to this probe.

Based on patch by Benedikt Spranger, adding Jailhouse probing to avoid
changing the behavior in the absence of the hypervisor.

CC: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
 arch/x86/pci/legacy.c | 4 +++-
 drivers/pci/probe.c   | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/pci/legacy.c b/arch/x86/pci/legacy.c
index 1cb01abcb1be..dfbe6ac38830 100644
--- a/arch/x86/pci/legacy.c
+++ b/arch/x86/pci/legacy.c
@@ -4,6 +4,7 @@
 #include <linux/init.h>
 #include <linux/export.h>
 #include <linux/pci.h>
+#include <asm/jailhouse_para.h>
 #include <asm/pci_x86.h>
 
 /*
@@ -34,13 +35,14 @@ int __init pci_legacy_init(void)
 
 void pcibios_scan_specific_bus(int busn)
 {
+	int stride = jailhouse_paravirt() ? 1 : 8;
 	int devfn;
 	u32 l;
 
 	if (pci_find_bus(0, busn))
 		return;
 
-	for (devfn = 0; devfn < 256; devfn += 8) {
+	for (devfn = 0; devfn < 256; devfn += stride) {
 		if (!raw_pci_read(0, busn, devfn, PCI_VENDOR_ID, 2, &l) &&
 		    l != 0x0000 && l != 0xffff) {
 			DBG("Found device at %02x:%02x [%04x]\n", busn, devfn, l);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ef5377438a1e..ce728251ae36 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -16,6 +16,7 @@
 #include <linux/pci-aspm.h>
 #include <linux/aer.h>
 #include <linux/acpi.h>
+#include <linux/hypervisor.h>
 #include <linux/irqdomain.h>
 #include <linux/pm_runtime.h>
 #include "pci.h"
@@ -2517,6 +2518,7 @@ static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
 					      unsigned int available_buses)
 {
 	unsigned int used_buses, normal_bridges = 0, hotplug_bridges = 0;
+	unsigned int stride = jailhouse_paravirt() ? 1 : 8;
 	unsigned int start = bus->busn_res.start;
 	unsigned int devfn, cmax, max = start;
 	struct pci_dev *dev;
@@ -2524,7 +2526,7 @@ static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
 	dev_dbg(&bus->dev, "scanning bus\n");
 
 	/* Go find them, Rover! */
-	for (devfn = 0; devfn < 0x100; devfn += 8)
+	for (devfn = 0; devfn < 0x100; devfn += stride)
 		pci_scan_slot(bus, devfn);
 
 	/* Reserve buses for SR-IOV capability */
-- 
2.13.6

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox