Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH net-next 0/6] virtio_net: Add ethtool stat items
From: Tariq Toukan @ 2018-08-05 14:22 UTC (permalink / raw)
  To: David Miller, mst, toshiaki.makita1, makita.toshiaki
  Cc: ranro, netdev, Eran Ben Elisha, virtualization, Maor Gottlieb
In-Reply-To: <8a21b956-af42-3598-b752-d1e3fa9e63d3@mellanox.com>



On 05/08/2018 4:45 PM, Tariq Toukan wrote:
> 
> 
> On 05/08/2018 4:15 PM, Tariq Toukan wrote:
>>
>>
>> On 25/07/2018 10:59 PM, David Miller wrote:
>>> From: "Michael S. Tsirkin" <mst@redhat.com>
>>> Date: Wed, 25 Jul 2018 12:40:12 +0300
>>>
>>>> On Mon, Jul 23, 2018 at 11:36:03PM +0900, Toshiaki Makita wrote:
>>>>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>
>>>>> Add some ethtool stat items useful for performance analysis.
>>>>>
>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>
>>>> Series:
>>>>
>>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>>
>>> Series applied.
>>>
>>>> Patch 1 seems appropriate for stable, even though it's minor.
>>>
>>> Ok, I'll put patch #1 also into 'net' and queue it up for -stable.
>>>
>>> Thanks.
>>>
>>
>> Hi,
>> Our verification team encountered the following degradation, 
>> introduced by this series. Please check KASAN bug report below.
>>
>> We tested before and after the series, but didn't bisect within it. We 
>> can do if necessary.
>>
>> Thanks,
>> Tariq
>>
> 
> I see this might already be fixed, here:
> https://marc.info/?l=linux-netdev&m=153335713407532&w=2
> 
> Verifying...
> 

No repro when applying this fix.

Thanks,
Tariq

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-05 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <20180805072930.GB23288@infradead.org>

On Sun, 2018-08-05 at 00:29 -0700, Christoph Hellwig wrote:
> On Sun, Aug 05, 2018 at 11:10:15AM +1000, Benjamin Herrenschmidt wrote:
> >  - One you have rejected, which is to have a way for "no-iommu" virtio
> > (which still doesn't use an iommu on the qemu side and doesn't need
> > to), to be forced to use some custom DMA ops on the VM side.
> > 
> >  - One, which sadly has more overhead and will require modifying more
> > pieces of the puzzle, which is to make qemu uses an emulated iommu.
> > Once we make qemu do that, we can then layer swiotlb on top of the
> > emulated iommu on the guest side, and pass that as dma_ops to virtio.
> 
> Or number three:  have a a virtio feature bit that tells the VM
> to use whatever dma ops the platform thinks are appropinquate for
> the bus it pretends to be on.  Then set a dma-range that is limited
> to your secure memory range (if you really need it to be runtime
> enabled only after a device reset that rescans) and use the normal
> dma mapping code to bounce buffer.

Who would set this bit ? qemu ? Under what circumstances ?

What would be the effect of this bit while VIRTIO_F_IOMMU is NOT set,
ie, what would qemu do and what would Linux do ? I'm not sure I fully
understand your idea.

I'm trying to understand because the limitation is not a device side
limitation, it's not a qemu limitation, it's actually more of a VM
limitation. It has most of its memory pages made inaccessible for
security reasons. The platform from a qemu/KVM perspective is almost
entirely normal.

So I don't understand when would qemu set this bit, or should it be set
by the VM at runtime ?

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-05 21:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, paulus, marc.zyngier, joe, robin.murphy,
	david, linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <b7e8294e3e70d24072883a7e8e5375719d5af870.camel@kernel.crashing.org>

On Mon, 2018-08-06 at 07:16 +1000, Benjamin Herrenschmidt wrote:
> I'm trying to understand because the limitation is not a device side
> limitation, it's not a qemu limitation, it's actually more of a VM
> limitation. It has most of its memory pages made inaccessible for
> security reasons. The platform from a qemu/KVM perspective is almost
> entirely normal.

In fact this is probably the best image of what's going on:

It's a normal VM from a KVM/qemu perspective (and thus virtio). It
boots normally, can run firmware, linux, etc... normally, it's not
created with any different XML or qemu command line definition etc...

It just that once it reaches the kernel with the secure stuff enabled
(could be via kexec from a normal kernel), that kernel will "stash
away" most of the VM's memory into some secure space that nothing else
(not even the hypervisor) can access.

It can keep around a pool or two of normal memory for bounce buferring
IOs but that's about it.

I think that's the clearest way I could find to explain what's going
on, and why I'm so resistant on adding things on qemu side.

That said, we *can* (and will) notify KVM and qemu of the transition,
and we can/will do so after virtio has been instanciated and used by
the bootloader, but before it will be used (or even probed) by the
secure VM itself, so there's an opportunity to poke at things, either
from the VM itself (a quirk poking at virtio config space for example)
or from qemu (though I find the idea of iterating all virtio devices
from qemu to change a setting rather gross).

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: mark expected switch fall-throughs
From: David Miller @ 2018-08-06  0:27 UTC (permalink / raw)
  To: gustavo; +Cc: netdev, virtualization, linux-kernel, mst
In-Reply-To: <20180805024205.GA14300@embeddedor.com>

From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Date: Sat, 4 Aug 2018 21:42:05 -0500

> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Addresses-Coverity-ID: 1402059 ("Missing break in switch")
> Addresses-Coverity-ID: 1402060 ("Missing break in switch")
> Addresses-Coverity-ID: 1402061 ("Missing break in switch")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] vhost: switch to use new message format
From: Jason Wang @ 2018-08-06  3:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <20180804.132110.184608716461100739.davem@davemloft.net>



On 2018年08月05日 04:21, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Fri,  3 Aug 2018 15:04:51 +0800
>
>> So fixing this by introducing a new message type with an explicit
>> 32bit reserved field after type like:
>>
>> struct vhost_msg_v2 {
>> 	int type;
>> 	__u32 reserved;
> Please use fixed sized types consistently.  Use 's32' instead of 'int'
> here.
>
> Thanks!

Ok, V2 will be posted soon.

And it looks to me u32 is sufficient.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH net-next V2] vhost: switch to use new message format
From: Jason Wang @ 2018-08-06  3:17 UTC (permalink / raw)
  To: mst, jasowang; +Cc: netdev, linux-kernel, kvm, virtualization

We use to have message like:

struct vhost_msg {
	int type;
	union {
		struct vhost_iotlb_msg iotlb;
		__u8 padding[64];
	};
};

Unfortunately, there will be a hole of 32bit in 64bit machine because
of the alignment. This leads a different formats between 32bit API and
64bit API. What's more it will break 32bit program running on 64bit
machine.

So fixing this by introducing a new message type with an explicit
32bit reserved field after type like:

struct vhost_msg_v2 {
	__u32 type;
	__u32 reserved;
	union {
		struct vhost_iotlb_msg iotlb;
		__u8 padding[64];
	};
};

We will have a consistent ABI after switching to use this. To enable
this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).

Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
Changes from V1:
- use __u32 instead of int for type
---
 drivers/vhost/net.c        | 30 ++++++++++++++++++++
 drivers/vhost/vhost.c      | 71 ++++++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h      | 11 ++++++-
 include/uapi/linux/vhost.h | 18 ++++++++++++
 4 files changed, 111 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 367d802..4e656f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -78,6 +78,10 @@ enum {
 };
 
 enum {
+	VHOST_NET_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
+};
+
+enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
 	VHOST_NET_VQ_MAX = 2,
@@ -1399,6 +1403,21 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 	return err;
 }
 
+static int vhost_net_set_backend_features(struct vhost_net *n, u64 features)
+{
+	int i;
+
+	mutex_lock(&n->dev.mutex);
+	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+		mutex_lock(&n->vqs[i].vq.mutex);
+		n->vqs[i].vq.acked_backend_features = features;
+		mutex_unlock(&n->vqs[i].vq.mutex);
+	}
+	mutex_unlock(&n->dev.mutex);
+
+	return 0;
+}
+
 static int vhost_net_set_features(struct vhost_net *n, u64 features)
 {
 	size_t vhost_hlen, sock_hlen, hdr_len;
@@ -1489,6 +1508,17 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 		if (features & ~VHOST_NET_FEATURES)
 			return -EOPNOTSUPP;
 		return vhost_net_set_features(n, features);
+	case VHOST_GET_BACKEND_FEATURES:
+		features = VHOST_NET_BACKEND_FEATURES;
+		if (copy_to_user(featurep, &features, sizeof(features)))
+			return -EFAULT;
+		return 0;
+	case VHOST_SET_BACKEND_FEATURES:
+		if (copy_from_user(&features, featurep, sizeof(features)))
+			return -EFAULT;
+		if (features & ~VHOST_NET_BACKEND_FEATURES)
+			return -EOPNOTSUPP;
+		return vhost_net_set_backend_features(n, features);
 	case VHOST_RESET_OWNER:
 		return vhost_net_reset_owner(n);
 	case VHOST_SET_OWNER:
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a502f1a..6f6c42d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -315,6 +315,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->log_addr = -1ull;
 	vq->private_data = NULL;
 	vq->acked_features = 0;
+	vq->acked_backend_features = 0;
 	vq->log_base = NULL;
 	vq->error_ctx = NULL;
 	vq->kick = NULL;
@@ -1027,28 +1028,40 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 			     struct iov_iter *from)
 {
-	struct vhost_msg_node node;
-	unsigned size = sizeof(struct vhost_msg);
-	size_t ret;
-	int err;
+	struct vhost_iotlb_msg msg;
+	size_t offset;
+	int type, ret;
 
-	if (iov_iter_count(from) < size)
-		return 0;
-	ret = copy_from_iter(&node.msg, size, from);
-	if (ret != size)
+	ret = copy_from_iter(&type, sizeof(type), from);
+	if (ret != sizeof(type))
 		goto done;
 
-	switch (node.msg.type) {
+	switch (type) {
 	case VHOST_IOTLB_MSG:
-		err = vhost_process_iotlb_msg(dev, &node.msg.iotlb);
-		if (err)
-			ret = err;
+		/* There maybe a hole after type for V1 message type,
+		 * so skip it here.
+		 */
+		offset = offsetof(struct vhost_msg, iotlb) - sizeof(int);
+		break;
+	case VHOST_IOTLB_MSG_V2:
+		offset = sizeof(__u32);
 		break;
 	default:
 		ret = -EINVAL;
-		break;
+		goto done;
+	}
+
+	iov_iter_advance(from, offset);
+	ret = copy_from_iter(&msg, sizeof(msg), from);
+	if (ret != sizeof(msg))
+		goto done;
+	if (vhost_process_iotlb_msg(dev, &msg)) {
+		ret = -EFAULT;
+		goto done;
 	}
 
+	ret = (type == VHOST_IOTLB_MSG) ? sizeof(struct vhost_msg) :
+	      sizeof(struct vhost_msg_v2);
 done:
 	return ret;
 }
@@ -1107,13 +1120,28 @@ ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct iov_iter *to,
 		finish_wait(&dev->wait, &wait);
 
 	if (node) {
-		ret = copy_to_iter(&node->msg, size, to);
+		struct vhost_iotlb_msg *msg;
+		void *start = &node->msg;
+
+		switch (node->msg.type) {
+		case VHOST_IOTLB_MSG:
+			size = sizeof(node->msg);
+			msg = &node->msg.iotlb;
+			break;
+		case VHOST_IOTLB_MSG_V2:
+			size = sizeof(node->msg_v2);
+			msg = &node->msg_v2.iotlb;
+			break;
+		default:
+			BUG();
+			break;
+		}
 
-		if (ret != size || node->msg.type != VHOST_IOTLB_MISS) {
+		ret = copy_to_iter(start, size, to);
+		if (ret != size || msg->type != VHOST_IOTLB_MISS) {
 			kfree(node);
 			return ret;
 		}
-
 		vhost_enqueue_msg(dev, &dev->pending_list, node);
 	}
 
@@ -1126,12 +1154,19 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
 	struct vhost_dev *dev = vq->dev;
 	struct vhost_msg_node *node;
 	struct vhost_iotlb_msg *msg;
+	bool v2 = vhost_backend_has_feature(vq, VHOST_BACKEND_F_IOTLB_MSG_V2);
 
-	node = vhost_new_msg(vq, VHOST_IOTLB_MISS);
+	node = vhost_new_msg(vq, v2 ? VHOST_IOTLB_MSG_V2 : VHOST_IOTLB_MSG);
 	if (!node)
 		return -ENOMEM;
 
-	msg = &node->msg.iotlb;
+	if (v2) {
+		node->msg_v2.type = VHOST_IOTLB_MSG_V2;
+		msg = &node->msg_v2.iotlb;
+	} else {
+		msg = &node->msg.iotlb;
+	}
+
 	msg->type = VHOST_IOTLB_MISS;
 	msg->iova = iova;
 	msg->perm = access;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6c844b9..466ef75 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -132,6 +132,7 @@ struct vhost_virtqueue {
 	struct vhost_umem *iotlb;
 	void *private_data;
 	u64 acked_features;
+	u64 acked_backend_features;
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
@@ -147,7 +148,10 @@ struct vhost_virtqueue {
 };
 
 struct vhost_msg_node {
-  struct vhost_msg msg;
+  union {
+	  struct vhost_msg msg;
+	  struct vhost_msg_v2 msg_v2;
+  };
   struct vhost_virtqueue *vq;
   struct list_head node;
 };
@@ -238,6 +242,11 @@ static inline bool vhost_has_feature(struct vhost_virtqueue *vq, int bit)
 	return vq->acked_features & (1ULL << bit);
 }
 
+static inline bool vhost_backend_has_feature(struct vhost_virtqueue *vq, int bit)
+{
+	return vq->acked_backend_features & (1ULL << bit);
+}
+
 #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
 {
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c51f8e5..b1e22c4 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -65,6 +65,7 @@ struct vhost_iotlb_msg {
 };
 
 #define VHOST_IOTLB_MSG 0x1
+#define VHOST_IOTLB_MSG_V2 0x2
 
 struct vhost_msg {
 	int type;
@@ -74,6 +75,15 @@ struct vhost_msg {
 	};
 };
 
+struct vhost_msg_v2 {
+	__u32 type;
+	__u32 reserved;
+	union {
+		struct vhost_iotlb_msg iotlb;
+		__u8 padding[64];
+	};
+};
+
 struct vhost_memory_region {
 	__u64 guest_phys_addr;
 	__u64 memory_size; /* bytes */
@@ -160,6 +170,14 @@ struct vhost_memory {
 #define VHOST_GET_VRING_BUSYLOOP_TIMEOUT _IOW(VHOST_VIRTIO, 0x24,	\
 					 struct vhost_vring_state)
 
+/* Set or get vhost backend capability */
+
+/* Use message type V2 */
+#define VHOST_BACKEND_F_IOTLB_MSG_V2 0x1
+
+#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
+#define VHOST_GET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x26, __u64)
+
 /* VHOST_NET specific defines */
 
 /* Attach virtio net ring to a raw socket, or tap device.
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
From: Wei Wang @ 2018-08-06  3:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, penguin-kernel, linux-kernel, mhocko, linux-mm, akpm,
	virtualization
In-Reply-To: <20180803221423-mutt-send-email-mst@kernel.org>

On 08/04/2018 03:15 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 04:32:26PM +0800, Wei Wang wrote:
>> The OOM notifier is getting deprecated to use for the reasons:
>> - As a callout from the oom context, it is too subtle and easy to
>>    generate bugs and corner cases which are hard to track;
>> - It is called too late (after the reclaiming has been performed).
>>    Drivers with large amuont of reclaimable memory is expected to
>>    release them at an early stage of memory pressure;
>> - The notifier callback isn't aware of oom contrains;
>> Link: https://lkml.org/lkml/2018/7/12/314
>>
>> This patch replaces the virtio-balloon oom notifier with a shrinker
>> to release balloon pages on memory pressure. The balloon pages are
>> given back to mm adaptively by returning the number of pages that the
>> reclaimer is asking for (i.e. sc->nr_to_scan).
>>
>> Currently the max possible value of sc->nr_to_scan passed to the balloon
>> shrinker is SHRINK_BATCH, which is 128. This is smaller than the
>> limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be
>> returned via one invocation of leak_balloon. But this patch still
>> considers the case that SHRINK_BATCH or shrinker->batch could be changed
>> to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to
>> do multiple invocations of leak_balloon.
>>
>> Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used
>> to release balloon pages on OOM. We continue to use this feature bit for
>> the shrinker, so the shrinker is only registered when this feature bit
>> has been negotiated with host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>
> Could you add data at how was this tested and how did guest
> behaviour change. Which configurations see an improvement?
>

Yes. Please see the differences from the "*1" and "*2" cases below.

Taking this chance, I use "*2" and "*3" to show Michal etc the 
differences of applying and not applying the shrinker fix patch here: 
https://lkml.org/lkml/2018/8/3/384


*1. V3 patches
1)After inflating some amount of memory, actual=1000001536 Bytes
free -m
               total        used        free      shared buff/cache   
available
Mem:           7975        7289         514          10 171         447
Swap:         10236           0       10236

2) dd if=478MB_file of=/dev/null, actual=1058721792 Bytes
free -m
               total        used        free      shared buff/cache   
available
Mem:           7975        7233         102          10 639         475
Swap:         10236           0       10236

The advantage is that the inflated pages are given back to mm based on 
the number, i.e. ~56MB(diff "actual" above) of the reclaimer is asking 
for. This is more adaptive.



*2. V2 paches, balloon_pages_to_shrink=1000000 pages (around 4GB), with 
the shrinker fix patches applied.
1)After inflating some amount of memory, actual=1000001536 Bytes
free -m
               total        used        free      shared buff/cache   
available
Mem:           7975        7288         530          10 157         455
Swap:         10236           0       10236

2)dd if=478MB_file of=/dev/null, actual=5096001536 Bytes
free -m
               total        used        free      shared buff/cache   
available
Mem:           7975        3381        3953          10 640        4327
Swap:         10236           0       10236

In the above example, we set 4GB to shrink to make the difference 
obvious. Though the claimer only needs to reclaim ~56MB memory, 4GB 
inflated pages are given back to mm, which is unnecessary. From the 
user's perspective, it has no idea of how many pages to given back at 
the time of setting the module parameter (balloon_pages_to_shrink). So I 
think the above "*1" is better.



*3.  V2 paches, balloon_pages_to_shrink=1000000 pages (around 4GB), 
without the shrinker fix patches applied.
1) After inflating some amount of memory, actual=1000001536 Bytes
free -m
                total        used        free      shared buff/cache   
available
Mem:           7975        7292         524          10 158         450
Swap:         10236           0       10236

2) dd if=478MB_file of=/dev/null, actual=8589934592 Bytes
free -m
              total        used        free      shared  buff/cache 
available
Mem:           7975          53        7281          10 640        7656
Swap:         10236           0       10236

Compared to *2, all the balloon pages are shrunk, but users expect 4GB 
to shrink. The reason is that do_slab_shrink has a mistake in 
calculating schrinkctl->nr_scanned, which should be the actual number of 
pages that the shrinker has freed, but do slab_shrink still treat that 
value as 128 (but 4GB has actually been freed).


Best,
Wei

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Anshuman Khandual @ 2018-08-06  9:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Benjamin Herrenschmidt
  Cc: robh, srikar, linuxram, linux-kernel, virtualization, hch, paulus,
	joe, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180805032355-mutt-send-email-mst@kernel.org>

On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
>> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
>>>>>> Please go through these patches and review whether this approach broadly
>>>>>> makes sense. I will appreciate suggestions, inputs, comments regarding
>>>>>> the patches or the approach in general. Thank you.
>>>>>
>>>>> Jason did some work on profiling this. Unfortunately he reports
>>>>> about 4% extra overhead from this switch on x86 with no vIOMMU.
>>>>
>>>> The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
>>>> guest and measure PPS on tap on host.
>>>>
>>>> Thanks
>>>
>>> Could you supply host configuration involved please?
>>
>> I wonder how much of that could be caused by Spectre mitigations
>> blowing up indirect function calls...
>>
>> Cheers,
>> Ben.
> 
> I won't be surprised. If yes I suggested a way to mitigate the overhead.

Did we get better results (lower regression due to indirect calls) with
the suggested mitigation ? Just curious.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-06  9:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, Christoph Hellwig, paulus, marc.zyngier,
	joe, robin.murphy, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <b7e8294e3e70d24072883a7e8e5375719d5af870.camel@kernel.crashing.org>

On Mon, Aug 06, 2018 at 07:16:47AM +1000, Benjamin Herrenschmidt wrote:
> Who would set this bit ? qemu ? Under what circumstances ?

I don't really care who sets what.  The implementation might not even
involved qemu.

It is your job to write a coherent interface specification that does
not depend on the used components.  The hypervisor might be PAPR,
Linux + qemu, VMware, Hyperv or something so secret that you'd have
to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
AIX, OS400 or a Hipster project of the day in Rust.  As long as we
properly specify the interface it simplify does not matter.

> What would be the effect of this bit while VIRTIO_F_IOMMU is NOT set,
> ie, what would qemu do and what would Linux do ? I'm not sure I fully
> understand your idea.

In a perfect would we'd just reuse VIRTIO_F_IOMMU and clarify the
description which currently is rather vague but basically captures
the use case.  Currently is is:

VIRTIO_F_IOMMU_PLATFORM(33)
    This feature indicates that the device is behind an IOMMU that
    translates bus addresses from the device into physical addresses in
    memory. If this feature bit is set to 0, then the device emits
    physical addresses which are not translated further, even though an
    IOMMU may be present.

And I'd change it to something like:

VIRTIO_F_PLATFORM_DMA(33)
    This feature indicates that the device emits platform specific
    bus addresses that might not be identical to physical address.
    The translation of physical to bus address is platform speific
    and defined by the plaform specification for the bus that the virtio
    device is attached to.
    If this feature bit is set to 0, then the device emits
    physical addresses which are not translated further, even if
    the platform would normally require translations for the bus that
    the virtio device is attached to.

If we can't change the defintion any more we should deprecate the
old VIRTIO_F_IOMMU_PLATFORM bit, and require the VIRTIO_F_IOMMU_PLATFORM
and VIRTIO_F_PLATFORM_DMA to be not set at the same time.

> I'm trying to understand because the limitation is not a device side
> limitation, it's not a qemu limitation, it's actually more of a VM
> limitation. It has most of its memory pages made inaccessible for
> security reasons. The platform from a qemu/KVM perspective is almost
> entirely normal.

Well, find a way to describe this either in the qemu specification using
new feature bits, or by using something like the above.

^ permalink raw reply

* Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
From: Wei Wang @ 2018-08-06  9:56 UTC (permalink / raw)
  To: Tetsuo Handa, virtio-dev, linux-kernel, virtualization, linux-mm,
	mst, mhocko, akpm
In-Reply-To: <16c56ee5-eef7-dd5f-f2b6-e3c11df2765c@i-love.sakura.ne.jp>

On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> On 2018/08/03 17:32, Wei Wang wrote:
>> +static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
>> +{
>> +	vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
>> +	vb->shrinker.count_objects = virtio_balloon_shrinker_count;
>> +	vb->shrinker.batch = 0;
>> +	vb->shrinker.seeks = DEFAULT_SEEKS;
> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
> and is nowhere zero-cleared, KASAN would complain it.

Could you point where in the code that would complain it?
I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and they 
seem not related to that.


Best,
Wei

^ permalink raw reply

* Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
From: Tetsuo Handa @ 2018-08-06 10:29 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, mst, linux-kernel, mhocko, linux-mm, akpm,
	virtualization
In-Reply-To: <5B681B41.6070205@intel.com>

On 2018/08/06 18:56, Wei Wang wrote:
> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
>> On 2018/08/03 17:32, Wei Wang wrote:
>>> +static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
>>> +{
>>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
>>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
>>> +    vb->shrinker.batch = 0;
>>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
>> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
>> and is nowhere zero-cleared, KASAN would complain it.
> 
> Could you point where in the code that would complain it?
> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and they seem not related to that.

Where is vb->shrinker.flags initialized?
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
From: Wang, Wei W @ 2018-08-06 12:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: virtio-dev@lists.oasis-open.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <c8d25019-1990-f0dd-c83d-e4def5b5f7fe@i-love.sakura.ne.jp>

On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 18:56, Wei Wang wrote:
> > On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> >> On 2018/08/03 17:32, Wei Wang wrote:
> >>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> >>> +*vb) {
> >>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> >>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> >>> +    vb->shrinker.batch = 0;
> >>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
> >> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
> >> and is nowhere zero-cleared, KASAN would complain it.
> >
> > Could you point where in the code that would complain it?
> > I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> they seem not related to that.
> 
> Where is vb->shrinker.flags initialized?

Is that mandatory to be initialized? I find it's not initialized in most shrinkers (e.g. zs_register_shrinker, huge_zero_page_shrinker).

Best,
Wei
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
From: Tetsuo Handa @ 2018-08-06 13:28 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev@lists.oasis-open.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7397222E8@SHSMSX101.ccr.corp.intel.com>

On 2018/08/06 21:44, Wang, Wei W wrote:
> On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
>> On 2018/08/06 18:56, Wei Wang wrote:
>>> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
>>>> On 2018/08/03 17:32, Wei Wang wrote:
>>>>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
>>>>> +*vb) {
>>>>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
>>>>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
>>>>> +    vb->shrinker.batch = 0;
>>>>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
>>>> Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
>>>> and is nowhere zero-cleared, KASAN would complain it.
>>>
>>> Could you point where in the code that would complain it?
>>> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
>> they seem not related to that.
>>
>> Where is vb->shrinker.flags initialized?
> 
> Is that mandatory to be initialized?

Of course. ;-)

> I find it's not initialized in most shrinkers (e.g. zs_register_shrinker, huge_zero_page_shrinker).

Because most shrinkers are "statically initialized (which means that unspecified fields are
implicitly zero-cleared)" or "dynamically allocated with __GFP_ZERO or zero-cleared using
memset() (which means that all fields are once zero-cleared)".

And if you once zero-clear vb at allocation time, you will get a bonus that
calling unregister_shrinker() without corresponding register_shrinker() is safe
(which will simplify initialization failure path).
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-06 13:36 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, hch, paulus, joe, linuxppc-dev, elfring, haren,
	david
In-Reply-To: <74a1e1b8-81e0-84db-6d0d-d8bd9caebb4a@linux.vnet.ibm.com>

On Mon, Aug 06, 2018 at 02:32:28PM +0530, Anshuman Khandual wrote:
> On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> > On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> >> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> >>>>>> Please go through these patches and review whether this approach broadly
> >>>>>> makes sense. I will appreciate suggestions, inputs, comments regarding
> >>>>>> the patches or the approach in general. Thank you.
> >>>>>
> >>>>> Jason did some work on profiling this. Unfortunately he reports
> >>>>> about 4% extra overhead from this switch on x86 with no vIOMMU.
> >>>>
> >>>> The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> >>>> guest and measure PPS on tap on host.
> >>>>
> >>>> Thanks
> >>>
> >>> Could you supply host configuration involved please?
> >>
> >> I wonder how much of that could be caused by Spectre mitigations
> >> blowing up indirect function calls...
> >>
> >> Cheers,
> >> Ben.
> > 
> > I won't be surprised. If yes I suggested a way to mitigate the overhead.
> 
> Did we get better results (lower regression due to indirect calls) with
> the suggested mitigation ? Just curious.

I'm referring to this:
	I wonder whether we can support map_sg and friends being NULL, then use
	that when mapping is an identity. A conditional branch there is likely
	very cheap.

I don't think anyone tried implementing this yes.

-- 
MST

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-06 13:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <fd8fee94cf42e436878f179c7895de3a4dab3355.camel@kernel.crashing.org>

On Sun, Aug 05, 2018 at 02:52:54PM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2018-08-05 at 03:22 +0300, Michael S. Tsirkin wrote:
> > I see the allure of this, but I think down the road you will
> > discover passing a flag in libvirt XML saying
> > "please use a secure mode" or whatever is a good idea.
> > 
> > Even thought it is probably not required to address this
> > specific issue.
> > 
> > For example, I don't think ballooning works in secure mode,
> > you will be able to teach libvirt not to try to add a
> > balloon to the guest.
> 
> Right, we'll need some quirk to disable balloons  in the guest I
> suppose.
> 
> Passing something from libvirt is cumbersome because the end user may
> not even need to know about secure VMs. There are use cases where the
> security is a contract down to some special application running inside
> the secure VM, the sysadmin knows nothing about.
> 
> Also there's repercussions all the way to admin tools, web UIs etc...
> so it's fairly wide ranging.
> 
> So as long as we only need to quirk a couple of devices, it's much
> better contained that way.

So just the balloon thing already means that yes management and all the
way to the user tools must know this is going on. Otherwise
user will try to inflate the balloon and wonder why this does not work.

> > > Later on, (we may have even already run Linux at that point,
> > > unsecurely, as we can use Linux as a bootloader under some
> > > circumstances), we start a "secure image".
> > > 
> > > This is a kernel zImage that includes a "ticket" that has the
> > > appropriate signature etc... so that when that kernel starts, it can
> > > authenticate with the ultravisor, be verified (along with its ramdisk)
> > > etc... and copied (by the UV) into secure memory & run from there.
> > > 
> > > At that point, the hypervisor is informed that the VM has become
> > > secure.
> > > 
> > > So at that point, we could exit to qemu to inform it of the change,
> > 
> > That's probably a good idea too.
> 
> We probably will have to tell qemu eventually for migration, as we'll
> need some kind of key exchange phase etc... to deal with the crypto
> aspects (the actual page copy is sorted via encrypting the secure pages
> back to normal pages in qemu, but we'll need extra metadata).
> 
> > > and
> > > have it walk the qtree and "Switch" all the virtio devices to use the
> > > IOMMU I suppose, but it feels a lot grosser to me.
> > 
> > That part feels gross, yes.
> > 
> > > That's the only other option I can think of.
> > > 
> > > > However in this specific case, the flag does not need to come from the
> > > > hypervisor, it can be set by arch boot code I think.
> > > > Christoph do you see a problem with that?
> > > 
> > > The above could do that yes. Another approach would be to do it from a
> > > small virtio "quirk" that pokes a bit in the device to force it to
> > > iommu mode when it detects that we are running in a secure VM. That's a
> > > bit warty on the virito side but probably not as much as having a qemu
> > > one that walks of the virtio devices to change how they behave.
> > > 
> > > What do you reckon ?
> > 
> > I think you are right that for the dma limit the hypervisor doesn't seem
> > to need to know.
> 
> It's not just a limit mind you. It's a range, at least if we allocate
> just a single pool of insecure pages. swiotlb feels like a better
> option for us.
> 
> > > What we want to avoid is to expose any of this to the *end user* or
> > > libvirt or any other higher level of the management stack. We really
> > > want that stuff to remain contained between the VM itself, KVM and
> > > maybe qemu.
> > > 
> > > We will need some other qemu changes for migration so that's ok. But
> > > the minute you start touching libvirt and the higher levels it becomes
> > > a nightmare.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > I don't believe you'll be able to avoid that entirely. The split between
> > libvirt and qemu is more about community than about code, random bits of
> > functionality tend to land on random sides of that fence.  Better add a
> > tag in domain XML early is my advice. Having said that, it's your
> > hypervisor. I'm just suggesting that when hypervisor does somehow need
> > to care then I suspect most people won't be receptive to the argument
> > that changing libvirt is a nightmare.
> 
> It only needs to care at runtime. The problem isn't changing libvirt
> per-se, I don't have a problem with that. The problem is that it means
> creating two categories of machines "secure" and "non-secure", which is
> end-user visible, and thus has to be escalated to all the various
> management stacks, UIs, etc... out there.
> 
> In addition, there are some cases where the individual creating the VMs
> may not have any idea that they are secure.
> 
> But yes, if we have to, we'll do it. However, so far, we don't think
> it's a great idea.
> 
> Cheers,
> Ben.

Here's another example: you can't migrate a secure vm to hypervisor
which doesn't support this feature. Again management tools above libvirt
need to know otherwise they will try.

> > > > > >   To get swiotlb you'll need to then use the DT/ACPI
> > > > > > dma-range property to limit the addressable range, and a swiotlb
> > > > > > capable plaform will use swiotlb automatically.
> > > > > 
> > > > > This cannot be done as you describe it.
> > > > > 
> > > > > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > > > > at a point where it has *no idea* that the VM will later become secure
> > > > > and thus will have to restrict which pages can be used for "DMA".
> > > > > 
> > > > > The VM will *at runtime* turn itself into a secure VM via interactions
> > > > > with the security HW and the Ultravisor layer (which sits below the
> > > > > HV). This happens way after the DT has been created and consumed, the
> > > > > qemu devices instanciated etc...
> > > > > 
> > > > > Only the guest kernel knows because it initates the transition. When
> > > > > that happens, the virtio devices have already been used by the guest
> > > > > firmware, bootloader, possibly another kernel that kexeced the "secure"
> > > > > one, etc... 
> > > > > 
> > > > > So instead of running around saying NAK NAK NAK, please explain how we
> > > > > can solve that differently.
> > > > > 
> > > > > Ben.

^ permalink raw reply

* RE: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker
From: Wang, Wei W @ 2018-08-06 14:02 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: virtio-dev@lists.oasis-open.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <109ff5ec-692d-67fe-4c5a-2de8b48e8300@i-love.sakura.ne.jp>

On Monday, August 6, 2018 9:29 PM, Tetsuo Handa wrote:
> On 2018/08/06 21:44, Wang, Wei W wrote:
> > On Monday, August 6, 2018 6:29 PM, Tetsuo Handa wrote:
> >> On 2018/08/06 18:56, Wei Wang wrote:
> >>> On 08/03/2018 08:11 PM, Tetsuo Handa wrote:
> >>>> On 2018/08/03 17:32, Wei Wang wrote:
> >>>>> +static int virtio_balloon_register_shrinker(struct virtio_balloon
> >>>>> +*vb) {
> >>>>> +    vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
> >>>>> +    vb->shrinker.count_objects = virtio_balloon_shrinker_count;
> >>>>> +    vb->shrinker.batch = 0;
> >>>>> +    vb->shrinker.seeks = DEFAULT_SEEKS;
> >>>> Why flags field is not set? If vb is allocated by
> >>>> kmalloc(GFP_KERNEL) and is nowhere zero-cleared, KASAN would
> complain it.
> >>>
> >>> Could you point where in the code that would complain it?
> >>> I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and
> >> they seem not related to that.
> >>
> >> Where is vb->shrinker.flags initialized?
> >
> > Is that mandatory to be initialized?
> 
> Of course. ;-)
> 
> > I find it's not initialized in most shrinkers (e.g. zs_register_shrinker,
> huge_zero_page_shrinker).
> 
> Because most shrinkers are "statically initialized (which means that
> unspecified fields are implicitly zero-cleared)" or "dynamically allocated with
> __GFP_ZERO or zero-cleared using
> memset() (which means that all fields are once zero-cleared)".
> 
> And if you once zero-clear vb at allocation time, you will get a bonus that
> calling unregister_shrinker() without corresponding register_shrinker() is safe
> (which will simplify initialization failure path).

Oh, I see, thanks. So it sounds better to directly kzalloc vb.

Best,
Wei
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Will Deacon @ 2018-08-06 14:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, Christoph Hellwig, jean-philippe.brucker, paulus,
	marc.zyngier, mpe, joe, robin.murphy, david, linuxppc-dev,
	elfring, haren, Anshuman Khandual
In-Reply-To: <20180805032504-mutt-send-email-mst@kernel.org>

Hi Michael,

On Sun, Aug 05, 2018 at 03:27:42AM +0300, Michael S. Tsirkin wrote:
> On Wed, Aug 01, 2018 at 09:16:38AM +0100, Will Deacon wrote:
> > On Tue, Jul 31, 2018 at 03:36:22PM -0500, Benjamin Herrenschmidt wrote:
> > > On Tue, 2018-07-31 at 10:30 -0700, Christoph Hellwig wrote:
> > > > > However the question people raise is that DMA API is already full of
> > > > > arch-specific tricks the likes of which are outlined in your post linked
> > > > > above. How is this one much worse?
> > > > 
> > > > None of these warts is visible to the driver, they are all handled in
> > > > the architecture (possibly on a per-bus basis).
> > > > 
> > > > So for virtio we really need to decide if it has one set of behavior
> > > > as specified in the virtio spec, or if it behaves exactly as if it
> > > > was on a PCI bus, or in fact probably both as you lined up.  But no
> > > > magic arch specific behavior inbetween.
> > > 
> > > The only arch specific behaviour is needed in the case where it doesn't
> > > behave like PCI. In this case, the PCI DMA ops are not suitable, but in
> > > our secure VMs, we still need to make it use swiotlb in order to bounce
> > > through non-secure pages.
> > 
> > On arm/arm64, the problem we have is that legacy virtio devices on the MMIO
> > transport (so definitely not PCI) have historically been advertised by qemu
> > as not being cache coherent, but because the virtio core has bypassed DMA
> > ops then everything has happened to work. If we blindly enable the arch DMA
> > ops, we'll plumb in the non-coherent ops and start getting data corruption,
> > so we do need a way to quirk virtio as being "always coherent" if we want to
> > use the DMA ops (which we do, because our emulation platforms have an IOMMU
> > for all virtio devices).
> > 
> > Will
> 
> Right that's not very different from placing the device within the IOMMU
> domain but in fact bypassing the IOMMU

Hmm, I'm not sure I follow you here -- the IOMMU bypassing is handled
inside the IOMMU driver, so we'd still end up with non-coherent DMA ops
for the guest accesses. The presence of an IOMMU doesn't imply coherency for
us. Or am I missing your point here?

> I wonder whether anyone ever needs a non coherent virtio-mmio. If yes we
> can extend PLATFORM_IOMMU to cover that or add another bit.

I think that's probably the right way around: assume that legacy virtio-mmio
devices are coherent by default.

> What exactly do the non-coherent ops do that causes the corruption?

The non-coherent ops mean that the guest ends up allocating the vring queues
using non-cacheable mappings, whereas qemu/hypervisor uses a cacheable
mapping despite not advertising the devices as being cache-coherent.

This hits something in the architecture known as "mismatched aliases", which
means that coherency is lost between the guest and the hypervisor,
consequently resulting in data not being visible and ordering not being
guaranteed. The usual symptom is that the device appears to lock up iirc,
because the guest and the hypervisor are unable to communicate with each
other.

Does that help to clarify things?

Thanks,

Will

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-06 15:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, hch, paulus, joe, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180806163440-mutt-send-email-mst@kernel.org>

On Mon, Aug 06, 2018 at 04:36:43PM +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 06, 2018 at 02:32:28PM +0530, Anshuman Khandual wrote:
> > On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> > > On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> > >> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > >>>>>> Please go through these patches and review whether this approach broadly
> > >>>>>> makes sense. I will appreciate suggestions, inputs, comments regarding
> > >>>>>> the patches or the approach in general. Thank you.
> > >>>>>
> > >>>>> Jason did some work on profiling this. Unfortunately he reports
> > >>>>> about 4% extra overhead from this switch on x86 with no vIOMMU.
> > >>>>
> > >>>> The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> > >>>> guest and measure PPS on tap on host.
> > >>>>
> > >>>> Thanks
> > >>>
> > >>> Could you supply host configuration involved please?
> > >>
> > >> I wonder how much of that could be caused by Spectre mitigations
> > >> blowing up indirect function calls...
> > >>
> > >> Cheers,
> > >> Ben.
> > > 
> > > I won't be surprised. If yes I suggested a way to mitigate the overhead.
> > 
> > Did we get better results (lower regression due to indirect calls) with
> > the suggested mitigation ? Just curious.
> 
> I'm referring to this:
> 	I wonder whether we can support map_sg and friends being NULL, then use
> 	that when mapping is an identity. A conditional branch there is likely
> 	very cheap.
> 
> I don't think anyone tried implementing this yes.

I've done something very similar in the thread I posted a few years
ago. I plan to get a version of that upstream for 4.20, but it won't
cover the virtio case, just the real direct mapping.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-06 16:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, paulus, joe, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180806152406.GA29020@infradead.org>

On Mon, Aug 06, 2018 at 08:24:06AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 06, 2018 at 04:36:43PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Aug 06, 2018 at 02:32:28PM +0530, Anshuman Khandual wrote:
> > > On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> > > >> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > > >>>>>> Please go through these patches and review whether this approach broadly
> > > >>>>>> makes sense. I will appreciate suggestions, inputs, comments regarding
> > > >>>>>> the patches or the approach in general. Thank you.
> > > >>>>>
> > > >>>>> Jason did some work on profiling this. Unfortunately he reports
> > > >>>>> about 4% extra overhead from this switch on x86 with no vIOMMU.
> > > >>>>
> > > >>>> The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in
> > > >>>> guest and measure PPS on tap on host.
> > > >>>>
> > > >>>> Thanks
> > > >>>
> > > >>> Could you supply host configuration involved please?
> > > >>
> > > >> I wonder how much of that could be caused by Spectre mitigations
> > > >> blowing up indirect function calls...
> > > >>
> > > >> Cheers,
> > > >> Ben.
> > > > 
> > > > I won't be surprised. If yes I suggested a way to mitigate the overhead.
> > > 
> > > Did we get better results (lower regression due to indirect calls) with
> > > the suggested mitigation ? Just curious.
> > 
> > I'm referring to this:
> > 	I wonder whether we can support map_sg and friends being NULL, then use
> > 	that when mapping is an identity. A conditional branch there is likely
> > 	very cheap.
> > 
> > I don't think anyone tried implementing this yes.
> 
> I've done something very similar in the thread I posted a few years
> ago.

Right so that was before spectre where a virtual call was cheaper :(

> I plan to get a version of that upstream for 4.20, but it won't
> cover the virtio case, just the real direct mapping.

I guess this RFC will have to be reworked on top and performance retested.

Thanks,

-- 
MST

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-06 16:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, Christoph Hellwig, paulus, joe, david,
	linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <20180806190321-mutt-send-email-mst@kernel.org>

On Mon, Aug 06, 2018 at 07:06:05PM +0300, Michael S. Tsirkin wrote:
> > I've done something very similar in the thread I posted a few years
> > ago.
> 
> Right so that was before spectre where a virtual call was cheaper :(

Sorry, I meant days, not years.  The whole point of the thread was the
slowdowns due to retpolines, which are the software spectre mitigation.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-06 16:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, paulus, joe, david, linuxppc-dev, elfring, haren,
	Anshuman Khandual
In-Reply-To: <20180806161040.GA4675@infradead.org>

On Mon, Aug 06, 2018 at 09:10:40AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 06, 2018 at 07:06:05PM +0300, Michael S. Tsirkin wrote:
> > > I've done something very similar in the thread I posted a few years
> > > ago.
> > 
> > Right so that was before spectre where a virtual call was cheaper :(
> 
> Sorry, I meant days, not years.  The whole point of the thread was the
> slowdowns due to retpolines, which are the software spectre mitigation.

Oh that makes sense then. Could you post a pointer pls so
this patchset is rebased on top (there are things to
change about 4/4 but 1-3 could go in if they don't add
overhead)?

-- 
MST

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-06 16:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, Benjamin Herrenschmidt, linuxram, linux-kernel,
	virtualization, Christoph Hellwig, paulus, joe, david,
	linuxppc-dev, elfring, haren, Anshuman Khandual
In-Reply-To: <20180806191111-mutt-send-email-mst@kernel.org>

On Mon, Aug 06, 2018 at 07:13:32PM +0300, Michael S. Tsirkin wrote:
> Oh that makes sense then. Could you post a pointer pls so
> this patchset is rebased on top (there are things to
> change about 4/4 but 1-3 could go in if they don't add
> overhead)?

The dma mapping direct calls will need a major work vs what I posted.
I plan to start that work in about two weeks once returning from my
vacation.

^ permalink raw reply

* Re: [PATCH net-next V2] vhost: switch to use new message format
From: David Miller @ 2018-08-06 17:41 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1533525467-17787-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Mon,  6 Aug 2018 11:17:47 +0800

> We use to have message like:
> 
> struct vhost_msg {
> 	int type;
> 	union {
> 		struct vhost_iotlb_msg iotlb;
> 		__u8 padding[64];
> 	};
> };
> 
> Unfortunately, there will be a hole of 32bit in 64bit machine because
> of the alignment. This leads a different formats between 32bit API and
> 64bit API. What's more it will break 32bit program running on 64bit
> machine.
> 
> So fixing this by introducing a new message type with an explicit
> 32bit reserved field after type like:
> 
> struct vhost_msg_v2 {
> 	__u32 type;
> 	__u32 reserved;
> 	union {
> 		struct vhost_iotlb_msg iotlb;
> 		__u8 padding[64];
> 	};
> };
> 
> We will have a consistent ABI after switching to use this. To enable
> this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
> userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).
> 
> Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> Changes from V1:
> - use __u32 instead of int for type

Applied, thanks Jason.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-06 19:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, srikar, Michael S. Tsirkin, mpe, Will Deacon, linux-kernel,
	linuxram, virtualization, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180806094243.GA16032@infradead.org>

On Mon, 2018-08-06 at 02:42 -0700, Christoph Hellwig wrote:
> On Mon, Aug 06, 2018 at 07:16:47AM +1000, Benjamin Herrenschmidt wrote:
> > Who would set this bit ? qemu ? Under what circumstances ?
> 
> I don't really care who sets what.  The implementation might not even
> involved qemu.
> 
> It is your job to write a coherent interface specification that does
> not depend on the used components.  The hypervisor might be PAPR,
> Linux + qemu, VMware, Hyperv or something so secret that you'd have
> to shoot me if you had to tell me.  The guest might be Linux, FreeBSD,
> AIX, OS400 or a Hipster project of the day in Rust.  As long as we
> properly specify the interface it simplify does not matter.

That's the point Christoph. The interface is today's interface. It does
NOT change. That information is not part of the interface.

It's the VM itself that is stashing away its memory in a secret place,
and thus needs to do bounce buffering. There is no change to the virtio
interface per-se.

> > What would be the effect of this bit while VIRTIO_F_IOMMU is NOT set,
> > ie, what would qemu do and what would Linux do ? I'm not sure I fully
> > understand your idea.
> 
> In a perfect would we'd just reuse VIRTIO_F_IOMMU and clarify the
> description which currently is rather vague but basically captures
> the use case.  Currently is is:
> 
> VIRTIO_F_IOMMU_PLATFORM(33)
>     This feature indicates that the device is behind an IOMMU that
>     translates bus addresses from the device into physical addresses in
>     memory. If this feature bit is set to 0, then the device emits
>     physical addresses which are not translated further, even though an
>     IOMMU may be present.
> 
> And I'd change it to something like:
> 
> VIRTIO_F_PLATFORM_DMA(33)
>     This feature indicates that the device emits platform specific
>     bus addresses that might not be identical to physical address.
>     The translation of physical to bus address is platform speific
>     and defined by the plaform specification for the bus that the virtio
>     device is attached to.
>     If this feature bit is set to 0, then the device emits
>     physical addresses which are not translated further, even if
>     the platform would normally require translations for the bus that
>     the virtio device is attached to.
> 
> If we can't change the defintion any more we should deprecate the
> old VIRTIO_F_IOMMU_PLATFORM bit, and require the VIRTIO_F_IOMMU_PLATFORM
> and VIRTIO_F_PLATFORM_DMA to be not set at the same time.

But this doesn't really change our problem does it ?

None of what happens in our case is part of the "interface". The
suggestion to force the iommu ON was simply that it was a "workaround"
as by doing so, we get to override the DMA ops, but that's just a
trick.

Fundamentally, what we need to solve is pretty much entirely a guest
problem.

> > I'm trying to understand because the limitation is not a device side
> > limitation, it's not a qemu limitation, it's actually more of a VM
> > limitation. It has most of its memory pages made inaccessible for
> > security reasons. The platform from a qemu/KVM perspective is almost
> > entirely normal.
> 
> Well, find a way to describe this either in the qemu specification using
> new feature bits, or by using something like the above.

But again, why do you want to involve the interface, and thus the
hypervisor for something that is essentially what the guest is doign to
itself ?

It really is something we need to solve locally to the guest, it's not
part of the interface.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-06 19:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: robh, srikar, mpe, Will Deacon, linux-kernel, linuxram,
	virtualization, Christoph Hellwig, jean-philippe.brucker, paulus,
	marc.zyngier, joe, robin.murphy, david, linuxppc-dev, elfring,
	haren, Anshuman Khandual
In-Reply-To: <20180806164106-mutt-send-email-mst@kernel.org>

On Mon, 2018-08-06 at 16:46 +0300, Michael S. Tsirkin wrote:
> 
> > Right, we'll need some quirk to disable balloons  in the guest I
> > suppose.
> > 
> > Passing something from libvirt is cumbersome because the end user may
> > not even need to know about secure VMs. There are use cases where the
> > security is a contract down to some special application running inside
> > the secure VM, the sysadmin knows nothing about.
> > 
> > Also there's repercussions all the way to admin tools, web UIs etc...
> > so it's fairly wide ranging.
> > 
> > So as long as we only need to quirk a couple of devices, it's much
> > better contained that way.
> 
> So just the balloon thing already means that yes management and all the
> way to the user tools must know this is going on. Otherwise
> user will try to inflate the balloon and wonder why this does not work.

There is *dozens* of management systems out there, not even all open
source, we won't ever be able to see the end of the tunnel if we need
to teach every single of them, including end users, about platform
specific new VM flags like that.

.../...

> Here's another example: you can't migrate a secure vm to hypervisor
> which doesn't support this feature. Again management tools above libvirt
> need to know otherwise they will try.

There will have to be a new machine type for that I suppose, yes,
though it's not just the hypervisor that needs to know about the
modified migration stream, it's also the need to have a compatible
ultravisor with the right keys on the other side.

So migration is going to be special and require extra admin work in all
cases yes. But not all secure VMs are meant to be migratable.

In any case, back to the problem at hand. What a qemu flag gives us is
just a way to force iommu at VM creation time.

This is rather sub-optimal, we don't really want the iommu in the way,
so it's at best a "workaround", and it's not really solving the real
problem.

As I said replying to Christoph, we are "leaking" into the interface
something here that is really what's the VM is doing to itself, which
is to stash its memory away in an inaccessible place.

Cheers,
Ben.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox