* Re: [PATCH 0/1] vhost: add vhost_blk driver
From: Christian Borntraeger @ 2018-11-05 15:48 UTC (permalink / raw)
To: Jason Wang, Vitaly Mayatskikh, Michael S . Tsirkin
Cc: Paolo Bonzini, virtualization, linux-kernel, Stefan Hajnoczi, kvm
In-Reply-To: <6a7f1668-bf2d-0caa-2efd-c8fab5f1db24@redhat.com>
On 11/05/2018 04:00 AM, Jason Wang wrote:
>
> On 2018/11/3 上午2:21, Vitaly Mayatskikh wrote:
>> vhost_blk is a host-side kernel mode accelerator for virtio-blk. The
>> driver allows VM to reach a near bare-metal disk performance. See IOPS
>> numbers below (fio --rw=randread --bs=4k).
>>
>> This implementation uses kiocb interface. It is slightly slower than
>> going directly through bio, but is simpler and also works with disk
>> images placed on a file system.
>>
>> # fio num-jobs
>> # A: bare metal over block
>> # B: bare metal over file
>> # C: virtio-blk over block
>> # D: virtio-blk over file
>> # E: vhost-blk bio over block
>> # F: vhost-blk kiocb over block
>> # G: vhost-blk kiocb over file
>> #
>> # A B C D E F G
>>
>> 1 171k 151k 148k 151k 195k 187k 175k
>> 2 328k 302k 249k 241k 349k 334k 296k
>> 3 479k 437k 179k 174k 501k 464k 404k
>> 4 622k 568k 143k 183k 620k 580k 492k
>> 5 755k 697k 136k 128k 737k 693k 579k
>> 6 887k 808k 131k 120k 830k 782k 640k
>> 7 1004k 926k 126k 131k 926k 863k 693k
>> 8 1099k 1015k 117k 115k 1001k 931k 712k
>> 9 1194k 1119k 115k 111k 1055k 991k 711k
>> 10 1278k 1207k 109k 114k 1130k 1046k 695k
>> 11 1345k 1280k 110k 108k 1119k 1091k 663k
>> 12 1411k 1356k 104k 106k 1201k 1142k 629k
>> 13 1466k 1423k 106k 106k 1260k 1170k 607k
>> 14 1517k 1486k 103k 106k 1296k 1179k 589k
>> 15 1552k 1543k 102k 102k 1322k 1191k 571k
>> 16 1480k 1506k 101k 102k 1346k 1202k 566k
>>
>> Vitaly Mayatskikh (1):
>> Add vhost_blk driver
>>
>> drivers/vhost/Kconfig | 13 ++
>> drivers/vhost/Makefile | 3 +
>> drivers/vhost/blk.c | 510 +++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 526 insertions(+)
>> create mode 100644 drivers/vhost/blk.c
>>
>
> Hi:
>
> Thanks for the patches.
>
> This is not the first attempt for having vhost-blk:
>
> - Badari's version: https://lwn.net/Articles/379864/
>
> - Asias' version: https://lwn.net/Articles/519880/
>
> It's better to describe the differences (kiocb vs bio? performance?). E.g if my memory is correct, Asias said it doesn't give much improvement compared with userspace qemu.
>
> And what's more important, I believe we tend to use virtio-scsi nowdays. So what's the advantages of vhost-blk over vhost-scsi?
For the record, we still do use virtio-blk a lot. As we see new things like discard/write zero
support it seems that others do as well.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 0/1] vhost: add vhost_blk driver
From: Christian Borntraeger @ 2018-11-05 14:02 UTC (permalink / raw)
To: Michael S. Tsirkin, Vitaly Mayatskikh
Cc: Paolo Bonzini, linux-kernel, kvm, virtualization
In-Reply-To: <20181102142446-mutt-send-email-mst@kernel.org>
On 11/02/2018 07:26 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 02, 2018 at 06:21:22PM +0000, Vitaly Mayatskikh wrote:
>> vhost_blk is a host-side kernel mode accelerator for virtio-blk. The
>> driver allows VM to reach a near bare-metal disk performance. See IOPS
>> numbers below (fio --rw=randread --bs=4k).
>>
>> This implementation uses kiocb interface. It is slightly slower than
>> going directly through bio, but is simpler and also works with disk
>> images placed on a file system.
This should also work with other transports like virtio-ccw (instead of virtio-pci).
Correct?
>>
>> # fio num-jobs
>> # A: bare metal over block
>> # B: bare metal over file
>> # C: virtio-blk over block
>> # D: virtio-blk over file
>> # E: vhost-blk bio over block
>> # F: vhost-blk kiocb over block
>> # G: vhost-blk kiocb over file
>> #
>> # A B C D E F G
>>
>> 1 171k 151k 148k 151k 195k 187k 175k
>> 2 328k 302k 249k 241k 349k 334k 296k
>> 3 479k 437k 179k 174k 501k 464k 404k
>> 4 622k 568k 143k 183k 620k 580k 492k
>> 5 755k 697k 136k 128k 737k 693k 579k
>> 6 887k 808k 131k 120k 830k 782k 640k
>> 7 1004k 926k 126k 131k 926k 863k 693k
>> 8 1099k 1015k 117k 115k 1001k 931k 712k
>> 9 1194k 1119k 115k 111k 1055k 991k 711k
>> 10 1278k 1207k 109k 114k 1130k 1046k 695k
>> 11 1345k 1280k 110k 108k 1119k 1091k 663k
>> 12 1411k 1356k 104k 106k 1201k 1142k 629k
>> 13 1466k 1423k 106k 106k 1260k 1170k 607k
>> 14 1517k 1486k 103k 106k 1296k 1179k 589k
>> 15 1552k 1543k 102k 102k 1322k 1191k 571k
>> 16 1480k 1506k 101k 102k 1346k 1202k 566k
>>
>> Vitaly Mayatskikh (1):
>> Add vhost_blk driver
>
>
> Thanks!
> Before merging this, I'd like to get some acks from userspace that it's
> actually going to be used - e.g. QEMU block maintainers.
>
>> drivers/vhost/Kconfig | 13 ++
>> drivers/vhost/Makefile | 3 +
>> drivers/vhost/blk.c | 510 +++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 526 insertions(+)
>> create mode 100644 drivers/vhost/blk.c
>>
>> --
>> 2.17.1
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>
^ permalink raw reply
* Re: [PATCH 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: Jason Wang @ 2018-11-05 9:21 UTC (permalink / raw)
To: jiangyiwen, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <5BDFF49C.3040603@huawei.com>
On 2018/11/5 下午3:43, jiangyiwen wrote:
> Now vsock only support send/receive small packet, it can't achieve
> high performance. As previous discussed with Jason Wang, I revisit the
> idea of vhost-net about mergeable rx buffer and implement the mergeable
> rx buffer in vhost-vsock, it can allow big packet to be scattered in
> into different buffers and improve performance obviously.
>
> I write a tool to test the vhost-vsock performance, mainly send big
> packet(64K) included guest->Host and Host->Guest. The result as
> follows:
>
> Before performance:
> Single socket Multiple sockets(Max Bandwidth)
> Guest->Host ~400MB/s ~480MB/s
> Host->Guest ~1450MB/s ~1600MB/s
>
> After performance:
> Single socket Multiple sockets(Max Bandwidth)
> Guest->Host ~1700MB/s ~2900MB/s
> Host->Guest ~1700MB/s ~2900MB/s
>
> From the test results, the performance is improved obviously, and guest
> memory will not be wasted.
Hi:
Thanks for the patches and the numbers are really impressive.
But instead of duplicating codes between sock and net. I was considering
to use virtio-net as a transport of vsock. Then we may have all existed
features likes batching, mergeable rx buffers and multiqueue. Want to
consider this idea? Thoughts?
>
> ---
>
> Yiwen Jiang (5):
> VSOCK: support fill mergeable rx buffer in guest
> VSOCK: support fill data to mergeable rx buffer in host
> VSOCK: support receive mergeable rx buffer in guest
> VSOCK: modify default rx buf size to improve performance
> VSOCK: batch sending rx buffer to increase bandwidth
>
> drivers/vhost/vsock.c | 135 +++++++++++++++++++++++------
> include/linux/virtio_vsock.h | 15 +++-
> include/uapi/linux/virtio_vsock.h | 5 ++
> net/vmw_vsock/virtio_transport.c | 147 ++++++++++++++++++++++++++------
> net/vmw_vsock/virtio_transport_common.c | 59 +++++++++++--
> 5 files changed, 300 insertions(+), 61 deletions(-)
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH 5/5] VSOCK: batch sending rx buffer to increase bandwidth
From: jiangyiwen @ 2018-11-05 7:48 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
Batch sending rx buffer can improve total bandwidth.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
drivers/vhost/vsock.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 648be39..a587ddc 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -148,10 +148,12 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
struct vhost_virtqueue *vq)
{
struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
- bool added = false;
bool restart_tx = false;
int mergeable;
size_t vsock_hlen;
+ int batch_count = 0;
+
+#define VHOST_VSOCK_BATCH 16
mutex_lock(&vq->mutex);
@@ -191,8 +193,9 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
list_del_init(&pkt->list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
- headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
- &in, likely(mergeable) ? UIO_MAXIOV : 1);
+ headcount = get_rx_bufs(vq, vq->heads + batch_count,
+ vsock_hlen + pkt->len, &in,
+ likely(mergeable) ? UIO_MAXIOV : 1);
if (headcount <= 0) {
spin_lock_bh(&vsock->send_pkt_list_lock);
list_add(&pkt->list, &vsock->send_pkt_list);
@@ -238,8 +241,12 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
break;
}
- vhost_add_used_n(vq, vq->heads, headcount);
- added = true;
+ batch_count += headcount;
+ if (batch_count > VHOST_VSOCK_BATCH) {
+ vhost_add_used_and_signal_n(&vsock->dev, vq,
+ vq->heads, batch_count);
+ batch_count = 0;
+ }
if (pkt->reply) {
int val;
@@ -258,8 +265,11 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
virtio_transport_free_pkt(pkt);
}
- if (added)
- vhost_signal(&vsock->dev, vq);
+
+ if (batch_count) {
+ vhost_add_used_and_signal_n(&vsock->dev, vq,
+ vq->heads, batch_count);
+ }
out:
mutex_unlock(&vq->mutex);
--
1.8.3.1
^ permalink raw reply related
* [PATCH 4/5] VSOCK: modify default rx buf size to improve performance
From: jiangyiwen @ 2018-11-05 7:47 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
Since VSOCK already support mergeable rx buffer, so it can
implement the balance with performance and guest memory,
we can increase the default rx buffer size to improve
performance.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
include/linux/virtio_vsock.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 6be3cd7..594e720 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -10,7 +10,7 @@
#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE 128
#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE (1024 * 256)
#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE (1024 * 256)
-#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 64)
#define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
--
1.8.3.1
^ permalink raw reply related
* [PATCH 3/5] VSOCK: support receive mergeable rx buffer in guest
From: jiangyiwen @ 2018-11-05 7:47 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
Guest receive mergeable rx buffer, it can merge
scatter rx buffer into a big buffer and then copy
to user space.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
include/linux/virtio_vsock.h | 9 ++++
net/vmw_vsock/virtio_transport.c | 75 +++++++++++++++++++++++++++++----
net/vmw_vsock/virtio_transport_common.c | 59 ++++++++++++++++++++++----
3 files changed, 127 insertions(+), 16 deletions(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index da9e1fe..6be3cd7 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -13,6 +13,8 @@
#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
#define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
+/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
+#define VIRTIO_VSOCK_MAX_MRG_BUF_NUM ((VIRTIO_VSOCK_MAX_PKT_BUF_SIZE / PAGE_SIZE) + 1)
/* Virtio-vsock feature */
#define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
@@ -48,6 +50,11 @@ struct virtio_vsock_sock {
struct list_head rx_queue;
};
+struct virtio_vsock_mrg_rxbuf {
+ void *buf;
+ u32 len;
+};
+
struct virtio_vsock_pkt {
struct virtio_vsock_hdr hdr;
struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
@@ -59,6 +66,8 @@ struct virtio_vsock_pkt {
u32 len;
u32 off;
bool reply;
+ bool mergeable;
+ struct virtio_vsock_mrg_rxbuf mrg_rxbuf[VIRTIO_VSOCK_MAX_MRG_BUF_NUM];
};
struct virtio_vsock_pkt_info {
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 2040a9e..3557ad3 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -359,11 +359,62 @@ static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
return val < virtqueue_get_vring_size(vq);
}
+static struct virtio_vsock_pkt *receive_mergeable(struct virtqueue *vq,
+ struct virtio_vsock *vsock, unsigned int *total_len)
+{
+ struct virtio_vsock_pkt *pkt;
+ u16 num_buf;
+ void *page;
+ unsigned int len;
+ int i = 0;
+
+ page = virtqueue_get_buf(vq, &len);
+ if (!page)
+ return NULL;
+
+ *total_len = len;
+ vsock->rx_buf_nr--;
+
+ pkt = page;
+ num_buf = le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers);
+ if (!num_buf || num_buf > VIRTIO_VSOCK_MAX_MRG_BUF_NUM)
+ goto err;
+
+ pkt->mergeable = true;
+ if (!le32_to_cpu(pkt->hdr.len))
+ return pkt;
+
+ len -= sizeof(struct virtio_vsock_pkt);
+ pkt->mrg_rxbuf[i].buf = page + sizeof(struct virtio_vsock_pkt);
+ pkt->mrg_rxbuf[i].len = len;
+ i++;
+
+ while (--num_buf) {
+ page = virtqueue_get_buf(vq, &len);
+ if (!page)
+ goto err;
+
+ *total_len += len;
+ vsock->rx_buf_nr--;
+
+ pkt->mrg_rxbuf[i].buf = page;
+ pkt->mrg_rxbuf[i].len = len;
+ i++;
+ }
+
+ return pkt;
+err:
+ virtio_transport_free_pkt(pkt);
+ return NULL;
+}
+
static void virtio_transport_rx_work(struct work_struct *work)
{
struct virtio_vsock *vsock =
container_of(work, struct virtio_vsock, rx_work);
struct virtqueue *vq;
+ size_t vsock_hlen = vsock->mergeable ? sizeof(struct virtio_vsock_pkt) :
+ sizeof(struct virtio_vsock_hdr);
vq = vsock->vqs[VSOCK_VQ_RX];
@@ -383,21 +434,29 @@ static void virtio_transport_rx_work(struct work_struct *work)
goto out;
}
- pkt = virtqueue_get_buf(vq, &len);
- if (!pkt) {
- break;
- }
+ if (likely(vsock->mergeable)) {
+ pkt = receive_mergeable(vq, vsock, &len);
+ if (!pkt)
+ break;
- vsock->rx_buf_nr--;
+ pkt->len = le32_to_cpu(pkt->hdr.len);
+ } else {
+ pkt = virtqueue_get_buf(vq, &len);
+ if (!pkt) {
+ break;
+ }
+
+ vsock->rx_buf_nr--;
+ }
/* Drop short/long packets */
- if (unlikely(len < sizeof(pkt->hdr) ||
- len > sizeof(pkt->hdr) + pkt->len)) {
+ if (unlikely(len < vsock_hlen ||
+ len > vsock_hlen + pkt->len)) {
virtio_transport_free_pkt(pkt);
continue;
}
- pkt->len = len - sizeof(pkt->hdr);
+ pkt->len = len - vsock_hlen;
virtio_transport_deliver_tap_pkt(pkt);
virtio_transport_recv_pkt(pkt);
}
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 3ae3a33..7bef1d5 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -272,14 +272,49 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
- if (err)
- goto out;
+ if (pkt->mergeable) {
+ struct virtio_vsock_mrg_rxbuf *buf = pkt->mrg_rxbuf;
+ size_t mrg_copy_bytes, last_buf_total = 0, rxbuf_off;
+ size_t tmp_bytes = bytes;
+ int i;
+
+ for (i = 0; i < le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers); i++) {
+ if (pkt->off > last_buf_total + buf[i].len) {
+ last_buf_total += buf[i].len;
+ continue;
+ }
+
+ rxbuf_off = pkt->off - last_buf_total;
+ mrg_copy_bytes = min(buf[i].len - rxbuf_off, tmp_bytes);
+ err = memcpy_to_msg(msg, buf[i].buf + rxbuf_off, mrg_copy_bytes);
+ if (err)
+ goto out;
+
+ tmp_bytes -= mrg_copy_bytes;
+ pkt->off += mrg_copy_bytes;
+ last_buf_total += buf[i].len;
+ if (!tmp_bytes)
+ break;
+ }
+
+ if (tmp_bytes) {
+ printk(KERN_WARNING "WARNING! bytes = %llu, "
+ "bytes = %llu\n",
+ (unsigned long long)bytes,
+ (unsigned long long)tmp_bytes);
+ }
+
+ total += (bytes - tmp_bytes);
+ } else {
+ err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
+ if (err)
+ goto out;
+
+ total += bytes;
+ pkt->off += bytes;
+ }
spin_lock_bh(&vvs->rx_lock);
-
- total += bytes;
- pkt->off += bytes;
if (pkt->off == pkt->len) {
virtio_transport_dec_rx_pkt(vvs, pkt);
list_del(&pkt->list);
@@ -1050,8 +1085,16 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
{
- kfree(pkt->buf);
- kfree(pkt);
+ int i;
+
+ if (pkt->mergeable) {
+ for (i = 1; i < le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers); i++)
+ free_page((unsigned long)pkt->mrg_rxbuf[i].buf);
+ free_page((unsigned long)(void *)pkt);
+ } else {
+ kfree(pkt->buf);
+ kfree(pkt);
+ }
}
EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
--
1.8.3.1
^ permalink raw reply related
* [PATCH 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-11-05 7:45 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
it will merge big packet into rx vq.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
drivers/vhost/vsock.c | 117 +++++++++++++++++++++++++++++++-------
include/linux/virtio_vsock.h | 1 +
include/uapi/linux/virtio_vsock.h | 5 ++
3 files changed, 102 insertions(+), 21 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 34bc3ab..648be39 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -22,7 +22,8 @@
#define VHOST_VSOCK_DEFAULT_HOST_CID 2
enum {
- VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+ VHOST_VSOCK_FEATURES = VHOST_FEATURES |
+ (1ULL << VIRTIO_VSOCK_F_MRG_RXBUF),
};
/* Used to track all the vhost_vsock instances on the system. */
@@ -80,6 +81,68 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
return vsock;
}
+static int get_rx_bufs(struct vhost_virtqueue *vq,
+ struct vring_used_elem *heads, int datalen,
+ unsigned *iovcount, unsigned int quota)
+{
+ unsigned int out, in;
+ int seg = 0;
+ int headcount = 0;
+ unsigned d;
+ int ret;
+ /*
+ * len is always initialized before use since we are always called with
+ * datalen > 0.
+ */
+ u32 uninitialized_var(len);
+
+ while (datalen > 0 && headcount < quota) {
+ if (unlikely(seg >= UIO_MAXIOV)) {
+ ret = -ENOBUFS;
+ goto err;
+ }
+
+ ret = vhost_get_vq_desc(vq, vq->iov + seg,
+ ARRAY_SIZE(vq->iov) - seg, &out,
+ &in, NULL, NULL);
+ if (unlikely(ret < 0))
+ goto err;
+
+ d = ret;
+ if (d == vq->num) {
+ ret = 0;
+ goto err;
+ }
+
+ if (unlikely(out || in <= 0)) {
+ vq_err(vq, "unexpected descriptor format for RX: "
+ "out %d, in %d\n", out, in);
+ ret = -EINVAL;
+ goto err;
+ }
+
+ heads[headcount].id = cpu_to_vhost32(vq, d);
+ len = iov_length(vq->iov + seg, in);
+ heads[headcount].len = cpu_to_vhost32(vq, len);
+ datalen -= len;
+ ++headcount;
+ seg += in;
+ }
+
+ heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
+ *iovcount = seg;
+
+ /* Detect overrun */
+ if (unlikely(datalen > 0)) {
+ ret = UIO_MAXIOV + 1;
+ goto err;
+ }
+ return headcount;
+err:
+ vhost_discard_vq_desc(vq, headcount);
+ return ret;
+}
+
static void
vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct vhost_virtqueue *vq)
@@ -87,22 +150,34 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
bool added = false;
bool restart_tx = false;
+ int mergeable;
+ size_t vsock_hlen;
mutex_lock(&vq->mutex);
if (!vq->private_data)
goto out;
+ mergeable = vhost_has_feature(vq, VIRTIO_VSOCK_F_MRG_RXBUF);
+ /*
+ * Guest fill page for rx vq in mergeable case, so it will not
+ * allocate pkt structure, we should reserve size of pkt in advance.
+ */
+ if (likely(mergeable))
+ vsock_hlen = sizeof(struct virtio_vsock_pkt);
+ else
+ vsock_hlen = sizeof(struct virtio_vsock_hdr);
+
/* Avoid further vmexits, we're already processing the virtqueue */
vhost_disable_notify(&vsock->dev, vq);
for (;;) {
struct virtio_vsock_pkt *pkt;
struct iov_iter iov_iter;
- unsigned out, in;
+ unsigned out = 0, in = 0;
size_t nbytes;
size_t len;
- int head;
+ s16 headcount;
spin_lock_bh(&vsock->send_pkt_list_lock);
if (list_empty(&vsock->send_pkt_list)) {
@@ -116,16 +191,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
list_del_init(&pkt->list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
- head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- &out, &in, NULL, NULL);
- if (head < 0) {
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
- break;
- }
-
- if (head == vq->num) {
+ headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
+ &in, likely(mergeable) ? UIO_MAXIOV : 1);
+ if (headcount <= 0) {
spin_lock_bh(&vsock->send_pkt_list_lock);
list_add(&pkt->list, &vsock->send_pkt_list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
@@ -133,19 +201,13 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
/* We cannot finish yet if more buffers snuck in while
* re-enabling notify.
*/
- if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+ if (!headcount && unlikely(vhost_enable_notify(&vsock->dev, vq))) {
vhost_disable_notify(&vsock->dev, vq);
continue;
}
break;
}
- if (out) {
- virtio_transport_free_pkt(pkt);
- vq_err(vq, "Expected 0 output buffers, got %u\n", out);
- break;
- }
-
len = iov_length(&vq->iov[out], in);
iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
@@ -156,6 +218,19 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
break;
}
+ if (likely(mergeable)) {
+ pkt->mrg_rxbuf_hdr.num_buffers = cpu_to_le16(headcount);
+ nbytes = copy_to_iter(&pkt->mrg_rxbuf_hdr,
+ sizeof(pkt->mrg_rxbuf_hdr), &iov_iter);
+ if (nbytes != sizeof(pkt->mrg_rxbuf_hdr)) {
+ virtio_transport_free_pkt(pkt);
+ vq_err(vq, "Faulted on copying rxbuf hdr\n");
+ break;
+ }
+ iov_iter_advance(&iov_iter, (vsock_hlen -
+ sizeof(pkt->mrg_rxbuf_hdr) - sizeof(pkt->hdr)));
+ }
+
nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
if (nbytes != pkt->len) {
virtio_transport_free_pkt(pkt);
@@ -163,7 +238,7 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
break;
}
- vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+ vhost_add_used_n(vq, vq->heads, headcount);
added = true;
if (pkt->reply) {
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index bf84418..da9e1fe 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -50,6 +50,7 @@ struct virtio_vsock_sock {
struct virtio_vsock_pkt {
struct virtio_vsock_hdr hdr;
+ struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
struct work_struct work;
struct list_head list;
/* socket refcnt not held, only use for cancellation */
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 1d57ed3..2292f30 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
__le32 fwd_cnt;
} __attribute__((packed));
+/* It add mergeable rx buffers feature */
+struct virtio_vsock_mrg_rxbuf_hdr {
+ __le16 num_buffers; /* number of mergeable rx buffers */
+} __attribute__((packed));
+
enum virtio_vsock_type {
VIRTIO_VSOCK_TYPE_STREAM = 1,
};
--
1.8.3.1
^ permalink raw reply related
* [PATCH 1/5] VSOCK: support fill mergeable rx buffer in guest
From: jiangyiwen @ 2018-11-05 7:45 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
In driver probing, if virtio has VIRTIO_VSOCK_F_MRG_RXBUF feature,
it will fill mergeable rx buffer, support for host send mergeable
rx buffer. It will fill a page everytime to compact with small
packet and big packet.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
include/linux/virtio_vsock.h | 3 ++
net/vmw_vsock/virtio_transport.c | 72 +++++++++++++++++++++++++++++-----------
2 files changed, 56 insertions(+), 19 deletions(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index e223e26..bf84418 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -14,6 +14,9 @@
#define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
+/* Virtio-vsock feature */
+#define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
+
enum {
VSOCK_VQ_RX = 0, /* for host to guest data */
VSOCK_VQ_TX = 1, /* for guest to host data */
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 5d3cce9..2040a9e 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -64,6 +64,7 @@ struct virtio_vsock {
struct virtio_vsock_event event_list[8];
u32 guest_cid;
+ bool mergeable;
};
static struct virtio_vsock *virtio_vsock_get(void)
@@ -256,6 +257,25 @@ static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
return 0;
}
+static int fill_mergeable_rx_buff(struct virtqueue *vq)
+{
+ void *page = NULL;
+ struct scatterlist sg;
+ int err;
+
+ page = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+
+ sg_init_one(&sg, page, PAGE_SIZE);
+
+ err = virtqueue_add_inbuf(vq, &sg, 1, page, GFP_KERNEL);
+ if (err < 0)
+ free_page((unsigned long) page);
+
+ return err;
+}
+
static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
{
int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
@@ -267,27 +287,33 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
vq = vsock->vqs[VSOCK_VQ_RX];
do {
- pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
- if (!pkt)
- break;
+ if (vsock->mergeable) {
+ ret = fill_mergeable_rx_buff(vq);
+ if (ret)
+ break;
+ } else {
+ pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+ if (!pkt)
+ break;
- pkt->buf = kmalloc(buf_len, GFP_KERNEL);
- if (!pkt->buf) {
- virtio_transport_free_pkt(pkt);
- break;
- }
+ pkt->buf = kmalloc(buf_len, GFP_KERNEL);
+ if (!pkt->buf) {
+ virtio_transport_free_pkt(pkt);
+ break;
+ }
- pkt->len = buf_len;
+ pkt->len = buf_len;
- sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
- sgs[0] = &hdr;
+ sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+ sgs[0] = &hdr;
- sg_init_one(&buf, pkt->buf, buf_len);
- sgs[1] = &buf;
- ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
- if (ret) {
- virtio_transport_free_pkt(pkt);
- break;
+ sg_init_one(&buf, pkt->buf, buf_len);
+ sgs[1] = &buf;
+ ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
+ if (ret) {
+ virtio_transport_free_pkt(pkt);
+ break;
+ }
}
vsock->rx_buf_nr++;
} while (vq->num_free);
@@ -588,6 +614,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
if (ret < 0)
goto out_vqs;
+ if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_MRG_RXBUF))
+ vsock->mergeable = true;
+
vsock->rx_buf_nr = 0;
vsock->rx_buf_max_nr = 0;
atomic_set(&vsock->queued_replies, 0);
@@ -640,8 +669,12 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
vdev->config->reset(vdev);
mutex_lock(&vsock->rx_lock);
- while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
- virtio_transport_free_pkt(pkt);
+ while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX]))) {
+ if (vsock->mergeable)
+ free_page((unsigned long)(void *)pkt);
+ else
+ virtio_transport_free_pkt(pkt);
+ }
mutex_unlock(&vsock->rx_lock);
mutex_lock(&vsock->tx_lock);
@@ -683,6 +716,7 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
};
static unsigned int features[] = {
+ VIRTIO_VSOCK_F_MRG_RXBUF,
};
static struct virtio_driver virtio_vsock_driver = {
--
1.8.3.1
^ permalink raw reply related
* [PATCH 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: jiangyiwen @ 2018-11-05 7:43 UTC (permalink / raw)
To: stefanha, Jason Wang; +Cc: netdev, kvm, virtualization
Now vsock only support send/receive small packet, it can't achieve
high performance. As previous discussed with Jason Wang, I revisit the
idea of vhost-net about mergeable rx buffer and implement the mergeable
rx buffer in vhost-vsock, it can allow big packet to be scattered in
into different buffers and improve performance obviously.
I write a tool to test the vhost-vsock performance, mainly send big
packet(64K) included guest->Host and Host->Guest. The result as
follows:
Before performance:
Single socket Multiple sockets(Max Bandwidth)
Guest->Host ~400MB/s ~480MB/s
Host->Guest ~1450MB/s ~1600MB/s
After performance:
Single socket Multiple sockets(Max Bandwidth)
Guest->Host ~1700MB/s ~2900MB/s
Host->Guest ~1700MB/s ~2900MB/s
From the test results, the performance is improved obviously, and guest
memory will not be wasted.
---
Yiwen Jiang (5):
VSOCK: support fill mergeable rx buffer in guest
VSOCK: support fill data to mergeable rx buffer in host
VSOCK: support receive mergeable rx buffer in guest
VSOCK: modify default rx buf size to improve performance
VSOCK: batch sending rx buffer to increase bandwidth
drivers/vhost/vsock.c | 135 +++++++++++++++++++++++------
include/linux/virtio_vsock.h | 15 +++-
include/uapi/linux/virtio_vsock.h | 5 ++
net/vmw_vsock/virtio_transport.c | 147 ++++++++++++++++++++++++++------
net/vmw_vsock/virtio_transport_common.c | 59 +++++++++++--
5 files changed, 300 insertions(+), 61 deletions(-)
--
1.8.3.1
^ permalink raw reply
* Re: [PATCH 1/5] drm/virtio: add virtio_gpu_alloc_fence()
From: Gerd Hoffmann @ 2018-11-05 6:38 UTC (permalink / raw)
To: Robert Foss
Cc: David Airlie, Emil Velikov, Linux-Kernel@Vger. Kernel. Org,
ML dri-devel, open list:VIRTIO GPU DRIVER, Gustavo Padovan,
Emil Velikov
In-Reply-To: <1c92b6dc-4152-c81b-5180-2f48799b959f@collabora.com>
> > On Thu, 25 Oct 2018 at 19:38, Robert Foss <robert.foss@collabora.com> wrote:
> > >
> > > From: Gustavo Padovan <gustavo.padovan@collabora.com>
> > >
> > > Refactor fence creation to remove the potential allocation failure from
> > > the cmd_submit and atomic_commit paths. Now the fence should be allocated
> > > first and just after we should proceed with the rest of the execution.
> > >
> >
> > Commit does a bit more that what the above says:
> > - dummy, factor out fence creation/destruction
> > - use per virtio_gpu_framebuffer fence
> >
> > Personally I'd keep the two separate patches and elaborate on the latter.
> > Obviously in that case, one will need to add 3 lines worth of
> > virtio_gpu_fence_alloc() in virtio_gpu_cursor_plane_update which will be nuked
> > with the next patch.
> >
> > Not a big deal, but it's up-to the maintainer to make the final call if it's
> > worth splitting or not.
>
> Agreed, I'll hold off with this change until then.
No need to split this, but a bit more verbose commit message would be
good.
cheers,
Gerd
^ permalink raw reply
* Re: [PATCH 0/1] vhost: add vhost_blk driver
From: Jason Wang @ 2018-11-05 3:00 UTC (permalink / raw)
To: Vitaly Mayatskikh, Michael S . Tsirkin
Cc: Paolo Bonzini, Stefan Hajnoczi, linux-kernel, kvm, virtualization
In-Reply-To: <20181102182123.29420-1-v.mayatskih@gmail.com>
On 2018/11/3 上午2:21, Vitaly Mayatskikh wrote:
> vhost_blk is a host-side kernel mode accelerator for virtio-blk. The
> driver allows VM to reach a near bare-metal disk performance. See IOPS
> numbers below (fio --rw=randread --bs=4k).
>
> This implementation uses kiocb interface. It is slightly slower than
> going directly through bio, but is simpler and also works with disk
> images placed on a file system.
>
> # fio num-jobs
> # A: bare metal over block
> # B: bare metal over file
> # C: virtio-blk over block
> # D: virtio-blk over file
> # E: vhost-blk bio over block
> # F: vhost-blk kiocb over block
> # G: vhost-blk kiocb over file
> #
> # A B C D E F G
>
> 1 171k 151k 148k 151k 195k 187k 175k
> 2 328k 302k 249k 241k 349k 334k 296k
> 3 479k 437k 179k 174k 501k 464k 404k
> 4 622k 568k 143k 183k 620k 580k 492k
> 5 755k 697k 136k 128k 737k 693k 579k
> 6 887k 808k 131k 120k 830k 782k 640k
> 7 1004k 926k 126k 131k 926k 863k 693k
> 8 1099k 1015k 117k 115k 1001k 931k 712k
> 9 1194k 1119k 115k 111k 1055k 991k 711k
> 10 1278k 1207k 109k 114k 1130k 1046k 695k
> 11 1345k 1280k 110k 108k 1119k 1091k 663k
> 12 1411k 1356k 104k 106k 1201k 1142k 629k
> 13 1466k 1423k 106k 106k 1260k 1170k 607k
> 14 1517k 1486k 103k 106k 1296k 1179k 589k
> 15 1552k 1543k 102k 102k 1322k 1191k 571k
> 16 1480k 1506k 101k 102k 1346k 1202k 566k
>
> Vitaly Mayatskikh (1):
> Add vhost_blk driver
>
> drivers/vhost/Kconfig | 13 ++
> drivers/vhost/Makefile | 3 +
> drivers/vhost/blk.c | 510 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 526 insertions(+)
> create mode 100644 drivers/vhost/blk.c
>
Hi:
Thanks for the patches.
This is not the first attempt for having vhost-blk:
- Badari's version: https://lwn.net/Articles/379864/
- Asias' version: https://lwn.net/Articles/519880/
It's better to describe the differences (kiocb vs bio? performance?).
E.g if my memory is correct, Asias said it doesn't give much improvement
compared with userspace qemu.
And what's more important, I believe we tend to use virtio-scsi nowdays.
So what's the advantages of vhost-blk over vhost-scsi?
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 1/1] vhost: add per-vq worker thread
From: Jason Wang @ 2018-11-05 2:53 UTC (permalink / raw)
To: Vitaly Mayatskikh, Michael S . Tsirkin
Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181102160710.3741-2-v.mayatskih@gmail.com>
On 2018/11/3 上午12:07, Vitaly Mayatskikh wrote:
> +
> +static int vhost_vq_poll_start(struct vhost_virtqueue *vq)
> +{
> + if (!vq->worker) {
> + vq->worker = kthread_create(vhost_vq_worker, vq, "vhost-%d/%i",
> + vq->dev->pid, vq->index);
> + if (IS_ERR(vq->worker)) {
> + int ret = PTR_ERR(vq->worker);
> +
> + pr_err("%s: can't create vq worker: %d\n", __func__,
> + ret);
> + vq->worker = NULL;
> + return ret;
> + }
> + }
> + vhost_work_init(&vq->work, vhost_vq_poll_start_work);
> + vhost_vq_work_queue(vq, &vq->work);
> + return 0;
> +}
> +
I wonder whether or not it's better to allow the device to specific the
worker here instead of forcing a per vq worker model. Then we can keep
the behavior of exist implementation and do optimization on top?
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 0/1] vhost: parallel virtqueue handling
From: Jason Wang @ 2018-11-05 2:51 UTC (permalink / raw)
To: Vitaly Mayatskikh, Michael S . Tsirkin
Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181102160710.3741-1-v.mayatskih@gmail.com>
On 2018/11/3 上午12:07, Vitaly Mayatskikh wrote:
> Hi,
>
> I stumbled across poor performance of virtio-blk while working on a
> high-performance network storage protocol. Moving virtio-blk's host
> side to kernel did increase single queue IOPS, but multiqueue disk
> still was not scaling well. It turned out that vhost handles events
> from all virtio queues in one helper thread, and that's pretty much a
> big serialization point.
>
> The following patch enables events handling in per-queue thread and
> increases IO concurrency, see IOPS numbers:
Thanks a lot for the patches. Here's some thoughts:
- This is not the first attempt that tries to parallelize vhost workers.
So we need a comparing among them.
1) Multiple vhost workers from Anthony,
https://www.spinics.net/lists/netdev/msg189432.html
2) ELVIS from IBM, http://www.mulix.org/pubs/eli/elvis-h319.pdf
3) CMWQ from Bandan,
http://www.linux-kvm.org/images/5/52/02x08-Aspen-Bandan_Das-vhost-sharing_is_better.pdf
- vhost-net use a different multiqueue model. Each vhost device on host
is only dealing with a specific queue pair instead of a whole device.
This allow great flexibility and multiqueue could be implemented without
touching vhost codes.
- current vhost-net implementation depends heavily on the assumption of
single thread model especially its busy polling code. It would be broken
by this attempt. If we decide to go this way, this needs to be fixed.
And we do need performance result of networking.
- Having more threads is not necessarily a win, at least we need a
module parameter to other stuffs to control the number of threads I
believe.
Thanks
>
> # num-queues
> # bare metal
> # virtio-blk
> # vhost-blk
>
> 1 171k 148k 195k
> 2 328k 249k 349k
> 3 479k 179k 501k
> 4 622k 143k 620k
> 5 755k 136k 737k
> 6 887k 131k 830k
> 7 1004k 126k 926k
> 8 1099k 117k 1001k
> 9 1194k 115k 1055k
> 10 1278k 109k 1130k
> 11 1345k 110k 1119k
> 12 1411k 104k 1201k
> 13 1466k 106k 1260k
> 14 1517k 103k 1296k
> 15 1552k 102k 1322k
> 16 1480k 101k 1346k
>
> Vitaly Mayatskikh (1):
> vhost: add per-vq worker thread
>
> drivers/vhost/vhost.c | 123 +++++++++++++++++++++++++++++++-----------
> drivers/vhost/vhost.h | 11 +++-
> 2 files changed, 100 insertions(+), 34 deletions(-)
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [tip:x86/urgent] x86/hyper-v: Enable PIT shutdown quirk
From: tip-bot for Michael Kelley @ 2018-11-04 10:10 UTC (permalink / raw)
To: linux-tip-commits
Cc: jgross, olaf, gregkh, daniel.lezcano, akataria, linux-kernel,
virtualization, marcelo.cerri, tglx, hpa, apw, devel, mikelley,
mingo
In-Reply-To: <1541303219-11142-3-git-send-email-mikelley@microsoft.com>
Commit-ID: 1de72c706488b7be664a601cf3843bd01e327e58
Gitweb: https://git.kernel.org/tip/1de72c706488b7be664a601cf3843bd01e327e58
Author: Michael Kelley <mikelley@microsoft.com>
AuthorDate: Sun, 4 Nov 2018 03:48:57 +0000
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sun, 4 Nov 2018 11:04:46 +0100
x86/hyper-v: Enable PIT shutdown quirk
Hyper-V emulation of the PIT has a quirk such that the normal PIT shutdown
path doesn't work, because clearing the counter register restarts the
timer.
Disable the counter clearing on PIT shutdown.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>
Cc: "jgross@suse.com" <jgross@suse.com>
Cc: "akataria@vmware.com" <akataria@vmware.com>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: vkuznets <vkuznets@redhat.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1541303219-11142-3-git-send-email-mikelley@microsoft.com
---
arch/x86/kernel/cpu/mshyperv.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 1c72f3819eb1..e81a2db42df7 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -20,6 +20,7 @@
#include <linux/interrupt.h>
#include <linux/irq.h>
#include <linux/kexec.h>
+#include <linux/i8253.h>
#include <asm/processor.h>
#include <asm/hypervisor.h>
#include <asm/hyperv-tlfs.h>
@@ -295,6 +296,16 @@ static void __init ms_hyperv_init_platform(void)
if (efi_enabled(EFI_BOOT))
x86_platform.get_nmi_reason = hv_get_nmi_reason;
+ /*
+ * Hyper-V VMs have a PIT emulation quirk such that zeroing the
+ * counter register during PIT shutdown restarts the PIT. So it
+ * continues to interrupt @18.2 HZ. Setting i8253_clear_counter
+ * to false tells pit_shutdown() not to zero the counter so that
+ * the PIT really is shutdown. Generation 2 VMs don't have a PIT,
+ * and setting this value has no effect.
+ */
+ i8253_clear_counter_on_shutdown = false;
+
#if IS_ENABLED(CONFIG_HYPERV)
/*
* Setup the hook to get control post apic initialization.
^ permalink raw reply related
* [tip:x86/urgent] clockevents/drivers/i8253: Add support for PIT shutdown quirk
From: tip-bot for Michael Kelley @ 2018-11-04 10:09 UTC (permalink / raw)
To: linux-tip-commits
Cc: jgross, olaf, gregkh, daniel.lezcano, akataria, linux-kernel,
mikelley, apw, hpa, marcelo.cerri, devel, tglx, virtualization,
mingo
In-Reply-To: <1541303219-11142-2-git-send-email-mikelley@microsoft.com>
Commit-ID: 35b69a420bfb56b7b74cb635ea903db05e357bec
Gitweb: https://git.kernel.org/tip/35b69a420bfb56b7b74cb635ea903db05e357bec
Author: Michael Kelley <mikelley@microsoft.com>
AuthorDate: Sun, 4 Nov 2018 03:48:54 +0000
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sun, 4 Nov 2018 11:04:46 +0100
clockevents/drivers/i8253: Add support for PIT shutdown quirk
Add support for platforms where pit_shutdown() doesn't work because of a
quirk in the PIT emulation. On these platforms setting the counter register
to zero causes the PIT to start running again, negating the shutdown.
Provide a global variable that controls whether the counter register is
zero'ed, which platform specific code can override.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>
Cc: "jgross@suse.com" <jgross@suse.com>
Cc: "akataria@vmware.com" <akataria@vmware.com>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: vkuznets <vkuznets@redhat.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1541303219-11142-2-git-send-email-mikelley@microsoft.com
---
drivers/clocksource/i8253.c | 14 ++++++++++++--
include/linux/i8253.h | 1 +
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/clocksource/i8253.c b/drivers/clocksource/i8253.c
index 9c38895542f4..d4350bb10b83 100644
--- a/drivers/clocksource/i8253.c
+++ b/drivers/clocksource/i8253.c
@@ -20,6 +20,13 @@
DEFINE_RAW_SPINLOCK(i8253_lock);
EXPORT_SYMBOL(i8253_lock);
+/*
+ * Handle PIT quirk in pit_shutdown() where zeroing the counter register
+ * restarts the PIT, negating the shutdown. On platforms with the quirk,
+ * platform specific code can set this to false.
+ */
+bool i8253_clear_counter_on_shutdown __ro_after_init = true;
+
#ifdef CONFIG_CLKSRC_I8253
/*
* Since the PIT overflows every tick, its not very useful
@@ -109,8 +116,11 @@ static int pit_shutdown(struct clock_event_device *evt)
raw_spin_lock(&i8253_lock);
outb_p(0x30, PIT_MODE);
- outb_p(0, PIT_CH0);
- outb_p(0, PIT_CH0);
+
+ if (i8253_clear_counter_on_shutdown) {
+ outb_p(0, PIT_CH0);
+ outb_p(0, PIT_CH0);
+ }
raw_spin_unlock(&i8253_lock);
return 0;
diff --git a/include/linux/i8253.h b/include/linux/i8253.h
index e6bb36a97519..8336b2f6f834 100644
--- a/include/linux/i8253.h
+++ b/include/linux/i8253.h
@@ -21,6 +21,7 @@
#define PIT_LATCH ((PIT_TICK_RATE + HZ/2) / HZ)
extern raw_spinlock_t i8253_lock;
+extern bool i8253_clear_counter_on_shutdown;
extern struct clock_event_device i8253_clockevent;
extern void clockevent_i8253_init(bool oneshot);
^ permalink raw reply related
* Re: [PATCH 0/2] i8253: Fix PIT shutdown quirk on Hyper-V
From: Thomas Gleixner @ 2018-11-03 17:22 UTC (permalink / raw)
To: Juergen Gross
Cc: olaf@aepfle.de, daniel.lezcano@linaro.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org,
marcelo.cerri@canonical.com, apw@canonical.com,
akataria@vmware.com, Michael Kelley
In-Reply-To: <11f82d95-539c-b680-183b-b6c5c60468a8@suse.com>
On Fri, 2 Nov 2018, Juergen Gross wrote:
> On 01/11/2018 18:30, Michael Kelley wrote:
> > pit_shutdown() doesn't work on Hyper-V because of a quirk in the
> > PIT emulation. This problem exists in all versions of Hyper-V and
> > had not been noticed previously. When the counter register is set
> > to zero, the emulated PIT continues to interrupt @18.2 HZ.
> >
> > So add a test for running on Hyper-V, and use that test to skip
> > setting the counter register when running on Hyper-V.
> >
> > This patch replaces a previously proposed patch with a different
> > approach. This new approach follows comments from Thomas Gleixner.
>
> Did you consider using a static_key instead? You could set it in
> ms_hyperv_init_platform(). This would enable you to support future
> Hyper-V versions which don't require avoiding to set the count to zero.
Duh. Now that you say it it's more than obvious. Instead of checking for
running on hyperv have a quirk check in that function and set it from
hyperv. Not necessarily a static key required as this is not a fast path,
so a simple ro_after_init marked variable is good enough
Michael, sorry for guiding you down the wrong road. Juergens idea is much
better.
If you redo that, could you please make sure, that your mail series is
properly threaded? i.e. the 1..n/n mails contain a
Reference: <message_id_of_0/n>
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH 1/1] Add vhost_blk driver
From: kbuild test robot @ 2018-11-03 2:50 UTC (permalink / raw)
Cc: Vitaly Mayatskikh, kvm, Michael S . Tsirkin, linux-kernel,
virtualization, kbuild-all, Paolo Bonzini
In-Reply-To: <20181102182123.29420-2-v.mayatskih@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5345 bytes --]
Hi Vitaly,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on vhost/linux-next]
[also build test ERROR on v4.19 next-20181102]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vitaly-Mayatskikh/vhost-add-vhost_blk-driver/20181103-084141
base: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=mips
All error/warnings (new ones prefixed by >>):
drivers/vhost/blk.c: In function 'vhost_blk_iocb_complete':
>> drivers/vhost/blk.c:129:2: error: implicit declaration of function 'vhost_vq_work_queue'; did you mean 'vhost_work_queue'? [-Werror=implicit-function-declaration]
vhost_vq_work_queue(&req->q->vq, &req->q->w);
^~~~~~~~~~~~~~~~~~~
vhost_work_queue
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/module.h:9,
from drivers/vhost/blk.c:11:
drivers/vhost/blk.c: In function 'vhost_blk_req_handle':
>> drivers/vhost/blk.c:153:12: warning: format '%ld' expects argument of type 'long int', but argument 8 has type 'ssize_t {aka int}' [-Wformat=]
pr_debug("%s: [pid:%d %s] %s sector %lld, len %ld\n",
^
include/linux/printk.h:292:21: note: in definition of macro 'pr_fmt'
#define pr_fmt(fmt) fmt
^~~
include/linux/printk.h:340:2: note: in expansion of macro 'dynamic_pr_debug'
dynamic_pr_debug(fmt, ##__VA_ARGS__)
^~~~~~~~~~~~~~~~
>> drivers/vhost/blk.c:153:3: note: in expansion of macro 'pr_debug'
pr_debug("%s: [pid:%d %s] %s sector %lld, len %ld\n",
^~~~~~~~
cc1: some warnings being treated as errors
vim +129 drivers/vhost/blk.c
118
119 static void vhost_blk_iocb_complete(struct kiocb *iocb, long ret, long ret2)
120 {
121 struct vhost_blk_req *req = container_of(iocb, struct vhost_blk_req,
122 iocb);
123
124 pr_debug("%s vq[%d] req->index %d ret %ld ret2 %ld\n", __func__,
125 req->q->index, req->index, ret, ret2);
126
127 req->res = (ret == req->len) ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
128 llist_add(&req->list, &req->q->wl);
> 129 vhost_vq_work_queue(&req->q->vq, &req->q->w);
130 }
131
132 static int vhost_blk_req_handle(struct vhost_blk_req *req)
133 {
134 struct vhost_blk *blk = req->q->blk;
135 struct vhost_virtqueue *vq = &req->q->vq;
136 int type = le32_to_cpu(req->hdr.type);
137 int ret;
138 u8 status;
139
140 if ((type == VIRTIO_BLK_T_IN) || (type == VIRTIO_BLK_T_OUT)) {
141 bool write = (type == VIRTIO_BLK_T_OUT);
142 int nr_seg = (write ? req->out_num : req->in_num) - 1;
143 unsigned long sector = le64_to_cpu(req->hdr.sector);
144 ssize_t len, rem_len;
145
146 if (!req->q->blk->backend) {
147 vq_err(vq, "blk %p no backend!\n", req->q->blk);
148 ret = -EINVAL;
149 goto out_err;
150 }
151
152 len = iov_length(&vq->iov[1], nr_seg);
> 153 pr_debug("%s: [pid:%d %s] %s sector %lld, len %ld\n",
154 __func__, current->pid, current->comm,
155 write ? "WRITE" : "READ", req->hdr.sector, len);
156
157 req->len = len;
158 rem_len = len;
159 iov_iter_init(&req->i, (write ? WRITE : READ),
160 write ? &req->out_iov[0] : &req->in_iov[0],
161 nr_seg, len);
162
163 req->iocb.ki_pos = sector << 9;
164 req->iocb.ki_filp = blk->backend;
165 req->iocb.ki_complete = vhost_blk_iocb_complete;
166 req->iocb.ki_flags = IOCB_DIRECT;
167
168 if (write)
169 ret = call_write_iter(blk->backend, &req->iocb,
170 &req->i);
171 else
172 ret = call_read_iter(blk->backend, &req->iocb,
173 &req->i);
174
175 if (ret != -EIOCBQUEUED)
176 vhost_blk_iocb_complete(&req->iocb, ret, 0);
177
178 ret = 0;
179 goto out;
180 }
181
182 if (type == VIRTIO_BLK_T_GET_ID) {
183 char s[] = "vhost_blk";
184 size_t len = min_t(size_t, req->in_iov[0].iov_len,
185 strlen(s));
186
187 ret = copy_to_user(req->in_iov[0].iov_base, s, len);
188 status = ret ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
189 if (put_user(status, (unsigned char __user *)req->status)) {
190 ret = -EFAULT;
191 goto out_err;
192 }
193 vhost_add_used_and_signal(&blk->dev, vq, req->index, 1);
194 ret = 0;
195 goto out;
196 } else {
197 pr_warn("Unsupported request type %d\n", type);
198 vhost_discard_vq_desc(vq, 1);
199 ret = -EINVAL;
200 return ret;
201 }
202 out_err:
203 vhost_discard_vq_desc(vq, 1);
204 out:
205 return ret;
206 }
207
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 57563 bytes --]
[-- Attachment #3: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 1/1] Add vhost_blk driver
From: Michael S. Tsirkin @ 2018-11-02 20:38 UTC (permalink / raw)
To: Vitaly Mayatskih; +Cc: Paolo Bonzini, linux-kernel, kvm, virtualization
In-Reply-To: <CAGF4SLj0TtW_kP935UdqswjT9qaznh_BR7sxW2ALozy7yTJ8vg@mail.gmail.com>
On Fri, Nov 02, 2018 at 03:24:13PM -0400, Vitaly Mayatskih wrote:
> On Fri, Nov 2, 2018 at 2:36 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> > > + if (type == VIRTIO_BLK_T_GET_ID) {
> > > + char s[] = "vhost_blk";
> >
> > Isn't this supposed to return the serial #?
>
> Yes, that gets a bit tricky here... Disk serial number is specified in
> QEMU command line parameters, so it needs to be passed to vhost_blk
> when qemu attaches the disk backend. That can be done (now? in a
> following incremental patch?).
I'add an ioctl now.
> Also other vhost work does the same with GET_ID, if that's matter.
>
> I'll fix the rest, thanks for review!
>
> --
> wbr, Vitaly
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Al Viro @ 2018-11-02 19:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, mst, netdev,
liang.z.li, Linux Kernel Mailing List, virtualization, stefanha,
joe, Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wj5ZnqmpDuXBCcND8nMNitNyRf_4KQSzSUfQvX2-wOYsg@mail.gmail.com>
On Fri, Nov 02, 2018 at 10:15:56AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 10:10 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Don't you take over the VM with "use_mm()" when you do the copies? So
> > yes, it's a kernel thread, but it has a user VM, and though that
> > should have the user limits.
>
> Oooh. *Just* as I sent this, I realized that "use_mm()" doesn't update
> the thread addr_limit.
>
> That actually looks like a bug to me - although one that you've
> apparently been aware of and worked around.
>
> Wouldn't it be nicer to just make "use_mm()" do
>
> set_fs(USER_DS);
>
> instead? And undo it on unuse_mm()?
>
> And, in fact, maybe we should default kernel threads to have a zero
> address limit, so that they can't do any user accesses at all without
> doing this?
Try it and watch it fail to set initramfs up, let alone exec the init...
> Adding Al to the cc, because I think he's been looking at set_fs() in general.
It would be the right thing (with return to KERNEL_DS), but I'm not certain
if GPU users will survive - these two
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h:157: use_mm(mmptr); \
drivers/gpu/drm/i915/gvt/kvmgt.c:1799: use_mm(kvm->mm);
I don't understand the call chains there (especially for the first one) well
enough to tell.
^ permalink raw reply
* Re: [PATCH 1/1] Add vhost_blk driver
From: Michael S. Tsirkin @ 2018-11-02 18:36 UTC (permalink / raw)
To: Vitaly Mayatskikh; +Cc: Paolo Bonzini, linux-kernel, kvm, virtualization
In-Reply-To: <20181102182123.29420-2-v.mayatskih@gmail.com>
On Fri, Nov 02, 2018 at 06:21:23PM +0000, Vitaly Mayatskikh wrote:
> This driver accelerates host side of virtio-blk.
>
> Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
> ---
> drivers/vhost/Kconfig | 13 ++
> drivers/vhost/Makefile | 3 +
> drivers/vhost/blk.c | 510 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 526 insertions(+)
> create mode 100644 drivers/vhost/blk.c
>
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index b580885243f7..c4980d6af0ea 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -53,3 +53,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
> adds some overhead, it is disabled by default.
>
> If unsure, say "N".
> +
> +config VHOST_BLK
> + tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)"
> + depends on BLOCK && EVENTFD
> + select VHOST
> + default n
> + help
> + This kernel module can be loaded in host kernel to accelerate
> + guest block with virtio_blk. Not to be confused with virtio_blk
> + module itself which needs to be loaded in guest kernel.
> +
> + To compile this driver as a module, choose M here: the module will
> + be called vhost_blk.
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> index 6c6df24f770c..c8be36cd9214 100644
> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -8,6 +8,9 @@ vhost_scsi-y := scsi.o
> obj-$(CONFIG_VHOST_VSOCK) += vhost_vsock.o
> vhost_vsock-y := vsock.o
>
> +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
> +vhost_blk-y := blk.o
> +
> obj-$(CONFIG_VHOST_RING) += vringh.o
>
> obj-$(CONFIG_VHOST) += vhost.o
> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
> new file mode 100644
> index 000000000000..aefb9a61fa0f
> --- /dev/null
> +++ b/drivers/vhost/blk.c
> @@ -0,0 +1,510 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018 IBM Corporation
> + * Author: Vitaly Mayatskikh <v.mayatskih@gmail.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * virtio-blk server in host kernel.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/miscdevice.h>
> +#include <linux/virtio_blk.h>
> +#include <linux/vhost.h>
> +#include <linux/fs.h>
> +#include "vhost.h"
> +
> +enum {
> + VHOST_BLK_FEATURES =
> + VHOST_FEATURES |
> + (1ULL << VIRTIO_F_IOMMU_PLATFORM) |
> + (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
> + (1ULL << VIRTIO_RING_F_EVENT_IDX) |
> + (1ULL << VIRTIO_BLK_F_MQ)
> +};
> +
> +#define VHOST_BLK_SET_BACKEND _IOW(VHOST_VIRTIO, 0x50, int)
> +
> +enum {
> + VHOST_BLK_VQ_MAX = 16,
> + VHOST_BLK_VQ_MAX_REQS = 128,
> +};
> +
> +struct vhost_blk_req {
> + struct llist_node list;
> + int index;
> + struct vhost_blk_queue *q;
> + struct virtio_blk_outhdr hdr;
> + struct iovec *out_iov;
> + struct iovec *in_iov;
> + u8 out_num;
> + u8 in_num;
> + long len;
> + struct kiocb iocb;
> + struct iov_iter i;
> + int res;
> + void __user *status;
> +};
> +
> +struct vhost_blk_queue {
> + int index;
> + struct vhost_blk *blk;
> + struct vhost_virtqueue vq;
> + struct vhost_work w;
> + struct llist_head wl;
> + struct vhost_blk_req req[VHOST_BLK_VQ_MAX_REQS];
> +};
> +
> +struct vhost_blk {
> + struct vhost_dev dev;
> + struct file *backend;
> + int num_queues;
> + struct vhost_virtqueue *vqs[VHOST_BLK_VQ_MAX];
> + struct vhost_blk_queue queue[VHOST_BLK_VQ_MAX];
> +};
> +
> +static void vhost_blk_flush(struct vhost_blk *blk)
> +{
> + int i;
> +
> + for (i = 0; i < blk->num_queues; i++)
> + vhost_poll_flush(&blk->queue[i].vq.poll);
> +}
> +
> +
> +static void vhost_blk_stop(struct vhost_blk *blk)
> +{
> + struct vhost_virtqueue *vq;
> + int i;
> +
> + for (i = 0; i < blk->num_queues; i++) {
> + vq = &blk->queue[i].vq;
> + mutex_lock(&vq->mutex);
> + rcu_assign_pointer(vq->private_data, NULL);
> + mutex_unlock(&vq->mutex);
> + }
> +}
> +
> +static int vhost_blk_req_done(struct vhost_blk_req *req, unsigned char status)
> +{
> + int ret;
> + int len = req->len;
> +
> + pr_debug("%s vq[%d] req->index %d status %d len %d\n", __func__,
> + req->q->index, req->index, status, len);
> + ret = put_user(status, (unsigned char __user *)req->status);
I'd make it u8 and not unsigned char. Also why don't you change
req->status type so you don't need a cast?
> +
> + WARN(ret, "%s: vq[%d] req->index %d failed to write status\n", __func__,
> + req->q->index, req->index);
kernel warnings and debug messages that are guest-triggerable lead to
disk full errors on the host. applies elsewhere too. you want traces
instead.
> +
> + vhost_add_used(&req->q->vq, req->index, len);
this can fail too.
> +
> + return ret;
> +}
> +
> +static void vhost_blk_io_done_work(struct vhost_work *w)
> +{
> + struct vhost_blk_queue *q = container_of(w, struct vhost_blk_queue, w);
> + struct llist_node *node;
> + struct vhost_blk_req *req, *tmp;
> +
> + node = llist_del_all(&q->wl);
> + llist_for_each_entry_safe(req, tmp, node, list) {
> + vhost_blk_req_done(req, req->res);
> + }
> + vhost_signal(&q->blk->dev, &q->vq);
> +}
> +
> +static void vhost_blk_iocb_complete(struct kiocb *iocb, long ret, long ret2)
> +{
> + struct vhost_blk_req *req = container_of(iocb, struct vhost_blk_req,
> + iocb);
> +
> + pr_debug("%s vq[%d] req->index %d ret %ld ret2 %ld\n", __func__,
> + req->q->index, req->index, ret, ret2);
> +
> + req->res = (ret == req->len) ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
> + llist_add(&req->list, &req->q->wl);
> + vhost_vq_work_queue(&req->q->vq, &req->q->w);
> +}
> +
> +static int vhost_blk_req_handle(struct vhost_blk_req *req)
> +{
> + struct vhost_blk *blk = req->q->blk;
> + struct vhost_virtqueue *vq = &req->q->vq;
> + int type = le32_to_cpu(req->hdr.type);
> + int ret;
> + u8 status;
> +
> + if ((type == VIRTIO_BLK_T_IN) || (type == VIRTIO_BLK_T_OUT)) {
> + bool write = (type == VIRTIO_BLK_T_OUT);
> + int nr_seg = (write ? req->out_num : req->in_num) - 1;
> + unsigned long sector = le64_to_cpu(req->hdr.sector);
> + ssize_t len, rem_len;
> +
> + if (!req->q->blk->backend) {
> + vq_err(vq, "blk %p no backend!\n", req->q->blk);
> + ret = -EINVAL;
> + goto out_err;
> + }
> +
> + len = iov_length(&vq->iov[1], nr_seg);
> + pr_debug("%s: [pid:%d %s] %s sector %lld, len %ld\n",
> + __func__, current->pid, current->comm,
> + write ? "WRITE" : "READ", req->hdr.sector, len);
> +
> + req->len = len;
> + rem_len = len;
> + iov_iter_init(&req->i, (write ? WRITE : READ),
> + write ? &req->out_iov[0] : &req->in_iov[0],
> + nr_seg, len);
> +
> + req->iocb.ki_pos = sector << 9;
> + req->iocb.ki_filp = blk->backend;
> + req->iocb.ki_complete = vhost_blk_iocb_complete;
> + req->iocb.ki_flags = IOCB_DIRECT;
> +
> + if (write)
> + ret = call_write_iter(blk->backend, &req->iocb,
> + &req->i);
> + else
> + ret = call_read_iter(blk->backend, &req->iocb,
> + &req->i);
> +
> + if (ret != -EIOCBQUEUED)
> + vhost_blk_iocb_complete(&req->iocb, ret, 0);
> +
> + ret = 0;
> + goto out;
> + }
> +
> + if (type == VIRTIO_BLK_T_GET_ID) {
> + char s[] = "vhost_blk";
Isn't this supposed to return the serial #?
> + size_t len = min_t(size_t, req->in_iov[0].iov_len,
> + strlen(s));
> +
> + ret = copy_to_user(req->in_iov[0].iov_base, s, len);
I don't think we should assume there's no scatter list here.
> + status = ret ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
> + if (put_user(status, (unsigned char __user *)req->status)) {
> + ret = -EFAULT;
> + goto out_err;
> + }
> + vhost_add_used_and_signal(&blk->dev, vq, req->index, 1);
> + ret = 0;
> + goto out;
> + } else {
> + pr_warn("Unsupported request type %d\n", type);
> + vhost_discard_vq_desc(vq, 1);
> + ret = -EINVAL;
> + return ret;
> + }
> +out_err:
> + vhost_discard_vq_desc(vq, 1);
> +out:
> + return ret;
> +}
> +
> +static void vhost_blk_handle_guest_kick(struct vhost_work *work)
> +{
> + struct vhost_virtqueue *vq;
> + struct vhost_blk_queue *q;
> + struct vhost_blk *blk;
> + struct vhost_blk_req *req;
> + int in, out;
> + int head;
> +
> + vq = container_of(work, struct vhost_virtqueue, poll.work);
> + q = container_of(vq, struct vhost_blk_queue, vq);
> + blk = container_of(vq->dev, struct vhost_blk, dev);
> +
> + vhost_disable_notify(&blk->dev, vq);
> + for (;;) {
> + in = out = -1;
> +
> + head = vhost_get_vq_desc(vq, vq->iov,
> + ARRAY_SIZE(vq->iov),
> + &out, &in, NULL, NULL);
> +
> + if (head < 0)
> + break;
> +
> + if (head == vq->num) {
> + if (vhost_enable_notify(&blk->dev, vq)) {
> + vhost_disable_notify(&blk->dev, vq);
> + continue;
> + }
> + break;
> + }
> +
> + req = &q->req[head];
> + req->index = head;
> + req->out_num = out;
> + req->in_num = in;
> + req->out_iov = &vq->iov[1];
> + req->in_iov = &vq->iov[out];
> + req->status = vq->iov[out + in - 1].iov_base;
Shouldn't we validate that there's actually an in?
> +
> + if (copy_from_user(&req->hdr, vq->iov[0].iov_base,
> + sizeof(req->hdr))) {
> + vq_err(vq, "Failed to get block header!\n");
> + vhost_discard_vq_desc(vq, 1);
> + continue;
> + }
It's better to avoid assuming that header is in a single iov entry,
use an iterator.
> + if (vhost_blk_req_handle(req) < 0)
> + break;
> + }
> +}
> +
> +static int vhost_blk_open(struct inode *inode, struct file *file)
> +{
> + struct vhost_blk *blk;
> + struct vhost_blk_queue *q;
> + int i, j;
> +
> + blk = kvzalloc(sizeof(*blk), GFP_KERNEL);
> + if (!blk)
> + return -ENOMEM;
> +
> + for (i = 0; i < VHOST_BLK_VQ_MAX; i++) {
> + q = &blk->queue[i];
> + q->index = i;
> + q->blk = blk;
> + q->vq.handle_kick = vhost_blk_handle_guest_kick;
> + vhost_work_init(&q->w, vhost_blk_io_done_work);
> + blk->vqs[i] = &q->vq;
> + for (j = 0; j < VHOST_BLK_VQ_MAX_REQS; j++) {
> + q->req[j].index = j;
> + q->req[j].q = q;
> + }
> + }
> + vhost_dev_init(&blk->dev, (struct vhost_virtqueue **)&blk->vqs,
> + VHOST_BLK_VQ_MAX);
> + file->private_data = blk;
> +
> + return 0;
> +}
> +
> +static int vhost_blk_release(struct inode *inode, struct file *f)
> +{
> + struct vhost_blk *blk = f->private_data;
> +
> + vhost_blk_stop(blk);
> + mutex_lock(&blk->dev.mutex);
> + vhost_blk_flush(blk);
> + vhost_dev_stop(&blk->dev);
> + vhost_dev_cleanup(&blk->dev);
> + vhost_blk_flush(blk);
> +
> + if (blk->backend) {
> + fput(blk->backend);
> + blk->backend = NULL;
> + }
> +
> + mutex_unlock(&blk->dev.mutex);
> + kvfree(blk);
> +
> + return 0;
> +}
> +
> +static int vhost_blk_set_features(struct vhost_blk *blk, u64 features)
> +{
> + int i;
> + int ret = -EFAULT;
> +
> + mutex_lock(&blk->dev.mutex);
> + if ((features & (1 << VHOST_F_LOG_ALL)) &&
> + !vhost_log_access_ok(&blk->dev))
> + goto out_unlock;
> +
> + if ((features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))) {
> + if (vhost_init_device_iotlb(&blk->dev, true))
> + goto out_unlock;
> + }
> +
> + for (i = 0; i < VHOST_BLK_VQ_MAX; ++i) {
> + struct vhost_virtqueue *vq = blk->vqs[i];
> +
> + mutex_lock(&vq->mutex);
> + vq->acked_features = features & VHOST_BLK_FEATURES;
> + mutex_unlock(&vq->mutex);
> + }
> + ret = 0;
> +out_unlock:
> + mutex_unlock(&blk->dev.mutex);
> +
> + return ret;
> +}
> +
> +static long vhost_blk_reset_owner(struct vhost_blk *blk)
> +{
> + long err;
> + struct vhost_umem *umem;
> +
> + mutex_lock(&blk->dev.mutex);
> + err = vhost_dev_check_owner(&blk->dev);
> + if (err)
> + goto done;
> + umem = vhost_dev_reset_owner_prepare();
> + if (!umem) {
> + err = -ENOMEM;
> + goto done;
> + }
> + vhost_blk_stop(blk);
> + vhost_blk_flush(blk);
> + vhost_dev_reset_owner(&blk->dev, umem);
> +done:
> + mutex_unlock(&blk->dev.mutex);
> + return err;
> +}
> +
> +static long vhost_blk_set_backend(struct vhost_blk *blk, int fd)
> +{
> + struct file *backend;
> + int ret, i;
> + struct vhost_virtqueue *vq;
> +
> + mutex_lock(&blk->dev.mutex);
> + ret = vhost_dev_check_owner(&blk->dev);
> + if (ret)
> + goto out_dev;
> +
> + backend = fget(fd);
> + if (IS_ERR(backend)) {
> + ret = PTR_ERR(backend);
> + goto out_dev;
> + }
> +
> + if (backend == blk->backend) {
> + ret = 0;
> + goto out_file;
> + }
> +
> + if (blk->backend)
> + fput(blk->backend);
> + blk->backend = backend;
> + for (i = 0; i < blk->num_queues; i++) {
> + vq = &blk->queue[i].vq;
> + if (!vhost_vq_access_ok(vq)) {
> + ret = -EFAULT;
> + goto out_file;
> + }
> + mutex_lock(&vq->mutex);
> + rcu_assign_pointer(vq->private_data, backend);
> + ret = vhost_vq_init_access(vq);
> + mutex_unlock(&vq->mutex);
> + if (ret) {
> + pr_err("vhost_vq_init_access failed: %d\n", ret);
> + goto out_file;
> + }
> +
> + }
> + ret = 0;
> + goto out_dev;
> +out_file:
> + fput(backend);
> + blk->backend = NULL;
> +out_dev:
> + mutex_unlock(&blk->dev.mutex);
> + vhost_blk_flush(blk);
> + return ret;
> +}
> +
> +static long vhost_blk_pass_ioctl(struct vhost_blk *blk, unsigned int ioctl,
> + void __user *argp)
> +{
> + long ret;
> +
> + mutex_lock(&blk->dev.mutex);
> + ret = vhost_dev_ioctl(&blk->dev, ioctl, argp);
> + if (ret == -ENOIOCTLCMD)
> + ret = vhost_vring_ioctl(&blk->dev, ioctl, argp);
> + else
> + vhost_blk_flush(blk);
> + mutex_unlock(&blk->dev.mutex);
> + return ret;
> +}
> +
> +static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
> + unsigned long arg)
> +{
> + struct vhost_blk *blk = f->private_data;
> + void __user *argp = (void __user *)arg;
> + int fd;
> + u64 __user *featurep = argp;
> + u64 features;
> + long ret;
> + struct vhost_vring_state s;
> +
> + switch (ioctl) {
> + case VHOST_SET_MEM_TABLE:
> + vhost_blk_stop(blk);
> + ret = vhost_blk_pass_ioctl(blk, ioctl, argp);
> + break;
> + case VHOST_SET_VRING_NUM:
> + if (copy_from_user(&s, argp, sizeof(s)))
> + return -EFAULT;
> + ret = vhost_blk_pass_ioctl(blk, ioctl, argp);
> + if (!ret)
> + blk->num_queues = s.index + 1;
> + break;
> + case VHOST_BLK_SET_BACKEND:
> + if (copy_from_user(&fd, argp, sizeof(fd)))
> + return -EFAULT;
> + ret = vhost_blk_set_backend(blk, fd);
> + break;
> + case VHOST_GET_FEATURES:
> + features = VHOST_BLK_FEATURES;
> + if (copy_to_user(featurep, &features, sizeof(features)))
> + return -EFAULT;
> + ret = 0;
> + break;
> + case VHOST_SET_FEATURES:
> + if (copy_from_user(&features, featurep, sizeof(features)))
> + return -EFAULT;
> + if (features & ~VHOST_BLK_FEATURES)
> + return -EOPNOTSUPP;
> + ret = vhost_blk_set_features(blk, features);
> + break;
> + case VHOST_RESET_OWNER:
> + ret = vhost_blk_reset_owner(blk);
> + break;
> + default:
> + ret = vhost_blk_pass_ioctl(blk, ioctl, argp);
> + break;
> + }
> + return ret;
> +}
> +
> +static const struct file_operations vhost_blk_fops = {
> + .owner = THIS_MODULE,
> + .open = vhost_blk_open,
> + .release = vhost_blk_release,
> + .llseek = noop_llseek,
> + .unlocked_ioctl = vhost_blk_ioctl,
> +};
> +
> +static struct miscdevice vhost_blk_misc = {
> + MISC_DYNAMIC_MINOR,
> + "vhost-blk",
> + &vhost_blk_fops,
> +};
> +
> +static int vhost_blk_init(void)
> +{
> + return misc_register(&vhost_blk_misc);
> +}
> +module_init(vhost_blk_init);
> +
> +static void vhost_blk_exit(void)
> +{
> + misc_deregister(&vhost_blk_misc);
> +}
> +
> +module_exit(vhost_blk_exit);
> +
> +MODULE_VERSION("1.0");
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Vitaly Mayatskikh");
> +MODULE_DESCRIPTION("Host kernel accelerator for virtio blk");
> +MODULE_ALIAS("devname:vhost-blk");
> --
> 2.17.1
^ permalink raw reply
* [PATCH 4.14 091/143] x86/paravirt: Fix some warning messages
From: Greg Kroah-Hartman @ 2018-11-02 18:34 UTC (permalink / raw)
To: linux-kernel
Cc: Juergen Gross, Sasha Levin, Peter Zijlstra, Greg Kroah-Hartman,
kernel-janitors, stable, virtualization, H. Peter Anvin,
Thomas Gleixner, Alok Kataria, Dan Carpenter
In-Reply-To: <20181102182857.064326086@linuxfoundation.org>
4.14-stable review patch. If anyone has any objections, please let me know.
------------------
[ Upstream commit 571d0563c8881595f4ab027aef9ed1c55e3e7b7c ]
The first argument to WARN_ONCE() is a condition.
Fixes: 5800dc5c19f3 ("x86/paravirt: Fix spectre-v2 mitigations for paravirt guests")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: virtualization@lists.linux-foundation.org
Cc: kernel-janitors@vger.kernel.org
Link: https://lkml.kernel.org/r/20180919103553.GD9238@mwanda
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/x86/kernel/paravirt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index f3559b84cd75..04da826381c9 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -90,7 +90,7 @@ unsigned paravirt_patch_call(void *insnbuf,
if (len < 5) {
#ifdef CONFIG_RETPOLINE
- WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr);
+ WARN_ONCE(1, "Failing to patch indirect CALL in %ps\n", (void *)addr);
#endif
return len; /* call too long for patch site */
}
@@ -110,7 +110,7 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
if (len < 5) {
#ifdef CONFIG_RETPOLINE
- WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr);
+ WARN_ONCE(1, "Failing to patch indirect JMP in %ps\n", (void *)addr);
#endif
return len; /* call too long for patch site */
}
--
2.17.1
^ permalink raw reply related
* [PATCH 4.18 073/150] x86/paravirt: Fix some warning messages
From: Greg Kroah-Hartman @ 2018-11-02 18:33 UTC (permalink / raw)
To: linux-kernel
Cc: Juergen Gross, Sasha Levin, Peter Zijlstra, Greg Kroah-Hartman,
kernel-janitors, stable, virtualization, H. Peter Anvin,
Thomas Gleixner, Alok Kataria, Dan Carpenter
In-Reply-To: <20181102182902.250560510@linuxfoundation.org>
4.18-stable review patch. If anyone has any objections, please let me know.
------------------
[ Upstream commit 571d0563c8881595f4ab027aef9ed1c55e3e7b7c ]
The first argument to WARN_ONCE() is a condition.
Fixes: 5800dc5c19f3 ("x86/paravirt: Fix spectre-v2 mitigations for paravirt guests")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: virtualization@lists.linux-foundation.org
Cc: kernel-janitors@vger.kernel.org
Link: https://lkml.kernel.org/r/20180919103553.GD9238@mwanda
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/x86/kernel/paravirt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 930c88341e4e..1fbf38dde84c 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -90,7 +90,7 @@ unsigned paravirt_patch_call(void *insnbuf,
if (len < 5) {
#ifdef CONFIG_RETPOLINE
- WARN_ONCE("Failing to patch indirect CALL in %ps\n", (void *)addr);
+ WARN_ONCE(1, "Failing to patch indirect CALL in %ps\n", (void *)addr);
#endif
return len; /* call too long for patch site */
}
@@ -110,7 +110,7 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
if (len < 5) {
#ifdef CONFIG_RETPOLINE
- WARN_ONCE("Failing to patch indirect JMP in %ps\n", (void *)addr);
+ WARN_ONCE(1, "Failing to patch indirect JMP in %ps\n", (void *)addr);
#endif
return len; /* call too long for patch site */
}
--
2.17.1
^ permalink raw reply related
* Re: [PATCH 0/1] vhost: add vhost_blk driver
From: Michael S. Tsirkin @ 2018-11-02 18:26 UTC (permalink / raw)
To: Vitaly Mayatskikh; +Cc: Paolo Bonzini, linux-kernel, kvm, virtualization
In-Reply-To: <20181102182123.29420-1-v.mayatskih@gmail.com>
On Fri, Nov 02, 2018 at 06:21:22PM +0000, Vitaly Mayatskikh wrote:
> vhost_blk is a host-side kernel mode accelerator for virtio-blk. The
> driver allows VM to reach a near bare-metal disk performance. See IOPS
> numbers below (fio --rw=randread --bs=4k).
>
> This implementation uses kiocb interface. It is slightly slower than
> going directly through bio, but is simpler and also works with disk
> images placed on a file system.
>
> # fio num-jobs
> # A: bare metal over block
> # B: bare metal over file
> # C: virtio-blk over block
> # D: virtio-blk over file
> # E: vhost-blk bio over block
> # F: vhost-blk kiocb over block
> # G: vhost-blk kiocb over file
> #
> # A B C D E F G
>
> 1 171k 151k 148k 151k 195k 187k 175k
> 2 328k 302k 249k 241k 349k 334k 296k
> 3 479k 437k 179k 174k 501k 464k 404k
> 4 622k 568k 143k 183k 620k 580k 492k
> 5 755k 697k 136k 128k 737k 693k 579k
> 6 887k 808k 131k 120k 830k 782k 640k
> 7 1004k 926k 126k 131k 926k 863k 693k
> 8 1099k 1015k 117k 115k 1001k 931k 712k
> 9 1194k 1119k 115k 111k 1055k 991k 711k
> 10 1278k 1207k 109k 114k 1130k 1046k 695k
> 11 1345k 1280k 110k 108k 1119k 1091k 663k
> 12 1411k 1356k 104k 106k 1201k 1142k 629k
> 13 1466k 1423k 106k 106k 1260k 1170k 607k
> 14 1517k 1486k 103k 106k 1296k 1179k 589k
> 15 1552k 1543k 102k 102k 1322k 1191k 571k
> 16 1480k 1506k 101k 102k 1346k 1202k 566k
>
> Vitaly Mayatskikh (1):
> Add vhost_blk driver
Thanks!
Before merging this, I'd like to get some acks from userspace that it's
actually going to be used - e.g. QEMU block maintainers.
> drivers/vhost/Kconfig | 13 ++
> drivers/vhost/Makefile | 3 +
> drivers/vhost/blk.c | 510 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 526 insertions(+)
> create mode 100644 drivers/vhost/blk.c
>
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH v9] virtio_blk: add discard and write zeroes support
From: Daniel Verkamp @ 2018-11-02 18:25 UTC (permalink / raw)
To: dongli.zhang
Cc: axboe, hch, mst, virtualization, linux-block, stefanha, pbonzini,
Changpeng Liu
In-Reply-To: <378387af-25ec-f053-5204-a5a8fa65acf9@oracle.com>
Hi Dongli,
Unfortunately, I am not aware of any in-progress implementation of
this feature for qemu. It hopefully should not be too difficult to
wire up in the qemu virtio-blk model, but I haven't looked into it in
detail.
Thanks,
-- Daniel
On Thu, Nov 1, 2018 at 4:42 PM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
> Hi Daniel,
>
> Other than crosvm, is there any version of qemu (e.g., repositories developed in
> progress on github) where I can try with this feature?
>
> Thank you very much!
>
> Dongli Zhang
>
> On 11/02/2018 06:40 AM, Daniel Verkamp wrote:
> > From: Changpeng Liu <changpeng.liu@intel.com>
> >
> > In commit 88c85538, "virtio-blk: add discard and write zeroes features
> > to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
> > block specification has been extended to add VIRTIO_BLK_T_DISCARD and
> > VIRTIO_BLK_T_WRITE_ZEROES commands. This patch enables support for
> > discard and write zeroes in the virtio-blk driver when the device
> > advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
> > VIRTIO_BLK_F_WRITE_ZEROES.
> >
> > Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> > Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
> > ---
> > dverkamp: I've picked up this patch and made a few minor changes (as
> > listed below); most notably, I changed the kmalloc back to GFP_ATOMIC,
> > since it can be called from a context where sleeping is not allowed.
> > To prevent large allocations, I've also clamped the maximum number of
> > discard segments to 256; this results in a 4K allocation and should be
> > plenty of descriptors for most use cases.
> >
> > I also removed most of the description from the commit message, since it
> > was duplicating the comments from virtio_blk.h and quoting parts of the
> > spec without adding any extra information. I have tested this iteration
> > of the patch using crosvm with modifications to enable the new features:
> > https://chromium.googlesource.com/chromiumos/platform/crosvm/
> >
> > v9 fixes a number of review issues; I didn't attempt to optimize the
> > single-element write zeroes case, so it still does an allocation per
> > request (I did not see any easy place to put the payload that would
> > avoid the allocation).
> >
> > CHANGELOG:
> > v9: [dverkamp] fix LE types in discard struct; cleanups from Ming Lei
> > v8: [dverkamp] replace shifts by 9 with SECTOR_SHIFT constant
> > v7: [dverkamp] use GFP_ATOMIC for allocation that may not sleep; clarify
> > descriptor flags field; comment wording cleanups.
> > v6: don't set T_OUT bit to discard and write zeroes commands.
> > v5: use new block layer API: blk_queue_flag_set.
> > v4: several optimizations based on MST's comments, remove bit field
> > usage for command descriptor.
> > v3: define the virtio-blk protocol to add discard and write zeroes
> > support, first version implementation based on proposed specification.
> > v2: add write zeroes command support.
> > v1: initial proposal implementation for discard command.
> > ---
> > drivers/block/virtio_blk.c | 83 ++++++++++++++++++++++++++++++++-
> > include/uapi/linux/virtio_blk.h | 54 +++++++++++++++++++++
> > 2 files changed, 135 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index 086c6bb12baa..0f39efb4b3aa 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -18,6 +18,7 @@
> >
> > #define PART_BITS 4
> > #define VQ_NAME_LEN 16
> > +#define MAX_DISCARD_SEGMENTS 256u
> >
> > static int major;
> > static DEFINE_IDA(vd_index_ida);
> > @@ -172,10 +173,48 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> > return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> > }
> >
> > +static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
> > +{
> > + unsigned short segments = blk_rq_nr_discard_segments(req);
> > + unsigned short n = 0;
> > + struct virtio_blk_discard_write_zeroes *range;
> > + struct bio *bio;
> > + u32 flags = 0;
> > +
> > + if (unmap)
> > + flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
> > +
> > + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
> > + if (!range)
> > + return -ENOMEM;
> > +
> > + __rq_for_each_bio(bio, req) {
> > + u64 sector = bio->bi_iter.bi_sector;
> > + u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> > +
> > + range[n].flags = cpu_to_le32(flags);
> > + range[n].num_sectors = cpu_to_le32(num_sectors);
> > + range[n].sector = cpu_to_le64(sector);
> > + n++;
> > + }
> > +
> > + req->special_vec.bv_page = virt_to_page(range);
> > + req->special_vec.bv_offset = offset_in_page(range);
> > + req->special_vec.bv_len = sizeof(*range) * segments;
> > + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> > +
> > + return 0;
> > +}
> > +
> > static inline void virtblk_request_done(struct request *req)
> > {
> > struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
> >
> > + if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
> > + kfree(page_address(req->special_vec.bv_page) +
> > + req->special_vec.bv_offset);
> > + }
> > +
> > switch (req_op(req)) {
> > case REQ_OP_SCSI_IN:
> > case REQ_OP_SCSI_OUT:
> > @@ -225,6 +264,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> > int qid = hctx->queue_num;
> > int err;
> > bool notify = false;
> > + bool unmap = false;
> > u32 type;
> >
> > BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> > @@ -237,6 +277,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> > case REQ_OP_FLUSH:
> > type = VIRTIO_BLK_T_FLUSH;
> > break;
> > + case REQ_OP_DISCARD:
> > + type = VIRTIO_BLK_T_DISCARD;
> > + break;
> > + case REQ_OP_WRITE_ZEROES:
> > + type = VIRTIO_BLK_T_WRITE_ZEROES;
> > + unmap = !(req->cmd_flags & REQ_NOUNMAP);
> > + break;
> > case REQ_OP_SCSI_IN:
> > case REQ_OP_SCSI_OUT:
> > type = VIRTIO_BLK_T_SCSI_CMD;
> > @@ -256,6 +303,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> >
> > blk_mq_start_request(req);
> >
> > + if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
> > + err = virtblk_setup_discard_write_zeroes(req, unmap);
> > + if (err)
> > + return BLK_STS_RESOURCE;
> > + }
> > +
> > num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
> > if (num) {
> > if (rq_data_dir(req) == WRITE)
> > @@ -802,6 +855,32 @@ static int virtblk_probe(struct virtio_device *vdev)
> > if (!err && opt_io_size)
> > blk_queue_io_opt(q, blk_size * opt_io_size);
> >
> > + if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
> > + q->limits.discard_granularity = blk_size;
> > +
> > + virtio_cread(vdev, struct virtio_blk_config,
> > + discard_sector_alignment, &v);
> > + q->limits.discard_alignment = v ? v << SECTOR_SHIFT : 0;
> > +
> > + virtio_cread(vdev, struct virtio_blk_config,
> > + max_discard_sectors, &v);
> > + blk_queue_max_discard_sectors(q, v ? v : UINT_MAX);
> > +
> > + virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
> > + &v);
> > + blk_queue_max_discard_segments(q,
> > + min_not_zero(v,
> > + MAX_DISCARD_SEGMENTS));
> > +
> > + blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > + }
> > +
> > + if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
> > + virtio_cread(vdev, struct virtio_blk_config,
> > + max_write_zeroes_sectors, &v);
> > + blk_queue_max_write_zeroes_sectors(q, v ? v : UINT_MAX);
> > + }
> > +
> > virtblk_update_capacity(vblk, false);
> > virtio_device_ready(vdev);
> >
> > @@ -895,14 +974,14 @@ static unsigned int features_legacy[] = {
> > VIRTIO_BLK_F_SCSI,
> > #endif
> > VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> > - VIRTIO_BLK_F_MQ,
> > + VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> > }
> > ;
> > static unsigned int features[] = {
> > VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
> > VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
> > VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> > - VIRTIO_BLK_F_MQ,
> > + VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> > };
> >
> > static struct virtio_driver virtio_blk = {
> > diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
> > index 9ebe4d968dd5..0f99d7b49ede 100644
> > --- a/include/uapi/linux/virtio_blk.h
> > +++ b/include/uapi/linux/virtio_blk.h
> > @@ -38,6 +38,8 @@
> > #define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
> > #define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
> > #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
> > +#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
> > +#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
> >
> > /* Legacy feature bits */
> > #ifndef VIRTIO_BLK_NO_LEGACY
> > @@ -86,6 +88,39 @@ struct virtio_blk_config {
> >
> > /* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
> > __u16 num_queues;
> > +
> > + /* the next 3 entries are guarded by VIRTIO_BLK_F_DISCARD */
> > + /*
> > + * The maximum discard sectors (in 512-byte sectors) for
> > + * one segment.
> > + */
> > + __u32 max_discard_sectors;
> > + /*
> > + * The maximum number of discard segments in a
> > + * discard command.
> > + */
> > + __u32 max_discard_seg;
> > + /* Discard commands must be aligned to this number of sectors. */
> > + __u32 discard_sector_alignment;
> > +
> > + /* the next 3 entries are guarded by VIRTIO_BLK_F_WRITE_ZEROES */
> > + /*
> > + * The maximum number of write zeroes sectors (in 512-byte sectors) in
> > + * one segment.
> > + */
> > + __u32 max_write_zeroes_sectors;
> > + /*
> > + * The maximum number of segments in a write zeroes
> > + * command.
> > + */
> > + __u32 max_write_zeroes_seg;
> > + /*
> > + * Set if a VIRTIO_BLK_T_WRITE_ZEROES request may result in the
> > + * deallocation of one or more of the sectors.
> > + */
> > + __u8 write_zeroes_may_unmap;
> > +
> > + __u8 unused1[3];
> > } __attribute__((packed));
> >
> > /*
> > @@ -114,6 +149,12 @@ struct virtio_blk_config {
> > /* Get device ID command */
> > #define VIRTIO_BLK_T_GET_ID 8
> >
> > +/* Discard command */
> > +#define VIRTIO_BLK_T_DISCARD 11
> > +
> > +/* Write zeroes command */
> > +#define VIRTIO_BLK_T_WRITE_ZEROES 13
> > +
> > #ifndef VIRTIO_BLK_NO_LEGACY
> > /* Barrier before this op. */
> > #define VIRTIO_BLK_T_BARRIER 0x80000000
> > @@ -133,6 +174,19 @@ struct virtio_blk_outhdr {
> > __virtio64 sector;
> > };
> >
> > +/* Unmap this range (only valid for write zeroes command) */
> > +#define VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP 0x00000001
> > +
> > +/* Discard/write zeroes range for each request. */
> > +struct virtio_blk_discard_write_zeroes {
> > + /* discard/write zeroes start sector */
> > + __le64 sector;
> > + /* number of discard/write zeroes sectors */
> > + __le32 num_sectors;
> > + /* flags for this range */
> > + __le32 flags;
> > +};
> > +
> > #ifndef VIRTIO_BLK_NO_LEGACY
> > struct virtio_scsi_inhdr {
> > __virtio32 errors;
> >
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-02 18:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
Linux Kernel Mailing List, virtualization, stefanha, joe,
Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wjAvx_rWKct5RtqRrVEgNxrFF=1qQwkz+AegDvd4yXW=g@mail.gmail.com>
On Fri, Nov 02, 2018 at 11:02:35AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 10:21 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > it seems that it depends on current not on the active mm.
>
> Yes, see my other suggestion to just fix "use_mm()" to update the address range.
>
> Would you mind testing that?
Sure, I'll test it.
> Because that would seem to be the *much* less error-prone model..
>
> Linus
I agree, it's always been bothering me.
--
MST
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox