* [PATCH net-next 3/5] sctp: Build sctp offload support into the base kernel
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
We need to take the sctp offload out of the sctp module
and add it to the base kernel to support guests that can
support it. This is similar to how IPv6 offloads are done
and works around kernels that exclude sctp protocol support.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
include/net/sctp/sctp.h | 5 -----
net/Kconfig | 1 +
net/sctp/Kconfig | 1 -
net/sctp/Makefile | 3 ++-
net/sctp/offload.c | 4 +++-
net/sctp/protocol.c | 3 ---
6 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 72c5b8f..625b45f 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -183,11 +183,6 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
int __net_init sctp_proc_init(struct net *net);
/*
- * sctp/offload.c
- */
-int sctp_offload_init(void);
-
-/*
* sctp/stream_sched.c
*/
void sctp_sched_ops_init(void);
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12..2773f98 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -64,6 +64,7 @@ config INET
bool "TCP/IP networking"
select CRYPTO
select CRYPTO_AES
+ select LIBCRC32C
---help---
These are the protocols used on the Internet and on most local
Ethernets. It is highly recommended to say Y here (this will enlarge
diff --git a/net/sctp/Kconfig b/net/sctp/Kconfig
index c740b18..d07477a 100644
--- a/net/sctp/Kconfig
+++ b/net/sctp/Kconfig
@@ -9,7 +9,6 @@ menuconfig IP_SCTP
select CRYPTO
select CRYPTO_HMAC
select CRYPTO_SHA1
- select LIBCRC32C
---help---
Stream Control Transmission Protocol
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index e845e45..ee206ca 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -5,6 +5,7 @@
obj-$(CONFIG_IP_SCTP) += sctp.o
obj-$(CONFIG_INET_SCTP_DIAG) += sctp_diag.o
+obj-$(CONFIG_INET) += offload.o
sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
protocol.o endpointola.o associola.o \
@@ -12,7 +13,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
inqueue.o outqueue.o ulpqueue.o \
tsnmap.o bind_addr.o socket.o primitive.o \
output.o input.o debug.o stream.o auth.o \
- offload.o stream_sched.o stream_sched_prio.o \
+ stream_sched.o stream_sched_prio.o \
stream_sched_rr.o stream_interleave.o
sctp_diag-y := diag.o
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 123e9f2..c61cbde 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -107,7 +107,7 @@ static const struct skb_checksum_ops crc32c_csum_ops = {
.combine = sctp_csum_combine,
};
-int __init sctp_offload_init(void)
+static int __init sctp_offload_init(void)
{
int ret;
@@ -127,3 +127,5 @@ int __init sctp_offload_init(void)
out:
return ret;
}
+
+fs_initcall(sctp_offload_init);
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index a24cde2..46d2b63 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1479,9 +1479,6 @@ static __init int sctp_init(void)
if (status)
goto err_v6_add_protocol;
- if (sctp_offload_init() < 0)
- pr_crit("%s: Cannot add SCTP protocol offload\n", __func__);
-
out:
return status;
err_v6_add_protocol:
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 2/5] sctp: Handle sctp packets with CHECKSUM_PARTIAL
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
With SCTP checksum offload available in virtio, it is now
possible for virtio to receive a sctp packet with CHECKSUM_PARTIAL
set (guest-to-guest traffic). SCTP doesn't really have a
partial checksum like TCP does because CRC32c can't do partial
additive checksumming. It's all or nothing. So an SCTP packet
with CHECKSUM_PARTIAL will have checksum set to 0. Let SCTP
treat this as a valid checksum if CHECKSUM_PARTIAL is set.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
net/sctp/input.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6..055b8ffa 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -80,8 +80,17 @@ static inline int sctp_rcv_checksum(struct net *net, struct sk_buff *skb)
{
struct sctphdr *sh = sctp_hdr(skb);
__le32 cmp = sh->checksum;
- __le32 val = sctp_compute_cksum(skb, 0);
+ __le32 val = 0;
+ /* In sctp PARTIAL checksum is always 0. This is a case of
+ * a packet received from guest that supports checksum offload.
+ * Assume it's correct as there is really no way to verify,
+ * and we want to avaoid computing it unnecesarily.
+ */
+ if (skb->ip_summed == CHECKSUM_PARTIAL)
+ return 0;
+
+ val = sctp_compute_cksum(skb, 0);
if (val != cmp) {
/* CRC failure, dump it. */
__SCTP_INC_STATS(net, SCTP_MIB_CHECKSUMERRORS);
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
To support SCTP checksum offloading, we need to add a new feature
to virtio_net, so we can negotiate support between the hypervisor
and the guest.
The signalling to the guest that an alternate checksum needs to
be used is done via a new flag in the virtio_net_hdr. If the
flag is set, the host will know to perform an alternate checksum
calculation, which right now is only CRC32c.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
drivers/net/virtio_net.c | 11 ++++++++---
include/linux/virtio_net.h | 6 ++++++
include/uapi/linux/virtio_net.h | 2 ++
3 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec..b601294 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
/* Do we support "hardware" checksums? */
if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
/* This opens up the world of extra features. */
- dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
+ netdev_features_t sctp = 0;
+
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
+ sctp |= NETIF_F_SCTP_CRC;
+
+ dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
if (csum)
- dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
+ dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
dev->hw_features |= NETIF_F_TSO
@@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
};
#define VIRTNET_FEATURES \
- VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
+ VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, VIRTIO_NET_F_SCTP_CSUM, \
VIRTIO_NET_F_MAC, \
VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index f144216..2e7a64a 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
if (!skb_partial_csum_set(skb, start, off))
return -EINVAL;
+
+ if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
+ skb->csum_not_inet = 1;
}
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
@@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
} /* else everything is zero */
+ if (skb->csum_not_inet)
+ hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
+
return 0;
}
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed3..3f279c8 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -36,6 +36,7 @@
#define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
#define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
#define VIRTIO_NET_F_MTU 3 /* Initial MTU advice */
+#define VIRTIO_NET_F_SCTP_CSUM 4 /* SCTP checksum offload support */
#define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */
#define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */
#define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */
@@ -101,6 +102,7 @@ struct virtio_net_config {
struct virtio_net_hdr_v1 {
#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */
#define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */
+#define VIRTIO_NET_HDR_F_CSUM_NOT_INET 4 /* Checksum is not inet */
__u8 flags;
#define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */
#define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
Now that we have SCTP offload capabilities in the kernel, we can add
them to virtio as well. First step is SCTP checksum.
We need a new freature in virtio to negotiate this support since
SCTP is excluded with the stardard checksum and requires a little
bit extra. This series proposes VIRTIO_NET_F_SCTP_CSUM feature bit.
As the "little bit extra", the kernel uses a new bit in the skb
(skb->csum_not_inet) to determine whether to use standard inet checksum
or the SCTP CRC32c checksum. This bit has to be communicated between
the host and the guest. This bit is carried in the vnet header.
Tap and macvtap support is added through an extra feature for the
TUNSETOFFLOAD ioctl. Additionally macvtap will no correctly
do sctp checksumming if the receive doesn't support SCTP offload.
This also turns on sctp offloading for macvlan devices.
As for the perf numbers, I am seeing about a 5% increase in vm-to-vm
and vm-to-hos throughput which is the same as manually disabling
sctp checksumming,since this is exactly what we are emulatting.
Sending outside the host, the increase about 2.5-3%.
As for GSO, the way sctp GSO is currently implemented buys us nothing
in added support to virtio. To add true GSO, would require a lot of
re-work inside of SCTP and would require extensions to the virtio
net header to carry extra sctp data.
Vladislav Yasevich (5):
virtio: Add support for SCTP checksum offloading
sctp: Handle sctp packets with CHECKSUM_PARTIAL
sctp: Build sctp offload support into the base kernel
tun: Add support for SCTP checksum offload
macvlan/macvtap: Add support for SCTP checksum offload.
drivers/net/macvlan.c | 5 +++--
drivers/net/tap.c | 8 +++++---
drivers/net/tun.c | 5 +++++
drivers/net/virtio_net.c | 10 +++++++---
include/linux/virtio_net.h | 6 ++++++
include/net/sctp/sctp.h | 5 -----
include/uapi/linux/if_tun.h | 1 +
include/uapi/linux/virtio_net.h | 2 ++
net/Kconfig | 1 +
net/sctp/Kconfig | 1 -
net/sctp/Makefile | 3 ++-
net/sctp/input.c | 11 ++++++++++-
net/sctp/offload.c | 4 +++-
net/sctp/protocol.c | 3 ---
14 files changed, 45 insertions(+), 20 deletions(-)
--
2.9.5
^ permalink raw reply
* [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-01 14:12 UTC (permalink / raw)
To: mst, jasowang, wexu, virtualization, linux-kernel, netdev
Hello everyone,
This RFC implements packed ring support for virtio driver.
The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
Minor changes are needed for the vhost code, e.g. to kick the guest.
TODO:
- Refinements and bug fixes;
- Split into small patches;
- Test indirect descriptor support;
- Test/fix event suppression support;
- Test devices other than net;
RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;
Thanks!
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
include/linux/virtio_ring.h | 8 +-
include/uapi/linux/virtio_config.h | 12 +-
include/uapi/linux/virtio_ring.h | 61 ++
4 files changed, 980 insertions(+), 195 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..0515dca34d77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,15 @@
struct vring_desc_state {
void *data; /* Data for callback. */
- struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
+ void *indir_desc; /* Indirect descriptor, if any. */
+ int num; /* Descriptor list length. */
};
struct vring_virtqueue {
struct virtqueue vq;
- /* Actual memory layout for this queue */
- struct vring vring;
+ /* Is this a packed ring? */
+ bool packed;
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +80,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
- /* Head of free buffer list. */
- unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
/* Last used index we've seen. */
u16 last_used_idx;
- /* Last written value to avail->flags */
- u16 avail_flags_shadow;
+ union {
+ /* Available for split ring */
+ struct {
+ /* Actual memory layout for this queue. */
+ struct vring vring;
- /* Last written value to avail->idx in guest byte order */
- u16 avail_idx_shadow;
+ /* Head of free buffer list. */
+ unsigned int free_head;
+
+ /* Last written value to avail->flags */
+ u16 avail_flags_shadow;
+
+ /* Last written value to avail->idx in
+ * guest byte order. */
+ u16 avail_idx_shadow;
+ };
+
+ /* Available for packed ring */
+ struct {
+ /* Actual memory layout for this queue. */
+ struct vring_packed vring_packed;
+
+ /* Driver ring wrap counter. */
+ u8 wrap_counter;
+
+ /* Index of the next avail descriptor. */
+ unsigned int next_avail_idx;
+
+ /* Last written value to driver->flags in
+ * guest byte order. */
+ u16 event_flags_shadow;
+ };
+ };
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +228,33 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
cpu_addr, size, direction);
}
-static void vring_unmap_one(const struct vring_virtqueue *vq,
- struct vring_desc *desc)
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
+{
+ u16 flags;
+
+ if (!vring_use_dma_api(vq->vq.vdev))
+ return;
+
+ flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+ if (flags & VRING_DESC_F_INDIRECT) {
+ dma_unmap_single(vring_dma_dev(vq),
+ virtio64_to_cpu(vq->vq.vdev, desc->addr),
+ virtio32_to_cpu(vq->vq.vdev, desc->len),
+ (flags & VRING_DESC_F_WRITE) ?
+ DMA_FROM_DEVICE : DMA_TO_DEVICE);
+ } else {
+ dma_unmap_page(vring_dma_dev(vq),
+ virtio64_to_cpu(vq->vq.vdev, desc->addr),
+ virtio32_to_cpu(vq->vq.vdev, desc->len),
+ (flags & VRING_DESC_F_WRITE) ?
+ DMA_FROM_DEVICE : DMA_TO_DEVICE);
+ }
+}
+
+static void vring_unmap_one_packed(const struct vring_virtqueue *vq,
+ struct vring_packed_desc *desc)
{
u16 flags;
@@ -235,8 +287,9 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
return dma_mapping_error(vring_dma_dev(vq), addr);
}
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
- unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+ unsigned int total_sg,
+ gfp_t gfp)
{
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +310,32 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
return desc;
}
-static inline int virtqueue_add(struct virtqueue *_vq,
- struct scatterlist *sgs[],
- unsigned int total_sg,
- unsigned int out_sgs,
- unsigned int in_sgs,
- void *data,
- void *ctx,
- gfp_t gfp)
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+ unsigned int total_sg,
+ gfp_t gfp)
+{
+ struct vring_packed_desc *desc;
+
+ /*
+ * We require lowmem mappings for the descriptors because
+ * otherwise virt_to_phys will give us bogus addresses in the
+ * virtqueue.
+ */
+ gfp &= ~__GFP_HIGHMEM;
+
+ desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+ return desc;
+}
+
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
struct scatterlist *sg;
@@ -303,7 +374,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
/* If the host supports indirect descriptor tables, and we have multiple
* buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
- desc = alloc_indirect(_vq, total_sg, gfp);
+ desc = alloc_indirect_split(_vq, total_sg, gfp);
else {
desc = NULL;
WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
@@ -424,7 +495,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
for (n = 0; n < total_sg; n++) {
if (i == err_idx)
break;
- vring_unmap_one(vq, &desc[i]);
+ vring_unmap_one_split(vq, &desc[i]);
i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
}
@@ -435,6 +506,210 @@ static inline int virtqueue_add(struct virtqueue *_vq,
return -EIO;
}
+static inline int virtqueue_add_packed(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ struct vring_packed_desc *desc;
+ struct scatterlist *sg;
+ unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+ __virtio16 uninitialized_var(head_flags), flags;
+ int head, wrap_counter;
+ bool indirect;
+
+ START_USE(vq);
+
+ BUG_ON(data == NULL);
+ BUG_ON(ctx && vq->indirect);
+
+ if (unlikely(vq->broken)) {
+ END_USE(vq);
+ return -EIO;
+ }
+
+#ifdef DEBUG
+ {
+ ktime_t now = ktime_get();
+
+ /* No kick or get, with .1 second between? Warn. */
+ if (vq->last_add_time_valid)
+ WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+ > 100);
+ vq->last_add_time = now;
+ vq->last_add_time_valid = true;
+ }
+#endif
+
+ BUG_ON(total_sg == 0);
+
+ head = vq->next_avail_idx;
+ wrap_counter = vq->wrap_counter;
+
+ /* If the host supports indirect descriptor tables, and we have multiple
+ * buffers, then go indirect. FIXME: tune this threshold */
+ if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+ desc = alloc_indirect_packed(_vq, total_sg, gfp);
+ else {
+ desc = NULL;
+ WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
+ }
+
+ if (desc) {
+ /* Use a single buffer which doesn't continue */
+ indirect = true;
+ /* Set up rest to use this indirect table. */
+ i = 0;
+ descs_used = 1;
+ } else {
+ indirect = false;
+ desc = vq->vring_packed.desc;
+ i = head;
+ descs_used = total_sg;
+ }
+
+ if (vq->vq.num_free < descs_used) {
+ pr_debug("Can't add buf len %i - avail = %i\n",
+ descs_used, vq->vq.num_free);
+ /* FIXME: for historical reasons, we force a notify here if
+ * there are outgoing parts to the buffer. Presumably the
+ * host should service the ring ASAP. */
+ if (out_sgs)
+ vq->notify(&vq->vq);
+ if (indirect)
+ kfree(desc);
+ END_USE(vq);
+ return -ENOSPC;
+ }
+
+ for (n = 0; n < out_sgs + in_sgs; n++) {
+ for (sg = sgs[n]; sg; sg = sg_next(sg)) {
+ dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
+ DMA_TO_DEVICE : DMA_FROM_DEVICE);
+ if (vring_mapping_error(vq, addr))
+ goto unmap_release;
+
+ flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
+ (n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
+ VRING_DESC_F_AVAIL(vq->wrap_counter) |
+ VRING_DESC_F_USED(!vq->wrap_counter));
+ if (!indirect && i == head)
+ head_flags = flags;
+ else
+ desc[i].flags = flags;
+
+ desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
+ desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
+ desc[i].id = cpu_to_virtio32(_vq->vdev, head);
+ prev = i;
+ i++;
+ if (!indirect && i >= vq->vring_packed.num) {
+ i = 0;
+ vq->wrap_counter ^= 1;
+ }
+ }
+ }
+ /* Last one doesn't continue. */
+ if (total_sg == 1)
+ head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+ else
+ desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+
+ if (indirect) {
+ /* Now that the indirect table is filled in, map it. */
+ dma_addr_t addr = vring_map_single(
+ vq, desc, total_sg * sizeof(struct vring_packed_desc),
+ DMA_TO_DEVICE);
+ if (vring_mapping_error(vq, addr))
+ goto unmap_release;
+
+ head_flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT |
+ VRING_DESC_F_AVAIL(wrap_counter) |
+ VRING_DESC_F_USED(!wrap_counter));
+ vq->vring_packed.desc[head].addr = cpu_to_virtio64(_vq->vdev, addr);
+ vq->vring_packed.desc[head].len = cpu_to_virtio32(_vq->vdev,
+ total_sg * sizeof(struct vring_packed_desc));
+ vq->vring_packed.desc[head].id = cpu_to_virtio32(_vq->vdev, head);
+ }
+
+ /* We're using some buffers from the free list. */
+ vq->vq.num_free -= descs_used;
+
+ /* Update free pointer */
+ if (indirect) {
+ n = head + 1;
+ if (n >= vq->vring_packed.num) {
+ n = 0;
+ vq->wrap_counter ^= 1;
+ }
+ vq->next_avail_idx = n;
+ } else
+ vq->next_avail_idx = i;
+
+ /* Store token and indirect buffer state. */
+ vq->desc_state[head].num = descs_used;
+ vq->desc_state[head].data = data;
+ if (indirect)
+ vq->desc_state[head].indir_desc = desc;
+ else
+ vq->desc_state[head].indir_desc = ctx;
+
+ /* A driver MUST NOT make the first descriptor in the list
+ * available before all subsequent descriptors comprising
+ * the list are made available. */
+ virtio_wmb(vq->weak_barriers);
+ vq->vring_packed.desc[head].flags = head_flags;
+ vq->num_added++;
+
+ pr_debug("Added buffer head %i to %p\n", head, vq);
+ END_USE(vq);
+
+ return 0;
+
+unmap_release:
+ err_idx = i;
+ i = head;
+
+ for (n = 0; n < total_sg; n++) {
+ if (i == err_idx)
+ break;
+ vring_unmap_one_packed(vq, &desc[i]);
+ i++;
+ if (!indirect && i >= vq->vring_packed.num)
+ i = 0;
+ }
+
+ vq->wrap_counter = wrap_counter;
+
+ if (indirect)
+ kfree(desc);
+
+ END_USE(vq);
+ return -EIO;
+}
+
+static inline int virtqueue_add(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ return vq->packed ? virtqueue_add_packed(_vq, sgs, total_sg, out_sgs,
+ in_sgs, data, ctx, gfp) :
+ virtqueue_add_split(_vq, sgs, total_sg, out_sgs,
+ in_sgs, data, ctx, gfp);
+}
+
/**
* virtqueue_add_sgs - expose buffers to other end
* @vq: the struct virtqueue we're talking about.
@@ -537,18 +812,7 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
}
EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
-/**
- * virtqueue_kick_prepare - first half of split virtqueue_kick call.
- * @vq: the struct virtqueue
- *
- * Instead of virtqueue_kick(), you can do:
- * if (virtqueue_kick_prepare(vq))
- * virtqueue_notify(vq);
- *
- * This is sometimes useful because the virtqueue_kick_prepare() needs
- * to be serialized, but the actual virtqueue_notify() call does not.
- */
-bool virtqueue_kick_prepare(struct virtqueue *_vq)
+static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
u16 new, old;
@@ -580,6 +844,62 @@ bool virtqueue_kick_prepare(struct virtqueue *_vq)
END_USE(vq);
return needs_kick;
}
+
+static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 new, old, off_wrap;
+ bool needs_kick;
+
+ START_USE(vq);
+ /* We need to expose the new flags value before checking notification
+ * suppressions. */
+ virtio_mb(vq->weak_barriers);
+
+ old = vq->next_avail_idx - vq->num_added;
+ new = vq->next_avail_idx;
+ vq->num_added = 0;
+
+#ifdef DEBUG
+ if (vq->last_add_time_valid) {
+ WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+ vq->last_add_time)) > 100);
+ }
+ vq->last_add_time_valid = false;
+#endif
+
+ off_wrap = virtio16_to_cpu(_vq->vdev, vq->vring_packed.device->off_wrap);
+
+ if (vq->event) {
+ // FIXME: fix this!
+ needs_kick = ((off_wrap >> 15) == vq->wrap_counter) &&
+ vring_need_event(off_wrap & ~(1<<15), new, old);
+ } else {
+ needs_kick = (vq->vring_packed.device->flags !=
+ cpu_to_virtio16(_vq->vdev, VRING_EVENT_F_DISABLE));
+ }
+ END_USE(vq);
+ return needs_kick;
+}
+
+/**
+ * virtqueue_kick_prepare - first half of split virtqueue_kick call.
+ * @vq: the struct virtqueue
+ *
+ * Instead of virtqueue_kick(), you can do:
+ * if (virtqueue_kick_prepare(vq))
+ * virtqueue_notify(vq);
+ *
+ * This is sometimes useful because the virtqueue_kick_prepare() needs
+ * to be serialized, but the actual virtqueue_notify() call does not.
+ */
+bool virtqueue_kick_prepare(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ return vq->packed ? virtqueue_kick_prepare_packed(_vq) :
+ virtqueue_kick_prepare_split(_vq);
+}
EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
/**
@@ -626,8 +946,8 @@ bool virtqueue_kick(struct virtqueue *vq)
}
EXPORT_SYMBOL_GPL(virtqueue_kick);
-static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
- void **ctx)
+static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+ void **ctx)
{
unsigned int i, j;
__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
@@ -639,12 +959,12 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
i = head;
while (vq->vring.desc[i].flags & nextflag) {
- vring_unmap_one(vq, &vq->vring.desc[i]);
+ vring_unmap_one_split(vq, &vq->vring.desc[i]);
i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
vq->vq.num_free++;
}
- vring_unmap_one(vq, &vq->vring.desc[i]);
+ vring_unmap_one_split(vq, &vq->vring.desc[i]);
vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
vq->free_head = head;
@@ -666,7 +986,7 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
BUG_ON(len == 0 || len % sizeof(struct vring_desc));
for (j = 0; j < len / sizeof(struct vring_desc); j++)
- vring_unmap_one(vq, &indir_desc[j]);
+ vring_unmap_one_split(vq, &indir_desc[j]);
kfree(indir_desc);
vq->desc_state[head].indir_desc = NULL;
@@ -675,11 +995,207 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
}
}
-static inline bool more_used(const struct vring_virtqueue *vq)
+static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
+ void **ctx)
+{
+ struct vring_packed_desc *desc;
+ unsigned int i, j;
+
+ /* Clear data ptr. */
+ vq->desc_state[head].data = NULL;
+
+ i = head;
+
+ for (j = 0; j < vq->desc_state[head].num; j++) {
+ desc = &vq->vring_packed.desc[i];
+ vring_unmap_one_packed(vq, desc);
+ desc->flags = 0x0;
+ i++;
+ if (i >= vq->vring_packed.num)
+ i = 0;
+ }
+
+ vq->vq.num_free += vq->desc_state[head].num;
+
+ if (vq->indirect) {
+ u32 len;
+
+ desc = vq->desc_state[head].indir_desc;
+ /* Free the indirect table, if any, now that it's unmapped. */
+ if (!desc)
+ goto out;
+
+ len = virtio32_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[head].len);
+
+ BUG_ON(!(vq->vring_packed.desc[head].flags &
+ cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
+ BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
+
+ for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
+ vring_unmap_one_packed(vq, &desc[j]);
+
+ kfree(desc);
+ vq->desc_state[head].indir_desc = NULL;
+ } else if (ctx) {
+ *ctx = vq->desc_state[head].indir_desc;
+ }
+
+out:
+ return vq->desc_state[head].num;
+}
+
+static inline bool more_used_split(const struct vring_virtqueue *vq)
{
return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
}
+static inline bool more_used_packed(const struct vring_virtqueue *vq)
+{
+ u16 last_used, flags;
+ bool avail, used;
+
+ if (vq->vq.num_free == vq->vring_packed.num)
+ return false;
+
+ last_used = vq->last_used_idx;
+ flags = virtio16_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[last_used].flags);
+ avail = flags & VRING_DESC_F_AVAIL(1);
+ used = flags & VRING_DESC_F_USED(1);
+
+ return avail == used;
+}
+
+static inline bool more_used(const struct vring_virtqueue *vq)
+{
+ return vq->packed ? more_used_packed(vq) : more_used_split(vq);
+}
+
+void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, unsigned int *len,
+ void **ctx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ void *ret;
+ unsigned int i;
+ u16 last_used;
+
+ START_USE(vq);
+
+ if (unlikely(vq->broken)) {
+ END_USE(vq);
+ return NULL;
+ }
+
+ if (!more_used(vq)) {
+ pr_debug("No more buffers in queue\n");
+ END_USE(vq);
+ return NULL;
+ }
+
+ /* Only get used array entries after they have been exposed by host. */
+ virtio_rmb(vq->weak_barriers);
+
+ last_used = (vq->last_used_idx & (vq->vring.num - 1));
+ i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
+ *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
+
+ if (unlikely(i >= vq->vring.num)) {
+ BAD_RING(vq, "id %u out of range\n", i);
+ return NULL;
+ }
+ if (unlikely(!vq->desc_state[i].data)) {
+ BAD_RING(vq, "id %u is not a head!\n", i);
+ return NULL;
+ }
+
+ /* detach_buf_split clears data, so grab it now. */
+ ret = vq->desc_state[i].data;
+ detach_buf_split(vq, i, ctx);
+ vq->last_used_idx++;
+ /* If we expect an interrupt for the next entry, tell host
+ * by writing event index and flush out the write before
+ * the read in the next get_buf call. */
+ if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
+ virtio_store_mb(vq->weak_barriers,
+ &vring_used_event(&vq->vring),
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
+
+#ifdef DEBUG
+ vq->last_add_time_valid = false;
+#endif
+
+ END_USE(vq);
+ return ret;
+}
+
+void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq, unsigned int *len,
+ void **ctx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ uint16_t wrap_counter;
+ void *ret;
+ unsigned int i;
+ u16 last_used;
+
+ START_USE(vq);
+
+ if (unlikely(vq->broken)) {
+ END_USE(vq);
+ return NULL;
+ }
+
+ if (!more_used(vq)) {
+ pr_debug("No more buffers in queue\n");
+ END_USE(vq);
+ return NULL;
+ }
+
+ /* Only get used elements after they have been exposed by host. */
+ virtio_rmb(vq->weak_barriers);
+
+ last_used = vq->last_used_idx;
+ i = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].id);
+ *len = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].len);
+
+ if (unlikely(i >= vq->vring_packed.num)) {
+ BAD_RING(vq, "id %u out of range\n", i);
+ return NULL;
+ }
+ if (unlikely(!vq->desc_state[i].data)) {
+ BAD_RING(vq, "id %u is not a head!\n", i);
+ return NULL;
+ }
+
+ /* detach_buf_packed clears data, so grab it now. */
+ ret = vq->desc_state[i].data;
+ detach_buf_packed(vq, i, ctx);
+
+ vq->last_used_idx += vq->desc_state[i].num;
+ if (vq->last_used_idx >= vq->vring_packed.num)
+ vq->last_used_idx -= vq->vring_packed.num;
+
+ wrap_counter = vq->wrap_counter;
+ if (vq->last_used_idx > vq->next_avail_idx)
+ wrap_counter ^= 1;
+
+ /* If we expect an interrupt for the next entry, tell host
+ * by writing event index and flush out the write before
+ * the read in the next get_buf call. */
+ if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+ virtio_store_mb(vq->weak_barriers,
+ &vq->vring_packed.driver->off_wrap,
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+ wrap_counter << 15));
+
+#ifdef DEBUG
+ vq->last_add_time_valid = false;
+#endif
+
+ END_USE(vq);
+ return ret;
+}
+
/**
* virtqueue_get_buf - get the next used buffer
* @vq: the struct virtqueue we're talking about.
@@ -700,57 +1216,9 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
void **ctx)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- void *ret;
- unsigned int i;
- u16 last_used;
- START_USE(vq);
-
- if (unlikely(vq->broken)) {
- END_USE(vq);
- return NULL;
- }
-
- if (!more_used(vq)) {
- pr_debug("No more buffers in queue\n");
- END_USE(vq);
- return NULL;
- }
-
- /* Only get used array entries after they have been exposed by host. */
- virtio_rmb(vq->weak_barriers);
-
- last_used = (vq->last_used_idx & (vq->vring.num - 1));
- i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
- *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
-
- if (unlikely(i >= vq->vring.num)) {
- BAD_RING(vq, "id %u out of range\n", i);
- return NULL;
- }
- if (unlikely(!vq->desc_state[i].data)) {
- BAD_RING(vq, "id %u is not a head!\n", i);
- return NULL;
- }
-
- /* detach_buf clears data, so grab it now. */
- ret = vq->desc_state[i].data;
- detach_buf(vq, i, ctx);
- vq->last_used_idx++;
- /* If we expect an interrupt for the next entry, tell host
- * by writing event index and flush out the write before
- * the read in the next get_buf call. */
- if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
- virtio_store_mb(vq->weak_barriers,
- &vring_used_event(&vq->vring),
- cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
-
-#ifdef DEBUG
- vq->last_add_time_valid = false;
-#endif
-
- END_USE(vq);
- return ret;
+ return vq->packed ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
+ virtqueue_get_buf_ctx_split(_vq, len, ctx);
}
EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
@@ -759,6 +1227,29 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
return virtqueue_get_buf_ctx(_vq, len, NULL);
}
EXPORT_SYMBOL_GPL(virtqueue_get_buf);
+
+static void virtqueue_disable_cb_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
+ vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+}
+
+static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ if (vq->event_flags_shadow != VRING_EVENT_F_DISABLE) {
+ vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+ vq->event_flags_shadow);
+ }
+}
+
/**
* virtqueue_disable_cb - disable callbacks
* @vq: the struct virtqueue we're talking about.
@@ -772,15 +1263,66 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
- vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
-
+ if (vq->packed)
+ virtqueue_disable_cb_packed(_vq);
+ else
+ virtqueue_disable_cb_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
+static unsigned virtqueue_enable_cb_prepare_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 last_used_idx;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always do both to keep code simple. */
+ if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+ vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+ vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
+ END_USE(vq);
+ return last_used_idx;
+}
+
+static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 last_used_idx, wrap_counter, off_wrap;
+
+ START_USE(vq);
+
+ last_used_idx = vq->last_used_idx;
+ wrap_counter = vq->wrap_counter;
+
+ if (last_used_idx > vq->next_avail_idx)
+ wrap_counter ^= 1;
+
+ off_wrap = last_used_idx | (wrap_counter << 15);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always do both to keep code simple. */
+ if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+ vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC:
+ VRING_EVENT_F_ENABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+ vq->event_flags_shadow);
+ }
+ vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev, off_wrap);
+ END_USE(vq);
+ return last_used_idx;
+}
+
/**
* virtqueue_enable_cb_prepare - restart callbacks after disable_cb
* @vq: the struct virtqueue we're talking about.
@@ -796,26 +1338,34 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- u16 last_used_idx;
- START_USE(vq);
-
- /* We optimistically turn back on interrupts, then check if there was
- * more to do. */
- /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
- * either clear the flags bit or point the event index at the next
- * entry. Always do both to keep code simple. */
- if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
- vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
- vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
- END_USE(vq);
- return last_used_idx;
+ return vq->packed ? virtqueue_enable_cb_prepare_packed(_vq) :
+ virtqueue_enable_cb_prepare_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
+static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ virtio_mb(vq->weak_barriers);
+ return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+}
+
+static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ bool avail, used;
+ u16 flags;
+
+ virtio_mb(vq->weak_barriers);
+ flags = virtio16_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[last_used_idx].flags);
+ avail = flags & VRING_DESC_F_AVAIL(1);
+ used = flags & VRING_DESC_F_USED(1);
+ return avail == used;
+}
+
/**
* virtqueue_poll - query pending used buffers
* @vq: the struct virtqueue we're talking about.
@@ -829,8 +1379,8 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- virtio_mb(vq->weak_barriers);
- return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+ return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
+ virtqueue_poll_split(_vq, last_used_idx);
}
EXPORT_SYMBOL_GPL(virtqueue_poll);
@@ -852,6 +1402,83 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
+static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 bufs;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always update the event index to keep code simple. */
+ if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+ vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+ /* TODO: tune this threshold */
+ bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
+
+ virtio_store_mb(vq->weak_barriers,
+ &vring_used_event(&vq->vring),
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
+
+ if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
+ END_USE(vq);
+ return false;
+ }
+
+ END_USE(vq);
+ return true;
+}
+
+static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 bufs, off_wrap, used_idx, wrap_counter;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always update the event index to keep code simple. */
+ if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+ vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC:
+ VRING_EVENT_F_ENABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+ vq->event_flags_shadow);
+ }
+
+ /* TODO: tune this threshold */
+ bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
+
+ used_idx = vq->last_used_idx + bufs;
+ if (used_idx >= vq->vring_packed.num)
+ used_idx -= vq->vring_packed.num;
+
+ wrap_counter = vq->wrap_counter;
+ if (used_idx > vq->next_avail_idx)
+ wrap_counter ^= 1;
+
+ off_wrap = used_idx | (wrap_counter << 15);
+
+ virtio_store_mb(vq->weak_barriers, &vq->vring_packed.driver->off_wrap,
+ cpu_to_virtio16(_vq->vdev, off_wrap));
+
+ if (more_used_packed(vq)) {
+ END_USE(vq);
+ return false;
+ }
+
+ END_USE(vq);
+ return true;
+}
+
/**
* virtqueue_enable_cb_delayed - restart callbacks after disable_cb.
* @vq: the struct virtqueue we're talking about.
@@ -868,37 +1495,69 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- u16 bufs;
- START_USE(vq);
-
- /* We optimistically turn back on interrupts, then check if there was
- * more to do. */
- /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
- * either clear the flags bit or point the event index at the next
- * entry. Always update the event index to keep code simple. */
- if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
- vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
- /* TODO: tune this threshold */
- bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
-
- virtio_store_mb(vq->weak_barriers,
- &vring_used_event(&vq->vring),
- cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
-
- if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
- END_USE(vq);
- return false;
- }
-
- END_USE(vq);
- return true;
+ return vq->packed ? virtqueue_enable_cb_delayed_packed(_vq) :
+ virtqueue_enable_cb_delayed_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
+static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ unsigned int i;
+ void *buf;
+
+ START_USE(vq);
+
+ for (i = 0; i < vq->vring.num; i++) {
+ if (!vq->desc_state[i].data)
+ continue;
+ /* detach_buf clears data, so grab it now. */
+ buf = vq->desc_state[i].data;
+ detach_buf_split(vq, i, NULL);
+ vq->avail_idx_shadow--;
+ vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+ END_USE(vq);
+ return buf;
+ }
+ /* That should have freed everything. */
+ BUG_ON(vq->vq.num_free != vq->vring.num);
+
+ END_USE(vq);
+ return NULL;
+}
+
+static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ unsigned int i, num;
+ void *buf;
+
+ START_USE(vq);
+
+ for (i = 0; i < vq->vring_packed.num; i++) {
+ if (!vq->desc_state[i].data)
+ continue;
+ /* detach_buf clears data, so grab it now. */
+ buf = vq->desc_state[i].data;
+ num = detach_buf_packed(vq, i, NULL);
+ if (vq->next_avail_idx < num) {
+ vq->next_avail_idx = vq->vring_packed.num -
+ (num - vq->next_avail_idx);
+ vq->wrap_counter ^= 1;
+ } else {
+ vq->next_avail_idx -= num;
+ }
+ END_USE(vq);
+ return buf;
+ }
+ /* That should have freed everything. */
+ BUG_ON(vq->vq.num_free != vq->vring_packed.num);
+
+ END_USE(vq);
+ return NULL;
+}
+
/**
* virtqueue_detach_unused_buf - detach first unused buffer
* @vq: the struct virtqueue we're talking about.
@@ -910,27 +1569,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- unsigned int i;
- void *buf;
- START_USE(vq);
-
- for (i = 0; i < vq->vring.num; i++) {
- if (!vq->desc_state[i].data)
- continue;
- /* detach_buf clears data, so grab it now. */
- buf = vq->desc_state[i].data;
- detach_buf(vq, i, NULL);
- vq->avail_idx_shadow--;
- vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
- END_USE(vq);
- return buf;
- }
- /* That should have freed everything. */
- BUG_ON(vq->vq.num_free != vq->vring.num);
-
- END_USE(vq);
- return NULL;
+ return vq->packed ? virtqueue_detach_unused_buf_packed(_vq) :
+ virtqueue_detach_unused_buf_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
@@ -955,7 +1596,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
EXPORT_SYMBOL_GPL(vring_interrupt);
struct virtqueue *__vring_new_virtqueue(unsigned int index,
- struct vring vring,
+ union vring_union vring,
+ bool packed,
struct virtio_device *vdev,
bool weak_barriers,
bool context,
@@ -963,19 +1605,20 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
void (*callback)(struct virtqueue *),
const char *name)
{
- unsigned int i;
+ unsigned int num, i;
struct vring_virtqueue *vq;
- vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
+ num = packed ? vring.vring_packed.num : vring.vring_split.num;
+
+ vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
GFP_KERNEL);
if (!vq)
return NULL;
- vq->vring = vring;
vq->vq.callback = callback;
vq->vq.vdev = vdev;
vq->vq.name = name;
- vq->vq.num_free = vring.num;
+ vq->vq.num_free = num;
vq->vq.index = index;
vq->we_own_ring = false;
vq->queue_dma_addr = 0;
@@ -984,9 +1627,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
vq->weak_barriers = weak_barriers;
vq->broken = false;
vq->last_used_idx = 0;
- vq->avail_flags_shadow = 0;
- vq->avail_idx_shadow = 0;
vq->num_added = 0;
+ vq->packed = packed;
list_add_tail(&vq->vq.list, &vdev->vqs);
#ifdef DEBUG
vq->in_use = false;
@@ -997,18 +1639,37 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
!context;
vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
+ if (vq->packed) {
+ vq->vring_packed = vring.vring_packed;
+ vq->next_avail_idx = 0;
+ vq->wrap_counter = 1;
+ vq->event_flags_shadow = 0;
+ } else {
+ vq->vring = vring.vring_split;
+ vq->avail_flags_shadow = 0;
+ vq->avail_idx_shadow = 0;
+
+ /* Put everything in free lists. */
+ vq->free_head = 0;
+ for (i = 0; i < num-1; i++)
+ vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
+ }
+
/* No callback? Tell other side not to bother us. */
if (!callback) {
- vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(vdev, vq->avail_flags_shadow);
+ if (packed) {
+ vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(vdev,
+ vq->event_flags_shadow);
+ } else {
+ vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(vdev,
+ vq->avail_flags_shadow);
+ }
}
- /* Put everything in free lists. */
- vq->free_head = 0;
- for (i = 0; i < vring.num-1; i++)
- vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
- memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
+ memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
return &vq->vq;
}
@@ -1056,6 +1717,22 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size,
}
}
+static inline int
+__vring_size(unsigned int num, unsigned long align, bool packed)
+{
+ return packed ? vring_packed_size(num, align) : vring_size(num, align);
+}
+
+static inline void vring_packed_init(struct vring_packed *vr, unsigned int num,
+ void *p, unsigned long align)
+{
+ vr->num = num;
+ vr->desc = p;
+ vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
+ * num + align - 1) & ~(align - 1));
+ vr->device = vr->driver + 1;
+}
+
struct virtqueue *vring_create_virtqueue(
unsigned int index,
unsigned int num,
@@ -1072,7 +1749,8 @@ struct virtqueue *vring_create_virtqueue(
void *queue = NULL;
dma_addr_t dma_addr;
size_t queue_size_in_bytes;
- struct vring vring;
+ union vring_union vring;
+ bool packed;
/* We assume num is a power of 2. */
if (num & (num - 1)) {
@@ -1080,9 +1758,13 @@ struct virtqueue *vring_create_virtqueue(
return NULL;
}
+ packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+
/* TODO: allocate each queue chunk individually */
- for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
- queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+ for (; num && __vring_size(num, vring_align, packed) > PAGE_SIZE;
+ num /= 2) {
+ queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+ packed),
&dma_addr,
GFP_KERNEL|__GFP_NOWARN|__GFP_ZERO);
if (queue)
@@ -1094,17 +1776,21 @@ struct virtqueue *vring_create_virtqueue(
if (!queue) {
/* Try to get a single page. You are my only hope! */
- queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+ queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+ packed),
&dma_addr, GFP_KERNEL|__GFP_ZERO);
}
if (!queue)
return NULL;
- queue_size_in_bytes = vring_size(num, vring_align);
- vring_init(&vring, num, queue, vring_align);
+ queue_size_in_bytes = __vring_size(num, vring_align, packed);
+ if (packed)
+ vring_packed_init(&vring.vring_packed, num, queue, vring_align);
+ else
+ vring_init(&vring.vring_split, num, queue, vring_align);
- vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
- notify, callback, name);
+ vq = __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+ context, notify, callback, name);
if (!vq) {
vring_free_queue(vdev, queue_size_in_bytes, queue,
dma_addr);
@@ -1130,10 +1816,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
void (*callback)(struct virtqueue *vq),
const char *name)
{
- struct vring vring;
- vring_init(&vring, num, pages, vring_align);
- return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
- notify, callback, name);
+ union vring_union vring;
+ bool packed;
+
+ packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+ if (packed)
+ vring_packed_init(&vring.vring_packed, num, pages, vring_align);
+ else
+ vring_init(&vring.vring_split, num, pages, vring_align);
+
+ return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+ context, notify, callback, name);
}
EXPORT_SYMBOL_GPL(vring_new_virtqueue);
@@ -1143,7 +1836,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
if (vq->we_own_ring) {
vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
- vq->vring.desc, vq->queue_dma_addr);
+ vq->packed ? (void *)vq->vring_packed.desc :
+ (void *)vq->vring.desc,
+ vq->queue_dma_addr);
}
list_del(&_vq->list);
kfree(vq);
@@ -1157,14 +1852,18 @@ void vring_transport_features(struct virtio_device *vdev)
for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
switch (i) {
- case VIRTIO_RING_F_INDIRECT_DESC:
+#if 0
+ case VIRTIO_RING_F_INDIRECT_DESC: // FIXME not tested yet.
break;
- case VIRTIO_RING_F_EVENT_IDX:
+ case VIRTIO_RING_F_EVENT_IDX: // FIXME probably not work.
break;
+#endif
case VIRTIO_F_VERSION_1:
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+ case VIRTIO_F_RING_PACKED:
+ break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
@@ -1185,7 +1884,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
struct vring_virtqueue *vq = to_vvq(_vq);
- return vq->vring.num;
+ return vq->packed ? vq->vring_packed.num : vq->vring.num;
}
EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
@@ -1228,6 +1927,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
BUG_ON(!vq->we_own_ring);
+ if (vq->packed)
+ return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
+ (char *)vq->vring_packed.desc);
+
return vq->queue_dma_addr +
((char *)vq->vring.avail - (char *)vq->vring.desc);
}
@@ -1239,11 +1942,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
BUG_ON(!vq->we_own_ring);
+ if (vq->packed)
+ return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
+ (char *)vq->vring_packed.desc);
+
return vq->queue_dma_addr +
((char *)vq->vring.used - (char *)vq->vring.desc);
}
EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
+/* Only available for split ring */
const struct vring *virtqueue_get_vring(struct virtqueue *vq)
{
return &to_vvq(vq)->vring;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bbf32524ab27..a0075894ad16 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
struct virtio_device;
struct virtqueue;
+union vring_union {
+ struct vring vring_split;
+ struct vring_packed vring_packed;
+};
+
/*
* Creates a virtqueue and allocates the descriptor ring. If
* may_reduce_num is set, then this may allocate a smaller ring than
@@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
/* Creates a virtqueue with a custom layout. */
struct virtqueue *__vring_new_virtqueue(unsigned int index,
- struct vring vring,
+ union vring_union vring,
+ bool packed,
struct virtio_device *vdev,
bool weak_barriers,
bool ctx,
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
* transport being used (eg. virtio_ring), the rest are per-device feature
* bits. */
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
#ifndef VIRTIO_CONFIG_NO_LEGACY
/* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
* this is for compatibility with legacy systems.
*/
#define VIRTIO_F_IOMMU_PLATFORM 33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED 34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER 35
+
#endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..735d4207c988 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
/* This means the buffer contains a list of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
+#define VRING_DESC_F_AVAIL(b) ((b) << 7)
+#define VRING_DESC_F_USED(b) ((b) << 15)
+
/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
* will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
* optimization. */
#define VRING_AVAIL_F_NO_INTERRUPT 1
+#define VRING_EVENT_F_ENABLE 0x0
+#define VRING_EVENT_F_DISABLE 0x1
+#define VRING_EVENT_F_DESC 0x2
+
/* We support indirect buffer descriptors */
#define VIRTIO_RING_F_INDIRECT_DESC 28
@@ -171,4 +178,58 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
}
+struct vring_packed_desc_event {
+ /* __virtio16 off : 15; // Descriptor Event Offset
+ * __virtio16 wrap : 1; // Descriptor Event Wrap Counter */
+ __virtio16 off_wrap;
+ /* __virtio16 flags : 2; // Descriptor Event Flags */
+ __virtio16 flags;
+};
+
+struct vring_packed_desc {
+ /* Buffer Address. */
+ __virtio64 addr;
+ /* Buffer Length. */
+ __virtio32 len;
+ /* Buffer ID. */
+ __virtio16 id;
+ /* The flags depending on descriptor type. */
+ __virtio16 flags;
+};
+
+struct vring_packed {
+ unsigned int num;
+
+ struct vring_packed_desc *desc;
+
+ struct vring_packed_desc_event *driver;
+
+ struct vring_packed_desc_event *device;
+};
+
+/* The standard layout for the packed ring is a continuous chunk of memory
+ * which looks like this.
+ *
+ * struct vring_packed
+ * {
+ * // The actual descriptors (16 bytes each)
+ * struct vring_packed_desc desc[num];
+ *
+ * // Padding to the next align boundary.
+ * char pad[];
+ *
+ * // Driver Event Suppression
+ * struct vring_packed_desc_event driver;
+ *
+ * // Device Event Suppression
+ * struct vring_packed_desc_event device;
+ * };
+ */
+
+static inline unsigned vring_packed_size(unsigned int num, unsigned long align)
+{
+ return ((sizeof(struct vring_packed_desc) * num + align - 1)
+ & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
+}
+
#endif /* _UAPI_LINUX_VIRTIO_RING_H */
--
2.11.0
^ permalink raw reply related
* Re: [PATCH 00/24] drm_framebuffer boilerplate removal
From: Alex Deucher @ 2018-03-30 14:47 UTC (permalink / raw)
To: Daniel Stone
Cc: Heiko Stübner, Joonas Lahtinen, dri-devel,
open list:VIRTIO CORE, NET..., Thierry Reding,
amd-gfx mailing list, David (ChunMing) Zhou, Joonyoung Shim,
Russell King, Patrik Jakobsson, Tomi Valkeinen, CK Hu,
Dave Airlie, Rob Clark, David Lechner, linux-arm-msm, intel-gfx,
Jani Nikula, Inki Dae, Rodrigo Vivi
In-Reply-To: <CAPj87rPKDwBK7qoOoY2zXoS6T02toATPRsk=39m_OBr=s1kXTw@mail.gmail.com>
On Fri, Mar 30, 2018 at 10:11 AM, Daniel Stone <daniels@collabora.com> wrote:
> Hi,
> I've been working on a getfb2[0] ioctl, which amongst other things
> supports multi-planar framebuffers as well as modifiers. getfb
> currently calls the framebuffer's handle_create hook, which doesn't
> support multiple planes.
>
> Thanks to Noralf's recent work, drivers can just store GEM objects
> directly in drm_framebuffer. I use this directly in getfb2: we need
> direct access to the GEM objects and not a vfunc in order to not hand
> out duplicate GEM names for the same object.
>
> This series converts all drivers except for nouveau, which was a
> little too non-trivial for my comfort, to storing GEM objects directly
> in drm_framebuffer. For those drivers whose driver_framebuffer struct
> was nothing but drm_framebuffer + BO, it deletes the driver-specific
> struct. It also makes use of Noralf's generic framebuffer helpers for
> create_handle and destroy where possible.
>
> I don't have the hardware for most of these drivers, so have had to
> settle for just staring really hard at the diff.
>
> I intend to remove create_handle when all drivers are converted over
> to placing BOs directly inside drm_framebuffer. For most drivers
> there's a relatively easy conversion to using the helpers for
> basically all framebuffer handling and fbdev emulation as well, though
> that's a bit further than I was willing to go without hardware to test
> on ...
Series is:
Acked-by: Alex Deucher <alexander.deucher@amd.com>
>
> Cheers,
> Daniel
>
> [0]: https://lists.freedesktop.org/archives/dri-devel/2018-March/170512.html
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply
* Re: [RFC PATCH V2 8/8] vhost: event suppression for packed ring
From: Jason Wang @ 2018-03-30 2:46 UTC (permalink / raw)
To: Tiwei Bie; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <20180330020518.tja3lxkfz7mdwasj@debian>
On 2018年03月30日 10:05, Tiwei Bie wrote:
> On Mon, Mar 26, 2018 at 11:38:53AM +0800, Jason Wang wrote:
>> This patch introduces basic support for event suppression aka driver
>> and device area. Compile tested only.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
> [...]
>> +
>> +static bool vhost_notify_packed(struct vhost_dev *dev,
>> + struct vhost_virtqueue *vq)
>> +{
>> + __virtio16 event_off_wrap, event_flags;
>> + __u16 old, new;
>> + bool v, wrap;
>> + int off;
>> +
>> + /* Flush out used descriptors updates. This is paired
>> + * with the barrier that the Guest executes when enabling
>> + * interrupts.
>> + */
>> + smp_mb();
>> +
>> + if (vhost_get_avail(vq, event_flags,
>> + &vq->driver_event->desc_event_flags) < 0) {
>> + vq_err(vq, "Failed to get driver desc_event_flags");
>> + return true;
>> + }
>> +
>> + if (!(event_flags & cpu_to_vhost16(vq, RING_EVENT_FLAGS_DESC)))
>> + return event_flags ==
>> + cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
> Maybe it would be better to not use '&' here. Because these flags
> are not defined as bits which can be ORed or ANDed. Instead, they
> are defined as values:
>
> 0x0 enable
> 0x1 disable
> 0x2 desc
> 0x3 reserved
Yes the code seems tricky. Let me fix it in next version.
>
>> +
>> + /* Read desc event flags before event_off and event_wrap */
>> + smp_rmb();
>> +
>> + if (vhost_get_avail(vq, event_off_wrap,
>> + &vq->driver_event->desc_event_off_warp) < 0) {
>> + vq_err(vq, "Failed to get driver desc_event_off/wrap");
>> + return true;
>> + }
>> +
>> + off = vhost16_to_cpu(vq, event_off_wrap);
>> +
>> + wrap = off & 0x1;
>> + off >>= 1;
> Based on the below definitions in spec, wrap counter is
> the most significant bit.
>
> struct pvirtq_event_suppress {
> le16 {
> desc_event_off : 15; /* Descriptor Ring Change Event Offset */
> desc_event_wrap : 1; /* Descriptor Ring Change Event Wrap Counter */
> } desc; /* If desc_event_flags set to RING_EVENT_FLAGS_DESC */
> le16 {
> desc_event_flags : 2, /* Descriptor Ring Change Event Flags */
> reserved : 14; /* Reserved, set to 0 */
> } flags;
> };
Will fix this in next version.
>
>> +
>> +
>> + old = vq->signalled_used;
>> + v = vq->signalled_used_valid;
>> + new = vq->signalled_used = vq->last_used_idx;
>> + vq->signalled_used_valid = true;
>> +
>> + if (unlikely(!v))
>> + return true;
>> +
>> + return vhost_vring_packed_need_event(vq, new, old, off) &&
>> + wrap == vq->used_wrap_counter;
>> +}
>> +
>> +static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>> +{
>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>> + return vhost_notify_packed(dev, vq);
>> + else
>> + return vhost_notify_split(dev, vq);
>> +}
>> +
>> /* This actually signals the guest, using eventfd. */
>> void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>> {
>> @@ -2789,7 +2911,17 @@ static bool vhost_enable_notify_packed(struct vhost_dev *dev,
>> __virtio16 flags;
>> int ret;
>>
>> - /* FIXME: disable notification through device area */
>> + if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>> + return false;
>> + vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
>> +
>> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
>> + ret = vhost_update_device_flags(vq, flags);
>> + if (ret) {
>> + vq_err(vq, "Failed to enable notification at %p: %d\n",
>> + &vq->device_event->desc_event_flags, ret);
>> + return false;
>> + }
>>
>> /* They could have slipped one in as we were doing that: make
>> * sure it's written, then check again. */
>> @@ -2855,7 +2987,18 @@ EXPORT_SYMBOL_GPL(vhost_enable_notify);
>> static void vhost_disable_notify_packed(struct vhost_dev *dev,
>> struct vhost_virtqueue *vq)
>> {
>> - /* FIXME: disable notification through device area */
>> + __virtio16 flags;
>> + int r;
>> +
>> + if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
>> + return;
>> + vq->used_flags |= VRING_USED_F_NO_NOTIFY;
>> +
>> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
>> + r = vhost_update_device_flags(vq, flags);
>> + if (r)
>> + vq_err(vq, "Failed to enable notification at %p: %d\n",
>> + &vq->device_event->desc_event_flags, r);
>> }
>>
>> static void vhost_disable_notify_split(struct vhost_dev *dev,
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 8a9df4f..02d7a36 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -96,8 +96,14 @@ struct vhost_virtqueue {
>> struct vring_desc __user *desc;
>> struct vring_desc_packed __user *desc_packed;
> Do you think it'd be better to name the desc type as
> struct vring_packed_desc?
Ok. Will do.
> And it will be consistent
> with other names, like:
>
> struct vring_packed;
> struct vring_packed_desc_event;
>
>> };
>> - struct vring_avail __user *avail;
>> - struct vring_used __user *used;
>> + union {
>> + struct vring_avail __user *avail;
>> + struct vring_packed_desc_event __user *driver_event;
>> + };
>> + union {
>> + struct vring_used __user *used;
>> + struct vring_packed_desc_event __user *device_event;
>> + };
>> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>> struct file *kick;
>> struct eventfd_ctx *call_ctx;
>> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
>> index e297580..7cdbf06 100644
>> --- a/include/uapi/linux/virtio_ring.h
>> +++ b/include/uapi/linux/virtio_ring.h
>> @@ -75,6 +75,25 @@ struct vring_desc_packed {
>> __virtio16 flags;
>> };
>>
>> +/* Enable events */
>> +#define RING_EVENT_FLAGS_ENABLE 0x0
>> +/* Disable events */
>> +#define RING_EVENT_FLAGS_DISABLE 0x1
>> +/*
>> + * Enable events for a specific descriptor
>> + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
>> + * Only valid if VIRTIO_F_RING_EVENT_IDX has been negotiated.
>> + */
>> +#define RING_EVENT_FLAGS_DESC 0x2
>> +/* The value 0x3 is reserved */
>> +
>> +struct vring_packed_desc_event {
>> + /* Descriptor Ring Change Event Offset and Wrap Counter */
>> + __virtio16 desc_event_off_warp;
>> + /* Descriptor Ring Change Event Flags */
>> + __virtio16 desc_event_flags;
> Do you think it'd be better to remove the prefix (desc_event_) for
> the fields. And it will be consistent with other definitions, e.g.:
>
> struct vring_packed_desc {
> /* Buffer Address. */
> __virtio64 addr;
> /* Buffer Length. */
> __virtio32 len;
> /* Buffer ID. */
> __virtio16 id;
> /* The flags depending on descriptor type. */
> __virtio16 flags;
> };
Yes. Let me do it in next version.
Thanks for the review!
>> +};
>> +
>> /* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
>> struct vring_desc {
>> /* Address (guest-physical). */
>> --
>> 2.7.4
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: virtio over SW-defined/CPU-driven PCIe endpoint
From: Jason Wang @ 2018-03-30 2:44 UTC (permalink / raw)
To: Stephen Warren, Michael S. Tsirkin; +Cc: virtualization
In-Reply-To: <7cbb8e1c-a34b-2cef-9de3-f034f6afcdfd@wwwdotorg.org>
On 2018年03月30日 05:22, Stephen Warren wrote:
> I've been investigating how to implement a virtio device (as opposed
> to a virtio driver) on a regular computer system with a PCIe
> controller that can operate in endpoint mode, as opposed to an
> endpoint that's implemented by a hypervisor that can preempt execution
> of a VM, or an endpoint that's implemented purely in hardware by logic
> gates. In my case (and I assume likely in most CPU-driven PCIe
> endpoint cases), the endpoint controller has the following capabilities:
>
> - Host-initiated accesses to the endpoint's BARs can read/write normal
> memory, but not hardware registers within the endpoint system.
>
> - Accesses to memory exposed by BARs can't be synchronously handled by
> the endpoint's local CPU. The local CPU can't be notified when the
> host writes memory in order to synchronously update other memory
> locations. The local CPU can't synchronously generate the result of a
> host read transaction, but rather the data must be present in memory
> ahead of time.
>
> - Accesses to a small region of address space can be used to generate
> interrupts to the endpoint's local CPU. This region can be exposed
> through a PCI BAR (or perhaps as part of a BAR; not sure on details
> yet). This region of memory has a fixed format and is separate from
> true RAM, and so can't be used to hold PCI-virtio's
> discovery/capability data.
>
> - The endpoint can emit PCI interrupts (e.g. MSI) to the attached host.
>
> The model described in the virtio spec's "Virtio PCI Bus" section
> doesn't seem to work in this case, since it assumes:
>
> - Writes to some fields in the PCI configuration space (e.g.
> {queue,device_feature,driver_feature}_select) synchronously update
> other fields (e.g. queue_size), which can immediately be accessed by
> the host. This isn't possible when the memory content is created by a
> CPU that isn't a synchronous part of PCI accesses.
>
> - Writing to some fields in the PCI configuration space (e.g.
> device_status) are supposed to trigger a response by the device,
> without the need to explicitly notify the device of the memory write
> by some other means. This isn't possible when the endpoint's local CPU
> has no mechanism to be notified of such writes.
>
> - The device_status field is asynchronously read-write by both the
> device and driver, yet the spec requires that the driver must always
> read-modify-write this field, and additionally that the driver must
> never clear any device status bit. These requirements seem impossible
> to satisfy in any case at all, let alone the current case.
>
> I can see some possible solutions here:
>
> 1) Just implement virtqueues but not all of the standardized PCI
> discovery protocol. virtqueues don't have the problems described above
> and should work fine between systems where there are asynchronous CPUs
> on both ends. virtqueus solely rely on normal memory access without
> side-effects and explicit notification. This would require
> implementing some custom/device-specific discovery protocol. I believe
> that remoteproc/rpmsg take this approach.
Yes, kernel has already supported virtio over remoteproc.
Maybe the first step is to document it in the spec.
Thanks
>
> 2) Define a new standardized virtio PCI discovery protocol that is
> better suited to the device being an asynchronous CPU. For example,
> eliminate the need for the device to somehow notice memory accesses
> and rely on explicitl notification instead. Separate device-written
> and driver-written data into different cache-lines or pages.
>
> 3) Use something other than virtio/virtqueues instead.
>
> As an aside, I noticed that the memory allocation for virtqueues is
> very lopsided; the driver always allocates the storage. This means
> that the device must perform PCI reads to transfer data from the
> driver to the device. PCI reads are typically slower than PCI writes
> since reads require a round-trip transfer, whereas writes can be
> posted. I wonder if any thought has been put into having the device
> optionally allocate virtuqueue buffers so that the protocol can rely
> primarily on PCI writes? Perhaps there's some alternative protocol
> that's more optimized for true PCI-based communication rather than
> paravirtualized PCI-based communication?
>
> Thanks for any thoughts on the best approach, or pointers to
> pre-existing work in this area.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: Jason Wang @ 2018-03-30 2:27 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180329173438-mutt-send-email-mst@kernel.org>
On 2018年03月29日 22:44, Michael S. Tsirkin wrote:
> On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
>> Vq log_base is the userspace address of bitmap which has nothing to do
>> with IOTLB. So it needs to be validated unconditionally otherwise we
>> may try use 0 as log_base which may lead to pin pages that will lead
>> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>>
>> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
>> Reported-by:syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> One follow-up question:
>
> We still observe that get user pages returns 0 sometimes. While I agree
> we should not pass in unvalidated addresses, isn't this worth
> documenting?
>
>
Looking at get_user_pages_fast(), it has:
if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
(void __user *)start, len)))
return 0;
So this is expected I think.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC PATCH V2 8/8] vhost: event suppression for packed ring
From: Tiwei Bie @ 2018-03-30 2:05 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1522035533-11786-9-git-send-email-jasowang@redhat.com>
On Mon, Mar 26, 2018 at 11:38:53AM +0800, Jason Wang wrote:
> This patch introduces basic support for event suppression aka driver
> and device area. Compile tested only.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
[...]
> +
> +static bool vhost_notify_packed(struct vhost_dev *dev,
> + struct vhost_virtqueue *vq)
> +{
> + __virtio16 event_off_wrap, event_flags;
> + __u16 old, new;
> + bool v, wrap;
> + int off;
> +
> + /* Flush out used descriptors updates. This is paired
> + * with the barrier that the Guest executes when enabling
> + * interrupts.
> + */
> + smp_mb();
> +
> + if (vhost_get_avail(vq, event_flags,
> + &vq->driver_event->desc_event_flags) < 0) {
> + vq_err(vq, "Failed to get driver desc_event_flags");
> + return true;
> + }
> +
> + if (!(event_flags & cpu_to_vhost16(vq, RING_EVENT_FLAGS_DESC)))
> + return event_flags ==
> + cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
Maybe it would be better to not use '&' here. Because these flags
are not defined as bits which can be ORed or ANDed. Instead, they
are defined as values:
0x0 enable
0x1 disable
0x2 desc
0x3 reserved
> +
> + /* Read desc event flags before event_off and event_wrap */
> + smp_rmb();
> +
> + if (vhost_get_avail(vq, event_off_wrap,
> + &vq->driver_event->desc_event_off_warp) < 0) {
> + vq_err(vq, "Failed to get driver desc_event_off/wrap");
> + return true;
> + }
> +
> + off = vhost16_to_cpu(vq, event_off_wrap);
> +
> + wrap = off & 0x1;
> + off >>= 1;
Based on the below definitions in spec, wrap counter is
the most significant bit.
struct pvirtq_event_suppress {
le16 {
desc_event_off : 15; /* Descriptor Ring Change Event Offset */
desc_event_wrap : 1; /* Descriptor Ring Change Event Wrap Counter */
} desc; /* If desc_event_flags set to RING_EVENT_FLAGS_DESC */
le16 {
desc_event_flags : 2, /* Descriptor Ring Change Event Flags */
reserved : 14; /* Reserved, set to 0 */
} flags;
};
> +
> +
> + old = vq->signalled_used;
> + v = vq->signalled_used_valid;
> + new = vq->signalled_used = vq->last_used_idx;
> + vq->signalled_used_valid = true;
> +
> + if (unlikely(!v))
> + return true;
> +
> + return vhost_vring_packed_need_event(vq, new, old, off) &&
> + wrap == vq->used_wrap_counter;
> +}
> +
> +static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> + return vhost_notify_packed(dev, vq);
> + else
> + return vhost_notify_split(dev, vq);
> +}
> +
> /* This actually signals the guest, using eventfd. */
> void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> {
> @@ -2789,7 +2911,17 @@ static bool vhost_enable_notify_packed(struct vhost_dev *dev,
> __virtio16 flags;
> int ret;
>
> - /* FIXME: disable notification through device area */
> + if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> + return false;
> + vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> +
> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
> + ret = vhost_update_device_flags(vq, flags);
> + if (ret) {
> + vq_err(vq, "Failed to enable notification at %p: %d\n",
> + &vq->device_event->desc_event_flags, ret);
> + return false;
> + }
>
> /* They could have slipped one in as we were doing that: make
> * sure it's written, then check again. */
> @@ -2855,7 +2987,18 @@ EXPORT_SYMBOL_GPL(vhost_enable_notify);
> static void vhost_disable_notify_packed(struct vhost_dev *dev,
> struct vhost_virtqueue *vq)
> {
> - /* FIXME: disable notification through device area */
> + __virtio16 flags;
> + int r;
> +
> + if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> + return;
> + vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> +
> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
> + r = vhost_update_device_flags(vq, flags);
> + if (r)
> + vq_err(vq, "Failed to enable notification at %p: %d\n",
> + &vq->device_event->desc_event_flags, r);
> }
>
> static void vhost_disable_notify_split(struct vhost_dev *dev,
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 8a9df4f..02d7a36 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -96,8 +96,14 @@ struct vhost_virtqueue {
> struct vring_desc __user *desc;
> struct vring_desc_packed __user *desc_packed;
Do you think it'd be better to name the desc type as
struct vring_packed_desc? And it will be consistent
with other names, like:
struct vring_packed;
struct vring_packed_desc_event;
> };
> - struct vring_avail __user *avail;
> - struct vring_used __user *used;
> + union {
> + struct vring_avail __user *avail;
> + struct vring_packed_desc_event __user *driver_event;
> + };
> + union {
> + struct vring_used __user *used;
> + struct vring_packed_desc_event __user *device_event;
> + };
> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
> struct file *kick;
> struct eventfd_ctx *call_ctx;
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index e297580..7cdbf06 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -75,6 +75,25 @@ struct vring_desc_packed {
> __virtio16 flags;
> };
>
> +/* Enable events */
> +#define RING_EVENT_FLAGS_ENABLE 0x0
> +/* Disable events */
> +#define RING_EVENT_FLAGS_DISABLE 0x1
> +/*
> + * Enable events for a specific descriptor
> + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
> + * Only valid if VIRTIO_F_RING_EVENT_IDX has been negotiated.
> + */
> +#define RING_EVENT_FLAGS_DESC 0x2
> +/* The value 0x3 is reserved */
> +
> +struct vring_packed_desc_event {
> + /* Descriptor Ring Change Event Offset and Wrap Counter */
> + __virtio16 desc_event_off_warp;
> + /* Descriptor Ring Change Event Flags */
> + __virtio16 desc_event_flags;
Do you think it'd be better to remove the prefix (desc_event_) for
the fields. And it will be consistent with other definitions, e.g.:
struct vring_packed_desc {
/* Buffer Address. */
__virtio64 addr;
/* Buffer Length. */
__virtio32 len;
/* Buffer ID. */
__virtio16 id;
/* The flags depending on descriptor type. */
__virtio16 flags;
};
> +};
> +
> /* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
> struct vring_desc {
> /* Address (guest-physical). */
> --
> 2.7.4
>
^ permalink raw reply
* Re: [PATCH v3] virtio_blk: add DISCARD and WRIET ZEROES command support
From: Michael S. Tsirkin @ 2018-03-30 1:07 UTC (permalink / raw)
To: Changpeng Liu
Cc: james.r.harris, pawelx.wodkowski, fabio.miranda.martins,
virtualization, stefanha, pbonzini, cavery
In-Reply-To: <1522370974-28259-1-git-send-email-changpeng.liu@intel.com>
On Fri, Mar 30, 2018 at 08:49:34AM +0800, Changpeng Liu wrote:
> Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES
> command support, this will impact the performance when using SSD
> backend over file systems.
>
> The idea here is using 16 Bytes payload as one descriptor for
> DISCARD/WRITE ZEROES command, users can put several ranges into
> one command, for the purpose to support such feature, two feature
> flags VIRTIO_BLK_F_DISCARD/VIRTIO_BLK_F_WRITE_ZEROES and two
> commands VIRTIO_BLK_T_DISCARD/VIRTIO_BLK_T_WRITE_ZEROES are
> introduced, and some parameters are added to the configuration
> space to tell the OS the granularity of DISCARD/WRITE ZEROES
> commands.
Pls fix grammar in this comment, I am not sure what are you
trying to say.
>
> The specification change list here:
> https://github.com/oasis-tcs/virtio-spec
Which commit?
> CHANGELOG:
> v3: finalized the specification change.
Changelog belongs after -- and should be complete
including changes v1 to v2.
> Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> ---
> drivers/block/virtio_blk.c | 96 +++++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/virtio_blk.h | 39 +++++++++++++++++
> 2 files changed, 132 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 4a07593c..1943adb 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -172,10 +172,53 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> }
>
> +static inline int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
> +{
> + unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
Split on two lines pls:
unsigned short n = 0;
> + u32 block_size = queue_logical_block_size(req->q);
Seems to be unused except for the sanity check below. Why?
> + struct virtio_blk_discard_write_zeroes *range;
> + struct bio *bio;
> +
> + if (block_size < 512 || !block_size)
Why 512? And when is it 0?
> + return -1;
-1 isn't a normal errno code.
> +
> + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
This might be pretty large: with 64K segments it looks like you are
trying to allocate ~1Mbyte with GFP_ATOMIC which is unlikely to succeed.
Can we split this up in chunks?
> + if (!range)
> + return -1;
> +
> + __rq_for_each_bio(bio, req) {
> + u64 sector = bio->bi_iter.bi_sector;
> + u32 num_sectors = bio->bi_iter.bi_size >> 9;
why 9?
> +
> + range[n].flags.unmap = cpu_to_le32(unmap);
> + range[n].flags.reserved = cpu_to_le32(0);
> + range[n].num_sectors = cpu_to_le32(num_sectors);
> + range[n].sector = cpu_to_le64(sector);
Isn't this causing sparse warnings?
> + n++;
> + }
> +
> + if (WARN_ON_ONCE(n != segments)) {
and when does this happen?
> + kfree(range);
> + return -1;
> + }
> +
> + req->special_vec.bv_page = virt_to_page(range);
> + req->special_vec.bv_offset = offset_in_page(range);
> + req->special_vec.bv_len = sizeof(*range) * segments;
> + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> + return 0;
> +}
> +
> static inline void virtblk_request_done(struct request *req)
> {
> struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>
> + if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
> + kfree(page_address(req->special_vec.bv_page) +
> + req->special_vec.bv_offset);
> + }
> +
> switch (req_op(req)) {
> case REQ_OP_SCSI_IN:
> case REQ_OP_SCSI_OUT:
> @@ -225,6 +268,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> int qid = hctx->queue_num;
> int err;
> bool notify = false;
> + bool unmap = false;
> u32 type;
>
> BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> @@ -237,6 +281,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> case REQ_OP_FLUSH:
> type = VIRTIO_BLK_T_FLUSH;
> break;
> + case REQ_OP_DISCARD:
> + type = VIRTIO_BLK_T_DISCARD;
> + break;
> + case REQ_OP_WRITE_ZEROES:
> + type = VIRTIO_BLK_T_WRITE_ZEROES;
> + unmap = !(req->cmd_flags & REQ_NOUNMAP);
> + break;
> case REQ_OP_SCSI_IN:
> case REQ_OP_SCSI_OUT:
> type = VIRTIO_BLK_T_SCSI_CMD;
> @@ -256,9 +307,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>
> blk_mq_start_request(req);
>
> + if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
> + err = virtblk_setup_discard_write_zeroes(req, unmap);
> + if (err)
> + return BLK_STS_IOERR;
Does a failure actually indicate an IO error?
> + }
> +
> num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
> if (num) {
> - if (rq_data_dir(req) == WRITE)
> + if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD ||
> + type == VIRTIO_BLK_T_WRITE_ZEROES)
> vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
> else
> vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
> @@ -777,6 +835,38 @@ static int virtblk_probe(struct virtio_device *vdev)
> if (!err && opt_io_size)
> blk_queue_io_opt(q, blk_size * opt_io_size);
>
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
> + q->limits.discard_granularity = blk_size;
> +
> + virtio_cread(vdev, struct virtio_blk_config, discard_sector_alignment, &v);
> + if (v)
> + q->limits.discard_alignment = v << 9;
> + else
> + q->limits.discard_alignment = 0;
> +
> + virtio_cread(vdev, struct virtio_blk_config, max_discard_sectors, &v);
> + if (v)
> + blk_queue_max_discard_sectors(q, v);
> + else
> + blk_queue_max_discard_sectors(q, UINT_MAX);
> +
> + virtio_cread(vdev, struct virtio_blk_config, max_discard_seg, &v);
> + if (v)
> + blk_queue_max_discard_segments(q, v);
> + else
> + blk_queue_max_discard_segments(q, USHRT_MAX);
> +
> + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
> + }
> +
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
> + virtio_cread(vdev, struct virtio_blk_config, max_write_zeroes_sectors, &v);
> + if (v)
> + blk_queue_max_write_zeroes_sectors(q, v);
> + else
> + blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
> + }
> +
> virtblk_update_capacity(vblk, false);
> virtio_device_ready(vdev);
>
> @@ -885,14 +975,14 @@ static int virtblk_restore(struct virtio_device *vdev)
> VIRTIO_BLK_F_SCSI,
> #endif
> VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> - VIRTIO_BLK_F_MQ,
> + VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> }
> ;
> static unsigned int features[] = {
> VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
> VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
> VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> - VIRTIO_BLK_F_MQ,
> + VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> };
>
> static struct virtio_driver virtio_blk = {
> diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
> index 9ebe4d9..78a60e6 100644
> --- a/include/uapi/linux/virtio_blk.h
> +++ b/include/uapi/linux/virtio_blk.h
> @@ -38,6 +38,8 @@
> #define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
> #define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
> #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
> +#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD command is supported */
> +#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES command is supported */
>
> /* Legacy feature bits */
> #ifndef VIRTIO_BLK_NO_LEGACY
> @@ -86,6 +88,22 @@ struct virtio_blk_config {
>
> /* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
> __u16 num_queues;
> + /* The maximum discard segment size (if VIRTIO_BLK_F_DISCARD) */
> + __u32 max_discard_sectors;
> + /* The maximum number of discard segments (if VIRTIO_BLK_F_DISCARD) */
> + __u32 max_discard_seg;
> + /* The sector alignment for discard (if VIRTIO_BLK_F_DISCARD) */
> + __u32 discard_sector_alignment;
> + /* The maximum number of write zeroes sectors (if VIRTIO_BLK_F_WRITE_ZEROES) */
> + __u32 max_write_zeroes_sectors;
> + /* The maximum number of write zeroes segments (if VIRTIO_BLK_F_WRITE_ZEROES) */
> + __u32 max_write_zeroes_seg;
> + /* Device clear this bit when write zeroes command cannot result in
> + * deallocating one or more sectors
> + * (if VIRTIO_BLK_F_WRITE_ZEROES with unmap bit)
Pls fix grammar in this comment, I am not sure what are you trying to
say.
> + */
> + __u8 write_zeroes_may_unmap;
> + __u8 unused1[3];
> } __attribute__((packed));
>
> /*
> @@ -114,6 +132,12 @@ struct virtio_blk_config {
> /* Get device ID command */
> #define VIRTIO_BLK_T_GET_ID 8
>
> +/* Discard command */
> +#define VIRTIO_BLK_T_DISCARD 11
> +
> +/* Write zeroes command */
> +#define VIRTIO_BLK_T_WRITE_ZEROES 13
> +
> #ifndef VIRTIO_BLK_NO_LEGACY
> /* Barrier before this op. */
> #define VIRTIO_BLK_T_BARRIER 0x80000000
> @@ -133,6 +157,21 @@ struct virtio_blk_outhdr {
> __virtio64 sector;
> };
>
> +/*
> + * discard/write zeroes range for each request.
> + */
> +struct virtio_blk_discard_write_zeroes {
> + /* discard/write zeroes start sector */
> + __virtio64 sector;
> + /* number of discard/write zeroes sectors */
> + __virtio32 num_sectors;
> + /* valid for write zeroes command */
> + struct {
> + __virtio32 unmap:1;
> + __virtio32 reserved:31;
> + } flags;
> +};
> +
You can't use bitmaps in portable code.
The format differs between architectures.
> #ifndef VIRTIO_BLK_NO_LEGACY
> struct virtio_scsi_inhdr {
> __virtio32 errors;
> --
> 1.9.3
^ permalink raw reply
* [PATCH v3] virtio_blk: add DISCARD and WRIET ZEROES command support
From: Changpeng Liu @ 2018-03-30 0:49 UTC (permalink / raw)
To: virtualization
Cc: james.r.harris, pawelx.wodkowski, mst, fabio.miranda.martins,
stefanha, pbonzini, cavery, changpeng.liu
Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES
command support, this will impact the performance when using SSD
backend over file systems.
The idea here is using 16 Bytes payload as one descriptor for
DISCARD/WRITE ZEROES command, users can put several ranges into
one command, for the purpose to support such feature, two feature
flags VIRTIO_BLK_F_DISCARD/VIRTIO_BLK_F_WRITE_ZEROES and two
commands VIRTIO_BLK_T_DISCARD/VIRTIO_BLK_T_WRITE_ZEROES are
introduced, and some parameters are added to the configuration
space to tell the OS the granularity of DISCARD/WRITE ZEROES
commands.
The specification change list here:
https://github.com/oasis-tcs/virtio-spec
CHANGELOG:
v3: finalized the specification change.
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
---
drivers/block/virtio_blk.c | 96 +++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_blk.h | 39 +++++++++++++++++
2 files changed, 132 insertions(+), 3 deletions(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4a07593c..1943adb 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -172,10 +172,53 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}
+static inline int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
+ u32 block_size = queue_logical_block_size(req->q);
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+
+ if (block_size < 512 || !block_size)
+ return -1;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
+ if (!range)
+ return -1;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> 9;
+
+ range[n].flags.unmap = cpu_to_le32(unmap);
+ range[n].flags.reserved = cpu_to_le32(0);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ if (WARN_ON_ONCE(n != segments)) {
+ kfree(range);
+ return -1;
+ }
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+ kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+ }
+
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
@@ -225,6 +268,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
int qid = hctx->queue_num;
int err;
bool notify = false;
+ bool unmap = false;
u32 type;
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -237,6 +281,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
type = VIRTIO_BLK_T_FLUSH;
break;
+ case REQ_OP_DISCARD:
+ type = VIRTIO_BLK_T_DISCARD;
+ break;
+ case REQ_OP_WRITE_ZEROES:
+ type = VIRTIO_BLK_T_WRITE_ZEROES;
+ unmap = !(req->cmd_flags & REQ_NOUNMAP);
+ break;
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
type = VIRTIO_BLK_T_SCSI_CMD;
@@ -256,9 +307,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(req);
+ if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+ err = virtblk_setup_discard_write_zeroes(req, unmap);
+ if (err)
+ return BLK_STS_IOERR;
+ }
+
num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
if (num) {
- if (rq_data_dir(req) == WRITE)
+ if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD ||
+ type == VIRTIO_BLK_T_WRITE_ZEROES)
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
else
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
@@ -777,6 +835,38 @@ static int virtblk_probe(struct virtio_device *vdev)
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+ q->limits.discard_granularity = blk_size;
+
+ virtio_cread(vdev, struct virtio_blk_config, discard_sector_alignment, &v);
+ if (v)
+ q->limits.discard_alignment = v << 9;
+ else
+ q->limits.discard_alignment = 0;
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_sectors, &v);
+ if (v)
+ blk_queue_max_discard_sectors(q, v);
+ else
+ blk_queue_max_discard_sectors(q, UINT_MAX);
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_seg, &v);
+ if (v)
+ blk_queue_max_discard_segments(q, v);
+ else
+ blk_queue_max_discard_segments(q, USHRT_MAX);
+
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+ virtio_cread(vdev, struct virtio_blk_config, max_write_zeroes_sectors, &v);
+ if (v)
+ blk_queue_max_write_zeroes_sectors(q, v);
+ else
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
@@ -885,14 +975,14 @@ static int virtblk_restore(struct virtio_device *vdev)
VIRTIO_BLK_F_SCSI,
#endif
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
}
;
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
};
static struct virtio_driver virtio_blk = {
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ebe4d9..78a60e6 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -38,6 +38,8 @@
#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
+#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD command is supported */
+#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES command is supported */
/* Legacy feature bits */
#ifndef VIRTIO_BLK_NO_LEGACY
@@ -86,6 +88,22 @@ struct virtio_blk_config {
/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
__u16 num_queues;
+ /* The maximum discard segment size (if VIRTIO_BLK_F_DISCARD) */
+ __u32 max_discard_sectors;
+ /* The maximum number of discard segments (if VIRTIO_BLK_F_DISCARD) */
+ __u32 max_discard_seg;
+ /* The sector alignment for discard (if VIRTIO_BLK_F_DISCARD) */
+ __u32 discard_sector_alignment;
+ /* The maximum number of write zeroes sectors (if VIRTIO_BLK_F_WRITE_ZEROES) */
+ __u32 max_write_zeroes_sectors;
+ /* The maximum number of write zeroes segments (if VIRTIO_BLK_F_WRITE_ZEROES) */
+ __u32 max_write_zeroes_seg;
+ /* Device clear this bit when write zeroes command cannot result in
+ * deallocating one or more sectors
+ * (if VIRTIO_BLK_F_WRITE_ZEROES with unmap bit)
+ */
+ __u8 write_zeroes_may_unmap;
+ __u8 unused1[3];
} __attribute__((packed));
/*
@@ -114,6 +132,12 @@ struct virtio_blk_config {
/* Get device ID command */
#define VIRTIO_BLK_T_GET_ID 8
+/* Discard command */
+#define VIRTIO_BLK_T_DISCARD 11
+
+/* Write zeroes command */
+#define VIRTIO_BLK_T_WRITE_ZEROES 13
+
#ifndef VIRTIO_BLK_NO_LEGACY
/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
@@ -133,6 +157,21 @@ struct virtio_blk_outhdr {
__virtio64 sector;
};
+/*
+ * discard/write zeroes range for each request.
+ */
+struct virtio_blk_discard_write_zeroes {
+ /* discard/write zeroes start sector */
+ __virtio64 sector;
+ /* number of discard/write zeroes sectors */
+ __virtio32 num_sectors;
+ /* valid for write zeroes command */
+ struct {
+ __virtio32 unmap:1;
+ __virtio32 reserved:31;
+ } flags;
+};
+
#ifndef VIRTIO_BLK_NO_LEGACY
struct virtio_scsi_inhdr {
__virtio32 errors;
--
1.9.3
^ permalink raw reply related
* Re: virtio over SW-defined/CPU-driven PCIe endpoint
From: Michael S. Tsirkin @ 2018-03-29 22:38 UTC (permalink / raw)
To: Stephen Warren; +Cc: virtualization
In-Reply-To: <7cbb8e1c-a34b-2cef-9de3-f034f6afcdfd@wwwdotorg.org>
On Thu, Mar 29, 2018 at 03:22:29PM -0600, Stephen Warren wrote:
> I've been investigating how to implement a virtio device (as opposed to a
> virtio driver) on a regular computer system with a PCIe controller that can
> operate in endpoint mode, as opposed to an endpoint that's implemented by a
> hypervisor that can preempt execution of a VM, or an endpoint that's
> implemented purely in hardware by logic gates. In my case (and I assume
> likely in most CPU-driven PCIe endpoint cases), the endpoint controller has
> the following capabilities:
>
> - Host-initiated accesses to the endpoint's BARs can read/write normal
> memory, but not hardware registers within the endpoint system.
>
> - Accesses to memory exposed by BARs can't be synchronously handled by the
> endpoint's local CPU. The local CPU can't be notified when the host writes
> memory in order to synchronously update other memory locations. The local
> CPU can't synchronously generate the result of a host read transaction, but
> rather the data must be present in memory ahead of time.
>
> - Accesses to a small region of address space can be used to generate
> interrupts to the endpoint's local CPU. This region can be exposed through a
> PCI BAR (or perhaps as part of a BAR; not sure on details yet). This region
> of memory has a fixed format and is separate from true RAM, and so can't be
> used to hold PCI-virtio's discovery/capability data.
>
> - The endpoint can emit PCI interrupts (e.g. MSI) to the attached host.
>
> The model described in the virtio spec's "Virtio PCI Bus" section doesn't
> seem to work in this case, since it assumes:
>
> - Writes to some fields in the PCI configuration space (e.g.
> {queue,device_feature,driver_feature}_select) synchronously update other
> fields (e.g. queue_size), which can immediately be accessed by the host.
> This isn't possible when the memory content is created by a CPU that isn't a
> synchronous part of PCI accesses.
>
> - Writing to some fields in the PCI configuration space (e.g. device_status)
> are supposed to trigger a response by the device, without the need to
> explicitly notify the device of the memory write by some other means. This
> isn't possible when the endpoint's local CPU has no mechanism to be notified
> of such writes.
>
> - The device_status field is asynchronously read-write by both the device
> and driver, yet the spec requires that the driver must always
> read-modify-write this field, and additionally that the driver must never
> clear any device status bit. These requirements seem impossible to satisfy
> in any case at all, let alone the current case.
>
> I can see some possible solutions here:
>
> 1) Just implement virtqueues but not all of the standardized PCI discovery
> protocol. virtqueues don't have the problems described above and should work
> fine between systems where there are asynchronous CPUs on both ends.
> virtqueus solely rely on normal memory access without side-effects and
> explicit notification. This would require implementing some
> custom/device-specific discovery protocol. I believe that remoteproc/rpmsg
> take this approach.
>
> 2) Define a new standardized virtio PCI discovery protocol that is better
> suited to the device being an asynchronous CPU. For example, eliminate the
> need for the device to somehow notice memory accesses and rely on explicitl
> notification instead. Separate device-written and driver-written data into
> different cache-lines or pages.
>
> 3) Use something other than virtio/virtqueues instead.
>
> As an aside, I noticed that the memory allocation for virtqueues is very
> lopsided; the driver always allocates the storage. This means that the
> device must perform PCI reads to transfer data from the driver to the
> device. PCI reads are typically slower than PCI writes since reads require a
> round-trip transfer, whereas writes can be posted. I wonder if any thought
> has been put into having the device optionally allocate virtuqueue buffers
> so that the protocol can rely primarily on PCI writes? Perhaps there's some
> alternative protocol that's more optimized for true PCI-based communication
> rather than paravirtualized PCI-based communication?
>
> Thanks for any thoughts on the best approach, or pointers to pre-existing
> work in this area.
I think Jan Kiszka wanted to add ability to put some data in the PCI
BAR. This was never formally proposed.
Your first step IMHO should be to send these thoughts to one of the
virtio TC mailing lists, that's where discussion about virtio interface
extensions takes place. virtualization@lists.linux-foundation.org is
mostly for Linux virtio drivers.
--
MST
^ permalink raw reply
* virtio over SW-defined/CPU-driven PCIe endpoint
From: Stephen Warren @ 2018-03-29 21:22 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang; +Cc: virtualization
I've been investigating how to implement a virtio device (as opposed to
a virtio driver) on a regular computer system with a PCIe controller
that can operate in endpoint mode, as opposed to an endpoint that's
implemented by a hypervisor that can preempt execution of a VM, or an
endpoint that's implemented purely in hardware by logic gates. In my
case (and I assume likely in most CPU-driven PCIe endpoint cases), the
endpoint controller has the following capabilities:
- Host-initiated accesses to the endpoint's BARs can read/write normal
memory, but not hardware registers within the endpoint system.
- Accesses to memory exposed by BARs can't be synchronously handled by
the endpoint's local CPU. The local CPU can't be notified when the host
writes memory in order to synchronously update other memory locations.
The local CPU can't synchronously generate the result of a host read
transaction, but rather the data must be present in memory ahead of time.
- Accesses to a small region of address space can be used to generate
interrupts to the endpoint's local CPU. This region can be exposed
through a PCI BAR (or perhaps as part of a BAR; not sure on details
yet). This region of memory has a fixed format and is separate from true
RAM, and so can't be used to hold PCI-virtio's discovery/capability data.
- The endpoint can emit PCI interrupts (e.g. MSI) to the attached host.
The model described in the virtio spec's "Virtio PCI Bus" section
doesn't seem to work in this case, since it assumes:
- Writes to some fields in the PCI configuration space (e.g.
{queue,device_feature,driver_feature}_select) synchronously update other
fields (e.g. queue_size), which can immediately be accessed by the host.
This isn't possible when the memory content is created by a CPU that
isn't a synchronous part of PCI accesses.
- Writing to some fields in the PCI configuration space (e.g.
device_status) are supposed to trigger a response by the device, without
the need to explicitly notify the device of the memory write by some
other means. This isn't possible when the endpoint's local CPU has no
mechanism to be notified of such writes.
- The device_status field is asynchronously read-write by both the
device and driver, yet the spec requires that the driver must always
read-modify-write this field, and additionally that the driver must
never clear any device status bit. These requirements seem impossible to
satisfy in any case at all, let alone the current case.
I can see some possible solutions here:
1) Just implement virtqueues but not all of the standardized PCI
discovery protocol. virtqueues don't have the problems described above
and should work fine between systems where there are asynchronous CPUs
on both ends. virtqueus solely rely on normal memory access without
side-effects and explicit notification. This would require implementing
some custom/device-specific discovery protocol. I believe that
remoteproc/rpmsg take this approach.
2) Define a new standardized virtio PCI discovery protocol that is
better suited to the device being an asynchronous CPU. For example,
eliminate the need for the device to somehow notice memory accesses and
rely on explicitl notification instead. Separate device-written and
driver-written data into different cache-lines or pages.
3) Use something other than virtio/virtqueues instead.
As an aside, I noticed that the memory allocation for virtqueues is very
lopsided; the driver always allocates the storage. This means that the
device must perform PCI reads to transfer data from the driver to the
device. PCI reads are typically slower than PCI writes since reads
require a round-trip transfer, whereas writes can be posted. I wonder if
any thought has been put into having the device optionally allocate
virtuqueue buffers so that the protocol can rely primarily on PCI
writes? Perhaps there's some alternative protocol that's more optimized
for true PCI-based communication rather than paravirtualized PCI-based
communication?
Thanks for any thoughts on the best approach, or pointers to
pre-existing work in this area.
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: David Miller @ 2018-03-29 20:23 UTC (permalink / raw)
To: mst; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180329154803-mutt-send-email-mst@kernel.org>
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Thu, 29 Mar 2018 15:48:22 +0300
> On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
>> Vq log_base is the userspace address of bitmap which has nothing to do
>> with IOTLB. So it needs to be validated unconditionally otherwise we
>> may try use 0 as log_base which may lead to pin pages that will lead
>> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>>
>> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
>> Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> stable material I guess.
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: Michael S. Tsirkin @ 2018-03-29 14:44 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1522310404-8486-1-git-send-email-jasowang@redhat.com>
On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
> Vq log_base is the userspace address of bitmap which has nothing to do
> with IOTLB. So it needs to be validated unconditionally otherwise we
> may try use 0 as log_base which may lead to pin pages that will lead
> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
> Signed-off-by: Jason Wang <jasowang@redhat.com>
One follow-up question:
We still observe that get user pages returns 0 sometimes. While I agree
we should not pass in unvalidated addresses, isn't this worth
documenting?
> ---
> drivers/vhost/vhost.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5d5a9d9..5320039 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,14 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
> /* Caller should have vq mutex and device mutex */
> int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> - if (vq->iotlb) {
> - /* When device IOTLB was used, the access validation
> - * will be validated during prefetching.
> - */
> - return 1;
> - }
> - return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
> - vq_log_access_ok(vq, vq->log_base);
> + int ret = vq_log_access_ok(vq, vq->log_base);
> +
> + if (ret || vq->iotlb)
> + return ret;
> +
> + return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>
> --
> 2.7.4
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: Michael S. Tsirkin @ 2018-03-29 12:48 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1522310404-8486-1-git-send-email-jasowang@redhat.com>
On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
> Vq log_base is the userspace address of bitmap which has nothing to do
> with IOTLB. So it needs to be validated unconditionally otherwise we
> may try use 0 as log_base which may lead to pin pages that will lead
> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
> Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
stable material I guess.
> ---
> drivers/vhost/vhost.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5d5a9d9..5320039 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,14 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
> /* Caller should have vq mutex and device mutex */
> int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> - if (vq->iotlb) {
> - /* When device IOTLB was used, the access validation
> - * will be validated during prefetching.
> - */
> - return 1;
> - }
> - return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
> - vq_log_access_ok(vq, vq->log_base);
> + int ret = vq_log_access_ok(vq, vq->log_base);
> +
> + if (ret || vq->iotlb)
> + return ret;
> +
> + return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>
> --
> 2.7.4
^ permalink raw reply
* Re: [PATCH net V2] vhost: correctly remove wait queue during poll failure
From: Jason Wang @ 2018-03-29 8:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, Darren Kenny, linux-kernel, kvm, virtualization
In-Reply-To: <20180329071801-mutt-send-email-mst@kernel.org>
On 2018年03月29日 12:20, Michael S. Tsirkin wrote:
> On Tue, Mar 27, 2018 at 08:50:52PM +0800, Jason Wang wrote:
>> We tried to remove vq poll from wait queue, but do not check whether
>> or not it was in a list before. This will lead double free. Fixing
>> this by switching to use vhost_poll_stop() which zeros poll->wqh after
>> removing poll from waitqueue to make sure it won't be freed twice.
>>
>> Cc: Darren Kenny<darren.kenny@oracle.com>
>> Reported-by:syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com
>> Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend")
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> OK with this the only bug we have is where get user pages returns 0
> (Reported-by:syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com)
>
>
>
Thanks for the reminder.
I post a patch to avoid this.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH net] vhost: validate log when IOTLB is enabled
From: Jason Wang @ 2018-03-29 8:00 UTC (permalink / raw)
To: mst, jasowang; +Cc: netdev, linux-kernel, kvm, virtualization
Vq log_base is the userspace address of bitmap which has nothing to do
with IOTLB. So it needs to be validated unconditionally otherwise we
may try use 0 as log_base which may lead to pin pages that will lead
unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5d5a9d9..5320039 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1244,14 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
/* Caller should have vq mutex and device mutex */
int vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
- if (vq->iotlb) {
- /* When device IOTLB was used, the access validation
- * will be validated during prefetching.
- */
- return 1;
- }
- return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
- vq_log_access_ok(vq, vq->log_base);
+ int ret = vq_log_access_ok(vq, vq->log_base);
+
+ if (ret || vq->iotlb)
+ return ret;
+
+ return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
}
EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net V2] vhost: correctly remove wait queue during poll failure
From: Michael S. Tsirkin @ 2018-03-29 4:20 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, Darren Kenny, linux-kernel, kvm, virtualization
In-Reply-To: <1522155052-13347-1-git-send-email-jasowang@redhat.com>
On Tue, Mar 27, 2018 at 08:50:52PM +0800, Jason Wang wrote:
> We tried to remove vq poll from wait queue, but do not check whether
> or not it was in a list before. This will lead double free. Fixing
> this by switching to use vhost_poll_stop() which zeros poll->wqh after
> removing poll from waitqueue to make sure it won't be freed twice.
>
> Cc: Darren Kenny <darren.kenny@oracle.com>
> Reported-by: syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com
> Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
OK with this the only bug we have is where get user pages returns 0
(Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com)
> ---
> Changes from V1:
> - tweak the commit log for to match the code
> ---
> drivers/vhost/vhost.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 1b3e8d2d..5d5a9d9 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -212,8 +212,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file)
> if (mask)
> vhost_poll_wakeup(&poll->wait, 0, 0, poll_to_key(mask));
> if (mask & EPOLLERR) {
> - if (poll->wqh)
> - remove_wait_queue(poll->wqh, &poll->wait);
> + vhost_poll_stop(poll);
> ret = -EINVAL;
> }
>
> --
> 2.7.4
^ permalink raw reply
* Re: [PATCH] vhost-net: add time limitation for tx polling(Internet mail)
From: Jason Wang @ 2018-03-29 2:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org,
haibinzhang(张海斌),
yunfangtai(台运方),
lidongchen(陈立东)
In-Reply-To: <20180328182813-mutt-send-email-mst@kernel.org>
On 2018年03月28日 23:31, Michael S. Tsirkin wrote:
> On Wed, Mar 28, 2018 at 02:37:04PM +0800, Jason Wang wrote:
>>
>> On 2018年03月28日 12:01, haibinzhang(张海斌) wrote:
>>> On 2018年03月27日 19:26, Jason wrote
>>> On 2018年03月27日 17:12, haibinzhang wrote:
>>>>> handle_tx() will delay rx for a long time when busy tx polling udp packets
>>>>> with short length(ie: 1byte udp payload), because setting VHOST_NET_WEIGHT
>>>>> takes into account only sent-bytes but no time.
>>>> Interesting.
>>>>
>>>> Looking at vhost_can_busy_poll() it tries to poke pending vhost work and
>>>> exit the busy loop if it found one. So I believe something block the
>>>> work queuing. E.g did reverting 8241a1e466cd56e6c10472cac9c1ad4e54bc65db
>>>> fix the issue?
>>> "busy tx polling" means using netperf send udp packets with 1 bytes payload(total 47bytes frame lenght),
>>> and handle_tx() will be busy sending packets continuously.
>>>
>>>>> It's not fair for handle_rx(),
>>>>> so needs to limit max time of tx polling.
>>>>>
>>>>> ---
>>>>> drivers/vhost/net.c | 3 ++-
>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>>>> index 8139bc70ad7d..dc9218a3a75b 100644
>>>>> --- a/drivers/vhost/net.c
>>>>> +++ b/drivers/vhost/net.c
>>>>> @@ -473,6 +473,7 @@ static void handle_tx(struct vhost_net *net)
>>>>> struct socket *sock;
>>>>> struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>>>>> bool zcopy, zcopy_used;
>>>>> + unsigned long start = jiffies;
>>>> Checking jiffies is tricky, need to convert it to ms or whatever others.
>>>>
>>>>> mutex_lock(&vq->mutex);
>>>>> sock = vq->private_data;
>>>>> @@ -580,7 +581,7 @@ static void handle_tx(struct vhost_net *net)
>>>>> else
>>>>> vhost_zerocopy_signal_used(net, vq);
>>>>> vhost_net_tx_packet(net);
>>>>> - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>>>>> + if (unlikely(total_len >= VHOST_NET_WEIGHT) || unlikely(jiffies - start >= 1)) {
>>>> How value 1 is determined here? And we need a complete test to make sure
>>>> this won't affect other use cases.
>>> We just want <1ms ping latency, but actually we are not sure what value is reasonable.
>>> We have some test results using netperf before this patch as follow,
>>>
>>> Udp payload 1byte 100bytes 1000bytes 1400bytes
>>> Ping avg latency 25ms 10ms 2ms 1.5ms
>>>
>>> What is other testcases?
>> Something like https://patchwork.kernel.org/patch/10151645/.
>>
>> Btw, you need use time_before() to properly handle jiffies overflow and I
>> would also suggest you to try something like #packets limit (e.g 64).
> Maybe a ring size?
Yes or a factor of ring size.
>
>> For long term, we definitely need more worker threads.
>>
>> Thanks
> Only helps when you have spare CPUs.
Right.
Thanks
>>>> Another thought is introduce another limit of #packets, but this need
>>>> benchmark too.
>>>>
>>>> Thanks
>>>>
>>>>> vhost_poll_queue(&vq->poll);
>>>>> break;
>>>>> }
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] vhost-net: add time limitation for tx polling(Internet mail)
From: Michael S. Tsirkin @ 2018-03-28 15:31 UTC (permalink / raw)
To: Jason Wang
Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org,
haibinzhang(张海斌),
yunfangtai(台运方),
lidongchen(陈立东)
In-Reply-To: <0b1846d3-dfcf-df0c-f94c-65d414331d88@redhat.com>
On Wed, Mar 28, 2018 at 02:37:04PM +0800, Jason Wang wrote:
>
>
> On 2018年03月28日 12:01, haibinzhang(张海斌) wrote:
> > On 2018年03月27日 19:26, Jason wrote
> > On 2018年03月27日 17:12, haibinzhang wrote:
> > > > handle_tx() will delay rx for a long time when busy tx polling udp packets
> > > > with short length(ie: 1byte udp payload), because setting VHOST_NET_WEIGHT
> > > > takes into account only sent-bytes but no time.
> > > Interesting.
> > >
> > > Looking at vhost_can_busy_poll() it tries to poke pending vhost work and
> > > exit the busy loop if it found one. So I believe something block the
> > > work queuing. E.g did reverting 8241a1e466cd56e6c10472cac9c1ad4e54bc65db
> > > fix the issue?
> > "busy tx polling" means using netperf send udp packets with 1 bytes payload(total 47bytes frame lenght),
> > and handle_tx() will be busy sending packets continuously.
> >
> > > > It's not fair for handle_rx(),
> > > > so needs to limit max time of tx polling.
> > > >
> > > > ---
> > > > drivers/vhost/net.c | 3 ++-
> > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > index 8139bc70ad7d..dc9218a3a75b 100644
> > > > --- a/drivers/vhost/net.c
> > > > +++ b/drivers/vhost/net.c
> > > > @@ -473,6 +473,7 @@ static void handle_tx(struct vhost_net *net)
> > > > struct socket *sock;
> > > > struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> > > > bool zcopy, zcopy_used;
> > > > + unsigned long start = jiffies;
> > > Checking jiffies is tricky, need to convert it to ms or whatever others.
> > >
> > > > mutex_lock(&vq->mutex);
> > > > sock = vq->private_data;
> > > > @@ -580,7 +581,7 @@ static void handle_tx(struct vhost_net *net)
> > > > else
> > > > vhost_zerocopy_signal_used(net, vq);
> > > > vhost_net_tx_packet(net);
> > > > - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > > > + if (unlikely(total_len >= VHOST_NET_WEIGHT) || unlikely(jiffies - start >= 1)) {
> > > How value 1 is determined here? And we need a complete test to make sure
> > > this won't affect other use cases.
> > We just want <1ms ping latency, but actually we are not sure what value is reasonable.
> > We have some test results using netperf before this patch as follow,
> >
> > Udp payload 1byte 100bytes 1000bytes 1400bytes
> > Ping avg latency 25ms 10ms 2ms 1.5ms
> >
> > What is other testcases?
>
> Something like https://patchwork.kernel.org/patch/10151645/.
>
> Btw, you need use time_before() to properly handle jiffies overflow and I
> would also suggest you to try something like #packets limit (e.g 64).
Maybe a ring size?
> For long term, we definitely need more worker threads.
>
> Thanks
Only helps when you have spare CPUs.
> >
> > > Another thought is introduce another limit of #packets, but this need
> > > benchmark too.
> > >
> > > Thanks
> > >
> > > > vhost_poll_queue(&vq->poll);
> > > > break;
> > > > }
> > >
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v29 1/4] mm: support reporting free page blocks
From: Michal Hocko @ 2018-03-28 7:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm,
liliang.opensource, linux-kernel, virtualization, linux-mm,
huangzhichao, pbonzini, Andrew Morton, nilal
In-Reply-To: <20180327190635-mutt-send-email-mst@kernel.org>
On Tue 27-03-18 19:07:22, Michael S. Tsirkin wrote:
> On Tue, Mar 27, 2018 at 08:33:22AM +0200, Michal Hocko wrote:
> > > > > + * The function itself might sleep so it cannot be called from atomic
> > > > > + * contexts.
> > > > I don't see how walk_free_mem_block() can sleep.
> > >
> > > OK, it would be better to remove this sentence for the current version. But
> > > I think we could probably keep it if we decide to add cond_resched() below.
> >
> > The point of this sentence was to make any user aware that the function
> > might sleep from the very begining rather than chase existing callers
> > when we need to add cond_resched or sleep for any other reason. So I
> > would rather keep it.
>
> Let's say what it is then - "will be changed to sleep in the future".
Do we really want to describe the precise implementation in the
documentation? I thought the main purpose of the documentation is to
describe the _contract_. If I am curious about the implementation I can
look at the code. As I've said earlier in this patchset lifetime. This
interface is rather dangerous because we are exposing guts of our
internal data structures. So we better set expectations of what can and
cannot be done right from the beginning. I definitely do not want
somebody to simply look at the code and see that the interface is
sleepable and abuse that fact.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH] vhost-net: add time limitation for tx polling(Internet mail)
From: Jason Wang @ 2018-03-28 6:37 UTC (permalink / raw)
To: haibinzhang(张海斌), mst@redhat.com,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: yunfangtai(台运方),
lidongchen(陈立东)
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC412A@EXMBX-SZMAIL011.tencent.com>
On 2018年03月28日 12:01, haibinzhang(张海斌) wrote:
> On 2018年03月27日 19:26, Jason wrote
> On 2018年03月27日 17:12, haibinzhang wrote:
>>> handle_tx() will delay rx for a long time when busy tx polling udp packets
>>> with short length(ie: 1byte udp payload), because setting VHOST_NET_WEIGHT
>>> takes into account only sent-bytes but no time.
>> Interesting.
>>
>> Looking at vhost_can_busy_poll() it tries to poke pending vhost work and
>> exit the busy loop if it found one. So I believe something block the
>> work queuing. E.g did reverting 8241a1e466cd56e6c10472cac9c1ad4e54bc65db
>> fix the issue?
> "busy tx polling" means using netperf send udp packets with 1 bytes payload(total 47bytes frame lenght),
> and handle_tx() will be busy sending packets continuously.
>
>>> It's not fair for handle_rx(),
>>> so needs to limit max time of tx polling.
>>>
>>> ---
>>> drivers/vhost/net.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>> index 8139bc70ad7d..dc9218a3a75b 100644
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -473,6 +473,7 @@ static void handle_tx(struct vhost_net *net)
>>> struct socket *sock;
>>> struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>>> bool zcopy, zcopy_used;
>>> + unsigned long start = jiffies;
>> Checking jiffies is tricky, need to convert it to ms or whatever others.
>>
>>>
>>> mutex_lock(&vq->mutex);
>>> sock = vq->private_data;
>>> @@ -580,7 +581,7 @@ static void handle_tx(struct vhost_net *net)
>>> else
>>> vhost_zerocopy_signal_used(net, vq);
>>> vhost_net_tx_packet(net);
>>> - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>>> + if (unlikely(total_len >= VHOST_NET_WEIGHT) || unlikely(jiffies - start >= 1)) {
>> How value 1 is determined here? And we need a complete test to make sure
>> this won't affect other use cases.
> We just want <1ms ping latency, but actually we are not sure what value is reasonable.
> We have some test results using netperf before this patch as follow,
>
> Udp payload 1byte 100bytes 1000bytes 1400bytes
> Ping avg latency 25ms 10ms 2ms 1.5ms
>
> What is other testcases?
Something like https://patchwork.kernel.org/patch/10151645/.
Btw, you need use time_before() to properly handle jiffies overflow and
I would also suggest you to try something like #packets limit (e.g 64).
For long term, we definitely need more worker threads.
Thanks
>
>> Another thought is introduce another limit of #packets, but this need
>> benchmark too.
>>
>> Thanks
>>
>>> vhost_poll_queue(&vq->poll);
>>> break;
>>> }
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net V2] vhost: correctly remove wait queue during poll failure
From: David Miller @ 2018-03-27 17:04 UTC (permalink / raw)
To: jasowang; +Cc: kvm, mst, netdev, linux-kernel, virtualization, darren.kenny
In-Reply-To: <1522155052-13347-1-git-send-email-jasowang@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Tue, 27 Mar 2018 20:50:52 +0800
> We tried to remove vq poll from wait queue, but do not check whether
> or not it was in a list before. This will lead double free. Fixing
> this by switching to use vhost_poll_stop() which zeros poll->wqh after
> removing poll from waitqueue to make sure it won't be freed twice.
>
> Cc: Darren Kenny <darren.kenny@oracle.com>
> Reported-by: syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com
> Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> Changes from V1:
> - tweak the commit log for to match the code
Applied and queued up for -stable, thank you.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox