* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-05-28 23:56 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: robh, mpe, linux-kernel, virtualization, hch, joe, david,
linuxppc-dev, elfring, Anshuman Khandual
In-Reply-To: <6fff9f5d67361653e6072570a857cf0d1009a123.camel@kernel.crashing.org>
On Tue, 2018-05-29 at 09:48 +1000, Benjamin Herrenschmidt wrote:
> > Well it's not supposed to be much slower for the static case.
> >
> > vhost has a cache so should be fine.
> >
> > A while ago Paolo implemented a translation cache which should be
> > perfect for this case - most of the code got merged but
> > never enabled because of stability issues.
> >
> > If all else fails, we could teach QEMU to handle the no-iommu case
> > as if VIRTIO_F_IOMMU_PLATFORM was off.
>
> Any serious reason why not just getting that 2 line patch allowing our
> arch code to force virtio to use the DMA API ?
>
> It's not particularly invasive and solves our problem rather nicely
> without adding overhead or additional knowledge to qemu/libvirt/mgmnt
> tools etc... that it doesn't need etc....
>
> The guest knows it's going secure so the guest arch code can do the
> right thing rather trivially.
>
> Long term we should probably make virtio always use the DMA API anyway,
> and interpose "1:1" dma_ops for the traditional virtio case, that would
> reduce code clutter significantly. In that case, it would become just a
> matter of having a platform hook to override the dma_ops used.
To elaborate a bit ....
What we are trying to solve here is entirely a guest problem, I don't
think involving qemu in the solution is the right thing to do.
The guest can only allow external parties (qemu, potentially PCI
devices, etc...) access to some restricted portions of memory (insecure
memory). Thus the guest need to do some bounce buffering/swiotlb type
tricks.
This is completely orthogonal to whether there is an actual iommu
between the guest and the device (or emulated device/virtio).
This is why I think the solution should reside in the guest kernel, by
proper manipulation (by the arch code) of the dma ops.
I don't think forcing the addition of an emulated iommu in the middle
just to work around the fact that virtio "cheats" and doesn't use the
dma API unless there is one, is the right "fix".
The right long term fix is to always use the DMA API, reducing code
path etc... and just have a single point where virtio can "chose"
alternate DMA ops (via an arch hook to deal with our case).
In the meantime, having the hook we propose gets us going, but if you
agree with the approach, we should also work on the long term approach.
Cheers,
Ben.
^ permalink raw reply
* [PATCH v4] virtio_blk: add DISCARD and WRIET ZEROES commands support
From: Changpeng Liu @ 2018-05-29 1:42 UTC (permalink / raw)
To: virtualization, changpeng.liu; +Cc: pbonzini, cavery, stefanha
Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES commands
support, this will impact the performance when using SSD backend over
file systems.
Commit 88c85538 "virtio-blk: add discard and write zeroes features to
specification"(see https://github.com/oasis-tcs/virtio-spec) extended
existing virtio-blk protocol, adding extra DISCARD and WRITE ZEROES
commands support.
While here, using 16 bytes descriptor to describe one segment of DISCARD
or WRITE ZEROES commands, each command may contain one or more decriptors.
The following data structure shows the definition of one descriptor:
struct virtio_blk_discard_write_zeroes {
le64 sector;
le32 num_sectors;
le32 unmap;
};
Field 'sector' means the start sector for DISCARD and WRITE ZEROES,
filed 'num_sectors' means the number of sectors for DISCARD and WRITE
ZEROES, if both DISCARD and WRITE ZEROES are supported, field 'unmap'
maybe used for WRITE ZEROES command with DISCARD enabled.
We also extended the virtio-blk configuration space to let backend
device put DISCARD and WRITE ZEROES configuration parameters.
struct virtio_blk_config {
[...]
le32 max_discard_sectors;
le32 max_discard_seg;
le32 discard_sector_alignment;
le32 max_write_zeroes_sectors;
le32 max_write_zeroes_seg;
u8 write_zeroes_may_unmap;
}
New feature bit [VIRTIO_BLK_F_DISCARD (13)]: Device can support discard
command, maximum discard sectors size in field 'max_discard_sectors' and
maximum discard segment number in field 'max_discard_seg'.
New feature [VIRTIO_BLK_F_WRITE_ZEROES (14)]: Device can support write
zeroes command, maximum write zeroes sectors size in field
'max_write_zeroes_sectors' and maximum write zeroes segment number in
field 'max_write_zeroes_seg'.
The parameters in the configuration space of the device field
'max_discard_sectors' and field 'discard_sector_alignment' are expressed in
512-byte units if the VIRTIO_BLK_F_DISCARD feature bit is negotiated. The
field 'max_write_zeroes_sectors' is expressed in 512-byte units if the
VIRTIO_BLK_F_WRITE_ZEROES feature bit is negotiated.
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
---
CHANGELOG:
v4: several optimizations based on MST's comments, remove bit field usage for
command descriptor.
v3: define the virtio-blk protocol to add discard and write zeroes support,
first version implementation based on proposed specification.
v2: add write zeroes command support.
v1: initial proposal implementation for discard command.
---
drivers/block/virtio_blk.c | 92 +++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_blk.h | 43 +++++++++++++++++++
2 files changed, 132 insertions(+), 3 deletions(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4a07593c..a250fcc 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -172,10 +172,45 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}
+static inline int virtblk_setup_discard_write_zeroes(struct request *req,
+ bool unmap)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req);
+ unsigned short n = 0;
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_KERNEL);
+ if (!range)
+ return -ENOMEM;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> 9;
+
+ range[n].unmap = cpu_to_le32(unmap);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+ kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+ }
+
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
@@ -225,6 +260,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
int qid = hctx->queue_num;
int err;
bool notify = false;
+ bool unmap = false;
u32 type;
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -237,6 +273,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
type = VIRTIO_BLK_T_FLUSH;
break;
+ case REQ_OP_DISCARD:
+ type = VIRTIO_BLK_T_DISCARD;
+ break;
+ case REQ_OP_WRITE_ZEROES:
+ type = VIRTIO_BLK_T_WRITE_ZEROES;
+ unmap = !(req->cmd_flags & REQ_NOUNMAP);
+ break;
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
type = VIRTIO_BLK_T_SCSI_CMD;
@@ -256,9 +299,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(req);
+ if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+ err = virtblk_setup_discard_write_zeroes(req, unmap);
+ if (err)
+ return BLK_STS_RESOURCE;
+ }
+
num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
if (num) {
- if (rq_data_dir(req) == WRITE)
+ if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD ||
+ type == VIRTIO_BLK_T_WRITE_ZEROES)
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
else
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
@@ -777,6 +827,42 @@ static int virtblk_probe(struct virtio_device *vdev)
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+ q->limits.discard_granularity = blk_size;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ discard_sector_alignment, &v);
+ if (v)
+ q->limits.discard_alignment = v << 9;
+ else
+ q->limits.discard_alignment = 0;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_discard_sectors, &v);
+ if (v)
+ blk_queue_max_discard_sectors(q, v);
+ else
+ blk_queue_max_discard_sectors(q, UINT_MAX);
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
+ &v);
+ if (v)
+ blk_queue_max_discard_segments(q, v);
+ else
+ blk_queue_max_discard_segments(q, USHRT_MAX);
+
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_write_zeroes_sectors, &v);
+ if (v)
+ blk_queue_max_write_zeroes_sectors(q, v);
+ else
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
@@ -885,14 +971,14 @@ static int virtblk_restore(struct virtio_device *vdev)
VIRTIO_BLK_F_SCSI,
#endif
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
}
;
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
};
static struct virtio_driver virtio_blk = {
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ebe4d9..8e7a015 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -38,6 +38,8 @@
#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
+#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
+#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
/* Legacy feature bits */
#ifndef VIRTIO_BLK_NO_LEGACY
@@ -86,6 +88,29 @@ struct virtio_blk_config {
/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
__u16 num_queues;
+ /* The maximum discard sectors (in 512-byte sectors) for
+ * one segment (if VIRTIO_BLK_F_DISCARD)
+ */
+ __u32 max_discard_sectors;
+ /* The maximum number of discard segments in a
+ * discard command (if VIRTIO_BLK_F_DISCARD)
+ */
+ __u32 max_discard_seg;
+ /* The alignment sectors for discard (if VIRTIO_BLK_F_DISCARD) */
+ __u32 discard_sector_alignment;
+ /* The maximum number of write zeroes sectors (in 512-byte sectors) in
+ * one segment (if VIRTIO_BLK_F_WRITE_ZEROES)
+ */
+ __u32 max_write_zeroes_sectors;
+ /* The maximum number of segments in a write zeroes
+ * command (if VIRTIO_BLK_F_WRITE_ZEROES)
+ */
+ __u32 max_write_zeroes_seg;
+ /* Device clear this bit when write zeroes command can't result in
+ * unmapping sectors (if VIRITO_BLK_F_WRITE_ZEROES and with unmap)
+ */
+ __u8 write_zeroes_may_unmap;
+ __u8 unused1[3];
} __attribute__((packed));
/*
@@ -114,6 +139,12 @@ struct virtio_blk_config {
/* Get device ID command */
#define VIRTIO_BLK_T_GET_ID 8
+/* Discard command */
+#define VIRTIO_BLK_T_DISCARD 11
+
+/* Write zeroes command */
+#define VIRTIO_BLK_T_WRITE_ZEROES 13
+
#ifndef VIRTIO_BLK_NO_LEGACY
/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
@@ -133,6 +164,18 @@ struct virtio_blk_outhdr {
__virtio64 sector;
};
+/*
+ * discard/write zeroes range for each request.
+ */
+struct virtio_blk_discard_write_zeroes {
+ /* discard/write zeroes start sector */
+ __virtio64 sector;
+ /* number of discard/write zeroes sectors */
+ __virtio32 num_sectors;
+ /* valid for write zeroes command */
+ __virtio32 unmap;
+};
+
#ifndef VIRTIO_BLK_NO_LEGACY
struct virtio_scsi_inhdr {
__virtio32 errors;
--
1.9.3
^ permalink raw reply related
* [RFC V5 PATCH 0/8] Packed ring layout for vhost
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
Hi all:
This RFC implement packed ring layout. The code were tested with
Tiwei's RFC V5 at https://lkml.org/lkml/2018/5/22/138. Some fixups and
tweaks were needed on top of Tiwei's code to make it run for event
index.
Pktgen reports about 20% improvement on TX PPS when doing pktgen from
guest to host. No ovbious improvement on RX PPS. We can do lots of
optimizations on top but for simple and for correceness first, this
version does not do much.
This version were tested with:
- Zerocopy (Out of Order) support
- vIOMMU support
- mergeable buffer on/off
- busy polling on/off
Notes for tester:
- Start from this version, vhost need qemu co-operation to work
correctly. Or you can comment out the packed specific code for
GET/SET_VRING_BASE.
- Changes from V4:
- fix signalled_used index recording
- track avail index correctly
- various minor fixes
Changes from V3:
- Fix math on event idx checking
- Sync last avail wrap counter through GET/SET_VRING_BASE
- remove desc_event prefix in the driver/device structure
Changes from V2:
- do not use & in checking desc_event_flags
- off should be most significant bit
- remove the workaround of mergeable buffer for dpdk prototype
- id should be in the last descriptor in the chain
- keep _F_WRITE for write descriptor when adding used
- device flags updating should use ADDR_USED type
- return error on unexpected unavail descriptor in a chain
- return false in vhost_ve_avail_empty is descriptor is available
- track last seen avail_wrap_counter
- correctly examine available descriptor in get_indirect_packed()
- vhost_idx_diff should return u16 instead of bool
Changes from V1:
- Refactor vhost used elem code to avoid open coding on used elem
- Event suppression support (compile test only).
- Indirect descriptor support (compile test only).
- Zerocopy support.
- vIOMMU support.
- SCSI/VSOCK support (compile test only).
- Fix several bugs
Jason Wang (8):
vhost: move get_rx_bufs to vhost.c
vhost: hide used ring layout from device
vhost: do not use vring_used_elem
vhost_net: do not explicitly manipulate vhost_used_elem
vhost: vhost_put_user() can accept metadata type
virtio: introduce packed ring defines
vhost: packed ring support
vhost: event suppression for packed ring
drivers/vhost/net.c | 144 ++----
drivers/vhost/scsi.c | 62 +--
drivers/vhost/vhost.c | 926 +++++++++++++++++++++++++++++++++----
drivers/vhost/vhost.h | 52 ++-
drivers/vhost/vsock.c | 42 +-
include/uapi/linux/virtio_config.h | 9 +
include/uapi/linux/virtio_ring.h | 32 ++
7 files changed, 1000 insertions(+), 267 deletions(-)
--
2.7.4
^ permalink raw reply
* [RFC V5 PATCH 1/8] vhost: move get_rx_bufs to vhost.c
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Move get_rx_bufs() to vhost.c and rename it to
vhost_get_bufs(). This helps to hide vring internal layout from
specific device implementation. Packed ring implementation will
benefit from this.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 83 ++-------------------------------------------------
drivers/vhost/vhost.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/vhost/vhost.h | 7 +++++
3 files changed, 88 insertions(+), 80 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 986058a..762aa81 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -664,83 +664,6 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
return len;
}
-/* This is a multi-buffer version of vhost_get_desc, that works if
- * vq has read descriptors only.
- * @vq - the relevant virtqueue
- * @datalen - data length we'll be reading
- * @iovcount - returned count of io vectors we fill
- * @log - vhost log
- * @log_num - log offset
- * @quota - headcount quota, 1 for big buffer
- * returns number of buffer heads allocated, negative on error
- */
-static int get_rx_bufs(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
- int datalen,
- unsigned *iovcount,
- struct vhost_log *log,
- unsigned *log_num,
- unsigned int quota)
-{
- unsigned int out, in;
- int seg = 0;
- int headcount = 0;
- unsigned d;
- int r, nlogs = 0;
- /* len is always initialized before use since we are always called with
- * datalen > 0.
- */
- u32 uninitialized_var(len);
-
- while (datalen > 0 && headcount < quota) {
- if (unlikely(seg >= UIO_MAXIOV)) {
- r = -ENOBUFS;
- goto err;
- }
- r = vhost_get_vq_desc(vq, vq->iov + seg,
- ARRAY_SIZE(vq->iov) - seg, &out,
- &in, log, log_num);
- if (unlikely(r < 0))
- goto err;
-
- d = r;
- if (d == vq->num) {
- r = 0;
- goto err;
- }
- if (unlikely(out || in <= 0)) {
- vq_err(vq, "unexpected descriptor format for RX: "
- "out %d, in %d\n", out, in);
- r = -EINVAL;
- goto err;
- }
- if (unlikely(log)) {
- nlogs += *log_num;
- log += *log_num;
- }
- heads[headcount].id = cpu_to_vhost32(vq, d);
- len = iov_length(vq->iov + seg, in);
- heads[headcount].len = cpu_to_vhost32(vq, len);
- datalen -= len;
- ++headcount;
- seg += in;
- }
- heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
- *iovcount = seg;
- if (unlikely(log))
- *log_num = nlogs;
-
- /* Detect overrun */
- if (unlikely(datalen > 0)) {
- r = UIO_MAXIOV + 1;
- goto err;
- }
- return headcount;
-err:
- vhost_discard_vq_desc(vq, headcount);
- return r;
-}
-
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_rx(struct vhost_net *net)
@@ -790,9 +713,9 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
- headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
- &in, vq_log, &log,
- likely(mergeable) ? UIO_MAXIOV : 1);
+ headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+ &in, vq_log, &log,
+ likely(mergeable) ? UIO_MAXIOV : 1);
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f0be5f3..096a688 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2100,6 +2100,84 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
+/* This is a multi-buffer version of vhost_get_desc, that works if
+ * vq has read descriptors only.
+ * @vq - the relevant virtqueue
+ * @datalen - data length we'll be reading
+ * @iovcount - returned count of io vectors we fill
+ * @log - vhost log
+ * @log_num - log offset
+ * @quota - headcount quota, 1 for big buffer
+ * returns number of buffer heads allocated, negative on error
+ */
+int vhost_get_bufs(struct vhost_virtqueue *vq,
+ struct vring_used_elem *heads,
+ int datalen,
+ unsigned *iovcount,
+ struct vhost_log *log,
+ unsigned *log_num,
+ unsigned int quota)
+{
+ unsigned int out, in;
+ int seg = 0;
+ int headcount = 0;
+ unsigned d;
+ int r, nlogs = 0;
+ /* len is always initialized before use since we are always called with
+ * datalen > 0.
+ */
+ u32 uninitialized_var(len);
+
+ while (datalen > 0 && headcount < quota) {
+ if (unlikely(seg >= UIO_MAXIOV)) {
+ r = -ENOBUFS;
+ goto err;
+ }
+ r = vhost_get_vq_desc(vq, vq->iov + seg,
+ ARRAY_SIZE(vq->iov) - seg, &out,
+ &in, log, log_num);
+ if (unlikely(r < 0))
+ goto err;
+
+ d = r;
+ if (d == vq->num) {
+ r = 0;
+ goto err;
+ }
+ if (unlikely(out || in <= 0)) {
+ vq_err(vq, "unexpected descriptor format for RX: "
+ "out %d, in %d\n", out, in);
+ r = -EINVAL;
+ goto err;
+ }
+ if (unlikely(log)) {
+ nlogs += *log_num;
+ log += *log_num;
+ }
+ heads[headcount].id = cpu_to_vhost32(vq, d);
+ len = iov_length(vq->iov + seg, in);
+ heads[headcount].len = cpu_to_vhost32(vq, len);
+ datalen -= len;
+ ++headcount;
+ seg += in;
+ }
+ heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
+ *iovcount = seg;
+ if (unlikely(log))
+ *log_num = nlogs;
+
+ /* Detect overrun */
+ if (unlikely(datalen > 0)) {
+ r = UIO_MAXIOV + 1;
+ goto err;
+ }
+ return headcount;
+err:
+ vhost_discard_vq_desc(vq, headcount);
+ return r;
+}
+EXPORT_SYMBOL_GPL(vhost_get_bufs);
+
/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
{
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6c844b9..52edd242 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -185,6 +185,13 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
+int vhost_get_bufs(struct vhost_virtqueue *vq,
+ struct vring_used_elem *heads,
+ int datalen,
+ unsigned *iovcount,
+ struct vhost_log *log,
+ unsigned *log_num,
+ unsigned int quota);
void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 2/8] vhost: hide used ring layout from device
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
We used to return descriptor head by vhost_get_vq_desc() to device and
pass it back to vhost_add_used() and its friends. This exposes the
internal used ring layout to device which makes it hard to be extended for
e.g packed ring layout.
So this patch tries to hide the used ring layout by
- letting vhost_get_vq_desc() return pointer to struct vring_used_elem
- accepting pointer to struct vring_used_elem in vhost_add_used() and
vhost_add_used_and_signal()
This could help to hide used ring layout and make it easier to
implement packed ring on top.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 46 +++++++++++++++++++++-----------------
drivers/vhost/scsi.c | 62 +++++++++++++++++++++++++++------------------------
drivers/vhost/vhost.c | 52 +++++++++++++++++++++---------------------
drivers/vhost/vhost.h | 9 +++++---
drivers/vhost/vsock.c | 42 +++++++++++++++++-----------------
5 files changed, 112 insertions(+), 99 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 762aa81..826489c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -426,22 +426,24 @@ static int vhost_net_enable_vq(struct vhost_net *n,
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
+ struct vring_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
{
unsigned long uninitialized_var(endtime);
- int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+ int r = vhost_get_vq_desc(vq, used_elem, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
- if (r == vq->num && vq->busyloop_timeout) {
+ if (r == -ENOSPC && vq->busyloop_timeout) {
preempt_disable();
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(vq->dev, endtime) &&
vhost_vq_avail_empty(vq->dev, vq))
cpu_relax();
preempt_enable();
- r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- out_num, in_num, NULL, NULL);
+ r = vhost_get_vq_desc(vq, used_elem, vq->iov,
+ ARRAY_SIZE(vq->iov), out_num, in_num,
+ NULL, NULL);
}
return r;
@@ -463,7 +465,6 @@ static void handle_tx(struct vhost_net *net)
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = &nvq->vq;
unsigned out, in;
- int head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
@@ -476,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+ struct vring_used_elem used;
bool zcopy, zcopy_used;
int sent_pkts = 0;
@@ -499,20 +501,20 @@ static void handle_tx(struct vhost_net *net)
vhost_zerocopy_signal_used(net, vq);
- head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
- ARRAY_SIZE(vq->iov),
- &out, &in);
- /* On error, stop handling until the next kick. */
- if (unlikely(head < 0))
- break;
+ err = vhost_net_tx_get_vq_desc(net, vq, &used, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in);
/* Nothing new? Wait for eventfd to tell us they refilled. */
- if (head == vq->num) {
+ if (err == -ENOSPC) {
if (unlikely(vhost_enable_notify(&net->dev, vq))) {
vhost_disable_notify(&net->dev, vq);
continue;
}
break;
}
+ /* On error, stop handling until the next kick. */
+ if (unlikely(err < 0))
+ break;
if (in) {
vq_err(vq, "Unexpected descriptor format for TX: "
"out %d, int %d\n", out, in);
@@ -540,7 +542,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
- vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
+ vq->heads[nvq->upend_idx].id =
+ cpu_to_vhost32(vq, used.id);
vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
@@ -581,7 +584,7 @@ static void handle_tx(struct vhost_net *net)
pr_debug("Truncated TX packet: "
" len %d != %zd\n", err, len);
if (!zcopy_used)
- vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ vhost_add_used_and_signal(&net->dev, vq, &used, 0);
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
@@ -713,14 +716,12 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
- headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
- &in, vq_log, &log,
- likely(mergeable) ? UIO_MAXIOV : 1);
- /* On error, stop handling until the next kick. */
- if (unlikely(headcount < 0))
- goto out;
+ err = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+ &in, vq_log, &log,
+ likely(mergeable) ? UIO_MAXIOV : 1,
+ &headcount);
/* OK, now we need to know about added descriptors. */
- if (!headcount) {
+ if (err == -ENOSPC) {
if (unlikely(vhost_enable_notify(&net->dev, vq))) {
/* They have slipped one in as we were
* doing that: check again. */
@@ -731,6 +732,9 @@ static void handle_rx(struct vhost_net *net)
* they refilled. */
goto out;
}
+ /* On error, stop handling until the next kick. */
+ if (unlikely(err < 0))
+ goto out;
if (nvq->rx_ring)
msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
/* On overrun, truncate and discard */
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 7ad5709..654c71f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -67,7 +67,7 @@ struct vhost_scsi_inflight {
struct vhost_scsi_cmd {
/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
- int tvc_vq_desc;
+ struct vring_used_elem tvc_vq_used;
/* virtio-scsi initiator task attribute */
int tvc_task_attr;
/* virtio-scsi response incoming iovecs */
@@ -441,8 +441,9 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
struct vhost_virtqueue *vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
struct virtio_scsi_event *event = &evt->event;
struct virtio_scsi_event __user *eventp;
+ struct vring_used_elem used;
unsigned out, in;
- int head, ret;
+ int ret;
if (!vq->private_data) {
vs->vs_events_missed = true;
@@ -451,16 +452,16 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
again:
vhost_disable_notify(&vs->dev, vq);
- head = vhost_get_vq_desc(vq, vq->iov,
+ ret = vhost_get_vq_desc(vq, &used, vq->iov,
ARRAY_SIZE(vq->iov), &out, &in,
NULL, NULL);
- if (head < 0) {
+ if (ret == -ENOSPC) {
+ if (vhost_enable_notify(&vs->dev, vq))
+ goto again;
vs->vs_events_missed = true;
return;
}
- if (head == vq->num) {
- if (vhost_enable_notify(&vs->dev, vq))
- goto again;
+ if (ret < 0) {
vs->vs_events_missed = true;
return;
}
@@ -480,7 +481,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
eventp = vq->iov[out].iov_base;
ret = __copy_to_user(eventp, event, sizeof(*event));
if (!ret)
- vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+ vhost_add_used_and_signal(&vs->dev, vq, &used, 0);
else
vq_err(vq, "Faulted on vhost_scsi_send_event\n");
}
@@ -541,7 +542,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
if (likely(ret == sizeof(v_rsp))) {
struct vhost_scsi_virtqueue *q;
- vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
+ vhost_add_used(cmd->tvc_vq, &cmd->tvc_vq_used, 0);
q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
vq = q - vs->vqs;
__set_bit(vq, signal);
@@ -784,7 +785,7 @@ static void vhost_scsi_submission_work(struct work_struct *work)
static void
vhost_scsi_send_bad_target(struct vhost_scsi *vs,
struct vhost_virtqueue *vq,
- int head, unsigned out)
+ struct vring_used_elem *used, unsigned out)
{
struct virtio_scsi_cmd_resp __user *resp;
struct virtio_scsi_cmd_resp rsp;
@@ -795,7 +796,7 @@ vhost_scsi_send_bad_target(struct vhost_scsi *vs,
resp = vq->iov[out].iov_base;
ret = __copy_to_user(resp, &rsp, sizeof(rsp));
if (!ret)
- vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+ vhost_add_used_and_signal(&vs->dev, vq, used, 0);
else
pr_err("Faulted on virtio_scsi_cmd_resp\n");
}
@@ -807,11 +808,12 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
struct virtio_scsi_cmd_req v_req;
struct virtio_scsi_cmd_req_pi v_req_pi;
struct vhost_scsi_cmd *cmd;
+ struct vring_used_elem used;
struct iov_iter out_iter, in_iter, prot_iter, data_iter;
u64 tag;
u32 exp_data_len, data_direction;
unsigned int out = 0, in = 0;
- int head, ret, prot_bytes;
+ int ret, prot_bytes;
size_t req_size, rsp_size = sizeof(struct virtio_scsi_cmd_resp);
size_t out_size, in_size;
u16 lun;
@@ -831,22 +833,22 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
vhost_disable_notify(&vs->dev, vq);
for (;;) {
- head = vhost_get_vq_desc(vq, vq->iov,
- ARRAY_SIZE(vq->iov), &out, &in,
- NULL, NULL);
+ ret = vhost_get_vq_desc(vq, &used, vq->iov,
+ ARRAY_SIZE(vq->iov), &out, &in,
+ NULL, NULL);
pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
- head, out, in);
- /* On error, stop handling until the next kick. */
- if (unlikely(head < 0))
- break;
+ used.id, out, in);
/* Nothing new? Wait for eventfd to tell us they refilled. */
- if (head == vq->num) {
+ if (ret == -ENOSPC) {
if (unlikely(vhost_enable_notify(&vs->dev, vq))) {
vhost_disable_notify(&vs->dev, vq);
continue;
}
break;
}
+ /* On error, stop handling until the next kick. */
+ if (unlikely(ret < 0))
+ break;
/*
* Check for a sane response buffer so we can report early
* errors back to the guest.
@@ -891,20 +893,20 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (unlikely(!copy_from_iter_full(req, req_size, &out_iter))) {
vq_err(vq, "Faulted on copy_from_iter\n");
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
/* virtio-scsi spec requires byte 0 of the lun to be 1 */
if (unlikely(*lunp != 1)) {
vq_err(vq, "Illegal virtio-scsi lun: %u\n", *lunp);
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
tpg = READ_ONCE(vs_tpg[*target]);
if (unlikely(!tpg)) {
/* Target does not exist, fail the request */
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
/*
@@ -950,7 +952,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (data_direction != DMA_TO_DEVICE) {
vq_err(vq, "Received non zero pi_bytesout,"
" but wrong data_direction\n");
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq,
+ &used, out);
continue;
}
prot_bytes = vhost32_to_cpu(vq, v_req_pi.pi_bytesout);
@@ -958,7 +961,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (data_direction != DMA_FROM_DEVICE) {
vq_err(vq, "Received non zero pi_bytesin,"
" but wrong data_direction\n");
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq,
+ &used, out);
continue;
}
prot_bytes = vhost32_to_cpu(vq, v_req_pi.pi_bytesin);
@@ -996,7 +1000,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
vq_err(vq, "Received SCSI CDB with command_size: %d that"
" exceeds SCSI_MAX_VARLEN_CDB_SIZE: %d\n",
scsi_command_size(cdb), VHOST_SCSI_MAX_CDB_SIZE);
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
cmd = vhost_scsi_get_tag(vq, tpg, cdb, tag, lun, task_attr,
@@ -1005,7 +1009,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (IS_ERR(cmd)) {
vq_err(vq, "vhost_scsi_get_tag failed %ld\n",
PTR_ERR(cmd));
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
cmd->tvc_vhost = vs;
@@ -1025,7 +1029,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (unlikely(ret)) {
vq_err(vq, "Failed to map iov to sgl\n");
vhost_scsi_release_cmd(&cmd->tvc_se_cmd);
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
}
@@ -1034,7 +1038,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
* complete the virtio-scsi request in TCM callback context via
* vhost_scsi_queue_data_in() and vhost_scsi_queue_status()
*/
- cmd->tvc_vq_desc = head;
+ cmd->tvc_vq_used = used;
/*
* Dispatch cmd descriptor for cmwq execution in process
* context provided by vhost_scsi_workqueue. This also ensures
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 096a688..296bd5e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1958,6 +1958,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
* never a valid descriptor number) if none was found. A negative code is
* returned on error. */
int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+ struct vring_used_elem *used,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num)
@@ -1990,7 +1991,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
* invalid.
*/
if (vq->avail_idx == last_avail_idx)
- return vq->num;
+ return -ENOSPC;
/* Only get avail ring entries after they have been
* exposed by guest.
@@ -2008,6 +2009,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
return -EFAULT;
}
+ used->id = ring_head;
head = vhost16_to_cpu(vq, ring_head);
/* If their number is silly, that's an error. */
@@ -2096,10 +2098,16 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Assume notifications from guest are disabled at this point,
* if they aren't we would need to update avail_event index. */
BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
- return head;
+ return 0;
}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
+static void vhost_set_used_len(struct vhost_virtqueue *vq,
+ struct vring_used_elem *used, int len)
+{
+ used->len = cpu_to_vhost32(vq, len);
+}
+
/* This is a multi-buffer version of vhost_get_desc, that works if
* vq has read descriptors only.
* @vq - the relevant virtqueue
@@ -2116,13 +2124,13 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
unsigned *iovcount,
struct vhost_log *log,
unsigned *log_num,
- unsigned int quota)
+ unsigned int quota,
+ s16 *count)
{
unsigned int out, in;
int seg = 0;
int headcount = 0;
- unsigned d;
- int r, nlogs = 0;
+ int r = 0, nlogs = 0;
/* len is always initialized before use since we are always called with
* datalen > 0.
*/
@@ -2133,17 +2141,12 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
r = -ENOBUFS;
goto err;
}
- r = vhost_get_vq_desc(vq, vq->iov + seg,
+ r = vhost_get_vq_desc(vq, &heads[headcount], vq->iov + seg,
ARRAY_SIZE(vq->iov) - seg, &out,
&in, log, log_num);
if (unlikely(r < 0))
goto err;
- d = r;
- if (d == vq->num) {
- r = 0;
- goto err;
- }
if (unlikely(out || in <= 0)) {
vq_err(vq, "unexpected descriptor format for RX: "
"out %d, in %d\n", out, in);
@@ -2154,24 +2157,26 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
nlogs += *log_num;
log += *log_num;
}
- heads[headcount].id = cpu_to_vhost32(vq, d);
+
len = iov_length(vq->iov + seg, in);
- heads[headcount].len = cpu_to_vhost32(vq, len);
+ vhost_set_used_len(vq, &heads[headcount], len);
datalen -= len;
++headcount;
seg += in;
}
- heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
+ vhost_set_used_len(vq, &heads[headcount - 1], len + datalen);
*iovcount = seg;
if (unlikely(log))
*log_num = nlogs;
/* Detect overrun */
if (unlikely(datalen > 0)) {
- r = UIO_MAXIOV + 1;
+ headcount = UIO_MAXIOV + 1;
goto err;
}
- return headcount;
+
+ *count = headcount;
+ return 0;
err:
vhost_discard_vq_desc(vq, headcount);
return r;
@@ -2187,14 +2192,11 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
+int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem *used,
+ int len)
{
- struct vring_used_elem heads = {
- cpu_to_vhost32(vq, head),
- cpu_to_vhost32(vq, len)
- };
-
- return vhost_add_used_n(vq, &heads, 1);
+ vhost_set_used_len(vq, used, len);
+ return vhost_add_used_n(vq, used, 1);
}
EXPORT_SYMBOL_GPL(vhost_add_used);
@@ -2327,9 +2329,9 @@ EXPORT_SYMBOL_GPL(vhost_signal);
/* And here's the combo meal deal. Supersize me! */
void vhost_add_used_and_signal(struct vhost_dev *dev,
struct vhost_virtqueue *vq,
- unsigned int head, int len)
+ struct vring_used_elem *used, int len)
{
- vhost_add_used(vq, head, len);
+ vhost_add_used(vq, used, len);
vhost_signal(dev, vq);
}
EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 52edd242..a7cc7e7 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -182,6 +182,7 @@ bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
bool vhost_log_access_ok(struct vhost_dev *);
int vhost_get_vq_desc(struct vhost_virtqueue *,
+ struct vring_used_elem *used_elem,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
@@ -191,15 +192,17 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
unsigned *iovcount,
struct vhost_log *log,
unsigned *log_num,
- unsigned int quota);
+ unsigned int quota,
+ s16 *count);
void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
-int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
+int vhost_add_used(struct vhost_virtqueue *vq,
+ struct vring_used_elem *elem, int len);
int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
unsigned count);
void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
- unsigned int id, int len);
+ struct vring_used_elem *, int len);
void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
struct vring_used_elem *heads, unsigned count);
void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 34bc3ab..59a01cd 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -98,11 +98,12 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
for (;;) {
struct virtio_vsock_pkt *pkt;
+ struct vring_used_elem used;
struct iov_iter iov_iter;
unsigned out, in;
size_t nbytes;
size_t len;
- int head;
+ int ret;
spin_lock_bh(&vsock->send_pkt_list_lock);
if (list_empty(&vsock->send_pkt_list)) {
@@ -116,16 +117,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
list_del_init(&pkt->list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
- head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- &out, &in, NULL, NULL);
- if (head < 0) {
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
- break;
- }
-
- if (head == vq->num) {
+ ret = vhost_get_vq_desc(vq, &used, vq->iov, ARRAY_SIZE(vq->iov),
+ &out, &in, NULL, NULL);
+ if (ret == -ENOSPC) {
spin_lock_bh(&vsock->send_pkt_list_lock);
list_add(&pkt->list, &vsock->send_pkt_list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
@@ -139,6 +133,12 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}
break;
}
+ if (ret < 0) {
+ spin_lock_bh(&vsock->send_pkt_list_lock);
+ list_add(&pkt->list, &vsock->send_pkt_list);
+ spin_unlock_bh(&vsock->send_pkt_list_lock);
+ break;
+ }
if (out) {
virtio_transport_free_pkt(pkt);
@@ -146,7 +146,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- len = iov_length(&vq->iov[out], in);
+ len = vhost32_to_cpu(vq, used.len);
iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
@@ -163,7 +163,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+ vhost_add_used(vq, &used, sizeof(pkt->hdr) + pkt->len);
added = true;
if (pkt->reply) {
@@ -346,7 +346,8 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
dev);
struct virtio_vsock_pkt *pkt;
- int head;
+ struct vring_used_elem used;
+ int ret;
unsigned int out, in;
bool added = false;
@@ -367,18 +368,17 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
goto no_more_replies;
}
- head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- &out, &in, NULL, NULL);
- if (head < 0)
- break;
-
- if (head == vq->num) {
+ ret = vhost_get_vq_desc(vq, &used, vq->iov, ARRAY_SIZE(vq->iov),
+ &out, &in, NULL, NULL);
+ if (ret == -ENOSPC) {
if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
vhost_disable_notify(&vsock->dev, vq);
continue;
}
break;
}
+ if (ret < 0)
+ break;
pkt = vhost_vsock_alloc_pkt(vq, out, in);
if (!pkt) {
@@ -397,7 +397,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
else
virtio_transport_free_pkt(pkt);
- vhost_add_used(vq, head, sizeof(pkt->hdr) + len);
+ vhost_add_used(vq, &used, sizeof(pkt->hdr) + len);
added = true;
}
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 3/8] vhost: do not use vring_used_elem
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Instead of depending on the exported vring_used_elem, this patch
switches to use a new internal structure vhost_used_elem which embed
vring_used_elem in itself. This could be used to let vhost to record
extra metadata for the incoming packed ring layout.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 19 +++++++-------
drivers/vhost/scsi.c | 10 ++++----
drivers/vhost/vhost.c | 68 ++++++++++++---------------------------------------
drivers/vhost/vhost.h | 18 ++++++++------
drivers/vhost/vsock.c | 6 ++---
5 files changed, 45 insertions(+), 76 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 826489c..3826f1f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -341,10 +341,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
int j = 0;
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
- if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+ if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
- if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
- vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+ if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+ vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
break;
@@ -367,7 +367,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
rcu_read_lock_bh();
/* set len to mark this desc buffers done DMA */
- vq->heads[ubuf->desc].len = success ?
+ vq->heads[ubuf->desc].elem.len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
cnt = vhost_net_ubuf_put(ubufs);
@@ -426,7 +426,7 @@ static int vhost_net_enable_vq(struct vhost_net *n,
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
- struct vring_used_elem *used_elem,
+ struct vhost_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
{
@@ -477,7 +477,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
- struct vring_used_elem used;
+ struct vhost_used_elem used;
bool zcopy, zcopy_used;
int sent_pkts = 0;
@@ -542,9 +542,10 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
- vq->heads[nvq->upend_idx].id =
- cpu_to_vhost32(vq, used.id);
- vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+ vq->heads[nvq->upend_idx].elem.id =
+ cpu_to_vhost32(vq, used.elem.id);
+ vq->heads[nvq->upend_idx].elem.len =
+ VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 654c71f..ac11412 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -67,7 +67,7 @@ struct vhost_scsi_inflight {
struct vhost_scsi_cmd {
/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
- struct vring_used_elem tvc_vq_used;
+ struct vhost_used_elem tvc_vq_used;
/* virtio-scsi initiator task attribute */
int tvc_task_attr;
/* virtio-scsi response incoming iovecs */
@@ -441,7 +441,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
struct vhost_virtqueue *vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
struct virtio_scsi_event *event = &evt->event;
struct virtio_scsi_event __user *eventp;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
unsigned out, in;
int ret;
@@ -785,7 +785,7 @@ static void vhost_scsi_submission_work(struct work_struct *work)
static void
vhost_scsi_send_bad_target(struct vhost_scsi *vs,
struct vhost_virtqueue *vq,
- struct vring_used_elem *used, unsigned out)
+ struct vhost_used_elem *used, unsigned out)
{
struct virtio_scsi_cmd_resp __user *resp;
struct virtio_scsi_cmd_resp rsp;
@@ -808,7 +808,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
struct virtio_scsi_cmd_req v_req;
struct virtio_scsi_cmd_req_pi v_req_pi;
struct vhost_scsi_cmd *cmd;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
struct iov_iter out_iter, in_iter, prot_iter, data_iter;
u64 tag;
u32 exp_data_len, data_direction;
@@ -837,7 +837,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
ARRAY_SIZE(vq->iov), &out, &in,
NULL, NULL);
pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
- used.id, out, in);
+ used.elem.id, out, in);
/* Nothing new? Wait for eventfd to tell us they refilled. */
if (ret == -ENOSPC) {
if (unlikely(vhost_enable_notify(&vs->dev, vq))) {
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 296bd5e..2b2a776 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -728,41 +728,6 @@ static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
struct iovec iov[], int iov_size, int access);
-static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
- const void *from, unsigned size)
-{
- int ret;
-
- if (!vq->iotlb)
- return __copy_to_user(to, from, size);
- else {
- /* This function should be called after iotlb
- * prefetch, which means we're sure that all vq
- * could be access through iotlb. So -EAGAIN should
- * not happen in this case.
- */
- struct iov_iter t;
- void __user *uaddr = vhost_vq_meta_fetch(vq,
- (u64)(uintptr_t)to, size,
- VHOST_ADDR_USED);
-
- if (uaddr)
- return __copy_to_user(uaddr, from, size);
-
- ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov,
- ARRAY_SIZE(vq->iotlb_iov),
- VHOST_ACCESS_WO);
- if (ret < 0)
- goto out;
- iov_iter_init(&t, WRITE, vq->iotlb_iov, ret, size);
- ret = copy_to_iter(from, size, &t);
- if (ret == size)
- ret = 0;
- }
-out:
- return ret;
-}
-
static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
void __user *from, unsigned size)
{
@@ -1958,7 +1923,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
* never a valid descriptor number) if none was found. A negative code is
* returned on error. */
int vhost_get_vq_desc(struct vhost_virtqueue *vq,
- struct vring_used_elem *used,
+ struct vhost_used_elem *used,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num)
@@ -2009,7 +1974,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
return -EFAULT;
}
- used->id = ring_head;
+ used->elem.id = ring_head;
head = vhost16_to_cpu(vq, ring_head);
/* If their number is silly, that's an error. */
@@ -2103,9 +2068,9 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
static void vhost_set_used_len(struct vhost_virtqueue *vq,
- struct vring_used_elem *used, int len)
+ struct vhost_used_elem *used, int len)
{
- used->len = cpu_to_vhost32(vq, len);
+ used->elem.len = cpu_to_vhost32(vq, len);
}
/* This is a multi-buffer version of vhost_get_desc, that works if
@@ -2119,7 +2084,7 @@ static void vhost_set_used_len(struct vhost_virtqueue *vq,
* returns number of buffer heads allocated, negative on error
*/
int vhost_get_bufs(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
+ struct vhost_used_elem *heads,
int datalen,
unsigned *iovcount,
struct vhost_log *log,
@@ -2192,7 +2157,7 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem *used,
+int vhost_add_used(struct vhost_virtqueue *vq, struct vhost_used_elem *used,
int len)
{
vhost_set_used_len(vq, used, len);
@@ -2201,27 +2166,26 @@ int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem *used,
EXPORT_SYMBOL_GPL(vhost_add_used);
static int __vhost_add_used_n(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
+ struct vhost_used_elem *heads,
unsigned count)
{
struct vring_used_elem __user *used;
u16 old, new;
- int start;
+ int start, i;
start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
- if (count == 1) {
- if (vhost_put_user(vq, heads[0].id, &used->id)) {
+ for (i = 0; i < count; i++) {
+ if (unlikely(vhost_put_user(vq, heads[i].elem.id,
+ &used[i].id))) {
vq_err(vq, "Failed to write used id");
return -EFAULT;
}
- if (vhost_put_user(vq, heads[0].len, &used->len)) {
+ if (unlikely(vhost_put_user(vq, heads[i].elem.len,
+ &used[i].len))) {
vq_err(vq, "Failed to write used len");
return -EFAULT;
}
- } else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
- vq_err(vq, "Failed to write used");
- return -EFAULT;
}
if (unlikely(vq->log_used)) {
/* Make sure data is seen before log. */
@@ -2245,7 +2209,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
+int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
unsigned count)
{
int start, n, r;
@@ -2329,7 +2293,7 @@ EXPORT_SYMBOL_GPL(vhost_signal);
/* And here's the combo meal deal. Supersize me! */
void vhost_add_used_and_signal(struct vhost_dev *dev,
struct vhost_virtqueue *vq,
- struct vring_used_elem *used, int len)
+ struct vhost_used_elem *used, int len)
{
vhost_add_used(vq, used, len);
vhost_signal(dev, vq);
@@ -2339,7 +2303,7 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
/* multi-buffer version of vhost_add_used_and_signal */
void vhost_add_used_and_signal_n(struct vhost_dev *dev,
struct vhost_virtqueue *vq,
- struct vring_used_elem *heads, unsigned count)
+ struct vhost_used_elem *heads, unsigned count)
{
vhost_add_used_n(vq, heads, count);
vhost_signal(dev, vq);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a7cc7e7..8dea44b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -34,6 +34,10 @@ struct vhost_poll {
struct vhost_dev *dev;
};
+struct vhost_used_elem {
+ struct vring_used_elem elem;
+};
+
void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
bool vhost_has_work(struct vhost_dev *dev);
@@ -126,7 +130,7 @@ struct vhost_virtqueue {
struct iovec iov[UIO_MAXIOV];
struct iovec iotlb_iov[64];
struct iovec *indirect;
- struct vring_used_elem *heads;
+ struct vhost_used_elem *heads;
/* Protected by virtqueue mutex. */
struct vhost_umem *umem;
struct vhost_umem *iotlb;
@@ -182,12 +186,12 @@ bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
bool vhost_log_access_ok(struct vhost_dev *);
int vhost_get_vq_desc(struct vhost_virtqueue *,
- struct vring_used_elem *used_elem,
+ struct vhost_used_elem *used_elem,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
int vhost_get_bufs(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
+ struct vhost_used_elem *heads,
int datalen,
unsigned *iovcount,
struct vhost_log *log,
@@ -198,13 +202,13 @@ void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *vq,
- struct vring_used_elem *elem, int len);
-int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
+ struct vhost_used_elem *elem, int len);
+int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
unsigned count);
void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
- struct vring_used_elem *, int len);
+ struct vhost_used_elem *, int len);
void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
- struct vring_used_elem *heads, unsigned count);
+ struct vhost_used_elem *heads, unsigned count);
void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 59a01cd..695694f 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -98,7 +98,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
for (;;) {
struct virtio_vsock_pkt *pkt;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
struct iov_iter iov_iter;
unsigned out, in;
size_t nbytes;
@@ -146,7 +146,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- len = vhost32_to_cpu(vq, used.len);
+ len = vhost32_to_cpu(vq, used.elem.len);
iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
@@ -346,7 +346,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
dev);
struct virtio_vsock_pkt *pkt;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
int ret;
unsigned int out, in;
bool added = false;
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 4/8] vhost_net: do not explicitly manipulate vhost_used_elem
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Two helpers of setting/getting used len were introduced to avoid
explicitly manipulating vhost_used_elem in zerocopy code. This will be
used to hide used_elem internals and simplify packed ring
implementation.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 11 +++++------
drivers/vhost/vhost.c | 12 ++++++++++--
drivers/vhost/vhost.h | 5 +++++
3 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3826f1f..30273ad 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -341,9 +341,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
int j = 0;
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
- if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
+ if (vhost_get_used_len(vq, &vq->heads[i]) ==
+ VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
- if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+ if (VHOST_DMA_IS_DONE(vhost_get_used_len(vq, &vq->heads[i]))) {
vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
@@ -542,10 +543,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
- vq->heads[nvq->upend_idx].elem.id =
- cpu_to_vhost32(vq, used.elem.id);
- vq->heads[nvq->upend_idx].elem.len =
- VHOST_DMA_IN_PROGRESS;
+ vhost_set_used_len(vq, &used, VHOST_DMA_IN_PROGRESS);
+ vq->heads[nvq->upend_idx] = used;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2b2a776..e0fcfec 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2067,11 +2067,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
-static void vhost_set_used_len(struct vhost_virtqueue *vq,
- struct vhost_used_elem *used, int len)
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used, int len)
{
used->elem.len = cpu_to_vhost32(vq, len);
}
+EXPORT_SYMBOL_GPL(vhost_set_used_len);
+
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used)
+{
+ return vhost32_to_cpu(vq, used->elem.len);
+}
+EXPORT_SYMBOL_GPL(vhost_get_used_len);
/* This is a multi-buffer version of vhost_get_desc, that works if
* vq has read descriptors only.
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8dea44b..604821b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,11 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
unsigned *log_num,
unsigned int quota,
s16 *count);
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ int len);
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used);
void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 5/8] vhost: vhost_put_user() can accept metadata type
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
We assumes used ring update is the only user for vhost_put_user() in
the past. This may not be the case for the incoming packed ring which
may update the descriptor ring for used. So introduce a new type
parameter.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0fcfec..4031a8f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -811,7 +811,7 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
return __vhost_get_user_slow(vq, addr, size, type);
}
-#define vhost_put_user(vq, x, ptr) \
+#define vhost_put_user(vq, x, ptr, type) \
({ \
int ret = -EFAULT; \
if (!vq->iotlb) { \
@@ -819,7 +819,7 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
} else { \
__typeof__(ptr) to = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
- sizeof(*ptr), VHOST_ADDR_USED); \
+ sizeof(*ptr), type); \
if (to != NULL) \
ret = __put_user(x, to); \
else \
@@ -1683,7 +1683,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
{
void __user *used;
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
- &vq->used->flags) < 0)
+ &vq->used->flags, VHOST_ADDR_USED) < 0)
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1702,7 +1702,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
{
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
- vhost_avail_event(vq)))
+ vhost_avail_event(vq), VHOST_ADDR_USED))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -2185,12 +2185,12 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
used = vq->used->ring + start;
for (i = 0; i < count; i++) {
if (unlikely(vhost_put_user(vq, heads[i].elem.id,
- &used[i].id))) {
+ &used[i].id, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used id");
return -EFAULT;
}
if (unlikely(vhost_put_user(vq, heads[i].elem.len,
- &used[i].len))) {
+ &used[i].len, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used len");
return -EFAULT;
}
@@ -2236,7 +2236,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
/* Make sure buffer is written before we update index. */
smp_wmb();
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
- &vq->used->idx)) {
+ &vq->used->idx, VHOST_ADDR_USED)) {
vq_err(vq, "Failed to increment used idx");
return -EFAULT;
}
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 6/8] virtio: introduce packed ring defines
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
include/uapi/linux/virtio_config.h | 9 +++++++++
include/uapi/linux/virtio_ring.h | 13 +++++++++++++
2 files changed, 22 insertions(+)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e209..5903d51 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -71,4 +71,13 @@
* this is for compatibility with legacy systems.
*/
#define VIRTIO_F_IOMMU_PLATFORM 33
+
+#define VIRTIO_F_RING_PACKED 34
+
+/*
+ * This feature indicates that all buffers are used by the device in
+ * the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER 35
+
#endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5fa..e297580 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -43,6 +43,8 @@
#define VRING_DESC_F_WRITE 2
/* This means the buffer contains a list of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
+#define VRING_DESC_F_AVAIL 7
+#define VRING_DESC_F_USED 15
/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
@@ -62,6 +64,17 @@
* at the end of the used ring. Guest should ignore the used->flags field. */
#define VIRTIO_RING_F_EVENT_IDX 29
+struct vring_desc_packed {
+ /* Buffer Address. */
+ __virtio64 addr;
+ /* Buffer Length. */
+ __virtio32 len;
+ /* Buffer ID. */
+ __virtio16 id;
+ /* The flags depending on descriptor type. */
+ __virtio16 flags;
+};
+
/* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
struct vring_desc {
/* Address (guest-physical). */
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 7/8] vhost: packed ring support
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 13 +-
drivers/vhost/vhost.c | 585 ++++++++++++++++++++++++++++++++++++++++++++++----
drivers/vhost/vhost.h | 13 +-
3 files changed, 566 insertions(+), 45 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 30273ad..4991aa4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -71,7 +71,8 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
(1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
(1ULL << VIRTIO_NET_F_MRG_RXBUF) |
- (1ULL << VIRTIO_F_IOMMU_PLATFORM)
+ (1ULL << VIRTIO_F_IOMMU_PLATFORM) |
+ (1ULL << VIRTIO_F_RING_PACKED)
};
enum {
@@ -576,7 +577,7 @@ static void handle_tx(struct vhost_net *net)
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
- vhost_discard_vq_desc(vq, 1);
+ vhost_discard_vq_desc(vq, &used, 1);
vhost_net_enable_vq(net, vq);
break;
}
@@ -714,9 +715,11 @@ static void handle_rx(struct vhost_net *net)
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
+ struct vhost_used_elem *used = vq->heads + nheads;
+
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
- err = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+ err = vhost_get_bufs(vq, used, vhost_len,
&in, vq_log, &log,
likely(mergeable) ? UIO_MAXIOV : 1,
&headcount);
@@ -762,7 +765,7 @@ static void handle_rx(struct vhost_net *net)
if (unlikely(err != sock_len)) {
pr_debug("Discarded rx packet: "
" len %d, expected %zd\n", err, sock_len);
- vhost_discard_vq_desc(vq, headcount);
+ vhost_discard_vq_desc(vq, used, 1);
continue;
}
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
@@ -786,7 +789,7 @@ static void handle_rx(struct vhost_net *net)
copy_to_iter(&num_buffers, sizeof num_buffers,
&fixup) != sizeof num_buffers) {
vq_err(vq, "Failed num_buffers write");
- vhost_discard_vq_desc(vq, headcount);
+ vhost_discard_vq_desc(vq, used, 1);
goto out;
}
nheads += headcount;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4031a8f..a36e5ad2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -323,6 +323,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vhost_reset_is_le(vq);
vhost_disable_cross_endian(vq);
vq->busyloop_timeout = 0;
+ vq->used_wrap_counter = true;
+ vq->last_avail_wrap_counter = true;
+ vq->avail_wrap_counter = true;
vq->umem = NULL;
vq->iotlb = NULL;
__vhost_vq_meta_reset(vq);
@@ -1103,11 +1106,22 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
return 0;
}
-static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
- struct vring_desc __user *desc,
- struct vring_avail __user *avail,
- struct vring_used __user *used)
+static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
+{
+ struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+
+ /* FIXME: check device area and driver area */
+ return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
+ access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+}
+static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1118,6 +1132,17 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
sizeof *used + num * sizeof *used->ring + s);
}
+static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vq_access_ok_packed(vq, num, desc, avail, used);
+ else
+ return vq_access_ok_split(vq, num, desc, avail, used);
+}
+
static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
const struct vhost_umem_node *node,
int type)
@@ -1361,6 +1386,10 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
break;
}
vq->last_avail_idx = s.num;
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
+ vq->last_avail_wrap_counter = s.num >> 31;
+ vq->avail_wrap_counter = vq->last_avail_wrap_counter;
+ }
/* Forget the cached index value. */
vq->avail_idx = vq->last_avail_idx;
break;
@@ -1369,6 +1398,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
s.num = vq->last_avail_idx;
if (copy_to_user(argp, &s, sizeof s))
r = -EFAULT;
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ s.num |= vq->last_avail_wrap_counter << 31;
break;
case VHOST_SET_VRING_ADDR:
if (copy_from_user(&a, argp, sizeof a)) {
@@ -1730,6 +1761,9 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
vhost_init_is_le(vq);
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return 0;
+
r = vhost_update_used_flags(vq);
if (r)
goto err;
@@ -1803,7 +1837,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
/* Each buffer in the virtqueues is actually a chain of descriptors. This
* function returns the next descriptor in the chain,
* or -1U if we're at the end. */
-static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
+static unsigned next_desc_split(struct vhost_virtqueue *vq,
+ struct vring_desc *desc)
{
unsigned int next;
@@ -1816,11 +1851,17 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
return next;
}
-static int get_indirect(struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num,
- struct vring_desc *indirect)
+static unsigned next_desc_packed(struct vhost_virtqueue *vq,
+ struct vring_desc_packed *desc)
+{
+ return desc->flags & cpu_to_vhost16(vq, VRING_DESC_F_NEXT);
+}
+
+static int get_indirect_split(struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ struct vring_desc *indirect)
{
struct vring_desc desc;
unsigned int i = 0, count, found = 0;
@@ -1910,23 +1951,301 @@ static int get_indirect(struct vhost_virtqueue *vq,
}
*out_num += ret;
}
- } while ((i = next_desc(vq, &desc)) != -1);
+ } while ((i = next_desc_split(vq, &desc)) != -1);
return 0;
}
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access. Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found. A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
- struct vhost_used_elem *used,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
+static int get_indirect_packed(struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ struct vring_desc_packed *indirect)
+{
+ struct vring_desc_packed desc;
+ unsigned int i = 0, count, found = 0;
+ u32 len = vhost32_to_cpu(vq, indirect->len);
+ struct iov_iter from;
+ int ret, access;
+
+ /* Sanity check */
+ if (unlikely(len % sizeof(desc))) {
+ vq_err(vq, "Invalid length in indirect descriptor: "
+ "len 0x%llx not multiple of 0x%zx\n",
+ (unsigned long long)len,
+ sizeof desc);
+ return -EINVAL;
+ }
+
+ ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr),
+ len, vq->indirect,
+ UIO_MAXIOV, VHOST_ACCESS_RO);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d in indirect.\n",
+ ret);
+ return ret;
+ }
+ iov_iter_init(&from, READ, vq->indirect, ret, len);
+
+ /* We will use the result as an address to read from, so most
+ * architectures only need a compiler barrier here. */
+ read_barrier_depends();
+
+ count = len / sizeof desc;
+ /* Buffers are chained via a 16 bit next field, so
+ * we can have at most 2^16 of these. */
+ if (unlikely(count > USHRT_MAX + 1)) {
+ vq_err(vq, "Indirect buffer length too big: %d\n",
+ indirect->len);
+ return -E2BIG;
+ }
+
+ do {
+ unsigned iov_count = *in_num + *out_num;
+ if (unlikely(++found > count)) {
+ vq_err(vq, "Loop detected: last one at %u "
+ "indirect size %u\n",
+ i, count);
+ return -EINVAL;
+ }
+ if (unlikely(!copy_from_iter_full(&desc, sizeof(desc),
+ &from))) {
+ vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
+ i, (size_t)vhost64_to_cpu(vq, indirect->addr)
+ + i * sizeof desc);
+ return -EINVAL;
+ }
+ if (unlikely(desc.flags &
+ cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
+ vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
+ i, (size_t)vhost64_to_cpu(vq, indirect->addr)
+ + i * sizeof desc);
+ return -EINVAL;
+ }
+
+ if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
+ access = VHOST_ACCESS_WO;
+ else
+ access = VHOST_ACCESS_RO;
+
+ ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
+ vhost32_to_cpu(vq, desc.len),
+ iov + iov_count,
+ iov_size - iov_count, access);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d "
+ "indirect idx %d\n",
+ ret, i);
+ return ret;
+ }
+ /* If this is an input descriptor, increment that count. */
+ if (access == VHOST_ACCESS_WO) {
+ *in_num += ret;
+ if (unlikely(log)) {
+ log[*log_num].addr =
+ vhost64_to_cpu(vq, desc.addr);
+ log[*log_num].len =
+ vhost32_to_cpu(vq, desc.len);
+ ++*log_num;
+ }
+ } else {
+ /* If it's an output descriptor, they're all supposed
+ * to come before any input descriptors. */
+ if (unlikely(*in_num)) {
+ vq_err(vq, "Indirect descriptor "
+ "has out after in: idx %d\n", i);
+ return -EINVAL;
+ }
+ *out_num += ret;
+ }
+ i++;
+ } while (next_desc_packed(vq, &desc));
+ return 0;
+}
+
+#define DESC_AVAIL (1 << VRING_DESC_F_AVAIL)
+#define DESC_USED (1 << VRING_DESC_F_USED)
+static bool desc_is_avail(struct vhost_virtqueue *vq, bool wrap_counter,
+ __virtio16 flags)
+{
+ bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);
+
+ return avail == wrap_counter;
+}
+
+static __virtio16 get_desc_flags(struct vhost_virtqueue *vq, bool write)
+{
+ __virtio16 flags = 0;
+
+ if (vq->used_wrap_counter) {
+ flags |= cpu_to_vhost16(vq, DESC_AVAIL);
+ flags |= cpu_to_vhost16(vq, DESC_USED);
+ } else {
+ flags &= ~cpu_to_vhost16(vq, DESC_AVAIL);
+ flags &= ~cpu_to_vhost16(vq, DESC_USED);
+ }
+
+ if (write)
+ flags |= cpu_to_vhost16(vq, VRING_DESC_F_WRITE);
+
+ return flags;
+}
+
+static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
+ bool wrap, __u16 off_wrap, __u16 new,
+ __u16 old)
+{
+ int off = off_wrap & ~(1 << 15);
+
+ if (new < old) {
+ new += vq->num;
+ wrap ^= 1;
+ }
+
+ if (wrap != off_wrap >> 15)
+ off += vq->num;
+
+ return vring_need_event(off, new, old);
+}
+
+static int vhost_get_vq_desc_packed(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log,
+ unsigned int *log_num)
+{
+ struct vring_desc_packed desc;
+ int ret, access, i;
+ u16 last_avail_idx = vq->last_avail_idx;
+ u16 off_wrap = vq->avail_idx | (vq->avail_wrap_counter << 15);
+
+ /* When we start there are none of either input nor output. */
+ *out_num = *in_num = 0;
+ if (unlikely(log))
+ *log_num = 0;
+
+ used->count = 0;
+
+ do {
+ struct vring_desc_packed *d = vq->desc_packed +
+ vq->last_avail_idx;
+ unsigned int iov_count = *in_num + *out_num;
+
+ ret = vhost_get_user(vq, desc.flags, &d->flags,
+ VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get flags: idx %d addr %p\n",
+ vq->last_avail_idx, &d->flags);
+ return -EFAULT;
+ }
+
+ if (!desc_is_avail(vq, vq->last_avail_wrap_counter, desc.flags)) {
+ /* If there's nothing new since last we looked, return
+ * invalid.
+ */
+ if (!used->count)
+ return -ENOSPC;
+ vq_err(vq, "Unexpected unavail descriptor: idx %d\n",
+ vq->last_avail_idx);
+ return -EFAULT;
+ }
+
+ /* Read desc content after we're sure it was available. */
+ smp_rmb();
+
+ ret = vhost_copy_from_user(vq, &desc, d, sizeof(desc));
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+ vq->last_avail_idx, d);
+ return -EFAULT;
+ }
+
+ used->elem.id = desc.id;
+
+ if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
+ ret = get_indirect_packed(vq, iov, iov_size,
+ out_num, in_num, log,
+ log_num, &desc);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Failure detected "
+ "in indirect descriptor "
+ "at idx %d\n", i);
+ return ret;
+ }
+ goto next;
+ }
+
+ if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
+ access = VHOST_ACCESS_WO;
+ else
+ access = VHOST_ACCESS_RO;
+ ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
+ vhost32_to_cpu(vq, desc.len),
+ iov + iov_count, iov_size - iov_count,
+ access);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d idx %d\n",
+ ret, i);
+ return ret;
+ }
+
+ if (access == VHOST_ACCESS_WO) {
+ /* If this is an input descriptor,
+ * increment that count.
+ */
+ *in_num += ret;
+ if (unlikely(log)) {
+ log[*log_num].addr =
+ vhost64_to_cpu(vq, desc.addr);
+ log[*log_num].len =
+ vhost32_to_cpu(vq, desc.len);
+ ++*log_num;
+ }
+ } else {
+ /* If it's an output descriptor, they're all supposed
+ * to come before any input descriptors.
+ */
+ if (unlikely(*in_num)) {
+ vq_err(vq, "Desc out after in: idx %d\n",
+ i);
+ return -EINVAL;
+ }
+ *out_num += ret;
+ }
+
+next:
+ if (unlikely(++used->count > vq->num)) {
+ vq_err(vq, "Loop detected: last one at %u "
+ "vq size %u head %u\n",
+ i, vq->num, used->elem.id);
+ return -EINVAL;
+ }
+ if (++vq->last_avail_idx >= vq->num) {
+ vq->last_avail_idx = 0;
+ vq->last_avail_wrap_counter ^= 1;
+ }
+ /* If this descriptor says it doesn't chain, we're done. */
+ } while (next_desc_packed(vq, &desc));
+
+ if (vhost_vring_packed_need_event(vq, vq->last_avail_wrap_counter,
+ off_wrap, vq->last_avail_idx,
+ last_avail_idx)) {
+ vq->avail_idx = vq->last_avail_idx;
+ vq->avail_wrap_counter = vq->last_avail_wrap_counter;
+ }
+
+ return 0;
+}
+
+static int vhost_get_vq_desc_split(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
{
struct vring_desc desc;
unsigned int i, head, found = 0;
@@ -2011,9 +2330,9 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
return -EFAULT;
}
if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
- ret = get_indirect(vq, iov, iov_size,
- out_num, in_num,
- log, log_num, &desc);
+ ret = get_indirect_split(vq, iov, iov_size,
+ out_num, in_num,
+ log, log_num, &desc);
if (unlikely(ret < 0)) {
if (ret != -EAGAIN)
vq_err(vq, "Failure detected "
@@ -2055,7 +2374,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
}
*out_num += ret;
}
- } while ((i = next_desc(vq, &desc)) != -1);
+ } while ((i = next_desc_split(vq, &desc)) != -1);
/* On success, increment avail index. */
vq->last_avail_idx++;
@@ -2065,6 +2384,31 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
return 0;
}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access. Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found. A negative code is
+ * returned on error.
+ */
+int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_get_vq_desc_packed(vq, used, iov, iov_size,
+ out_num, in_num,
+ log, log_num);
+ else
+ return vhost_get_vq_desc_split(vq, used, iov, iov_size,
+ out_num, in_num,
+ log, log_num);
+}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
void vhost_set_used_len(struct vhost_virtqueue *vq,
@@ -2151,15 +2495,30 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
*count = headcount;
return 0;
err:
- vhost_discard_vq_desc(vq, headcount);
+ vhost_discard_vq_desc(vq, heads, headcount);
return r;
}
EXPORT_SYMBOL_GPL(vhost_get_bufs);
/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
-void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
+void vhost_discard_vq_desc(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ int headcount)
{
- vq->last_avail_idx -= n;
+ int i;
+
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
+ for (i = 0; i < headcount; i++) {
+ vq->last_avail_idx -= heads[i].count;
+ if (vq->last_avail_idx >= vq->num) {
+ vq->last_avail_wrap_counter ^= 1;
+ vq->last_avail_idx += vq->num;
+ }
+ }
+ } else {
+ vq->last_avail_idx -= headcount;
+ }
+
}
EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
@@ -2215,10 +2574,69 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
return 0;
}
+static int vhost_add_used_n_packed(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ unsigned int count)
+{
+ struct vring_desc_packed __user *desc;
+ int i, ret;
+
+ for (i = 0; i < count; i++) {
+ desc = vq->desc_packed + vq->last_used_idx;
+
+ ret = vhost_put_user(vq, heads[i].elem.id, &desc->id,
+ VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to update id: idx %d addr %p\n",
+ vq->last_used_idx, desc);
+ return -EFAULT;
+ }
+ ret = vhost_put_user(vq, heads[i].elem.len, &desc->len,
+ VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to update len: idx %d addr %p\n",
+ vq->last_used_idx, desc);
+ return -EFAULT;
+ }
+
+ /* Update flags after descriptor id and len is wrote,
+ * TODO: Update head flags at last for saving barriers */
+ smp_wmb();
+
+ ret = vhost_put_user(vq, get_desc_flags(vq, heads[i].elem.len),
+ &desc->flags, VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to update flags: idx %d addr %p\n",
+ vq->last_used_idx, desc);
+ return -EFAULT;
+ }
+
+ if (unlikely(vq->log_used)) {
+ /* Make sure desc is written before update log. */
+ smp_wmb();
+ log_write(vq->log_base, vq->log_addr +
+ vq->last_used_idx * sizeof(*desc),
+ sizeof(*desc));
+ if (vq->log_ctx)
+ eventfd_signal(vq->log_ctx, 1);
+ }
+
+ vq->last_used_idx += heads[i].count;
+ if (vq->last_used_idx >= vq->num) {
+ vq->used_wrap_counter ^= 1;
+ vq->last_used_idx -= vq->num;
+ }
+ }
+
+ return 0;
+}
+
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
- unsigned count)
+static int vhost_add_used_n_split(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ unsigned count)
+
{
int start, n, r;
@@ -2250,6 +2668,19 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
}
return r;
}
+
+/* After we've used one of their buffers, we tell them about it. We'll then
+ * want to notify the guest, using eventfd.
+ */
+int vhost_add_used_n(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ unsigned int count)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_add_used_n_packed(vq, heads, count);
+ else
+ return vhost_add_used_n_split(vq, heads, count);
+}
EXPORT_SYMBOL_GPL(vhost_add_used_n);
static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
@@ -2257,6 +2688,11 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
__u16 old, new;
__virtio16 event;
bool v;
+
+ /* FIXME: check driver area */
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return true;
+
/* Flush out used index updates. This is paired
* with the barrier that the Guest executes when enabling
* interrupts. */
@@ -2319,7 +2755,8 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
/* return true if we're sure that avaiable ring is empty */
-bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_vq_avail_empty_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
__virtio16 avail_idx;
int r;
@@ -2334,10 +2771,58 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
return vq->avail_idx == vq->last_avail_idx;
}
+
+static bool vhost_vq_avail_empty_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ struct vring_desc_packed *d = vq->desc_packed + vq->avail_idx;
+ __virtio16 flags;
+ int ret;
+
+ ret = vhost_get_user(vq, flags, &d->flags, VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get flags: idx %d addr %p\n",
+ vq->last_avail_idx, d);
+ return -EFAULT;
+ }
+
+ return !desc_is_avail(vq, vq->avail_wrap_counter, flags);
+}
+
+bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_vq_avail_empty_packed(dev, vq);
+ else
+ return vhost_vq_avail_empty_split(dev, vq);
+}
EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
-/* OK, now we need to know about added descriptors. */
-bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_enable_notify_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ struct vring_desc_packed *d = vq->desc_packed + vq->avail_idx;
+ __virtio16 flags;
+ int ret;
+
+ /* FIXME: disable notification through device area */
+
+ /* They could have slipped one in as we were doing that: make
+ * sure it's written, then check again. */
+ smp_mb();
+
+ ret = vhost_get_user(vq, flags, &d->flags, VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+ vq->last_avail_idx, &d->flags);
+ return -EFAULT;
+ }
+
+ return desc_is_avail(vq, vq->avail_wrap_counter, flags);
+}
+
+static bool vhost_enable_notify_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
__virtio16 avail_idx;
int r;
@@ -2372,10 +2857,25 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
}
+
+/* OK, now we need to know about added descriptors. */
+bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_enable_notify_packed(dev, vq);
+ else
+ return vhost_enable_notify_split(dev, vq);
+}
EXPORT_SYMBOL_GPL(vhost_enable_notify);
-/* We don't need to be notified again. */
-void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static void vhost_disable_notify_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ /* FIXME: disable notification through device area */
+}
+
+static void vhost_disable_notify_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
int r;
@@ -2389,6 +2889,15 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
&vq->used->flags, r);
}
}
+
+/* We don't need to be notified again. */
+void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_disable_notify_packed(dev, vq);
+ else
+ return vhost_disable_notify_split(dev, vq);
+}
EXPORT_SYMBOL_GPL(vhost_disable_notify);
/* Create a new message. */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 604821b..7543a46 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -36,6 +36,7 @@ struct vhost_poll {
struct vhost_used_elem {
struct vring_used_elem elem;
+ int count;
};
void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
@@ -91,7 +92,10 @@ struct vhost_virtqueue {
/* The actual ring of buffers. */
struct mutex mutex;
unsigned int num;
- struct vring_desc __user *desc;
+ union {
+ struct vring_desc __user *desc;
+ struct vring_desc_packed __user *desc_packed;
+ };
struct vring_avail __user *avail;
struct vring_used __user *used;
const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
@@ -148,6 +152,9 @@ struct vhost_virtqueue {
bool user_be;
#endif
u32 busyloop_timeout;
+ bool used_wrap_counter;
+ bool avail_wrap_counter;
+ bool last_avail_wrap_counter;
};
struct vhost_msg_node {
@@ -203,7 +210,9 @@ void vhost_set_used_len(struct vhost_virtqueue *vq,
int len);
int vhost_get_used_len(struct vhost_virtqueue *vq,
struct vhost_used_elem *used);
-void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
+void vhost_discard_vq_desc(struct vhost_virtqueue *,
+ struct vhost_used_elem *,
+ int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *vq,
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 8/8] vhost: event suppression for packed ring
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
This patch introduces basic support for event suppression aka driver
and device area.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 191 ++++++++++++++++++++++++++++++++++++---
drivers/vhost/vhost.h | 10 +-
include/uapi/linux/virtio_ring.h | 19 ++++
3 files changed, 204 insertions(+), 16 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a36e5ad2..112f680 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1112,10 +1112,15 @@ static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num,
struct vring_used __user *used)
{
struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+ struct vring_packed_desc_event *driver_event =
+ (struct vring_packed_desc_event *)avail;
+ struct vring_packed_desc_event *device_event =
+ (struct vring_packed_desc_event *)used;
- /* FIXME: check device area and driver area */
return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
- access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+ access_ok(VERIFY_WRITE, packed, num * sizeof(*packed)) &&
+ access_ok(VERIFY_READ, driver_event, sizeof(*driver_event)) &&
+ access_ok(VERIFY_WRITE, device_event, sizeof(*device_event));
}
static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
@@ -1190,14 +1195,27 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
return true;
}
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+int vq_iotlb_prefetch_packed(struct vhost_virtqueue *vq)
+{
+ int num = vq->num;
+
+ return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
+ num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+ iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->desc,
+ num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+ iotlb_access_ok(vq, VHOST_ACCESS_RO,
+ (u64)(uintptr_t)vq->driver_event,
+ sizeof(*vq->driver_event), VHOST_ADDR_AVAIL) &&
+ iotlb_access_ok(vq, VHOST_ACCESS_WO,
+ (u64)(uintptr_t)vq->device_event,
+ sizeof(*vq->device_event), VHOST_ADDR_USED);
+}
+
+int vq_iotlb_prefetch_split(struct vhost_virtqueue *vq)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
unsigned int num = vq->num;
- if (!vq->iotlb)
- return 1;
-
return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
@@ -1209,6 +1227,17 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
num * sizeof(*vq->used->ring) + s,
VHOST_ADDR_USED);
}
+
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+{
+ if (!vq->iotlb)
+ return 1;
+
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vq_iotlb_prefetch_packed(vq);
+ else
+ return vq_iotlb_prefetch_split(vq);
+}
EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
/* Can we log writes? */
@@ -1730,6 +1759,50 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
return 0;
}
+static int vhost_update_device_flags(struct vhost_virtqueue *vq,
+ __virtio16 device_flags)
+{
+ void __user *flags;
+
+ if (vhost_put_user(vq, device_flags, &vq->device_event->flags,
+ VHOST_ADDR_USED) < 0)
+ return -EFAULT;
+ if (unlikely(vq->log_used)) {
+ /* Make sure the flag is seen before log. */
+ smp_wmb();
+ /* Log used flag write. */
+ flags = &vq->device_event->flags;
+ log_write(vq->log_base, vq->log_addr +
+ (flags - (void __user *)vq->device_event),
+ sizeof(vq->device_event->flags));
+ if (vq->log_ctx)
+ eventfd_signal(vq->log_ctx, 1);
+ }
+ return 0;
+}
+
+static int vhost_update_device_off_wrap(struct vhost_virtqueue *vq,
+ __virtio16 device_off_wrap)
+{
+ void __user *off_wrap;
+
+ if (vhost_put_user(vq, device_off_wrap, &vq->device_event->off_wrap,
+ VHOST_ADDR_USED) < 0)
+ return -EFAULT;
+ if (unlikely(vq->log_used)) {
+ /* Make sure the flag is seen before log. */
+ smp_wmb();
+ /* Log used flag write. */
+ off_wrap = &vq->device_event->off_wrap;
+ log_write(vq->log_base, vq->log_addr +
+ (off_wrap - (void __user *)vq->device_event),
+ sizeof(vq->device_event->off_wrap));
+ if (vq->log_ctx)
+ eventfd_signal(vq->log_ctx, 1);
+ }
+ return 0;
+}
+
static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
{
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
@@ -2683,16 +2756,13 @@ int vhost_add_used_n(struct vhost_virtqueue *vq,
}
EXPORT_SYMBOL_GPL(vhost_add_used_n);
-static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_notify_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
__u16 old, new;
__virtio16 event;
bool v;
- /* FIXME: check driver area */
- if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
- return true;
-
/* Flush out used index updates. This is paired
* with the barrier that the Guest executes when enabling
* interrupts. */
@@ -2725,6 +2795,64 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
return vring_need_event(vhost16_to_cpu(vq, event), new, old);
}
+static bool vhost_notify_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ __virtio16 event_off_wrap, event_flags;
+ __u16 old, new, off_wrap;
+ bool v;
+
+ /* Flush out used descriptors updates. This is paired
+ * with the barrier that the Guest executes when enabling
+ * interrupts.
+ */
+ smp_mb();
+
+ if (vhost_get_avail(vq, event_flags,
+ &vq->driver_event->flags) < 0) {
+ vq_err(vq, "Failed to get driver desc_event_flags");
+ return true;
+ }
+
+ if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX))
+ return event_flags !=
+ cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
+
+ old = vq->signalled_used;
+ v = vq->signalled_used_valid;
+ new = vq->signalled_used = vq->last_used_idx;
+ vq->signalled_used_valid = true;
+
+ if (event_flags != cpu_to_vhost16(vq, RING_EVENT_FLAGS_DESC))
+ return event_flags !=
+ cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
+
+ /* Read desc event flags before event_off and event_wrap */
+ smp_rmb();
+
+ if (vhost_get_avail(vq, event_off_wrap,
+ &vq->driver_event->off_wrap) < 0) {
+ vq_err(vq, "Failed to get driver desc_event_off/wrap");
+ return true;
+ }
+
+ off_wrap = vhost16_to_cpu(vq, event_off_wrap);
+
+ if (unlikely(!v))
+ return true;
+
+ return vhost_vring_packed_need_event(vq, vq->used_wrap_counter,
+ off_wrap, new, old);
+}
+
+static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_notify_packed(dev, vq);
+ else
+ return vhost_notify_split(dev, vq);
+}
+
/* This actually signals the guest, using eventfd. */
void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
{
@@ -2802,10 +2930,34 @@ static bool vhost_enable_notify_packed(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
{
struct vring_desc_packed *d = vq->desc_packed + vq->avail_idx;
- __virtio16 flags;
+ __virtio16 flags = RING_EVENT_FLAGS_ENABLE;
int ret;
- /* FIXME: disable notification through device area */
+ if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
+ return false;
+ vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
+
+ if (vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
+ __virtio16 off_wrap = cpu_to_vhost16(vq, vq->avail_idx |
+ vq->avail_wrap_counter << 15);
+
+ ret = vhost_update_device_off_wrap(vq, off_wrap);
+ if (ret) {
+ vq_err(vq, "Failed to write to off warp at %p: %d\n",
+ &vq->device_event->off_wrap, ret);
+ return false;
+ }
+ /* Make sure off_wrap is wrote before flags */
+ smp_wmb();
+ flags = RING_EVENT_FLAGS_DESC;
+ }
+
+ ret = vhost_update_device_flags(vq, flags);
+ if (ret) {
+ vq_err(vq, "Failed to enable notification at %p: %d\n",
+ &vq->device_event->flags, ret);
+ return false;
+ }
/* They could have slipped one in as we were doing that: make
* sure it's written, then check again. */
@@ -2871,7 +3023,18 @@ EXPORT_SYMBOL_GPL(vhost_enable_notify);
static void vhost_disable_notify_packed(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
{
- /* FIXME: disable notification through device area */
+ __virtio16 flags;
+ int r;
+
+ if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
+ return;
+ vq->used_flags |= VRING_USED_F_NO_NOTIFY;
+
+ flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
+ r = vhost_update_device_flags(vq, flags);
+ if (r)
+ vq_err(vq, "Failed to enable notification at %p: %d\n",
+ &vq->device_event->flags, r);
}
static void vhost_disable_notify_split(struct vhost_dev *dev,
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 7543a46..b920582 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -96,8 +96,14 @@ struct vhost_virtqueue {
struct vring_desc __user *desc;
struct vring_desc_packed __user *desc_packed;
};
- struct vring_avail __user *avail;
- struct vring_used __user *used;
+ union {
+ struct vring_avail __user *avail;
+ struct vring_packed_desc_event __user *driver_event;
+ };
+ union {
+ struct vring_used __user *used;
+ struct vring_packed_desc_event __user *device_event;
+ };
const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
struct file *kick;
struct eventfd_ctx *call_ctx;
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index e297580..71c7a46 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -75,6 +75,25 @@ struct vring_desc_packed {
__virtio16 flags;
};
+/* Enable events */
+#define RING_EVENT_FLAGS_ENABLE 0x0
+/* Disable events */
+#define RING_EVENT_FLAGS_DISABLE 0x1
+/*
+ * Enable events for a specific descriptor
+ * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
+ * Only valid if VIRTIO_F_RING_EVENT_IDX has been negotiated.
+ */
+#define RING_EVENT_FLAGS_DESC 0x2
+/* The value 0x3 is reserved */
+
+struct vring_packed_desc_event {
+ /* Descriptor Ring Change Event Offset and Wrap Counter */
+ __virtio16 off_wrap;
+ /* Descriptor Ring Change Event Flags */
+ __virtio16 flags;
+};
+
/* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
struct vring_desc {
/* Address (guest-physical). */
--
2.7.4
^ permalink raw reply related
* Re: [RFC v5 2/5] virtio_ring: support creating packed ring
From: Jason Wang @ 2018-05-29 2:49 UTC (permalink / raw)
To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu, Cornelia Huck
In-Reply-To: <20180522081648.14768-3-tiwei.bie@intel.com>
On 2018年05月22日 16:16, Tiwei Bie wrote:
> This commit introduces the support for creating packed ring.
> All split ring specific functions are added _split suffix.
> Some necessary stubs for packed ring are also added.
>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
> drivers/virtio/virtio_ring.c | 801 +++++++++++++++++++++++------------
> include/linux/virtio_ring.h | 8 +-
> 2 files changed, 546 insertions(+), 263 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 71458f493cf8..f5ef5f42a7cf 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -61,11 +61,15 @@ struct vring_desc_state {
> struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
> };
>
> +struct vring_desc_state_packed {
> + int next; /* The next desc state. */
> +};
> +
> struct vring_virtqueue {
> struct virtqueue vq;
>
> - /* Actual memory layout for this queue */
> - struct vring vring;
> + /* Is this a packed ring? */
> + bool packed;
>
> /* Can we use weak barriers? */
> bool weak_barriers;
> @@ -87,11 +91,39 @@ struct vring_virtqueue {
> /* Last used index we've seen. */
> u16 last_used_idx;
>
> - /* Last written value to avail->flags */
> - u16 avail_flags_shadow;
> + union {
> + /* Available for split ring */
> + struct {
> + /* Actual memory layout for this queue. */
> + struct vring vring;
>
> - /* Last written value to avail->idx in guest byte order */
> - u16 avail_idx_shadow;
> + /* Last written value to avail->flags */
> + u16 avail_flags_shadow;
> +
> + /* Last written value to avail->idx in
> + * guest byte order. */
> + u16 avail_idx_shadow;
> + };
> +
> + /* Available for packed ring */
> + struct {
> + /* Actual memory layout for this queue. */
> + struct vring_packed vring_packed;
> +
> + /* Driver ring wrap counter. */
> + u8 avail_wrap_counter;
> +
> + /* Device ring wrap counter. */
> + u8 used_wrap_counter;
How about just use boolean?
> +
> + /* Index of the next avail descriptor. */
> + u16 next_avail_idx;
> +
> + /* Last written value to driver->flags in
> + * guest byte order. */
> + u16 event_flags_shadow;
> + };
> + };
>
> /* How to notify other side. FIXME: commonalize hcalls! */
> bool (*notify)(struct virtqueue *vq);
> @@ -111,11 +143,24 @@ struct vring_virtqueue {
> #endif
>
> /* Per-descriptor state. */
> - struct vring_desc_state desc_state[];
> + union {
> + struct vring_desc_state desc_state[1];
> + struct vring_desc_state_packed desc_state_packed[1];
> + };
> };
>
> #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>
> +static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
> + unsigned int total_sg)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + /* If the host supports indirect descriptor tables, and we have multiple
> + * buffers, then go indirect. FIXME: tune this threshold */
> + return (vq->indirect && total_sg > 1 && vq->vq.num_free);
> +}
> +
> /*
> * Modern virtio devices have feature bits to specify whether they need a
> * quirk and bypass the IOMMU. If not there, just use the DMA API.
> @@ -201,8 +246,17 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
> cpu_addr, size, direction);
> }
>
> -static void vring_unmap_one(const struct vring_virtqueue *vq,
> - struct vring_desc *desc)
> +static int vring_mapping_error(const struct vring_virtqueue *vq,
> + dma_addr_t addr)
> +{
> + if (!vring_use_dma_api(vq->vq.vdev))
> + return 0;
> +
> + return dma_mapping_error(vring_dma_dev(vq), addr);
> +}
> +
> +static void vring_unmap_one_split(const struct vring_virtqueue *vq,
> + struct vring_desc *desc)
> {
> u16 flags;
>
> @@ -226,17 +280,9 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
> }
> }
>
> -static int vring_mapping_error(const struct vring_virtqueue *vq,
> - dma_addr_t addr)
> -{
> - if (!vring_use_dma_api(vq->vq.vdev))
> - return 0;
> -
> - return dma_mapping_error(vring_dma_dev(vq), addr);
> -}
It looks to me if you keep vring_mapping_error behind
vring_unmap_one_split(), lots of changes were unncessary.
> -
> -static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> - unsigned int total_sg, gfp_t gfp)
> +static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> + unsigned int total_sg,
> + gfp_t gfp)
> {
> struct vring_desc *desc;
> unsigned int i;
> @@ -257,14 +303,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> return desc;
> }
>
> -static inline int virtqueue_add(struct virtqueue *_vq,
> - struct scatterlist *sgs[],
> - unsigned int total_sg,
> - unsigned int out_sgs,
> - unsigned int in_sgs,
> - void *data,
> - void *ctx,
> - gfp_t gfp)
> +static inline int virtqueue_add_split(struct virtqueue *_vq,
> + struct scatterlist *sgs[],
> + unsigned int total_sg,
> + unsigned int out_sgs,
> + unsigned int in_sgs,
> + void *data,
> + void *ctx,
> + gfp_t gfp)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> struct scatterlist *sg;
> @@ -300,10 +346,8 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>
> head = vq->free_head;
>
> - /* If the host supports indirect descriptor tables, and we have multiple
> - * buffers, then go indirect. FIXME: tune this threshold */
> - if (vq->indirect && total_sg > 1 && vq->vq.num_free)
> - desc = alloc_indirect(_vq, total_sg, gfp);
> + if (virtqueue_use_indirect(_vq, total_sg))
> + desc = alloc_indirect_split(_vq, total_sg, gfp);
> else {
> desc = NULL;
> WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
> @@ -424,7 +468,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> for (n = 0; n < total_sg; n++) {
> if (i == err_idx)
> break;
> - vring_unmap_one(vq, &desc[i]);
> + vring_unmap_one_split(vq, &desc[i]);
> i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
> }
>
> @@ -435,6 +479,355 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> return -EIO;
> }
>
> +static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 new, old;
> + bool needs_kick;
> +
> + START_USE(vq);
> + /* We need to expose available array entries before checking avail
> + * event. */
> + virtio_mb(vq->weak_barriers);
> +
> + old = vq->avail_idx_shadow - vq->num_added;
> + new = vq->avail_idx_shadow;
> + vq->num_added = 0;
> +
> +#ifdef DEBUG
> + if (vq->last_add_time_valid) {
> + WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> + vq->last_add_time)) > 100);
> + }
> + vq->last_add_time_valid = false;
> +#endif
> +
> + if (vq->event) {
> + needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
> + new, old);
> + } else {
> + needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
> + }
> + END_USE(vq);
> + return needs_kick;
> +}
> +
> +static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> + void **ctx)
> +{
> + unsigned int i, j;
> + __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
> +
> + /* Clear data ptr. */
> + vq->desc_state[head].data = NULL;
> +
> + /* Put back on free list: unmap first-level descriptors and find end */
> + i = head;
> +
> + while (vq->vring.desc[i].flags & nextflag) {
> + vring_unmap_one_split(vq, &vq->vring.desc[i]);
> + i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
> + vq->vq.num_free++;
> + }
> +
> + vring_unmap_one_split(vq, &vq->vring.desc[i]);
> + vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
> + vq->free_head = head;
> +
> + /* Plus final descriptor */
> + vq->vq.num_free++;
> +
> + if (vq->indirect) {
> + struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
> + u32 len;
> +
> + /* Free the indirect table, if any, now that it's unmapped. */
> + if (!indir_desc)
> + return;
> +
> + len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
> +
> + BUG_ON(!(vq->vring.desc[head].flags &
> + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> + BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> +
> + for (j = 0; j < len / sizeof(struct vring_desc); j++)
> + vring_unmap_one_split(vq, &indir_desc[j]);
> +
> + kfree(indir_desc);
> + vq->desc_state[head].indir_desc = NULL;
> + } else if (ctx) {
> + *ctx = vq->desc_state[head].indir_desc;
> + }
> +}
> +
> +static inline bool more_used_split(const struct vring_virtqueue *vq)
> +{
> + return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> +}
> +
> +static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
> + unsigned int *len,
> + void **ctx)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + void *ret;
> + unsigned int i;
> + u16 last_used;
> +
> + START_USE(vq);
> +
> + if (unlikely(vq->broken)) {
> + END_USE(vq);
> + return NULL;
> + }
> +
> + if (!more_used_split(vq)) {
> + pr_debug("No more buffers in queue\n");
> + END_USE(vq);
> + return NULL;
> + }
> +
> + /* Only get used array entries after they have been exposed by host. */
> + virtio_rmb(vq->weak_barriers);
> +
> + last_used = (vq->last_used_idx & (vq->vring.num - 1));
> + i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
> + *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
> +
> + if (unlikely(i >= vq->vring.num)) {
> + BAD_RING(vq, "id %u out of range\n", i);
> + return NULL;
> + }
> + if (unlikely(!vq->desc_state[i].data)) {
> + BAD_RING(vq, "id %u is not a head!\n", i);
> + return NULL;
> + }
> +
> + /* detach_buf_split clears data, so grab it now. */
> + ret = vq->desc_state[i].data;
> + detach_buf_split(vq, i, ctx);
> + vq->last_used_idx++;
> + /* If we expect an interrupt for the next entry, tell host
> + * by writing event index and flush out the write before
> + * the read in the next get_buf call. */
> + if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
> + virtio_store_mb(vq->weak_barriers,
> + &vring_used_event(&vq->vring),
> + cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
> +
> +#ifdef DEBUG
> + vq->last_add_time_valid = false;
> +#endif
> +
> + END_USE(vq);
> + return ret;
> +}
> +
> +static void virtqueue_disable_cb_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
> + vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> + }
> +}
> +
> +static unsigned virtqueue_enable_cb_prepare_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 last_used_idx;
> +
> + START_USE(vq);
> +
> + /* We optimistically turn back on interrupts, then check if there was
> + * more to do. */
> + /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> + * either clear the flags bit or point the event index at the next
> + * entry. Always do both to keep code simple. */
> + if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> + vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> + }
> + vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
> + END_USE(vq);
> + return last_used_idx;
> +}
> +
> +static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + virtio_mb(vq->weak_barriers);
> + return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
> +}
> +
> +static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 bufs;
> +
> + START_USE(vq);
> +
> + /* We optimistically turn back on interrupts, then check if there was
> + * more to do. */
> + /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> + * either clear the flags bit or point the event index at the next
> + * entry. Always update the event index to keep code simple. */
> + if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> + vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> + }
> + /* TODO: tune this threshold */
> + bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
> +
> + virtio_store_mb(vq->weak_barriers,
> + &vring_used_event(&vq->vring),
> + cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
> +
> + if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
> + END_USE(vq);
> + return false;
> + }
> +
> + END_USE(vq);
> + return true;
> +}
> +
> +static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + unsigned int i;
> + void *buf;
> +
> + START_USE(vq);
> +
> + for (i = 0; i < vq->vring.num; i++) {
> + if (!vq->desc_state[i].data)
> + continue;
> + /* detach_buf clears data, so grab it now. */
> + buf = vq->desc_state[i].data;
> + detach_buf_split(vq, i, NULL);
> + vq->avail_idx_shadow--;
> + vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> + END_USE(vq);
> + return buf;
> + }
> + /* That should have freed everything. */
> + BUG_ON(vq->vq.num_free != vq->vring.num);
> +
> + END_USE(vq);
> + return NULL;
> +}
I think the those copy-and-paste hunks could be avoided and the diff
should only contains renaming of the function. If yes, it would be very
welcomed since it requires to compare the changes verbatim otherwise.
> +
> +/*
> + * The layout for the packed ring is a continuous chunk of memory
> + * which looks like this.
> + *
> + * struct vring_packed {
> + * // The actual descriptors (16 bytes each)
> + * struct vring_packed_desc desc[num];
> + *
> + * // Padding to the next align boundary.
> + * char pad[];
> + *
> + * // Driver Event Suppression
> + * struct vring_packed_desc_event driver;
> + *
> + * // Device Event Suppression
> + * struct vring_packed_desc_event device;
> + * };
> + */
> +static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
> + void *p, unsigned long align)
> +{
> + vr->num = num;
> + vr->desc = p;
> + vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
> + * num + align - 1) & ~(align - 1));
If we choose not to go uapi, maybe we can use ALIGN() macro here?
> + vr->device = vr->driver + 1;
> +}
> +
> +static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
> +{
> + return ((sizeof(struct vring_packed_desc) * num + align - 1)
> + & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> +}
> +
> +static inline int virtqueue_add_packed(struct virtqueue *_vq,
> + struct scatterlist *sgs[],
> + unsigned int total_sg,
> + unsigned int out_sgs,
> + unsigned int in_sgs,
> + void *data,
> + void *ctx,
> + gfp_t gfp)
> +{
> + return -EIO;
> +}
> +
> +static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> +{
> + return false;
> +}
> +
> +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> +{
> + return false;
> +}
> +
> +static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> + unsigned int *len,
> + void **ctx)
> +{
> + return NULL;
> +}
> +
> +static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
> +{
> +}
> +
> +static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> +{
> + return 0;
> +}
> +
> +static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> +{
> + return false;
> +}
> +
> +static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> +{
> + return false;
> +}
> +
> +static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
> +{
> + return NULL;
> +}
> +
> +static inline int virtqueue_add(struct virtqueue *_vq,
> + struct scatterlist *sgs[],
> + unsigned int total_sg,
> + unsigned int out_sgs,
> + unsigned int in_sgs,
> + void *data,
> + void *ctx,
> + gfp_t gfp)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + return vq->packed ? virtqueue_add_packed(_vq, sgs, total_sg, out_sgs,
> + in_sgs, data, ctx, gfp) :
> + virtqueue_add_split(_vq, sgs, total_sg, out_sgs,
> + in_sgs, data, ctx, gfp);
> +}
> +
> /**
> * virtqueue_add_sgs - expose buffers to other end
> * @vq: the struct virtqueue we're talking about.
> @@ -551,34 +944,9 @@ EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
> bool virtqueue_kick_prepare(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - u16 new, old;
> - bool needs_kick;
>
> - START_USE(vq);
> - /* We need to expose available array entries before checking avail
> - * event. */
> - virtio_mb(vq->weak_barriers);
> -
> - old = vq->avail_idx_shadow - vq->num_added;
> - new = vq->avail_idx_shadow;
> - vq->num_added = 0;
> -
> -#ifdef DEBUG
> - if (vq->last_add_time_valid) {
> - WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> - vq->last_add_time)) > 100);
> - }
> - vq->last_add_time_valid = false;
> -#endif
> -
> - if (vq->event) {
> - needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
> - new, old);
> - } else {
> - needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
> - }
> - END_USE(vq);
> - return needs_kick;
> + return vq->packed ? virtqueue_kick_prepare_packed(_vq) :
> + virtqueue_kick_prepare_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
>
> @@ -626,58 +994,9 @@ bool virtqueue_kick(struct virtqueue *vq)
> }
> EXPORT_SYMBOL_GPL(virtqueue_kick);
>
> -static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
> - void **ctx)
> -{
> - unsigned int i, j;
> - __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
> -
> - /* Clear data ptr. */
> - vq->desc_state[head].data = NULL;
> -
> - /* Put back on free list: unmap first-level descriptors and find end */
> - i = head;
> -
> - while (vq->vring.desc[i].flags & nextflag) {
> - vring_unmap_one(vq, &vq->vring.desc[i]);
> - i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
> - vq->vq.num_free++;
> - }
> -
> - vring_unmap_one(vq, &vq->vring.desc[i]);
> - vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
> - vq->free_head = head;
> -
> - /* Plus final descriptor */
> - vq->vq.num_free++;
> -
> - if (vq->indirect) {
> - struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
> - u32 len;
> -
> - /* Free the indirect table, if any, now that it's unmapped. */
> - if (!indir_desc)
> - return;
> -
> - len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
> -
> - BUG_ON(!(vq->vring.desc[head].flags &
> - cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> - BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> -
> - for (j = 0; j < len / sizeof(struct vring_desc); j++)
> - vring_unmap_one(vq, &indir_desc[j]);
> -
> - kfree(indir_desc);
> - vq->desc_state[head].indir_desc = NULL;
> - } else if (ctx) {
> - *ctx = vq->desc_state[head].indir_desc;
> - }
> -}
> -
> static inline bool more_used(const struct vring_virtqueue *vq)
> {
> - return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> + return vq->packed ? more_used_packed(vq) : more_used_split(vq);
> }
>
> /**
> @@ -700,57 +1019,9 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
> void **ctx)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - void *ret;
> - unsigned int i;
> - u16 last_used;
>
> - START_USE(vq);
> -
> - if (unlikely(vq->broken)) {
> - END_USE(vq);
> - return NULL;
> - }
> -
> - if (!more_used(vq)) {
> - pr_debug("No more buffers in queue\n");
> - END_USE(vq);
> - return NULL;
> - }
> -
> - /* Only get used array entries after they have been exposed by host. */
> - virtio_rmb(vq->weak_barriers);
> -
> - last_used = (vq->last_used_idx & (vq->vring.num - 1));
> - i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
> - *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
> -
> - if (unlikely(i >= vq->vring.num)) {
> - BAD_RING(vq, "id %u out of range\n", i);
> - return NULL;
> - }
> - if (unlikely(!vq->desc_state[i].data)) {
> - BAD_RING(vq, "id %u is not a head!\n", i);
> - return NULL;
> - }
> -
> - /* detach_buf clears data, so grab it now. */
> - ret = vq->desc_state[i].data;
> - detach_buf(vq, i, ctx);
> - vq->last_used_idx++;
> - /* If we expect an interrupt for the next entry, tell host
> - * by writing event index and flush out the write before
> - * the read in the next get_buf call. */
> - if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
> - virtio_store_mb(vq->weak_barriers,
> - &vring_used_event(&vq->vring),
> - cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
> -
> -#ifdef DEBUG
> - vq->last_add_time_valid = false;
> -#endif
> -
> - END_USE(vq);
> - return ret;
> + return vq->packed ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
> + virtqueue_get_buf_ctx_split(_vq, len, ctx);
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
>
> @@ -772,12 +1043,10 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
>
> - if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
> - vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> - }
> -
> + if (vq->packed)
> + virtqueue_disable_cb_packed(_vq);
> + else
> + virtqueue_disable_cb_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
>
> @@ -796,23 +1065,9 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
> unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - u16 last_used_idx;
>
> - START_USE(vq);
> -
> - /* We optimistically turn back on interrupts, then check if there was
> - * more to do. */
> - /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> - * either clear the flags bit or point the event index at the next
> - * entry. Always do both to keep code simple. */
> - if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> - vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> - }
> - vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
> - END_USE(vq);
> - return last_used_idx;
> + return vq->packed ? virtqueue_enable_cb_prepare_packed(_vq) :
> + virtqueue_enable_cb_prepare_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
>
> @@ -829,8 +1084,8 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
>
> - virtio_mb(vq->weak_barriers);
> - return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
> + return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
> + virtqueue_poll_split(_vq, last_used_idx);
> }
> EXPORT_SYMBOL_GPL(virtqueue_poll);
>
> @@ -868,34 +1123,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
> bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - u16 bufs;
>
> - START_USE(vq);
> -
> - /* We optimistically turn back on interrupts, then check if there was
> - * more to do. */
> - /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> - * either clear the flags bit or point the event index at the next
> - * entry. Always update the event index to keep code simple. */
> - if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> - vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> - }
> - /* TODO: tune this threshold */
> - bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
> -
> - virtio_store_mb(vq->weak_barriers,
> - &vring_used_event(&vq->vring),
> - cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
> -
> - if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
> - END_USE(vq);
> - return false;
> - }
> -
> - END_USE(vq);
> - return true;
> + return vq->packed ? virtqueue_enable_cb_delayed_packed(_vq) :
> + virtqueue_enable_cb_delayed_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
>
> @@ -910,27 +1140,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
> void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - unsigned int i;
> - void *buf;
>
> - START_USE(vq);
> -
> - for (i = 0; i < vq->vring.num; i++) {
> - if (!vq->desc_state[i].data)
> - continue;
> - /* detach_buf clears data, so grab it now. */
> - buf = vq->desc_state[i].data;
> - detach_buf(vq, i, NULL);
> - vq->avail_idx_shadow--;
> - vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> - END_USE(vq);
> - return buf;
> - }
> - /* That should have freed everything. */
> - BUG_ON(vq->vq.num_free != vq->vring.num);
> -
> - END_USE(vq);
> - return NULL;
> + return vq->packed ? virtqueue_detach_unused_buf_packed(_vq) :
> + virtqueue_detach_unused_buf_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
>
> @@ -955,7 +1167,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
> EXPORT_SYMBOL_GPL(vring_interrupt);
>
> struct virtqueue *__vring_new_virtqueue(unsigned int index,
> - struct vring vring,
> + union vring_union vring,
> + bool packed,
> struct virtio_device *vdev,
> bool weak_barriers,
> bool context,
> @@ -963,19 +1176,22 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
> void (*callback)(struct virtqueue *),
> const char *name)
> {
> - unsigned int i;
> struct vring_virtqueue *vq;
> + unsigned int num, i;
> + size_t size;
>
> - vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
> - GFP_KERNEL);
> + num = packed ? vring.vring_packed.num : vring.vring_split.num;
> + size = packed ? num * sizeof(struct vring_desc_state_packed) :
> + num * sizeof(struct vring_desc_state);
> +
> + vq = kmalloc(sizeof(*vq) + size, GFP_KERNEL);
> if (!vq)
> return NULL;
>
> - vq->vring = vring;
> vq->vq.callback = callback;
> vq->vq.vdev = vdev;
> vq->vq.name = name;
> - vq->vq.num_free = vring.num;
> + vq->vq.num_free = num;
> vq->vq.index = index;
> vq->we_own_ring = false;
> vq->queue_dma_addr = 0;
> @@ -984,9 +1200,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
> vq->weak_barriers = weak_barriers;
> vq->broken = false;
> vq->last_used_idx = 0;
> - vq->avail_flags_shadow = 0;
> - vq->avail_idx_shadow = 0;
> vq->num_added = 0;
> + vq->packed = packed;
> list_add_tail(&vq->vq.list, &vdev->vqs);
> #ifdef DEBUG
> vq->in_use = false;
> @@ -997,19 +1212,48 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
> !context;
> vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
>
> + if (vq->packed) {
> + vq->vring_packed = vring.vring_packed;
> + vq->next_avail_idx = 0;
> + vq->avail_wrap_counter = 1;
> + vq->used_wrap_counter = 1;
> + vq->event_flags_shadow = 0;
> +
> + memset(vq->desc_state_packed, 0,
> + num * sizeof(struct vring_desc_state_packed));
> +
> x
> + vq->free_head = 0;
> + for (i = 0; i < num-1; i++)
> + vq->desc_state_packed[i].next = i + 1;
> + } else {
> + vq->vring = vring.vring_split;
> + vq->avail_flags_shadow = 0;
> + vq->avail_idx_shadow = 0;
> +
> + /* Put everything in free lists. */
> + vq->free_head = 0;
> + for (i = 0; i < num-1; i++)
> + vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
> +
> + memset(vq->desc_state, 0,
> + num * sizeof(struct vring_desc_state));
> + }
> +
> /* No callback? Tell other side not to bother us. */
> if (!callback) {
> - vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(vdev, vq->avail_flags_shadow);
> + if (packed) {
> + vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
> + vq->vring_packed.driver->flags = cpu_to_virtio16(vdev,
> + vq->event_flags_shadow);
> + } else {
> + vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(vdev,
> + vq->avail_flags_shadow);
> + }
> }
>
> - /* Put everything in free lists. */
> - vq->free_head = 0;
> - for (i = 0; i < vring.num-1; i++)
> - vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
> - memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
> -
> return &vq->vq;
> }
> EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
> @@ -1056,6 +1300,12 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size,
> }
> }
>
> +static inline int
> +__vring_size(unsigned int num, unsigned long align, bool packed)
> +{
> + return packed ? vring_size_packed(num, align) : vring_size(num, align);
> +}
> +
> struct virtqueue *vring_create_virtqueue(
> unsigned int index,
> unsigned int num,
> @@ -1072,7 +1322,8 @@ struct virtqueue *vring_create_virtqueue(
> void *queue = NULL;
> dma_addr_t dma_addr;
> size_t queue_size_in_bytes;
> - struct vring vring;
> + union vring_union vring;
> + bool packed;
>
> /* We assume num is a power of 2. */
> if (num & (num - 1)) {
> @@ -1080,9 +1331,13 @@ struct virtqueue *vring_create_virtqueue(
> return NULL;
> }
>
> + packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
> +
> /* TODO: allocate each queue chunk individually */
> - for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
> - queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
> + for (; num && __vring_size(num, vring_align, packed) > PAGE_SIZE;
> + num /= 2) {
> + queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
> + packed),
> &dma_addr,
> GFP_KERNEL|__GFP_NOWARN|__GFP_ZERO);
> if (queue)
> @@ -1094,17 +1349,21 @@ struct virtqueue *vring_create_virtqueue(
>
> if (!queue) {
> /* Try to get a single page. You are my only hope! */
> - queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
> + queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
> + packed),
> &dma_addr, GFP_KERNEL|__GFP_ZERO);
> }
> if (!queue)
> return NULL;
>
> - queue_size_in_bytes = vring_size(num, vring_align);
> - vring_init(&vring, num, queue, vring_align);
> + queue_size_in_bytes = __vring_size(num, vring_align, packed);
> + if (packed)
> + vring_init_packed(&vring.vring_packed, num, queue, vring_align);
> + else
> + vring_init(&vring.vring_split, num, queue, vring_align);
>
> - vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
> - notify, callback, name);
> + vq = __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
> + context, notify, callback, name);
> if (!vq) {
> vring_free_queue(vdev, queue_size_in_bytes, queue,
> dma_addr);
> @@ -1130,10 +1389,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
> void (*callback)(struct virtqueue *vq),
> const char *name)
> {
> - struct vring vring;
> - vring_init(&vring, num, pages, vring_align);
> - return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
> - notify, callback, name);
> + union vring_union vring;
> + bool packed;
> +
> + packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
> + if (packed)
> + vring_init_packed(&vring.vring_packed, num, pages, vring_align);
> + else
> + vring_init(&vring.vring_split, num, pages, vring_align);
> +
> + return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
> + context, notify, callback, name);
> }
> EXPORT_SYMBOL_GPL(vring_new_virtqueue);
>
> @@ -1143,7 +1409,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
>
> if (vq->we_own_ring) {
> vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
> - vq->vring.desc, vq->queue_dma_addr);
> + vq->packed ? (void *)vq->vring_packed.desc :
> + (void *)vq->vring.desc,
> + vq->queue_dma_addr);
> }
> list_del(&_vq->list);
> kfree(vq);
> @@ -1185,7 +1453,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
>
> struct vring_virtqueue *vq = to_vvq(_vq);
>
> - return vq->vring.num;
> + return vq->packed ? vq->vring_packed.num : vq->vring.num;
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
>
> @@ -1228,6 +1496,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
>
> BUG_ON(!vq->we_own_ring);
>
> + if (vq->packed)
> + return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
> + (char *)vq->vring_packed.desc);
> +
> return vq->queue_dma_addr +
> ((char *)vq->vring.avail - (char *)vq->vring.desc);
> }
> @@ -1239,11 +1511,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
>
> BUG_ON(!vq->we_own_ring);
>
> + if (vq->packed)
> + return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
> + (char *)vq->vring_packed.desc);
> +
> return vq->queue_dma_addr +
> ((char *)vq->vring.used - (char *)vq->vring.desc);
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
>
> +/* Only available for split ring */
> const struct vring *virtqueue_get_vring(struct virtqueue *vq)
> {
A possible issue with this is:
After commit d4674240f31f8c4289abba07d64291c6ddce51bc ("KVM: s390:
virtio-ccw revision 1 SET_VQ"). CCW tries to use
virtqueue_get_avail()/virtqueue_get_used(). Looks like a bug either here
or ccw code.
Thanks
> return &to_vvq(vq)->vring;
> diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> index bbf32524ab27..a0075894ad16 100644
> --- a/include/linux/virtio_ring.h
> +++ b/include/linux/virtio_ring.h
> @@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
> struct virtio_device;
> struct virtqueue;
>
> +union vring_union {
> + struct vring vring_split;
> + struct vring_packed vring_packed;
> +};
> +
> /*
> * Creates a virtqueue and allocates the descriptor ring. If
> * may_reduce_num is set, then this may allocate a smaller ring than
> @@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
>
> /* Creates a virtqueue with a custom layout. */
> struct virtqueue *__vring_new_virtqueue(unsigned int index,
> - struct vring vring,
> + union vring_union vring,
> + bool packed,
> struct virtio_device *vdev,
> bool weak_barriers,
> bool ctx,
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next v12 0/5] Enable virtio_net to act as a standby for a passthru device
From: David Miller @ 2018-05-29 3:00 UTC (permalink / raw)
To: sridhar.samudrala
Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
virtualization, loseweigh, anjali.singhai, aaron.f.brown
In-Reply-To: <1527180917-39737-1-git-send-email-sridhar.samudrala@intel.com>
From: Sridhar Samudrala <sridhar.samudrala@intel.com>
Date: Thu, 24 May 2018 09:55:12 -0700
> The main motivation for this patch is to enable cloud service providers
> to provide an accelerated datapath to virtio-net enabled VMs in a
> transparent manner with no/minimal guest userspace changes. This also
> enables hypervisor controlled live migration to be supported with VMs that
> have direct attached SR-IOV VF devices.
...
Series applied, thank you.
^ permalink raw reply
* Re: [RFC v5 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-29 3:18 UTC (permalink / raw)
To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180522081648.14768-4-tiwei.bie@intel.com>
On 2018年05月22日 16:16, Tiwei Bie wrote:
> This commit introduces the support (without EVENT_IDX) for
> packed ring.
>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
> drivers/virtio/virtio_ring.c | 474 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 468 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index f5ef5f42a7cf..eb9fd5207a68 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -62,6 +62,12 @@ struct vring_desc_state {
> };
>
> struct vring_desc_state_packed {
> + void *data; /* Data for callback. */
> + struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
> + int num; /* Descriptor list length. */
> + dma_addr_t addr; /* Buffer DMA addr. */
> + u32 len; /* Buffer length. */
> + u16 flags; /* Descriptor flags. */
> int next; /* The next desc state. */
> };
>
> @@ -758,6 +764,72 @@ static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
> & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> }
>
> +static void vring_unmap_state_packed(const struct vring_virtqueue *vq,
> + struct vring_desc_state_packed *state)
> +{
> + u16 flags;
> +
> + if (!vring_use_dma_api(vq->vq.vdev))
> + return;
> +
> + flags = state->flags;
> +
> + if (flags & VRING_DESC_F_INDIRECT) {
> + dma_unmap_single(vring_dma_dev(vq),
> + state->addr, state->len,
> + (flags & VRING_DESC_F_WRITE) ?
> + DMA_FROM_DEVICE : DMA_TO_DEVICE);
> + } else {
> + dma_unmap_page(vring_dma_dev(vq),
> + state->addr, state->len,
> + (flags & VRING_DESC_F_WRITE) ?
> + DMA_FROM_DEVICE : DMA_TO_DEVICE);
> + }
> +}
> +
> +static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
> + struct vring_packed_desc *desc)
> +{
> + u16 flags;
> +
> + if (!vring_use_dma_api(vq->vq.vdev))
> + return;
> +
> + flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> +
> + if (flags & VRING_DESC_F_INDIRECT) {
> + dma_unmap_single(vring_dma_dev(vq),
> + virtio64_to_cpu(vq->vq.vdev, desc->addr),
> + virtio32_to_cpu(vq->vq.vdev, desc->len),
> + (flags & VRING_DESC_F_WRITE) ?
> + DMA_FROM_DEVICE : DMA_TO_DEVICE);
> + } else {
> + dma_unmap_page(vring_dma_dev(vq),
> + virtio64_to_cpu(vq->vq.vdev, desc->addr),
> + virtio32_to_cpu(vq->vq.vdev, desc->len),
> + (flags & VRING_DESC_F_WRITE) ?
> + DMA_FROM_DEVICE : DMA_TO_DEVICE);
> + }
> +}
> +
> +static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
> + unsigned int total_sg,
> + gfp_t gfp)
> +{
> + struct vring_packed_desc *desc;
> +
> + /*
> + * We require lowmem mappings for the descriptors because
> + * otherwise virt_to_phys will give us bogus addresses in the
> + * virtqueue.
> + */
> + gfp &= ~__GFP_HIGHMEM;
> +
> + desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
> +
> + return desc;
> +}
> +
> static inline int virtqueue_add_packed(struct virtqueue *_vq,
> struct scatterlist *sgs[],
> unsigned int total_sg,
> @@ -767,47 +839,437 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> void *ctx,
> gfp_t gfp)
> {
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + struct vring_packed_desc *desc;
> + struct scatterlist *sg;
> + unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
> + __virtio16 uninitialized_var(head_flags), flags;
> + u16 head, avail_wrap_counter, id, cur;
> + bool indirect;
> +
> + START_USE(vq);
> +
> + BUG_ON(data == NULL);
> + BUG_ON(ctx && vq->indirect);
> +
> + if (unlikely(vq->broken)) {
> + END_USE(vq);
> + return -EIO;
> + }
> +
> +#ifdef DEBUG
> + {
> + ktime_t now = ktime_get();
> +
> + /* No kick or get, with .1 second between? Warn. */
> + if (vq->last_add_time_valid)
> + WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
> + > 100);
> + vq->last_add_time = now;
> + vq->last_add_time_valid = true;
> + }
> +#endif
> +
> + BUG_ON(total_sg == 0);
> +
> + head = vq->next_avail_idx;
> + avail_wrap_counter = vq->avail_wrap_counter;
> +
> + if (virtqueue_use_indirect(_vq, total_sg))
> + desc = alloc_indirect_packed(_vq, total_sg, gfp);
> + else {
> + desc = NULL;
> + WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
> + }
> +
> + if (desc) {
> + /* Use a single buffer which doesn't continue */
> + indirect = true;
> + /* Set up rest to use this indirect table. */
> + i = 0;
> + descs_used = 1;
> + } else {
> + indirect = false;
> + desc = vq->vring_packed.desc;
> + i = head;
> + descs_used = total_sg;
> + }
> +
> + if (vq->vq.num_free < descs_used) {
> + pr_debug("Can't add buf len %i - avail = %i\n",
> + descs_used, vq->vq.num_free);
> + /* FIXME: for historical reasons, we force a notify here if
> + * there are outgoing parts to the buffer. Presumably the
> + * host should service the ring ASAP. */
> + if (out_sgs)
> + vq->notify(&vq->vq);
> + if (indirect)
> + kfree(desc);
> + END_USE(vq);
> + return -ENOSPC;
> + }
> +
> + id = vq->free_head;
> + BUG_ON(id == vq->vring_packed.num);
> +
> + cur = id;
> + for (n = 0; n < out_sgs + in_sgs; n++) {
> + for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> + dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
> + DMA_TO_DEVICE : DMA_FROM_DEVICE);
> + if (vring_mapping_error(vq, addr))
> + goto unmap_release;
> +
> + flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
> + (n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
> + VRING_DESC_F_AVAIL(vq->avail_wrap_counter) |
> + VRING_DESC_F_USED(!vq->avail_wrap_counter));
> + if (!indirect && i == head)
> + head_flags = flags;
> + else
> + desc[i].flags = flags;
> +
> + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> + desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
> + i++;
> + if (!indirect) {
> + vq->desc_state_packed[cur].addr = addr;
> + vq->desc_state_packed[cur].len = sg->length;
> + vq->desc_state_packed[cur].flags =
> + virtio16_to_cpu(_vq->vdev, flags);
> + cur = vq->desc_state_packed[cur].next;
> +
> + if (i >= vq->vring_packed.num) {
> + i = 0;
> + vq->avail_wrap_counter ^= 1;
> + }
> + }
> + }
> + }
> +
> + prev = (i > 0 ? i : vq->vring_packed.num) - 1;
> + desc[prev].id = cpu_to_virtio16(_vq->vdev, id);
> +
> + /* Last one doesn't continue. */
> + if (total_sg == 1)
> + head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> + else
> + desc[prev].flags &= cpu_to_virtio16(_vq->vdev,
> + ~VRING_DESC_F_NEXT);
> +
> + if (indirect) {
> + /* Now that the indirect table is filled in, map it. */
> + dma_addr_t addr = vring_map_single(
> + vq, desc, total_sg * sizeof(struct vring_packed_desc),
> + DMA_TO_DEVICE);
> + if (vring_mapping_error(vq, addr))
> + goto unmap_release;
> +
> + head_flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT |
> + VRING_DESC_F_AVAIL(avail_wrap_counter) |
> + VRING_DESC_F_USED(!avail_wrap_counter));
> + vq->vring_packed.desc[head].addr = cpu_to_virtio64(_vq->vdev,
> + addr);
> + vq->vring_packed.desc[head].len = cpu_to_virtio32(_vq->vdev,
> + total_sg * sizeof(struct vring_packed_desc));
> + vq->vring_packed.desc[head].id = cpu_to_virtio16(_vq->vdev, id);
> +
> + vq->desc_state_packed[id].addr = addr;
> + vq->desc_state_packed[id].len = total_sg *
> + sizeof(struct vring_packed_desc);
> + vq->desc_state_packed[id].flags =
> + virtio16_to_cpu(_vq->vdev, head_flags);
> + }
> +
> + /* We're using some buffers from the free list. */
> + vq->vq.num_free -= descs_used;
> +
> + /* Update free pointer */
> + if (indirect) {
> + n = head + 1;
> + if (n >= vq->vring_packed.num) {
> + n = 0;
> + vq->avail_wrap_counter ^= 1;
> + }
> + vq->next_avail_idx = n;
> + vq->free_head = vq->desc_state_packed[id].next;
> + } else {
> + vq->next_avail_idx = i;
> + vq->free_head = cur;
> + }
> +
> + /* Store token and indirect buffer state. */
> + vq->desc_state_packed[id].num = descs_used;
> + vq->desc_state_packed[id].data = data;
> + if (indirect)
> + vq->desc_state_packed[id].indir_desc = desc;
> + else
> + vq->desc_state_packed[id].indir_desc = ctx;
> +
> + /* A driver MUST NOT make the first descriptor in the list
> + * available before all subsequent descriptors comprising
> + * the list are made available. */
> + virtio_wmb(vq->weak_barriers);
> + vq->vring_packed.desc[head].flags = head_flags;
> + vq->num_added += descs_used;
> +
> + pr_debug("Added buffer head %i to %p\n", head, vq);
> + END_USE(vq);
> +
> + return 0;
> +
> +unmap_release:
> + err_idx = i;
> + i = head;
> +
> + for (n = 0; n < total_sg; n++) {
> + if (i == err_idx)
> + break;
> + vring_unmap_desc_packed(vq, &desc[i]);
> + i++;
> + if (!indirect && i >= vq->vring_packed.num)
> + i = 0;
> + }
> +
> + vq->avail_wrap_counter = avail_wrap_counter;
> +
> + if (indirect)
> + kfree(desc);
> +
> + END_USE(vq);
> return -EIO;
> }
>
> static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> {
> - return false;
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 flags;
> + bool needs_kick;
> + u32 snapshot;
> +
> + START_USE(vq);
> + /* We need to expose the new flags value before checking notification
> + * suppressions. */
> + virtio_mb(vq->weak_barriers);
> +
> + snapshot = *(u32 *)vq->vring_packed.device;
> + flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
> +
> +#ifdef DEBUG
> + if (vq->last_add_time_valid) {
> + WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> + vq->last_add_time)) > 100);
> + }
> + vq->last_add_time_valid = false;
> +#endif
> +
> + needs_kick = (flags != VRING_EVENT_F_DISABLE);
> + END_USE(vq);
> + return needs_kick;
> +}
> +
> +static void detach_buf_packed(struct vring_virtqueue *vq,
> + unsigned int id, void **ctx)
> +{
> + struct vring_desc_state_packed *state;
> + struct vring_packed_desc *desc;
> + unsigned int i;
> + int *next;
> +
> + /* Clear data ptr. */
> + vq->desc_state_packed[id].data = NULL;
> +
> + next = &id;
> + for (i = 0; i < vq->desc_state_packed[id].num; i++) {
> + state = &vq->desc_state_packed[*next];
> + vring_unmap_state_packed(vq, state);
> + next = &vq->desc_state_packed[*next].next;
Have a short discussion with Michael. It looks like the only thing we
care is DMA address and length here.
So a thought is to avoid using desc_state_packed() is
vring_use_dma_api() is false? Probably need another id allocator or just
use desc_state_packed for free id list.
> + }
> +
> + vq->vq.num_free += vq->desc_state_packed[id].num;
> + *next = vq->free_head;
Using pointer seems not elegant and unnecessary here. You can just track
next in integer I think?
> + vq->free_head = id;
> +
> + if (vq->indirect) {
> + u32 len;
> +
> + /* Free the indirect table, if any, now that it's unmapped. */
> + desc = vq->desc_state_packed[id].indir_desc;
> + if (!desc)
> + return;
> +
> + len = vq->desc_state_packed[id].len;
> + for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
> + vring_unmap_desc_packed(vq, &desc[i]);
> +
> + kfree(desc);
> + vq->desc_state_packed[id].indir_desc = NULL;
> + } else if (ctx) {
> + *ctx = vq->desc_state_packed[id].indir_desc;
> + }
> }
>
> static inline bool more_used_packed(const struct vring_virtqueue *vq)
> {
> - return false;
> + u16 last_used, flags;
> + u8 avail, used;
> +
> + last_used = vq->last_used_idx;
> + flags = virtio16_to_cpu(vq->vq.vdev,
> + vq->vring_packed.desc[last_used].flags);
> + avail = !!(flags & VRING_DESC_F_AVAIL(1));
> + used = !!(flags & VRING_DESC_F_USED(1));
> +
> + return avail == used && used == vq->used_wrap_counter;
Spec does not check avail == used? So there's probably a bug in either
of the two places.
In what condition that avail != used but used == vq_used_wrap_counter? A
buggy device or device set the two bits in two writes?
> }
>
> static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> unsigned int *len,
> void **ctx)
> {
> - return NULL;
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 last_used, id;
> + void *ret;
> +
> + START_USE(vq);
> +
> + if (unlikely(vq->broken)) {
> + END_USE(vq);
> + return NULL;
> + }
> +
> + if (!more_used_packed(vq)) {
> + pr_debug("No more buffers in queue\n");
> + END_USE(vq);
> + return NULL;
> + }
> +
> + /* Only get used elements after they have been exposed by host. */
> + virtio_rmb(vq->weak_barriers);
> +
> + last_used = vq->last_used_idx;
> + id = virtio16_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].id);
> + *len = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].len);
> +
> + if (unlikely(id >= vq->vring_packed.num)) {
> + BAD_RING(vq, "id %u out of range\n", id);
> + return NULL;
> + }
> + if (unlikely(!vq->desc_state_packed[id].data)) {
> + BAD_RING(vq, "id %u is not a head!\n", id);
> + return NULL;
> + }
> +
> + vq->last_used_idx += vq->desc_state_packed[id].num;
> + if (vq->last_used_idx >= vq->vring_packed.num) {
> + vq->last_used_idx -= vq->vring_packed.num;
> + vq->used_wrap_counter ^= 1;
> + }
> +
> + /* detach_buf_packed clears data, so grab it now. */
> + ret = vq->desc_state_packed[id].data;
> + detach_buf_packed(vq, id, ctx);
> +
> +#ifdef DEBUG
> + vq->last_add_time_valid = false;
> +#endif
> +
> + END_USE(vq);
> + return ret;
> }
>
> static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
> {
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + if (vq->event_flags_shadow != VRING_EVENT_F_DISABLE) {
> + vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
> + vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> + vq->event_flags_shadow);
> + }
> }
>
> static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> {
> - return 0;
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + START_USE(vq);
> +
> + /* We optimistically turn back on interrupts, then check if there was
> + * more to do. */
> +
> + if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> + virtio_wmb(vq->weak_barriers);
Missing comments for the barrier.
> + vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> + vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> + vq->event_flags_shadow);
> + }
> +
> + END_USE(vq);
> + return vq->last_used_idx;
> }
>
> static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> {
> - return false;
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u8 avail, used;
> + u16 flags;
> +
> + virtio_mb(vq->weak_barriers);
> + flags = virtio16_to_cpu(vq->vq.vdev,
> + vq->vring_packed.desc[last_used_idx].flags);
> + avail = !!(flags & VRING_DESC_F_AVAIL(1));
> + used = !!(flags & VRING_DESC_F_USED(1));
> +
> + return avail == used && used == vq->used_wrap_counter;
> }
>
> static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> {
> - return false;
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + START_USE(vq);
> +
> + /* We optimistically turn back on interrupts, then check if there was
> + * more to do. */
> +
> + if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> + virtio_wmb(vq->weak_barriers);
Need comments for the barrier.
> + vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> + vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> + vq->event_flags_shadow);
> + }
> +
> + if (more_used_packed(vq)) {
> + END_USE(vq);
> + return false;
> + }
> +
> + END_USE(vq);
> + return true;
> }
>
> static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
> {
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + unsigned int i;
> + void *buf;
> +
> + START_USE(vq);
> +
> + for (i = 0; i < vq->vring_packed.num; i++) {
> + if (!vq->desc_state_packed[i].data)
> + continue;
> + /* detach_buf clears data, so grab it now. */
> + buf = vq->desc_state_packed[i].data;
> + detach_buf_packed(vq, i, NULL);
> + END_USE(vq);
> + return buf;
> + }
> + /* That should have freed everything. */
> + BUG_ON(vq->vq.num_free != vq->vring_packed.num);
> +
> + END_USE(vq);
> return NULL;
> }
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH v5] virtio_blk: add DISCARD and WRIET ZEROES commands support
From: Changpeng Liu @ 2018-05-29 4:23 UTC (permalink / raw)
To: virtualization, changpeng.liu; +Cc: pbonzini, cavery, stefanha
Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES commands
support, this will impact the performance when using SSD backend over
file systems.
Commit 88c85538 "virtio-blk: add discard and write zeroes features to
specification"(see https://github.com/oasis-tcs/virtio-spec) extended
existing virtio-blk protocol, adding extra DISCARD and WRITE ZEROES
commands support.
While here, using 16 bytes descriptor to describe one segment of DISCARD
or WRITE ZEROES commands, each command may contain one or more decriptors.
The following data structure shows the definition of one descriptor:
struct virtio_blk_discard_write_zeroes {
le64 sector;
le32 num_sectors;
le32 unmap;
};
Field 'sector' means the start sector for DISCARD and WRITE ZEROES,
filed 'num_sectors' means the number of sectors for DISCARD and WRITE
ZEROES, if both DISCARD and WRITE ZEROES are supported, field 'unmap'
maybe used for WRITE ZEROES command with DISCARD enabled.
We also extended the virtio-blk configuration space to let backend
device put DISCARD and WRITE ZEROES configuration parameters.
struct virtio_blk_config {
[...]
le32 max_discard_sectors;
le32 max_discard_seg;
le32 discard_sector_alignment;
le32 max_write_zeroes_sectors;
le32 max_write_zeroes_seg;
u8 write_zeroes_may_unmap;
}
New feature bit [VIRTIO_BLK_F_DISCARD (13)]: Device can support discard
command, maximum discard sectors size in field 'max_discard_sectors' and
maximum discard segment number in field 'max_discard_seg'.
New feature [VIRTIO_BLK_F_WRITE_ZEROES (14)]: Device can support write
zeroes command, maximum write zeroes sectors size in field
'max_write_zeroes_sectors' and maximum write zeroes segment number in
field 'max_write_zeroes_seg'.
The parameters in the configuration space of the device field
'max_discard_sectors' and field 'discard_sector_alignment' are expressed in
512-byte units if the VIRTIO_BLK_F_DISCARD feature bit is negotiated. The
field 'max_write_zeroes_sectors' is expressed in 512-byte units if the
VIRTIO_BLK_F_WRITE_ZEROES feature bit is negotiated.
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
---
CHANGELOG:
v5: use new block layer API: blk_queue_flag_set.
v4: several optimizations based on MST's comments, remove bit field usage for
command descriptor.
v3: define the virtio-blk protocol to add discard and write zeroes support,
first version implementation based on proposed specification.
v2: add write zeroes command support.
v1: initial proposal implementation for discard command.
---
drivers/block/virtio_blk.c | 92 +++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_blk.h | 43 +++++++++++++++++++
2 files changed, 132 insertions(+), 3 deletions(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4a07593c..a250fcc 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -172,10 +172,45 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}
+static inline int virtblk_setup_discard_write_zeroes(struct request *req,
+ bool unmap)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req);
+ unsigned short n = 0;
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_KERNEL);
+ if (!range)
+ return -ENOMEM;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> 9;
+
+ range[n].unmap = cpu_to_le32(unmap);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+ kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+ }
+
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
@@ -225,6 +260,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
int qid = hctx->queue_num;
int err;
bool notify = false;
+ bool unmap = false;
u32 type;
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -237,6 +273,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
type = VIRTIO_BLK_T_FLUSH;
break;
+ case REQ_OP_DISCARD:
+ type = VIRTIO_BLK_T_DISCARD;
+ break;
+ case REQ_OP_WRITE_ZEROES:
+ type = VIRTIO_BLK_T_WRITE_ZEROES;
+ unmap = !(req->cmd_flags & REQ_NOUNMAP);
+ break;
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
type = VIRTIO_BLK_T_SCSI_CMD;
@@ -256,9 +299,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(req);
+ if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+ err = virtblk_setup_discard_write_zeroes(req, unmap);
+ if (err)
+ return BLK_STS_RESOURCE;
+ }
+
num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
if (num) {
- if (rq_data_dir(req) == WRITE)
+ if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD ||
+ type == VIRTIO_BLK_T_WRITE_ZEROES)
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
else
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
@@ -777,6 +827,42 @@ static int virtblk_probe(struct virtio_device *vdev)
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+ q->limits.discard_granularity = blk_size;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ discard_sector_alignment, &v);
+ if (v)
+ q->limits.discard_alignment = v << 9;
+ else
+ q->limits.discard_alignment = 0;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_discard_sectors, &v);
+ if (v)
+ blk_queue_max_discard_sectors(q, v);
+ else
+ blk_queue_max_discard_sectors(q, UINT_MAX);
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
+ &v);
+ if (v)
+ blk_queue_max_discard_segments(q, v);
+ else
+ blk_queue_max_discard_segments(q, USHRT_MAX);
+
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_write_zeroes_sectors, &v);
+ if (v)
+ blk_queue_max_write_zeroes_sectors(q, v);
+ else
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
@@ -885,14 +971,14 @@ static int virtblk_restore(struct virtio_device *vdev)
VIRTIO_BLK_F_SCSI,
#endif
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
}
;
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
};
static struct virtio_driver virtio_blk = {
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ebe4d9..8e7a015 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -38,6 +38,8 @@
#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
+#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
+#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
/* Legacy feature bits */
#ifndef VIRTIO_BLK_NO_LEGACY
@@ -86,6 +88,29 @@ struct virtio_blk_config {
/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
__u16 num_queues;
+ /* The maximum discard sectors (in 512-byte sectors) for
+ * one segment (if VIRTIO_BLK_F_DISCARD)
+ */
+ __u32 max_discard_sectors;
+ /* The maximum number of discard segments in a
+ * discard command (if VIRTIO_BLK_F_DISCARD)
+ */
+ __u32 max_discard_seg;
+ /* The alignment sectors for discard (if VIRTIO_BLK_F_DISCARD) */
+ __u32 discard_sector_alignment;
+ /* The maximum number of write zeroes sectors (in 512-byte sectors) in
+ * one segment (if VIRTIO_BLK_F_WRITE_ZEROES)
+ */
+ __u32 max_write_zeroes_sectors;
+ /* The maximum number of segments in a write zeroes
+ * command (if VIRTIO_BLK_F_WRITE_ZEROES)
+ */
+ __u32 max_write_zeroes_seg;
+ /* Device clear this bit when write zeroes command can't result in
+ * unmapping sectors (if VIRITO_BLK_F_WRITE_ZEROES and with unmap)
+ */
+ __u8 write_zeroes_may_unmap;
+ __u8 unused1[3];
} __attribute__((packed));
/*
@@ -114,6 +139,12 @@ struct virtio_blk_config {
/* Get device ID command */
#define VIRTIO_BLK_T_GET_ID 8
+/* Discard command */
+#define VIRTIO_BLK_T_DISCARD 11
+
+/* Write zeroes command */
+#define VIRTIO_BLK_T_WRITE_ZEROES 13
+
#ifndef VIRTIO_BLK_NO_LEGACY
/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
@@ -133,6 +164,18 @@ struct virtio_blk_outhdr {
__virtio64 sector;
};
+/*
+ * discard/write zeroes range for each request.
+ */
+struct virtio_blk_discard_write_zeroes {
+ /* discard/write zeroes start sector */
+ __virtio64 sector;
+ /* number of discard/write zeroes sectors */
+ __virtio32 num_sectors;
+ /* valid for write zeroes command */
+ __virtio32 unmap;
+};
+
#ifndef VIRTIO_BLK_NO_LEGACY
struct virtio_scsi_inhdr {
__virtio32 errors;
--
1.9.3
^ permalink raw reply related
* [PATCH v5] virtio_blk: add DISCARD and WRIET ZEROES commands support
From: Changpeng Liu @ 2018-05-29 4:29 UTC (permalink / raw)
To: virtualization, changpeng.liu; +Cc: pbonzini, cavery, stefanha
Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES commands
support, this will impact the performance when using SSD backend over
file systems.
Commit 88c85538 "virtio-blk: add discard and write zeroes features to
specification"(see https://github.com/oasis-tcs/virtio-spec) extended
existing virtio-blk protocol, adding extra DISCARD and WRITE ZEROES
commands support.
While here, using 16 bytes descriptor to describe one segment of DISCARD
or WRITE ZEROES commands, each command may contain one or more decriptors.
The following data structure shows the definition of one descriptor:
struct virtio_blk_discard_write_zeroes {
le64 sector;
le32 num_sectors;
le32 unmap;
};
Field 'sector' means the start sector for DISCARD and WRITE ZEROES,
filed 'num_sectors' means the number of sectors for DISCARD and WRITE
ZEROES, if both DISCARD and WRITE ZEROES are supported, field 'unmap'
maybe used for WRITE ZEROES command with DISCARD enabled.
We also extended the virtio-blk configuration space to let backend
device put DISCARD and WRITE ZEROES configuration parameters.
struct virtio_blk_config {
[...]
le32 max_discard_sectors;
le32 max_discard_seg;
le32 discard_sector_alignment;
le32 max_write_zeroes_sectors;
le32 max_write_zeroes_seg;
u8 write_zeroes_may_unmap;
}
New feature bit [VIRTIO_BLK_F_DISCARD (13)]: Device can support discard
command, maximum discard sectors size in field 'max_discard_sectors' and
maximum discard segment number in field 'max_discard_seg'.
New feature [VIRTIO_BLK_F_WRITE_ZEROES (14)]: Device can support write
zeroes command, maximum write zeroes sectors size in field
'max_write_zeroes_sectors' and maximum write zeroes segment number in
field 'max_write_zeroes_seg'.
The parameters in the configuration space of the device field
'max_discard_sectors' and field 'discard_sector_alignment' are expressed in
512-byte units if the VIRTIO_BLK_F_DISCARD feature bit is negotiated. The
field 'max_write_zeroes_sectors' is expressed in 512-byte units if the
VIRTIO_BLK_F_WRITE_ZEROES feature bit is negotiated.
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
---
CHANGELOG:
v5: use new block layer API: blk_queue_flag_set.
v4: several optimizations based on MST's comments, remove bit field usage for
command descriptor.
v3: define the virtio-blk protocol to add discard and write zeroes support,
first version implementation based on proposed specification.
v2: add write zeroes command support.
v1: initial proposal implementation for discard command.
---
drivers/block/virtio_blk.c | 92 +++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_blk.h | 43 +++++++++++++++++++
2 files changed, 132 insertions(+), 3 deletions(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4a07593c..fabab2a 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -172,10 +172,45 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}
+static inline int virtblk_setup_discard_write_zeroes(struct request *req,
+ bool unmap)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req);
+ unsigned short n = 0;
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_KERNEL);
+ if (!range)
+ return -ENOMEM;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> 9;
+
+ range[n].unmap = cpu_to_le32(unmap);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+ kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+ }
+
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
@@ -225,6 +260,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
int qid = hctx->queue_num;
int err;
bool notify = false;
+ bool unmap = false;
u32 type;
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -237,6 +273,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
type = VIRTIO_BLK_T_FLUSH;
break;
+ case REQ_OP_DISCARD:
+ type = VIRTIO_BLK_T_DISCARD;
+ break;
+ case REQ_OP_WRITE_ZEROES:
+ type = VIRTIO_BLK_T_WRITE_ZEROES;
+ unmap = !(req->cmd_flags & REQ_NOUNMAP);
+ break;
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
type = VIRTIO_BLK_T_SCSI_CMD;
@@ -256,9 +299,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(req);
+ if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+ err = virtblk_setup_discard_write_zeroes(req, unmap);
+ if (err)
+ return BLK_STS_RESOURCE;
+ }
+
num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
if (num) {
- if (rq_data_dir(req) == WRITE)
+ if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD ||
+ type == VIRTIO_BLK_T_WRITE_ZEROES)
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
else
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
@@ -777,6 +827,42 @@ static int virtblk_probe(struct virtio_device *vdev)
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+ q->limits.discard_granularity = blk_size;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ discard_sector_alignment, &v);
+ if (v)
+ q->limits.discard_alignment = v << 9;
+ else
+ q->limits.discard_alignment = 0;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_discard_sectors, &v);
+ if (v)
+ blk_queue_max_discard_sectors(q, v);
+ else
+ blk_queue_max_discard_sectors(q, UINT_MAX);
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
+ &v);
+ if (v)
+ blk_queue_max_discard_segments(q, v);
+ else
+ blk_queue_max_discard_segments(q, USHRT_MAX);
+
+ blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_write_zeroes_sectors, &v);
+ if (v)
+ blk_queue_max_write_zeroes_sectors(q, v);
+ else
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
@@ -885,14 +971,14 @@ static int virtblk_restore(struct virtio_device *vdev)
VIRTIO_BLK_F_SCSI,
#endif
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
}
;
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
};
static struct virtio_driver virtio_blk = {
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ebe4d9..8e7a015 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -38,6 +38,8 @@
#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
+#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
+#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
/* Legacy feature bits */
#ifndef VIRTIO_BLK_NO_LEGACY
@@ -86,6 +88,29 @@ struct virtio_blk_config {
/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
__u16 num_queues;
+ /* The maximum discard sectors (in 512-byte sectors) for
+ * one segment (if VIRTIO_BLK_F_DISCARD)
+ */
+ __u32 max_discard_sectors;
+ /* The maximum number of discard segments in a
+ * discard command (if VIRTIO_BLK_F_DISCARD)
+ */
+ __u32 max_discard_seg;
+ /* The alignment sectors for discard (if VIRTIO_BLK_F_DISCARD) */
+ __u32 discard_sector_alignment;
+ /* The maximum number of write zeroes sectors (in 512-byte sectors) in
+ * one segment (if VIRTIO_BLK_F_WRITE_ZEROES)
+ */
+ __u32 max_write_zeroes_sectors;
+ /* The maximum number of segments in a write zeroes
+ * command (if VIRTIO_BLK_F_WRITE_ZEROES)
+ */
+ __u32 max_write_zeroes_seg;
+ /* Device clear this bit when write zeroes command can't result in
+ * unmapping sectors (if VIRITO_BLK_F_WRITE_ZEROES and with unmap)
+ */
+ __u8 write_zeroes_may_unmap;
+ __u8 unused1[3];
} __attribute__((packed));
/*
@@ -114,6 +139,12 @@ struct virtio_blk_config {
/* Get device ID command */
#define VIRTIO_BLK_T_GET_ID 8
+/* Discard command */
+#define VIRTIO_BLK_T_DISCARD 11
+
+/* Write zeroes command */
+#define VIRTIO_BLK_T_WRITE_ZEROES 13
+
#ifndef VIRTIO_BLK_NO_LEGACY
/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
@@ -133,6 +164,18 @@ struct virtio_blk_outhdr {
__virtio64 sector;
};
+/*
+ * discard/write zeroes range for each request.
+ */
+struct virtio_blk_discard_write_zeroes {
+ /* discard/write zeroes start sector */
+ __virtio64 sector;
+ /* number of discard/write zeroes sectors */
+ __virtio32 num_sectors;
+ /* valid for write zeroes command */
+ __virtio32 unmap;
+};
+
#ifndef VIRTIO_BLK_NO_LEGACY
struct virtio_scsi_inhdr {
__virtio32 errors;
--
1.9.3
^ permalink raw reply related
* Re: [RFC v5 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-29 5:11 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <30bd7505-0ded-8584-e9ab-152756a667d6@redhat.com>
On Tue, May 29, 2018 at 11:18:57AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
[...]
> > +static void detach_buf_packed(struct vring_virtqueue *vq,
> > + unsigned int id, void **ctx)
> > +{
> > + struct vring_desc_state_packed *state;
> > + struct vring_packed_desc *desc;
> > + unsigned int i;
> > + int *next;
> > +
> > + /* Clear data ptr. */
> > + vq->desc_state_packed[id].data = NULL;
> > +
> > + next = &id;
> > + for (i = 0; i < vq->desc_state_packed[id].num; i++) {
> > + state = &vq->desc_state_packed[*next];
> > + vring_unmap_state_packed(vq, state);
> > + next = &vq->desc_state_packed[*next].next;
>
> Have a short discussion with Michael. It looks like the only thing we care
> is DMA address and length here.
The length tracked by desc_state_packed[] is also necessary
when INDIRECT is used and the buffers are writable.
>
> So a thought is to avoid using desc_state_packed() is vring_use_dma_api() is
> false? Probably need another id allocator or just use desc_state_packed for
> free id list.
I think it's a good suggestion. Thanks!
I won't track the addr/len/flags for non-indirect descs
when vring_use_dma_api() returns false.
>
> > + }
> > +
> > + vq->vq.num_free += vq->desc_state_packed[id].num;
> > + *next = vq->free_head;
>
> Using pointer seems not elegant and unnecessary here. You can just track
> next in integer I think?
It's possible to use integer, but will need one more var
to track `prev` (desc_state_packed[prev].next == next in
this case).
>
> > + vq->free_head = id;
> > +
> > + if (vq->indirect) {
> > + u32 len;
> > +
> > + /* Free the indirect table, if any, now that it's unmapped. */
> > + desc = vq->desc_state_packed[id].indir_desc;
> > + if (!desc)
> > + return;
> > +
> > + len = vq->desc_state_packed[id].len;
> > + for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
> > + vring_unmap_desc_packed(vq, &desc[i]);
> > +
> > + kfree(desc);
> > + vq->desc_state_packed[id].indir_desc = NULL;
> > + } else if (ctx) {
> > + *ctx = vq->desc_state_packed[id].indir_desc;
> > + }
> > }
> > static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > {
> > - return false;
> > + u16 last_used, flags;
> > + u8 avail, used;
> > +
> > + last_used = vq->last_used_idx;
> > + flags = virtio16_to_cpu(vq->vq.vdev,
> > + vq->vring_packed.desc[last_used].flags);
> > + avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > + used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > + return avail == used && used == vq->used_wrap_counter;
>
> Spec does not check avail == used? So there's probably a bug in either of
> the two places.
>
> In what condition that avail != used but used == vq_used_wrap_counter? A
> buggy device or device set the two bits in two writes?
Currently, spec doesn't forbid devices to set the status
bits as: avail != used but used == vq_used_wrap_counter.
So I think driver cannot assume this won't happen.
>
> > }
[...]
> > static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> > {
> > - return 0;
> > + struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > + START_USE(vq);
> > +
> > + /* We optimistically turn back on interrupts, then check if there was
> > + * more to do. */
> > +
> > + if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > + virtio_wmb(vq->weak_barriers);
>
> Missing comments for the barrier.
Will add some comments.
>
> > + vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > + vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > + vq->event_flags_shadow);
> > + }
> > +
> > + END_USE(vq);
> > + return vq->last_used_idx;
> > }
> > static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> > {
> > - return false;
> > + struct vring_virtqueue *vq = to_vvq(_vq);
> > + u8 avail, used;
> > + u16 flags;
> > +
> > + virtio_mb(vq->weak_barriers);
> > + flags = virtio16_to_cpu(vq->vq.vdev,
> > + vq->vring_packed.desc[last_used_idx].flags);
> > + avail = !!(flags & VRING_DESC_F_AVAIL(1));
> > + used = !!(flags & VRING_DESC_F_USED(1));
> > +
> > + return avail == used && used == vq->used_wrap_counter;
> > }
> > static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> > {
> > - return false;
> > + struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > + START_USE(vq);
> > +
> > + /* We optimistically turn back on interrupts, then check if there was
> > + * more to do. */
> > +
> > + if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > + virtio_wmb(vq->weak_barriers);
>
> Need comments for the barrier.
Will add some comments.
Best regards,
Tiwei Bie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC v5 2/5] virtio_ring: support creating packed ring
From: Tiwei Bie @ 2018-05-29 5:24 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, netdev, Cornelia Huck, linux-kernel, virtualization, wexu
In-Reply-To: <654feee2-3414-e304-0aec-2bb19a7e0c87@redhat.com>
On Tue, May 29, 2018 at 10:49:11AM +0800, Jason Wang wrote:
> On 2018年05月22日 16:16, Tiwei Bie wrote:
> > This commit introduces the support for creating packed ring.
> > All split ring specific functions are added _split suffix.
> > Some necessary stubs for packed ring are also added.
> >
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> > drivers/virtio/virtio_ring.c | 801 +++++++++++++++++++++++------------
> > include/linux/virtio_ring.h | 8 +-
> > 2 files changed, 546 insertions(+), 263 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 71458f493cf8..f5ef5f42a7cf 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -61,11 +61,15 @@ struct vring_desc_state {
> > struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
> > };
> > +struct vring_desc_state_packed {
> > + int next; /* The next desc state. */
> > +};
> > +
> > struct vring_virtqueue {
> > struct virtqueue vq;
> > - /* Actual memory layout for this queue */
> > - struct vring vring;
> > + /* Is this a packed ring? */
> > + bool packed;
> > /* Can we use weak barriers? */
> > bool weak_barriers;
> > @@ -87,11 +91,39 @@ struct vring_virtqueue {
> > /* Last used index we've seen. */
> > u16 last_used_idx;
> > - /* Last written value to avail->flags */
> > - u16 avail_flags_shadow;
> > + union {
> > + /* Available for split ring */
> > + struct {
> > + /* Actual memory layout for this queue. */
> > + struct vring vring;
> > - /* Last written value to avail->idx in guest byte order */
> > - u16 avail_idx_shadow;
> > + /* Last written value to avail->flags */
> > + u16 avail_flags_shadow;
> > +
> > + /* Last written value to avail->idx in
> > + * guest byte order. */
> > + u16 avail_idx_shadow;
> > + };
> > +
> > + /* Available for packed ring */
> > + struct {
> > + /* Actual memory layout for this queue. */
> > + struct vring_packed vring_packed;
> > +
> > + /* Driver ring wrap counter. */
> > + u8 avail_wrap_counter;
> > +
> > + /* Device ring wrap counter. */
> > + u8 used_wrap_counter;
>
> How about just use boolean?
I can change it to bool if you want.
>
[...]
> > -static int vring_mapping_error(const struct vring_virtqueue *vq,
> > - dma_addr_t addr)
> > -{
> > - if (!vring_use_dma_api(vq->vq.vdev))
> > - return 0;
> > -
> > - return dma_mapping_error(vring_dma_dev(vq), addr);
> > -}
>
> It looks to me if you keep vring_mapping_error behind
> vring_unmap_one_split(), lots of changes were unncessary.
>
[...]
> > + }
> > + /* That should have freed everything. */
> > + BUG_ON(vq->vq.num_free != vq->vring.num);
> > +
> > + END_USE(vq);
> > + return NULL;
> > +}
>
> I think the those copy-and-paste hunks could be avoided and the diff should
> only contains renaming of the function. If yes, it would be very welcomed
> since it requires to compare the changes verbatim otherwise.
Michael suggested to lay out the code as:
XXX_split
XXX_packed
XXX wrappers
https://lkml.org/lkml/2018/4/13/410
That's why I moved some code.
>
> > +
> > +/*
> > + * The layout for the packed ring is a continuous chunk of memory
> > + * which looks like this.
> > + *
> > + * struct vring_packed {
> > + * // The actual descriptors (16 bytes each)
> > + * struct vring_packed_desc desc[num];
> > + *
> > + * // Padding to the next align boundary.
> > + * char pad[];
> > + *
> > + * // Driver Event Suppression
> > + * struct vring_packed_desc_event driver;
> > + *
> > + * // Device Event Suppression
> > + * struct vring_packed_desc_event device;
> > + * };
> > + */
> > +static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
> > + void *p, unsigned long align)
> > +{
> > + vr->num = num;
> > + vr->desc = p;
> > + vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
> > + * num + align - 1) & ~(align - 1));
>
> If we choose not to go uapi, maybe we can use ALIGN() macro here?
Okay.
>
> > + vr->device = vr->driver + 1;
> > +}
[...]
> > +/* Only available for split ring */
> > const struct vring *virtqueue_get_vring(struct virtqueue *vq)
> > {
>
> A possible issue with this is:
>
> After commit d4674240f31f8c4289abba07d64291c6ddce51bc ("KVM: s390:
> virtio-ccw revision 1 SET_VQ"). CCW tries to use
> virtqueue_get_avail()/virtqueue_get_used(). Looks like a bug either here or
> ccw code.
Do we still need to support:
include/linux/virtio.h
/*
* Legacy accessors -- in almost all cases, these are the wrong functions
* to use.
*/
static inline void *virtqueue_get_desc(struct virtqueue *vq)
{
return virtqueue_get_vring(vq)->desc;
}
static inline void *virtqueue_get_avail(struct virtqueue *vq)
{
return virtqueue_get_vring(vq)->avail;
}
static inline void *virtqueue_get_used(struct virtqueue *vq)
{
return virtqueue_get_vring(vq)->used;
}
in packed ring?
If so, I think maybe it's better to expose them as
symbols and implement them in virtio_ring.c.
Best regards,
Tiwei Bie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC v5 2/5] virtio_ring: support creating packed ring
From: Jason Wang @ 2018-05-29 6:13 UTC (permalink / raw)
To: Tiwei Bie; +Cc: mst, netdev, Cornelia Huck, linux-kernel, virtualization, wexu
In-Reply-To: <20180529052418.GB2976@debian>
On 2018年05月29日 13:24, Tiwei Bie wrote:
> On Tue, May 29, 2018 at 10:49:11AM +0800, Jason Wang wrote:
>> On 2018年05月22日 16:16, Tiwei Bie wrote:
>>> This commit introduces the support for creating packed ring.
>>> All split ring specific functions are added _split suffix.
>>> Some necessary stubs for packed ring are also added.
>>>
>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>> ---
>>> drivers/virtio/virtio_ring.c | 801 +++++++++++++++++++++++------------
>>> include/linux/virtio_ring.h | 8 +-
>>> 2 files changed, 546 insertions(+), 263 deletions(-)
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 71458f493cf8..f5ef5f42a7cf 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -61,11 +61,15 @@ struct vring_desc_state {
>>> struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
>>> };
>>> +struct vring_desc_state_packed {
>>> + int next; /* The next desc state. */
>>> +};
>>> +
>>> struct vring_virtqueue {
>>> struct virtqueue vq;
>>> - /* Actual memory layout for this queue */
>>> - struct vring vring;
>>> + /* Is this a packed ring? */
>>> + bool packed;
>>> /* Can we use weak barriers? */
>>> bool weak_barriers;
>>> @@ -87,11 +91,39 @@ struct vring_virtqueue {
>>> /* Last used index we've seen. */
>>> u16 last_used_idx;
>>> - /* Last written value to avail->flags */
>>> - u16 avail_flags_shadow;
>>> + union {
>>> + /* Available for split ring */
>>> + struct {
>>> + /* Actual memory layout for this queue. */
>>> + struct vring vring;
>>> - /* Last written value to avail->idx in guest byte order */
>>> - u16 avail_idx_shadow;
>>> + /* Last written value to avail->flags */
>>> + u16 avail_flags_shadow;
>>> +
>>> + /* Last written value to avail->idx in
>>> + * guest byte order. */
>>> + u16 avail_idx_shadow;
>>> + };
>>> +
>>> + /* Available for packed ring */
>>> + struct {
>>> + /* Actual memory layout for this queue. */
>>> + struct vring_packed vring_packed;
>>> +
>>> + /* Driver ring wrap counter. */
>>> + u8 avail_wrap_counter;
>>> +
>>> + /* Device ring wrap counter. */
>>> + u8 used_wrap_counter;
>> How about just use boolean?
> I can change it to bool if you want.
Yes, please.
>
> [...]
>>> -static int vring_mapping_error(const struct vring_virtqueue *vq,
>>> - dma_addr_t addr)
>>> -{
>>> - if (!vring_use_dma_api(vq->vq.vdev))
>>> - return 0;
>>> -
>>> - return dma_mapping_error(vring_dma_dev(vq), addr);
>>> -}
>> It looks to me if you keep vring_mapping_error behind
>> vring_unmap_one_split(), lots of changes were unncessary.
>>
> [...]
>>> + }
>>> + /* That should have freed everything. */
>>> + BUG_ON(vq->vq.num_free != vq->vring.num);
>>> +
>>> + END_USE(vq);
>>> + return NULL;
>>> +}
>> I think the those copy-and-paste hunks could be avoided and the diff should
>> only contains renaming of the function. If yes, it would be very welcomed
>> since it requires to compare the changes verbatim otherwise.
> Michael suggested to lay out the code as:
>
> XXX_split
>
> XXX_packed
>
> XXX wrappers
>
> https://lkml.org/lkml/2018/4/13/410
>
> That's why I moved some code.
I see, then no need to change but it still looks unnecessary.
>
>>> +
>>> +/*
>>> + * The layout for the packed ring is a continuous chunk of memory
>>> + * which looks like this.
>>> + *
>>> + * struct vring_packed {
>>> + * // The actual descriptors (16 bytes each)
>>> + * struct vring_packed_desc desc[num];
>>> + *
>>> + * // Padding to the next align boundary.
>>> + * char pad[];
>>> + *
>>> + * // Driver Event Suppression
>>> + * struct vring_packed_desc_event driver;
>>> + *
>>> + * // Device Event Suppression
>>> + * struct vring_packed_desc_event device;
>>> + * };
>>> + */
>>> +static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
>>> + void *p, unsigned long align)
>>> +{
>>> + vr->num = num;
>>> + vr->desc = p;
>>> + vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
>>> + * num + align - 1) & ~(align - 1));
>> If we choose not to go uapi, maybe we can use ALIGN() macro here?
> Okay.
>
>>> + vr->device = vr->driver + 1;
>>> +}
> [...]
>>> +/* Only available for split ring */
>>> const struct vring *virtqueue_get_vring(struct virtqueue *vq)
>>> {
>> A possible issue with this is:
>>
>> After commit d4674240f31f8c4289abba07d64291c6ddce51bc ("KVM: s390:
>> virtio-ccw revision 1 SET_VQ"). CCW tries to use
>> virtqueue_get_avail()/virtqueue_get_used(). Looks like a bug either here or
>> ccw code.
> Do we still need to support:
>
> include/linux/virtio.h
> /*
> * Legacy accessors -- in almost all cases, these are the wrong functions
> * to use.
> */
> static inline void *virtqueue_get_desc(struct virtqueue *vq)
> {
> return virtqueue_get_vring(vq)->desc;
> }
> static inline void *virtqueue_get_avail(struct virtqueue *vq)
> {
> return virtqueue_get_vring(vq)->avail;
> }
> static inline void *virtqueue_get_used(struct virtqueue *vq)
> {
> return virtqueue_get_vring(vq)->used;
> }
>
> in packed ring?
I think it was probably a bug in ccw, they should use e.g
virtqueue_get_desc_addr() instead.
Thanks
>
> If so, I think maybe it's better to expose them as
> symbols and implement them in virtio_ring.c.
>
> Best regards,
> Tiwei Bie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC v5 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-29 6:16 UTC (permalink / raw)
To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180529051125.GA2976@debian>
On 2018年05月29日 13:11, Tiwei Bie wrote:
> On Tue, May 29, 2018 at 11:18:57AM +0800, Jason Wang wrote:
>> On 2018年05月22日 16:16, Tiwei Bie wrote:
> [...]
>>> +static void detach_buf_packed(struct vring_virtqueue *vq,
>>> + unsigned int id, void **ctx)
>>> +{
>>> + struct vring_desc_state_packed *state;
>>> + struct vring_packed_desc *desc;
>>> + unsigned int i;
>>> + int *next;
>>> +
>>> + /* Clear data ptr. */
>>> + vq->desc_state_packed[id].data = NULL;
>>> +
>>> + next = &id;
>>> + for (i = 0; i < vq->desc_state_packed[id].num; i++) {
>>> + state = &vq->desc_state_packed[*next];
>>> + vring_unmap_state_packed(vq, state);
>>> + next = &vq->desc_state_packed[*next].next;
>> Have a short discussion with Michael. It looks like the only thing we care
>> is DMA address and length here.
> The length tracked by desc_state_packed[] is also necessary
> when INDIRECT is used and the buffers are writable.
>
>> So a thought is to avoid using desc_state_packed() is vring_use_dma_api() is
>> false? Probably need another id allocator or just use desc_state_packed for
>> free id list.
> I think it's a good suggestion. Thanks!
>
> I won't track the addr/len/flags for non-indirect descs
> when vring_use_dma_api() returns false.
>
>>> + }
>>> +
>>> + vq->vq.num_free += vq->desc_state_packed[id].num;
>>> + *next = vq->free_head;
>> Using pointer seems not elegant and unnecessary here. You can just track
>> next in integer I think?
> It's possible to use integer, but will need one more var
> to track `prev` (desc_state_packed[prev].next == next in
> this case).
Yes, please do this.
>
>>> + vq->free_head = id;
>>> +
>>> + if (vq->indirect) {
>>> + u32 len;
>>> +
>>> + /* Free the indirect table, if any, now that it's unmapped. */
>>> + desc = vq->desc_state_packed[id].indir_desc;
>>> + if (!desc)
>>> + return;
>>> +
>>> + len = vq->desc_state_packed[id].len;
>>> + for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
>>> + vring_unmap_desc_packed(vq, &desc[i]);
>>> +
>>> + kfree(desc);
>>> + vq->desc_state_packed[id].indir_desc = NULL;
>>> + } else if (ctx) {
>>> + *ctx = vq->desc_state_packed[id].indir_desc;
>>> + }
>>> }
>>> static inline bool more_used_packed(const struct vring_virtqueue *vq)
>>> {
>>> - return false;
>>> + u16 last_used, flags;
>>> + u8 avail, used;
>>> +
>>> + last_used = vq->last_used_idx;
>>> + flags = virtio16_to_cpu(vq->vq.vdev,
>>> + vq->vring_packed.desc[last_used].flags);
>>> + avail = !!(flags & VRING_DESC_F_AVAIL(1));
>>> + used = !!(flags & VRING_DESC_F_USED(1));
>>> +
>>> + return avail == used && used == vq->used_wrap_counter;
>> Spec does not check avail == used? So there's probably a bug in either of
>> the two places.
>>
>> In what condition that avail != used but used == vq_used_wrap_counter? A
>> buggy device or device set the two bits in two writes?
> Currently, spec doesn't forbid devices to set the status
> bits as: avail != used but used == vq_used_wrap_counter.
> So I think driver cannot assume this won't happen.
Right, but I could not find a situation that device need to something
like this. We should either forbid this in the spec or change the
example code in the spec.
Let's see how Michael think about this.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH net] vhost_net: flush batched heads before trying to busy polling
From: Jason Wang @ 2018-05-29 6:18 UTC (permalink / raw)
To: mst, jasowang; +Cc: netdev, linux-kernel, kvm, virtualization
After commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
we tend to batch updating used heads. But it doesn't flush batched
heads before trying to do busy polling, this will cause vhost to wait
for guest TX which waits for the used RX. Fixing by flush batched
heads before busy loop.
1 byte TCP_RR performance recovers from 13107.83 to 50402.65.
Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 37 ++++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 13 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 986058a..eeaf673 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -105,7 +105,9 @@ struct vhost_net_virtqueue {
/* vhost zerocopy support fields below: */
/* last used idx for outstanding DMA zerocopy buffers */
int upend_idx;
- /* first used idx for DMA done zerocopy buffers */
+ /* For TX, first used idx for DMA done zerocopy buffers
+ * For RX, number of batched heads
+ */
int done_idx;
/* an array of userspace buffers info */
struct ubuf_info *ubuf_info;
@@ -626,6 +628,18 @@ static int sk_has_rx_data(struct sock *sk)
return skb_queue_empty(&sk->sk_receive_queue);
}
+static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
+{
+ struct vhost_virtqueue *vq = &nvq->vq;
+ struct vhost_dev *dev = vq->dev;
+
+ if (!nvq->done_idx)
+ return;
+
+ vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
+ nvq->done_idx = 0;
+}
+
static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
{
struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
@@ -635,6 +649,8 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
int len = peek_head_len(rvq, sk);
if (!len && vq->busyloop_timeout) {
+ /* Flush batched heads first */
+ vhost_rx_signal_used(rvq);
/* Both tx vq and rx socket were polled here */
mutex_lock_nested(&vq->mutex, 1);
vhost_disable_notify(&net->dev, vq);
@@ -762,7 +778,7 @@ static void handle_rx(struct vhost_net *net)
};
size_t total_len = 0;
int err, mergeable;
- s16 headcount, nheads = 0;
+ s16 headcount;
size_t vhost_hlen, sock_hlen;
size_t vhost_len, sock_len;
struct socket *sock;
@@ -790,8 +806,8 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
- headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
- &in, vq_log, &log,
+ headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
+ vhost_len, &in, vq_log, &log,
likely(mergeable) ? UIO_MAXIOV : 1);
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
@@ -862,12 +878,9 @@ static void handle_rx(struct vhost_net *net)
vhost_discard_vq_desc(vq, headcount);
goto out;
}
- nheads += headcount;
- if (nheads > VHOST_RX_BATCH) {
- vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
- nheads);
- nheads = 0;
- }
+ nvq->done_idx += headcount;
+ if (nvq->done_idx > VHOST_RX_BATCH)
+ vhost_rx_signal_used(nvq);
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
@@ -878,9 +891,7 @@ static void handle_rx(struct vhost_net *net)
}
vhost_net_enable_vq(net, vq);
out:
- if (nheads)
- vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
- nheads);
+ vhost_rx_signal_used(nvq);
mutex_unlock(&vq->mutex);
}
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v3 09/27] x86/acpi: Adapt assembly for PIE support
From: Pavel Machek @ 2018-05-29 12:31 UTC (permalink / raw)
To: Thomas Garnier
Cc: Kate Stewart, Nicolas Pitre, the arch/x86 maintainers,
Sergey Senozhatsky, Petr Mladek, Len Brown, Peter Zijlstra,
Yonghong Song, Christopher Li, Dave Hansen, Dominik Brodowski,
LKML, Masahiro Yamada, Jan Beulich, H . Peter Anvin,
Kernel Hardening, Christoph Lameter, Alok Kataria,
Linux Doc Mailing List, linux-arch, Jonathan Corbet, Herbert Xu
In-Reply-To: <CAJcbSZH17D01Stk4vRKKzjW6dxvK8x+S9sWL6vUopSP9=-x7Nw@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 1182 bytes --]
On Fri 2018-05-25 10:00:04, Thomas Garnier wrote:
> On Fri, May 25, 2018 at 2:14 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> > On Thu 2018-05-24 09:35:42, Thomas Garnier wrote:
> > > On Thu, May 24, 2018 at 4:03 AM Pavel Machek <pavel@ucw.cz> wrote:
> > >
> > > > On Wed 2018-05-23 12:54:03, Thomas Garnier wrote:
> > > > > Change the assembly code to use only relative references of symbols
> for
> > > the
> > > > > kernel to be PIE compatible.
> > > > >
> > > > > Position Independent Executable (PIE) support will allow to
> extended the
> > > > > KASLR randomization range below the -2G memory limit.
> > >
> > > > What testing did this get?
> > >
> > > Tested boot, hibernation and performance on qemu and dedicated machine.
>
> > Well, this is suspend, not hibernation code.
>
> > So "sudo pm-suspend" or "echo mem > /sys/power/state" would be good
> > way to test this.
>
> Thanks, it worked. I added this to the testsuite I use for KASLR.
Thanks!
You can add my Acked-by:.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Christoph Hellwig @ 2018-05-29 14:03 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, Michael S. Tsirkin, mpe, linux-kernel, virtualization, hch,
luto, joe, david, linuxppc-dev, elfring, Anshuman Khandual
In-Reply-To: <2f1d48cf029c1f0903f3cffea946ae5b85f60ec0.camel@kernel.crashing.org>
On Tue, May 29, 2018 at 09:56:24AM +1000, Benjamin Herrenschmidt wrote:
> I don't think forcing the addition of an emulated iommu in the middle
> just to work around the fact that virtio "cheats" and doesn't use the
> dma API unless there is one, is the right "fix".
Agreed.
> The right long term fix is to always use the DMA API, reducing code
> path etc... and just have a single point where virtio can "chose"
> alternate DMA ops (via an arch hook to deal with our case).
Also agreed.
When Andi added vring_use_dma_api it was marked as temporary.
So I'd much rather move to blacklisting platforms that needs this
hack now than adding another exception.
And then once we have the blacklist move it to a quirk in the arch
code that just forces dma_direct_ops as the per-device dma ops.
I don't really think this is crazy long term, but something we could
do relatively quickly. Interestingly enough the original commit
mentions PPC64 as a case where this quirk is needed.
^ permalink raw reply
* Re: [PATCH net] vhost_net: flush batched heads before trying to busy polling
From: Michael S. Tsirkin @ 2018-05-29 15:27 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1527574699-13047-1-git-send-email-jasowang@redhat.com>
On Tue, May 29, 2018 at 02:18:19PM +0800, Jason Wang wrote:
> After commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
> we tend to batch updating used heads. But it doesn't flush batched
> heads before trying to do busy polling, this will cause vhost to wait
> for guest TX which waits for the used RX. Fixing by flush batched
> heads before busy loop.
>
> 1 byte TCP_RR performance recovers from 13107.83 to 50402.65.
>
> Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/vhost/net.c | 37 ++++++++++++++++++++++++-------------
> 1 file changed, 24 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 986058a..eeaf673 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -105,7 +105,9 @@ struct vhost_net_virtqueue {
> /* vhost zerocopy support fields below: */
> /* last used idx for outstanding DMA zerocopy buffers */
> int upend_idx;
> - /* first used idx for DMA done zerocopy buffers */
> + /* For TX, first used idx for DMA done zerocopy buffers
> + * For RX, number of batched heads
> + */
> int done_idx;
> /* an array of userspace buffers info */
> struct ubuf_info *ubuf_info;
> @@ -626,6 +628,18 @@ static int sk_has_rx_data(struct sock *sk)
> return skb_queue_empty(&sk->sk_receive_queue);
> }
>
> +static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
> +{
> + struct vhost_virtqueue *vq = &nvq->vq;
> + struct vhost_dev *dev = vq->dev;
> +
> + if (!nvq->done_idx)
> + return;
> +
> + vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
> + nvq->done_idx = 0;
> +}
> +
> static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
> {
> struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
> @@ -635,6 +649,8 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
> int len = peek_head_len(rvq, sk);
>
> if (!len && vq->busyloop_timeout) {
> + /* Flush batched heads first */
> + vhost_rx_signal_used(rvq);
> /* Both tx vq and rx socket were polled here */
> mutex_lock_nested(&vq->mutex, 1);
> vhost_disable_notify(&net->dev, vq);
> @@ -762,7 +778,7 @@ static void handle_rx(struct vhost_net *net)
> };
> size_t total_len = 0;
> int err, mergeable;
> - s16 headcount, nheads = 0;
> + s16 headcount;
> size_t vhost_hlen, sock_hlen;
> size_t vhost_len, sock_len;
> struct socket *sock;
> @@ -790,8 +806,8 @@ static void handle_rx(struct vhost_net *net)
> while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
> sock_len += sock_hlen;
> vhost_len = sock_len + vhost_hlen;
> - headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
> - &in, vq_log, &log,
> + headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
> + vhost_len, &in, vq_log, &log,
> likely(mergeable) ? UIO_MAXIOV : 1);
> /* On error, stop handling until the next kick. */
> if (unlikely(headcount < 0))
> @@ -862,12 +878,9 @@ static void handle_rx(struct vhost_net *net)
> vhost_discard_vq_desc(vq, headcount);
> goto out;
> }
> - nheads += headcount;
> - if (nheads > VHOST_RX_BATCH) {
> - vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> - nheads);
> - nheads = 0;
> - }
> + nvq->done_idx += headcount;
> + if (nvq->done_idx > VHOST_RX_BATCH)
> + vhost_rx_signal_used(nvq);
> if (unlikely(vq_log))
> vhost_log_write(vq, vq_log, log, vhost_len);
> total_len += vhost_len;
> @@ -878,9 +891,7 @@ static void handle_rx(struct vhost_net *net)
> }
> vhost_net_enable_vq(net, vq);
> out:
> - if (nheads)
> - vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> - nheads);
> + vhost_rx_signal_used(nvq);
> mutex_unlock(&vq->mutex);
> }
>
> --
> 2.7.4
^ permalink raw reply
* Re: [PATCH v3 09/27] x86/acpi: Adapt assembly for PIE support
From: Thomas Garnier via Virtualization @ 2018-05-29 15:55 UTC (permalink / raw)
To: Pavel Machek
Cc: Kate Stewart, Nicolas Pitre, the arch/x86 maintainers,
Sergey Senozhatsky, Petr Mladek, Len Brown, Peter Zijlstra,
Yonghong Song, Christopher Li, Dave Hansen, Dominik Brodowski,
LKML, Masahiro Yamada, Jan Beulich, H . Peter Anvin,
Kernel Hardening, Christoph Lameter, Alok Kataria,
Linux Doc Mailing List, linux-arch, Jonathan Corbet, Herbert Xu
In-Reply-To: <20180529123114.GB21399@amd>
On Tue, May 29, 2018 at 5:31 AM Pavel Machek <pavel@ucw.cz> wrote:
> On Fri 2018-05-25 10:00:04, Thomas Garnier wrote:
> > On Fri, May 25, 2018 at 2:14 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > > On Thu 2018-05-24 09:35:42, Thomas Garnier wrote:
> > > > On Thu, May 24, 2018 at 4:03 AM Pavel Machek <pavel@ucw.cz> wrote:
> > > >
> > > > > On Wed 2018-05-23 12:54:03, Thomas Garnier wrote:
> > > > > > Change the assembly code to use only relative references of
symbols
> > for
> > > > the
> > > > > > kernel to be PIE compatible.
> > > > > >
> > > > > > Position Independent Executable (PIE) support will allow to
> > extended the
> > > > > > KASLR randomization range below the -2G memory limit.
> > > >
> > > > > What testing did this get?
> > > >
> > > > Tested boot, hibernation and performance on qemu and dedicated
machine.
> >
> > > Well, this is suspend, not hibernation code.
> >
> > > So "sudo pm-suspend" or "echo mem > /sys/power/state" would be good
> > > way to test this.
> >
> > Thanks, it worked. I added this to the testsuite I use for KASLR.
> Thanks!
> You can add my Acked-by:.
Will do. Thanks for the review.
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures)
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Thomas
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox