* Re: [PATCH net V2] vhost: correctly remove wait queue during poll failure
From: Michael S. Tsirkin @ 2018-03-29 4:20 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, Darren Kenny, linux-kernel, kvm, virtualization
In-Reply-To: <1522155052-13347-1-git-send-email-jasowang@redhat.com>
On Tue, Mar 27, 2018 at 08:50:52PM +0800, Jason Wang wrote:
> We tried to remove vq poll from wait queue, but do not check whether
> or not it was in a list before. This will lead double free. Fixing
> this by switching to use vhost_poll_stop() which zeros poll->wqh after
> removing poll from waitqueue to make sure it won't be freed twice.
>
> Cc: Darren Kenny <darren.kenny@oracle.com>
> Reported-by: syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com
> Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
OK with this the only bug we have is where get user pages returns 0
(Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com)
> ---
> Changes from V1:
> - tweak the commit log for to match the code
> ---
> drivers/vhost/vhost.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 1b3e8d2d..5d5a9d9 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -212,8 +212,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file)
> if (mask)
> vhost_poll_wakeup(&poll->wait, 0, 0, poll_to_key(mask));
> if (mask & EPOLLERR) {
> - if (poll->wqh)
> - remove_wait_queue(poll->wqh, &poll->wait);
> + vhost_poll_stop(poll);
> ret = -EINVAL;
> }
>
> --
> 2.7.4
^ permalink raw reply
* [PATCH net] vhost: validate log when IOTLB is enabled
From: Jason Wang @ 2018-03-29 8:00 UTC (permalink / raw)
To: mst, jasowang; +Cc: netdev, linux-kernel, kvm, virtualization
Vq log_base is the userspace address of bitmap which has nothing to do
with IOTLB. So it needs to be validated unconditionally otherwise we
may try use 0 as log_base which may lead to pin pages that will lead
unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5d5a9d9..5320039 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1244,14 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
/* Caller should have vq mutex and device mutex */
int vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
- if (vq->iotlb) {
- /* When device IOTLB was used, the access validation
- * will be validated during prefetching.
- */
- return 1;
- }
- return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
- vq_log_access_ok(vq, vq->log_base);
+ int ret = vq_log_access_ok(vq, vq->log_base);
+
+ if (ret || vq->iotlb)
+ return ret;
+
+ return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
}
EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net V2] vhost: correctly remove wait queue during poll failure
From: Jason Wang @ 2018-03-29 8:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, Darren Kenny, linux-kernel, kvm, virtualization
In-Reply-To: <20180329071801-mutt-send-email-mst@kernel.org>
On 2018年03月29日 12:20, Michael S. Tsirkin wrote:
> On Tue, Mar 27, 2018 at 08:50:52PM +0800, Jason Wang wrote:
>> We tried to remove vq poll from wait queue, but do not check whether
>> or not it was in a list before. This will lead double free. Fixing
>> this by switching to use vhost_poll_stop() which zeros poll->wqh after
>> removing poll from waitqueue to make sure it won't be freed twice.
>>
>> Cc: Darren Kenny<darren.kenny@oracle.com>
>> Reported-by:syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com
>> Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend")
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> OK with this the only bug we have is where get user pages returns 0
> (Reported-by:syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com)
>
>
>
Thanks for the reminder.
I post a patch to avoid this.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: Michael S. Tsirkin @ 2018-03-29 12:48 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1522310404-8486-1-git-send-email-jasowang@redhat.com>
On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
> Vq log_base is the userspace address of bitmap which has nothing to do
> with IOTLB. So it needs to be validated unconditionally otherwise we
> may try use 0 as log_base which may lead to pin pages that will lead
> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
> Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
stable material I guess.
> ---
> drivers/vhost/vhost.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5d5a9d9..5320039 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,14 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
> /* Caller should have vq mutex and device mutex */
> int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> - if (vq->iotlb) {
> - /* When device IOTLB was used, the access validation
> - * will be validated during prefetching.
> - */
> - return 1;
> - }
> - return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
> - vq_log_access_ok(vq, vq->log_base);
> + int ret = vq_log_access_ok(vq, vq->log_base);
> +
> + if (ret || vq->iotlb)
> + return ret;
> +
> + return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>
> --
> 2.7.4
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: Michael S. Tsirkin @ 2018-03-29 14:44 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1522310404-8486-1-git-send-email-jasowang@redhat.com>
On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
> Vq log_base is the userspace address of bitmap which has nothing to do
> with IOTLB. So it needs to be validated unconditionally otherwise we
> may try use 0 as log_base which may lead to pin pages that will lead
> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
> Signed-off-by: Jason Wang <jasowang@redhat.com>
One follow-up question:
We still observe that get user pages returns 0 sometimes. While I agree
we should not pass in unvalidated addresses, isn't this worth
documenting?
> ---
> drivers/vhost/vhost.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5d5a9d9..5320039 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,14 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
> /* Caller should have vq mutex and device mutex */
> int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> - if (vq->iotlb) {
> - /* When device IOTLB was used, the access validation
> - * will be validated during prefetching.
> - */
> - return 1;
> - }
> - return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
> - vq_log_access_ok(vq, vq->log_base);
> + int ret = vq_log_access_ok(vq, vq->log_base);
> +
> + if (ret || vq->iotlb)
> + return ret;
> +
> + return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>
> --
> 2.7.4
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: David Miller @ 2018-03-29 20:23 UTC (permalink / raw)
To: mst; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180329154803-mutt-send-email-mst@kernel.org>
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Thu, 29 Mar 2018 15:48:22 +0300
> On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
>> Vq log_base is the userspace address of bitmap which has nothing to do
>> with IOTLB. So it needs to be validated unconditionally otherwise we
>> may try use 0 as log_base which may lead to pin pages that will lead
>> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>>
>> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
>> Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> stable material I guess.
Applied and queued up for -stable.
^ permalink raw reply
* virtio over SW-defined/CPU-driven PCIe endpoint
From: Stephen Warren @ 2018-03-29 21:22 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang; +Cc: virtualization
I've been investigating how to implement a virtio device (as opposed to
a virtio driver) on a regular computer system with a PCIe controller
that can operate in endpoint mode, as opposed to an endpoint that's
implemented by a hypervisor that can preempt execution of a VM, or an
endpoint that's implemented purely in hardware by logic gates. In my
case (and I assume likely in most CPU-driven PCIe endpoint cases), the
endpoint controller has the following capabilities:
- Host-initiated accesses to the endpoint's BARs can read/write normal
memory, but not hardware registers within the endpoint system.
- Accesses to memory exposed by BARs can't be synchronously handled by
the endpoint's local CPU. The local CPU can't be notified when the host
writes memory in order to synchronously update other memory locations.
The local CPU can't synchronously generate the result of a host read
transaction, but rather the data must be present in memory ahead of time.
- Accesses to a small region of address space can be used to generate
interrupts to the endpoint's local CPU. This region can be exposed
through a PCI BAR (or perhaps as part of a BAR; not sure on details
yet). This region of memory has a fixed format and is separate from true
RAM, and so can't be used to hold PCI-virtio's discovery/capability data.
- The endpoint can emit PCI interrupts (e.g. MSI) to the attached host.
The model described in the virtio spec's "Virtio PCI Bus" section
doesn't seem to work in this case, since it assumes:
- Writes to some fields in the PCI configuration space (e.g.
{queue,device_feature,driver_feature}_select) synchronously update other
fields (e.g. queue_size), which can immediately be accessed by the host.
This isn't possible when the memory content is created by a CPU that
isn't a synchronous part of PCI accesses.
- Writing to some fields in the PCI configuration space (e.g.
device_status) are supposed to trigger a response by the device, without
the need to explicitly notify the device of the memory write by some
other means. This isn't possible when the endpoint's local CPU has no
mechanism to be notified of such writes.
- The device_status field is asynchronously read-write by both the
device and driver, yet the spec requires that the driver must always
read-modify-write this field, and additionally that the driver must
never clear any device status bit. These requirements seem impossible to
satisfy in any case at all, let alone the current case.
I can see some possible solutions here:
1) Just implement virtqueues but not all of the standardized PCI
discovery protocol. virtqueues don't have the problems described above
and should work fine between systems where there are asynchronous CPUs
on both ends. virtqueus solely rely on normal memory access without
side-effects and explicit notification. This would require implementing
some custom/device-specific discovery protocol. I believe that
remoteproc/rpmsg take this approach.
2) Define a new standardized virtio PCI discovery protocol that is
better suited to the device being an asynchronous CPU. For example,
eliminate the need for the device to somehow notice memory accesses and
rely on explicitl notification instead. Separate device-written and
driver-written data into different cache-lines or pages.
3) Use something other than virtio/virtqueues instead.
As an aside, I noticed that the memory allocation for virtqueues is very
lopsided; the driver always allocates the storage. This means that the
device must perform PCI reads to transfer data from the driver to the
device. PCI reads are typically slower than PCI writes since reads
require a round-trip transfer, whereas writes can be posted. I wonder if
any thought has been put into having the device optionally allocate
virtuqueue buffers so that the protocol can rely primarily on PCI
writes? Perhaps there's some alternative protocol that's more optimized
for true PCI-based communication rather than paravirtualized PCI-based
communication?
Thanks for any thoughts on the best approach, or pointers to
pre-existing work in this area.
^ permalink raw reply
* Re: virtio over SW-defined/CPU-driven PCIe endpoint
From: Michael S. Tsirkin @ 2018-03-29 22:38 UTC (permalink / raw)
To: Stephen Warren; +Cc: virtualization
In-Reply-To: <7cbb8e1c-a34b-2cef-9de3-f034f6afcdfd@wwwdotorg.org>
On Thu, Mar 29, 2018 at 03:22:29PM -0600, Stephen Warren wrote:
> I've been investigating how to implement a virtio device (as opposed to a
> virtio driver) on a regular computer system with a PCIe controller that can
> operate in endpoint mode, as opposed to an endpoint that's implemented by a
> hypervisor that can preempt execution of a VM, or an endpoint that's
> implemented purely in hardware by logic gates. In my case (and I assume
> likely in most CPU-driven PCIe endpoint cases), the endpoint controller has
> the following capabilities:
>
> - Host-initiated accesses to the endpoint's BARs can read/write normal
> memory, but not hardware registers within the endpoint system.
>
> - Accesses to memory exposed by BARs can't be synchronously handled by the
> endpoint's local CPU. The local CPU can't be notified when the host writes
> memory in order to synchronously update other memory locations. The local
> CPU can't synchronously generate the result of a host read transaction, but
> rather the data must be present in memory ahead of time.
>
> - Accesses to a small region of address space can be used to generate
> interrupts to the endpoint's local CPU. This region can be exposed through a
> PCI BAR (or perhaps as part of a BAR; not sure on details yet). This region
> of memory has a fixed format and is separate from true RAM, and so can't be
> used to hold PCI-virtio's discovery/capability data.
>
> - The endpoint can emit PCI interrupts (e.g. MSI) to the attached host.
>
> The model described in the virtio spec's "Virtio PCI Bus" section doesn't
> seem to work in this case, since it assumes:
>
> - Writes to some fields in the PCI configuration space (e.g.
> {queue,device_feature,driver_feature}_select) synchronously update other
> fields (e.g. queue_size), which can immediately be accessed by the host.
> This isn't possible when the memory content is created by a CPU that isn't a
> synchronous part of PCI accesses.
>
> - Writing to some fields in the PCI configuration space (e.g. device_status)
> are supposed to trigger a response by the device, without the need to
> explicitly notify the device of the memory write by some other means. This
> isn't possible when the endpoint's local CPU has no mechanism to be notified
> of such writes.
>
> - The device_status field is asynchronously read-write by both the device
> and driver, yet the spec requires that the driver must always
> read-modify-write this field, and additionally that the driver must never
> clear any device status bit. These requirements seem impossible to satisfy
> in any case at all, let alone the current case.
>
> I can see some possible solutions here:
>
> 1) Just implement virtqueues but not all of the standardized PCI discovery
> protocol. virtqueues don't have the problems described above and should work
> fine between systems where there are asynchronous CPUs on both ends.
> virtqueus solely rely on normal memory access without side-effects and
> explicit notification. This would require implementing some
> custom/device-specific discovery protocol. I believe that remoteproc/rpmsg
> take this approach.
>
> 2) Define a new standardized virtio PCI discovery protocol that is better
> suited to the device being an asynchronous CPU. For example, eliminate the
> need for the device to somehow notice memory accesses and rely on explicitl
> notification instead. Separate device-written and driver-written data into
> different cache-lines or pages.
>
> 3) Use something other than virtio/virtqueues instead.
>
> As an aside, I noticed that the memory allocation for virtqueues is very
> lopsided; the driver always allocates the storage. This means that the
> device must perform PCI reads to transfer data from the driver to the
> device. PCI reads are typically slower than PCI writes since reads require a
> round-trip transfer, whereas writes can be posted. I wonder if any thought
> has been put into having the device optionally allocate virtuqueue buffers
> so that the protocol can rely primarily on PCI writes? Perhaps there's some
> alternative protocol that's more optimized for true PCI-based communication
> rather than paravirtualized PCI-based communication?
>
> Thanks for any thoughts on the best approach, or pointers to pre-existing
> work in this area.
I think Jan Kiszka wanted to add ability to put some data in the PCI
BAR. This was never formally proposed.
Your first step IMHO should be to send these thoughts to one of the
virtio TC mailing lists, that's where discussion about virtio interface
extensions takes place. virtualization@lists.linux-foundation.org is
mostly for Linux virtio drivers.
--
MST
^ permalink raw reply
* [PATCH v3] virtio_blk: add DISCARD and WRIET ZEROES command support
From: Changpeng Liu @ 2018-03-30 0:49 UTC (permalink / raw)
To: virtualization
Cc: james.r.harris, pawelx.wodkowski, mst, fabio.miranda.martins,
stefanha, pbonzini, cavery, changpeng.liu
Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES
command support, this will impact the performance when using SSD
backend over file systems.
The idea here is using 16 Bytes payload as one descriptor for
DISCARD/WRITE ZEROES command, users can put several ranges into
one command, for the purpose to support such feature, two feature
flags VIRTIO_BLK_F_DISCARD/VIRTIO_BLK_F_WRITE_ZEROES and two
commands VIRTIO_BLK_T_DISCARD/VIRTIO_BLK_T_WRITE_ZEROES are
introduced, and some parameters are added to the configuration
space to tell the OS the granularity of DISCARD/WRITE ZEROES
commands.
The specification change list here:
https://github.com/oasis-tcs/virtio-spec
CHANGELOG:
v3: finalized the specification change.
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
---
drivers/block/virtio_blk.c | 96 +++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_blk.h | 39 +++++++++++++++++
2 files changed, 132 insertions(+), 3 deletions(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4a07593c..1943adb 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -172,10 +172,53 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}
+static inline int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
+ u32 block_size = queue_logical_block_size(req->q);
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+
+ if (block_size < 512 || !block_size)
+ return -1;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
+ if (!range)
+ return -1;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> 9;
+
+ range[n].flags.unmap = cpu_to_le32(unmap);
+ range[n].flags.reserved = cpu_to_le32(0);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ if (WARN_ON_ONCE(n != segments)) {
+ kfree(range);
+ return -1;
+ }
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+ kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+ }
+
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
@@ -225,6 +268,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
int qid = hctx->queue_num;
int err;
bool notify = false;
+ bool unmap = false;
u32 type;
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -237,6 +281,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
type = VIRTIO_BLK_T_FLUSH;
break;
+ case REQ_OP_DISCARD:
+ type = VIRTIO_BLK_T_DISCARD;
+ break;
+ case REQ_OP_WRITE_ZEROES:
+ type = VIRTIO_BLK_T_WRITE_ZEROES;
+ unmap = !(req->cmd_flags & REQ_NOUNMAP);
+ break;
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
type = VIRTIO_BLK_T_SCSI_CMD;
@@ -256,9 +307,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(req);
+ if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+ err = virtblk_setup_discard_write_zeroes(req, unmap);
+ if (err)
+ return BLK_STS_IOERR;
+ }
+
num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
if (num) {
- if (rq_data_dir(req) == WRITE)
+ if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD ||
+ type == VIRTIO_BLK_T_WRITE_ZEROES)
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
else
vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
@@ -777,6 +835,38 @@ static int virtblk_probe(struct virtio_device *vdev)
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+ q->limits.discard_granularity = blk_size;
+
+ virtio_cread(vdev, struct virtio_blk_config, discard_sector_alignment, &v);
+ if (v)
+ q->limits.discard_alignment = v << 9;
+ else
+ q->limits.discard_alignment = 0;
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_sectors, &v);
+ if (v)
+ blk_queue_max_discard_sectors(q, v);
+ else
+ blk_queue_max_discard_sectors(q, UINT_MAX);
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_seg, &v);
+ if (v)
+ blk_queue_max_discard_segments(q, v);
+ else
+ blk_queue_max_discard_segments(q, USHRT_MAX);
+
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+ virtio_cread(vdev, struct virtio_blk_config, max_write_zeroes_sectors, &v);
+ if (v)
+ blk_queue_max_write_zeroes_sectors(q, v);
+ else
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
@@ -885,14 +975,14 @@ static int virtblk_restore(struct virtio_device *vdev)
VIRTIO_BLK_F_SCSI,
#endif
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
}
;
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
};
static struct virtio_driver virtio_blk = {
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ebe4d9..78a60e6 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -38,6 +38,8 @@
#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
+#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD command is supported */
+#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES command is supported */
/* Legacy feature bits */
#ifndef VIRTIO_BLK_NO_LEGACY
@@ -86,6 +88,22 @@ struct virtio_blk_config {
/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
__u16 num_queues;
+ /* The maximum discard segment size (if VIRTIO_BLK_F_DISCARD) */
+ __u32 max_discard_sectors;
+ /* The maximum number of discard segments (if VIRTIO_BLK_F_DISCARD) */
+ __u32 max_discard_seg;
+ /* The sector alignment for discard (if VIRTIO_BLK_F_DISCARD) */
+ __u32 discard_sector_alignment;
+ /* The maximum number of write zeroes sectors (if VIRTIO_BLK_F_WRITE_ZEROES) */
+ __u32 max_write_zeroes_sectors;
+ /* The maximum number of write zeroes segments (if VIRTIO_BLK_F_WRITE_ZEROES) */
+ __u32 max_write_zeroes_seg;
+ /* Device clear this bit when write zeroes command cannot result in
+ * deallocating one or more sectors
+ * (if VIRTIO_BLK_F_WRITE_ZEROES with unmap bit)
+ */
+ __u8 write_zeroes_may_unmap;
+ __u8 unused1[3];
} __attribute__((packed));
/*
@@ -114,6 +132,12 @@ struct virtio_blk_config {
/* Get device ID command */
#define VIRTIO_BLK_T_GET_ID 8
+/* Discard command */
+#define VIRTIO_BLK_T_DISCARD 11
+
+/* Write zeroes command */
+#define VIRTIO_BLK_T_WRITE_ZEROES 13
+
#ifndef VIRTIO_BLK_NO_LEGACY
/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
@@ -133,6 +157,21 @@ struct virtio_blk_outhdr {
__virtio64 sector;
};
+/*
+ * discard/write zeroes range for each request.
+ */
+struct virtio_blk_discard_write_zeroes {
+ /* discard/write zeroes start sector */
+ __virtio64 sector;
+ /* number of discard/write zeroes sectors */
+ __virtio32 num_sectors;
+ /* valid for write zeroes command */
+ struct {
+ __virtio32 unmap:1;
+ __virtio32 reserved:31;
+ } flags;
+};
+
#ifndef VIRTIO_BLK_NO_LEGACY
struct virtio_scsi_inhdr {
__virtio32 errors;
--
1.9.3
^ permalink raw reply related
* Re: [PATCH v3] virtio_blk: add DISCARD and WRIET ZEROES command support
From: Michael S. Tsirkin @ 2018-03-30 1:07 UTC (permalink / raw)
To: Changpeng Liu
Cc: james.r.harris, pawelx.wodkowski, fabio.miranda.martins,
virtualization, stefanha, pbonzini, cavery
In-Reply-To: <1522370974-28259-1-git-send-email-changpeng.liu@intel.com>
On Fri, Mar 30, 2018 at 08:49:34AM +0800, Changpeng Liu wrote:
> Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES
> command support, this will impact the performance when using SSD
> backend over file systems.
>
> The idea here is using 16 Bytes payload as one descriptor for
> DISCARD/WRITE ZEROES command, users can put several ranges into
> one command, for the purpose to support such feature, two feature
> flags VIRTIO_BLK_F_DISCARD/VIRTIO_BLK_F_WRITE_ZEROES and two
> commands VIRTIO_BLK_T_DISCARD/VIRTIO_BLK_T_WRITE_ZEROES are
> introduced, and some parameters are added to the configuration
> space to tell the OS the granularity of DISCARD/WRITE ZEROES
> commands.
Pls fix grammar in this comment, I am not sure what are you
trying to say.
>
> The specification change list here:
> https://github.com/oasis-tcs/virtio-spec
Which commit?
> CHANGELOG:
> v3: finalized the specification change.
Changelog belongs after -- and should be complete
including changes v1 to v2.
> Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> ---
> drivers/block/virtio_blk.c | 96 +++++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/virtio_blk.h | 39 +++++++++++++++++
> 2 files changed, 132 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 4a07593c..1943adb 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -172,10 +172,53 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
> return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> }
>
> +static inline int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
> +{
> + unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
Split on two lines pls:
unsigned short n = 0;
> + u32 block_size = queue_logical_block_size(req->q);
Seems to be unused except for the sanity check below. Why?
> + struct virtio_blk_discard_write_zeroes *range;
> + struct bio *bio;
> +
> + if (block_size < 512 || !block_size)
Why 512? And when is it 0?
> + return -1;
-1 isn't a normal errno code.
> +
> + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
This might be pretty large: with 64K segments it looks like you are
trying to allocate ~1Mbyte with GFP_ATOMIC which is unlikely to succeed.
Can we split this up in chunks?
> + if (!range)
> + return -1;
> +
> + __rq_for_each_bio(bio, req) {
> + u64 sector = bio->bi_iter.bi_sector;
> + u32 num_sectors = bio->bi_iter.bi_size >> 9;
why 9?
> +
> + range[n].flags.unmap = cpu_to_le32(unmap);
> + range[n].flags.reserved = cpu_to_le32(0);
> + range[n].num_sectors = cpu_to_le32(num_sectors);
> + range[n].sector = cpu_to_le64(sector);
Isn't this causing sparse warnings?
> + n++;
> + }
> +
> + if (WARN_ON_ONCE(n != segments)) {
and when does this happen?
> + kfree(range);
> + return -1;
> + }
> +
> + req->special_vec.bv_page = virt_to_page(range);
> + req->special_vec.bv_offset = offset_in_page(range);
> + req->special_vec.bv_len = sizeof(*range) * segments;
> + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> + return 0;
> +}
> +
> static inline void virtblk_request_done(struct request *req)
> {
> struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>
> + if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
> + kfree(page_address(req->special_vec.bv_page) +
> + req->special_vec.bv_offset);
> + }
> +
> switch (req_op(req)) {
> case REQ_OP_SCSI_IN:
> case REQ_OP_SCSI_OUT:
> @@ -225,6 +268,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> int qid = hctx->queue_num;
> int err;
> bool notify = false;
> + bool unmap = false;
> u32 type;
>
> BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> @@ -237,6 +281,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> case REQ_OP_FLUSH:
> type = VIRTIO_BLK_T_FLUSH;
> break;
> + case REQ_OP_DISCARD:
> + type = VIRTIO_BLK_T_DISCARD;
> + break;
> + case REQ_OP_WRITE_ZEROES:
> + type = VIRTIO_BLK_T_WRITE_ZEROES;
> + unmap = !(req->cmd_flags & REQ_NOUNMAP);
> + break;
> case REQ_OP_SCSI_IN:
> case REQ_OP_SCSI_OUT:
> type = VIRTIO_BLK_T_SCSI_CMD;
> @@ -256,9 +307,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>
> blk_mq_start_request(req);
>
> + if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
> + err = virtblk_setup_discard_write_zeroes(req, unmap);
> + if (err)
> + return BLK_STS_IOERR;
Does a failure actually indicate an IO error?
> + }
> +
> num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
> if (num) {
> - if (rq_data_dir(req) == WRITE)
> + if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD ||
> + type == VIRTIO_BLK_T_WRITE_ZEROES)
> vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
> else
> vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
> @@ -777,6 +835,38 @@ static int virtblk_probe(struct virtio_device *vdev)
> if (!err && opt_io_size)
> blk_queue_io_opt(q, blk_size * opt_io_size);
>
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
> + q->limits.discard_granularity = blk_size;
> +
> + virtio_cread(vdev, struct virtio_blk_config, discard_sector_alignment, &v);
> + if (v)
> + q->limits.discard_alignment = v << 9;
> + else
> + q->limits.discard_alignment = 0;
> +
> + virtio_cread(vdev, struct virtio_blk_config, max_discard_sectors, &v);
> + if (v)
> + blk_queue_max_discard_sectors(q, v);
> + else
> + blk_queue_max_discard_sectors(q, UINT_MAX);
> +
> + virtio_cread(vdev, struct virtio_blk_config, max_discard_seg, &v);
> + if (v)
> + blk_queue_max_discard_segments(q, v);
> + else
> + blk_queue_max_discard_segments(q, USHRT_MAX);
> +
> + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
> + }
> +
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
> + virtio_cread(vdev, struct virtio_blk_config, max_write_zeroes_sectors, &v);
> + if (v)
> + blk_queue_max_write_zeroes_sectors(q, v);
> + else
> + blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
> + }
> +
> virtblk_update_capacity(vblk, false);
> virtio_device_ready(vdev);
>
> @@ -885,14 +975,14 @@ static int virtblk_restore(struct virtio_device *vdev)
> VIRTIO_BLK_F_SCSI,
> #endif
> VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> - VIRTIO_BLK_F_MQ,
> + VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> }
> ;
> static unsigned int features[] = {
> VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
> VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
> VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> - VIRTIO_BLK_F_MQ,
> + VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> };
>
> static struct virtio_driver virtio_blk = {
> diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
> index 9ebe4d9..78a60e6 100644
> --- a/include/uapi/linux/virtio_blk.h
> +++ b/include/uapi/linux/virtio_blk.h
> @@ -38,6 +38,8 @@
> #define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
> #define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
> #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
> +#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD command is supported */
> +#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES command is supported */
>
> /* Legacy feature bits */
> #ifndef VIRTIO_BLK_NO_LEGACY
> @@ -86,6 +88,22 @@ struct virtio_blk_config {
>
> /* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
> __u16 num_queues;
> + /* The maximum discard segment size (if VIRTIO_BLK_F_DISCARD) */
> + __u32 max_discard_sectors;
> + /* The maximum number of discard segments (if VIRTIO_BLK_F_DISCARD) */
> + __u32 max_discard_seg;
> + /* The sector alignment for discard (if VIRTIO_BLK_F_DISCARD) */
> + __u32 discard_sector_alignment;
> + /* The maximum number of write zeroes sectors (if VIRTIO_BLK_F_WRITE_ZEROES) */
> + __u32 max_write_zeroes_sectors;
> + /* The maximum number of write zeroes segments (if VIRTIO_BLK_F_WRITE_ZEROES) */
> + __u32 max_write_zeroes_seg;
> + /* Device clear this bit when write zeroes command cannot result in
> + * deallocating one or more sectors
> + * (if VIRTIO_BLK_F_WRITE_ZEROES with unmap bit)
Pls fix grammar in this comment, I am not sure what are you trying to
say.
> + */
> + __u8 write_zeroes_may_unmap;
> + __u8 unused1[3];
> } __attribute__((packed));
>
> /*
> @@ -114,6 +132,12 @@ struct virtio_blk_config {
> /* Get device ID command */
> #define VIRTIO_BLK_T_GET_ID 8
>
> +/* Discard command */
> +#define VIRTIO_BLK_T_DISCARD 11
> +
> +/* Write zeroes command */
> +#define VIRTIO_BLK_T_WRITE_ZEROES 13
> +
> #ifndef VIRTIO_BLK_NO_LEGACY
> /* Barrier before this op. */
> #define VIRTIO_BLK_T_BARRIER 0x80000000
> @@ -133,6 +157,21 @@ struct virtio_blk_outhdr {
> __virtio64 sector;
> };
>
> +/*
> + * discard/write zeroes range for each request.
> + */
> +struct virtio_blk_discard_write_zeroes {
> + /* discard/write zeroes start sector */
> + __virtio64 sector;
> + /* number of discard/write zeroes sectors */
> + __virtio32 num_sectors;
> + /* valid for write zeroes command */
> + struct {
> + __virtio32 unmap:1;
> + __virtio32 reserved:31;
> + } flags;
> +};
> +
You can't use bitmaps in portable code.
The format differs between architectures.
> #ifndef VIRTIO_BLK_NO_LEGACY
> struct virtio_scsi_inhdr {
> __virtio32 errors;
> --
> 1.9.3
^ permalink raw reply
* Re: [RFC PATCH V2 8/8] vhost: event suppression for packed ring
From: Tiwei Bie @ 2018-03-30 2:05 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1522035533-11786-9-git-send-email-jasowang@redhat.com>
On Mon, Mar 26, 2018 at 11:38:53AM +0800, Jason Wang wrote:
> This patch introduces basic support for event suppression aka driver
> and device area. Compile tested only.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
[...]
> +
> +static bool vhost_notify_packed(struct vhost_dev *dev,
> + struct vhost_virtqueue *vq)
> +{
> + __virtio16 event_off_wrap, event_flags;
> + __u16 old, new;
> + bool v, wrap;
> + int off;
> +
> + /* Flush out used descriptors updates. This is paired
> + * with the barrier that the Guest executes when enabling
> + * interrupts.
> + */
> + smp_mb();
> +
> + if (vhost_get_avail(vq, event_flags,
> + &vq->driver_event->desc_event_flags) < 0) {
> + vq_err(vq, "Failed to get driver desc_event_flags");
> + return true;
> + }
> +
> + if (!(event_flags & cpu_to_vhost16(vq, RING_EVENT_FLAGS_DESC)))
> + return event_flags ==
> + cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
Maybe it would be better to not use '&' here. Because these flags
are not defined as bits which can be ORed or ANDed. Instead, they
are defined as values:
0x0 enable
0x1 disable
0x2 desc
0x3 reserved
> +
> + /* Read desc event flags before event_off and event_wrap */
> + smp_rmb();
> +
> + if (vhost_get_avail(vq, event_off_wrap,
> + &vq->driver_event->desc_event_off_warp) < 0) {
> + vq_err(vq, "Failed to get driver desc_event_off/wrap");
> + return true;
> + }
> +
> + off = vhost16_to_cpu(vq, event_off_wrap);
> +
> + wrap = off & 0x1;
> + off >>= 1;
Based on the below definitions in spec, wrap counter is
the most significant bit.
struct pvirtq_event_suppress {
le16 {
desc_event_off : 15; /* Descriptor Ring Change Event Offset */
desc_event_wrap : 1; /* Descriptor Ring Change Event Wrap Counter */
} desc; /* If desc_event_flags set to RING_EVENT_FLAGS_DESC */
le16 {
desc_event_flags : 2, /* Descriptor Ring Change Event Flags */
reserved : 14; /* Reserved, set to 0 */
} flags;
};
> +
> +
> + old = vq->signalled_used;
> + v = vq->signalled_used_valid;
> + new = vq->signalled_used = vq->last_used_idx;
> + vq->signalled_used_valid = true;
> +
> + if (unlikely(!v))
> + return true;
> +
> + return vhost_vring_packed_need_event(vq, new, old, off) &&
> + wrap == vq->used_wrap_counter;
> +}
> +
> +static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> + return vhost_notify_packed(dev, vq);
> + else
> + return vhost_notify_split(dev, vq);
> +}
> +
> /* This actually signals the guest, using eventfd. */
> void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> {
> @@ -2789,7 +2911,17 @@ static bool vhost_enable_notify_packed(struct vhost_dev *dev,
> __virtio16 flags;
> int ret;
>
> - /* FIXME: disable notification through device area */
> + if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> + return false;
> + vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> +
> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
> + ret = vhost_update_device_flags(vq, flags);
> + if (ret) {
> + vq_err(vq, "Failed to enable notification at %p: %d\n",
> + &vq->device_event->desc_event_flags, ret);
> + return false;
> + }
>
> /* They could have slipped one in as we were doing that: make
> * sure it's written, then check again. */
> @@ -2855,7 +2987,18 @@ EXPORT_SYMBOL_GPL(vhost_enable_notify);
> static void vhost_disable_notify_packed(struct vhost_dev *dev,
> struct vhost_virtqueue *vq)
> {
> - /* FIXME: disable notification through device area */
> + __virtio16 flags;
> + int r;
> +
> + if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> + return;
> + vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> +
> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
> + r = vhost_update_device_flags(vq, flags);
> + if (r)
> + vq_err(vq, "Failed to enable notification at %p: %d\n",
> + &vq->device_event->desc_event_flags, r);
> }
>
> static void vhost_disable_notify_split(struct vhost_dev *dev,
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 8a9df4f..02d7a36 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -96,8 +96,14 @@ struct vhost_virtqueue {
> struct vring_desc __user *desc;
> struct vring_desc_packed __user *desc_packed;
Do you think it'd be better to name the desc type as
struct vring_packed_desc? And it will be consistent
with other names, like:
struct vring_packed;
struct vring_packed_desc_event;
> };
> - struct vring_avail __user *avail;
> - struct vring_used __user *used;
> + union {
> + struct vring_avail __user *avail;
> + struct vring_packed_desc_event __user *driver_event;
> + };
> + union {
> + struct vring_used __user *used;
> + struct vring_packed_desc_event __user *device_event;
> + };
> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
> struct file *kick;
> struct eventfd_ctx *call_ctx;
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index e297580..7cdbf06 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -75,6 +75,25 @@ struct vring_desc_packed {
> __virtio16 flags;
> };
>
> +/* Enable events */
> +#define RING_EVENT_FLAGS_ENABLE 0x0
> +/* Disable events */
> +#define RING_EVENT_FLAGS_DISABLE 0x1
> +/*
> + * Enable events for a specific descriptor
> + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
> + * Only valid if VIRTIO_F_RING_EVENT_IDX has been negotiated.
> + */
> +#define RING_EVENT_FLAGS_DESC 0x2
> +/* The value 0x3 is reserved */
> +
> +struct vring_packed_desc_event {
> + /* Descriptor Ring Change Event Offset and Wrap Counter */
> + __virtio16 desc_event_off_warp;
> + /* Descriptor Ring Change Event Flags */
> + __virtio16 desc_event_flags;
Do you think it'd be better to remove the prefix (desc_event_) for
the fields. And it will be consistent with other definitions, e.g.:
struct vring_packed_desc {
/* Buffer Address. */
__virtio64 addr;
/* Buffer Length. */
__virtio32 len;
/* Buffer ID. */
__virtio16 id;
/* The flags depending on descriptor type. */
__virtio16 flags;
};
> +};
> +
> /* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
> struct vring_desc {
> /* Address (guest-physical). */
> --
> 2.7.4
>
^ permalink raw reply
* Re: [PATCH net] vhost: validate log when IOTLB is enabled
From: Jason Wang @ 2018-03-30 2:27 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180329173438-mutt-send-email-mst@kernel.org>
On 2018年03月29日 22:44, Michael S. Tsirkin wrote:
> On Thu, Mar 29, 2018 at 04:00:04PM +0800, Jason Wang wrote:
>> Vq log_base is the userspace address of bitmap which has nothing to do
>> with IOTLB. So it needs to be validated unconditionally otherwise we
>> may try use 0 as log_base which may lead to pin pages that will lead
>> unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).
>>
>> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
>> Reported-by:syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> One follow-up question:
>
> We still observe that get user pages returns 0 sometimes. While I agree
> we should not pass in unvalidated addresses, isn't this worth
> documenting?
>
>
Looking at get_user_pages_fast(), it has:
if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
(void __user *)start, len)))
return 0;
So this is expected I think.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: virtio over SW-defined/CPU-driven PCIe endpoint
From: Jason Wang @ 2018-03-30 2:44 UTC (permalink / raw)
To: Stephen Warren, Michael S. Tsirkin; +Cc: virtualization
In-Reply-To: <7cbb8e1c-a34b-2cef-9de3-f034f6afcdfd@wwwdotorg.org>
On 2018年03月30日 05:22, Stephen Warren wrote:
> I've been investigating how to implement a virtio device (as opposed
> to a virtio driver) on a regular computer system with a PCIe
> controller that can operate in endpoint mode, as opposed to an
> endpoint that's implemented by a hypervisor that can preempt execution
> of a VM, or an endpoint that's implemented purely in hardware by logic
> gates. In my case (and I assume likely in most CPU-driven PCIe
> endpoint cases), the endpoint controller has the following capabilities:
>
> - Host-initiated accesses to the endpoint's BARs can read/write normal
> memory, but not hardware registers within the endpoint system.
>
> - Accesses to memory exposed by BARs can't be synchronously handled by
> the endpoint's local CPU. The local CPU can't be notified when the
> host writes memory in order to synchronously update other memory
> locations. The local CPU can't synchronously generate the result of a
> host read transaction, but rather the data must be present in memory
> ahead of time.
>
> - Accesses to a small region of address space can be used to generate
> interrupts to the endpoint's local CPU. This region can be exposed
> through a PCI BAR (or perhaps as part of a BAR; not sure on details
> yet). This region of memory has a fixed format and is separate from
> true RAM, and so can't be used to hold PCI-virtio's
> discovery/capability data.
>
> - The endpoint can emit PCI interrupts (e.g. MSI) to the attached host.
>
> The model described in the virtio spec's "Virtio PCI Bus" section
> doesn't seem to work in this case, since it assumes:
>
> - Writes to some fields in the PCI configuration space (e.g.
> {queue,device_feature,driver_feature}_select) synchronously update
> other fields (e.g. queue_size), which can immediately be accessed by
> the host. This isn't possible when the memory content is created by a
> CPU that isn't a synchronous part of PCI accesses.
>
> - Writing to some fields in the PCI configuration space (e.g.
> device_status) are supposed to trigger a response by the device,
> without the need to explicitly notify the device of the memory write
> by some other means. This isn't possible when the endpoint's local CPU
> has no mechanism to be notified of such writes.
>
> - The device_status field is asynchronously read-write by both the
> device and driver, yet the spec requires that the driver must always
> read-modify-write this field, and additionally that the driver must
> never clear any device status bit. These requirements seem impossible
> to satisfy in any case at all, let alone the current case.
>
> I can see some possible solutions here:
>
> 1) Just implement virtqueues but not all of the standardized PCI
> discovery protocol. virtqueues don't have the problems described above
> and should work fine between systems where there are asynchronous CPUs
> on both ends. virtqueus solely rely on normal memory access without
> side-effects and explicit notification. This would require
> implementing some custom/device-specific discovery protocol. I believe
> that remoteproc/rpmsg take this approach.
Yes, kernel has already supported virtio over remoteproc.
Maybe the first step is to document it in the spec.
Thanks
>
> 2) Define a new standardized virtio PCI discovery protocol that is
> better suited to the device being an asynchronous CPU. For example,
> eliminate the need for the device to somehow notice memory accesses
> and rely on explicitl notification instead. Separate device-written
> and driver-written data into different cache-lines or pages.
>
> 3) Use something other than virtio/virtqueues instead.
>
> As an aside, I noticed that the memory allocation for virtqueues is
> very lopsided; the driver always allocates the storage. This means
> that the device must perform PCI reads to transfer data from the
> driver to the device. PCI reads are typically slower than PCI writes
> since reads require a round-trip transfer, whereas writes can be
> posted. I wonder if any thought has been put into having the device
> optionally allocate virtuqueue buffers so that the protocol can rely
> primarily on PCI writes? Perhaps there's some alternative protocol
> that's more optimized for true PCI-based communication rather than
> paravirtualized PCI-based communication?
>
> Thanks for any thoughts on the best approach, or pointers to
> pre-existing work in this area.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC PATCH V2 8/8] vhost: event suppression for packed ring
From: Jason Wang @ 2018-03-30 2:46 UTC (permalink / raw)
To: Tiwei Bie; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <20180330020518.tja3lxkfz7mdwasj@debian>
On 2018年03月30日 10:05, Tiwei Bie wrote:
> On Mon, Mar 26, 2018 at 11:38:53AM +0800, Jason Wang wrote:
>> This patch introduces basic support for event suppression aka driver
>> and device area. Compile tested only.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
> [...]
>> +
>> +static bool vhost_notify_packed(struct vhost_dev *dev,
>> + struct vhost_virtqueue *vq)
>> +{
>> + __virtio16 event_off_wrap, event_flags;
>> + __u16 old, new;
>> + bool v, wrap;
>> + int off;
>> +
>> + /* Flush out used descriptors updates. This is paired
>> + * with the barrier that the Guest executes when enabling
>> + * interrupts.
>> + */
>> + smp_mb();
>> +
>> + if (vhost_get_avail(vq, event_flags,
>> + &vq->driver_event->desc_event_flags) < 0) {
>> + vq_err(vq, "Failed to get driver desc_event_flags");
>> + return true;
>> + }
>> +
>> + if (!(event_flags & cpu_to_vhost16(vq, RING_EVENT_FLAGS_DESC)))
>> + return event_flags ==
>> + cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
> Maybe it would be better to not use '&' here. Because these flags
> are not defined as bits which can be ORed or ANDed. Instead, they
> are defined as values:
>
> 0x0 enable
> 0x1 disable
> 0x2 desc
> 0x3 reserved
Yes the code seems tricky. Let me fix it in next version.
>
>> +
>> + /* Read desc event flags before event_off and event_wrap */
>> + smp_rmb();
>> +
>> + if (vhost_get_avail(vq, event_off_wrap,
>> + &vq->driver_event->desc_event_off_warp) < 0) {
>> + vq_err(vq, "Failed to get driver desc_event_off/wrap");
>> + return true;
>> + }
>> +
>> + off = vhost16_to_cpu(vq, event_off_wrap);
>> +
>> + wrap = off & 0x1;
>> + off >>= 1;
> Based on the below definitions in spec, wrap counter is
> the most significant bit.
>
> struct pvirtq_event_suppress {
> le16 {
> desc_event_off : 15; /* Descriptor Ring Change Event Offset */
> desc_event_wrap : 1; /* Descriptor Ring Change Event Wrap Counter */
> } desc; /* If desc_event_flags set to RING_EVENT_FLAGS_DESC */
> le16 {
> desc_event_flags : 2, /* Descriptor Ring Change Event Flags */
> reserved : 14; /* Reserved, set to 0 */
> } flags;
> };
Will fix this in next version.
>
>> +
>> +
>> + old = vq->signalled_used;
>> + v = vq->signalled_used_valid;
>> + new = vq->signalled_used = vq->last_used_idx;
>> + vq->signalled_used_valid = true;
>> +
>> + if (unlikely(!v))
>> + return true;
>> +
>> + return vhost_vring_packed_need_event(vq, new, old, off) &&
>> + wrap == vq->used_wrap_counter;
>> +}
>> +
>> +static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>> +{
>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>> + return vhost_notify_packed(dev, vq);
>> + else
>> + return vhost_notify_split(dev, vq);
>> +}
>> +
>> /* This actually signals the guest, using eventfd. */
>> void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>> {
>> @@ -2789,7 +2911,17 @@ static bool vhost_enable_notify_packed(struct vhost_dev *dev,
>> __virtio16 flags;
>> int ret;
>>
>> - /* FIXME: disable notification through device area */
>> + if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>> + return false;
>> + vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
>> +
>> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_ENABLE);
>> + ret = vhost_update_device_flags(vq, flags);
>> + if (ret) {
>> + vq_err(vq, "Failed to enable notification at %p: %d\n",
>> + &vq->device_event->desc_event_flags, ret);
>> + return false;
>> + }
>>
>> /* They could have slipped one in as we were doing that: make
>> * sure it's written, then check again. */
>> @@ -2855,7 +2987,18 @@ EXPORT_SYMBOL_GPL(vhost_enable_notify);
>> static void vhost_disable_notify_packed(struct vhost_dev *dev,
>> struct vhost_virtqueue *vq)
>> {
>> - /* FIXME: disable notification through device area */
>> + __virtio16 flags;
>> + int r;
>> +
>> + if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
>> + return;
>> + vq->used_flags |= VRING_USED_F_NO_NOTIFY;
>> +
>> + flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
>> + r = vhost_update_device_flags(vq, flags);
>> + if (r)
>> + vq_err(vq, "Failed to enable notification at %p: %d\n",
>> + &vq->device_event->desc_event_flags, r);
>> }
>>
>> static void vhost_disable_notify_split(struct vhost_dev *dev,
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 8a9df4f..02d7a36 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -96,8 +96,14 @@ struct vhost_virtqueue {
>> struct vring_desc __user *desc;
>> struct vring_desc_packed __user *desc_packed;
> Do you think it'd be better to name the desc type as
> struct vring_packed_desc?
Ok. Will do.
> And it will be consistent
> with other names, like:
>
> struct vring_packed;
> struct vring_packed_desc_event;
>
>> };
>> - struct vring_avail __user *avail;
>> - struct vring_used __user *used;
>> + union {
>> + struct vring_avail __user *avail;
>> + struct vring_packed_desc_event __user *driver_event;
>> + };
>> + union {
>> + struct vring_used __user *used;
>> + struct vring_packed_desc_event __user *device_event;
>> + };
>> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>> struct file *kick;
>> struct eventfd_ctx *call_ctx;
>> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
>> index e297580..7cdbf06 100644
>> --- a/include/uapi/linux/virtio_ring.h
>> +++ b/include/uapi/linux/virtio_ring.h
>> @@ -75,6 +75,25 @@ struct vring_desc_packed {
>> __virtio16 flags;
>> };
>>
>> +/* Enable events */
>> +#define RING_EVENT_FLAGS_ENABLE 0x0
>> +/* Disable events */
>> +#define RING_EVENT_FLAGS_DISABLE 0x1
>> +/*
>> + * Enable events for a specific descriptor
>> + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
>> + * Only valid if VIRTIO_F_RING_EVENT_IDX has been negotiated.
>> + */
>> +#define RING_EVENT_FLAGS_DESC 0x2
>> +/* The value 0x3 is reserved */
>> +
>> +struct vring_packed_desc_event {
>> + /* Descriptor Ring Change Event Offset and Wrap Counter */
>> + __virtio16 desc_event_off_warp;
>> + /* Descriptor Ring Change Event Flags */
>> + __virtio16 desc_event_flags;
> Do you think it'd be better to remove the prefix (desc_event_) for
> the fields. And it will be consistent with other definitions, e.g.:
>
> struct vring_packed_desc {
> /* Buffer Address. */
> __virtio64 addr;
> /* Buffer Length. */
> __virtio32 len;
> /* Buffer ID. */
> __virtio16 id;
> /* The flags depending on descriptor type. */
> __virtio16 flags;
> };
Yes. Let me do it in next version.
Thanks for the review!
>> +};
>> +
>> /* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
>> struct vring_desc {
>> /* Address (guest-physical). */
>> --
>> 2.7.4
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 00/24] drm_framebuffer boilerplate removal
From: Alex Deucher @ 2018-03-30 14:47 UTC (permalink / raw)
To: Daniel Stone
Cc: Heiko Stübner, Joonas Lahtinen, dri-devel,
open list:VIRTIO CORE, NET..., Thierry Reding,
amd-gfx mailing list, David (ChunMing) Zhou, Joonyoung Shim,
Russell King, Patrik Jakobsson, Tomi Valkeinen, CK Hu,
Dave Airlie, Rob Clark, David Lechner, linux-arm-msm, intel-gfx,
Jani Nikula, Inki Dae, Rodrigo Vivi
In-Reply-To: <CAPj87rPKDwBK7qoOoY2zXoS6T02toATPRsk=39m_OBr=s1kXTw@mail.gmail.com>
On Fri, Mar 30, 2018 at 10:11 AM, Daniel Stone <daniels@collabora.com> wrote:
> Hi,
> I've been working on a getfb2[0] ioctl, which amongst other things
> supports multi-planar framebuffers as well as modifiers. getfb
> currently calls the framebuffer's handle_create hook, which doesn't
> support multiple planes.
>
> Thanks to Noralf's recent work, drivers can just store GEM objects
> directly in drm_framebuffer. I use this directly in getfb2: we need
> direct access to the GEM objects and not a vfunc in order to not hand
> out duplicate GEM names for the same object.
>
> This series converts all drivers except for nouveau, which was a
> little too non-trivial for my comfort, to storing GEM objects directly
> in drm_framebuffer. For those drivers whose driver_framebuffer struct
> was nothing but drm_framebuffer + BO, it deletes the driver-specific
> struct. It also makes use of Noralf's generic framebuffer helpers for
> create_handle and destroy where possible.
>
> I don't have the hardware for most of these drivers, so have had to
> settle for just staring really hard at the diff.
>
> I intend to remove create_handle when all drivers are converted over
> to placing BOs directly inside drm_framebuffer. For most drivers
> there's a relatively easy conversion to using the helpers for
> basically all framebuffer handling and fbdev emulation as well, though
> that's a bit further than I was willing to go without hardware to test
> on ...
Series is:
Acked-by: Alex Deucher <alexander.deucher@amd.com>
>
> Cheers,
> Daniel
>
> [0]: https://lists.freedesktop.org/archives/dri-devel/2018-March/170512.html
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply
* [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-01 14:12 UTC (permalink / raw)
To: mst, jasowang, wexu, virtualization, linux-kernel, netdev
Hello everyone,
This RFC implements packed ring support for virtio driver.
The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
Minor changes are needed for the vhost code, e.g. to kick the guest.
TODO:
- Refinements and bug fixes;
- Split into small patches;
- Test indirect descriptor support;
- Test/fix event suppression support;
- Test devices other than net;
RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;
Thanks!
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
include/linux/virtio_ring.h | 8 +-
include/uapi/linux/virtio_config.h | 12 +-
include/uapi/linux/virtio_ring.h | 61 ++
4 files changed, 980 insertions(+), 195 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..0515dca34d77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,15 @@
struct vring_desc_state {
void *data; /* Data for callback. */
- struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
+ void *indir_desc; /* Indirect descriptor, if any. */
+ int num; /* Descriptor list length. */
};
struct vring_virtqueue {
struct virtqueue vq;
- /* Actual memory layout for this queue */
- struct vring vring;
+ /* Is this a packed ring? */
+ bool packed;
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +80,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
- /* Head of free buffer list. */
- unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
/* Last used index we've seen. */
u16 last_used_idx;
- /* Last written value to avail->flags */
- u16 avail_flags_shadow;
+ union {
+ /* Available for split ring */
+ struct {
+ /* Actual memory layout for this queue. */
+ struct vring vring;
- /* Last written value to avail->idx in guest byte order */
- u16 avail_idx_shadow;
+ /* Head of free buffer list. */
+ unsigned int free_head;
+
+ /* Last written value to avail->flags */
+ u16 avail_flags_shadow;
+
+ /* Last written value to avail->idx in
+ * guest byte order. */
+ u16 avail_idx_shadow;
+ };
+
+ /* Available for packed ring */
+ struct {
+ /* Actual memory layout for this queue. */
+ struct vring_packed vring_packed;
+
+ /* Driver ring wrap counter. */
+ u8 wrap_counter;
+
+ /* Index of the next avail descriptor. */
+ unsigned int next_avail_idx;
+
+ /* Last written value to driver->flags in
+ * guest byte order. */
+ u16 event_flags_shadow;
+ };
+ };
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +228,33 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
cpu_addr, size, direction);
}
-static void vring_unmap_one(const struct vring_virtqueue *vq,
- struct vring_desc *desc)
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
+{
+ u16 flags;
+
+ if (!vring_use_dma_api(vq->vq.vdev))
+ return;
+
+ flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+ if (flags & VRING_DESC_F_INDIRECT) {
+ dma_unmap_single(vring_dma_dev(vq),
+ virtio64_to_cpu(vq->vq.vdev, desc->addr),
+ virtio32_to_cpu(vq->vq.vdev, desc->len),
+ (flags & VRING_DESC_F_WRITE) ?
+ DMA_FROM_DEVICE : DMA_TO_DEVICE);
+ } else {
+ dma_unmap_page(vring_dma_dev(vq),
+ virtio64_to_cpu(vq->vq.vdev, desc->addr),
+ virtio32_to_cpu(vq->vq.vdev, desc->len),
+ (flags & VRING_DESC_F_WRITE) ?
+ DMA_FROM_DEVICE : DMA_TO_DEVICE);
+ }
+}
+
+static void vring_unmap_one_packed(const struct vring_virtqueue *vq,
+ struct vring_packed_desc *desc)
{
u16 flags;
@@ -235,8 +287,9 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
return dma_mapping_error(vring_dma_dev(vq), addr);
}
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
- unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+ unsigned int total_sg,
+ gfp_t gfp)
{
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +310,32 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
return desc;
}
-static inline int virtqueue_add(struct virtqueue *_vq,
- struct scatterlist *sgs[],
- unsigned int total_sg,
- unsigned int out_sgs,
- unsigned int in_sgs,
- void *data,
- void *ctx,
- gfp_t gfp)
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+ unsigned int total_sg,
+ gfp_t gfp)
+{
+ struct vring_packed_desc *desc;
+
+ /*
+ * We require lowmem mappings for the descriptors because
+ * otherwise virt_to_phys will give us bogus addresses in the
+ * virtqueue.
+ */
+ gfp &= ~__GFP_HIGHMEM;
+
+ desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+ return desc;
+}
+
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
struct scatterlist *sg;
@@ -303,7 +374,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
/* If the host supports indirect descriptor tables, and we have multiple
* buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
- desc = alloc_indirect(_vq, total_sg, gfp);
+ desc = alloc_indirect_split(_vq, total_sg, gfp);
else {
desc = NULL;
WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
@@ -424,7 +495,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
for (n = 0; n < total_sg; n++) {
if (i == err_idx)
break;
- vring_unmap_one(vq, &desc[i]);
+ vring_unmap_one_split(vq, &desc[i]);
i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
}
@@ -435,6 +506,210 @@ static inline int virtqueue_add(struct virtqueue *_vq,
return -EIO;
}
+static inline int virtqueue_add_packed(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ struct vring_packed_desc *desc;
+ struct scatterlist *sg;
+ unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+ __virtio16 uninitialized_var(head_flags), flags;
+ int head, wrap_counter;
+ bool indirect;
+
+ START_USE(vq);
+
+ BUG_ON(data == NULL);
+ BUG_ON(ctx && vq->indirect);
+
+ if (unlikely(vq->broken)) {
+ END_USE(vq);
+ return -EIO;
+ }
+
+#ifdef DEBUG
+ {
+ ktime_t now = ktime_get();
+
+ /* No kick or get, with .1 second between? Warn. */
+ if (vq->last_add_time_valid)
+ WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+ > 100);
+ vq->last_add_time = now;
+ vq->last_add_time_valid = true;
+ }
+#endif
+
+ BUG_ON(total_sg == 0);
+
+ head = vq->next_avail_idx;
+ wrap_counter = vq->wrap_counter;
+
+ /* If the host supports indirect descriptor tables, and we have multiple
+ * buffers, then go indirect. FIXME: tune this threshold */
+ if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+ desc = alloc_indirect_packed(_vq, total_sg, gfp);
+ else {
+ desc = NULL;
+ WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
+ }
+
+ if (desc) {
+ /* Use a single buffer which doesn't continue */
+ indirect = true;
+ /* Set up rest to use this indirect table. */
+ i = 0;
+ descs_used = 1;
+ } else {
+ indirect = false;
+ desc = vq->vring_packed.desc;
+ i = head;
+ descs_used = total_sg;
+ }
+
+ if (vq->vq.num_free < descs_used) {
+ pr_debug("Can't add buf len %i - avail = %i\n",
+ descs_used, vq->vq.num_free);
+ /* FIXME: for historical reasons, we force a notify here if
+ * there are outgoing parts to the buffer. Presumably the
+ * host should service the ring ASAP. */
+ if (out_sgs)
+ vq->notify(&vq->vq);
+ if (indirect)
+ kfree(desc);
+ END_USE(vq);
+ return -ENOSPC;
+ }
+
+ for (n = 0; n < out_sgs + in_sgs; n++) {
+ for (sg = sgs[n]; sg; sg = sg_next(sg)) {
+ dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
+ DMA_TO_DEVICE : DMA_FROM_DEVICE);
+ if (vring_mapping_error(vq, addr))
+ goto unmap_release;
+
+ flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
+ (n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
+ VRING_DESC_F_AVAIL(vq->wrap_counter) |
+ VRING_DESC_F_USED(!vq->wrap_counter));
+ if (!indirect && i == head)
+ head_flags = flags;
+ else
+ desc[i].flags = flags;
+
+ desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
+ desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
+ desc[i].id = cpu_to_virtio32(_vq->vdev, head);
+ prev = i;
+ i++;
+ if (!indirect && i >= vq->vring_packed.num) {
+ i = 0;
+ vq->wrap_counter ^= 1;
+ }
+ }
+ }
+ /* Last one doesn't continue. */
+ if (total_sg == 1)
+ head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+ else
+ desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+
+ if (indirect) {
+ /* Now that the indirect table is filled in, map it. */
+ dma_addr_t addr = vring_map_single(
+ vq, desc, total_sg * sizeof(struct vring_packed_desc),
+ DMA_TO_DEVICE);
+ if (vring_mapping_error(vq, addr))
+ goto unmap_release;
+
+ head_flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT |
+ VRING_DESC_F_AVAIL(wrap_counter) |
+ VRING_DESC_F_USED(!wrap_counter));
+ vq->vring_packed.desc[head].addr = cpu_to_virtio64(_vq->vdev, addr);
+ vq->vring_packed.desc[head].len = cpu_to_virtio32(_vq->vdev,
+ total_sg * sizeof(struct vring_packed_desc));
+ vq->vring_packed.desc[head].id = cpu_to_virtio32(_vq->vdev, head);
+ }
+
+ /* We're using some buffers from the free list. */
+ vq->vq.num_free -= descs_used;
+
+ /* Update free pointer */
+ if (indirect) {
+ n = head + 1;
+ if (n >= vq->vring_packed.num) {
+ n = 0;
+ vq->wrap_counter ^= 1;
+ }
+ vq->next_avail_idx = n;
+ } else
+ vq->next_avail_idx = i;
+
+ /* Store token and indirect buffer state. */
+ vq->desc_state[head].num = descs_used;
+ vq->desc_state[head].data = data;
+ if (indirect)
+ vq->desc_state[head].indir_desc = desc;
+ else
+ vq->desc_state[head].indir_desc = ctx;
+
+ /* A driver MUST NOT make the first descriptor in the list
+ * available before all subsequent descriptors comprising
+ * the list are made available. */
+ virtio_wmb(vq->weak_barriers);
+ vq->vring_packed.desc[head].flags = head_flags;
+ vq->num_added++;
+
+ pr_debug("Added buffer head %i to %p\n", head, vq);
+ END_USE(vq);
+
+ return 0;
+
+unmap_release:
+ err_idx = i;
+ i = head;
+
+ for (n = 0; n < total_sg; n++) {
+ if (i == err_idx)
+ break;
+ vring_unmap_one_packed(vq, &desc[i]);
+ i++;
+ if (!indirect && i >= vq->vring_packed.num)
+ i = 0;
+ }
+
+ vq->wrap_counter = wrap_counter;
+
+ if (indirect)
+ kfree(desc);
+
+ END_USE(vq);
+ return -EIO;
+}
+
+static inline int virtqueue_add(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+ void *data,
+ void *ctx,
+ gfp_t gfp)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ return vq->packed ? virtqueue_add_packed(_vq, sgs, total_sg, out_sgs,
+ in_sgs, data, ctx, gfp) :
+ virtqueue_add_split(_vq, sgs, total_sg, out_sgs,
+ in_sgs, data, ctx, gfp);
+}
+
/**
* virtqueue_add_sgs - expose buffers to other end
* @vq: the struct virtqueue we're talking about.
@@ -537,18 +812,7 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
}
EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
-/**
- * virtqueue_kick_prepare - first half of split virtqueue_kick call.
- * @vq: the struct virtqueue
- *
- * Instead of virtqueue_kick(), you can do:
- * if (virtqueue_kick_prepare(vq))
- * virtqueue_notify(vq);
- *
- * This is sometimes useful because the virtqueue_kick_prepare() needs
- * to be serialized, but the actual virtqueue_notify() call does not.
- */
-bool virtqueue_kick_prepare(struct virtqueue *_vq)
+static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
u16 new, old;
@@ -580,6 +844,62 @@ bool virtqueue_kick_prepare(struct virtqueue *_vq)
END_USE(vq);
return needs_kick;
}
+
+static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 new, old, off_wrap;
+ bool needs_kick;
+
+ START_USE(vq);
+ /* We need to expose the new flags value before checking notification
+ * suppressions. */
+ virtio_mb(vq->weak_barriers);
+
+ old = vq->next_avail_idx - vq->num_added;
+ new = vq->next_avail_idx;
+ vq->num_added = 0;
+
+#ifdef DEBUG
+ if (vq->last_add_time_valid) {
+ WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+ vq->last_add_time)) > 100);
+ }
+ vq->last_add_time_valid = false;
+#endif
+
+ off_wrap = virtio16_to_cpu(_vq->vdev, vq->vring_packed.device->off_wrap);
+
+ if (vq->event) {
+ // FIXME: fix this!
+ needs_kick = ((off_wrap >> 15) == vq->wrap_counter) &&
+ vring_need_event(off_wrap & ~(1<<15), new, old);
+ } else {
+ needs_kick = (vq->vring_packed.device->flags !=
+ cpu_to_virtio16(_vq->vdev, VRING_EVENT_F_DISABLE));
+ }
+ END_USE(vq);
+ return needs_kick;
+}
+
+/**
+ * virtqueue_kick_prepare - first half of split virtqueue_kick call.
+ * @vq: the struct virtqueue
+ *
+ * Instead of virtqueue_kick(), you can do:
+ * if (virtqueue_kick_prepare(vq))
+ * virtqueue_notify(vq);
+ *
+ * This is sometimes useful because the virtqueue_kick_prepare() needs
+ * to be serialized, but the actual virtqueue_notify() call does not.
+ */
+bool virtqueue_kick_prepare(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ return vq->packed ? virtqueue_kick_prepare_packed(_vq) :
+ virtqueue_kick_prepare_split(_vq);
+}
EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
/**
@@ -626,8 +946,8 @@ bool virtqueue_kick(struct virtqueue *vq)
}
EXPORT_SYMBOL_GPL(virtqueue_kick);
-static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
- void **ctx)
+static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+ void **ctx)
{
unsigned int i, j;
__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
@@ -639,12 +959,12 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
i = head;
while (vq->vring.desc[i].flags & nextflag) {
- vring_unmap_one(vq, &vq->vring.desc[i]);
+ vring_unmap_one_split(vq, &vq->vring.desc[i]);
i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
vq->vq.num_free++;
}
- vring_unmap_one(vq, &vq->vring.desc[i]);
+ vring_unmap_one_split(vq, &vq->vring.desc[i]);
vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
vq->free_head = head;
@@ -666,7 +986,7 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
BUG_ON(len == 0 || len % sizeof(struct vring_desc));
for (j = 0; j < len / sizeof(struct vring_desc); j++)
- vring_unmap_one(vq, &indir_desc[j]);
+ vring_unmap_one_split(vq, &indir_desc[j]);
kfree(indir_desc);
vq->desc_state[head].indir_desc = NULL;
@@ -675,11 +995,207 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
}
}
-static inline bool more_used(const struct vring_virtqueue *vq)
+static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
+ void **ctx)
+{
+ struct vring_packed_desc *desc;
+ unsigned int i, j;
+
+ /* Clear data ptr. */
+ vq->desc_state[head].data = NULL;
+
+ i = head;
+
+ for (j = 0; j < vq->desc_state[head].num; j++) {
+ desc = &vq->vring_packed.desc[i];
+ vring_unmap_one_packed(vq, desc);
+ desc->flags = 0x0;
+ i++;
+ if (i >= vq->vring_packed.num)
+ i = 0;
+ }
+
+ vq->vq.num_free += vq->desc_state[head].num;
+
+ if (vq->indirect) {
+ u32 len;
+
+ desc = vq->desc_state[head].indir_desc;
+ /* Free the indirect table, if any, now that it's unmapped. */
+ if (!desc)
+ goto out;
+
+ len = virtio32_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[head].len);
+
+ BUG_ON(!(vq->vring_packed.desc[head].flags &
+ cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
+ BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
+
+ for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
+ vring_unmap_one_packed(vq, &desc[j]);
+
+ kfree(desc);
+ vq->desc_state[head].indir_desc = NULL;
+ } else if (ctx) {
+ *ctx = vq->desc_state[head].indir_desc;
+ }
+
+out:
+ return vq->desc_state[head].num;
+}
+
+static inline bool more_used_split(const struct vring_virtqueue *vq)
{
return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
}
+static inline bool more_used_packed(const struct vring_virtqueue *vq)
+{
+ u16 last_used, flags;
+ bool avail, used;
+
+ if (vq->vq.num_free == vq->vring_packed.num)
+ return false;
+
+ last_used = vq->last_used_idx;
+ flags = virtio16_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[last_used].flags);
+ avail = flags & VRING_DESC_F_AVAIL(1);
+ used = flags & VRING_DESC_F_USED(1);
+
+ return avail == used;
+}
+
+static inline bool more_used(const struct vring_virtqueue *vq)
+{
+ return vq->packed ? more_used_packed(vq) : more_used_split(vq);
+}
+
+void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, unsigned int *len,
+ void **ctx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ void *ret;
+ unsigned int i;
+ u16 last_used;
+
+ START_USE(vq);
+
+ if (unlikely(vq->broken)) {
+ END_USE(vq);
+ return NULL;
+ }
+
+ if (!more_used(vq)) {
+ pr_debug("No more buffers in queue\n");
+ END_USE(vq);
+ return NULL;
+ }
+
+ /* Only get used array entries after they have been exposed by host. */
+ virtio_rmb(vq->weak_barriers);
+
+ last_used = (vq->last_used_idx & (vq->vring.num - 1));
+ i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
+ *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
+
+ if (unlikely(i >= vq->vring.num)) {
+ BAD_RING(vq, "id %u out of range\n", i);
+ return NULL;
+ }
+ if (unlikely(!vq->desc_state[i].data)) {
+ BAD_RING(vq, "id %u is not a head!\n", i);
+ return NULL;
+ }
+
+ /* detach_buf_split clears data, so grab it now. */
+ ret = vq->desc_state[i].data;
+ detach_buf_split(vq, i, ctx);
+ vq->last_used_idx++;
+ /* If we expect an interrupt for the next entry, tell host
+ * by writing event index and flush out the write before
+ * the read in the next get_buf call. */
+ if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
+ virtio_store_mb(vq->weak_barriers,
+ &vring_used_event(&vq->vring),
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
+
+#ifdef DEBUG
+ vq->last_add_time_valid = false;
+#endif
+
+ END_USE(vq);
+ return ret;
+}
+
+void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq, unsigned int *len,
+ void **ctx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ uint16_t wrap_counter;
+ void *ret;
+ unsigned int i;
+ u16 last_used;
+
+ START_USE(vq);
+
+ if (unlikely(vq->broken)) {
+ END_USE(vq);
+ return NULL;
+ }
+
+ if (!more_used(vq)) {
+ pr_debug("No more buffers in queue\n");
+ END_USE(vq);
+ return NULL;
+ }
+
+ /* Only get used elements after they have been exposed by host. */
+ virtio_rmb(vq->weak_barriers);
+
+ last_used = vq->last_used_idx;
+ i = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].id);
+ *len = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].len);
+
+ if (unlikely(i >= vq->vring_packed.num)) {
+ BAD_RING(vq, "id %u out of range\n", i);
+ return NULL;
+ }
+ if (unlikely(!vq->desc_state[i].data)) {
+ BAD_RING(vq, "id %u is not a head!\n", i);
+ return NULL;
+ }
+
+ /* detach_buf_packed clears data, so grab it now. */
+ ret = vq->desc_state[i].data;
+ detach_buf_packed(vq, i, ctx);
+
+ vq->last_used_idx += vq->desc_state[i].num;
+ if (vq->last_used_idx >= vq->vring_packed.num)
+ vq->last_used_idx -= vq->vring_packed.num;
+
+ wrap_counter = vq->wrap_counter;
+ if (vq->last_used_idx > vq->next_avail_idx)
+ wrap_counter ^= 1;
+
+ /* If we expect an interrupt for the next entry, tell host
+ * by writing event index and flush out the write before
+ * the read in the next get_buf call. */
+ if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+ virtio_store_mb(vq->weak_barriers,
+ &vq->vring_packed.driver->off_wrap,
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+ wrap_counter << 15));
+
+#ifdef DEBUG
+ vq->last_add_time_valid = false;
+#endif
+
+ END_USE(vq);
+ return ret;
+}
+
/**
* virtqueue_get_buf - get the next used buffer
* @vq: the struct virtqueue we're talking about.
@@ -700,57 +1216,9 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
void **ctx)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- void *ret;
- unsigned int i;
- u16 last_used;
- START_USE(vq);
-
- if (unlikely(vq->broken)) {
- END_USE(vq);
- return NULL;
- }
-
- if (!more_used(vq)) {
- pr_debug("No more buffers in queue\n");
- END_USE(vq);
- return NULL;
- }
-
- /* Only get used array entries after they have been exposed by host. */
- virtio_rmb(vq->weak_barriers);
-
- last_used = (vq->last_used_idx & (vq->vring.num - 1));
- i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
- *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
-
- if (unlikely(i >= vq->vring.num)) {
- BAD_RING(vq, "id %u out of range\n", i);
- return NULL;
- }
- if (unlikely(!vq->desc_state[i].data)) {
- BAD_RING(vq, "id %u is not a head!\n", i);
- return NULL;
- }
-
- /* detach_buf clears data, so grab it now. */
- ret = vq->desc_state[i].data;
- detach_buf(vq, i, ctx);
- vq->last_used_idx++;
- /* If we expect an interrupt for the next entry, tell host
- * by writing event index and flush out the write before
- * the read in the next get_buf call. */
- if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
- virtio_store_mb(vq->weak_barriers,
- &vring_used_event(&vq->vring),
- cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
-
-#ifdef DEBUG
- vq->last_add_time_valid = false;
-#endif
-
- END_USE(vq);
- return ret;
+ return vq->packed ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
+ virtqueue_get_buf_ctx_split(_vq, len, ctx);
}
EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
@@ -759,6 +1227,29 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
return virtqueue_get_buf_ctx(_vq, len, NULL);
}
EXPORT_SYMBOL_GPL(virtqueue_get_buf);
+
+static void virtqueue_disable_cb_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
+ vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+}
+
+static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ if (vq->event_flags_shadow != VRING_EVENT_F_DISABLE) {
+ vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+ vq->event_flags_shadow);
+ }
+}
+
/**
* virtqueue_disable_cb - disable callbacks
* @vq: the struct virtqueue we're talking about.
@@ -772,15 +1263,66 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
- vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
-
+ if (vq->packed)
+ virtqueue_disable_cb_packed(_vq);
+ else
+ virtqueue_disable_cb_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
+static unsigned virtqueue_enable_cb_prepare_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 last_used_idx;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always do both to keep code simple. */
+ if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+ vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+ vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
+ END_USE(vq);
+ return last_used_idx;
+}
+
+static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 last_used_idx, wrap_counter, off_wrap;
+
+ START_USE(vq);
+
+ last_used_idx = vq->last_used_idx;
+ wrap_counter = vq->wrap_counter;
+
+ if (last_used_idx > vq->next_avail_idx)
+ wrap_counter ^= 1;
+
+ off_wrap = last_used_idx | (wrap_counter << 15);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always do both to keep code simple. */
+ if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+ vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC:
+ VRING_EVENT_F_ENABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+ vq->event_flags_shadow);
+ }
+ vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev, off_wrap);
+ END_USE(vq);
+ return last_used_idx;
+}
+
/**
* virtqueue_enable_cb_prepare - restart callbacks after disable_cb
* @vq: the struct virtqueue we're talking about.
@@ -796,26 +1338,34 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- u16 last_used_idx;
- START_USE(vq);
-
- /* We optimistically turn back on interrupts, then check if there was
- * more to do. */
- /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
- * either clear the flags bit or point the event index at the next
- * entry. Always do both to keep code simple. */
- if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
- vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
- vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
- END_USE(vq);
- return last_used_idx;
+ return vq->packed ? virtqueue_enable_cb_prepare_packed(_vq) :
+ virtqueue_enable_cb_prepare_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
+static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+
+ virtio_mb(vq->weak_barriers);
+ return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+}
+
+static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ bool avail, used;
+ u16 flags;
+
+ virtio_mb(vq->weak_barriers);
+ flags = virtio16_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[last_used_idx].flags);
+ avail = flags & VRING_DESC_F_AVAIL(1);
+ used = flags & VRING_DESC_F_USED(1);
+ return avail == used;
+}
+
/**
* virtqueue_poll - query pending used buffers
* @vq: the struct virtqueue we're talking about.
@@ -829,8 +1379,8 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- virtio_mb(vq->weak_barriers);
- return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+ return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
+ virtqueue_poll_split(_vq, last_used_idx);
}
EXPORT_SYMBOL_GPL(virtqueue_poll);
@@ -852,6 +1402,83 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
+static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 bufs;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always update the event index to keep code simple. */
+ if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+ vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+ }
+ /* TODO: tune this threshold */
+ bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
+
+ virtio_store_mb(vq->weak_barriers,
+ &vring_used_event(&vq->vring),
+ cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
+
+ if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
+ END_USE(vq);
+ return false;
+ }
+
+ END_USE(vq);
+ return true;
+}
+
+static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ u16 bufs, off_wrap, used_idx, wrap_counter;
+
+ START_USE(vq);
+
+ /* We optimistically turn back on interrupts, then check if there was
+ * more to do. */
+ /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+ * either clear the flags bit or point the event index at the next
+ * entry. Always update the event index to keep code simple. */
+ if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+ vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC:
+ VRING_EVENT_F_ENABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+ vq->event_flags_shadow);
+ }
+
+ /* TODO: tune this threshold */
+ bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
+
+ used_idx = vq->last_used_idx + bufs;
+ if (used_idx >= vq->vring_packed.num)
+ used_idx -= vq->vring_packed.num;
+
+ wrap_counter = vq->wrap_counter;
+ if (used_idx > vq->next_avail_idx)
+ wrap_counter ^= 1;
+
+ off_wrap = used_idx | (wrap_counter << 15);
+
+ virtio_store_mb(vq->weak_barriers, &vq->vring_packed.driver->off_wrap,
+ cpu_to_virtio16(_vq->vdev, off_wrap));
+
+ if (more_used_packed(vq)) {
+ END_USE(vq);
+ return false;
+ }
+
+ END_USE(vq);
+ return true;
+}
+
/**
* virtqueue_enable_cb_delayed - restart callbacks after disable_cb.
* @vq: the struct virtqueue we're talking about.
@@ -868,37 +1495,69 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- u16 bufs;
- START_USE(vq);
-
- /* We optimistically turn back on interrupts, then check if there was
- * more to do. */
- /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
- * either clear the flags bit or point the event index at the next
- * entry. Always update the event index to keep code simple. */
- if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
- vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
- }
- /* TODO: tune this threshold */
- bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
-
- virtio_store_mb(vq->weak_barriers,
- &vring_used_event(&vq->vring),
- cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
-
- if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
- END_USE(vq);
- return false;
- }
-
- END_USE(vq);
- return true;
+ return vq->packed ? virtqueue_enable_cb_delayed_packed(_vq) :
+ virtqueue_enable_cb_delayed_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
+static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ unsigned int i;
+ void *buf;
+
+ START_USE(vq);
+
+ for (i = 0; i < vq->vring.num; i++) {
+ if (!vq->desc_state[i].data)
+ continue;
+ /* detach_buf clears data, so grab it now. */
+ buf = vq->desc_state[i].data;
+ detach_buf_split(vq, i, NULL);
+ vq->avail_idx_shadow--;
+ vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+ END_USE(vq);
+ return buf;
+ }
+ /* That should have freed everything. */
+ BUG_ON(vq->vq.num_free != vq->vring.num);
+
+ END_USE(vq);
+ return NULL;
+}
+
+static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ unsigned int i, num;
+ void *buf;
+
+ START_USE(vq);
+
+ for (i = 0; i < vq->vring_packed.num; i++) {
+ if (!vq->desc_state[i].data)
+ continue;
+ /* detach_buf clears data, so grab it now. */
+ buf = vq->desc_state[i].data;
+ num = detach_buf_packed(vq, i, NULL);
+ if (vq->next_avail_idx < num) {
+ vq->next_avail_idx = vq->vring_packed.num -
+ (num - vq->next_avail_idx);
+ vq->wrap_counter ^= 1;
+ } else {
+ vq->next_avail_idx -= num;
+ }
+ END_USE(vq);
+ return buf;
+ }
+ /* That should have freed everything. */
+ BUG_ON(vq->vq.num_free != vq->vring_packed.num);
+
+ END_USE(vq);
+ return NULL;
+}
+
/**
* virtqueue_detach_unused_buf - detach first unused buffer
* @vq: the struct virtqueue we're talking about.
@@ -910,27 +1569,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- unsigned int i;
- void *buf;
- START_USE(vq);
-
- for (i = 0; i < vq->vring.num; i++) {
- if (!vq->desc_state[i].data)
- continue;
- /* detach_buf clears data, so grab it now. */
- buf = vq->desc_state[i].data;
- detach_buf(vq, i, NULL);
- vq->avail_idx_shadow--;
- vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
- END_USE(vq);
- return buf;
- }
- /* That should have freed everything. */
- BUG_ON(vq->vq.num_free != vq->vring.num);
-
- END_USE(vq);
- return NULL;
+ return vq->packed ? virtqueue_detach_unused_buf_packed(_vq) :
+ virtqueue_detach_unused_buf_split(_vq);
}
EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
@@ -955,7 +1596,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
EXPORT_SYMBOL_GPL(vring_interrupt);
struct virtqueue *__vring_new_virtqueue(unsigned int index,
- struct vring vring,
+ union vring_union vring,
+ bool packed,
struct virtio_device *vdev,
bool weak_barriers,
bool context,
@@ -963,19 +1605,20 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
void (*callback)(struct virtqueue *),
const char *name)
{
- unsigned int i;
+ unsigned int num, i;
struct vring_virtqueue *vq;
- vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
+ num = packed ? vring.vring_packed.num : vring.vring_split.num;
+
+ vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
GFP_KERNEL);
if (!vq)
return NULL;
- vq->vring = vring;
vq->vq.callback = callback;
vq->vq.vdev = vdev;
vq->vq.name = name;
- vq->vq.num_free = vring.num;
+ vq->vq.num_free = num;
vq->vq.index = index;
vq->we_own_ring = false;
vq->queue_dma_addr = 0;
@@ -984,9 +1627,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
vq->weak_barriers = weak_barriers;
vq->broken = false;
vq->last_used_idx = 0;
- vq->avail_flags_shadow = 0;
- vq->avail_idx_shadow = 0;
vq->num_added = 0;
+ vq->packed = packed;
list_add_tail(&vq->vq.list, &vdev->vqs);
#ifdef DEBUG
vq->in_use = false;
@@ -997,18 +1639,37 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
!context;
vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
+ if (vq->packed) {
+ vq->vring_packed = vring.vring_packed;
+ vq->next_avail_idx = 0;
+ vq->wrap_counter = 1;
+ vq->event_flags_shadow = 0;
+ } else {
+ vq->vring = vring.vring_split;
+ vq->avail_flags_shadow = 0;
+ vq->avail_idx_shadow = 0;
+
+ /* Put everything in free lists. */
+ vq->free_head = 0;
+ for (i = 0; i < num-1; i++)
+ vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
+ }
+
/* No callback? Tell other side not to bother us. */
if (!callback) {
- vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
- if (!vq->event)
- vq->vring.avail->flags = cpu_to_virtio16(vdev, vq->avail_flags_shadow);
+ if (packed) {
+ vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+ vq->vring_packed.driver->flags = cpu_to_virtio16(vdev,
+ vq->event_flags_shadow);
+ } else {
+ vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+ if (!vq->event)
+ vq->vring.avail->flags = cpu_to_virtio16(vdev,
+ vq->avail_flags_shadow);
+ }
}
- /* Put everything in free lists. */
- vq->free_head = 0;
- for (i = 0; i < vring.num-1; i++)
- vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
- memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
+ memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
return &vq->vq;
}
@@ -1056,6 +1717,22 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size,
}
}
+static inline int
+__vring_size(unsigned int num, unsigned long align, bool packed)
+{
+ return packed ? vring_packed_size(num, align) : vring_size(num, align);
+}
+
+static inline void vring_packed_init(struct vring_packed *vr, unsigned int num,
+ void *p, unsigned long align)
+{
+ vr->num = num;
+ vr->desc = p;
+ vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
+ * num + align - 1) & ~(align - 1));
+ vr->device = vr->driver + 1;
+}
+
struct virtqueue *vring_create_virtqueue(
unsigned int index,
unsigned int num,
@@ -1072,7 +1749,8 @@ struct virtqueue *vring_create_virtqueue(
void *queue = NULL;
dma_addr_t dma_addr;
size_t queue_size_in_bytes;
- struct vring vring;
+ union vring_union vring;
+ bool packed;
/* We assume num is a power of 2. */
if (num & (num - 1)) {
@@ -1080,9 +1758,13 @@ struct virtqueue *vring_create_virtqueue(
return NULL;
}
+ packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+
/* TODO: allocate each queue chunk individually */
- for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
- queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+ for (; num && __vring_size(num, vring_align, packed) > PAGE_SIZE;
+ num /= 2) {
+ queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+ packed),
&dma_addr,
GFP_KERNEL|__GFP_NOWARN|__GFP_ZERO);
if (queue)
@@ -1094,17 +1776,21 @@ struct virtqueue *vring_create_virtqueue(
if (!queue) {
/* Try to get a single page. You are my only hope! */
- queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+ queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+ packed),
&dma_addr, GFP_KERNEL|__GFP_ZERO);
}
if (!queue)
return NULL;
- queue_size_in_bytes = vring_size(num, vring_align);
- vring_init(&vring, num, queue, vring_align);
+ queue_size_in_bytes = __vring_size(num, vring_align, packed);
+ if (packed)
+ vring_packed_init(&vring.vring_packed, num, queue, vring_align);
+ else
+ vring_init(&vring.vring_split, num, queue, vring_align);
- vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
- notify, callback, name);
+ vq = __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+ context, notify, callback, name);
if (!vq) {
vring_free_queue(vdev, queue_size_in_bytes, queue,
dma_addr);
@@ -1130,10 +1816,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
void (*callback)(struct virtqueue *vq),
const char *name)
{
- struct vring vring;
- vring_init(&vring, num, pages, vring_align);
- return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
- notify, callback, name);
+ union vring_union vring;
+ bool packed;
+
+ packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+ if (packed)
+ vring_packed_init(&vring.vring_packed, num, pages, vring_align);
+ else
+ vring_init(&vring.vring_split, num, pages, vring_align);
+
+ return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+ context, notify, callback, name);
}
EXPORT_SYMBOL_GPL(vring_new_virtqueue);
@@ -1143,7 +1836,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
if (vq->we_own_ring) {
vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
- vq->vring.desc, vq->queue_dma_addr);
+ vq->packed ? (void *)vq->vring_packed.desc :
+ (void *)vq->vring.desc,
+ vq->queue_dma_addr);
}
list_del(&_vq->list);
kfree(vq);
@@ -1157,14 +1852,18 @@ void vring_transport_features(struct virtio_device *vdev)
for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
switch (i) {
- case VIRTIO_RING_F_INDIRECT_DESC:
+#if 0
+ case VIRTIO_RING_F_INDIRECT_DESC: // FIXME not tested yet.
break;
- case VIRTIO_RING_F_EVENT_IDX:
+ case VIRTIO_RING_F_EVENT_IDX: // FIXME probably not work.
break;
+#endif
case VIRTIO_F_VERSION_1:
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+ case VIRTIO_F_RING_PACKED:
+ break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
@@ -1185,7 +1884,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
struct vring_virtqueue *vq = to_vvq(_vq);
- return vq->vring.num;
+ return vq->packed ? vq->vring_packed.num : vq->vring.num;
}
EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
@@ -1228,6 +1927,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
BUG_ON(!vq->we_own_ring);
+ if (vq->packed)
+ return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
+ (char *)vq->vring_packed.desc);
+
return vq->queue_dma_addr +
((char *)vq->vring.avail - (char *)vq->vring.desc);
}
@@ -1239,11 +1942,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
BUG_ON(!vq->we_own_ring);
+ if (vq->packed)
+ return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
+ (char *)vq->vring_packed.desc);
+
return vq->queue_dma_addr +
((char *)vq->vring.used - (char *)vq->vring.desc);
}
EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
+/* Only available for split ring */
const struct vring *virtqueue_get_vring(struct virtqueue *vq)
{
return &to_vvq(vq)->vring;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bbf32524ab27..a0075894ad16 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
struct virtio_device;
struct virtqueue;
+union vring_union {
+ struct vring vring_split;
+ struct vring_packed vring_packed;
+};
+
/*
* Creates a virtqueue and allocates the descriptor ring. If
* may_reduce_num is set, then this may allocate a smaller ring than
@@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
/* Creates a virtqueue with a custom layout. */
struct virtqueue *__vring_new_virtqueue(unsigned int index,
- struct vring vring,
+ union vring_union vring,
+ bool packed,
struct virtio_device *vdev,
bool weak_barriers,
bool ctx,
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
* transport being used (eg. virtio_ring), the rest are per-device feature
* bits. */
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
#ifndef VIRTIO_CONFIG_NO_LEGACY
/* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
* this is for compatibility with legacy systems.
*/
#define VIRTIO_F_IOMMU_PLATFORM 33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED 34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER 35
+
#endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..735d4207c988 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
/* This means the buffer contains a list of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
+#define VRING_DESC_F_AVAIL(b) ((b) << 7)
+#define VRING_DESC_F_USED(b) ((b) << 15)
+
/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
* will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
* optimization. */
#define VRING_AVAIL_F_NO_INTERRUPT 1
+#define VRING_EVENT_F_ENABLE 0x0
+#define VRING_EVENT_F_DISABLE 0x1
+#define VRING_EVENT_F_DESC 0x2
+
/* We support indirect buffer descriptors */
#define VIRTIO_RING_F_INDIRECT_DESC 28
@@ -171,4 +178,58 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
}
+struct vring_packed_desc_event {
+ /* __virtio16 off : 15; // Descriptor Event Offset
+ * __virtio16 wrap : 1; // Descriptor Event Wrap Counter */
+ __virtio16 off_wrap;
+ /* __virtio16 flags : 2; // Descriptor Event Flags */
+ __virtio16 flags;
+};
+
+struct vring_packed_desc {
+ /* Buffer Address. */
+ __virtio64 addr;
+ /* Buffer Length. */
+ __virtio32 len;
+ /* Buffer ID. */
+ __virtio16 id;
+ /* The flags depending on descriptor type. */
+ __virtio16 flags;
+};
+
+struct vring_packed {
+ unsigned int num;
+
+ struct vring_packed_desc *desc;
+
+ struct vring_packed_desc_event *driver;
+
+ struct vring_packed_desc_event *device;
+};
+
+/* The standard layout for the packed ring is a continuous chunk of memory
+ * which looks like this.
+ *
+ * struct vring_packed
+ * {
+ * // The actual descriptors (16 bytes each)
+ * struct vring_packed_desc desc[num];
+ *
+ * // Padding to the next align boundary.
+ * char pad[];
+ *
+ * // Driver Event Suppression
+ * struct vring_packed_desc_event driver;
+ *
+ * // Device Event Suppression
+ * struct vring_packed_desc_event device;
+ * };
+ */
+
+static inline unsigned vring_packed_size(unsigned int num, unsigned long align)
+{
+ return ((sizeof(struct vring_packed_desc) * num + align - 1)
+ & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
+}
+
#endif /* _UAPI_LINUX_VIRTIO_RING_H */
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
Now that we have SCTP offload capabilities in the kernel, we can add
them to virtio as well. First step is SCTP checksum.
We need a new freature in virtio to negotiate this support since
SCTP is excluded with the stardard checksum and requires a little
bit extra. This series proposes VIRTIO_NET_F_SCTP_CSUM feature bit.
As the "little bit extra", the kernel uses a new bit in the skb
(skb->csum_not_inet) to determine whether to use standard inet checksum
or the SCTP CRC32c checksum. This bit has to be communicated between
the host and the guest. This bit is carried in the vnet header.
Tap and macvtap support is added through an extra feature for the
TUNSETOFFLOAD ioctl. Additionally macvtap will no correctly
do sctp checksumming if the receive doesn't support SCTP offload.
This also turns on sctp offloading for macvlan devices.
As for the perf numbers, I am seeing about a 5% increase in vm-to-vm
and vm-to-hos throughput which is the same as manually disabling
sctp checksumming,since this is exactly what we are emulatting.
Sending outside the host, the increase about 2.5-3%.
As for GSO, the way sctp GSO is currently implemented buys us nothing
in added support to virtio. To add true GSO, would require a lot of
re-work inside of SCTP and would require extensions to the virtio
net header to carry extra sctp data.
Vladislav Yasevich (5):
virtio: Add support for SCTP checksum offloading
sctp: Handle sctp packets with CHECKSUM_PARTIAL
sctp: Build sctp offload support into the base kernel
tun: Add support for SCTP checksum offload
macvlan/macvtap: Add support for SCTP checksum offload.
drivers/net/macvlan.c | 5 +++--
drivers/net/tap.c | 8 +++++---
drivers/net/tun.c | 5 +++++
drivers/net/virtio_net.c | 10 +++++++---
include/linux/virtio_net.h | 6 ++++++
include/net/sctp/sctp.h | 5 -----
include/uapi/linux/if_tun.h | 1 +
include/uapi/linux/virtio_net.h | 2 ++
net/Kconfig | 1 +
net/sctp/Kconfig | 1 -
net/sctp/Makefile | 3 ++-
net/sctp/input.c | 11 ++++++++++-
net/sctp/offload.c | 4 +++-
net/sctp/protocol.c | 3 ---
14 files changed, 45 insertions(+), 20 deletions(-)
--
2.9.5
^ permalink raw reply
* [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
To support SCTP checksum offloading, we need to add a new feature
to virtio_net, so we can negotiate support between the hypervisor
and the guest.
The signalling to the guest that an alternate checksum needs to
be used is done via a new flag in the virtio_net_hdr. If the
flag is set, the host will know to perform an alternate checksum
calculation, which right now is only CRC32c.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
drivers/net/virtio_net.c | 11 ++++++++---
include/linux/virtio_net.h | 6 ++++++
include/uapi/linux/virtio_net.h | 2 ++
3 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec..b601294 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
/* Do we support "hardware" checksums? */
if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
/* This opens up the world of extra features. */
- dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
+ netdev_features_t sctp = 0;
+
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
+ sctp |= NETIF_F_SCTP_CRC;
+
+ dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
if (csum)
- dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
+ dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
dev->hw_features |= NETIF_F_TSO
@@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
};
#define VIRTNET_FEATURES \
- VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
+ VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, VIRTIO_NET_F_SCTP_CSUM, \
VIRTIO_NET_F_MAC, \
VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index f144216..2e7a64a 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
if (!skb_partial_csum_set(skb, start, off))
return -EINVAL;
+
+ if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
+ skb->csum_not_inet = 1;
}
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
@@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
} /* else everything is zero */
+ if (skb->csum_not_inet)
+ hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
+
return 0;
}
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed3..3f279c8 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -36,6 +36,7 @@
#define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
#define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
#define VIRTIO_NET_F_MTU 3 /* Initial MTU advice */
+#define VIRTIO_NET_F_SCTP_CSUM 4 /* SCTP checksum offload support */
#define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */
#define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */
#define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */
@@ -101,6 +102,7 @@ struct virtio_net_config {
struct virtio_net_hdr_v1 {
#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */
#define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */
+#define VIRTIO_NET_HDR_F_CSUM_NOT_INET 4 /* Checksum is not inet */
__u8 flags;
#define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */
#define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 2/5] sctp: Handle sctp packets with CHECKSUM_PARTIAL
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
With SCTP checksum offload available in virtio, it is now
possible for virtio to receive a sctp packet with CHECKSUM_PARTIAL
set (guest-to-guest traffic). SCTP doesn't really have a
partial checksum like TCP does because CRC32c can't do partial
additive checksumming. It's all or nothing. So an SCTP packet
with CHECKSUM_PARTIAL will have checksum set to 0. Let SCTP
treat this as a valid checksum if CHECKSUM_PARTIAL is set.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
net/sctp/input.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6..055b8ffa 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -80,8 +80,17 @@ static inline int sctp_rcv_checksum(struct net *net, struct sk_buff *skb)
{
struct sctphdr *sh = sctp_hdr(skb);
__le32 cmp = sh->checksum;
- __le32 val = sctp_compute_cksum(skb, 0);
+ __le32 val = 0;
+ /* In sctp PARTIAL checksum is always 0. This is a case of
+ * a packet received from guest that supports checksum offload.
+ * Assume it's correct as there is really no way to verify,
+ * and we want to avaoid computing it unnecesarily.
+ */
+ if (skb->ip_summed == CHECKSUM_PARTIAL)
+ return 0;
+
+ val = sctp_compute_cksum(skb, 0);
if (val != cmp) {
/* CRC failure, dump it. */
__SCTP_INC_STATS(net, SCTP_MIB_CHECKSUMERRORS);
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 3/5] sctp: Build sctp offload support into the base kernel
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
We need to take the sctp offload out of the sctp module
and add it to the base kernel to support guests that can
support it. This is similar to how IPv6 offloads are done
and works around kernels that exclude sctp protocol support.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
include/net/sctp/sctp.h | 5 -----
net/Kconfig | 1 +
net/sctp/Kconfig | 1 -
net/sctp/Makefile | 3 ++-
net/sctp/offload.c | 4 +++-
net/sctp/protocol.c | 3 ---
6 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 72c5b8f..625b45f 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -183,11 +183,6 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
int __net_init sctp_proc_init(struct net *net);
/*
- * sctp/offload.c
- */
-int sctp_offload_init(void);
-
-/*
* sctp/stream_sched.c
*/
void sctp_sched_ops_init(void);
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12..2773f98 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -64,6 +64,7 @@ config INET
bool "TCP/IP networking"
select CRYPTO
select CRYPTO_AES
+ select LIBCRC32C
---help---
These are the protocols used on the Internet and on most local
Ethernets. It is highly recommended to say Y here (this will enlarge
diff --git a/net/sctp/Kconfig b/net/sctp/Kconfig
index c740b18..d07477a 100644
--- a/net/sctp/Kconfig
+++ b/net/sctp/Kconfig
@@ -9,7 +9,6 @@ menuconfig IP_SCTP
select CRYPTO
select CRYPTO_HMAC
select CRYPTO_SHA1
- select LIBCRC32C
---help---
Stream Control Transmission Protocol
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index e845e45..ee206ca 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -5,6 +5,7 @@
obj-$(CONFIG_IP_SCTP) += sctp.o
obj-$(CONFIG_INET_SCTP_DIAG) += sctp_diag.o
+obj-$(CONFIG_INET) += offload.o
sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
protocol.o endpointola.o associola.o \
@@ -12,7 +13,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
inqueue.o outqueue.o ulpqueue.o \
tsnmap.o bind_addr.o socket.o primitive.o \
output.o input.o debug.o stream.o auth.o \
- offload.o stream_sched.o stream_sched_prio.o \
+ stream_sched.o stream_sched_prio.o \
stream_sched_rr.o stream_interleave.o
sctp_diag-y := diag.o
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 123e9f2..c61cbde 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -107,7 +107,7 @@ static const struct skb_checksum_ops crc32c_csum_ops = {
.combine = sctp_csum_combine,
};
-int __init sctp_offload_init(void)
+static int __init sctp_offload_init(void)
{
int ret;
@@ -127,3 +127,5 @@ int __init sctp_offload_init(void)
out:
return ret;
}
+
+fs_initcall(sctp_offload_init);
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index a24cde2..46d2b63 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1479,9 +1479,6 @@ static __init int sctp_init(void)
if (status)
goto err_v6_add_protocol;
- if (sctp_offload_init() < 0)
- pr_crit("%s: Cannot add SCTP protocol offload\n", __func__);
-
out:
return status;
err_v6_add_protocol:
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 4/5] tun: Add support for SCTP checksum offload
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
Adds a new tun offload flag to allow for SCTP checksum offload.
The flag has to be set by the user and defaults to "no offload".
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
drivers/net/tun.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ba262..263bcbe 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2719,6 +2719,11 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
arg &= ~TUN_F_UFO;
}
+ if (arg & TUN_F_SCTP_CSUM) {
+ features |= NETIF_F_SCTP_CRC;
+ arg &= ~TUN_F_SCTP_CSUM;
+ }
+
/* This gives the user a way to test for new features in future by
* trying to set them. */
if (arg)
--
2.9.5
^ permalink raw reply related
* [PATCH net-next 5/5] macvlan/macvtap: Add support for SCTP checksum offload.
From: Vladislav Yasevich @ 2018-04-02 13:40 UTC (permalink / raw)
To: netdev; +Cc: nhorman, mst, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
Since we now have support for software CRC32c offload, turn it on
for macvlan and macvtap devices so that guests can take advantage
of offload SCTP checksums to the host or host hardware.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
drivers/net/macvlan.c | 5 +++--
drivers/net/tap.c | 8 +++++---
include/uapi/linux/if_tun.h | 1 +
3 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 725f4b4..646b730 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -834,7 +834,7 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
#define ALWAYS_ON_OFFLOADS \
(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
- NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
+ NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL | NETIF_F_SCTP_CRC)
#define ALWAYS_ON_FEATURES (ALWAYS_ON_OFFLOADS | NETIF_F_LLTX)
@@ -842,7 +842,8 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
NETIF_F_GSO | NETIF_F_TSO | NETIF_F_LRO | \
NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
- NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
+ NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
+ NETIF_F_SCTP_CRC)
#define MACVLAN_STATE_MASK \
((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9b6cb78..2c8512b 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -369,8 +369,7 @@ rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
* check, we either support them all or none.
*/
if (skb->ip_summed == CHECKSUM_PARTIAL &&
- !(features & NETIF_F_CSUM_MASK) &&
- skb_checksum_help(skb))
+ skb_csum_hwoffload_help(skb, features))
goto drop;
if (ptr_ring_produce(&q->ring, skb))
goto drop;
@@ -945,6 +944,9 @@ static int set_offload(struct tap_queue *q, unsigned long arg)
}
}
+ if (arg & TUN_F_SCTP_CSUM)
+ feature_mask |= NETIF_F_SCTP_CRC;
+
/* tun/tap driver inverts the usage for TSO offloads, where
* setting the TSO bit means that the userspace wants to
* accept TSO frames and turning it off means that user space
@@ -1077,7 +1079,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
- TUN_F_TSO_ECN | TUN_F_UFO))
+ TUN_F_TSO_ECN | TUN_F_UFO | TUN_F_SCTP_CSUM))
return -EINVAL;
rtnl_lock();
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index ee432cd..c3bb282 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -86,6 +86,7 @@
#define TUN_F_TSO6 0x04 /* I can handle TSO for IPv6 packets */
#define TUN_F_TSO_ECN 0x08 /* I can handle TSO with ECN bits. */
#define TUN_F_UFO 0x10 /* I can handle UFO packets */
+#define TUN_F_SCTP_CSUM 0x20 /* I can handle SCTP checksum offload */
/* Protocol info prepended to the packets (when IFF_NO_PI is not set) */
#define TUN_PKT_STRIP 0x0001
--
2.9.5
^ permalink raw reply related
* Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support
From: David Miller @ 2018-04-02 14:16 UTC (permalink / raw)
To: vyasevich; +Cc: nhorman, mst, netdev, virtualization, linux-sctp
In-Reply-To: <20180402134006.10111-1-vyasevic@redhat.com>
From: Vladislav Yasevich <vyasevich@gmail.com>
Date: Mon, 2 Apr 2018 09:40:01 -0400
> Now that we have SCTP offload capabilities in the kernel, we can add
> them to virtio as well. First step is SCTP checksum.
Vlad, the net-next tree is closed, please resubmit this when the merge window
is over and the net-next tree opens back up.
Thank you.
^ permalink raw reply
* [RFC] vhost: introduce mdev based hardware vhost backend
From: Tiwei Bie @ 2018-04-02 15:23 UTC (permalink / raw)
To: mst, jasowang, alex.williamson, ddutile, alexander.h.duyck
Cc: jianfeng.tan, virtio-dev, kvm, netdev, linux-kernel,
virtualization, xiao.w.wang, zhihong.wang
This patch introduces a mdev (mediated device) based hardware
vhost backend. This backend is an abstraction of the various
hardware vhost accelerators (potentially any device that uses
virtio ring can be used as a vhost accelerator). Some generic
mdev parent ops are provided for accelerator drivers to support
generating mdev instances.
What's this
===========
The idea is that we can setup a virtio ring compatible device
with the messages available at the vhost-backend. Originally,
these messages are used to implement a software vhost backend,
but now we will use these messages to setup a virtio ring
compatible hardware device. Then the hardware device will be
able to work with the guest virtio driver in the VM just like
what the software backend does. That is to say, we can implement
a hardware based vhost backend in QEMU, and any virtio ring
compatible devices potentially can be used with this backend.
(We also call it vDPA -- vhost Data Path Acceleration).
One problem is that, different virtio ring compatible devices
may have different device interfaces. That is to say, we will
need different drivers in QEMU. It could be troublesome. And
that's what this patch trying to fix. The idea behind this
patch is very simple: mdev is a standard way to emulate device
in kernel. So we defined a standard device based on mdev, which
is able to accept vhost messages. When the mdev emulation code
(i.e. the generic mdev parent ops provided by this patch) gets
vhost messages, it will parse and deliver them to accelerator
drivers. Drivers can use these messages to setup accelerators.
That is to say, the generic mdev parent ops (e.g. read()/write()/
ioctl()/...) will be provided for accelerator drivers to register
accelerators as mdev parent devices. And each accelerator device
will support generating standard mdev instance(s).
With this standard device interface, we will be able to just
develop one userspace driver to implement the hardware based
vhost backend in QEMU.
Difference between vDPA and PCI passthru
========================================
The key difference between vDPA and PCI passthru is that, in
vDPA only the data path of the device (e.g. DMA ring, notify
region and queue interrupt) is pass-throughed to the VM, the
device control path (e.g. PCI configuration space and MMIO
regions) is still defined and emulated by QEMU.
The benefits of keeping virtio device emulation in QEMU compared
with virtio device PCI passthru include (but not limit to):
- consistent device interface for guest OS in the VM;
- max flexibility on the hardware design, especially the
accelerator for each vhost backend doesn't have to be a
full PCI device;
- leveraging the existing virtio live-migration framework;
The interface of this mdev based device
=======================================
1. BAR0
The MMIO region described by BAR0 is the main control
interface. Messages will be written to or read from
this region.
The message type is determined by the `request` field
in message header. The message size is encoded in the
message header too. The message format looks like this:
struct vhost_vfio_op {
__u64 request;
__u32 flags;
/* Flag values: */
#define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
__u32 size;
union {
__u64 u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
struct vhost_memory memory;
} payload;
};
The existing vhost-kernel ioctl cmds are reused as
the message requests in above structure.
Each message will be written to or read from this
region at offset 0:
int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
int ret;
ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count)
return -1;
return 0;
}
int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
uint64_t request = op->request;
int ret;
ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count || request != op->request)
return -1;
return 0;
}
It's quite straightforward to set things to the device.
Just need to write the message to device directly:
int vhost_vfio_set_features(struct vhost_dev *dev, uint64_t features)
{
struct vhost_vfio_op op;
op.request = VHOST_SET_FEATURES;
op.flags = 0;
op.size = sizeof(features);
op.payload.u64 = features;
return vhost_vfio_write(dev, &op);
}
To get things from the device, two steps are needed.
Take VHOST_GET_FEATURE as an example:
int vhost_vfio_get_features(struct vhost_dev *dev, uint64_t *features)
{
struct vhost_vfio_op op;
int ret;
op.request = VHOST_GET_FEATURES;
op.flags = VHOST_VFIO_NEED_REPLY;
op.size = 0;
/* Just need to write the header */
ret = vhost_vfio_write(dev, &op);
if (ret != 0)
goto out;
/* `op` wasn't changed during write */
op.flags = 0;
op.size = sizeof(*features);
ret = vhost_vfio_read(dev, &op);
if (ret != 0)
goto out;
*features = op.payload.u64;
out:
return ret;
}
2. BAR1 (mmap-able)
The MMIO region described by BAR1 will be used to notify the
device.
Each queue will has a page for notification, and it can be
mapped to VM (if hardware also supports), and the virtio
driver in the VM will be able to notify the device directly.
The MMIO region described by BAR1 is also write-able. If the
accelerator's notification register(s) cannot be mapped to the
VM, write() can also be used to notify the device. Something
like this:
void notify_relay(void *opaque)
{
......
offset = 0x1000 * queue_idx; /* XXX assume page size is 4K here. */
ret = pwrite64(vfio->device_fd, &queue_idx, sizeof(queue_idx),
vfio->bar1_offset + offset);
......
}
Other BARs are reserved.
3. VFIO interrupt ioctl API
VFIO interrupt ioctl API is used to setup device interrupts.
IRQ-bypass will also be supported.
Currently, only VFIO_PCI_MSIX_IRQ_INDEX is supported.
The API for drivers to provide mdev instances
=============================================
The read()/write()/ioctl()/mmap()/open()/release() mdev
parent ops have been provided for accelerators' drivers
to provide mdev instances.
ssize_t vdpa_read(struct mdev_device *mdev, char __user *buf,
size_t count, loff_t *ppos);
ssize_t vdpa_write(struct mdev_device *mdev, const char __user *buf,
size_t count, loff_t *ppos);
long vdpa_ioctl(struct mdev_device *mdev, unsigned int cmd, unsigned long arg);
int vdpa_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
int vdpa_open(struct mdev_device *mdev);
void vdpa_close(struct mdev_device *mdev);
Each accelerator driver just needs to implement its own
create()/remove() ops, and provide a vdpa device ops
which will be called by the generic mdev emulation code.
Currently, the vdpa device ops are defined as:
typedef int (*vdpa_start_device_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_stop_device_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_dma_map_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_dma_unmap_t)(struct vdpa_dev *vdpa);
typedef int (*vdpa_set_eventfd_t)(struct vdpa_dev *vdpa, int vector, int fd);
typedef u64 (*vdpa_supported_features_t)(struct vdpa_dev *vdpa);
typedef void (*vdpa_notify_device_t)(struct vdpa_dev *vdpa, int qid);
typedef u64 (*vdpa_get_notify_addr_t)(struct vdpa_dev *vdpa, int qid);
struct vdpa_device_ops {
vdpa_start_device_t start;
vdpa_stop_device_t stop;
vdpa_dma_map_t dma_map;
vdpa_dma_unmap_t dma_unmap;
vdpa_set_eventfd_t set_eventfd;
vdpa_supported_features_t supported_features;
vdpa_notify_device_t notify;
vdpa_get_notify_addr_t get_notify_addr;
};
struct vdpa_dev {
struct mdev_device *mdev;
struct mutex ops_lock;
u8 vconfig[VDPA_CONFIG_SIZE];
int nr_vring;
u64 features;
u64 state;
struct vhost_memory *mem_table;
bool pending_reply;
struct vhost_vfio_op pending;
const struct vdpa_device_ops *ops;
void *private;
int max_vrings;
struct vdpa_vring_info vring_info[0];
};
struct vdpa_dev *vdpa_alloc(struct mdev_device *mdev, void *private,
int max_vrings);
void vdpa_free(struct vdpa_dev *vdpa);
A simple example
================
# Query the number of available mdev instances
$ cat /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/available_instances
# Create a mdev instance
$ echo $UUID > /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/create
# Launch QEMU with a virtio-net device
$ qemu \
...... \
-netdev type=vhost-vfio,sysfsdev=/sys/bus/mdev/devices/$UUID,id=$ID \
-device virtio-net-pci,netdev=$ID
-------- END --------
Most of above words will be refined and moved to a doc in
the formal patch. In this RFC, all introductions and code
are gathered in this patch, the idea is to make it easier
to find all the relevant information. Anyone who wants to
comment could use inline comment and just keep the relevant
parts. Sorry for the big RFC patch..
This patch is just a RFC for now, and something is still
missing or needs to be refined. But it's never too early
to hear the thoughts from the community. So any comments
would be appreciated! Thanks! :-)
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
drivers/vhost/Makefile | 3 +
drivers/vhost/vdpa.c | 805 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/vdpa_mdev.h | 76 +++++
include/uapi/linux/vhost.h | 26 ++
4 files changed, 910 insertions(+)
create mode 100644 drivers/vhost/vdpa.c
create mode 100644 include/linux/vdpa_mdev.h
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..7d185e083140 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -11,3 +11,6 @@ vhost_vsock-y := vsock.o
obj-$(CONFIG_VHOST_RING) += vringh.o
obj-$(CONFIG_VHOST) += vhost.o
+
+obj-m += vhost_vdpa.o # FIXME: add an option
+vhost_vdpa-y := vdpa.o
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
new file mode 100644
index 000000000000..aa19c266ea19
--- /dev/null
+++ b/drivers/vhost/vdpa.c
@@ -0,0 +1,805 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Intel Corporation.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/vfio.h>
+#include <linux/vhost.h>
+#include <linux/mdev.h>
+#include <linux/vdpa_mdev.h>
+
+#define VDPA_BAR0_SIZE 0x1000000 // TBD
+
+#define VDPA_VFIO_PCI_OFFSET_SHIFT 40
+#define VDPA_VFIO_PCI_OFFSET_MASK \
+ ((1ULL << VDPA_VFIO_PCI_OFFSET_SHIFT) - 1)
+#define VDPA_VFIO_PCI_OFFSET_TO_INDEX(offset) \
+ ((offset) >> VDPA_VFIO_PCI_OFFSET_SHIFT)
+#define VDPA_VFIO_PCI_INDEX_TO_OFFSET(index) \
+ ((u64)(index) << VDPA_VFIO_PCI_OFFSET_SHIFT)
+#define VDPA_VFIO_PCI_BAR_OFFSET(offset) \
+ ((offset) & VDPA_VFIO_PCI_OFFSET_MASK)
+
+#define STORE_LE16(addr, val) (*(u16 *)(addr) = cpu_to_le16(val))
+#define STORE_LE32(addr, val) (*(u32 *)(addr) = cpu_to_le32(val))
+
+static void vdpa_create_config_space(struct vdpa_dev *vdpa)
+{
+ /* PCI device ID / vendor ID */
+ STORE_LE32(&vdpa->vconfig[0x0], 0xffffffff); // FIXME TBD
+
+ /* Programming interface class */
+ vdpa->vconfig[0x9] = 0x00;
+
+ /* Sub class */
+ vdpa->vconfig[0xa] = 0x00;
+
+ /* Base class */
+ vdpa->vconfig[0xb] = 0x02;
+
+ // FIXME TBD
+}
+
+struct vdpa_dev *vdpa_alloc(struct mdev_device *mdev, void *private,
+ int max_vrings)
+{
+ struct vdpa_dev *vdpa;
+ size_t size;
+
+ size = sizeof(struct vdpa_dev) + max_vrings *
+ sizeof(struct vdpa_vring_info);
+
+ vdpa = kzalloc(size, GFP_KERNEL);
+ if (vdpa == NULL)
+ return NULL;
+
+ mutex_init(&vdpa->ops_lock);
+
+ vdpa->mdev = mdev;
+ vdpa->private = private;
+ vdpa->max_vrings = max_vrings;
+
+ vdpa_create_config_space(vdpa);
+
+ return vdpa;
+}
+EXPORT_SYMBOL(vdpa_alloc);
+
+void vdpa_free(struct vdpa_dev *vdpa)
+{
+ struct mdev_device *mdev;
+
+ mdev = vdpa->mdev;
+
+ vdpa->ops->stop(vdpa);
+ vdpa->ops->dma_unmap(vdpa);
+
+ mdev_set_drvdata(mdev, NULL);
+
+ mutex_destroy(&vdpa->ops_lock);
+
+ kfree(vdpa->mem_table);
+ kfree(vdpa);
+}
+EXPORT_SYMBOL(vdpa_free);
+
+static ssize_t vdpa_handle_pcicfg_read(struct mdev_device *mdev,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct vdpa_dev *vdpa;
+ loff_t pos = *ppos;
+ loff_t offset;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ offset = VDPA_VFIO_PCI_BAR_OFFSET(pos);
+
+ if (count + offset > VDPA_CONFIG_SIZE)
+ return -EINVAL;
+
+ if (copy_to_user(buf, (vdpa->vconfig + offset), count))
+ return -EFAULT;
+
+ return count;
+}
+
+static ssize_t vdpa_handle_bar0_read(struct mdev_device *mdev,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct vdpa_dev *vdpa;
+ struct vhost_vfio_op *op = NULL;
+ loff_t pos = *ppos;
+ loff_t offset;
+ int ret;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ offset = VDPA_VFIO_PCI_BAR_OFFSET(pos);
+ if (offset != 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (!vdpa->pending_reply) {
+ ret = 0;
+ goto out;
+ }
+
+ vdpa->pending_reply = false;
+
+ op = kzalloc(VHOST_VFIO_OP_HDR_SIZE + VHOST_VFIO_OP_PAYLOAD_MAX_SIZE,
+ GFP_KERNEL);
+ if (op == NULL) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ op->request = vdpa->pending.request;
+
+ switch (op->request) {
+ case VHOST_GET_VRING_BASE:
+ op->payload.state = vdpa->pending.payload.state;
+ op->size = sizeof(op->payload.state);
+ break;
+ case VHOST_GET_FEATURES:
+ op->payload.u64 = vdpa->pending.payload.u64;
+ op->size = sizeof(op->payload.u64);
+ break;
+ default:
+ ret = -EINVAL;
+ goto out_free;
+ }
+
+ if (op->size + VHOST_VFIO_OP_HDR_SIZE != count) {
+ ret = -EINVAL;
+ goto out_free;
+ }
+
+ if (copy_to_user(buf, op, count)) {
+ ret = -EFAULT;
+ goto out_free;
+ }
+
+ ret = count;
+
+out_free:
+ kfree(op);
+out:
+ return ret;
+}
+
+ssize_t vdpa_read(struct mdev_device *mdev, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ int done = 0;
+ unsigned int index;
+ loff_t pos = *ppos;
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ mutex_lock(&vdpa->ops_lock);
+
+ index = VDPA_VFIO_PCI_OFFSET_TO_INDEX(pos);
+
+ switch (index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ done = vdpa_handle_pcicfg_read(mdev, buf, count, ppos);
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ done = vdpa_handle_bar0_read(mdev, buf, count, ppos);
+ break;
+ }
+
+ if (done > 0)
+ *ppos += done;
+
+ mutex_unlock(&vdpa->ops_lock);
+
+ return done;
+}
+EXPORT_SYMBOL(vdpa_read);
+
+static ssize_t vdpa_handle_pcicfg_write(struct mdev_device *mdev,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ return count;
+}
+
+static int vhost_set_mem_table(struct mdev_device *mdev,
+ struct vhost_memory *mem)
+{
+ struct vdpa_dev *vdpa;
+ struct vhost_memory *mem_table;
+ size_t size;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ // FIXME fix this
+ if (vdpa->state != VHOST_DEVICE_S_STOPPED)
+ return -EBUSY;
+
+ size = sizeof(*mem) + mem->nregions * sizeof(*mem->regions);
+
+ mem_table = kzalloc(size, GFP_KERNEL);
+ if (mem_table == NULL)
+ return -ENOMEM;
+
+ memcpy(mem_table, mem, size);
+
+ kfree(vdpa->mem_table);
+
+ vdpa->mem_table = mem_table;
+
+ vdpa->ops->dma_unmap(vdpa);
+ vdpa->ops->dma_map(vdpa);
+
+ return 0;
+}
+
+static int vhost_set_vring_addr(struct mdev_device *mdev,
+ struct vhost_vring_addr *addr)
+{
+ struct vdpa_dev *vdpa;
+ int qid = addr->index;
+ struct vdpa_vring_info *vring;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ if (qid >= vdpa->max_vrings)
+ return -EINVAL;
+
+ /* FIXME to be fixed */
+ if (qid >= vdpa->nr_vring)
+ vdpa->nr_vring = qid + 1;
+
+ vring = &vdpa->vring_info[qid];
+
+ vring->desc_user_addr = addr->desc_user_addr;
+ vring->used_user_addr = addr->used_user_addr;
+ vring->avail_user_addr = addr->avail_user_addr;
+ vring->log_guest_addr = addr->log_guest_addr;
+
+ return 0;
+}
+
+static int vhost_set_vring_num(struct mdev_device *mdev,
+ struct vhost_vring_state *num)
+{
+ struct vdpa_dev *vdpa;
+ int qid = num->index;
+ struct vdpa_vring_info *vring;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ if (qid >= vdpa->max_vrings)
+ return -EINVAL;
+
+ vring = &vdpa->vring_info[qid];
+
+ vring->size = num->num;
+
+ return 0;
+}
+
+static int vhost_set_vring_base(struct mdev_device *mdev,
+ struct vhost_vring_state *base)
+{
+ struct vdpa_dev *vdpa;
+ int qid = base->index;
+ struct vdpa_vring_info *vring;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ if (qid >= vdpa->max_vrings)
+ return -EINVAL;
+
+ vring = &vdpa->vring_info[qid];
+
+ vring->base = base->num;
+
+ return 0;
+}
+
+static int vhost_get_vring_base(struct mdev_device *mdev,
+ struct vhost_vring_state *base)
+{
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ vdpa->pending_reply = true;
+ vdpa->pending.request = VHOST_GET_VRING_BASE;
+ vdpa->pending.payload.state.index = base->index;
+
+ // FIXME to be implemented
+
+ return 0;
+}
+
+static int vhost_set_features(struct mdev_device *mdev, u64 *features)
+{
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ vdpa->features = *features;
+
+ return 0;
+}
+
+static int vhost_get_features(struct mdev_device *mdev, u64 *features)
+{
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ vdpa->pending_reply = true;
+ vdpa->pending.request = VHOST_GET_FEATURES;
+ vdpa->pending.payload.u64 =
+ vdpa->ops->supported_features(vdpa);
+
+ return 0;
+}
+
+static int vhost_set_owner(struct mdev_device *mdev)
+{
+ return 0;
+}
+
+static int vhost_reset_owner(struct mdev_device *mdev)
+{
+ return 0;
+}
+
+static int vhost_set_state(struct mdev_device *mdev, u64 *state)
+{
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ if (*state >= VHOST_DEVICE_S_MAX)
+ return -EINVAL;
+
+ if (vdpa->state == *state)
+ return 0;
+
+ vdpa->state = *state;
+
+ switch (vdpa->state) {
+ case VHOST_DEVICE_S_RUNNING:
+ vdpa->ops->start(vdpa);
+ break;
+ case VHOST_DEVICE_S_STOPPED:
+ vdpa->ops->stop(vdpa);
+ break;
+ }
+
+ return 0;
+}
+
+static ssize_t vdpa_handle_bar0_write(struct mdev_device *mdev,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct vhost_vfio_op *op = NULL;
+ loff_t pos = *ppos;
+ loff_t offset;
+ int ret;
+
+ offset = VDPA_VFIO_PCI_BAR_OFFSET(pos);
+ if (offset != 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (count < VHOST_VFIO_OP_HDR_SIZE) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ op = kzalloc(VHOST_VFIO_OP_HDR_SIZE + VHOST_VFIO_OP_PAYLOAD_MAX_SIZE,
+ GFP_KERNEL);
+ if (op == NULL) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (copy_from_user(op, buf, VHOST_VFIO_OP_HDR_SIZE)) {
+ ret = -EINVAL;
+ goto out_free;
+ }
+
+ if (op->size > VHOST_VFIO_OP_PAYLOAD_MAX_SIZE ||
+ op->size + VHOST_VFIO_OP_HDR_SIZE != count) {
+ ret = -EINVAL;
+ goto out_free;
+ }
+
+ if (copy_from_user(&op->payload, buf + VHOST_VFIO_OP_HDR_SIZE,
+ op->size)) {
+ ret = -EFAULT;
+ goto out_free;
+ }
+
+ switch (op->request) {
+ case VHOST_SET_LOG_BASE:
+ break;
+ case VHOST_SET_MEM_TABLE:
+ vhost_set_mem_table(mdev, &op->payload.memory);
+ break;
+ case VHOST_SET_VRING_ADDR:
+ vhost_set_vring_addr(mdev, &op->payload.addr);
+ break;
+ case VHOST_SET_VRING_NUM:
+ vhost_set_vring_num(mdev, &op->payload.state);
+ break;
+ case VHOST_SET_VRING_BASE:
+ vhost_set_vring_base(mdev, &op->payload.state);
+ break;
+ case VHOST_GET_VRING_BASE:
+ vhost_get_vring_base(mdev, &op->payload.state);
+ break;
+ case VHOST_SET_FEATURES:
+ vhost_set_features(mdev, &op->payload.u64);
+ break;
+ case VHOST_GET_FEATURES:
+ vhost_get_features(mdev, &op->payload.u64);
+ break;
+ case VHOST_SET_OWNER:
+ vhost_set_owner(mdev);
+ break;
+ case VHOST_RESET_OWNER:
+ vhost_reset_owner(mdev);
+ break;
+ case VHOST_DEVICE_SET_STATE:
+ vhost_set_state(mdev, &op->payload.u64);
+ break;
+ default:
+ break;
+ }
+
+ ret = count;
+
+out_free:
+ kfree(op);
+out:
+ return ret;
+}
+
+static ssize_t vdpa_handle_bar1_write(struct mdev_device *mdev,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct vdpa_dev *vdpa;
+ int qid;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ if (count < sizeof(qid))
+ return -EINVAL;
+
+ if (copy_from_user(&qid, buf, sizeof(qid)))
+ return -EINVAL;
+
+ vdpa->ops->notify(vdpa, qid);
+
+ return count;
+}
+
+ssize_t vdpa_write(struct mdev_device *mdev, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ int done = 0;
+ unsigned int index;
+ loff_t pos = *ppos;
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ mutex_lock(&vdpa->ops_lock);
+
+ index = VDPA_VFIO_PCI_OFFSET_TO_INDEX(pos);
+
+ switch (index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ done = vdpa_handle_pcicfg_write(mdev, buf, count, ppos);
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ done = vdpa_handle_bar0_write(mdev, buf, count, ppos);
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ done = vdpa_handle_bar1_write(mdev, buf, count, ppos);
+ break;
+ }
+
+ if (done > 0)
+ *ppos += done;
+
+ mutex_unlock(&vdpa->ops_lock);
+
+ return done;
+}
+EXPORT_SYMBOL(vdpa_write);
+
+static int vdpa_get_region_info(struct mdev_device *mdev,
+ struct vfio_region_info *region_info,
+ u16 *cap_type_id, void **cap_type)
+{
+ struct vdpa_dev *vdpa;
+ u32 bar_index;
+ u64 size = 0;
+
+ if (!mdev)
+ return -EINVAL;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -EINVAL;
+
+ bar_index = region_info->index;
+ if (bar_index >= VFIO_PCI_NUM_REGIONS)
+ return -EINVAL;
+
+ mutex_lock(&vdpa->ops_lock);
+
+ switch (bar_index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ size = VDPA_CONFIG_SIZE;
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ size = VDPA_BAR0_SIZE;
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ size = (u64)vdpa->max_vrings << PAGE_SHIFT;
+ break;
+ default:
+ size = 0;
+ break;
+ }
+
+ // FIXME: mark BAR1 as mmap-able (VFIO_REGION_INFO_FLAG_MMAP)
+ region_info->size = size;
+ region_info->offset = VDPA_VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+ region_info->flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+ mutex_unlock(&vdpa->ops_lock);
+ return 0;
+}
+
+static int vdpa_reset(struct mdev_device *mdev)
+{
+ struct vdpa_dev *vdpa;
+
+ if (!mdev)
+ return -EINVAL;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int vdpa_get_device_info(struct mdev_device *mdev,
+ struct vfio_device_info *dev_info)
+{
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ dev_info->flags = VFIO_DEVICE_FLAGS_PCI;
+ dev_info->num_regions = VFIO_PCI_NUM_REGIONS;
+ dev_info->num_irqs = vdpa->max_vrings;
+
+ return 0;
+}
+
+static int vdpa_get_irq_info(struct mdev_device *mdev,
+ struct vfio_irq_info *info)
+{
+ struct vdpa_dev *vdpa;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ if (info->index != VFIO_PCI_MSIX_IRQ_INDEX)
+ return -ENOTSUPP;
+
+ info->flags = VFIO_IRQ_INFO_EVENTFD;
+ info->count = vdpa->max_vrings;
+
+ return 0;
+}
+
+static int vdpa_set_irqs(struct mdev_device *mdev, uint32_t flags,
+ unsigned int index, unsigned int start,
+ unsigned int count, void *data)
+{
+ struct vdpa_dev *vdpa;
+ int *fd = data, i;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -EINVAL;
+
+ if (index != VFIO_PCI_MSIX_IRQ_INDEX)
+ return -ENOTSUPP;
+
+ for (i = 0; i < count; i++)
+ vdpa->ops->set_eventfd(vdpa, start + i,
+ (flags & VFIO_IRQ_SET_DATA_EVENTFD) ? fd[i] : -1);
+
+ return 0;
+}
+
+long vdpa_ioctl(struct mdev_device *mdev, unsigned int cmd, unsigned long arg)
+{
+ int ret = 0;
+ unsigned long minsz;
+ struct vdpa_dev *vdpa;
+
+ if (!mdev)
+ return -EINVAL;
+
+ vdpa = mdev_get_drvdata(mdev);
+ if (!vdpa)
+ return -ENODEV;
+
+ switch (cmd) {
+ case VFIO_DEVICE_GET_INFO:
+ {
+ struct vfio_device_info info;
+
+ minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ ret = vdpa_get_device_info(mdev, &info);
+ if (ret)
+ return ret;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ return -EFAULT;
+
+ return 0;
+ }
+ case VFIO_DEVICE_GET_REGION_INFO:
+ {
+ struct vfio_region_info info;
+ u16 cap_type_id = 0;
+ void *cap_type = NULL;
+
+ minsz = offsetofend(struct vfio_region_info, offset);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ ret = vdpa_get_region_info(mdev, &info, &cap_type_id,
+ &cap_type);
+ if (ret)
+ return ret;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ return -EFAULT;
+
+ return 0;
+ }
+ case VFIO_DEVICE_GET_IRQ_INFO:
+ {
+ struct vfio_irq_info info;
+
+ minsz = offsetofend(struct vfio_irq_info, count);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz || info.index >= vdpa->max_vrings)
+ return -EINVAL;
+
+ ret = vdpa_get_irq_info(mdev, &info);
+ if (ret)
+ return ret;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ return -EFAULT;
+
+ return 0;
+ }
+ case VFIO_DEVICE_SET_IRQS:
+ {
+ struct vfio_irq_set hdr;
+ size_t data_size = 0;
+ u8 *data = NULL;
+
+ minsz = offsetofend(struct vfio_irq_set, count);
+
+ if (copy_from_user(&hdr, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ ret = vfio_set_irqs_validate_and_prepare(&hdr, vdpa->max_vrings,
+ VFIO_PCI_NUM_IRQS,
+ &data_size);
+ if (ret)
+ return ret;
+
+ if (data_size) {
+ data = memdup_user((void __user *)(arg + minsz),
+ data_size);
+ if (IS_ERR(data))
+ return PTR_ERR(data);
+ }
+
+ ret = vdpa_set_irqs(mdev, hdr.flags, hdr.index, hdr.start,
+ hdr.count, data);
+
+ kfree(data);
+ return ret;
+ }
+ case VFIO_DEVICE_RESET:
+ return vdpa_reset(mdev);
+ }
+ return -ENOTTY;
+}
+EXPORT_SYMBOL(vdpa_ioctl);
+
+int vdpa_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
+{
+ // FIXME: to be implemented
+
+ return 0;
+}
+EXPORT_SYMBOL(vdpa_mmap);
+
+int vdpa_open(struct mdev_device *mdev)
+{
+ return 0;
+}
+EXPORT_SYMBOL(vdpa_open);
+
+void vdpa_close(struct mdev_device *mdev)
+{
+}
+EXPORT_SYMBOL(vdpa_close);
+
+MODULE_VERSION("0.0.0");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Hardware virtio accelerator abstraction");
diff --git a/include/linux/vdpa_mdev.h b/include/linux/vdpa_mdev.h
new file mode 100644
index 000000000000..8414e86ba4b8
--- /dev/null
+++ b/include/linux/vdpa_mdev.h
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Intel Corporation.
+ */
+
+#ifndef VDPA_MDEV_H
+#define VDPA_MDEV_H
+
+#define VDPA_CONFIG_SIZE 0xff
+
+struct mdev_device;
+struct vdpa_dev;
+
+/*
+ * XXX: Any comments about the vDPA API design for drivers
+ * would be appreciated!
+ */
+
+typedef int (*vdpa_start_device_t)(struct vdpa_dev *vdpa);
+typedef int (*vdpa_stop_device_t)(struct vdpa_dev *vdpa);
+typedef int (*vdpa_dma_map_t)(struct vdpa_dev *vdpa);
+typedef int (*vdpa_dma_unmap_t)(struct vdpa_dev *vdpa);
+typedef int (*vdpa_set_eventfd_t)(struct vdpa_dev *vdpa, int vector, int fd);
+typedef u64 (*vdpa_supported_features_t)(struct vdpa_dev *vdpa);
+typedef void (*vdpa_notify_device_t)(struct vdpa_dev *vdpa, int qid);
+typedef u64 (*vdpa_get_notify_addr_t)(struct vdpa_dev *vdpa, int qid);
+
+struct vdpa_device_ops {
+ vdpa_start_device_t start;
+ vdpa_stop_device_t stop;
+ vdpa_dma_map_t dma_map;
+ vdpa_dma_unmap_t dma_unmap;
+ vdpa_set_eventfd_t set_eventfd;
+ vdpa_supported_features_t supported_features;
+ vdpa_notify_device_t notify;
+ vdpa_get_notify_addr_t get_notify_addr;
+};
+
+struct vdpa_vring_info {
+ u64 desc_user_addr;
+ u64 used_user_addr;
+ u64 avail_user_addr;
+ u64 log_guest_addr;
+ u16 size;
+ u16 base;
+};
+
+struct vdpa_dev {
+ struct mdev_device *mdev;
+ struct mutex ops_lock;
+ u8 vconfig[VDPA_CONFIG_SIZE];
+ int nr_vring;
+ u64 features;
+ u64 state;
+ struct vhost_memory *mem_table;
+ bool pending_reply;
+ struct vhost_vfio_op pending;
+ const struct vdpa_device_ops *ops;
+ void *private;
+ int max_vrings;
+ struct vdpa_vring_info vring_info[0];
+};
+
+struct vdpa_dev *vdpa_alloc(struct mdev_device *mdev, void *private,
+ int max_vrings);
+void vdpa_free(struct vdpa_dev *vdpa);
+ssize_t vdpa_read(struct mdev_device *mdev, char __user *buf,
+ size_t count, loff_t *ppos);
+ssize_t vdpa_write(struct mdev_device *mdev, const char __user *buf,
+ size_t count, loff_t *ppos);
+long vdpa_ioctl(struct mdev_device *mdev, unsigned int cmd, unsigned long arg);
+int vdpa_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
+int vdpa_open(struct mdev_device *mdev);
+void vdpa_close(struct mdev_device *mdev);
+
+#endif /* VDPA_MDEV_H */
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c51f8e5cc608..92a1ca0b5fe1 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -207,4 +207,30 @@ struct vhost_scsi_target {
#define VHOST_VSOCK_SET_GUEST_CID _IOW(VHOST_VIRTIO, 0x60, __u64)
#define VHOST_VSOCK_SET_RUNNING _IOW(VHOST_VIRTIO, 0x61, int)
+/* VHOST_DEVICE specific defines */
+
+#define VHOST_DEVICE_SET_STATE _IOW(VHOST_VIRTIO, 0x70, __u64)
+
+#define VHOST_DEVICE_S_STOPPED 0
+#define VHOST_DEVICE_S_RUNNING 1
+#define VHOST_DEVICE_S_MAX 2
+
+struct vhost_vfio_op {
+ __u64 request;
+ __u32 flags;
+ /* Flag values: */
+#define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
+ __u32 size;
+ union {
+ __u64 u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ struct vhost_memory memory;
+ } payload;
+};
+
+#define VHOST_VFIO_OP_HDR_SIZE \
+ ((unsigned long)&((struct vhost_vfio_op *)NULL)->payload)
+#define VHOST_VFIO_OP_PAYLOAD_MAX_SIZE 1024 /* FIXME TBD */
+
#endif
--
2.11.0
^ permalink raw reply related
* Re: [PATCH net-next 4/5] tun: Add support for SCTP checksum offload
From: kbuild test robot @ 2018-04-03 0:49 UTC (permalink / raw)
To: Vladislav Yasevich
Cc: nhorman, mst, netdev, virtualization, linux-sctp, kbuild-all
In-Reply-To: <20180402134006.10111-5-vyasevic@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 2668 bytes --]
Hi Vladislav,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on net-next/master]
url: https://github.com/0day-ci/linux/commits/Vladislav-Yasevich/virtio-net-Add-SCTP-checksum-offload-support/20180402-221407
config: m68k-hp300_defconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k
Note: the linux-review/Vladislav-Yasevich/virtio-net-Add-SCTP-checksum-offload-support/20180402-221407 HEAD 5e0497a085e70055a1981959802173f4ff05c86b builds fine.
It only hurts bisectibility.
All errors (new ones prefixed by >>):
drivers/net/tun.c: In function 'set_offload':
>> drivers/net/tun.c:2722:12: error: 'TUN_F_SCTP_CSUM' undeclared (first use in this function); did you mean 'TUN_F_CSUM'?
if (arg & TUN_F_SCTP_CSUM) {
^~~~~~~~~~~~~~~
TUN_F_CSUM
drivers/net/tun.c:2722:12: note: each undeclared identifier is reported only once for each function it appears in
vim +2722 drivers/net/tun.c
2696
2697 /* This is like a cut-down ethtool ops, except done via tun fd so no
2698 * privs required. */
2699 static int set_offload(struct tun_struct *tun, unsigned long arg)
2700 {
2701 netdev_features_t features = 0;
2702
2703 if (arg & TUN_F_CSUM) {
2704 features |= NETIF_F_HW_CSUM;
2705 arg &= ~TUN_F_CSUM;
2706
2707 if (arg & (TUN_F_TSO4|TUN_F_TSO6)) {
2708 if (arg & TUN_F_TSO_ECN) {
2709 features |= NETIF_F_TSO_ECN;
2710 arg &= ~TUN_F_TSO_ECN;
2711 }
2712 if (arg & TUN_F_TSO4)
2713 features |= NETIF_F_TSO;
2714 if (arg & TUN_F_TSO6)
2715 features |= NETIF_F_TSO6;
2716 arg &= ~(TUN_F_TSO4|TUN_F_TSO6);
2717 }
2718
2719 arg &= ~TUN_F_UFO;
2720 }
2721
> 2722 if (arg & TUN_F_SCTP_CSUM) {
2723 features |= NETIF_F_SCTP_CRC;
2724 arg &= ~TUN_F_SCTP_CSUM;
2725 }
2726
2727 /* This gives the user a way to test for new features in future by
2728 * trying to set them. */
2729 if (arg)
2730 return -EINVAL;
2731
2732 tun->set_features = features;
2733 tun->dev->wanted_features &= ~TUN_USER_FEATURES;
2734 tun->dev->wanted_features |= features;
2735 netdev_update_features(tun->dev);
2736
2737 return 0;
2738 }
2739
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 12572 bytes --]
[-- Attachment #3: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox