* Re: [PATCH] vhost: correct the related warning message
From: Michael S. Tsirkin @ 2018-12-13 15:21 UTC (permalink / raw)
To: Sergei Shtylyov; +Cc: kvm, netdev, virtualization, piaojun, viro, hch, wangyan
In-Reply-To: <7c1a2b4c-670f-2be8-88e1-7485237029f6@cogentembedded.com>
On Thu, Dec 13, 2018 at 12:05:31PM +0300, Sergei Shtylyov wrote:
> On 13.12.2018 4:10, wangyan wrote:
>
> > Fixes: 'commit d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem")'
> > 'commit cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and friends")'
>
> Fixes: d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove
> tf_subsystem")
> Fixes: cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and
> friends")
>
> But no line wrapping allowed (can't find how to turn it off in my mailer).
Well is the standard really sane if people can't find a way to
follow it without being forced to use a specific mailer?
> > Signed-off-by: Yan Wang <wangyan122@huawei.com>
> > Reviewed-by: Jun Piao <piaojun@huawei.com>
> [...]
>
> MBR, Sergei
^ permalink raw reply
* Re: [PATCH v2 5/5] VSOCK: batch sending rx buffer to increase bandwidth
From: Stefan Hajnoczi @ 2018-12-13 15:17 UTC (permalink / raw)
To: jiangyiwen; +Cc: netdev, virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <5C10D65F.7030407@huawei.com>
[-- Attachment #1.1: Type: text/plain, Size: 649 bytes --]
On Wed, Dec 12, 2018 at 05:35:27PM +0800, jiangyiwen wrote:
> Batch sending rx buffer can improve total bandwidth.
>
> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
> ---
Please send patches with git-send-email --thread --no-chain-reply-to so
that your patch series email thread looks like this:
* [PATCH 00/NN] My feature
+-- [PATCH 01/NN] First patch
+-- [PATCH 02/NN] Second patch
.
.
.
+-- [PATCH NN/NN] Last patch
This way it's much easier to view the entire series. At the moment you
are sending each patch as an independent email and there is no
relationship between the emails.
Thanks,
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: Michael S. Tsirkin @ 2018-12-13 14:50 UTC (permalink / raw)
To: jiangyiwen; +Cc: kvm, netdev, virtualization, stefanha, David Miller
In-Reply-To: <5C11CDF4.4040405@huawei.com>
On Thu, Dec 13, 2018 at 11:11:48AM +0800, jiangyiwen wrote:
> On 2018/12/13 3:09, David Miller wrote:
> > From: jiangyiwen <jiangyiwen@huawei.com>
> > Date: Wed, 12 Dec 2018 17:29:31 +0800
> >
> >> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> >> index 1d57ed3..2292f30 100644
> >> --- a/include/uapi/linux/virtio_vsock.h
> >> +++ b/include/uapi/linux/virtio_vsock.h
> >> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
> >> __le32 fwd_cnt;
> >> } __attribute__((packed));
> >>
> >> +/* It add mergeable rx buffers feature */
> >> +struct virtio_vsock_mrg_rxbuf_hdr {
> >> + __le16 num_buffers; /* number of mergeable rx buffers */
> >> +} __attribute__((packed));
> >> +
> >
> > I know the rest of this file uses 'packed' but this attribute should
> > only be used if absolutely necessary as it incurs a
> > non-trivial performance penalty for some architectures.
> >
> > .
> >
>
> Hi David,
>
> I hope Host can fill fewer bytes into rx virtqueue, so
> I keep structure virtio_vsock_mrg_rxbuf_hdr one byte
> alignment.
>
> Thanks,
> Yiwen.
It doesn't work like this now though, does it?
Buffers are preallocated and they are always aligned.
So I do not see the point.
--
MST
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: Michael S. Tsirkin @ 2018-12-13 14:48 UTC (permalink / raw)
To: jiangyiwen; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <5C11CD14.3040809@huawei.com>
On Thu, Dec 13, 2018 at 11:08:04AM +0800, jiangyiwen wrote:
> On 2018/12/12 23:37, Michael S. Tsirkin wrote:
> > On Wed, Dec 12, 2018 at 05:29:31PM +0800, jiangyiwen wrote:
> >> When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
> >> it will merge big packet into rx vq.
> >>
> >> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
> >
> > I feel this approach jumps into making interface changes for
> > optimizations too quickly. For example, what prevents us
> > from taking a big buffer, prepending each chunk
> > with the header and writing it out without
> > host/guest interface changes?
> >
> > This should allow optimizations such as vhost_add_used_n
> > batching.
> >
> > I realize a header in each packet does have a cost,
> > but it also has advantages such as improved robustness,
> > I'd like to see more of an apples to apples comparison
> > of the performance gain from skipping them.
> >
> >
>
> Hi Michael,
>
> I don't fully understand what you mean, do you want to
> see a performance comparison that before performance and
> only use batching?
>
> In my opinion, guest don't fill big buffer in rx vq because
> the balance performance and guest memory pressure, add
> mergeable feature can improve big packets performance,
> as for small packets, I try to find out the reason, may be
> the fluctuation of test results, or in mergeable mode, when
> Host send a 4k packet to Guest, we should call vhost_get_vq_desc()
> twice in host(hdr + 4k data), and in guest we also should call
> virtqueue_get_buf() twice.
>
> Thanks,
> Yiwen.
What I mean is that at least some of the gain here is because
larger skbs are passed up the stack.
You do not need larger packets to build larger skbs though.
Just check that the addresses match and you can combine
multiple fragments in a single skb.
> >> ---
> >> drivers/vhost/vsock.c | 111 ++++++++++++++++++++++++++++++--------
> >> include/linux/virtio_vsock.h | 1 +
> >> include/uapi/linux/virtio_vsock.h | 5 ++
> >> 3 files changed, 94 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >> index 34bc3ab..dc52b0f 100644
> >> --- a/drivers/vhost/vsock.c
> >> +++ b/drivers/vhost/vsock.c
> >> @@ -22,7 +22,8 @@
> >> #define VHOST_VSOCK_DEFAULT_HOST_CID 2
> >>
> >> enum {
> >> - VHOST_VSOCK_FEATURES = VHOST_FEATURES,
> >> + VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> >> + (1ULL << VIRTIO_VSOCK_F_MRG_RXBUF),
> >> };
> >>
> >> /* Used to track all the vhost_vsock instances on the system. */
> >> @@ -80,6 +81,69 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> >> return vsock;
> >> }
> >>
> >> +/* This segment of codes are copied from drivers/vhost/net.c */
> >> +static int get_rx_bufs(struct vhost_virtqueue *vq,
> >> + struct vring_used_elem *heads, int datalen,
> >> + unsigned *iovcount, unsigned int quota)
> >> +{
> >> + unsigned int out, in;
> >> + int seg = 0;
> >> + int headcount = 0;
> >> + unsigned d;
> >> + int ret;
> >> + /*
> >> + * len is always initialized before use since we are always called with
> >> + * datalen > 0.
> >> + */
> >> + u32 uninitialized_var(len);
> >> +
> >> + while (datalen > 0 && headcount < quota) {
> >> + if (unlikely(seg >= UIO_MAXIOV)) {
> >> + ret = -ENOBUFS;
> >> + goto err;
> >> + }
> >> +
> >> + ret = vhost_get_vq_desc(vq, vq->iov + seg,
> >> + ARRAY_SIZE(vq->iov) - seg, &out,
> >> + &in, NULL, NULL);
> >> + if (unlikely(ret < 0))
> >> + goto err;
> >> +
> >> + d = ret;
> >> + if (d == vq->num) {
> >> + ret = 0;
> >> + goto err;
> >> + }
> >> +
> >> + if (unlikely(out || in <= 0)) {
> >> + vq_err(vq, "unexpected descriptor format for RX: "
> >> + "out %d, in %d\n", out, in);
> >> + ret = -EINVAL;
> >> + goto err;
> >> + }
> >> +
> >> + heads[headcount].id = cpu_to_vhost32(vq, d);
> >> + len = iov_length(vq->iov + seg, in);
> >> + heads[headcount].len = cpu_to_vhost32(vq, len);
> >> + datalen -= len;
> >> + ++headcount;
> >> + seg += in;
> >> + }
> >> +
> >> + heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
> >> + *iovcount = seg;
> >> +
> >> + /* Detect overrun */
> >> + if (unlikely(datalen > 0)) {
> >> + ret = UIO_MAXIOV + 1;
> >> + goto err;
> >> + }
> >> + return headcount;
> >> +err:
> >> + vhost_discard_vq_desc(vq, headcount);
> >> + return ret;
> >> +}
> >> +
> >> static void
> >> vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> >> struct vhost_virtqueue *vq)
> >> @@ -87,22 +151,34 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> >> struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
> >> bool added = false;
> >> bool restart_tx = false;
> >> + int mergeable;
> >> + size_t vsock_hlen;
> >>
> >> mutex_lock(&vq->mutex);
> >>
> >> if (!vq->private_data)
> >> goto out;
> >>
> >> + mergeable = vhost_has_feature(vq, VIRTIO_VSOCK_F_MRG_RXBUF);
> >> + /*
> >> + * Guest fill page for rx vq in mergeable case, so it will not
> >> + * allocate pkt structure, we should reserve size of pkt in advance.
> >> + */
> >> + if (likely(mergeable))
> >> + vsock_hlen = sizeof(struct virtio_vsock_pkt);
> >> + else
> >> + vsock_hlen = sizeof(struct virtio_vsock_hdr);
> >> +
> >> /* Avoid further vmexits, we're already processing the virtqueue */
> >> vhost_disable_notify(&vsock->dev, vq);
> >>
> >> for (;;) {
> >> struct virtio_vsock_pkt *pkt;
> >> struct iov_iter iov_iter;
> >> - unsigned out, in;
> >> + unsigned out = 0, in = 0;
> >> size_t nbytes;
> >> size_t len;
> >> - int head;
> >> + s16 headcount;
> >>
> >> spin_lock_bh(&vsock->send_pkt_list_lock);
> >> if (list_empty(&vsock->send_pkt_list)) {
> >> @@ -116,16 +192,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> >> list_del_init(&pkt->list);
> >> spin_unlock_bh(&vsock->send_pkt_list_lock);
> >>
> >> - head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> >> - &out, &in, NULL, NULL);
> >> - if (head < 0) {
> >> - spin_lock_bh(&vsock->send_pkt_list_lock);
> >> - list_add(&pkt->list, &vsock->send_pkt_list);
> >> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> >> - break;
> >> - }
> >> -
> >> - if (head == vq->num) {
> >> + headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
> >> + &in, likely(mergeable) ? UIO_MAXIOV : 1);
> >> + if (headcount <= 0) {
> >> spin_lock_bh(&vsock->send_pkt_list_lock);
> >> list_add(&pkt->list, &vsock->send_pkt_list);
> >> spin_unlock_bh(&vsock->send_pkt_list_lock);
> >> @@ -133,24 +202,20 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> >> /* We cannot finish yet if more buffers snuck in while
> >> * re-enabling notify.
> >> */
> >> - if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
> >> + if (!headcount && unlikely(vhost_enable_notify(&vsock->dev, vq))) {
> >> vhost_disable_notify(&vsock->dev, vq);
> >> continue;
> >> }
> >> break;
> >> }
> >>
> >> - if (out) {
> >> - virtio_transport_free_pkt(pkt);
> >> - vq_err(vq, "Expected 0 output buffers, got %u\n", out);
> >> - break;
> >> - }
> >> -
> >> len = iov_length(&vq->iov[out], in);
> >> iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
> >>
> >> - nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
> >> - if (nbytes != sizeof(pkt->hdr)) {
> >> + if (likely(mergeable))
> >> + pkt->mrg_rxbuf_hdr.num_buffers = cpu_to_le16(headcount);
> >> + nbytes = copy_to_iter(&pkt->hdr, vsock_hlen, &iov_iter);
> >> + if (nbytes != vsock_hlen) {
> >> virtio_transport_free_pkt(pkt);
> >> vq_err(vq, "Faulted on copying pkt hdr\n");
> >> break;
> >> @@ -163,7 +228,7 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> >> break;
> >> }
> >>
> >> - vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
> >> + vhost_add_used_n(vq, vq->heads, headcount);
> >> added = true;
> >>
> >> if (pkt->reply) {
> >> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> >> index bf84418..da9e1fe 100644
> >> --- a/include/linux/virtio_vsock.h
> >> +++ b/include/linux/virtio_vsock.h
> >> @@ -50,6 +50,7 @@ struct virtio_vsock_sock {
> >>
> >> struct virtio_vsock_pkt {
> >> struct virtio_vsock_hdr hdr;
> >> + struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
> >> struct work_struct work;
> >> struct list_head list;
> >> /* socket refcnt not held, only use for cancellation */
> >> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> >> index 1d57ed3..2292f30 100644
> >> --- a/include/uapi/linux/virtio_vsock.h
> >> +++ b/include/uapi/linux/virtio_vsock.h
> >> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
> >> __le32 fwd_cnt;
> >> } __attribute__((packed));
> >>
> >> +/* It add mergeable rx buffers feature */
> >> +struct virtio_vsock_mrg_rxbuf_hdr {
> >> + __le16 num_buffers; /* number of mergeable rx buffers */
> >> +} __attribute__((packed));
> >> +
> >> enum virtio_vsock_type {
> >> VIRTIO_VSOCK_TYPE_STREAM = 1,
> >> };
> >> --
> >> 1.8.3.1
> >>
> >
> > .
> >
>
^ permalink raw reply
* Re: [PATCH net V3 0/3] Fix various issue of vhost
From: Michael S. Tsirkin @ 2018-12-13 14:32 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213025339.14023-1-jasowang@redhat.com>
On Thu, Dec 13, 2018 at 10:53:36AM +0800, Jason Wang wrote:
> Hi:
>
> This series tries to fix various issues of vhost:
>
> - Patch 1 adds a missing write barrier between used idx updating and
> logging.
> - Patch 2-3 brings back the protection of device IOTLB through vq
> mutex, this fixes possible use after free in device IOTLB entries.
>
> Please consider them for -stable.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Thanks
>
> Changes from V2:
> - drop dirty page fix and make it for net-next
> Changes from V1:
> - silent compiler warning for 32bit.
> - use mutex_trylock() on slowpath instead of mutex_lock() even on fast
> path.
>
> Jason Wang (3):
> vhost: make sure used idx is seen before log in vhost_add_used_n()
> vhost_net: switch to use mutex_trylock() in vhost_net_busy_poll()
> Revert "net: vhost: lock the vqs one by one"
>
> drivers/vhost/net.c | 8 +++++++-
> drivers/vhost/vhost.c | 23 +++++++++++++++++++----
> 2 files changed, 26 insertions(+), 5 deletions(-)
>
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Michael S. Tsirkin @ 2018-12-13 14:31 UTC (permalink / raw)
To: Jason Wang; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <0239c220-e7ca-c08f-be26-eb9be63fced3@redhat.com>
On Thu, Dec 13, 2018 at 10:39:41AM +0800, Jason Wang wrote:
>
> On 2018/12/12 下午10:32, Michael S. Tsirkin wrote:
> > On Wed, Dec 12, 2018 at 06:08:19PM +0800, Jason Wang wrote:
> > > Vhost dirty page logging API is designed to sync through GPA. But we
> > > try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > > lead to missing data after migration.
> > >
> > > To solve this issue, when logging with device IOTLB enabled, we will:
> > >
> > > 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > > get HVA, for writable descriptor, get HVA through iovec. For used
> > > ring update, translate its GIOVA to HVA
> > > 2) traverse the GPA->HVA mapping to get the possible GPA and log
> > > through GPA. Pay attention this reverse mapping is not guaranteed
> > > to be unique, so we should log each possible GPA in this case.
> > >
> > > This fix the failure of scp to guest during migration. In -next, we
> > > will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > >
> > > Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > > Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > > Cc: Jintack Lim <jintack@cs.columbia.edu>
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > It's a nasty bug for sure but it's been like this for a long
> > time so I'm inclined to say let's put it in 4.21,
> > and queue for stable.
> >
> > So please split this out from this series.
>
>
> Ok.
>
>
> >
> > Also, I'd like to see a feature bit that allows GPA in IOTLBs.
>
>
> Just to make sure I understand this. It looks to me we should:
>
> - allow passing GIOVA->GPA through UAPI
>
> - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
> performance
>
> Is this what you suggest?
>
> Thanks
Not really. We already have GPA->HVA, so I suggested a flag to pass
GIOVA->GPA in the IOTLB.
This has advantages for security since a single table needs
then to be validated to ensure guest does not corrupt
QEMU memory.
>
> >
> > > ---
> > > drivers/vhost/net.c | 3 +-
> > > drivers/vhost/vhost.c | 79 +++++++++++++++++++++++++++++++++++--------
> > > drivers/vhost/vhost.h | 3 +-
> > > 3 files changed, 69 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index ad7a6f475a44..784df2b49628 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -1192,7 +1192,8 @@ static void handle_rx(struct vhost_net *net)
> > > if (nvq->done_idx > VHOST_NET_BATCH)
> > > vhost_net_signal_used(nvq);
> > > if (unlikely(vq_log))
> > > - vhost_log_write(vq, vq_log, log, vhost_len);
> > > + vhost_log_write(vq, vq_log, log, vhost_len,
> > > + vq->iov, in);
> > > total_len += vhost_len;
> > > if (unlikely(vhost_exceeds_weight(++recv_pkts, total_len))) {
> > > vhost_poll_queue(&vq->poll);
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 55e5aa662ad5..3660310604fd 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -1733,11 +1733,67 @@ static int log_write(void __user *log_base,
> > > return r;
> > > }
> > > +static int log_write_hva(struct vhost_virtqueue *vq, u64 hva, u64 len)
> > > +{
> > > + struct vhost_umem *umem = vq->umem;
> > > + struct vhost_umem_node *u;
> > > + u64 gpa;
> > > + int r;
> > > + bool hit = false;
> > > +
> > > + list_for_each_entry(u, &umem->umem_list, link) {
> > > + if (u->userspace_addr < hva &&
> > > + u->userspace_addr + u->size >=
> > > + hva + len) {
> > > + gpa = u->start + hva - u->userspace_addr;
> > > + r = log_write(vq->log_base, gpa, len);
> > > + if (r < 0)
> > > + return r;
> > > + hit = true;
> > > + }
> > > + }
> > > +
> > > + /* No reverse mapping, should be a bug */
> > > + WARN_ON(!hit);
> > Maybe it should but userspace can trigger this easily I think.
> > We need to stop the device not warn in kernel log.
> >
> > Also there's an error fd: VHOST_SET_VRING_ERR, need to wake it up.
> >
>
> Ok.
>
>
> > > + return 0;
> > > +}
> > > +
> > > +static void log_used(struct vhost_virtqueue *vq, u64 used_offset, u64 len)
> > > +{
> > > + struct iovec iov[64];
> > > + int i, ret;
> > > +
> > > + if (!vq->iotlb) {
> > > + log_write(vq->log_base, vq->log_addr + used_offset, len);
> > > + return;
> > > + }
> > This change seems questionable. used ring writes
> > use their own machinery it does not go through iotlb.
> > Same should apply to log I think.
>
>
> The problem is used ring may not be physically contiguous with Device IOTLB
> enabled. So it should go through it.
>
>
> >
> > > +
> > > + ret = translate_desc(vq, (u64)(uintptr_t)vq->used + used_offset,
> > > + len, iov, 64, VHOST_ACCESS_WO);
> > > + WARN_ON(ret < 0);
> >
> > Same thing here. translation failures can be triggered from guest.
> > warn on is not a good error handling strategy ...
>
>
> Ok. Let me fix it.
>
>
> Thanks
>
>
> > > +
> > > + for (i = 0; i < ret; i++) {
> > > + ret = log_write_hva(vq, (u64)(uintptr_t)iov[i].iov_base,
> > > + iov[i].iov_len);
> > > + WARN_ON(ret);
> > > + }
> > > +}
> > > +
> > > int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> > > - unsigned int log_num, u64 len)
> > > + unsigned int log_num, u64 len, struct iovec *iov, int count)
> > > {
> > > int i, r;
> > > + if (vq->iotlb) {
> > > + for (i = 0; i < count; i++) {
> > > + r = log_write_hva(vq, (u64)(uintptr_t)iov[i].iov_base,
> > > + iov[i].iov_len);
> > > + if (r < 0)
> > > + return r;
> > > + }
> > > + return 0;
> > > + }
> > > +
> > > /* Make sure data written is seen before log. */
> > > smp_wmb();
> > > for (i = 0; i < log_num; ++i) {
> > > @@ -1769,9 +1825,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
> > > smp_wmb();
> > > /* Log used flag write. */
> > > used = &vq->used->flags;
> > > - log_write(vq->log_base, vq->log_addr +
> > > - (used - (void __user *)vq->used),
> > > - sizeof vq->used->flags);
> > > + log_used(vq, (used - (void __user *)vq->used),
> > > + sizeof vq->used->flags);
> > > if (vq->log_ctx)
> > > eventfd_signal(vq->log_ctx, 1);
> > > }
> > > @@ -1789,9 +1844,8 @@ static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
> > > smp_wmb();
> > > /* Log avail event write */
> > > used = vhost_avail_event(vq);
> > > - log_write(vq->log_base, vq->log_addr +
> > > - (used - (void __user *)vq->used),
> > > - sizeof *vhost_avail_event(vq));
> > > + log_used(vq, (used - (void __user *)vq->used),
> > > + sizeof *vhost_avail_event(vq));
> > > if (vq->log_ctx)
> > > eventfd_signal(vq->log_ctx, 1);
> > > }
> > > @@ -2191,10 +2245,8 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
> > > /* Make sure data is seen before log. */
> > > smp_wmb();
> > > /* Log used ring entry write. */
> > > - log_write(vq->log_base,
> > > - vq->log_addr +
> > > - ((void __user *)used - (void __user *)vq->used),
> > > - count * sizeof *used);
> > > + log_used(vq, ((void __user *)used - (void __user *)vq->used),
> > > + count * sizeof *used);
> > > }
> > > old = vq->last_used_idx;
> > > new = (vq->last_used_idx += count);
> > > @@ -2236,9 +2288,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
> > > /* Make sure used idx is seen before log. */
> > > smp_wmb();
> > > /* Log used index update. */
> > > - log_write(vq->log_base,
> > > - vq->log_addr + offsetof(struct vring_used, idx),
> > > - sizeof vq->used->idx);
> > > + log_used(vq, offsetof(struct vring_used, idx),
> > > + sizeof vq->used->idx);
> > > if (vq->log_ctx)
> > > eventfd_signal(vq->log_ctx, 1);
> > > }
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index 466ef7542291..1b675dad5e05 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -205,7 +205,8 @@ bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);
> > > bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
> > > int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> > > - unsigned int log_num, u64 len);
> > > + unsigned int log_num, u64 len,
> > > + struct iovec *iov, int count);
> > > int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
> > > struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
> > > --
> > > 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 3/5] VSOCK: support receive mergeable rx buffer in guest
From: Michael S. Tsirkin @ 2018-12-13 14:29 UTC (permalink / raw)
To: jiangyiwen; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <5C11C611.40901@huawei.com>
On Thu, Dec 13, 2018 at 10:38:09AM +0800, jiangyiwen wrote:
> Hi Michael,
>
> On 2018/12/12 23:31, Michael S. Tsirkin wrote:
> > On Wed, Dec 12, 2018 at 05:31:39PM +0800, jiangyiwen wrote:
> >> Guest receive mergeable rx buffer, it can merge
> >> scatter rx buffer into a big buffer and then copy
> >> to user space.
> >>
> >> In addition, it also use iovec to replace buf in struct
> >> virtio_vsock_pkt, keep tx and rx consistency. The only
> >> difference is now tx still uses a segment of continuous
> >> physical memory to implement.
> >>
> >> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
> >> ---
> >> drivers/vhost/vsock.c | 31 +++++++---
> >> include/linux/virtio_vsock.h | 6 +-
> >> net/vmw_vsock/virtio_transport.c | 105 ++++++++++++++++++++++++++++----
> >> net/vmw_vsock/virtio_transport_common.c | 59 ++++++++++++++----
> >> 4 files changed, 166 insertions(+), 35 deletions(-)
> >
> >
> > This was supposed to be a guest patch, why is vhost changed here?
> >
>
> In mergeable rx buff cases, it need to scatter big packets into several
> buffers, so I add kvec variable in struct virtio_vsock_pkt, at the same
> time, in order to keep tx and rx consistency, I use kvec to replace
> variable buf, because vhost use the variable pkt->buf, so this patch
> caused vhost is changed.
You'd want to split these patches imho.
> >> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >> index dc52b0f..c7ab0dd 100644
> >> --- a/drivers/vhost/vsock.c
> >> +++ b/drivers/vhost/vsock.c
> >> @@ -179,6 +179,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
> >> size_t nbytes;
> >> size_t len;
> >> s16 headcount;
> >> + size_t remain_len;
> >> + int i;
> >>
> >> spin_lock_bh(&vsock->send_pkt_list_lock);
> >> if (list_empty(&vsock->send_pkt_list)) {
> >> @@ -221,11 +223,19 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
> >> break;
> >> }
> >>
> >> - nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
> >> - if (nbytes != pkt->len) {
> >> - virtio_transport_free_pkt(pkt);
> >> - vq_err(vq, "Faulted on copying pkt buf\n");
> >> - break;
> >> + remain_len = pkt->len;
> >> + for (i = 0; i < pkt->nr_vecs; i++) {
> >> + int tmp_len;
> >> +
> >> + tmp_len = min(remain_len, pkt->vec[i].iov_len);
> >> + nbytes = copy_to_iter(pkt->vec[i].iov_base, tmp_len, &iov_iter);
> >> + if (nbytes != tmp_len) {
> >> + virtio_transport_free_pkt(pkt);
> >> + vq_err(vq, "Faulted on copying pkt buf\n");
> >> + break;
> >> + }
> >> +
> >> + remain_len -= tmp_len;
> >> }
> >>
> >> vhost_add_used_n(vq, vq->heads, headcount);
> >> @@ -341,6 +351,7 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
> >> struct iov_iter iov_iter;
> >> size_t nbytes;
> >> size_t len;
> >> + void *buf;
> >>
> >> if (in != 0) {
> >> vq_err(vq, "Expected 0 input buffers, got %u\n", in);
> >> @@ -375,13 +386,17 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
> >> return NULL;
> >> }
> >>
> >> - pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
> >> - if (!pkt->buf) {
> >> + buf = kmalloc(pkt->len, GFP_KERNEL);
> >> + if (!buf) {
> >> kfree(pkt);
> >> return NULL;
> >> }
> >>
> >> - nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
> >> + pkt->vec[0].iov_base = buf;
> >> + pkt->vec[0].iov_len = pkt->len;
> >> + pkt->nr_vecs = 1;
> >> +
> >> + nbytes = copy_from_iter(buf, pkt->len, &iov_iter);
> >> if (nbytes != pkt->len) {
> >> vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
> >> pkt->len, nbytes);
> >> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> >> index da9e1fe..734eeed 100644
> >> --- a/include/linux/virtio_vsock.h
> >> +++ b/include/linux/virtio_vsock.h
> >> @@ -13,6 +13,8 @@
> >> #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
> >> #define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
> >> #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
> >> +/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
> >> +#define VIRTIO_VSOCK_MAX_VEC_NUM ((VIRTIO_VSOCK_MAX_PKT_BUF_SIZE / PAGE_SIZE) + 1)
> >>
> >> /* Virtio-vsock feature */
> >> #define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
> >> @@ -55,10 +57,12 @@ struct virtio_vsock_pkt {
> >> struct list_head list;
> >> /* socket refcnt not held, only use for cancellation */
> >> struct vsock_sock *vsk;
> >> - void *buf;
> >> + struct kvec vec[VIRTIO_VSOCK_MAX_VEC_NUM];
> >> + int nr_vecs;
> >> u32 len;
> >> u32 off;
> >> bool reply;
> >> + bool mergeable;
> >> };
> >>
> >> struct virtio_vsock_pkt_info {
> >> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >> index c4a465c..148b58a 100644
> >> --- a/net/vmw_vsock/virtio_transport.c
> >> +++ b/net/vmw_vsock/virtio_transport.c
> >> @@ -155,8 +155,10 @@ static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
> >>
> >> sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
> >> sgs[out_sg++] = &hdr;
> >> - if (pkt->buf) {
> >> - sg_init_one(&buf, pkt->buf, pkt->len);
> >> + if (pkt->len) {
> >> + /* Currently only support a segment of memory in tx */
> >> + BUG_ON(pkt->vec[0].iov_len != pkt->len);
> >> + sg_init_one(&buf, pkt->vec[0].iov_base, pkt->vec[0].iov_len);
> >> sgs[out_sg++] = &buf;
> >> }
> >>
> >> @@ -304,23 +306,28 @@ static int fill_old_rx_buff(struct virtqueue *vq)
> >> struct virtio_vsock_pkt *pkt;
> >> struct scatterlist hdr, buf, *sgs[2];
> >> int ret;
> >> + void *pkt_buf;
> >>
> >> pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> >> if (!pkt)
> >> return -ENOMEM;
> >>
> >> - pkt->buf = kmalloc(buf_len, GFP_KERNEL);
> >> - if (!pkt->buf) {
> >> + pkt_buf = kmalloc(buf_len, GFP_KERNEL);
> >> + if (!pkt_buf) {
> >> virtio_transport_free_pkt(pkt);
> >> return -ENOMEM;
> >> }
> >>
> >> + pkt->vec[0].iov_base = pkt_buf;
> >> + pkt->vec[0].iov_len = buf_len;
> >> + pkt->nr_vecs = 1;
> >> +
> >> pkt->len = buf_len;
> >>
> >> sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
> >> sgs[0] = &hdr;
> >>
> >> - sg_init_one(&buf, pkt->buf, buf_len);
> >> + sg_init_one(&buf, pkt->vec[0].iov_base, buf_len);
> >> sgs[1] = &buf;
> >> ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
> >> if (ret)
> >> @@ -388,11 +395,78 @@ static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
> >> return val < virtqueue_get_vring_size(vq);
> >> }
> >>
> >> +static struct virtio_vsock_pkt *receive_mergeable(struct virtqueue *vq,
> >> + struct virtio_vsock *vsock, unsigned int *total_len)
> >> +{
> >> + struct virtio_vsock_pkt *pkt;
> >> + u16 num_buf;
> >> + void *buf;
> >> + unsigned int len;
> >> + size_t vsock_hlen = sizeof(struct virtio_vsock_pkt);
> >> +
> >> + buf = virtqueue_get_buf(vq, &len);
> >> + if (!buf)
> >> + return NULL;
> >> +
> >> + *total_len = len;
> >> + vsock->rx_buf_nr--;
> >> +
> >> + if (unlikely(len < vsock_hlen)) {
> >> + put_page(virt_to_head_page(buf));
> >> + return NULL;
> >> + }
> >> +
> >> + pkt = buf;
> >> + num_buf = le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers);
> >> + if (!num_buf || num_buf > VIRTIO_VSOCK_MAX_VEC_NUM) {
> >> + put_page(virt_to_head_page(buf));
> >> + return NULL;
> >> + }
> >
> > So everything just stops going, and host and user don't even
> > know what the reason is. And not only that - the next
> > packet will be corrupted because we skipped the first one.
> >
> >
>
> I understand this case will not encountered unless the code has
> *BUG*, like Host send some problematic packages (shorten/longer than
> expected). In this case, I think we should ignore/drop these packets.
If there's a specific packet length expected, e.g. in this case vector
size, it needs to be negotiated between host and guest in some way.
> >
> >> +
> >> + /* Initialize pkt residual structure */
> >> + memset(&pkt->work, 0, vsock_hlen - sizeof(struct virtio_vsock_hdr) -
> >> + sizeof(struct virtio_vsock_mrg_rxbuf_hdr));
> >> +
> >> + pkt->mergeable = true;
> >> + pkt->len = le32_to_cpu(pkt->hdr.len);
> >> + if (!pkt->len)
> >> + return pkt;
> >> +
> >> + len -= vsock_hlen;
> >> + if (len) {
> >> + pkt->vec[pkt->nr_vecs].iov_base = buf + vsock_hlen;
> >> + pkt->vec[pkt->nr_vecs].iov_len = len;
> >> + /* Shared page with pkt, so get page in advance */
> >> + get_page(virt_to_head_page(buf));
> >> + pkt->nr_vecs++;
> >> + }
> >> +
> >> + while (--num_buf) {
> >> + buf = virtqueue_get_buf(vq, &len);
> >> + if (!buf)
> >> + goto err;
> >> +
> >> + *total_len += len;
> >> + vsock->rx_buf_nr--;
> >> +
> >> + pkt->vec[pkt->nr_vecs].iov_base = buf;
> >> + pkt->vec[pkt->nr_vecs].iov_len = len;
> >> + pkt->nr_vecs++;
> >> + }
> >> +
> >> + return pkt;
> >> +err:
> >> + virtio_transport_free_pkt(pkt);
> >> + return NULL;
> >> +}
> >> +
> >> static void virtio_transport_rx_work(struct work_struct *work)
> >> {
> >> struct virtio_vsock *vsock =
> >> container_of(work, struct virtio_vsock, rx_work);
> >> struct virtqueue *vq;
> >> + size_t vsock_hlen = vsock->mergeable ? sizeof(struct virtio_vsock_pkt) :
> >> + sizeof(struct virtio_vsock_hdr);
> >>
> >> vq = vsock->vqs[VSOCK_VQ_RX];
> >>
> >> @@ -412,21 +486,26 @@ static void virtio_transport_rx_work(struct work_struct *work)
> >> goto out;
> >> }
> >>
> >> - pkt = virtqueue_get_buf(vq, &len);
> >> - if (!pkt) {
> >> - break;
> >> - }
> >> + if (likely(vsock->mergeable)) {
> >> + pkt = receive_mergeable(vq, vsock, &len);
> >> + if (!pkt)
> >> + break;
> >> + } else {
> >> + pkt = virtqueue_get_buf(vq, &len);
> >> + if (!pkt)
> >> + break;
> >>
> >
> > So looking at it, this seems to be the main source of the gain.
> > But why does this require host/guest changes?
> >
> >
> > The way I see it:
> > - get a buffer and create an skb
> > - get the next one, check header matches, if yes
> > tack it on the skb as a fragment. If not then
> > don't, deliver previous one and queue the new one.
> >
> >
>
> Vhost change reason I explain as above, and I hope use kvec
> to instead buf, after all buf only can express a contiguous
> physical memory.
>
> Thanks,
> Yiwen.
I got the reason but I am not yet convinced it's a good one.
You don't necessarily need host to skip headers
in all but the first chunk. Guest can do this just as well.
> >
> >> - vsock->rx_buf_nr--;
> >> + vsock->rx_buf_nr--;
> >> + }
> >>
> >> /* Drop short/long packets */
> >> - if (unlikely(len < sizeof(pkt->hdr) ||
> >> - len > sizeof(pkt->hdr) + pkt->len)) {
> >> + if (unlikely(len < vsock_hlen ||
> >> + len > vsock_hlen + pkt->len)) {
> >> virtio_transport_free_pkt(pkt);
> >> continue;
> >> }
> >>
> >> - pkt->len = len - sizeof(pkt->hdr);
> >> + pkt->len = len - vsock_hlen;
> >> virtio_transport_deliver_tap_pkt(pkt);
> >> virtio_transport_recv_pkt(pkt);
> >> }
> >> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >> index 3ae3a33..123a8b6 100644
> >> --- a/net/vmw_vsock/virtio_transport_common.c
> >> +++ b/net/vmw_vsock/virtio_transport_common.c
> >> @@ -44,6 +44,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
> >> {
> >> struct virtio_vsock_pkt *pkt;
> >> int err;
> >> + void *buf = NULL;
> >>
> >> pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> >> if (!pkt)
> >> @@ -62,12 +63,16 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
> >> pkt->vsk = info->vsk;
> >>
> >> if (info->msg && len > 0) {
> >> - pkt->buf = kmalloc(len, GFP_KERNEL);
> >> - if (!pkt->buf)
> >> + buf = kmalloc(len, GFP_KERNEL);
> >> + if (!buf)
> >> goto out_pkt;
> >> - err = memcpy_from_msg(pkt->buf, info->msg, len);
> >> + err = memcpy_from_msg(buf, info->msg, len);
> >> if (err)
> >> goto out;
> >> +
> >> + pkt->vec[0].iov_base = buf;
> >> + pkt->vec[0].iov_len = len;
> >> + pkt->nr_vecs = 1;
> >> }
> >>
> >> trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >> @@ -80,7 +85,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
> >> return pkt;
> >>
> >> out:
> >> - kfree(pkt->buf);
> >> + kfree(buf);
> >> out_pkt:
> >> kfree(pkt);
> >> return NULL;
> >> @@ -92,6 +97,7 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> >> struct virtio_vsock_pkt *pkt = opaque;
> >> struct af_vsockmon_hdr *hdr;
> >> struct sk_buff *skb;
> >> + int i;
> >>
> >> skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
> >> GFP_ATOMIC);
> >> @@ -134,7 +140,8 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> >> skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
> >>
> >> if (pkt->len) {
> >> - skb_put_data(skb, pkt->buf, pkt->len);
> >> + for (i = 0; i < pkt->nr_vecs; i++)
> >> + skb_put_data(skb, pkt->vec[i].iov_base, pkt->vec[i].iov_len);
> >> }
> >>
> >> return skb;
> >> @@ -260,6 +267,9 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
> >>
> >> spin_lock_bh(&vvs->rx_lock);
> >> while (total < len && !list_empty(&vvs->rx_queue)) {
> >> + size_t copy_bytes, last_vec_total = 0, vec_off;
> >> + int i;
> >> +
> >> pkt = list_first_entry(&vvs->rx_queue,
> >> struct virtio_vsock_pkt, list);
> >>
> >> @@ -272,14 +282,28 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
> >> */
> >> spin_unlock_bh(&vvs->rx_lock);
> >>
> >> - err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
> >> - if (err)
> >> - goto out;
> >> + for (i = 0; i < pkt->nr_vecs; i++) {
> >> + if (pkt->off > last_vec_total + pkt->vec[i].iov_len) {
> >> + last_vec_total += pkt->vec[i].iov_len;
> >> + continue;
> >> + }
> >> +
> >> + vec_off = pkt->off - last_vec_total;
> >> + copy_bytes = min(pkt->vec[i].iov_len - vec_off, bytes);
> >> + err = memcpy_to_msg(msg, pkt->vec[i].iov_base + vec_off,
> >> + copy_bytes);
> >> + if (err)
> >> + goto out;
> >> +
> >> + bytes -= copy_bytes;
> >> + pkt->off += copy_bytes;
> >> + total += copy_bytes;
> >> + last_vec_total += pkt->vec[i].iov_len;
> >> + if (!bytes)
> >> + break;
> >> + }
> >>
> >> spin_lock_bh(&vvs->rx_lock);
> >> -
> >> - total += bytes;
> >> - pkt->off += bytes;
> >> if (pkt->off == pkt->len) {
> >> virtio_transport_dec_rx_pkt(vvs, pkt);
> >> list_del(&pkt->list);
> >> @@ -1050,8 +1074,17 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
> >>
> >> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> >> {
> >> - kfree(pkt->buf);
> >> - kfree(pkt);
> >> + int i;
> >> +
> >> + if (pkt->mergeable) {
> >> + for (i = 0; i < pkt->nr_vecs; i++)
> >> + put_page(virt_to_head_page(pkt->vec[i].iov_base));
> >> + put_page(virt_to_head_page((void *)pkt));
> >> + } else {
> >> + for (i = 0; i < pkt->nr_vecs; i++)
> >> + kfree(pkt->vec[i].iov_base);
> >> + kfree(pkt);
> >> + }
> >> }
> >> EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
> >>
> >> --
> >> 1.8.3.1
> >>
> >
> > .
> >
>
^ permalink raw reply
* Re: [PATCH v6 0/7] Add virtio-iommu driver
From: Christoph Hellwig @ 2018-12-13 14:17 UTC (permalink / raw)
To: Jean-Philippe Brucker
Cc: Mark Rutland, virtio-dev@lists.oasis-open.org,
tnowicki@caviumnetworks.com, mst@redhat.com, Marc Zyngier,
linux-pci@vger.kernel.org, Joerg Roedel, Will Deacon,
virtualization@lists.linux-foundation.org,
iommu@lists.linux-foundation.org, robh+dt@kernel.org,
kvmarm@lists.cs.columbia.edu, bhelgaas@google.com, Robin Murphy,
devicetree@vger.kernel.org
In-Reply-To: <9110873f-d344-b6b9-c722-9accfc329db2@arm.com>
On Thu, Dec 13, 2018 at 12:50:29PM +0000, Jean-Philippe Brucker wrote:
> * DMA ops for x86 (see "HACK" commit). I'd like to use dma-iommu but I'm
> not sure how to implement the glue that sets dma_ops properly.
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-iommu-ops
^ permalink raw reply
* Re: [virtio-dev] Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver
From: Auger Eric @ 2018-12-13 14:13 UTC (permalink / raw)
To: Robin Murphy, Michael S. Tsirkin, Jean-Philippe Brucker
Cc: mark.rutland, virtio-dev, lorenzo.pieralisi, tnowicki, devicetree,
marc.zyngier, linux-pci, joro, will.deacon, virtualization, iommu,
robh+dt, bhelgaas, kvmarm
In-Reply-To: <1494c026-a763-6c77-2476-a3373c21ae6f@arm.com>
Hi Robin
On 12/13/18 1:37 PM, Robin Murphy wrote:
> On 2018-12-12 3:27 pm, Auger Eric wrote:
>> Hi,
>>
>> On 12/12/18 3:56 PM, Michael S. Tsirkin wrote:
>>> On Fri, Dec 07, 2018 at 06:52:31PM +0000, Jean-Philippe Brucker wrote:
>>>> Sorry for the delay, I wanted to do a little more performance analysis
>>>> before continuing.
>>>>
>>>> On 27/11/2018 18:10, Michael S. Tsirkin wrote:
>>>>> On Tue, Nov 27, 2018 at 05:55:20PM +0000, Jean-Philippe Brucker wrote:
>>>>>>>> + if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
>>>>>>>> + !virtio_has_feature(vdev, VIRTIO_IOMMU_F_MAP_UNMAP))
>>>>>>>
>>>>>>> Why bother with a feature bit for this then btw?
>>>>>>
>>>>>> We'll need a new feature bit for sharing page tables with the
>>>>>> hardware,
>>>>>> because they require different requests (attach_table/invalidate
>>>>>> instead
>>>>>> of map/unmap.) A future device supporting page table sharing won't
>>>>>> necessarily need to support map/unmap.
>>>>>>
>>>>> I don't see virtio iommu being extended to support ARM specific
>>>>> requests. This just won't scale, too many different
>>>>> descriptor formats out there.
>>>>
>>>> They aren't really ARM specific requests. The two new requests are
>>>> ATTACH_TABLE and INVALIDATE, which would be used by x86 IOMMUs as well.
>>>>
>>>> Sharing CPU address space with the HW IOMMU (SVM) has been in the scope
>>>> of virtio-iommu since the first RFC, and I've been working with that
>>>> extension in mind since the beginning. As an example you can have a
>>>> look
>>>> at my current draft for this [1], which is inspired from the VFIO work
>>>> we've been doing with Intel.
>>>>
>>>> The negotiation phase inevitably requires vendor-specific fields in the
>>>> descriptors - host tells which formats are supported, guest chooses a
>>>> format and attaches page tables. But invalidation and fault reporting
>>>> descriptors are fairly generic.
>>>
>>> We need to tread carefully here. People expect it that if user does
>>> lspci and sees a virtio device then it's reasonably portable.
>>>
>>>>> If you want to go that way down the road, you should avoid
>>>>> virtio iommu, instead emulate and share code with the ARM SMMU
>>>>> (probably
>>>>> with a different vendor id so you can implement the
>>>>> report on map for devices without PRI).
>>>>
>>>> vSMMU has to stay in userspace though. The main reason we're proposing
>>>> virtio-iommu is that emulating every possible vIOMMU model in the
>>>> kernel
>>>> would be unmaintainable. With virtio-iommu we can process the fast path
>>>> in the host kernel, through vhost-iommu, and do the heavy lifting in
>>>> userspace.
>>>
>>> Interesting.
>>>
>>>> As said above, I'm trying to keep the fast path for
>>>> virtio-iommu generic.
>>>>
>>>> More notes on what I consider to be the fast path, and comparison with
>>>> vSMMU:
>>>>
>>>> (1) The primary use-case we have in mind for vIOMMU is something like
>>>> DPDK in the guest, assigning a hardware device to guest userspace. DPDK
>>>> maps a large amount of memory statically, to be used by a pass-through
>>>> device. For this case I don't think we care about vIOMMU performance.
>>>> Setup and teardown need to be reasonably fast, sure, but the MAP/UNMAP
>>>> requests don't have to be optimal.
>>>>
>>>>
>>>> (2) If the assigned device is owned by the guest kernel, then mappings
>>>> are dynamic and require dma_map/unmap() to be fast, but there generally
>>>> is no need for a vIOMMU, since device and drivers are trusted by the
>>>> guest kernel. Even when the user does enable a vIOMMU for this case
>>>> (allowing to over-commit guest memory, which needs to be pinned
>>>> otherwise),
>>>
>>> BTW that's in theory in practice it doesn't really work.
>>>
>>>> we generally play tricks like lazy TLBI (non-strict mode) to
>>>> make it faster.
>>>
>>> Simple lazy TLB for guest/userspace drivers would be a big no no.
>>> You need something smarter.
>>>
>>>> Here device and drivers are trusted, therefore the
>>>> vulnerability window of lazy mode isn't a concern.
>>>>
>>>> If the reason to enable the vIOMMU is over-comitting guest memory
>>>> however, you can't use nested translation because it requires pinning
>>>> the second-level tables. For this case performance matters a bit,
>>>> because your invalidate-on-map needs to be fast, even if you enable
>>>> lazy
>>>> mode and only receive inval-on-unmap every 10ms. It won't ever be as
>>>> fast as nested translation, though. For this case I think vSMMU+Caching
>>>> Mode and userspace virtio-iommu with MAP/UNMAP would perform similarly
>>>> (given page-sized payloads), because the pagetable walk doesn't add a
>>>> lot of overhead compared to the context switch. But given the results
>>>> below, vhost-iommu would be faster than vSMMU+CM.
>>>>
>>>>
>>>> (3) Then there is SVM. For SVM, any destructive change to the process
>>>> address space requires a synchronous invalidation command to the
>>>> hardware (at least when using PCI ATS). Given that SVM is based on page
>>>> faults, fault reporting from host to guest also needs to be fast, as
>>>> well as fault response from guest to host.
>>>>
>>>> I think this is where performance matters the most. To get a feel of
>>>> the
>>>> advantage we get with virtio-iommu, I compared the vSMMU page-table
>>>> sharing implementation [2] and vhost-iommu + VFIO with page table
>>>> sharing (based on Tomasz Nowicki's vhost-iommu prototype). That's on a
>>>> ThunderX2 with a 10Gb NIC assigned to the guest kernel, which
>>>> corresponds to case (2) above, with nesting page tables and without the
>>>> lazy mode. The host's only job is forwarding invalidation to the HW
>>>> SMMU.
>>>>
>>>> vhost-iommu performed on average 1.8x and 5.5x better than vSMMU on
>>>> netperf TCP_STREAM and TCP_MAERTS respectively (~200 samples). I think
>>>> this can be further optimized (that was still polling under the vq
>>>> lock), and unlike vSMMU, virtio-iommu offers the possibility of
>>>> multi-queue for improved scalability. In addition, the guest will need
>>>> to send both TLB and ATC invalidations with vSMMU, but virtio-iommu
>>>> allows to multiplex those, and to invalidate ranges. Similarly for
>>>> fault
>>>> injection, having the ability to report page faults to the guest from
>>>> the host kernel should be significantly faster than having to go to
>>>> userspace and back to the kernel.
>>>
>>> Fascinating. Any data about host CPU utilization?
>>>
>>> Eric what do you think?
>>>
>>> Is it true that SMMUv3 is fundmentally slow at the architecture level
>>> and so a PV interface will always scale better until
>>> a new hardware interface is designed?
>>
>> As far as I understand the figures above correspond to vhost-iommu
>> against vsmmuv3. In the 2 cases the guest owns stage1 tables so the
>> difference comes from the IOTLB invalidation handling. With vhost we
>> avoid a kernel <-> userspace round trip which may mostly explain the
>> difference.
>>
>> About SMMUv3 issues I already reported one big limitation with respect
>> to hugepage invalidation. See [RFC v2 4/4] iommu/arm-smmu-v3: add
>> CMD_TLBI_NH_VA_AM command for iova range invalidation
>> (https://lkml.org/lkml/2017/8/11/428).
>>
>> At smmuv3 guest driver level, arm_smmu_tlb_inv_range_nosync(), when
>> called with a hugepage size, invalidates each 4K/64K page of the region
>> and not the whole region at once. Each of them are trapped by the SMMUv3
>> device which forwards them to the host. This stalls the guest. This
>> issue can be observed in DPDK case - not the use case benchmarked
>> above - .
>>
>> I raised this point again in recent discussions and it is unclear
>> whether this is an SMMUv3 driver limitation or an architecture
>> limitation. Seems a single invalidation within the block mapping should
>> invalidate the whole mapping at HW level. In the past I hacked a
>> workaround by defining an implementation defined invalidation command.
>>
>> Robin/Will, could you please explain the rationale behind the
>> arm_smmu_tlb_inv_range_nosync() implementation.
>
> Fundamentally, TLBI commands only take an address, so invalidations have
> to match the actual leaf PTEs being removed. If iommu_unmap() sees that
> 2MB is a valid block size, it may send a single "unmap this 2MB" request
> to the driver, but nobody knows how that region is actually mapped until
> the pagetable code walks the tables. If it does find a 2MB block PTE,
> then it can simply clear it and generate a single invalidation command -
> if a TLB entry exists for that block mapping then any address within the
> block will match it. If however that 2MB region was actually covered by
> a subtable of 4KB pages, then separate TLB entries may exist for any or
> all of those pages, and a single address can at best only match one of
> them. Thus after the table PTE is cleared, a separate command has to be
> generated for each page to ensure that all possible TLB entries
> associated with that table are invalidated.
>
> So if you're seeing page-granularity invalidation, it means that the
> thing you're unmapping wasn't actually mapped as that size of hugepage
> in the first place (or something pathological has caused it to be split
> in the meantime - I recall we once had an amusing bug which caused VFIO
> to do that on teardown). There is one suboptimal case if we're taking
> out potentially multiple levels of table at once (e.g. a 1GB region),
> where we don't bother recursing down the removed table to see whether
> anything was mapped with blocks at the intermediate level, and just
> invalidate the whole lot at page granularity to cover the worst-case
> scenario. I think we assumed that sort of unmap would be unlikely enough
> that it wasn't worth optimising for at the time, but there's certainly
> scope to improve it if unmapping 1GB worth of 2MB blocks turns out to be
> a common thing.
thank you for your reply. This last situation looks like the one I
encountered. I will test with DPDK again and let you know.
Thanks
Eric
>
> FWIW SMMUv3.2 has introduced actual range-invalidation commands -
> reworking all the TLBI logic in the driver to make worthwhile use of
> those is on the to-do list, but until real 3.2 hardware starts coming to
> light I've not been prioritising it.
>
> Robin.
>
>>
>> Thanks
>>
>> Eric
>>
>>
>>
>>>
>>>
>>>>
>>>> (4) Virtio and vhost endpoints weren't really a priority for the base
>>>> virtio-iommu device, we were looking mainly at device pass-through. I
>>>> have optimizations in mind for this, although a lot of them are
>>>> based on
>>>> page tables, not MAP/UNMAP requests. But just getting the vIOMMU closer
>>>> to vhost devices, avoiding the trip to userspace through vhost-tlb,
>>>> should already improve things.
>>>>
>>>> The important difference when DMA is done by software is that you don't
>>>> need to mirror all mappings into the HW IOMMU - you don't need
>>>> inval-on-map. The endpoint can ask the vIOMMU for mappings when it
>>>> needs
>>>> them, like vhost-iotlb does for example. So the MAP/UNMAP interface of
>>>> virtio-iommu performs poorly for emulated/PV endpoints compared to an
>>>> emulated IOMMU, since it requires three context switches for DMA
>>>> (MAP/DMA/UNMAP) between host and guest, rather than two (DMA/INVAL).
>>>> There is a feature I call "posted MAP", that avoids the kick on MAP and
>>>> instead lets the device fetch the MAP request on TLB miss, but I
>>>> haven't
>>>> spent enough time experimenting with this.
>>>>
>>>>> Others on the TC might feel differently.
>>>>>
>>>>> If someone's looking into adding virtio iommu support in hardware,
>>>>> that's a different matter. Which is it?
>>>>
>>>> I'm not aware of anything like that, and suspect that no one would
>>>> consider it until virtio-iommu is more widely adopted.
>>>>
>>>> Thanks,
>>>> Jean
>>>>
>>>>
>>>> [1] Diff between current spec and page table sharing draft
>>>> (Very rough, missing page fault support and I'd like to rework the
>>>> PASID model a bit, but table descriptors p.24-26 for both Arm
>>>> SMMUv2 and SMMUv3.)
>>>>
>>>> http://jpbrucker.net/virtio-iommu/spec-table/diffs/virtio-iommu-pdf-diff-v0.9-v0.10.dev03.pdf
>>>>
>>>>
>>>> [2] [RFC v2 00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration
>>>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg562369.html
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH] drm/virtio: switch to generic fbdev emulation
From: Gerd Hoffmann @ 2018-12-13 13:49 UTC (permalink / raw)
To: dri-devel; +Cc: David Airlie, open list, open list:VIRTIO GPU DRIVER
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
drivers/gpu/drm/virtio/virtgpu_drv.h | 14 ---
drivers/gpu/drm/virtio/virtgpu_display.c | 1 -
drivers/gpu/drm/virtio/virtgpu_drv.c | 9 +-
drivers/gpu/drm/virtio/virtgpu_fb.c | 191 -------------------------------
drivers/gpu/drm/virtio/virtgpu_kms.c | 8 --
5 files changed, 8 insertions(+), 215 deletions(-)
diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h b/drivers/gpu/drm/virtio/virtgpu_drv.h
index 1deb41d42e..63704915f8 100644
--- a/drivers/gpu/drm/virtio/virtgpu_drv.h
+++ b/drivers/gpu/drm/virtio/virtgpu_drv.h
@@ -137,19 +137,10 @@ struct virtio_gpu_framebuffer {
#define to_virtio_gpu_framebuffer(x) \
container_of(x, struct virtio_gpu_framebuffer, base)
-struct virtio_gpu_fbdev {
- struct drm_fb_helper helper;
- struct virtio_gpu_framebuffer vgfb;
- struct virtio_gpu_device *vgdev;
- struct delayed_work work;
-};
-
struct virtio_gpu_mman {
struct ttm_bo_device bdev;
};
-struct virtio_gpu_fbdev;
-
struct virtio_gpu_queue {
struct virtqueue *vq;
spinlock_t qlock;
@@ -180,8 +171,6 @@ struct virtio_gpu_device {
struct virtio_gpu_mman mman;
- /* pointer to fbdev info structure */
- struct virtio_gpu_fbdev *vgfbdev;
struct virtio_gpu_output outputs[VIRTIO_GPU_MAX_SCANOUTS];
uint32_t num_scanouts;
@@ -249,9 +238,6 @@ int virtio_gpu_mode_dumb_mmap(struct drm_file *file_priv,
uint32_t handle, uint64_t *offset_p);
/* virtio_fb */
-#define VIRTIO_GPUFB_CONN_LIMIT 1
-int virtio_gpu_fbdev_init(struct virtio_gpu_device *vgdev);
-void virtio_gpu_fbdev_fini(struct virtio_gpu_device *vgdev);
int virtio_gpu_surface_dirty(struct virtio_gpu_framebuffer *qfb,
struct drm_clip_rect *clips,
unsigned int num_clips);
diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index b5580b11a0..e1c223e18d 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -390,6 +390,5 @@ void virtio_gpu_modeset_fini(struct virtio_gpu_device *vgdev)
for (i = 0 ; i < vgdev->num_scanouts; ++i)
kfree(vgdev->outputs[i].edid);
- virtio_gpu_fbdev_fini(vgdev);
drm_mode_config_cleanup(vgdev->ddev);
}
diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
index f7f32a885a..7df50920c1 100644
--- a/drivers/gpu/drm/virtio/virtgpu_drv.c
+++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
@@ -42,13 +42,20 @@ module_param_named(modeset, virtio_gpu_modeset, int, 0400);
static int virtio_gpu_probe(struct virtio_device *vdev)
{
+ int ret;
+
if (vgacon_text_force() && virtio_gpu_modeset == -1)
return -EINVAL;
if (virtio_gpu_modeset == 0)
return -EINVAL;
- return drm_virtio_init(&driver, vdev);
+ ret = drm_virtio_init(&driver, vdev);
+ if (ret)
+ return ret;
+
+ drm_fbdev_generic_setup(vdev->priv, 32);
+ return 0;
}
static void virtio_gpu_remove(struct virtio_device *vdev)
diff --git a/drivers/gpu/drm/virtio/virtgpu_fb.c b/drivers/gpu/drm/virtio/virtgpu_fb.c
index fb1cc8b2f1..b07584b1c2 100644
--- a/drivers/gpu/drm/virtio/virtgpu_fb.c
+++ b/drivers/gpu/drm/virtio/virtgpu_fb.c
@@ -27,8 +27,6 @@
#include <drm/drm_fb_helper.h>
#include "virtgpu_drv.h"
-#define VIRTIO_GPU_FBCON_POLL_PERIOD (HZ / 60)
-
static int virtio_gpu_dirty_update(struct virtio_gpu_framebuffer *fb,
bool store, int x, int y,
int width, int height)
@@ -150,192 +148,3 @@ int virtio_gpu_surface_dirty(struct virtio_gpu_framebuffer *vgfb,
left, top, right - left, bottom - top);
return 0;
}
-
-static void virtio_gpu_fb_dirty_work(struct work_struct *work)
-{
- struct delayed_work *delayed_work = to_delayed_work(work);
- struct virtio_gpu_fbdev *vfbdev =
- container_of(delayed_work, struct virtio_gpu_fbdev, work);
- struct virtio_gpu_framebuffer *vgfb = &vfbdev->vgfb;
-
- virtio_gpu_dirty_update(&vfbdev->vgfb, false, vgfb->x1, vgfb->y1,
- vgfb->x2 - vgfb->x1, vgfb->y2 - vgfb->y1);
-}
-
-static void virtio_gpu_3d_fillrect(struct fb_info *info,
- const struct fb_fillrect *rect)
-{
- struct virtio_gpu_fbdev *vfbdev = info->par;
-
- drm_fb_helper_sys_fillrect(info, rect);
- virtio_gpu_dirty_update(&vfbdev->vgfb, true, rect->dx, rect->dy,
- rect->width, rect->height);
- schedule_delayed_work(&vfbdev->work, VIRTIO_GPU_FBCON_POLL_PERIOD);
-}
-
-static void virtio_gpu_3d_copyarea(struct fb_info *info,
- const struct fb_copyarea *area)
-{
- struct virtio_gpu_fbdev *vfbdev = info->par;
-
- drm_fb_helper_sys_copyarea(info, area);
- virtio_gpu_dirty_update(&vfbdev->vgfb, true, area->dx, area->dy,
- area->width, area->height);
- schedule_delayed_work(&vfbdev->work, VIRTIO_GPU_FBCON_POLL_PERIOD);
-}
-
-static void virtio_gpu_3d_imageblit(struct fb_info *info,
- const struct fb_image *image)
-{
- struct virtio_gpu_fbdev *vfbdev = info->par;
-
- drm_fb_helper_sys_imageblit(info, image);
- virtio_gpu_dirty_update(&vfbdev->vgfb, true, image->dx, image->dy,
- image->width, image->height);
- schedule_delayed_work(&vfbdev->work, VIRTIO_GPU_FBCON_POLL_PERIOD);
-}
-
-static struct fb_ops virtio_gpufb_ops = {
- .owner = THIS_MODULE,
- DRM_FB_HELPER_DEFAULT_OPS,
- .fb_fillrect = virtio_gpu_3d_fillrect,
- .fb_copyarea = virtio_gpu_3d_copyarea,
- .fb_imageblit = virtio_gpu_3d_imageblit,
-};
-
-static int virtio_gpufb_create(struct drm_fb_helper *helper,
- struct drm_fb_helper_surface_size *sizes)
-{
- struct virtio_gpu_fbdev *vfbdev =
- container_of(helper, struct virtio_gpu_fbdev, helper);
- struct drm_device *dev = helper->dev;
- struct virtio_gpu_device *vgdev = dev->dev_private;
- struct fb_info *info;
- struct drm_framebuffer *fb;
- struct drm_mode_fb_cmd2 mode_cmd = {};
- struct virtio_gpu_object *obj;
- uint32_t format, size;
- int ret;
-
- mode_cmd.width = sizes->surface_width;
- mode_cmd.height = sizes->surface_height;
- mode_cmd.pitches[0] = mode_cmd.width * 4;
- mode_cmd.pixel_format = DRM_FORMAT_HOST_XRGB8888;
-
- format = virtio_gpu_translate_format(mode_cmd.pixel_format);
- if (format == 0)
- return -EINVAL;
-
- size = mode_cmd.pitches[0] * mode_cmd.height;
- obj = virtio_gpu_alloc_object(dev, size, false, true);
- if (IS_ERR(obj))
- return PTR_ERR(obj);
-
- virtio_gpu_cmd_create_resource(vgdev, obj, format,
- mode_cmd.width, mode_cmd.height);
-
- ret = virtio_gpu_object_kmap(obj);
- if (ret) {
- DRM_ERROR("failed to kmap fb %d\n", ret);
- goto err_obj_vmap;
- }
-
- /* attach the object to the resource */
- ret = virtio_gpu_object_attach(vgdev, obj, NULL);
- if (ret)
- goto err_obj_attach;
-
- info = drm_fb_helper_alloc_fbi(helper);
- if (IS_ERR(info)) {
- ret = PTR_ERR(info);
- goto err_fb_alloc;
- }
-
- info->par = helper;
-
- ret = virtio_gpu_framebuffer_init(dev, &vfbdev->vgfb,
- &mode_cmd, &obj->gem_base);
- if (ret)
- goto err_fb_alloc;
-
- fb = &vfbdev->vgfb.base;
-
- vfbdev->helper.fb = fb;
-
- strcpy(info->fix.id, "virtiodrmfb");
- info->fbops = &virtio_gpufb_ops;
- info->pixmap.flags = FB_PIXMAP_SYSTEM;
-
- info->screen_buffer = obj->vmap;
- info->screen_size = obj->gem_base.size;
- drm_fb_helper_fill_fix(info, fb->pitches[0], fb->format->depth);
- drm_fb_helper_fill_var(info, &vfbdev->helper,
- sizes->fb_width, sizes->fb_height);
-
- info->fix.mmio_start = 0;
- info->fix.mmio_len = 0;
- return 0;
-
-err_fb_alloc:
- virtio_gpu_object_detach(vgdev, obj);
-err_obj_attach:
-err_obj_vmap:
- virtio_gpu_gem_free_object(&obj->gem_base);
- return ret;
-}
-
-static int virtio_gpu_fbdev_destroy(struct drm_device *dev,
- struct virtio_gpu_fbdev *vgfbdev)
-{
- struct virtio_gpu_framebuffer *vgfb = &vgfbdev->vgfb;
-
- drm_fb_helper_unregister_fbi(&vgfbdev->helper);
-
- if (vgfb->base.obj[0])
- vgfb->base.obj[0] = NULL;
- drm_fb_helper_fini(&vgfbdev->helper);
- drm_framebuffer_cleanup(&vgfb->base);
-
- return 0;
-}
-static const struct drm_fb_helper_funcs virtio_gpu_fb_helper_funcs = {
- .fb_probe = virtio_gpufb_create,
-};
-
-int virtio_gpu_fbdev_init(struct virtio_gpu_device *vgdev)
-{
- struct virtio_gpu_fbdev *vgfbdev;
- int bpp_sel = 32; /* TODO: parameter from somewhere? */
- int ret;
-
- vgfbdev = kzalloc(sizeof(struct virtio_gpu_fbdev), GFP_KERNEL);
- if (!vgfbdev)
- return -ENOMEM;
-
- vgfbdev->vgdev = vgdev;
- vgdev->vgfbdev = vgfbdev;
- INIT_DELAYED_WORK(&vgfbdev->work, virtio_gpu_fb_dirty_work);
-
- drm_fb_helper_prepare(vgdev->ddev, &vgfbdev->helper,
- &virtio_gpu_fb_helper_funcs);
- ret = drm_fb_helper_init(vgdev->ddev, &vgfbdev->helper,
- VIRTIO_GPUFB_CONN_LIMIT);
- if (ret) {
- kfree(vgfbdev);
- return ret;
- }
-
- drm_fb_helper_single_add_all_connectors(&vgfbdev->helper);
- drm_fb_helper_initial_config(&vgfbdev->helper, bpp_sel);
- return 0;
-}
-
-void virtio_gpu_fbdev_fini(struct virtio_gpu_device *vgdev)
-{
- if (!vgdev->vgfbdev)
- return;
-
- virtio_gpu_fbdev_destroy(vgdev->ddev, vgdev->vgfbdev);
- kfree(vgdev->vgfbdev);
- vgdev->vgfbdev = NULL;
-}
diff --git a/drivers/gpu/drm/virtio/virtgpu_kms.c b/drivers/gpu/drm/virtio/virtgpu_kms.c
index 3af6181c05..1072064a0d 100644
--- a/drivers/gpu/drm/virtio/virtgpu_kms.c
+++ b/drivers/gpu/drm/virtio/virtgpu_kms.c
@@ -28,11 +28,6 @@
#include <drm/drmP.h>
#include "virtgpu_drv.h"
-static int virtio_gpu_fbdev = 1;
-
-MODULE_PARM_DESC(fbdev, "Disable/Enable framebuffer device & console");
-module_param_named(fbdev, virtio_gpu_fbdev, int, 0400);
-
static void virtio_gpu_config_changed_work_func(struct work_struct *work)
{
struct virtio_gpu_device *vgdev =
@@ -212,9 +207,6 @@ int virtio_gpu_driver_load(struct drm_device *dev, unsigned long flags)
virtio_gpu_cmd_get_display_info(vgdev);
wait_event_timeout(vgdev->resp_wq, !vgdev->display_info_pending,
5 * HZ);
- if (virtio_gpu_fbdev)
- virtio_gpu_fbdev_init(vgdev);
-
return 0;
err_modeset:
--
2.9.3
^ permalink raw reply related
* Re: [PATCH v6 0/7] Add virtio-iommu driver
From: Jean-Philippe Brucker @ 2018-12-13 12:50 UTC (permalink / raw)
To: Joerg Roedel
Cc: Mark Rutland, virtio-dev@lists.oasis-open.org, Lorenzo Pieralisi,
tnowicki@caviumnetworks.com, mst@redhat.com, Marc Zyngier,
linux-pci@vger.kernel.org, Will Deacon, Robin Murphy,
virtualization@lists.linux-foundation.org, eric.auger@redhat.com,
iommu@lists.linux-foundation.org, robh+dt@kernel.org,
bhelgaas@google.com, kvmarm@lists.cs.columbia.edu,
devicetree@vger.kernel.org
In-Reply-To: <20181212103545.GV16835@8bytes.org>
Hi Joerg,
On 12/12/2018 10:35, Joerg Roedel wrote:
> Hi,
>
> to make progress on this, we should first agree on the protocol used
> between guest and host. I have a few points to discuss on the protocol
> first.
>
> On Tue, Dec 11, 2018 at 06:20:57PM +0000, Jean-Philippe Brucker wrote:
>> [1] Virtio-iommu specification v0.9, sources and pdf
>> git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.9
>> http://jpbrucker.net/virtio-iommu/spec/v0.9/virtio-iommu-v0.9.pdf
>
> Looking at this I wonder why it doesn't make the IOTLB visible to the
> guest. the UNMAP requests seem to require that the TLB is already
> flushed to make the unmap visible.
>
> I think that will cost significant performance for both, vfio and
> dma-iommu use-cases which both do (vfio at least to some degree),
> deferred flushing.
We already do deferred flush: UNMAP requests are added to the queue by
iommu_unmap(), and then flushed out by iotlb_sync(). So we switch to the
host only on iotlb_sync(), or when the request queue is full.
> I also wonder whether the protocol should implement a
> protocol version handshake and iommu-feature set queries.
With the virtio transport there is a handshake when the device (IOMMU)
is initialized, through feature bits and global config fields. Feature
bits are made of both transport-specific features, including the version
number, and device-specific features defined in section 2.3 of the above
document (the transport is described in the virtio 1.0 specification).
The device presents features that it supports in a register, and the
driver masks out the feature bits that it doesn't support. Then the
driver sets the global status to FEATURES_OK and initialization continues.
In addition virtio-iommu has per-endpoint features through the PROBE
request, since the vIOMMU may manage hardware (VFIO) and software
(virtio) endpoints at the same time, which don't have the same DMA
capabilities (different IOVA ranges, page granularity, reserved ranges,
pgtable sharing, etc). At the moment this is a one-way probe, not a
handshake. The device simply fills the properties of each endpoint, but
the driver doesn't have to ack them. Initially there was a way to
negotiate each PROBE property but it was deemed unnecessary during
review. By leaving a few spare bits in the property headers I made sure
it can be added back with a feature bit if we ever need it.
>> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.1
>> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9
>
> Unfortunatly gitweb seems to be broken on linux-arm.org. What is missing
> in this patch-set to make this work on x86?
You should be able to access it here:
http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/virtio-iommu/devel
That branch contains missing bits for x86 support:
* ACPI support. We have the code but it's waiting for an IORT spec
update, to reserve the IORT node ID. I expect it to take a while, given
that I'm alone requesting a change for something that's not upstream or
in hardware.
* DMA ops for x86 (see "HACK" commit). I'd like to use dma-iommu but I'm
not sure how to implement the glue that sets dma_ops properly.
Thanks,
Jean
^ permalink raw reply
* Re: [virtio-dev] Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver
From: Robin Murphy @ 2018-12-13 12:37 UTC (permalink / raw)
To: Auger Eric, Michael S. Tsirkin, Jean-Philippe Brucker
Cc: mark.rutland, virtio-dev, lorenzo.pieralisi, tnowicki, devicetree,
marc.zyngier, linux-pci, joro, will.deacon, virtualization, iommu,
robh+dt, bhelgaas, kvmarm
In-Reply-To: <eb267a7c-a881-afd9-9ce2-bcdc317d16d7@redhat.com>
On 2018-12-12 3:27 pm, Auger Eric wrote:
> Hi,
>
> On 12/12/18 3:56 PM, Michael S. Tsirkin wrote:
>> On Fri, Dec 07, 2018 at 06:52:31PM +0000, Jean-Philippe Brucker wrote:
>>> Sorry for the delay, I wanted to do a little more performance analysis
>>> before continuing.
>>>
>>> On 27/11/2018 18:10, Michael S. Tsirkin wrote:
>>>> On Tue, Nov 27, 2018 at 05:55:20PM +0000, Jean-Philippe Brucker wrote:
>>>>>>> + if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
>>>>>>> + !virtio_has_feature(vdev, VIRTIO_IOMMU_F_MAP_UNMAP))
>>>>>>
>>>>>> Why bother with a feature bit for this then btw?
>>>>>
>>>>> We'll need a new feature bit for sharing page tables with the hardware,
>>>>> because they require different requests (attach_table/invalidate instead
>>>>> of map/unmap.) A future device supporting page table sharing won't
>>>>> necessarily need to support map/unmap.
>>>>>
>>>> I don't see virtio iommu being extended to support ARM specific
>>>> requests. This just won't scale, too many different
>>>> descriptor formats out there.
>>>
>>> They aren't really ARM specific requests. The two new requests are
>>> ATTACH_TABLE and INVALIDATE, which would be used by x86 IOMMUs as well.
>>>
>>> Sharing CPU address space with the HW IOMMU (SVM) has been in the scope
>>> of virtio-iommu since the first RFC, and I've been working with that
>>> extension in mind since the beginning. As an example you can have a look
>>> at my current draft for this [1], which is inspired from the VFIO work
>>> we've been doing with Intel.
>>>
>>> The negotiation phase inevitably requires vendor-specific fields in the
>>> descriptors - host tells which formats are supported, guest chooses a
>>> format and attaches page tables. But invalidation and fault reporting
>>> descriptors are fairly generic.
>>
>> We need to tread carefully here. People expect it that if user does
>> lspci and sees a virtio device then it's reasonably portable.
>>
>>>> If you want to go that way down the road, you should avoid
>>>> virtio iommu, instead emulate and share code with the ARM SMMU (probably
>>>> with a different vendor id so you can implement the
>>>> report on map for devices without PRI).
>>>
>>> vSMMU has to stay in userspace though. The main reason we're proposing
>>> virtio-iommu is that emulating every possible vIOMMU model in the kernel
>>> would be unmaintainable. With virtio-iommu we can process the fast path
>>> in the host kernel, through vhost-iommu, and do the heavy lifting in
>>> userspace.
>>
>> Interesting.
>>
>>> As said above, I'm trying to keep the fast path for
>>> virtio-iommu generic.
>>>
>>> More notes on what I consider to be the fast path, and comparison with
>>> vSMMU:
>>>
>>> (1) The primary use-case we have in mind for vIOMMU is something like
>>> DPDK in the guest, assigning a hardware device to guest userspace. DPDK
>>> maps a large amount of memory statically, to be used by a pass-through
>>> device. For this case I don't think we care about vIOMMU performance.
>>> Setup and teardown need to be reasonably fast, sure, but the MAP/UNMAP
>>> requests don't have to be optimal.
>>>
>>>
>>> (2) If the assigned device is owned by the guest kernel, then mappings
>>> are dynamic and require dma_map/unmap() to be fast, but there generally
>>> is no need for a vIOMMU, since device and drivers are trusted by the
>>> guest kernel. Even when the user does enable a vIOMMU for this case
>>> (allowing to over-commit guest memory, which needs to be pinned
>>> otherwise),
>>
>> BTW that's in theory in practice it doesn't really work.
>>
>>> we generally play tricks like lazy TLBI (non-strict mode) to
>>> make it faster.
>>
>> Simple lazy TLB for guest/userspace drivers would be a big no no.
>> You need something smarter.
>>
>>> Here device and drivers are trusted, therefore the
>>> vulnerability window of lazy mode isn't a concern.
>>>
>>> If the reason to enable the vIOMMU is over-comitting guest memory
>>> however, you can't use nested translation because it requires pinning
>>> the second-level tables. For this case performance matters a bit,
>>> because your invalidate-on-map needs to be fast, even if you enable lazy
>>> mode and only receive inval-on-unmap every 10ms. It won't ever be as
>>> fast as nested translation, though. For this case I think vSMMU+Caching
>>> Mode and userspace virtio-iommu with MAP/UNMAP would perform similarly
>>> (given page-sized payloads), because the pagetable walk doesn't add a
>>> lot of overhead compared to the context switch. But given the results
>>> below, vhost-iommu would be faster than vSMMU+CM.
>>>
>>>
>>> (3) Then there is SVM. For SVM, any destructive change to the process
>>> address space requires a synchronous invalidation command to the
>>> hardware (at least when using PCI ATS). Given that SVM is based on page
>>> faults, fault reporting from host to guest also needs to be fast, as
>>> well as fault response from guest to host.
>>>
>>> I think this is where performance matters the most. To get a feel of the
>>> advantage we get with virtio-iommu, I compared the vSMMU page-table
>>> sharing implementation [2] and vhost-iommu + VFIO with page table
>>> sharing (based on Tomasz Nowicki's vhost-iommu prototype). That's on a
>>> ThunderX2 with a 10Gb NIC assigned to the guest kernel, which
>>> corresponds to case (2) above, with nesting page tables and without the
>>> lazy mode. The host's only job is forwarding invalidation to the HW SMMU.
>>>
>>> vhost-iommu performed on average 1.8x and 5.5x better than vSMMU on
>>> netperf TCP_STREAM and TCP_MAERTS respectively (~200 samples). I think
>>> this can be further optimized (that was still polling under the vq
>>> lock), and unlike vSMMU, virtio-iommu offers the possibility of
>>> multi-queue for improved scalability. In addition, the guest will need
>>> to send both TLB and ATC invalidations with vSMMU, but virtio-iommu
>>> allows to multiplex those, and to invalidate ranges. Similarly for fault
>>> injection, having the ability to report page faults to the guest from
>>> the host kernel should be significantly faster than having to go to
>>> userspace and back to the kernel.
>>
>> Fascinating. Any data about host CPU utilization?
>>
>> Eric what do you think?
>>
>> Is it true that SMMUv3 is fundmentally slow at the architecture level
>> and so a PV interface will always scale better until
>> a new hardware interface is designed?
>
> As far as I understand the figures above correspond to vhost-iommu
> against vsmmuv3. In the 2 cases the guest owns stage1 tables so the
> difference comes from the IOTLB invalidation handling. With vhost we
> avoid a kernel <-> userspace round trip which may mostly explain the
> difference.
>
> About SMMUv3 issues I already reported one big limitation with respect
> to hugepage invalidation. See [RFC v2 4/4] iommu/arm-smmu-v3: add
> CMD_TLBI_NH_VA_AM command for iova range invalidation
> (https://lkml.org/lkml/2017/8/11/428).
>
> At smmuv3 guest driver level, arm_smmu_tlb_inv_range_nosync(), when
> called with a hugepage size, invalidates each 4K/64K page of the region
> and not the whole region at once. Each of them are trapped by the SMMUv3
> device which forwards them to the host. This stalls the guest. This
> issue can be observed in DPDK case - not the use case benchmarked above - .
>
> I raised this point again in recent discussions and it is unclear
> whether this is an SMMUv3 driver limitation or an architecture
> limitation. Seems a single invalidation within the block mapping should
> invalidate the whole mapping at HW level. In the past I hacked a
> workaround by defining an implementation defined invalidation command.
>
> Robin/Will, could you please explain the rationale behind the
> arm_smmu_tlb_inv_range_nosync() implementation.
Fundamentally, TLBI commands only take an address, so invalidations have
to match the actual leaf PTEs being removed. If iommu_unmap() sees that
2MB is a valid block size, it may send a single "unmap this 2MB" request
to the driver, but nobody knows how that region is actually mapped until
the pagetable code walks the tables. If it does find a 2MB block PTE,
then it can simply clear it and generate a single invalidation command -
if a TLB entry exists for that block mapping then any address within the
block will match it. If however that 2MB region was actually covered by
a subtable of 4KB pages, then separate TLB entries may exist for any or
all of those pages, and a single address can at best only match one of
them. Thus after the table PTE is cleared, a separate command has to be
generated for each page to ensure that all possible TLB entries
associated with that table are invalidated.
So if you're seeing page-granularity invalidation, it means that the
thing you're unmapping wasn't actually mapped as that size of hugepage
in the first place (or something pathological has caused it to be split
in the meantime - I recall we once had an amusing bug which caused VFIO
to do that on teardown). There is one suboptimal case if we're taking
out potentially multiple levels of table at once (e.g. a 1GB region),
where we don't bother recursing down the removed table to see whether
anything was mapped with blocks at the intermediate level, and just
invalidate the whole lot at page granularity to cover the worst-case
scenario. I think we assumed that sort of unmap would be unlikely enough
that it wasn't worth optimising for at the time, but there's certainly
scope to improve it if unmapping 1GB worth of 2MB blocks turns out to be
a common thing.
FWIW SMMUv3.2 has introduced actual range-invalidation commands -
reworking all the TLBI logic in the driver to make worthwhile use of
those is on the to-do list, but until real 3.2 hardware starts coming to
light I've not been prioritising it.
Robin.
>
> Thanks
>
> Eric
>
>
>
>>
>>
>>>
>>> (4) Virtio and vhost endpoints weren't really a priority for the base
>>> virtio-iommu device, we were looking mainly at device pass-through. I
>>> have optimizations in mind for this, although a lot of them are based on
>>> page tables, not MAP/UNMAP requests. But just getting the vIOMMU closer
>>> to vhost devices, avoiding the trip to userspace through vhost-tlb,
>>> should already improve things.
>>>
>>> The important difference when DMA is done by software is that you don't
>>> need to mirror all mappings into the HW IOMMU - you don't need
>>> inval-on-map. The endpoint can ask the vIOMMU for mappings when it needs
>>> them, like vhost-iotlb does for example. So the MAP/UNMAP interface of
>>> virtio-iommu performs poorly for emulated/PV endpoints compared to an
>>> emulated IOMMU, since it requires three context switches for DMA
>>> (MAP/DMA/UNMAP) between host and guest, rather than two (DMA/INVAL).
>>> There is a feature I call "posted MAP", that avoids the kick on MAP and
>>> instead lets the device fetch the MAP request on TLB miss, but I haven't
>>> spent enough time experimenting with this.
>>>
>>>> Others on the TC might feel differently.
>>>>
>>>> If someone's looking into adding virtio iommu support in hardware,
>>>> that's a different matter. Which is it?
>>>
>>> I'm not aware of anything like that, and suspect that no one would
>>> consider it until virtio-iommu is more widely adopted.
>>>
>>> Thanks,
>>> Jean
>>>
>>>
>>> [1] Diff between current spec and page table sharing draft
>>> (Very rough, missing page fault support and I'd like to rework the
>>> PASID model a bit, but table descriptors p.24-26 for both Arm
>>> SMMUv2 and SMMUv3.)
>>>
>>> http://jpbrucker.net/virtio-iommu/spec-table/diffs/virtio-iommu-pdf-diff-v0.9-v0.10.dev03.pdf
>>>
>>> [2] [RFC v2 00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration
>>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg562369.html
^ permalink raw reply
* Re: [PATCH] kbuild, x86: revert macros in extended asm workarounds
From: Peter Zijlstra @ 2018-12-13 10:51 UTC (permalink / raw)
To: Masahiro Yamada
Cc: linux-arch, Juergen Gross, Michal Marek, Richard Biener,
Arnd Bergmann, Segher Boessenkool, linux-kbuild, x86,
linux-kernel, virtualization, linux-sparse, Ingo Molnar,
Borislav Petkov, Andy Lutomirski, H . Peter Anvin, Sedat Dilek,
Nadav Amit, Thomas Gleixner, Logan Gunthorpe, Alok Kataria,
Luc Van Oostenryck
In-Reply-To: <1544692661-9455-1-git-send-email-yamada.masahiro@socionext.com>
On Thu, Dec 13, 2018 at 06:17:41PM +0900, Masahiro Yamada wrote:
> Revert the following commits:
>
> - 5bdcd510c2ac9efaf55c4cbd8d46421d8e2320cd
> ("x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs")
>
> - d5a581d84ae6b8a4a740464b80d8d9cf1e7947b2
> ("x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs")
>
> - 0474d5d9d2f7f3b11262f7bf87d0e7314ead9200.
> ("x86/extable: Macrofy inline assembly code to work around GCC inlining bugs")
>
> - 494b5168f2de009eb80f198f668da374295098dd.
> ("x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops")
>
> - f81f8ad56fd1c7b99b2ed1c314527f7d9ac447c6.
> ("x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs")
>
> - 77f48ec28e4ccff94d2e5f4260a83ac27a7f3099.
> ("x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs")
>
> - 9e1725b410594911cc5981b6c7b4cea4ec054ca8.
> ("x86/refcount: Work around GCC inlining bug")
> (Conflicts: arch/x86/include/asm/refcount.h)
>
> - c06c4d8090513f2974dfdbed2ac98634357ac475.
> ("x86/objtool: Use asm macros to work around GCC inlining bugs")
>
> - 77b0bf55bc675233d22cd5df97605d516d64525e.
> ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
>
I don't think we want to blindly revert all that. Some of them actually
made sense and did clean up things irrespective of the asm-inline issue.
In particular I like the jump-label one. The cpufeature one OTOh, yeah,
I'd love to get that reverted.
And as a note; the normal commit quoting style is:
d5a581d84ae6 ("x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs")
^ permalink raw reply
* [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
To: mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20181213101022.12475-1-jasowang@redhat.com>
It was noticed that the copy_user() friends that was used to access
virtqueue metdata tends to be very expensive for dataplane
implementation like vhost since it involves lots of software check,
speculation barrier, hardware feature toggling (e.g SMAP). The
extra cost will be more obvious when transferring small packets.
This patch tries to eliminate those overhead by pin vq metadata pages
and access them through vmap(). During SET_VRING_ADDR, we will setup
those mappings and memory accessors are modified to use pointers to
access the metadata directly.
Note, this was only done when device IOTLB is not enabled. We could
use similar method to optimize it in the future.
Tests shows about ~24% improvement on TX PPS when using virtio-user +
vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
Before: ~5.0Mpps
After: ~6.1Mpps
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
drivers/vhost/vhost.h | 11 +++
2 files changed, 189 insertions(+)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bafe39d2e637..1bd24203afb6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
vq->indirect = NULL;
vq->heads = NULL;
vq->dev = dev;
+ memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
+ memset(&vq->used_ring, 0, sizeof(vq->used_ring));
+ memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
mutex_init(&vq->mutex);
vhost_vq_reset(dev, vq);
if (vq->handle_kick)
@@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
spin_unlock(&dev->iotlb_lock);
}
+static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
+ size_t size, int write)
+{
+ struct page **pages;
+ int npages = DIV_ROUND_UP(size, PAGE_SIZE);
+ int npinned;
+ void *vaddr;
+
+ pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+ if (!pages)
+ return -ENOMEM;
+
+ npinned = get_user_pages_fast(uaddr, npages, write, pages);
+ if (npinned != npages)
+ goto err;
+
+ vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+ if (!vaddr)
+ goto err;
+
+ map->pages = pages;
+ map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
+ map->npages = npages;
+
+ return 0;
+
+err:
+ if (npinned > 0)
+ release_pages(pages, npinned);
+ kfree(pages);
+ return -EFAULT;
+}
+
+static void vhost_uninit_vmap(struct vhost_vmap *map)
+{
+ if (!map->addr)
+ return;
+
+ vunmap(map->addr);
+ release_pages(map->pages, map->npages);
+ kfree(map->pages);
+
+ map->addr = NULL;
+ map->pages = NULL;
+ map->npages = 0;
+}
+
+static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
+{
+ vhost_uninit_vmap(&vq->avail_ring);
+ vhost_uninit_vmap(&vq->desc_ring);
+ vhost_uninit_vmap(&vq->used_ring);
+}
+
+static int vhost_setup_vmaps(struct vhost_virtqueue *vq, unsigned long avail,
+ unsigned long desc, unsigned long used)
+{
+ size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+ size_t avail_size, desc_size, used_size;
+ int ret;
+
+ vhost_clean_vmaps(vq);
+
+ avail_size = sizeof(*vq->avail) +
+ sizeof(*vq->avail->ring) * vq->num + event;
+ ret = vhost_init_vmap(&vq->avail_ring, avail, avail_size, false);
+ if (ret) {
+ vq_err(vq, "Fail to setup vmap for avail ring!\n");
+ goto err_avail;
+ }
+
+ desc_size = sizeof(*vq->desc) * vq->num;
+ ret = vhost_init_vmap(&vq->desc_ring, desc, desc_size, false);
+ if (ret) {
+ vq_err(vq, "Fail to setup vmap for desc ring!\n");
+ goto err_desc;
+ }
+
+ used_size = sizeof(*vq->used) +
+ sizeof(*vq->used->ring) * vq->num + event;
+ ret = vhost_init_vmap(&vq->used_ring, used, used_size, true);
+ if (ret) {
+ vq_err(vq, "Fail to setup vmap for used ring!\n");
+ goto err_used;
+ }
+
+ return 0;
+
+err_used:
+ vhost_uninit_vmap(&vq->used_ring);
+err_desc:
+ vhost_uninit_vmap(&vq->avail_ring);
+err_avail:
+ return -EFAULT;
+}
+
void vhost_dev_cleanup(struct vhost_dev *dev)
{
int i;
@@ -626,6 +725,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
if (dev->vqs[i]->call_ctx)
eventfd_ctx_put(dev->vqs[i]->call_ctx);
vhost_vq_reset(dev, dev->vqs[i]);
+ vhost_clean_vmaps(dev->vqs[i]);
}
vhost_dev_free_iovecs(dev);
if (dev->log_ctx)
@@ -873,6 +973,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ *((__virtio16 *)&used->ring[vq->num]) =
+ cpu_to_vhost16(vq, vq->avail_idx);
+ return 0;
+ }
+
return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
vhost_avail_event(vq));
}
@@ -881,6 +989,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
struct vring_used_elem *head, int idx,
int count)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ memcpy(used->ring + idx, head, count * sizeof(*head));
+ return 0;
+ }
+
return vhost_copy_to_user(vq, vq->used->ring + idx, head,
count * sizeof(*head));
}
@@ -888,6 +1003,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ used->flags = cpu_to_vhost16(vq, vq->used_flags);
+ return 0;
+ }
+
return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
&vq->used->flags);
}
@@ -895,6 +1017,13 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
+ return 0;
+ }
+
return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
&vq->used->idx);
}
@@ -926,12 +1055,26 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
__virtio16 *idx)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ *idx = avail->idx;
+ return 0;
+ }
+
return vhost_get_avail(vq, *idx, &vq->avail->idx);
}
static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
__virtio16 *head, int idx)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ *head = avail->ring[idx & (vq->num - 1)];
+ return 0;
+ }
+
return vhost_get_avail(vq, *head,
&vq->avail->ring[idx & (vq->num - 1)]);
}
@@ -939,24 +1082,52 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
__virtio16 *flags)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ *flags = avail->flags;
+ return 0;
+ }
+
return vhost_get_avail(vq, *flags, &vq->avail->flags);
}
static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
__virtio16 *event)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ *event = (__virtio16)avail->ring[vq->num];
+ return 0;
+ }
+
return vhost_get_avail(vq, *event, vhost_used_event(vq));
}
static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
__virtio16 *idx)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ *idx = used->idx;
+ return 0;
+ }
+
return vhost_get_used(vq, *idx, &vq->used->idx);
}
static inline int vhost_get_desc(struct vhost_virtqueue *vq,
struct vring_desc *desc, int idx)
{
+ if (!vq->iotlb) {
+ struct vring_desc *d = vq->desc_ring.addr;
+
+ *desc = *(d + idx);
+ return 0;
+ }
+
return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
}
@@ -1551,6 +1722,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
}
}
+ if (!vq->iotlb && vhost_setup_vmaps(vq, a.avail_user_addr,
+ a.desc_user_addr,
+ a.used_user_addr)) {
+ r = -EINVAL;
+ break;
+ }
+
vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 466ef7542291..89dc0ad3d055 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -80,6 +80,12 @@ enum vhost_uaddr_type {
VHOST_NUM_ADDRS = 3,
};
+struct vhost_vmap {
+ struct page **pages;
+ void *addr;
+ int npages;
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -90,6 +96,11 @@ struct vhost_virtqueue {
struct vring_desc __user *desc;
struct vring_avail __user *avail;
struct vring_used __user *used;
+
+ struct vhost_vmap avail_ring;
+ struct vhost_vmap desc_ring;
+ struct vhost_vmap used_ring;
+
const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
struct file *kick;
struct eventfd_ctx *call_ctx;
--
2.17.1
^ permalink raw reply related
* [PATCH net-next 2/3] vhost: fine grain userspace memory accessors
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
To: mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20181213101022.12475-1-jasowang@redhat.com>
This is used to hide the metadata address from virtqueue helpers. This
will allow to implement a vmap based fast accessing to metadata.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++++++++--------
1 file changed, 77 insertions(+), 17 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1c54ec1b82f8..bafe39d2e637 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -871,6 +871,34 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
ret; \
})
+static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
+{
+ return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+ vhost_avail_event(vq));
+}
+
+static inline int vhost_put_used(struct vhost_virtqueue *vq,
+ struct vring_used_elem *head, int idx,
+ int count)
+{
+ return vhost_copy_to_user(vq, vq->used->ring + idx, head,
+ count * sizeof(*head));
+}
+
+static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
+
+{
+ return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+ &vq->used->flags);
+}
+
+static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
+
+{
+ return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+ &vq->used->idx);
+}
+
#define vhost_get_user(vq, x, ptr, type) \
({ \
int ret; \
@@ -895,6 +923,43 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
#define vhost_get_used(vq, x, ptr) \
vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
+ __virtio16 *idx)
+{
+ return vhost_get_avail(vq, *idx, &vq->avail->idx);
+}
+
+static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
+ __virtio16 *head, int idx)
+{
+ return vhost_get_avail(vq, *head,
+ &vq->avail->ring[idx & (vq->num - 1)]);
+}
+
+static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
+ __virtio16 *flags)
+{
+ return vhost_get_avail(vq, *flags, &vq->avail->flags);
+}
+
+static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
+ __virtio16 *event)
+{
+ return vhost_get_avail(vq, *event, vhost_used_event(vq));
+}
+
+static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
+ __virtio16 *idx)
+{
+ return vhost_get_used(vq, *idx, &vq->used->idx);
+}
+
+static inline int vhost_get_desc(struct vhost_virtqueue *vq,
+ struct vring_desc *desc, int idx)
+{
+ return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
+}
+
static int vhost_new_umem_range(struct vhost_umem *umem,
u64 start, u64 size, u64 end,
u64 userspace_addr, int perm)
@@ -1751,8 +1816,7 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
static int vhost_update_used_flags(struct vhost_virtqueue *vq)
{
void __user *used;
- if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
- &vq->used->flags) < 0)
+ if (vhost_put_used_flags(vq))
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1770,8 +1834,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
{
- if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
- vhost_avail_event(vq)))
+ if (vhost_put_avail_event(vq))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -1808,7 +1871,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
r = -EFAULT;
goto err;
}
- r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
+ r = vhost_get_used_idx(vq, &last_used_idx);
if (r) {
vq_err(vq, "Can't access used idx at %p\n",
&vq->used->idx);
@@ -2007,7 +2070,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
last_avail_idx = vq->last_avail_idx;
if (vq->avail_idx == vq->last_avail_idx) {
- if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
+ if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
&vq->avail->idx);
return -EFAULT;
@@ -2034,8 +2097,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Grab the next descriptor number they're advertising, and increment
* the index we've seen. */
- if (unlikely(vhost_get_avail(vq, ring_head,
- &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
+ if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
last_avail_idx,
&vq->avail->ring[last_avail_idx % vq->num]);
@@ -2070,8 +2132,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
i, vq->num, head);
return -EINVAL;
}
- ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
- sizeof desc);
+ ret = vhost_get_desc(vq, &desc, i);
if (unlikely(ret)) {
vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
i, vq->desc + i);
@@ -2164,7 +2225,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
- if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+ if (vhost_put_used(vq, heads, start, count)) {
vq_err(vq, "Failed to write used");
return -EFAULT;
}
@@ -2208,8 +2269,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
/* Make sure buffer is written before we update index. */
smp_wmb();
- if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
- &vq->used->idx)) {
+ if (vhost_put_used_idx(vq)) {
vq_err(vq, "Failed to increment used idx");
return -EFAULT;
}
@@ -2241,7 +2301,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
__virtio16 flags;
- if (vhost_get_avail(vq, flags, &vq->avail->flags)) {
+ if (vhost_get_avail_flags(vq, &flags)) {
vq_err(vq, "Failed to get flags");
return true;
}
@@ -2255,7 +2315,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
if (unlikely(!v))
return true;
- if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
+ if (vhost_get_used_event(vq, &event)) {
vq_err(vq, "Failed to get used event idx");
return true;
}
@@ -2300,7 +2360,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
if (vq->avail_idx != vq->last_avail_idx)
return false;
- r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+ r = vhost_get_avail_idx(vq, &avail_idx);
if (unlikely(r))
return false;
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
@@ -2336,7 +2396,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
* sure it's written, then check again. */
smp_mb();
- r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+ r = vhost_get_avail_idx(vq, &avail_idx);
if (r) {
vq_err(vq, "Failed to check avail idx at %p: %d\n",
&vq->avail->idx, r);
--
2.17.1
^ permalink raw reply related
* [PATCH net-next 1/3] vhost: generalize adding used elem
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
To: mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20181213101022.12475-1-jasowang@redhat.com>
Use one generic vhost_copy_to_user() instead of two dedicated
accessor. This will simplify the conversion to fine grain accessors.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 3a5f81a66d34..1c54ec1b82f8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2164,16 +2164,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
- if (count == 1) {
- if (vhost_put_user(vq, heads[0].id, &used->id)) {
- vq_err(vq, "Failed to write used id");
- return -EFAULT;
- }
- if (vhost_put_user(vq, heads[0].len, &used->len)) {
- vq_err(vq, "Failed to write used len");
- return -EFAULT;
- }
- } else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+ if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
vq_err(vq, "Failed to write used");
return -EFAULT;
}
--
2.17.1
^ permalink raw reply related
* [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
To: mst, jasowang, kvm, virtualization, netdev, linux-kernel
Hi:
This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling.
Test shows about 24% improvement on TX PPS. It should benefit other
cases as well.
Please review
Jason Wang (3):
vhost: generalize adding used elem
vhost: fine grain userspace memory accessors
vhost: access vq metadata through kernel virtual address
drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
drivers/vhost/vhost.h | 11 ++
2 files changed, 266 insertions(+), 26 deletions(-)
--
2.17.1
^ permalink raw reply
* [PATCH] kbuild, x86: revert macros in extended asm workarounds
From: Masahiro Yamada @ 2018-12-13 9:17 UTC (permalink / raw)
To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
x86
Cc: linux-arch, Juergen Gross, Michal Marek, Richard Biener,
Arnd Bergmann, Segher Boessenkool, linux-kbuild, Peter Zijlstra,
linux-kernel, virtualization, Masahiro Yamada, linux-sparse,
Nadav Amit, Andy Lutomirski, Sedat Dilek, Alok Kataria,
Logan Gunthorpe, Luc Van Oostenryck
Revert the following commits:
- 5bdcd510c2ac9efaf55c4cbd8d46421d8e2320cd
("x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs")
- d5a581d84ae6b8a4a740464b80d8d9cf1e7947b2
("x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs")
- 0474d5d9d2f7f3b11262f7bf87d0e7314ead9200.
("x86/extable: Macrofy inline assembly code to work around GCC inlining bugs")
- 494b5168f2de009eb80f198f668da374295098dd.
("x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops")
- f81f8ad56fd1c7b99b2ed1c314527f7d9ac447c6.
("x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs")
- 77f48ec28e4ccff94d2e5f4260a83ac27a7f3099.
("x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs")
- 9e1725b410594911cc5981b6c7b4cea4ec054ca8.
("x86/refcount: Work around GCC inlining bug")
(Conflicts: arch/x86/include/asm/refcount.h)
- c06c4d8090513f2974dfdbed2ac98634357ac475.
("x86/objtool: Use asm macros to work around GCC inlining bugs")
- 77b0bf55bc675233d22cd5df97605d516d64525e.
("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
A few days after those commits applied, discussion started to solve
the issue more elegantly on the compiler side:
https://lkml.org/lkml/2018/10/7/92
The "asm inline" was implemented by Segher Boessenkool, and now queued
up for GCC 9. (People were positive even for back-porting it to older
compilers).
Since the in-kernel workarounds merged, some issues have been reported:
breakage of building with distcc/icecc, breakage of distro packages for
module building. (More fundamentally, we cannot build external modules
after 'make clean')
Patching around the build system would make the code even uglier.
Given that this issue will be solved in a cleaner way sooner or later,
let's revert the in-kernel workarounds, and wait for GCC 9.
Reported-by: Logan Gunthorpe <logang@deltatee.com> # distcc
Reported-by: Sedat Dilek <sedat.dilek@gmail.com> # debian/rpm package
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
---
Please consider this for v4.20 release.
Currently, distro package build is broken.
Makefile | 9 +---
arch/x86/Makefile | 7 ---
arch/x86/entry/calling.h | 2 +-
arch/x86/include/asm/alternative-asm.h | 20 +++----
arch/x86/include/asm/alternative.h | 11 +++-
arch/x86/include/asm/asm.h | 53 +++++++++++-------
arch/x86/include/asm/bug.h | 98 +++++++++++++++-------------------
arch/x86/include/asm/cpufeature.h | 82 ++++++++++++----------------
arch/x86/include/asm/jump_label.h | 72 ++++++++++++++++++-------
arch/x86/include/asm/paravirt_types.h | 56 +++++++++----------
arch/x86/include/asm/refcount.h | 81 ++++++++++++----------------
arch/x86/kernel/macros.S | 16 ------
include/asm-generic/bug.h | 8 +--
include/linux/compiler.h | 56 +++++--------------
scripts/Kbuild.include | 4 +-
scripts/mod/Makefile | 2 -
16 files changed, 262 insertions(+), 315 deletions(-)
delete mode 100644 arch/x86/kernel/macros.S
diff --git a/Makefile b/Makefile
index f2c3423..4cf4c5b 100644
--- a/Makefile
+++ b/Makefile
@@ -1081,7 +1081,7 @@ scripts: scripts_basic scripts_dtc asm-generic gcc-plugins $(autoksyms_h)
# version.h and scripts_basic is processed / created.
# Listed in dependency order
-PHONY += prepare archprepare macroprepare prepare0 prepare1 prepare2 prepare3
+PHONY += prepare archprepare prepare0 prepare1 prepare2 prepare3
# prepare3 is used to check if we are building in a separate output directory,
# and if so do:
@@ -1104,9 +1104,7 @@ prepare2: prepare3 outputmakefile asm-generic
prepare1: prepare2 $(version_h) $(autoksyms_h) include/generated/utsrelease.h
$(cmd_crmodverdir)
-macroprepare: prepare1 archmacros
-
-archprepare: archheaders archscripts macroprepare scripts_basic
+archprepare: archheaders archscripts prepare1 scripts_basic
prepare0: archprepare gcc-plugins
$(Q)$(MAKE) $(build)=.
@@ -1174,9 +1172,6 @@ archheaders:
PHONY += archscripts
archscripts:
-PHONY += archmacros
-archmacros:
-
PHONY += __headers
__headers: $(version_h) scripts_basic uapi-asm-generic archheaders archscripts
$(Q)$(MAKE) $(build)=scripts build_unifdef
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 75ef499..85a66c4 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -232,13 +232,6 @@ archscripts: scripts_basic
archheaders:
$(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
-archmacros:
- $(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
-
-ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s
-export ASM_MACRO_FLAGS
-KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)
-
###
# Kernel objects
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 25e5a6b..20d0885 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -352,7 +352,7 @@ For 32-bit we have the following conventions - kernel is built with
.macro CALL_enter_from_user_mode
#ifdef CONFIG_CONTEXT_TRACKING
#ifdef HAVE_JUMP_LABEL
- STATIC_BRANCH_JMP l_yes=.Lafter_call_\@, key=context_tracking_enabled, branch=1
+ STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0
#endif
call enter_from_user_mode
.Lafter_call_\@:
diff --git a/arch/x86/include/asm/alternative-asm.h b/arch/x86/include/asm/alternative-asm.h
index 8e4ea39..31b627b 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -7,24 +7,16 @@
#include <asm/asm.h>
#ifdef CONFIG_SMP
-.macro LOCK_PREFIX_HERE
+ .macro LOCK_PREFIX
+672: lock
.pushsection .smp_locks,"a"
.balign 4
- .long 671f - . # offset
+ .long 672b - .
.popsection
-671:
-.endm
-
-.macro LOCK_PREFIX insn:vararg
- LOCK_PREFIX_HERE
- lock \insn
-.endm
+ .endm
#else
-.macro LOCK_PREFIX_HERE
-.endm
-
-.macro LOCK_PREFIX insn:vararg
-.endm
+ .macro LOCK_PREFIX
+ .endm
#endif
/*
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index d7faa16..4cd6a3b 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -31,8 +31,15 @@
*/
#ifdef CONFIG_SMP
-#define LOCK_PREFIX_HERE "LOCK_PREFIX_HERE\n\t"
-#define LOCK_PREFIX "LOCK_PREFIX "
+#define LOCK_PREFIX_HERE \
+ ".pushsection .smp_locks,\"a\"\n" \
+ ".balign 4\n" \
+ ".long 671f - .\n" /* offset */ \
+ ".popsection\n" \
+ "671:"
+
+#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
+
#else /* ! CONFIG_SMP */
#define LOCK_PREFIX_HERE ""
#define LOCK_PREFIX ""
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 21b0867..6467757b 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -120,25 +120,12 @@
/* Exception table entry */
#ifdef __ASSEMBLY__
# define _ASM_EXTABLE_HANDLE(from, to, handler) \
- ASM_EXTABLE_HANDLE from to handler
-
-.macro ASM_EXTABLE_HANDLE from:req to:req handler:req
- .pushsection "__ex_table","a"
- .balign 4
- .long (\from) - .
- .long (\to) - .
- .long (\handler) - .
+ .pushsection "__ex_table","a" ; \
+ .balign 4 ; \
+ .long (from) - . ; \
+ .long (to) - . ; \
+ .long (handler) - . ; \
.popsection
-.endm
-#else /* __ASSEMBLY__ */
-
-# define _ASM_EXTABLE_HANDLE(from, to, handler) \
- "ASM_EXTABLE_HANDLE from=" #from " to=" #to \
- " handler=\"" #handler "\"\n\t"
-
-/* For C file, we already have NOKPROBE_SYMBOL macro */
-
-#endif /* __ASSEMBLY__ */
# define _ASM_EXTABLE(from, to) \
_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
@@ -161,7 +148,6 @@
_ASM_PTR (entry); \
.popsection
-#ifdef __ASSEMBLY__
.macro ALIGN_DESTINATION
/* check for bad alignment of destination */
movl %edi,%ecx
@@ -185,7 +171,34 @@
_ASM_EXTABLE_UA(100b, 103b)
_ASM_EXTABLE_UA(101b, 103b)
.endm
-#endif /* __ASSEMBLY__ */
+
+#else
+# define _EXPAND_EXTABLE_HANDLE(x) #x
+# define _ASM_EXTABLE_HANDLE(from, to, handler) \
+ " .pushsection \"__ex_table\",\"a\"\n" \
+ " .balign 4\n" \
+ " .long (" #from ") - .\n" \
+ " .long (" #to ") - .\n" \
+ " .long (" _EXPAND_EXTABLE_HANDLE(handler) ") - .\n" \
+ " .popsection\n"
+
+# define _ASM_EXTABLE(from, to) \
+ _ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
+
+# define _ASM_EXTABLE_UA(from, to) \
+ _ASM_EXTABLE_HANDLE(from, to, ex_handler_uaccess)
+
+# define _ASM_EXTABLE_FAULT(from, to) \
+ _ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
+
+# define _ASM_EXTABLE_EX(from, to) \
+ _ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
+
+# define _ASM_EXTABLE_REFCOUNT(from, to) \
+ _ASM_EXTABLE_HANDLE(from, to, ex_handler_refcount)
+
+/* For C file, we already have NOKPROBE_SYMBOL macro */
+#endif
#ifndef __ASSEMBLY__
/*
diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index 5090035..6804d66 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -4,8 +4,6 @@
#include <linux/stringify.h>
-#ifndef __ASSEMBLY__
-
/*
* Despite that some emulators terminate on UD2, we use it for WARN().
*
@@ -22,15 +20,53 @@
#define LEN_UD2 2
+#ifdef CONFIG_GENERIC_BUG
+
+#ifdef CONFIG_X86_32
+# define __BUG_REL(val) ".long " __stringify(val)
+#else
+# define __BUG_REL(val) ".long " __stringify(val) " - 2b"
+#endif
+
+#ifdef CONFIG_DEBUG_BUGVERBOSE
+
+#define _BUG_FLAGS(ins, flags) \
+do { \
+ asm volatile("1:\t" ins "\n" \
+ ".pushsection __bug_table,\"aw\"\n" \
+ "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n" \
+ "\t" __BUG_REL(%c0) "\t# bug_entry::file\n" \
+ "\t.word %c1" "\t# bug_entry::line\n" \
+ "\t.word %c2" "\t# bug_entry::flags\n" \
+ "\t.org 2b+%c3\n" \
+ ".popsection" \
+ : : "i" (__FILE__), "i" (__LINE__), \
+ "i" (flags), \
+ "i" (sizeof(struct bug_entry))); \
+} while (0)
+
+#else /* !CONFIG_DEBUG_BUGVERBOSE */
+
#define _BUG_FLAGS(ins, flags) \
do { \
- asm volatile("ASM_BUG ins=\"" ins "\" file=%c0 line=%c1 " \
- "flags=%c2 size=%c3" \
- : : "i" (__FILE__), "i" (__LINE__), \
- "i" (flags), \
+ asm volatile("1:\t" ins "\n" \
+ ".pushsection __bug_table,\"aw\"\n" \
+ "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n" \
+ "\t.word %c0" "\t# bug_entry::flags\n" \
+ "\t.org 2b+%c1\n" \
+ ".popsection" \
+ : : "i" (flags), \
"i" (sizeof(struct bug_entry))); \
} while (0)
+#endif /* CONFIG_DEBUG_BUGVERBOSE */
+
+#else
+
+#define _BUG_FLAGS(ins, flags) asm volatile(ins)
+
+#endif /* CONFIG_GENERIC_BUG */
+
#define HAVE_ARCH_BUG
#define BUG() \
do { \
@@ -46,54 +82,4 @@ do { \
#include <asm-generic/bug.h>
-#else /* __ASSEMBLY__ */
-
-#ifdef CONFIG_GENERIC_BUG
-
-#ifdef CONFIG_X86_32
-.macro __BUG_REL val:req
- .long \val
-.endm
-#else
-.macro __BUG_REL val:req
- .long \val - 2b
-.endm
-#endif
-
-#ifdef CONFIG_DEBUG_BUGVERBOSE
-
-.macro ASM_BUG ins:req file:req line:req flags:req size:req
-1: \ins
- .pushsection __bug_table,"aw"
-2: __BUG_REL val=1b # bug_entry::bug_addr
- __BUG_REL val=\file # bug_entry::file
- .word \line # bug_entry::line
- .word \flags # bug_entry::flags
- .org 2b+\size
- .popsection
-.endm
-
-#else /* !CONFIG_DEBUG_BUGVERBOSE */
-
-.macro ASM_BUG ins:req file:req line:req flags:req size:req
-1: \ins
- .pushsection __bug_table,"aw"
-2: __BUG_REL val=1b # bug_entry::bug_addr
- .word \flags # bug_entry::flags
- .org 2b+\size
- .popsection
-.endm
-
-#endif /* CONFIG_DEBUG_BUGVERBOSE */
-
-#else /* CONFIG_GENERIC_BUG */
-
-.macro ASM_BUG ins:req file:req line:req flags:req size:req
- \ins
-.endm
-
-#endif /* CONFIG_GENERIC_BUG */
-
-#endif /* __ASSEMBLY__ */
-
#endif /* _ASM_X86_BUG_H */
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7d44272..aced6c9 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -2,10 +2,10 @@
#ifndef _ASM_X86_CPUFEATURE_H
#define _ASM_X86_CPUFEATURE_H
-#ifdef __KERNEL__
-#ifndef __ASSEMBLY__
-
#include <asm/processor.h>
+
+#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+
#include <asm/asm.h>
#include <linux/bitops.h>
@@ -161,10 +161,37 @@ extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);
*/
static __always_inline __pure bool _static_cpu_has(u16 bit)
{
- asm_volatile_goto("STATIC_CPU_HAS bitnum=%[bitnum] "
- "cap_byte=\"%[cap_byte]\" "
- "feature=%P[feature] t_yes=%l[t_yes] "
- "t_no=%l[t_no] always=%P[always]"
+ asm_volatile_goto("1: jmp 6f\n"
+ "2:\n"
+ ".skip -(((5f-4f) - (2b-1b)) > 0) * "
+ "((5f-4f) - (2b-1b)),0x90\n"
+ "3:\n"
+ ".section .altinstructions,\"a\"\n"
+ " .long 1b - .\n" /* src offset */
+ " .long 4f - .\n" /* repl offset */
+ " .word %P[always]\n" /* always replace */
+ " .byte 3b - 1b\n" /* src len */
+ " .byte 5f - 4f\n" /* repl len */
+ " .byte 3b - 2b\n" /* pad len */
+ ".previous\n"
+ ".section .altinstr_replacement,\"ax\"\n"
+ "4: jmp %l[t_no]\n"
+ "5:\n"
+ ".previous\n"
+ ".section .altinstructions,\"a\"\n"
+ " .long 1b - .\n" /* src offset */
+ " .long 0\n" /* no replacement */
+ " .word %P[feature]\n" /* feature bit */
+ " .byte 3b - 1b\n" /* src len */
+ " .byte 0\n" /* repl len */
+ " .byte 0\n" /* pad len */
+ ".previous\n"
+ ".section .altinstr_aux,\"ax\"\n"
+ "6:\n"
+ " testb %[bitnum],%[cap_byte]\n"
+ " jnz %l[t_yes]\n"
+ " jmp %l[t_no]\n"
+ ".previous\n"
: : [feature] "i" (bit),
[always] "i" (X86_FEATURE_ALWAYS),
[bitnum] "i" (1 << (bit & 7)),
@@ -199,44 +226,5 @@ static __always_inline __pure bool _static_cpu_has(u16 bit)
#define CPU_FEATURE_TYPEVAL boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
boot_cpu_data.x86_model
-#else /* __ASSEMBLY__ */
-
-.macro STATIC_CPU_HAS bitnum:req cap_byte:req feature:req t_yes:req t_no:req always:req
-1:
- jmp 6f
-2:
- .skip -(((5f-4f) - (2b-1b)) > 0) * ((5f-4f) - (2b-1b)),0x90
-3:
- .section .altinstructions,"a"
- .long 1b - . /* src offset */
- .long 4f - . /* repl offset */
- .word \always /* always replace */
- .byte 3b - 1b /* src len */
- .byte 5f - 4f /* repl len */
- .byte 3b - 2b /* pad len */
- .previous
- .section .altinstr_replacement,"ax"
-4:
- jmp \t_no
-5:
- .previous
- .section .altinstructions,"a"
- .long 1b - . /* src offset */
- .long 0 /* no replacement */
- .word \feature /* feature bit */
- .byte 3b - 1b /* src len */
- .byte 0 /* repl len */
- .byte 0 /* pad len */
- .previous
- .section .altinstr_aux,"ax"
-6:
- testb \bitnum,\cap_byte
- jnz \t_yes
- jmp \t_no
- .previous
-.endm
-
-#endif /* __ASSEMBLY__ */
-
-#endif /* __KERNEL__ */
+#endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */
#endif /* _ASM_X86_CPUFEATURE_H */
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index a5fb34f..21efc9d 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,6 +2,19 @@
#ifndef _ASM_X86_JUMP_LABEL_H
#define _ASM_X86_JUMP_LABEL_H
+#ifndef HAVE_JUMP_LABEL
+/*
+ * For better or for worse, if jump labels (the gcc extension) are missing,
+ * then the entire static branch patching infrastructure is compiled out.
+ * If that happens, the code in here will malfunction. Raise a compiler
+ * error instead.
+ *
+ * In theory, jump labels and the static branch patching infrastructure
+ * could be decoupled to fix this.
+ */
+#error asm/jump_label.h included on a non-jump-label kernel
+#endif
+
#define JUMP_LABEL_NOP_SIZE 5
#ifdef CONFIG_X86_64
@@ -20,9 +33,15 @@
static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
{
- asm_volatile_goto("STATIC_BRANCH_NOP l_yes=\"%l[l_yes]\" key=\"%c0\" "
- "branch=\"%c1\""
- : : "i" (key), "i" (branch) : : l_yes);
+ asm_volatile_goto("1:"
+ ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
+ ".pushsection __jump_table, \"aw\" \n\t"
+ _ASM_ALIGN "\n\t"
+ ".long 1b - ., %l[l_yes] - . \n\t"
+ _ASM_PTR "%c0 + %c1 - .\n\t"
+ ".popsection \n\t"
+ : : "i" (key), "i" (branch) : : l_yes);
+
return false;
l_yes:
return true;
@@ -30,8 +49,14 @@ static __always_inline bool arch_static_branch(struct static_key *key, bool bran
static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)
{
- asm_volatile_goto("STATIC_BRANCH_JMP l_yes=\"%l[l_yes]\" key=\"%c0\" "
- "branch=\"%c1\""
+ asm_volatile_goto("1:"
+ ".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
+ "2:\n\t"
+ ".pushsection __jump_table, \"aw\" \n\t"
+ _ASM_ALIGN "\n\t"
+ ".long 1b - ., %l[l_yes] - . \n\t"
+ _ASM_PTR "%c0 + %c1 - .\n\t"
+ ".popsection \n\t"
: : "i" (key), "i" (branch) : : l_yes);
return false;
@@ -41,26 +66,37 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
#else /* __ASSEMBLY__ */
-.macro STATIC_BRANCH_NOP l_yes:req key:req branch:req
-.Lstatic_branch_nop_\@:
- .byte STATIC_KEY_INIT_NOP
-.Lstatic_branch_no_after_\@:
+.macro STATIC_JUMP_IF_TRUE target, key, def
+.Lstatic_jump_\@:
+ .if \def
+ /* Equivalent to "jmp.d32 \target" */
+ .byte 0xe9
+ .long \target - .Lstatic_jump_after_\@
+.Lstatic_jump_after_\@:
+ .else
+ .byte STATIC_KEY_INIT_NOP
+ .endif
.pushsection __jump_table, "aw"
_ASM_ALIGN
- .long .Lstatic_branch_nop_\@ - ., \l_yes - .
- _ASM_PTR \key + \branch - .
+ .long .Lstatic_jump_\@ - ., \target - .
+ _ASM_PTR \key - .
.popsection
.endm
-.macro STATIC_BRANCH_JMP l_yes:req key:req branch:req
-.Lstatic_branch_jmp_\@:
- .byte 0xe9
- .long \l_yes - .Lstatic_branch_jmp_after_\@
-.Lstatic_branch_jmp_after_\@:
+.macro STATIC_JUMP_IF_FALSE target, key, def
+.Lstatic_jump_\@:
+ .if \def
+ .byte STATIC_KEY_INIT_NOP
+ .else
+ /* Equivalent to "jmp.d32 \target" */
+ .byte 0xe9
+ .long \target - .Lstatic_jump_after_\@
+.Lstatic_jump_after_\@:
+ .endif
.pushsection __jump_table, "aw"
_ASM_ALIGN
- .long .Lstatic_branch_jmp_\@ - ., \l_yes - .
- _ASM_PTR \key + \branch - .
+ .long .Lstatic_jump_\@ - ., \target - .
+ _ASM_PTR \key + 1 - .
.popsection
.endm
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 26942ad..488c596 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -348,11 +348,23 @@ extern struct paravirt_patch_template pv_ops;
#define paravirt_clobber(clobber) \
[paravirt_clobber] "i" (clobber)
+/*
+ * Generate some code, and mark it as patchable by the
+ * apply_paravirt() alternate instruction patcher.
+ */
+#define _paravirt_alt(insn_string, type, clobber) \
+ "771:\n\t" insn_string "\n" "772:\n" \
+ ".pushsection .parainstructions,\"a\"\n" \
+ _ASM_ALIGN "\n" \
+ _ASM_PTR " 771b\n" \
+ " .byte " type "\n" \
+ " .byte 772b-771b\n" \
+ " .short " clobber "\n" \
+ ".popsection\n"
+
/* Generate patchable code, with the default asm parameters. */
-#define paravirt_call \
- "PARAVIRT_CALL type=\"%c[paravirt_typenum]\"" \
- " clobber=\"%c[paravirt_clobber]\"" \
- " pv_opptr=\"%c[paravirt_opptr]\";"
+#define paravirt_alt(insn_string) \
+ _paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
/* Simple instruction patching code. */
#define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
@@ -373,6 +385,16 @@ unsigned native_patch(u8 type, void *ibuf, unsigned long addr, unsigned len);
int paravirt_disable_iospace(void);
/*
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_patch_template structure, and can therefore be
+ * freely converted back into a structure offset.
+ */
+#define PARAVIRT_CALL \
+ ANNOTATE_RETPOLINE_SAFE \
+ "call *%c[paravirt_opptr];"
+
+/*
* These macros are intended to wrap calls through one of the paravirt
* ops structs, so that they can be later identified and patched at
* runtime.
@@ -509,7 +531,7 @@ int paravirt_disable_iospace(void);
/* since this condition will never hold */ \
if (sizeof(rettype) > sizeof(unsigned long)) { \
asm volatile(pre \
- paravirt_call \
+ paravirt_alt(PARAVIRT_CALL) \
post \
: call_clbr, ASM_CALL_CONSTRAINT \
: paravirt_type(op), \
@@ -519,7 +541,7 @@ int paravirt_disable_iospace(void);
__ret = (rettype)((((u64)__edx) << 32) | __eax); \
} else { \
asm volatile(pre \
- paravirt_call \
+ paravirt_alt(PARAVIRT_CALL) \
post \
: call_clbr, ASM_CALL_CONSTRAINT \
: paravirt_type(op), \
@@ -546,7 +568,7 @@ int paravirt_disable_iospace(void);
PVOP_VCALL_ARGS; \
PVOP_TEST_NULL(op); \
asm volatile(pre \
- paravirt_call \
+ paravirt_alt(PARAVIRT_CALL) \
post \
: call_clbr, ASM_CALL_CONSTRAINT \
: paravirt_type(op), \
@@ -664,26 +686,6 @@ struct paravirt_patch_site {
extern struct paravirt_patch_site __parainstructions[],
__parainstructions_end[];
-#else /* __ASSEMBLY__ */
-
-/*
- * This generates an indirect call based on the operation type number.
- * The type number, computed in PARAVIRT_PATCH, is derived from the
- * offset into the paravirt_patch_template structure, and can therefore be
- * freely converted back into a structure offset.
- */
-.macro PARAVIRT_CALL type:req clobber:req pv_opptr:req
-771: ANNOTATE_RETPOLINE_SAFE
- call *\pv_opptr
-772: .pushsection .parainstructions,"a"
- _ASM_ALIGN
- _ASM_PTR 771b
- .byte \type
- .byte 772b-771b
- .short \clobber
- .popsection
-.endm
-
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_PARAVIRT_TYPES_H */
diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
index a8b5e1e..dbaed55 100644
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -4,41 +4,6 @@
* x86-specific implementation of refcount_t. Based on PAX_REFCOUNT from
* PaX/grsecurity.
*/
-
-#ifdef __ASSEMBLY__
-
-#include <asm/asm.h>
-#include <asm/bug.h>
-
-.macro REFCOUNT_EXCEPTION counter:req
- .pushsection .text..refcount
-111: lea \counter, %_ASM_CX
-112: ud2
- ASM_UNREACHABLE
- .popsection
-113: _ASM_EXTABLE_REFCOUNT(112b, 113b)
-.endm
-
-/* Trigger refcount exception if refcount result is negative. */
-.macro REFCOUNT_CHECK_LT_ZERO counter:req
- js 111f
- REFCOUNT_EXCEPTION counter="\counter"
-.endm
-
-/* Trigger refcount exception if refcount result is zero or negative. */
-.macro REFCOUNT_CHECK_LE_ZERO counter:req
- jz 111f
- REFCOUNT_CHECK_LT_ZERO counter="\counter"
-.endm
-
-/* Trigger refcount exception unconditionally. */
-.macro REFCOUNT_ERROR counter:req
- jmp 111f
- REFCOUNT_EXCEPTION counter="\counter"
-.endm
-
-#else /* __ASSEMBLY__ */
-
#include <linux/refcount.h>
#include <asm/bug.h>
@@ -50,12 +15,35 @@
* central refcount exception. The fixup address for the exception points
* back to the regular execution flow in .text.
*/
+#define _REFCOUNT_EXCEPTION \
+ ".pushsection .text..refcount\n" \
+ "111:\tlea %[var], %%" _ASM_CX "\n" \
+ "112:\t" ASM_UD2 "\n" \
+ ASM_UNREACHABLE \
+ ".popsection\n" \
+ "113:\n" \
+ _ASM_EXTABLE_REFCOUNT(112b, 113b)
+
+/* Trigger refcount exception if refcount result is negative. */
+#define REFCOUNT_CHECK_LT_ZERO \
+ "js 111f\n\t" \
+ _REFCOUNT_EXCEPTION
+
+/* Trigger refcount exception if refcount result is zero or negative. */
+#define REFCOUNT_CHECK_LE_ZERO \
+ "jz 111f\n\t" \
+ REFCOUNT_CHECK_LT_ZERO
+
+/* Trigger refcount exception unconditionally. */
+#define REFCOUNT_ERROR \
+ "jmp 111f\n\t" \
+ _REFCOUNT_EXCEPTION
static __always_inline void refcount_add(unsigned int i, refcount_t *r)
{
asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
- "REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
- : [counter] "+m" (r->refs.counter)
+ REFCOUNT_CHECK_LT_ZERO
+ : [var] "+m" (r->refs.counter)
: "ir" (i)
: "cc", "cx");
}
@@ -63,32 +51,31 @@ static __always_inline void refcount_add(unsigned int i, refcount_t *r)
static __always_inline void refcount_inc(refcount_t *r)
{
asm volatile(LOCK_PREFIX "incl %0\n\t"
- "REFCOUNT_CHECK_LT_ZERO counter=\"%[counter]\""
- : [counter] "+m" (r->refs.counter)
+ REFCOUNT_CHECK_LT_ZERO
+ : [var] "+m" (r->refs.counter)
: : "cc", "cx");
}
static __always_inline void refcount_dec(refcount_t *r)
{
asm volatile(LOCK_PREFIX "decl %0\n\t"
- "REFCOUNT_CHECK_LE_ZERO counter=\"%[counter]\""
- : [counter] "+m" (r->refs.counter)
+ REFCOUNT_CHECK_LE_ZERO
+ : [var] "+m" (r->refs.counter)
: : "cc", "cx");
}
static __always_inline __must_check
bool refcount_sub_and_test(unsigned int i, refcount_t *r)
{
-
return GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
- "REFCOUNT_CHECK_LT_ZERO counter=\"%[var]\"",
+ REFCOUNT_CHECK_LT_ZERO,
r->refs.counter, e, "er", i, "cx");
}
static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
{
return GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
- "REFCOUNT_CHECK_LT_ZERO counter=\"%[var]\"",
+ REFCOUNT_CHECK_LT_ZERO,
r->refs.counter, e, "cx");
}
@@ -106,8 +93,8 @@ bool refcount_add_not_zero(unsigned int i, refcount_t *r)
/* Did we try to increment from/to an undesirable state? */
if (unlikely(c < 0 || c == INT_MAX || result < c)) {
- asm volatile("REFCOUNT_ERROR counter=\"%[counter]\""
- : : [counter] "m" (r->refs.counter)
+ asm volatile(REFCOUNT_ERROR
+ : : [var] "m" (r->refs.counter)
: "cc", "cx");
break;
}
@@ -122,6 +109,4 @@ static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
return refcount_add_not_zero(1, r);
}
-#endif /* __ASSEMBLY__ */
-
#endif
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
deleted file mode 100644
index 161c950..0000000
--- a/arch/x86/kernel/macros.S
+++ /dev/null
@@ -1,16 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-/*
- * This file includes headers whose assembly part includes macros which are
- * commonly used. The macros are precompiled into assmebly file which is later
- * assembled together with each compiled file.
- */
-
-#include <linux/compiler.h>
-#include <asm/refcount.h>
-#include <asm/alternative-asm.h>
-#include <asm/bug.h>
-#include <asm/paravirt.h>
-#include <asm/asm.h>
-#include <asm/cpufeature.h>
-#include <asm/jump_label.h>
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index cdafa5e..20561a6 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -17,8 +17,10 @@
#ifndef __ASSEMBLY__
#include <linux/kernel.h>
-struct bug_entry {
+#ifdef CONFIG_BUG
+
#ifdef CONFIG_GENERIC_BUG
+struct bug_entry {
#ifndef CONFIG_GENERIC_BUG_RELATIVE_POINTERS
unsigned long bug_addr;
#else
@@ -33,10 +35,8 @@ struct bug_entry {
unsigned short line;
#endif
unsigned short flags;
-#endif /* CONFIG_GENERIC_BUG */
};
-
-#ifdef CONFIG_BUG
+#endif /* CONFIG_GENERIC_BUG */
/*
* Don't use BUG() or BUG_ON() unless there's really no way out; one
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 06396c1..fc5004a 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -99,13 +99,22 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
* unique, to convince GCC not to merge duplicate inline asm statements.
*/
#define annotate_reachable() ({ \
- asm volatile("ANNOTATE_REACHABLE counter=%c0" \
- : : "i" (__COUNTER__)); \
+ asm volatile("%c0:\n\t" \
+ ".pushsection .discard.reachable\n\t" \
+ ".long %c0b - .\n\t" \
+ ".popsection\n\t" : : "i" (__COUNTER__)); \
})
#define annotate_unreachable() ({ \
- asm volatile("ANNOTATE_UNREACHABLE counter=%c0" \
- : : "i" (__COUNTER__)); \
+ asm volatile("%c0:\n\t" \
+ ".pushsection .discard.unreachable\n\t" \
+ ".long %c0b - .\n\t" \
+ ".popsection\n\t" : : "i" (__COUNTER__)); \
})
+#define ASM_UNREACHABLE \
+ "999:\n\t" \
+ ".pushsection .discard.unreachable\n\t" \
+ ".long 999b - .\n\t" \
+ ".popsection\n\t"
#else
#define annotate_reachable()
#define annotate_unreachable()
@@ -293,45 +302,6 @@ static inline void *offset_to_ptr(const int *off)
return (void *)((unsigned long)off + *off);
}
-#else /* __ASSEMBLY__ */
-
-#ifdef __KERNEL__
-#ifndef LINKER_SCRIPT
-
-#ifdef CONFIG_STACK_VALIDATION
-.macro ANNOTATE_UNREACHABLE counter:req
-\counter:
- .pushsection .discard.unreachable
- .long \counter\()b -.
- .popsection
-.endm
-
-.macro ANNOTATE_REACHABLE counter:req
-\counter:
- .pushsection .discard.reachable
- .long \counter\()b -.
- .popsection
-.endm
-
-.macro ASM_UNREACHABLE
-999:
- .pushsection .discard.unreachable
- .long 999b - .
- .popsection
-.endm
-#else /* CONFIG_STACK_VALIDATION */
-.macro ANNOTATE_UNREACHABLE counter:req
-.endm
-
-.macro ANNOTATE_REACHABLE counter:req
-.endm
-
-.macro ASM_UNREACHABLE
-.endm
-#endif /* CONFIG_STACK_VALIDATION */
-
-#endif /* LINKER_SCRIPT */
-#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
/* Compile time object size, -1 for unknown */
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index bb01555..3d09844 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -115,9 +115,7 @@ __cc-option = $(call try-run,\
# Do not attempt to build with gcc plugins during cc-option tests.
# (And this uses delayed resolution so the flags will be up to date.)
-# In addition, do not include the asm macros which are built later.
-CC_OPTION_FILTERED = $(GCC_PLUGINS_CFLAGS) $(ASM_MACRO_FLAGS)
-CC_OPTION_CFLAGS = $(filter-out $(CC_OPTION_FILTERED),$(KBUILD_CFLAGS))
+CC_OPTION_CFLAGS = $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS))
# cc-option
# Usage: cflags-y += $(call cc-option,-march=winchip-c6,-march=i586)
diff --git a/scripts/mod/Makefile b/scripts/mod/Makefile
index a5b4af4..42c5d50 100644
--- a/scripts/mod/Makefile
+++ b/scripts/mod/Makefile
@@ -4,8 +4,6 @@ OBJECT_FILES_NON_STANDARD := y
hostprogs-y := modpost mk_elfconfig
always := $(hostprogs-y) empty.o
-CFLAGS_REMOVE_empty.o := $(ASM_MACRO_FLAGS)
-
modpost-objs := modpost.o file2alias.o sumversion.o
devicetable-offsets-file := devicetable-offsets.h
--
2.7.4
^ permalink raw reply related
* Re: [PATCH] vhost: correct the related warning message
From: Sergei Shtylyov @ 2018-12-13 9:05 UTC (permalink / raw)
To: wangyan, mst, jasowang; +Cc: kvm, netdev, virtualization, piaojun, viro, hch
In-Reply-To: <5C11B176.4040704@huawei.com>
On 13.12.2018 4:10, wangyan wrote:
> Fixes: 'commit d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem")'
> 'commit cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and friends")'
Fixes: d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove
tf_subsystem")
Fixes: cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and
friends")
But no line wrapping allowed (can't find how to turn it off in my mailer).
> Signed-off-by: Yan Wang <wangyan122@huawei.com>
> Reviewed-by: Jun Piao <piaojun@huawei.com>
[...]
MBR, Sergei
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-12-13 7:42 UTC (permalink / raw)
To: David Miller; +Cc: kvm, mst, netdev, virtualization, stefanha
In-Reply-To: <20181212.215905.1657023590815103551.davem@davemloft.net>
On 2018/12/13 13:59, David Miller wrote:
> From: jiangyiwen <jiangyiwen@huawei.com>
> Date: Thu, 13 Dec 2018 11:11:48 +0800
>
>> I hope Host can fill fewer bytes into rx virtqueue, so
>> I keep structure virtio_vsock_mrg_rxbuf_hdr one byte
>> alignment.
>
> The question is if this actully matters.
>
> Do you know?
>
> If the obejct this is embeeded inside of is at least 2 byte aligned,
> you are marking it packed for nothing.
>
> There are only %100 downsides to using the packed attribute.
>
> Simply define your datastructures properly, with fixed sized types,
> and all padding defined explicitly.
>
> .
>
Hi David,
Thanks a lot, I need to send number buffers from Host to Guest, so I think
we need to keep the structure size the same between host and guest.
But after your reminder, I feel my code may exist a serious problem,
that in mergeable mode, I send the total structure virtio_vsock_pkt
from Host to Guest, however, this structure size may be different
under different compilers (Guest and Host are different). Then, Guest
may parse the wrong packet length.
David, I want to ask if there is such a problem?
In addition, why I send total virtio_vsock_pkt structure from Host to Guest?
- In order to avoid to allocate virtio_vsock_pkt memory when receiving
packets, in case of insufficient memory, it may have some advantages, and
we may keep consistent with old version.
Thanks again,
Yiwen.
^ permalink raw reply
* Re: [PATCH] Export mm_update_next_owner function for vhost-net
From: Jason Wang @ 2018-12-13 6:30 UTC (permalink / raw)
To: gchen.guomin, Michael S. Tsirkin, guominchen
Cc: kvm, netdev, linux-kernel, Dominik Brodowski, virtualization,
Luis R. Rodriguez, Eric W. Biederman, Andrew Morton,
Sudip Mukherjee
In-Reply-To: <1544676445-14897-1-git-send-email-gchen.guomin@gmail.com>
On 2018/12/13 下午12:47, gchen.guomin@gmail.com wrote:
> From: guomin chen <gchen.guomin@gmail.com>
>
> Under normal circumstances,When do_exit exits, mm->owner will
> be updated on exit_mm(). but when the kernel process calls
> unuse_mm() and then exits,mm->owner cannot be updated. And it
> will point to a task that has been released.
>
> Below is my issue on vhost_net:
> A, B are two kernel processes(such as vhost_worker),
> C is a user space process(such as qemu), and all
> three use the mm of the user process C.
> Now, because user process C exits abnormally, the owner of this
> mm becomes A. When A calls unuse_mm and exits, this mm->ower
> still points to the A that has been released.
> When B accesses this mm->owner again, A has been released.
Could you describe how you reproduce this issue? I believe vhost process
should exit before process C?
>
> Process A Process B
> vhost_worker() vhost_worker()
> --------- ---------
> use_mm() use_mm()
> ...
> unuse_mm()
> tsk->mm=NULL
> do_exit() page fault
> exit_mm() access mm->owner
> can't update owner kernel Oops
>
> unuse_mm()
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
> Cc: Dominik Brodowski <linux@dominikbrodowski.net>
> Signed-off-by: guomin chen <gchen.guomin@gmail.com>
> ---
> drivers/vhost/vhost.c | 1 +
> kernel/exit.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 6b98d8e..7c09087 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -368,6 +368,7 @@ static int vhost_worker(void *data)
> }
> }
> unuse_mm(dev->mm);
> + mm_update_next_owner(dev->mm);
If you analysis is correct, this is still racy isn't it? (E.g page fault
happen between unuse_mm() and mm_update_next_owner()).
Thanks
> set_fs(oldfs);
> return 0;
> }
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 0e21e6d..9e046dd 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -486,6 +486,7 @@ void mm_update_next_owner(struct mm_struct *mm)
> task_unlock(c);
> put_task_struct(c);
> }
> +EXPORT_SYMBOL(mm_update_next_owner);
> #endif /* CONFIG_MEMCG */
>
> /*
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: David Miller @ 2018-12-13 5:59 UTC (permalink / raw)
To: jiangyiwen; +Cc: kvm, mst, netdev, virtualization, stefanha
In-Reply-To: <5C11CDF4.4040405@huawei.com>
From: jiangyiwen <jiangyiwen@huawei.com>
Date: Thu, 13 Dec 2018 11:11:48 +0800
> I hope Host can fill fewer bytes into rx virtqueue, so
> I keep structure virtio_vsock_mrg_rxbuf_hdr one byte
> alignment.
The question is if this actully matters.
Do you know?
If the obejct this is embeeded inside of is at least 2 byte aligned,
you are marking it packed for nothing.
There are only %100 downsides to using the packed attribute.
Simply define your datastructures properly, with fixed sized types,
and all padding defined explicitly.
^ permalink raw reply
* Re: [PATCH net V3 0/3] Fix various issue of vhost
From: David Miller @ 2018-12-13 5:57 UTC (permalink / raw)
To: jasowang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <20181213025339.14023-1-jasowang@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Thu, 13 Dec 2018 10:53:36 +0800
> This series tries to fix various issues of vhost:
>
> - Patch 1 adds a missing write barrier between used idx updating and
> logging.
> - Patch 2-3 brings back the protection of device IOTLB through vq
> mutex, this fixes possible use after free in device IOTLB entries.
>
> Please consider them for -stable.
>
> Thanks
>
> Changes from V2:
> - drop dirty page fix and make it for net-next
> Changes from V1:
> - silent compiler warning for 32bit.
> - use mutex_trylock() on slowpath instead of mutex_lock() even on fast
> path.
Series applied and queued up for -stable, thanks Jason.
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-12-13 3:11 UTC (permalink / raw)
To: David Miller; +Cc: kvm, mst, netdev, virtualization, stefanha
In-Reply-To: <20181212.110944.1077302806391053539.davem@davemloft.net>
On 2018/12/13 3:09, David Miller wrote:
> From: jiangyiwen <jiangyiwen@huawei.com>
> Date: Wed, 12 Dec 2018 17:29:31 +0800
>
>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>> index 1d57ed3..2292f30 100644
>> --- a/include/uapi/linux/virtio_vsock.h
>> +++ b/include/uapi/linux/virtio_vsock.h
>> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
>> __le32 fwd_cnt;
>> } __attribute__((packed));
>>
>> +/* It add mergeable rx buffers feature */
>> +struct virtio_vsock_mrg_rxbuf_hdr {
>> + __le16 num_buffers; /* number of mergeable rx buffers */
>> +} __attribute__((packed));
>> +
>
> I know the rest of this file uses 'packed' but this attribute should
> only be used if absolutely necessary as it incurs a
> non-trivial performance penalty for some architectures.
>
> .
>
Hi David,
I hope Host can fill fewer bytes into rx virtqueue, so
I keep structure virtio_vsock_mrg_rxbuf_hdr one byte
alignment.
Thanks,
Yiwen.
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-12-13 3:08 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <20181212103138-mutt-send-email-mst@kernel.org>
On 2018/12/12 23:37, Michael S. Tsirkin wrote:
> On Wed, Dec 12, 2018 at 05:29:31PM +0800, jiangyiwen wrote:
>> When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
>> it will merge big packet into rx vq.
>>
>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
>
> I feel this approach jumps into making interface changes for
> optimizations too quickly. For example, what prevents us
> from taking a big buffer, prepending each chunk
> with the header and writing it out without
> host/guest interface changes?
>
> This should allow optimizations such as vhost_add_used_n
> batching.
>
> I realize a header in each packet does have a cost,
> but it also has advantages such as improved robustness,
> I'd like to see more of an apples to apples comparison
> of the performance gain from skipping them.
>
>
Hi Michael,
I don't fully understand what you mean, do you want to
see a performance comparison that before performance and
only use batching?
In my opinion, guest don't fill big buffer in rx vq because
the balance performance and guest memory pressure, add
mergeable feature can improve big packets performance,
as for small packets, I try to find out the reason, may be
the fluctuation of test results, or in mergeable mode, when
Host send a 4k packet to Guest, we should call vhost_get_vq_desc()
twice in host(hdr + 4k data), and in guest we also should call
virtqueue_get_buf() twice.
Thanks,
Yiwen.
>> ---
>> drivers/vhost/vsock.c | 111 ++++++++++++++++++++++++++++++--------
>> include/linux/virtio_vsock.h | 1 +
>> include/uapi/linux/virtio_vsock.h | 5 ++
>> 3 files changed, 94 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index 34bc3ab..dc52b0f 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -22,7 +22,8 @@
>> #define VHOST_VSOCK_DEFAULT_HOST_CID 2
>>
>> enum {
>> - VHOST_VSOCK_FEATURES = VHOST_FEATURES,
>> + VHOST_VSOCK_FEATURES = VHOST_FEATURES |
>> + (1ULL << VIRTIO_VSOCK_F_MRG_RXBUF),
>> };
>>
>> /* Used to track all the vhost_vsock instances on the system. */
>> @@ -80,6 +81,69 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>> return vsock;
>> }
>>
>> +/* This segment of codes are copied from drivers/vhost/net.c */
>> +static int get_rx_bufs(struct vhost_virtqueue *vq,
>> + struct vring_used_elem *heads, int datalen,
>> + unsigned *iovcount, unsigned int quota)
>> +{
>> + unsigned int out, in;
>> + int seg = 0;
>> + int headcount = 0;
>> + unsigned d;
>> + int ret;
>> + /*
>> + * len is always initialized before use since we are always called with
>> + * datalen > 0.
>> + */
>> + u32 uninitialized_var(len);
>> +
>> + while (datalen > 0 && headcount < quota) {
>> + if (unlikely(seg >= UIO_MAXIOV)) {
>> + ret = -ENOBUFS;
>> + goto err;
>> + }
>> +
>> + ret = vhost_get_vq_desc(vq, vq->iov + seg,
>> + ARRAY_SIZE(vq->iov) - seg, &out,
>> + &in, NULL, NULL);
>> + if (unlikely(ret < 0))
>> + goto err;
>> +
>> + d = ret;
>> + if (d == vq->num) {
>> + ret = 0;
>> + goto err;
>> + }
>> +
>> + if (unlikely(out || in <= 0)) {
>> + vq_err(vq, "unexpected descriptor format for RX: "
>> + "out %d, in %d\n", out, in);
>> + ret = -EINVAL;
>> + goto err;
>> + }
>> +
>> + heads[headcount].id = cpu_to_vhost32(vq, d);
>> + len = iov_length(vq->iov + seg, in);
>> + heads[headcount].len = cpu_to_vhost32(vq, len);
>> + datalen -= len;
>> + ++headcount;
>> + seg += in;
>> + }
>> +
>> + heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
>> + *iovcount = seg;
>> +
>> + /* Detect overrun */
>> + if (unlikely(datalen > 0)) {
>> + ret = UIO_MAXIOV + 1;
>> + goto err;
>> + }
>> + return headcount;
>> +err:
>> + vhost_discard_vq_desc(vq, headcount);
>> + return ret;
>> +}
>> +
>> static void
>> vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>> struct vhost_virtqueue *vq)
>> @@ -87,22 +151,34 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>> struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
>> bool added = false;
>> bool restart_tx = false;
>> + int mergeable;
>> + size_t vsock_hlen;
>>
>> mutex_lock(&vq->mutex);
>>
>> if (!vq->private_data)
>> goto out;
>>
>> + mergeable = vhost_has_feature(vq, VIRTIO_VSOCK_F_MRG_RXBUF);
>> + /*
>> + * Guest fill page for rx vq in mergeable case, so it will not
>> + * allocate pkt structure, we should reserve size of pkt in advance.
>> + */
>> + if (likely(mergeable))
>> + vsock_hlen = sizeof(struct virtio_vsock_pkt);
>> + else
>> + vsock_hlen = sizeof(struct virtio_vsock_hdr);
>> +
>> /* Avoid further vmexits, we're already processing the virtqueue */
>> vhost_disable_notify(&vsock->dev, vq);
>>
>> for (;;) {
>> struct virtio_vsock_pkt *pkt;
>> struct iov_iter iov_iter;
>> - unsigned out, in;
>> + unsigned out = 0, in = 0;
>> size_t nbytes;
>> size_t len;
>> - int head;
>> + s16 headcount;
>>
>> spin_lock_bh(&vsock->send_pkt_list_lock);
>> if (list_empty(&vsock->send_pkt_list)) {
>> @@ -116,16 +192,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>> list_del_init(&pkt->list);
>> spin_unlock_bh(&vsock->send_pkt_list_lock);
>>
>> - head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>> - &out, &in, NULL, NULL);
>> - if (head < 0) {
>> - spin_lock_bh(&vsock->send_pkt_list_lock);
>> - list_add(&pkt->list, &vsock->send_pkt_list);
>> - spin_unlock_bh(&vsock->send_pkt_list_lock);
>> - break;
>> - }
>> -
>> - if (head == vq->num) {
>> + headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
>> + &in, likely(mergeable) ? UIO_MAXIOV : 1);
>> + if (headcount <= 0) {
>> spin_lock_bh(&vsock->send_pkt_list_lock);
>> list_add(&pkt->list, &vsock->send_pkt_list);
>> spin_unlock_bh(&vsock->send_pkt_list_lock);
>> @@ -133,24 +202,20 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>> /* We cannot finish yet if more buffers snuck in while
>> * re-enabling notify.
>> */
>> - if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
>> + if (!headcount && unlikely(vhost_enable_notify(&vsock->dev, vq))) {
>> vhost_disable_notify(&vsock->dev, vq);
>> continue;
>> }
>> break;
>> }
>>
>> - if (out) {
>> - virtio_transport_free_pkt(pkt);
>> - vq_err(vq, "Expected 0 output buffers, got %u\n", out);
>> - break;
>> - }
>> -
>> len = iov_length(&vq->iov[out], in);
>> iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
>>
>> - nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
>> - if (nbytes != sizeof(pkt->hdr)) {
>> + if (likely(mergeable))
>> + pkt->mrg_rxbuf_hdr.num_buffers = cpu_to_le16(headcount);
>> + nbytes = copy_to_iter(&pkt->hdr, vsock_hlen, &iov_iter);
>> + if (nbytes != vsock_hlen) {
>> virtio_transport_free_pkt(pkt);
>> vq_err(vq, "Faulted on copying pkt hdr\n");
>> break;
>> @@ -163,7 +228,7 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>> break;
>> }
>>
>> - vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
>> + vhost_add_used_n(vq, vq->heads, headcount);
>> added = true;
>>
>> if (pkt->reply) {
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index bf84418..da9e1fe 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -50,6 +50,7 @@ struct virtio_vsock_sock {
>>
>> struct virtio_vsock_pkt {
>> struct virtio_vsock_hdr hdr;
>> + struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
>> struct work_struct work;
>> struct list_head list;
>> /* socket refcnt not held, only use for cancellation */
>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>> index 1d57ed3..2292f30 100644
>> --- a/include/uapi/linux/virtio_vsock.h
>> +++ b/include/uapi/linux/virtio_vsock.h
>> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
>> __le32 fwd_cnt;
>> } __attribute__((packed));
>>
>> +/* It add mergeable rx buffers feature */
>> +struct virtio_vsock_mrg_rxbuf_hdr {
>> + __le16 num_buffers; /* number of mergeable rx buffers */
>> +} __attribute__((packed));
>> +
>> enum virtio_vsock_type {
>> VIRTIO_VSOCK_TYPE_STREAM = 1,
>> };
>> --
>> 1.8.3.1
>>
>
> .
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox