Re: vhost questions. - Michael S. Tsirkin

virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Subject: Re: vhost questions.
Date: Thu, 14 Mar 2013 11:33:43 +0200	[thread overview]
Message-ID: <20130314093342.GC14977@redhat.com> (raw)
In-Reply-To: <87wqtatrj5.fsf@rustcorp.com.au>

On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
> >> OK, I've been trying to read the vhost and vhost net code, and I'm
> >> struggling with basic questions.
> >> 
> >> 1) Why do we allow userspace to change an already-set-up vhost device?
> >>    Why not have:
> >>         open()
> >>         ioctl(VHOST_GET_FEATURES)
> >>         ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
> >>         ioctl(VHOST_NET_SETUP)
> >>         ...
> >>         close()
> >> 
> >>    You're not allowed to call things twice: to reset, you close and
> >>    reopen.  That would save a lot of code which I'm not sure is correct
> >>    anyway.
> >
> > This won't work for the usecases we have in qemu.
> > The way things are setup currently, qemu typically does not have
> > the rights to open any char devices, instead it is passed an fd.
> 
> Ah, because creating a vhost fd is privileged?  Why?  Because it creates
> a kernel thread?  Why?
> 
> Using a model where a userspace process/thread manages the ring (even if
> it spends 99.99% of its time in the kernel) would avoid this.  This is
> how KVM itself works, it's no longer privileged.

KVM kind of has to have a thread per VCPU, no other options.
Here I don't think we know what the best threading model is.
We used to have a per-CPU thread, switched to thread pair vq *pair* and
there are patches to go back to per-CPU threads, they seem to improve
scalability though there are still some open issues wrt cgroups.
Using a kernel thread means we can control this without touching userspace.

> >> 2) Why do we implement write logging?  Shouldn't qemu just switch to
> >>    emulation if it wants to track writes?
> >
> > We could do it this way.  Originally vhost didn't do logging,
> > but Anthony felt it's cleaner this way.  We argued this way and that, in
> > the end I agreed it's better to always process rings in vhost, leave
> > userspace alone. For example this allows a simpler userspace that does
> > not implement tx/rx ring processing at all.
> 
> We don't support things which don't exist.
> 
> A simple implementation which supports live migration would have to have
> enough understanding to save and restore the net state.  With a tun
> device (which handles all the header stuff), supporting networking is
> *trivial*: see lguest (though it needs MRG_RXBUF support).  It's about
> 100 lines.

You assume networking is stopped while we migrate but this doesn't work
as migration might take a long time.  So you have to have the full
device, processing rings and sending/receiving packets all in userspace.

> > We used to use a workqueue, but Tejun Heo required this switch.
> > Basically we want everything run from one thread so we
> > can apply cgroup limits to this thread. workqueue does not
> > guarantee it - one thread is an implementation detail.
> 
> Right.
> 
> >> vhost/net.c questions:
> >> 
> >> 1) Why do we wake the net tx worker every time the reference count of
> >>    the number of outstanding DMAs is a multiple of 16?
> >
> > Waking every time turns out to be wasteful. Waking only when nothing is
> > outstanding turns out to be slow. The comment says it all really:
> >
> > +        * We also trigger polling periodically after each 16 packets
> > +        * (the value 16 here is more or less arbitrary, it's tuned to trigger
> > +        * less than 10% of times).
> >
> > If you have any suggestions on how to improve this,
> 
> But this code does *not* wake every 16 packets.
>
> Anyway, what happens when we wake it?  I'm assuming it's the net tx
> worker, which will just return the used bufs to the guest?  Wouldn't it
> be better to do that directly?  Maybe wake the tx worker if it needs to
> wake the guest, if we can't do that here.

Sorry I don't get what you mean here.

> >> 3) Why don't we simply use a slab allocator to allocate a structure for
> >>    outstanding xmit dmas?  Wouldn't that be far simpler?  Would it be
> >>    slower?
> >
> > Which structure for outstanding xmit dmas? ubufs?  We could use a cache
> > maybe, plan slab is usually pretty slow.
> >
> > I think what you are really worried about is the upend/done index thing?
> > It's not really there because of the array really.
> > It's because we try to order the completion of zero copy buffers
> > in the same order they are submitted. This was because I was worried about
> > triggering fragmentation of the descriptor ring (Avi kept raising this
> > concern).
> 
> I've never seen any evidence that caching is slowing us down.  Many
> theories, but mainly our slowdowns are crap in qemu.
> 
> And for xmit rings, it's *REALLY DUMB* because there can be no
> fragmentation with indirect buffers!
> 
> If fragmentation becomes a problem, it's easy to use a bitmap-based
> allocator and avoid fragmentation.
> 
> But most importantly, it's *not* the host's problem to fix!  Just keep
> the damn latency low.

OK, let's try to rip it out and see what happens.

> >> 4) Why are the following fields in struct vhost_virtqueue, not struct
> >>    vhost_net?
> >> 
> >> 	/* hdr is used to store the virtio header.
> >> 	 * Since each iovec has >= 1 byte length, we never need more than
> >> 	 * header length entries to store the header. */
> >> 	struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
> >> 	size_t vhost_hlen;
> >> 	size_t sock_hlen;
> >> 	struct vring_used_elem *heads;
> >> 
> >>         ...
> >
> > Because these are used for both tx and rx.  While we handle both from
> > the same thread at the moment, so we could reuse same fields, there are
> > some patches by Anthony that process tx and rx in separate threads and
> > sometimes it's a win. So I don't want to close the door on that.
> 
> But they're *net* specific: They don't belong in the vhost structure.

Ah, I see what you mean. Yes, we really need vhost_net_virtqueue,
we didn't have one and I let some net specific stuff seep through.
We should do something like

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ec6fb3f..3c89d51 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -70,9 +70,13 @@ enum vhost_net_poll_state {
 	VHOST_NET_POLL_STOPPED = 2,
 };
 
+struct vhost_net_virtqueue {
+	struct vhost_virtqueue vq;
+};
+
 struct vhost_net {
 	struct vhost_dev dev;
-	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
+	struct vhost_net_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
@@ -612,17 +616,21 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
 	struct vhost_dev *dev;
+	struct vhost_virtqueue **vqs = kmalloc(VHOST_NET_VQ_MAX, sizeof *vqs);
 	int r;
 
 	if (!n)
 		return -ENOMEM;
 
 	dev = &n->dev;
-	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
+	vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
+	vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
+	n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
+	n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
 	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
 	if (r < 0) {
 		kfree(n);
+		kfree(vqs);
 		return r;
 	}
 
And then we can move net specific stuff to vhost_net_virtqueue.

> And from my reading, hdr[] is written to on the xmit side, but it's not
> used.  sock_hlen is entirely unused on xmit. heads[] is abused, and
> should be its own structure.

True.

> >> 	/* vhost zerocopy support fields below: */
> >> 	/* last used idx for outstanding DMA zerocopy buffers */
> >> 	int upend_idx;
> >> 	/* first used idx for DMA done zerocopy buffers */
> >> 	int done_idx;
> >> 	/* an array of userspace buffers info */
> >> 	struct ubuf_info *ubuf_info;
> >> 	/* Reference counting for outstanding ubufs.
> >> 	 * Protected by vq mutex. Writers must also take device mutex. */
> >> 	struct vhost_ubuf_ref *ubufs;
> >
> > Only used from tx. I was trying to keep the option of multiple
> > tx rings in one vhost, and the option of rx zero copy open, but it's not
> > a must.
> 
> Again, these are all very net specific.
> 
> >> 5) Why does vhost_net use the sendmsg interface for tun (and macvlan?)?
> >>    It's not usable from userspace (it looks like if you can sendmsg to a
> >>    tun device you can make it call an arbitrary pointer though).
> >
> > Yes we could switch to some other interface, but only if we
> > deprecate the packet socket backend and remove it at some point.
> >
> >>    It would be better for us to create the skb, and have a special
> >>    interface to inject it directly.  This way we can handle the
> >>    UIO_MAXIOV limit ourselves (eg. by not doing zero copy on such stupid
> >>    cases).
> >
> > Basically because userspace already configures the backend with existing
> > tun ioctls. This way we don't need to introduce new interfaces for this,
> > just virtio stuff for vhost net.  It's also good that we share code with
> > tun, often both vhost and virtio processing can gain from same
> > optimizations and new features. Happened in the past already.
> 
> Sorry, I wasn't clear here.  I mean instead of doing:
> 
> 	err = sock->ops->recvmsg(NULL, sock, &msg,
> 				 sock_len, MSG_DONTWAIT | MSG_TRUNC);
> 
> We do something like:
> 
>         err = tun_inject_skb(sock, skb);
> 
> Because the zerocopy_sg_from_iovec() in tun.c already has a comment
> about putting it in the network core... in fact, it's cut & pasted into
> macvtap :(

Right. There's lots of code duplication in tun, macvtap and af_packet,
we could factor code out, and then use it everywhere.

> The msg thing for the callback is a nasty hack, AFAICT...

Yes it's not pretty.

> > Basically not many people care about virt. The less virt specific
> > code we have the better.  I lost count of times I had this dialog when I
> > needed some help
> > - we see this problem XYZ in host net stack
> > - oh it's vhost I don't know what it does, sorry
> > - actually it only builds up iovecs all networking is done by tun that
> >   everyone uses
> > - ok then let me take a look
> >
> >> 6) By calling tun_get_socket() and macvtap_get_socket() we are forcing
> >>    those modules to load.  We should use symbol_get() to avoid this.
> >
> > Good idea.
> 
> I can create this patch.
>
> >> In short, this code is a mess.  Assuming we can't just break the API,
> >> we've got a long process ahead of us to get it into shape :(
> >> 
> >> Thanks,
> >> Rusty.
> >
> > Zerocopy path is the most messy I think.  We should really consider
> > removing the ordering of zero copy buffers. The rest I don't consider
> > so messy, and it should be easy to cleanup zerocopy without
> > breaking the API.
> 
> I've got some cleanup patches, but I'll keep going and send them to you
> for review...
> 
> Thanks,
> Rusty

Sure, we have lots going on with vhost scsi ATM, but as long as changes
are done in small steps I'll figure out conflicts if any without much
trouble.

-- 
MST

next prev parent reply	other threads:[~2013-03-14  9:33 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-13  6:49 vhost questions Rusty Russell
2013-03-13 15:59 ` Michael S. Tsirkin
2013-03-14  6:38   ` Rusty Russell
2013-03-14  9:33     ` Michael S. Tsirkin [this message]
2013-03-18  0:50       ` Rusty Russell
2013-03-18  8:06         ` Michael S. Tsirkin
2013-03-17 11:09     ` Michael S. Tsirkin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:ec6fb3f dfblob:3c89d51 )
 OR (
bs:"Re: vhost questions." )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130314093342.GC14977@redhat.com \
    --to=mst@redhat.com \
    --cc=rusty@rustcorp.com.au \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).