From: "Michael S. Tsirkin" <mst@redhat.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Subject: Re: vhost questions.
Date: Thu, 14 Mar 2013 11:33:43 +0200 [thread overview]
Message-ID: <20130314093342.GC14977@redhat.com> (raw)
In-Reply-To: <87wqtatrj5.fsf@rustcorp.com.au>
On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
> >> OK, I've been trying to read the vhost and vhost net code, and I'm
> >> struggling with basic questions.
> >>
> >> 1) Why do we allow userspace to change an already-set-up vhost device?
> >> Why not have:
> >> open()
> >> ioctl(VHOST_GET_FEATURES)
> >> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
> >> ioctl(VHOST_NET_SETUP)
> >> ...
> >> close()
> >>
> >> You're not allowed to call things twice: to reset, you close and
> >> reopen. That would save a lot of code which I'm not sure is correct
> >> anyway.
> >
> > This won't work for the usecases we have in qemu.
> > The way things are setup currently, qemu typically does not have
> > the rights to open any char devices, instead it is passed an fd.
>
> Ah, because creating a vhost fd is privileged? Why? Because it creates
> a kernel thread? Why?
>
> Using a model where a userspace process/thread manages the ring (even if
> it spends 99.99% of its time in the kernel) would avoid this. This is
> how KVM itself works, it's no longer privileged.
KVM kind of has to have a thread per VCPU, no other options.
Here I don't think we know what the best threading model is.
We used to have a per-CPU thread, switched to thread pair vq *pair* and
there are patches to go back to per-CPU threads, they seem to improve
scalability though there are still some open issues wrt cgroups.
Using a kernel thread means we can control this without touching userspace.
> >> 2) Why do we implement write logging? Shouldn't qemu just switch to
> >> emulation if it wants to track writes?
> >
> > We could do it this way. Originally vhost didn't do logging,
> > but Anthony felt it's cleaner this way. We argued this way and that, in
> > the end I agreed it's better to always process rings in vhost, leave
> > userspace alone. For example this allows a simpler userspace that does
> > not implement tx/rx ring processing at all.
>
> We don't support things which don't exist.
>
> A simple implementation which supports live migration would have to have
> enough understanding to save and restore the net state. With a tun
> device (which handles all the header stuff), supporting networking is
> *trivial*: see lguest (though it needs MRG_RXBUF support). It's about
> 100 lines.
You assume networking is stopped while we migrate but this doesn't work
as migration might take a long time. So you have to have the full
device, processing rings and sending/receiving packets all in userspace.
> > We used to use a workqueue, but Tejun Heo required this switch.
> > Basically we want everything run from one thread so we
> > can apply cgroup limits to this thread. workqueue does not
> > guarantee it - one thread is an implementation detail.
>
> Right.
>
> >> vhost/net.c questions:
> >>
> >> 1) Why do we wake the net tx worker every time the reference count of
> >> the number of outstanding DMAs is a multiple of 16?
> >
> > Waking every time turns out to be wasteful. Waking only when nothing is
> > outstanding turns out to be slow. The comment says it all really:
> >
> > + * We also trigger polling periodically after each 16 packets
> > + * (the value 16 here is more or less arbitrary, it's tuned to trigger
> > + * less than 10% of times).
> >
> > If you have any suggestions on how to improve this,
>
> But this code does *not* wake every 16 packets.
>
> Anyway, what happens when we wake it? I'm assuming it's the net tx
> worker, which will just return the used bufs to the guest? Wouldn't it
> be better to do that directly? Maybe wake the tx worker if it needs to
> wake the guest, if we can't do that here.
Sorry I don't get what you mean here.
> >> 3) Why don't we simply use a slab allocator to allocate a structure for
> >> outstanding xmit dmas? Wouldn't that be far simpler? Would it be
> >> slower?
> >
> > Which structure for outstanding xmit dmas? ubufs? We could use a cache
> > maybe, plan slab is usually pretty slow.
> >
> > I think what you are really worried about is the upend/done index thing?
> > It's not really there because of the array really.
> > It's because we try to order the completion of zero copy buffers
> > in the same order they are submitted. This was because I was worried about
> > triggering fragmentation of the descriptor ring (Avi kept raising this
> > concern).
>
> I've never seen any evidence that caching is slowing us down. Many
> theories, but mainly our slowdowns are crap in qemu.
>
> And for xmit rings, it's *REALLY DUMB* because there can be no
> fragmentation with indirect buffers!
>
> If fragmentation becomes a problem, it's easy to use a bitmap-based
> allocator and avoid fragmentation.
>
> But most importantly, it's *not* the host's problem to fix! Just keep
> the damn latency low.
OK, let's try to rip it out and see what happens.
> >> 4) Why are the following fields in struct vhost_virtqueue, not struct
> >> vhost_net?
> >>
> >> /* hdr is used to store the virtio header.
> >> * Since each iovec has >= 1 byte length, we never need more than
> >> * header length entries to store the header. */
> >> struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
> >> size_t vhost_hlen;
> >> size_t sock_hlen;
> >> struct vring_used_elem *heads;
> >>
> >> ...
> >
> > Because these are used for both tx and rx. While we handle both from
> > the same thread at the moment, so we could reuse same fields, there are
> > some patches by Anthony that process tx and rx in separate threads and
> > sometimes it's a win. So I don't want to close the door on that.
>
> But they're *net* specific: They don't belong in the vhost structure.
Ah, I see what you mean. Yes, we really need vhost_net_virtqueue,
we didn't have one and I let some net specific stuff seep through.
We should do something like
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ec6fb3f..3c89d51 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -70,9 +70,13 @@ enum vhost_net_poll_state {
VHOST_NET_POLL_STOPPED = 2,
};
+struct vhost_net_virtqueue {
+ struct vhost_virtqueue vq;
+};
+
struct vhost_net {
struct vhost_dev dev;
- struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
+ struct vhost_net_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
/* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
@@ -612,17 +616,21 @@ static int vhost_net_open(struct inode *inode, struct file *f)
{
struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
struct vhost_dev *dev;
+ struct vhost_virtqueue **vqs = kmalloc(VHOST_NET_VQ_MAX, sizeof *vqs);
int r;
if (!n)
return -ENOMEM;
dev = &n->dev;
- n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
- n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
+ vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
+ vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
+ n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
+ n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
if (r < 0) {
kfree(n);
+ kfree(vqs);
return r;
}
And then we can move net specific stuff to vhost_net_virtqueue.
> And from my reading, hdr[] is written to on the xmit side, but it's not
> used. sock_hlen is entirely unused on xmit. heads[] is abused, and
> should be its own structure.
True.
> >> /* vhost zerocopy support fields below: */
> >> /* last used idx for outstanding DMA zerocopy buffers */
> >> int upend_idx;
> >> /* first used idx for DMA done zerocopy buffers */
> >> int done_idx;
> >> /* an array of userspace buffers info */
> >> struct ubuf_info *ubuf_info;
> >> /* Reference counting for outstanding ubufs.
> >> * Protected by vq mutex. Writers must also take device mutex. */
> >> struct vhost_ubuf_ref *ubufs;
> >
> > Only used from tx. I was trying to keep the option of multiple
> > tx rings in one vhost, and the option of rx zero copy open, but it's not
> > a must.
>
> Again, these are all very net specific.
>
> >> 5) Why does vhost_net use the sendmsg interface for tun (and macvlan?)?
> >> It's not usable from userspace (it looks like if you can sendmsg to a
> >> tun device you can make it call an arbitrary pointer though).
> >
> > Yes we could switch to some other interface, but only if we
> > deprecate the packet socket backend and remove it at some point.
> >
> >> It would be better for us to create the skb, and have a special
> >> interface to inject it directly. This way we can handle the
> >> UIO_MAXIOV limit ourselves (eg. by not doing zero copy on such stupid
> >> cases).
> >
> > Basically because userspace already configures the backend with existing
> > tun ioctls. This way we don't need to introduce new interfaces for this,
> > just virtio stuff for vhost net. It's also good that we share code with
> > tun, often both vhost and virtio processing can gain from same
> > optimizations and new features. Happened in the past already.
>
> Sorry, I wasn't clear here. I mean instead of doing:
>
> err = sock->ops->recvmsg(NULL, sock, &msg,
> sock_len, MSG_DONTWAIT | MSG_TRUNC);
>
> We do something like:
>
> err = tun_inject_skb(sock, skb);
>
> Because the zerocopy_sg_from_iovec() in tun.c already has a comment
> about putting it in the network core... in fact, it's cut & pasted into
> macvtap :(
Right. There's lots of code duplication in tun, macvtap and af_packet,
we could factor code out, and then use it everywhere.
> The msg thing for the callback is a nasty hack, AFAICT...
Yes it's not pretty.
> > Basically not many people care about virt. The less virt specific
> > code we have the better. I lost count of times I had this dialog when I
> > needed some help
> > - we see this problem XYZ in host net stack
> > - oh it's vhost I don't know what it does, sorry
> > - actually it only builds up iovecs all networking is done by tun that
> > everyone uses
> > - ok then let me take a look
> >
> >> 6) By calling tun_get_socket() and macvtap_get_socket() we are forcing
> >> those modules to load. We should use symbol_get() to avoid this.
> >
> > Good idea.
>
> I can create this patch.
>
> >> In short, this code is a mess. Assuming we can't just break the API,
> >> we've got a long process ahead of us to get it into shape :(
> >>
> >> Thanks,
> >> Rusty.
> >
> > Zerocopy path is the most messy I think. We should really consider
> > removing the ordering of zero copy buffers. The rest I don't consider
> > so messy, and it should be easy to cleanup zerocopy without
> > breaking the API.
>
> I've got some cleanup patches, but I'll keep going and send them to you
> for review...
>
> Thanks,
> Rusty
Sure, we have lots going on with vhost scsi ATM, but as long as changes
are done in small steps I'll figure out conflicts if any without much
trouble.
--
MST
next prev parent reply other threads:[~2013-03-14 9:33 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-13 6:49 vhost questions Rusty Russell
2013-03-13 15:59 ` Michael S. Tsirkin
2013-03-14 6:38 ` Rusty Russell
2013-03-14 9:33 ` Michael S. Tsirkin [this message]
2013-03-18 0:50 ` Rusty Russell
2013-03-18 8:06 ` Michael S. Tsirkin
2013-03-17 11:09 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130314093342.GC14977@redhat.com \
--to=mst@redhat.com \
--cc=rusty@rustcorp.com.au \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).