From: "Michael S. Tsirkin" <mst@redhat.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Subject: Re: vhost questions.
Date: Mon, 18 Mar 2013 10:06:56 +0200 [thread overview]
Message-ID: <20130318080656.GA4512@redhat.com> (raw)
In-Reply-To: <87y5dlsf8e.fsf@rustcorp.com.au>
On Mon, Mar 18, 2013 at 11:20:41AM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
> >> "Michael S. Tsirkin" <mst@redhat.com> writes:
> >> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
> >> >> OK, I've been trying to read the vhost and vhost net code, and I'm
> >> >> struggling with basic questions.
> >> >>
> >> >> 1) Why do we allow userspace to change an already-set-up vhost device?
> >> >> Why not have:
> >> >> open()
> >> >> ioctl(VHOST_GET_FEATURES)
> >> >> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
> >> >> ioctl(VHOST_NET_SETUP)
> >> >> ...
> >> >> close()
> >> >>
> >> >> You're not allowed to call things twice: to reset, you close and
> >> >> reopen. That would save a lot of code which I'm not sure is correct
> >> >> anyway.
> >> >
> >> > This won't work for the usecases we have in qemu.
> >> > The way things are setup currently, qemu typically does not have
> >> > the rights to open any char devices, instead it is passed an fd.
> >>
> >> Ah, because creating a vhost fd is privileged? Why? Because it creates
> >> a kernel thread? Why?
> >>
> >> Using a model where a userspace process/thread manages the ring (even if
> >> it spends 99.99% of its time in the kernel) would avoid this. This is
> >> how KVM itself works, it's no longer privileged.
> >
> > KVM kind of has to have a thread per VCPU, no other options.
>
> Not at all! It could spin off a helper kernel thread to do the work and
> come up with some IPC. But there's a damn good reason that it didn't,
> nor did lguest.
>
> That same reasoning applies here, too. The fact that you create a
> kernel thread then give it the same context and cgroup etc should be the
> clue!
>
> > Here I don't think we know what the best threading model is.
> > We used to have a per-CPU thread, switched to thread pair vq *pair* and
> > there are patches to go back to per-CPU threads, they seem to improve
> > scalability though there are still some open issues wrt cgroups.
> > Using a kernel thread means we can control this without touching userspace.
>
> But kernelspace is the wrong place to make these decisions.
>
> Let's try to imagine what it'd be like if we didn't do it this way, say
> a /dev/vhost-net2 device. You use an ioctl to set up the memory
> mapping, the (tun) fd, the location of a xmit ring and a recv ring (you
> specify one or both... we could have separate xmit and recv devices, but
> that seems a bit overkill).
>
> Now:
> 1) Simplest case, you want a dumb implementation with no extra threads
> in your code. Using select(), if tun fd is readable and vhost-net2 fd is
> writable, call ioctl(VHOST_NET_RECV), which processes as many packets
> as it can. If tun fd is writable and vhost-net2 fd is readable, call
> ioctl(VHOST_NET_XMIT).
>
> 2) You want an I/O thread in your code? That's easy, too. You might
> skip the writable fd checks, and just let the ioctl() block.
>
> 3) Want separate I/O threads? Separate /dev/vhost-net2 fds, one one
> only sets the xmit ring, one only sets the recv ring. They just
> call ioctl() over and over.
>
> 4) Multiqueue comes along? It only makes sense with separate threads
> per queue, so open /dev/vhost-net2 and tunfd once for each thread.
> Do your per-cpu binding here.
>
> Now, there may be issues if we actually coded this, but it *seems* sane.
Doing a per-cpu thread (without multiqueue) may be harder though. Or at
least I don't see a nice way to do it. This is what we started with, and
the interface to allow it stayed around to avoid breaking userspace.
> >> A simple implementation which supports live migration would have to have
> >> enough understanding to save and restore the net state. With a tun
> >> device (which handles all the header stuff), supporting networking is
> >> *trivial*: see lguest (though it needs MRG_RXBUF support). It's about
> >> 100 lines.
> >
> > You assume networking is stopped while we migrate but this doesn't work
> > as migration might take a long time. So you have to have the full
> > device, processing rings and sending/receiving packets all in userspace.
>
> No I don't: 100 lines of code *is* a full networking device.
>
> >> > + * We also trigger polling periodically after each 16 packets
> >> > + * (the value 16 here is more or less arbitrary, it's tuned to trigger
> >> > + * less than 10% of times).
> >> >
> >> > If you have any suggestions on how to improve this,
> >>
> >> But this code does *not* wake every 16 packets.
>
> It wakes when the reference count hits 16, 32, etc. But that reference
> count will go up and down as packets are processed: in theory, it might
> go 0, 1, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3... ?
I see. True. But it's just an optimization, cnt <= 2 makes sure there's
no deadlock.
> >> Anyway, what happens when we wake it? I'm assuming it's the net tx
> >> worker, which will just return the used bufs to the guest? Wouldn't it
> >> be better to do that directly? Maybe wake the tx worker if it needs to
> >> wake the guest, if we can't do that here.
> >
> > Sorry I don't get what you mean here.
>
> Let's check I've got this straight:
>
> Packet is fully xmitted:
> -> skb_release_data -> vhost_zerocopy_callback -> vhost_poll_queue()
>
> Wakes up tx thread -> handle_tx_kick -> handle_tx
> -> vhost_zerocopy_signal_used -> vhost_add_used_and_signal()
Exactly.
> But what if we kmapped the ring so we could access it in
> vhost_zerocopy_callback? We could skip the wakeup altogether, right?
>
> And it happens that I just wrote vringh accessors that work on
> kernelspace, so it should be easier than it was before... :)
>
Hmm this should be better for a latency test where we wake up every time,
but could worse for a bandwidth test.
One thing making it a bit harder is the need to implement flush for when
memory mappings change. I guess we could make the tx_flush flag atomic.
> >> But most importantly, it's *not* the host's problem to fix! Just keep
> >> the damn latency low.
> >
> > OK, let's try to rip it out and see what happens.
>
> Thanks! That'll make things much simpler for me to understand :)
>
> >> >> 4) Why are the following fields in struct vhost_virtqueue, not struct
> >> >> vhost_net?
> >> >>
> >> >> /* hdr is used to store the virtio header.
> >> >> * Since each iovec has >= 1 byte length, we never need more than
> >> >> * header length entries to store the header. */
> >> >> struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
> >> >> size_t vhost_hlen;
> >> >> size_t sock_hlen;
> >> >> struct vring_used_elem *heads;
> >> >>
> >> >> ...
> >> >
> >> > Because these are used for both tx and rx. While we handle both from
> >> > the same thread at the moment, so we could reuse same fields, there are
> >> > some patches by Anthony that process tx and rx in separate threads and
> >> > sometimes it's a win. So I don't want to close the door on that.
> >>
> >> But they're *net* specific: They don't belong in the vhost structure.
> >
> > Ah, I see what you mean. Yes, we really need vhost_net_virtqueue,
> > we didn't have one and I let some net specific stuff seep through.
> > We should do something like
>
> Thanks. I'll put that in my tree and start playing with it.
Hmm no need to keep it out of upstream I think.
> >> We do something like:
> >>
> >> err = tun_inject_skb(sock, skb);
> >>
> >> Because the zerocopy_sg_from_iovec() in tun.c already has a comment
> >> about putting it in the network core... in fact, it's cut & pasted into
> >> macvtap :(
> >
> > Right. There's lots of code duplication in tun, macvtap and af_packet,
> > we could factor code out, and then use it everywhere.
> >
> >> The msg thing for the callback is a nasty hack, AFAICT...
> >
> > Yes it's not pretty.
>
> OK, I'll see what I can do.
>
> > Sure, we have lots going on with vhost scsi ATM, but as long as changes
> > are done in small steps I'll figure out conflicts if any without much
> > trouble.
>
> Yep, I think it can be done. I've done some patches so far, and the
> impact on tcm_vhost has been minimal...
>
> Thanks,
> Rusty.
next prev parent reply other threads:[~2013-03-18 8:06 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-13 6:49 vhost questions Rusty Russell
2013-03-13 15:59 ` Michael S. Tsirkin
2013-03-14 6:38 ` Rusty Russell
2013-03-14 9:33 ` Michael S. Tsirkin
2013-03-18 0:50 ` Rusty Russell
2013-03-18 8:06 ` Michael S. Tsirkin [this message]
2013-03-17 11:09 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130318080656.GA4512@redhat.com \
--to=mst@redhat.com \
--cc=rusty@rustcorp.com.au \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).