* vhost questions.
@ 2013-03-13 6:49 Rusty Russell
2013-03-13 15:59 ` Michael S. Tsirkin
0 siblings, 1 reply; 7+ messages in thread
From: Rusty Russell @ 2013-03-13 6:49 UTC (permalink / raw)
To: mst; +Cc: virtualization
OK, I've been trying to read the vhost and vhost net code, and I'm
struggling with basic questions.
1) Why do we allow userspace to change an already-set-up vhost device?
Why not have:
open()
ioctl(VHOST_GET_FEATURES)
ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
ioctl(VHOST_NET_SETUP)
...
close()
You're not allowed to call things twice: to reset, you close and
reopen. That would save a lot of code which I'm not sure is correct
anyway.
2) Why do we implement write logging? Shouldn't qemu just switch to
emulation if it wants to track writes?
3) Why do we have our own workqueue implementation?
vhost/net.c questions:
1) Why do we wake the net tx worker every time the reference count of
the number of outstanding DMAs is a multiple of 16?
2) Why is vhost_ubuf_ref kmalloced, rather than a structure member?
3) Why don't we simply use a slab allocator to allocate a structure for
outstanding xmit dmas? Wouldn't that be far simpler? Would it be
slower?
4) Why are the following fields in struct vhost_virtqueue, not struct
vhost_net?
/* hdr is used to store the virtio header.
* Since each iovec has >= 1 byte length, we never need more than
* header length entries to store the header. */
struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
size_t vhost_hlen;
size_t sock_hlen;
struct vring_used_elem *heads;
...
/* vhost zerocopy support fields below: */
/* last used idx for outstanding DMA zerocopy buffers */
int upend_idx;
/* first used idx for DMA done zerocopy buffers */
int done_idx;
/* an array of userspace buffers info */
struct ubuf_info *ubuf_info;
/* Reference counting for outstanding ubufs.
* Protected by vq mutex. Writers must also take device mutex. */
struct vhost_ubuf_ref *ubufs;
5) Why does vhost_net use the sendmsg interface for tun (and macvlan?)?
It's not usable from userspace (it looks like if you can sendmsg to a
tun device you can make it call an arbitrary pointer though).
It would be better for us to create the skb, and have a special
interface to inject it directly. This way we can handle the
UIO_MAXIOV limit ourselves (eg. by not doing zero copy on such stupid
cases).
6) By calling tun_get_socket() and macvtap_get_socket() we are forcing
those modules to load. We should use symbol_get() to avoid this.
In short, this code is a mess. Assuming we can't just break the API,
we've got a long process ahead of us to get it into shape :(
Thanks,
Rusty.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost questions.
2013-03-13 6:49 vhost questions Rusty Russell
@ 2013-03-13 15:59 ` Michael S. Tsirkin
2013-03-14 6:38 ` Rusty Russell
0 siblings, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2013-03-13 15:59 UTC (permalink / raw)
To: Rusty Russell; +Cc: virtualization
On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
> OK, I've been trying to read the vhost and vhost net code, and I'm
> struggling with basic questions.
>
> 1) Why do we allow userspace to change an already-set-up vhost device?
> Why not have:
> open()
> ioctl(VHOST_GET_FEATURES)
> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
> ioctl(VHOST_NET_SETUP)
> ...
> close()
>
> You're not allowed to call things twice: to reset, you close and
> reopen. That would save a lot of code which I'm not sure is correct
> anyway.
This won't work for the usecases we have in qemu.
The way things are setup currently, qemu typically does not have
the rights to open any char devices, instead it is passed an fd.
> 2) Why do we implement write logging? Shouldn't qemu just switch to
> emulation if it wants to track writes?
We could do it this way. Originally vhost didn't do logging,
but Anthony felt it's cleaner this way. We argued this way and that, in
the end I agreed it's better to always process rings in vhost, leave
userspace alone. For example this allows a simpler userspace that does
not implement tx/rx ring processing at all.
> 3) Why do we have our own workqueue implementation?
We used to use a workqueue, but Tejun Heo required this switch.
Basically we want everything run from one thread so we
can apply cgroup limits to this thread. workqueue does not
guarantee it - one thread is an implementation detail.
> vhost/net.c questions:
>
> 1) Why do we wake the net tx worker every time the reference count of
> the number of outstanding DMAs is a multiple of 16?
Waking every time turns out to be wasteful. Waking only when nothing is
outstanding turns out to be slow. The comment says it all really:
+ * We also trigger polling periodically after each 16 packets
+ * (the value 16 here is more or less arbitrary, it's tuned to trigger
+ * less than 10% of times).
If you have any suggestions on how to improve this,
>
> 3) Why don't we simply use a slab allocator to allocate a structure for
> outstanding xmit dmas? Wouldn't that be far simpler? Would it be
> slower?
Which structure for outstanding xmit dmas? ubufs? We could use a cache
maybe, plan slab is usually pretty slow.
I think what you are really worried about is the upend/done index thing?
It's not really there because of the array really.
It's because we try to order the completion of zero copy buffers
in the same order they are submitted. This was because I was worried about
triggering fragmentation of the descriptor ring (Avi kept raising this
concern).
> 4) Why are the following fields in struct vhost_virtqueue, not struct
> vhost_net?
>
> /* hdr is used to store the virtio header.
> * Since each iovec has >= 1 byte length, we never need more than
> * header length entries to store the header. */
> struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
> size_t vhost_hlen;
> size_t sock_hlen;
> struct vring_used_elem *heads;
>
> ...
Because these are used for both tx and rx. While we handle both from
the same thread at the moment, so we could reuse same fields, there are
some patches by Anthony that process tx and rx in separate threads and
sometimes it's a win. So I don't want to close the door on that.
>
> /* vhost zerocopy support fields below: */
> /* last used idx for outstanding DMA zerocopy buffers */
> int upend_idx;
> /* first used idx for DMA done zerocopy buffers */
> int done_idx;
> /* an array of userspace buffers info */
> struct ubuf_info *ubuf_info;
> /* Reference counting for outstanding ubufs.
> * Protected by vq mutex. Writers must also take device mutex. */
> struct vhost_ubuf_ref *ubufs;
Only used from tx. I was trying to keep the option of multiple
tx rings in one vhost, and the option of rx zero copy open, but it's not
a must.
> 5) Why does vhost_net use the sendmsg interface for tun (and macvlan?)?
> It's not usable from userspace (it looks like if you can sendmsg to a
> tun device you can make it call an arbitrary pointer though).
Yes we could switch to some other interface, but only if we
deprecate the packet socket backend and remove it at some point.
> It would be better for us to create the skb, and have a special
> interface to inject it directly. This way we can handle the
> UIO_MAXIOV limit ourselves (eg. by not doing zero copy on such stupid
> cases).
Basically because userspace already configures the backend with existing
tun ioctls. This way we don't need to introduce new interfaces for this,
just virtio stuff for vhost net. It's also good that we share code with
tun, often both vhost and virtio processing can gain from same
optimizations and new features. Happened in the past already.
Basically not many people care about virt. The less virt specific
code we have the better. I lost count of times I had this dialog when I
needed some help
- we see this problem XYZ in host net stack
- oh it's vhost I don't know what it does, sorry
- actually it only builds up iovecs all networking is done by tun that
everyone uses
- ok then let me take a look
> 6) By calling tun_get_socket() and macvtap_get_socket() we are forcing
> those modules to load. We should use symbol_get() to avoid this.
Good idea.
> In short, this code is a mess. Assuming we can't just break the API,
> we've got a long process ahead of us to get it into shape :(
>
> Thanks,
> Rusty.
Zerocopy path is the most messy I think. We should really consider
removing the ordering of zero copy buffers. The rest I don't consider
so messy, and it should be easy to cleanup zerocopy without
breaking the API.
--
MST
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost questions.
2013-03-13 15:59 ` Michael S. Tsirkin
@ 2013-03-14 6:38 ` Rusty Russell
2013-03-14 9:33 ` Michael S. Tsirkin
2013-03-17 11:09 ` Michael S. Tsirkin
0 siblings, 2 replies; 7+ messages in thread
From: Rusty Russell @ 2013-03-14 6:38 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: virtualization
"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
>> OK, I've been trying to read the vhost and vhost net code, and I'm
>> struggling with basic questions.
>>
>> 1) Why do we allow userspace to change an already-set-up vhost device?
>> Why not have:
>> open()
>> ioctl(VHOST_GET_FEATURES)
>> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
>> ioctl(VHOST_NET_SETUP)
>> ...
>> close()
>>
>> You're not allowed to call things twice: to reset, you close and
>> reopen. That would save a lot of code which I'm not sure is correct
>> anyway.
>
> This won't work for the usecases we have in qemu.
> The way things are setup currently, qemu typically does not have
> the rights to open any char devices, instead it is passed an fd.
Ah, because creating a vhost fd is privileged? Why? Because it creates
a kernel thread? Why?
Using a model where a userspace process/thread manages the ring (even if
it spends 99.99% of its time in the kernel) would avoid this. This is
how KVM itself works, it's no longer privileged.
>> 2) Why do we implement write logging? Shouldn't qemu just switch to
>> emulation if it wants to track writes?
>
> We could do it this way. Originally vhost didn't do logging,
> but Anthony felt it's cleaner this way. We argued this way and that, in
> the end I agreed it's better to always process rings in vhost, leave
> userspace alone. For example this allows a simpler userspace that does
> not implement tx/rx ring processing at all.
We don't support things which don't exist.
A simple implementation which supports live migration would have to have
enough understanding to save and restore the net state. With a tun
device (which handles all the header stuff), supporting networking is
*trivial*: see lguest (though it needs MRG_RXBUF support). It's about
100 lines.
> We used to use a workqueue, but Tejun Heo required this switch.
> Basically we want everything run from one thread so we
> can apply cgroup limits to this thread. workqueue does not
> guarantee it - one thread is an implementation detail.
Right.
>> vhost/net.c questions:
>>
>> 1) Why do we wake the net tx worker every time the reference count of
>> the number of outstanding DMAs is a multiple of 16?
>
> Waking every time turns out to be wasteful. Waking only when nothing is
> outstanding turns out to be slow. The comment says it all really:
>
> + * We also trigger polling periodically after each 16 packets
> + * (the value 16 here is more or less arbitrary, it's tuned to trigger
> + * less than 10% of times).
>
> If you have any suggestions on how to improve this,
But this code does *not* wake every 16 packets.
Anyway, what happens when we wake it? I'm assuming it's the net tx
worker, which will just return the used bufs to the guest? Wouldn't it
be better to do that directly? Maybe wake the tx worker if it needs to
wake the guest, if we can't do that here.
>> 3) Why don't we simply use a slab allocator to allocate a structure for
>> outstanding xmit dmas? Wouldn't that be far simpler? Would it be
>> slower?
>
> Which structure for outstanding xmit dmas? ubufs? We could use a cache
> maybe, plan slab is usually pretty slow.
>
> I think what you are really worried about is the upend/done index thing?
> It's not really there because of the array really.
> It's because we try to order the completion of zero copy buffers
> in the same order they are submitted. This was because I was worried about
> triggering fragmentation of the descriptor ring (Avi kept raising this
> concern).
I've never seen any evidence that caching is slowing us down. Many
theories, but mainly our slowdowns are crap in qemu.
And for xmit rings, it's *REALLY DUMB* because there can be no
fragmentation with indirect buffers!
If fragmentation becomes a problem, it's easy to use a bitmap-based
allocator and avoid fragmentation.
But most importantly, it's *not* the host's problem to fix! Just keep
the damn latency low.
>> 4) Why are the following fields in struct vhost_virtqueue, not struct
>> vhost_net?
>>
>> /* hdr is used to store the virtio header.
>> * Since each iovec has >= 1 byte length, we never need more than
>> * header length entries to store the header. */
>> struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
>> size_t vhost_hlen;
>> size_t sock_hlen;
>> struct vring_used_elem *heads;
>>
>> ...
>
> Because these are used for both tx and rx. While we handle both from
> the same thread at the moment, so we could reuse same fields, there are
> some patches by Anthony that process tx and rx in separate threads and
> sometimes it's a win. So I don't want to close the door on that.
But they're *net* specific: They don't belong in the vhost structure.
And from my reading, hdr[] is written to on the xmit side, but it's not
used. sock_hlen is entirely unused on xmit. heads[] is abused, and
should be its own structure.
>> /* vhost zerocopy support fields below: */
>> /* last used idx for outstanding DMA zerocopy buffers */
>> int upend_idx;
>> /* first used idx for DMA done zerocopy buffers */
>> int done_idx;
>> /* an array of userspace buffers info */
>> struct ubuf_info *ubuf_info;
>> /* Reference counting for outstanding ubufs.
>> * Protected by vq mutex. Writers must also take device mutex. */
>> struct vhost_ubuf_ref *ubufs;
>
> Only used from tx. I was trying to keep the option of multiple
> tx rings in one vhost, and the option of rx zero copy open, but it's not
> a must.
Again, these are all very net specific.
>> 5) Why does vhost_net use the sendmsg interface for tun (and macvlan?)?
>> It's not usable from userspace (it looks like if you can sendmsg to a
>> tun device you can make it call an arbitrary pointer though).
>
> Yes we could switch to some other interface, but only if we
> deprecate the packet socket backend and remove it at some point.
>
>> It would be better for us to create the skb, and have a special
>> interface to inject it directly. This way we can handle the
>> UIO_MAXIOV limit ourselves (eg. by not doing zero copy on such stupid
>> cases).
>
> Basically because userspace already configures the backend with existing
> tun ioctls. This way we don't need to introduce new interfaces for this,
> just virtio stuff for vhost net. It's also good that we share code with
> tun, often both vhost and virtio processing can gain from same
> optimizations and new features. Happened in the past already.
Sorry, I wasn't clear here. I mean instead of doing:
err = sock->ops->recvmsg(NULL, sock, &msg,
sock_len, MSG_DONTWAIT | MSG_TRUNC);
We do something like:
err = tun_inject_skb(sock, skb);
Because the zerocopy_sg_from_iovec() in tun.c already has a comment
about putting it in the network core... in fact, it's cut & pasted into
macvtap :(
The msg thing for the callback is a nasty hack, AFAICT...
> Basically not many people care about virt. The less virt specific
> code we have the better. I lost count of times I had this dialog when I
> needed some help
> - we see this problem XYZ in host net stack
> - oh it's vhost I don't know what it does, sorry
> - actually it only builds up iovecs all networking is done by tun that
> everyone uses
> - ok then let me take a look
>
>> 6) By calling tun_get_socket() and macvtap_get_socket() we are forcing
>> those modules to load. We should use symbol_get() to avoid this.
>
> Good idea.
I can create this patch.
>> In short, this code is a mess. Assuming we can't just break the API,
>> we've got a long process ahead of us to get it into shape :(
>>
>> Thanks,
>> Rusty.
>
> Zerocopy path is the most messy I think. We should really consider
> removing the ordering of zero copy buffers. The rest I don't consider
> so messy, and it should be easy to cleanup zerocopy without
> breaking the API.
I've got some cleanup patches, but I'll keep going and send them to you
for review...
Thanks,
Rusty
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost questions.
2013-03-14 6:38 ` Rusty Russell
@ 2013-03-14 9:33 ` Michael S. Tsirkin
2013-03-18 0:50 ` Rusty Russell
2013-03-17 11:09 ` Michael S. Tsirkin
1 sibling, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2013-03-14 9:33 UTC (permalink / raw)
To: Rusty Russell; +Cc: virtualization
On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
> >> OK, I've been trying to read the vhost and vhost net code, and I'm
> >> struggling with basic questions.
> >>
> >> 1) Why do we allow userspace to change an already-set-up vhost device?
> >> Why not have:
> >> open()
> >> ioctl(VHOST_GET_FEATURES)
> >> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
> >> ioctl(VHOST_NET_SETUP)
> >> ...
> >> close()
> >>
> >> You're not allowed to call things twice: to reset, you close and
> >> reopen. That would save a lot of code which I'm not sure is correct
> >> anyway.
> >
> > This won't work for the usecases we have in qemu.
> > The way things are setup currently, qemu typically does not have
> > the rights to open any char devices, instead it is passed an fd.
>
> Ah, because creating a vhost fd is privileged? Why? Because it creates
> a kernel thread? Why?
>
> Using a model where a userspace process/thread manages the ring (even if
> it spends 99.99% of its time in the kernel) would avoid this. This is
> how KVM itself works, it's no longer privileged.
KVM kind of has to have a thread per VCPU, no other options.
Here I don't think we know what the best threading model is.
We used to have a per-CPU thread, switched to thread pair vq *pair* and
there are patches to go back to per-CPU threads, they seem to improve
scalability though there are still some open issues wrt cgroups.
Using a kernel thread means we can control this without touching userspace.
> >> 2) Why do we implement write logging? Shouldn't qemu just switch to
> >> emulation if it wants to track writes?
> >
> > We could do it this way. Originally vhost didn't do logging,
> > but Anthony felt it's cleaner this way. We argued this way and that, in
> > the end I agreed it's better to always process rings in vhost, leave
> > userspace alone. For example this allows a simpler userspace that does
> > not implement tx/rx ring processing at all.
>
> We don't support things which don't exist.
>
> A simple implementation which supports live migration would have to have
> enough understanding to save and restore the net state. With a tun
> device (which handles all the header stuff), supporting networking is
> *trivial*: see lguest (though it needs MRG_RXBUF support). It's about
> 100 lines.
You assume networking is stopped while we migrate but this doesn't work
as migration might take a long time. So you have to have the full
device, processing rings and sending/receiving packets all in userspace.
> > We used to use a workqueue, but Tejun Heo required this switch.
> > Basically we want everything run from one thread so we
> > can apply cgroup limits to this thread. workqueue does not
> > guarantee it - one thread is an implementation detail.
>
> Right.
>
> >> vhost/net.c questions:
> >>
> >> 1) Why do we wake the net tx worker every time the reference count of
> >> the number of outstanding DMAs is a multiple of 16?
> >
> > Waking every time turns out to be wasteful. Waking only when nothing is
> > outstanding turns out to be slow. The comment says it all really:
> >
> > + * We also trigger polling periodically after each 16 packets
> > + * (the value 16 here is more or less arbitrary, it's tuned to trigger
> > + * less than 10% of times).
> >
> > If you have any suggestions on how to improve this,
>
> But this code does *not* wake every 16 packets.
>
> Anyway, what happens when we wake it? I'm assuming it's the net tx
> worker, which will just return the used bufs to the guest? Wouldn't it
> be better to do that directly? Maybe wake the tx worker if it needs to
> wake the guest, if we can't do that here.
Sorry I don't get what you mean here.
> >> 3) Why don't we simply use a slab allocator to allocate a structure for
> >> outstanding xmit dmas? Wouldn't that be far simpler? Would it be
> >> slower?
> >
> > Which structure for outstanding xmit dmas? ubufs? We could use a cache
> > maybe, plan slab is usually pretty slow.
> >
> > I think what you are really worried about is the upend/done index thing?
> > It's not really there because of the array really.
> > It's because we try to order the completion of zero copy buffers
> > in the same order they are submitted. This was because I was worried about
> > triggering fragmentation of the descriptor ring (Avi kept raising this
> > concern).
>
> I've never seen any evidence that caching is slowing us down. Many
> theories, but mainly our slowdowns are crap in qemu.
>
> And for xmit rings, it's *REALLY DUMB* because there can be no
> fragmentation with indirect buffers!
>
> If fragmentation becomes a problem, it's easy to use a bitmap-based
> allocator and avoid fragmentation.
>
> But most importantly, it's *not* the host's problem to fix! Just keep
> the damn latency low.
OK, let's try to rip it out and see what happens.
> >> 4) Why are the following fields in struct vhost_virtqueue, not struct
> >> vhost_net?
> >>
> >> /* hdr is used to store the virtio header.
> >> * Since each iovec has >= 1 byte length, we never need more than
> >> * header length entries to store the header. */
> >> struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
> >> size_t vhost_hlen;
> >> size_t sock_hlen;
> >> struct vring_used_elem *heads;
> >>
> >> ...
> >
> > Because these are used for both tx and rx. While we handle both from
> > the same thread at the moment, so we could reuse same fields, there are
> > some patches by Anthony that process tx and rx in separate threads and
> > sometimes it's a win. So I don't want to close the door on that.
>
> But they're *net* specific: They don't belong in the vhost structure.
Ah, I see what you mean. Yes, we really need vhost_net_virtqueue,
we didn't have one and I let some net specific stuff seep through.
We should do something like
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ec6fb3f..3c89d51 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -70,9 +70,13 @@ enum vhost_net_poll_state {
VHOST_NET_POLL_STOPPED = 2,
};
+struct vhost_net_virtqueue {
+ struct vhost_virtqueue vq;
+};
+
struct vhost_net {
struct vhost_dev dev;
- struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
+ struct vhost_net_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
/* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
@@ -612,17 +616,21 @@ static int vhost_net_open(struct inode *inode, struct file *f)
{
struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
struct vhost_dev *dev;
+ struct vhost_virtqueue **vqs = kmalloc(VHOST_NET_VQ_MAX, sizeof *vqs);
int r;
if (!n)
return -ENOMEM;
dev = &n->dev;
- n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
- n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
+ vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
+ vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
+ n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
+ n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
if (r < 0) {
kfree(n);
+ kfree(vqs);
return r;
}
And then we can move net specific stuff to vhost_net_virtqueue.
> And from my reading, hdr[] is written to on the xmit side, but it's not
> used. sock_hlen is entirely unused on xmit. heads[] is abused, and
> should be its own structure.
True.
> >> /* vhost zerocopy support fields below: */
> >> /* last used idx for outstanding DMA zerocopy buffers */
> >> int upend_idx;
> >> /* first used idx for DMA done zerocopy buffers */
> >> int done_idx;
> >> /* an array of userspace buffers info */
> >> struct ubuf_info *ubuf_info;
> >> /* Reference counting for outstanding ubufs.
> >> * Protected by vq mutex. Writers must also take device mutex. */
> >> struct vhost_ubuf_ref *ubufs;
> >
> > Only used from tx. I was trying to keep the option of multiple
> > tx rings in one vhost, and the option of rx zero copy open, but it's not
> > a must.
>
> Again, these are all very net specific.
>
> >> 5) Why does vhost_net use the sendmsg interface for tun (and macvlan?)?
> >> It's not usable from userspace (it looks like if you can sendmsg to a
> >> tun device you can make it call an arbitrary pointer though).
> >
> > Yes we could switch to some other interface, but only if we
> > deprecate the packet socket backend and remove it at some point.
> >
> >> It would be better for us to create the skb, and have a special
> >> interface to inject it directly. This way we can handle the
> >> UIO_MAXIOV limit ourselves (eg. by not doing zero copy on such stupid
> >> cases).
> >
> > Basically because userspace already configures the backend with existing
> > tun ioctls. This way we don't need to introduce new interfaces for this,
> > just virtio stuff for vhost net. It's also good that we share code with
> > tun, often both vhost and virtio processing can gain from same
> > optimizations and new features. Happened in the past already.
>
> Sorry, I wasn't clear here. I mean instead of doing:
>
> err = sock->ops->recvmsg(NULL, sock, &msg,
> sock_len, MSG_DONTWAIT | MSG_TRUNC);
>
> We do something like:
>
> err = tun_inject_skb(sock, skb);
>
> Because the zerocopy_sg_from_iovec() in tun.c already has a comment
> about putting it in the network core... in fact, it's cut & pasted into
> macvtap :(
Right. There's lots of code duplication in tun, macvtap and af_packet,
we could factor code out, and then use it everywhere.
> The msg thing for the callback is a nasty hack, AFAICT...
Yes it's not pretty.
> > Basically not many people care about virt. The less virt specific
> > code we have the better. I lost count of times I had this dialog when I
> > needed some help
> > - we see this problem XYZ in host net stack
> > - oh it's vhost I don't know what it does, sorry
> > - actually it only builds up iovecs all networking is done by tun that
> > everyone uses
> > - ok then let me take a look
> >
> >> 6) By calling tun_get_socket() and macvtap_get_socket() we are forcing
> >> those modules to load. We should use symbol_get() to avoid this.
> >
> > Good idea.
>
> I can create this patch.
>
> >> In short, this code is a mess. Assuming we can't just break the API,
> >> we've got a long process ahead of us to get it into shape :(
> >>
> >> Thanks,
> >> Rusty.
> >
> > Zerocopy path is the most messy I think. We should really consider
> > removing the ordering of zero copy buffers. The rest I don't consider
> > so messy, and it should be easy to cleanup zerocopy without
> > breaking the API.
>
> I've got some cleanup patches, but I'll keep going and send them to you
> for review...
>
> Thanks,
> Rusty
Sure, we have lots going on with vhost scsi ATM, but as long as changes
are done in small steps I'll figure out conflicts if any without much
trouble.
--
MST
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: vhost questions.
2013-03-14 6:38 ` Rusty Russell
2013-03-14 9:33 ` Michael S. Tsirkin
@ 2013-03-17 11:09 ` Michael S. Tsirkin
1 sibling, 0 replies; 7+ messages in thread
From: Michael S. Tsirkin @ 2013-03-17 11:09 UTC (permalink / raw)
To: Rusty Russell; +Cc: virtualization
On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
> >> OK, I've been trying to read the vhost and vhost net code, and I'm
> >> struggling with basic questions.
> >>
> >> 1) Why do we allow userspace to change an already-set-up vhost device?
> >> Why not have:
> >> open()
> >> ioctl(VHOST_GET_FEATURES)
> >> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
> >> ioctl(VHOST_NET_SETUP)
> >> ...
> >> close()
> >>
> >> You're not allowed to call things twice: to reset, you close and
> >> reopen. That would save a lot of code which I'm not sure is correct
> >> anyway.
> >
> > This won't work for the usecases we have in qemu.
> > The way things are setup currently, qemu typically does not have
> > the rights to open any char devices, instead it is passed an fd.
By the way at least for the guest callback eventfd, we currently use the
ability to switch it on the fly for when guest wants to mask the interrupt.
At some level, this could be considered an argument that initialization
order is a policy best let for userspace.
--
MST
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost questions.
2013-03-14 9:33 ` Michael S. Tsirkin
@ 2013-03-18 0:50 ` Rusty Russell
2013-03-18 8:06 ` Michael S. Tsirkin
0 siblings, 1 reply; 7+ messages in thread
From: Rusty Russell @ 2013-03-18 0:50 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: virtualization
"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
>> >> OK, I've been trying to read the vhost and vhost net code, and I'm
>> >> struggling with basic questions.
>> >>
>> >> 1) Why do we allow userspace to change an already-set-up vhost device?
>> >> Why not have:
>> >> open()
>> >> ioctl(VHOST_GET_FEATURES)
>> >> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
>> >> ioctl(VHOST_NET_SETUP)
>> >> ...
>> >> close()
>> >>
>> >> You're not allowed to call things twice: to reset, you close and
>> >> reopen. That would save a lot of code which I'm not sure is correct
>> >> anyway.
>> >
>> > This won't work for the usecases we have in qemu.
>> > The way things are setup currently, qemu typically does not have
>> > the rights to open any char devices, instead it is passed an fd.
>>
>> Ah, because creating a vhost fd is privileged? Why? Because it creates
>> a kernel thread? Why?
>>
>> Using a model where a userspace process/thread manages the ring (even if
>> it spends 99.99% of its time in the kernel) would avoid this. This is
>> how KVM itself works, it's no longer privileged.
>
> KVM kind of has to have a thread per VCPU, no other options.
Not at all! It could spin off a helper kernel thread to do the work and
come up with some IPC. But there's a damn good reason that it didn't,
nor did lguest.
That same reasoning applies here, too. The fact that you create a
kernel thread then give it the same context and cgroup etc should be the
clue!
> Here I don't think we know what the best threading model is.
> We used to have a per-CPU thread, switched to thread pair vq *pair* and
> there are patches to go back to per-CPU threads, they seem to improve
> scalability though there are still some open issues wrt cgroups.
> Using a kernel thread means we can control this without touching userspace.
But kernelspace is the wrong place to make these decisions.
Let's try to imagine what it'd be like if we didn't do it this way, say
a /dev/vhost-net2 device. You use an ioctl to set up the memory
mapping, the (tun) fd, the location of a xmit ring and a recv ring (you
specify one or both... we could have separate xmit and recv devices, but
that seems a bit overkill).
Now:
1) Simplest case, you want a dumb implementation with no extra threads
in your code. Using select(), if tun fd is readable and vhost-net2 fd is
writable, call ioctl(VHOST_NET_RECV), which processes as many packets
as it can. If tun fd is writable and vhost-net2 fd is readable, call
ioctl(VHOST_NET_XMIT).
2) You want an I/O thread in your code? That's easy, too. You might
skip the writable fd checks, and just let the ioctl() block.
3) Want separate I/O threads? Separate /dev/vhost-net2 fds, one one
only sets the xmit ring, one only sets the recv ring. They just
call ioctl() over and over.
4) Multiqueue comes along? It only makes sense with separate threads
per queue, so open /dev/vhost-net2 and tunfd once for each thread.
Do your per-cpu binding here.
Now, there may be issues if we actually coded this, but it *seems* sane.
>> A simple implementation which supports live migration would have to have
>> enough understanding to save and restore the net state. With a tun
>> device (which handles all the header stuff), supporting networking is
>> *trivial*: see lguest (though it needs MRG_RXBUF support). It's about
>> 100 lines.
>
> You assume networking is stopped while we migrate but this doesn't work
> as migration might take a long time. So you have to have the full
> device, processing rings and sending/receiving packets all in userspace.
No I don't: 100 lines of code *is* a full networking device.
>> > + * We also trigger polling periodically after each 16 packets
>> > + * (the value 16 here is more or less arbitrary, it's tuned to trigger
>> > + * less than 10% of times).
>> >
>> > If you have any suggestions on how to improve this,
>>
>> But this code does *not* wake every 16 packets.
It wakes when the reference count hits 16, 32, etc. But that reference
count will go up and down as packets are processed: in theory, it might
go 0, 1, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3... ?
>> Anyway, what happens when we wake it? I'm assuming it's the net tx
>> worker, which will just return the used bufs to the guest? Wouldn't it
>> be better to do that directly? Maybe wake the tx worker if it needs to
>> wake the guest, if we can't do that here.
>
> Sorry I don't get what you mean here.
Let's check I've got this straight:
Packet is fully xmitted:
-> skb_release_data -> vhost_zerocopy_callback -> vhost_poll_queue()
Wakes up tx thread -> handle_tx_kick -> handle_tx
-> vhost_zerocopy_signal_used -> vhost_add_used_and_signal()
But what if we kmapped the ring so we could access it in
vhost_zerocopy_callback? We could skip the wakeup altogether, right?
And it happens that I just wrote vringh accessors that work on
kernelspace, so it should be easier than it was before... :)
>> But most importantly, it's *not* the host's problem to fix! Just keep
>> the damn latency low.
>
> OK, let's try to rip it out and see what happens.
Thanks! That'll make things much simpler for me to understand :)
>> >> 4) Why are the following fields in struct vhost_virtqueue, not struct
>> >> vhost_net?
>> >>
>> >> /* hdr is used to store the virtio header.
>> >> * Since each iovec has >= 1 byte length, we never need more than
>> >> * header length entries to store the header. */
>> >> struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
>> >> size_t vhost_hlen;
>> >> size_t sock_hlen;
>> >> struct vring_used_elem *heads;
>> >>
>> >> ...
>> >
>> > Because these are used for both tx and rx. While we handle both from
>> > the same thread at the moment, so we could reuse same fields, there are
>> > some patches by Anthony that process tx and rx in separate threads and
>> > sometimes it's a win. So I don't want to close the door on that.
>>
>> But they're *net* specific: They don't belong in the vhost structure.
>
> Ah, I see what you mean. Yes, we really need vhost_net_virtqueue,
> we didn't have one and I let some net specific stuff seep through.
> We should do something like
Thanks. I'll put that in my tree and start playing with it.
>> We do something like:
>>
>> err = tun_inject_skb(sock, skb);
>>
>> Because the zerocopy_sg_from_iovec() in tun.c already has a comment
>> about putting it in the network core... in fact, it's cut & pasted into
>> macvtap :(
>
> Right. There's lots of code duplication in tun, macvtap and af_packet,
> we could factor code out, and then use it everywhere.
>
>> The msg thing for the callback is a nasty hack, AFAICT...
>
> Yes it's not pretty.
OK, I'll see what I can do.
> Sure, we have lots going on with vhost scsi ATM, but as long as changes
> are done in small steps I'll figure out conflicts if any without much
> trouble.
Yep, I think it can be done. I've done some patches so far, and the
impact on tcm_vhost has been minimal...
Thanks,
Rusty.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost questions.
2013-03-18 0:50 ` Rusty Russell
@ 2013-03-18 8:06 ` Michael S. Tsirkin
0 siblings, 0 replies; 7+ messages in thread
From: Michael S. Tsirkin @ 2013-03-18 8:06 UTC (permalink / raw)
To: Rusty Russell; +Cc: virtualization
On Mon, Mar 18, 2013 at 11:20:41AM +1030, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
> >> "Michael S. Tsirkin" <mst@redhat.com> writes:
> >> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
> >> >> OK, I've been trying to read the vhost and vhost net code, and I'm
> >> >> struggling with basic questions.
> >> >>
> >> >> 1) Why do we allow userspace to change an already-set-up vhost device?
> >> >> Why not have:
> >> >> open()
> >> >> ioctl(VHOST_GET_FEATURES)
> >> >> ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
> >> >> ioctl(VHOST_NET_SETUP)
> >> >> ...
> >> >> close()
> >> >>
> >> >> You're not allowed to call things twice: to reset, you close and
> >> >> reopen. That would save a lot of code which I'm not sure is correct
> >> >> anyway.
> >> >
> >> > This won't work for the usecases we have in qemu.
> >> > The way things are setup currently, qemu typically does not have
> >> > the rights to open any char devices, instead it is passed an fd.
> >>
> >> Ah, because creating a vhost fd is privileged? Why? Because it creates
> >> a kernel thread? Why?
> >>
> >> Using a model where a userspace process/thread manages the ring (even if
> >> it spends 99.99% of its time in the kernel) would avoid this. This is
> >> how KVM itself works, it's no longer privileged.
> >
> > KVM kind of has to have a thread per VCPU, no other options.
>
> Not at all! It could spin off a helper kernel thread to do the work and
> come up with some IPC. But there's a damn good reason that it didn't,
> nor did lguest.
>
> That same reasoning applies here, too. The fact that you create a
> kernel thread then give it the same context and cgroup etc should be the
> clue!
>
> > Here I don't think we know what the best threading model is.
> > We used to have a per-CPU thread, switched to thread pair vq *pair* and
> > there are patches to go back to per-CPU threads, they seem to improve
> > scalability though there are still some open issues wrt cgroups.
> > Using a kernel thread means we can control this without touching userspace.
>
> But kernelspace is the wrong place to make these decisions.
>
> Let's try to imagine what it'd be like if we didn't do it this way, say
> a /dev/vhost-net2 device. You use an ioctl to set up the memory
> mapping, the (tun) fd, the location of a xmit ring and a recv ring (you
> specify one or both... we could have separate xmit and recv devices, but
> that seems a bit overkill).
>
> Now:
> 1) Simplest case, you want a dumb implementation with no extra threads
> in your code. Using select(), if tun fd is readable and vhost-net2 fd is
> writable, call ioctl(VHOST_NET_RECV), which processes as many packets
> as it can. If tun fd is writable and vhost-net2 fd is readable, call
> ioctl(VHOST_NET_XMIT).
>
> 2) You want an I/O thread in your code? That's easy, too. You might
> skip the writable fd checks, and just let the ioctl() block.
>
> 3) Want separate I/O threads? Separate /dev/vhost-net2 fds, one one
> only sets the xmit ring, one only sets the recv ring. They just
> call ioctl() over and over.
>
> 4) Multiqueue comes along? It only makes sense with separate threads
> per queue, so open /dev/vhost-net2 and tunfd once for each thread.
> Do your per-cpu binding here.
>
> Now, there may be issues if we actually coded this, but it *seems* sane.
Doing a per-cpu thread (without multiqueue) may be harder though. Or at
least I don't see a nice way to do it. This is what we started with, and
the interface to allow it stayed around to avoid breaking userspace.
> >> A simple implementation which supports live migration would have to have
> >> enough understanding to save and restore the net state. With a tun
> >> device (which handles all the header stuff), supporting networking is
> >> *trivial*: see lguest (though it needs MRG_RXBUF support). It's about
> >> 100 lines.
> >
> > You assume networking is stopped while we migrate but this doesn't work
> > as migration might take a long time. So you have to have the full
> > device, processing rings and sending/receiving packets all in userspace.
>
> No I don't: 100 lines of code *is* a full networking device.
>
> >> > + * We also trigger polling periodically after each 16 packets
> >> > + * (the value 16 here is more or less arbitrary, it's tuned to trigger
> >> > + * less than 10% of times).
> >> >
> >> > If you have any suggestions on how to improve this,
> >>
> >> But this code does *not* wake every 16 packets.
>
> It wakes when the reference count hits 16, 32, etc. But that reference
> count will go up and down as packets are processed: in theory, it might
> go 0, 1, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3... ?
I see. True. But it's just an optimization, cnt <= 2 makes sure there's
no deadlock.
> >> Anyway, what happens when we wake it? I'm assuming it's the net tx
> >> worker, which will just return the used bufs to the guest? Wouldn't it
> >> be better to do that directly? Maybe wake the tx worker if it needs to
> >> wake the guest, if we can't do that here.
> >
> > Sorry I don't get what you mean here.
>
> Let's check I've got this straight:
>
> Packet is fully xmitted:
> -> skb_release_data -> vhost_zerocopy_callback -> vhost_poll_queue()
>
> Wakes up tx thread -> handle_tx_kick -> handle_tx
> -> vhost_zerocopy_signal_used -> vhost_add_used_and_signal()
Exactly.
> But what if we kmapped the ring so we could access it in
> vhost_zerocopy_callback? We could skip the wakeup altogether, right?
>
> And it happens that I just wrote vringh accessors that work on
> kernelspace, so it should be easier than it was before... :)
>
Hmm this should be better for a latency test where we wake up every time,
but could worse for a bandwidth test.
One thing making it a bit harder is the need to implement flush for when
memory mappings change. I guess we could make the tx_flush flag atomic.
> >> But most importantly, it's *not* the host's problem to fix! Just keep
> >> the damn latency low.
> >
> > OK, let's try to rip it out and see what happens.
>
> Thanks! That'll make things much simpler for me to understand :)
>
> >> >> 4) Why are the following fields in struct vhost_virtqueue, not struct
> >> >> vhost_net?
> >> >>
> >> >> /* hdr is used to store the virtio header.
> >> >> * Since each iovec has >= 1 byte length, we never need more than
> >> >> * header length entries to store the header. */
> >> >> struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
> >> >> size_t vhost_hlen;
> >> >> size_t sock_hlen;
> >> >> struct vring_used_elem *heads;
> >> >>
> >> >> ...
> >> >
> >> > Because these are used for both tx and rx. While we handle both from
> >> > the same thread at the moment, so we could reuse same fields, there are
> >> > some patches by Anthony that process tx and rx in separate threads and
> >> > sometimes it's a win. So I don't want to close the door on that.
> >>
> >> But they're *net* specific: They don't belong in the vhost structure.
> >
> > Ah, I see what you mean. Yes, we really need vhost_net_virtqueue,
> > we didn't have one and I let some net specific stuff seep through.
> > We should do something like
>
> Thanks. I'll put that in my tree and start playing with it.
Hmm no need to keep it out of upstream I think.
> >> We do something like:
> >>
> >> err = tun_inject_skb(sock, skb);
> >>
> >> Because the zerocopy_sg_from_iovec() in tun.c already has a comment
> >> about putting it in the network core... in fact, it's cut & pasted into
> >> macvtap :(
> >
> > Right. There's lots of code duplication in tun, macvtap and af_packet,
> > we could factor code out, and then use it everywhere.
> >
> >> The msg thing for the callback is a nasty hack, AFAICT...
> >
> > Yes it's not pretty.
>
> OK, I'll see what I can do.
>
> > Sure, we have lots going on with vhost scsi ATM, but as long as changes
> > are done in small steps I'll figure out conflicts if any without much
> > trouble.
>
> Yep, I think it can be done. I've done some patches so far, and the
> impact on tcm_vhost has been minimal...
>
> Thanks,
> Rusty.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-03-18 8:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-13 6:49 vhost questions Rusty Russell
2013-03-13 15:59 ` Michael S. Tsirkin
2013-03-14 6:38 ` Rusty Russell
2013-03-14 9:33 ` Michael S. Tsirkin
2013-03-18 0:50 ` Rusty Russell
2013-03-18 8:06 ` Michael S. Tsirkin
2013-03-17 11:09 ` Michael S. Tsirkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).