From: "Michael S. Tsirkin" <mst@redhat.com>
To: Antonios Motakis <a.motakis@virtualopensystems.com>
Cc: Luke Gorrie <lukego@gmail.com>,
snabb-devel@googlegroups.com,
Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>,
qemu-devel qemu-devel <qemu-devel@nongnu.org>,
VirtualOpenSystems Technical Team <tech@virtualopensystems.com>
Subject: Re: [Qemu-devel] [PATCH v6 0/8] Vhost and vhost-net support for userspace based backends
Date: Mon, 27 Jan 2014 18:49:08 +0200 [thread overview]
Message-ID: <20140127164908.GA13260@redhat.com> (raw)
In-Reply-To: <CAG8rG2w_OsCBm7Eo1M_s6pROrt+F9AWjPQSrrW1sf5C7Bz3uoA@mail.gmail.com>
On Mon, Jan 27, 2014 at 05:37:02PM +0100, Antonios Motakis wrote:
> Hello again,
>
>
> On Wed, Jan 15, 2014 at 3:49 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jan 15, 2014 at 01:50:47PM +0100, Antonios Motakis wrote:
> > >
> > >
> > >
> > > On Wed, Jan 15, 2014 at 10:07 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Jan 14, 2014 at 07:13:43PM +0100, Antonios Motakis wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Jan 14, 2014 at 12:33 PM, Michael S. Tsirkin <mst@redhat.com>
> > > wrote:
> > > >
> > > > On Mon, Jan 13, 2014 at 03:25:11PM +0100, Antonios Motakis wrote:
> > > > > In this patch series we would like to introduce our approach for
> > > putting
> > > > a
> > > > > virtio-net backend in an external userspace process. Our eventual
> > > target
> > > > is to
> > > > > run the network backend in the Snabbswitch ethernet switch, while
> > > > receiving
> > > > > traffic from a guest inside QEMU/KVM which runs an unmodified
> > > virtio-net
> > > > > implementation.
> > > > >
> > > > > For this, we are working into extending vhost to allow equivalent
> > > > functionality
> > > > > for userspace. Vhost already passes control of the data plane of
> > > > virtio-net to
> > > > > the host kernel; we want to realize a similar model, but for
> > > userspace.
> > > > >
> > > > > In this patch series the concept of a vhost-backend is introduced.
> > > > >
> > > > > We define two vhost backend types - vhost-kernel and vhost-user.
> > > The
> > > > former is
> > > > > the interface to the current kernel module implementation. Its
> > > control
> > > > plane is
> > > > > ioctl based. The data plane is the kernel directly accessing the
> > > QEMU
> > > > allocated,
> > > > > guest memory.
> > > > >
> > > > > In the new vhost-user backend, the control plane is based on
> > > > communication
> > > > > between QEMU and another userspace process using a unix domain
> > > socket.
> > > > This
> > > > > allows to implement a virtio backend for a guest running in QEMU,
> > > inside
> > > > the
> > > > > other userspace process.
> > > > >
> > > > > We change -mem-path to QemuOpts and add prealloc, share and unlink
> > > as
> > > > properties
> > > > > to it. HugeTLBFS requirements of -mem-path are relaxed, so any
> > > valid path
> > > > can
> > > > > be used now. The new properties allow more fine grained control
> > > over the
> > > > guest
> > > > > RAM backing store.
> > > > >
> > > > > The data path is realized by directly accessing the vrings and the
> > > buffer
> > > > data
> > > > > off the guest's memory.
> > > > >
> > > > > The current user of vhost-user is only vhost-net. We add new netdev
> > > > backend
> > > > > that is intended to initialize vhost-net with vhost-user backend.
> > > >
> > > > Some meta comments.
> > > >
> > > > Something that makes this patch harder to review is how it's
> > > > split up. Generally IMHO it's not a good idea to repeatedly
> > > > edit same part of file adding stuff in patch after patch,
> > > > it's only making things harder to read if you add stubs, then fill
> > > them up.
> > > > (we do this sometimes when we are changing existing code, but
> > > > it is generally not needed when adding new code)
> > > >
> > > > Instead, split it like this:
> > > >
> > > > 1. general refactoring, split out linux specific and generic parts
> > > > and add the ops indirection
> > > > 2. add new files for vhost-user with complete implementation.
> > > > without command line to support it, there will be no way to use
> > > it,
> > > > but should build fine.
> > > > 3. tie it all up with option parsing
> > > >
> > > >
> > > > Generic vhost and vhost net files should be kept separate.
> > > > Don't let vhost net stuff seep back into generic files,
> > > > we have vhost-scsi too.
> > > > I would also prefer that userspace vhost has its own files.
> > > >
> > > >
> > > > Ok, we'll keep this into account.
> > > >
> > > >
> > > >
> > > > We need a small test server qemu can talk to, to verify things
> > > > actually work.
> > > >
> > > >
> > > > We have implemented such a test app: https://github.com/
> > > virtualopensystems/vapp
> > > >
> > > > We use it for testing, and also as a reference implementation. A client
> > > is also
> > > > included.
> > > >
> > >
> > > Sounds good. Can we include this in qemu and tie
> > > it into the qtest framework?
> > > >From a brief look, it merely needs to be tweaked for portability,
> > > unless
> > >
> > > >
> > > > Already commented on: reuse the chardev syntax and preferably code.
> > > > We already support a bunch of options there for
> > > > domain sockets that will be useful here, they should
> > > > work here as well.
> > > >
> > > >
> > > > We adapted the syntax for this to be consistent with chardev. What we
> > > didn't
> > > > use, it is not obvious at all to us on how they should be used; a lot of
> > > the
> > > > chardev options just don't apply to us.
> > > >
> > >
> > > Well server option should work at least.
> > > nowait can work too?
> > >
> > > Also, if reconnect is useful it should be for chardevs too, so if we don't
> > > share code, need to code it in two places to stay consistent.
> > >
> > > Overall sharing some code might be better ...
> > >
> > >
> > >
> > > What you have in mind is to use the functions chardev uses from qemu-sockets.c
> > > right? Chardev itself doesn't look to have anything else that can be shared.
> >
> > Yes.
> >
> > > The problem with reconnect is that it is implemented at the protocol level; we
> > > are not just transparently reconnecting the socket. So the same approach would
> > > most likely not apply for chardev.
> >
> > Chardev mostly just could use transparent reconnect.
> > vhost-user could use that and get a callback to reconfigure
> > everything after reconnect.
> >
> > Once you write up the protocol in some text file we can
> > discuss this in more detail.
> > For example I wonder how would feature negotiation work
> > with reconnect: new connection could be from another
> > application that does not support same features, but
> > virtio assumes that device features never change.
> >
>
> I attach the text document that we will include in the next version of
> the series, which describes the vhost-user protocol.
>
> The protocol is based on and very close to the vhost kernel protocol.
> Of note is the VHOST_USER_ECHO message, which is the only one that
> doesn't have an equivalent ioctl in the kernel version of vhost; this
> is the message that is being used to detect that the remote party is
> not on the socket anymore. At that point QEMU will close the session
> and try to initiate a new one on the same socket.
What if e.g. features change in between?
Everything just goes south, doesn't it?
Is this detection and reconnect a must for your project?
I think it would be simpler to
- generalize char unix socket handling code and reuse for vhost-user
- as a separate step, add live detection and reconnect abilities
to the generic code
> > >
> > >
> > >
> > > > In particular you shouldn't require filesystem access by qemu,
> > > > passing fd for domain socket should work.
> > > >
> > > >
> > > > We can add an option to pass an fd for the domain socket if needed.
> > > However as
> > > > far as we understand, chardev doesn't do that either (at least form
> > > looking at
> > > > the man page). Maybe we misunderstand what you mean.
> > >
> > > Sorry. I got confused with e.g. tap which has this. This might be
> > > useful but does not have to block this patch.
> > >
> > > >
> > > >
> > > > > Example usage:
> > > > >
> > > > > qemu -m 1024 -mem-path /hugetlbfs,prealloc=on,share=on \
> > > > > -netdev type=vhost-user,id=net0,path=/path/to/sock,poll_time=
> > > 2500 \
> > > > > -device virtio-net-pci,netdev=net0
> > > >
> > > > It's not clear which parts of -mem-path are required for vhost-user.
> > > > It should be documented somewhere, made clear in -help
> > > > and should fail gracefully if misconfigured.
> > > >
> > > >
> > > >
> > > > Ok.
> > > >
> > > >
> > > >
> > > > >
> > > > > Changes from v5:
> > > > > - Split -mem-path unlink option to a separate patch
> > > > > - Fds are passed only in the ancillary data
> > > > > - Stricter message size checks on receive/send
> > > > > - Netdev vhost-user now includes path and poll_time options
> > > > > - The connection probing interval is configurable
> > > > >
> > > > > Changes from v4:
> > > > > - Use error_report for errors
> > > > > - VhostUserMsg has new field `size` indicating the following
> > > payload
> > > > length.
> > > > > Field `flags` now has version and reply bits. The structure is
> > > packed.
> > > > > - Send data is of variable length (`size` field in message)
> > > > > - Receive in 2 steps, header and payload
> > > > > - Add new message type VHOST_USER_ECHO, to check connection status
> > > > >
> > > > > Changes from v3:
> > > > > - Convert -mem-path to QemuOpts with prealloc, share and unlink
> > > > properties
> > > > > - Set 1 sec timeout when read/write to the unix domain socket
> > > > > - Fix file descriptor leak
> > > > >
> > > > > Changes from v2:
> > > > > - Reconnect when the backend disappears
> > > > >
> > > > > Changes from v1:
> > > > > - Implementation of vhost-user netdev backend
> > > > > - Code improvements
> > > > >
> > > > > Antonios Motakis (8):
> > > > > Convert -mem-path to QemuOpts and add prealloc and share
> > > properties
> > > > > New -mem-path option - unlink.
> > > > > Decouple vhost from kernel interface
> > > > > Add vhost-user skeleton
> > > > > Add domain socket communication for vhost-user backend
> > > > > Add vhost-user calls implementation
> > > > > Add new vhost-user netdev backend
> > > > > Add vhost-user reconnection
> > > > >
> > > > > exec.c | 57 +++-
> > > > > hmp-commands.hx | 4 +-
> > > > > hw/net/vhost_net.c | 144 +++++++---
> > > > > hw/net/virtio-net.c | 42 ++-
> > > > > hw/scsi/vhost-scsi.c | 13 +-
> > > > > hw/virtio/Makefile.objs | 2 +-
> > > > > hw/virtio/vhost-backend.c | 556
> > > > ++++++++++++++++++++++++++++++++++++++
> > > > > hw/virtio/vhost.c | 46 ++--
> > > > > include/exec/cpu-all.h | 3 -
> > > > > include/hw/virtio/vhost-backend.h | 40 +++
> > > > > include/hw/virtio/vhost.h | 4 +-
> > > > > include/net/vhost-user.h | 17 ++
> > > > > include/net/vhost_net.h | 15 +-
> > > > > net/Makefile.objs | 2 +-
> > > > > net/clients.h | 3 +
> > > > > net/hub.c | 1 +
> > > > > net/net.c | 2 +
> > > > > net/tap.c | 16 +-
> > > > > net/vhost-user.c | 177 ++++++++++++
> > > > > qapi-schema.json | 21 +-
> > > > > qemu-options.hx | 24 +-
> > > > > vl.c | 41 ++-
> > > > > 22 files changed, 1106 insertions(+), 124 deletions(-)
> > > > > create mode 100644 hw/virtio/vhost-backend.c
> > > > > create mode 100644 include/hw/virtio/vhost-backend.h
> > > > > create mode 100644 include/net/vhost-user.h
> > > > > create mode 100644 net/vhost-user.c
> > > > >
> > > > > --
> > > > > 1.8.3.2
> > > > >
> > > >
> > > >
> > >
> > >
> Vhost-user Protocol
> ===================
>
> This protocol is aiming to complement the ioctl interface used to control the
> vhost implementation in the Linux kernel. It implements the control plane needed
> to establish virtqueue sharing with a user space process on the same host. It
> uses communication over a Unix domain socket to share file descriptors in the
> ancillary data of the message.
>
> The protocol defines 2 sides of the communication, master and slave. Master is
> the application that shares it's virtqueues, in our case QEMU. Slave is the
> consumer of the virtqueues.
>
> In the current implementation QEMU is the Master, and the Slave is intended to
> be a software ethernet switch running in user space, such as Snabbswitch.
>
> Master and slave can be either a client (i.e. connecting) or server (listening)
> in the socket communication.
>
> Message Specification
> ---------------------
>
> Note that all numbers are in the machine native byte order. A vhost-user message
> consists of 3 header fields and a payload:
>
> ------------------------------------
> | request | flags | size | payload |
> ------------------------------------
>
> * Request: 32-bit type of the request
> * Flags: 32-bit bit field:
> - Lower 2 bits are the version (currently 0x01)
> - Bit 2 is the reply flag - needs to be sent on each reply from the slave
> * Size - 32-bit size of the payload
>
>
> Depending on the request type, payload can be:
>
> * A single 64-bit integer
> -------
> | u64 |
> -------
>
> u64: a 64-bit unsigned integer
>
> * A vring state description
> ---------------
> | index | num |
> ---------------
>
> Index: a 32-bit index
> Num: a 32-bit number
>
> * A vring address description
> --------------------------------------------------------------
> | index | flags | size | descriptor | used | available | log |
> --------------------------------------------------------------
>
> Index: a 32-bit vring index
> Flags: a 32-bit vring flags
> Descriptor: a 64-bit user address of the vring descriptor table
> Used: a 64-bit user address of the vring used ring
> Available: a 64-bit user address of the vring available ring
> Log: a 64-bit guest address for logging
>
> * Memory regions description
> ---------------------------------------------------
> | num regions | padding | region0 | ... | region7 |
> ---------------------------------------------------
>
> Num regions: a 32-bit number of regions
> Padding: 32-bit
>
> A region is:
> ---------------------------------------
> | guest address | size | user address |
> ---------------------------------------
>
> Guest address: a 64-bit guest address of the region
> Size: a 64-bit size
> User address: a 64-bit user address
>
>
> In QEMU the vhost-user message is implemented with the following struct:
>
> typedef struct VhostUserMsg {
> VhostUserRequest request;
> uint32_t flags;
> uint32_t size;
> union {
> uint64_t u64;
> struct vhost_vring_state state;
> struct vhost_vring_addr addr;
> VhostUserMemory memory;
> };
> } QEMU_PACKED VhostUserMsg;
>
> Communication
> -------------
>
> The protocol for vhost-user is based on the existing implementation of vhost
> for the Linux Kernel. Most messages that can be send via the Unix domain socket
> implementing vhost-user have an equivalent ioctl to the kernel implementation.
>
> The communication consists of master sending message requests and slave sending
> message replies. Most of the requests don't require replies. Here is a list of
> the ones that do:
>
> * VHOST_USER_ECHO
> * VHOST_GET_FEATURES
> * VHOST_GET_VRING_BASE
>
> There are several messages that the master sends with file descriptors passed
> in the ancillary data:
>
> * VHOST_SET_MEM_TABLE
> * VHOST_SET_LOG_FD
> * VHOST_SET_VRING_KICK
> * VHOST_SET_VRING_CALL
> * VHOST_SET_VRING_ERR
>
> If Master is unable to send the full message or receives a wrong reply it will
> close the connection. An optional reconnection mechanism can be implemented.
>
> Message types
> -------------
>
> * VHOST_USER_ECHO
>
> Id: 1
> Equivalent ioctl: N/A
> Master payload: N/A
>
> ECHO request that is used to periodically probe the connection. When
> received by the slave, it is expected that he will send back an ECHO
> packet with the REPLY flag set.
>
> * VHOST_USER_GET_FEATURES
>
> Id: 2
> Equivalent ioctl: VHOST_GET_FEATURES
> Master payload: N/A
> Slave payload: u64
>
> Get from the underlying vhost implementation the features bitmask.
>
> * VHOST_USER_SET_FEATURES
>
> Id: 3
> Ioctl: VHOST_SET_FEATURES
> Master payload: u64
>
> Enable features in the underlying vhost implementation using a bitmask.
>
> * VHOST_USER_SET_OWNER
>
> Id: 4
> Equivalent ioctl: VHOST_SET_OWNER
> Master payload: N/A
>
> Issued when a new connection is established. It sets the current Master
> as an owner of the session. This can be used on the Slave as a
> "session start" flag.
>
> * VHOST_USER_RESET_OWNER
>
> Id: 5
> Equivalent ioctl: VHOST_RESET_OWNER
> Master payload: N/A
>
> Issued when a new connection is about to be closed. The Master will no
> longer own this connection (and will usually close it).
>
> * VHOST_USER_SET_MEM_TABLE
>
> Id: 6
> Equivalent ioctl: VHOST_SET_MEM_TABLE
> Master payload: memory regions description
>
> Sets the memory map regions on the slave so it can translate the vring
> addresses. In the ancillary data there is an array of file descriptors
> for each memory mapped region. The size and ordering of the fds matches
> the number and ordering of memory regions.
>
> * VHOST_USER_SET_LOG_BASE
>
> Id: 7
> Equivalent ioctl: VHOST_SET_LOG_BASE
> Master payload: u64
>
> Sets the logging base address.
>
> * VHOST_USER_SET_LOG_FD
>
> Id: 8
> Equivalent ioctl: VHOST_SET_LOG_FD
> Master payload: N/A
>
> Sets the logging file descriptor, which is passed as ancillary data.
>
> * VHOST_USER_SET_VRING_NUM
>
> Id: 9
> Equivalent ioctl: VHOST_SET_VRING_NUM
> Master payload: vring state description
>
> Sets the number of vrings for this owner.
>
> * VHOST_USER_SET_VRING_ADDR
>
> Id: 10
> Equivalent ioctl: VHOST_SET_VRING_ADDR
> Master payload: vring address description
> Slave payload: N/A
>
> Sets the addresses of the different aspects of the vring.
>
> * VHOST_USER_SET_VRING_BASE
>
> Id: 11
> Equivalent ioctl: VHOST_SET_VRING_BASE
> Master payload: vring state description
>
> Sets the base address where the available descriptors are.
>
> * VHOST_USER_GET_VRING_BASE
>
> Id: 12
> Equivalent ioctl: VHOST_USER_GET_VRING_BASE
> Master payload: vring state description
> Slave payload: vring state description
>
> Get the vring base address.
>
> * VHOST_USER_SET_VRING_KICK
>
> Id: 13
> Equivalent ioctl: VHOST_SET_VRING_KICK
> Master payload: N/A
>
> Set the event file descriptor for adding buffers to the vring. It
> is passed in the ancillary data.
>
> * VHOST_USER_SET_VRING_CALL
>
> Id: 14
> Equivalent ioctl: VHOST_SET_VRING_CALL
> Master payload: N/A
>
> Set the event file descriptor to signal when buffers are used. It
> is passed in the ancillary data.
>
> * VHOST_USER_SET_VRING_ERR
>
> Id: 15
> Equivalent ioctl: VHOST_SET_VRING_ERR
> Master payload: N/A
>
> Set the event file descriptor to signal when error occurs. It
> is passed in the ancillary data.
next prev parent reply other threads:[~2014-01-27 16:44 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-13 14:25 [Qemu-devel] [PATCH v6 0/8] Vhost and vhost-net support for userspace based backends Antonios Motakis
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 1/8] Convert -mem-path to QemuOpts and add prealloc and share properties Antonios Motakis
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 2/8] New -mem-path option - unlink Antonios Motakis
2014-01-14 11:16 ` Michael S. Tsirkin
2014-01-14 18:13 ` Antonios Motakis
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 3/8] Decouple vhost from kernel interface Antonios Motakis
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 4/8] Add vhost-user skeleton Antonios Motakis
2014-01-14 11:17 ` Michael S. Tsirkin
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 5/8] Add domain socket communication for vhost-user backend Antonios Motakis
2014-01-14 11:10 ` Michael S. Tsirkin
2014-01-14 18:14 ` Antonios Motakis
2014-01-15 9:13 ` Michael S. Tsirkin
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 6/8] Add vhost-user calls implementation Antonios Motakis
2014-01-14 11:21 ` Michael S. Tsirkin
2014-01-14 18:14 ` Antonios Motakis
2014-01-15 9:14 ` Michael S. Tsirkin
2014-01-15 10:08 ` Michael S. Tsirkin
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 7/8] Add new vhost-user netdev backend Antonios Motakis
2014-01-13 14:25 ` [Qemu-devel] [PATCH v6 8/8] Add vhost-user reconnection Antonios Motakis
2014-01-14 11:14 ` [Qemu-devel] [PATCH v6 0/8] Vhost and vhost-net support for userspace based backends Michael S. Tsirkin
2014-01-14 18:13 ` Antonios Motakis
2014-01-15 8:54 ` Michael S. Tsirkin
2014-01-14 11:33 ` Michael S. Tsirkin
2014-01-14 18:13 ` Antonios Motakis
2014-01-15 9:07 ` Michael S. Tsirkin
2014-01-15 12:50 ` Antonios Motakis
2014-01-15 14:49 ` Michael S. Tsirkin
2014-01-16 12:35 ` Antonios Motakis
2014-01-27 16:37 ` Antonios Motakis
2014-01-27 16:49 ` Michael S. Tsirkin [this message]
2014-01-29 12:04 ` Antonios Motakis
2014-01-29 14:10 ` Michael S. Tsirkin
2014-01-21 13:32 ` Antonios Motakis
2014-01-15 10:05 ` Michael S. Tsirkin
2014-01-21 13:32 ` Antonios Motakis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140127164908.GA13260@redhat.com \
--to=mst@redhat.com \
--cc=a.motakis@virtualopensystems.com \
--cc=lukego@gmail.com \
--cc=n.nikolaev@virtualopensystems.com \
--cc=qemu-devel@nongnu.org \
--cc=snabb-devel@googlegroups.com \
--cc=tech@virtualopensystems.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).