qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Wei Wang <wei.w.wang@intel.com>
To: "Marc-André Lureau" <marcandre.lureau@gmail.com>
Cc: virtio-comment@lists.oasis-open.org, qemu-devel@nongnu.org,
	mst@redhat.com, stefanha@redhat.com, pbonzini@redhat.com
Subject: Re: [Qemu-devel] [virtio-comment] Re: [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication
Date: Wed, 09 Nov 2016 16:32:16 +0800	[thread overview]
Message-ID: <5822DF10.2030807@intel.com> (raw)
In-Reply-To: <CAJ+F1CJPLtV3WB5JT61H0zfuun1tn=ALt6QzEPjcUU4G=CxXmA@mail.gmail.com>

On 11/08/2016 08:17 PM, Marc-André Lureau wrote:
> >
>
>     >      Message Specification
>     >      ---------------------
>     >
>     >      Note that all numbers are in the machine native byte order. A
>     >     vhost-user message
>     >     -consists of 3 header fields and a payload:
>     >     +consists of 4 header fields and a payload:
>     >
>     >     -------------------------------------
>     >     -| request | flags | size | payload |
>     >     -------------------------------------
>     >     +----------------------------------------------
>     >     +| request | flags | conn_id | size | payload |
>     >     +----------------------------------------------
>     >
>     >       * Request: 32-bit type of the request
>     >       * Flags: 32-bit bit field:
>     >         - Lower 2 bits are the version (currently 0x01)
>     >     -   - Bit 2 is the reply flag - needs to be sent on each reply
>     >     from the slave
>     >     +   - Bit 2 is the reply flag - needs to be sent on each reply
>     >         - Bit 3 is the need_reply flag - see
>     >     VHOST_USER_PROTOCOL_F_REPLY_ACK for
>     >           details.
>     >     + * Conn_id: 64-bit connection id to indentify a client socket
>     >     connection. It is
>     >     +            introduced in version 0x02 to support the
>     >     "1-server-N-client" model
>     >     +            and an asynchronous client read implementation. The
>     >     connection id,
>     >     +            0xFFFFFFFFFFFFFFFF, is used by an anonymous client
>     >     (e.g. a client who
>     >     +            has not got its connection id from the server
>     in the
>     >     initial talk)
>     >
>     >
>     > I don't understand why you need a connection id, on each message.
>     > What's the purpose? Since the communication is unicast, a single
>     > message should be enough.
>
>     Sure, please let me explain more:
>     The QEMU socket is going to be upgraded to support 1 server socket
>     being
>     connected by multiple client sockets (I've made patches to achieve
>     this). In other words, here, multiple masters will connect to one
>     slave,
>     and the slave creates a vhost-pci device for each master after
>     receiving
>     the necessary message info. The slave needs to know which master it is
>     talking to when receiving a message, as it maintains multiple
>     connections at the same time.
>
>
> You should be able to identify each connection in the slave (as a 
> socket server), without a need for connection id: connected sockets 
> are independent from each others.


Yes, that's doable. But why couldn't we do it from the protocol layer? I 
think it will be easier.

Please check below my thoughts about the implementation if we do it in 
the slave:

The interface for receiving a msg is - tcp_chr_read(QIOChannel *chan, 
GIOCondition cond, void *opaque)

QIOChannel is the one that we can use to identify the master connection 
who sends this msg (the socket server now has an array of QIOChannel, 
ioc[MAX_CLIENTS]). Everytime a msg is received, the tcp_chr_read() needs 
to compare *chan and the ioc[] array, to find out the id (indexed into 
the ioc[]), and passes the id to qemu_chr_be_write(), and all the way 
down to the final slave handler where the msg is parsed and handled. 
This needs modifications to the existing APIs, for example, the 
mentioned qemu_chr_be_write() will need one more parameter, "id". This 
will not be compatible with the existing implementation, because all 
other implementations which invoke qemu_chr_be_write() will need to be 
patched to use the new qemu_chr_be_write(,"id",).



>     >       * Size - 32-bit size of the payload
>     >
>     >
>     >     @@ -97,6 +106,13 @@ Depending on the request type, payload
>     can be:
>     >         log offset: offset from start of supplied file descriptor
>     >             where logging starts (i.e. where guest address 0
>     would be
>     >     logged)
>     >
>     >     +* Device info
>     >     +   --------------------
>     >     +   | virito id | uuid |
>     >     +   --------------------
>     >     +   Virtio id: 16-bit virtio id of the device
>     >     +   UUID: 128-bit UUID to identify the QEMU instance that
>     creates
>     >     the device
>     >     +
>     >
>     >
>     > I wonder if UUID should be a different message.
>     >
>     We can make uuid another message if it has other usages.
>     Do you see any other usages of uuid?
>
>
> Allows to associate data/configuration with a particular VM, in a 
> multi-master/single-slave scenario. But tbh, I don't see how this is 
> necessary, I can imagine solving this differently (having different 
> connection address per vm for ex).


Using connection addresses, how could you know if the two connections 
are from the same VM?


> I would like to understand your use case.


Here is an example of the use case:
VM1 has two master connections (connection X and Y) and VM2 has 1 master 
connection (Z).
X,Y,Z - each has a connection id. But X and Y send the same uuid, uuid1, 
to the slave, and Z sends uuid2 to the slave. In this way, the slave 
know X and Y are the two connections from the same VM, and Z is a 
connection from a different VM.

For connection Y, the vhost-pci device will be created in a way which 
does not need the driver to map the memory, since it has already been 
mapped by device X from the same VM.


>
>
>     >      [ Also see the section on REPLY_ACK protocol extension. ]
>     >
>     >     +Currently, the communication also supports the Slave (server)
>     >     sending messages
>     >     +to the Master (client). Here is a list of them:
>     >     + * VHOST_USER_SET_FEATURES
>     >
>     >     + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively
>     request
>     >     to disconnect
>     >     +   with the client)
>     >
>     >
>     > Oh, you are making the communication bidirectional? This is a
>     > fundamental change in the protocol. This may be difficult to
>     implement
>     > in qemu, since the communication in synchronous, a request
>     expects an
>     > immediate reply, if it gets back a request (from the slave) in the
>     > middle, it will fail.
>     >
>
>     Not really.
>     Adding the above two doesn't affect the existing synchronous read()
>     messages (basically, those VHOST_USER_GET_xx messages). Like
>     VHOST_USER_SET_FEATURES, the _SET_ messages don't need a reply.
>     Here, we
>     just make the slave capable of actively sending messages to the
>     master.
>
>
> Yes, that's the trouble. At any time the Master may send a request and 
> expects an immediate reply. There is a race of getting a request from 
> the Slave in the middle with your proposed change. I'd rather avoid 
> making the request bidirectionnal if possible. (I proposed a second 
> channel for Slave->Master request in the past: 
> https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg00095.html)


  If the message that the slave got has a different "request" field 
value, it simply drops it and re-read again. The implementation is not 
complex also, please see the change example to vhost_user_get_u64() below:

    if (vhost_user_write(dev, &msg_w, NULL, 0) < 0) {
        return -1;
     }
retry:
     if (vhost_user_read(dev, &msg_r) < 0) {
         return -1;
    }
     if (msg_r.request != msg_w.request)
         goto retry;


On the other side, the slave's request to the master is dropped due to 
the race. This race can be solved in the protocol layer - let the _SET_ 
request ask for an ACK, if no ACK is received, re-sent it. Also, this 
kind of race should be very rare in real usage.


>
>     >
>     > +This request should be sent only when
>     VHOST_USER_PROTOCOL_F_VHOST_PCI
>     > has...
>     >
>     >     +* VHOST_USER_SET_DEV_INFO
>     >     +
>     >     +      Id: 21
>     >     +      Equivalent ioctl: N/A
>     >     +      Master payload: dev info
>     >     +
>     >     +      The client sends the producer device info to the server.
>     >
>     >
>     > "Master sends producer device info to the Slave" works, no?
>
>     Yes, it works. The current dev info only contains a "virtio id" field
>     (assume we'll take uuid out as a separate message), which tells the
>     slave if it is a net, scsi, console or else. do you see any issue?
>     >
>     > Could we guarantee this message is sent before SET_VRING*?
>
>     Why do we need to guarantee this?
>
>
> It would simplify the protocol to have expectations on when messages 
> come. In particular, an early message with devinfo would allow to 
> check/pre-configure the Slave for a particular device. Also 
> VHOST_USER_SET_DEV_INFO should probably be unique (don't allow a 
> device to be reconfigured)


Yes, it is sent in an early age of the vhost-user protocol interaction. 
It's implemented to be sent right after sending the 
VHOST_USER_SET_PROTOCOL_FEATURES msg. On the slave side, when it 
receives SET_DEV_INFO, it pre-configures the device in a table entry (as 
mentioned before, a device will be created from the table entry at a 
later stage of the protocol interaction).

I think it should be the implementation logic, like 
VHOST_USER_SET_PROTOCOL_FEATURES. why do we need to add a guarantee in 
the protocol to specify the order?


>
>
>     >
>     >     + This request should be sent only when
>     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has
>     >     +      been negotiated.
>     >     +
>     >
>     >
>     > I think this message could be useful for other purposes than
>     > vhost-pci, thus I would give it its own flag.
>
>     Could you please give an example of other usage? Thanks.
>
>
> You could have a Slave that implements various devices, and pick the 
> corresponding one dynamically (we already have implementations for 
> net/input/gpu/scsi...)


If I understand the example correctly, the various devices still belongs 
to the vhost-pci series - in the future we would have vhost-pci-net, 
vhost-pci-scsi, vhost-pci-gpu etc. If that's the case, we may still use 
the VHOST_PCI flag.


>
>     >
>     >     +* VHOST_USER_SET_PEER_CONNECTION
>     >     +
>     >     +      Id: 22
>     >     +      Equivalent ioctl: N/A
>     >     +      Master payload: u64
>     >     +
>     >     +      The producer device requests to connect or disconnect to
>     >     the consumer device.
>     >
>     >
>     > producer->Master, consummer->Slave
>     >
>     > How does it interact with SET_VRING_ENABLE?
>
>     It's independent of SET_VRING_ENABLE:
>     SET_VRING_ENABLE enables a virtq to be in "active".
>     SET_PEER_CONNECTION enables the peer (slave or master) device to be in
>     "active". The driver shouldn't send packets if the device is inactive.
>
>
> I fail to see the difference with SET_VRING_ENABLE, perhaps someone 
> more familiar with the protocol could help here.


I'm not sure if  another email explaning this was sent out successfully, 
repost the explanation here:

The SET_PEER_CONNECTION msg is ued to turn "ON/OFF" the (slave or 
master) device connection status. For example, when the master side VM 
wants to turn down, the virtio-net driver sets the virtio-net device's 
PEER_CONNECTION status to "OFF" - before this happens, the virtio-net 
device needs to sync-up with the vhost-pci-net device first, that is, 
sending a VHOST_USER_SET_PEER_CONNECTION(cmd=OFF) msg to the master. In 
return (not as a synchronous reply, because it has to sync with the 
driver to stop using the slave side resource first), the vhost-pci-net 
device sends VHOST_USER_SET_PEER_CONNECTION(cmd=OFF) msg to the slave - 
this sets the virtio-net device's PEER_CONNECTION status to "OFF" and 
then the virtio driver is ready to unload. (same for the vhost-pci-net 
driver to unload)

SET_VRING_ENABLE controls the virtq status - the slave should not use 
the virtq if it's not enabled by the master. For example, a device may 
have 4 vitrqs, if vq[0].enabled==0, then the slave should not use vitrq 0.


>     >
>     >     + The consumer device may request to disconnect to the producer
>     >     device. This
>     >     +      request should be sent only when
>     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has been
>     >     +      negotiated.
>     >     +      Connection request: If the reply message indicates
>     >     "success", the vhost-pci based
>     >     +      inter-VM communication channel has been established.
>     >     +      Disconnection request: If the reply message indicates
>     >     "success", the vhost-pci based
>     >     +      inter-VM communication channel has been destroyed.
>     >     +      #define VHOST_USER_SET_PEER_CONNECTION_F_OFF 0
>     >     +      #define VHOST_USER_SET_PEER_CONNECTION_F_ON 1
>     >     +
>     >
>     I think it would be better to add one more command here:
>     #define VHOST_USER_SET_PEER_CONNECTION_F_INIT 2
>
>     The master uses this command to tell the slave it's ready to
>     create the
>     vhost-pci device. Regarding the implementation, it is put at the
>     bottom
>     of vhost_net_start() function (when all the vring info have been sent
>     and enabled).
>
>
> Do you have WIP branch for qemu vhost-pci? That could help to 
> understand the context.


Yes, I can share them.


Best,
Wei

  reply	other threads:[~2016-11-09  8:32 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-24  7:10 [Qemu-devel] [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication Wei Wang
2016-10-30 18:02 ` Michael S. Tsirkin
2016-11-07  5:43 ` Wang, Wei W
2016-11-08  7:47 ` Marc-André Lureau
2016-11-08 11:35   ` [Qemu-devel] [virtio-comment] " Wei Wang
2016-11-08 11:59     ` Wang, Wei W
2016-11-08 12:17     ` Marc-André Lureau
2016-11-09  8:32       ` Wei Wang [this message]
2016-11-10 12:26         ` Marc-André Lureau
2016-11-11  8:24           ` Wei Wang
2016-11-11 17:12       ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5822DF10.2030807@intel.com \
    --to=wei.w.wang@intel.com \
    --cc=marcandre.lureau@gmail.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=virtio-comment@lists.oasis-open.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).