From: Xiao Guangrong <guangrong.xiao@linux.intel.com>
To: "Wang, Wei W" <wei.w.wang@intel.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"virtio-comment@lists.oasis-open.org"
<virtio-comment@lists.oasis-open.org>,
"mst@redhat.com" <mst@redhat.com>,
"stefanha@redhat.com" <stefanha@redhat.com>,
"pbonzini@redhat.com" <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
Date: Thu, 2 Jun 2016 11:52:22 +0800 [thread overview]
Message-ID: <574FAD76.6030601@linux.intel.com> (raw)
In-Reply-To: <286AC319A985734F985F78AFA26841F7C78F0A@shsmsx102.ccr.corp.intel.com>
On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
>> On 05/29/2016 04:11 PM, Wei Wang wrote:
>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>> ---
>>> Details | 324
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 324 insertions(+)
>>> create mode 100644 Details
>>>
>>> diff --git a/Details b/Details
>>> new file mode 100644
>>> index 0000000..4ea2252
>>> --- /dev/null
>>> +++ b/Details
>>> @@ -0,0 +1,324 @@
>>> +1 Device ID
>>> +TBD
>>> +
>>> +2 Virtqueues
>>> +0 controlq
>>> +
>>> +3 Feature Bits
>>> +3.1 Local Feature Bits
>>> +Currently no local feature bits are defined, so the standard virtio
>>> +feature bits negation will always be successful and complete.
>>> +
>>> +3.2 Remote Feature Bits
>>> +The remote feature bits are obtained from the frontend virtio device
>>> +and negotiated with the vhost-pci driver via the controlq. The
>>> +negotiation steps are described in 4.5 Device Initialization.
>>> +
>>> +4 Device Configuration Layout
>>> +struct vhost_pci_config {
>>> + #define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
>>> + #define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
>>> + #define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
>>> + u32 ack_type;
>>> + u32 ack_device_type;
>>> + u64 ack_device_id;
>>> + union {
>>> + #define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
>>> + #define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
>>> + #define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
>>> + #define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
>>> + u64 ack_memory_info;
>>> + u64 ack_device_info;
>>> + u64 ack_feature_bits;
>>> + };
>>> +};
>>
>> Do you need to write all these 4 field to ack the operation? It seems it is not
>> efficient and it is not flexible if the driver need to offer more data to the device
>> in the further. Can we dedicate a vq for this purpose?
>
> Yes, all the 4 fields are required to be written. The vhost-pci server usually connects to multiple clients, and the "ack_device_type" and "ack_device_id" fields are used to identify them.
>
> Agree, another controlq for the guest->host direction looks better, and the above fileds can be converted to be the controlq message header.
>
Thanks.
>>
>> BTW, current approach can not handle the case if there are multiple same kind
>> of requests in the control queue, e.g, if there are two memory-add request in
>> the control queue.
>
> A vhost-pci device corresponds to a driver VM. The two memory-add requests on the controlq are all for the same driver VM. Memory-add requests for different driver VMs couldn’t be present on the same controlq. I haven’t seen the issue yet. Can you please explain more? Thanks.
The issue is caused by "The two memory-add requests on the controlq are all for the same driver VM",
the driver need to ACK these request respectively, however, these two requests have the same
ack_type, device_type, device_id, ack_memory_info, then QEMU is not able to figure out which request
has been acked.
>
>
>>> +
>>> +The configuration fields are currently used for the vhost-pci driver
>>> +to acknowledge to the vhost-pci device after it receives controlq messages.
>>> +
>>> +4.5 Device Initialization
>>> +When a device VM boots, it creates a vhost-pci server socket.
>>> +
>>> +When a virtio device on the driver VM is created with specifying the
>>> +use of a vhost-pci device as a backend, a client socket is created
>>> +and connected to the corresponding vhost-pci server for message exchanges.
>>> +
>>> +The messages passed to the vhost-pci server is proceeded by the
>>> +following
>>> +header:
>>> +struct vhost_pci_socket_hdr {
>>> + #define VHOST_PCI_SOCKET_MEMORY_INFO 0
>>> + #define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
>>> + #define VHOST_PCI_SOCKET_DEVICE_INFO 2
>>> + #define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
>>> + #define VHOST_PCI_SOCKET_FEATURE_BITS 4
>>> + #define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
>>> + u16 msg_type;
>>> + u16 msg_version;
>>> + u32 msg_len;
>>> + u64 qemu_pid;
>>> +};
>>> +
>>> +The payload of the above message types can be constructed using the
>>> +structures
>>> +below:
>>> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
>>> +vhost_pci_socket_memory_info {
>>> + #define VHOST_PCI_ADD_MEMORY 0
>>> + #define VHOST_PCI_DEL_MEMORY 1
>>> + u16 ops;
>>> + u32 nregions;
>>> + struct vhost_pci_memory_region {
>>> + int fd;
>>> + u64 guest_phys_addr;
>>> + u64 memory_size;
>>> + u64 mmap_offset;
>>> + } regions[VHOST_PCI_MAX_NREGIONS];
>>> +};
>>> +
>>> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
>>> +vhost_pci_device_info {
>>> + #define VHOST_PCI_ADD_FRONTEND_DEVICE 0
>>> + #define VHOST_PCI_DEL_FRONTEND_DEVICE 1
>>> + u16 ops;
>>> + u32 nvirtq;
>>> + #define VHOST_PCI_FRONTEND_DEVICE_NET 1
>>> + #define VHOST_PCI_FRONTEND_DEVICE_BLK 2
>>> + #define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
>>> + #define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
>>> + #define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
>>> + #define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
>>> + u32 device_type;
>>> + u64 device_id;
>>> + struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
>>> +};
>>> +The device_id field identifies the device. For example, it can be
>>> +used to store a MAC address if the device_type is
>> VHOST_PCI_FRONTEND_DEVICE_NET.
>>> +
>>> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
>>> +vhost_pci_feature_bits {
>>> + u64 feature_bits;
>>> +};
>>
>> We not only have 'socket feature bits' but also the feature bits for per virtio
>> device plugged in on the side of vhost-pci device.
>
> Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are actually the remote feature bits (got from a socket message).
Hmm, there are two questions:
1) The device related info (e.g, device-type, device-id, etc) has not been included
there, so the vhost-pci driver can not write proper values to the fields of
vhost_pci_config.
2) different virtio device may have different feature bits, VHOST_PCI_SOCKET_FEATURE_BITS
and VHOST_PCI_CONTROLQ_FEATURE_BITS should be able to negotiate the feature bits for
different virtio device.
>
>>
>> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to
>> directly communicate with another VM. The feature bits of these two devices
>> need to be negotiated with that VM respectively. And you can not put these
>> feature bits in vhost_pci_device_info struct as its vq is not created at that time.
>
> Right. If you check the initialization steps below, there is a statement "When the device status is updated with DRIVER_OK".
>
>>> +
>>> +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
>>> + #define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
>>> + #define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
>>> + #define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
>>> + #define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
>>> + u64 ack;
>>> +};
>>> +
>>> +The driver update message passed via the controlq is preceded by the
>>> +following
>>> +header:
>>> +struct vhost_pci_controlq_hdr {
>>> + #define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
>>> + #define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
>>> + #define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
>>> + #define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
>>> + u16 msg_type;
>>> + u16 msg_version;
>>> + u32 msg_len;
>>> +};
>>> +
>>> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
>>> +constructed using the following structure:
>>> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
>>> +vhost_pci_controlq_memory_info {
>>> + #define VHOST_PCI_ADD_MEMORY 0
>>> + #define VHOST_PCI_DEL_MEMORY 1
>>> + u16 ops;
>>> + u32 nregion;
>>> + struct exotic_memory_region {
>>> + u64 region_base_xgpa;
>>> + u64 size;
>>> + u64 offset_in_bar_area;
>>> + } region[VHOST_PCI_MAX_NREGIONS];
>>> +};
>>> +
>>> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
>>> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using
>> the
>>> +vhost_pci_device_info structure and the vhost_pci_feature_bits
>>> +structure respectively.
>>> +
>>> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
>>> +constructed using the structure below:
>>> +struct vhost_pci_controlq_update_done {
>>> + u32 device_type;
>>> + u64 device_id;
>>> +};
>>> +
>>> +Fig. 1 shows the initialization steps.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-pci
>>> +device has been created for the requesting VM whose QEMU process id
>>> +is qemu_pid. If yes, it will simply update the subsequent received
>>> +messages to the vhost-pci driver via the controlq. Otherwise, the
>>> +server creates a new vhost-pci device, and continues the following
>> initialization steps.
>>
>>
>> qemu-pid is not stable as the existing VM will be killed silently and the new
>> vhost-pci driver reusing the same qemu-pid will ask to join before the vhost-
>> device gets to know the previous one has gone.
>
> Would it be a normal and legal operation to silently kill a QEMU? I guess only the system admin can do that, right?
Yup, it is a valid operation as a problematic VM will be killed and as you said the admin
can kill it silently, anyway, the design should have a way to handle this case properly.
>
> If that's true, I think we can add a new field, "u64 tsc_of_birth" to the vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created.
You only need to identify a VM, can use UUID instead?
> If that's true, another problem would be the remove of the vhost-pci device for a silently killed driver VM.
> The vhost-pci server may need to periodically send a checking message to check if the driver VM is silently killed. If that really happens, it should remove the related vhost-pci device.
Yes.
>
>>> +
>>> +The vhost-pci server adds up all the memory region size, and uses a
>>> +64-bit device bar for the mapping of all the memory regions obtained
>>> +from the socket message. To better support memory hot-plugging of the
>>> +driver VM, the bar is configured with a double size of the driver
>>> +VM's memory. The server maps the received memory info via the QEMU
>>> +MemoryRegion mechanism, and then the new created vhost-pci device is
>> hot-plugged to the VM.
>>> +
>>> +When the device status is updated with DRIVER_OK, a
>>> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
>> from the
>>> +memory info socket message, is put on the controlq and a controlq
>>> +interrupt is injected to the VM.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
>> acknowledgement from the
>>> +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
>> message
>>> +to the client that is identified by the ack_device_type and ack_device_id fields.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
>>> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the
>>> +controlq and a controlq interrupt is injected to the VM.
>>> +
>>> +If the vhost-pci server notices that the driver fully accepted the
>>> +offered feature bits, it sends a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the client.
>> If
>>> +the vhost-pci server notices that the vhost-pci driver only accepted
>>> +a subset of the offered feature bits, it sends a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to
>>> +the client. The client side virtio device re-negotiates the new
>>> +feature bits with its driver, and sends back a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
>>> +message to the server.
>>> +
>>> +Either when the vhost-pci driver fully accepted the offered feature
>>> +bits or a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
>> from
>>> +the client, the vhost-pci server puts a
>>> +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
>> controlq interrupt is injected to the VM.
>>
>> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?
>
> OK, this one looks redundant. We can set up the related support for that frontend device when the device info is received via the controlq.
>
Great.
next prev parent reply other threads:[~2016-06-02 3:55 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-29 8:11 [Qemu-devel] [PATCH 0/6 Resend] *** Vhost-pci RFC *** Wei Wang
2016-05-29 8:11 ` [Qemu-devel] [PATCH 1/6 Resend] Vhost-pci RFC: Introduction Wei Wang
2016-05-29 8:11 ` [Qemu-devel] [PATCH 2/6 Resend] Vhost-pci RFC: Modification Scope Wei Wang
2016-05-29 8:11 ` [Qemu-devel] [PATCH 3/6 Resend] Vhost-pci RFC: Benefits to KVM Wei Wang
2016-05-29 8:11 ` [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format Wei Wang
2016-06-01 8:15 ` Xiao Guangrong
2016-06-02 3:15 ` Wang, Wei W
2016-06-02 3:52 ` Xiao Guangrong [this message]
2016-06-02 8:43 ` Wang, Wei W
2016-06-02 11:13 ` Xiao Guangrong
2016-06-03 6:12 ` Wang, Wei W
2016-05-29 8:11 ` [Qemu-devel] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement Wei Wang
2016-05-30 6:23 ` [Qemu-devel] [virtio-comment] " Jan Kiszka
2016-05-31 8:00 ` Wang, Wei W
2016-06-02 9:27 ` Jan Kiszka
2016-06-03 5:54 ` Wang, Wei W
2016-05-29 8:11 ` [Qemu-devel] [PATCH 6/6 Resend] Vhost-pci RFC: Experimental Results Wei Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=574FAD76.6030601@linux.intel.com \
--to=guangrong.xiao@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=virtio-comment@lists.oasis-open.org \
--cc=wei.w.wang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).