From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49886) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b8Jj4-0003YX-EP for qemu-devel@nongnu.org; Wed, 01 Jun 2016 23:55:20 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b8Jj1-00035Q-6h for qemu-devel@nongnu.org; Wed, 01 Jun 2016 23:55:18 -0400 Received: from mga14.intel.com ([192.55.52.115]:5518) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b8Jj0-00035M-PY for qemu-devel@nongnu.org; Wed, 01 Jun 2016 23:55:15 -0400 References: <1464509494-159509-1-git-send-email-wei.w.wang@intel.com> <1464509494-159509-5-git-send-email-wei.w.wang@intel.com> <574E999B.90307@linux.intel.com> <286AC319A985734F985F78AFA26841F7C78F0A@shsmsx102.ccr.corp.intel.com> From: Xiao Guangrong Message-ID: <574FAD76.6030601@linux.intel.com> Date: Thu, 2 Jun 2016 11:52:22 +0800 MIME-Version: 1.0 In-Reply-To: <286AC319A985734F985F78AFA26841F7C78F0A@shsmsx102.ccr.corp.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Wang, Wei W" , "kvm@vger.kernel.org" , "qemu-devel@nongnu.org" , "virtio-comment@lists.oasis-open.org" , "mst@redhat.com" , "stefanha@redhat.com" , "pbonzini@redhat.com" On 06/02/2016 11:15 AM, Wang, Wei W wrote: > On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote: >> On 05/29/2016 04:11 PM, Wei Wang wrote: >>> Signed-off-by: Wei Wang >>> --- >>> Details | 324 >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 324 insertions(+) >>> create mode 100644 Details >>> >>> diff --git a/Details b/Details >>> new file mode 100644 >>> index 0000000..4ea2252 >>> --- /dev/null >>> +++ b/Details >>> @@ -0,0 +1,324 @@ >>> +1 Device ID >>> +TBD >>> + >>> +2 Virtqueues >>> +0 controlq >>> + >>> +3 Feature Bits >>> +3.1 Local Feature Bits >>> +Currently no local feature bits are defined, so the standard virtio >>> +feature bits negation will always be successful and complete. >>> + >>> +3.2 Remote Feature Bits >>> +The remote feature bits are obtained from the frontend virtio device >>> +and negotiated with the vhost-pci driver via the controlq. The >>> +negotiation steps are described in 4.5 Device Initialization. >>> + >>> +4 Device Configuration Layout >>> +struct vhost_pci_config { >>> + #define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0 >>> + #define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1 >>> + #define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2 >>> + u32 ack_type; >>> + u32 ack_device_type; >>> + u64 ack_device_id; >>> + union { >>> + #define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0 >>> + #define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1 >>> + #define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2 >>> + #define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3 >>> + u64 ack_memory_info; >>> + u64 ack_device_info; >>> + u64 ack_feature_bits; >>> + }; >>> +}; >> >> Do you need to write all these 4 field to ack the operation? It seems it is not >> efficient and it is not flexible if the driver need to offer more data to the device >> in the further. Can we dedicate a vq for this purpose? > > Yes, all the 4 fields are required to be written. The vhost-pci server usually connects to multiple clients, and the "ack_device_type" and "ack_device_id" fields are used to identify them. > > Agree, another controlq for the guest->host direction looks better, and the above fileds can be converted to be the controlq message header. > Thanks. >> >> BTW, current approach can not handle the case if there are multiple same kind >> of requests in the control queue, e.g, if there are two memory-add request in >> the control queue. > > A vhost-pci device corresponds to a driver VM. The two memory-add requests on the controlq are all for the same driver VM. Memory-add requests for different driver VMs couldn’t be present on the same controlq. I haven’t seen the issue yet. Can you please explain more? Thanks. The issue is caused by "The two memory-add requests on the controlq are all for the same driver VM", the driver need to ACK these request respectively, however, these two requests have the same ack_type, device_type, device_id, ack_memory_info, then QEMU is not able to figure out which request has been acked. > > >>> + >>> +The configuration fields are currently used for the vhost-pci driver >>> +to acknowledge to the vhost-pci device after it receives controlq messages. >>> + >>> +4.5 Device Initialization >>> +When a device VM boots, it creates a vhost-pci server socket. >>> + >>> +When a virtio device on the driver VM is created with specifying the >>> +use of a vhost-pci device as a backend, a client socket is created >>> +and connected to the corresponding vhost-pci server for message exchanges. >>> + >>> +The messages passed to the vhost-pci server is proceeded by the >>> +following >>> +header: >>> +struct vhost_pci_socket_hdr { >>> + #define VHOST_PCI_SOCKET_MEMORY_INFO 0 >>> + #define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1 >>> + #define VHOST_PCI_SOCKET_DEVICE_INFO 2 >>> + #define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3 >>> + #define VHOST_PCI_SOCKET_FEATURE_BITS 4 >>> + #define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5 >>> + u16 msg_type; >>> + u16 msg_version; >>> + u32 msg_len; >>> + u64 qemu_pid; >>> +}; >>> + >>> +The payload of the above message types can be constructed using the >>> +structures >>> +below: >>> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct >>> +vhost_pci_socket_memory_info { >>> + #define VHOST_PCI_ADD_MEMORY 0 >>> + #define VHOST_PCI_DEL_MEMORY 1 >>> + u16 ops; >>> + u32 nregions; >>> + struct vhost_pci_memory_region { >>> + int fd; >>> + u64 guest_phys_addr; >>> + u64 memory_size; >>> + u64 mmap_offset; >>> + } regions[VHOST_PCI_MAX_NREGIONS]; >>> +}; >>> + >>> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct >>> +vhost_pci_device_info { >>> + #define VHOST_PCI_ADD_FRONTEND_DEVICE 0 >>> + #define VHOST_PCI_DEL_FRONTEND_DEVICE 1 >>> + u16 ops; >>> + u32 nvirtq; >>> + #define VHOST_PCI_FRONTEND_DEVICE_NET 1 >>> + #define VHOST_PCI_FRONTEND_DEVICE_BLK 2 >>> + #define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3 >>> + #define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4 >>> + #define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5 >>> + #define VHOST_PCI_FRONTEND_DEVICE_SCSI 8 >>> + u32 device_type; >>> + u64 device_id; >>> + struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ]; >>> +}; >>> +The device_id field identifies the device. For example, it can be >>> +used to store a MAC address if the device_type is >> VHOST_PCI_FRONTEND_DEVICE_NET. >>> + >>> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct >>> +vhost_pci_feature_bits { >>> + u64 feature_bits; >>> +}; >> >> We not only have 'socket feature bits' but also the feature bits for per virtio >> device plugged in on the side of vhost-pci device. > > Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are actually the remote feature bits (got from a socket message). Hmm, there are two questions: 1) The device related info (e.g, device-type, device-id, etc) has not been included there, so the vhost-pci driver can not write proper values to the fields of vhost_pci_config. 2) different virtio device may have different feature bits, VHOST_PCI_SOCKET_FEATURE_BITS and VHOST_PCI_CONTROLQ_FEATURE_BITS should be able to negotiate the feature bits for different virtio device. > >> >> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to >> directly communicate with another VM. The feature bits of these two devices >> need to be negotiated with that VM respectively. And you can not put these >> feature bits in vhost_pci_device_info struct as its vq is not created at that time. > > Right. If you check the initialization steps below, there is a statement "When the device status is updated with DRIVER_OK". > >>> + >>> +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack { >>> + #define VHOST_PCI_SOCKET_ACK_ADD_DONE 0 >>> + #define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1 >>> + #define VHOST_PCI_SOCKET_ACK_DEL_DONE 2 >>> + #define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3 >>> + u64 ack; >>> +}; >>> + >>> +The driver update message passed via the controlq is preceded by the >>> +following >>> +header: >>> +struct vhost_pci_controlq_hdr { >>> + #define VHOST_PCI_CONTROLQ_MEMORY_INFO 0 >>> + #define VHOST_PCI_CONTROLQ_DEVICE_INFO 1 >>> + #define VHOST_PCI_CONTROLQ_FEATURE_BITS 2 >>> + #define VHOST_PCI_CONTROLQ_UPDATE_DONE 3 >>> + u16 msg_type; >>> + u16 msg_version; >>> + u32 msg_len; >>> +}; >>> + >>> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be >>> +constructed using the following structure: >>> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct >>> +vhost_pci_controlq_memory_info { >>> + #define VHOST_PCI_ADD_MEMORY 0 >>> + #define VHOST_PCI_DEL_MEMORY 1 >>> + u16 ops; >>> + u32 nregion; >>> + struct exotic_memory_region { >>> + u64 region_base_xgpa; >>> + u64 size; >>> + u64 offset_in_bar_area; >>> + } region[VHOST_PCI_MAX_NREGIONS]; >>> +}; >>> + >>> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using >> the >>> +vhost_pci_device_info structure and the vhost_pci_feature_bits >>> +structure respectively. >>> + >>> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be >>> +constructed using the structure below: >>> +struct vhost_pci_controlq_update_done { >>> + u32 device_type; >>> + u64 device_id; >>> +}; >>> + >>> +Fig. 1 shows the initialization steps. >>> + >>> +When the vhost-pci server receives a >>> +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-pci >>> +device has been created for the requesting VM whose QEMU process id >>> +is qemu_pid. If yes, it will simply update the subsequent received >>> +messages to the vhost-pci driver via the controlq. Otherwise, the >>> +server creates a new vhost-pci device, and continues the following >> initialization steps. >> >> >> qemu-pid is not stable as the existing VM will be killed silently and the new >> vhost-pci driver reusing the same qemu-pid will ask to join before the vhost- >> device gets to know the previous one has gone. > > Would it be a normal and legal operation to silently kill a QEMU? I guess only the system admin can do that, right? Yup, it is a valid operation as a problematic VM will be killed and as you said the admin can kill it silently, anyway, the design should have a way to handle this case properly. > > If that's true, I think we can add a new field, "u64 tsc_of_birth" to the vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created. You only need to identify a VM, can use UUID instead? > If that's true, another problem would be the remove of the vhost-pci device for a silently killed driver VM. > The vhost-pci server may need to periodically send a checking message to check if the driver VM is silently killed. If that really happens, it should remove the related vhost-pci device. Yes. > >>> + >>> +The vhost-pci server adds up all the memory region size, and uses a >>> +64-bit device bar for the mapping of all the memory regions obtained >>> +from the socket message. To better support memory hot-plugging of the >>> +driver VM, the bar is configured with a double size of the driver >>> +VM's memory. The server maps the received memory info via the QEMU >>> +MemoryRegion mechanism, and then the new created vhost-pci device is >> hot-plugged to the VM. >>> + >>> +When the device status is updated with DRIVER_OK, a >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed >> from the >>> +memory info socket message, is put on the controlq and a controlq >>> +interrupt is injected to the VM. >>> + >>> +When the vhost-pci server receives a >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE) >> acknowledgement from the >>> +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE) >> message >>> +to the client that is identified by the ack_device_type and ack_device_id fields. >>> + >>> +When the vhost-pci server receives a >>> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the >>> +controlq and a controlq interrupt is injected to the VM. >>> + >>> +If the vhost-pci server notices that the driver fully accepted the >>> +offered feature bits, it sends a >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the client. >> If >>> +the vhost-pci server notices that the vhost-pci driver only accepted >>> +a subset of the offered feature bits, it sends a >>> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to >>> +the client. The client side virtio device re-negotiates the new >>> +feature bits with its driver, and sends back a >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) >>> +message to the server. >>> + >>> +Either when the vhost-pci driver fully accepted the offered feature >>> +bits or a >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received >> from >>> +the client, the vhost-pci server puts a >>> +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a >> controlq interrupt is injected to the VM. >> >> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed? > > OK, this one looks redundant. We can set up the related support for that frontend device when the device info is received via the controlq. > Great.