From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D88BC4338F for ; Fri, 13 Aug 2021 05:11:18 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id DA3B060C41 for ; Fri, 13 Aug 2021 05:11:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org DA3B060C41 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=lists.xenproject.org Received: from list by lists.xenproject.org with outflank-mailman.166628.304142 (Exim 4.92) (envelope-from ) id 1mEPT1-000406-My; Fri, 13 Aug 2021 05:10:51 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 166628.304142; Fri, 13 Aug 2021 05:10:51 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1mEPT1-0003zz-Jq; Fri, 13 Aug 2021 05:10:51 +0000 Received: by outflank-mailman (input) for mailman id 166628; Fri, 13 Aug 2021 05:10:50 +0000 Received: from us1-rack-iad1.inumbo.com ([172.99.69.81]) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1mEPT0-0003zt-CS for xen-devel@lists.xen.org; Fri, 13 Aug 2021 05:10:50 +0000 Received: from mail-pl1-x62b.google.com (unknown [2607:f8b0:4864:20::62b]) by us1-rack-iad1.inumbo.com (Halon) with ESMTPS id 1e4eb541-7f29-47ee-b1d0-0e04e3daf6d1; Fri, 13 Aug 2021 05:10:46 +0000 (UTC) Received: by mail-pl1-x62b.google.com with SMTP id e19so10338131pla.10 for ; Thu, 12 Aug 2021 22:10:46 -0700 (PDT) Received: from laputa (pdb6272e8.tkyea130.ap.so-net.ne.jp. [219.98.114.232]) by smtp.gmail.com with ESMTPSA id g19sm475037pjl.25.2021.08.12.22.10.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Aug 2021 22:10:45 -0700 (PDT) X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: 1e4eb541-7f29-47ee-b1d0-0e04e3daf6d1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=uDve4X7h8e8+uUqd0ZdOH0jv/o3f2LQrqEakHw/ja08=; b=TB/x0SMvGRKPYfx1/6yzgN0De41ffm8XOyPB8zlAoEwQAiejjn3/2NwiLRHC6THP5L zI5GYwAfCZ9A46FUZnkZ4X3hr3p3HHxykfjZyR/gs7lcBErQ9zjV2u1bYMSi084uWeUV lRbpQ/g5rnGHhSnvp5ROkllQRhABt3w8nru6n4KvvIOtppTDsdw4AGAyuhZt6/h6NVev neJsp4LugUF4hmd/prqZDOj3DX/tce5/YswRb5LCVGIh/Rx5F/q3oubrFIOpyYt7MJYY DOQ/znqdcEyL9lBAcK8QxkgUyVv1cFY+XSGQgzaw7MrLeZcjRsfdBHGY3gpwjak4boy+ Spiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=uDve4X7h8e8+uUqd0ZdOH0jv/o3f2LQrqEakHw/ja08=; b=o0Z9rMJNXX2MmGX9/pgnFdrTxhF3PmBwfwJ/1Zo5nLeMUGa6NhuZm+QhiGPKgIGdr8 E32mv1WyybS139wEwb8BdaLf0NiMLSOeMDzBlJv843OybBIpzXh8SgtDj4oekY85XD/w +N8TILPg45ZNc4t0Se56kqB2I9lQHTBLDXBSMxE4anWri4yFtqUAVCckJqkQ0Nl/5qvJ OU1CvPNSdI1vH3lI4O+5uPdPnqxLFiyNU1syripqTjSMalH/n7QCVFC0KxzP9LSFhmVS 3pbUWA5NmbBQZLZAbZ66jYbojKoiG1jwdjPZ/bSPnYWaB5wy+ZjVdr3Xin83as7WemOx WdGg== X-Gm-Message-State: AOAM5333bqFfGDJw0ZHzdNPAcD0BLzKy4hOlrfNUau0pShQ3e5nEuuDa sZtiR3TXh3FpLrfsnT28UJn5Qw== X-Google-Smtp-Source: ABdhPJyif4UnArl6HX/KgdVqwjCT4kmq89eYVfXEoyFzqHtVpSsjLxuY9Iiz8QRWCQcvnrVOoBOFVQ== X-Received: by 2002:a17:90a:2f88:: with SMTP id t8mr860094pjd.2.1628831445717; Thu, 12 Aug 2021 22:10:45 -0700 (PDT) Date: Fri, 13 Aug 2021 14:10:38 +0900 From: AKASHI Takahiro To: Fran??ois Ozog Cc: Stefano Stabellini , paul@xen.org, Stratos Mailing List , virtio-dev@lists.oasis-open.org, Jan Kiszka , Arnd Bergmann , jgross@suse.com, julien@xen.org, Carl van Schaik , Bertrand.Marquis@arm.com, stefanha@redhat.com, Artem_Mygaiev@epam.com, xen-devel@lists.xen.org, olekstysh@gmail.com, Oleksandr_Tyshchenko@epam.com Subject: Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends Message-ID: <20210813051038.GA77540@laputa> References: <87v94ldrqq.fsf@linaro.org> <0100017b33e585a5-06d4248e-b1a7-485e-800c-7ead89e5f916-000000@email.amazonses.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Hi François, On Thu, Aug 12, 2021 at 09:55:52AM +0200, Fran??ois Ozog wrote: > I top post as I find it difficult to identify where to make the comments. Thank you for the posting. I think that we should first discuss more about the goal/requirements/ practical use cases for the framework. > 1) BE acceleration > Network and storage backends may actually be executed in SmartNICs. As > virtio 1.1 is hardware friendly, there may be SmartNICs with virtio 1.1 PCI > VFs. Is it a valid use case for the generic BE framework to be used in this > context? > DPDK is used in some BE to significantly accelerate switching. DPDK is also > used sometimes in guests. In that case, there are no event injection but > just high performance memory scheme. Is this considered as a use case? I'm not quite familiar with DPDK but it seems to be heavily reliant on not only virtqueues but also kvm/linux features/functionality, say, according to [1]. I'm afraid that DPDK is not suitable for primary (at least, initial) target use. # In my proposal, virtio-proxy, I have in mind the assumption that we would # create BE VM as a baremetal application on RTOS (and/or unikernel.) But as far as virtqueue is concerned, I think we can discuss in general technical details as Alex suggested, including: - sharing or mapping memory regions for data payload - efficient notification mechanism [1] https://www.redhat.com/en/blog/journey-vhost-users-realm > 2) Virtio as OS HAL > Panasonic CTO has been calling for a virtio based HAL and based on the > teachings of Google GKI, an internal HAL seem inevitable in the long term. > Virtio is then a contender to Google promoted Android HAL. Could the > framework be used in that context? In this case, where will the implementation of "HAL" reside? I don't think the portability of "HAL" code (as a set of virtio BEs) is a requirement here. -Takahiro Akashi > On Wed, 11 Aug 2021 at 08:28, AKASHI Takahiro via Stratos-dev < > stratos-dev@op-lists.linaro.org> wrote: > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote: > > > CCing people working on Xen+VirtIO and IOREQs. Not trimming the original > > > email to let them read the full context. > > > > > > My comments below are related to a potential Xen implementation, not > > > because it is the only implementation that matters, but because it is > > > the one I know best. > > > > Please note that my proposal (and hence the working prototype)[1] > > is based on Xen's virtio implementation (i.e. IOREQ) and particularly > > EPAM's virtio-disk application (backend server). > > It has been, I believe, well generalized but is still a bit biased > > toward this original design. > > > > So I hope you like my approach :) > > > > [1] > > https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html > > > > Let me take this opportunity to explain a bit more about my approach below. > > > > > Also, please see this relevant email thread: > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > Hi, > > > > > > > > One of the goals of Project Stratos is to enable hypervisor agnostic > > > > backends so we can enable as much re-use of code as possible and avoid > > > > repeating ourselves. This is the flip side of the front end where > > > > multiple front-end implementations are required - one per OS, assuming > > > > you don't just want Linux guests. The resultant guests are trivially > > > > movable between hypervisors modulo any abstracted paravirt type > > > > interfaces. > > > > > > > > In my original thumb nail sketch of a solution I envisioned vhost-user > > > > daemons running in a broadly POSIX like environment. The interface to > > > > the daemon is fairly simple requiring only some mapped memory and some > > > > sort of signalling for events (on Linux this is eventfd). The idea was > > a > > > > stub binary would be responsible for any hypervisor specific setup and > > > > then launch a common binary to deal with the actual virtqueue requests > > > > themselves. > > > > > > > > Since that original sketch we've seen an expansion in the sort of ways > > > > backends could be created. There is interest in encapsulating backends > > > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust > > > > has prompted ideas of using the trait interface to abstract differences > > > > away as well as the idea of bare-metal Rust backends. > > > > > > > > We have a card (STR-12) called "Hypercall Standardisation" which > > > > calls for a description of the APIs needed from the hypervisor side to > > > > support VirtIO guests and their backends. However we are some way off > > > > from that at the moment as I think we need to at least demonstrate one > > > > portable backend before we start codifying requirements. To that end I > > > > want to think about what we need for a backend to function. > > > > > > > > Configuration > > > > ============= > > > > > > > > In the type-2 setup this is typically fairly simple because the host > > > > system can orchestrate the various modules that make up the complete > > > > system. In the type-1 case (or even type-2 with delegated service VMs) > > > > we need some sort of mechanism to inform the backend VM about key > > > > details about the system: > > > > > > > > - where virt queue memory is in it's address space > > > > - how it's going to receive (interrupt) and trigger (kick) events > > > > - what (if any) resources the backend needs to connect to > > > > > > > > Obviously you can elide over configuration issues by having static > > > > configurations and baking the assumptions into your guest images > > however > > > > this isn't scalable in the long term. The obvious solution seems to be > > > > extending a subset of Device Tree data to user space but perhaps there > > > > are other approaches? > > > > > > > > Before any virtio transactions can take place the appropriate memory > > > > mappings need to be made between the FE guest and the BE guest. > > > > > > > Currently the whole of the FE guests address space needs to be visible > > > > to whatever is serving the virtio requests. I can envision 3 > > approaches: > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > This would entail the guest OS knowing where in it's Guest Physical > > > > Address space is already taken up and avoiding clashing. I would > > assume > > > > in this case you would want a standard interface to userspace to then > > > > make that address space visible to the backend daemon. > > > > Yet another way here is that we would have well known "shared memory" > > between > > VMs. I think that Jailhouse's ivshmem gives us good insights on this matter > > and that it can even be an alternative for hypervisor-agnostic solution. > > > > (Please note memory regions in ivshmem appear as a PCI device and can be > > mapped locally.) > > > > I want to add this shared memory aspect to my virtio-proxy, but > > the resultant solution would eventually look similar to ivshmem. > > > > > > * BE guests boots with a hypervisor handle to memory > > > > > > > > The BE guest is then free to map the FE's memory to where it wants in > > > > the BE's guest physical address space. > > > > > > I cannot see how this could work for Xen. There is no "handle" to give > > > to the backend if the backend is not running in dom0. So for Xen I think > > > the memory has to be already mapped > > > > In Xen's IOREQ solution (virtio-blk), the following information is expected > > to be exposed to BE via Xenstore: > > (I know that this is a tentative approach though.) > > - the start address of configuration space > > - interrupt number > > - file path for backing storage > > - read-only flag > > And the BE server have to call a particular hypervisor interface to > > map the configuration space. > > > > In my approach (virtio-proxy), all those Xen (or hypervisor)-specific > > stuffs are contained in virtio-proxy, yet another VM, to hide all details. > > > > # My point is that a "handle" is not mandatory for executing mapping. > > > > > and the mapping probably done by the > > > toolstack (also see below.) Or we would have to invent a new Xen > > > hypervisor interface and Xen virtual machine privileges to allow this > > > kind of mapping. > > > > > If we run the backend in Dom0 that we have no problems of course. > > > > One of difficulties on Xen that I found in my approach is that calling > > such hypervisor intefaces (registering IOREQ, mapping memory) is only > > allowed on BE servers themselvies and so we will have to extend those > > interfaces. > > This, however, will raise some concern on security and privilege > > distribution > > as Stefan suggested. > > > > > > > > > > To activate the mapping will > > > > require some sort of hypercall to the hypervisor. I can see two > > options > > > > at this point: > > > > > > > > - expose the handle to userspace for daemon/helper to trigger the > > > > mapping via existing hypercall interfaces. If using a helper you > > > > would have a hypervisor specific one to avoid the daemon having to > > > > care too much about the details or push that complexity into a > > > > compile time option for the daemon which would result in different > > > > binaries although a common source base. > > > > > > > > - expose a new kernel ABI to abstract the hypercall differences away > > > > in the guest kernel. In this case the userspace would essentially > > > > ask for an abstract "map guest N memory to userspace ptr" and let > > > > the kernel deal with the different hypercall interfaces. This of > > > > course assumes the majority of BE guests would be Linux kernels and > > > > leaves the bare-metal/unikernel approaches to their own devices. > > > > > > > > Operation > > > > ========= > > > > > > > > The core of the operation of VirtIO is fairly simple. Once the > > > > vhost-user feature negotiation is done it's a case of receiving update > > > > events and parsing the resultant virt queue for data. The vhost-user > > > > specification handles a bunch of setup before that point, mostly to > > > > detail where the virt queues are set up FD's for memory and event > > > > communication. This is where the envisioned stub process would be > > > > responsible for getting the daemon up and ready to run. This is > > > > currently done inside a big VMM like QEMU but I suspect a modern > > > > approach would be to use the rust-vmm vhost crate. It would then either > > > > communicate with the kernel's abstracted ABI or be re-targeted as a > > > > build option for the various hypervisors. > > > > > > One thing I mentioned before to Alex is that Xen doesn't have VMMs the > > > way they are typically envisioned and described in other environments. > > > Instead, Xen has IOREQ servers. Each of them connects independently to > > > Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as > > > emulators for a single Xen VM, each of them connecting to Xen > > > independently via the IOREQ interface. > > > > > > The component responsible for starting a daemon and/or setting up shared > > > interfaces is the toolstack: the xl command and the libxl/libxc > > > libraries. > > > > I think that VM configuration management (or orchestration in Startos > > jargon?) is a subject to debate in parallel. > > Otherwise, is there any good assumption to avoid it right now? > > > > > Oleksandr and others I CCed have been working on ways for the toolstack > > > to create virtio backends and setup memory mappings. They might be able > > > to provide more info on the subject. I do think we miss a way to provide > > > the configuration to the backend and anything else that the backend > > > might require to start doing its job. > > > > > > > > > > One question is how to best handle notification and kicks. The existing > > > > vhost-user framework uses eventfd to signal the daemon (although QEMU > > > > is quite capable of simulating them when you use TCG). Xen has it's own > > > > IOREQ mechanism. However latency is an important factor and having > > > > events go through the stub would add quite a lot. > > > > > > Yeah I think, regardless of anything else, we want the backends to > > > connect directly to the Xen hypervisor. > > > > In my approach, > > a) BE -> FE: interrupts triggered by BE calling a hypervisor interface > > via virtio-proxy > > b) FE -> BE: MMIO to config raises events (in event channels), which is > > converted to a callback to BE via virtio-proxy > > (Xen's event channel is internnally implemented by > > interrupts.) > > > > I don't know what "connect directly" means here, but sending interrupts > > to the opposite side would be best efficient. > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism. > > > > > > > > > Could we consider the kernel internally converting IOREQ messages from > > > > the Xen hypervisor to eventfd events? Would this scale with other > > kernel > > > > hypercall interfaces? > > > > > > > > So any thoughts on what directions are worth experimenting with? > > > > > > One option we should consider is for each backend to connect to Xen via > > > the IOREQ interface. We could generalize the IOREQ interface and make it > > > hypervisor agnostic. The interface is really trivial and easy to add. > > > > As I said above, my proposal does the same thing that you mentioned here :) > > The difference is that I do call hypervisor interfaces via virtio-proxy. > > > > > The only Xen-specific part is the notification mechanism, which is an > > > event channel. If we replaced the event channel with something else the > > > interface would be generic. See: > > > > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > I don't think that translating IOREQs to eventfd in the kernel is a > > > good idea: if feels like it would be extra complexity and that the > > > kernel shouldn't be involved as this is a backend-hypervisor interface. > > > > Given that we may want to implement BE as a bare-metal application > > as I did on Zephyr, I don't think that the translation would not be > > a big issue, especially on RTOS's. > > It will be some kind of abstraction layer of interrupt handling > > (or nothing but a callback mechanism). > > > > > Also, eventfd is very Linux-centric and we are trying to design an > > > interface that could work well for RTOSes too. If we want to do > > > something different, both OS-agnostic and hypervisor-agnostic, perhaps > > > we could design a new interface. One that could be implementable in the > > > Xen hypervisor itself (like IOREQ) and of course any other hypervisor > > > too. > > > > > > > > > There is also another problem. IOREQ is probably not be the only > > > interface needed. Have a look at > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need > > > an interface for the backend to inject interrupts into the frontend? And > > > if the backend requires dynamic memory mappings of frontend pages, then > > > we would also need an interface to map/unmap domU pages. > > > > My proposal document might help here; All the interfaces required for > > virtio-proxy (or hypervisor-related interfaces) are listed as > > RPC protocols :) > > > > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny > > > and self-contained. It is easy to add anywhere. A new interface to > > > inject interrupts or map pages is more difficult to manage because it > > > would require changes scattered across the various emulators. > > > > Exactly. I have no confident yet that my approach will also apply > > to other hypervisors than Xen. > > Technically, yes, but whether people can accept it or not is a different > > matter. > > > > Thanks, > > -Takahiro Akashi > > > > -- > > Stratos-dev mailing list > > Stratos-dev@op-lists.linaro.org > > https://op-lists.linaro.org/mailman/listinfo/stratos-dev > > > > > -- > François-Frédéric Ozog | *Director Business Development* > T: +33.67221.6485 > francois.ozog@linaro.org | Skype: ffozog