From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 73744C00140 for ; Tue, 26 Jul 2022 15:10:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230424AbiGZPKq (ORCPT ); Tue, 26 Jul 2022 11:10:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44576 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229804AbiGZPKk (ORCPT ); Tue, 26 Jul 2022 11:10:40 -0400 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6688B656B for ; Tue, 26 Jul 2022 08:10:39 -0700 (PDT) Received: by mail-pl1-x631.google.com with SMTP id d3so5716898pls.4 for ; Tue, 26 Jul 2022 08:10:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=nEq1IaNhdkodBDzozbLgFdlJtMKKyeEDpgF5SJMm+Y8=; b=OWEnJwcMRH6J/QXCuwh5ay2U50ZpDhtt9UbnMucbCaA8Hn/Gf1tn7OIPJZvJJlYZph LW8ZsOCCLqtJX6LzYMwSle6gPv8wQUpMXg/MKvdNr6Jw/YVHewJywz77TGt6RYKsrv7s SaYVVhrt32cGZ4Vqk4SVa2I7QE52NbeI2bAhNurJ2VvJc9eEaSBJ74CCOhEq5pj6Lfjc GH7pedy/4PRnTzw3rnxOmJYwi3M2qip3MhP3oY2TjKi1A/z/1XkDjDb80k1rsXKkYTKQ 1rS339PAXGkNqzjwoqOFEuEedlCoeFDrxQRsqQt1kR5gbNkNRVUSH9SwC4ovtagB6bu7 dNxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=nEq1IaNhdkodBDzozbLgFdlJtMKKyeEDpgF5SJMm+Y8=; b=I04VQpIFyXYc+v6ezmOPiEoEsYF/8NZlt3zAnSYNmGSKwlwZ/fsutSDygyaYDR7gDM Zugtxo7mGbA3iTHVYTK3iGnOYq+Klc88Hle5oSztyqaMYQjU0Bi/RS23joj+tksxAlA+ 5ja86yrrshKi86ne/iwH37ZGKt8nTfQk3OmHVk4R288eg1iCiqAWQdErWcUoysVyMVpj 2XgyU/HgAbb3N/3cdV7dzozvVlrVAztMibOUtlhkYUZMJQxXGBX7tx/bWAr03frUTK4f 7r6VsXzm+exbAhJyZ/dKx7U4HRfpTiup+RgR1SjuraJCwPLp9m+Q2Uaj1Dhc276xNoPT 2NRA== X-Gm-Message-State: AJIora8l5vmy13bSXmf4gKPUbXtsHGV1enXpeZ9Deu98Cu9TlVIUYETe LHpM8aIFUubDywVbmpNrSe0M7w== X-Google-Smtp-Source: AGRyM1tbthc+9spb5S+wWJFbOwwDqsIQ0p+TbIcXSswRVvW+f3FRIUOzzHyQP2Y909DPGDD80G8XRA== X-Received: by 2002:a17:902:d2d1:b0:16c:223e:a3db with SMTP id n17-20020a170902d2d100b0016c223ea3dbmr17861909plc.37.1658848238620; Tue, 26 Jul 2022 08:10:38 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id j1-20020a654d41000000b003fadd680908sm10389894pgt.83.2022.07.26.08.10.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 Jul 2022 08:10:37 -0700 (PDT) Date: Tue, 26 Jul 2022 15:10:34 +0000 From: Sean Christopherson To: Andrei Vagin Cc: Paolo Bonzini , linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Wanpeng Li , Vitaly Kuznetsov , Jianfeng Tan , Adin Scannell , Konstantin Bogomolov , Etienne Perot , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" Subject: Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Message-ID: References: <20220722230241.1944655-1-avagin@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Tue, Jul 26, 2022, Andrei Vagin wrote: > On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson wrote: > > > > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM. > > > > On Fri, Jul 22, 2022, Andrei Vagin wrote: > > > Another option is the KVM platform. In this case, the Sentry (gVisor > > > kernel) can run in a guest ring0 and create/manage multiple address > > > spaces. Its performance is much better than the ptrace one, but it is > > > still not great compared with the native performance. This change > > > optimizes the most critical part, which is the syscall overhead. > > > > What exactly is the source of the syscall overhead, > > Here are perf traces for two cases: when "guest" syscalls are executed via > hypercalls and when syscalls are executed by the user-space VMM: > https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e > > And here are two tests that I use to collect these traces: > https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99 > > If we compare these traces, we can find that in the second case, we spend extra > time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put, > syscall_exit_to_user_mode. So of those, I think the only path a robust implementation can actually avoid, without significantly whittling down the allowed set of syscalls, is syscall_exit_to_user_mode(). The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run through that before calling out of KVM. E.g. prctrl(ARCH_GET_GS) will read the wrong GS.base if MSR_KERNEL_GS_BASE isn't restored. And that necessitates calling vmx_prepare_switch_to_guest() when resuming the vCPU. FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound to be a syscall that accesses user FPU state and will do the wrong thing if guest state is loaded. For gVisor, that's all presumably a non-issue because it uses a small set of syscalls (or has guest==host state?), but for a common KVM feature it's problematic. > > and what alternatives have been explored? Making arbitrary syscalls from > > within KVM is mildly terrifying. > > "mildly terrifying" is a good sentence in this case:). If I were in your place, > I would think about it similarly. > > I understand these concerns about calling syscalls from the KVM code, and this > is why I hide this feature under a separate capability that can be enabled > explicitly. > > We can think about restricting the list of system calls that this hypercall can > execute. In the user-space changes for gVisor, we have a list of system calls > that are not executed via this hypercall. Can you provide that list? > But it has downsides: > * Each sentry system call trigger the full exit to hr3. > * Each vmenter/vmexit requires to trigger a signal but it is expensive. Can you explain this one? I didn't quite follow what this is referring to. > * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry > has to be fully enclosed in a VM to be able to support these technologies. Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC tackled the "syscalls are crazy expensive" problem by using a message queue and a dedicated task outside of the enclave to handle syscalls. Would something like that work, or is having to burn a pCPU (or more) to handle syscalls in the host a non-starter?