From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from smtp2.osuosl.org (smtp2.osuosl.org [140.211.166.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 27362C54E94 for ; Thu, 26 Jan 2023 02:14:16 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp2.osuosl.org (Postfix) with ESMTP id 812FD40A5C; Thu, 26 Jan 2023 02:14:16 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 smtp2.osuosl.org 812FD40A5C X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp2.osuosl.org ([127.0.0.1]) by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lOaRCW9XPSHh; Thu, 26 Jan 2023 02:14:15 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by smtp2.osuosl.org (Postfix) with ESMTPS id C3BC240119; Thu, 26 Jan 2023 02:14:14 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 smtp2.osuosl.org C3BC240119 Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 6D5FDC0032; Thu, 26 Jan 2023 02:14:14 +0000 (UTC) Received: from smtp2.osuosl.org (smtp2.osuosl.org [140.211.166.133]) by lists.linuxfoundation.org (Postfix) with ESMTP id 78404C002D for ; Thu, 26 Jan 2023 02:14:13 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp2.osuosl.org (Postfix) with ESMTP id 4CBD04046A for ; Thu, 26 Jan 2023 02:14:13 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 smtp2.osuosl.org 4CBD04046A X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp2.osuosl.org ([127.0.0.1]) by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Z5VieT9-3p7A for ; Thu, 26 Jan 2023 02:14:12 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey-1.8.0 DKIM-Filter: OpenDKIM Filter v2.11.0 smtp2.osuosl.org 2484F40119 Received: from outgoing2021.csail.mit.edu (outgoing2021.csail.mit.edu [128.30.2.78]) by smtp2.osuosl.org (Postfix) with ESMTPS id 2484F40119 for ; Thu, 26 Jan 2023 02:14:11 +0000 (UTC) Received: from c-24-17-218-140.hsd1.wa.comcast.net ([24.17.218.140] helo=srivatsab3MD6R.vmware.com) by outgoing2021.csail.mit.edu with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.95) (envelope-from ) id 1pKrm9-00HW0d-KX; Wed, 25 Jan 2023 21:14:05 -0500 To: Sean Christopherson , Igor Mammedov References: <20230116060134.80259-1-srivatsa@csail.mit.edu> <20230116155526.05d37ff9@imammedo.users.ipa.redhat.com> <87bkmui5z4.ffs@tglx> <20230120163734.63e62444@imammedo.users.ipa.redhat.com> From: "Srivatsa S. Bhat" Subject: Re: [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state Message-ID: Date: Wed, 25 Jan 2023 18:14:01 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US Cc: x86@kernel.org, kvm@vger.kernel.org, Peter Zijlstra , Dave Hansen , virtualization@lists.linux-foundation.org, ganb@vmware.com, ankitja@vmware.com, namit@vmware.com, "H. Peter Anvin" , "Rafael J. Wysocki" , VMware PV-Drivers Reviewers , Ingo Molnar , bordoloih@vmware.com, xen-devel@lists.xenproject.org, Wyes Karny , Tom Lendacky , "Paul E. McKenney" , keerthanak@vmware.com, Borislav Petkov , Thomas Gleixner , Juergen Gross , amakhalov@vmware.com, linux-kernel@vger.kernel.org, Lewis Caroll , blamoreaux@vmware.com X-BeenThere: virtualization@lists.linux-foundation.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: Linux virtualization List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: virtualization-bounces@lists.linux-foundation.org Sender: "Virtualization" Hi Igor and Sean, On 1/20/23 10:35 AM, Sean Christopherson wrote: > On Fri, Jan 20, 2023, Igor Mammedov wrote: >> On Fri, 20 Jan 2023 05:55:11 -0800 >> "Srivatsa S. Bhat" wrote: >> >>> Hi Igor and Thomas, >>> >>> Thank you for your review! >>> >>> On 1/19/23 1:12 PM, Thomas Gleixner wrote: >>>> On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote: >>>>> "Srivatsa S. Bhat" wrote: >>>>>> Fix this by preventing the use of mwait idle state in the vCPU offline >>>>>> play_dead() path for any hypervisor, even if mwait support is >>>>>> available. >>>>> >>>>> if mwait is enabled, it's very likely guest to have cpuidle >>>>> enabled and using the same mwait as well. So exiting early from >>>>> mwait_play_dead(), might just punt workflow down: >>>>> native_play_dead() >>>>> ... >>>>> mwait_play_dead(); >>>>> if (cpuidle_play_dead()) <- possible mwait here >>>>> hlt_play_dead(); >>>>> >>>>> and it will end up in mwait again and only if that fails >>>>> it will go HLT route and maybe transition to VMM. >>>> >>>> Good point. >>>> >>>>> Instead of workaround on guest side, >>>>> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's >>>>> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest >>>>> context when it is removing vCPU, among other things) >>>> >>>> For a pure guest side CPU unplug operation: >>>> >>>> guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online >>>> >>>> the hypervisor is not involved at all. The vCPU is not removed in that >>>> case. >>>> >>> >>> Agreed, and this is indeed the scenario I was targeting with this patch, >>> as opposed to vCPU removal from the host side. I'll add this clarification >>> to the commit message. > > Forcing HLT doesn't solve anything, it's perfectly legal to passthrough HLT. I > guarantee there are use cases that passthrough HLT but _not_ MONITOR/MWAIT, and > that passthrough all of them. > >> commit message explicitly said: >> "which prevents the hypervisor from running other vCPUs or workloads on the >> corresponding pCPU." >> >> and that implies unplug on hypervisor side as well. >> Why? That's because when hypervisor exposes mwait to guest, it has to reserve/pin >> a pCPU for each of present vCPUs. And you can safely run other VMs/workloads >> on that pCPU only after it's not possible for it to be reused by VM where >> it was used originally. > > Pinning isn't strictly required from a safety perspective. The latency of context > switching may suffer due to wake times, but preempting a vCPU that it's C1 (or > deeper) won't cause functional problems. Passing through an entire socket > (or whatever scope triggers extra fun) might be a different story, but pinning > isn't strictly required. > > That said, I 100% agree that this is expected behavior and not a bug. Letting the > guest execute MWAIT or HLT means the host won't have perfect visibility into guest > activity state. > > Oversubscribing a pCPU and exposing MWAIT and/or HLT to vCPUs is generally not done > precisely because the guest will always appear busy without extra effort on the > host. E.g. KVM requires an explicit opt-in from userspace to expose MWAIT and/or > HLT. > > If someone really wants to effeciently oversubscribe pCPUs and passthrough MWAIT, > then their best option is probably to have a paravirt interface so that the guest > can tell the host its offlining a vCPU. Barring that the host could inspect the > guest when preempting a vCPU to try and guesstimate how much work the vCPU is > actually doing in order to make better scheduling decisions. > >> Now consider following worst (and most likely) case without unplug >> on hypervisor side: >> >> 1. vm1mwait: pin pCPU2 to vCPU2 >> 2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online >> -> HLT -> VMEXIT >> -- >> 3. vm2mwait: pin pCPU2 to vCPUx and start VM >> 4. vm2mwait: guest OS onlines Vcpu and starts using it incl. >> going into idle=>mwait state >> -- >> 5. vm1mwait: it still thinks that vCPU is present it can rightfully do: >> guest$ echo 1 >/sys/devices/system/cpu/cpu2/online >> -- >> 6.1 best case vm1mwait online fails after timeout >> 6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when >> vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's >> vCPUx will be stuck (possibly indefinitely) until for some reason >> VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule >> vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2. >> So either way it's expected behavior. >> >> And if there is no intention to unplug vCPU on hypervisor side, >> then VMEXIT on play_dead is not really necessary (mwait is better >> then HLT), since hypervisor can't safely reuse pCPU elsewhere and >> VCPU goes into deep sleep within guest context. >> >> PS: >> The only case where making HLT/VMEXIT on play_dead might work out, >> would be if new workload weren't pinned to the same pCPU nor >> used mwait (i.e. host can migrate it elsewhere and schedule >> vCPU2 back on pCPU2). That makes sense. Thank you both for the detailed explanation! Let's drop this patch. Regards, Srivatsa VMware Photon OS _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization