From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from mx3-rdu2.redhat.com ([66.187.233.73] helo=mx1.redhat.com) by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fAxRn-0000WT-9u for speck@linutronix.de; Tue, 24 Apr 2018 14:53:28 +0200 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id AD8CC10D43F for ; Tue, 24 Apr 2018 12:53:20 +0000 (UTC) Received: from [10.36.117.154] (ovpn-117-154.ams2.redhat.com [10.36.117.154]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2153383B8C for ; Tue, 24 Apr 2018 12:53:19 +0000 (UTC) Subject: [MODERATED] Re: L1D-Fault KVM mitigation References: <20180424090630.wlghmrpasn7v7wbn@suse.de> <20180424093537.GC4064@hirez.programming.kicks-ass.net> From: Paolo Bonzini Message-ID: <8cbc35b2-f75a-6357-014d-e20ff7284ac0@redhat.com> Date: Tue, 24 Apr 2018 14:53:15 +0200 MIME-Version: 1.0 In-Reply-To: <20180424093537.GC4064@hirez.programming.kicks-ass.net> Content-Type: multipart/mixed; boundary="02lGJN4ndtXHrGRfg9yhuIPug77DT6yfL"; protected-headers="v1" To: speck@linutronix.de List-ID: This is an OpenPGP/MIME encrypted message (RFC 4880 and 3156) --02lGJN4ndtXHrGRfg9yhuIPug77DT6yfL Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: quoted-printable On 24/04/2018 11:35, speck for Peter Zijlstra wrote: > I know that I worked a little with Tim on this, and I know Google did > their own thing (but have not seen patches from them -- is pjt on this > list?). I've also heard Amazon was also working on things (are they > here?). And I think RHT was also looking into something (mingo, bonzini= > -- are you guys reading?) Yes, I am. First of all: the cost of doing an L1D flush on every vmentry is absolutely horrible on KVM microbenchmarks, but seems a little better (around 6% worst case) on syscall microbenchmarks. "Message-passing" workloads with vCPUs repeatedly going to sleep are the worst. First of all, hyperthreading in general doesn't exactly shine when running many small virtual machines since the VMs are unlikely to share any code or data and you'll be able to use half the normal amount of L1 cache. Perhaps KSM shares the guest kernels and recovers some of the icache (assuming that there are kernel-heavy benchmarks that _also_ benefit from hyperthreading), but it's more likely to give worse than improved performance. Hyperthreading may provide slightly better jitter when you run two different guests on the siblings. But with gang scheduling you wouldn't do that, so that's not an issue. As a result, in the overcommitted case the main issue is having to explain to the customers that disabling hyperthreading is not that bad. Even in the non-overcommitted case, there is a possibility that host IRQs or NMIs happen, which as Thomas pointed out can also pollute the cac= he. The only case where hyperthreading may be salvaged is the case where 1) all guest CPUs are pinned to a single physical CPU and memory is also reserved because you use 1GB hugetlbfs 2) host IRQs are either using VT-d posted interrupts or are pinned away from those physical CPUs that run guests, 3) you are using nohz_full and other similar fine-tuned configuration to ensure that the guest CPUs run smoothly. This includes NFV usecases, but big databases like SAP are also run like this sometimes= =2E In this case, you'd need to add synchronization around vmexit. However, because these workloads _will_ actually do vmexits, sometimes a lot of them (e.g. unless you use nohz_full in the guest as well, you'll have vmexits to program the LAPIC timer). Either all of them will have to suffer from the synchronization cost, or you have to arbitrarily decide that some vmexits are "confined" and unlikely to pollute the cache; in that case you skip the synchronization and the L1D flush. For example you could say "anything that does not do get_user_pages is confined". Because you've done this arbitrary choice, the synchronization is total security theater unless you know what you're doing: no two guests on the same core, no interrupt handlers that can run during a vmexit and pollute the L1 cache (if that happens, the other sibling would be able to read that data), etc.. BUT: 1) I'm not saying hyperthreading is valuable in those cases, only that it can be salvaged; 2) if you're paranoid you're more likely to disable HT anyway. So while I do plan to test what happens when we do synchronization, it's all but certain that we're going to ship it. And even that would only be if it is acceptable upstream---I'm not going to make it a special Red Hat only patch. Ingo suggested, for ease of testing and also for ease of deployment, a knob to easily online/offline all siblings but the first on each core. There's still the chance that some userspace daemon is started before hyperthreading is software-disabled that way, and is confused by the number of CPUs suddenly halving, so it would have to be both on the kernel command line and in debugfs. Paolo --02lGJN4ndtXHrGRfg9yhuIPug77DT6yfL--