From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.linutronix.de (146.0.238.70:993) by crypto-ml.lab.linutronix.de with IMAP4-SSL for ; 05 Jul 2018 23:07:43 -0000 Received: from mga06.intel.com ([134.134.136.31]) by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fbDLh-00065h-Pw for speck@linutronix.de; Fri, 06 Jul 2018 01:07:42 +0200 Date: Thu, 5 Jul 2018 16:07:32 -0700 From: Andi Kleen Subject: [MODERATED] Interrupts policy for SMT Message-ID: <20180705230732.GJ17013@tassilo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: speck@linutronix.de List-ID: Hi, When SMT is on for a guest and there is an interrupt on the sibling the guest may be able to look at the data processed by the interrupt through L1TF. This is mainly an issue when SMT is on, but the guest is confined to a small number of exclusive cores that don't run any other processes, e.g. through an exclusive cpuset. There are three possible solutions for this: 1. Ask the user to change the affinity of interrupts away to other cores 2. Explicitly force the sibling to exit when an interrupt is processed 3. Consider it low risk and allow it 1. Ask the user to change affinity when running confined guests This is somewhat complicated for the users. They would need to have a script to move all the interrupt away from a given set of cores used by VMs through /proc/irq/*/smp_affinity. It would be possible to provide a tool that does this. It also implies that at least one core needs to be left for interrupt processing. For MSI-X drivers that bind an interrupt to every core and don't allow to override it, it wouldn't work. However those typically one process interrupts on CPUs that actually initiated the IO, and cores only running guests shouldn't. One case where this might be violated is when a network driver uses a hash function to select the queue, thus the cpu. However we expect network traffic to be encrypted anyways, so the risk of leaking something here should be low. 2. Explicitly force the sibling to exit for interrupts The basic idea is to maintain the "i am in a guest" and "i am in an interrupt states in two cache lins. The sibling always checks it before executing interrupts. If is set it sends an IPI. KVM also needs to check and wait for interrupts before entering the guest. The IPI isn't actually fully executed because KVM can intercept it during an exit and ack it directly. I did a prototype patch for this and we saw ~2% performance loss for doing a kernel build inside the guest. This was before the optimization of not executing the IPI, with that I would expect it to be faster I wouldn't expect any performance loss when the guest is not active, as it just checks a cache line that is never written (but this still needs to be verified) So this has some overhead, but it would have the advantage that the user doesn't need to change the affinity, and doesn't have the potential issue with hashed MSI-X interrupts. 3. Consider it low risk and allow it Sensitive data is any user data and any kernel secrets, such as encryption keys. Normal kernel pointers are normally not sensitive, except for KASLR (which is very hard to maintain with L1TF) Interrupts for modern devices normally don't touch user data directly because they operate with DMA only, so they only handle descriptors and similar. So they should not leak any data. Really old interrupt handlers of course may use PIO, but we don't expect those to be used anymore. Another case is the soft interrupts. For example it may run the network stack, which can in some exceptional copy user data (for example TCP on a retransmit with MTU change if the device doesn't support full scather gather). However we expect network traffic to be already encrypted, so this should be low risk. Another case is timer handlers. It's hard to say for sure, but my assumption is that they generally don't directly touch user data either. For both timers and networking stack the processing is also only on the cores that actually initiated a transaction, which we don't expect cores that only run guests to do frequently. Of course to really verify this for all interrupt handlers would be a daunting task. However from a very preliminary analysis it seems low risk. In theory also hybrid solutions of (2) and (3) would be possible. For example some interrupts could be white listed, and the synchronization only be done if something not audited runs, or in case of softirqs it could be pushed to ksoftirqd. I'm thinking for most cases recommending (3) may be actually a reasonable approach, although it is definitely somewhat hand wavey. We could do (2) or better a (2)/(3) hybrid (however proper white listing may be a lot of effort). (3) is somewhat ugly and should probably only be last resort. Comments? -Andi