From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ak@linux.intel.com>
Received: from mail.linutronix.de (146.0.238.70:993) by
  crypto-ml.lab.linutronix.de with IMAP4-SSL for <speck@linutronix.de>; 05 Jul
  2018 23:07:43 -0000
Received: from mga06.intel.com ([134.134.136.31])
	by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256)
	(Exim 4.80)
	(envelope-from <ak@linux.intel.com>)
	id 1fbDLh-00065h-Pw
	for speck@linutronix.de; Fri, 06 Jul 2018 01:07:42 +0200
Date: Thu, 5 Jul 2018 16:07:32 -0700
From: Andi Kleen <ak@linux.intel.com>
Subject: [MODERATED] Interrupts policy for SMT
Message-ID: <20180705230732.GJ17013@tassilo.jf.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

Hi,

When SMT is on for a guest and there is an interrupt on the sibling the guest
may be able to look at the data processed by the interrupt through L1TF.

This is mainly an issue when SMT is on, but the guest is confined
to a small number of exclusive cores that don't run any other processes,
e.g. through an exclusive cpuset.

There are three possible solutions for this:

1. Ask the user to change the affinity of interrupts away to other cores
2. Explicitly force the sibling to exit when an interrupt is processed
3. Consider it low risk and allow it


1. Ask the user to change affinity when running confined guests

This is somewhat complicated for the users. They would
need to have a script to move all the interrupt away from a given
set of cores used by VMs through /proc/irq/*/smp_affinity. 
It would be possible to provide a tool that does this.

It also implies that at least one core needs to be left for interrupt
processing.

For MSI-X drivers that bind an interrupt to every core and don't 
allow to override it, it wouldn't work. However those typically
one process interrupts on CPUs that actually initiated the IO,
and cores only running guests shouldn't. 

One case where this might be violated is when a network driver uses a
hash function to select the queue, thus the cpu. However we expect network
traffic to be encrypted anyways, so the risk of leaking something
here should be low.


2. Explicitly force the sibling to exit for interrupts

The basic idea is to maintain the "i am in a guest" and "i am in
an interrupt states in two cache lins.
The sibling always checks it before executing interrupts. If
is set it sends an IPI. KVM also needs to check and wait
for interrupts before entering the guest.

The IPI isn't actually fully executed because KVM can intercept
it during an exit and ack it directly.

I did a prototype patch for this and we saw ~2% performance loss
for doing a kernel build inside the guest. This was before
the optimization of not executing the IPI, with that I would
expect it to be faster

I wouldn't expect any performance loss when the guest is not active,
as it just checks a cache line that is never written (but
this still needs to be verified)

So this has some overhead, but it would have the advantage
that the user doesn't need to change the affinity, and doesn't
have the potential issue with hashed MSI-X interrupts.


3. Consider it low risk and allow it

Sensitive data is any user data and any kernel secrets, such as encryption 
keys. Normal kernel pointers are normally not sensitive, except for KASLR
(which is very hard to maintain with L1TF) 

Interrupts for modern devices normally don't touch user data directly 
because they operate with DMA only, so they only handle descriptors
and similar. So they should not leak any data.

Really old interrupt handlers of course may use PIO, but we don't
expect those to be used anymore.

Another case is the soft interrupts. For example it may run the
network stack, which can in some exceptional copy user data 
(for example TCP on a retransmit with MTU change if the device
doesn't support full scather gather). However we expect network
traffic to be already encrypted, so this should be low risk.

Another case is timer handlers. It's hard to say for sure,
but my assumption is that they generally don't directly touch
user data either. 

For both timers and networking stack the processing is also
only on the cores that actually initiated a transaction, which
we don't expect cores that only run guests to do frequently.

Of course to really verify this for all interrupt handlers
would be a daunting task. However from a very preliminary
analysis it seems low risk.

In theory also hybrid solutions of (2) and (3) would be possible.
For example some interrupts could be white listed, and the synchronization
only be done if something not audited runs, or in case of softirqs
it could be pushed to ksoftirqd.


I'm thinking for most cases recommending (3) may be actually a reasonable
approach, although it is definitely somewhat hand wavey.

We could do (2) or better a (2)/(3) hybrid (however proper
white listing may be a lot of effort). (3) is somewhat
ugly and should probably only be last resort.

Comments?

-Andi