Re: [RFC PATCH 00/47] Address Space Isolation for KVM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alexandre Chartre <alexandre.chartre@oracle.com>
To: Junaid Shahid <junaids@google.com>, linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com,
	pjt@google.com, oweisse@google.com, rppt@linux.ibm.com,
	dave.hansen@linux.intel.com, peterz@infradead.org,
	tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM
Date: Fri, 8 Apr 2022 10:52:12 +0200	[thread overview]
Message-ID: <0813c9da-f91d-317e-2eda-f2ed0b95385f@oracle.com> (raw)
In-Reply-To: <a23e32d3-9738-278b-42d3-5fe45cfab721@google.com>


On 3/23/22 20:35, Junaid Shahid wrote:
> On 3/22/22 02:46, Alexandre Chartre wrote:
>> 
>> On 3/18/22 00:25, Junaid Shahid wrote:
>>> 
>>> I agree that it is not secure to run one sibling in the
>>> unrestricted kernel address space while the other sibling is
>>> running in an ASI restricted address space, without doing a cache
>>> flush before re-entering the VM. However, I think that avoiding
>>> this situation does not require doing a sibling stun operation
>>> immediately after VM Exit. The way we avoid it is as follows.
>>> 
>>> First, we always use ASI in conjunction with core scheduling.
>>> This means that if HT0 is running a VCPU thread, then HT1 will be
>>> running either a VCPU thread of the same VM or the Idle thread.
>>> If it is running a VCPU thread, then if/when that thread takes a
>>> VM Exit, it will also be running in the same ASI restricted
>>> address space. For the idle thread, we have created another ASI
>>> Class, called Idle-ASI, which maps only globally non-sensitive
>>> kernel memory. The idle loop enters this ASI address space.
>>> 
>>> This means that when HT0 does a VM Exit, HT1 will either be
>>> running the guest code of a VCPU of the same VM, or it will be
>>> running kernel code in either a KVM-ASI or the Idle-ASI address
>>> space. (If HT1 is already running in the full kernel address
>>> space, that would imply that it had previously done an ASI Exit,
>>> which would have triggered a stun_sibling, which would have
>>> already caused HT0 to exit the VM and wait in the kernel).
>> 
>> Note that using core scheduling (or not) is a detail, what is
>> important is whether HT are running with ASI or not. Running core
>> scheduling will just improve chances to have all siblings run ASI
>> at the same time and so improve ASI performances.
>> 
>> 
>>> If HT1 now does an ASI Exit, that will trigger the
>>> stun_sibling() operation in its pre_asi_exit() handler, which
>>> will set the state of the core/HT0 to Stunned (and possibly send
>>> an IPI too, though that will be ignored if HT0 was already in
>>> kernel mode). Now when HT0 tries to re-enter the VM, since its
>>> state is set to Stunned, it will just wait in a loop until HT1
>>> does an unstun_sibling() operation, which it will do in its
>>> post_asi_enter handler the next time it does an ASI Enter (which
>>> would be either just before VM Enter if it was KVM-ASI, or in the
>>> next iteration of the idle loop if it was Idle-ASI). In either
>>> case, HT1's post_asi_enter() handler would also do a
>>> flush_sensitive_cpu_state operation before the unstun_sibling(), 
>>> so when HT0 gets out of its wait-loop and does a VM Enter, there
>>> will not be any sensitive state left.
>>> 
>>> One thing that probably was not clear from the patch, is that
>>> the stun state check and wait-loop is still always executed
>>> before VM Enter, even if no ASI Exit happened in that execution.
>>> 
>> 
>> So if I understand correctly, you have following sequence:
>> 
>> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu
>> should wait before VMEnter)
>> 
>> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling
>> can do VMEnter)
>> 
>> 2 - Before VMEnter : wait while my state is "stunned"
>> 
>> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling
>> should wait before VMEnter)
>> 
>> I have tried this kind of implementation, and the problem is with
>> step 2 (wait while my state is "stunned"); how do you wait exactly?
>> You can't just do an active wait otherwise you have all kind of
>> problems (depending if you have interrupts enabled or not)
>> especially as you don't know how long you have to wait for (this
>> depends on what the other cpu is doing).
> 
> In our stunning implementation, we do an active wait with interrupts 
> enabled and with a need_resched() check to decide when to bail out
> to the scheduler (plus we also make sure that we re-enter ASI at the
> end of the wait in case some interrupt exited ASI). What kind of
> problems have you run into with an active wait, besides wasted CPU
> cycles?

If you wait with interrupts enabled then there is window after the
wait and before interrupts get disabled where a cpu can get an interrupt,
exit ASI while the sibling is entering the VM. Also after a CPU has passed
the wait and have disable interrupts, it can't be notified if the sibling
has exited ASI:

T+01 - cpu A and B enter ASI - interrupts are enabled
T+02 - cpu A and B pass the wait because both are using ASI - interrupts are enabled
T+03 - cpu A gets an interrupt
T+04 - cpu B disables interrupts
T+05 - cpu A exit ASI and process interrupts
T+06 - cpu B enters VM  => cpu B runs VM while cpu A is not using ASI
T+07 - cpu B exits VM
T+08 - cpu B exits ASI
T+09 - cpu A returns from interrupt
T+10 - cpu A disables interrupts and enter VM => cpu A runs VM while cpu A is not using ASI


> In any case, the specific stunning mechanism is orthogonal to ASI.
> This implementation of ASI can be integrated with different stunning
> implementations. The "kernel core scheduling" that you proposed is
> also an alternative to stunning and could be similarly integrated
> with ASI.

Yes, but for ASI to be relevant with KVM to prevent data leak, you need
a fully functional and reliable stunning mechanism, otherwise ASI is
useless. That's why I think it is better to first focus on having an
effective stunning mechanism and then implement ASI.


alex.

next prev parent reply	other threads:[~2022-04-08  8:52 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-23  5:21 [RFC PATCH 00/47] Address Space Isolation for KVM Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 01/47] mm: asi: Introduce ASI core API Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 02/47] mm: asi: Add command-line parameter to enable/disable ASI Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 03/47] mm: asi: Switch to unrestricted address space when entering scheduler Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 04/47] mm: asi: ASI support in interrupts/exceptions Junaid Shahid
2022-03-14 15:50   ` Thomas Gleixner
2022-03-15  2:01     ` Junaid Shahid
2022-03-15 12:55       ` Thomas Gleixner
2022-03-15 22:41         ` Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 05/47] mm: asi: Make __get_current_cr3_fast() ASI-aware Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 06/47] mm: asi: ASI page table allocation and free functions Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 07/47] mm: asi: Functions to map/unmap a memory range into ASI page tables Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 08/47] mm: asi: Add basic infrastructure for global non-sensitive mappings Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 09/47] mm: Add __PAGEFLAG_FALSE Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 10/47] mm: asi: Support for global non-sensitive direct map allocations Junaid Shahid
2022-03-23 21:06   ` Matthew Wilcox
2022-03-23 23:48     ` Junaid Shahid
2022-03-24  1:54       ` Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 11/47] mm: asi: Global non-sensitive vmalloc/vmap support Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 12/47] mm: asi: Support for global non-sensitive slab caches Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 13/47] asi: Added ASI memory cgroup flag Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 14/47] mm: asi: Disable ASI API when ASI is not enabled for a process Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 15/47] kvm: asi: Restricted address space for VM execution Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 16/47] mm: asi: Support for mapping non-sensitive pcpu chunks Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 17/47] mm: asi: Aliased direct map for local non-sensitive allocations Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 18/47] mm: asi: Support for pre-ASI-init " Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 19/47] mm: asi: Support for locally nonsensitive page allocations Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 20/47] mm: asi: Support for locally non-sensitive vmalloc allocations Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 21/47] mm: asi: Add support for locally non-sensitive VM_USERMAP pages Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 22/47] mm: asi: Added refcounting when initilizing an asi Junaid Shahid
2022-02-23  5:21 ` [RFC PATCH 23/47] mm: asi: Add support for mapping all userspace memory into ASI Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 24/47] mm: asi: Support for local non-sensitive slab caches Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 25/47] mm: asi: Avoid warning from NMI userspace accesses in ASI context Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 26/47] mm: asi: Use separate PCIDs for restricted address spaces Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 27/47] mm: asi: Avoid TLB flushes during ASI CR3 switches when possible Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 28/47] mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 29/47] mm: asi: Reduce TLB flushes when freeing pages asynchronously Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 30/47] mm: asi: Add API for mapping userspace address ranges Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 31/47] mm: asi: Support for non-sensitive SLUB caches Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 32/47] x86: asi: Allocate FPU state separately when ASI is enabled Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 33/47] kvm: asi: Map guest memory into restricted ASI address space Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 34/47] kvm: asi: Unmap guest memory from ASI address space when using nested virt Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 35/47] mm: asi: asi_exit() on PF, skip handling if address is accessible Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 36/47] mm: asi: Adding support for dynamic percpu ASI allocations Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 37/47] mm: asi: ASI annotation support for static variables Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 38/47] mm: asi: ASI annotation support for dynamic modules Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 39/47] mm: asi: Skip conventional L1TF/MDS mitigations Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 40/47] mm: asi: support for static percpu DEFINE_PER_CPU*_ASI Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 41/47] mm: asi: Annotation of static variables to be nonsensitive Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 42/47] mm: asi: Annotation of PERCPU " Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 43/47] mm: asi: Annotation of dynamic " Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 44/47] kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 45/47] mm: asi: Mapping global nonsensitive areas in asi_global_init Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 46/47] kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace Junaid Shahid
2022-02-23  5:22 ` [RFC PATCH 47/47] mm: asi: Properly un/mapping task stack from ASI + tlb flush Junaid Shahid
2022-03-05  3:39 ` [RFC PATCH 00/47] Address Space Isolation for KVM Hyeonggon Yoo
2022-03-16 21:34 ` Alexandre Chartre
2022-03-17 23:25   ` Junaid Shahid
2022-03-22  9:46     ` Alexandre Chartre
2022-03-23 19:35       ` Junaid Shahid
2022-04-08  8:52         ` Alexandre Chartre [this message]
2022-04-11  3:26           ` junaid_shahid
2022-03-16 22:49 ` Thomas Gleixner
2022-03-17 21:24   ` Junaid Shahid

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0813c9da-f91d-317e-2eda-f2ed0b95385f@oracle.com \
    --to=alexandre.chartre@oracle.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=jmattson@google.com \
    --cc=junaids@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=oweisse@google.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rppt@linux.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).