Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: "Kalra, Ashish" <ashish.kalra@amd.com>
To: Mingwei Zhang <mizhang@google.com>,
	Sean Christopherson <seanjc@google.com>,
	Jacky Li <jackyli@google.com>
Cc: isaku.yamahata@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	Michael Roth <michael.roth@amd.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	erdemaktas@google.com, Sagi Shahar <sagis@google.com>,
	David Matlack <dmatlack@google.com>,
	Kai Huang <kai.huang@intel.com>,
	Zhi Wang <zhi.wang.linux@gmail.com>,
	chen.bo@intel.com, linux-coco@lists.linux.dev,
	Chao Peng <chao.p.peng@linux.intel.com>,
	Ackerley Tng <ackerleytng@google.com>,
	Vishal Annapurve <vannapurve@google.com>,
	Yuan Yao <yuan.yao@linux.intel.com>,
	Jarkko Sakkinen <jarkko@kernel.org>,
	Xu Yilun <yilun.xu@intel.com>,
	Quentin Perret <qperret@google.com>,
	wei.w.wang@intel.com, Fuad Tabba <tabba@google.com>
Subject: Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()
Date: Mon, 21 Aug 2023 16:44:37 -0500	[thread overview]
Message-ID: <df49bbb2-92c0-7792-ab90-e748be570b5d@amd.com> (raw)
In-Reply-To: <CAL715WL9TJzDxZE8_gfhUQFGtOAydG0kyuSbzkqWTs3pc57j7A@mail.gmail.com>

Hello Mingwei & Sean,

On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
> +Jacky Li
> 
> On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson <seanjc@google.com> wrote:
>>
>> +Mingwei to correct me if I'm wrong
>>
>> On Fri, Aug 18, 2023, Ashish Kalra wrote:
>>>
>>> On 8/18/2023 12:55 PM, Sean Christopherson wrote:
>>>> On Tue, Aug 15, 2023, isaku.yamahata@intel.com wrote:
>>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>>
>>>>> kvm_mmu_invalidate_end() updates struct kvm::mmu_invalidate_in_progress
>>>>> and it's protected by kvm::mmu_lock.  call kvm_mmu_invalidate_end() before
>>>>> unlocking it. Not after the unlock.
>>>>>
>>>>> Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")
>>>>
>>>> This fixes is wrong.  It won't matter in the long run, but it makes my life that
>>>> much harder.
>>>>
>>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>> ---
>>>>>    virt/kvm/kvm_main.c | 15 ++++++++++++++-
>>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index 8bfeb615fc4d..49380cd62367 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
>>>>>            } arg;
>>>>>            gfn_handler_t handler;
>>>>>            on_lock_fn_t on_lock;
>>>>> + on_unlock_fn_t before_unlock;
>>>>>            on_unlock_fn_t on_unlock;
>>>>
>>>> Ugh, shame on my past me.  Having on_lock and on_unlock be asymmetrical with respect
>>>> to the lock is nasty.
>>>>
>>>> I would much rather we either (a) be explicit, e.g. before_(un)lock and after_(un)lock,
>>>> or (b) have just on_(un)lock, make them symetrical, and handle the SEV mess a
>>>> different way.
>>>>
>>>> The SEV hook doesn't actually care about running immediately after unlock, it just
>>>> wants to know if there was an overlapping memslot.  It can run after SRCU is dropped,
>>>> because even if we make the behavior more precise (right now it blasts WBINVD),
>>>> just having a reference to memslots isn't sufficient, the code needs to guarantee
>>>> memslots are *stable*.  And that is already guaranteed by the notifier code, i.e.
>>>> the SEV code could just reacquire SRCU.
>>>
>>> On a separate note here, the SEV hook blasting WBINVD is still causing
>>> serious performance degradation issues with SNP triggered via
>>> AutoNUMA/numad/KSM, etc. With reference to previous discussions related to
>>> it, we have plans to replace WBINVD with CLFLUSHOPT.
>>
>> Isn't the flush unnecessary when freeing shared memory?  My recollection is that
>> the problematic scenario is when encrypted memory is freed back to the host,
>> because KVM already flushes when potentially encrypted mapping memory into the
>> guest.
>>
>> With SNP+guest_memfd, private/encrypted memory should be unreachabled via the
>> hva-based mmu_notifiers.  gmem should have full control of the page lifecycles,
>> i.e. can get the kernel virtual address as appropriated, and so it SNP shouldn't
>> need the nuclear option.
>>
>> E.g. something like this?
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 07756b7348ae..1c6828ae391d 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>>
>>   void sev_guest_memory_reclaimed(struct kvm *kvm)
>>   {
>> -       if (!sev_guest(kvm))
>> +       if (!sev_guest(kvm) || sev_snp_guest(kvm))
>>                  return;
>>
>>          wbinvd_on_all_cpus();
> 
> I hope this is the final solution :)
> 
> So, short answer: no.
> 
> SNP+guest_memfd prevent untrusted host user space from directly
> modifying the data, this is good enough for CVE-2022-0171, but there
> is no such guarantee that the host kernel in some scenarios could
> access the data and generate dirty caches. In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!
> 
> So, to avoid any corner case situations like the above, it seems
> currently we have to retain the property: flushing the cache when the
> guest memory mapping leaves KVM NPT.
> 
> Of course, this is fundamentally because SME_COHERENT only applies to
> CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
> longer needed. Alternatively, we need extra bookkeeping for KVM to
> know whether each page has dirty cache lines. Another alternative is
> to filter mmu_notifier reasons, which is the part that I am planning
> to take. thoughts?
> 

Now running SNP+guest_memfd with discard=both option enabled:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_release+60
     __mmu_notifier_release+128
     exit_mmap+657
     __mmput+72
     mmput+49
     do_exit+752
     do_group_exit+57
     get_signal+2486
     arch_do_signal_or_restart+51
     exit_to_user_mode_prepare+257
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 1
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     exit_to_user_mode_prepare+337
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 2
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     xfer_to_guest_mode_handle_work+228
     kvm_arch_vcpu_ioctl_run+1572
     kvm_vcpu_ioctl+671
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 2
@[
     sev_guest_memory_reclaimed+5
     kvm_set_memslot+740
     __kvm_set_memory_region.part.0+411
     kvm_set_memory_region+89
     kvm_vm_ioctl+1482
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 104
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     zap_page_range_single+384
     unmap_mapping_range+279
     shmem_fallocate+932
     vfs_fallocate+345
     __x64_sys_fallocate+71
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 5465
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+869
     __mmu_notifier_invalidate_range_start+152
     zap_page_range_single+384
     madvise_vma_behavior+1967
     madvise_walk_vmas+190
     do_madvise.part.0+264
     __x64_sys_madvise+98
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 69677

The maximum hits are seen with shmem_fallocate and madvise, which we 
believe are response to shared<->private
GHCB page-state-chage requests. discard=both handles discard both for
private and shared memory, so freeing shared memory
via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
notifiers when freeing shared pages after guest converts a GPA to
private.

Now, as with SNP+guest_memfd, guest private memory is not mapped in host 
anymore, so i added a generic fix (instead of Sean's proposed patch of 
checking for SNP guest inside sev_guest_memory_reclaimed()):

--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -593,6 +593,9 @@ static __always_inline int 
__kvm_handle_hva_range(struct kvm *kvm,
                         unsigned long hva_start, hva_end;

                         slot = container_of(node, struct 
kvm_memory_slot, hva_node[slots->node_idx]);
+                       if (kvm_slot_can_be_private(slot)) {
+                               continue;
+                       }
                         hva_start = max(range->start, 
slot->userspace_addr);
                         hva_end = min(range->end, slot->userspace_addr +
                                                   (slot->npages << 
PAGE_SHIFT));

With this fix added, the traces are as follows:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+812
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     exit_to_user_mode_prepare+337
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 1
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_release+60
     __mmu_notifier_release+128
     exit_mmap+657
     __mmput+72
     mmput+49
     do_exit+752
     do_group_exit+57
     get_signal+2486
     arch_do_signal_or_restart+51
     exit_to_user_mode_prepare+257
     syscall_exit_to_user_mode+42
     do_syscall_64+109
     entry_SYSCALL_64_after_hwframe+114
]: 1
@[
     sev_guest_memory_reclaimed+5
     kvm_mmu_notifier_invalidate_range_start+812
     __mmu_notifier_invalidate_range_start+152
     change_protection+4628
     change_prot_numa+93
     task_numa_work+588
     task_work_run+108
     xfer_to_guest_mode_handle_work+228
     kvm_arch_vcpu_ioctl_run+1572
     kvm_vcpu_ioctl+671
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]:
@[
     sev_guest_memory_reclaimed+5
     kvm_set_memslot+740
     __kvm_set_memory_region.part.0+411
     kvm_set_memory_region+89
     kvm_vm_ioctl+1482
     __x64_sys_ioctl+153
     do_syscall_64+96
     entry_SYSCALL_64_after_hwframe+114
]: 104
#

As expected, the SEV hook is not invoked for the guest private memory 
pages (no more invalidation from shmem_fallocate() + madvise()).

Isn't it better to skip invoking the KVM MMU invalidation notifier when 
the invalidated range belongs to guest private memory ?

 > In fact, AFAIC, SNP VM does
 > not track whether each page is previously shared, isn't it? If a page
 > was previously shared and was written by the host kernel or devices
 > before it was changed to private. No one tracks it and dirty caches
 > are there!

The skipped invalidation here covered the case Mingwei mentioned above, 
where the pages are changed from private->shared and subsequent freeing 
of shared pages triggered the invalidation.

But, then why are we concerned about this, i thought we have concerns 
about the case where the dirty cache lines contain encrypted guest data ?

Thanks,
Ashish

next prev parent reply	other threads:[~2023-08-21 21:44 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15 17:18 [PATCH 0/8] KVM: gmem: Adding hooks for SEV and TDX isaku.yamahata
2023-08-15 17:18 ` [PATCH 1/8] KVM: gmem: Make kvm_gmem_bind return EBADF on wrong fd isaku.yamahata
2023-08-15 17:18 ` [PATCH 2/8] KVM: gmem: removed duplicated kvm_gmem_init() isaku.yamahata
2023-08-15 17:18 ` [PATCH 3/8] KVM: gmem: Fix kvm_gmem_issue_arch_invalidate() isaku.yamahata
2023-08-18 22:33   ` Sean Christopherson
2023-08-15 17:18 ` [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end() isaku.yamahata
2023-08-16 20:28   ` Jarkko Sakkinen
2023-08-18 17:55   ` Sean Christopherson
2023-08-18 20:32     ` Kalra, Ashish
2023-08-18 22:44       ` Sean Christopherson
2023-08-19  2:08         ` Mingwei Zhang
2023-08-21 14:42           ` Sean Christopherson
2023-08-21 21:44           ` Kalra, Ashish [this message]
2023-08-22 22:30             ` Kalra, Ashish
2023-08-22 23:17             ` Sean Christopherson
2023-08-31 16:50               ` Kalra, Ashish
2023-08-15 17:18 ` [PATCH 5/8] KVM: gmem, x86: Add gmem hook for initializing private memory isaku.yamahata
2023-08-16 20:30   ` Jarkko Sakkinen
2023-08-15 17:18 ` [PATCH 6/8] KVM: gmem, x86: Add gmem hook for invalidating " isaku.yamahata
2023-08-16  0:42   ` kernel test robot
2023-08-16 20:37   ` Isaku Yamahata
2023-10-10  9:17   ` Xu Yilun
2023-08-15 17:18 ` [PATCH 7/8] KVM: gmem: Avoid race with kvm_gmem_release and mmu notifier isaku.yamahata
2023-08-18 18:15   ` Sean Christopherson
2023-08-15 17:18 ` [PATCH 8/8] RFC: KVM: gmem: Guarantee the order of destruction isaku.yamahata
2023-08-18 23:14 ` [PATCH 0/8] KVM: gmem: Adding hooks for SEV and TDX Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=df49bbb2-92c0-7792-ab90-e748be570b5d@amd.com \
    --to=ashish.kalra@amd.com \
    --cc=ackerleytng@google.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chen.bo@intel.com \
    --cc=dmatlack@google.com \
    --cc=erdemaktas@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jackyli@google.com \
    --cc=jarkko@kernel.org \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=mizhang@google.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=sagis@google.com \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=wei.w.wang@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yuan.yao@linux.intel.com \
    --cc=zhi.wang.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox