From: "Kalra, Ashish" <ashish.kalra@amd.com>
To: Mingwei Zhang <mizhang@google.com>,
Sean Christopherson <seanjc@google.com>,
Jacky Li <jackyli@google.com>
Cc: isaku.yamahata@intel.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
Michael Roth <michael.roth@amd.com>,
Paolo Bonzini <pbonzini@redhat.com>,
erdemaktas@google.com, Sagi Shahar <sagis@google.com>,
David Matlack <dmatlack@google.com>,
Kai Huang <kai.huang@intel.com>,
Zhi Wang <zhi.wang.linux@gmail.com>,
chen.bo@intel.com, linux-coco@lists.linux.dev,
Chao Peng <chao.p.peng@linux.intel.com>,
Ackerley Tng <ackerleytng@google.com>,
Vishal Annapurve <vannapurve@google.com>,
Yuan Yao <yuan.yao@linux.intel.com>,
Jarkko Sakkinen <jarkko@kernel.org>,
Xu Yilun <yilun.xu@intel.com>,
Quentin Perret <qperret@google.com>,
wei.w.wang@intel.com, Fuad Tabba <tabba@google.com>
Subject: Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()
Date: Mon, 21 Aug 2023 16:44:37 -0500 [thread overview]
Message-ID: <df49bbb2-92c0-7792-ab90-e748be570b5d@amd.com> (raw)
In-Reply-To: <CAL715WL9TJzDxZE8_gfhUQFGtOAydG0kyuSbzkqWTs3pc57j7A@mail.gmail.com>
Hello Mingwei & Sean,
On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
> +Jacky Li
>
> On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson <seanjc@google.com> wrote:
>>
>> +Mingwei to correct me if I'm wrong
>>
>> On Fri, Aug 18, 2023, Ashish Kalra wrote:
>>>
>>> On 8/18/2023 12:55 PM, Sean Christopherson wrote:
>>>> On Tue, Aug 15, 2023, isaku.yamahata@intel.com wrote:
>>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>>
>>>>> kvm_mmu_invalidate_end() updates struct kvm::mmu_invalidate_in_progress
>>>>> and it's protected by kvm::mmu_lock. call kvm_mmu_invalidate_end() before
>>>>> unlocking it. Not after the unlock.
>>>>>
>>>>> Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")
>>>>
>>>> This fixes is wrong. It won't matter in the long run, but it makes my life that
>>>> much harder.
>>>>
>>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>> ---
>>>>> virt/kvm/kvm_main.c | 15 ++++++++++++++-
>>>>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index 8bfeb615fc4d..49380cd62367 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
>>>>> } arg;
>>>>> gfn_handler_t handler;
>>>>> on_lock_fn_t on_lock;
>>>>> + on_unlock_fn_t before_unlock;
>>>>> on_unlock_fn_t on_unlock;
>>>>
>>>> Ugh, shame on my past me. Having on_lock and on_unlock be asymmetrical with respect
>>>> to the lock is nasty.
>>>>
>>>> I would much rather we either (a) be explicit, e.g. before_(un)lock and after_(un)lock,
>>>> or (b) have just on_(un)lock, make them symetrical, and handle the SEV mess a
>>>> different way.
>>>>
>>>> The SEV hook doesn't actually care about running immediately after unlock, it just
>>>> wants to know if there was an overlapping memslot. It can run after SRCU is dropped,
>>>> because even if we make the behavior more precise (right now it blasts WBINVD),
>>>> just having a reference to memslots isn't sufficient, the code needs to guarantee
>>>> memslots are *stable*. And that is already guaranteed by the notifier code, i.e.
>>>> the SEV code could just reacquire SRCU.
>>>
>>> On a separate note here, the SEV hook blasting WBINVD is still causing
>>> serious performance degradation issues with SNP triggered via
>>> AutoNUMA/numad/KSM, etc. With reference to previous discussions related to
>>> it, we have plans to replace WBINVD with CLFLUSHOPT.
>>
>> Isn't the flush unnecessary when freeing shared memory? My recollection is that
>> the problematic scenario is when encrypted memory is freed back to the host,
>> because KVM already flushes when potentially encrypted mapping memory into the
>> guest.
>>
>> With SNP+guest_memfd, private/encrypted memory should be unreachabled via the
>> hva-based mmu_notifiers. gmem should have full control of the page lifecycles,
>> i.e. can get the kernel virtual address as appropriated, and so it SNP shouldn't
>> need the nuclear option.
>>
>> E.g. something like this?
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 07756b7348ae..1c6828ae391d 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>>
>> void sev_guest_memory_reclaimed(struct kvm *kvm)
>> {
>> - if (!sev_guest(kvm))
>> + if (!sev_guest(kvm) || sev_snp_guest(kvm))
>> return;
>>
>> wbinvd_on_all_cpus();
>
> I hope this is the final solution :)
>
> So, short answer: no.
>
> SNP+guest_memfd prevent untrusted host user space from directly
> modifying the data, this is good enough for CVE-2022-0171, but there
> is no such guarantee that the host kernel in some scenarios could
> access the data and generate dirty caches. In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!
>
> So, to avoid any corner case situations like the above, it seems
> currently we have to retain the property: flushing the cache when the
> guest memory mapping leaves KVM NPT.
>
> Of course, this is fundamentally because SME_COHERENT only applies to
> CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
> longer needed. Alternatively, we need extra bookkeeping for KVM to
> know whether each page has dirty cache lines. Another alternative is
> to filter mmu_notifier reasons, which is the part that I am planning
> to take. thoughts?
>
Now running SNP+guest_memfd with discard=both option enabled:
# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_release+60
__mmu_notifier_release+128
exit_mmap+657
__mmput+72
mmput+49
do_exit+752
do_group_exit+57
get_signal+2486
arch_do_signal_or_restart+51
exit_to_user_mode_prepare+257
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
exit_to_user_mode_prepare+337
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 2
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
xfer_to_guest_mode_handle_work+228
kvm_arch_vcpu_ioctl_run+1572
kvm_vcpu_ioctl+671
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 2
@[
sev_guest_memory_reclaimed+5
kvm_set_memslot+740
__kvm_set_memory_region.part.0+411
kvm_set_memory_region+89
kvm_vm_ioctl+1482
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 104
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
zap_page_range_single+384
unmap_mapping_range+279
shmem_fallocate+932
vfs_fallocate+345
__x64_sys_fallocate+71
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 5465
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+869
__mmu_notifier_invalidate_range_start+152
zap_page_range_single+384
madvise_vma_behavior+1967
madvise_walk_vmas+190
do_madvise.part.0+264
__x64_sys_madvise+98
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 69677
The maximum hits are seen with shmem_fallocate and madvise, which we
believe are response to shared<->private
GHCB page-state-chage requests. discard=both handles discard both for
private and shared memory, so freeing shared memory
via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
notifiers when freeing shared pages after guest converts a GPA to
private.
Now, as with SNP+guest_memfd, guest private memory is not mapped in host
anymore, so i added a generic fix (instead of Sean's proposed patch of
checking for SNP guest inside sev_guest_memory_reclaimed()):
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -593,6 +593,9 @@ static __always_inline int
__kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct
kvm_memory_slot, hva_node[slots->node_idx]);
+ if (kvm_slot_can_be_private(slot)) {
+ continue;
+ }
hva_start = max(range->start,
slot->userspace_addr);
hva_end = min(range->end, slot->userspace_addr +
(slot->npages <<
PAGE_SHIFT));
With this fix added, the traces are as follows:
# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+812
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
exit_to_user_mode_prepare+337
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_release+60
__mmu_notifier_release+128
exit_mmap+657
__mmput+72
mmput+49
do_exit+752
do_group_exit+57
get_signal+2486
arch_do_signal_or_restart+51
exit_to_user_mode_prepare+257
syscall_exit_to_user_mode+42
do_syscall_64+109
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
sev_guest_memory_reclaimed+5
kvm_mmu_notifier_invalidate_range_start+812
__mmu_notifier_invalidate_range_start+152
change_protection+4628
change_prot_numa+93
task_numa_work+588
task_work_run+108
xfer_to_guest_mode_handle_work+228
kvm_arch_vcpu_ioctl_run+1572
kvm_vcpu_ioctl+671
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]:
@[
sev_guest_memory_reclaimed+5
kvm_set_memslot+740
__kvm_set_memory_region.part.0+411
kvm_set_memory_region+89
kvm_vm_ioctl+1482
__x64_sys_ioctl+153
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+114
]: 104
#
As expected, the SEV hook is not invoked for the guest private memory
pages (no more invalidation from shmem_fallocate() + madvise()).
Isn't it better to skip invoking the KVM MMU invalidation notifier when
the invalidated range belongs to guest private memory ?
> In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!
The skipped invalidation here covered the case Mingwei mentioned above,
where the pages are changed from private->shared and subsequent freeing
of shared pages triggered the invalidation.
But, then why are we concerned about this, i thought we have concerns
about the case where the dirty cache lines contain encrypted guest data ?
Thanks,
Ashish
next prev parent reply other threads:[~2023-08-21 21:44 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-15 17:18 [PATCH 0/8] KVM: gmem: Adding hooks for SEV and TDX isaku.yamahata
2023-08-15 17:18 ` [PATCH 1/8] KVM: gmem: Make kvm_gmem_bind return EBADF on wrong fd isaku.yamahata
2023-08-15 17:18 ` [PATCH 2/8] KVM: gmem: removed duplicated kvm_gmem_init() isaku.yamahata
2023-08-15 17:18 ` [PATCH 3/8] KVM: gmem: Fix kvm_gmem_issue_arch_invalidate() isaku.yamahata
2023-08-18 22:33 ` Sean Christopherson
2023-08-15 17:18 ` [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end() isaku.yamahata
2023-08-16 20:28 ` Jarkko Sakkinen
2023-08-18 17:55 ` Sean Christopherson
2023-08-18 20:32 ` Kalra, Ashish
2023-08-18 22:44 ` Sean Christopherson
2023-08-19 2:08 ` Mingwei Zhang
2023-08-21 14:42 ` Sean Christopherson
2023-08-21 21:44 ` Kalra, Ashish [this message]
2023-08-22 22:30 ` Kalra, Ashish
2023-08-22 23:17 ` Sean Christopherson
2023-08-31 16:50 ` Kalra, Ashish
2023-08-15 17:18 ` [PATCH 5/8] KVM: gmem, x86: Add gmem hook for initializing private memory isaku.yamahata
2023-08-16 20:30 ` Jarkko Sakkinen
2023-08-15 17:18 ` [PATCH 6/8] KVM: gmem, x86: Add gmem hook for invalidating " isaku.yamahata
2023-08-16 0:42 ` kernel test robot
2023-08-16 20:37 ` Isaku Yamahata
2023-10-10 9:17 ` Xu Yilun
2023-08-15 17:18 ` [PATCH 7/8] KVM: gmem: Avoid race with kvm_gmem_release and mmu notifier isaku.yamahata
2023-08-18 18:15 ` Sean Christopherson
2023-08-15 17:18 ` [PATCH 8/8] RFC: KVM: gmem: Guarantee the order of destruction isaku.yamahata
2023-08-18 23:14 ` [PATCH 0/8] KVM: gmem: Adding hooks for SEV and TDX Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=df49bbb2-92c0-7792-ab90-e748be570b5d@amd.com \
--to=ashish.kalra@amd.com \
--cc=ackerleytng@google.com \
--cc=chao.p.peng@linux.intel.com \
--cc=chen.bo@intel.com \
--cc=dmatlack@google.com \
--cc=erdemaktas@google.com \
--cc=isaku.yamahata@gmail.com \
--cc=isaku.yamahata@intel.com \
--cc=jackyli@google.com \
--cc=jarkko@kernel.org \
--cc=kai.huang@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-coco@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=mizhang@google.com \
--cc=pbonzini@redhat.com \
--cc=qperret@google.com \
--cc=sagis@google.com \
--cc=seanjc@google.com \
--cc=tabba@google.com \
--cc=vannapurve@google.com \
--cc=wei.w.wang@intel.com \
--cc=yilun.xu@intel.com \
--cc=yuan.yao@linux.intel.com \
--cc=zhi.wang.linux@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox