Re: [RFC PATCH] KVM: Introduce KVM VIRTIO device

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>,
	"wanpengli@tencent.com" <wanpengli@tencent.com>,
	Zhenyu Z Wang <zhenyu.z.wang@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Yiwei Zhang <zzyiwei@google.com>,
	"joro@8bytes.org" <joro@8bytes.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"gurchetansingh@chromium.org" <gurchetansingh@chromium.org>,
	"kraxel@redhat.com" <kraxel@redhat.com>,
	Yongwei Ma <yongwei.ma@intel.com>,
	Zhiyuan Lv <zhiyuan.lv@intel.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"vkuznets@redhat.com" <vkuznets@redhat.com>,
	"jmattson@google.com" <jmattson@google.com>
Subject: Re: [RFC PATCH] KVM: Introduce KVM VIRTIO device
Date: Wed, 20 Dec 2023 18:12:55 -0800	[thread overview]
Message-ID: <ZYOfJ_QWG01aL8Hl@google.com> (raw)
In-Reply-To: <ZYJOTLwreD+8l4gU@yzhao56-desk.sh.intel.com>

On Wed, Dec 20, 2023, Yan Zhao wrote:
> On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote:
> > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote:
> > > > > Implementation Consideration
> > > > > ===
> > > > > There is a previous series [1] from google to serve the same purpose to
> > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > > > requires a new memslot flag, and special memslots in user space.
> > > > > 
> > > > > We don't choose to use memslot flag to request honoring guest memory
> > > > > type.
> > > > 
> > > > memslot flag has the potential to restrict the impact e.g. when using
> > > > clflush-before-read in migration?
> > > 
> > > Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> > > the host kernel, then the memslot flag will allow for a much more targeted
> > > operation.
> > > 
> > > > Of course the implication is to honor guest type only for the selected slot
> > > > in KVM instead of applying to the entire guest memory as in previous series
> > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > > > hence not good to check memslot flag?)
> > > 
> > > Checking a memslot flag won't impact performance.  KVM already has the memslot
> > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> > > access to the memslot.
> > > 
> > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> > > to retrieve the associated PFN, update write-tracking for shadow pages, etc.
> > > 
> > Hi Sean,
> > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC?
> > For KVM_MEM_DMA, KVM needs to
> > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with
> >     pgprot2cachemode()? ), or
> > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in
> >     pat_pfn_immune_to_uc_mtrr().
> > 
> > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now.
> > 
> > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped
> > to WC, right?
> > 
> > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type
> > for the special memslot only, as below.
> > Is this understanding correct?
> > 
> > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > {
> >         if (is_mmio)                                                                           
> >                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;                          
> >                                                                                                
> >         if (gfn_in_dma_slot(vcpu->kvm, gfn)) {                                                 
> >                 u8 type = MTRR_TYPE_WRCOMB;                                      
> >                 //u8 type = pat_pfn_memtype(pfn);                                
> >                 return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;       
> >         }                                                                                      
> >                                                                                                
> >         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))                            
> >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;         
> >                                                                                                
> >         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {                                             
> >                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))               
> >                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;                      
> >                 else                                                                           
> >                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | 
> >                                 VMX_EPT_IPAT_BIT;                                
> >         }                                                                        
> >                                                                                  
> >         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> > }
> > 
> > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in
> > order to prevent other guest drivers from access, I wonder if it's better to
> > include some keyword like VIRTIO_GPU_BAR in memslot flag name.
> Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user
> (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping
> reading/writing to them in general paths).
> 
> @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>         if (is_mmio)
>                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> 
> +       if (in_slot_honor_guest_pat(vcpu->kvm, gfn))
> +               return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;

This is more along the lines of what I was thinking, though the name should be
something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory
for AMD (which already honors guest PAT).

I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping
those out.  This is a new feature, so we have the luxury of defining KVM's ABI
for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs.

Like so?

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d21f55f323ea..ed527acb2bd3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm)
        return 0;
 }
 
-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
+                         struct kvm_memory_slot *slot)
 {
        /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
         * memory aliases with conflicting memory types and sometimes MCEs.
@@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
        if (is_mmio)
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
 
+       if (kvm_memslot_has_non_coherent_dma(slot))
+               return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+
        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

I like the idea of pulling the memtype from the host, but if we can make that
work then I don't see the need for a special memslot flag, i.e. just do it for
*all* SPTEs on VMX.  I don't think we need a VMA for that, e.g. we should be able
to get the memtype from the host PTEs, just like we do the page size.

KVM_MEM_WC is a hard "no" for me.  It's far too x86 centric, and as you alluded
to, it requires coordination from the guest, i.e. is effectively limited to
paravirt scenarios.

> +
>         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
>         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
>                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
>                 else
>                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
>                                 VMX_EPT_IPAT_BIT;
>         }
> 
>         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
>  }

WARNING: multiple messages have this Message-ID (diff)

From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	 "dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	 "pbonzini@redhat.com" <pbonzini@redhat.com>,
	"olvaffe@gmail.com" <olvaffe@gmail.com>,
	 Zhiyuan Lv <zhiyuan.lv@intel.com>,
	Zhenyu Z Wang <zhenyu.z.wang@intel.com>,
	 Yongwei Ma <yongwei.ma@intel.com>,
	"vkuznets@redhat.com" <vkuznets@redhat.com>,
	 "wanpengli@tencent.com" <wanpengli@tencent.com>,
	"jmattson@google.com" <jmattson@google.com>,
	 "joro@8bytes.org" <joro@8bytes.org>,
	 "gurchetansingh@chromium.org" <gurchetansingh@chromium.org>,
	"kraxel@redhat.com" <kraxel@redhat.com>,
	 Yiwei Zhang <zzyiwei@google.com>
Subject: Re: [RFC PATCH] KVM: Introduce KVM VIRTIO device
Date: Wed, 20 Dec 2023 18:12:55 -0800	[thread overview]
Message-ID: <ZYOfJ_QWG01aL8Hl@google.com> (raw)
In-Reply-To: <ZYJOTLwreD+8l4gU@yzhao56-desk.sh.intel.com>

On Wed, Dec 20, 2023, Yan Zhao wrote:
> On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote:
> > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote:
> > > > > Implementation Consideration
> > > > > ===
> > > > > There is a previous series [1] from google to serve the same purpose to
> > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > > > requires a new memslot flag, and special memslots in user space.
> > > > > 
> > > > > We don't choose to use memslot flag to request honoring guest memory
> > > > > type.
> > > > 
> > > > memslot flag has the potential to restrict the impact e.g. when using
> > > > clflush-before-read in migration?
> > > 
> > > Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> > > the host kernel, then the memslot flag will allow for a much more targeted
> > > operation.
> > > 
> > > > Of course the implication is to honor guest type only for the selected slot
> > > > in KVM instead of applying to the entire guest memory as in previous series
> > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > > > hence not good to check memslot flag?)
> > > 
> > > Checking a memslot flag won't impact performance.  KVM already has the memslot
> > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> > > access to the memslot.
> > > 
> > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> > > to retrieve the associated PFN, update write-tracking for shadow pages, etc.
> > > 
> > Hi Sean,
> > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC?
> > For KVM_MEM_DMA, KVM needs to
> > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with
> >     pgprot2cachemode()? ), or
> > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in
> >     pat_pfn_immune_to_uc_mtrr().
> > 
> > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now.
> > 
> > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped
> > to WC, right?
> > 
> > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type
> > for the special memslot only, as below.
> > Is this understanding correct?
> > 
> > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > {
> >         if (is_mmio)                                                                           
> >                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;                          
> >                                                                                                
> >         if (gfn_in_dma_slot(vcpu->kvm, gfn)) {                                                 
> >                 u8 type = MTRR_TYPE_WRCOMB;                                      
> >                 //u8 type = pat_pfn_memtype(pfn);                                
> >                 return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;       
> >         }                                                                                      
> >                                                                                                
> >         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))                            
> >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;         
> >                                                                                                
> >         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {                                             
> >                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))               
> >                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;                      
> >                 else                                                                           
> >                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | 
> >                                 VMX_EPT_IPAT_BIT;                                
> >         }                                                                        
> >                                                                                  
> >         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> > }
> > 
> > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in
> > order to prevent other guest drivers from access, I wonder if it's better to
> > include some keyword like VIRTIO_GPU_BAR in memslot flag name.
> Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user
> (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping
> reading/writing to them in general paths).
> 
> @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>         if (is_mmio)
>                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> 
> +       if (in_slot_honor_guest_pat(vcpu->kvm, gfn))
> +               return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;

This is more along the lines of what I was thinking, though the name should be
something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory
for AMD (which already honors guest PAT).

I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping
those out.  This is a new feature, so we have the luxury of defining KVM's ABI
for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs.

Like so?

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d21f55f323ea..ed527acb2bd3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm)
        return 0;
 }
 
-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
+                         struct kvm_memory_slot *slot)
 {
        /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
         * memory aliases with conflicting memory types and sometimes MCEs.
@@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
        if (is_mmio)
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
 
+       if (kvm_memslot_has_non_coherent_dma(slot))
+               return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+
        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

I like the idea of pulling the memtype from the host, but if we can make that
work then I don't see the need for a special memslot flag, i.e. just do it for
*all* SPTEs on VMX.  I don't think we need a VMA for that, e.g. we should be able
to get the memtype from the host PTEs, just like we do the page size.

KVM_MEM_WC is a hard "no" for me.  It's far too x86 centric, and as you alluded
to, it requires coordination from the guest, i.e. is effectively limited to
paravirt scenarios.

> +
>         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
>         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
>                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
>                 else
>                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
>                                 VMX_EPT_IPAT_BIT;
>         }
> 
>         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
>  }

next prev parent reply	other threads:[~2023-12-21  2:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-14 10:35 [RFC PATCH] KVM: Introduce KVM VIRTIO device Yan Zhao
2023-12-14 10:35 ` Yan Zhao
2023-12-15  6:23 ` Tian, Kevin
2023-12-15  6:23   ` Tian, Kevin
2023-12-18  2:58   ` Yan Zhao
2023-12-18  2:58     ` Yan Zhao
2023-12-18 15:08   ` Sean Christopherson
2023-12-18 15:08     ` Sean Christopherson
2023-12-18 18:14     ` Yiwei Zhang
2023-12-18 18:14       ` Yiwei Zhang
2023-12-19  4:26     ` Yan Zhao
2023-12-19  4:26       ` Yan Zhao
2023-12-20  2:15       ` Yan Zhao
2023-12-21  2:12         ` Sean Christopherson [this message]
2023-12-21  2:12           ` Sean Christopherson
2023-12-21  2:44           ` Yan Zhao
2023-12-21  2:44             ` Yan Zhao

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d21f55f323e dfblob:ed527acb2bd dfblob:d21f55f323e
dfblob:ed527acb2bd )
 OR (
bs:"Re: [RFC PATCH] KVM: Introduce KVM VIRTIO device" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZYOfJ_QWG01aL8Hl@google.com \
    --to=seanjc@google.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gurchetansingh@chromium.org \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yongwei.ma@intel.com \
    --cc=zhenyu.z.wang@intel.com \
    --cc=zhiyuan.lv@intel.com \
    --cc=zzyiwei@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.