Re: [PATCH kernel v5 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Nicholas Piggin <npiggin@gmail.com>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linuxppc-dev@lists.ozlabs.org,
	David Gibson <david@gibson.dropbear.id.au>,
	kvm-ppc@vger.kernel.org,
	Alex Williamson <alex.williamson@redhat.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Paul Mackerras <paulus@ozlabs.org>,
	Balbir Singh <bsingharora@gmail.com>
Subject: Re: [PATCH kernel v5 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
Date: Sat, 7 Jul 2018 23:47:42 +1000	[thread overview]
Message-ID: <20180707234742.59bbc2e7@roar.ozlabs.ibm.com> (raw)
In-Reply-To: <20180707224445.24664324@aikyoga2>

On Sat, 7 Jul 2018 22:44:45 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On Sat, 7 Jul 2018 21:43:03 +1000
> Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On Sat,  7 Jul 2018 20:44:10 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > A VM which has:
> > >  - a DMA capable device passed through to it (eg. network card);
> > >  - running a malicious kernel that ignores H_PUT_TCE failure;
> > >  - capability of using IOMMU pages bigger that physical pages
> > > can create an IOMMU mapping that exposes (for example) 16MB of
> > > the host physical memory to the device when only 64K was allocated to the VM.
> > > 
> > > The remaining 16MB - 64K will be some other content of host memory, possibly
> > > including pages of the VM, but also pages of host kernel memory, host
> > > programs or other VMs.
> > > 
> > > The attacking VM does not control the location of the page it can map,
> > > and is only allowed to map as many pages as it has pages of RAM.
> > > 
> > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > get access to unassigned host memory; however this check is missing in
> > > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> > > did not hit this yet as the very first time when the mapping happens
> > > we do not have tbl::it_userspace allocated yet and fall back to
> > > the userspace which in turn calls VFIO IOMMU driver, this fails and
> > > the guest does not retry,
> > > 
> > > This stores the smallest preregistered page size in the preregistered
> > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > the IOMMU page size. This calculates maximum page size as a minimum of
> > > the natural region alignment and hugepagetlb's compound page size.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > ---
> > > Changes:
> > > v5:
> > > * only consider compound pages from hugetlbfs
> > > 
> > > v4:
> > > * reimplemented max pageshift calculation
> > > 
> > > v3:
> > > * fixed upper limit for the page size
> > > * added checks that we don't register parts of a huge page
> > > 
> > > v2:
> > > * explicitely check for compound pages before calling compound_order()
> > > 
> > > ---
> > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > 
> > > With the change, mapping will fail in KVM and the guest will print:
> > > 
> > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > > mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
> > > ---
> > >  arch/powerpc/include/asm/mmu_context.h |  4 ++--
> > >  arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
> > >  arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
> > >  arch/powerpc/mm/mmu_context_iommu.c    | 33 +++++++++++++++++++++++++++------
> > >  drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
> > >  5 files changed, 35 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > > index 896efa5..79d570c 100644
> > > --- a/arch/powerpc/include/asm/mmu_context.h
> > > +++ b/arch/powerpc/include/asm/mmu_context.h
> > > @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
> > >  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> > >  		unsigned long ua, unsigned long entries);
> > >  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> > > -		unsigned long ua, unsigned long *hpa);
> > > +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> > >  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> > > -		unsigned long ua, unsigned long *hpa);
> > > +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> > >  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
> > >  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
> > >  #endif
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > > index d066e37..8c456fa 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
> > >  		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> > >  		return H_TOO_HARD;
> > >  
> > > -	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> > > +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
> > >  		return H_HARDWARE;
> > >  
> > >  	if (mm_iommu_mapped_inc(mem))
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > > index 925fc31..5b298f5 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > > @@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
> > >  	if (!mem)
> > >  		return H_TOO_HARD;
> > >  
> > > -	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> > > +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
> > > +			&hpa)))
> > >  		return H_HARDWARE;
> > >  
> > >  	pua = (void *) vmalloc_to_phys(pua);
> > > @@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> > >  
> > >  		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
> > >  		if (mem)
> > > -			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> > > +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
> > > +					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
> > >  	}
> > >  
> > >  	if (!prereg) {
> > > diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> > > index abb4364..0aebed6 100644
> > > --- a/arch/powerpc/mm/mmu_context_iommu.c
> > > +++ b/arch/powerpc/mm/mmu_context_iommu.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/migrate.h>
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/swap.h>
> > > +#include <linux/hugetlb.h>
> > >  #include <asm/mmu_context.h>
> > >  
> > >  static DEFINE_MUTEX(mem_list_mutex);
> > > @@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
> > >  	struct rcu_head rcu;
> > >  	unsigned long used;
> > >  	atomic64_t mapped;
> > > +	unsigned int pageshift;
> > >  	u64 ua;			/* userspace address */
> > >  	u64 entries;		/* number of entries in hpas[] */
> > >  	u64 *hpas;		/* vmalloc'ed */
> > > @@ -125,6 +127,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  {
> > >  	struct mm_iommu_table_group_mem_t *mem;
> > >  	long i, j, ret = 0, locked_entries = 0;
> > > +	unsigned int pageshift;
> > >  	struct page *page = NULL;
> > >  
> > >  	mutex_lock(&mem_list_mutex);
> > > @@ -159,6 +162,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  		goto unlock_exit;
> > >  	}
> > >  
> > > +	/*
> > > +	 * Use @ua and @entries alignment as a starting point for a maximum
> > > +	 * page size calculation below.
> > > +	 */
> > > +	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
> > >  	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
> > >  	if (!mem->hpas) {
> > >  		kfree(mem);    
> > 
> > I think it would be better to start with UINT_MAX or something to make
> > it clear that it's not a valid "guess" but rather just a starting point
> > for the min().  
> 
> This is to handle unaligned @ua/@entries backed by a huge page. For
> example, 1GB page and we are preregistering only a second half of it -
> the maximum pageshift should be 29 instead of 30 which it will be if we
> start from 64.

Ah okay that's what I'm missing. Maybe a small addition to the comment?

> >   
> > > @@ -167,8 +175,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  	}
> > >  
> > >  	for (i = 0; i < entries; ++i) {
> > > -		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > > -					1/* pages */, 1/* iswrite */, &page)) {
> > > +		struct vm_area_struct *vma = NULL;
> > > +
> > > +		if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
> > > +					1/* pages */, 1/* iswrite */, &page,
> > > +					&vma)) {
> > >  			ret = -EFAULT;
> > >  			for (j = 0; j < i; ++j)
> > >  				put_page(pfn_to_page(mem->hpas[j] >>    
> > 
> > You need mmap_sem for read here, and held while you inspect the vma.
> > I would say don't hold it over the error handling and the
> > mm_iommu_move_page_from_cma() call though.  
> 
> 
> Ah, this is why it deadlocked when I tried this.
> 
> But do I really need mmap_sem here? get_user_pages_longterm() inspects
> vma with no mmap_sem held in case of VFIO IOMMU Type1:
> 
> vfio_fops_unl_ioctl
>   vfio_iommu_type1_ioctl
>     vfio_dma_do_map
>       vfio_pin_map_dma
>         vfio_pin_pages_remote
>           vaddr_get_pfn
>             get_user_pages_longterm
> 
> Is that a bug? Other cases seem to hold it though.

Yes you need mmap_sem. I think that would be a bug too.

Thanks,
Nick

     prev parent reply	other threads:[~2018-07-07 13:47 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-07 10:44 [PATCH kernel v5 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy
2018-07-07 10:44 ` [PATCH kernel v5 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize Alexey Kardashevskiy
2018-07-07 10:44 ` [PATCH kernel v5 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page Alexey Kardashevskiy
2018-07-07 11:43   ` Nicholas Piggin
2018-07-07 12:44     ` Alexey Kardashevskiy
2018-07-07 13:47       ` Nicholas Piggin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180707234742.59bbc2e7@roar.ozlabs.ibm.com \
    --to=npiggin@gmail.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=bsingharora@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=paulus@ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).