Re: [RFC PATCH] KVM: PPC: Book3S HV: add support for page faults in VM_IO|VM_PFNMAP vmas

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: "Paolo Bonzini" <pbonzini@redhat.com>,
	"Cédric Le Goater" <clg@kaod.org>,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org,
	"Paul Mackerras" <paulus@samba.org>
Subject: Re: [RFC PATCH] KVM: PPC: Book3S HV: add support for page faults in VM_IO|VM_PFNMAP vmas
Date: Mon, 12 Feb 2018 10:33:53 +0000	[thread overview]
Message-ID: <1518431633.2959.9.camel@kernel.crashing.org> (raw)
In-Reply-To: <f0d6d84c-1a59-a38f-1b38-eec798236378@redhat.com>

On Mon, 2018-02-12 at 10:57 +0100, Paolo Bonzini wrote:
> On 09/02/2018 22:44, Benjamin Herrenschmidt wrote:
> > > 
> > > (Also, why is this code reinventing most of hva_to_pfn?  Not asking you
> > > to fix that of course, but perhaps there's something to improve in the
> > > generic code too).
> > 
> > I've been wondering the same thing, which led me to try to understand
> > hva_to_pfn, which resulted in skull marks in the wall ;-)
> > 
> > So hva_to_pfn does a complicated dance around 3 or 4 different variants
> > of gup and it's really not obvious why, the "intent" doesn't appear to
> > be documented clearly. I think I somewhat figued out it relates to the
> > async page fault stuff but even then I don't actually know what those
> > are and what is their purpose :-)
> 
> Yeah, hva_to_pfn has all those arguments to tweak its behavior, however 
> you can treat it as essentially get_user_pages_unlocked:
> 
> - if async=false && atomic=false you can ignore hva_to_pfn_fast.  
> (hva_to_pfn_fast uses __get_user_pages_fast)

Do you mean async = NULL && atomic = false ? IE. async is a
pointer...

> - hva_to_pfn_slow then is essentially get_user_pages_unlocked.  
> Optionally it's followed by __get_user_pages_fast to get a writable 
> mapping if the caller would like it but does not require it, but you can 
> ignore this if you pass writable=NULL.
>
> PPC uses get_user_pages_fast; x86 cannot use it directly because we need 
> FOLL_HWPOISON and get_user_pages_fast does not support it.

Remind me what that does ? I've lost track ... we do support some
amount of HWPOISON stuff on powerpc so we should probably do that
too...

I also at some point would like to understand how we sync the dirty
bit in a race free way ;-) For the accessed bit it looks like the
generic mm always calls the notifiers that gets into our "age" callback
, but dirty seems to be treated differently. My limited understanding
right now is that we set it via gup on a write fault in the struct
page, and it can only be cleared via page_mkclean which takes the
mapping out. But I haven't checked that this is 100% tight (meaning we 
don't actually rely on the 2nd level page table dirty bit except for
dirty_map tracking).
 
> However, get_user_pages_fast is itself a combination of 
> __get_user_pages_fast and get_user_pages_unlocked.  So if it is useful 
> for PPC to use get_user_pages_fast, perhaps the generic code should go 
> down hva_to_pfn_fast unconditionally, even if async and atomic are both 
> false.
> 
> The other big difference is that you follow up with the PageHuge check 
> to get compound_head/compound_order.  This is a bit like 
> kvm_host_page_size, but different.  It would be fine for me to add a 
> struct page ** argument to hva_to_pfn if that's enough to remove the 
> duplicate code in arch/powerpc/kvm/.

Yes, that would be nice.

> See the untested/uncompiled patch below the signature for some ideas.

I can give that a spin tomorrow with a bit of luck (30 other things on
my plate), or maybe Paul can.

BTW. Is there some doc/explanation about the whole Async page fault
business ? What is it about ? I wonder if it's something we
could/should use as well.

Cheers,
Ben.

> Thanks,
> 
> Paolo
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b4414842b023..a2c8824ae608 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1339,15 +1339,12 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * The atomic path to get the writable pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
> -			    bool write_fault, bool *writable, kvm_pfn_t *pfn)
> +static struct page *__hva_to_page_fast(unsigned long addr, bool write_fault,
> +				       bool *writable)
>  {
>  	struct page *page[1];
>  	int npages;
>  
> -	if (!(async || atomic))
> -		return false;
> -
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
> @@ -1357,23 +1354,21 @@ static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
>  		return false;
>  
>  	npages = __get_user_pages_fast(addr, 1, 1, page);
> -	if (npages = 1) {
> -		*pfn = page_to_pfn(page[0]);
> +	if (npages < 0)
> +		return ERR_PTR(npages);
>  
> -		if (writable)
> -			*writable = true;
> -		return true;
> -	}
> +	if (writable)
> +		*writable = true;
>  
> -	return false;
> +	return page[0];
>  }
>  
>  /*
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool *writable, kvm_pfn_t *pfn)
> +static struct page *hva_to_page_unlocked(unsigned long addr, bool *async,
> +					 bool write_fault, bool *writable)
>  {
>  	struct page *page[1];
>  	int npages = 0;
> @@ -1395,24 +1390,20 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  
>  		npages = get_user_pages_unlocked(addr, 1, page, flags);
>  	}
> -	if (npages != 1)
> -		return npages;
> +	if (npages < 0)
> +		return ERR_PTR(npages);
>  
>  	/* map read fault as writable if possible */
>  	if (unlikely(!write_fault) && writable) {
> -		struct page *wpage[1];
> -
> -		npages = __get_user_pages_fast(addr, 1, 1, wpage);
> -		if (npages = 1) {
> -			*writable = true;
> +		struct page *wpage;
> +		wpage = __hva_to_page_fast(addr, write_fault, writable);
> +		if (!IS_ERR(wpage)) {
>  			put_page(page[0]);
> -			page[0] = wpage[0];
> +			return wpage;
>  		}
> -
> -		npages = 1;
>  	}
> -	*pfn = page_to_pfn(page[0]);
> -	return npages;
> +
> +	return page[0];
>  }
>  
>  static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
> @@ -1487,33 +1478,37 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   *     whether the mapping is writable.
>   */
>  static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
> -			bool write_fault, bool *writable)
> +			    bool write_fault, bool *writable, struct page **p_page)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn = 0;
> -	int npages, r;
> +	struct page *page;
> +	int r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
>  	BUG_ON(atomic && async);
>  
> -	if (hva_to_pfn_fast(addr, atomic, async, write_fault, writable, &pfn))
> -		return pfn;
> +	page = __hva_to_page_fast(addr, write_fault, writable);
> +	if (!IS_ERR(page))
> +		goto got_page;
>  
>  	if (atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
> -	if (npages = 1)
> -		return pfn;
> +	page = hva_to_page_unlocked(addr, async, write_fault, writable);
> +	if (!IS_ERR(page))
> +		goto got_page;
>  
>  	down_read(&current->mm->mmap_sem);
> -	if (npages = -EHWPOISON ||
> +	/* FIXME, is check_user_page_hwpoison still needed? */
> +	if (PTR_ERR(page) = -EHWPOISON ||
>  	      (!async && check_user_page_hwpoison(addr))) {
> -		pfn = KVM_PFN_ERR_HWPOISON;
> -		goto exit;
> +		up_read(&current->mm->mmap_sem);
> +		return KVM_PFN_ERR_HWPOISON;
>  	}
>  
>  retry:
> +	*p_page = NULL;
>  	vma = find_vma_intersection(current->mm, addr, addr + 1);
>  
>  	if (vma = NULL)
> @@ -1529,9 +1523,13 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
>  			*async = true;
>  		pfn = KVM_PFN_ERR_FAULT;
>  	}
> -exit:
>  	up_read(&current->mm->mmap_sem);
>  	return pfn;
> +
> +got_page:
> +	if (p_page)
> +		*p_page = page;
> +	return page_to_pfn(page);
>  }
>  
>  kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -1559,7 +1557,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
>  	}
>  
>  	return hva_to_pfn(addr, atomic, async, write_fault,
> -			  writable);
> +			  writable, NULL);
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>

WARNING: multiple messages have this Message-ID (diff)

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: "Paolo Bonzini" <pbonzini@redhat.com>,
	"Cédric Le Goater" <clg@kaod.org>,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org,
	"Paul Mackerras" <paulus@samba.org>
Subject: Re: [RFC PATCH] KVM: PPC: Book3S HV: add support for page faults in VM_IO|VM_PFNMAP vmas
Date: Mon, 12 Feb 2018 21:33:53 +1100	[thread overview]
Message-ID: <1518431633.2959.9.camel@kernel.crashing.org> (raw)
In-Reply-To: <f0d6d84c-1a59-a38f-1b38-eec798236378@redhat.com>

On Mon, 2018-02-12 at 10:57 +0100, Paolo Bonzini wrote:
> On 09/02/2018 22:44, Benjamin Herrenschmidt wrote:
> > > 
> > > (Also, why is this code reinventing most of hva_to_pfn?  Not asking you
> > > to fix that of course, but perhaps there's something to improve in the
> > > generic code too).
> > 
> > I've been wondering the same thing, which led me to try to understand
> > hva_to_pfn, which resulted in skull marks in the wall ;-)
> > 
> > So hva_to_pfn does a complicated dance around 3 or 4 different variants
> > of gup and it's really not obvious why, the "intent" doesn't appear to
> > be documented clearly. I think I somewhat figued out it relates to the
> > async page fault stuff but even then I don't actually know what those
> > are and what is their purpose :-)
> 
> Yeah, hva_to_pfn has all those arguments to tweak its behavior, however 
> you can treat it as essentially get_user_pages_unlocked:
> 
> - if async==false && atomic==false you can ignore hva_to_pfn_fast.  
> (hva_to_pfn_fast uses __get_user_pages_fast)

Do you mean async == NULL && atomic == false ? IE. async is a
pointer...

> - hva_to_pfn_slow then is essentially get_user_pages_unlocked.  
> Optionally it's followed by __get_user_pages_fast to get a writable 
> mapping if the caller would like it but does not require it, but you can 
> ignore this if you pass writable==NULL.
>
> PPC uses get_user_pages_fast; x86 cannot use it directly because we need 
> FOLL_HWPOISON and get_user_pages_fast does not support it.

Remind me what that does ? I've lost track ... we do support some
amount of HWPOISON stuff on powerpc so we should probably do that
too...

I also at some point would like to understand how we sync the dirty
bit in a race free way ;-) For the accessed bit it looks like the
generic mm always calls the notifiers that gets into our "age" callback
, but dirty seems to be treated differently. My limited understanding
right now is that we set it via gup on a write fault in the struct
page, and it can only be cleared via page_mkclean which takes the
mapping out. But I haven't checked that this is 100% tight (meaning we 
don't actually rely on the 2nd level page table dirty bit except for
dirty_map tracking).
 
> However, get_user_pages_fast is itself a combination of 
> __get_user_pages_fast and get_user_pages_unlocked.  So if it is useful 
> for PPC to use get_user_pages_fast, perhaps the generic code should go 
> down hva_to_pfn_fast unconditionally, even if async and atomic are both 
> false.
> 
> The other big difference is that you follow up with the PageHuge check 
> to get compound_head/compound_order.  This is a bit like 
> kvm_host_page_size, but different.  It would be fine for me to add a 
> struct page ** argument to hva_to_pfn if that's enough to remove the 
> duplicate code in arch/powerpc/kvm/.

Yes, that would be nice.

> See the untested/uncompiled patch below the signature for some ideas.

I can give that a spin tomorrow with a bit of luck (30 other things on
my plate), or maybe Paul can.

BTW. Is there some doc/explanation about the whole Async page fault
business ? What is it about ? I wonder if it's something we
could/should use as well.

Cheers,
Ben.

> Thanks,
> 
> Paolo
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b4414842b023..a2c8824ae608 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1339,15 +1339,12 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * The atomic path to get the writable pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
> -			    bool write_fault, bool *writable, kvm_pfn_t *pfn)
> +static struct page *__hva_to_page_fast(unsigned long addr, bool write_fault,
> +				       bool *writable)
>  {
>  	struct page *page[1];
>  	int npages;
>  
> -	if (!(async || atomic))
> -		return false;
> -
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
> @@ -1357,23 +1354,21 @@ static bool hva_to_pfn_fast(unsigned long addr, bool atomic, bool *async,
>  		return false;
>  
>  	npages = __get_user_pages_fast(addr, 1, 1, page);
> -	if (npages == 1) {
> -		*pfn = page_to_pfn(page[0]);
> +	if (npages < 0)
> +		return ERR_PTR(npages);
>  
> -		if (writable)
> -			*writable = true;
> -		return true;
> -	}
> +	if (writable)
> +		*writable = true;
>  
> -	return false;
> +	return page[0];
>  }
>  
>  /*
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool *writable, kvm_pfn_t *pfn)
> +static struct page *hva_to_page_unlocked(unsigned long addr, bool *async,
> +					 bool write_fault, bool *writable)
>  {
>  	struct page *page[1];
>  	int npages = 0;
> @@ -1395,24 +1390,20 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  
>  		npages = get_user_pages_unlocked(addr, 1, page, flags);
>  	}
> -	if (npages != 1)
> -		return npages;
> +	if (npages < 0)
> +		return ERR_PTR(npages);
>  
>  	/* map read fault as writable if possible */
>  	if (unlikely(!write_fault) && writable) {
> -		struct page *wpage[1];
> -
> -		npages = __get_user_pages_fast(addr, 1, 1, wpage);
> -		if (npages == 1) {
> -			*writable = true;
> +		struct page *wpage;
> +		wpage = __hva_to_page_fast(addr, write_fault, writable);
> +		if (!IS_ERR(wpage)) {
>  			put_page(page[0]);
> -			page[0] = wpage[0];
> +			return wpage;
>  		}
> -
> -		npages = 1;
>  	}
> -	*pfn = page_to_pfn(page[0]);
> -	return npages;
> +
> +	return page[0];
>  }
>  
>  static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
> @@ -1487,33 +1478,37 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   *     whether the mapping is writable.
>   */
>  static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
> -			bool write_fault, bool *writable)
> +			    bool write_fault, bool *writable, struct page **p_page)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn = 0;
> -	int npages, r;
> +	struct page *page;
> +	int r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
>  	BUG_ON(atomic && async);
>  
> -	if (hva_to_pfn_fast(addr, atomic, async, write_fault, writable, &pfn))
> -		return pfn;
> +	page = __hva_to_page_fast(addr, write_fault, writable);
> +	if (!IS_ERR(page))
> +		goto got_page;
>  
>  	if (atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
> -	if (npages == 1)
> -		return pfn;
> +	page = hva_to_page_unlocked(addr, async, write_fault, writable);
> +	if (!IS_ERR(page))
> +		goto got_page;
>  
>  	down_read(&current->mm->mmap_sem);
> -	if (npages == -EHWPOISON ||
> +	/* FIXME, is check_user_page_hwpoison still needed? */
> +	if (PTR_ERR(page) == -EHWPOISON ||
>  	      (!async && check_user_page_hwpoison(addr))) {
> -		pfn = KVM_PFN_ERR_HWPOISON;
> -		goto exit;
> +		up_read(&current->mm->mmap_sem);
> +		return KVM_PFN_ERR_HWPOISON;
>  	}
>  
>  retry:
> +	*p_page = NULL;
>  	vma = find_vma_intersection(current->mm, addr, addr + 1);
>  
>  	if (vma == NULL)
> @@ -1529,9 +1523,13 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
>  			*async = true;
>  		pfn = KVM_PFN_ERR_FAULT;
>  	}
> -exit:
>  	up_read(&current->mm->mmap_sem);
>  	return pfn;
> +
> +got_page:
> +	if (p_page)
> +		*p_page = page;
> +	return page_to_pfn(page);
>  }
>  
>  kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -1559,7 +1557,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
>  	}
>  
>  	return hva_to_pfn(addr, atomic, async, write_fault,
> -			  writable);
> +			  writable, NULL);
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>

next prev parent reply	other threads:[~2018-02-12 10:33 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-08 16:03 [RFC PATCH] KVM: PPC: Book3S HV: add support for page faults in VM_IO|VM_PFNMAP vmas Cédric Le Goater
2018-02-08 16:03 ` Cédric Le Goater
2018-02-08 22:19 ` Benjamin Herrenschmidt
2018-02-08 22:19   ` Benjamin Herrenschmidt
2018-02-09  7:14   ` Cédric Le Goater
2018-02-09  7:14     ` Cédric Le Goater
2018-02-09 15:01     ` Paolo Bonzini
2018-02-09 15:01       ` Paolo Bonzini
2018-02-09 21:44       ` Benjamin Herrenschmidt
2018-02-09 21:44         ` Benjamin Herrenschmidt
2018-02-12  9:57         ` Paolo Bonzini
2018-02-12  9:57           ` Paolo Bonzini
2018-02-12 10:33           ` Benjamin Herrenschmidt [this message]
2018-02-12 10:33             ` Benjamin Herrenschmidt
2018-02-12 10:41             ` Paolo Bonzini
2018-02-12 10:41               ` Paolo Bonzini
2018-02-12 10:54               ` Benjamin Herrenschmidt
2018-02-12 10:54                 ` Benjamin Herrenschmidt
2018-02-12 11:01                 ` Paolo Bonzini
2018-02-12 11:01                   ` Paolo Bonzini
2018-02-12 14:38                   ` Benjamin Herrenschmidt
2018-02-12 14:38                     ` Benjamin Herrenschmidt
2018-02-12 14:41                     ` Paolo Bonzini
2018-02-12 14:41                       ` Paolo Bonzini
2018-02-09 21:54     ` Benjamin Herrenschmidt
2018-02-09 21:54       ` Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1518431633.2959.9.camel@kernel.crashing.org \
    --to=benh@kernel.crashing.org \
    --cc=clg@kaod.org \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.