[PATCH 00/25] mm: Page fault accounting cleanups

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/25] mm: Page fault accounting cleanups
@ 2020-06-15 22:15 Peter Xu
  2020-06-15 22:23 ` [PATCH 19/25] mm/s390: Use mm_fault_accounting() Peter Xu
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli

Gerald Schaefer reported an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b9827063 ("mm: allow
VM_FAULT_RETRY for multiple times"):

  https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

When I was looking at the issue, I actually found some other trivial issues too
related to the page fault accounting, namely:

  - Incorrect accountings

    - The major issue that was reported by Gerald, that we did multiple
      accounting when the page fault retried while we should only do it once
      (for either perf event accounting or per-task accounting).

    - In many archs, major fault is only accounted when the 1st page fault is a
      major fault, or the last page fault is a major fault.  Ideally we should
      account the page fault as a major fault as long as any of the page fault
      retries is a major fault and keep the same behavior across archs.

  - Missing accountings of perf events (PERF_COUNT_SW_PAGE_FAULTS[_[MAJ|MIN]]).

  - Doing accounting with mmap_sem: not a big deal, but logically we can move
    it after releasing the mmap_sem because accounting does not need it.

This series tries to address all of them by introducing mm_fault_accounting()
first, so that we move all the page fault accounting into the common code base,
then call it properly from arch pf handlers just like handle_mm_fault().  There
are some special cases, e.g., the um arch does not have "regs" in the fault
handler, so it's still using the manual statistics.

For each of the patch that fixes a specific arch, I'm CCing the maintainers and
the arch list if there is.  Besides, I only lightly tested this series on x86.

Please have a look, thanks.

Peter Xu (25):
  mm/um: Fix extra accounting for page fault retries
  mm: Introduce mm_fault_accounting()
  mm/alpha: Use mm_fault_accounting()
  mm/arc: Use mm_fault_accounting()
  mm/arm: Use mm_fault_accounting()
  mm/arm64: Use mm_fault_accounting()
  mm/csky: Use mm_fault_accounting()
  mm/hexagon: Use mm_fault_accounting()
  mm/ia64: Use mm_fault_accounting()
  mm/m68k: Use mm_fault_accounting()
  mm/microblaze: Use mm_fault_accounting()
  mm/mips: Use mm_fault_accounting()
  mm/nds32: Use mm_fault_accounting()
  mm/nios2: Use mm_fault_accounting()
  mm/openrisc: Use mm_fault_accounting()
  mm/parisc: Use mm_fault_accounting()
  mm/powerpc: Use mm_fault_accounting()
  mm/riscv: Use mm_fault_accounting()
  mm/s390: Use mm_fault_accounting()
  mm/sh: Use mm_fault_accounting()
  mm/sparc32: Use mm_fault_accounting()
  mm/sparc64: Use mm_fault_accounting()
  mm/unicore32: Use mm_fault_accounting()
  mm/x86: Use mm_fault_accounting()
  mm/xtensa: Use mm_fault_accounting()

 arch/alpha/mm/fault.c      |  9 ++++-----
 arch/arc/mm/fault.c        | 15 +++------------
 arch/arm/mm/fault.c        | 21 ++++-----------------
 arch/arm64/mm/fault.c      | 17 ++---------------
 arch/csky/mm/fault.c       | 13 +------------
 arch/hexagon/mm/vm_fault.c |  9 +++------
 arch/ia64/mm/fault.c       |  8 +++-----
 arch/m68k/mm/fault.c       | 13 +++----------
 arch/microblaze/mm/fault.c |  8 +++-----
 arch/mips/mm/fault.c       | 14 +++-----------
 arch/nds32/mm/fault.c      | 19 +++----------------
 arch/nios2/mm/fault.c      | 13 +++----------
 arch/openrisc/mm/fault.c   |  8 +++-----
 arch/parisc/mm/fault.c     |  8 +++-----
 arch/powerpc/mm/fault.c    | 13 ++++---------
 arch/riscv/mm/fault.c      | 21 +++------------------
 arch/s390/mm/fault.c       | 21 +++++----------------
 arch/sh/mm/fault.c         | 15 +++------------
 arch/sparc/mm/fault_32.c   | 16 +++-------------
 arch/sparc/mm/fault_64.c   | 16 ++++------------
 arch/um/kernel/trap.c      | 13 +++++++------
 arch/unicore32/mm/fault.c  | 15 +++++++--------
 arch/x86/mm/fault.c        | 10 +---------
 arch/xtensa/mm/fault.c     | 14 +++-----------
 include/linux/mm.h         |  2 ++
 mm/memory.c                | 18 ++++++++++++++++++
 26 files changed, 101 insertions(+), 248 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-06-16 15:59   ` Alexander Gordeev
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, linux-s390

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: Vasily Gorbik <gor@linux.ibm.com>
CC: Christian Borntraeger <borntraeger@de.ibm.com>
CC: linux-s390@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/mm/fault.c | 21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index dedc28be27ab..8ca207635b59 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	unsigned long trans_exc_code;
 	unsigned long address;
 	unsigned int flags;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 
 	tsk = current;
 	/*
@@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	}
 
 	address = trans_exc_code & __FAIL_ADDR_MASK;
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	flags = FAULT_FLAG_DEFAULT;
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
@@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 	if (fault_signal_pending(fault, regs)) {
 		fault = VM_FAULT_SIGNAL;
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
@@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	if (unlikely(fault & VM_FAULT_ERROR))
 		goto out_up;
 
-	/*
-	 * Major/minor page fault accounting is only done on the
-	 * initial attempt. If we go through a retry, it is extremely
-	 * likely that the page will be found in page cache at that point.
-	 */
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			tsk->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
-				      regs, address);
-		} else {
-			tsk->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
-				      regs, address);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
 			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
@@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 			goto retry;
 		}
 	}
+
+	mm_fault_accounting(tsk, regs, address, major);
+
 	if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
 		address =  __gmap_link(gmap, current->thread.gmap_addr,
 				       address);
-- 
2.26.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-15 22:23 ` [PATCH 19/25] mm/s390: Use mm_fault_accounting() Peter Xu
@ 2020-06-16 15:59   ` Alexander Gordeev
  2020-06-16 16:35     ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Gordeev @ 2020-06-16 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, linux-s390

On Mon, Jun 15, 2020 at 06:23:02PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
> 
> Avoid doing page fault accounting multiple times if the page fault is retried.
> 
> CC: Heiko Carstens <heiko.carstens@de.ibm.com>
> CC: Vasily Gorbik <gor@linux.ibm.com>
> CC: Christian Borntraeger <borntraeger@de.ibm.com>
> CC: linux-s390@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/s390/mm/fault.c | 21 +++++----------------
>  1 file changed, 5 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index dedc28be27ab..8ca207635b59 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	unsigned long trans_exc_code;
>  	unsigned long address;
>  	unsigned int flags;
> -	vm_fault_t fault;
> +	vm_fault_t fault, major = 0;
> 
>  	tsk = current;
>  	/*
> @@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	}
> 
>  	address = trans_exc_code & __FAIL_ADDR_MASK;
> -	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  	flags = FAULT_FLAG_DEFAULT;
>  	if (user_mode(regs))
>  		flags |= FAULT_FLAG_USER;
> @@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	 * the fault.
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
> +	major |= fault & VM_FAULT_MAJOR;
>  	if (fault_signal_pending(fault, regs)) {
>  		fault = VM_FAULT_SIGNAL;
>  		if (flags & FAULT_FLAG_RETRY_NOWAIT)
> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	if (unlikely(fault & VM_FAULT_ERROR))
>  		goto out_up;
> 
> -	/*
> -	 * Major/minor page fault accounting is only done on the
> -	 * initial attempt. If we go through a retry, it is extremely
> -	 * likely that the page will be found in page cache at that point.
> -	 */
>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> -		if (fault & VM_FAULT_MAJOR) {
> -			tsk->maj_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> -				      regs, address);
> -		} else {
> -			tsk->min_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> -				      regs, address);
> -		}
>  		if (fault & VM_FAULT_RETRY) {
>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {

Seems like the call to mm_fault_accounting() will be missed if
we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
jumps to "out_up"...

> @@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  			goto retry;
>  		}
>  	}
> +
> +	mm_fault_accounting(tsk, regs, address, major);
> +
>  	if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
>  		address =  __gmap_link(gmap, current->thread.gmap_addr,
>  				       address);
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-16 15:59   ` Alexander Gordeev
@ 2020-06-16 16:35     ` Peter Xu
  2020-06-17  6:19       ` Christian Borntraeger
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2020-06-16 16:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, linux-s390

Hi, Alexander,

On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> > @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >  	if (unlikely(fault & VM_FAULT_ERROR))
> >  		goto out_up;
> > 
> > -	/*
> > -	 * Major/minor page fault accounting is only done on the
> > -	 * initial attempt. If we go through a retry, it is extremely
> > -	 * likely that the page will be found in page cache at that point.
> > -	 */
> >  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > -		if (fault & VM_FAULT_MAJOR) {
> > -			tsk->maj_flt++;
> > -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> > -				      regs, address);
> > -		} else {
> > -			tsk->min_flt++;
> > -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> > -				      regs, address);
> > -		}
> >  		if (fault & VM_FAULT_RETRY) {
> >  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> 
> Seems like the call to mm_fault_accounting() will be missed if
> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> jumps to "out_up"...

This is true as a functional change.  However that also means that we've got a
VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
than handled correctly (for instance, due to some try_lock failed during the
fault process).

To me, that case should not be counted as a page fault at all?  Or we might get
the same duplicated accounting when the page fault retried from a higher stack.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
  2020-06-15 22:23 ` [PATCH 19/25] mm/s390: Use mm_fault_accounting() Peter Xu
@ 2020-06-16 18:55 ` Linus Torvalds
  2020-06-16 21:03   ` Peter Xu
  2020-06-17  0:55   ` Michael Ellerman
  1 sibling, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2020-06-16 18:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli, openrisc, linux-arch, Alexander Gordeev,
	linux-s390, Will Deacon, Catalin Marinas, Linux ARM

[-- Attachment #1: Type: text/plain, Size: 2788 bytes --]

On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
>
> This series tries to address all of them by introducing mm_fault_accounting()
> first, so that we move all the page fault accounting into the common code base,
> then call it properly from arch pf handlers just like handle_mm_fault().

Hmm.

So having looked at this a bit more, I'd actually like to go even
further, and just get rid of the per-architecture code _entirely_.

Here's a straw-man patch to the generic code - the idea is mostly laid
out in the comment that I'm just quoting here directly too:

        /*
         * Do accounting in the common code, to avoid unnecessary
         * architecture differences or duplicated code.
         *
         * We arbitrarily make the rules be:
         *
         *  - faults that never even got here (because the address
         *    wasn't valid). That includes arch_vma_access_permitted()
         *    failing above.
         *
         *    So this is expressly not a "this many hardware page
         *    faults" counter. Use the hw profiling for that.
         *
         *  - incomplete faults (ie RETRY) do not count (see above).
         *    They will only count once completed.
         *
         *  - the fault counts as a "major" fault when the final
         *    successful fault is VM_FAULT_MAJOR, or if it was a
         *    retry (which implies that we couldn't handle it
         *    immediately previously).
         *
         *  - if the fault is done for GUP, regs wil be NULL and
         *    no accounting will be done (but you _could_ pass in
         *    your own regs and it would be accounted to the thread
         *    doing the fault, not to the target!)
         */

the code itself in the patch is

 (a) pretty trivial and self-evident

 (b) INCOMPLETE

that (b) is worth noting: this patch won't compile on its own. It
intentionally leaves all the users without the new 'regs' argument,
because you obviously simply need to remove all the code that
currently tries to do any accounting.

Comments?

This is a bigger change, but I think it might be worth it to _really_
consolidate the major/minor logic.

One detail worth noting: I do wonder if we should put the

    perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);

just in the arch code at the top of the fault handling, and consider
it entirely unrelated to the major/minor fault handling. The
major/minor faults fundamnetally are about successes. But the plain
PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
things that never even get to this point at all.

I'm not convinced it's useful to have three SW events that are defined
to be A=B+C.

But this does *not* do that part. It' sreally just an RFC patch.

                    Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 2918 bytes --]

 include/linux/mm.h |  3 ++-
 mm/memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..e82a604339c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1648,7 +1648,8 @@ int invalidate_inode_page(struct page *page);
 
 #ifdef CONFIG_MMU
 extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
-			unsigned long address, unsigned int flags);
+			unsigned long address, unsigned int flags,
+			struct pt_regs *);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index dc7f3543b1fd..f15056151fb1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -72,6 +72,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 
+#include <linux/perf_event.h>
 #include <trace/events/kmem.h>
 
 #include <asm/io.h>
@@ -4353,7 +4354,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
-		unsigned int flags)
+		unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
 
@@ -4394,6 +4395,49 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
+	if (ret & VM_FAULT_RETRY)
+		return ret;
+
+	/*
+	 * Do accounting in the common code, to avoid unnecessary
+	 * architecture differences or duplicated code.
+	 *
+	 * We arbitrarily make the rules be:
+	 *
+	 *  - faults that never even got here (because the address
+	 *    wasn't valid). That includes arch_vma_access_permitted()
+	 *    failing above.
+	 *
+	 *    So this is expressly not a "this many hardware page
+	 *    faults" counter. Use the hw profiling for that.
+	 *
+	 *  - incomplete faults (ie RETRY) do not count (see above).
+	 *    They will only count once completed.
+	 *
+	 *  - the fault counts as a "major" fault when the final
+	 *    successful fault is VM_FAULT_MAJOR, or if it was a
+	 *    retry (which implies that we couldn't handle it
+	 *    immediately previously).
+	 *
+	 *  - if the fault is done for GUP, regs wil be NULL and
+	 *    no accounting will be done (but you _could_ pass in
+	 *    your own regs and it would be accounted to the thread
+	 *    doing the fault, not to the target!)
+	 */
+
+	if (!regs)
+		return ret;
+
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
+	if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
+		current->maj_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+	} else {
+		current->min_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
@ 2020-06-16 21:03   ` Peter Xu
  2020-06-17  0:55   ` Michael Ellerman
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Xu @ 2020-06-16 21:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli, openrisc, linux-arch, Alexander Gordeev,
	linux-s390, Will Deacon, Catalin Marinas, Linux ARM

On Tue, Jun 16, 2020 at 11:55:17AM -0700, Linus Torvalds wrote:
> On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > This series tries to address all of them by introducing mm_fault_accounting()
> > first, so that we move all the page fault accounting into the common code base,
> > then call it properly from arch pf handlers just like handle_mm_fault().
> 
> Hmm.
> 
> So having looked at this a bit more, I'd actually like to go even
> further, and just get rid of the per-architecture code _entirely_.
> 
> Here's a straw-man patch to the generic code - the idea is mostly laid
> out in the comment that I'm just quoting here directly too:
> 
>         /*
>          * Do accounting in the common code, to avoid unnecessary
>          * architecture differences or duplicated code.
>          *
>          * We arbitrarily make the rules be:
>          *
>          *  - faults that never even got here (because the address
>          *    wasn't valid). That includes arch_vma_access_permitted()
>          *    failing above.
>          *
>          *    So this is expressly not a "this many hardware page
>          *    faults" counter. Use the hw profiling for that.
>          *
>          *  - incomplete faults (ie RETRY) do not count (see above).
>          *    They will only count once completed.
>          *
>          *  - the fault counts as a "major" fault when the final
>          *    successful fault is VM_FAULT_MAJOR, or if it was a
>          *    retry (which implies that we couldn't handle it
>          *    immediately previously).
>          *
>          *  - if the fault is done for GUP, regs wil be NULL and
>          *    no accounting will be done (but you _could_ pass in
>          *    your own regs and it would be accounted to the thread
>          *    doing the fault, not to the target!)
>          */
> 
> the code itself in the patch is
> 
>  (a) pretty trivial and self-evident
> 
>  (b) INCOMPLETE
> 
> that (b) is worth noting: this patch won't compile on its own. It
> intentionally leaves all the users without the new 'regs' argument,
> because you obviously simply need to remove all the code that
> currently tries to do any accounting.
> 
> Comments?

Looks clean to me.  The definition of "major faults" will slightly change even
for those who has done the "major |= fault & MAJOR" operations before, but at
least I can't see anything bad about that either...

To make things easier, we can use the 1st patch to introduce this change,
however pass regs==NULL at the callers to never trigger this accounting.  Then
we can still use one patch for each arch to do the final convertions.

> 
> This is a bigger change, but I think it might be worth it to _really_
> consolidate the major/minor logic.
> 
> One detail worth noting: I do wonder if we should put the
> 
>     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> 
> just in the arch code at the top of the fault handling, and consider
> it entirely unrelated to the major/minor fault handling. The
> major/minor faults fundamnetally are about successes. But the plain
> PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> things that never even get to this point at all.
> 
> I'm not convinced it's useful to have three SW events that are defined
> to be A=B+C.

IMHO it's still common to have a "total" statistics in softwares even if each
of the subsets are accounted separately.  Here it's just a bit special because
there're only two elements so the addition is so straightforward.  It seems a
trade-off on whether we'd like to do the accounting of errornous faults, or we
want to make it cleaner by put them altogether but only successful page faults.
I slightly preferred the latter due to the fact that I failed to find great
usefulness out of keeping error fault accountings, but no strong opinions..

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
  2020-06-16 21:03   ` Peter Xu
@ 2020-06-17  0:55   ` Michael Ellerman
  2020-06-17  8:04     ` Will Deacon
  1 sibling, 1 reply; 13+ messages in thread
From: Michael Ellerman @ 2020-06-17  0:55 UTC (permalink / raw)
  To: Linus Torvalds, Peter Xu
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli, openrisc, linux-arch, Alexander Gordeev,
	linux-s390, Will Deacon, Catalin Marinas, Linux ARM

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
>> This series tries to address all of them by introducing mm_fault_accounting()
>> first, so that we move all the page fault accounting into the common code base,
>> then call it properly from arch pf handlers just like handle_mm_fault().
>
> Hmm.
>
> So having looked at this a bit more, I'd actually like to go even
> further, and just get rid of the per-architecture code _entirely_.

<snip>

> One detail worth noting: I do wonder if we should put the
>
>     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
>
> just in the arch code at the top of the fault handling, and consider
> it entirely unrelated to the major/minor fault handling. The
> major/minor faults fundamnetally are about successes. But the plain
> PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> things that never even get to this point at all.

Yeah I think we should keep it in the arch code at roughly the top.

If it's moved to the end you could have a process spinning taking bad
page faults (and fixing them up), and see no sign of it from the perf
page fault counters.

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-16 16:35     ` Peter Xu
@ 2020-06-17  6:19       ` Christian Borntraeger
  2020-06-17 16:06         ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Borntraeger @ 2020-06-17  6:19 UTC (permalink / raw)
  To: Peter Xu, Alexander Gordeev
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik, linux-s390



On 16.06.20 18:35, Peter Xu wrote:
> Hi, Alexander,
> 
> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>  	if (unlikely(fault & VM_FAULT_ERROR))
>>>  		goto out_up;
>>>
>>> -	/*
>>> -	 * Major/minor page fault accounting is only done on the
>>> -	 * initial attempt. If we go through a retry, it is extremely
>>> -	 * likely that the page will be found in page cache at that point.
>>> -	 */
>>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>> -		if (fault & VM_FAULT_MAJOR) {
>>> -			tsk->maj_flt++;
>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>> -				      regs, address);
>>> -		} else {
>>> -			tsk->min_flt++;
>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>> -				      regs, address);
>>> -		}
>>>  		if (fault & VM_FAULT_RETRY) {
>>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>>
>> Seems like the call to mm_fault_accounting() will be missed if
>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>> jumps to "out_up"...
> 
> This is true as a functional change.  However that also means that we've got a
> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> than handled correctly (for instance, due to some try_lock failed during the
> fault process).
> 
> To me, that case should not be counted as a page fault at all?  Or we might get
> the same duplicated accounting when the page fault retried from a higher stack.
> 
> Thanks

This case below (the one with the gmap) is the KVM case for doing a so called
pseudo page fault to our guests. (we notify our guests about major host page
faults and let it reschedule to something else instead of halting the vcpu).
This is being resolved with either gup or fixup_user_fault asynchronously by
KVM code (this can also be sync when the guest does not match some conditions)
We do not change the counters in that code as far as I can tell so we should
continue to do it here.

(see arch/s390/kvm/kvm-s390.c
static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
{
[...]
        } else if (current->thread.gmap_pfault) {
                trace_kvm_s390_major_guest_pfault(vcpu);
                current->thread.gmap_pfault = 0;
                if (kvm_arch_setup_async_pf(vcpu))
                        return 0;
                return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
        }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-17  0:55   ` Michael Ellerman
@ 2020-06-17  8:04     ` Will Deacon
  2020-06-17 16:10       ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Will Deacon @ 2020-06-17  8:04 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Linus Torvalds, Peter Xu, Linux Kernel Mailing List,
	Gerald Schaefer, Andrew Morton, Andrea Arcangeli, openrisc,
	linux-arch, Alexander Gordeev, linux-s390, Catalin Marinas,
	Linux ARM

On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> >> This series tries to address all of them by introducing mm_fault_accounting()
> >> first, so that we move all the page fault accounting into the common code base,
> >> then call it properly from arch pf handlers just like handle_mm_fault().
> >
> > Hmm.
> >
> > So having looked at this a bit more, I'd actually like to go even
> > further, and just get rid of the per-architecture code _entirely_.
> 
> <snip>
> 
> > One detail worth noting: I do wonder if we should put the
> >
> >     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> >
> > just in the arch code at the top of the fault handling, and consider
> > it entirely unrelated to the major/minor fault handling. The
> > major/minor faults fundamnetally are about successes. But the plain
> > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> > things that never even get to this point at all.
> 
> Yeah I think we should keep it in the arch code at roughly the top.

I agree. It's a nice idea to consolidate the code, but I don't see that
it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly
changing the semantics (and a potentially less useful way. Of course,
moving more of do_page_fault() out of the arch code would be great, but
that's a much bigger effort.

> If it's moved to the end you could have a process spinning taking bad
> page faults (and fixing them up), and see no sign of it from the perf
> page fault counters.

The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS
if _all_ of the following are true:

  1. The fault isn't handled by kprobes
  2. The pagefault handler is enabled
  3. We have an mm (current->mm)
  4. The fault isn't an unexpected kernel fault on a user address (we oops
     and kill the task in this case)

Which loosely corresponds to "we took a fault on a user address that it
looks like we can handle".

That said, I'm happy to tweak this if it brings us into line with other
architectures.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-17  6:19       ` Christian Borntraeger
@ 2020-06-17 16:06         ` Peter Xu
  2020-06-17 16:14           ` Christian Borntraeger
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2020-06-17 16:06 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alexander Gordeev, linux-kernel, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	linux-s390

Hi, Christian,

On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
> 
> 
> On 16.06.20 18:35, Peter Xu wrote:
> > Hi, Alexander,
> > 
> > On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>>  	if (unlikely(fault & VM_FAULT_ERROR))
> >>>  		goto out_up;
> >>>
> >>> -	/*
> >>> -	 * Major/minor page fault accounting is only done on the
> >>> -	 * initial attempt. If we go through a retry, it is extremely
> >>> -	 * likely that the page will be found in page cache at that point.
> >>> -	 */
> >>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>> -		if (fault & VM_FAULT_MAJOR) {
> >>> -			tsk->maj_flt++;
> >>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>> -				      regs, address);
> >>> -		} else {
> >>> -			tsk->min_flt++;
> >>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>> -				      regs, address);
> >>> -		}
> >>>  		if (fault & VM_FAULT_RETRY) {
> >>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {

[1]

> >>
> >> Seems like the call to mm_fault_accounting() will be missed if
> >> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >> jumps to "out_up"...
> > 
> > This is true as a functional change.  However that also means that we've got a
> > VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> > than handled correctly (for instance, due to some try_lock failed during the
> > fault process).
> > 
> > To me, that case should not be counted as a page fault at all?  Or we might get
> > the same duplicated accounting when the page fault retried from a higher stack.
> > 
> > Thanks
> 
> This case below (the one with the gmap) is the KVM case for doing a so called
> pseudo page fault to our guests. (we notify our guests about major host page
> faults and let it reschedule to something else instead of halting the vcpu).
> This is being resolved with either gup or fixup_user_fault asynchronously by
> KVM code (this can also be sync when the guest does not match some conditions)
> We do not change the counters in that code as far as I can tell so we should
> continue to do it here.
> 
> (see arch/s390/kvm/kvm-s390.c
> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> {
> [...]
>         } else if (current->thread.gmap_pfault) {
>                 trace_kvm_s390_major_guest_pfault(vcpu);
>                 current->thread.gmap_pfault = 0;
>                 if (kvm_arch_setup_async_pf(vcpu))
>                         return 0;
>                 return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
>         }

Please correct me if I'm wrong... but I still think what this patch does is the
right thing to do.

Note again that IMHO when reached [1] above it means the page fault is not
handled correctly so we need to fallback to KVM async page fault, then we
shouldn't increment the accountings until it's finally handled correctly. That
final accounting should be done in the async pf path in gup code where the page
fault is handled:

  kvm_arch_fault_in_page
    gmap_fault
      fixup_user_fault

Where in fixup_user_fault() we have:

	if (tsk) {
		if (major)
			tsk->maj_flt++;
		else
			tsk->min_flt++;
	}

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-17  8:04     ` Will Deacon
@ 2020-06-17 16:10       ` Peter Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Xu @ 2020-06-17 16:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael Ellerman, Linus Torvalds, Linux Kernel Mailing List,
	Gerald Schaefer, Andrew Morton, Andrea Arcangeli, openrisc,
	linux-arch, Alexander Gordeev, linux-s390, Catalin Marinas,
	Linux ARM

On Wed, Jun 17, 2020 at 09:04:06AM +0100, Will Deacon wrote:
> On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote:
> > Linus Torvalds <torvalds@linux-foundation.org> writes:
> > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> > >> This series tries to address all of them by introducing mm_fault_accounting()
> > >> first, so that we move all the page fault accounting into the common code base,
> > >> then call it properly from arch pf handlers just like handle_mm_fault().
> > >
> > > Hmm.
> > >
> > > So having looked at this a bit more, I'd actually like to go even
> > > further, and just get rid of the per-architecture code _entirely_.
> > 
> > <snip>
> > 
> > > One detail worth noting: I do wonder if we should put the
> > >
> > >     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> > >
> > > just in the arch code at the top of the fault handling, and consider
> > > it entirely unrelated to the major/minor fault handling. The
> > > major/minor faults fundamnetally are about successes. But the plain
> > > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> > > things that never even get to this point at all.
> > 
> > Yeah I think we should keep it in the arch code at roughly the top.
> 
> I agree. It's a nice idea to consolidate the code, but I don't see that
> it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly
> changing the semantics (and a potentially less useful way. Of course,
> moving more of do_page_fault() out of the arch code would be great, but
> that's a much bigger effort.
> 
> > If it's moved to the end you could have a process spinning taking bad
> > page faults (and fixing them up), and see no sign of it from the perf
> > page fault counters.
> 
> The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS
> if _all_ of the following are true:
> 
>   1. The fault isn't handled by kprobes
>   2. The pagefault handler is enabled
>   3. We have an mm (current->mm)
>   4. The fault isn't an unexpected kernel fault on a user address (we oops
>      and kill the task in this case)
> 
> Which loosely corresponds to "we took a fault on a user address that it
> looks like we can handle".
> 
> That said, I'm happy to tweak this if it brings us into line with other
> architectures.

I see.  I'll keep the semantics for PERF_COUNT_SW_PAGE_FAULTS in the next
version.  Thanks for all the comments!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-17 16:06         ` Peter Xu
@ 2020-06-17 16:14           ` Christian Borntraeger
  2020-06-17 16:44             ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Borntraeger @ 2020-06-17 16:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexander Gordeev, linux-kernel, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	linux-s390



On 17.06.20 18:06, Peter Xu wrote:
> Hi, Christian,
> 
> On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
>>
>>
>> On 16.06.20 18:35, Peter Xu wrote:
>>> Hi, Alexander,
>>>
>>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>>>  	if (unlikely(fault & VM_FAULT_ERROR))
>>>>>  		goto out_up;
>>>>>
>>>>> -	/*
>>>>> -	 * Major/minor page fault accounting is only done on the
>>>>> -	 * initial attempt. If we go through a retry, it is extremely
>>>>> -	 * likely that the page will be found in page cache at that point.
>>>>> -	 */
>>>>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>>>> -		if (fault & VM_FAULT_MAJOR) {
>>>>> -			tsk->maj_flt++;
>>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>>>> -				      regs, address);
>>>>> -		} else {
>>>>> -			tsk->min_flt++;
>>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>>>> -				      regs, address);
>>>>> -		}
>>>>>  		if (fault & VM_FAULT_RETRY) {
>>>>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>>>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> 
> [1]
> 
>>>>
>>>> Seems like the call to mm_fault_accounting() will be missed if
>>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>>>> jumps to "out_up"...
>>>
>>> This is true as a functional change.  However that also means that we've got a
>>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
>>> than handled correctly (for instance, due to some try_lock failed during the
>>> fault process).
>>>
>>> To me, that case should not be counted as a page fault at all?  Or we might get
>>> the same duplicated accounting when the page fault retried from a higher stack.
>>>
>>> Thanks
>>
>> This case below (the one with the gmap) is the KVM case for doing a so called
>> pseudo page fault to our guests. (we notify our guests about major host page
>> faults and let it reschedule to something else instead of halting the vcpu).
>> This is being resolved with either gup or fixup_user_fault asynchronously by
>> KVM code (this can also be sync when the guest does not match some conditions)
>> We do not change the counters in that code as far as I can tell so we should
>> continue to do it here.
>>
>> (see arch/s390/kvm/kvm-s390.c
>> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
>> {
>> [...]
>>         } else if (current->thread.gmap_pfault) {
>>                 trace_kvm_s390_major_guest_pfault(vcpu);
>>                 current->thread.gmap_pfault = 0;
>>                 if (kvm_arch_setup_async_pf(vcpu))
>>                         return 0;
>>                 return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
>>         }
> 
> Please correct me if I'm wrong... but I still think what this patch does is the
> right thing to do.
> 
> Note again that IMHO when reached [1] above it means the page fault is not
> handled correctly so we need to fallback to KVM async page fault, then we
> shouldn't increment the accountings until it's finally handled correctly. That
> final accounting should be done in the async pf path in gup code where the page
> fault is handled:
> 
>   kvm_arch_fault_in_page
>     gmap_fault
>       fixup_user_fault
> 
> Where in fixup_user_fault() we have:
> 
> 	if (tsk) {
> 		if (major)
> 			tsk->maj_flt++;
> 		else
> 			tsk->min_flt++;
> 	}
> 

Right that case does work. Its the case where we do not inject a pseudo pagefault and
instead fall back to synchronous fault-in.
What is about the other case:

kvm_setup_async_pf
	->workqueue
		async_pf_execute
			get_user_pages_remote

Does get_user_pages_remote do the accounting as well? I cant see that.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-17 16:14           ` Christian Borntraeger
@ 2020-06-17 16:44             ` Peter Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Xu @ 2020-06-17 16:44 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alexander Gordeev, linux-kernel, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	linux-s390

On Wed, Jun 17, 2020 at 06:14:52PM +0200, Christian Borntraeger wrote:
> 
> 
> On 17.06.20 18:06, Peter Xu wrote:
> > Hi, Christian,
> > 
> > On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 16.06.20 18:35, Peter Xu wrote:
> >>> Hi, Alexander,
> >>>
> >>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>>>>  	if (unlikely(fault & VM_FAULT_ERROR))
> >>>>>  		goto out_up;
> >>>>>
> >>>>> -	/*
> >>>>> -	 * Major/minor page fault accounting is only done on the
> >>>>> -	 * initial attempt. If we go through a retry, it is extremely
> >>>>> -	 * likely that the page will be found in page cache at that point.
> >>>>> -	 */
> >>>>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>>>> -		if (fault & VM_FAULT_MAJOR) {
> >>>>> -			tsk->maj_flt++;
> >>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>>>> -				      regs, address);
> >>>>> -		} else {
> >>>>> -			tsk->min_flt++;
> >>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>>>> -				      regs, address);
> >>>>> -		}
> >>>>>  		if (fault & VM_FAULT_RETRY) {
> >>>>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>>>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> > 
> > [1]
> > 
> >>>>
> >>>> Seems like the call to mm_fault_accounting() will be missed if
> >>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >>>> jumps to "out_up"...
> >>>
> >>> This is true as a functional change.  However that also means that we've got a
> >>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> >>> than handled correctly (for instance, due to some try_lock failed during the
> >>> fault process).
> >>>
> >>> To me, that case should not be counted as a page fault at all?  Or we might get
> >>> the same duplicated accounting when the page fault retried from a higher stack.
> >>>
> >>> Thanks
> >>
> >> This case below (the one with the gmap) is the KVM case for doing a so called
> >> pseudo page fault to our guests. (we notify our guests about major host page
> >> faults and let it reschedule to something else instead of halting the vcpu).
> >> This is being resolved with either gup or fixup_user_fault asynchronously by
> >> KVM code (this can also be sync when the guest does not match some conditions)
> >> We do not change the counters in that code as far as I can tell so we should
> >> continue to do it here.
> >>
> >> (see arch/s390/kvm/kvm-s390.c
> >> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> >> {
> >> [...]
> >>         } else if (current->thread.gmap_pfault) {
> >>                 trace_kvm_s390_major_guest_pfault(vcpu);
> >>                 current->thread.gmap_pfault = 0;
> >>                 if (kvm_arch_setup_async_pf(vcpu))
> >>                         return 0;
> >>                 return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
> >>         }
> > 
> > Please correct me if I'm wrong... but I still think what this patch does is the
> > right thing to do.
> > 
> > Note again that IMHO when reached [1] above it means the page fault is not
> > handled correctly so we need to fallback to KVM async page fault, then we
> > shouldn't increment the accountings until it's finally handled correctly. That
> > final accounting should be done in the async pf path in gup code where the page
> > fault is handled:
> > 
> >   kvm_arch_fault_in_page
> >     gmap_fault
> >       fixup_user_fault
> > 
> > Where in fixup_user_fault() we have:
> > 
> > 	if (tsk) {
> > 		if (major)
> > 			tsk->maj_flt++;
> > 		else
> > 			tsk->min_flt++;
> > 	}
> > 
> 
> Right that case does work. Its the case where we do not inject a pseudo pagefault and
> instead fall back to synchronous fault-in.
> What is about the other case:
> 
> kvm_setup_async_pf
> 	->workqueue
> 		async_pf_execute
> 			get_user_pages_remote
> 
> Does get_user_pages_remote do the accounting as well? I cant see that.
> 

It should, with:

  get_user_pages_remote
    __get_user_pages_remote
      __get_user_pages_locked
        __get_user_pages
          faultin_page

Where the accounting is done in faultin_page().

We probably need to also move the accounting in faultin_page() to be after the
retry check too, but that should be irrelevant to this patch.

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-06-17 16:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
2020-06-15 22:23 ` [PATCH 19/25] mm/s390: Use mm_fault_accounting() Peter Xu
2020-06-16 15:59   ` Alexander Gordeev
2020-06-16 16:35     ` Peter Xu
2020-06-17  6:19       ` Christian Borntraeger
2020-06-17 16:06         ` Peter Xu
2020-06-17 16:14           ` Christian Borntraeger
2020-06-17 16:44             ` Peter Xu
2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
2020-06-16 21:03   ` Peter Xu
2020-06-17  0:55   ` Michael Ellerman
2020-06-17  8:04     ` Will Deacon
2020-06-17 16:10       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox