All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alistair Popple <apopple@nvidia.com>
Subject: Re: [PATCH] mm: Avoid unnecessary page fault retires on shared memory types
Date: Fri, 6 May 2022 16:27:06 -0400	[thread overview]
Message-ID: <YnWEmsUwJt9KOUN0@cmpxchg.org> (raw)
In-Reply-To: <20220505211748.41127-1-peterx@redhat.com>

On Thu, May 05, 2022 at 05:17:48PM -0400, Peter Xu wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> 
> Then after that throttling we return VM_FAULT_RETRY.
> 
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
> 
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
> 
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
> 
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
> 
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
> 
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed:
> 
>   Before: 650980.20 (+-1.94%)
>   After:  569396.40 (+-1.38%)
> 
> I believe it could help more than that.
> 
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
> 
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> 
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

The change makes sense to me, but the unlock/retry signaling is
tricky...

> @@ -1227,6 +1247,18 @@ int fixup_user_fault(struct mm_struct *mm,
>  		return -EINTR;
>  
>  	ret = handle_mm_fault(vma, address, fault_flags, NULL);
> +
> +	if (ret & VM_FAULT_COMPLETED) {
> +		/*
> +		 * NOTE: it's a pity that we need to retake the lock here
> +		 * to pair with the unlock() in the callers. Ideally we
> +		 * could tell the callers so they do not need to unlock.
> +		 */
> +		mmap_read_lock(mm);
> +		*unlocked = true;
> +		return 0;
> +	}

unlocked can be NULL inside the function, yet you assume it's non-NULL
here. This is okay because COMPLETED can only be returned if RETRY is
set, and when RETRY is set unlocked must be non-NULL. It's correct but
not very obvious.

It might be cleaner to have separate flags for ALLOW_RETRY and
ALLOW_UNLOCK, with corresponding VM_FAULT_RETRY and VM_FAULT_UNLOCKED?
Even if not all combinations are used.

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2942,7 +2942,7 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>  		balance_dirty_pages_ratelimited(mapping);
>  		if (fpin) {
>  			fput(fpin);
> -			return VM_FAULT_RETRY;
> +			return VM_FAULT_COMPLETED;

There is one oddity in this now.

It completes the fault and no longer triggers a retry. Yet it's still
using maybe_unlock_mmap_for_io() and subject to retry limiting. This
means that if the fault already retried once, this code won't drop the
mmap_sem to call balance_dirty_pages() - even though it safely could
and should do so, without risking endless retries.

Here too IMO the distinction between ALLOW_RETRY|TRIED and
ALLOW_UNLOCK would make things cleaner and more obvious.


  parent reply	other threads:[~2022-05-06 20:28 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-05 21:17 [PATCH] mm: Avoid unnecessary page fault retires on shared memory types Peter Xu
2022-05-06  7:47 ` kernel test robot
2022-05-06 20:27 ` Johannes Weiner [this message]
2022-05-09 15:35   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YnWEmsUwJt9KOUN0@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.