Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <aarcange@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>,
	'Andrew Morton' <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	"'Dr. David Alan Gilbert'" <dgilbert@redhat.com>,
	'Shaohua Li' <shli@fb.com>,
	'Pavel Emelyanov' <xemul@parallels.com>,
	'Mike Rapoport' <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
Date: Fri, 18 Nov 2016 01:05:27 +0100	[thread overview]
Message-ID: <20161118000527.GB10229@redhat.com> (raw)
In-Reply-To: <718434af-d279-445d-e210-201bf02f434f@oracle.com>

On Thu, Nov 17, 2016 at 11:26:17AM -0800, Mike Kravetz wrote:
> On 11/17/2016 07:40 AM, Andrea Arcangeli wrote:
> > On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote:
> >> I was running some tests with error injection to exercise the error
> >> path and noticed the reservation leaks as the system eventually ran
> >> out of huge pages.  I need to think about it some more, but we may
> >> want to at least do something like the following before put_page (with
> >> a BIG comment):
> >>
> >> 	if (unlikely(PagePrivate(page)))
> >> 		ClearPagePrivate(page);
> >>
> >> That would at least keep the global reservation count from increasing.
> >> Let me look into that.
> > 
> > However what happens if the old vma got munmapped
> 
> When the huge page was allocated, the reservation map associated with
> the vma was marked to indicate the reservation was consumed.  In addition
> the global reservation count and subpool count were adjusted to account
> for the page allocation.  So, when the vma gets unmapped the reservation
> map will be examined.  Since the map indicates the reservation was consumed,
> no adjustment will be made to the global or subpool reservation count.

ClearPagePrivate before put_page, will simply avoid to run
h->resv_huge_pages++?

Not increasing resv_huge_pages means more non reserved allocations
will pass. That is a global value though, how is it ok to leave it
permanently lower?

If PagePrivate was set, it means alloc_huge_page already run this:

			SetPagePrivate(page);
			h->resv_huge_pages--;

But it would also have set a reserve map on the vma for that range
before that.

When the vma is destroyed the reserve is flushed back to global, minus
the consumed pages (reserve = (end - start) - region_count(resv,
start, end)).

Why should then we skip h->resv_huge_pages++ for the consumed pages by
running ClearPagePrivate?

It's not clear was wrong in the first place considering
put_page->free_huge_page() acts on the global stuff only?

void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
				unsigned long address, struct page *page)
{
	if (unlikely(PagePrivate(page))) {
		long rc = vma_needs_reservation(h, vma, address);

		if (unlikely(rc < 0)) {
			/*
			 * Rare out of memory condition in reserve map
			 * manipulation.  Clear PagePrivate so that
			 * global reserve count will not be incremented
			 * by free_huge_page.  This will make it appear
			 * as though the reservation for this page was
			 * consumed.  This may prevent the task from
			 * faulting in the page at a later time.  This
			 * is better than inconsistent global huge page
			 * accounting of reserve counts.
			 */
			ClearPagePrivate(page);

The ClearPagePrivate was run above because vma_needs_reservation run
out of memory and couldn't be added?

So I suppose the vma reservation wasn't possible in the above case, in
our allocation case alloc_huge_page succeeded at those reserve maps
allocations:

	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
	if (map_chg < 0)
		return ERR_PTR(-ENOMEM);
	[..]
		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
			SetPagePrivate(page);
			h->resv_huge_pages--;
		}

> >                                                   and a new compatible
> > vma was instantiated and passes revalidation fine? The reserved page
> > of the old vma goes to a different vma then?
> 
> No, the new vma should get a new reservation.  It can not use the old
> reservation as it was associated with the old vma.  This is at least
> the case for private mappings where the reservation maps are associated
> with the vma.

You're not suggesting to call ClearPagePrivate in the second pass of
the "retry" loop if all goes fine and second pass succeeds, but only if
we end up in a error of revalidation at the second pass?

So the page with PagePrivate set could go to a different vma despite
the vma reserve map was accounted for in the original vma? Is that ok?

> > This reservation code is complex and has lots of special cases anyway,
> > but the main concern at this point is the
> > set_page_private(subpool_vma(vma)) released by
> > hugetlb_vm_op_close->unlock_or_release_subpool.
> 
> Do note that set_page_private(subpool_vma(vma)) just indicates which
> subpool was used when the huge page was allocated.  I do not believe
> there is any connection made to the vma.  The vma is only used to get
> to the inode and superblock which contains subpool information.  With
> the subpool stored in page_private, the subpool count can be adjusted
> at free_huge_page time.  Also note that the subpool can not be free'ed
> in unlock_or_release_subpool until put_page is complete for the page.
> This is because the page is accounted for in spool->used_hpages.

Yes I figured myself shortly later used_hpages. So there's no risk of
use after free on the subpool pointed by the page at least.

I also considered shutting down this accounting entirely by calling
alloc_huge_page(allow_reserve = 0) in hugetlbfs mcopy atomic... Can't
we start that way so we don't have to worry about the reservation
accounting at all?

> > Aside the accounting, what about the page_private(page) subpool? It's
> > used by huge_page_free which would get out of sync with vma/inode
> > destruction if we release the mmap_sem.
> 
> I do not think that is the case.  Reservation and subpool adjustments
> made at vma/inode destruction time are based on entries in the reservation
> map.  Those entries are created/destroyed when holding mmap_sem.
> 
> > 	struct hugepage_subpool *spool =
> > 		(struct hugepage_subpool *)page_private(page);
> > 
> > I think in the revalidation code we need to check if
> > page_private(page) still matches the subpool_vma(vma), if it doesn't
> > and it's a stale pointer, we can't even call put_page before fixing up
> > the page_private first.
> 
> I do not think that is correct.  page_private(page) points to the subpool
> used when the page was allocated.  Therefore, adjustments were made to that
> subpool when the page was allocated.  We need to adjust the same subpool
> when calling put_page.  I don't think there is any need to look at the
> vma/subpool_vma(vma).  If it doesn't match, we certainly do not want to
> adjust counts in a potentially different subpool when calling page_put.

Isn't the subpool different for every mountpoint of hugetlbfs?

The old vma subpool can't be a stale pointer, because of the
used_hpages but if there are two different hugetlbfs mounts the
subpool seems to come from the superblock so it may change after we
release the mmap_sem.

Don't we have to add a check for the new vma subpool change against
the page->private?

Otherwise we'd be putting the page in some other subpool than the one
it was allocated from, as long as they pass the vma_hpagesize !=
vma_kernel_pagesize(dst_vma) check.

> As you said, this reservation code is complex.  It might be good if
> Hillf could comment as he understands this code.
> 
> I still believe a simple call to ClearPagePrivate(page) may be all we
> need to do in the error path.  If this is the case, the only downside
> is that it would appear the reservation was consumed for that page.
> So, subsequent faults 'might' not get a huge page.

I thought running out of hugepages is what you experienced already
with the current code if using error injection.

> Good catch.

Eh, that was an easy part :).

> Great.  I did review the patch, but did not test as planned.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-11-18  0:05 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE() Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 04/33] userfaultfd: use vma_is_anonymous Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
2016-11-03  7:41   ` Hillf Danton
2016-11-03 17:52     ` Mike Rapoport
2016-11-04 15:40     ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
2016-11-03  8:01   ` Hillf Danton
2016-11-03 17:24     ` Mike Rapoport
2016-11-04 16:40       ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED requestg Andrea Arcangeli
2016-11-04 15:42     ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Mike Rapoport
2016-11-02 19:33 ` [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for " Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
2016-11-03 10:15   ` Hillf Danton
2016-11-03 17:33     ` Mike Kravetz
2016-11-03 19:14       ` Mike Kravetz
2016-11-04  6:43         ` Hillf Danton
2016-11-04 19:36         ` Andrea Arcangeli
2016-11-04 20:34           ` Mike Kravetz
2016-11-08 21:06           ` Mike Kravetz
2016-11-16 18:28             ` Andrea Arcangeli
2016-11-16 18:53               ` Mike Kravetz
2016-11-17 15:40                 ` Andrea Arcangeli
2016-11-17 19:26                   ` Mike Kravetz
2016-11-18  0:05                     ` Andrea Arcangeli [this message]
2016-11-18  5:52                       ` Mike Kravetz
2016-11-22  1:16                         ` Mike Kravetz
2016-11-23  6:38                           ` Hillf Danton
2016-12-15 19:02                             ` Andrea Arcangeli
2016-12-16  3:54                               ` Hillf Danton
2016-11-17 19:41               ` Mike Kravetz
2016-11-04 16:35       ` Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
2016-11-04  7:02   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
2016-11-04  7:39   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
2016-11-04  8:59   ` Hillf Danton
2016-11-04 14:53     ` Mike Rapoport
2016-11-04 15:44     ` Mike Rapoport
2016-11-04 16:56       ` Andrea Arcangeli
2016-11-18  0:37       ` Andrea Arcangeli
2016-11-20 12:10         ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock Andrea Arcangeli
2016-11-02 20:07 ` [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161118000527.GB10229@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dgilbert@redhat.com \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=shli@fb.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).