From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f72.google.com (mail-it0-f72.google.com [209.85.214.72]) by kanga.kvack.org (Postfix) with ESMTP id 69F8B6B0328 for ; Thu, 17 Nov 2016 10:40:35 -0500 (EST) Received: by mail-it0-f72.google.com with SMTP id w132so125740366ita.1 for ; Thu, 17 Nov 2016 07:40:35 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id d30si2724740ioj.19.2016.11.17.07.40.34 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 17 Nov 2016 07:40:34 -0800 (PST) Date: Thu, 17 Nov 2016 16:40:31 +0100 From: Andrea Arcangeli Subject: Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Message-ID: <20161117154031.GA10229@redhat.com> References: <1478115245-32090-1-git-send-email-aarcange@redhat.com> <1478115245-32090-16-git-send-email-aarcange@redhat.com> <074501d235bb$3766dbd0$a6349370$@alibaba-inc.com> <31d06dc7-ea2d-4ca3-821a-f14ea69de3e9@oracle.com> <20161104193626.GU4611@redhat.com> <1805f956-1777-471c-1401-46c984189c88@oracle.com> <20161116182809.GC26185@redhat.com> <8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Hillf Danton , 'Andrew Morton' , linux-mm@kvack.org, "'Dr. David Alan Gilbert'" , 'Shaohua Li' , 'Pavel Emelyanov' , 'Mike Rapoport' On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote: > I was running some tests with error injection to exercise the error > path and noticed the reservation leaks as the system eventually ran > out of huge pages. I need to think about it some more, but we may > want to at least do something like the following before put_page (with > a BIG comment): > > if (unlikely(PagePrivate(page))) > ClearPagePrivate(page); > > That would at least keep the global reservation count from increasing. > Let me look into that. However what happens if the old vma got munmapped and a new compatible vma was instantiated and passes revalidation fine? The reserved page of the old vma goes to a different vma then? This reservation code is complex and has lots of special cases anyway, but the main concern at this point is the set_page_private(subpool_vma(vma)) released by hugetlb_vm_op_close->unlock_or_release_subpool. Aside the accounting, what about the page_private(page) subpool? It's used by huge_page_free which would get out of sync with vma/inode destruction if we release the mmap_sem. struct hugepage_subpool *spool = (struct hugepage_subpool *)page_private(page); I think in the revalidation code we need to check if page_private(page) still matches the subpool_vma(vma), if it doesn't and it's a stale pointer, we can't even call put_page before fixing up the page_private first. The other way to solve this is not to release the mmap_sem at all and in the slow path call __get_user_pages(nonblocking=NULL). That's slower than using the CPU TLB to reach the source data and it'd prevent also to handle a userfault in the source address of UFFDIO_COPY, because an userfault in the source would require the mmap_sem to be released (i.e. it'd require get_user_pages_fast that would again recurse on the mmap_sem and in turn we could as well stick to the current nonblocking copy-user). We currently don't handle nesting with non-cooperative anyway so it'd be ok for now not to release the mmap_sem while copying in UFFDIO_COPY. Offtopic here but while reading this code I also noticed free_huge_page is wasting CPU and then noticed other places wasting CPU.