Re: [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults
Date: Wed, 28 Oct 2015 14:13:33 -0700	[thread overview]
Message-ID: <56313A7D.4000102@oracle.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1510281332050.4687@eggly.anvils>

On 10/28/2015 02:00 PM, Hugh Dickins wrote:
> On Wed, 28 Oct 2015, Mike Kravetz wrote:
>> On 10/27/2015 08:34 PM, Hugh Dickins wrote:
>>
>> Thanks for the detailed response Hugh.  I will try to address your questions
>> and provide more reasoning behind the use case and need for this code.
> 
> And thank you for your detailed response, Mike: that helped a lot.
> 
>> Ok, here is a bit more explanation of the proposed use case.  It all
>> revolves around a DB's use of hugetlbfs and the desire for more control
>> over the underlying memory.  This additional control is achieved by
>> adding existing fallocate and userfaultfd semantics to hugetlbfs.
>>
>> In this use case there is a single process that manages hugetlbfs files
>> and the underlying memory resources.  It pre-allocates/initializes these
>> files.
>>
>> In addition, there are many other processes which access (rw mode) these
>> files.  They will simply mmap the files.  It is expected that they will
>> not fault in any new pages.  Rather, all pages would have been pre-allocated
>> by the management process.
>>
>> At some time, the management process determines that specific ranges of
>> pages within the hugetlbfs files are no longer needed.  It will then punch
>> holes in the files.  These 'free' pages within the holes may then be used
>> for other purposes.  For applications like this (sophisticated DBs), huge
>> pages are reserved at system init time and closely managed by the
>> application.
>> Hence, the desire for this additional control.
>>
>> So, when a hole containing N huge pages is punched, the management process
>> wants to know that it really has N huge pages for other purposes.  Ideally,
>> none of the other processes mapping this file/area would access the hole.
>> This is an application error, and it can be 'caught' with  userfaultfd.
>>
>> Since these other (non-management) processes will never fault in pages,
>> they would simply set up userfaultfd to catch any page faults immediately
>> after mmaping the hugetlbfs file.
>>
>>>
>>> But it sounds to me more as if the holes you want punched are not
>>> quite like on other filesystems, and you want to be able to police
>>> them afterwards with userfaultfd, to prevent them from being refilled.
>>
>> I am not sure if they are any different.
>>
>> One could argue that a hole punch operation must always result in all
>> pages within the hole being deallocated.  As you point out, this could
>> race with a fault.  Previously, there would be no way to determine if
>> all pages had been deallocated because user space could not detect this
>> race.  Now, userfaultfd allows user space to catch page faults.  So,
>> it is now possible to catch/depend on hole punch deallocating all pages
>> within the hole.
>>
>>>
>>> Can't userfaultfd be used just slightly earlier, to prevent them from
>>> being filled while doing the holepunch?  Then no need for this patchset?
>>
>> I do not think so, at least with current userfaultfd semantics.  The hole
>> needs to be punched before being caught with UFFDIO_REGISTER_MODE_MISSING.
> 
> Great, that makes sense.
> 
> I was worried that you needed some kind of atomic treatment of the whole
> extent punched, but all you need is to close the hole/fault race one
> hugepage at a time.
> 
> Throw away all of 1/4, 2/4, 3/4: I think all you need is your 4/4
> (plus i_mmap_lock_write around the hugetlb_vmdelete_list of course).
> 
> There you already do the single hugepage hugetlb_vmdelete_list()
> under mutex_lock(&hugetlb_fault_mutex_table[hash]).
> 
> And it should come as no surprise that hugetlb_fault() does most
> of its work under that same mutex.
> 
> So once remove_inode_hugepages() unlocks the mutex, that page is gone
> from the file, and userfaultfd UFFDIO_REGISTER_MODE_MISSING will do
> what you want, won't it?
> 
> I don't think "my" code buys you anything at all: you're not in danger of
> shmem's starvation livelock issue, partly because remove_inode_hugepages()
> uses the simple loop from start to end, and partly because hugetlb_fault()
> already takes the serializing mutex (no equivalent in shmem_fault()).
> 
> Or am I dreaming?

I don't think you are dreaming.

I should have stepped back and thought about this more before before pulling
in the shmem code.  It really is only a 'page at a time' operation, and we
can use the fault mutex table for that.

I'll code it up with just the changes needed for 4/4 and put it through some
stress testing.

Thanks,
-- 
Mike Kravetz

> 
> Hugh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults
Date: Wed, 28 Oct 2015 14:13:33 -0700	[thread overview]
Message-ID: <56313A7D.4000102@oracle.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1510281332050.4687@eggly.anvils>

On 10/28/2015 02:00 PM, Hugh Dickins wrote:
> On Wed, 28 Oct 2015, Mike Kravetz wrote:
>> On 10/27/2015 08:34 PM, Hugh Dickins wrote:
>>
>> Thanks for the detailed response Hugh.  I will try to address your questions
>> and provide more reasoning behind the use case and need for this code.
> 
> And thank you for your detailed response, Mike: that helped a lot.
> 
>> Ok, here is a bit more explanation of the proposed use case.  It all
>> revolves around a DB's use of hugetlbfs and the desire for more control
>> over the underlying memory.  This additional control is achieved by
>> adding existing fallocate and userfaultfd semantics to hugetlbfs.
>>
>> In this use case there is a single process that manages hugetlbfs files
>> and the underlying memory resources.  It pre-allocates/initializes these
>> files.
>>
>> In addition, there are many other processes which access (rw mode) these
>> files.  They will simply mmap the files.  It is expected that they will
>> not fault in any new pages.  Rather, all pages would have been pre-allocated
>> by the management process.
>>
>> At some time, the management process determines that specific ranges of
>> pages within the hugetlbfs files are no longer needed.  It will then punch
>> holes in the files.  These 'free' pages within the holes may then be used
>> for other purposes.  For applications like this (sophisticated DBs), huge
>> pages are reserved at system init time and closely managed by the
>> application.
>> Hence, the desire for this additional control.
>>
>> So, when a hole containing N huge pages is punched, the management process
>> wants to know that it really has N huge pages for other purposes.  Ideally,
>> none of the other processes mapping this file/area would access the hole.
>> This is an application error, and it can be 'caught' with  userfaultfd.
>>
>> Since these other (non-management) processes will never fault in pages,
>> they would simply set up userfaultfd to catch any page faults immediately
>> after mmaping the hugetlbfs file.
>>
>>>
>>> But it sounds to me more as if the holes you want punched are not
>>> quite like on other filesystems, and you want to be able to police
>>> them afterwards with userfaultfd, to prevent them from being refilled.
>>
>> I am not sure if they are any different.
>>
>> One could argue that a hole punch operation must always result in all
>> pages within the hole being deallocated.  As you point out, this could
>> race with a fault.  Previously, there would be no way to determine if
>> all pages had been deallocated because user space could not detect this
>> race.  Now, userfaultfd allows user space to catch page faults.  So,
>> it is now possible to catch/depend on hole punch deallocating all pages
>> within the hole.
>>
>>>
>>> Can't userfaultfd be used just slightly earlier, to prevent them from
>>> being filled while doing the holepunch?  Then no need for this patchset?
>>
>> I do not think so, at least with current userfaultfd semantics.  The hole
>> needs to be punched before being caught with UFFDIO_REGISTER_MODE_MISSING.
> 
> Great, that makes sense.
> 
> I was worried that you needed some kind of atomic treatment of the whole
> extent punched, but all you need is to close the hole/fault race one
> hugepage at a time.
> 
> Throw away all of 1/4, 2/4, 3/4: I think all you need is your 4/4
> (plus i_mmap_lock_write around the hugetlb_vmdelete_list of course).
> 
> There you already do the single hugepage hugetlb_vmdelete_list()
> under mutex_lock(&hugetlb_fault_mutex_table[hash]).
> 
> And it should come as no surprise that hugetlb_fault() does most
> of its work under that same mutex.
> 
> So once remove_inode_hugepages() unlocks the mutex, that page is gone
> from the file, and userfaultfd UFFDIO_REGISTER_MODE_MISSING will do
> what you want, won't it?
> 
> I don't think "my" code buys you anything at all: you're not in danger of
> shmem's starvation livelock issue, partly because remove_inode_hugepages()
> uses the simple loop from start to end, and partly because hugetlb_fault()
> already takes the serializing mutex (no equivalent in shmem_fault()).
> 
> Or am I dreaming?

I don't think you are dreaming.

I should have stepped back and thought about this more before before pulling
in the shmem code.  It really is only a 'page at a time' operation, and we
can use the fault mutex table for that.

I'll code it up with just the changes needed for 4/4 and put it through some
stress testing.

Thanks,
-- 
Mike Kravetz

> 
> Hugh
>

next prev parent reply	other threads:[~2015-10-28 21:15 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-20 23:52 [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults Mike Kravetz
2015-10-20 23:52 ` Mike Kravetz
2015-10-20 23:52 ` [PATCH v2 1/4] mm/hugetlb: Define hugetlb_falloc structure for hole punch race Mike Kravetz
2015-10-20 23:52   ` Mike Kravetz
2015-10-20 23:52 ` [PATCH v2 2/4] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch Mike Kravetz
2015-10-20 23:52   ` Mike Kravetz
2015-10-21  0:11   ` Dave Hansen
2015-10-21  0:11     ` Dave Hansen
2015-10-21  1:02     ` Mike Kravetz
2015-10-21  1:02       ` Mike Kravetz
2015-10-20 23:52 ` [PATCH v2 3/4] mm/hugetlb: page faults check for fallocate hole punch in progress and wait Mike Kravetz
2015-10-20 23:52   ` Mike Kravetz
2015-10-28  3:37   ` Hugh Dickins
2015-10-28  3:37     ` Hugh Dickins
2015-10-20 23:52 ` [PATCH v2 4/4] mm/hugetlb: Unmap pages to remove if page fault raced with hole punch Mike Kravetz
2015-10-20 23:52   ` Mike Kravetz
2015-10-28  3:34 ` [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults Hugh Dickins
2015-10-28  3:34   ` Hugh Dickins
2015-10-28 16:06   ` Mike Kravetz
2015-10-28 16:06     ` Mike Kravetz
2015-10-28 21:00     ` Hugh Dickins
2015-10-28 21:00       ` Hugh Dickins
2015-10-28 21:13       ` Mike Kravetz [this message]
2015-10-28 21:13         ` Mike Kravetz
2015-10-29  0:21         ` Mike Kravetz
2015-10-29  0:21           ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56313A7D.4000102@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.