Re: [PATCH] mm/hugetlb: Unmap pages if page fault raced with hole punch

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Davidlohr Bueso <dave@stgolabs.net>
Subject: Re: [PATCH] mm/hugetlb: Unmap pages if page fault raced with hole punch
Date: Mon, 9 Nov 2015 14:55:11 -0800	[thread overview]
Message-ID: <5641244F.3060108@oracle.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1511082310390.15826@eggly.anvils>

On 11/08/2015 11:42 PM, Hugh Dickins wrote:
> On Fri, 30 Oct 2015, Mike Kravetz wrote:
>>
>> The 'next = start' code is actually from the original truncate_hugepages
>> routine.  This functionality was combined with that needed for hole punch
>> to create remove_inode_hugepages().
>>
>> The following code was in truncate_hugepages:
>>
>> 	next = start;
>> 	while (1) {
>> 		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>> 			if (next == start)
>> 				break;
>> 			next = start;
>> 			continue;
>> 		}
>>
>>
>> So, in the truncate case pages starting at 'start' are deleted until
>> pagevec_lookup fails.  Then, we call pagevec_lookup() again.  If no
>> pages are found we are done.  Else, we repeat the whole process.
>>
>> Does anyone recall the reason for going back and looking for pages at
>> index'es already deleted?  Git doesn't help as that was part of initial
>> commit.  My thought is that truncate can race with page faults.  The
>> truncate code sets inode offset before unmapping and deleting pages.
>> So, faults after the new offset is set should fail.  But, I suppose a
>> fault could race with setting offset and deleting of pages.  Does this
>> sound right?  Or, is there some other reason I am missing?
> 
> I believe your thinking is correct.  But remember that
> truncate_inode_pages_range() is shared by almost all filesystems,
> and different filesystems have different internal locking conventions,
> and different propensities to such a race: it's trying to cover for
> all of them.
> 
> Typically, writing is well serialized (by i_mutex) against truncation,
> but faulting (like reading) sails through without enough of a lock.
> We resort to i_size checks to avoid the worst of it, but there's often
> a corner or two in which those checks are not quite good enough -
> it's easy to check i_size at the beginning, but it needs to be checked
> again at the end too, and what's been done undone - can be awkward.

Well, it looks like the hugetlb_no_page() routine is checking i_size both
before and after.  It appears to be doing the right thing to handle the
race, but I need to stare at the code some more to make sure.

Because of the way the truncate code went back and did an extra lookup
when done with the range, I assumed it was covering some race.  However,
that may not be the case.

> 
> I hope that in the case of hugetlbfs, since you already have the
> additional fault_mutex to handle races between faults and punching,
> it should be possible to get away without that "pincer" restarting.

Yes, it looks like this may work as a straight loop over the range of
pages.  I just need to study the code some more to make sure I am not
missing something.

-- 
Mike Kravetz

> 
> Hugh
> 
>>
>> I would like to continue having remove_inode_hugepages handle both the
>> truncate and hole punch case.  So, what to make sure the code correctly
>> handles both cases.
>>
>> -- 
>> Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Davidlohr Bueso <dave@stgolabs.net>
Subject: Re: [PATCH] mm/hugetlb: Unmap pages if page fault raced with hole punch
Date: Mon, 9 Nov 2015 14:55:11 -0800	[thread overview]
Message-ID: <5641244F.3060108@oracle.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1511082310390.15826@eggly.anvils>

On 11/08/2015 11:42 PM, Hugh Dickins wrote:
> On Fri, 30 Oct 2015, Mike Kravetz wrote:
>>
>> The 'next = start' code is actually from the original truncate_hugepages
>> routine.  This functionality was combined with that needed for hole punch
>> to create remove_inode_hugepages().
>>
>> The following code was in truncate_hugepages:
>>
>> 	next = start;
>> 	while (1) {
>> 		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>> 			if (next == start)
>> 				break;
>> 			next = start;
>> 			continue;
>> 		}
>>
>>
>> So, in the truncate case pages starting at 'start' are deleted until
>> pagevec_lookup fails.  Then, we call pagevec_lookup() again.  If no
>> pages are found we are done.  Else, we repeat the whole process.
>>
>> Does anyone recall the reason for going back and looking for pages at
>> index'es already deleted?  Git doesn't help as that was part of initial
>> commit.  My thought is that truncate can race with page faults.  The
>> truncate code sets inode offset before unmapping and deleting pages.
>> So, faults after the new offset is set should fail.  But, I suppose a
>> fault could race with setting offset and deleting of pages.  Does this
>> sound right?  Or, is there some other reason I am missing?
> 
> I believe your thinking is correct.  But remember that
> truncate_inode_pages_range() is shared by almost all filesystems,
> and different filesystems have different internal locking conventions,
> and different propensities to such a race: it's trying to cover for
> all of them.
> 
> Typically, writing is well serialized (by i_mutex) against truncation,
> but faulting (like reading) sails through without enough of a lock.
> We resort to i_size checks to avoid the worst of it, but there's often
> a corner or two in which those checks are not quite good enough -
> it's easy to check i_size at the beginning, but it needs to be checked
> again at the end too, and what's been done undone - can be awkward.

Well, it looks like the hugetlb_no_page() routine is checking i_size both
before and after.  It appears to be doing the right thing to handle the
race, but I need to stare at the code some more to make sure.

Because of the way the truncate code went back and did an extra lookup
when done with the range, I assumed it was covering some race.  However,
that may not be the case.

> 
> I hope that in the case of hugetlbfs, since you already have the
> additional fault_mutex to handle races between faults and punching,
> it should be possible to get away without that "pincer" restarting.

Yes, it looks like this may work as a straight loop over the range of
pages.  I just need to study the code some more to make sure I am not
missing something.

-- 
Mike Kravetz

> 
> Hugh
> 
>>
>> I would like to continue having remove_inode_hugepages handle both the
>> truncate and hole punch case.  So, what to make sure the code correctly
>> handles both cases.
>>
>> -- 
>> Mike Kravetz

next prev parent reply	other threads:[~2015-11-09 23:00 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-29 22:33 [PATCH] mm/hugetlb: Unmap pages if page fault raced with hole punch Mike Kravetz
2015-10-29 22:33 ` Mike Kravetz
2015-10-30  3:32 ` Hugh Dickins
2015-10-30  3:32   ` Hugh Dickins
2015-10-30 16:45   ` Mike Kravetz
2015-10-30 16:45     ` Mike Kravetz
2015-10-30 20:56     ` Mike Kravetz
2015-10-30 20:56       ` Mike Kravetz
2015-11-09  7:42       ` Hugh Dickins
2015-11-09  7:42         ` Hugh Dickins
2015-11-09 22:55         ` Mike Kravetz [this message]
2015-11-09 22:55           ` Mike Kravetz
2015-11-10 22:41           ` Mike Kravetz
2015-11-10 22:41             ` Mike Kravetz
2015-11-14  0:36             ` Hugh Dickins
2015-11-14  0:36               ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5641244F.3060108@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.