linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Hillf Danton <hillf.zj@alibaba-inc.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel <linux-kernel@vger.kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Davidlohr Bueso <dave@stgolabs.net>
Subject: Re: [PATCH] mm/hugetlbfs Fix bugs in fallocate hole punch of areas with holes
Date: Sat, 31 Oct 2015 10:53:24 -0700	[thread overview]
Message-ID: <56350014.2040800@oracle.com> (raw)
In-Reply-To: <007901d1139a$030b0440$09210cc0$@alibaba-inc.com>

On 10/30/2015 10:07 PM, Hillf Danton wrote:
>>
>> Hugh Dickins pointed out problems with the new hugetlbfs fallocate
>> hole punch code.  These problems are in the routine remove_inode_hugepages
>> and mostly occur in the case where there are holes in the range of
>> pages to be removed.  These holes could be the result of a previous hole
>> punch or simply sparse allocation.
>>
>> remove_inode_hugepages handles both hole punch and truncate operations.
>> Page index handling was fixed/cleaned up so that holes are properly
>> handled.  In addition, code was changed to ensure multiple passes of the
>> address range only happens in the truncate case.  More comments were added
>> to explain the different actions in each case.  A cond_resched() was added
>> after removing up to PAGEVEC_SIZE pages.
>>
>> Some totally unnecessary code in hugetlbfs_fallocate() that remained from
>> early development was also removed.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  fs/hugetlbfs/inode.c | 44 +++++++++++++++++++++++++++++---------------
>>  1 file changed, 29 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 316adb9..30cf534 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -368,10 +368,25 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>>  			lookup_nr = end - next;
>>
>>  		/*
>> -		 * This pagevec_lookup() may return pages past 'end',
>> -		 * so we must check for page->index > end.
>> +		 * When no more pages are found, take different action for
>> +		 * hole punch and truncate.
>> +		 *
>> +		 * For hole punch, this indicates we have removed each page
>> +		 * within the range and are done.  Note that pages may have
>> +		 * been faulted in after being removed in the hole punch case.
>> +		 * This is OK as long as each page in the range was removed
>> +		 * once.
>> +		 *
>> +		 * For truncate, we need to make sure all pages within the
>> +		 * range are removed when exiting this routine.  We could
>> +		 * have raced with a fault that brought in a page after it
>> +		 * was first removed.  Check the range again until no pages
>> +		 * are found.
>>  		 */
>>  		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
>> +			if (!truncate_op)
>> +				break;
>> +
>>  			if (next == start)
>>  				break;
>>  			next = start;
>> @@ -382,19 +397,23 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>>  			struct page *page = pvec.pages[i];
>>  			u32 hash;
>>
>> +			/*
>> +			 * The page (index) could be beyond end.  This is
>> +			 * only possible in the punch hole case as end is
>> +			 * LLONG_MAX for truncate.
>> +			 */
>> +			if (page->index >= end) {
>> +				next = end;	/* we are done */
>> +				break;
>> +			}
>> +			next = page->index;
>> +
>>  			hash = hugetlb_fault_mutex_hash(h, current->mm,
>>  							&pseudo_vma,
>>  							mapping, next, 0);
>>  			mutex_lock(&hugetlb_fault_mutex_table[hash]);
>>
>>  			lock_page(page);
>> -			if (page->index >= end) {
>> -				unlock_page(page);
>> -				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>> -				next = end;	/* we are done */
>> -				break;
>> -			}
>> -
>>  			/*
>>  			 * If page is mapped, it was faulted in after being
>>  			 * unmapped.  Do nothing in this race case.  In the
>> @@ -423,15 +442,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>>  				}
>>  			}
>>
>> -			if (page->index > next)
>> -				next = page->index;
>> -
>>  			++next;
>>  			unlock_page(page);
>>
>>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>>  		}
>>  		huge_pagevec_release(&pvec);
>> +		cond_resched();
>>  	}
>>
>>  	if (truncate_op)
>> @@ -647,9 +664,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> 
> This hunk is already in the next tree, see below please.
> 

Ah, the whole series to add shmem like code to handle hole punch/fault
races is in the next tree.  It has been determined that most of this
series is not necessary.  For the next tree, ideally the following
should happen:
- revert the series
	0830d5afd4ab69d01cf5ceba9b9f2796564c4eb6
	4e0a78fea078af972276c2d3aeaceb2bac80e033
	251c8a023a0c639725e014a612e8c05a631ce839
	03bcef375766af4db12ec783241ac39f8bf5e2b1
- Add this patch (if Ack'ed/reviewed) to fix remove_inode_hugepages
- Add a new patch for the handle hole punch/fault race.  It modifies
  same code as this patch, so I have not sent out until this is Ack'ed.

I will admit that I do not fully understand how maintainers manage their
trees and share patches.  If someone can make suggestions on how to handle
this situation (create patches against what tree? send patches to who?),
I will be happy to make it happen.

-- 
Mike Kravetz

>>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>>  		i_size_write(inode, offset + len);
>>  	inode->i_ctime = CURRENT_TIME;
>> -	spin_lock(&inode->i_lock);
>> -	inode->i_private = NULL;
>> -	spin_unlock(&inode->i_lock);
>>  out:
>>  	mutex_unlock(&inode->i_mutex);
>>  	return error;
>> --
>> 2.4.3
>>
> In the next tree,
> 	4e0a78fea078af972276c2d3aeaceb2bac80e033
> 	mm/hugetlb: setup hugetlb_falloc during fallocate hole punch
> 
> @@ -647,9 +676,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>  		i_size_write(inode, offset + len);
>  	inode->i_ctime = CURRENT_TIME;
> -	spin_lock(&inode->i_lock);
> -	inode->i_private = NULL;
> -	spin_unlock(&inode->i_lock);
>  out:
>  	mutex_unlock(&inode->i_mutex);
>  	return error;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-10-31 17:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-31  5:07 [PATCH] mm/hugetlbfs Fix bugs in fallocate hole punch of areas with holes Hillf Danton
2015-10-31 17:53 ` Mike Kravetz [this message]
2015-11-02  2:50   ` Hillf Danton
2015-11-02 17:39     ` Mike Kravetz
2015-11-02 21:40       ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2015-10-30 23:32 Mike Kravetz
2015-11-09  6:57 ` Naoya Horiguchi
2015-11-09  7:09 ` Hugh Dickins
2015-11-09 23:24   ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56350014.2040800@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).