Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andy Whitcroft <apw@shadowen.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: gerald.schaefer@de.ibm.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, schwidefsky@de.ibm.com,
	heiko.carstens@de.ibm.com, mel@csn.ul.ie
Subject: Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
Date: Fri, 8 Aug 2008 11:16:46 +0100	[thread overview]
Message-ID: <20080808101646.GJ14829@brain> (raw)
In-Reply-To: <20080807143824.8e0803da.akpm@linux-foundation.org>

On Thu, Aug 07, 2008 at 02:38:24PM -0700, Andrew Morton wrote:
> On Thu,  7 Aug 2008 21:28:23 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> > [Andrew, this fixes a problem in the private reservations stack, shown up
> > by some testing done by Gerald on s390 with PREEMPT.  It fixes an attempt
> > at allocation while holding locks.  This should be merged up to mainline
> > as a bug fix to those patches.]
> > 
> > In the normal case, hugetlbfs reserves hugepages at map time so that the
> > pages exist for future faults.  A struct file_region is used to track
> > when reservations have been consumed and where.  These file_regions
> > are allocated as necessary with kmalloc() which can sleep with the
> > mm->page_table_lock held.  This is wrong and triggers may-sleep warning
> > when PREEMPT is enabled.
> > 
> > Updates to the underlying file_region are done in two phases.  The first
> > phase prepares the region for the change, allocating any necessary memory,
> > without actually making the change.  The second phase actually commits
> > the change.  This patch makes use of this by checking the reservations
> > before the page_table_lock is taken; triggering any necessary allocations.
> > This may then be safely repeated within the locks without any allocations
> > being required.
> > 
> > Credit to Mel Gorman for diagnosing this failure and initial versions of
> > the patch.
> > 
> 
> After applying the patch:
> 
> : int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> : 			unsigned long address, int write_access)
> : {
> : 	pte_t *ptep;
> : 	pte_t entry;
> : 	int ret;
> : 	struct page *pagecache_page = NULL;
> : 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> : 	struct hstate *h = hstate_vma(vma);
> : 
> : 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> : 	if (!ptep)
> : 		return VM_FAULT_OOM;
> : 
> : 	/*
> : 	 * Serialize hugepage allocation and instantiation, so that we don't
> : 	 * get spurious allocation failures if two CPUs race to instantiate
> : 	 * the same page in the page cache.
> : 	 */
> : 	mutex_lock(&hugetlb_instantiation_mutex);
> : 	entry = huge_ptep_get(ptep);
> : 	if (huge_pte_none(entry)) {
> : 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
> : 		mutex_unlock(&hugetlb_instantiation_mutex);
> : 		return ret;
> : 	}
> : 
> : 	ret = 0;
> : 
> : 	/*
> : 	 * If we are going to COW the mapping later, we examine the pending
> : 	 * reservations for this page now. This will ensure that any
> : 	 * allocations necessary to record that reservation occur outside the
> : 	 * spinlock. For private mappings, we also lookup the pagecache
> : 	 * page now as it is used to determine if a reservation has been
> : 	 * consumed.
> : 	 */
> : 	if (write_access && !pte_write(entry)) {
> : 		vma_needs_reservation(h, vma, address);
> : 
> : 		if (!(vma->vm_flags & VM_SHARED))
> : 			pagecache_page = hugetlbfs_pagecache_page(h,
> : 								vma, address);
> : 	}
> 
> There's a seeming race window here, where a new page can get
> instantiated.  But down-read(mmap_sem) plus hugetlb_instantiation_mutex
> prevents that, yes?

Although that is true, I would prefer to not think of the
instantiation_mutex as protection for this, its primary concern is
serialisation.  I believe that the combination of down_read(mmap_sem),
the page lock, and perversely the page_table_lock protect this.

At this point we know that the PTE was not pte_none else we would
have branched to no_page.  No mapping operations can be occuring as
we have down_read(mmap_sem).  Any truncates racing with us first clear
the PTEs and then the pagecache references.  Should we pick up a stale
pagecache reference, we will detect it when we recheck the PTE under
the page_table_lock; this will also detect any racing instantiations.

Obviously we have the instantiation_mutex, and the locking rules for
the regions need it.  But I believe we are safe against this race even
without the instantiation_mutex.

> : 	spin_lock(&mm->page_table_lock);
> : 	/* Check for a racing update before calling hugetlb_cow */
> : 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
> : 		if (write_access && !pte_write(entry))
> : 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> : 							pagecache_page);
> : 	spin_unlock(&mm->page_table_lock);
> : 
> : 	if (pagecache_page) {
> : 		unlock_page(pagecache_page);
> : 		put_page(pagecache_page);
> : 	}
> : 
> : 	mutex_unlock(&hugetlb_instantiation_mutex);
> : 
> : 	return ret;
> : }
> : 
> : 

-apw

WARNING: multiple messages have this Message-ID (diff)

From: Andy Whitcroft <apw@shadowen.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: gerald.schaefer@de.ibm.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, schwidefsky@de.ibm.com,
	heiko.carstens@de.ibm.com, mel@csn.ul.ie
Subject: Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
Date: Fri, 8 Aug 2008 11:16:46 +0100	[thread overview]
Message-ID: <20080808101646.GJ14829@brain> (raw)
In-Reply-To: <20080807143824.8e0803da.akpm@linux-foundation.org>

On Thu, Aug 07, 2008 at 02:38:24PM -0700, Andrew Morton wrote:
> On Thu,  7 Aug 2008 21:28:23 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> > [Andrew, this fixes a problem in the private reservations stack, shown up
> > by some testing done by Gerald on s390 with PREEMPT.  It fixes an attempt
> > at allocation while holding locks.  This should be merged up to mainline
> > as a bug fix to those patches.]
> > 
> > In the normal case, hugetlbfs reserves hugepages at map time so that the
> > pages exist for future faults.  A struct file_region is used to track
> > when reservations have been consumed and where.  These file_regions
> > are allocated as necessary with kmalloc() which can sleep with the
> > mm->page_table_lock held.  This is wrong and triggers may-sleep warning
> > when PREEMPT is enabled.
> > 
> > Updates to the underlying file_region are done in two phases.  The first
> > phase prepares the region for the change, allocating any necessary memory,
> > without actually making the change.  The second phase actually commits
> > the change.  This patch makes use of this by checking the reservations
> > before the page_table_lock is taken; triggering any necessary allocations.
> > This may then be safely repeated within the locks without any allocations
> > being required.
> > 
> > Credit to Mel Gorman for diagnosing this failure and initial versions of
> > the patch.
> > 
> 
> After applying the patch:
> 
> : int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> : 			unsigned long address, int write_access)
> : {
> : 	pte_t *ptep;
> : 	pte_t entry;
> : 	int ret;
> : 	struct page *pagecache_page = NULL;
> : 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> : 	struct hstate *h = hstate_vma(vma);
> : 
> : 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> : 	if (!ptep)
> : 		return VM_FAULT_OOM;
> : 
> : 	/*
> : 	 * Serialize hugepage allocation and instantiation, so that we don't
> : 	 * get spurious allocation failures if two CPUs race to instantiate
> : 	 * the same page in the page cache.
> : 	 */
> : 	mutex_lock(&hugetlb_instantiation_mutex);
> : 	entry = huge_ptep_get(ptep);
> : 	if (huge_pte_none(entry)) {
> : 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
> : 		mutex_unlock(&hugetlb_instantiation_mutex);
> : 		return ret;
> : 	}
> : 
> : 	ret = 0;
> : 
> : 	/*
> : 	 * If we are going to COW the mapping later, we examine the pending
> : 	 * reservations for this page now. This will ensure that any
> : 	 * allocations necessary to record that reservation occur outside the
> : 	 * spinlock. For private mappings, we also lookup the pagecache
> : 	 * page now as it is used to determine if a reservation has been
> : 	 * consumed.
> : 	 */
> : 	if (write_access && !pte_write(entry)) {
> : 		vma_needs_reservation(h, vma, address);
> : 
> : 		if (!(vma->vm_flags & VM_SHARED))
> : 			pagecache_page = hugetlbfs_pagecache_page(h,
> : 								vma, address);
> : 	}
> 
> There's a seeming race window here, where a new page can get
> instantiated.  But down-read(mmap_sem) plus hugetlb_instantiation_mutex
> prevents that, yes?

Although that is true, I would prefer to not think of the
instantiation_mutex as protection for this, its primary concern is
serialisation.  I believe that the combination of down_read(mmap_sem),
the page lock, and perversely the page_table_lock protect this.

At this point we know that the PTE was not pte_none else we would
have branched to no_page.  No mapping operations can be occuring as
we have down_read(mmap_sem).  Any truncates racing with us first clear
the PTEs and then the pagecache references.  Should we pick up a stale
pagecache reference, we will detect it when we recheck the PTE under
the page_table_lock; this will also detect any racing instantiations.

Obviously we have the instantiation_mutex, and the locking rules for
the regions need it.  But I believe we are safe against this race even
without the instantiation_mutex.

> : 	spin_lock(&mm->page_table_lock);
> : 	/* Check for a racing update before calling hugetlb_cow */
> : 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
> : 		if (write_access && !pte_write(entry))
> : 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> : 							pagecache_page);
> : 	spin_unlock(&mm->page_table_lock);
> : 
> : 	if (pagecache_page) {
> : 		unlock_page(pagecache_page);
> : 		put_page(pagecache_page);
> : 	}
> : 
> : 	mutex_unlock(&hugetlb_instantiation_mutex);
> : 
> : 	return ret;
> : }
> : 
> : 

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-08-08 10:17 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-06 14:43 [BUG] hugetlb: sleeping function called from invalid context Gerald Schaefer
2008-08-06 14:43 ` Gerald Schaefer
2008-08-06 19:03 ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks Andy Whitcroft
2008-08-06 19:03   ` Andy Whitcroft
2008-08-07 11:35   ` Gerald Schaefer
2008-08-07 11:35     ` Gerald Schaefer
2008-08-07 20:28 ` Andy Whitcroft
2008-08-07 20:28   ` Andy Whitcroft
2008-08-07 21:38   ` Andrew Morton
2008-08-07 21:38     ` Andrew Morton
2008-08-08  8:33     ` Mel Gorman
2008-08-08  8:33       ` Mel Gorman
2008-08-08 10:16     ` Andy Whitcroft [this message]
2008-08-08 10:16       ` Andy Whitcroft
2008-08-08 11:10     ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 Andy Whitcroft
2008-08-08 11:10       ` Andy Whitcroft
2008-08-08 12:57       ` Gerald Schaefer
2008-08-08 12:57         ` Gerald Schaefer
2008-08-11 17:58     ` Andy Whitcroft
2008-08-11 17:58       ` Andy Whitcroft

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080808101646.GJ14829@brain \
    --to=apw@shadowen.org \
    --cc=akpm@linux-foundation.org \
    --cc=gerald.schaefer@de.ibm.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=schwidefsky@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.