Re: [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>,
	Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
Date: Mon, 01 Jun 2015 09:53:34 -0700	[thread overview]
Message-ID: <556C8E0E.5000309@oracle.com> (raw)
In-Reply-To: <1432749371-32220-4-git-send-email-mike.kravetz@oracle.com>

On 05/27/2015 10:56 AM, Mike Kravetz wrote:
> alloc_huge_page and hugetlb_reserve_pages use region_chg to
> calculate the number of pages which will be added to the reserve
> map.  Subpool and global reserve counts are adjusted based on
> the output of region_chg.  Before the pages are actually added
> to the reserve map, these routines could race and add fewer
> pages than expected.  If this happens, the subpool and global
> reserve counts are not correct.
>
> Compare the number of pages actually added (region_add) to those
> expected to added (region_chg).  If fewer pages are actually added,
> this indicates a race and adjust counters accordingly.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++----
>   1 file changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b3d3d59..038c84e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1544,7 +1544,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>   	struct hugepage_subpool *spool = subpool_vma(vma);
>   	struct hstate *h = hstate_vma(vma);
>   	struct page *page;
> -	long chg;
> +	long chg, commit;
>   	int ret, idx;
>   	struct hugetlb_cgroup *h_cg;
>
> @@ -1585,7 +1585,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>
>   	set_page_private(page, (unsigned long)spool);
>
> -	vma_commit_reservation(h, vma, addr);
> +	commit = vma_commit_reservation(h, vma, addr);
> +	if (unlikely(chg > commit)) {
> +		/*
> +		 * The page was added to the reservation map between
> +		 * vma_needs_reservation and vma_commit_reservation.
> +		 * This indicates a race with hugetlb_reserve_pages.
> +		 * Adjust for the subpool count incremented above AND
> +		 * in hugetlb_reserve_pages for the same page.  Also,
> +		 * the reservation count added in hugetlb_reserve_pages
> +		 * no longer applies.
> +		 */
> +		hugepage_subpool_put_pages(spool, 1);
> +		hugetlb_acct_memory(h, -1);
> +	}
>   	return page;

Well, this is embarrassing.  The code to fix up counts is incomplete
and incorrect.  It does not take into account the hugetlbfs min_size
option to reserve a set of pages for a filesystem.  The code should
look like:

		long rsv_adjust;

		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
		hugetlb_acct_memory(h, -rsv_adjust);

A similar change is required in hugetlb_reserve_pages().

I have verified that the global reserve count is not properly
adjusted with the original patch, and verified that it is properly
adjusted with the above change.

Should I create a new patch set, or create a patch on top of the
existing set?

I think this oversight points out the need to encapsulate the
subpool and global reserve count adjustments.  There are too
many counters/adjustments in the code and it is pretty easy not
to take them all into account.  I'll start work on some
encapsulation routines.

-- 
Mike Kravetz

>
>   out_uncharge_cgroup:
> @@ -3699,8 +3712,21 @@ int hugetlb_reserve_pages(struct inode *inode,
>   	 * consumed reservations are stored in the map. Hence, nothing
>   	 * else has to be done for private mappings here
>   	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(resv_map, from, to);
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		long add = region_add(resv_map, from, to);
> +
> +		if (unlikely(chg > add)) {
> +			/*
> +			 * pages in this range were added to the reserve
> +			 * map between region_chg and region_add.  This
> +			 * indicates a race with alloc_huge_page.  Adjust
> +			 * the subpool and reserve counts modified above
> +			 * based on the difference.
> +			 */
> +			hugepage_subpool_put_pages(spool, chg - add);
> +			hugetlb_acct_memory(h, -(chg - ret));
> +		}
> +	}
>   	return 0;
>   out_err:
>   	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>,
	Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
Date: Mon, 01 Jun 2015 09:53:34 -0700	[thread overview]
Message-ID: <556C8E0E.5000309@oracle.com> (raw)
In-Reply-To: <1432749371-32220-4-git-send-email-mike.kravetz@oracle.com>

On 05/27/2015 10:56 AM, Mike Kravetz wrote:
> alloc_huge_page and hugetlb_reserve_pages use region_chg to
> calculate the number of pages which will be added to the reserve
> map.  Subpool and global reserve counts are adjusted based on
> the output of region_chg.  Before the pages are actually added
> to the reserve map, these routines could race and add fewer
> pages than expected.  If this happens, the subpool and global
> reserve counts are not correct.
>
> Compare the number of pages actually added (region_add) to those
> expected to added (region_chg).  If fewer pages are actually added,
> this indicates a race and adjust counters accordingly.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++----
>   1 file changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b3d3d59..038c84e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1544,7 +1544,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>   	struct hugepage_subpool *spool = subpool_vma(vma);
>   	struct hstate *h = hstate_vma(vma);
>   	struct page *page;
> -	long chg;
> +	long chg, commit;
>   	int ret, idx;
>   	struct hugetlb_cgroup *h_cg;
>
> @@ -1585,7 +1585,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>
>   	set_page_private(page, (unsigned long)spool);
>
> -	vma_commit_reservation(h, vma, addr);
> +	commit = vma_commit_reservation(h, vma, addr);
> +	if (unlikely(chg > commit)) {
> +		/*
> +		 * The page was added to the reservation map between
> +		 * vma_needs_reservation and vma_commit_reservation.
> +		 * This indicates a race with hugetlb_reserve_pages.
> +		 * Adjust for the subpool count incremented above AND
> +		 * in hugetlb_reserve_pages for the same page.  Also,
> +		 * the reservation count added in hugetlb_reserve_pages
> +		 * no longer applies.
> +		 */
> +		hugepage_subpool_put_pages(spool, 1);
> +		hugetlb_acct_memory(h, -1);
> +	}
>   	return page;

Well, this is embarrassing.  The code to fix up counts is incomplete
and incorrect.  It does not take into account the hugetlbfs min_size
option to reserve a set of pages for a filesystem.  The code should
look like:

		long rsv_adjust;

		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
		hugetlb_acct_memory(h, -rsv_adjust);

A similar change is required in hugetlb_reserve_pages().

I have verified that the global reserve count is not properly
adjusted with the original patch, and verified that it is properly
adjusted with the above change.

Should I create a new patch set, or create a patch on top of the
existing set?

I think this oversight points out the need to encapsulate the
subpool and global reserve count adjustments.  There are too
many counters/adjustments in the code and it is pretty easy not
to take them all into account.  I'll start work on some
encapsulation routines.

-- 
Mike Kravetz

>
>   out_uncharge_cgroup:
> @@ -3699,8 +3712,21 @@ int hugetlb_reserve_pages(struct inode *inode,
>   	 * consumed reservations are stored in the map. Hence, nothing
>   	 * else has to be done for private mappings here
>   	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(resv_map, from, to);
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		long add = region_add(resv_map, from, to);
> +
> +		if (unlikely(chg > add)) {
> +			/*
> +			 * pages in this range were added to the reserve
> +			 * map between region_chg and region_add.  This
> +			 * indicates a race with alloc_huge_page.  Adjust
> +			 * the subpool and reserve counts modified above
> +			 * based on the difference.
> +			 */
> +			hugepage_subpool_put_pages(spool, chg - add);
> +			hugetlb_acct_memory(h, -(chg - ret));
> +		}
> +	}
>   	return 0;
>   out_err:
>   	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>

next prev parent reply	other threads:[~2015-06-01 16:59 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-27 17:56 [PATCH v3 0/3] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
2015-05-27 17:56 ` Mike Kravetz
2015-05-27 17:56 ` [PATCH v3 1/3] mm/hugetlb: document the reserve map/region tracking routines Mike Kravetz
2015-05-27 17:56   ` Mike Kravetz
2015-05-27 17:56 ` [PATCH v3 2/3] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
2015-05-27 17:56   ` Mike Kravetz
2015-05-29  1:48   ` Naoya Horiguchi
2015-05-29  1:48     ` Naoya Horiguchi
2015-05-27 17:56 ` [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
2015-05-27 17:56   ` Mike Kravetz
2015-05-28 14:01   ` Davidlohr Bueso
2015-05-28 14:01     ` Davidlohr Bueso
2015-05-28 21:00     ` Mike Kravetz
2015-05-28 21:00       ` Mike Kravetz
2015-05-29  1:49   ` Naoya Horiguchi
2015-05-29  1:49     ` Naoya Horiguchi
2015-06-01 16:53   ` Mike Kravetz [this message]
2015-06-01 16:53     ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=556C8E0E.5000309@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.