Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
To: Luiz Capitulino <lcapitulino@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	mtosatti@redhat.com, aarcange@redhat.com, mgorman@suse.de,
	akpm@linux-foundation.org, andi@firstfloor.org, davidlohr@hp.com,
	rientjes@google.com, yinghai@kernel.org, riel@redhat.com,
	n-horiguchi@ah.jp.nec.com
Subject: Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
Date: Thu, 10 Apr 2014 13:39:48 +0900	[thread overview]
Message-ID: <53462094.6000407@jp.fujitsu.com> (raw)
In-Reply-To: <20140409135614.0fb55016@redhat.com>

(2014/04/10 2:56), Luiz Capitulino wrote:
> On Wed, 9 Apr 2014 09:42:01 +0900
> Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> wrote:
>
>> (2014/04/09 4:02), Luiz Capitulino wrote:
>>> HugeTLB is limited to allocating hugepages whose size are less than
>>> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
>>> the buddy allocator. Gigantic pages (that is, pages whose size is
>>> greater than MAX_ORDER order) have to be allocated at boottime.
>>>
>>> However, boottime allocation has at least two serious problems. First,
>>> it doesn't support NUMA and second, gigantic pages allocated at
>>> boottime can't be freed.
>>>
>>> This commit solves both issues by adding support for allocating gigantic
>>> pages during runtime. It works just like regular sized hugepages,
>>> meaning that the interface in sysfs is the same, it supports NUMA,
>>> and gigantic pages can be freed.
>>>
>>> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
>>> gigantic pages on node 1, one can do:
>>>
>>>    # echo 2 > \
>>>      /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> And to free them all:
>>>
>>>    # echo 0 > \
>>>      /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> The one problem with gigantic page allocation at runtime is that it
>>> can't be serviced by the buddy allocator. To overcome that problem, this
>>> commit scans all zones from a node looking for a large enough contiguous
>>> region. When one is found, it's allocated by using CMA, that is, we call
>>> alloc_contig_range() to do the actual allocation. For example, on x86_64
>>> we scan all zones looking for a 1GB contiguous region. When one is found,
>>> it's allocated by alloc_contig_range().
>>>
>>> One expected issue with that approach is that such gigantic contiguous
>>> regions tend to vanish as runtime goes by. The best way to avoid this for
>>> now is to make gigantic page allocations very early during system boot, say
>>> from a init script. Other possible optimization include using compaction,
>>> which is supported by CMA but is not explicitly used by this commit.
>>>
>>> It's also important to note the following:
>>>
>>>    1. Gigantic pages allocated at boottime by the hugepages= command-line
>>>       option can be freed at runtime just fine
>>>
>>>    2. This commit adds support for gigantic pages only to x86_64. The
>>>       reason is that I don't have access to nor experience with other archs.
>>>       The code is arch indepedent though, so it should be simple to add
>>>       support to different archs
>>>
>>>    3. I didn't add support for hugepage overcommit, that is allocating
>>>       a gigantic page on demand when
>>>      /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
>>>      think it's reasonable to do the hard and long work required for
>>>      allocating a gigantic page at fault time. But it should be simple
>>>      to add this if wanted
>>>
>>> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
>>> ---
>>>    mm/hugetlb.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>    1 file changed, 147 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9dded98..2258045 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -679,11 +679,141 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>>>    		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>>>    		nr_nodes--)
>>>
>>> +#if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
>>> +static void destroy_compound_gigantic_page(struct page *page,
>>> +					unsigned long order)
>>> +{
>>> +	int i;
>>> +	int nr_pages = 1 << order;
>>> +	struct page *p = page + 1;
>>> +
>>> +	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
>>> +		__ClearPageTail(p);
>>> +		set_page_refcounted(p);
>>> +		p->first_page = NULL;
>>> +	}
>>> +
>>> +	set_compound_order(page, 0);
>>> +	__ClearPageHead(page);
>>> +}
>>> +
>>> +static void free_gigantic_page(struct page *page, unsigned order)
>>> +{
>>> +	free_contig_range(page_to_pfn(page), 1 << order);
>>> +}
>>> +
>>> +static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count)
>>> +{
>>> +	unsigned long end_pfn = start_pfn + count;
>>> +	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>> +}
>>> +
>>> +static bool pfn_range_valid_gigantic(unsigned long start_pfn,
>>> +				unsigned long nr_pages)
>>> +{
>>> +	unsigned long i, end_pfn = start_pfn + nr_pages;
>>> +	struct page *page;
>>> +
>>> +	for (i = start_pfn; i < end_pfn; i++) {
>>> +		if (!pfn_valid(i))
>>> +			return false;
>>> +
>>> +		page = pfn_to_page(i);
>>> +
>>> +		if (PageReserved(page))
>>> +			return false;
>>> +
>>> +		if (page_count(page) > 0)
>>> +			return false;
>>> +
>>> +		if (PageHuge(page))
>>> +			return false;
>>> +	}
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static struct page *alloc_gigantic_page(int nid, unsigned order)
>>> +{
>>> +	unsigned long nr_pages = 1 << order;
>>> +	unsigned long ret, pfn, flags;
>>> +	struct zone *z;
>>> +
>>> +	z = NODE_DATA(nid)->node_zones;
>>> +	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
>>> +		spin_lock_irqsave(&z->lock, flags);
>>> +
>>> +		pfn = ALIGN(z->zone_start_pfn, nr_pages);
>>> +		for (; pfn < zone_end_pfn(z); pfn += nr_pages) {
>>
>>> +			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
>>
>> How about it. It can reduce the indentation level.
>> 			if (!pfn_range_valid_gigantic(...))
>> 				continue;
>>
>> And I think following check is necessary:
>> 			if (pfn + nr_pages >= zone_end_pfn(z))
>> 				break;
>
> You're right that we have to check if the whole range is within the zone,
> but the check above is off-by-one. What about the following:
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c0068..b01cdeb 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -734,6 +734,13 @@ static bool pfn_range_valid_gigantic(unsigned long start_pfn,
>   	return true;
>   }
>
> +static bool zone_spans_last_pfn(const struct zone *zone,
> +			unsigned long start_pfn, unsigned long nr_pages)
> +{
> +	unsigned long last_pfn = start_pfn + nr_pages - 1;
> +	return zone_spans_pfn(zone, last_pfn);
> +}
> +
>   static struct page *alloc_gigantic_page(int nid, unsigned order)
>   {
>   	unsigned long nr_pages = 1 << order;
> @@ -745,7 +752,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>   		spin_lock_irqsave(&z->lock, flags);
>
>   		pfn = ALIGN(z->zone_start_pfn, nr_pages);
> -		for (; pfn < zone_end_pfn(z); pfn += nr_pages) {
> +		while (zone_spans_last_pfn(z, pfn, nr_pages)) {
>   			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
>   				/*
>   				 * We release the zone lock here because
> @@ -760,6 +767,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>   					return pfn_to_page(pfn);
>   				spin_lock_irqsave(&z->lock, flags);
>   			}
> +			pfn += nr_pages;
>   		}
>
>   		spin_unlock_irqrestore(&z->lock, flags);
>

Looks good to me.

Thanks,
Yasuaki Ishimatsu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
To: Luiz Capitulino <lcapitulino@redhat.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<mtosatti@redhat.com>, <aarcange@redhat.com>, <mgorman@suse.de>,
	<akpm@linux-foundation.org>, <andi@firstfloor.org>,
	<davidlohr@hp.com>, <rientjes@google.com>, <yinghai@kernel.org>,
	<riel@redhat.com>, <n-horiguchi@ah.jp.nec.com>
Subject: Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
Date: Thu, 10 Apr 2014 13:39:48 +0900	[thread overview]
Message-ID: <53462094.6000407@jp.fujitsu.com> (raw)
In-Reply-To: <20140409135614.0fb55016@redhat.com>

(2014/04/10 2:56), Luiz Capitulino wrote:
> On Wed, 9 Apr 2014 09:42:01 +0900
> Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> wrote:
>
>> (2014/04/09 4:02), Luiz Capitulino wrote:
>>> HugeTLB is limited to allocating hugepages whose size are less than
>>> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
>>> the buddy allocator. Gigantic pages (that is, pages whose size is
>>> greater than MAX_ORDER order) have to be allocated at boottime.
>>>
>>> However, boottime allocation has at least two serious problems. First,
>>> it doesn't support NUMA and second, gigantic pages allocated at
>>> boottime can't be freed.
>>>
>>> This commit solves both issues by adding support for allocating gigantic
>>> pages during runtime. It works just like regular sized hugepages,
>>> meaning that the interface in sysfs is the same, it supports NUMA,
>>> and gigantic pages can be freed.
>>>
>>> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
>>> gigantic pages on node 1, one can do:
>>>
>>>    # echo 2 > \
>>>      /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> And to free them all:
>>>
>>>    # echo 0 > \
>>>      /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> The one problem with gigantic page allocation at runtime is that it
>>> can't be serviced by the buddy allocator. To overcome that problem, this
>>> commit scans all zones from a node looking for a large enough contiguous
>>> region. When one is found, it's allocated by using CMA, that is, we call
>>> alloc_contig_range() to do the actual allocation. For example, on x86_64
>>> we scan all zones looking for a 1GB contiguous region. When one is found,
>>> it's allocated by alloc_contig_range().
>>>
>>> One expected issue with that approach is that such gigantic contiguous
>>> regions tend to vanish as runtime goes by. The best way to avoid this for
>>> now is to make gigantic page allocations very early during system boot, say
>>> from a init script. Other possible optimization include using compaction,
>>> which is supported by CMA but is not explicitly used by this commit.
>>>
>>> It's also important to note the following:
>>>
>>>    1. Gigantic pages allocated at boottime by the hugepages= command-line
>>>       option can be freed at runtime just fine
>>>
>>>    2. This commit adds support for gigantic pages only to x86_64. The
>>>       reason is that I don't have access to nor experience with other archs.
>>>       The code is arch indepedent though, so it should be simple to add
>>>       support to different archs
>>>
>>>    3. I didn't add support for hugepage overcommit, that is allocating
>>>       a gigantic page on demand when
>>>      /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
>>>      think it's reasonable to do the hard and long work required for
>>>      allocating a gigantic page at fault time. But it should be simple
>>>      to add this if wanted
>>>
>>> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
>>> ---
>>>    mm/hugetlb.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>    1 file changed, 147 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9dded98..2258045 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -679,11 +679,141 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>>>    		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>>>    		nr_nodes--)
>>>
>>> +#if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
>>> +static void destroy_compound_gigantic_page(struct page *page,
>>> +					unsigned long order)
>>> +{
>>> +	int i;
>>> +	int nr_pages = 1 << order;
>>> +	struct page *p = page + 1;
>>> +
>>> +	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
>>> +		__ClearPageTail(p);
>>> +		set_page_refcounted(p);
>>> +		p->first_page = NULL;
>>> +	}
>>> +
>>> +	set_compound_order(page, 0);
>>> +	__ClearPageHead(page);
>>> +}
>>> +
>>> +static void free_gigantic_page(struct page *page, unsigned order)
>>> +{
>>> +	free_contig_range(page_to_pfn(page), 1 << order);
>>> +}
>>> +
>>> +static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count)
>>> +{
>>> +	unsigned long end_pfn = start_pfn + count;
>>> +	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>> +}
>>> +
>>> +static bool pfn_range_valid_gigantic(unsigned long start_pfn,
>>> +				unsigned long nr_pages)
>>> +{
>>> +	unsigned long i, end_pfn = start_pfn + nr_pages;
>>> +	struct page *page;
>>> +
>>> +	for (i = start_pfn; i < end_pfn; i++) {
>>> +		if (!pfn_valid(i))
>>> +			return false;
>>> +
>>> +		page = pfn_to_page(i);
>>> +
>>> +		if (PageReserved(page))
>>> +			return false;
>>> +
>>> +		if (page_count(page) > 0)
>>> +			return false;
>>> +
>>> +		if (PageHuge(page))
>>> +			return false;
>>> +	}
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static struct page *alloc_gigantic_page(int nid, unsigned order)
>>> +{
>>> +	unsigned long nr_pages = 1 << order;
>>> +	unsigned long ret, pfn, flags;
>>> +	struct zone *z;
>>> +
>>> +	z = NODE_DATA(nid)->node_zones;
>>> +	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
>>> +		spin_lock_irqsave(&z->lock, flags);
>>> +
>>> +		pfn = ALIGN(z->zone_start_pfn, nr_pages);
>>> +		for (; pfn < zone_end_pfn(z); pfn += nr_pages) {
>>
>>> +			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
>>
>> How about it. It can reduce the indentation level.
>> 			if (!pfn_range_valid_gigantic(...))
>> 				continue;
>>
>> And I think following check is necessary:
>> 			if (pfn + nr_pages >= zone_end_pfn(z))
>> 				break;
>
> You're right that we have to check if the whole range is within the zone,
> but the check above is off-by-one. What about the following:
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c0068..b01cdeb 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -734,6 +734,13 @@ static bool pfn_range_valid_gigantic(unsigned long start_pfn,
>   	return true;
>   }
>
> +static bool zone_spans_last_pfn(const struct zone *zone,
> +			unsigned long start_pfn, unsigned long nr_pages)
> +{
> +	unsigned long last_pfn = start_pfn + nr_pages - 1;
> +	return zone_spans_pfn(zone, last_pfn);
> +}
> +
>   static struct page *alloc_gigantic_page(int nid, unsigned order)
>   {
>   	unsigned long nr_pages = 1 << order;
> @@ -745,7 +752,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>   		spin_lock_irqsave(&z->lock, flags);
>
>   		pfn = ALIGN(z->zone_start_pfn, nr_pages);
> -		for (; pfn < zone_end_pfn(z); pfn += nr_pages) {
> +		while (zone_spans_last_pfn(z, pfn, nr_pages)) {
>   			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
>   				/*
>   				 * We release the zone lock here because
> @@ -760,6 +767,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>   					return pfn_to_page(pfn);
>   				spin_lock_irqsave(&z->lock, flags);
>   			}
> +			pfn += nr_pages;
>   		}
>
>   		spin_unlock_irqrestore(&z->lock, flags);
>

Looks good to me.

Thanks,
Yasuaki Ishimatsu

next prev parent reply	other threads:[~2014-04-10  4:40 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-08 19:02 [PATCH v2 0/5] hugetlb: add support gigantic page allocation at runtime Luiz Capitulino
2014-04-08 19:02 ` Luiz Capitulino
2014-04-08 19:02 ` [PATCH 1/5] hugetlb: prep_compound_gigantic_page(): drop __init marker Luiz Capitulino
2014-04-08 19:02   ` Luiz Capitulino
2014-04-08 19:51   ` Naoya Horiguchi
     [not found]   ` <1396986711-o0m2kq1v@n-horiguchi@ah.jp.nec.com>
2014-04-08 20:25     ` Luiz Capitulino
2014-04-08 20:25       ` Luiz Capitulino
2014-04-08 19:02 ` [PATCH 2/5] hugetlb: add hstate_is_gigantic() Luiz Capitulino
2014-04-08 19:02   ` Luiz Capitulino
2014-04-08 19:02 ` [PATCH 3/5] hugetlb: update_and_free_page(): don't clear PG_reserved bit Luiz Capitulino
2014-04-08 19:02   ` Luiz Capitulino
2014-04-08 20:51   ` Kirill A. Shutemov
2014-04-08 21:11     ` Luiz Capitulino
2014-04-08 21:11       ` Luiz Capitulino
2014-04-08 19:02 ` [PATCH 4/5] hugetlb: move helpers up in the file Luiz Capitulino
2014-04-08 19:02   ` Luiz Capitulino
2014-04-08 19:02 ` [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime Luiz Capitulino
2014-04-08 19:02   ` Luiz Capitulino
2014-04-08 20:38   ` Naoya Horiguchi
2014-04-09  0:42   ` Yasuaki Ishimatsu
2014-04-09  0:42     ` Yasuaki Ishimatsu
2014-04-09 17:56     ` Luiz Capitulino
2014-04-09 17:56       ` Luiz Capitulino
2014-04-10  4:39       ` Yasuaki Ishimatsu [this message]
2014-04-10  4:39         ` Yasuaki Ishimatsu
  -- strict thread matches above, loose matches on Subject: below --
2014-04-10 17:58 [PATCH v3 0/5] hugetlb: add support " Luiz Capitulino
2014-04-10 17:58 ` [PATCH 5/5] hugetlb: add support for " Luiz Capitulino
2014-04-10 17:58   ` Luiz Capitulino
2014-04-13 23:31   ` Yasuaki Ishimatsu
2014-04-13 23:31     ` Yasuaki Ishimatsu
2014-04-17 23:00   ` Andrew Morton
2014-04-17 23:00     ` Andrew Morton
2014-04-22 21:19     ` Luiz Capitulino
2014-04-22 21:19       ` Luiz Capitulino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53462094.6000407@jp.fujitsu.com \
    --to=isimatu.yasuaki@jp.fujitsu.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=davidlohr@hp.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mtosatti@redhat.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.