linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place
@ 2013-09-02  3:45 Bob Liu
  2013-09-02  3:45 ` [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node Bob Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Bob Liu @ 2013-09-02  3:45 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	davidoff, Bob Liu

Move alloc_hugepage to better place, no need for a seperate #ifndef CONFIG_NUMA

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 mm/huge_memory.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a92012a..7448cf9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -753,14 +753,6 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 			       HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
-#ifndef CONFIG_NUMA
-static inline struct page *alloc_hugepage(int defrag)
-{
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
-			   HPAGE_PMD_ORDER);
-}
-#endif
-
 static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
@@ -2204,6 +2196,12 @@ static struct page
 	return *hpage;
 }
 #else
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
+			   HPAGE_PMD_ORDER);
+}
+
 static struct page *khugepaged_alloc_hugepage(bool *wait)
 {
 	struct page *hpage;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-02  3:45 [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Bob Liu
@ 2013-09-02  3:45 ` Bob Liu
  2013-09-07 15:32   ` Andrew Davidoff
                     ` (2 more replies)
  2013-09-02 10:55 ` [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Kirill A. Shutemov
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 13+ messages in thread
From: Bob Liu @ 2013-09-02  3:45 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	davidoff, Bob Liu

Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
which is allocated from the node of the first normal page, this policy is very
rough and may affect userland applications.
Andrew Davidoff reported a related issue several days ago.

Using "numactl --interleave=all ./test" to run the testcase, but the result
wasn't not as expected.
cat /proc/2814/numa_maps:
7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
N3=50098
The end results showed that most pages are from Node3 instead of interleave
among node0-3 which was unreasonable.

This patch adds a more complicated policy.
When searching HPAGE_PMD_NR normal pages, record which node those pages come
from. Alway allocate hugepage from the node with the max record. If several
nodes have the same max record, try to interleave among them.

After this patch the result was as expected:
7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
N3=12722

The simple testcase is like this:
#include<stdio.h>
#include<stdlib.h>

int main() {
	char *p;
	int i;
	int j;

	for (i=0; i < 200; i++) {
		p = (char *)malloc(1048576);
		printf("malloc done\n");

		if (p == 0) {
			printf("Out of memory\n");
			return 1;
		}
		for (j=0; j < 1048576; j++) {
			p[j] = 'A';
		}
		printf("touched memory\n");

		sleep(1);
	}
	printf("enter sleep\n");
	while(1) {
		sleep(100);
	}
}

Reported-by: Andrew Davidoff <davidoff@qedmf.net>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7448cf9..86c7f0d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
 			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 }
 
+static int khugepaged_node_load[MAX_NUMNODES];
 #ifdef CONFIG_NUMA
+static int last_khugepaged_target_node = NUMA_NO_NODE;
+static int khugepaged_find_target_node(void)
+{
+	int i, target_node = 0, max_value = 1;
+
+	/* find first node with most normal pages hit */
+	for (i = 0; i < MAX_NUMNODES; i++)
+		if (khugepaged_node_load[i] > max_value) {
+			max_value = khugepaged_node_load[i];
+			target_node = i;
+		}
+
+	/* do some balance if several nodes have the same hit number */
+	if (target_node <= last_khugepaged_target_node) {
+		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
+			if (max_value == khugepaged_node_load[i]) {
+				target_node = i;
+				break;
+			}
+	}
+
+	last_khugepaged_target_node = target_node;
+	return target_node;
+}
+
 static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 {
 	if (IS_ERR(*hpage)) {
@@ -2178,9 +2204,8 @@ static struct page
 	 * mmap_sem in read mode is good idea also to allow greater
 	 * scalability.
 	 */
-	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
-				      node, __GFP_OTHER_NODE);
-
+	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
+			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
 	/*
 	 * After allocating the hugepage, release the mmap_sem read lock in
 	 * preparation for taking it in write mode.
@@ -2196,6 +2221,11 @@ static struct page
 	return *hpage;
 }
 #else
+static int khugepaged_find_target_node(void)
+{
+	return 0;
+}
+
 static inline struct page *alloc_hugepage(int defrag)
 {
 	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
@@ -2405,6 +2435,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	if (pmd_trans_huge(*pmd))
 		goto out;
 
+	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -2421,12 +2452,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (unlikely(!page))
 			goto out_unmap;
 		/*
-		 * Chose the node of the first page. This could
-		 * be more sophisticated and look at more pages,
-		 * but isn't for now.
+		 * Chose the node of most normal pages hit, record this
+		 * informaction to khugepaged_node_load[]
 		 */
-		if (node == NUMA_NO_NODE)
-			node = page_to_nid(page);
+		node = page_to_nid(page);
+		khugepaged_node_load[node]++;
 		VM_BUG_ON(PageCompound(page));
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
@@ -2441,9 +2471,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		ret = 1;
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
-	if (ret)
+	if (ret) {
+		node = khugepaged_find_target_node();
 		/* collapse_huge_page will return with the mmap_sem released */
 		collapse_huge_page(mm, address, hpage, vma, node);
+	}
 out:
 	return ret;
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place
  2013-09-02  3:45 [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Bob Liu
  2013-09-02  3:45 ` [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node Bob Liu
@ 2013-09-02 10:55 ` Kirill A. Shutemov
  2013-09-07 15:31 ` Andrew Davidoff
  2013-09-10  1:28 ` Yasuaki Ishimatsu
  3 siblings, 0 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2013-09-02 10:55 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	davidoff, Bob Liu

Bob Liu wrote:
> Move alloc_hugepage to better place, no need for a seperate #ifndef CONFIG_NUMA
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place
  2013-09-02  3:45 [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Bob Liu
  2013-09-02  3:45 ` [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node Bob Liu
  2013-09-02 10:55 ` [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Kirill A. Shutemov
@ 2013-09-07 15:31 ` Andrew Davidoff
  2013-09-10  1:28 ` Yasuaki Ishimatsu
  3 siblings, 0 replies; 13+ messages in thread
From: Andrew Davidoff @ 2013-09-07 15:31 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	Bob Liu

On Sun, Sep 1, 2013 at 11:45 PM, Bob Liu <lliubbo@gmail.com> wrote:
> Move alloc_hugepage to better place, no need for a seperate #ifndef CONFIG_NUMA
>
> Signed-off-by: Bob Liu <bob.liu@oracle.com>

Tested-by: Andrew Davidoff <davidoff@qedmf.net>

> ---
>  mm/huge_memory.c |   14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a92012a..7448cf9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -753,14 +753,6 @@ static inline struct page *alloc_hugepage_vma(int defrag,
>                                HPAGE_PMD_ORDER, vma, haddr, nd);
>  }
>
> -#ifndef CONFIG_NUMA
> -static inline struct page *alloc_hugepage(int defrag)
> -{
> -       return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
> -                          HPAGE_PMD_ORDER);
> -}
> -#endif
> -
>  static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
>                 struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
>                 struct page *zero_page)
> @@ -2204,6 +2196,12 @@ static struct page
>         return *hpage;
>  }
>  #else
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> +       return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
> +                          HPAGE_PMD_ORDER);
> +}
> +
>  static struct page *khugepaged_alloc_hugepage(bool *wait)
>  {
>         struct page *hpage;
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-02  3:45 ` [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node Bob Liu
@ 2013-09-07 15:32   ` Andrew Davidoff
  2013-09-10  0:45   ` Yasuaki Ishimatsu
  2013-09-10  2:51   ` Yasuaki Ishimatsu
  2 siblings, 0 replies; 13+ messages in thread
From: Andrew Davidoff @ 2013-09-07 15:32 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	Bob Liu

On Sun, Sep 1, 2013 at 11:45 PM, Bob Liu <lliubbo@gmail.com> wrote:
>
> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
> Signed-off-by: Bob Liu <bob.liu@oracle.com>

Tested-by: Andrew Davidoff <davidoff@qedmf.net>

> ---
>  mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 41 insertions(+), 9 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7448cf9..86c7f0d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>                         msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>  }
>
> +static int khugepaged_node_load[MAX_NUMNODES];
>  #ifdef CONFIG_NUMA
> +static int last_khugepaged_target_node = NUMA_NO_NODE;
> +static int khugepaged_find_target_node(void)
> +{
> +       int i, target_node = 0, max_value = 1;
> +
> +       /* find first node with most normal pages hit */
> +       for (i = 0; i < MAX_NUMNODES; i++)
> +               if (khugepaged_node_load[i] > max_value) {
> +                       max_value = khugepaged_node_load[i];
> +                       target_node = i;
> +               }
> +
> +       /* do some balance if several nodes have the same hit number */
> +       if (target_node <= last_khugepaged_target_node) {
> +               for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
> +                       if (max_value == khugepaged_node_load[i]) {
> +                               target_node = i;
> +                               break;
> +                       }
> +       }
> +
> +       last_khugepaged_target_node = target_node;
> +       return target_node;
> +}
> +
>  static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>  {
>         if (IS_ERR(*hpage)) {
> @@ -2178,9 +2204,8 @@ static struct page
>          * mmap_sem in read mode is good idea also to allow greater
>          * scalability.
>          */
> -       *hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
> -                                     node, __GFP_OTHER_NODE);
> -
> +       *hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
> +                       khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
>         /*
>          * After allocating the hugepage, release the mmap_sem read lock in
>          * preparation for taking it in write mode.
> @@ -2196,6 +2221,11 @@ static struct page
>         return *hpage;
>  }
>  #else
> +static int khugepaged_find_target_node(void)
> +{
> +       return 0;
> +}
> +
>  static inline struct page *alloc_hugepage(int defrag)
>  {
>         return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
> @@ -2405,6 +2435,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>         if (pmd_trans_huge(*pmd))
>                 goto out;
>
> +       memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>         for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>              _pte++, _address += PAGE_SIZE) {
> @@ -2421,12 +2452,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                 if (unlikely(!page))
>                         goto out_unmap;
>                 /*
> -                * Chose the node of the first page. This could
> -                * be more sophisticated and look at more pages,
> -                * but isn't for now.
> +                * Chose the node of most normal pages hit, record this
> +                * informaction to khugepaged_node_load[]
>                  */
> -               if (node == NUMA_NO_NODE)
> -                       node = page_to_nid(page);
> +               node = page_to_nid(page);
> +               khugepaged_node_load[node]++;
>                 VM_BUG_ON(PageCompound(page));
>                 if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>                         goto out_unmap;
> @@ -2441,9 +2471,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                 ret = 1;
>  out_unmap:
>         pte_unmap_unlock(pte, ptl);
> -       if (ret)
> +       if (ret) {
> +               node = khugepaged_find_target_node();
>                 /* collapse_huge_page will return with the mmap_sem released */
>                 collapse_huge_page(mm, address, hpage, vma, node);
> +       }
>  out:
>         return ret;
>  }
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-02  3:45 ` [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node Bob Liu
  2013-09-07 15:32   ` Andrew Davidoff
@ 2013-09-10  0:45   ` Yasuaki Ishimatsu
  2013-09-10  0:55     ` Wanpeng Li
  2013-09-10  0:55     ` Wanpeng Li
  2013-09-10  2:51   ` Yasuaki Ishimatsu
  2 siblings, 2 replies; 13+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-10  0:45 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	davidoff, Bob Liu

(2013/09/02 12:45), Bob Liu wrote:
> Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
> which is allocated from the node of the first normal page, this policy is very
> rough and may affect userland applications.

> Andrew Davidoff reported a related issue several days ago.

Where is an original e-mail?
I tried to find original e-mail in my mailbox. But I cannot find it.

Thanks,
Yasuaki Ishimatsu

> 
> Using "numactl --interleave=all ./test" to run the testcase, but the result
> wasn't not as expected.
> cat /proc/2814/numa_maps:
> 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
> N3=50098
> The end results showed that most pages are from Node3 instead of interleave
> among node0-3 which was unreasonable.
> 
> This patch adds a more complicated policy.
> When searching HPAGE_PMD_NR normal pages, record which node those pages come
> from. Alway allocate hugepage from the node with the max record. If several
> nodes have the same max record, try to interleave among them.
> 
> After this patch the result was as expected:
> 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
> N3=12722
> 
> The simple testcase is like this:
> #include<stdio.h>
> #include<stdlib.h>
> 
> int main() {
> 	char *p;
> 	int i;
> 	int j;
> 
> 	for (i=0; i < 200; i++) {
> 		p = (char *)malloc(1048576);
> 		printf("malloc done\n");
> 
> 		if (p == 0) {
> 			printf("Out of memory\n");
> 			return 1;
> 		}
> 		for (j=0; j < 1048576; j++) {
> 			p[j] = 'A';
> 		}
> 		printf("touched memory\n");
> 
> 		sleep(1);
> 	}
> 	printf("enter sleep\n");
> 	while(1) {
> 		sleep(100);
> 	}
> }
> 
> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
>   mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 41 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7448cf9..86c7f0d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>   			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>   }
>   
> +static int khugepaged_node_load[MAX_NUMNODES];
>   #ifdef CONFIG_NUMA
> +static int last_khugepaged_target_node = NUMA_NO_NODE;
> +static int khugepaged_find_target_node(void)
> +{
> +	int i, target_node = 0, max_value = 1;
> +
> +	/* find first node with most normal pages hit */
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		if (khugepaged_node_load[i] > max_value) {
> +			max_value = khugepaged_node_load[i];
> +			target_node = i;
> +		}
> +
> +	/* do some balance if several nodes have the same hit number */
> +	if (target_node <= last_khugepaged_target_node) {
> +		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
> +			if (max_value == khugepaged_node_load[i]) {
> +				target_node = i;
> +				break;
> +			}
> +	}
> +
> +	last_khugepaged_target_node = target_node;
> +	return target_node;
> +}
> +
>   static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>   {
>   	if (IS_ERR(*hpage)) {
> @@ -2178,9 +2204,8 @@ static struct page
>   	 * mmap_sem in read mode is good idea also to allow greater
>   	 * scalability.
>   	 */
> -	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
> -				      node, __GFP_OTHER_NODE);
> -
> +	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
> +			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
>   	/*
>   	 * After allocating the hugepage, release the mmap_sem read lock in
>   	 * preparation for taking it in write mode.
> @@ -2196,6 +2221,11 @@ static struct page
>   	return *hpage;
>   }
>   #else
> +static int khugepaged_find_target_node(void)
> +{
> +	return 0;
> +}
> +
>   static inline struct page *alloc_hugepage(int defrag)
>   {
>   	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
> @@ -2405,6 +2435,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   	if (pmd_trans_huge(*pmd))
>   		goto out;
>   
> +	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>   	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>   	     _pte++, _address += PAGE_SIZE) {
> @@ -2421,12 +2452,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		if (unlikely(!page))
>   			goto out_unmap;
>   		/*
> -		 * Chose the node of the first page. This could
> -		 * be more sophisticated and look at more pages,
> -		 * but isn't for now.
> +		 * Chose the node of most normal pages hit, record this
> +		 * informaction to khugepaged_node_load[]
>   		 */
> -		if (node == NUMA_NO_NODE)
> -			node = page_to_nid(page);
> +		node = page_to_nid(page);
> +		khugepaged_node_load[node]++;
>   		VM_BUG_ON(PageCompound(page));
>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>   			goto out_unmap;
> @@ -2441,9 +2471,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		ret = 1;
>   out_unmap:
>   	pte_unmap_unlock(pte, ptl);
> -	if (ret)
> +	if (ret) {
> +		node = khugepaged_find_target_node();
>   		/* collapse_huge_page will return with the mmap_sem released */
>   		collapse_huge_page(mm, address, hpage, vma, node);
> +	}
>   out:
>   	return ret;
>   }
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-10  0:45   ` Yasuaki Ishimatsu
@ 2013-09-10  0:55     ` Wanpeng Li
  2013-09-10  2:19       ` Yasuaki Ishimatsu
  2013-09-10  0:55     ` Wanpeng Li
  1 sibling, 1 reply; 13+ messages in thread
From: Wanpeng Li @ 2013-09-10  0:55 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Bob Liu, akpm, linux-mm, aarcange, kirill.shutemov, mgorman,
	konrad.wilk, davidoff, Bob Liu

On Tue, Sep 10, 2013 at 09:45:09AM +0900, Yasuaki Ishimatsu wrote:
>(2013/09/02 12:45), Bob Liu wrote:
>> Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
>> which is allocated from the node of the first normal page, this policy is very
>> rough and may affect userland applications.
>
>> Andrew Davidoff reported a related issue several days ago.
>
>Where is an original e-mail?
>I tried to find original e-mail in my mailbox. But I cannot find it.
>

http://marc.info/?l=linux-mm&m=137701470529356&w=2

>Thanks,
>Yasuaki Ishimatsu
>
>> 
>> Using "numactl --interleave=all ./test" to run the testcase, but the result
>> wasn't not as expected.
>> cat /proc/2814/numa_maps:
>> 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
>> N3=50098
>> The end results showed that most pages are from Node3 instead of interleave
>> among node0-3 which was unreasonable.
>> 
>> This patch adds a more complicated policy.
>> When searching HPAGE_PMD_NR normal pages, record which node those pages come
>> from. Alway allocate hugepage from the node with the max record. If several
>> nodes have the same max record, try to interleave among them.
>> 
>> After this patch the result was as expected:
>> 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
>> N3=12722
>> 
>> The simple testcase is like this:
>> #include<stdio.h>
>> #include<stdlib.h>
>> 
>> int main() {
>> 	char *p;
>> 	int i;
>> 	int j;
>> 
>> 	for (i=0; i < 200; i++) {
>> 		p = (char *)malloc(1048576);
>> 		printf("malloc done\n");
>> 
>> 		if (p == 0) {
>> 			printf("Out of memory\n");
>> 			return 1;
>> 		}
>> 		for (j=0; j < 1048576; j++) {
>> 			p[j] = 'A';
>> 		}
>> 		printf("touched memory\n");
>> 
>> 		sleep(1);
>> 	}
>> 	printf("enter sleep\n");
>> 	while(1) {
>> 		sleep(100);
>> 	}
>> }
>> 
>> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>> ---
>>   mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 41 insertions(+), 9 deletions(-)
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 7448cf9..86c7f0d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>>   			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>>   }
>>   
>> +static int khugepaged_node_load[MAX_NUMNODES];
>>   #ifdef CONFIG_NUMA
>> +static int last_khugepaged_target_node = NUMA_NO_NODE;
>> +static int khugepaged_find_target_node(void)
>> +{
>> +	int i, target_node = 0, max_value = 1;
>> +
>> +	/* find first node with most normal pages hit */
>> +	for (i = 0; i < MAX_NUMNODES; i++)
>> +		if (khugepaged_node_load[i] > max_value) {
>> +			max_value = khugepaged_node_load[i];
>> +			target_node = i;
>> +		}
>> +
>> +	/* do some balance if several nodes have the same hit number */
>> +	if (target_node <= last_khugepaged_target_node) {
>> +		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
>> +			if (max_value == khugepaged_node_load[i]) {
>> +				target_node = i;
>> +				break;
>> +			}
>> +	}
>> +
>> +	last_khugepaged_target_node = target_node;
>> +	return target_node;
>> +}
>> +
>>   static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>>   {
>>   	if (IS_ERR(*hpage)) {
>> @@ -2178,9 +2204,8 @@ static struct page
>>   	 * mmap_sem in read mode is good idea also to allow greater
>>   	 * scalability.
>>   	 */
>> -	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
>> -				      node, __GFP_OTHER_NODE);
>> -
>> +	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
>> +			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
>>   	/*
>>   	 * After allocating the hugepage, release the mmap_sem read lock in
>>   	 * preparation for taking it in write mode.
>> @@ -2196,6 +2221,11 @@ static struct page
>>   	return *hpage;
>>   }
>>   #else
>> +static int khugepaged_find_target_node(void)
>> +{
>> +	return 0;
>> +}
>> +
>>   static inline struct page *alloc_hugepage(int defrag)
>>   {
>>   	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
>> @@ -2405,6 +2435,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   	if (pmd_trans_huge(*pmd))
>>   		goto out;
>>   
>> +	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>   	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>>   	     _pte++, _address += PAGE_SIZE) {
>> @@ -2421,12 +2452,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   		if (unlikely(!page))
>>   			goto out_unmap;
>>   		/*
>> -		 * Chose the node of the first page. This could
>> -		 * be more sophisticated and look at more pages,
>> -		 * but isn't for now.
>> +		 * Chose the node of most normal pages hit, record this
>> +		 * informaction to khugepaged_node_load[]
>>   		 */
>> -		if (node == NUMA_NO_NODE)
>> -			node = page_to_nid(page);
>> +		node = page_to_nid(page);
>> +		khugepaged_node_load[node]++;
>>   		VM_BUG_ON(PageCompound(page));
>>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>>   			goto out_unmap;
>> @@ -2441,9 +2471,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   		ret = 1;
>>   out_unmap:
>>   	pte_unmap_unlock(pte, ptl);
>> -	if (ret)
>> +	if (ret) {
>> +		node = khugepaged_find_target_node();
>>   		/* collapse_huge_page will return with the mmap_sem released */
>>   		collapse_huge_page(mm, address, hpage, vma, node);
>> +	}
>>   out:
>>   	return ret;
>>   }
>> 
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-10  0:45   ` Yasuaki Ishimatsu
  2013-09-10  0:55     ` Wanpeng Li
@ 2013-09-10  0:55     ` Wanpeng Li
  1 sibling, 0 replies; 13+ messages in thread
From: Wanpeng Li @ 2013-09-10  0:55 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Bob Liu, akpm, linux-mm, aarcange, kirill.shutemov, mgorman,
	konrad.wilk, davidoff, Bob Liu

On Tue, Sep 10, 2013 at 09:45:09AM +0900, Yasuaki Ishimatsu wrote:
>(2013/09/02 12:45), Bob Liu wrote:
>> Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
>> which is allocated from the node of the first normal page, this policy is very
>> rough and may affect userland applications.
>
>> Andrew Davidoff reported a related issue several days ago.
>
>Where is an original e-mail?
>I tried to find original e-mail in my mailbox. But I cannot find it.
>

http://marc.info/?l=linux-mm&m=137701470529356&w=2

>Thanks,
>Yasuaki Ishimatsu
>
>> 
>> Using "numactl --interleave=all ./test" to run the testcase, but the result
>> wasn't not as expected.
>> cat /proc/2814/numa_maps:
>> 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
>> N3=50098
>> The end results showed that most pages are from Node3 instead of interleave
>> among node0-3 which was unreasonable.
>> 
>> This patch adds a more complicated policy.
>> When searching HPAGE_PMD_NR normal pages, record which node those pages come
>> from. Alway allocate hugepage from the node with the max record. If several
>> nodes have the same max record, try to interleave among them.
>> 
>> After this patch the result was as expected:
>> 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
>> N3=12722
>> 
>> The simple testcase is like this:
>> #include<stdio.h>
>> #include<stdlib.h>
>> 
>> int main() {
>> 	char *p;
>> 	int i;
>> 	int j;
>> 
>> 	for (i=0; i < 200; i++) {
>> 		p = (char *)malloc(1048576);
>> 		printf("malloc done\n");
>> 
>> 		if (p == 0) {
>> 			printf("Out of memory\n");
>> 			return 1;
>> 		}
>> 		for (j=0; j < 1048576; j++) {
>> 			p[j] = 'A';
>> 		}
>> 		printf("touched memory\n");
>> 
>> 		sleep(1);
>> 	}
>> 	printf("enter sleep\n");
>> 	while(1) {
>> 		sleep(100);
>> 	}
>> }
>> 
>> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>> ---
>>   mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 41 insertions(+), 9 deletions(-)
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 7448cf9..86c7f0d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>>   			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>>   }
>>   
>> +static int khugepaged_node_load[MAX_NUMNODES];
>>   #ifdef CONFIG_NUMA
>> +static int last_khugepaged_target_node = NUMA_NO_NODE;
>> +static int khugepaged_find_target_node(void)
>> +{
>> +	int i, target_node = 0, max_value = 1;
>> +
>> +	/* find first node with most normal pages hit */
>> +	for (i = 0; i < MAX_NUMNODES; i++)
>> +		if (khugepaged_node_load[i] > max_value) {
>> +			max_value = khugepaged_node_load[i];
>> +			target_node = i;
>> +		}
>> +
>> +	/* do some balance if several nodes have the same hit number */
>> +	if (target_node <= last_khugepaged_target_node) {
>> +		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
>> +			if (max_value == khugepaged_node_load[i]) {
>> +				target_node = i;
>> +				break;
>> +			}
>> +	}
>> +
>> +	last_khugepaged_target_node = target_node;
>> +	return target_node;
>> +}
>> +
>>   static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>>   {
>>   	if (IS_ERR(*hpage)) {
>> @@ -2178,9 +2204,8 @@ static struct page
>>   	 * mmap_sem in read mode is good idea also to allow greater
>>   	 * scalability.
>>   	 */
>> -	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
>> -				      node, __GFP_OTHER_NODE);
>> -
>> +	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
>> +			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
>>   	/*
>>   	 * After allocating the hugepage, release the mmap_sem read lock in
>>   	 * preparation for taking it in write mode.
>> @@ -2196,6 +2221,11 @@ static struct page
>>   	return *hpage;
>>   }
>>   #else
>> +static int khugepaged_find_target_node(void)
>> +{
>> +	return 0;
>> +}
>> +
>>   static inline struct page *alloc_hugepage(int defrag)
>>   {
>>   	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
>> @@ -2405,6 +2435,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   	if (pmd_trans_huge(*pmd))
>>   		goto out;
>>   
>> +	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>   	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>>   	     _pte++, _address += PAGE_SIZE) {
>> @@ -2421,12 +2452,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   		if (unlikely(!page))
>>   			goto out_unmap;
>>   		/*
>> -		 * Chose the node of the first page. This could
>> -		 * be more sophisticated and look at more pages,
>> -		 * but isn't for now.
>> +		 * Chose the node of most normal pages hit, record this
>> +		 * informaction to khugepaged_node_load[]
>>   		 */
>> -		if (node == NUMA_NO_NODE)
>> -			node = page_to_nid(page);
>> +		node = page_to_nid(page);
>> +		khugepaged_node_load[node]++;
>>   		VM_BUG_ON(PageCompound(page));
>>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>>   			goto out_unmap;
>> @@ -2441,9 +2471,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   		ret = 1;
>>   out_unmap:
>>   	pte_unmap_unlock(pte, ptl);
>> -	if (ret)
>> +	if (ret) {
>> +		node = khugepaged_find_target_node();
>>   		/* collapse_huge_page will return with the mmap_sem released */
>>   		collapse_huge_page(mm, address, hpage, vma, node);
>> +	}
>>   out:
>>   	return ret;
>>   }
>> 
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place
  2013-09-02  3:45 [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Bob Liu
                   ` (2 preceding siblings ...)
  2013-09-07 15:31 ` Andrew Davidoff
@ 2013-09-10  1:28 ` Yasuaki Ishimatsu
  3 siblings, 0 replies; 13+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-10  1:28 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	davidoff, Bob Liu

(2013/09/02 12:45), Bob Liu wrote:
> Move alloc_hugepage to better place, no need for a seperate #ifndef CONFIG_NUMA
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---

Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

Thanks,
Yasuaki Ishimatsu

>   mm/huge_memory.c |   14 ++++++--------
>   1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a92012a..7448cf9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -753,14 +753,6 @@ static inline struct page *alloc_hugepage_vma(int defrag,
>   			       HPAGE_PMD_ORDER, vma, haddr, nd);
>   }
>   
> -#ifndef CONFIG_NUMA
> -static inline struct page *alloc_hugepage(int defrag)
> -{
> -	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
> -			   HPAGE_PMD_ORDER);
> -}
> -#endif
> -
>   static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
>   		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
>   		struct page *zero_page)
> @@ -2204,6 +2196,12 @@ static struct page
>   	return *hpage;
>   }
>   #else
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> +	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
> +			   HPAGE_PMD_ORDER);
> +}
> +
>   static struct page *khugepaged_alloc_hugepage(bool *wait)
>   {
>   	struct page *hpage;
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-10  0:55     ` Wanpeng Li
@ 2013-09-10  2:19       ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 13+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-10  2:19 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Bob Liu, akpm, linux-mm, aarcange, kirill.shutemov, mgorman,
	konrad.wilk, davidoff, Bob Liu

(2013/09/10 9:55), Wanpeng Li wrote:
> On Tue, Sep 10, 2013 at 09:45:09AM +0900, Yasuaki Ishimatsu wrote:
>> (2013/09/02 12:45), Bob Liu wrote:
>>> Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
>>> which is allocated from the node of the first normal page, this policy is very
>>> rough and may affect userland applications.
>>
>>> Andrew Davidoff reported a related issue several days ago.
>>
>> Where is an original e-mail?
>> I tried to find original e-mail in my mailbox. But I cannot find it.
>>
>

> http://marc.info/?l=linux-mm&m=137701470529356&w=2

Thank you for informing it.

Thanks,
Yasuaki Ishimatsu

>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>>>
>>> Using "numactl --interleave=all ./test" to run the testcase, but the result
>>> wasn't not as expected.
>>> cat /proc/2814/numa_maps:
>>> 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
>>> N3=50098
>>> The end results showed that most pages are from Node3 instead of interleave
>>> among node0-3 which was unreasonable.
>>>
>>> This patch adds a more complicated policy.
>>> When searching HPAGE_PMD_NR normal pages, record which node those pages come
>>> from. Alway allocate hugepage from the node with the max record. If several
>>> nodes have the same max record, try to interleave among them.
>>>
>>> After this patch the result was as expected:
>>> 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
>>> N3=12722
>>>
>>> The simple testcase is like this:
>>> #include<stdio.h>
>>> #include<stdlib.h>
>>>
>>> int main() {
>>> 	char *p;
>>> 	int i;
>>> 	int j;
>>>
>>> 	for (i=0; i < 200; i++) {
>>> 		p = (char *)malloc(1048576);
>>> 		printf("malloc done\n");
>>>
>>> 		if (p == 0) {
>>> 			printf("Out of memory\n");
>>> 			return 1;
>>> 		}
>>> 		for (j=0; j < 1048576; j++) {
>>> 			p[j] = 'A';
>>> 		}
>>> 		printf("touched memory\n");
>>>
>>> 		sleep(1);
>>> 	}
>>> 	printf("enter sleep\n");
>>> 	while(1) {
>>> 		sleep(100);
>>> 	}
>>> }
>>>
>>> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>>> ---
>>>    mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>>>    1 file changed, 41 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 7448cf9..86c7f0d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>>>    			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>>>    }
>>>
>>> +static int khugepaged_node_load[MAX_NUMNODES];
>>>    #ifdef CONFIG_NUMA
>>> +static int last_khugepaged_target_node = NUMA_NO_NODE;
>>> +static int khugepaged_find_target_node(void)
>>> +{
>>> +	int i, target_node = 0, max_value = 1;
>>> +
>>> +	/* find first node with most normal pages hit */
>>> +	for (i = 0; i < MAX_NUMNODES; i++)
>>> +		if (khugepaged_node_load[i] > max_value) {
>>> +			max_value = khugepaged_node_load[i];
>>> +			target_node = i;
>>> +		}
>>> +
>>> +	/* do some balance if several nodes have the same hit number */
>>> +	if (target_node <= last_khugepaged_target_node) {
>>> +		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
>>> +			if (max_value == khugepaged_node_load[i]) {
>>> +				target_node = i;
>>> +				break;
>>> +			}
>>> +	}
>>> +
>>> +	last_khugepaged_target_node = target_node;
>>> +	return target_node;
>>> +}
>>> +
>>>    static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>>>    {
>>>    	if (IS_ERR(*hpage)) {
>>> @@ -2178,9 +2204,8 @@ static struct page
>>>    	 * mmap_sem in read mode is good idea also to allow greater
>>>    	 * scalability.
>>>    	 */
>>> -	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
>>> -				      node, __GFP_OTHER_NODE);
>>> -
>>> +	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
>>> +			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
>>>    	/*
>>>    	 * After allocating the hugepage, release the mmap_sem read lock in
>>>    	 * preparation for taking it in write mode.
>>> @@ -2196,6 +2221,11 @@ static struct page
>>>    	return *hpage;
>>>    }
>>>    #else
>>> +static int khugepaged_find_target_node(void)
>>> +{
>>> +	return 0;
>>> +}
>>> +
>>>    static inline struct page *alloc_hugepage(int defrag)
>>>    {
>>>    	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
>>> @@ -2405,6 +2435,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>    	if (pmd_trans_huge(*pmd))
>>>    		goto out;
>>>
>>> +	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>>>    	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>>    	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>>>    	     _pte++, _address += PAGE_SIZE) {
>>> @@ -2421,12 +2452,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>    		if (unlikely(!page))
>>>    			goto out_unmap;
>>>    		/*
>>> -		 * Chose the node of the first page. This could
>>> -		 * be more sophisticated and look at more pages,
>>> -		 * but isn't for now.
>>> +		 * Chose the node of most normal pages hit, record this
>>> +		 * informaction to khugepaged_node_load[]
>>>    		 */
>>> -		if (node == NUMA_NO_NODE)
>>> -			node = page_to_nid(page);
>>> +		node = page_to_nid(page);
>>> +		khugepaged_node_load[node]++;
>>>    		VM_BUG_ON(PageCompound(page));
>>>    		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>>>    			goto out_unmap;
>>> @@ -2441,9 +2471,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>    		ret = 1;
>>>    out_unmap:
>>>    	pte_unmap_unlock(pte, ptl);
>>> -	if (ret)
>>> +	if (ret) {
>>> +		node = khugepaged_find_target_node();
>>>    		/* collapse_huge_page will return with the mmap_sem released */
>>>    		collapse_huge_page(mm, address, hpage, vma, node);
>>> +	}
>>>    out:
>>>    	return ret;
>>>    }
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-02  3:45 ` [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node Bob Liu
  2013-09-07 15:32   ` Andrew Davidoff
  2013-09-10  0:45   ` Yasuaki Ishimatsu
@ 2013-09-10  2:51   ` Yasuaki Ishimatsu
  2013-09-10 14:28     ` Bob Liu
  2 siblings, 1 reply; 13+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-10  2:51 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-mm, aarcange, kirill.shutemov, mgorman, konrad.wilk,
	davidoff, Bob Liu

(2013/09/02 12:45), Bob Liu wrote:
> Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
> which is allocated from the node of the first normal page, this policy is very
> rough and may affect userland applications.
> Andrew Davidoff reported a related issue several days ago.
> 
> Using "numactl --interleave=all ./test" to run the testcase, but the result
> wasn't not as expected.
> cat /proc/2814/numa_maps:
> 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
> N3=50098
> The end results showed that most pages are from Node3 instead of interleave
> among node0-3 which was unreasonable.
> 

> This patch adds a more complicated policy.
> When searching HPAGE_PMD_NR normal pages, record which node those pages come
> from. Alway allocate hugepage from the node with the max record. If several
> nodes have the same max record, try to interleave among them.

I don't understand this policy. Why does ths patch allocate hugepage from the
node with the max record?

> 
> After this patch the result was as expected:
> 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
> N3=12722
> 
> The simple testcase is like this:
> #include<stdio.h>
> #include<stdlib.h>
> 
> int main() {
> 	char *p;
> 	int i;
> 	int j;
> 
> 	for (i=0; i < 200; i++) {
> 		p = (char *)malloc(1048576);
> 		printf("malloc done\n");
> 
> 		if (p == 0) {
> 			printf("Out of memory\n");
> 			return 1;
> 		}
> 		for (j=0; j < 1048576; j++) {
> 			p[j] = 'A';
> 		}
> 		printf("touched memory\n");
> 
> 		sleep(1);
> 	}
> 	printf("enter sleep\n");
> 	while(1) {
> 		sleep(100);
> 	}
> }
> 
> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
>   mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 41 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7448cf9..86c7f0d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>   			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>   }
>   
> +static int khugepaged_node_load[MAX_NUMNODES];
>   #ifdef CONFIG_NUMA
> +static int last_khugepaged_target_node = NUMA_NO_NODE;
> +static int khugepaged_find_target_node(void)
> +{

> +	int i, target_node = 0, max_value = 1;

i is used as node ids. So please use node or nid instead of i.

> +

> +	/* find first node with most normal pages hit */
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		if (khugepaged_node_load[i] > max_value) {
> +			max_value = khugepaged_node_load[i];
> +			target_node = i;
> +		}

khugepaged_node_load[] is initialized as 0 and max_value is initialized
as 1. So this loop does not work well until khugepage_node_load[] is set
to 2 or more. How about initializing max_value to 0?


> +
> +	/* do some balance if several nodes have the same hit number */
> +	if (target_node <= last_khugepaged_target_node) {
> +		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
> +			if (max_value == khugepaged_node_load[i]) {
> +				target_node = i;
> +				break;
> +			}
> +	}
> +
> +	last_khugepaged_target_node = target_node;
> +	return target_node;
> +}
> +
>   static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>   {
>   	if (IS_ERR(*hpage)) {
> @@ -2178,9 +2204,8 @@ static struct page
>   	 * mmap_sem in read mode is good idea also to allow greater
>   	 * scalability.
>   	 */

> -	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
> -				      node, __GFP_OTHER_NODE);
> -
> +	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
> +			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);

Why do you use alloc_pages_exact_node()?

Thanks,
Yasuaki Ishimatsu

>   	/*
>   	 * After allocating the hugepage, release the mmap_sem read lock in
>   	 * preparation for taking it in write mode.
> @@ -2196,6 +2221,11 @@ static struct page
>   	return *hpage;
>   }
>   #else
> +static int khugepaged_find_target_node(void)
> +{
> +	return 0;
> +}
> +
>   static inline struct page *alloc_hugepage(int defrag)
>   {
>   	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
> @@ -2405,6 +2435,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   	if (pmd_trans_huge(*pmd))
>   		goto out;
>   
> +	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>   	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>   	     _pte++, _address += PAGE_SIZE) {
> @@ -2421,12 +2452,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		if (unlikely(!page))
>   			goto out_unmap;
>   		/*
> -		 * Chose the node of the first page. This could
> -		 * be more sophisticated and look at more pages,
> -		 * but isn't for now.
> +		 * Chose the node of most normal pages hit, record this
> +		 * informaction to khugepaged_node_load[]
>   		 */
> -		if (node == NUMA_NO_NODE)
> -			node = page_to_nid(page);
> +		node = page_to_nid(page);
> +		khugepaged_node_load[node]++;
>   		VM_BUG_ON(PageCompound(page));
>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>   			goto out_unmap;
> @@ -2441,9 +2471,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		ret = 1;
>   out_unmap:
>   	pte_unmap_unlock(pte, ptl);
> -	if (ret)
> +	if (ret) {
> +		node = khugepaged_find_target_node();
>   		/* collapse_huge_page will return with the mmap_sem released */
>   		collapse_huge_page(mm, address, hpage, vma, node);
> +	}
>   out:
>   	return ret;
>   }
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-10  2:51   ` Yasuaki Ishimatsu
@ 2013-09-10 14:28     ` Bob Liu
  2013-09-11  2:23       ` Yasuaki Ishimatsu
  0 siblings, 1 reply; 13+ messages in thread
From: Bob Liu @ 2013-09-10 14:28 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Bob Liu, akpm, linux-mm, aarcange, kirill.shutemov, mgorman,
	konrad.wilk, davidoff

Hi Yasuaki,

On 09/10/2013 10:51 AM, Yasuaki Ishimatsu wrote:
> (2013/09/02 12:45), Bob Liu wrote:
>> Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
>> which is allocated from the node of the first normal page, this policy is very
>> rough and may affect userland applications.
>> Andrew Davidoff reported a related issue several days ago.
>>
>> Using "numactl --interleave=all ./test" to run the testcase, but the result
>> wasn't not as expected.
>> cat /proc/2814/numa_maps:
>> 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
>> N3=50098
>> The end results showed that most pages are from Node3 instead of interleave
>> among node0-3 which was unreasonable.
>>
> 
>> This patch adds a more complicated policy.
>> When searching HPAGE_PMD_NR normal pages, record which node those pages come
>> from. Alway allocate hugepage from the node with the max record. If several
>> nodes have the same max record, try to interleave among them.
> 
> I don't understand this policy. Why does ths patch allocate hugepage from the
> node with the max record?
> 

Thanks for your review.

The reason is hupepaged always allocate huge pages from the node of the
first scanned normal page, which may break the original page-balancing
among all nodes.
Thinking about the case that the first scanned normal page is allocated
from node A, most of other scanned normal pages are allocated from node
B or C..
But khugepaged will  always allocate the huge page from node A which
will cause extra memory pressure on node A and is not the same as users
expected.

The policy I used in this patch(allocate huge page from the node with
max record)is try to minimize the effect to original page balancing.

The other thing is even normal pages are allocated from Node A,B and C
equally, after khugepaged started Node A will also suffer from memory
pressure because of huge pages.

>>
>> After this patch the result was as expected:
>> 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
>> N3=12722
>>
>> The simple testcase is like this:
>> #include<stdio.h>
>> #include<stdlib.h>
>>
>> int main() {
>> 	char *p;
>> 	int i;
>> 	int j;
>>
>> 	for (i=0; i < 200; i++) {
>> 		p = (char *)malloc(1048576);
>> 		printf("malloc done\n");
>>
>> 		if (p == 0) {
>> 			printf("Out of memory\n");
>> 			return 1;
>> 		}
>> 		for (j=0; j < 1048576; j++) {
>> 			p[j] = 'A';
>> 		}
>> 		printf("touched memory\n");
>>
>> 		sleep(1);
>> 	}
>> 	printf("enter sleep\n");
>> 	while(1) {
>> 		sleep(100);
>> 	}
>> }
>>
>> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>> ---
>>   mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 41 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 7448cf9..86c7f0d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>>   			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>>   }
>>   
>> +static int khugepaged_node_load[MAX_NUMNODES];
>>   #ifdef CONFIG_NUMA
>> +static int last_khugepaged_target_node = NUMA_NO_NODE;
>> +static int khugepaged_find_target_node(void)
>> +{
> 
>> +	int i, target_node = 0, max_value = 1;
> 
> i is used as node ids. So please use node or nid instead of i.
> 

Sure!

>> +
> 
>> +	/* find first node with most normal pages hit */
>> +	for (i = 0; i < MAX_NUMNODES; i++)
>> +		if (khugepaged_node_load[i] > max_value) {
>> +			max_value = khugepaged_node_load[i];
>> +			target_node = i;
>> +		}
> 
> khugepaged_node_load[] is initialized as 0 and max_value is initialized
> as 1. So this loop does not work well until khugepage_node_load[] is set
> to 2 or more. How about initializing max_value to 0?
> 

Sure!

> 
>> +
>> +	/* do some balance if several nodes have the same hit number */
>> +	if (target_node <= last_khugepaged_target_node) {
>> +		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
>> +			if (max_value == khugepaged_node_load[i]) {
>> +				target_node = i;
>> +				break;
>> +			}
>> +	}
>> +
>> +	last_khugepaged_target_node = target_node;
>> +	return target_node;
>> +}
>> +
>>   static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>>   {
>>   	if (IS_ERR(*hpage)) {
>> @@ -2178,9 +2204,8 @@ static struct page
>>   	 * mmap_sem in read mode is good idea also to allow greater
>>   	 * scalability.
>>   	 */
> 
>> -	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
>> -				      node, __GFP_OTHER_NODE);
>> -
>> +	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
>> +			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
> 
> Why do you use alloc_pages_exact_node()?
> 

 alloc_hugepage_vma() will call  alloc_pages_vma() which will use some
mempolicy.
But some time the mempolicy is not we want for khugepaged.

In Andrew's example, he set his application's mempolicy to MPOL_INTERLEAVE.
But khugepaged doesn't know this, the mempolicy of khugepaged will be
used(MPOL_PREFERRED) when alloc_pages_vma() is called in khugepaged thread.
As a result, all huge pages are allocated from Node A which doesn't
match the requirement from user land.

Thanks,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node
  2013-09-10 14:28     ` Bob Liu
@ 2013-09-11  2:23       ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 13+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-11  2:23 UTC (permalink / raw)
  To: Bob Liu
  Cc: Bob Liu, akpm, linux-mm, aarcange, kirill.shutemov, mgorman,
	konrad.wilk, davidoff

Hi Bob,

(2013/09/10 23:28), Bob Liu wrote:
> Hi Yasuaki,
> 
> On 09/10/2013 10:51 AM, Yasuaki Ishimatsu wrote:
>> (2013/09/02 12:45), Bob Liu wrote:
>>> Currently khugepaged will try to merge HPAGE_PMD_NR normal pages to a huge page
>>> which is allocated from the node of the first normal page, this policy is very
>>> rough and may affect userland applications.
>>> Andrew Davidoff reported a related issue several days ago.
>>>
>>> Using "numactl --interleave=all ./test" to run the testcase, but the result
>>> wasn't not as expected.
>>> cat /proc/2814/numa_maps:
>>> 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
>>> N3=50098
>>> The end results showed that most pages are from Node3 instead of interleave
>>> among node0-3 which was unreasonable.
>>>
>>
>>> This patch adds a more complicated policy.
>>> When searching HPAGE_PMD_NR normal pages, record which node those pages come
>>> from. Alway allocate hugepage from the node with the max record. If several
>>> nodes have the same max record, try to interleave among them.
>>
>> I don't understand this policy. Why does ths patch allocate hugepage from the
>> node with the max record?
>>
> 
> Thanks for your review.
> 
> The reason is hupepaged always allocate huge pages from the node of the
> first scanned normal page, which may break the original page-balancing
> among all nodes.
> Thinking about the case that the first scanned normal page is allocated
> from node A, most of other scanned normal pages are allocated from node
> B or C..
> But khugepaged will  always allocate the huge page from node A which
> will cause extra memory pressure on node A and is not the same as users
> expected.
> 
> The policy I used in this patch(allocate huge page from the node with
> max record)is try to minimize the effect to original page balancing.
> 
> The other thing is even normal pages are allocated from Node A,B and C
> equally, after khugepaged started Node A will also suffer from memory
> pressure because of huge pages.

Thank you for your explanation.
I understood it.

> 
>>>
>>> After this patch the result was as expected:
>>> 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235
>>> N3=12722
>>>
>>> The simple testcase is like this:
>>> #include<stdio.h>
>>> #include<stdlib.h>
>>>
>>> int main() {
>>> 	char *p;
>>> 	int i;
>>> 	int j;
>>>
>>> 	for (i=0; i < 200; i++) {
>>> 		p = (char *)malloc(1048576);
>>> 		printf("malloc done\n");
>>>
>>> 		if (p == 0) {
>>> 			printf("Out of memory\n");
>>> 			return 1;
>>> 		}
>>> 		for (j=0; j < 1048576; j++) {
>>> 			p[j] = 'A';
>>> 		}
>>> 		printf("touched memory\n");
>>>
>>> 		sleep(1);
>>> 	}
>>> 	printf("enter sleep\n");
>>> 	while(1) {
>>> 		sleep(100);
>>> 	}
>>> }
>>>
>>> Reported-by: Andrew Davidoff <davidoff@qedmf.net>
>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>>> ---
>>>    mm/huge_memory.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
>>>    1 file changed, 41 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 7448cf9..86c7f0d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2144,7 +2144,33 @@ static void khugepaged_alloc_sleep(void)
>>>    			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>>>    }
>>>    
>>> +static int khugepaged_node_load[MAX_NUMNODES];
>>>    #ifdef CONFIG_NUMA
>>> +static int last_khugepaged_target_node = NUMA_NO_NODE;
>>> +static int khugepaged_find_target_node(void)
>>> +{
>>
>>> +	int i, target_node = 0, max_value = 1;
>>
>> i is used as node ids. So please use node or nid instead of i.
>>
> 
> Sure!
> 
>>> +
>>
>>> +	/* find first node with most normal pages hit */
>>> +	for (i = 0; i < MAX_NUMNODES; i++)
>>> +		if (khugepaged_node_load[i] > max_value) {
>>> +			max_value = khugepaged_node_load[i];
>>> +			target_node = i;
>>> +		}
>>
>> khugepaged_node_load[] is initialized as 0 and max_value is initialized
>> as 1. So this loop does not work well until khugepage_node_load[] is set
>> to 2 or more. How about initializing max_value to 0?
>>
> 
> Sure!
> 
>>
>>> +
>>> +	/* do some balance if several nodes have the same hit number */
>>> +	if (target_node <= last_khugepaged_target_node) {
>>> +		for (i = last_khugepaged_target_node + 1; i < MAX_NUMNODES; i++)
>>> +			if (max_value == khugepaged_node_load[i]) {
>>> +				target_node = i;
>>> +				break;
>>> +			}
>>> +	}
>>> +
>>> +	last_khugepaged_target_node = target_node;
>>> +	return target_node;
>>> +}
>>> +
>>>    static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>>>    {
>>>    	if (IS_ERR(*hpage)) {
>>> @@ -2178,9 +2204,8 @@ static struct page
>>>    	 * mmap_sem in read mode is good idea also to allow greater
>>>    	 * scalability.
>>>    	 */
>>
>>> -	*hpage  = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
>>> -				      node, __GFP_OTHER_NODE);
>>> -
>>> +	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
>>> +			khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
>>
>> Why do you use alloc_pages_exact_node()?
>>
> 
>   alloc_hugepage_vma() will call  alloc_pages_vma() which will use some
> mempolicy.
> But some time the mempolicy is not we want for khugepaged.
> 
> In Andrew's example, he set his application's mempolicy to MPOL_INTERLEAVE.
> But khugepaged doesn't know this, the mempolicy of khugepaged will be
> used(MPOL_PREFERRED) when alloc_pages_vma() is called in khugepaged thread.
> As a result, all huge pages are allocated from Node A which doesn't
> match the requirement from user land.

I understood it.

Thanks,
Yasuaki Ishimatsu

> 
> Thanks,
> -Bob
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-09-11  2:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-02  3:45 [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Bob Liu
2013-09-02  3:45 ` [PATCH 2/2] mm: thp: khugepaged: add policy for finding target node Bob Liu
2013-09-07 15:32   ` Andrew Davidoff
2013-09-10  0:45   ` Yasuaki Ishimatsu
2013-09-10  0:55     ` Wanpeng Li
2013-09-10  2:19       ` Yasuaki Ishimatsu
2013-09-10  0:55     ` Wanpeng Li
2013-09-10  2:51   ` Yasuaki Ishimatsu
2013-09-10 14:28     ` Bob Liu
2013-09-11  2:23       ` Yasuaki Ishimatsu
2013-09-02 10:55 ` [PATCH 1/2] mm: thp: cleanup: mv alloc_hugepage to better place Kirill A. Shutemov
2013-09-07 15:31 ` Andrew Davidoff
2013-09-10  1:28 ` Yasuaki Ishimatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).