[PATCH] Add nid sanity on alloc_pages

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Add nid sanity on alloc_pages_node
@ 2007-07-13  2:45 Joe Jin
  2007-07-13  5:18 ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Joe Jin @ 2007-07-13  2:45 UTC (permalink / raw)
  To: akpm, bill.irwin; +Cc: linux-kernel

This patch add nid sanity check on alloc_pages_node().
While two process change nr_hugepages at a system, alloc_fresh_huge_page()
been called, at this function, nid defined as a static variable, but, there
is not any protection of, if 2 process called at the same time, maybe pass a
invalid nid to alloc_pages_node. 

We have hit it by following scripts:

#!/bin/bash
while : ; do 
	echo 1000000000000 > /proc/sys/vm/nr_hugepages 
	echo 1 > /proc/sys/vm/nr_hugepages 
	echo 10000000000000000000 > /proc/sys/vm/nr_hugepages 
	echo 0 > /proc/sys/vm/nr_hugepages 
done 

Run the script at _two_ difference terminal, after a short time, a kernel panic
info will print.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
---

--- linux-2.6.22/include/linux/gfp.h.orig	2007-07-12 15:06:23.000000000 +0800
+++ linux-2.6.22/include/linux/gfp.h	2007-07-12 15:02:59.000000000 +0800
@@ -133,6 +133,9 @@
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
+	
+	if (unlikely(nid == MAX_NUMNODES))
+		nid = first_node(node_online_map);

 	return __alloc_pages(gfp_mask, order,
 		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  2:45 [PATCH] Add nid sanity on alloc_pages_node Joe Jin
@ 2007-07-13  5:18 ` Andrew Morton
  2007-07-13  6:40   ` Joe Jin
  2007-07-17 15:04   ` Hugh Dickins
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  5:18 UTC (permalink / raw)
  To: Joe Jin; +Cc: bill.irwin, linux-kernel

On Fri, 13 Jul 2007 10:45:07 +0800 Joe Jin <joe.jin@oracle.com> wrote:

> This patch add nid sanity check on alloc_pages_node().
> While two process change nr_hugepages at a system, alloc_fresh_huge_page()
> been called, at this function, nid defined as a static variable, but, there
> is not any protection of, if 2 process called at the same time, maybe pass a
> invalid nid to alloc_pages_node. 
> 
> We have hit it by following scripts:
> 
> #!/bin/bash
> while : ; do 
> 	echo 1000000000000 > /proc/sys/vm/nr_hugepages 
> 	echo 1 > /proc/sys/vm/nr_hugepages 
> 	echo 10000000000000000000 > /proc/sys/vm/nr_hugepages 
> 	echo 0 > /proc/sys/vm/nr_hugepages 
> done 
> 
> 
> Run the script at _two_ difference terminal, after a short time, a kernel panic
> info will print.
> 
> 
> Signed-off-by: Joe Jin <joe.jin@oracle.com>
> ---
> 
> --- linux-2.6.22/include/linux/gfp.h.orig	2007-07-12 15:06:23.000000000 +0800
> +++ linux-2.6.22/include/linux/gfp.h	2007-07-12 15:02:59.000000000 +0800
> @@ -133,6 +133,9 @@
>  	/* Unknown node is current node */
>  	if (nid < 0)
>  		nid = numa_node_id();
> +	
> +	if (unlikely(nid == MAX_NUMNODES))
> +		nid = first_node(node_online_map);
>  
>  	return __alloc_pages(gfp_mask, order,
>  		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));

alloc_pages_node() is pretty much the last place where we want to fix this:
it adds more cycles and more code to many important codepaths in the
kernel.

It'd be much better to fix the race within alloc_fresh_huge_page().  That
function is pretty pathetic.

Something like this?

--- a/mm/hugetlb.c~a
+++ a/mm/hugetlb.c
@@ -105,13 +105,20 @@ static void free_huge_page(struct page *
 
 static int alloc_fresh_huge_page(void)
 {
-	static int nid = 0;
+	static int prev_nid;
+	static DEFINE_SPINLOCK(nid_lock);
 	struct page *page;
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
-	nid = next_node(nid, node_online_map);
+	int nid;
+
+	spin_lock(&nid_lock);
+	nid = next_node(prev_nid, node_online_map);
 	if (nid == MAX_NUMNODES)
 		nid = first_node(node_online_map);
+	prev_nid = nid;
+	spin_unlock(&nid_lock);
+
+	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
+					HUGETLB_PAGE_ORDER);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
_


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  5:18 ` Andrew Morton
@ 2007-07-13  6:40   ` Joe Jin
  2007-07-13  6:49     ` Andrew Morton
  2007-07-17 15:04   ` Hugh Dickins
  1 sibling, 1 reply; 37+ messages in thread
From: Joe Jin @ 2007-07-13  6:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel, gurudas.pai

On 2007-07-12 22:18, Andrew Morton wrote:
> On Fri, 13 Jul 2007 10:45:07 +0800 Joe Jin <joe.jin@oracle.com> wrote:
> 
> > This patch add nid sanity check on alloc_pages_node().
> > While two process change nr_hugepages at a system, alloc_fresh_huge_page()
> > been called, at this function, nid defined as a static variable, but, there
> > is not any protection of, if 2 process called at the same time, maybe pass a
> > invalid nid to alloc_pages_node. 
> > 
> > We have hit it by following scripts:
> > 
> > #!/bin/bash
> > while : ; do 
> > 	echo 1000000000000 > /proc/sys/vm/nr_hugepages 
> > 	echo 1 > /proc/sys/vm/nr_hugepages 
> > 	echo 10000000000000000000 > /proc/sys/vm/nr_hugepages 
> > 	echo 0 > /proc/sys/vm/nr_hugepages 
> > done 
> > 
> > 
> > Run the script at _two_ difference terminal, after a short time, a kernel panic
> > info will print.
> > 
> > 
> > Signed-off-by: Joe Jin <joe.jin@oracle.com>
> > ---
> > 
> > --- linux-2.6.22/include/linux/gfp.h.orig	2007-07-12 15:06:23.000000000 +0800
> > +++ linux-2.6.22/include/linux/gfp.h	2007-07-12 15:02:59.000000000 +0800
> > @@ -133,6 +133,9 @@
> >  	/* Unknown node is current node */
> >  	if (nid < 0)
> >  		nid = numa_node_id();
> > +	
> > +	if (unlikely(nid == MAX_NUMNODES))
> > +		nid = first_node(node_online_map);
> >  
> >  	return __alloc_pages(gfp_mask, order,
> >  		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
> 
> alloc_pages_node() is pretty much the last place where we want to fix this:
> it adds more cycles and more code to many important codepaths in the
> kernel.
> 
> It'd be much better to fix the race within alloc_fresh_huge_page().  That
> function is pretty pathetic.
> 
> Something like this?
> 
> --- a/mm/hugetlb.c~a
> +++ a/mm/hugetlb.c
> @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
>  
>  static int alloc_fresh_huge_page(void)
>  {
> -	static int nid = 0;
> +	static int prev_nid;
> +	static DEFINE_SPINLOCK(nid_lock);
>  	struct page *page;
> -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> -					HUGETLB_PAGE_ORDER);
> -	nid = next_node(nid, node_online_map);
> +	int nid;
> +
> +	spin_lock(&nid_lock);
> +	nid = next_node(prev_nid, node_online_map);
>  	if (nid == MAX_NUMNODES)
>  		nid = first_node(node_online_map);
> +	prev_nid = nid;
> +	spin_unlock(&nid_lock);
> +
> +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> +					HUGETLB_PAGE_ORDER);
>  	if (page) {
>  		set_compound_page_dtor(page, free_huge_page);
>  		spin_lock(&hugetlb_lock);
> _
> 

The patch looks good for this bug, thanks :)

if other caller give a invalid nid to alloc_pages_node(), __alloc_pages
will crash again.

So I think we add some sanity check for nid at alloc_pages_node is
meaningful.  

another question, if (nid >= MAX_NUMNODES),  may I set nid to 0 directly
like following code?

	if (unlikly(nid >= MAX_NUMNODES)
		nid = 0

Thanks,
Joe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  6:40   ` Joe Jin
@ 2007-07-13  6:49     ` Andrew Morton
  2007-07-13  6:57       ` Andrew Morton
                         ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  6:49 UTC (permalink / raw)
  To: Joe Jin; +Cc: bill.irwin, linux-kernel, gurudas.pai

On Fri, 13 Jul 2007 14:40:04 +0800 Joe Jin <joe.jin@oracle.com> wrote:

> On 2007-07-12 22:18, Andrew Morton wrote:
> > On Fri, 13 Jul 2007 10:45:07 +0800 Joe Jin <joe.jin@oracle.com> wrote:
> > 
> > Something like this?
> > 
> > --- a/mm/hugetlb.c~a
> > +++ a/mm/hugetlb.c
> > @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
> >  
> >  static int alloc_fresh_huge_page(void)
> >  {
> > -	static int nid = 0;
> > +	static int prev_nid;
> > +	static DEFINE_SPINLOCK(nid_lock);
> >  	struct page *page;
> > -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > -					HUGETLB_PAGE_ORDER);
> > -	nid = next_node(nid, node_online_map);
> > +	int nid;
> > +
> > +	spin_lock(&nid_lock);
> > +	nid = next_node(prev_nid, node_online_map);
> >  	if (nid == MAX_NUMNODES)
> >  		nid = first_node(node_online_map);
> > +	prev_nid = nid;
> > +	spin_unlock(&nid_lock);
> > +
> > +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > +					HUGETLB_PAGE_ORDER);
> >  	if (page) {
> >  		set_compound_page_dtor(page, free_huge_page);
> >  		spin_lock(&hugetlb_lock);
> > _
> > 
> 
> The patch looks good for this bug, thanks :)

If you have time could you test it and sent it back at me please?

> if other caller give a invalid nid to alloc_pages_node(), __alloc_pages
> will crash again.

That would be a buggy caller, so we should fix that caller.

> So I think we add some sanity check for nid at alloc_pages_node is
> meaningful.  
> 
> another question, if (nid >= MAX_NUMNODES),  may I set nid to 0 directly
> like following code?
> 
> 	if (unlikly(nid >= MAX_NUMNODES)
> 		nid = 0

if (nid > MAX_NUMNODES) then that is a bug and we should report it (doing
this via a BUG() is OK) rather than quietly covering it up.

if (nid == MAX_NUMNODES) then we should set it to
first_node(node_online_map);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  6:49     ` Andrew Morton
@ 2007-07-13  6:57       ` Andrew Morton
  2007-07-13  8:29         ` Paul Jackson
  2007-07-13  8:03       ` Joe Jin
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  6:57 UTC (permalink / raw)
  To: Joe Jin, bill.irwin, linux-kernel, gurudas.pai; +Cc: Paul Jackson

On Thu, 12 Jul 2007 23:49:38 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> first_node(node_online_map);

Incidentally, we have a helper for this operation:

#if MAX_NUMNODES > 1
...
#define first_online_node	first_node(node_online_map)
...
#else
...
#define first_online_node	0
...
#endif

So I suppose we should use that - it should generate better code for
non-NUMA builds.

It's a pretty sadly implemented helper though.  It should be
"first_online_node()", not "first_online_node".

I'm scratching my head over that min_t in __first_node(), too.   I don't think
it's possible for find_first_bit(..., N) to return anything >N _anyway_.  And if
it does, we want to know about it.

<looks at Paul>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  6:49     ` Andrew Morton
  2007-07-13  6:57       ` Andrew Morton
@ 2007-07-13  8:03       ` Joe Jin
  2007-07-13  8:15         ` Andrew Morton
  2007-07-13  8:04       ` gurudas pai
  2007-07-13  8:37       ` Joe Jin
  3 siblings, 1 reply; 37+ messages in thread
From: Joe Jin @ 2007-07-13  8:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel, gurudas.pai

> > 
> > The patch looks good for this bug, thanks :)
> 
> If you have time could you test it and sent it back at me please?
> 

Test passed and panic gone.
thanks.

> > if other caller give a invalid nid to alloc_pages_node(), __alloc_pages
> > will crash again.
> 
> That would be a buggy caller, so we should fix that caller.
> 
> > So I think we add some sanity check for nid at alloc_pages_node is
> > meaningful.  
> > 
> > another question, if (nid >= MAX_NUMNODES),  may I set nid to 0 directly
> > like following code?
> > 
> > 	if (unlikly(nid >= MAX_NUMNODES)
> > 		nid = 0
> 
> if (nid > MAX_NUMNODES) then that is a bug and we should report it (doing
> this via a BUG() is OK) rather than quietly covering it up.
> 
> if (nid == MAX_NUMNODES) then we should set it to
> first_node(node_online_map);

OK, will do it and send you a patch :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  6:49     ` Andrew Morton
  2007-07-13  6:57       ` Andrew Morton
  2007-07-13  8:03       ` Joe Jin
@ 2007-07-13  8:04       ` gurudas pai
  2007-07-13  8:19         ` Andrew Morton
  2007-07-13  8:37       ` Joe Jin
  3 siblings, 1 reply; 37+ messages in thread
From: gurudas pai @ 2007-07-13  8:04 UTC (permalink / raw)
  To: akpm; +Cc: Joe Jin, bill.irwin, linux-kernel, GURUDAS,PAI



Andrew Morton wrote:
> On Fri, 13 Jul 2007 14:40:04 +0800 Joe Jin <joe.jin@oracle.com> wrote:
>
>> On 2007-07-12 22:18, Andrew Morton wrote:
>>> On Fri, 13 Jul 2007 10:45:07 +0800 Joe Jin <joe.jin@oracle.com> wrote:
>>>
>>> Something like this?
>>>
>>> --- a/mm/hugetlb.c~a
>>> +++ a/mm/hugetlb.c
>>> @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
>>>  
>>>  static int alloc_fresh_huge_page(void)
>>>  {
>>> -	static int nid = 0;
>>> +	static int prev_nid;
>>> +	static DEFINE_SPINLOCK(nid_lock);
>>>  	struct page *page;
>>> -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
>>> -					HUGETLB_PAGE_ORDER);
>>> -	nid = next_node(nid, node_online_map);
>>> +	int nid;
>>> +
>>> +	spin_lock(&nid_lock);
>>> +	nid = next_node(prev_nid, node_online_map);
>>>  	if (nid == MAX_NUMNODES)
>>>  		nid = first_node(node_online_map);
>>> +	prev_nid = nid;
>>> +	spin_unlock(&nid_lock);
>>> +
>>> +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
>>> +					HUGETLB_PAGE_ORDER);
>>>  	if (page) {
>>>  		set_compound_page_dtor(page, free_huge_page);
>>>  		spin_lock(&hugetlb_lock);
>>> _
>>>
I think this will never get pages from node 0 ? Because nid = 
next_node(prev_node,node_online_map) and even if prev_node = 0, nid will 
become 1.

How about this patch ?


--- linux-2.6.22.orig/mm/hugetlb.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-devel/mm//hugetlb.c 2007-07-13 00:26:27.000000000 -0700
@@ -103,11 +103,18 @@
{
static int nid = 0;
struct page *page;
- page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
- HUGETLB_PAGE_ORDER);
+ int cur_nid;
+ static DEFINE_SPINLOCK(nid_lock);
+
+ spin_lock(&nid_lock);
+ cur_nid = nid;
nid = next_node(nid, node_online_map);
if (nid == MAX_NUMNODES)
nid = first_node(node_online_map);
+ spin_unlock(&nid_lock);
+
+ page = alloc_pages_node(cur_nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
+ HUGETLB_PAGE_ORDER);
if (page) {
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);



Thanks,
-Guru



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:03       ` Joe Jin
@ 2007-07-13  8:15         ` Andrew Morton
  2007-07-13 12:18           ` Joe Jin
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  8:15 UTC (permalink / raw)
  To: Joe Jin; +Cc: bill.irwin, linux-kernel, gurudas.pai, Paul Jackson

On Fri, 13 Jul 2007 16:03:36 +0800 Joe Jin <joe.jin@oracle.com> wrote:

> > > 
> > > The patch looks good for this bug, thanks :)
> > 
> > If you have time could you test it and sent it back at me please?
> > 
> 
> Test passed and panic gone.

Patch gone too ;) I deleted it.  I was hoping that you'd send me the final
finished product (please).

> thanks.
> 
> > > if other caller give a invalid nid to alloc_pages_node(), __alloc_pages
> > > will crash again.
> > 
> > That would be a buggy caller, so we should fix that caller.
> > 
> > > So I think we add some sanity check for nid at alloc_pages_node is
> > > meaningful.  
> > > 
> > > another question, if (nid >= MAX_NUMNODES),  may I set nid to 0 directly
> > > like following code?
> > > 
> > > 	if (unlikly(nid >= MAX_NUMNODES)
> > > 		nid = 0
> > 
> > if (nid > MAX_NUMNODES) then that is a bug and we should report it (doing
> > this via a BUG() is OK) rather than quietly covering it up.
> > 
> > if (nid == MAX_NUMNODES) then we should set it to
> > first_node(node_online_map);
> 
> OK, will do it and send you a patch :)

Thanks.  Please cc Paul Jackson <pj@sgi.com> on that patch.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:04       ` gurudas pai
@ 2007-07-13  8:19         ` Andrew Morton
  2007-07-13 12:37           ` gurudas pai
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  8:19 UTC (permalink / raw)
  To: gurudas pai; +Cc: Joe Jin, bill.irwin, linux-kernel

On Fri, 13 Jul 2007 13:34:29 +0530 gurudas pai <gurudas.pai@oracle.com> wrote:

> 
> 
> Andrew Morton wrote:
> > On Fri, 13 Jul 2007 14:40:04 +0800 Joe Jin <joe.jin@oracle.com> wrote:
> >
> >> On 2007-07-12 22:18, Andrew Morton wrote:
> >>> On Fri, 13 Jul 2007 10:45:07 +0800 Joe Jin <joe.jin@oracle.com> wrote:
> >>>
> >>> Something like this?
> >>>
> >>> --- a/mm/hugetlb.c~a
> >>> +++ a/mm/hugetlb.c
> >>> @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
> >>>  
> >>>  static int alloc_fresh_huge_page(void)
> >>>  {
> >>> -	static int nid = 0;
> >>> +	static int prev_nid;
> >>> +	static DEFINE_SPINLOCK(nid_lock);
> >>>  	struct page *page;
> >>> -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> >>> -					HUGETLB_PAGE_ORDER);
> >>> -	nid = next_node(nid, node_online_map);
> >>> +	int nid;
> >>> +
> >>> +	spin_lock(&nid_lock);
> >>> +	nid = next_node(prev_nid, node_online_map);
> >>>  	if (nid == MAX_NUMNODES)
> >>>  		nid = first_node(node_online_map);
> >>> +	prev_nid = nid;
> >>> +	spin_unlock(&nid_lock);
> >>> +
> >>> +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> >>> +					HUGETLB_PAGE_ORDER);
> >>>  	if (page) {
> >>>  		set_compound_page_dtor(page, free_huge_page);
> >>>  		spin_lock(&hugetlb_lock);
> >>> _
> >>>
> I think this will never get pages from node 0 ? Because nid = 
> next_node(prev_node,node_online_map) and even if prev_node = 0, nid will 
> become 1.

It'll start out at node 1.  But it will visit the final node (which is less
than MAX_NUMNODES) and will then advance onto the fist node (which can be
>= 0).

This code needs a bit of thought and testing for the non-numa case too
please.  At the least, there might be optimisation opportunities.


> How about this patch ?
> 
> 
> --- linux-2.6.22.orig/mm/hugetlb.c 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-devel/mm//hugetlb.c 2007-07-13 00:26:27.000000000 -0700
> @@ -103,11 +103,18 @@
> {
> static int nid = 0;
> struct page *page;
> - page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
> - HUGETLB_PAGE_ORDER);
> + int cur_nid;
> + static DEFINE_SPINLOCK(nid_lock);
> +
> + spin_lock(&nid_lock);
> + cur_nid = nid;
> nid = next_node(nid, node_online_map);
> if (nid == MAX_NUMNODES)
> nid = first_node(node_online_map);
> + spin_unlock(&nid_lock);
> +
> + page = alloc_pages_node(cur_nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
> + HUGETLB_PAGE_ORDER);
> if (page) {
> set_compound_page_dtor(page, free_huge_page);
> spin_lock(&hugetlb_lock);

whoa, your email client made a huge mess of that one.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  6:57       ` Andrew Morton
@ 2007-07-13  8:29         ` Paul Jackson
  2007-07-13  8:38           ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Paul Jackson @ 2007-07-13  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: joe.jin, bill.irwin, linux-kernel, gurudas.pai, Zwane Mwaikambo

> I'm scratching my head over that min_t in __first_node(), too.   I don't think
> it's possible for find_first_bit(..., N) to return anything >N _anyway_.  And if
> it does, we want to know about it.
> 
> <looks at Paul>

I'm not sure I've got this right, but looks like that min_t went in after
Zwane Mwaikambo, then <zwane@fsmlabs.com>, whom I am presuming is the same
person as now at <zwane@arm.linux.org.uk>, found a problem with the i386
find_next_bit implementation returning > N when merging i386 cpu hotplug.

See the thread:

  http://lkml.org/lkml/2004/7/31/102
  [PATCH][2.6] first/next_cpu returns values > NR_CPUS

I apparently lobbied at the time to mandate that find_first_bit(..., N)
return exactly N on failure to find a set bit, but gave up, after some
confusions on my part.

Adding Zwane to this thread -- the other participant on that thread,
Bill Irwin, is already on the CC list.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  6:49     ` Andrew Morton
                         ` (2 preceding siblings ...)
  2007-07-13  8:04       ` gurudas pai
@ 2007-07-13  8:37       ` Joe Jin
  2007-07-13  8:44         ` Andrew Morton
  3 siblings, 1 reply; 37+ messages in thread
From: Joe Jin @ 2007-07-13  8:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel, gurudas.pai

> 
> if (nid > MAX_NUMNODES) then that is a bug and we should report it (doing
> this via a BUG() is OK) rather than quietly covering it up.

I have create a patch to check if nid > MAX_NUMNODES, please apply it
thanks

Signed-off-by: Joe Jin <joe.jin@oracle.com>
---

--- linux-2.6.22/include/linux/gfp.h.orig	2007-07-12 15:06:23.000000000 +0800
+++ linux-2.6.22/include/linux/gfp.h	2007-07-13 16:23:52.000000000 +0800
@@ -127,6 +127,8 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
+	BUG_ON(nid > MAX_NUMNODES);
+
 	if (unlikely(order >= MAX_ORDER))
 		return NULL;
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:29         ` Paul Jackson
@ 2007-07-13  8:38           ` Andrew Morton
  2007-07-13  8:43             ` Paul Jackson
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  8:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: joe.jin, bill.irwin, linux-kernel, gurudas.pai, Zwane Mwaikambo

On Fri, 13 Jul 2007 01:29:06 -0700 Paul Jackson <pj@sgi.com> wrote:

> > I'm scratching my head over that min_t in __first_node(), too.   I don't think
> > it's possible for find_first_bit(..., N) to return anything >N _anyway_.  And if
> > it does, we want to know about it.
> > 
> > <looks at Paul>
> 
> I'm not sure I've got this right, but looks like that min_t went in after
> Zwane Mwaikambo, then <zwane@fsmlabs.com>, whom I am presuming is the same
> person as now at <zwane@arm.linux.org.uk>, found a problem with the i386
> find_next_bit implementation returning > N when merging i386 cpu hotplug.

Ah, Zwane was involved - say no more ;)

> See the thread:
> 
>   http://lkml.org/lkml/2004/7/31/102
>   [PATCH][2.6] first/next_cpu returns values > NR_CPUS
> 
> I apparently lobbied at the time to mandate that find_first_bit(..., N)
> return exactly N on failure to find a set bit, but gave up, after some
> confusions on my part.

iirc, find_first_bit(..., N) _must_ return N on nothing-found.  It'd be
untidy to return some randomly-larger number.

I wonder which was the culpable architecture?  Oh, i386.

Note how the i386 implementation's documentation carefully avoids describing
the return value.  I don't think _any_ of our find_foo_bit()
implementations have return-value docs, and here we see the result.

Sigh.  What crap.  I guess we leave it as-is.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:38           ` Andrew Morton
@ 2007-07-13  8:43             ` Paul Jackson
  2007-07-13  8:49               ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Paul Jackson @ 2007-07-13  8:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: joe.jin, bill.irwin, linux-kernel, gurudas.pai, zwane

> Sigh.  What crap.  I guess we leave it as-is.

That matches my conclusion.

The thought of exposing a bug in some user of this find bit code by
tightening up the API to return == N, not >= N, scares poor little old
me.  I can imagine a nasty bug coming from this

If someone bold enough and strong enough to slay such dragons came
along, I'd gladly cheer them along.  But such people are a rare breed,
and probably have bigger dragons to slay first.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:37       ` Joe Jin
@ 2007-07-13  8:44         ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  8:44 UTC (permalink / raw)
  To: Joe Jin; +Cc: bill.irwin, linux-kernel, gurudas.pai

On Fri, 13 Jul 2007 16:37:32 +0800 Joe Jin <joe.jin@oracle.com> wrote:

> > 
> > if (nid > MAX_NUMNODES) then that is a bug and we should report it (doing
> > this via a BUG() is OK) rather than quietly covering it up.
> 
> I have create a patch to check if nid > MAX_NUMNODES, please apply it
> thanks
> 
> Signed-off-by: Joe Jin <joe.jin@oracle.com>
> ---
> 
> --- linux-2.6.22/include/linux/gfp.h.orig	2007-07-12 15:06:23.000000000 +0800
> +++ linux-2.6.22/include/linux/gfp.h	2007-07-13 16:23:52.000000000 +0800
> @@ -127,6 +127,8 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  						unsigned int order)
>  {
> +	BUG_ON(nid > MAX_NUMNODES);
> +
>  	if (unlikely(order >= MAX_ORDER))
>  		return NULL;
>  

nope ;)

Would really prefer not to go adding overhead like this into a
frequently-called and frequently-inlined codepath.

If we do have a bug in a caller then the code will go on to overindex
NODE_DATA() which will hopefully produce a nice oops for at least some
people, and that's good enough.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:43             ` Paul Jackson
@ 2007-07-13  8:49               ` Andrew Morton
  2007-07-13  8:54                 ` Paul Jackson
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-13  8:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: joe.jin, bill.irwin, linux-kernel, gurudas.pai, zwane

On Fri, 13 Jul 2007 01:43:49 -0700 Paul Jackson <pj@sgi.com> wrote:

> > Sigh.  What crap.  I guess we leave it as-is.
> 
> That matches my conclusion.
> 
> The thought of exposing a bug in some user of this find bit code by
> tightening up the API to return == N, not >= N, scares poor little old
> me.  I can imagine a nasty bug coming from this
> 
> If someone bold enough and strong enough to slay such dragons came
> along, I'd gladly cheer them along.  But such people are a rare breed,
> and probably have bigger dragons to slay first.
> 

We could do this: add a little bitops torture-test and run that
unconditionally, nice and early.

Something for the young and keen amongst us...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:49               ` Andrew Morton
@ 2007-07-13  8:54                 ` Paul Jackson
  2007-07-13 12:48                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 37+ messages in thread
From: Paul Jackson @ 2007-07-13  8:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: joe.jin, bill.irwin, linux-kernel, gurudas.pai, zwane

> Something for the young ...

... ancient self smiles with relief, glad that Andrew is not looking
this way ;).

> We could do this: add a little bitops torture-test and run that
> unconditionally, nice and early.

Good idea.  That would allow for wide spread coverage across all arch's
that they obeyed the rules for the tighter API - find_first_bit returning
exactly N on finding no bits set, before we took the risk of actually
relying on the changed API, by removing those min_t() guards.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:15         ` Andrew Morton
@ 2007-07-13 12:18           ` Joe Jin
  2007-07-13 12:42             ` Paul Jackson
  2007-07-14 17:40             ` Nish Aravamudan
  0 siblings, 2 replies; 37+ messages in thread
From: Joe Jin @ 2007-07-13 12:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joe Jin, bill.irwin, linux-kernel, gurudas.pai, Paul Jackson

> 
> Patch gone too ;) I deleted it.  I was hoping that you'd send me the final
> finished product (please).
> 

Ha.., the patch against 2.6.22, at your patch have use htlb_alloc_mask, but I
cannot found it at 2.6.22 kernel tree, I think you must use difference kernel
tree :)

Thanks,
Joe

--- linux-2.6.22/mm/hugetlb.c.orig	2007-07-12 15:02:19.000000000 +0800
+++ linux-2.6.22/mm/hugetlb.c	2007-07-13 17:33:45.000000000 +0800
@@ -101,13 +101,20 @@ static void free_huge_page(struct page *
 
 static int alloc_fresh_huge_page(void)
 {
-	static int nid = 0;
+	static int prev_nid;
 	struct page *page;
-	page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
-	nid = next_node(nid, node_online_map);
+	static DEFINE_SPINLOCK(nid_lock);
+	int nid;
+
+	spin_lock(&nid_lock);
+	nid = next_node(prev_nid, node_online_map);
 	if (nid == MAX_NUMNODES)
 		nid = first_node(node_online_map);
+	prev_nid = nid;
+	spin_unlock(&nid_lock);
+	
+	page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
+					HUGETLB_PAGE_ORDER);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:19         ` Andrew Morton
@ 2007-07-13 12:37           ` gurudas pai
  0 siblings, 0 replies; 37+ messages in thread
From: gurudas pai @ 2007-07-13 12:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel


>>>>> On Fri, 13 Jul 2007 10:45:07 +0800 Joe Jin <joe.jin@oracle.com> wrote:
>>>>>
>>>>> Something like this?
>>>>>
>>>>> --- a/mm/hugetlb.c~a
>>>>> +++ a/mm/hugetlb.c
>>>>> @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
>>>>>  
>>>>>  static int alloc_fresh_huge_page(void)
>>>>>  {
>>>>> -	static int nid = 0;
>>>>> +	static int prev_nid;
>>>>> +	static DEFINE_SPINLOCK(nid_lock);
>>>>>  	struct page *page;
>>>>> -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
>>>>> -					HUGETLB_PAGE_ORDER);
>>>>> -	nid = next_node(nid, node_online_map);
>>>>> +	int nid;
>>>>> +
>>>>> +	spin_lock(&nid_lock);
>>>>> +	nid = next_node(prev_nid, node_online_map);
>>>>>  	if (nid == MAX_NUMNODES)
>>>>>  		nid = first_node(node_online_map);
>>>>> +	prev_nid = nid;
>>>>> +	spin_unlock(&nid_lock);
>>>>> +
>>>>> +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
>>>>> +					HUGETLB_PAGE_ORDER);
>>>>>  	if (page) {
>>>>>  		set_compound_page_dtor(page, free_huge_page);
>>>>>  		spin_lock(&hugetlb_lock);
>>>>> _
>>>>>
>> I think this will never get pages from node 0 ? Because nid = 
>> next_node(prev_node,node_online_map) and even if prev_node = 0, nid will 
>> become 1.
>
> It'll start out at node 1.  But it will visit the final node (which is less
> than MAX_NUMNODES) and will then advance onto the fist node (which can be
>> = 0).
>
> This code needs a bit of thought and testing for the non-numa case too
> please.  At the least, there might be optimisation opportunities.
I tested on non-numa machine. Your patch works fine.

Thanks,
-Guru



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13 12:18           ` Joe Jin
@ 2007-07-13 12:42             ` Paul Jackson
  2007-07-14 17:40             ` Nish Aravamudan
  1 sibling, 0 replies; 37+ messages in thread
From: Paul Jackson @ 2007-07-13 12:42 UTC (permalink / raw)
  To: Joe Jin; +Cc: akpm, joe.jin, bill.irwin, linux-kernel, gurudas.pai

Joe wrote:
> --- linux-2.6.22/mm/hugetlb.c.orig	2007-07-12 15:02:19.000000000 +0800
> +++ linux-2.6.22/mm/hugetlb.c	2007-07-13 17:33:45.000000000 +0800
> @@ -101,13 +101,20 @@ static void free_huge_page(struct page *
>  
>  static int alloc_fresh_huge_page(void)
>  {
>	...

Patch looks ok to me, except for ...

> Ha.., the patch against 2.6.22, at your patch have use htlb_alloc_mask, but I
> cannot found it at 2.6.22 kernel tree, I think you must use difference kernel
> tree :)

Yes, as a matter of fact, he does have a different kernel tree, found at:

    ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/

This is the infamous *-mm tree, which is currently at the following
version 2.6.22-rc6-mm1, and also available as a quilt patch set,
beneath:

    ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/

My guess is Andrew will prefer a patch against this *-mm kernel, though
sometimes he is a good chap and fixes up minor differences, such as
this involving htlb_alloc_mask replacing GFP_HIGHUSER.  If you had the
chance to verify your testing of this change against this
2.6.22-rc6-mm1, that would be worth doing.  Probably the change to
htlb_alloc_mask from GFP_HIGHUSER won't be any problem, but I suppose
that there's a small chance that some other patch in *-mm will interact
poorly with your proposed change.

Thanks for getting on this so quickly.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  8:54                 ` Paul Jackson
@ 2007-07-13 12:48                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 37+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-13 12:48 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, joe.jin, bill.irwin, linux-kernel, gurudas.pai,
	zwane

On Fri, 2007-07-13 at 01:54 -0700, Paul Jackson wrote:
> > Something for the young ...
> 
> ... ancient self smiles with relief, glad that Andrew is not looking
> this way ;).
> 
> > We could do this: add a little bitops torture-test and run that
> > unconditionally, nice and early.
> 
> Good idea.  That would allow for wide spread coverage across all arch's
> that they obeyed the rules for the tighter API - find_first_bit returning
> exactly N on finding no bits set, before we took the risk of actually
> relying on the changed API, by removing those min_t() guards.

That, the lockdep tests, the RCU tests...

Looks like the kernel is building up it's own test suite :-) That's a
-good- thing, maybe we should formalize it a bit (separate the "test
suite" config options) ?

Ben.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13 12:18           ` Joe Jin
  2007-07-13 12:42             ` Paul Jackson
@ 2007-07-14 17:40             ` Nish Aravamudan
  2007-07-14 18:04               ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Nish Aravamudan @ 2007-07-14 17:40 UTC (permalink / raw)
  To: Joe Jin; +Cc: Andrew Morton, bill.irwin, linux-kernel, gurudas.pai,
	Paul Jackson

On 7/13/07, Joe Jin <joe.jin@oracle.com> wrote:
> >
> > Patch gone too ;) I deleted it.  I was hoping that you'd send me the final
> > finished product (please).
> >
>
> Ha.., the patch against 2.6.22, at your patch have use htlb_alloc_mask, but I
> cannot found it at 2.6.22 kernel tree, I think you must use difference kernel
> tree :)

I believe this patch will be unnecessary if my "Fix hugetlb pool
allocation with empty nodes" patch gets pulled into -mm.
alloc_fresh_huge_page() now takes a mempolicy which is created by the
two callers, rather than reinventing interleaving itself. I believe
this will avoid the oops you saw. I am still waiting on some test
results (annoying -mm config changes) before I repost them (and they
depend on Christoph's fixes for memoryless nodes).

Thanks,
Nish

> --- linux-2.6.22/mm/hugetlb.c.orig      2007-07-12 15:02:19.000000000 +0800
> +++ linux-2.6.22/mm/hugetlb.c   2007-07-13 17:33:45.000000000 +0800
> @@ -101,13 +101,20 @@ static void free_huge_page(struct page *
>
>  static int alloc_fresh_huge_page(void)
>  {
> -       static int nid = 0;
> +       static int prev_nid;
>         struct page *page;
> -       page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
> -                                       HUGETLB_PAGE_ORDER);
> -       nid = next_node(nid, node_online_map);
> +       static DEFINE_SPINLOCK(nid_lock);
> +       int nid;
> +
> +       spin_lock(&nid_lock);
> +       nid = next_node(prev_nid, node_online_map);
>         if (nid == MAX_NUMNODES)
>                 nid = first_node(node_online_map);
> +       prev_nid = nid;
> +       spin_unlock(&nid_lock);
> +
> +       page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
> +                                       HUGETLB_PAGE_ORDER);
>         if (page) {
>                 set_compound_page_dtor(page, free_huge_page);
>                 spin_lock(&hugetlb_lock);
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-14 17:40             ` Nish Aravamudan
@ 2007-07-14 18:04               ` Andrew Morton
  2007-07-14 20:47                 ` Nish Aravamudan
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-14 18:04 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Joe Jin, bill.irwin, linux-kernel, gurudas.pai, Paul Jackson

On Sat, 14 Jul 2007 10:40:25 -0700 "Nish Aravamudan" <nish.aravamudan@gmail.com> wrote:

> On 7/13/07, Joe Jin <joe.jin@oracle.com> wrote:
> > >
> > > Patch gone too ;) I deleted it.  I was hoping that you'd send me the final
> > > finished product (please).
> > >
> >
> > Ha.., the patch against 2.6.22, at your patch have use htlb_alloc_mask, but I
> > cannot found it at 2.6.22 kernel tree, I think you must use difference kernel
> > tree :)
> 
> I believe this patch will be unnecessary if my "Fix hugetlb pool
> allocation with empty nodes" patch gets pulled into -mm.
> alloc_fresh_huge_page() now takes a mempolicy which is created by the
> two callers, rather than reinventing interleaving itself. I believe
> this will avoid the oops you saw. I am still waiting on some test
> results (annoying -mm config changes) before I repost them (and they
> depend on Christoph's fixes for memoryless nodes).

Could be so, but the patch which I have queued is nice and simple and
can be backported into 2.6.22.x.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-14 18:04               ` Andrew Morton
@ 2007-07-14 20:47                 ` Nish Aravamudan
  0 siblings, 0 replies; 37+ messages in thread
From: Nish Aravamudan @ 2007-07-14 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joe Jin, bill.irwin, linux-kernel, gurudas.pai, Paul Jackson

On 7/14/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sat, 14 Jul 2007 10:40:25 -0700 "Nish Aravamudan" <nish.aravamudan@gmail.com> wrote:
>
> > On 7/13/07, Joe Jin <joe.jin@oracle.com> wrote:
> > > >
> > > > Patch gone too ;) I deleted it.  I was hoping that you'd send me the final
> > > > finished product (please).
> > > >
> > >
> > > Ha.., the patch against 2.6.22, at your patch have use htlb_alloc_mask, but I
> > > cannot found it at 2.6.22 kernel tree, I think you must use difference kernel
> > > tree :)
> >
> > I believe this patch will be unnecessary if my "Fix hugetlb pool
> > allocation with empty nodes" patch gets pulled into -mm.
> > alloc_fresh_huge_page() now takes a mempolicy which is created by the
> > two callers, rather than reinventing interleaving itself. I believe
> > this will avoid the oops you saw. I am still waiting on some test
> > results (annoying -mm config changes) before I repost them (and they
> > depend on Christoph's fixes for memoryless nodes).
>
> Could be so, but the patch which I have queued is nice and simple and
> can be backported into 2.6.22.x.

Fair enough. FWIW, just finished testing my patches and have posted
the current incarnation to linux-mm.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-13  5:18 ` Andrew Morton
  2007-07-13  6:40   ` Joe Jin
@ 2007-07-17 15:04   ` Hugh Dickins
  2007-07-17 16:32     ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2007-07-17 15:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel

On Thu, 12 Jul 2007, Andrew Morton wrote:
> 
> It'd be much better to fix the race within alloc_fresh_huge_page().  That
> function is pretty pathetic.
> 
> Something like this?
> 
> --- a/mm/hugetlb.c~a
> +++ a/mm/hugetlb.c
> @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
>  
>  static int alloc_fresh_huge_page(void)
>  {
> -	static int nid = 0;
> +	static int prev_nid;
> +	static DEFINE_SPINLOCK(nid_lock);
>  	struct page *page;
> -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> -					HUGETLB_PAGE_ORDER);
> -	nid = next_node(nid, node_online_map);
> +	int nid;
> +
> +	spin_lock(&nid_lock);
> +	nid = next_node(prev_nid, node_online_map);
>  	if (nid == MAX_NUMNODES)
>  		nid = first_node(node_online_map);
> +	prev_nid = nid;
> +	spin_unlock(&nid_lock);
> +
> +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> +					HUGETLB_PAGE_ORDER);
>  	if (page) {
>  		set_compound_page_dtor(page, free_huge_page);
>  		spin_lock(&hugetlb_lock);

Now that it's gone into the tree, I look at it and wonder, does your
nid_lock really serve any purpose?  We're just doing a simple assignment
to prev_nid, and it doesn't matter if occasionally two racers choose the
same node, and there's no protection here against a node being offlined
before the alloc_pages_node anyway (unsupported? I'm ignorant).

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-17 15:04   ` Hugh Dickins
@ 2007-07-17 16:32     ` Andrew Morton
  2007-07-17 17:26       ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-17 16:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Joe Jin, bill.irwin, linux-kernel

On Tue, 17 Jul 2007 16:04:54 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> On Thu, 12 Jul 2007, Andrew Morton wrote:
> > 
> > It'd be much better to fix the race within alloc_fresh_huge_page().  That
> > function is pretty pathetic.
> > 
> > Something like this?
> > 
> > --- a/mm/hugetlb.c~a
> > +++ a/mm/hugetlb.c
> > @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
> >  
> >  static int alloc_fresh_huge_page(void)
> >  {
> > -	static int nid = 0;
> > +	static int prev_nid;
> > +	static DEFINE_SPINLOCK(nid_lock);
> >  	struct page *page;
> > -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > -					HUGETLB_PAGE_ORDER);
> > -	nid = next_node(nid, node_online_map);
> > +	int nid;
> > +
> > +	spin_lock(&nid_lock);
> > +	nid = next_node(prev_nid, node_online_map);
> >  	if (nid == MAX_NUMNODES)
> >  		nid = first_node(node_online_map);
> > +	prev_nid = nid;
> > +	spin_unlock(&nid_lock);
> > +
> > +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > +					HUGETLB_PAGE_ORDER);
> >  	if (page) {
> >  		set_compound_page_dtor(page, free_huge_page);
> >  		spin_lock(&hugetlb_lock);
> 
> Now that it's gone into the tree, I look at it and wonder, does your
> nid_lock really serve any purpose?  We're just doing a simple assignment
> to prev_nid, and it doesn't matter if occasionally two racers choose the
> same node, and there's no protection here against a node being offlined
> before the alloc_pages_node anyway (unsupported? I'm ignorant).

umm, actually, yes, the code as it happens to be structured does mean that
ther is no longer a way in which a race can cause us to pass MAX_NUMNODES
into alloc_pages_node().

Or not.  We can call next_node(MAX_NUMNODES, node_online_map) in that race
window, with perhaps bad results.

I think I like the lock ;)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-17 16:32     ` Andrew Morton
@ 2007-07-17 17:26       ` Hugh Dickins
  2007-07-17 18:58         ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2007-07-17 17:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel

On Tue, 17 Jul 2007, Andrew Morton wrote:
> On Tue, 17 Jul 2007 16:04:54 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:
> > On Thu, 12 Jul 2007, Andrew Morton wrote:
> > > 
> > > It'd be much better to fix the race within alloc_fresh_huge_page().  That
> > > function is pretty pathetic.
> > > 
> > > Something like this?
> > > 
> > > --- a/mm/hugetlb.c~a
> > > +++ a/mm/hugetlb.c
> > > @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
> > >  
> > >  static int alloc_fresh_huge_page(void)
> > >  {
> > > -	static int nid = 0;
> > > +	static int prev_nid;
> > > +	static DEFINE_SPINLOCK(nid_lock);
> > >  	struct page *page;
> > > -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > > -					HUGETLB_PAGE_ORDER);
> > > -	nid = next_node(nid, node_online_map);
> > > +	int nid;
> > > +
> > > +	spin_lock(&nid_lock);
> > > +	nid = next_node(prev_nid, node_online_map);
> > >  	if (nid == MAX_NUMNODES)
> > >  		nid = first_node(node_online_map);
> > > +	prev_nid = nid;
> > > +	spin_unlock(&nid_lock);
> > > +
> > > +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > > +					HUGETLB_PAGE_ORDER);
> > >  	if (page) {
> > >  		set_compound_page_dtor(page, free_huge_page);
> > >  		spin_lock(&hugetlb_lock);
> > 
> > Now that it's gone into the tree, I look at it and wonder, does your
> > nid_lock really serve any purpose?  We're just doing a simple assignment
> > to prev_nid, and it doesn't matter if occasionally two racers choose the
> > same node, and there's no protection here against a node being offlined
> > before the alloc_pages_node anyway (unsupported? I'm ignorant).
> 
> umm, actually, yes, the code as it happens to be structured does mean that
> ther is no longer a way in which a race can cause us to pass MAX_NUMNODES
> into alloc_pages_node().
> 
> Or not.  We can call next_node(MAX_NUMNODES, node_online_map) in that race
> window, with perhaps bad results.
> 
> I think I like the lock ;)

I hate to waste your time, but I'm still puzzled.  Wasn't the race fixed
by your changeover from use of "static int nid" throughout, to setting
local "int nid" from "static int prev_nid", working with nid, then
setting prev_nid from nid at the end?  What does the lock add to that?

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-17 17:26       ` Hugh Dickins
@ 2007-07-17 18:58         ` Andrew Morton
  2007-07-17 19:49           ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-17 18:58 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Joe Jin, bill.irwin, linux-kernel

On Tue, 17 Jul 2007 18:26:14 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 17 Jul 2007, Andrew Morton wrote:
> > On Tue, 17 Jul 2007 16:04:54 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:
> > > On Thu, 12 Jul 2007, Andrew Morton wrote:
> > > > 
> > > > It'd be much better to fix the race within alloc_fresh_huge_page().  That
> > > > function is pretty pathetic.
> > > > 
> > > > Something like this?
> > > > 
> > > > --- a/mm/hugetlb.c~a
> > > > +++ a/mm/hugetlb.c
> > > > @@ -105,13 +105,20 @@ static void free_huge_page(struct page *
> > > >  
> > > >  static int alloc_fresh_huge_page(void)
> > > >  {
> > > > -	static int nid = 0;
> > > > +	static int prev_nid;
> > > > +	static DEFINE_SPINLOCK(nid_lock);
> > > >  	struct page *page;
> > > > -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > > > -					HUGETLB_PAGE_ORDER);
> > > > -	nid = next_node(nid, node_online_map);
> > > > +	int nid;
> > > > +
> > > > +	spin_lock(&nid_lock);
> > > > +	nid = next_node(prev_nid, node_online_map);
> > > >  	if (nid == MAX_NUMNODES)
> > > >  		nid = first_node(node_online_map);
> > > > +	prev_nid = nid;
> > > > +	spin_unlock(&nid_lock);
> > > > +
> > > > +	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > > > +					HUGETLB_PAGE_ORDER);
> > > >  	if (page) {
> > > >  		set_compound_page_dtor(page, free_huge_page);
> > > >  		spin_lock(&hugetlb_lock);
> > > 
> > > Now that it's gone into the tree, I look at it and wonder, does your
> > > nid_lock really serve any purpose?  We're just doing a simple assignment
> > > to prev_nid, and it doesn't matter if occasionally two racers choose the
> > > same node, and there's no protection here against a node being offlined
> > > before the alloc_pages_node anyway (unsupported? I'm ignorant).
> > 
> > umm, actually, yes, the code as it happens to be structured does mean that
> > ther is no longer a way in which a race can cause us to pass MAX_NUMNODES
> > into alloc_pages_node().
> > 
> > Or not.  We can call next_node(MAX_NUMNODES, node_online_map) in that race
> > window, with perhaps bad results.
> > 
> > I think I like the lock ;)
> 
> I hate to waste your time, but I'm still puzzled.  Wasn't the race fixed
> by your changeover from use of "static int nid" throughout, to setting
> local "int nid" from "static int prev_nid", working with nid, then
> setting prev_nid from nid at the end?  What does the lock add to that?
> 

There are still minor races without the lock - two CPUs will allocate
from the first node, and prev_nid can occasionally go backwards.

I agree that they are sufficiently minor that we could remove the lock.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-17 18:58         ` Andrew Morton
@ 2007-07-17 19:49           ` Hugh Dickins
  2007-07-17 20:01             ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2007-07-17 19:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel

On Tue, 17 Jul 2007, Andrew Morton wrote:
> On Tue, 17 Jul 2007 18:26:14 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:
> > > I think I like the lock ;)
> > 
> > I hate to waste your time, but I'm still puzzled.  Wasn't the race fixed
> > by your changeover from use of "static int nid" throughout, to setting
> > local "int nid" from "static int prev_nid", working with nid, then
> > setting prev_nid from nid at the end?  What does the lock add to that?
> 
> There are still minor races without the lock - two CPUs will allocate
> from the first node, and prev_nid can occasionally go backwards.
> 
> I agree that they are sufficiently minor that we could remove the lock.

Phew, yes, that's what I meant - thanks.


[PATCH] Remove nid_lock from alloc_fresh_huge_page

The fix to the race in alloc_fresh_huge_page() which could give an illegal
node ID did not need nid_lock at all: the fix was to replace static int nid
by static int prev_nid and do the work on local int nid.  nid_lock did make
sure that racers strictly roundrobin the nodes, but that's not something we
actually need to enforce strictly.  Kill that nid_lock.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
 mm/hugetlb.c |    3 ---
 1 file changed, 3 deletions(-)

--- 2.6.22-git9/mm/hugetlb.c	2007-07-17 20:29:33.000000000 +0100
+++ linux/mm/hugetlb.c	2007-07-17 20:32:51.000000000 +0100
@@ -107,15 +107,12 @@ static int alloc_fresh_huge_page(void)
 {
 	static int prev_nid;
 	struct page *page;
-	static DEFINE_SPINLOCK(nid_lock);
 	int nid;
 
-	spin_lock(&nid_lock);
 	nid = next_node(prev_nid, node_online_map);
 	if (nid == MAX_NUMNODES)
 		nid = first_node(node_online_map);
 	prev_nid = nid;
-	spin_unlock(&nid_lock);
 
 	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
 					HUGETLB_PAGE_ORDER);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-17 19:49           ` Hugh Dickins
@ 2007-07-17 20:01             ` Andrew Morton
  2007-07-17 20:35               ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-17 20:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Joe Jin, bill.irwin, linux-kernel

On Tue, 17 Jul 2007 20:49:25 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> --- 2.6.22-git9/mm/hugetlb.c	2007-07-17 20:29:33.000000000 +0100
> +++ linux/mm/hugetlb.c	2007-07-17 20:32:51.000000000 +0100
> @@ -107,15 +107,12 @@ static int alloc_fresh_huge_page(void)
>  {
>  	static int prev_nid;
>  	struct page *page;
> -	static DEFINE_SPINLOCK(nid_lock);
>  	int nid;
>  
> -	spin_lock(&nid_lock);
>  	nid = next_node(prev_nid, node_online_map);
>  	if (nid == MAX_NUMNODES)
>  		nid = first_node(node_online_map);
>  	prev_nid = nid;
> -	spin_unlock(&nid_lock);

Given that we've now gone and added deliberate-but-we-hope-benign 
races into this code, an elaborate comment which explains and justifies
it all is pretty much obligatory, IMO.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-17 20:01             ` Andrew Morton
@ 2007-07-17 20:35               ` Hugh Dickins
  2007-07-18  1:40                 ` Joe Jin
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2007-07-17 20:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Jin, bill.irwin, linux-kernel

On Tue, 17 Jul 2007, Andrew Morton wrote:
> 
> Given that we've now gone and added deliberate-but-we-hope-benign 
> races into this code, an elaborate comment which explains and justifies
> it all is pretty much obligatory, IMO.

[PATCH] Remove nid_lock from alloc_fresh_huge_page

The fix to that race in alloc_fresh_huge_page() which could give an illegal
node ID did not need nid_lock at all: the fix was to replace static int nid
by static int prev_nid and do the work on local int nid.  nid_lock did make
sure that racers strictly roundrobin the nodes, but that's not something we
need to enforce strictly.  Kill nid_lock.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
 mm/hugetlb.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- 2.6.22-git9/mm/hugetlb.c	2007-07-17 20:29:33.000000000 +0100
+++ linux/mm/hugetlb.c	2007-07-17 21:29:58.000000000 +0100
@@ -107,15 +107,19 @@ static int alloc_fresh_huge_page(void)
 {
 	static int prev_nid;
 	struct page *page;
-	static DEFINE_SPINLOCK(nid_lock);
 	int nid;
 
-	spin_lock(&nid_lock);
+	/*
+	 * Copy static prev_nid to local nid, work on that, then copy it
+	 * back to prev_nid afterwards: otherwise there's a window in which
+	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
+	 * But we don't need to use a spin_lock here: it really doesn't
+	 * matter if occasionally a racer chooses the same nid as we do.
+	 */
 	nid = next_node(prev_nid, node_online_map);
 	if (nid == MAX_NUMNODES)
 		nid = first_node(node_online_map);
 	prev_nid = nid;
-	spin_unlock(&nid_lock);
 
 	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
 					HUGETLB_PAGE_ORDER);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-17 20:35               ` Hugh Dickins
@ 2007-07-18  1:40                 ` Joe Jin
  2007-07-18  4:49                   ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Joe Jin @ 2007-07-18  1:40 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Joe Jin, bill.irwin, linux-kernel


With your patch, I have reproduced the panic:

Unable to handle kernel paging request at 000000000000186a RIP: 
 [<ffffffff8105d4e7>] __alloc_pages+0x2f/0x2c3
PGD 72595067 PUD 72594067 PMD 0 
Oops: 0000 [1] SMP 
CPU 0 
Modules linked in: xt_tcpudp iptable_filter ip_tables x_tables cpufreq_ondemand dm_mirror dm_multipath dm_mod video sbs button battery backlight ac snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm sg snd_timer shpchp snd soundcore tg3 i2c_i801 piix i2c_core snd_page_alloc ide_cd cdrom serio_raw ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 3996, comm: sh Not tainted 2.6.22 #9
RIP: 0010:[<ffffffff8105d4e7>]  [<ffffffff8105d4e7>] __alloc_pages+0x2f/0x2c3
RSP: 0018:ffff810071563e48  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000000e8d4a51000 RCX: 0000000000000000
RDX: ffff810071563fd8 RSI: 0000000000000009 RDI: 00000000000242d2
RBP: 00000000000242d2 R08: 0000000000000000 R09: 0000000000043a01
R10: ffff810000e88000 R11: ffffffff81072a02 R12: 0000000000001862
R13: ffff810070c74ea0 R14: 0000000000000009 R15: 00002adc51042000
FS:  00002adc4db3ddb0(0000) GS:ffffffff81312000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000186a CR3: 000000007d963000 CR4: 00000000000006e0
Process sh (pid: 3996, threadinfo ffff810071562000, task ffff810070c74ea0)
Stack:  00000010000242d2 ffff810000e88000 ffff810070e76710 ffffffff812e31e0
 ffffffffffffffff 000000e8d4a51000 ffffffffffffffff ffff810077d11b00
 000000000000000e ffff810071563f50 00002adc51042000 ffffffff81071a67
Call Trace:
 [<ffffffff81071a67>] alloc_fresh_huge_page+0x98/0xe7
 [<ffffffff8107263b>] hugetlb_sysctl_handler+0x14/0xf1
 [<ffffffff810bdad9>] proc_sys_write+0x7c/0xa6
 [<ffffffff8108074b>] vfs_write+0xad/0x156
 [<ffffffff81080ced>] sys_write+0x45/0x6e
 [<ffffffff81009c9c>] tracesys+0xdc/0xe1


Code: 49 83 7c 24 08 00 75 0e 48 c7 44 24 08 00 00 00 00 e9 6a 02 
RIP  [<ffffffff8105d4e7>] __alloc_pages+0x2f/0x2c3
 RSP <ffff810071563e48>
CR2: 000000000000186a

>From your patch, if we dont the lock, the race condition maybe occur at
next_node(). 

On 2007-07-17 21:35, Hugh Dickins wrote:
> On Tue, 17 Jul 2007, Andrew Morton wrote:
> > 
> > Given that we've now gone and added deliberate-but-we-hope-benign 
> > races into this code, an elaborate comment which explains and justifies
> > it all is pretty much obligatory, IMO.
> 
> [PATCH] Remove nid_lock from alloc_fresh_huge_page
> 
> The fix to that race in alloc_fresh_huge_page() which could give an illegal
> node ID did not need nid_lock at all: the fix was to replace static int nid
> by static int prev_nid and do the work on local int nid.  nid_lock did make
> sure that racers strictly roundrobin the nodes, but that's not something we
> need to enforce strictly.  Kill nid_lock.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
>  mm/hugetlb.c |   10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> --- 2.6.22-git9/mm/hugetlb.c	2007-07-17 20:29:33.000000000 +0100
> +++ linux/mm/hugetlb.c	2007-07-17 21:29:58.000000000 +0100
> @@ -107,15 +107,19 @@ static int alloc_fresh_huge_page(void)
>  {
>  	static int prev_nid;
>  	struct page *page;
> -	static DEFINE_SPINLOCK(nid_lock);
>  	int nid;
>  
> -	spin_lock(&nid_lock);
> +	/*
> +	 * Copy static prev_nid to local nid, work on that, then copy it
> +	 * back to prev_nid afterwards: otherwise there's a window in which
> +	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
> +	 * But we don't need to use a spin_lock here: it really doesn't
> +	 * matter if occasionally a racer chooses the same nid as we do.
> +	 */
>  	nid = next_node(prev_nid, node_online_map);
>  	if (nid == MAX_NUMNODES)
>  		nid = first_node(node_online_map);
>  	prev_nid = nid;
> -	spin_unlock(&nid_lock);
>  
>  	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
>  					HUGETLB_PAGE_ORDER);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-18  1:40                 ` Joe Jin
@ 2007-07-18  4:49                   ` Hugh Dickins
  2007-07-18  5:45                     ` Andrew Morton
                                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Hugh Dickins @ 2007-07-18  4:49 UTC (permalink / raw)
  To: Joe Jin; +Cc: Andrew Morton, Oleg Nesterov, bill.irwin, linux-kernel

On Wed, 18 Jul 2007, Joe Jin wrote:
> 
> With your patch, I have reproduced the panic:

That is... surprising to me.  (I hadn't been able to reproduce it with
or without the patches: maybe I just need to try harder.)  Please post
your gcc --version, and the disassembly (objdump -d) output for
alloc_fresh_huge_page.

Or can someone else make sense of this - Oleg?  To me it still seems
that nid_lock can only be irrelevant (doesn't even provide a compiler
barrier between prev_nid and nid transactions).

You remark lower down
> From your patch, if we dont the lock, the race condition maybe occur
> at next_node(). 
but I don't see how (if next_node were a macro which evaluates its
args more than once, perhaps, but that doesn't seem to be the case).

Thanks a lot,
Hugh

> 
> Unable to handle kernel paging request at 000000000000186a RIP: 
>  [<ffffffff8105d4e7>] __alloc_pages+0x2f/0x2c3
> PGD 72595067 PUD 72594067 PMD 0 
> Oops: 0000 [1] SMP 
> CPU 0 
> Modules linked in: xt_tcpudp iptable_filter ip_tables x_tables cpufreq_ondemand dm_mirror dm_multipath dm_mod video sbs button battery backlight ac snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm sg snd_timer shpchp snd soundcore tg3 i2c_i801 piix i2c_core snd_page_alloc ide_cd cdrom serio_raw ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 3996, comm: sh Not tainted 2.6.22 #9
> RIP: 0010:[<ffffffff8105d4e7>]  [<ffffffff8105d4e7>] __alloc_pages+0x2f/0x2c3
> RSP: 0018:ffff810071563e48  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 000000e8d4a51000 RCX: 0000000000000000
> RDX: ffff810071563fd8 RSI: 0000000000000009 RDI: 00000000000242d2
> RBP: 00000000000242d2 R08: 0000000000000000 R09: 0000000000043a01
> R10: ffff810000e88000 R11: ffffffff81072a02 R12: 0000000000001862
> R13: ffff810070c74ea0 R14: 0000000000000009 R15: 00002adc51042000
> FS:  00002adc4db3ddb0(0000) GS:ffffffff81312000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 000000000000186a CR3: 000000007d963000 CR4: 00000000000006e0
> Process sh (pid: 3996, threadinfo ffff810071562000, task ffff810070c74ea0)
> Stack:  00000010000242d2 ffff810000e88000 ffff810070e76710 ffffffff812e31e0
>  ffffffffffffffff 000000e8d4a51000 ffffffffffffffff ffff810077d11b00
>  000000000000000e ffff810071563f50 00002adc51042000 ffffffff81071a67
> Call Trace:
>  [<ffffffff81071a67>] alloc_fresh_huge_page+0x98/0xe7
>  [<ffffffff8107263b>] hugetlb_sysctl_handler+0x14/0xf1
>  [<ffffffff810bdad9>] proc_sys_write+0x7c/0xa6
>  [<ffffffff8108074b>] vfs_write+0xad/0x156
>  [<ffffffff81080ced>] sys_write+0x45/0x6e
>  [<ffffffff81009c9c>] tracesys+0xdc/0xe1
> 
> 
> Code: 49 83 7c 24 08 00 75 0e 48 c7 44 24 08 00 00 00 00 e9 6a 02 
> RIP  [<ffffffff8105d4e7>] __alloc_pages+0x2f/0x2c3
>  RSP <ffff810071563e48>
> CR2: 000000000000186a
> 
> >From your patch, if we dont the lock, the race condition maybe occur at
> next_node(). 
> 
> On 2007-07-17 21:35, Hugh Dickins wrote:
> > On Tue, 17 Jul 2007, Andrew Morton wrote:
> > > 
> > > Given that we've now gone and added deliberate-but-we-hope-benign 
> > > races into this code, an elaborate comment which explains and justifies
> > > it all is pretty much obligatory, IMO.
> > 
> > [PATCH] Remove nid_lock from alloc_fresh_huge_page
> > 
> > The fix to that race in alloc_fresh_huge_page() which could give an illegal
> > node ID did not need nid_lock at all: the fix was to replace static int nid
> > by static int prev_nid and do the work on local int nid.  nid_lock did make
> > sure that racers strictly roundrobin the nodes, but that's not something we
> > need to enforce strictly.  Kill nid_lock.
> > 
> > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > ---
> >  mm/hugetlb.c |   10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> > 
> > --- 2.6.22-git9/mm/hugetlb.c	2007-07-17 20:29:33.000000000 +0100
> > +++ linux/mm/hugetlb.c	2007-07-17 21:29:58.000000000 +0100
> > @@ -107,15 +107,19 @@ static int alloc_fresh_huge_page(void)
> >  {
> >  	static int prev_nid;
> >  	struct page *page;
> > -	static DEFINE_SPINLOCK(nid_lock);
> >  	int nid;
> >  
> > -	spin_lock(&nid_lock);
> > +	/*
> > +	 * Copy static prev_nid to local nid, work on that, then copy it
> > +	 * back to prev_nid afterwards: otherwise there's a window in which
> > +	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
> > +	 * But we don't need to use a spin_lock here: it really doesn't
> > +	 * matter if occasionally a racer chooses the same nid as we do.
> > +	 */
> >  	nid = next_node(prev_nid, node_online_map);
> >  	if (nid == MAX_NUMNODES)
> >  		nid = first_node(node_online_map);
> >  	prev_nid = nid;
> > -	spin_unlock(&nid_lock);
> >  
> >  	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> >  					HUGETLB_PAGE_ORDER);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-18  4:49                   ` Hugh Dickins
@ 2007-07-18  5:45                     ` Andrew Morton
  2007-07-18  7:34                       ` Joe Jin
  2007-07-18  6:32                     ` Joe Jin
  2007-07-18  8:09                     ` Joe Jin
  2 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2007-07-18  5:45 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Joe Jin, Oleg Nesterov, bill.irwin, linux-kernel

On Wed, 18 Jul 2007 05:49:50 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> > With your patch, I have reproduced the panic:
> 
> That is... surprising to me.

To me also.  I'd like to double-check the code which Joe actually
tested, please - have a sneaking suspicion that the 2.6.22 version
of that function was tested ;)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-18  4:49                   ` Hugh Dickins
  2007-07-18  5:45                     ` Andrew Morton
@ 2007-07-18  6:32                     ` Joe Jin
  2007-07-18  8:09                     ` Joe Jin
  2 siblings, 0 replies; 37+ messages in thread
From: Joe Jin @ 2007-07-18  6:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Joe Jin, Andrew Morton, Oleg Nesterov, bill.irwin, linux-kernel


On 2007-07-18 05:49, Hugh Dickins wrote:
> On Wed, 18 Jul 2007, Joe Jin wrote:
> > 
> > With your patch, I have reproduced the panic:
> 
> That is... surprising to me.  (I hadn't been able to reproduce it with
> or without the patches: maybe I just need to try harder.)  Please post
> your gcc --version, and the disassembly (objdump -d) output for
> alloc_fresh_huge_page.
> 

It will easy to reproduce, just run following scripts on *2* difference
terminal at the same time, I think you will hit it in a short time.

#!/bin/bash
while : ; do
        echo 1000000000000 > /proc/sys/vm/nr_hugepages
        echo 1 > /proc/sys/vm/nr_hugepages
        echo 10000000000000000000 > /proc/sys/vm/nr_hugepages
        echo 0 > /proc/sys/vm/nr_hugepages
done

# gcc --version
gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

------------------------------------------------------------------------------------------------------------------------
Disassembly alloc_fresh_huge_page with gdb:

0xffffffff810719cf <alloc_fresh_huge_page+0>:   mov    3651606(%rip),%r8d        # 0xffffffff813ed1ec <prev_nid.16422>
0xffffffff810719d6 <alloc_fresh_huge_page+7>:   mov    2841027(%rip),%rdi        # 0xffffffff813273a0 <node_online_map>
0xffffffff810719dd <alloc_fresh_huge_page+14>:  mov    $0x3f,%eax
0xffffffff810719e2 <alloc_fresh_huge_page+19>:  push   %rbx
0xffffffff810719e3 <alloc_fresh_huge_page+20>:  lea    0x1(%r8),%esi
0xffffffff810719e7 <alloc_fresh_huge_page+24>:  sub    %r8d,%eax
0xffffffff810719ea <alloc_fresh_huge_page+27>:  mov    %rdi,%rdx
0xffffffff810719ed <alloc_fresh_huge_page+30>:  cltq   
0xffffffff810719ef <alloc_fresh_huge_page+32>:  mov    %sil,%cl
0xffffffff810719f2 <alloc_fresh_huge_page+35>:  shr    %cl,%rdx
0xffffffff810719f5 <alloc_fresh_huge_page+38>:  bsf    %rdx,%rcx
0xffffffff810719f9 <alloc_fresh_huge_page+42>:  cmove  %rax,%rcx
0xffffffff810719fd <alloc_fresh_huge_page+46>:  lea    (%rsi,%rcx,1),%ecx
0xffffffff81071a00 <alloc_fresh_huge_page+49>:  mov    $0x40,%eax
0xffffffff81071a05 <alloc_fresh_huge_page+54>:  cmp    $0x40,%ecx
0xffffffff81071a08 <alloc_fresh_huge_page+57>:  cmovg  %eax,%ecx
0xffffffff81071a0b <alloc_fresh_huge_page+60>:  cmp    $0x40,%ecx
0xffffffff81071a0e <alloc_fresh_huge_page+63>:  mov    %ecx,3651540(%rip)        # 0xffffffff813ed1e8 <nid.16423>
0xffffffff81071a14 <alloc_fresh_huge_page+69>:  jne    0xffffffff81071a2f <alloc_fresh_huge_page+96>
0xffffffff81071a16 <alloc_fresh_huge_page+71>:  mov    $0x40,%eax
0xffffffff81071a1b <alloc_fresh_huge_page+76>:  bsf    %rdi,%rdx
0xffffffff81071a1f <alloc_fresh_huge_page+80>:  cmove  %rax,%rdx
0xffffffff81071a23 <alloc_fresh_huge_page+84>:  cmp    $0x40,%edx
0xffffffff81071a26 <alloc_fresh_huge_page+87>:  cmovle %edx,%ecx
0xffffffff81071a29 <alloc_fresh_huge_page+90>:  mov    %ecx,3651513(%rip)        # 0xffffffff813ed1e8 <nid.16423>
0xffffffff81071a2f <alloc_fresh_huge_page+96>:  mov    3651507(%rip),%eax        # 0xffffffff813ed1e8 <nid.16423>
0xffffffff81071a35 <alloc_fresh_huge_page+102>: test   %eax,%eax
0xffffffff81071a37 <alloc_fresh_huge_page+104>: mov    %eax,3651503(%rip)        # 0xffffffff813ed1ec <prev_nid.16422>
0xffffffff81071a3d <alloc_fresh_huge_page+110>: jns    0xffffffff81071a47 <alloc_fresh_huge_page+120>
0xffffffff81071a3f <alloc_fresh_huge_page+112>: mov    %gs:0x30,%eax
0xffffffff81071a47 <alloc_fresh_huge_page+120>: cltq   
0xffffffff81071a49 <alloc_fresh_huge_page+122>: mov    $0x9,%esi
0xffffffff81071a4e <alloc_fresh_huge_page+127>: mov    $0x242d2,%edi
0xffffffff81071a53 <alloc_fresh_huge_page+132>: mov    0xffffffff813260e0(,%rax,8),%rdx
0xffffffff81071a5b <alloc_fresh_huge_page+140>: add    $0x1860,%rdx
0xffffffff81071a62 <alloc_fresh_huge_page+147>: callq  0xffffffff8105d4b8 <__alloc_pages>
0xffffffff81071a67 <alloc_fresh_huge_page+152>: mov    %rax,%rbx
0xffffffff81071a6a <alloc_fresh_huge_page+155>: xor    %eax,%eax
0xffffffff81071a6c <alloc_fresh_huge_page+157>: test   %rbx,%rbx
0xffffffff81071a6f <alloc_fresh_huge_page+160>: je     0xffffffff81071ab4 <alloc_fresh_huge_page+229>
0xffffffff81071a71 <alloc_fresh_huge_page+162>: movq   $0xffffffff81072a02,0x60(%rbx)
0xffffffff81071a79 <alloc_fresh_huge_page+170>: mov    $0xffffffff812e6c68,%rdi
0xffffffff81071a80 <alloc_fresh_huge_page+177>: callq  0xffffffff81217952 <_spin_lock>
0xffffffff81071a85 <alloc_fresh_huge_page+182>: incq   3651428(%rip)        # 0xffffffff813ed1f0 <nr_huge_pages>
0xffffffff81071a8c <alloc_fresh_huge_page+189>: mov    (%rbx),%rax
0xffffffff81071a8f <alloc_fresh_huge_page+192>: shr    $0x2d,%rax
0xffffffff81071a93 <alloc_fresh_huge_page+196>: and    $0x3f,%eax
0xffffffff81071a96 <alloc_fresh_huge_page+199>: incl   0xffffffff813ed620(,%rax,4)
0xffffffff81071a9d <alloc_fresh_huge_page+206>: movl   $0x1,2576833(%rip)        # 0xffffffff812e6c68 <hugetlb_lock>
0xffffffff81071aa7 <alloc_fresh_huge_page+216>: mov    %rbx,%rdi
0xffffffff81071aaa <alloc_fresh_huge_page+219>: callq  0xffffffff8105f9cc <put_page>
0xffffffff81071aaf <alloc_fresh_huge_page+224>: mov    $0x1,%eax
0xffffffff81071ab4 <alloc_fresh_huge_page+229>: pop    %rbx
0xffffffff81071ab5 <alloc_fresh_huge_page+230>: retq   


Thanks,
Joe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-18  5:45                     ` Andrew Morton
@ 2007-07-18  7:34                       ` Joe Jin
  0 siblings, 0 replies; 37+ messages in thread
From: Joe Jin @ 2007-07-18  7:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Joe Jin, Oleg Nesterov, bill.irwin, linux-kernel

> > 
> > That is... surprising to me.
> 
> To me also.  I'd like to double-check the code which Joe actually
> tested, please - have a sneaking suspicion that the 2.6.22 version
> of that function was tested ;)

Maybe I make a fault? I'll testing it again then give you a report.

Thank,
Joe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-18  4:49                   ` Hugh Dickins
  2007-07-18  5:45                     ` Andrew Morton
  2007-07-18  6:32                     ` Joe Jin
@ 2007-07-18  8:09                     ` Joe Jin
  2007-07-18  8:35                       ` Hugh Dickins
  2 siblings, 1 reply; 37+ messages in thread
From: Joe Jin @ 2007-07-18  8:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Joe Jin, Andrew Morton, Oleg Nesterov, bill.irwin, linux-kernel

Sorry for I use a unclean kernel tree, the panic have gone ;)
also, I will testing it again with other server, if I hit it I'll let you know.

Thanks,
Joe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Add nid sanity on alloc_pages_node
  2007-07-18  8:09                     ` Joe Jin
@ 2007-07-18  8:35                       ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2007-07-18  8:35 UTC (permalink / raw)
  To: Joe Jin; +Cc: Andrew Morton, Oleg Nesterov, bill.irwin, linux-kernel

On Wed, 18 Jul 2007, Joe Jin wrote:

> Sorry for I use a unclean kernel tree, the panic have gone ;)
> also, I will testing it again with other server, if I hit it I'll let you know.

That's a relief!  Thanks,
Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2007-07-18  9:01 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-13  2:45 [PATCH] Add nid sanity on alloc_pages_node Joe Jin
2007-07-13  5:18 ` Andrew Morton
2007-07-13  6:40   ` Joe Jin
2007-07-13  6:49     ` Andrew Morton
2007-07-13  6:57       ` Andrew Morton
2007-07-13  8:29         ` Paul Jackson
2007-07-13  8:38           ` Andrew Morton
2007-07-13  8:43             ` Paul Jackson
2007-07-13  8:49               ` Andrew Morton
2007-07-13  8:54                 ` Paul Jackson
2007-07-13 12:48                   ` Benjamin Herrenschmidt
2007-07-13  8:03       ` Joe Jin
2007-07-13  8:15         ` Andrew Morton
2007-07-13 12:18           ` Joe Jin
2007-07-13 12:42             ` Paul Jackson
2007-07-14 17:40             ` Nish Aravamudan
2007-07-14 18:04               ` Andrew Morton
2007-07-14 20:47                 ` Nish Aravamudan
2007-07-13  8:04       ` gurudas pai
2007-07-13  8:19         ` Andrew Morton
2007-07-13 12:37           ` gurudas pai
2007-07-13  8:37       ` Joe Jin
2007-07-13  8:44         ` Andrew Morton
2007-07-17 15:04   ` Hugh Dickins
2007-07-17 16:32     ` Andrew Morton
2007-07-17 17:26       ` Hugh Dickins
2007-07-17 18:58         ` Andrew Morton
2007-07-17 19:49           ` Hugh Dickins
2007-07-17 20:01             ` Andrew Morton
2007-07-17 20:35               ` Hugh Dickins
2007-07-18  1:40                 ` Joe Jin
2007-07-18  4:49                   ` Hugh Dickins
2007-07-18  5:45                     ` Andrew Morton
2007-07-18  7:34                       ` Joe Jin
2007-07-18  6:32                     ` Joe Jin
2007-07-18  8:09                     ` Joe Jin
2007-07-18  8:35                       ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox