[PATCH] Fix off-by-one bug in mbind() syscall implementation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Fix off-by-one bug in mbind() syscall implementation
@ 2010-07-26  9:28 Andre Przywara
  2010-07-26  9:49 ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Andre Przywara @ 2010-07-26  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Christoph Lameter, linux-kernel, linux-mm,
	Andre Przywara

When the mbind() syscall implementation processes the node mask
provided by the user, the last node is accidentally masked out.
This is present since the dawn of time (aka Before Git), I guess
nobody realized that because libnuma as the most prominent user of
mbind() uses large masks (sizeof(long)) and nobody cared if the
64th node is not handled properly. But if the user application
defers the masking to the kernel and provides the number of valid bits
in maxnodes, there is always the last node missing.
However this also affect the special case with maxnodes=0, the manpage
reads that mbind(ptr, len, MPOL_DEFAULT, &some_long, 0, 0); should
reset the policy to the default one, but in fact it returns EINVAL.
This patch just removes the decrease-by-one statement, I hope that
there is no workaround code in the wild that relies on the bogus
behavior.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
---
 mm/mempolicy.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5bc0a96..e70025b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1174,7 +1174,6 @@ static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
 	unsigned long nlongs;
 	unsigned long endmask;

-	--maxnode;
 	nodes_clear(*nodes);
 	if (maxnode == 0 || !nmask)
 		return 0;
-- 
1.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix off-by-one bug in mbind() syscall implementation
  2010-07-26  9:28 [PATCH] Fix off-by-one bug in mbind() syscall implementation Andre Przywara
@ 2010-07-26  9:49 ` Andi Kleen
  2010-07-26 10:23   ` Andre Przywara
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2010-07-26  9:49 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, linux-kernel,
	linux-mm

On Mon, Jul 26, 2010 at 11:28:18AM +0200, Andre Przywara wrote:
> When the mbind() syscall implementation processes the node mask
> provided by the user, the last node is accidentally masked out.
> This is present since the dawn of time (aka Before Git), I guess
> nobody realized that because libnuma as the most prominent user of
> mbind() uses large masks (sizeof(long)) and nobody cared if the
> 64th node is not handled properly. But if the user application
> defers the masking to the kernel and provides the number of valid bits
> in maxnodes, there is always the last node missing.
> However this also affect the special case with maxnodes=0, the manpage
> reads that mbind(ptr, len, MPOL_DEFAULT, &some_long, 0, 0); should
> reset the policy to the default one, but in fact it returns EINVAL.
> This patch just removes the decrease-by-one statement, I hope that
> there is no workaround code in the wild that relies on the bogus
> behavior.

Actually libnuma and likely most existing users rely on it.

The only way to change it would be to add new system calls.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix off-by-one bug in mbind() syscall implementation
  2010-07-26  9:49 ` Andi Kleen
@ 2010-07-26 10:23   ` Andre Przywara
  2010-07-26 10:40     ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Andre Przywara @ 2010-07-26 10:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org

Andi Kleen wrote:
> On Mon, Jul 26, 2010 at 11:28:18AM +0200, Andre Przywara wrote:
>> When the mbind() syscall implementation processes the node mask
>> provided by the user, the last node is accidentally masked out.
>> This is present since the dawn of time (aka Before Git), I guess
>> nobody realized that because libnuma as the most prominent user of
>> mbind() uses large masks (sizeof(long)) and nobody cared if the
>> 64th node is not handled properly. But if the user application
>> defers the masking to the kernel and provides the number of valid bits
>> in maxnodes, there is always the last node missing.
>> However this also affect the special case with maxnodes=0, the manpage
>> reads that mbind(ptr, len, MPOL_DEFAULT, &some_long, 0, 0); should
>> reset the policy to the default one, but in fact it returns EINVAL.
>> This patch just removes the decrease-by-one statement, I hope that
>> there is no workaround code in the wild that relies on the bogus
>> behavior.
> 
> Actually libnuma and likely most existing users rely on it.
If grep didn't fool me, then the only users in libnuma aware of that bug 
are the test implementations in numactl-2.0.3/test, namely /test/tshm.c 
(NUMA_MAX_NODES+1) and test/mbind_mig_pages.c (old_nodes->size + 1).

Has this bug been known before?
> 
> The only way to change it would be to add new system calls.
That would probably be overkill, but if this behavior is now fixed, it 
should be documented (in the manpage and in the kernel code).
Also the actual libnuma code should be adjusted, then.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix off-by-one bug in mbind() syscall implementation
  2010-07-26 10:23   ` Andre Przywara
@ 2010-07-26 10:40     ` Andi Kleen
  2010-07-29  8:50       ` Andre Przywara
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2010-07-26 10:40 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Mon, Jul 26, 2010 at 12:23:10PM +0200, Andre Przywara wrote:
> Andi Kleen wrote:
> >On Mon, Jul 26, 2010 at 11:28:18AM +0200, Andre Przywara wrote:
> >>When the mbind() syscall implementation processes the node mask
> >>provided by the user, the last node is accidentally masked out.
> >>This is present since the dawn of time (aka Before Git), I guess
> >>nobody realized that because libnuma as the most prominent user of
> >>mbind() uses large masks (sizeof(long)) and nobody cared if the
> >>64th node is not handled properly. But if the user application
> >>defers the masking to the kernel and provides the number of valid bits
> >>in maxnodes, there is always the last node missing.
> >>However this also affect the special case with maxnodes=0, the manpage
> >>reads that mbind(ptr, len, MPOL_DEFAULT, &some_long, 0, 0); should
> >>reset the policy to the default one, but in fact it returns EINVAL.
> >>This patch just removes the decrease-by-one statement, I hope that
> >>there is no workaround code in the wild that relies on the bogus
> >>behavior.
> >
> >Actually libnuma and likely most existing users rely on it.
> If grep didn't fool me, then the only users in libnuma aware of that
> bug are the test implementations in numactl-2.0.3/test, namely
> /test/tshm.c (NUMA_MAX_NODES+1) and test/mbind_mig_pages.c
> (old_nodes->size + 1).

At least libnuma 1 (which is the libnuma most distributions use today)
explicitely knows about it and will break if you change it.

> 
> Has this bug been known before?

Yes (and you can argue whether it's a problem or not)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix off-by-one bug in mbind() syscall implementation
  2010-07-26 10:40     ` Andi Kleen
@ 2010-07-29  8:50       ` Andre Przywara
  2010-07-29 16:15         ` Jack Steiner
  0 siblings, 1 reply; 6+ messages in thread
From: Andre Przywara @ 2010-07-29  8:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org

Andi Kleen wrote:
> On Mon, Jul 26, 2010 at 12:23:10PM +0200, Andre Przywara wrote:
>> Andi Kleen wrote:
>>> On Mon, Jul 26, 2010 at 11:28:18AM +0200, Andre Przywara wrote:
>>>> When the mbind() syscall implementation processes the node mask
>>>> provided by the user, the last node is accidentally masked out.
>>>> This is present since the dawn of time (aka Before Git), I guess
>>>> nobody realized that because libnuma as the most prominent user of
>>>> mbind() uses large masks (sizeof(long)) and nobody cared if the
>>>> 64th node is not handled properly. But if the user application
>>>> defers the masking to the kernel and provides the number of valid bits
>>>> in maxnodes, there is always the last node missing.
>>>> However this also affect the special case with maxnodes=0, the manpage
>>>> reads that mbind(ptr, len, MPOL_DEFAULT, &some_long, 0, 0); should
>>>> reset the policy to the default one, but in fact it returns EINVAL.
>>>> This patch just removes the decrease-by-one statement, I hope that
>>>> there is no workaround code in the wild that relies on the bogus
>>>> behavior.
>>> Actually libnuma and likely most existing users rely on it.
>> If grep didn't fool me, then the only users in libnuma aware of that
>> bug are the test implementations in numactl-2.0.3/test, namely
>> /test/tshm.c (NUMA_MAX_NODES+1) and test/mbind_mig_pages.c
>> (old_nodes->size + 1).
> 
> At least libnuma 1 (which is the libnuma most distributions use today)
> explicitely knows about it and will break if you change it.
Please define most distributions. I just did some research:
Old libnuma with the workaround active:
* OpenSuse 11.0 (recently EOL)
* Fedora 9 (EOL for about a year)
* SLES10 (still supported, but unlikey to get a vanilla kernel update)
* CentOS 5.5 (same as SLES10)
First version with a safe libnuma:
* OpenSuse 11.1
* Fedora 10
* SLES11
Didn't check others, but I guess that looks similar. If they get an 
official kernel update, they likely get the corresponding library fixes 
along with it.
Also I found that numactl-1.0.3 already had the bug fix.

So how big is the chance the anyone with these old distros will use a 
2.6.36+ kernel with it? If someone does so, then I'd guess he'd be on 
his own and will probably also update other parts of the system (or 
better upgrade the whole setup).
I see that this is a general question and should not be answered with 
probability arguments, but I would like to hear other statements on this 
policy. After all this is a clear kernel bug and should be fixed. Recent 
library implementation will trigger this bug.
Also I would like to know whether we support any older library with 
newer kernels. I guess there is no such promise (thinking of modutils, 
udev, ...)
Is the stable syscall interface defined by documentation or by (possibly 
buggy) de facto implementation?

> 
>> Has this bug been known before?
> 
> Yes (and you can argue whether it's a problem or not)
OK, I will:
1. It's not documented, neither in the kernel nor in libnuma.
2. The default interface for large bitmaps (consisting of a number of 
longs) is to pass the number of valid bits. A variant would be passing 
the highest valid bit number. The number of bits plus one is not in the 
list.
3. There is a special case in the syscall interface for resetting the 
policy. It says you need to pass either a NULL pointer or 0 for the 
number of bits (along with MPOL_DEFAULT). This simply does not work. 
Instead you have to pass a NULL pointer or _1_. Also that means that 
passing 1 intentionally triggers the special case.
3. libnuma changed the behavior from work-arounding to ignoring some 18 
month or so before. This bug will lead to the 64th node (or the 128th 
node, the 192th node, ...) to be ignored. And please don't argument that 
nobody will ever have 64 nodes...
4. If one use mbind() directly and lets the kernel do the masking by 
passing the number of valid bits (and not the size of the buffer) then 
the last node will always be masked off.

So I strongly opt for fixing this by removing the line and maybe add 
some documentation about the old behavior.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix off-by-one bug in mbind() syscall implementation
  2010-07-29  8:50       ` Andre Przywara
@ 2010-07-29 16:15         ` Jack Steiner
  0 siblings, 0 replies; 6+ messages in thread
From: Jack Steiner @ 2010-07-29 16:15 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Thu, Jul 29, 2010 at 10:50:37AM +0200, Andre Przywara wrote:
> Andi Kleen wrote:
>> On Mon, Jul 26, 2010 at 12:23:10PM +0200, Andre Przywara wrote:
>>> Andi Kleen wrote:
>>>> On Mon, Jul 26, 2010 at 11:28:18AM +0200, Andre Przywara wrote:
>>>>> When the mbind() syscall implementation processes the node mask
>>>>> provided by the user, the last node is accidentally masked out.
>>>>> This is present since the dawn of time (aka Before Git), I guess
>>>>> nobody realized that because libnuma as the most prominent user of
>>>>> mbind() uses large masks (sizeof(long)) and nobody cared if the
>>>>> 64th node is not handled properly. But if the user application
>>>>> defers the masking to the kernel and provides the number of valid bits
>>>>> in maxnodes, there is always the last node missing.
>>>>> However this also affect the special case with maxnodes=0, the manpage
>>>>> reads that mbind(ptr, len, MPOL_DEFAULT, &some_long, 0, 0); should
>>>>> reset the policy to the default one, but in fact it returns EINVAL.
>>>>> This patch just removes the decrease-by-one statement, I hope that
>>>>> there is no workaround code in the wild that relies on the bogus
>>>>> behavior.
>>>> Actually libnuma and likely most existing users rely on it.
>>> If grep didn't fool me, then the only users in libnuma aware of that
>>> bug are the test implementations in numactl-2.0.3/test, namely
>>> /test/tshm.c (NUMA_MAX_NODES+1) and test/mbind_mig_pages.c
>>> (old_nodes->size + 1).
>>
>> At least libnuma 1 (which is the libnuma most distributions use today)
>> explicitely knows about it and will break if you change it.
> Please define most distributions. I just did some research:
> Old libnuma with the workaround active:
> * OpenSuse 11.0 (recently EOL)
> * Fedora 9 (EOL for about a year)
> * SLES10 (still supported, but unlikey to get a vanilla kernel update)
> * CentOS 5.5 (same as SLES10)
> First version with a safe libnuma:
> * OpenSuse 11.1
> * Fedora 10
> * SLES11
> Didn't check others, but I guess that looks similar. If they get an official 
> kernel update, they likely get the corresponding library fixes along with 
> it.
> Also I found that numactl-1.0.3 already had the bug fix.
>
> So how big is the chance the anyone with these old distros will use a 
> 2.6.36+ kernel with it? If someone does so, then I'd guess he'd be on his 
> own and will probably also update other parts of the system (or better 
> upgrade the whole setup).
> I see that this is a general question and should not be answered with 
> probability arguments, but I would like to hear other statements on this 
> policy. After all this is a clear kernel bug and should be fixed. Recent 
> library implementation will trigger this bug.
> Also I would like to know whether we support any older library with newer 
> kernels. I guess there is no such promise (thinking of modutils, udev, ...)
> Is the stable syscall interface defined by documentation or by (possibly 
> buggy) de facto implementation?
>
>>
>>> Has this bug been known before?
>>
>> Yes (and you can argue whether it's a problem or not)
> OK, I will:
> 1. It's not documented, neither in the kernel nor in libnuma.
> 2. The default interface for large bitmaps (consisting of a number of longs) 
> is to pass the number of valid bits. A variant would be passing the highest 
> valid bit number. The number of bits plus one is not in the list.
> 3. There is a special case in the syscall interface for resetting the 
> policy. It says you need to pass either a NULL pointer or 0 for the number 
> of bits (along with MPOL_DEFAULT). This simply does not work. Instead you 
> have to pass a NULL pointer or _1_. Also that means that passing 1 
> intentionally triggers the special case.
> 3. libnuma changed the behavior from work-arounding to ignoring some 18 
> month or so before. This bug will lead to the 64th node (or the 128th node, 
> the 192th node, ...) to be ignored. And please don't argument that nobody 
> will ever have 64 nodes...

FYI -
	cct405-1:~ # numactl --hardware
	available: 254 nodes (0-253)
	node 0 cpus: 0 1 2 3 1648 1649 1650 1651
	node 0 size: 14298 MB
	node 0 free: 13352 MB
	...
	node 253 cpus: 1640 1641 1642 1643 1644 1645 1646 1647 2904 2905 2906 2907 2908 2909 2910 2911
	node 253 size: 32752 MB
	node 253 free: 32229 MB

> 4. If one use mbind() directly and lets the kernel do the masking by passing 
> the number of valid bits (and not the size of the buffer) then the last node 
> will always be masked off.
>
> So I strongly opt for fixing this by removing the line and maybe add some 
> documentation about the old behavior.
>
> Regards,
> Andre.
>
> -- 
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 448-3567-12
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-07-29 16:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-26  9:28 [PATCH] Fix off-by-one bug in mbind() syscall implementation Andre Przywara
2010-07-26  9:49 ` Andi Kleen
2010-07-26 10:23   ` Andre Przywara
2010-07-26 10:40     ` Andi Kleen
2010-07-29  8:50       ` Andre Przywara
2010-07-29 16:15         ` Jack Steiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).