linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jack Steiner <steiner@sgi.com>
To: Andre Przywara <andre.przywara@amd.com>
Cc: Andi Kleen <andi@firstfloor.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH] Fix off-by-one bug in mbind() syscall implementation
Date: Thu, 29 Jul 2010 11:15:37 -0500	[thread overview]
Message-ID: <20100729161537.GA13268@sgi.com> (raw)
In-Reply-To: <4C5140DD.802@amd.com>

On Thu, Jul 29, 2010 at 10:50:37AM +0200, Andre Przywara wrote:
> Andi Kleen wrote:
>> On Mon, Jul 26, 2010 at 12:23:10PM +0200, Andre Przywara wrote:
>>> Andi Kleen wrote:
>>>> On Mon, Jul 26, 2010 at 11:28:18AM +0200, Andre Przywara wrote:
>>>>> When the mbind() syscall implementation processes the node mask
>>>>> provided by the user, the last node is accidentally masked out.
>>>>> This is present since the dawn of time (aka Before Git), I guess
>>>>> nobody realized that because libnuma as the most prominent user of
>>>>> mbind() uses large masks (sizeof(long)) and nobody cared if the
>>>>> 64th node is not handled properly. But if the user application
>>>>> defers the masking to the kernel and provides the number of valid bits
>>>>> in maxnodes, there is always the last node missing.
>>>>> However this also affect the special case with maxnodes=0, the manpage
>>>>> reads that mbind(ptr, len, MPOL_DEFAULT, &some_long, 0, 0); should
>>>>> reset the policy to the default one, but in fact it returns EINVAL.
>>>>> This patch just removes the decrease-by-one statement, I hope that
>>>>> there is no workaround code in the wild that relies on the bogus
>>>>> behavior.
>>>> Actually libnuma and likely most existing users rely on it.
>>> If grep didn't fool me, then the only users in libnuma aware of that
>>> bug are the test implementations in numactl-2.0.3/test, namely
>>> /test/tshm.c (NUMA_MAX_NODES+1) and test/mbind_mig_pages.c
>>> (old_nodes->size + 1).
>>
>> At least libnuma 1 (which is the libnuma most distributions use today)
>> explicitely knows about it and will break if you change it.
> Please define most distributions. I just did some research:
> Old libnuma with the workaround active:
> * OpenSuse 11.0 (recently EOL)
> * Fedora 9 (EOL for about a year)
> * SLES10 (still supported, but unlikey to get a vanilla kernel update)
> * CentOS 5.5 (same as SLES10)
> First version with a safe libnuma:
> * OpenSuse 11.1
> * Fedora 10
> * SLES11
> Didn't check others, but I guess that looks similar. If they get an official 
> kernel update, they likely get the corresponding library fixes along with 
> it.
> Also I found that numactl-1.0.3 already had the bug fix.
>
> So how big is the chance the anyone with these old distros will use a 
> 2.6.36+ kernel with it? If someone does so, then I'd guess he'd be on his 
> own and will probably also update other parts of the system (or better 
> upgrade the whole setup).
> I see that this is a general question and should not be answered with 
> probability arguments, but I would like to hear other statements on this 
> policy. After all this is a clear kernel bug and should be fixed. Recent 
> library implementation will trigger this bug.
> Also I would like to know whether we support any older library with newer 
> kernels. I guess there is no such promise (thinking of modutils, udev, ...)
> Is the stable syscall interface defined by documentation or by (possibly 
> buggy) de facto implementation?
>
>>
>>> Has this bug been known before?
>>
>> Yes (and you can argue whether it's a problem or not)
> OK, I will:
> 1. It's not documented, neither in the kernel nor in libnuma.
> 2. The default interface for large bitmaps (consisting of a number of longs) 
> is to pass the number of valid bits. A variant would be passing the highest 
> valid bit number. The number of bits plus one is not in the list.
> 3. There is a special case in the syscall interface for resetting the 
> policy. It says you need to pass either a NULL pointer or 0 for the number 
> of bits (along with MPOL_DEFAULT). This simply does not work. Instead you 
> have to pass a NULL pointer or _1_. Also that means that passing 1 
> intentionally triggers the special case.
> 3. libnuma changed the behavior from work-arounding to ignoring some 18 
> month or so before. This bug will lead to the 64th node (or the 128th node, 
> the 192th node, ...) to be ignored. And please don't argument that nobody 
> will ever have 64 nodes...

FYI -
	cct405-1:~ # numactl --hardware
	available: 254 nodes (0-253)
	node 0 cpus: 0 1 2 3 1648 1649 1650 1651
	node 0 size: 14298 MB
	node 0 free: 13352 MB
	...
	node 253 cpus: 1640 1641 1642 1643 1644 1645 1646 1647 2904 2905 2906 2907 2908 2909 2910 2911
	node 253 size: 32752 MB
	node 253 free: 32229 MB

> 4. If one use mbind() directly and lets the kernel do the masking by passing 
> the number of valid bits (and not the size of the buffer) then the last node 
> will always be masked off.
>
> So I strongly opt for fixing this by removing the line and maybe add some 
> documentation about the old behavior.
>
> Regards,
> Andre.
>
> -- 
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 448-3567-12
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

      reply	other threads:[~2010-07-29 16:15 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-26  9:28 [PATCH] Fix off-by-one bug in mbind() syscall implementation Andre Przywara
2010-07-26  9:49 ` Andi Kleen
2010-07-26 10:23   ` Andre Przywara
2010-07-26 10:40     ` Andi Kleen
2010-07-29  8:50       ` Andre Przywara
2010-07-29 16:15         ` Jack Steiner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100729161537.GA13268@sgi.com \
    --to=steiner@sgi.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=andre.przywara@amd.com \
    --cc=cl@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).