public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* numa mm/mempolicy.c maxnode off by one
@ 2004-07-19  1:18 Paul Jackson
  2004-07-20  0:31 ` Paul Jackson
  0 siblings, 1 reply; 2+ messages in thread
From: Paul Jackson @ 2004-07-19  1:18 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel

Andi,

As best as I can tell from code reading, the value of maxnode that is
passed in on the numa system calls mbind, get_mempolicy and
set_mempolicy is "off by one".  This seems to be undocumented, and so
far as I have yet been able to imagine, unjustified.

The kernel code in mm/mempolicy.c expects a value one larger than is
natural, and then decrements it, or it uses "maxnode-1" (sometimes one
way, sometimes the other, inconsistently).  Your libnuma user library
hides this off by one with a corresponding increment by one of maxnode. 
But we are not all using libnuma.  Some of us are using the actual
system calls, for our own nefarious purposes.

For example, on my 256 node system, with 256 bit user nodemasks, I have
to pass in a maxnode value of 257.  And the libnuma library, and numactl
command utility based on it, which are hardcoded for 2048 bit nodemasks,
always pass in a maxnode value of 2049.

Someday, someone is going to pass in some nice power of two value,
corresponding to the actual number of nodes in their nodemasks, not
expecting that their last node is being ignored.

Andi - could you explain how this came to be, and perhaps recommend a
way to fix it (or explain why it's not broken and shouldn't be fixed)?

I could propose ways to fix this without breaking any libnuma that
you've already shipped.  But first I should find out if my understanding
of the situation is correct.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: numa mm/mempolicy.c maxnode off by one
  2004-07-19  1:18 numa mm/mempolicy.c maxnode off by one Paul Jackson
@ 2004-07-20  0:31 ` Paul Jackson
  0 siblings, 0 replies; 2+ messages in thread
From: Paul Jackson @ 2004-07-20  0:31 UTC (permalink / raw)
  To: ak, linux-kernel; +Cc: Paul Jackson

> This seems to be undocumented ...

Actually, documented, but with text that I find misleading.

And more over a hardcoded 64 in mm/mempolicy.c that could surprise
a 32 bit user by overwriting an additional long, unexpectedly.

Presume for example I have a 32 bit system (sizeof(long) == 4), and I
have 32 nodes, numbered 0 .. 31.  Further presume that the size of the
kernel's nodemask_t is 32 bits (4 bytes).  Finally, presume that I have
a user level nodemask that is essentially a single unsigned long (32
bits).

The get_mempolicy(2) man page (numactl-0.7-pre1) states:

    maxnode is the maximum bit number plus one that can be stored
    into nodemask.  The bit number is always rounded to an multiple
    of unsigned long.

Since the highest numbered bit I can store is bit number 31, I
would expect to specify maxnode == 32 on calls.

But if I read the code correctly:

 1) I'm off by one, and should pass 33.  See further the various
    maxnode-- and maxnode-1 expressions in mm/mempolicy.c.

 2) Actually, that's wrong too.  Since the kernel doesn't round up
    by a multiple of unsigned long, but a multiple of 64, in the
    source file mm/mempolicy.c of linux version 2.6.7-mm1:

	/* Copy a kernel node mask to user space */
	static int copy_nodes_to_user(unsigned long __user *mask,
				unsigned long maxnode, void *nodes, unsigned nbytes)
	{
        	unsigned long copy = ALIGN(maxnode-1, 64) / 8;

   I had better actually have a user nodemask of 64 bits, and pass in 65.

The mbind(2) and set_mempolicy(2) state this in different words,
but words which, to me, mean pretty much the same (wrong) thing:

    nodemask is pointer to a bit field of nodes that contains upto maxnode
    bits. The bit field size is rounded to the next multiple of
    sizeof(unsigned long), but the kernel will only use bits upto maxnode.

So, in addition to suggesting that the kernel not be decrementing or
subtracting one from the passed in value of maxnode (which may require
some hackery to avoid breaking shipped libraries), I also suggest that
the allignment in copy_nodes_to_user() be to the number of bits in an
unsigned long, not to a hardcoded 64.  I speculate that without this
last change, a user on a 32 bit system such as in my example above would
be surprised when the kernel responded to their get_mempolicy request by
overwriting 64 bits where they expected to get 32 bits.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-07-20  0:32 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-19  1:18 numa mm/mempolicy.c maxnode off by one Paul Jackson
2004-07-20  0:31 ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox