public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* NUMA API
@ 2004-04-30  7:35 Ulrich Drepper
  2004-04-30  8:30 ` William Lee Irwin III
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ulrich Drepper @ 2004-04-30  7:35 UTC (permalink / raw)
  To: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 4077 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In the last weeks I have been working on designing a new API for a NUMA
support library.  I am aware of the code in libnuma by ak but this code
has many shortcomings:

~ inadequate topology discovery
~ fixed cpu set size
~ no clear separation of memory nodes
~ no inclusion of SMT/multicore in the cpu hierarchy
~ awkward (at best) memory allocation interface
~ etc etc

and last but not least

~ a completely unacceptable library interface (e.g., global variables as
part of the API, WTF?)

At the end of the attached document is a comparison of the two APIs.


I'm only posting now about this since I wanted to get some sanity checks
of the API first.  Some of our (i.e., Red Hat's) partners provided this.
 They might identify themselves, or not.  This is not because other
parties are meant to be excluded.


The API described here is meant to be a minimal which can be wrapped for
use in any kind of higher-level language (or even in another C library
using the interface).  For this reason the CPU and memory node sets are
not handled by an abstract data type but instead as bitmap.  Using an
abstract data types (in C) means restricting the way wrapper libraries
can be designed.  In a C++ wrapper, for instance, the bit sets certainly
should be abstract.  A later version of the attached document might try
provide higher-level interfaces.


The text of the API proposal is not yet polished.  In fact, most
descriptions are fairly short.  I want o get some more assurance that
the API is received well before spending significantly more time on it.

As specified, the implementation of the interface is designed with only
the requirements of a program on NUMA hardware in mind.  I have paid no
attention to the currently proposed kernel extensions.  If the latter do
not really allow implementing the functionality programmers need then it
is wasted efforts.

For instance, I think the way memory allocated in interleaved fashion is
not "ideal".  Interleaved allocation is a property of a specific
allocation.  Global states for processes (or threads) are a terrible way
to handle this and other properties since it requires the programmer to
constantly switch the mode back and forth since any part of the runtime
might be NUMA aware and reset the mode.

Also, the concept of hard/soft sets for CPUs is useful.  Likewise
"spilling" over to other memory nodes.  Usually using NUMA means hinting
the desired configuration to the system.  It'll be used whenever
possible.  If it is not possible (for instance, if a given processor is
not available) it is mostly no good idea to completely fail the
execution.  Instead a less optimal resource should be used.  For memory
it is hard to know how much memory on which node is in use etc.

Another missing feature in libnuma and the current kernel design is
support for changes in the configuration.  CPUs might be added or
removed, likewise memory.  Additional interconnects between NUMA blocks
might be added etc.


Overall I think the proposed API provides a architecture-independent,
future-safe NUMA API.  If no program uses the kernel functionality
directly (which is possible with the API) the kernel interface can be
changed and adopted for each architecture or even specific machine
without the program noticing it.


The selection of names for the functions is by no means fixed.  These
are proposals.  I'm open for constructive criticism.  In case you find
interfaces to be missing or wrong or not optimal, please let me know as
well.  Once the API is regarded useful we can start thinking about the
kernel interface so keep these two things separated.


Please direct comments to me.  In case there is interest I can set up a
separate mailing list since lkml is probably not the best venue.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAkgG92ijCOnn/RHQRAqgUAJ9bJ83LxSZ43TW5+5I1VhXV+zRPNACgnjmQ
SnFjDhA7v+5CGaZO5/jOxhw=
=93mp
-----END PGP SIGNATURE-----

[-- Attachment #2: numa-if --]
[-- Type: text/plain, Size: 24459 bytes --]

		      Thoughts about a NUMA API

Ulrich Drepper
Red Hat, Inc.
Time-stamp: <2004-04-29 01:18:32 drepper>

*** Very early draft.  I'll clean up the interface when I get some positive
*** feedback.


The technology used in NUMA machines is still evolving which means
that any proposed interface will fall short over time.  We cannot
think about every possibility and nuance the hardware designers come
up with.  The following is a list of assumptions made for this
document.  Some assumptions will be too general for some
implementations which allows simplication.  But the interfaces should
cover more designs.

1.  Non-uniform resources are processors and memory

2.  The address spaces of processors overlap

3.  Possible measure: distance of processors

    The distance is measured by the minimal difference of cost of
    accessing memory.

    ~ SMT and multi-core (MC) processors share some processor cache;

    ~ processors on the same SMP node (which might just be one
      procesor in size) have the same distance, which is larger
      than SMT/MC distance

    ~ processors on different NUMA nodes increases in distance with each
      interconnect which has to be used.

4.  The machine's architecture can change over time.

    ~ hotplug CPUs/RAM

    ~ dynamically enabling/disabling parts of the machine based on
      resource requirements


Example
=======

  +----------------------------------------+
  | +--------------+      +--------------+ |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T1|    |T1| |      | |T1|    |T1| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T2|    |T2| |      | |T2|    |T2| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | |  C1      C2  |      |  C1      C2  | |
  | +------P1------+      +------P2------+ |
  |                                        |
  | +------------------------------------+ |
  | |                 M1                 | |
  | +------------------------------------+ |
  +-------------------N1-------------------+

  +----------------------------------------+
  | +--------------+      +--------------+ |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T1|    |T1| |      | |T1|    |T1| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T2|    |T2| |      | |T2|    |T2| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | |  C1      C2  |      |  C1      C2  | |
  | +------P1------+      +------P2------+ |
  |                                        |
  | +------------------------------------+ |
  | |                 M2                 | |
  | +------------------------------------+ |
  +-------------------N2-------------------+

  +----------------------------------------+
  | +--------------+      +--------------+ |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T1|    |T1| |      | |T1|    |T1| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | | |T2|    |T2| |      | |T2|    |T2| | |
  | | +--+    +--+ |      | +--+    +--+ | |
  | |  C1      C2  |      |  C1      C2  | |
  | +------P1------+      +------P2------+ |
  |                                        |
  | +------------------------------------+ |
  | |                 M3                 | |
  | +------------------------------------+ |
  +-------------------N3-------------------+

These are three NUMA blocks, each consisting of two SMP processors,
each of which has two cores which by itself have two threads.  We use
the notation T1:C2:P2:N3 for the first thread, in the second core, in
the second processor, on the third node.  The main memory in the nodes
is represented by M1, M2, M3.

A simplistics measure in this case could be: requiring to access the
next level of memory doubles the cost.  So we might have the following
costs:

  T1:C1:P1:N1 <-> T2:C1:P1:N1   ==  1
  T1:C1:P1:N1 <-> T1:C2:P1:N1   ==  2
  T1:C1:P1:N1 <-> T1:C1:P2:N1   ==  4
  T1:C1:P1:N1 <-> T1:C1:P1:N2   ==  8
  T1:C1:P1:N1 <-> T1:C1:P1:N3   ==  16 (i.e., 2 * 8 since two interconnect
                                        are used)

It might be better to compute the distance based on real memory access
costs.  The above is just an example.

The above costs automatically take into account when the main memory
of a node has to be used or when data is shared in caches.

A second cost does not take the sharing of data between processors
into account but instead measures access to data stored in a specific
memory node.

  T1:C1:P1:N1 ->  M1            == 4
  T1:C1:P1:N1 ->  M2            == 8
  T1:C1:P1:N1 ->  M3            == 16

This cost can be derived from the more detailed CPU-to-CPU cost but
since there can be memory nodes without CPUs and often it is not
sharing data between CPUs (but instead access to stored memory) which
is important, this simplified cost is useful, too.


Interfaces
==========

The interfaces can be grouped:

1. Topology.  Programs need to know about the machine's layout.

2. Placement/affinity

   ~ of execution
   ~ of memory allocation

3. Realignment: adjust placement/affinity to new situation

4: Temporal changes



Topology Interfaces
-------------------

Two different types of information must be accessible:

1.  enumeration of the memory hierarchies

    This includes SMT/MC

2.  distance


The fundamental data type is a bitset with each bit representing a
processor.  glibc defines cpu_set_t.  The size is arbitrarily large.
We might introduce interfaces to dynamically allocate them.  For now,
cpu_set_t is a fixed-size type.

CPU_SETSIZE                        number of processors in cpu_set_t

CPU_SET_S(cpu, setsize, cpuset)    set bit corresponding to CPU in CPUSET
CPU_CLR_S(cpu, setsize, cpuset)    clear bit corresponding to CPU in CPUSET
CPU_ISSET_S(cpu, setsize, cpuset)  check whether bit corresponding to CPU is set
CPU_ZERO_S(setsize, cpuset)        clear set

CPU_EQUAL_S(setsize1, cpuset1, setsize2, cpuset2)
                                   Check whether the set bits in the two
                                   sets match.
CPU_EQUAL(cpuset1, cpuset2)  CPU_EQUAL_S(sizeof(cpu_set_t), cpuset1,
                                         sizeof(cpu_set_t), cpuset2)


CPU_SET(cpu, cpuset)         CPU_SET_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_CLR(cpu, cpuset)         CPU_CLR_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_ISSET(cpu, cpuset)       CPU_ISSET_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_ZERO(cpuset)             CPU_ZERO_S(sizeof(cpu_set_t), cpuset)


We probably need the following:

CPU_AND_S(destsize, destset, setsize, srcset1, set2size, srcset2)

   logical AND of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

CPU_OR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical OR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

CPU_XOR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical XOR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

For dynamic allocation:

__cpu_mask             type for array element in cpu_set_t

CPU_ALLOC_SIZE(count)  number of bytes needed to represent cpu_set_t
                       which can at least represent CPU number COUNT

CPU_ALLOC(count)       allocate cpu_set_t which can represent at least
                       represent CPU number COUNT

CPU_FREE(cpuset)       free CPU set previously allocated with CPU_ALLOC()


Maybe interfaces to iterate over set are useful (C++ interface).


A similar type is defined for the representation of memory nodes.
Each processor is associated with one memory node and each memory node
can have zero or more processors associated.

memnode_set_t        basic type

MEMNODE_SET_S(node, memnodesize, memnodeset)
                    set bit corresponding bit to NODE in MEMNODESET
MEMNODE_CLR_S(node, memnodesize, memnodeset)
                    clear bit corresponding to NODE in MEMNODESET
MEMNODE_ISSET_S(node, memnodesize, memnodeset)
                    check whether bit corresponding to NODE is set
MEMNODE_ZERO_S(memnodesize, memnodeset)        clear set

MEMNODE_EQUAL_S(setsize1, memnodeset1, setsize2, memnodeset2)
                                   Check whether the set bits in the two
                                   sets match.
We probably need the following:

MEMNODE_AND_S(destsize, destset, setsize, srcset1, set2size, srcset2)

   logical AND of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

MEMNODE_OR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical OR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

MEMNODE_XOR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
   logical XOR of srcset1 and srcset2, place result in destsrc.  Might be
   one of the source sets

For dynamic allocation:

__memnode_mask         type for array element in memnode_set_t

MEMNODE_ALLOC_SIZE(count)  see CPU_ALLOC_SIZE()

MEMNODE_ALLOC(count)       see CPU_ALLOC()

MEMNODE_FREE(cpuset)       see CPU_FREE()


To determine the topology:


int NUMA_cpu_count(unsigned int *countp)

  Return in *COUNTP the number of online CPUs.  The
  sysconf(_SC_NPROCESSORS_ONLN) information might be sufficient, though.
  Returning an error can signal that the NUMA support is not present.


int NUMA_cpu_all(size_t destsize, cpu_set_t dest)

  Set bits for all (currently) available CPUs.


int NUMA_cpu_self(size_t destsize, cpu_set_t dest)

  Set bit for current processor

int NUMA_cpu_self_idx(void)

  Return index in cpu_set_t for current processor.


int NUMA_cpu_at_level(size_t destsize, cpu_set_t dest, size_t srcsize,
                      const cpu_set_t src, int level)

  Fill DEST with the bitmap which has all the bits corresponding to
  processors which are (currently) LEVEL or less levels away from any
  processor in SRC.

  In the simplest case on bit is set in SRC.  Level 1 might be used
  to find out all SMT siblings.  If more than one bit is set more the
  search is started from all of them.

  NB: the is "level", not "distance".  Since the distance could be
  relative to the access cost there need not be sequential values which
  can be used in iteration.  With this interface we can go on incrementing
  level until no further processors is found which could be signalled
  by an return value.


int NUMA_cpu_distance(int *minp, int *maxp, size_t setsize,
                      const cpu_set_t set)

  Determine the minimum and maximum distance between nodes in SET.

  This is the distance which is a measure for the cost of sharing
  memory.

  Usually two bits are set.  If more bits are set the spread between min
  and max is useful.


int NUMA_mem_main_level(int cpuidx, int *levelp)

  Return in *LEVELP the level where the local main memory for processor
  CPUIDX is.


int NUMA_memnode_count(unsigned int *countp)

  Return in *COUNTP the number of online memnodes.


int NUMA_memnode_all(size_t destsize, memnode_set_t dest)

  Set bits for all (currently) available memnodes.


int NUMA_cpu_to_memnode(size_t cpusetsize, const cpu_set_t cpuset,
                        size_t memnodesize, memnode_set_t memnodeset)

  Set bits in MEMNODESET which correspond to memory node which are local
  to any of the CPUs represented by bits set in CPUSET.


int NUMA_memnode_to_cpu(size_t memnodesize, const memnode_set_t memnodeset,
                        size_t cpusetsize, cpu_set_t cpuset)

  Set bits in CPUSET which correspond to CPUs which are local
  to any of the memory nodes represented by bits set in MEMNODESET.


int NUMA_mem_distance(int *minp, int *maxp, void *ptr, size_t setsize,
                      const memnode_set_t set)

  Determine the minimum and maximum level difference to the memory pointed
  to by Ptr from any of the CPUs in SET.

  Usually one bit is set in SET.

  If the difference between *MINP and the value returned from
  NUMA_mem_main_level() is zero, the memory is local to at least one CPU
  in set.  If the difference between *MAXP and the NUMA_mem_main_level()
  value is zero, the memory is local to all CPUs.


int NUMA_cpu_mem_cost(int *minp, int *maxp, size_t cpusetsize,
                      const cpu_set_t cpuset, size_t memsetsize,
                      const memnode_set_t memnodeset)

  Compute minimum and maximum access costs of processors in CPUSET to
  any of the memory nodes in MEMNODESET.


Example: Determine CPUs on neighbor "nodes"

  cpu_set_t level0;
  CPU_SET(level0, the_cpu);

  cpu_set_t levelN;
  NUMA_cpu_at_level(levelN, level0, N);

  cpu_set_t levelNp1;
  NUMA_cpu_at_level(levelNp1, level0, N + 1);

  CPU_XOR(levelNp1, levelNp1, levelN);

Given a CPU index, the CPUs at leavel N are determined, then those at
level N+1.  The difference (XOR) is the set of processors at level N+1
from the given CPU.



Memory Information
------------------


It is necessary to know something about the memory at a given level.
For instance, level 1 might be "level 1 CPU cache", level 4 might be
"main memory".

int NUMA_mem_info_size(int level, int cpuidx, NUMA_size_t *size)
int NUMA_mem_info_associativity(int level, int cpuidx, NUMA_size_t *size)
int NUMA_mem_info_linesize(int level, int cpuidx, NUMA_size_t *size)

*_size applies to all kinds of memory.  _associativity and _linesize
mainly apply to caches.  Maybe it's useful for main memory, too.  If not
an error could be returned.


int NUMA_mem_total(int memnodeidx, NUMA_size_t *size)
int NUMA_mem_avail(int memnodeidx, NUMA_size_t *size)

The total memory and available memory on memory node MEMNODEIDX.


Placement/Affinity Interfaces
-----------------------------

int NUMA_mem_set_home(pid_t pid, size_t setsize, memnode_set_t set)

  Install SET as mask of preferred nodes for memory allocation for process
  PID.  This applies only to directly attached memory (NUMA_mem_main_level()).
  If more than one bit is set in SET the memory allocation can be bread
  accross all the local memory for the CPUs in the set.

int NUMA_mem_get_home(pid_t pid, size_t setsize, memnode_set_t set)

  Return currently installed prefferred node set.


int NUMA_mem_set_home_thread(pthread_t th, size_t setsize, memnode_set_t set)

  Similar, but limited to the given thread.

int NUMA_mem_get_home_thread(pthread_t th, size_t setsize, memnode_set_t set)

  Likewise to retrieve the information.


int NUMA_aff_set_cpu(pid_t pid, size_t setsize, cpu_set_t set, int hard)

  Set affinity mask for process PID to the processors in SET.  There are
  two masks: the hard and the soft.  No processor not in the hard mask
  can ever be used.  The soft mask is a recommendation.

int NUMA_aff_get_cpu(pid_t pid, size_t setsize, cpu_set_t set, int hard)

  The corresponding interface to get the data.

int NUMA_aff_set_cpu_thread(pthread_t th, size_t setsize, cpu_set_t set,
                            int hard)

  Similar to NUMA_aff_set_cpu() but for the given thread.

int NUMA_aff_get_cpu_thread(pthread_t th, size_t setsize, cpu_set_t set,
                            int hard)

  Get the data.


The "hard" variants are basically the existing sched_setaffinity and
pthread_setaffinity.  The soft and hard maps are maintained separately.


void *NUMA_mem_alloc_local(NUMA_size_t size, int spill, int interleave)

  Allocate SIZE bytes local to the current process, regardless of the
  registered preferred memory node mask.  Unless SPILL is nonzero the
  allocation fails if no memory available locally.  If SPILL is nonzero
  memory at greater distances is considers.  This is a convenience
  interface, it could be implemented using NUMA_mem_alloc() below.

  Possible extension: SPILL could specify how far away the memory can
  be spilled.  For instance, the value 1 could mean one NUMA node way,
  2 for up to 2 NUMA nodes away etc.

  If INTERLEAVE is nonzero the memory is allocated in interleaved form
  from all the nodes specified.  Otherwise all memory comes from one node.


void *NUMA_mem_alloc_preferred(NUMA_size_t size, int spill, int interleave)

  Allocate memory according to the mask registered with NUMA_aff_mem_home
  or NUMA_aff_mem_home_thread.  SPILL and INTERLEAVE are handled as in
  NUMA_mem_alloc_local.


void *NUMA_mem_alloc(NUMA_size_t size, size_t setsize, memnode_set_t set,
                     int spill, int interleave)

  Allocate memory on any of the nodes in SET


What chunks of memory can be allocated is debateble.  It might make sense
to restrict all sizes to page size granularity.  Or at least round all
values up.

??? Should the granularity be configurable ???


void NUMA_mem_free(void *)

  Obviously, free the memory.


int NUMA_mem_get_nodes(void *addr, size_t destsize, memnode_set_t dest)

  The function will set the bits in DEST which represent processors
  which are local to the memory pointed to by ADDR.


int NUMA_mem_bind(void *addr, size_t size, size_t setsize, memnode_set_t set,
                  int spill)

  The memory in the range of [addr,addr+size) in the current process is bound
  to one of the nodes represented in SET.  Unless SPILL is nonzero  the
  call will fail if no memory is available on the nodes.


int NUMA_mem_get_nodes(void *addr, size_t size,
                       size_t destsize, memnode_set_t dest)

  The function returns information about the nodes on which the memory
  in the range [addr,addr+size) is allocated.  If the memory is not continously
  allocated (or in case of multi-threaded or multi-core processors) this can
  mean more than one bit is set in the result set.


Realignment
-----------

CPU sets can be realligned at any time using NUMA_aff_cpu() etc.


void *NUMA_mem_relocate(void *ptr, size_t setsize, memnode_set_t set)

  relocate the content of the memory pointed to be PTR to a node in SET.
  Return the new address.


Temporal Changes
----------------

The machine configuration can change over time.  New processors can
come online, others go offline, memory banks are switched on or off.
The above interfaces return information about the currently active
configuraiton.  There is possibly the danger that data sets from
different configurations are used.

One solution would be to require an open()-like function which
retrieves all the information in one step and all the interfaces
mentioned above will use that cached data.  The problem with this is
that if the configuration changes the decision made using the cached
data is outdated.  Second, the amount of data which is needed can be
big or, more likely, expensive to get even though only parts of the
information are used.

A different possibility would be to provide a simple callback which
returns a unique ID for each configuration.  Any use of the topology
interfaces would then start and end with a call to function to get the
ID.  If the two values differ, the collected data is inconsistent.
This would eliminate the second problem mentioned above, but not the
first.

A third possibility is to register a signal handler with the kernel so
that the kernel can send a signal whenever the configuration changes.
Alternatively, a /proc/file or netlink socket could be used to signal
interested parties (who then could send a signal if necessary).  Using
d-bus is posssible, too.  This notification could not only be used to
notice changes while reading the topology, it could also get a process
at any time to reconsider the current decision and reorganize the
processor/memory usage.

From these possibilities the d-bus route seems to be the most
appealing since d-bus already receives this kind of information from
the kernel and any number of processes can receive them.


Comparison with libnuma
=======================

nodemask_t:

  Unlike nodemask_t, cpu_set_t is already in use in glibc.  The affinity
  interfaces use it so there is not need to duplicate the functionality
  and no need to define other versions of the affinity interfaces.

  Furthermore, the nodemask_t type is of fixed size.  The cpu_set_t
  has a convenience version which is of fixed size but can be of
  arbitrary size.  This is important has a bit of math shows:

    Assume a processor with four cores and 4 threads each

    Four such processors on a single NUMA node

    That's a total of 64 virtual processors for one node.  With 32 such
    nodes the 1024 processors of cpu_set_t would be filled.  And we do
    not want to mention the total of 64 supported processors in libnuma's
    nodemask_t.  To be future safe the bitset size must be variable.


  In addition there is the type memnode_set_t which represents memory node.
  It is possible to have memory nodes without processors so only a cpu_set_t
  is not sufficient.


nodemask_zero()  -->  CPU_ZERO() which is already in glibc
nodemask_set()   -->  CPU_SET() ditto
nodemask_clr()   -->  CPU_CLR() ditto
nodemask_isset() -->  CPU_ISSET() ditto

nodemask_equal() -->  CPU_EQUAL()

Plus the appropriate macros to handle memnode_t.


numa_available() -->  NUMA_cpu_count()  for instance

numa_max_node()  -->  either NUMA_cpu_count()
                      or NUMA_CPU_all()

numa_homenode()  -->  NUMA_mem_get_home() or NUMA_aff_get_cpu()
                      or NUMA_aff_get_cpu_thread() or NUMA_cpu_self()

  The concept of a never-changing home node strikes me as odd.  Especially
  with hot-swap CPUs.  Declaring one or more CPUs the home nodes is fine.
  The default can be cpu the thread started on.


numa_node_size() --> NUMA_mem_avail()

  The main memory is at level NUMA_mem_main_level()

numa_pagesize()  -->  nothing yet since useless

  It is not clear to me what this really should to.  I.e., the interface
  of numa_pagesize() seems useless.  With no argument, the pagesize which
  can be determine is the pagesize of the system.  When hugepages etc
  come into play it is necessary to provide a pointer to a memory address
  so it can be determined which kind of memory it is.

??? Should we add NUMA_size_t NUMA_pagesize(void *addr) ???


numa_all_nodes --> global variables are *EVIL*

  Use NUMA_cpu_all()

numa_no_nodes  --> global variables are *EVIL*

  cpu_set_t s;
  CPU_ZERO(s);

numa_bind()    --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()
                   or NUMA_aff_set_cpu(() or NUMA_aff_set_cpu_thread()

  numa_bind() misses A LOT of flexibility.  First, memory and CPU need
  node be the same nodes. Second, thread handling is missing.  Third,
  hard versus soft requirements are not handled for CPU usage.


numa_set_interleave_mask() --> see comment
numa_get_interleave_mask() --> see comment
numa_get_interleave_node() --> see comment
numa_alloc_interleaved_subset() --> see comment
numa_alloc_interleaved() --> see comment
numa_interleave_memory() --> see comment

  I do not think that interleaving should be a completely separate mechanism
  next to normal memory allocation.  Instead it is a logical extension of
  memory allocation.  Interleaving is a parameter for the memory allocation
  functions like NUMA_mem_alloc().


numa_set_homenode()  -->  NUMA_mem_set_home() or NUMA_aff_set_cpu()
                          or NUMA_aff_set_cpu_thread() or NUMA_cpu_self()

numa_set_localalloc() --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()


numa_set_membind()   --> NUMA_mem_bind()

numa_get_membind()   --> NUMA_get_nodes()


numa_alloc_onnode()  -->  NUMA_mem_alloc()

numa_alloc_local()   -->  NUMA_mem_alloc_local()

numa_alloc()         -->  NUMA_mem_alloc_preferred()

numa_free()          -->  NUMA_mem_free()

numa_tonode_memory() -->  NUMA_mem_relocate()

numa_setlocal_memory() -->  NUMA_mem_relocate()

numa_police_memory()  -->  nothing yet

  I don't see why this is necessary.  Yes, address space allocation and
  the actual allocation of memory are two steps.  But this should be
  taken case of by the allocation functions (if necessary).  To support
  memory allocation with other interfaces then those described here and
  magically treat them in the "NUMA-way" seems dumb.


numa_run_on_node_mask() --> NUMA_aff_set_cpu() or NUMA_aff_set_cpu_thread()

numa_run_on_node() --> NUMA_aff_set_cpu() or NUMA_aff_set_cpu_thread()


numa_set_bind_policy() --> too coarse grained

  This cannot be a process property.  And it must be possible to change
  it from another thread, so the interface is completely broken.  Beside,
  it seems much more useful to differentiate between hard and soft masks
  since this allows, if necessary, to spill over to other nodes.  The
  NUMA_aff_set_cpu() and NUMA_aff_set_cpu_thread() allow specifying
  two masks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API
  2004-04-30  7:35 NUMA API Ulrich Drepper
@ 2004-04-30  8:30 ` William Lee Irwin III
  2004-05-03 18:37   ` Ulrich Drepper
  2004-04-30  8:49 ` Paul Jackson
  2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
  2 siblings, 1 reply; 9+ messages in thread
From: William Lee Irwin III @ 2004-04-30  8:30 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel

On Fri, Apr 30, 2004 at 12:35:26AM -0700, Ulrich Drepper wrote:
> In the last weeks I have been working on designing a new API for a NUMA
> support library.  I am aware of the code in libnuma by ak but this code
> has many shortcomings:
> ~ inadequate topology discovery
> ~ fixed cpu set size
> ~ no clear separation of memory nodes
> ~ no inclusion of SMT/multicore in the cpu hierarchy
> ~ awkward (at best) memory allocation interface
> ~ etc etc
> and last but not least
> ~ a completely unacceptable library interface (e.g., global variables as
> part of the API, WTF?)

Regardless of issues addressed, Andi's been working with everyone for
something on the order of 12+ months and this is out of the blue. I very
very strongly suggest that you take up each of these issues with him so
that they can be addressed as individual incremental improvements to the
API everyone's been working with for all that time as opposed to screwing
the world (esp. now that commodity NUMA boxen are becoming more prevalent)
with transparent and deliberate distro-competition motivated API skew.

-- wli

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API
  2004-04-30  7:35 NUMA API Ulrich Drepper
  2004-04-30  8:30 ` William Lee Irwin III
@ 2004-04-30  8:49 ` Paul Jackson
  2004-04-30  9:50   ` William Lee Irwin III
  2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
  2 siblings, 1 reply; 9+ messages in thread
From: Paul Jackson @ 2004-04-30  8:49 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel

> Please direct comments to me.  In case there is interest I can set up a
> separate mailing list since lkml is probably not the best venue.

Thanks for posting this.

If not the kernel mailing list, then could you specify some other
existing public list?  Sometimes useful feedback comes from the
interaction of several people responding, which is less likely to
happen if it is all funneled through one person.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API
  2004-04-30  8:49 ` Paul Jackson
@ 2004-04-30  9:50   ` William Lee Irwin III
  0 siblings, 0 replies; 9+ messages in thread
From: William Lee Irwin III @ 2004-04-30  9:50 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Ulrich Drepper, linux-kernel

At some point in the past, Uli wrote:
>> Please direct comments to me.  In case there is interest I can set up a
>> separate mailing list since lkml is probably not the best venue.

On Fri, Apr 30, 2004 at 01:49:33AM -0700, Paul Jackson wrote:
> Thanks for posting this.
> If not the kernel mailing list, then could you specify some other
> existing public list?  Sometimes useful feedback comes from the
> interaction of several people responding, which is less likely to
> happen if it is all funneled through one person.

The real problem with this is that f's been maintaining and bugfixing
and handling feature requests for people who are actually going to
depend on this stuff for a rather long time, and the total rewrite here
throws all that work to make things work for those who rely on the stuff
away in addition to creating a brand new vendor skew problem from scratch.

Uli likely has legitimate poihts in need of addressing. From-scratch
rewrites are not the proper ways to address them, especially not when
such a very strong precedent and various 3rd-parties' reliance on the
preexisting API's and codebases are established.

The proper methods for addressing these issues are by incrementally
improving f's codebase and fixing the bugs and/or limitations discussed
(e.g. hotplugging vs. NUMA API issues). What Uli has expressed is not a
sound basis for a ground-up, from-scratch API implementation. The issues
Uli wants to address are bugfixes and extensions, and should be
required to go through the same procedures and review as such, and these
in turn require working with the preexisting codebase, not wild from-
scratch rewrites of the known universe. Especially not with the
extremely transparent ulterior motives for incompatible API's proposed
on the day of SuSE's freeze.

I'm all in favor of the best. As the deficiencies are pointed out, I
won't rest until these are fixed and the implementation is the best.
But this proposed API divergence is not how it should be made so. Being
the best means having a coherent story, and bickering and contrived
incompatibilities betweeen distros is not how coherent stories and
customer satisfaction happen.


-- wli

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API - wish list
  2004-04-30  7:35 NUMA API Ulrich Drepper
  2004-04-30  8:30 ` William Lee Irwin III
  2004-04-30  8:49 ` Paul Jackson
@ 2004-05-03 12:48 ` Zoltan Menyhart
  2004-05-03 17:57   ` Paul Jackson
  2 siblings, 1 reply; 9+ messages in thread
From: Zoltan Menyhart @ 2004-05-03 12:48 UTC (permalink / raw)
  To: Ulrich Drepper, linux-kernel

Can you remember back the "old golden days" when there were no open(),
read(), lseek(), write(), mmap(), etc., and one had to tell explicitly
(job control punched cards) that s/he needed the sectors 123... 145 on
the disk on channel 6 unit 7 ?
Or somewhat more recently, one had to manage by hand the memory and the
overlays.
Now we are going to manage (from applications) the topology, CPU or
memory binding. Moreover, to have the applications resolve resources
management / dependency problems / conflicts among them...

The operating systems should provide for abstractions of the actual
HW platform: file system, virtual memory, shared CPUs, etc.

Why should an application care for the actual physical characteristics ?
Including counting nanoseconds of some HW resource access time ? We'll
end up with some completely un-portable applications.

I think an application should describe what it needs for its optimal run,
e.g.:
	- I need 3 * N (where N = 1, 2, 3,...) CPUs "very close"
	  together and 2.5 Gbytes / N real memory (working set size) for
	  each CPUs "very very close to" their respective CPUs
	- Should not it fit into a "domain", the CPUs have to be
	  "very very close" to each other 3 by 3
	- If no resources for even N == 1, do not start it at all
	- Use "gang scheduling" for them, otherwise I'll busy wait :-)
	- In addition, I need M CPUs + X Gbytes of memory
	  "where my previous group is" and I need a disk I/O path of
	  the capacity of 200 Mbytes / sec "more or less close to" my
	  memory
	- I need "some more" CPUs "somewhere" with some 100 Mbytes of
	  memory "preferably close to" the CPUs and 10 Mbytes / sec
	  TCP/IP bandwidth "close to" my memory 

	- I need 70 % of the CPU time on my CPUs (the scheduler can
	  select others for the 30 % of the time left)

	- O.K. should my request be too much, here is my minimal,
	  "degraded" configuration:...

The OS reserves the resources for the application (exec time assignment)
and reports the applications what of its needs have been granted.

When the application allocates some memory, it'll say: you know, this
is for the memory pool I've described in the 5th criteria.
When it creates threads, it'll say they are in the 2nd group of threads
mentioned at the 1st line

The work load manager / load balancer can negotiate other resource
assignment at any time with the application.
The work load manager / load balancer is free to move a collection of
resources from some NUMA domains to others, provided the application's
requirements are still met. (No hard binding.)

Billing is done accordingly :-)

As you do not need to know anything about SCSI LUNs, sector IDs, phy-
sical memory maps or the other applications when you compile your kernel,  
why should an application care for HW NUMA details ?

Thanks,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API - wish list
       [not found] ` <1RLdk-29R-11@gated-at.bofh.it>
@ 2004-05-03 13:17   ` Andi Kleen
  0 siblings, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2004-05-03 13:17 UTC (permalink / raw)
  To: Zoltan.Menyhart; +Cc: linux-kernel

Zoltan Menyhart <Zoltan.Menyhart_AT_bull.net@nospam.org> writes:

> The work load manager / load balancer can negotiate other resource
> assignment at any time with the application.
> The work load manager / load balancer is free to move a collection of
> resources from some NUMA domains to others, provided the application's
> requirements are still met. (No hard binding.)

IMHO these are hard research topics that will need considerable
more work to be automated, if they will ever work automated at all.
The main problem is that you several conflicting goals: you 
want to use all available CPU power, all available memory,
all available memory bandwidth and the best average memory latency.
They all conflict.

First: basically any more advanced automatic schemes will
require to go all the way to a full workload manager 
that can move around memory later, because it is near impossible
to get even two of these goals right in advance.

I first tried to develop a NUMA scheduler "homenode scheduler" that
attempted to do a lot of this automatically.  I then realized that it
is just too hard to do and it never worked very well. That is why I
changed gears and just started with a simple API to let the user tell
the kernel what he wants.

The advantage of this is that a lot of complexity is avoided; 
e.g. the NUMA API avoids any need to move memory around.

Now if somebody comes up with a good design for a workload manager and
does all the experiments needed to validate it then it could be later
added. But defering NUMA optimization efforts until this considerable
task is solved (if it even can be solved) would be a big mistake IMHO.

> Billing is done accordingly :-)
>
> As you do not need to know anything about SCSI LUNs, sector IDs, phy-
> sical memory maps or the other applications when you compile your kernel,  
> why should an application care for HW NUMA details ?

There is a big difference between these and NUMA. 

LUNs, sectors, physical memory are all hidden for correctness. For 
that virtualization is fine, because performance is secondary 
after correctness.

But NUMA knowledge is purely for optimization. And for optimization
purposes you want to avoid virtualization layers, because they get
in the way of your optimization efforts.

When a human does NUMA optimization they usually want to work near the
bare hardware.  And if your dream of a automatic workload manager ever
worked it would also work on the bare hardware.

-Andi


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API - wish list
  2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
@ 2004-05-03 17:57   ` Paul Jackson
  0 siblings, 0 replies; 9+ messages in thread
From: Paul Jackson @ 2004-05-03 17:57 UTC (permalink / raw)
  To: Zoltan.Menyhart; +Cc: drepper, linux-kernel

> The operating systems should provide for abstractions of the actual ...

True ... so long as you don't confuse "operating system" with "kernel".

Most of what you describe can and should be in user space, as what I
call "system software", constructed of libraries, daemons, utilities
and specific language support.

Having the kernel support the abstraction of "file", to hide details of
sectors, channels and devices has been a great success.  But the kernel
doesn't need to support every such abstraction, such as in this case
"abstract computers" with certain amounts of compute, memory and i/o
resources.

Rather the kernel only needs to provide the essential primitives, such
as cpu and memory placement, jobs (as related set of tasks), and access
to primitive topology and hardware attributes.

(Your spam encoded from address "Zoltan.Menyhart_AT_bull.net@nospam.org"
is a minor annoyance ...).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API
  2004-04-30  8:30 ` William Lee Irwin III
@ 2004-05-03 18:37   ` Ulrich Drepper
  2004-05-04 10:01     ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Ulrich Drepper @ 2004-05-03 18:37 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel

William Lee Irwin III wrote:
> I very
> very strongly suggest that you take up each of these issues with him

And what exactly do you think this is about?


> so
> that they can be addressed as individual incremental improvements

That's not a possibility.  The interface is simply inadequate.

I do not claim to be the expert when it comes to all the fancy NUMA
functionality.  But I surely can recognize a broken library interface.
*That's* my concern.  I do not yet care too much about the kernel interface.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NUMA API
  2004-05-03 18:37   ` Ulrich Drepper
@ 2004-05-04 10:01     ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2004-05-04 10:01 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: William Lee Irwin III, Linux Kernel

On Mon, May 03, 2004 at 11:37:41AM -0700, Ulrich Drepper wrote:
> I do not claim to be the expert when it comes to all the fancy NUMA
> functionality.  But I surely can recognize a broken library interface.
> *That's* my concern.  I do not yet care too much about the kernel interface.

Then it's rather offtopic for this list.  If you need additions and/or changes
to the kernel interface to your library we'd love to hear about that as early
as possible, though.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-05-04 10:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-30  7:35 NUMA API Ulrich Drepper
2004-04-30  8:30 ` William Lee Irwin III
2004-05-03 18:37   ` Ulrich Drepper
2004-05-04 10:01     ` Christoph Hellwig
2004-04-30  8:49 ` Paul Jackson
2004-04-30  9:50   ` William Lee Irwin III
2004-05-03 12:48 ` NUMA API - wish list Zoltan Menyhart
2004-05-03 17:57   ` Paul Jackson
     [not found] <1QAMU-4gf-15@gated-at.bofh.it>
     [not found] ` <1RLdk-29R-11@gated-at.bofh.it>
2004-05-03 13:17   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox