[PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave
@ 2023-12-09  6:59 Gregory Price
  2023-12-09  6:59 ` [PATCH v2 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko, Zhongkun He,
	Frank van der Linden, John Groves, Jonathan Cameron

v2:
  changes / adds:
- flattened weight matrix to an array at requested of Ying Huang
- Updated ABI docs per Davidlohr Bueso request
- change uapi structure to use aligned/fixed-length members as
  Suggested-by: Arnd Bergmann <arnd@arndb.de>
- Implemented weight fetch logic in get_mempolicy2
- mbind2 was changed to take (iovec,len) as function arguments
  rather than add them to the uapi structure, since they describe
  where to apply the mempolicy - as opposed to being part of it.

  fixes:
- fixed bug on fork/pthread
  Reported-by: Seungjun Ha <seungjun.ha@samsung.com>
  Link: https://lore.kernel.org/linux-cxl/20231206080944epcms2p76ebb230b9f4595f5cfcd2531d67ab3ce@epcms2p7/
- fixed bug in mbind2 where MPOL_F_GWEIGHTS was not set when il_weights
  was omitted after local weights were added as an option
- fixed bug in interleave logic where an OOB access was made if
  next_node_in returned MAX_NUMNODES
- fixed bug in bulk weighted interleave allocator where over-allocation
  could occur.

  tests:
- LTP: validated existing get_mempolicy, set_mempolicy, and mbind testss
- LTP: validated existing get_mempolicy, set_mempolicy, and mbind with
       MPOL_WEIGHTED_INTERLEAVE added.
- basic set_mempolicy2 tests and numactl -w --interleave tests

  numactl:
- Sample numactl extension for set_mempolicy available here:
  Link: https://github.com/gmprice/numactl/tree/weighted_interleave_master

(summary of LTP tests and manual tests added to end of cover letter)

=====================================================================

This patch set extends the mempolicy interface to enable new
mempolicies which may require extended data to operate. One
such policy is included with this set as an example.

There are 3 major "phases" in the patch set:
1) Implement a "global weight" mechanism via sysfs, which allows
   set_mempolicy to implement MPOL_WEIGHTED_INTERLEAVE utilizing
   weights set by the administrator (or system daemon).

2) A refactor of the mempolicy creation mechanism to accept an
   extensible argument structure `struct mempolicy_args` to promote
   code re-use between the original mempolicy/mbind interfaces and
   the new extended mempolicy/mbind interfaces.

3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2,
   along with the addition of task-local weights so that per-task
   weights can be registered for MPOL_WEIGHTED_INTERLEAVE.

=====================================================================
(Patch 1) : sysfs addition - /sys/kernel/mm/mempolicy/

This feature  provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM/weight

    The sysfs structure is designed as follows.

      $ tree /sys/kernel/mm/mempolicy/
      /sys/kernel/mm/mempolicy/
      ├── possible_nodes
      └── weighted_interleave
          ├── nodeN
          │   └── weight
          └── nodeN+X
              └── weight

'mempolicy' is added to '/sys/kernel/mm/' as a control group for
the mempolicy subsystem.

'possible_nodes' is added to 'mm/mempolicy' to help describe the
expected structures under mempolicy directorys. For example,
possible_nodes describes what nodeN directories wille exist under
the weighted_interleave directory.

Internally, weights are represented as an array of unsigned char

static unsigned char iw_table[MAX_NUMNODES];

char was chosen as most reasonable distributions can be represented
as factors <100, and to minimize memory usage (1KB)

We present possible nodes, instead of online nodes, to simplify the
management interface, considering that a) the table is of size
MAX_NUMNODES anyway to simplify fetching of weights (no need to track
sizes, and MAX_NUMNODES is typically at most 1kb), and b) it simplifies
management of hotplug events, allowing for weights to be set prior to
a node coming online, which may be beneficial for immediate use.

the 'weight' of a node (an unsigned char of value 1-255) is the number
of pages that are allocated during a "weighted interleave" round.
(See 'weighted interleave' for more details').

=====================================================================
(Patch 2) set_mempolicy: MPOL_WEIGHTED_INTERLEAVE

Weighted interleave is a new memory policy that interleaves memory
across numa nodes in the provided nodemask based on the weights
described in patch 1 (sysfs global weights).

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as having local DRAM and CXL memory together, the current round-robin
based interleaving policy doesn't maximize the overall bandwidth
because of their different bandwidth characteristics.

Instead, the interleaving can be more efficient when the allocation
policy follows each NUMA nodes' bandwidth weight rather than having 1:1
round-robin allocation.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
which enables weighted interleaving between NUMA nodes.  Weighted
interleave allows for a proportional distribution of memory across
multiple numa nodes, preferablly apportioned to match the bandwidth
capacity of each node from the perspective of the accessing node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
appropriate weight distribution is (2:1).

Weights will be acquired from the global weight array exposed by the
sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/

The policy will then allocate the number of pages according to the
set weights.  For example, if the weights are (2,1), then 2 pages
will be allocated on node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

=====================================================================
(Patches 3-6) Refactoring mempolicy for code-reuse

To avoid multiple paths of mempolicy creation, we should refactor the
existing code to enable the designed extensibility, and refactor
existing users to utilize the new interface (while retaining the
existing userland interface).

This set of patches introduces a new mempolicy_args structure, which
is used to more fully describe a requested mempolicy - to include
existing and future extensions.

/*
 * Describes settings of a mempolicy during set/get syscalls and
 * kernel internal calls to do_set_mempolicy()
 */
struct mempolicy_args {
    unsigned short mode;            /* policy mode */
    unsigned short mode_flags;      /* policy mode flags */
    nodemask_t *policy_nodes;       /* get/set/mbind */
    int policy_node;                /* get: policy node information */
    unsigned long addr;             /* get: vma address */
    int addr_node;                  /* get: node the address belongs to */
    int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
    unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
};

This arg structure will eventually be utilized by the following
interfaces:
    mpol_new() - new mempolicy creation
    do_get_mempolicy() - acquiring information about mempolicy
    do_set_mempolicy() - setting the task mempolicy
    do_mbind()         - setting a vma mempolicy

do_get_mempolicy() is completely refactored to break it out into
separate functionality based on the flags provided by get_mempolicy(2)
    MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
    MPOL_F_ADDR: acquires information on vma policies
    MPOL_F_NODE: changes the output for the policy arg to node info

We refactor the get_mempolicy syscall flatten the logic based on these
flags, and aloow for set_mempolicy2() to re-use the underlying logic.

The result of this refactor, and the new mempolicy_args structure, is
that extensions like 'sys_set_mempolicy_home_node' can now be directly
integrated into the initial call to 'set_mempolicy2', and that more
complete information about a mempolicy can be returned with a single
call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'


=====================================================================
(Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2

These interfaces are the 'extended' counterpart to their relatives.
They use the userland 'struct mpol_args' structure to communicate a
complete mempolicy configuration to the kernel.  This structure
looks very much like the kernel-internal 'struct mempolicy_args':

struct mpol_args {
        /* Basic mempolicy settings */
        __u16 mode;
        __u16 mode_flags;
        __s32 home_node;
        __aligned_u64 pol_nodes;
        __u64 pol_maxnodes;
        __u64 addr;
        __s32 policy_node;
        __s32 addr_node;
        __aligned_u64 *il_weights;      /* of size pol_maxnodes */
};

The basic mempolicy settings which are shared across all interfaces
are captured at the top of the structure, while extensions such as
'policy_node' and 'addr' are collected beneath.

The syscalls are uniform and defined as follows:

long sys_mbind2(struct iovec *vec, size_t vlen,
                struct mpol_args *args, size_t usize,
                unsigned long flags);

long sys_get_mempolicy2(struct mpol_args *args, size_t size,
                        unsigned long flags);

long sys_set_mempolicy2(struct mpol_args *args, size_t size,
                        unsigned long flags);

The 'flags' argument for mbind2 is the same as 'mbind', except with
the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
field should be utilized.

The 'flags' argument for get_mempolicy2 is the same as get_mempolicy.

The 'flags' argument is not used by 'set_mempolicy' at this time, but
may end up allowing the use of MPOL_MF_HOME_NODE if such functionality
is desired.

The extensions can be summed up as follows:

get_mempolicy2 extensions:
    'mode', 'policy_node', and 'addr_node' can now be fetched with
    a single call, rather than multiple with a combination of flags.
    - 'mode' will always return the policy mode
    - 'policy_node' will replace the functionality of MPOL_F_NODE
    - 'addr_node' will return the node for 'addr' w/ MPOL_F_ADDR

set_mempolicy2:
    - task-local interleave weights can be set via 'il_weights'
      (see next patch)

mbind2:
    - 'vec' and 'vlen' are sed to operate on multiple memory
       ranges, rather than a single memory range per syscall.
    - 'home_node' field sets policy home node w/ MPOL_MF_HOME_NODE
    - task-local interleave weights can be set via 'il_weights'
      (see next patch)

=====================================================================
(Patch 11) set_mempolicy2/mbind2: MPOL_WEIGHTED_INTERLEAVE

This patch shows the explicit extension pattern when adding new
policies to mempolicy2/mbind2.  This adds the 'il_weights' field
to mpol_args and adds the logic to fill in task-local weights.

There are now two ways to weight a mempolicy: global and local.
To denote which mode the task is in, we add the internal flag:
MPOL_F_GWEIGHT /* Utilize global weights */

When MPOL_F_GWEIGHT is set, the global weights are used, and
when it is not set, task-local weights are used.

Example logic:
if (pol->flags & MPOL_F_GWEIGHT)
       pol_weights = iw_table[numa_node_id()].weights;
else
       pol_weights = pol->wil.weights;

set_mempolicy is changed to always set MPOL_F_GWEIGHT, since this
syscall is incapable of passing weights via its interfaces, while
set_mempolicy2 sets MPOL_F_GWEIGHT if MPOL_F_WEIGHTED_INTERLEAVE
is required but (*il_weights) in mpol_args is null.

The operation of task-local weighted is otherwise exactly the
same - except for what occurs on task migration.

On task migration, the system presently has no way of determining
what the new weights "should be", or what the user "intended".

For this reason, we default all weights to '1' and do not allow
weights to be '0'.  This means, should a migration occur where
one or more nodes appear into the nodemask - the effective weight
for that node will be '1'.  This avoids a potential allocation
failure condition if a migration occurs and introduces a node
which otherwise did not have a weight.

For this reason, users should use task-local weighting when
migrations are not expected, and global weighting when migrations
are expected or possible.

=====================================================================
Test Reports:

LTP set_mempolicy, get_mempolicy, mbind regression tests:

MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality
but did not adjust tests for weighting.  Basically the weights were
set to 1, which is the default, and it should behavior like standard
MPOL_INTERLEAVE if logic is correct.

== set_mempolicy01
Summary:
passed   18
failed   0
broken   0
skipped  0
warnings 0

== set_mempolicy02
Summary:
passed   10
failed   0
broken   0
skipped  0
warnings 0

== set_mempolicy03
Summary:
passed   64
failed   0
broken   0
skipped  0
warnings 0

== set_mempolicy04
Summary:
passed   32
failed   0
broken   0
skipped  0
warnings 0

== set_mempolicy05 - n/a on non-x86

== set_mempolicy06 - set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
Summary:
passed   10
failed   0
broken   0
skipped  0
warnings 0

== set_mempolicy07 - set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
Summary:
passed   32
failed   0
broken   0
skipped  0
warnings 0


== get_mempolicy01 - added MPOL_WEIGHTED_INTERLEAVE
Summary:
passed   12
failed   0
broken   0
skipped  0
warnings 0

== get_mempolicy02
Summary:
passed   2
failed   0
broken   0
skipped  0
warnings 0


== mbind01 - added WEIGHTED_INTERLEAVE
Summary:
passed   15
failed   0
broken   0
skipped  0
warnings 0

== mbind02 - added WEIGHTED_INTERLEAVE
Summary:
passed   4
failed   0
broken   0
skipped  0
warnings 0

== mbind03 - added WEIGHTED_INTERLEAVE
Summary:
passed   16
failed   0
broken   0
skipped  0
warnings 0

== mbind04 - added WEIGHTED_INTERLEAVE
Summary:
passed   48
failed   0
broken   0
skipped  0
warnings 0

=====================================================================
Basic set_mempolicy2 test

set_mempolicy2 w/ weighted interleave, task-local weights and uses
pthread_create to demonstrate the mempolicy is overwritten by child.

Manually validating the distribution via numa_maps

007c0000 weighted interleave:0-1 heap anon=65794 dirty=65794 active=0 N0=54829 N1=10965 kernelpagesize_kB=4
7f3f2c000000 weighted interleave:0-1 anon=32768 dirty=32768 active=0 N0=5461 N1=27307 kernelpagesize_kB=4
7f3f34000000 weighted interleave:0-1 anon=16384 dirty=16384 active=0 N0=2731 N1=13653 kernelpagesize_kB=4
7f3f3bffe000 weighted interleave:0-1 anon=65538 dirty=65538 active=0 N0=10924 N1=54614 kernelpagesize_kB=4
7f3f5c000000 weighted interleave:0-1 anon=16384 dirty=16384 active=0 N0=2731 N1=13653 kernelpagesize_kB=4
7f3f60dfe000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54615 N1=10922 kernelpagesize_kB=4

Expected distribution is 5:1 or 1:5 (less node should be ~16.666%)
1) 10965/65794 : 16.6656...
2) 5461/32768  : 16.6656...
3) 2731/16384  : 16.6687...
4) 10924/65538 : 16.6682...
5) 2731/16384  : 16.6687...
6) 10922/65537 : 16.6653...


#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <numa.h>
#include <errno.h>
#include <numaif.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/uio.h>
#include <sys/types.h>
#include <stdint.h>

#define MPOL_WEIGHTED_INTERLEAVE 6
#define SET_MEMPOLICY2(a, b) syscall(457, a, b, 0)

#define M256 (1024*1024*256)
#define PAGE_SIZE (4096)

struct mpol_args {
        /* Basic mempolicy settings */
        uint16_t mode;
        uint16_t mode_flags;
        int32_t home_node;
        uint64_t pol_nodes;
        uint64_t pol_maxnodes;
        uint64_t addr;
        int32_t policy_node;
        int32_t addr_node;
        uint64_t il_weights;
};

struct mpol_args wil_args;
struct bitmask *wil_nodes;
unsigned char *weights;
int total_nodes = -1;
pthread_t tid;

void set_mempolicy_call(int which)
{
        weights = (unsigned char *)calloc(total_nodes, sizeof(unsigned char));
        wil_nodes = numa_allocate_nodemask();

        numa_bitmask_setbit(wil_nodes, 0); weights[0] = which ? 1 : 5;
        numa_bitmask_setbit(wil_nodes, 1); weights[1] = which ? 5 : 1;

        memset(&wil_args, 0, sizeof(wil_args));
        wil_args.mode = MPOL_WEIGHTED_INTERLEAVE;
        wil_args.mode_flags = 0;
        wil_args.pol_nodes = wil_nodes->maskp;
        wil_args.pol_maxnodes = total_nodes + 1;
        wil_args.il_weights = weights;

        int ret = SET_MEMPOLICY2(&wil_args, sizeof(wil_args));
        fprintf(stderr, "set_mempolicy2 result: %d(%s)\n", ret, strerror(errno));
}

void *func(void *arg)
{
        char *mainmem = malloc(M256);
        int i;

        set_mempolicy_call(1); /* weight 1 heavier */

        mainmem = malloc(M256);
        memset(mainmem, 1, M256);
        for (i = 0; i < (M256/PAGE_SIZE); i++) {
                mainmem = malloc(PAGE_SIZE);
                mainmem[0] = 1;
        }
        printf("thread done %d\n", getpid());
        getchar();
        return arg;
}

int main()
{
        char * mainmem;
        int i;

        total_nodes = numa_max_node() + 1;

        set_mempolicy_call(0); /* weight 0 heavier */
        pthread_create(&tid, NULL, func, NULL);

        mainmem = malloc(M256);
        memset(mainmem, 1, M256);
        for (i = 0; i < (M256/PAGE_SIZE); i++) {
                mainmem = malloc(PAGE_SIZE);
                mainmem[0] = 1;
        }
        printf("main done %d\n", getpid());
        getchar();

        return 0;
}

=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master

command: numactl -w --interleave=0,1 ./eatmem

result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4

50% distribution is correct

result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4

16.666% distribution is correct

result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4

16.666% distribution is correct

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
        char* mem = malloc(1024*1024*256);
        memset(mem, 1, 1024*1024*256);
        for (int i = 0; i  < ((1024*1024*256)/4096); i++)
        {
                mem = malloc(4096);
                mem[0] = 1;
        }
        printf("done\n");
        getchar();
        return 0;
}

=====================================================================

Suggested-by: Gregory Price <gregory.price@memverge.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hasan Al Maruf <hasanalmaruf@fb.com>
Suggested-by: Hao Wang <haowang3@fb.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: tj <tj@kernel.org>
Suggested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: John Groves <john@jagalactic.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Srinivasulu Thanneeru <sthanneeru@micron.com>
Suggested-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>

Gregory Price (9):
  mm/mempolicy: refactor sanitize_mpol_flags for reuse
  mm/mempolicy: create struct mempolicy_args for creating new
    mempolicies
  mm/mempolicy: refactor kernel_get_mempolicy for code re-use
  mm/mempolicy: allow home_node to be set by mpol_new
  mm/mempolicy: add userland mempolicy arg structure
  mm/mempolicy: add set_mempolicy2 syscall
  mm/mempolicy: add get_mempolicy2 syscall
  mm/mempolicy: add the mbind2 syscall
  mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted
    interleave

Rakie Kim (2):
  mm/mempolicy: implement the sysfs-based weighted_interleave interface
  mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted
    interleaving

 .../ABI/testing/sysfs-kernel-mm-mempolicy     |  18 +
 ...fs-kernel-mm-mempolicy-weighted-interleave |  21 +
 .../admin-guide/mm/numa_memory_policy.rst     |  67 ++
 arch/alpha/kernel/syscalls/syscall.tbl        |   3 +
 arch/arm/tools/syscall.tbl                    |   3 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   3 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   3 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   3 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   3 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   3 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   3 +
 arch/s390/kernel/syscalls/syscall.tbl         |   3 +
 arch/sh/kernel/syscalls/syscall.tbl           |   3 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   3 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   3 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   3 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   3 +
 include/linux/mempolicy.h                     |  21 +
 include/linux/syscalls.h                      |   7 +
 include/uapi/asm-generic/unistd.h             |   8 +-
 include/uapi/linux/mempolicy.h                |  20 +-
 mm/mempolicy.c                                | 946 ++++++++++++++++--
 22 files changed, 1036 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

-- 
2.39.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

From: Rakie Kim <rakie.kim@sk.com>

This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/weight

The sysfs structure is designed as follows.

  $ tree /sys/kernel/mm/mempolicy/
  /sys/kernel/mm/mempolicy/ [1]
  ├── possible_nodes [2]
  └── weighted_interleave [3]
      ├── node0 [4]
      │   └── weight [5]
      └── node1
          └── weight

Each file above can be explained as follows.

[1] mm/mempolicy: configuration interface for mempolicy subsystem

[2] possible_nodes: list of possible nodes

    informational interface which may be used across multiple memory
    policy configurations.  Lists the `possible` nodes for which
    configurations may be required.  A `possible` node is one which has
    been reserved by the kernel at boot, but may or may not be online.

    For example, the weighted_interleave policy generates a nodeN/
    folder for possible node N.

[3] weighted_interleave/: config interface for weighted interleave policy

[4] weighted_interleave/nodeN/:  possible node configurations

[5] weighted_interleave/nodeN/weight: weight for nodeN

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
---
 .../ABI/testing/sysfs-kernel-mm-mempolicy     |  18 ++
 ...fs-kernel-mm-mempolicy-weighted-interleave |  21 +++
 mm/mempolicy.c                                | 169 ++++++++++++++++++
 3 files changed, 208 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..445377dfd232
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,18 @@
+What:		/sys/kernel/mm/mempolicy/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for Mempolicy
+
+What:		/sys/kernel/mm/mempolicy/possible_nodes
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	The numa nodes which are possible to come online
+
+		A possible numa node is one which has been reserved by the
+		system at boot, but may or may not be online at runtime.
+
+		Example output:
+
+		=========     ========================================
+		"0,1,2,3"     nodes 0-3 are possibly online or offline
+		=========     ========================================
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..7c19a606725f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,21 @@
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Configuration Interface for the Weighted Interleave policy
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN/
+		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN/weight
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Weight configuration interface for nodeN
+
+		The interleave weight for a memory node (N). These weights are
+		utilized by processes which have set their mempolicy to
+		MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
+		omitting a task-local weight array.
+
+		These weights only affect new allocations, and changes at runtime
+		will not cause migrations on already allocated pages.
+
+		Minimum weight: 1
+		Maximum weight: 255
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..28dfae195beb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+static char iw_table[MAX_NUMNODES];
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -3067,3 +3069,170 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+struct iw_node_info {
+	struct kobject kobj;
+	int nid;
+};
+
+static ssize_t node_weight_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	struct iw_node_info *node_info = container_of(kobj, struct iw_node_info,
+						      kobj);
+	return sysfs_emit(buf, "%d\n", iw_table[node_info->nid]);
+}
+
+static ssize_t node_weight_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	unsigned char weight = 0;
+	struct iw_node_info *node_info = NULL;
+
+	node_info = container_of(kobj, struct iw_node_info, kobj);
+
+	if (kstrtou8(buf, 0, &weight) || !weight)
+		return -EINVAL;
+
+	iw_table[node_info->nid] = weight;
+
+	return count;
+}
+
+static struct kobj_attribute node_weight =
+	__ATTR(weight, 0664, node_weight_show, node_weight_store);
+
+static struct attribute *dst_node_attrs[] = {
+	&node_weight.attr,
+	NULL,
+};
+
+static struct attribute_group dst_node_attr_group = {
+	.attrs = dst_node_attrs,
+};
+
+static const struct attribute_group *dst_node_attr_groups[] = {
+	&dst_node_attr_group,
+	NULL,
+};
+
+static const struct kobj_type dst_node_kobj_ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+	.default_groups = dst_node_attr_groups,
+};
+
+static int add_weight_node(int nid, struct kobject *src_kobj)
+{
+	struct iw_node_info *node_info = NULL;
+	int ret;
+
+	node_info = kzalloc(sizeof(struct iw_node_info), GFP_KERNEL);
+	if (!node_info)
+		return -ENOMEM;
+	node_info->nid = nid;
+
+	kobject_init(&node_info->kobj, &dst_node_kobj_ktype);
+	ret = kobject_add(&node_info->kobj, src_kobj, "node%d", nid);
+	if (ret) {
+		pr_err("kobject_add error [node%d]: %d", nid, ret);
+		kobject_put(&node_info->kobj);
+	}
+	return ret;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+	struct kobject *wi_kobj;
+	int nid, err;
+
+	wi_kobj = kobject_create_and_add("weighted_interleave", root_kobj);
+	if (!wi_kobj) {
+		pr_err("failed to create node kobject\n");
+		return -ENOMEM;
+	}
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		err = add_weight_node(nid, wi_kobj);
+		if (err) {
+			pr_err("failed to add sysfs [node%d]\n", nid);
+			break;
+		}
+	}
+	if (err)
+		kobject_put(wi_kobj);
+	return 0;
+
+}
+
+static ssize_t possible_nodes_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	int nid, next_nid;
+	int len = 0;
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		len += sysfs_emit_at(buf, len, "%d", nid);
+		next_nid = next_node(nid, node_states[N_POSSIBLE]);
+		if (next_nid < MAX_NUMNODES)
+			len += sysfs_emit_at(buf, len, ",");
+	}
+	len += sysfs_emit_at(buf, len, "\n");
+
+	return len;
+}
+
+static struct kobj_attribute possible_nodes_attr = __ATTR_RO(possible_nodes);
+
+static struct attribute *mempolicy_attrs[] = {
+	&possible_nodes_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group mempolicy_attr_group = {
+	.attrs = mempolicy_attrs,
+	NULL,
+};
+
+static void mempolicy_kobj_release(struct kobject *kobj)
+{
+	kfree(kobj);
+}
+
+static const struct kobj_type mempolicy_kobj_ktype = {
+	.release = mempolicy_kobj_release,
+	.sysfs_ops = &kobj_sysfs_ops,
+};
+
+static int __init mempolicy_sysfs_init(void)
+{
+	int err;
+	struct kobject *root_kobj;
+
+	memset(&iw_table, 1, sizeof(iw_table));
+
+	root_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+	if (!root_kobj)
+		return -ENOMEM;
+
+	kobject_init(root_kobj, &mempolicy_kobj_ktype);
+	err = kobject_add(root_kobj, mm_kobj, "mempolicy");
+	if (err) {
+		pr_err("failed to add kobject to the system\n");
+		goto fail_obj;
+	}
+
+	err = sysfs_create_group(root_kobj, &mempolicy_attr_group);
+	if (err) {
+		pr_err("failed to register mempolicy group\n");
+		goto fail_obj;
+	}
+
+	err = add_weighted_interleave_group(root_kobj);
+fail_obj:
+	if (err)
+		kobject_put(root_kobj);
+	return err;
+
+}
+late_initcall(mempolicy_sysfs_init);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
  2023-12-09  6:59 ` [PATCH v2 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09 21:24   ` kernel test robot
  2023-12-09  6:59 ` [PATCH v2 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Srinivasulu Thanneeru

From: Rakie Kim <rakie.kim@sk.com>

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as having local DRAM and CXL memory together, the current round-robin
based interleaving policy doesn't maximize the overall bandwidth because
of their different bandwidth characteristics.

Instead, the interleaving can be more efficient when the allocation
policy follows each NUMA nodes' bandwidth weight rather than having 1:1
round-robin allocation.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which
enables weighted interleaving between NUMA nodes.  Weighted interleave
allows for a proportional distribution of memory across multiple numa
nodes, preferablly apportioned to match the bandwidth capacity of each
node from the perspective of the accessing node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
appropriate weight distribution is (2:1).

Weights will be acquired from the global weight array exposed by the
sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/

The policy will then allocate the number of pages according to the
set weights.  For example, if the weights are (2,1), then 2 pages
will be allocated on node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

There are 3 integration points:

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node. Applied by `mempolicy_slab_node()` and
    `policy_nodemask()`

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Applied by `policy_nodemask()` and `mpol_misplaced()`

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight (pol->cur_weight) will be allocated first, before
    the remaining bulk calculation is done. This simplifies the
    calculation at the cost of an additional allocation call.

One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actually application of the interleave.  The
calculation of the `interleave index` is done by `get_vma_policy()`,
while the actual selection of the node will be later applied by the
relevant weighted_interleave function.

Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  11 +
 include/linux/mempolicy.h                     |   5 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 197 +++++++++++++++++-
 4 files changed, 211 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..d2c8e712785b 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,17 @@ MPOL_PREFERRED_MANY
 	can fall back to all existing numa nodes. This is effectively
 	MPOL_PREFERRED allowed for a mask rather than a single node.
 
+MPOL_WEIGHTED_INTERLEAVE
+	This mode operates the same as MPOL_INTERLEAVE, except that
+	interleaving behavior is executed based on weights set in
+	/sys/kernel/mm/mempolicy/weighted_interleave/
+
+	Weighted interleave allocations pages on nodes according to
+	their weight.  For example if nodes [0,1] are weighted [5,2]
+	respectively, 5 pages will be allocated on node0 for every
+	2 pages allocated on node1.  This can better distribute data
+	according to bandwidth on heterogeneous memory systems.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..ba09167e80f7 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,11 @@ struct mempolicy {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
 	} w;
+
+	/* Weighted interleave settings */
+	struct {
+		unsigned char cur_weight;
+	} wil;
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@ enum {
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
+	MPOL_WEIGHTED_INTERLEAVE,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 28dfae195beb..b4d94646e6a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -305,6 +305,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	policy->mode = mode;
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
+	policy->wil.cur_weight = 0;
 
 	return policy;
 }
@@ -417,6 +418,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_preferred,
 	},
+	[MPOL_WEIGHTED_INTERLEAVE] = {
+		.create = mpol_new_nodemask,
+		.rebind = mpol_rebind_nodemask,
+	},
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -838,7 +843,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
 		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
@@ -864,6 +870,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*nodes = pol->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -948,6 +955,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
 			*policy = next_node_in(current->il_prev, pol->nodes);
+		} else if (pol == current->mempolicy &&
+				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
+			if (pol->wil.cur_weight)
+				*policy = current->il_prev;
+			else
+				*policy = next_node_in(current->il_prev,
+						       pol->nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1777,7 +1791,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	pol = __get_vma_policy(vma, addr, ilx);
 	if (!pol)
 		pol = get_task_policy(current);
-	if (pol->mode == MPOL_INTERLEAVE) {
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
 		*ilx += vma->vm_pgoff >> order;
 		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
 	}
@@ -1827,6 +1842,24 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 	return zone >= dynamic_policy_zone;
 }
 
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+	unsigned int next;
+	struct task_struct *me = current;
+
+	next = next_node_in(me->il_prev, policy->nodes);
+	if (next == MAX_NUMNODES)
+		return next;
+
+	if (!policy->wil.cur_weight)
+		policy->wil.cur_weight = iw_table[next];
+
+	policy->wil.cur_weight--;
+	if (!policy->wil.cur_weight)
+		me->il_prev = next;
+	return next;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
@@ -1861,6 +1894,9 @@ unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		return weighted_interleave_nodes(policy);
+
 	case MPOL_BIND:
 	case MPOL_PREFERRED_MANY:
 	{
@@ -1885,6 +1921,41 @@ unsigned int mempolicy_slab_node(void)
 	}
 }
 
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+	nodemask_t nodemask = pol->nodes;
+	unsigned int target, weight_total = 0;
+	int nid;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char weight;
+
+	barrier();
+
+	/* first ensure we have a valid nodemask */
+	nid = first_node(nodemask);
+	if (nid == MAX_NUMNODES)
+		return nid;
+
+	/* Then collect weights on stack and calculate totals */
+	for_each_node_mask(nid, nodemask) {
+		weight = iw_table[nid];
+		weight_total += weight;
+		weights[nid] = weight;
+	}
+
+	/* Finally, calculate the node offset based on totals */
+	target = (unsigned int)ilx % weight_total;
+	nid = first_node(nodemask);
+	while (target) {
+		weight = weights[nid];
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+	return nid;
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1953,6 +2024,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
 			interleave_nodes(pol) : interleave_nid(pol, ilx);
 		break;
+	case MPOL_WEIGHTED_INTERLEAVE:
+		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
+			weighted_interleave_nodes(pol) :
+			weighted_interleave_nid(pol, ilx);
+		break;
 	}
 
 	return nodemask;
@@ -2014,6 +2090,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*mask = mempolicy->nodes;
 		break;
 
@@ -2113,7 +2190,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		 * If the policy is interleave or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode != MPOL_INTERLEAVE &&
+		if ((pol->mode != MPOL_INTERLEAVE &&
+		    pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
 		    (!nodemask || node_isset(nid, *nodemask))) {
 			/*
 			 * First, try to allocate THP only on local node, but
@@ -2249,6 +2327,106 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 	return total_allocated;
 }
 
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	struct task_struct *me = current;
+	unsigned long total_allocated = 0;
+	unsigned long nr_allocated;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	unsigned char weight;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned int weight_total;
+	unsigned long rem_pages = nr_pages;
+	nodemask_t nodes = pol->nodes;
+	int nnodes, node, prev_node;
+	int i;
+
+	/* Stabilize the nodemask on the stack */
+	barrier();
+
+	nnodes = nodes_weight(nodes);
+
+	/* Collect weights and save them on stack so they don't change */
+	for_each_node_mask(node, nodes) {
+		weight = iw_table[node];
+		weight_total += weight;
+		weights[node] = weight;
+	}
+
+	/* Continue allocating from most recent node and adjust the nr_pages */
+	if (pol->wil.cur_weight) {
+		node = next_node_in(me->il_prev, nodes);
+		node_pages = pol->wil.cur_weight;
+		if (node_pages > rem_pages)
+			node_pages = rem_pages;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/* if that's all the pages, no need to interleave */
+		if (rem_pages <= pol->wil.cur_weight) {
+			pol->wil.cur_weight -= rem_pages;
+			return total_allocated;
+		}
+		/* Otherwise we adjust nr_pages down, and continue from there */
+		rem_pages -= pol->wil.cur_weight;
+		pol->wil.cur_weight = 0;
+		prev_node = node;
+	}
+
+	/* Now we can continue allocating as if from 0 instead of an offset */
+	rounds = rem_pages / weight_total;
+	delta = rem_pages % weight_total;
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, nodes);
+		weight = weights[node];
+		node_pages = weight * rounds;
+		if (delta) {
+			if (delta > weight) {
+				node_pages += weight;
+				delta -= weight;
+			} else {
+				node_pages += delta;
+				delta = 0;
+			}
+		}
+		/* We may not make it all the way around */
+		if (!node_pages)
+			break;
+		/* If an over-allocation would occur, floor it */
+		if (node_pages + total_allocated > nr_pages) {
+			node_pages = nr_pages - total_allocated;
+			delta = 0;
+		}
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		prev_node = node;
+	}
+
+	/*
+	 * Finally, we need to update me->il_prev and pol->wil.cur_weight
+	 * if there were overflow pages, but not equivalent to the node
+	 * weight, set the cur_weight to node_weight - delta and the
+	 * me->il_prev to the previous node. Otherwise if it was perfect
+	 * we can simply set il_prev to node and cur_weight to 0
+	 */
+	if (node_pages) {
+		me->il_prev = prev_node;
+		node_pages %= weight;
+		pol->wil.cur_weight = weight - node_pages;
+	} else {
+		me->il_prev = node;
+		pol->wil.cur_weight = 0;
+	}
+
+	return total_allocated;
+}
+
 static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
@@ -2289,6 +2467,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
 		return alloc_pages_bulk_array_interleave(gfp, pol,
 							 nr_pages, page_array);
 
+	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+		return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
+								  nr_pages,
+								  page_array);
+
 	if (pol->mode == MPOL_PREFERRED_MANY)
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
@@ -2364,6 +2547,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2500,6 +2684,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
 		polnid = interleave_nid(pol, ilx);
 		break;
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		polnid = weighted_interleave_nid(pol, ilx);
+		break;
+
 	case MPOL_PREFERRED:
 		if (node_isset(curnid, pol->nodes))
 			goto out;
@@ -2874,6 +3062,7 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
+	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
 	[MPOL_LOCAL]      = "local",
 	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
@@ -2933,6 +3122,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		}
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		/*
 		 * Default to online nodes with memory if no nodelist
 		 */
@@ -3043,6 +3233,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		nodes = pol->nodes;
 		break;
 	default:
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
  2023-12-09  6:59 ` [PATCH v2 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
  2023-12-09  6:59 ` [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

split sanitize_mpol_flags into sanitize and validate.

Sanitize is used by set_mempolicy to split (int mode) into mode
and mode_flags, and then validates them.

Validate validates already split flags.

Validate will be reused for new syscalls that accept already
split mode and mode_flags.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b4d94646e6a2..65d023720e83 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1463,24 +1463,39 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
 	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
 }
 
-/* Basic parameter sanity check used by both mbind() and set_mempolicy() */
-static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
+/*
+ * Basic parameter sanity check used by mbind/set_mempolicy
+ * May modify flags to include internal flags (e.g. MPOL_F_MOF/F_MORON)
+ */
+static inline int validate_mpol_flags(unsigned short mode, unsigned short *flags)
 {
-	*flags = *mode & MPOL_MODE_FLAGS;
-	*mode &= ~MPOL_MODE_FLAGS;
-
-	if ((unsigned int)(*mode) >=  MPOL_MAX)
+	if ((unsigned int)(mode) >= MPOL_MAX)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
 	if (*flags & MPOL_F_NUMA_BALANCING) {
-		if (*mode != MPOL_BIND)
+		if (mode != MPOL_BIND)
 			return -EINVAL;
 		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
 	}
 	return 0;
 }
 
+/*
+ * Used by mbind/set_memplicy to split and validate mode/flags
+ * set_mempolicy combines (mode | flags), split them out into separate
+ * fields and return just the mode in mode_arg and flags in flags.
+ */
+static inline int sanitize_mpol_flags(int *mode_arg, unsigned short *flags)
+{
+	unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS);
+
+	*flags = *mode_arg & MPOL_MODE_FLAGS;
+	*mode_arg = mode;
+
+	return validate_mpol_flags(mode, flags);
+}
+
 static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (2 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

This patch adds a new kernel structure `struct mempolicy_args`,
intended to be used for an extensible get/set_mempolicy interface.

This implements the fields required to support the existing syscall
interfaces interfaces, but does not expose any user-facing arg
structure.

mpol_new is refactored to take the argument structure so that future
mempolicy extensions can all be managed in the mempolicy constructor.

The get_mempolicy and mbind syscalls are refactored to utilize the
new argument structure, as are all the callers of mpol_new() and
do_set_mempolicy.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 include/linux/mempolicy.h | 14 ++++++++
 mm/mempolicy.c            | 69 +++++++++++++++++++++++++++++----------
 2 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ba09167e80f7..117c5395c6eb 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -61,6 +61,20 @@ struct mempolicy {
 	} wil;
 };
 
+/*
+ * Describes settings of a mempolicy during set/get syscalls and
+ * kernel internal calls to do_set_mempolicy()
+ */
+struct mempolicy_args {
+	unsigned short mode;		/* policy mode */
+	unsigned short mode_flags;	/* policy mode flags */
+	nodemask_t *policy_nodes;	/* get/set/mbind */
+	int policy_node;		/* get: policy node information */
+	unsigned long addr;		/* get: vma address */
+	int addr_node;			/* get: node the address belongs to */
+	int home_node;			/* mbind: use MPOL_MF_HOME_NODE */
+};
+
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 65d023720e83..324dbf1782df 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -265,10 +265,12 @@ static int mpol_set_nodemask(struct mempolicy *pol,
  * This function just creates a new policy, does some check and simple
  * initialization. You must invoke mpol_set_nodemask() to set nodes.
  */
-static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
-				  nodemask_t *nodes)
+static struct mempolicy *mpol_new(struct mempolicy_args *args)
 {
 	struct mempolicy *policy;
+	unsigned short mode = args->mode;
+	unsigned short flags = args->mode_flags;
+	nodemask_t *nodes = args->policy_nodes;
 
 	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
@@ -817,8 +819,7 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
 }
 
 /* Set the process memory policy */
-static long do_set_mempolicy(unsigned short mode, unsigned short flags,
-			     nodemask_t *nodes)
+static long do_set_mempolicy(struct mempolicy_args *args)
 {
 	struct mempolicy *new, *old;
 	NODEMASK_SCRATCH(scratch);
@@ -827,14 +828,14 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	if (!scratch)
 		return -ENOMEM;
 
-	new = mpol_new(mode, flags, nodes);
+	new = mpol_new(args);
 	if (IS_ERR(new)) {
 		ret = PTR_ERR(new);
 		goto out;
 	}
 
 	task_lock(current);
-	ret = mpol_set_nodemask(new, nodes, scratch);
+	ret = mpol_set_nodemask(new, args->policy_nodes, scratch);
 	if (ret) {
 		task_unlock(current);
 		mpol_put(new);
@@ -1232,8 +1233,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 #endif
 
 static long do_mbind(unsigned long start, unsigned long len,
-		     unsigned short mode, unsigned short mode_flags,
-		     nodemask_t *nmask, unsigned long flags)
+		     struct mempolicy_args *margs, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1253,7 +1253,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (margs->mode == MPOL_DEFAULT)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = PAGE_ALIGN(len);
@@ -1264,7 +1264,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (end == start)
 		return 0;
 
-	new = mpol_new(mode, mode_flags, nmask);
+	new = mpol_new(margs);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
@@ -1281,7 +1281,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		NODEMASK_SCRATCH(scratch);
 		if (scratch) {
 			mmap_write_lock(mm);
-			err = mpol_set_nodemask(new, nmask, scratch);
+			err = mpol_set_nodemask(new, margs->policy_nodes,
+						scratch);
 			if (err)
 				mmap_write_unlock(mm);
 		} else
@@ -1295,7 +1296,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	 * Lock the VMAs before scanning for pages to migrate,
 	 * to ensure we don't miss a concurrently inserted page.
 	 */
-	nr_failed = queue_pages_range(mm, start, end, nmask,
+	nr_failed = queue_pages_range(mm, start, end, margs->policy_nodes,
 			flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
 
 	if (nr_failed < 0) {
@@ -1500,6 +1501,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
 {
+	struct mempolicy_args margs;
 	unsigned short mode_flags;
 	nodemask_t nodes;
 	int lmode = mode;
@@ -1514,7 +1516,12 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	if (err)
 		return err;
 
-	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
+	memset(&margs, 0, sizeof(margs));
+	margs.mode = lmode;
+	margs.mode_flags = mode_flags;
+	margs.policy_nodes = &nodes;
+
+	return do_mbind(start, len, &margs, flags);
 }
 
 SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
@@ -1595,6 +1602,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 				 unsigned long maxnode)
 {
+	struct mempolicy_args args;
 	unsigned short mode_flags;
 	nodemask_t nodes;
 	int lmode = mode;
@@ -1608,7 +1616,12 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	if (err)
 		return err;
 
-	return do_set_mempolicy(lmode, mode_flags, &nodes);
+	memset(&args, 0, sizeof(args));
+	args.mode = lmode;
+	args.mode_flags = mode_flags;
+	args.policy_nodes = &nodes;
+
+	return do_set_mempolicy(&args);
 }
 
 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
@@ -2890,6 +2903,7 @@ static int shared_policy_replace(struct shared_policy *sp, pgoff_t start,
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 {
 	int ret;
+	struct mempolicy_args margs;
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
 	rwlock_init(&sp->lock);
@@ -2902,8 +2916,12 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 		if (!scratch)
 			goto put_mpol;
 
+		memset(&margs, 0, sizeof(margs));
+		margs.mode = mpol->mode;
+		margs.mode_flags = mpol->flags;
+		margs.policy_nodes = &mpol->w.user_nodemask;
 		/* contextualize the tmpfs mount point mempolicy to this file */
-		npol = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
+		npol = mpol_new(&margs);
 		if (IS_ERR(npol))
 			goto free_scratch; /* no valid nodemask intersection */
 
@@ -3011,6 +3029,7 @@ static inline void __init check_numabalancing_enable(void)
 
 void __init numa_policy_init(void)
 {
+	struct mempolicy_args args;
 	nodemask_t interleave_nodes;
 	unsigned long largest = 0;
 	int nid, prefer = 0;
@@ -3056,7 +3075,11 @@ void __init numa_policy_init(void)
 	if (unlikely(nodes_empty(interleave_nodes)))
 		node_set(prefer, interleave_nodes);
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
+	memset(&args, 0, sizeof(args));
+	args.mode = MPOL_INTERLEAVE;
+	args.policy_nodes = &interleave_nodes;
+
+	if (do_set_mempolicy(&args))
 		pr_err("%s: interleaving failed\n", __func__);
 
 	check_numabalancing_enable();
@@ -3065,7 +3088,12 @@ void __init numa_policy_init(void)
 /* Reset policy of current process to default */
 void numa_default_policy(void)
 {
-	do_set_mempolicy(MPOL_DEFAULT, 0, NULL);
+	struct mempolicy_args args;
+
+	memset(&args, 0, sizeof(args));
+	args.mode = MPOL_DEFAULT;
+
+	do_set_mempolicy(&args);
 }
 
 /*
@@ -3095,6 +3123,7 @@ static const char * const policy_modes[] =
  */
 int mpol_parse_str(char *str, struct mempolicy **mpol)
 {
+	struct mempolicy_args margs;
 	struct mempolicy *new = NULL;
 	unsigned short mode_flags;
 	nodemask_t nodes;
@@ -3181,7 +3210,11 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 			goto out;
 	}
 
-	new = mpol_new(mode, mode_flags, &nodes);
+	memset(&margs, 0, sizeof(margs));
+	margs.mode = mode;
+	margs.mode_flags = mode_flags;
+	margs.policy_nodes = &nodes;
+	new = mpol_new(&margs);
 	if (IS_ERR(new))
 		goto out;
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (3 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

Pull operation flag checking from inside do_get_mempolicy out
to kernel_get_mempolicy.  This allows us to flatten the
internal code, and break it into separate functions for future
syscalls (get_mempolicy2, process_get_mempolicy) to re-use the
code, even after additional extensions are made.

The primary change is that the flag is treated as the multiplexer
that it actually is.  For get_mempolicy, the flags represents 3
different primary operations:

if (flags & MPOL_F_MEMS_ALLOWED)
	return task->mems_allowed
else if (flags & MPOL_F_ADDR)
	return vma mempolicy information
else
	return task mempolicy information

Plus the behavior modifying flag:

if (flags & MPOL_F_NODE)
	change the return value of (int __user *policy)
	based on whether MPOL_F_ADDR was set.

The original behavior of get_mempolicy is retained, but we utilize
the new mempolicy_args structure to pass the operations down the
stack.  This will allow us to extend the internal functions without
affecting the legacy behavior of get_mempolicy.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 240 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 150 insertions(+), 90 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 324dbf1782df..ce5b7963e9b5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -895,106 +895,107 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
 	return ret;
 }
 
-/* Retrieve NUMA policy */
-static long do_get_mempolicy(int *policy, nodemask_t *nmask,
-			     unsigned long addr, unsigned long flags)
+/* Retrieve the mems_allowed for current task */
+static inline long do_get_mems_allowed(nodemask_t *nmask)
 {
-	int err;
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL;
+	task_lock(current);
+	*nmask  = cpuset_current_mems_allowed;
+	task_unlock(current);
+	return 0;
+}
 
-	if (flags &
-		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
-		return -EINVAL;
+/* If the policy has additional node information to retrieve, return it */
+static long do_get_policy_node(struct mempolicy *pol)
+{
+	/*
+	 * For MPOL_INTERLEAVE, the extended node information is the next
+	 * node that will be selected for interleave. For weighted interleave
+	 * we return the next node based on the current weight.
+	 */
+	if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE)
+		return next_node_in(current->il_prev, pol->nodes);
 
-	if (flags & MPOL_F_MEMS_ALLOWED) {
-		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
-			return -EINVAL;
-		*policy = 0;	/* just so it's initialized */
+	if (pol == current->mempolicy &&
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
+		if (pol->wil.cur_weight)
+			return current->il_prev;
+		else
+			return next_node_in(current->il_prev, pol->nodes);
+	}
+	return -EINVAL;
+}
+
+/* Handle user_nodemask condition when fetching nodemask for userspace */
+static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask)
+{
+	if (mpol_store_user_nodemask(pol)) {
+		*nmask = pol->w.user_nodemask;
+	} else {
 		task_lock(current);
-		*nmask  = cpuset_current_mems_allowed;
+		get_policy_nodemask(pol, nmask);
 		task_unlock(current);
-		return 0;
 	}
+}
 
-	if (flags & MPOL_F_ADDR) {
-		pgoff_t ilx;		/* ignored here */
-		/*
-		 * Do NOT fall back to task policy if the
-		 * vma/shared policy at addr is NULL.  We
-		 * want to return MPOL_DEFAULT in this case.
-		 */
-		mmap_read_lock(mm);
-		vma = vma_lookup(mm, addr);
-		if (!vma) {
-			mmap_read_unlock(mm);
-			return -EFAULT;
-		}
-		pol = __get_vma_policy(vma, addr, &ilx);
-	} else if (addr)
-		return -EINVAL;
+/* Retrieve NUMA policy for a VMA assocated with a given address  */
+static long do_get_vma_mempolicy(struct mempolicy_args *args)
+{
+	pgoff_t ilx;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	struct mempolicy *pol = NULL;
 
+	mmap_read_lock(mm);
+	vma = vma_lookup(mm, args->addr);
+	if (!vma) {
+		mmap_read_unlock(mm);
+		return -EFAULT;
+	}
+	pol = __get_vma_policy(vma, args->addr, &ilx);
 	if (!pol)
-		pol = &default_policy;	/* indicates default behavior */
+		pol = &default_policy;
+	/* this may cause a double-reference, resolved by a put+cond_put */
+	mpol_get(pol);
+	mmap_read_unlock(mm);
 
-	if (flags & MPOL_F_NODE) {
-		if (flags & MPOL_F_ADDR) {
-			/*
-			 * Take a refcount on the mpol, because we are about to
-			 * drop the mmap_lock, after which only "pol" remains
-			 * valid, "vma" is stale.
-			 */
-			pol_refcount = pol;
-			vma = NULL;
-			mpol_get(pol);
-			mmap_read_unlock(mm);
-			err = lookup_node(mm, addr);
-			if (err < 0)
-				goto out;
-			*policy = err;
-		} else if (pol == current->mempolicy &&
-				pol->mode == MPOL_INTERLEAVE) {
-			*policy = next_node_in(current->il_prev, pol->nodes);
-		} else if (pol == current->mempolicy &&
-				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
-			if (pol->wil.cur_weight)
-				*policy = current->il_prev;
-			else
-				*policy = next_node_in(current->il_prev,
-						       pol->nodes);
-		} else {
-			err = -EINVAL;
-			goto out;
-		}
-	} else {
-		*policy = pol == &default_policy ? MPOL_DEFAULT :
-						pol->mode;
-		/*
-		 * Internal mempolicy flags must be masked off before exposing
-		 * the policy to userspace.
-		 */
-		*policy |= (pol->flags & MPOL_MODE_FLAGS);
-	}
+	/* Fetch the node for the given address */
+	args->addr_node = lookup_node(mm, args->addr);
 
-	err = 0;
-	if (nmask) {
-		if (mpol_store_user_nodemask(pol)) {
-			*nmask = pol->w.user_nodemask;
-		} else {
-			task_lock(current);
-			get_policy_nodemask(pol, nmask);
-			task_unlock(current);
-		}
+	args->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode;
+	args->mode_flags = (pol->flags & MPOL_MODE_FLAGS);
+
+	/* If this policy has extra node info, fetch that */
+	args->policy_node = do_get_policy_node(pol);
+
+	if (args->policy_nodes)
+		do_get_mempolicy_nodemask(pol, args->policy_nodes);
+
+	if (pol != &default_policy) {
+		mpol_put(pol);
+		mpol_cond_put(pol);
 	}
 
- out:
-	mpol_cond_put(pol);
-	if (vma)
-		mmap_read_unlock(mm);
-	if (pol_refcount)
-		mpol_put(pol_refcount);
-	return err;
+	return 0;
+}
+
+/* Retrieve NUMA policy for the current task */
+static long do_get_task_mempolicy(struct mempolicy_args *args)
+{
+	struct mempolicy *pol = current->mempolicy;
+
+	if (!pol)
+		pol = &default_policy;	/* indicates default behavior */
+
+	args->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode;
+	/* Internal flags must be masked off before exposing to userspace */
+	args->mode_flags = (pol->flags & MPOL_MODE_FLAGS);
+
+	args->policy_node = do_get_policy_node(pol);
+
+	if (args->policy_nodes)
+		do_get_mempolicy_nodemask(pol, args->policy_nodes);
+
+	return 0;
 }
 
 #ifdef CONFIG_MIGRATION
@@ -1731,16 +1732,75 @@ static int kernel_get_mempolicy(int __user *policy,
 				unsigned long addr,
 				unsigned long flags)
 {
+	struct mempolicy_args args;
 	int err;
-	int pval;
+	int pval = 0;
 	nodemask_t nodes;
 
 	if (nmask != NULL && maxnode < nr_node_ids)
 		return -EINVAL;
 
-	addr = untagged_addr(addr);
+	if (flags &
+		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
+		return -EINVAL;
 
-	err = do_get_mempolicy(&pval, &nodes, addr, flags);
+	/* Ensure any data that may be copied to userland is initialized */
+	memset(&args, 0, sizeof(args));
+	args.policy_nodes = &nodes;
+	args.addr = untagged_addr(addr);
+
+	/*
+	 * set_mempolicy was originally multiplexed based on 3 flags:
+	 *   MPOL_F_MEMS_ALLOWED:  fetch task->mems_allowed
+	 *   MPOL_F_ADDR        :  operate on vma->mempolicy
+	 *   MPOL_F_NODE        :  change return value of *policy
+	 *
+	 * Split this behavior out here, rather than internal functions,
+	 * so that the internal functions can be re-used by future
+	 * get_mempolicy2 interfaces and the arg structure made extensible
+	 */
+	if (flags & MPOL_F_MEMS_ALLOWED) {
+		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
+			return -EINVAL;
+		pval = 0;	/* just so it's initialized */
+		err = do_get_mems_allowed(&nodes);
+	} else if (flags & MPOL_F_ADDR) {
+		/* If F_ADDR, we operation on a vma policy (or default) */
+		err = do_get_vma_mempolicy(&args);
+		if (err)
+			return err;
+		 /* if (F_ADDR | F_NODE), *pval is the address' node */
+		if (flags & MPOL_F_NODE) {
+			/* if we failed to fetch, that's likely an EFAULT */
+			if (args.addr_node < 0)
+				return args.addr_node;
+			pval = args.addr_node;
+		} else
+			pval = args.mode | args.mode_flags;
+	} else {
+		 /* if not F_ADDR and addr != null, EINVAL */
+		if (addr)
+			return -EINVAL;
+
+		err = do_get_task_mempolicy(&args);
+		if (err)
+			return err;
+		/*
+		 * if F_NODE was set and mode was MPOL_INTERLEAVE
+		 * *pval is equal to next interleave node.
+		 *
+		 * if args.policy_node < 0, this means the mode did
+		 * not have a policy.  This presently emulates the
+		 * original behavior of (F_NODE) & (!MPOL_INTERLEAVE)
+		 * producing -EINVAL
+		 */
+		if (flags & MPOL_F_NODE) {
+			if (args.policy_node < 0)
+				return args.policy_node;
+			pval = args.policy_node;
+		} else
+			pval = args.mode | args.mode_flags;
+	}
 
 	if (err)
 		return err;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 06/11] mm/mempolicy: allow home_node to be set by mpol_new
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (4 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

This patch adds the plumbing into mpol_new() to allow the argument
structure's home_node field to set mempolicy home node.

The syscall sys_set_mempolicy_home_node was added to allow a home
node to be registered for a vma.

For set_mempolicy2 and mbind2 syscalls, it would be useful to add
this as an extension to allow the user to submit a fully formed
mempolicy configuration in a single call, rather than require
multiple calls to configure a mempolicy.

This will become particularly useful if/when pidfd interfaces to
change process mempolicies from outside the task appear, as each
call to change the mempolicy does an atomic swap of that policy
in the task, rather than mutate the policy.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ce5b7963e9b5..446167dcebdc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -308,6 +308,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
 	policy->wil.cur_weight = 0;
+	policy->home_node = args->home_node;
 
 	return policy;
 }
@@ -1621,6 +1622,7 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	args.mode = lmode;
 	args.mode_flags = mode_flags;
 	args.policy_nodes = &nodes;
+	args.home_node = NUMA_NO_NODE;
 
 	return do_set_mempolicy(&args);
 }
@@ -2980,6 +2982,8 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 		margs.mode = mpol->mode;
 		margs.mode_flags = mpol->flags;
 		margs.policy_nodes = &mpol->w.user_nodemask;
+		margs.home_node = NUMA_NO_NODE;
+
 		/* contextualize the tmpfs mount point mempolicy to this file */
 		npol = mpol_new(&margs);
 		if (IS_ERR(npol))
@@ -3138,6 +3142,7 @@ void __init numa_policy_init(void)
 	memset(&args, 0, sizeof(args));
 	args.mode = MPOL_INTERLEAVE;
 	args.policy_nodes = &interleave_nodes;
+	args.home_node = NUMA_NO_NODE;
 
 	if (do_set_mempolicy(&args))
 		pr_err("%s: interleaving failed\n", __func__);
@@ -3152,6 +3157,7 @@ void numa_default_policy(void)
 
 	memset(&args, 0, sizeof(args));
 	args.mode = MPOL_DEFAULT;
+	args.home_node = NUMA_NO_NODE;
 
 	do_set_mempolicy(&args);
 }
@@ -3274,6 +3280,8 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 	margs.mode = mode;
 	margs.mode_flags = mode_flags;
 	margs.policy_nodes = &nodes;
+	margs.home_node = NUMA_NO_NODE;
+
 	new = mpol_new(&margs);
 	if (IS_ERR(new))
 		goto out;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 07/11] mm/mempolicy: add userland mempolicy arg structure
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (5 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Frank van der Linden

This patch adds the new user-api argument structure intended for
set_mempolicy2 and mbind2.

struct mpol_args {
  __u16 mode;
  __u16 mode_flags;
  __s32 home_node;          /* mbind2: policy home node */
  __aligned_u64 *pol_nodes;
  __u64 pol_maxnodes;
  __u64 addr;     /* get_mempolicy: policy address */
  __s32 policy_node;        /* get_mempolicy: policy node info */
  __s32 addr_node;          /* get_mempolicy: memory range policy */
};

This structure is intended to be extensible as new mempolicy extensions
are added.

For example, set_mempolicy_home_node was added to allow vma mempolicies
to have a preferred/home node assigned.  This structure allows the
addition of that setting at the time the mempolicy is set, rather
than requiring additional calls to modify the policy.

Full breakdown of arguments as of this patch:
    mode:         Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE)

    mode_flags:   Flags previously or'd into mode in set_mempolicy
                  (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES)

    home_node:    for mbind2.  Allows the setting of a policy's home
                  with the use of MPOL_MF_HOME_NODE

    pol_nodes:    Policy nodemask

    pol_maxnodes: Max number of nodes in the policy nodemask

    policy_node:  for get_mempolicy2.  Returns extended information
                  about a policy that was previously reported by
                  passing MPOL_F_NODE to get_mempolicy.  Instead of
                  overriding the mode value, simply add a field.

    addr:         for get_mempolicy2.  Used with MPOL_F_ADDR to run
                  get_mempolicy against the vma the address belongs
                  to instead of the task.

    addr_node:    for get_mempolicy2.  Returns the node the address
                  belongs to.  Previously get_mempolicy() would
                  override the output value of (mode) if MPOL_F_ADDR
                  and MPOL_F_NODE were set.  Instead, we extend
                  mpol_args to do this by default if MPOL_F_ADDR is
                  set and do away with MPOL_F_NODE.

Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 20 +++++++++++++++++++
 include/uapi/linux/mempolicy.h                | 12 +++++++++++
 2 files changed, 32 insertions(+)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index d2c8e712785b..64c5804dc40f 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -482,6 +482,26 @@ closest to which page allocation will come from. Specifying the home node overri
 the default allocation policy to allocate memory close to the local node for an
 executing CPU.
 
+Extended Mempolicy Arguments::
+
+	struct mpol_args {
+		__u16 mode;
+		__u16 mode_flags;
+		__s32 home_node; /* mbind2: policy home node */
+		__aligned_u64 pol_nodes; /* nodemask pointer */
+		__u64 pol_maxnodes;
+		__u64 addr; /* get_mempolicy2: policy address */
+		__s32 policy_node; /* get_mempolicy2: policy node information */
+		__s32 addr_node; /* get_mempolicy2: memory range policy */
+	};
+
+The extended mempolicy argument structure is defined to allow the mempolicy
+interfaces future extensibility without the need for additional system calls.
+
+The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
+all interfaces relative to their non-extended counterparts. Each additional
+field may only apply to specific extended interfaces.  See the respective
+extended interface man page for more details.
 
 Memory Policy Command Line Interface
 ====================================
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 1f9bb10d1a47..00a673e30047 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -27,6 +27,18 @@ enum {
 	MPOL_MAX,	/* always last member of enum */
 };
 
+struct mpol_args {
+	/* Basic mempolicy settings */
+	__u16 mode;
+	__u16 mode_flags;
+	__s32 home_node;	/* mbind2: policy home node */
+	__aligned_u64 pol_nodes;
+	__u64 pol_maxnodes;
+	__u64 addr;
+	__s32 policy_node;	/* get_mempolicy: policy node info */
+	__s32 addr_node;	/* get_mempolicy: memory range policy */
+};
+
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (6 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09 16:46   ` kernel test robot
  2023-12-09 18:24   ` kernel test robot
  2023-12-09  6:59 ` [PATCH v2 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko

set_mempolicy2 is an extensible set_mempolicy interface which allows
a user to set the per-task memory policy.

Defined as:

set_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags);

relevant mpol_args fields include the following:

mode:         The MPOL_* policy (DEFAULT, INTERLEAVE, etc.)
mode_flags:   The MPOL_F_* flags that were previously passed in or'd
              into the mode.  This was split to hopefully allow future
              extensions additional mode/flag space.
pol_nodes:    the nodemask to apply for the memory policy
pol_maxnodes: The max number of nodes described by pol_nodes

The usize arg is intended for the user to pass in sizeof(mpol_args)
to allow forward/backward compatibility whenever possible.

The flags argument is intended to future proof the syscall against
future extensions which may require interpreting the arguments in
the structure differently.

Semantics of `set_mempolicy` are otherwise the same as `set_mempolicy`
as of this patch.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 10 ++++++
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  2 ++
 include/uapi/asm-generic/unistd.h             |  4 ++-
 mm/mempolicy.c                                | 36 +++++++++++++++++++
 18 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 64c5804dc40f..aabc24db92d3 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -432,6 +432,8 @@ Set [Task] Memory Policy::
 
 	long set_mempolicy(int mode, const unsigned long *nmask,
 					unsigned long maxnode);
+	long set_mempolicy2(struct mpol_args args, size_t size,
+			    unsigned long flags);
 
 Set's the calling task's "task/process memory policy" to mode
 specified by the 'mode' argument and the set of nodes defined by
@@ -440,6 +442,12 @@ specified by the 'mode' argument and the set of nodes defined by
 'mode' argument with the flag (for example: MPOL_INTERLEAVE |
 MPOL_F_STATIC_NODES).
 
+set_mempolicy2() is an extended version of set_mempolicy() capable
+of setting a mempolicy which requires more information than can be
+passed via get_mempolicy().  For example, weighted interleave with
+task-local weights requires a weight array to be passed via the
+'mpol_args->il_weights' argument in the 'struct mpol_args' arg.
+
 See the set_mempolicy(2) man page for more details
 
 
@@ -498,6 +506,8 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
+Extended interfaces (set_mempolicy2) use this argument structure.
+
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
 field may only apply to specific extended interfaces.  See the respective
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 18c842ca6c32..0dc288a1118a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -496,3 +496,4 @@
 564	common	futex_wake			sys_futex_wake
 565	common	futex_wait			sys_futex_wait
 566	common	futex_requeue			sys_futex_requeue
+567	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 584f9528c996..50172ec0e1f5 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -470,3 +470,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7a4b780e82cb..839d90c535f2 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 5b6a0b02b7de..567c8b883735 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -462,3 +462,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index a842b41c8e06..cc0640e16f2f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -395,3 +395,4 @@
 454	n32	futex_wake			sys_futex_wake
 455	n32	futex_wait			sys_futex_wait
 456	n32	futex_requeue			sys_futex_requeue
+457	n32	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 525cc54bc63b..f7262fde98d9 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -444,3 +444,4 @@
 454	o32	futex_wake			sys_futex_wake
 455	o32	futex_wait			sys_futex_wait
 456	o32	futex_requeue			sys_futex_requeue
+457	o32	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index a47798fed54e..e10f0e8bd064 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -455,3 +455,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 7fab411378f2..4f03f5f42b78 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -543,3 +543,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 86fec9b080f6..f98dadc2e9df 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -459,3 +459,4 @@
 454  common	futex_wake		sys_futex_wake			sys_futex_wake
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
+457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 363fae0fe9bf..f47ba9f2d05d 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -459,3 +459,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7bcaa3d5ea44..53fb16616728 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -502,3 +502,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c8fac5205803..4b4dc41b24ee 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -461,3 +461,4 @@
 454	i386	futex_wake		sys_futex_wake
 455	i386	futex_wait		sys_futex_wait
 456	i386	futex_requeue		sys_futex_requeue
+457	i386	set_mempolicy2		sys_set_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8cb8bf68721c..1bc2190bec27 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
 454	common	futex_wake		sys_futex_wake
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
+457	common	set_mempolicy2		sys_set_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 06eefa9c1458..e26dc89399eb 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..3244cd990858 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -822,6 +822,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long addr, unsigned long flags);
 asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
 				unsigned long maxnode);
+asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
+				   unsigned long flags);
 asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
 				const unsigned long __user *from,
 				const unsigned long __user *to);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 756b013fb832..55486aba099f 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
 __SYSCALL(__NR_futex_wait, sys_futex_wait)
 #define __NR_futex_requeue 456
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_set_mempolicy2 457
+__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 457
+#define __NR_syscalls 458
 
 /*
  * 32 bit systems traditionally used different
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 446167dcebdc..a56ff02f780e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1633,6 +1633,42 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
 	return kernel_set_mempolicy(mode, nmask, maxnode);
 }
 
+SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
+		unsigned long, flags)
+{
+	struct mpol_args kargs;
+	struct mempolicy_args margs;
+	int err;
+	nodemask_t policy_nodemask;
+	unsigned long __user *nodes_ptr;
+
+	if (flags)
+		return -EINVAL;
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return err;
+
+	err = validate_mpol_flags(kargs.mode, &kargs.mode_flags);
+	if (err)
+		return err;
+
+	memset(&margs, '\0', sizeof(margs));
+	margs.mode = kargs.mode;
+	margs.mode_flags = kargs.mode_flags;
+	if (kargs.pol_nodes) {
+		nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
+		err = get_nodes(&policy_nodemask, nodes_ptr,
+				kargs.pol_maxnodes);
+		if (err)
+			return err;
+		margs.policy_nodes = &policy_nodemask;
+	} else
+		margs.policy_nodes = NULL;
+
+	return do_set_mempolicy(&margs);
+}
+
 static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 				const unsigned long __user *old_nodes,
 				const unsigned long __user *new_nodes)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 09/11] mm/mempolicy: add get_mempolicy2 syscall
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (7 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko

get_mempolicy2 is an extensible get_mempolicy interface which allows
a user to retrieve the memory policy for a task or address.

Defined as:

get_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags)

Input values include the following fields of mpol_args:

pol_nodes:    if set, the nodemask of the policy returned here
pol_maxnodes: if pol_nodes is set, must describe max number of nodes
              to be copied to pol_nodes
addr:         if MPOL_F_ADDR is passed in `flags`, this address will be
              used to return the mempolicy details of the vma the
              address belongs to

flags:        if MPOL_F_MEMS_ALLOWED, returns mems_allowed in pol_nodes
              if MPOL_F_ADDR, return mempolicy info vma containing addr
              else, returns per-task mempolicy information

Output values include the following fields of mpol_args:

mode:         mempolicy mode
mode_flags:   mempolicy mode flags
pol_nodes:    if set, the nodemask for the mempolicy
policy_node:  if the policy has extended node information, it will
              be placed here.  For example MPOL_INTERLEAVE will
              return the next node which will be used for allocation
addr_node:    If MPOL_F_ADDR is set, the numa node that the address
              is located on will be returned.
home_node:    policy home node will be returned here, or -1 if not.

MPOL_F_NODE has been dropped from get_mempolicy2 (it is ignored) in
favor or returning explicit values in `policy_node` and `addr_node`.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  8 +++-
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  2 +
 include/uapi/asm-generic/unistd.h             |  4 +-
 mm/mempolicy.c                                | 47 +++++++++++++++++++
 18 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index aabc24db92d3..a52624ab659a 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -456,11 +456,17 @@ Get [Task] Memory Policy or Related Information::
 	long get_mempolicy(int *mode,
 			   const unsigned long *nmask, unsigned long maxnode,
 			   void *addr, int flags);
+	long get_mempolicy2(struct mpol_args args, size_t size,
+			    unsigned long flags);
 
 Queries the "task/process memory policy" of the calling task, or the
 policy or location of a specified virtual address, depending on the
 'flags' argument.
 
+get_mempolicy2() is an extended version of get_mempolicy() capable of
+acquiring extended information about a mempolicy, including those
+that can only be set via set_mempolicy2() or mbind2()..
+
 See the get_mempolicy(2) man page for more details
 
 
@@ -506,7 +512,7 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
-Extended interfaces (set_mempolicy2) use this argument structure.
+Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure.
 
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 0dc288a1118a..0301a8b0a262 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -497,3 +497,4 @@
 565	common	futex_wait			sys_futex_wait
 566	common	futex_requeue			sys_futex_requeue
 567	common	set_mempolicy2			sys_set_mempolicy2
+568	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 50172ec0e1f5..771a33446e8e 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -471,3 +471,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 839d90c535f2..048a409e684c 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 567c8b883735..327b01bd6793 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -463,3 +463,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index cc0640e16f2f..921d58e1da23 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -396,3 +396,4 @@
 455	n32	futex_wait			sys_futex_wait
 456	n32	futex_requeue			sys_futex_requeue
 457	n32	set_mempolicy2			sys_set_mempolicy2
+458	n32	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index f7262fde98d9..9271c83c9993 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -445,3 +445,4 @@
 455	o32	futex_wait			sys_futex_wait
 456	o32	futex_requeue			sys_futex_requeue
 457	o32	set_mempolicy2			sys_set_mempolicy2
+458	o32	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index e10f0e8bd064..0654f3f89fc7 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4f03f5f42b78..ac11d2064e7a 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -544,3 +544,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index f98dadc2e9df..1cdcafe1ccca 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
+458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index f47ba9f2d05d..f71742024c29 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 53fb16616728..2fbf5dbe0620 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -503,3 +503,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4b4dc41b24ee..0af813b9a118 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -462,3 +462,4 @@
 455	i386	futex_wait		sys_futex_wait
 456	i386	futex_requeue		sys_futex_requeue
 457	i386	set_mempolicy2		sys_set_mempolicy2
+458	i386	get_mempolicy2		sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1bc2190bec27..0b777876fc15 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -379,6 +379,7 @@
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
 457	common	set_mempolicy2		sys_set_mempolicy2
+458	common	get_mempolicy2		sys_get_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index e26dc89399eb..4536c9a4227d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -428,3 +428,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3244cd990858..774512b7934e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -820,6 +820,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
 				unsigned long maxnode,
 				unsigned long addr, unsigned long flags);
+asmlinkage long sys_get_mempolicy2(struct mpol_args *args, size_t size,
+				   unsigned long flags);
 asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
 				unsigned long maxnode);
 asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 55486aba099f..719accc731db 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -830,9 +830,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 #define __NR_set_mempolicy2 457
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
+#define __NR_get_mempolicy2 458
+__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 458
+#define __NR_syscalls 459
 
 /*
  * 32 bit systems traditionally used different
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a56ff02f780e..cfe22156ef13 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1859,6 +1859,53 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
 	return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags);
 }
 
+SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
+		unsigned long, flags)
+{
+	struct mpol_args kargs;
+	struct mempolicy_args margs;
+	int err;
+	nodemask_t policy_nodemask;
+	unsigned long __user *nodes_ptr;
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return -EINVAL;
+
+	if (flags & MPOL_F_MEMS_ALLOWED) {
+		if (!margs.policy_nodes)
+			return -EINVAL;
+		err = do_get_mems_allowed(&policy_nodemask);
+		if (err)
+			return err;
+		nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
+		return copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes,
+					  &policy_nodemask);
+	}
+
+	margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL;
+	if (flags & MPOL_F_ADDR) {
+		margs.addr = kargs.addr;
+		err = do_get_vma_mempolicy(&margs);
+	} else
+		err = do_get_task_mempolicy(&margs);
+
+	if (err)
+		return err;
+
+	kargs.mode = margs.mode;
+	kargs.mode_flags = margs.mode_flags;
+	kargs.policy_node = margs.policy_node;
+	kargs.addr_node = (flags & MPOL_F_ADDR) ? margs.addr_node : -1;
+	if (kargs.pol_nodes) {
+		nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
+		err = copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes,
+					 margs.policy_nodes);
+	}
+
+	return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0;
+}
+
 bool vma_migratable(struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 10/11] mm/mempolicy: add the mbind2 syscall
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (8 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09  6:59 ` [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
  2023-12-11  5:53 ` [PATCH v2 00/11] mempolicy2, mbind2, and " Huang, Ying
  11 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko,
	Frank van der Linden

mbind2 is an extensible mbind interface which allows a user to
set the mempolicy for one or more address ranges.

Defined as:

mbind2(struct mpol_args *args, size_t size, unsigned long flags)

Input values include the following fields of mpl_args:

mode:         The MPOL_* policy (DEFAULT, INTERLEAVE, etc.)
mode_flags:   The MPOL_F_* flags that were previously passed in or'd
	      into the mode.  This was split to hopefully allow future
	      extensions additional mode/flag space.
pol_nodes:    the nodemask to apply for the memory policy
pol_maxnodes: The max number of nodes described by pol_nodes
home_node:    if MPOL_MF_HOME_NODE, set home node of policy to this
vec:          the vector of (address, len) memory ranges to operate on
vlen:         the number of entries in vec

The semantics are otherwise the same as mbind(), except that
the home_node can be set, and all address ranges defined by
vec/vlen will be operated on.

Valid flags for mbind2 include the same flags as mbind, plus
MPOL_MF_HOME_NODE, which informs the syscall to utilize the value
of mpol_args->home_node to set the mempolicy home node.

Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Rakie Kim <rakie.kim@sk.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 12 +++-
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  3 +
 include/uapi/asm-generic/unistd.h             |  4 +-
 include/uapi/linux/mempolicy.h                |  5 +-
 mm/mempolicy.c                                | 68 +++++++++++++++++++
 19 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index a52624ab659a..f1ba33de3a6e 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -475,12 +475,18 @@ Install VMA/Shared Policy for a Range of Task's Address Space::
 	long mbind(void *start, unsigned long len, int mode,
 		   const unsigned long *nmask, unsigned long maxnode,
 		   unsigned flags);
+	long mbind2(struct iovec *vec, size_t vlen, struct mpol_args args,
+		    size_t size, unsigned long flags);
 
 mbind() installs the policy specified by (mode, nmask, maxnodes) as a
 VMA policy for the range of the calling task's address space specified
 by the 'start' and 'len' arguments.  Additional actions may be
 requested via the 'flags' argument.
 
+mbind2() is an extended version of mbind() capable of operating on multiple
+memory ranges in one syscall, and which is capable of setting the home node
+for the memory policy without an additional call to set_mempolicy_home_node()
+
 See the mbind(2) man page for more details.
 
 Set home node for a Range of Task's Address Spacec::
@@ -496,6 +502,9 @@ closest to which page allocation will come from. Specifying the home node overri
 the default allocation policy to allocate memory close to the local node for an
 executing CPU.
 
+mbind2() also provides a way for the home node to be set at the time the
+mempolicy is set. See the mbind(2) man page for more details.
+
 Extended Mempolicy Arguments::
 
 	struct mpol_args {
@@ -512,7 +521,8 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
-Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure.
+Extended interfaces (set_mempolicy2, get_mempolicy2, and mbind2) use this
+this argument structure.
 
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 0301a8b0a262..e8239293c35a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -498,3 +498,4 @@
 566	common	futex_requeue			sys_futex_requeue
 567	common	set_mempolicy2			sys_set_mempolicy2
 568	common	get_mempolicy2			sys_get_mempolicy2
+569	common	mbind2				sys_mbind2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 771a33446e8e..a3f39750257a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -472,3 +472,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 048a409e684c..9a12dface18e 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -458,3 +458,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 327b01bd6793..6cb740123137 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -464,3 +464,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 921d58e1da23..52cf720f8ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -397,3 +397,4 @@
 456	n32	futex_requeue			sys_futex_requeue
 457	n32	set_mempolicy2			sys_set_mempolicy2
 458	n32	get_mempolicy2			sys_get_mempolicy2
+459	n32	mbind2				sys_mbind2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 9271c83c9993..fd37c5301a48 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -446,3 +446,4 @@
 456	o32	futex_requeue			sys_futex_requeue
 457	o32	set_mempolicy2			sys_set_mempolicy2
 458	o32	get_mempolicy2			sys_get_mempolicy2
+459	o32	mbind2				sys_mbind2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 0654f3f89fc7..fcd67bc405b1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index ac11d2064e7a..89715417014c 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -545,3 +545,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 1cdcafe1ccca..c8304e0d0aa7 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
 458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
+459  common	mbind2			sys_mbind2			sys_mbind2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index f71742024c29..e5c51b6c367f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 2fbf5dbe0620..74527f585500 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -504,3 +504,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 0af813b9a118..be2e2aa17dd8 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -463,3 +463,4 @@
 456	i386	futex_requeue		sys_futex_requeue
 457	i386	set_mempolicy2		sys_set_mempolicy2
 458	i386	get_mempolicy2		sys_get_mempolicy2
+459	i386	mbind2			sys_mbind2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0b777876fc15..6e2347eb8773 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -380,6 +380,7 @@
 456	common	futex_requeue		sys_futex_requeue
 457	common	set_mempolicy2		sys_set_mempolicy2
 458	common	get_mempolicy2		sys_get_mempolicy2
+459	common	mbind2			sys_mbind2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 4536c9a4227d..f00a21317dc0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -429,3 +429,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 774512b7934e..487dd9155b25 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -816,6 +816,9 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
 				const unsigned long __user *nmask,
 				unsigned long maxnode,
 				unsigned flags);
+asmlinkage long sys_mbind2(const struct iovec __user *vec, size_t vlen,
+			   const struct mpol_args __user *uargs, size_t usize,
+			   unsigned long flags);
 asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
 				unsigned long maxnode,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 719accc731db..cd31599bb9cc 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 #define __NR_get_mempolicy2 458
 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
+#define __NR_mbind2 459
+__SYSCALL(__NR_mbind2, sys_mbind2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 459
+#define __NR_syscalls 460
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 00a673e30047..506ea0f8f34e 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -56,13 +56,14 @@ struct mpol_args {
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
 #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
 
-/* Flags for mbind */
+/* Flags for mbind/mbind2 */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
 #define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
 				   to policy */
 #define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
 #define MPOL_MF_LAZY	 (1<<3)	/* UNSUPPORTED FLAG: Lazy migrate on fault */
-#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+#define MPOL_MF_HOME_NODE (1<<4)	/* mbind2: set home node */
+#define MPOL_MF_INTERNAL (1<<5)	/* Internal flags start here */
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cfe22156ef13..8f609204fbe7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1600,6 +1600,74 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 	return kernel_mbind(start, len, mode, nmask, maxnode, flags);
 }
 
+SYSCALL_DEFINE5(mbind2, const struct iovec __user *, vec, size_t, vlen,
+		const struct mpol_args __user *, uargs, size_t, usize,
+		unsigned long, flags)
+{
+	struct mpol_args kargs;
+	struct mempolicy_args margs;
+	nodemask_t policy_nodes;
+	unsigned long __user *nodes_ptr;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter iter;
+	int err;
+
+	if (!vec || !vlen)
+		return -EINVAL;
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return -EINVAL;
+
+	err = validate_mpol_flags(kargs.mode, &kargs.mode_flags);
+	if (err)
+		return err;
+
+	margs.mode = kargs.mode;
+	margs.mode_flags = kargs.mode_flags;
+	margs.addr = kargs.addr;
+
+	/* if home node given, validate it is online */
+	if (flags & MPOL_MF_HOME_NODE) {
+		if ((kargs.home_node >= MAX_NUMNODES) ||
+			!node_online(kargs.home_node))
+			return -EINVAL;
+		margs.home_node = kargs.home_node;
+	} else
+		margs.home_node = NUMA_NO_NODE;
+	flags &= ~MPOL_MF_HOME_NODE;
+
+	if (kargs.pol_nodes) {
+		nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
+		err = get_nodes(&policy_nodes, nodes_ptr,
+				kargs.pol_maxnodes);
+		if (err)
+			return err;
+		margs.policy_nodes = &policy_nodes;
+	} else
+		margs.policy_nodes = NULL;
+
+	/* For each address range in vector, do_mbind */
+	err = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov,
+			   &iter);
+	if (err)
+		return err;
+	while (iov_iter_count(&iter)) {
+		unsigned long start, len;
+
+		start = untagged_addr((unsigned long)iter_iov_addr(&iter));
+		len = iter_iov_len(&iter);
+		err = do_mbind(start, len, &margs, flags);
+		if (err)
+			break;
+		iov_iter_advance(&iter, iter_iov_len(&iter));
+	}
+
+	kfree(iov);
+	return err;
+}
+
 /* Set the process memory policy */
 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 				 unsigned long maxnode)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (9 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
@ 2023-12-09  6:59 ` Gregory Price
  2023-12-09 22:28   ` kernel test robot
  2023-12-11  5:53 ` [PATCH v2 00/11] mempolicy2, mbind2, and " Huang, Ying
  11 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2023-12-09  6:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-api, linux-arch, linux-kernel,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86, hpa, mhocko,
	tj, ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Gregory Price

From: Gregory Price <gregory@gregoryprice.net>

Extend set_mempolicy2 and mbind2 to support weighted interleave, and
demonstrate the extensibility of the mpol_args structure.

To support weighted interleave we add interleave weight fields to the
following structures:

Kernel Internal:  (include/linux/mempolicy.h)
struct mempolicy {
	/* task-local weights to apply to weighted interleave */
	unsigned char weights[MAX_NUMNODES];
}
struct mempolicy_args {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	unsigned char *il_weights;	/* of size MAX_NUMNODES */
}

UAPI: (/include/uapi/linux/mempolicy.h)
struct mpol_args {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	unsigned char *il_weights;	/* of size pol_max_nodes */
}

The task-local weights are a single, one-dimensional array of weights
that apply to all possible nodes on the system.  If a node is set in
the mempolicy nodemask, the weight in `il_weights` must be >= 1,
otherwise set_mempolicy2() will return -EINVAL.  If a node is not
set in pol_nodemask, the weight will default to `1` in the task policy.

The default value of `1` is required to handle the situation where a
task migrates to a set of nodes for which weights were not set (up to
and including the local numa node).  For example, a migrated task whose
nodemask changes entirely will have all its weights defaulted back
to `1`, or if the nodemask changes to include a mix of nodes that
were not previously accounted for - the weighted interleave may be
suboptimal.

If migrations are expected, a task should prefer not to use task-local
interleave weights, and instead utilize the global settings for natural
re-weighting on migration.

To support global vs local weighting,  we add the kernel-internal flag:
MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */

This flag is set when il_weights is omitted by set_mempolicy2(), or
when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal
mode_flag dictates whether global weights or task-local weights are
utilized by the the various weighted interleave functions:

* weighted_interleave_nodes
* weighted_interleave_nid
* alloc_pages_bulk_array_weighted_interleave

if (pol->flags & MPOL_F_GWEIGHT)
	pol_weights = iw_table[numa_node_id()].weights;
else
	pol_weights = pol->wil.weights;

To simplify creations and duplication of mempolicies, the weights are
added as a structure directly within mempolicy. This allows the
existing logic in __mpol_dup to copy the weights without additional
allocations:

if (old == current->mempolicy) {
	task_lock(current);
	*new = *old;
	task_unlock(current);
} else
	*new = *old

Suggested-by: Rakie Kim <rakie.kim@sk.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  10 ++
 include/linux/mempolicy.h                     |   2 +
 include/uapi/linux/mempolicy.h                |   2 +
 mm/mempolicy.c                                | 105 +++++++++++++++++-
 4 files changed, 115 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index f1ba33de3a6e..84c076af74c3 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE
 	This mode operates the same as MPOL_INTERLEAVE, except that
 	interleaving behavior is executed based on weights set in
 	/sys/kernel/mm/mempolicy/weighted_interleave/
+	when configured to utilize global weights, or based on task-local
+	weights configured with set_mempolicy2(2) or mbind2(2).
 
 	Weighted interleave allocations pages on nodes according to
 	their weight.  For example if nodes [0,1] are weighted [5,2]
@@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE
 	2 pages allocated on node1.  This can better distribute data
 	according to bandwidth on heterogeneous memory systems.
 
+	When utilizing task-local weights, weights are not rebalanced
+	in the event of a task migration.  If a weight has not been
+	explicitly set for a node set in the new nodemask, the
+	value of that weight defaults to "1".  For this reason, if
+	migrations are expected or possible, users should consider
+	utilizing global interleave weights.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
@@ -516,6 +525,7 @@ Extended Mempolicy Arguments::
 		__u64 addr; /* get_mempolicy2: policy address */
 		__s32 policy_node; /* get_mempolicy2: policy node information */
 		__s32 addr_node; /* get_mempolicy2: memory range policy */
+		__aligned_u64 il_weights;  /* u8 buf of size pol_maxnodes */
 	};
 
 The extended mempolicy argument structure is defined to allow the mempolicy
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 117c5395c6eb..c78874bd84dd 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -58,6 +58,7 @@ struct mempolicy {
 	/* Weighted interleave settings */
 	struct {
 		unsigned char cur_weight;
+		unsigned char weights[MAX_NUMNODES];
 	} wil;
 };
 
@@ -73,6 +74,7 @@ struct mempolicy_args {
 	unsigned long addr;		/* get: vma address */
 	int addr_node;			/* get: node the address belongs to */
 	int home_node;			/* mbind: use MPOL_MF_HOME_NODE */
+	unsigned char *il_weights;	/* for mode MPOL_WEIGHTED_INTERLEAVE */
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 506ea0f8f34e..687c72fbe6a1 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -37,6 +37,7 @@ struct mpol_args {
 	__u64 addr;
 	__s32 policy_node;	/* get_mempolicy: policy node info */
 	__s32 addr_node;	/* get_mempolicy: memory range policy */
+	__aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */
 };
 
 /* Flags for set_mempolicy */
@@ -77,6 +78,7 @@ struct mpol_args {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
+#define MPOL_F_GWEIGHT	(1 << 5) /* Utilize global weights */
 
 /*
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f609204fbe7..e5f86e430207 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -271,6 +271,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	unsigned short mode = args->mode;
 	unsigned short flags = args->mode_flags;
 	nodemask_t *nodes = args->policy_nodes;
+	int node;
 
 	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
@@ -297,6 +298,19 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 		    (flags & MPOL_F_STATIC_NODES) ||
 		    (flags & MPOL_F_RELATIVE_NODES))
 			return ERR_PTR(-EINVAL);
+	} else if (mode == MPOL_WEIGHTED_INTERLEAVE) {
+		/* weighted interleave requires a nodemask and weights > 0 */
+		if (nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		if (args->il_weights) {
+			node = first_node(*nodes);
+			while (node != MAX_NUMNODES) {
+				if (!args->il_weights[node])
+					return ERR_PTR(-EINVAL);
+				node = next_node(node, *nodes);
+			}
+		} else if (!(args->mode_flags & MPOL_F_GWEIGHT))
+			return ERR_PTR(-EINVAL);
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 
@@ -309,6 +323,16 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	policy->home_node = NUMA_NO_NODE;
 	policy->wil.cur_weight = 0;
 	policy->home_node = args->home_node;
+	if (policy->mode == MPOL_WEIGHTED_INTERLEAVE && args->il_weights) {
+		policy->wil.cur_weight = 0;
+		/* Minimum weight value is always 1 */
+		memset(policy->wil.weights, 1, MAX_NUMNODES);
+		node = first_node(*nodes);
+		while (node != MAX_NUMNODES) {
+			policy->wil.weights[node] = args->il_weights[node];
+			node = next_node(node, *nodes);
+		}
+	}
 
 	return policy;
 }
@@ -1518,6 +1542,9 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	if (err)
 		return err;
 
+	if (mode & MPOL_WEIGHTED_INTERLEAVE)
+		mode_flags |= MPOL_F_GWEIGHT;
+
 	memset(&margs, 0, sizeof(margs));
 	margs.mode = lmode;
 	margs.mode_flags = mode_flags;
@@ -1611,6 +1638,8 @@ SYSCALL_DEFINE5(mbind2, const struct iovec __user *, vec, size_t, vlen,
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
 	struct iov_iter iter;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char *weights_ptr;
 	int err;
 
 	if (!vec || !vlen)
@@ -1648,6 +1677,20 @@ SYSCALL_DEFINE5(mbind2, const struct iovec __user *, vec, size_t, vlen,
 	} else
 		margs.policy_nodes = NULL;
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		err = copy_struct_from_user(&weights,
+					    sizeof(weights),
+					    weights_ptr,
+					    kargs.pol_maxnodes);
+		if (err)
+			return err;
+		margs.il_weights = weights;
+	} else {
+		margs.il_weights = NULL;
+		flags |= MPOL_F_GWEIGHT;
+	}
+
 	/* For each address range in vector, do_mbind */
 	err = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov,
 			   &iter);
@@ -1686,6 +1729,9 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	if (err)
 		return err;
 
+	if (mode & MPOL_WEIGHTED_INTERLEAVE)
+		mode_flags |= MPOL_F_GWEIGHT;
+
 	memset(&args, 0, sizeof(args));
 	args.mode = lmode;
 	args.mode_flags = mode_flags;
@@ -1709,6 +1755,8 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char __user *weights_ptr;
 
 	if (flags)
 		return -EINVAL;
@@ -1734,6 +1782,20 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	} else
 		margs.policy_nodes = NULL;
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		err = copy_struct_from_user(weights,
+					    sizeof(weights),
+					    weights_ptr,
+					    kargs.pol_maxnodes);
+		if (err)
+			return err;
+		margs.il_weights = weights;
+	} else {
+		margs.il_weights = NULL;
+		flags |= MPOL_F_GWEIGHT;
+	}
+
 	return do_set_mempolicy(&margs);
 }
 
@@ -1935,6 +1997,8 @@ SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	unsigned char *weights_ptr;
+	unsigned char weights[MAX_NUMNODES];
 
 	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
 	if (err)
@@ -1951,6 +2015,9 @@ SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 					  &policy_nodemask);
 	}
 
+	if (kargs.il_weights)
+		margs.il_weights = weights;
+
 	margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL;
 	if (flags & MPOL_F_ADDR) {
 		margs.addr = kargs.addr;
@@ -1971,6 +2038,13 @@ SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 					 margs.policy_nodes);
 	}
 
+	if (kargs.il_weights) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		err = copy_to_user(weights_ptr, weights, kargs.pol_maxnodes);
+		if (err)
+			return err;
+	}
+
 	return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0;
 }
 
@@ -2087,13 +2161,18 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
 	unsigned int next;
 	struct task_struct *me = current;
+	unsigned char next_weight;
 
 	next = next_node_in(me->il_prev, policy->nodes);
 	if (next == MAX_NUMNODES)
 		return next;
 
-	if (!policy->wil.cur_weight)
-		policy->wil.cur_weight = iw_table[next];
+	if (!policy->wil.cur_weight) {
+		next_weight = (policy->flags & MPOL_F_GWEIGHT) ?
+				iw_table[next] :
+				policy->wil.weights[next];
+		policy->wil.cur_weight = next_weight ? next_weight : 1;
+	}
 
 	policy->wil.cur_weight--;
 	if (!policy->wil.cur_weight)
@@ -2167,6 +2246,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	nodemask_t nodemask = pol->nodes;
 	unsigned int target, weight_total = 0;
 	int nid;
+	unsigned char *pol_weights;
 	unsigned char weights[MAX_NUMNODES];
 	unsigned char weight;
 
@@ -2178,8 +2258,13 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 		return nid;
 
 	/* Then collect weights on stack and calculate totals */
+	if (pol->flags & MPOL_F_GWEIGHT)
+		pol_weights = iw_table;
+	else
+		pol_weights = pol->wil.weights;
+
 	for_each_node_mask(nid, nodemask) {
-		weight = iw_table[nid];
+		weight = pol_weights[nid];
 		weight_total += weight;
 		weights[nid] = weight;
 	}
@@ -2577,6 +2662,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 	unsigned long nr_allocated;
 	unsigned long rounds;
 	unsigned long node_pages, delta;
+	unsigned char *pol_weights;
 	unsigned char weight;
 	unsigned char weights[MAX_NUMNODES];
 	unsigned int weight_total;
@@ -2590,9 +2676,14 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 
 	nnodes = nodes_weight(nodes);
 
+	if (pol->flags & MPOL_F_GWEIGHT)
+		pol_weights = iw_table;
+	else
+		pol_weights = pol->wil.weights;
+
 	/* Collect weights and save them on stack so they don't change */
 	for_each_node_mask(node, nodes) {
-		weight = iw_table[node];
+		weight = pol_weights[node];
 		weight_total += weight;
 		weights[node] = weight;
 	}
@@ -3117,6 +3208,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 {
 	int ret;
 	struct mempolicy_args margs;
+	unsigned char weights[MAX_NUMNODES];
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
 	rwlock_init(&sp->lock);
@@ -3134,6 +3226,11 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 		margs.mode_flags = mpol->flags;
 		margs.policy_nodes = &mpol->w.user_nodemask;
 		margs.home_node = NUMA_NO_NODE;
+		if (margs.mode == MPOL_WEIGHTED_INTERLEAVE &&
+		    !(margs.mode_flags & MPOL_F_GWEIGHT)) {
+			memcpy(weights, mpol->wil.weights, sizeof(weights));
+			margs.il_weights = weights;
+		}
 
 		/* contextualize the tmpfs mount point mempolicy to this file */
 		npol = mpol_new(&margs);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall
  2023-12-09  6:59 ` [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
@ 2023-12-09 16:46   ` kernel test robot
  2023-12-09 18:24   ` kernel test robot
  1 sibling, 0 replies; 21+ messages in thread
From: kernel test robot @ 2023-12-09 16:46 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: oe-kbuild-all, linux-doc, linux-fsdevel, linux-api, linux-arch,
	linux-kernel, akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86,
	hpa, mhocko, tj, ying.huang, gregory.price, corbet, rakie.kim,
	hyeongtak.ji, honggyu.kim, vtavarespetr, peterz, jgroves,
	ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf

Hi Gregory,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on deller-parisc/for-next powerpc/next powerpc/fixes s390/features jcmvbkbc-xtensa/xtensa-for-next arnd-asm-generic/master linus/master v6.7-rc4]
[cannot apply to tip/x86/asm geert-m68k/for-next geert-m68k/for-linus next-20231208]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gregory-Price/mm-mempolicy-implement-the-sysfs-based-weighted_interleave-interface/20231209-150314
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231209065931.3458-9-gregory.price%40memverge.com
patch subject: [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall
config: arc-alldefconfig (https://download.01.org/0day-ci/archive/20231210/202312100003.aUR6uBvr-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231210/202312100003.aUR6uBvr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312100003.aUR6uBvr-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from init/main.c:21:
>> include/linux/syscalls.h:825:43: warning: 'struct mpol_args' declared inside parameter list will not be visible outside of this definition or declaration
     825 | asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
         |                                           ^~~~~~~~~
--
   In file included from arch/arc/kernel/sys.c:3:
>> include/linux/syscalls.h:825:43: warning: 'struct mpol_args' declared inside parameter list will not be visible outside of this definition or declaration
     825 | asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
         |                                           ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: warning: initialized field overwritten [-Woverride-init]
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:29:37: note: in expansion of macro '__SYSCALL'
      29 | #define __SC_COMP(_nr, _sys, _comp) __SYSCALL(_nr, _sys)
         |                                     ^~~~~~~~~
   include/uapi/asm-generic/unistd.h:34:1: note: in expansion of macro '__SC_COMP'
      34 | __SC_COMP(__NR_io_setup, sys_io_setup, compat_sys_io_setup)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: note: (near initialization for 'sys_call_table[0]')
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:29:37: note: in expansion of macro '__SYSCALL'
      29 | #define __SC_COMP(_nr, _sys, _comp) __SYSCALL(_nr, _sys)
         |                                     ^~~~~~~~~
   include/uapi/asm-generic/unistd.h:34:1: note: in expansion of macro '__SC_COMP'
      34 | __SC_COMP(__NR_io_setup, sys_io_setup, compat_sys_io_setup)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: warning: initialized field overwritten [-Woverride-init]
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:36:1: note: in expansion of macro '__SYSCALL'
      36 | __SYSCALL(__NR_io_destroy, sys_io_destroy)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: note: (near initialization for 'sys_call_table[1]')
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:36:1: note: in expansion of macro '__SYSCALL'
      36 | __SYSCALL(__NR_io_destroy, sys_io_destroy)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: warning: initialized field overwritten [-Woverride-init]
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:29:37: note: in expansion of macro '__SYSCALL'
      29 | #define __SC_COMP(_nr, _sys, _comp) __SYSCALL(_nr, _sys)
         |                                     ^~~~~~~~~
   include/uapi/asm-generic/unistd.h:38:1: note: in expansion of macro '__SC_COMP'
      38 | __SC_COMP(__NR_io_submit, sys_io_submit, compat_sys_io_submit)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: note: (near initialization for 'sys_call_table[2]')
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:29:37: note: in expansion of macro '__SYSCALL'
      29 | #define __SC_COMP(_nr, _sys, _comp) __SYSCALL(_nr, _sys)
         |                                     ^~~~~~~~~
   include/uapi/asm-generic/unistd.h:38:1: note: in expansion of macro '__SC_COMP'
      38 | __SC_COMP(__NR_io_submit, sys_io_submit, compat_sys_io_submit)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: warning: initialized field overwritten [-Woverride-init]
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:40:1: note: in expansion of macro '__SYSCALL'
      40 | __SYSCALL(__NR_io_cancel, sys_io_cancel)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: note: (near initialization for 'sys_call_table[3]')
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:40:1: note: in expansion of macro '__SYSCALL'
      40 | __SYSCALL(__NR_io_cancel, sys_io_cancel)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: warning: initialized field overwritten [-Woverride-init]
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:20:34: note: in expansion of macro '__SYSCALL'
      20 | #define __SC_3264(_nr, _32, _64) __SYSCALL(_nr, _32)
         |                                  ^~~~~~~~~
   include/uapi/asm-generic/unistd.h:44:1: note: in expansion of macro '__SC_3264'
      44 | __SC_3264(__NR_io_getevents, sys_io_getevents_time32, sys_io_getevents)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: note: (near initialization for 'sys_call_table[4]')
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:20:34: note: in expansion of macro '__SYSCALL'
      20 | #define __SC_3264(_nr, _32, _64) __SYSCALL(_nr, _32)
         |                                  ^~~~~~~~~
   include/uapi/asm-generic/unistd.h:44:1: note: in expansion of macro '__SC_3264'
      44 | __SC_3264(__NR_io_getevents, sys_io_getevents_time32, sys_io_getevents)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: warning: initialized field overwritten [-Woverride-init]
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:48:1: note: in expansion of macro '__SYSCALL'
      48 | __SYSCALL(__NR_setxattr, sys_setxattr)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: note: (near initialization for 'sys_call_table[5]')
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:48:1: note: in expansion of macro '__SYSCALL'
      48 | __SYSCALL(__NR_setxattr, sys_setxattr)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: warning: initialized field overwritten [-Woverride-init]
      13 | #define __SYSCALL(nr, call) [nr] = (call),
         |                                    ^
   include/uapi/asm-generic/unistd.h:50:1: note: in expansion of macro '__SYSCALL'
      50 | __SYSCALL(__NR_lsetxattr, sys_lsetxattr)
         | ^~~~~~~~~
   arch/arc/kernel/sys.c:13:36: note: (near initialization for 'sys_call_table[6]')
      13 | #define __SYSCALL(nr, call) [nr] = (call),
--
   In file included from kernel/time/hrtimer.c:30:
>> include/linux/syscalls.h:825:43: warning: 'struct mpol_args' declared inside parameter list will not be visible outside of this definition or declaration
     825 | asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
         |                                           ^~~~~~~~~
   kernel/time/hrtimer.c:120:35: warning: initialized field overwritten [-Woverride-init]
     120 |         [CLOCK_REALTIME]        = HRTIMER_BASE_REALTIME,
         |                                   ^~~~~~~~~~~~~~~~~~~~~
   kernel/time/hrtimer.c:120:35: note: (near initialization for 'hrtimer_clock_to_base_table[0]')
   kernel/time/hrtimer.c:121:35: warning: initialized field overwritten [-Woverride-init]
     121 |         [CLOCK_MONOTONIC]       = HRTIMER_BASE_MONOTONIC,
         |                                   ^~~~~~~~~~~~~~~~~~~~~~
   kernel/time/hrtimer.c:121:35: note: (near initialization for 'hrtimer_clock_to_base_table[1]')
   kernel/time/hrtimer.c:122:35: warning: initialized field overwritten [-Woverride-init]
     122 |         [CLOCK_BOOTTIME]        = HRTIMER_BASE_BOOTTIME,
         |                                   ^~~~~~~~~~~~~~~~~~~~~
   kernel/time/hrtimer.c:122:35: note: (near initialization for 'hrtimer_clock_to_base_table[7]')
   kernel/time/hrtimer.c:123:35: warning: initialized field overwritten [-Woverride-init]
     123 |         [CLOCK_TAI]             = HRTIMER_BASE_TAI,
         |                                   ^~~~~~~~~~~~~~~~
   kernel/time/hrtimer.c:123:35: note: (near initialization for 'hrtimer_clock_to_base_table[11]')
   kernel/time/hrtimer.c: In function '__run_hrtimer':
   kernel/time/hrtimer.c:1651:14: warning: variable 'expires_in_hardirq' set but not used [-Wunused-but-set-variable]
    1651 |         bool expires_in_hardirq;
         |              ^~~~~~~~~~~~~~~~~~
--
>> arc-elf-ld: arch/arc/kernel/sys.o:(.data+0x724): undefined reference to `sys_set_mempolicy2'
>> arc-elf-ld: arch/arc/kernel/sys.o:(.data+0x724): undefined reference to `sys_set_mempolicy2'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall
  2023-12-09  6:59 ` [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
  2023-12-09 16:46   ` kernel test robot
@ 2023-12-09 18:24   ` kernel test robot
  1 sibling, 0 replies; 21+ messages in thread
From: kernel test robot @ 2023-12-09 18:24 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: llvm, oe-kbuild-all, linux-doc, linux-fsdevel, linux-api,
	linux-arch, linux-kernel, akpm, arnd, tglx, luto, mingo, bp,
	dave.hansen, x86, hpa, mhocko, tj, ying.huang, gregory.price,
	corbet, rakie.kim, hyeongtak.ji, honggyu.kim, vtavarespetr,
	peterz, jgroves, ravis.opensrc, sthanneeru, emirakhur,
	Hasan.Maruf

Hi Gregory,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on deller-parisc/for-next powerpc/next powerpc/fixes s390/features jcmvbkbc-xtensa/xtensa-for-next arnd-asm-generic/master linus/master v6.7-rc4]
[cannot apply to tip/x86/asm geert-m68k/for-next geert-m68k/for-linus next-20231208]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gregory-Price/mm-mempolicy-implement-the-sysfs-based-weighted_interleave-interface/20231209-150314
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231209065931.3458-9-gregory.price%40memverge.com
patch subject: [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall
config: arm-lpc32xx_defconfig (https://download.01.org/0day-ci/archive/20231210/202312100245.Jgz5mPhJ-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231210/202312100245.Jgz5mPhJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312100245.Jgz5mPhJ-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/time/time.c:33:
>> include/linux/syscalls.h:825:43: warning: declaration of 'struct mpol_args' will not be visible outside of this function [-Wvisibility]
     825 | asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
         |                                           ^
   1 warning generated.
--
   In file included from kernel/time/hrtimer.c:30:
>> include/linux/syscalls.h:825:43: warning: declaration of 'struct mpol_args' will not be visible outside of this function [-Wvisibility]
     825 | asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
         |                                           ^
   kernel/time/hrtimer.c:1651:7: warning: variable 'expires_in_hardirq' set but not used [-Wunused-but-set-variable]
    1651 |         bool expires_in_hardirq;
         |              ^
   kernel/time/hrtimer.c:277:20: warning: unused function 'is_migration_base' [-Wunused-function]
     277 | static inline bool is_migration_base(struct hrtimer_clock_base *base)
         |                    ^
   kernel/time/hrtimer.c:1876:20: warning: unused function '__hrtimer_peek_ahead_timers' [-Wunused-function]
    1876 | static inline void __hrtimer_peek_ahead_timers(void)
         |                    ^
   4 warnings generated.


vim +825 include/linux/syscalls.h

   794	
   795	/* CONFIG_MMU only */
   796	asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
   797	asmlinkage long sys_swapoff(const char __user *specialfile);
   798	asmlinkage long sys_mprotect(unsigned long start, size_t len,
   799					unsigned long prot);
   800	asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
   801	asmlinkage long sys_mlock(unsigned long start, size_t len);
   802	asmlinkage long sys_munlock(unsigned long start, size_t len);
   803	asmlinkage long sys_mlockall(int flags);
   804	asmlinkage long sys_munlockall(void);
   805	asmlinkage long sys_mincore(unsigned long start, size_t len,
   806					unsigned char __user * vec);
   807	asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
   808	asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
   809				size_t vlen, int behavior, unsigned int flags);
   810	asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
   811	asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
   812				unsigned long prot, unsigned long pgoff,
   813				unsigned long flags);
   814	asmlinkage long sys_mbind(unsigned long start, unsigned long len,
   815					unsigned long mode,
   816					const unsigned long __user *nmask,
   817					unsigned long maxnode,
   818					unsigned flags);
   819	asmlinkage long sys_get_mempolicy(int __user *policy,
   820					unsigned long __user *nmask,
   821					unsigned long maxnode,
   822					unsigned long addr, unsigned long flags);
   823	asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
   824					unsigned long maxnode);
 > 825	asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size,
   826					   unsigned long flags);
   827	asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
   828					const unsigned long __user *from,
   829					const unsigned long __user *to);
   830	asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
   831					const void __user * __user *pages,
   832					const int __user *nodes,
   833					int __user *status,
   834					int flags);
   835	asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t  pid, int sig,
   836			siginfo_t __user *uinfo);
   837	asmlinkage long sys_perf_event_open(
   838			struct perf_event_attr __user *attr_uptr,
   839			pid_t pid, int cpu, int group_fd, unsigned long flags);
   840	asmlinkage long sys_accept4(int, struct sockaddr __user *, int __user *, int);
   841	asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
   842				     unsigned int vlen, unsigned flags,
   843				     struct __kernel_timespec __user *timeout);
   844	asmlinkage long sys_recvmmsg_time32(int fd, struct mmsghdr __user *msg,
   845				     unsigned int vlen, unsigned flags,
   846				     struct old_timespec32 __user *timeout);
   847	asmlinkage long sys_wait4(pid_t pid, int __user *stat_addr,
   848					int options, struct rusage __user *ru);
   849	asmlinkage long sys_prlimit64(pid_t pid, unsigned int resource,
   850					const struct rlimit64 __user *new_rlim,
   851					struct rlimit64 __user *old_rlim);
   852	asmlinkage long sys_fanotify_init(unsigned int flags, unsigned int event_f_flags);
   853	asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
   854					  u64 mask, int fd,
   855					  const char  __user *pathname);
   856	asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
   857					      struct file_handle __user *handle,
   858					      int __user *mnt_id, int flag);
   859	asmlinkage long sys_open_by_handle_at(int mountdirfd,
   860					      struct file_handle __user *handle,
   861					      int flags);
   862	asmlinkage long sys_clock_adjtime(clockid_t which_clock,
   863					struct __kernel_timex __user *tx);
   864	asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
   865					struct old_timex32 __user *tx);
   866	asmlinkage long sys_syncfs(int fd);
   867	asmlinkage long sys_setns(int fd, int nstype);
   868	asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
   869	asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
   870				     unsigned int vlen, unsigned flags);
   871	asmlinkage long sys_process_vm_readv(pid_t pid,
   872					     const struct iovec __user *lvec,
   873					     unsigned long liovcnt,
   874					     const struct iovec __user *rvec,
   875					     unsigned long riovcnt,
   876					     unsigned long flags);
   877	asmlinkage long sys_process_vm_writev(pid_t pid,
   878					      const struct iovec __user *lvec,
   879					      unsigned long liovcnt,
   880					      const struct iovec __user *rvec,
   881					      unsigned long riovcnt,
   882					      unsigned long flags);
   883	asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
   884				 unsigned long idx1, unsigned long idx2);
   885	asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
   886	asmlinkage long sys_sched_setattr(pid_t pid,
   887						struct sched_attr __user *attr,
   888						unsigned int flags);
   889	asmlinkage long sys_sched_getattr(pid_t pid,
   890						struct sched_attr __user *attr,
   891						unsigned int size,
   892						unsigned int flags);
   893	asmlinkage long sys_renameat2(int olddfd, const char __user *oldname,
   894				      int newdfd, const char __user *newname,
   895				      unsigned int flags);
   896	asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
   897				    void __user *uargs);
   898	asmlinkage long sys_getrandom(char __user *buf, size_t count,
   899				      unsigned int flags);
   900	asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
   901	asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
   902	asmlinkage long sys_execveat(int dfd, const char __user *filename,
   903				const char __user *const __user *argv,
   904				const char __user *const __user *envp, int flags);
   905	asmlinkage long sys_userfaultfd(int flags);
   906	asmlinkage long sys_membarrier(int cmd, unsigned int flags, int cpu_id);
   907	asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
   908	asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
   909					    int fd_out, loff_t __user *off_out,
   910					    size_t len, unsigned int flags);
   911	asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
   912				    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
   913				    rwf_t flags);
   914	asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
   915				    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
   916				    rwf_t flags);
   917	asmlinkage long sys_pkey_mprotect(unsigned long start, size_t len,
   918					  unsigned long prot, int pkey);
   919	asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
   920	asmlinkage long sys_pkey_free(int pkey);
   921	asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
   922				  unsigned mask, struct statx __user *buffer);
   923	asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
   924				 int flags, uint32_t sig);
   925	asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
   926	asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
   927				       int to_dfd, const char __user *to_path,
   928				       unsigned int ms_flags);
   929	asmlinkage long sys_mount_setattr(int dfd, const char __user *path,
   930					  unsigned int flags,
   931					  struct mount_attr __user *uattr, size_t usize);
   932	asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
   933	asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
   934				     const void __user *value, int aux);
   935	asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
   936	asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
   937	asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
   938					       siginfo_t __user *info,
   939					       unsigned int flags);
   940	asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
   941	asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr __user *attr,
   942			size_t size, __u32 flags);
   943	asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
   944			const void __user *rule_attr, __u32 flags);
   945	asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
   946	asmlinkage long sys_memfd_secret(unsigned int flags);
   947	asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
   948						    unsigned long home_node,
   949						    unsigned long flags);
   950	asmlinkage long sys_cachestat(unsigned int fd,
   951			struct cachestat_range __user *cstat_range,
   952			struct cachestat __user *cstat, unsigned int flags);
   953	asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
   954	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-09  6:59 ` [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
@ 2023-12-09 21:24   ` kernel test robot
  0 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2023-12-09 21:24 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: llvm, oe-kbuild-all, linux-doc, linux-fsdevel, linux-api,
	linux-arch, linux-kernel, akpm, arnd, tglx, luto, mingo, bp,
	dave.hansen, x86, hpa, mhocko, tj, ying.huang, gregory.price,
	corbet, rakie.kim, hyeongtak.ji, honggyu.kim, vtavarespetr,
	peterz, jgroves, ravis.opensrc, sthanneeru, emirakhur,
	Hasan.Maruf

Hi Gregory,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on deller-parisc/for-next powerpc/next powerpc/fixes s390/features jcmvbkbc-xtensa/xtensa-for-next arnd-asm-generic/master linus/master v6.7-rc4 next-20231208]
[cannot apply to tip/x86/asm geert-m68k/for-next geert-m68k/for-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gregory-Price/mm-mempolicy-implement-the-sysfs-based-weighted_interleave-interface/20231209-150314
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231209065931.3458-3-gregory.price%40memverge.com
patch subject: [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231210/202312100543.ix4jxw81-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231210/202312100543.ix4jxw81-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312100543.ix4jxw81-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/mempolicy.c:2355:3: warning: variable 'weight_total' is uninitialized when used here [-Wuninitialized]
                   weight_total += weight;
                   ^~~~~~~~~~~~
   mm/mempolicy.c:2341:27: note: initialize the variable 'weight_total' to silence this warning
           unsigned int weight_total;
                                    ^
                                     = 0
   1 warning generated.


vim +/weight_total +2355 mm/mempolicy.c

  2329	
  2330	static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
  2331			struct mempolicy *pol, unsigned long nr_pages,
  2332			struct page **page_array)
  2333	{
  2334		struct task_struct *me = current;
  2335		unsigned long total_allocated = 0;
  2336		unsigned long nr_allocated;
  2337		unsigned long rounds;
  2338		unsigned long node_pages, delta;
  2339		unsigned char weight;
  2340		unsigned char weights[MAX_NUMNODES];
  2341		unsigned int weight_total;
  2342		unsigned long rem_pages = nr_pages;
  2343		nodemask_t nodes = pol->nodes;
  2344		int nnodes, node, prev_node;
  2345		int i;
  2346	
  2347		/* Stabilize the nodemask on the stack */
  2348		barrier();
  2349	
  2350		nnodes = nodes_weight(nodes);
  2351	
  2352		/* Collect weights and save them on stack so they don't change */
  2353		for_each_node_mask(node, nodes) {
  2354			weight = iw_table[node];
> 2355			weight_total += weight;
  2356			weights[node] = weight;
  2357		}
  2358	
  2359		/* Continue allocating from most recent node and adjust the nr_pages */
  2360		if (pol->wil.cur_weight) {
  2361			node = next_node_in(me->il_prev, nodes);
  2362			node_pages = pol->wil.cur_weight;
  2363			if (node_pages > rem_pages)
  2364				node_pages = rem_pages;
  2365			nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
  2366							  NULL, page_array);
  2367			page_array += nr_allocated;
  2368			total_allocated += nr_allocated;
  2369			/* if that's all the pages, no need to interleave */
  2370			if (rem_pages <= pol->wil.cur_weight) {
  2371				pol->wil.cur_weight -= rem_pages;
  2372				return total_allocated;
  2373			}
  2374			/* Otherwise we adjust nr_pages down, and continue from there */
  2375			rem_pages -= pol->wil.cur_weight;
  2376			pol->wil.cur_weight = 0;
  2377			prev_node = node;
  2378		}
  2379	
  2380		/* Now we can continue allocating as if from 0 instead of an offset */
  2381		rounds = rem_pages / weight_total;
  2382		delta = rem_pages % weight_total;
  2383		for (i = 0; i < nnodes; i++) {
  2384			node = next_node_in(prev_node, nodes);
  2385			weight = weights[node];
  2386			node_pages = weight * rounds;
  2387			if (delta) {
  2388				if (delta > weight) {
  2389					node_pages += weight;
  2390					delta -= weight;
  2391				} else {
  2392					node_pages += delta;
  2393					delta = 0;
  2394				}
  2395			}
  2396			/* We may not make it all the way around */
  2397			if (!node_pages)
  2398				break;
  2399			/* If an over-allocation would occur, floor it */
  2400			if (node_pages + total_allocated > nr_pages) {
  2401				node_pages = nr_pages - total_allocated;
  2402				delta = 0;
  2403			}
  2404			nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
  2405							  NULL, page_array);
  2406			page_array += nr_allocated;
  2407			total_allocated += nr_allocated;
  2408			prev_node = node;
  2409		}
  2410	
  2411		/*
  2412		 * Finally, we need to update me->il_prev and pol->wil.cur_weight
  2413		 * if there were overflow pages, but not equivalent to the node
  2414		 * weight, set the cur_weight to node_weight - delta and the
  2415		 * me->il_prev to the previous node. Otherwise if it was perfect
  2416		 * we can simply set il_prev to node and cur_weight to 0
  2417		 */
  2418		if (node_pages) {
  2419			me->il_prev = prev_node;
  2420			node_pages %= weight;
  2421			pol->wil.cur_weight = weight - node_pages;
  2422		} else {
  2423			me->il_prev = node;
  2424			pol->wil.cur_weight = 0;
  2425		}
  2426	
  2427		return total_allocated;
  2428	}
  2429	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave
  2023-12-09  6:59 ` [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
@ 2023-12-09 22:28   ` kernel test robot
  0 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2023-12-09 22:28 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: oe-kbuild-all, linux-doc, linux-fsdevel, linux-api, linux-arch,
	linux-kernel, akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86,
	hpa, mhocko, tj, ying.huang, gregory.price, corbet, rakie.kim,
	hyeongtak.ji, honggyu.kim, vtavarespetr, peterz, jgroves,
	ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf

Hi Gregory,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on deller-parisc/for-next powerpc/next powerpc/fixes s390/features jcmvbkbc-xtensa/xtensa-for-next arnd-asm-generic/master linus/master v6.7-rc4]
[cannot apply to tip/x86/asm geert-m68k/for-next geert-m68k/for-linus next-20231208]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gregory-Price/mm-mempolicy-implement-the-sysfs-based-weighted_interleave-interface/20231209-150314
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231209065931.3458-12-gregory.price%40memverge.com
patch subject: [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave
config: x86_64-randconfig-123-20231210 (https://download.01.org/0day-ci/archive/20231210/202312100606.2aOpv2T5-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231210/202312100606.2aOpv2T5-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312100606.2aOpv2T5-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   mm/mempolicy.c: note: in included file (through include/linux/rculist.h, include/linux/pid.h, include/linux/sched.h, ...):
   include/linux/rcupdate.h:778:9: sparse: sparse: context imbalance in 'queue_folios_pte_range' - unexpected unlock
   mm/mempolicy.c: note: in included file (through arch/x86/include/asm/uaccess.h, include/linux/uaccess.h, include/linux/sched/task.h, ...):
   arch/x86/include/asm/uaccess_64.h:88:24: sparse: sparse: cast removes address space '__user' of expression
>> mm/mempolicy.c:1681:29: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char *weights_ptr @@     got void [noderef] __user * @@
   mm/mempolicy.c:1681:29: sparse:     expected unsigned char *weights_ptr
   mm/mempolicy.c:1681:29: sparse:     got void [noderef] __user *
>> mm/mempolicy.c:1684:45: sparse: sparse: incorrect type in argument 3 (different address spaces) @@     expected void const [noderef] __user *src @@     got unsigned char *weights_ptr @@
   mm/mempolicy.c:1684:45: sparse:     expected void const [noderef] __user *src
   mm/mempolicy.c:1684:45: sparse:     got unsigned char *weights_ptr
   mm/mempolicy.c:2042:29: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char *weights_ptr @@     got void [noderef] __user * @@
   mm/mempolicy.c:2042:29: sparse:     expected unsigned char *weights_ptr
   mm/mempolicy.c:2042:29: sparse:     got void [noderef] __user *
>> mm/mempolicy.c:2043:36: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected void [noderef] __user *to @@     got unsigned char *weights_ptr @@
   mm/mempolicy.c:2043:36: sparse:     expected void [noderef] __user *to
   mm/mempolicy.c:2043:36: sparse:     got unsigned char *weights_ptr

vim +1681 mm/mempolicy.c

  1629	
  1630	SYSCALL_DEFINE5(mbind2, const struct iovec __user *, vec, size_t, vlen,
  1631			const struct mpol_args __user *, uargs, size_t, usize,
  1632			unsigned long, flags)
  1633	{
  1634		struct mpol_args kargs;
  1635		struct mempolicy_args margs;
  1636		nodemask_t policy_nodes;
  1637		unsigned long __user *nodes_ptr;
  1638		struct iovec iovstack[UIO_FASTIOV];
  1639		struct iovec *iov = iovstack;
  1640		struct iov_iter iter;
  1641		unsigned char weights[MAX_NUMNODES];
  1642		unsigned char *weights_ptr;
  1643		int err;
  1644	
  1645		if (!vec || !vlen)
  1646			return -EINVAL;
  1647	
  1648		err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
  1649		if (err)
  1650			return -EINVAL;
  1651	
  1652		err = validate_mpol_flags(kargs.mode, &kargs.mode_flags);
  1653		if (err)
  1654			return err;
  1655	
  1656		margs.mode = kargs.mode;
  1657		margs.mode_flags = kargs.mode_flags;
  1658		margs.addr = kargs.addr;
  1659	
  1660		/* if home node given, validate it is online */
  1661		if (flags & MPOL_MF_HOME_NODE) {
  1662			if ((kargs.home_node >= MAX_NUMNODES) ||
  1663				!node_online(kargs.home_node))
  1664				return -EINVAL;
  1665			margs.home_node = kargs.home_node;
  1666		} else
  1667			margs.home_node = NUMA_NO_NODE;
  1668		flags &= ~MPOL_MF_HOME_NODE;
  1669	
  1670		if (kargs.pol_nodes) {
  1671			nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
  1672			err = get_nodes(&policy_nodes, nodes_ptr,
  1673					kargs.pol_maxnodes);
  1674			if (err)
  1675				return err;
  1676			margs.policy_nodes = &policy_nodes;
  1677		} else
  1678			margs.policy_nodes = NULL;
  1679	
  1680		if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE) {
> 1681			weights_ptr = u64_to_user_ptr(kargs.il_weights);
  1682			err = copy_struct_from_user(&weights,
  1683						    sizeof(weights),
> 1684						    weights_ptr,
  1685						    kargs.pol_maxnodes);
  1686			if (err)
  1687				return err;
  1688			margs.il_weights = weights;
  1689		} else {
  1690			margs.il_weights = NULL;
  1691			flags |= MPOL_F_GWEIGHT;
  1692		}
  1693	
  1694		/* For each address range in vector, do_mbind */
  1695		err = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov,
  1696				   &iter);
  1697		if (err)
  1698			return err;
  1699		while (iov_iter_count(&iter)) {
  1700			unsigned long start, len;
  1701	
  1702			start = untagged_addr((unsigned long)iter_iov_addr(&iter));
  1703			len = iter_iov_len(&iter);
  1704			err = do_mbind(start, len, &margs, flags);
  1705			if (err)
  1706				break;
  1707			iov_iter_advance(&iter, iter_iov_len(&iter));
  1708		}
  1709	
  1710		kfree(iov);
  1711		return err;
  1712	}
  1713	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (10 preceding siblings ...)
  2023-12-09  6:59 ` [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
@ 2023-12-11  5:53 ` Huang, Ying
  2023-12-11 16:42   ` Gregory Price
  11 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2023-12-11  5:53 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-api, linux-arch,
	linux-kernel, akpm, arnd, tglx, luto, mingo, bp, dave.hansen, x86,
	hpa, mhocko, tj, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko, Zhongkun He,
	Frank van der Linden, John Groves, Jonathan Cameron

Hi, Gregory,

Thanks for updated version!

Gregory Price <gourry.memverge@gmail.com> writes:

> v2:
>   changes / adds:
> - flattened weight matrix to an array at requested of Ying Huang
> - Updated ABI docs per Davidlohr Bueso request
> - change uapi structure to use aligned/fixed-length members as
>   Suggested-by: Arnd Bergmann <arnd@arndb.de>
> - Implemented weight fetch logic in get_mempolicy2
> - mbind2 was changed to take (iovec,len) as function arguments
>   rather than add them to the uapi structure, since they describe
>   where to apply the mempolicy - as opposed to being part of it.
>
>   fixes:
> - fixed bug on fork/pthread
>   Reported-by: Seungjun Ha <seungjun.ha@samsung.com>
>   Link: https://lore.kernel.org/linux-cxl/20231206080944epcms2p76ebb230b9f4595f5cfcd2531d67ab3ce@epcms2p7/
> - fixed bug in mbind2 where MPOL_F_GWEIGHTS was not set when il_weights
>   was omitted after local weights were added as an option
> - fixed bug in interleave logic where an OOB access was made if
>   next_node_in returned MAX_NUMNODES
> - fixed bug in bulk weighted interleave allocator where over-allocation
>   could occur.
>
>   tests:
> - LTP: validated existing get_mempolicy, set_mempolicy, and mbind testss
> - LTP: validated existing get_mempolicy, set_mempolicy, and mbind with
>        MPOL_WEIGHTED_INTERLEAVE added.
> - basic set_mempolicy2 tests and numactl -w --interleave tests
>
>   numactl:
> - Sample numactl extension for set_mempolicy available here:
>   Link: https://github.com/gmprice/numactl/tree/weighted_interleave_master
>
> (summary of LTP tests and manual tests added to end of cover letter)
>
> =====================================================================
>
> This patch set extends the mempolicy interface to enable new
> mempolicies which may require extended data to operate. One
> such policy is included with this set as an example.
>
> There are 3 major "phases" in the patch set:
> 1) Implement a "global weight" mechanism via sysfs, which allows
>    set_mempolicy to implement MPOL_WEIGHTED_INTERLEAVE utilizing
>    weights set by the administrator (or system daemon).
>
> 2) A refactor of the mempolicy creation mechanism to accept an
>    extensible argument structure `struct mempolicy_args` to promote
>    code re-use between the original mempolicy/mbind interfaces and
>    the new extended mempolicy/mbind interfaces.
>
> 3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2,
>    along with the addition of task-local weights so that per-task
>    weights can be registered for MPOL_WEIGHTED_INTERLEAVE.
>
> =====================================================================
> (Patch 1) : sysfs addition - /sys/kernel/mm/mempolicy/
>
> This feature  provides a way to set interleave weight information under
> sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM/weight
>
>     The sysfs structure is designed as follows.
>
>       $ tree /sys/kernel/mm/mempolicy/
>       /sys/kernel/mm/mempolicy/
>       ├── possible_nodes
>       └── weighted_interleave
>           ├── nodeN
>           │   └── weight
>           └── nodeN+X
>               └── weight
>
> 'mempolicy' is added to '/sys/kernel/mm/' as a control group for
> the mempolicy subsystem.

Is it good to add 'mempolicy' in '/sys/kernel/mm/numa'?  The advantage
is that 'mempolicy' here is in fact "NUMA mempolicy".  The disadvantage
is one more directory nesting.  I have no strong opinion here.

> 'possible_nodes' is added to 'mm/mempolicy' to help describe the
> expected structures under mempolicy directorys. For example,
> possible_nodes describes what nodeN directories wille exist under
> the weighted_interleave directory.

We have '/sys/devices/system/node/possible' already.  Is this just a
duplication?  If so, why?  And, the possible nodes can be gotten via
contents of 'weighted_interleave' too.

And it appears not necessary to make 'weighted_interleave/nodeN'
directory.  Why not just make it a file.

And, can we add a way to reset weight to the default value?  For example
`echo > nodeN/weight` or `echo > nodeN`.

> Internally, weights are represented as an array of unsigned char
>
> static unsigned char iw_table[MAX_NUMNODES];
>
> char was chosen as most reasonable distributions can be represented
> as factors <100, and to minimize memory usage (1KB)
>
> We present possible nodes, instead of online nodes, to simplify the
> management interface, considering that a) the table is of size
> MAX_NUMNODES anyway to simplify fetching of weights (no need to track
> sizes, and MAX_NUMNODES is typically at most 1kb), and b) it simplifies
> management of hotplug events, allowing for weights to be set prior to
> a node coming online, which may be beneficial for immediate use.
>
> the 'weight' of a node (an unsigned char of value 1-255) is the number
> of pages that are allocated during a "weighted interleave" round.
> (See 'weighted interleave' for more details').
>
> =====================================================================
> (Patch 2) set_mempolicy: MPOL_WEIGHTED_INTERLEAVE
>
> Weighted interleave is a new memory policy that interleaves memory
> across numa nodes in the provided nodemask based on the weights
> described in patch 1 (sysfs global weights).
>
> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as having local DRAM and CXL memory together, the current round-robin
> based interleaving policy doesn't maximize the overall bandwidth
> because of their different bandwidth characteristics.
>
> Instead, the interleaving can be more efficient when the allocation
> policy follows each NUMA nodes' bandwidth weight rather than having 1:1
> round-robin allocation.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> which enables weighted interleaving between NUMA nodes.  Weighted
> interleave allows for a proportional distribution of memory across
> multiple numa nodes, preferablly apportioned to match the bandwidth
> capacity of each node from the perspective of the accessing node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
> appropriate weight distribution is (2:1).
>
> Weights will be acquired from the global weight array exposed by the
> sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/
>
> The policy will then allocate the number of pages according to the
> set weights.  For example, if the weights are (2,1), then 2 pages
> will be allocated on node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> =====================================================================
> (Patches 3-6) Refactoring mempolicy for code-reuse
>
> To avoid multiple paths of mempolicy creation, we should refactor the
> existing code to enable the designed extensibility, and refactor
> existing users to utilize the new interface (while retaining the
> existing userland interface).
>
> This set of patches introduces a new mempolicy_args structure, which
> is used to more fully describe a requested mempolicy - to include
> existing and future extensions.
>
> /*
>  * Describes settings of a mempolicy during set/get syscalls and
>  * kernel internal calls to do_set_mempolicy()
>  */
> struct mempolicy_args {
>     unsigned short mode;            /* policy mode */
>     unsigned short mode_flags;      /* policy mode flags */
>     nodemask_t *policy_nodes;       /* get/set/mbind */
>     int policy_node;                /* get: policy node information */
>     unsigned long addr;             /* get: vma address */
>     int addr_node;                  /* get: node the address belongs to */
>     int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
>     unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
> };
>
> This arg structure will eventually be utilized by the following
> interfaces:
>     mpol_new() - new mempolicy creation
>     do_get_mempolicy() - acquiring information about mempolicy
>     do_set_mempolicy() - setting the task mempolicy
>     do_mbind()         - setting a vma mempolicy
>
> do_get_mempolicy() is completely refactored to break it out into
> separate functionality based on the flags provided by get_mempolicy(2)
>     MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
>     MPOL_F_ADDR: acquires information on vma policies
>     MPOL_F_NODE: changes the output for the policy arg to node info
>
> We refactor the get_mempolicy syscall flatten the logic based on these
> flags, and aloow for set_mempolicy2() to re-use the underlying logic.
>
> The result of this refactor, and the new mempolicy_args structure, is
> that extensions like 'sys_set_mempolicy_home_node' can now be directly
> integrated into the initial call to 'set_mempolicy2', and that more
> complete information about a mempolicy can be returned with a single
> call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
>
>
> =====================================================================
> (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
>
> These interfaces are the 'extended' counterpart to their relatives.
> They use the userland 'struct mpol_args' structure to communicate a
> complete mempolicy configuration to the kernel.  This structure
> looks very much like the kernel-internal 'struct mempolicy_args':
>
> struct mpol_args {
>         /* Basic mempolicy settings */
>         __u16 mode;
>         __u16 mode_flags;
>         __s32 home_node;
>         __aligned_u64 pol_nodes;
>         __u64 pol_maxnodes;
>         __u64 addr;
>         __s32 policy_node;
>         __s32 addr_node;
>         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
> };

This looks unnecessarily complex.  I don't think that it's a good idea
to use exact same parameter for all 3 syscalls.

For example, can we use something as below?

  long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned int *il_weights,
                          unsigned long maxnode, unsigned long home_node,
                          unsigned long flags);

  long mbind2(unsigned long start, unsigned long len,
                          int mode, const unsigned long *nodemask, unsigned int *il_weights,
                          unsigned long maxnode, unsigned long home_node,
                          unsigned long flags);

A struct may be defined to hold mempolicy iteself.

struct mpol {
        int mode;
        unsigned int home_node;
        const unsigned long *nodemask;
        unsigned int *il_weights;
        unsigned int maxnode;
};

> The basic mempolicy settings which are shared across all interfaces
> are captured at the top of the structure, while extensions such as
> 'policy_node' and 'addr' are collected beneath.
>
> The syscalls are uniform and defined as follows:
>
> long sys_mbind2(struct iovec *vec, size_t vlen,
>                 struct mpol_args *args, size_t usize,
>                 unsigned long flags);
>
> long sys_get_mempolicy2(struct mpol_args *args, size_t size,
>                         unsigned long flags);
>
> long sys_set_mempolicy2(struct mpol_args *args, size_t size,
>                         unsigned long flags);
>
> The 'flags' argument for mbind2 is the same as 'mbind', except with
> the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
> field should be utilized.
>
> The 'flags' argument for get_mempolicy2 is the same as get_mempolicy.
>
> The 'flags' argument is not used by 'set_mempolicy' at this time, but
> may end up allowing the use of MPOL_MF_HOME_NODE if such functionality
> is desired.
>
> The extensions can be summed up as follows:
>
> get_mempolicy2 extensions:
>     'mode', 'policy_node', and 'addr_node' can now be fetched with
>     a single call, rather than multiple with a combination of flags.
>     - 'mode' will always return the policy mode
>     - 'policy_node' will replace the functionality of MPOL_F_NODE
>     - 'addr_node' will return the node for 'addr' w/ MPOL_F_ADDR
>
> set_mempolicy2:
>     - task-local interleave weights can be set via 'il_weights'
>       (see next patch)
>
> mbind2:
>     - 'vec' and 'vlen' are sed to operate on multiple memory
>        ranges, rather than a single memory range per syscall.
>     - 'home_node' field sets policy home node w/ MPOL_MF_HOME_NODE
>     - task-local interleave weights can be set via 'il_weights'
>       (see next patch)
>

--
Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-11  5:53 ` [PATCH v2 00/11] mempolicy2, mbind2, and " Huang, Ying
@ 2023-12-11 16:42   ` Gregory Price
  2023-12-12  7:08     ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2023-12-11 16:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-api,
	linux-arch, linux-kernel, akpm, arnd, tglx, luto, mingo, bp,
	dave.hansen, x86, hpa, mhocko, tj, corbet, rakie.kim,
	hyeongtak.ji, honggyu.kim, vtavarespetr, peterz, jgroves,
	ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Johannes Weiner, Hasan Al Maruf, Hao Wang, Dan Williams,
	Michal Hocko, Zhongkun He, Frank van der Linden, John Groves,
	Jonathan Cameron

On Mon, Dec 11, 2023 at 01:53:40PM +0800, Huang, Ying wrote:
> Hi, Gregory,
> 
> Thanks for updated version!
> 
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > v2:
> >   changes / adds:
> > - flattened weight matrix to an array at requested of Ying Huang
> > - Updated ABI docs per Davidlohr Bueso request
> > - change uapi structure to use aligned/fixed-length members as
> >   Suggested-by: Arnd Bergmann <arnd@arndb.de>
> > - Implemented weight fetch logic in get_mempolicy2
> > - mbind2 was changed to take (iovec,len) as function arguments
> >   rather than add them to the uapi structure, since they describe
> >   where to apply the mempolicy - as opposed to being part of it.
> >
> >     The sysfs structure is designed as follows.
> >
> >       $ tree /sys/kernel/mm/mempolicy/
> >       /sys/kernel/mm/mempolicy/
> >       ├── possible_nodes
> >       └── weighted_interleave
> >           ├── nodeN
> >           │   └── weight
> >           └── nodeN+X
> >               └── weight
> >
> > 'mempolicy' is added to '/sys/kernel/mm/' as a control group for
> > the mempolicy subsystem.
> 
> Is it good to add 'mempolicy' in '/sys/kernel/mm/numa'?  The advantage
> is that 'mempolicy' here is in fact "NUMA mempolicy".  The disadvantage
> is one more directory nesting.  I have no strong opinion here.
> 

i don't have a strong opinion here.

> > 'possible_nodes' is added to 'mm/mempolicy' to help describe the
> > expected structures under mempolicy directorys. For example,
> > possible_nodes describes what nodeN directories wille exist under
> > the weighted_interleave directory.
> 
> We have '/sys/devices/system/node/possible' already.  Is this just a
> duplication?  If so, why?  And, the possible nodes can be gotten via
> contents of 'weighted_interleave' too.
> 

I'll remove it

> And it appears not necessary to make 'weighted_interleave/nodeN'
> directory.  Why not just make it a file.
> 

Originally I wasn't sure whether there would be more attributes, but
this is probably fine.  I'll change it.

> And, can we add a way to reset weight to the default value?  For example
> `echo > nodeN/weight` or `echo > nodeN`.
> 

Seems reasonable.

> > =====================================================================
> > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
> >
> > These interfaces are the 'extended' counterpart to their relatives.
> > They use the userland 'struct mpol_args' structure to communicate a
> > complete mempolicy configuration to the kernel.  This structure
> > looks very much like the kernel-internal 'struct mempolicy_args':
> >
> > struct mpol_args {
> >         /* Basic mempolicy settings */
> >         __u16 mode;
> >         __u16 mode_flags;
> >         __s32 home_node;
> >         __aligned_u64 pol_nodes;
> >         __u64 pol_maxnodes;
> >         __u64 addr;
> >         __s32 policy_node;
> >         __s32 addr_node;
> >         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
> > };
> 
> This looks unnecessarily complex.  I don't think that it's a good idea
> to use exact same parameter for all 3 syscalls.
>

It is exactly as complex as mempolicy is.  Everything here is already
described in the existing interfaces (except il_weights).

> For example, can we use something as below?
> 
>   long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned int *il_weights,
>                           unsigned long maxnode, unsigned long home_node,
>                           unsigned long flags);
> 
>   long mbind2(unsigned long start, unsigned long len,
>                           int mode, const unsigned long *nodemask, unsigned int *il_weights,
>                           unsigned long maxnode, unsigned long home_node,
>                           unsigned long flags);
> 

Your definition of mbind2 is impossible.

Neither of these interfaces solve the extensibility issue.  If a new
policy which requires a new format of data arrives, we can look forward
to set_mempolicy3 and mbind3.

> A struct may be defined to hold mempolicy iteself.
> 
> struct mpol {
>         int mode;
>         unsigned int home_node;
>         const unsigned long *nodemask;
>         unsigned int *il_weights;
>         unsigned int maxnode;
> };
> 

addr could be pulled out for get_mempolicy2, so i will do that

'addr_node' and 'policy_node' are warts that came from the original
get_mempolicy.  Removing them increases the complexity of handling
arguments in the common get_mempolicy code.

I could probably just drop support for retrieving the addr_node from
get_mempolicy2, since it's already possible with get_mempolicy.  So I
will do that.

~Gregory

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-11 16:42   ` Gregory Price
@ 2023-12-12  7:08     ` Huang, Ying
  2023-12-12 15:59       ` Gregory Price
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2023-12-12  7:08 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-api,
	linux-arch, linux-kernel, akpm, arnd, tglx, luto, mingo, bp,
	dave.hansen, x86, hpa, mhocko, tj, corbet, rakie.kim,
	hyeongtak.ji, honggyu.kim, vtavarespetr, peterz, jgroves,
	ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Johannes Weiner, Hasan Al Maruf, Hao Wang, Dan Williams,
	Michal Hocko, Zhongkun He, Frank van der Linden, John Groves,
	Jonathan Cameron

Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Dec 11, 2023 at 01:53:40PM +0800, Huang, Ying wrote:
>> Hi, Gregory,
>> 
>> Thanks for updated version!
>> 
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > v2:
>> >   changes / adds:
>> > - flattened weight matrix to an array at requested of Ying Huang
>> > - Updated ABI docs per Davidlohr Bueso request
>> > - change uapi structure to use aligned/fixed-length members as
>> >   Suggested-by: Arnd Bergmann <arnd@arndb.de>
>> > - Implemented weight fetch logic in get_mempolicy2
>> > - mbind2 was changed to take (iovec,len) as function arguments
>> >   rather than add them to the uapi structure, since they describe
>> >   where to apply the mempolicy - as opposed to being part of it.
>> >
>> >     The sysfs structure is designed as follows.
>> >
>> >       $ tree /sys/kernel/mm/mempolicy/
>> >       /sys/kernel/mm/mempolicy/
>> >       ├── possible_nodes
>> >       └── weighted_interleave
>> >           ├── nodeN
>> >           │   └── weight
>> >           └── nodeN+X
>> >               └── weight
>> >
>> > 'mempolicy' is added to '/sys/kernel/mm/' as a control group for
>> > the mempolicy subsystem.
>> 
>> Is it good to add 'mempolicy' in '/sys/kernel/mm/numa'?  The advantage
>> is that 'mempolicy' here is in fact "NUMA mempolicy".  The disadvantage
>> is one more directory nesting.  I have no strong opinion here.
>> 
>
> i don't have a strong opinion here.
>
>> > 'possible_nodes' is added to 'mm/mempolicy' to help describe the
>> > expected structures under mempolicy directorys. For example,
>> > possible_nodes describes what nodeN directories wille exist under
>> > the weighted_interleave directory.
>> 
>> We have '/sys/devices/system/node/possible' already.  Is this just a
>> duplication?  If so, why?  And, the possible nodes can be gotten via
>> contents of 'weighted_interleave' too.
>> 
>
> I'll remove it
>
>> And it appears not necessary to make 'weighted_interleave/nodeN'
>> directory.  Why not just make it a file.
>> 
>
> Originally I wasn't sure whether there would be more attributes, but
> this is probably fine.  I'll change it.
>
>> And, can we add a way to reset weight to the default value?  For example
>> `echo > nodeN/weight` or `echo > nodeN`.
>> 
>
> Seems reasonable.
>
>> > =====================================================================
>> > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
>> >
>> > These interfaces are the 'extended' counterpart to their relatives.
>> > They use the userland 'struct mpol_args' structure to communicate a
>> > complete mempolicy configuration to the kernel.  This structure
>> > looks very much like the kernel-internal 'struct mempolicy_args':
>> >
>> > struct mpol_args {
>> >         /* Basic mempolicy settings */
>> >         __u16 mode;
>> >         __u16 mode_flags;
>> >         __s32 home_node;
>> >         __aligned_u64 pol_nodes;
>> >         __u64 pol_maxnodes;
>> >         __u64 addr;
>> >         __s32 policy_node;
>> >         __s32 addr_node;
>> >         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
>> > };
>> 
>> This looks unnecessarily complex.  I don't think that it's a good idea
>> to use exact same parameter for all 3 syscalls.
>>
>
> It is exactly as complex as mempolicy is.  Everything here is already
> described in the existing interfaces (except il_weights).
>
>> For example, can we use something as below?
>> 
>>   long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned int *il_weights,
>>                           unsigned long maxnode, unsigned long home_node,
>>                           unsigned long flags);
>> 
>>   long mbind2(unsigned long start, unsigned long len,
>>                           int mode, const unsigned long *nodemask, unsigned int *il_weights,
>>                           unsigned long maxnode, unsigned long home_node,
>>                           unsigned long flags);
>> 
>
> Your definition of mbind2 is impossible.
>
> Neither of these interfaces solve the extensibility issue.  If a new
> policy which requires a new format of data arrives, we can look forward
> to set_mempolicy3 and mbind3.

IIUC, we will not over-engineering too much.  It's hard to predict the
requirements in the future.

>> A struct may be defined to hold mempolicy iteself.
>> 
>> struct mpol {
>>         int mode;
>>         unsigned int home_node;
>>         const unsigned long *nodemask;
>>         unsigned int *il_weights;
>>         unsigned int maxnode;
>> };
>> 
>
> addr could be pulled out for get_mempolicy2, so i will do that
>
> 'addr_node' and 'policy_node' are warts that came from the original
> get_mempolicy.  Removing them increases the complexity of handling
> arguments in the common get_mempolicy code.
>
> I could probably just drop support for retrieving the addr_node from
> get_mempolicy2, since it's already possible with get_mempolicy.  So I
> will do that.

If it's necessary, we can add another struct for get_mempolicy2().  But
I don't think that it's necessary to add get_mempolicy2() specific
parameters for set_mempolicy2() or mbind2().

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-12  7:08     ` Huang, Ying
@ 2023-12-12 15:59       ` Gregory Price
  2023-12-13  2:44         ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2023-12-12 15:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-api,
	linux-arch, linux-kernel, akpm, arnd, tglx, luto, mingo, bp,
	dave.hansen, x86, hpa, mhocko, tj, corbet, rakie.kim,
	hyeongtak.ji, honggyu.kim, vtavarespetr, peterz, jgroves,
	ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Johannes Weiner, Hasan Al Maruf, Hao Wang, Dan Williams,
	Michal Hocko, Zhongkun He, Frank van der Linden, John Groves,
	Jonathan Cameron

On Tue, Dec 12, 2023 at 03:08:24PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> >> For example, can we use something as below?
> >> 
> >>   long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned int *il_weights,
> >>                           unsigned long maxnode, unsigned long home_node,
> >>                           unsigned long flags);
> >> 
> >>   long mbind2(unsigned long start, unsigned long len,
> >>                           int mode, const unsigned long *nodemask, unsigned int *il_weights,
> >>                           unsigned long maxnode, unsigned long home_node,
> >>                           unsigned long flags);
> >> 
> >
> > Your definition of mbind2 is impossible.
> >
> > Neither of these interfaces solve the extensibility issue.  If a new
> > policy which requires a new format of data arrives, we can look forward
> > to set_mempolicy3 and mbind3.
> 
> IIUC, we will not over-engineering too much.  It's hard to predict the
> requirements in the future.
> 

Sure, but having the mempolicy struct at least gives us more flexibility
than the original interface.

> >> A struct may be defined to hold mempolicy iteself.
> >> 
> >> struct mpol {
> >>         int mode;
> >>         unsigned int home_node;
> >>         const unsigned long *nodemask;
> >>         unsigned int *il_weights;
> >>         unsigned int maxnode;
> >> };
> >> 
> >
> > addr could be pulled out for get_mempolicy2, so i will do that
> >
> > 'addr_node' and 'policy_node' are warts that came from the original
> > get_mempolicy.  Removing them increases the complexity of handling
> > arguments in the common get_mempolicy code.
> >
> > I could probably just drop support for retrieving the addr_node from
> > get_mempolicy2, since it's already possible with get_mempolicy.  So I
> > will do that.
> 
> If it's necessary, we can add another struct for get_mempolicy2().  But
> I don't think that it's necessary to add get_mempolicy2() specific
> parameters for set_mempolicy2() or mbind2().

After edits, the only parameter that doesn't have parity between
interfaces is `addr_node` and `policy_node`.  This was an unfortunate
wart on the original get_mempolicy() that multiplexed the output of
(*mode) based on whether MPOL_F_NODE was set.

Example:
if (MPOL_F_ADDR | MPOL_F_NODE), then get_mempolicy() would return
details about a VMA mempolicy + the node of that address in (*mode).

Right now in get_mempolicy2() I fetch this unconditionally instead of
requiring MPOL_F_NODE.  I did not want to multiplexing (*mode) output.

I see two options:
1) Get rid of MPOL_F_NODE functionality in get_mempolicy2()
   If a user wants that information, they can still use get_mempolicy()

2) Keep MPOL_F_NODE and mpol_args->addr_node/policy_node, but don't allow
   any future extensions that create this kind of situation.

I'm fine with either.  I originally aimed for get_mempolicy2() to be
all of get_mempolicy() features + new data, but that obviously isn't
required.

~Gregory

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-12 15:59       ` Gregory Price
@ 2023-12-13  2:44         ` Huang, Ying
  0 siblings, 0 replies; 21+ messages in thread
From: Huang, Ying @ 2023-12-13  2:44 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-api,
	linux-arch, linux-kernel, akpm, arnd, tglx, luto, mingo, bp,
	dave.hansen, x86, hpa, mhocko, tj, corbet, rakie.kim,
	hyeongtak.ji, honggyu.kim, vtavarespetr, peterz, jgroves,
	ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Johannes Weiner, Hasan Al Maruf, Hao Wang, Dan Williams,
	Michal Hocko, Zhongkun He, Frank van der Linden, John Groves,
	Jonathan Cameron

Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Dec 12, 2023 at 03:08:24PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> >> For example, can we use something as below?
>> >> 
>> >>   long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned int *il_weights,
>> >>                           unsigned long maxnode, unsigned long home_node,
>> >>                           unsigned long flags);
>> >> 
>> >>   long mbind2(unsigned long start, unsigned long len,
>> >>                           int mode, const unsigned long *nodemask, unsigned int *il_weights,
>> >>                           unsigned long maxnode, unsigned long home_node,
>> >>                           unsigned long flags);
>> >> 
>> >
>> > Your definition of mbind2 is impossible.
>> >
>> > Neither of these interfaces solve the extensibility issue.  If a new
>> > policy which requires a new format of data arrives, we can look forward
>> > to set_mempolicy3 and mbind3.
>> 
>> IIUC, we will not over-engineering too much.  It's hard to predict the
>> requirements in the future.
>> 
>
> Sure, but having the mempolicy struct at least gives us more flexibility
> than the original interface.
>
>> >> A struct may be defined to hold mempolicy iteself.
>> >> 
>> >> struct mpol {
>> >>         int mode;
>> >>         unsigned int home_node;
>> >>         const unsigned long *nodemask;
>> >>         unsigned int *il_weights;
>> >>         unsigned int maxnode;
>> >> };
>> >> 
>> >
>> > addr could be pulled out for get_mempolicy2, so i will do that
>> >
>> > 'addr_node' and 'policy_node' are warts that came from the original
>> > get_mempolicy.  Removing them increases the complexity of handling
>> > arguments in the common get_mempolicy code.
>> >
>> > I could probably just drop support for retrieving the addr_node from
>> > get_mempolicy2, since it's already possible with get_mempolicy.  So I
>> > will do that.
>> 
>> If it's necessary, we can add another struct for get_mempolicy2().  But
>> I don't think that it's necessary to add get_mempolicy2() specific
>> parameters for set_mempolicy2() or mbind2().
>
> After edits, the only parameter that doesn't have parity between
> interfaces is `addr_node` and `policy_node`.  This was an unfortunate
> wart on the original get_mempolicy() that multiplexed the output of
> (*mode) based on whether MPOL_F_NODE was set.
>
> Example:
> if (MPOL_F_ADDR | MPOL_F_NODE), then get_mempolicy() would return
> details about a VMA mempolicy + the node of that address in (*mode).
>
> Right now in get_mempolicy2() I fetch this unconditionally instead of
> requiring MPOL_F_NODE.  I did not want to multiplexing (*mode) output.
>
> I see two options:
> 1) Get rid of MPOL_F_NODE functionality in get_mempolicy2()
>    If a user wants that information, they can still use get_mempolicy()
>
> 2) Keep MPOL_F_NODE and mpol_args->addr_node/policy_node, but don't allow
>    any future extensions that create this kind of situation.

3) Add another parameter to get_mempolicy2(), such as "unsigned long
*value" to retrieve addr_node or policy_node.  We can extend it to be a
"struct *" in the future if necessary.

> I'm fine with either.  I originally aimed for get_mempolicy2() to be
> all of get_mempolicy() features + new data, but that obviously isn't
> required.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-12-13  2:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-09  6:59 [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
2023-12-09  6:59 ` [PATCH v2 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2023-12-09  6:59 ` [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2023-12-09 21:24   ` kernel test robot
2023-12-09  6:59 ` [PATCH v2 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
2023-12-09  6:59 ` [PATCH v2 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
2023-12-09  6:59 ` [PATCH v2 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
2023-12-09  6:59 ` [PATCH v2 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
2023-12-09  6:59 ` [PATCH v2 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
2023-12-09  6:59 ` [PATCH v2 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
2023-12-09 16:46   ` kernel test robot
2023-12-09 18:24   ` kernel test robot
2023-12-09  6:59 ` [PATCH v2 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
2023-12-09  6:59 ` [PATCH v2 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
2023-12-09  6:59 ` [PATCH v2 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
2023-12-09 22:28   ` kernel test robot
2023-12-11  5:53 ` [PATCH v2 00/11] mempolicy2, mbind2, and " Huang, Ying
2023-12-11 16:42   ` Gregory Price
2023-12-12  7:08     ` Huang, Ying
2023-12-12 15:59       ` Gregory Price
2023-12-13  2:44         ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).