[PATCH 0/6] futex: The remaining futex2 bits

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] futex: The remaining futex2 bits
@ 2024-10-25  9:03 Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 1/6] mm: Add vmalloc_huge_node() Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-25  9:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong

Hi!

By popular demand a repost of the remaining futex2 bits.

Notably this is the FUTEX2_NUMA and FUTEX2_{8,16} 'small' futex support.

I'm not sure how much demand there actually is, since back-channels and
whispers don't really count. So for those of you out there on the big wide
interweb, if you want this, respond with Tested-by tags.

I know Christoph wants the NUMA bits, so I suppose that is at least one user
for this new interface, and that might see it through, but more interested
parties would be better and certainly move things along faster.

Same goes for the small futex bits, if nobody replies, they're going to get
left behind -- again!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] mm: Add vmalloc_huge_node()
  2024-10-25  9:03 [PATCH 0/6] futex: The remaining futex2 bits Peter Zijlstra
@ 2024-10-25  9:03 ` Peter Zijlstra
  2024-10-25  9:58   ` Uladzislau Rezki
  2024-10-25 16:00   ` Davidlohr Bueso
  2024-10-25  9:03 ` [PATCH 2/6] futex: Implement FUTEX2_NUMA Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-25  9:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong, Christoph Hellwig

To enable node specific hash-tables.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/vmalloc.h |    3 +++
 mm/vmalloc.c            |    7 +++++++
 2 files changed, 10 insertions(+)

--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -177,6 +177,9 @@ void *__vmalloc_node_noprof(unsigned lon
 void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
 #define vmalloc_huge(...)	alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
 
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
+#define vmalloc_huge_node(...)	alloc_hooks(vmalloc_huge_node_noprof(__VA_ARGS__))
+
 extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 #define __vmalloc_array(...)	alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
 
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3948,6 +3948,13 @@ void *vmalloc_huge_noprof(unsigned long
 }
 EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
 
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+				    gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+				    node, __builtin_return_address(0));
+}
+
 /**
  * vzalloc - allocate virtually contiguous memory with zero fill
  * @size:    allocation size



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-25  9:03 [PATCH 0/6] futex: The remaining futex2 bits Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 1/6] mm: Add vmalloc_huge_node() Peter Zijlstra
@ 2024-10-25  9:03 ` Peter Zijlstra
  2024-10-25 18:30   ` Davidlohr Bueso
                     ` (2 more replies)
  2024-10-25  9:03 ` [PATCH 3/6] futex: Propagate flags into futex_get_value_locked() Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-25  9:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong

Extend the futex2 interface to be numa aware.

When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.

  struct futex_numa_32 {
	u32 val;
	u32 node;
  };

When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.

When FUTEX2_NUMA is not set, the node is simply an extention of the
hash, such that traditional futexes are still interleaved over the
nodes.

This is done to avoid having to have a separate !numa hash-table.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/futex.h      |    3 +
 include/uapi/linux/futex.h |    8 ++
 kernel/futex/core.c        |  128 ++++++++++++++++++++++++++++++++++++---------
 kernel/futex/futex.h       |   17 +++++
 4 files changed, 131 insertions(+), 25 deletions(-)

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -34,6 +34,7 @@ union futex_key {
 		u64 i_seq;
 		unsigned long pgoff;
 		unsigned int offset;
+		/* unsigned int node; */
 	} shared;
 	struct {
 		union {
@@ -42,11 +43,13 @@ union futex_key {
 		};
 		unsigned long address;
 		unsigned int offset;
+		/* unsigned int node; */
 	} private;
 	struct {
 		u64 ptr;
 		unsigned long word;
 		unsigned int offset;
+		unsigned int node;	/* NOT hashed! */
 	} both;
 };
 
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -74,6 +74,14 @@
 /* do not use */
 #define FUTEX_32		FUTEX2_SIZE_U32 /* historical accident :-( */
 
+
+/*
+ * When FUTEX2_NUMA doubles the futex word, the second word is a node value.
+ * The special value -1 indicates no-node. This is the same value as
+ * NUMA_NO_NODE, except that value is not ABI, this is.
+ */
+#define FUTEX_NO_NODE		(-1)
+
 /*
  * Max numbers of elements in a futex_waitv array
  */
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -36,7 +36,8 @@
 #include <linux/pagemap.h>
 #include <linux/debugfs.h>
 #include <linux/plist.h>
-#include <linux/memblock.h>
+#include <linux/gfp.h>
+#include <linux/vmalloc.h>
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
 
@@ -49,12 +50,14 @@
  * reside in the same cacheline.
  */
 static struct {
-	struct futex_hash_bucket *queues;
 	unsigned long            hashsize;
+	unsigned int		 hashshift;
+	struct futex_hash_bucket *queues[MAX_NUMNODES];
 } __futex_data __read_mostly __aligned(2*sizeof(long));
-#define futex_queues   (__futex_data.queues)
-#define futex_hashsize (__futex_data.hashsize)
 
+#define futex_hashsize	(__futex_data.hashsize)
+#define futex_hashshift	(__futex_data.hashshift)
+#define futex_queues	(__futex_data.queues)
 
 /*
  * Fault injections for futexes.
@@ -107,6 +110,26 @@ late_initcall(fail_futex_debugfs);
 
 #endif /* CONFIG_FAIL_FUTEX */
 
+static int futex_get_value(u32 *val, u32 __user *from, unsigned int flags)
+{
+	switch (futex_size(flags)) {
+	case 1: return __get_user(*val, (u8 __user *)from);
+	case 2: return __get_user(*val, (u16 __user *)from);
+	case 4: return __get_user(*val, (u32 __user *)from);
+	default: BUG();
+	}
+}
+
+static int futex_put_value(u32 val, u32 __user *to, unsigned int flags)
+{
+	switch (futex_size(flags)) {
+	case 1: return __put_user(val, (u8 __user *)to);
+	case 2: return __put_user(val, (u16 __user *)to);
+	case 4: return __put_user(val, (u32 __user *)to);
+	default: BUG();
+	}
+}
+
 /**
  * futex_hash - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
@@ -116,10 +139,29 @@ late_initcall(fail_futex_debugfs);
  */
 struct futex_hash_bucket *futex_hash(union futex_key *key)
 {
-	u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
+	u32 hash = jhash2((u32 *)key,
+			  offsetof(typeof(*key), both.offset) / sizeof(u32),
 			  key->both.offset);
+	int node = key->both.node;
 
-	return &futex_queues[hash & (futex_hashsize - 1)];
+	if (node == FUTEX_NO_NODE) {
+		/*
+		 * In case of !FLAGS_NUMA, use some unused hash bits to pick a
+		 * node -- this ensures regular futexes are interleaved across
+		 * the nodes and avoids having to allocate multiple
+		 * hash-tables.
+		 *
+		 * NOTE: this isn't perfectly uniform, but it is fast and
+		 * handles sparse node masks.
+		 */
+		node = (hash >> futex_hashshift) % nr_node_ids;
+		if (!node_possible(node)) {
+			node = find_next_bit_wrap(node_possible_map.bits,
+						  nr_node_ids, node);
+		}
+	}
+
+	return &futex_queues[node][hash & (futex_hashsize - 1)];
 }
 
 
@@ -219,7 +261,7 @@ static u64 get_inode_sequence_number(str
  *
  * lock_page() might sleep, the caller should not hold a spinlock.
  */
-int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
+int get_futex_key(void __user *uaddr, unsigned int flags, union futex_key *key,
 		  enum futex_access rw)
 {
 	unsigned long address = (unsigned long)uaddr;
@@ -227,25 +269,49 @@ int get_futex_key(u32 __user *uaddr, uns
 	struct page *page;
 	struct folio *folio;
 	struct address_space *mapping;
-	int err, ro = 0;
+	int node, err, size, ro = 0;
 	bool fshared;
 
 	fshared = flags & FLAGS_SHARED;
+	size = futex_size(flags);
+	if (flags & FLAGS_NUMA)
+		size *= 2;
 
 	/*
 	 * The futex address must be "naturally" aligned.
 	 */
 	key->both.offset = address % PAGE_SIZE;
-	if (unlikely((address % sizeof(u32)) != 0))
+	if (unlikely((address % size) != 0))
 		return -EINVAL;
 	address -= key->both.offset;
 
-	if (unlikely(!access_ok(uaddr, sizeof(u32))))
+	if (unlikely(!access_ok(uaddr, size)))
 		return -EFAULT;
 
 	if (unlikely(should_fail_futex(fshared)))
 		return -EFAULT;
 
+	if (flags & FLAGS_NUMA) {
+		void __user *naddr = uaddr + size / 2;
+
+		if (futex_get_value(&node, naddr, flags))
+			return -EFAULT;
+
+		if (node == FUTEX_NO_NODE) {
+			node = numa_node_id();
+			if (futex_put_value(node, naddr, flags))
+				return -EFAULT;
+
+		} else if (node >= MAX_NUMNODES || !node_possible(node)) {
+			return -EINVAL;
+		}
+
+		key->both.node = node;
+
+	} else {
+		key->both.node = FUTEX_NO_NODE;
+	}
+
 	/*
 	 * PROCESS_PRIVATE futexes are fast.
 	 * As the mm cannot disappear under us and the 'key' only needs
@@ -1148,26 +1214,42 @@ void futex_exit_release(struct task_stru
 
 static int __init futex_init(void)
 {
-	unsigned int futex_shift;
-	unsigned long i;
+	unsigned int order, n;
+	unsigned long size, i;
 
 #ifdef CONFIG_BASE_SMALL
 	futex_hashsize = 16;
 #else
-	futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+	futex_hashsize = 256 * num_possible_cpus();
+	futex_hashsize /= num_possible_nodes();
+	futex_hashsize = roundup_pow_of_two(futex_hashsize);
 #endif
+	futex_hashshift = ilog2(futex_hashsize);
+	size = sizeof(struct futex_hash_bucket) * futex_hashsize;
+	order = get_order(size);
+
+	for_each_node(n) {
+		struct futex_hash_bucket *table;
+
+		if (order > MAX_ORDER)
+			table = vmalloc_huge_node(size, GFP_KERNEL, n);
+		else
+			table = alloc_pages_exact_nid(n, size, GFP_KERNEL);
+
+		BUG_ON(!table);
+
+		for (i = 0; i < futex_hashsize; i++) {
+			atomic_set(&table[i].waiters, 0);
+			spin_lock_init(&table[i].lock);
+			plist_head_init(&table[i].chain);
+		}
 
-	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
-					       futex_hashsize, 0, 0,
-					       &futex_shift, NULL,
-					       futex_hashsize, futex_hashsize);
-	futex_hashsize = 1UL << futex_shift;
-
-	for (i = 0; i < futex_hashsize; i++) {
-		atomic_set(&futex_queues[i].waiters, 0);
-		plist_head_init(&futex_queues[i].chain);
-		spin_lock_init(&futex_queues[i].lock);
+		futex_queues[n] = table;
 	}
+	pr_info("futex hash table, %d nodes, %ld entries (order: %d, %lu bytes)\n",
+		num_possible_nodes(),
+		futex_hashsize, order,
+		sizeof(struct futex_hash_bucket) * futex_hashsize);
 
 	return 0;
 }
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -52,7 +52,7 @@ static inline unsigned int futex_to_flag
 	return flags;
 }
 
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
 
 /* FUTEX2_ to FLAGS_ */
 static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -85,6 +85,19 @@ static inline bool futex_flags_valid(uns
 	if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32)
 		return false;
 
+	/*
+	 * Must be able to represent both FUTEX_NO_NODE and every valid nodeid
+	 * in a futex word.
+	 */
+	if (flags & FLAGS_NUMA) {
+		int bits = 8 * futex_size(flags);
+		u64 max = ~0ULL;
+
+		max >>= 64 - bits;
+		if (nr_node_ids >= max)
+			return false;
+	}
+
 	return true;
 }
 
@@ -193,7 +206,7 @@ enum futex_access {
 	FUTEX_WRITE
 };
 
-extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
+extern int get_futex_key(void __user *uaddr, unsigned int flags, union futex_key *key,
 			 enum futex_access rw);
 
 extern struct hrtimer_sleeper *



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/6] futex: Propagate flags into futex_get_value_locked()
  2024-10-25  9:03 [PATCH 0/6] futex: The remaining futex2 bits Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 1/6] mm: Add vmalloc_huge_node() Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 2/6] futex: Implement FUTEX2_NUMA Peter Zijlstra
@ 2024-10-25  9:03 ` Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 4/6] futex: Enable FUTEX2_{8,16} Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-25  9:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong

In order to facilitate variable sized futexes propagate the flags into
futex_get_value_locked().

No functional change intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex/core.c     |    4 ++--
 kernel/futex/futex.h    |    2 +-
 kernel/futex/pi.c       |    8 ++++----
 kernel/futex/requeue.c  |    4 ++--
 kernel/futex/waitwake.c |    4 ++--
 5 files changed, 11 insertions(+), 11 deletions(-)

--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -528,12 +528,12 @@ int futex_cmpxchg_value_locked(u32 *curv
 	return ret;
 }
 
-int futex_get_value_locked(u32 *dest, u32 __user *from)
+int futex_get_value_locked(u32 *dest, u32 __user *from, unsigned int flags)
 {
 	int ret;
 
 	pagefault_disable();
-	ret = __get_user(*dest, from);
+	ret = futex_get_value(dest, from, flags);
 	pagefault_enable();
 
 	return ret ? -EFAULT : 0;
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -239,7 +239,7 @@ extern void futex_wake_mark(struct wake_
 
 extern int fault_in_user_writeable(u32 __user *uaddr);
 extern int futex_cmpxchg_value_locked(u32 *curval, u32 __user *uaddr, u32 uval, u32 newval);
-extern int futex_get_value_locked(u32 *dest, u32 __user *from);
+extern int futex_get_value_locked(u32 *dest, u32 __user *from, unsigned int flags);
 extern struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb, union futex_key *key);
 
 extern void __futex_unqueue(struct futex_q *q);
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -240,7 +240,7 @@ static int attach_to_pi_state(u32 __user
 	 * still is what we expect it to be, otherwise retry the entire
 	 * operation.
 	 */
-	if (futex_get_value_locked(&uval2, uaddr))
+	if (futex_get_value_locked(&uval2, uaddr, FLAGS_SIZE_32))
 		goto out_efault;
 
 	if (uval != uval2)
@@ -359,7 +359,7 @@ static int handle_exit_race(u32 __user *
 	 * The same logic applies to the case where the exiting task is
 	 * already gone.
 	 */
-	if (futex_get_value_locked(&uval2, uaddr))
+	if (futex_get_value_locked(&uval2, uaddr, FLAGS_SIZE_32))
 		return -EFAULT;
 
 	/* If the user space value has changed, try again. */
@@ -527,7 +527,7 @@ int futex_lock_pi_atomic(u32 __user *uad
 	 * Read the user space value first so we can validate a few
 	 * things before proceeding further.
 	 */
-	if (futex_get_value_locked(&uval, uaddr))
+	if (futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32))
 		return -EFAULT;
 
 	if (unlikely(should_fail_futex(true)))
@@ -750,7 +750,7 @@ static int __fixup_pi_state_owner(u32 __
 	if (!pi_state->owner)
 		newtid |= FUTEX_OWNER_DIED;
 
-	err = futex_get_value_locked(&uval, uaddr);
+	err = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
 	if (err)
 		goto handle_err;
 
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -275,7 +275,7 @@ futex_proxy_trylock_atomic(u32 __user *p
 	u32 curval;
 	int ret;
 
-	if (futex_get_value_locked(&curval, pifutex))
+	if (futex_get_value_locked(&curval, pifutex, FLAGS_SIZE_32))
 		return -EFAULT;
 
 	if (unlikely(should_fail_futex(true)))
@@ -453,7 +453,7 @@ int futex_requeue(u32 __user *uaddr1, un
 	if (likely(cmpval != NULL)) {
 		u32 curval;
 
-		ret = futex_get_value_locked(&curval, uaddr1);
+		ret = futex_get_value_locked(&curval, uaddr1, FLAGS_SIZE_32);
 
 		if (unlikely(ret)) {
 			double_unlock_hb(hb1, hb2);
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -453,7 +453,7 @@ int futex_wait_multiple_setup(struct fut
 		u32 val = vs[i].w.val;
 
 		hb = futex_q_lock(q);
-		ret = futex_get_value_locked(&uval, uaddr);
+		ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
 
 		if (!ret && uval == val) {
 			/*
@@ -621,7 +621,7 @@ int futex_wait_setup(u32 __user *uaddr,
 retry_private:
 	*hb = futex_q_lock(q);
 
-	ret = futex_get_value_locked(&uval, uaddr);
+	ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
 
 	if (ret) {
 		futex_q_unlock(*hb);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/6] futex: Enable FUTEX2_{8,16}
  2024-10-25  9:03 [PATCH 0/6] futex: The remaining futex2 bits Peter Zijlstra
                   ` (2 preceding siblings ...)
  2024-10-25  9:03 ` [PATCH 3/6] futex: Propagate flags into futex_get_value_locked() Peter Zijlstra
@ 2024-10-25  9:03 ` Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 5/6] futex,selftests: Extend the futex selftests Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 6/6] futex,selftests: Extend the futex selftests for NUMA Peter Zijlstra
  5 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-25  9:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong

When futexes are no longer u32 aligned, the lower offset bits are no
longer available to put type info in. However, since offset is the
offset within a page, there are plenty bits available on the top end.

After that, pass flags into futex_get_value_locked() for WAIT and
disallow FUTEX2_SIZE_U64 instead of mandating FUTEX2_SIZE_U32.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/futex.h   |   11 ++++++-----
 kernel/futex/core.c     |    9 +++++++++
 kernel/futex/futex.h    |    4 ++--
 kernel/futex/waitwake.c |    5 +++--
 4 files changed, 20 insertions(+), 9 deletions(-)

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -16,18 +16,19 @@ struct task_struct;
  * The key type depends on whether it's a shared or private mapping.
  * Don't rearrange members without looking at hash_futex().
  *
- * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We use the two low order bits of offset to tell what is the kind of key :
+ * offset is the position within a page and is in the range [0, PAGE_SIZE).
+ * The high bits of the offset indicate what kind of key this is:
  *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
  *       (no reference on an inode or mm)
  *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
  *	mapped on a file (reference on the underlying inode)
  *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
  *       (but private mapping on an mm, and reference taken on it)
-*/
+ */
 
-#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
-#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+#define FUT_OFF_INODE    (PAGE_SIZE << 0)
+#define FUT_OFF_MMSHARED (PAGE_SIZE << 1)
+#define FUT_OFF_SIZE	 (PAGE_SIZE << 2)
 
 union futex_key {
 	struct {
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -313,6 +313,15 @@ int get_futex_key(void __user *uaddr, un
 	}
 
 	/*
+	 * Encode the futex size in the offset. This makes cross-size
+	 * wake-wait fail -- see futex_match().
+	 *
+	 * NOTE that cross-size wake-wait is fundamentally broken wrt
+	 * FLAGS_NUMA.
+	 */
+	key->both.offset |= FUT_OFF_SIZE * (flags & FLAGS_SIZE_MASK);
+
+	/*
 	 * PROCESS_PRIVATE futexes are fast.
 	 * As the mm cannot disappear under us and the 'key' only needs
 	 * virtual address, we dont even have to find the underlying vma.
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -81,8 +81,8 @@ static inline bool futex_flags_valid(uns
 			return false;
 	}
 
-	/* Only 32bit futexes are implemented -- for now */
-	if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32)
+	/* 64bit futexes aren't implemented -- yet */
+	if ((flags & FLAGS_SIZE_MASK) == FLAGS_SIZE_64)
 		return false;
 
 	/*
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -449,11 +449,12 @@ int futex_wait_multiple_setup(struct fut
 
 	for (i = 0; i < count; i++) {
 		u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
+		unsigned int flags = vs[i].w.flags;
 		struct futex_q *q = &vs[i].q;
 		u32 val = vs[i].w.val;
 
 		hb = futex_q_lock(q);
-		ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
+		ret = futex_get_value_locked(&uval, uaddr, flags);
 
 		if (!ret && uval == val) {
 			/*
@@ -621,7 +622,7 @@ int futex_wait_setup(u32 __user *uaddr,
 retry_private:
 	*hb = futex_q_lock(q);
 
-	ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
+	ret = futex_get_value_locked(&uval, uaddr, flags);
 
 	if (ret) {
 		futex_q_unlock(*hb);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 5/6] futex,selftests: Extend the futex selftests
  2024-10-25  9:03 [PATCH 0/6] futex: The remaining futex2 bits Peter Zijlstra
                   ` (3 preceding siblings ...)
  2024-10-25  9:03 ` [PATCH 4/6] futex: Enable FUTEX2_{8,16} Peter Zijlstra
@ 2024-10-25  9:03 ` Peter Zijlstra
  2024-10-25  9:03 ` [PATCH 6/6] futex,selftests: Extend the futex selftests for NUMA Peter Zijlstra
  5 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-25  9:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong

Extend the wait/requeue selftests to also cover the futex2 syscalls.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/testing/selftests/futex/functional/futex_requeue.c         |  100 +++++++++-
 tools/testing/selftests/futex/functional/futex_wait.c            |   56 ++++-
 tools/testing/selftests/futex/functional/futex_wait_timeout.c    |   16 +
 tools/testing/selftests/futex/functional/futex_wait_wouldblock.c |   28 ++
 tools/testing/selftests/futex/functional/futex_waitv.c           |   15 -
 tools/testing/selftests/futex/functional/run.sh                  |    6 
 tools/testing/selftests/futex/include/futex2test.h               |   52 +++++
 7 files changed, 243 insertions(+), 30 deletions(-)

--- a/tools/testing/selftests/futex/functional/futex_requeue.c
+++ b/tools/testing/selftests/futex/functional/futex_requeue.c
@@ -7,8 +7,10 @@
 
 #include <pthread.h>
 #include <limits.h>
+#include <stdbool.h>
 #include "logging.h"
 #include "futextest.h"
+#include "futex2test.h"
 
 #define TEST_NAME "futex-requeue"
 #define timeout_ns  30000000
@@ -16,24 +18,58 @@
 
 volatile futex_t *f1;
 
+bool futex2 = 0;
+bool mixed = 0;
+
 void usage(char *prog)
 {
 	printf("Usage: %s\n", prog);
 	printf("  -c	Use color\n");
+	printf("  -n	Use futex2 interface\n");
+	printf("  -x	Use mixed size futex\n");
 	printf("  -h	Display this help message\n");
 	printf("  -v L	Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
 	       VQUIET, VCRITICAL, VINFO);
 }
 
-void *waiterfn(void *arg)
+static void *waiterfn(void *arg)
 {
+	unsigned int flags = 0;
 	struct timespec to;
 
-	to.tv_sec = 0;
-	to.tv_nsec = timeout_ns;
+	if (futex2) {
+		unsigned long mask;
+
+		if (clock_gettime(CLOCK_MONOTONIC, &to)) {
+			printf("clock_gettime() failed errno %d", errno);
+			return NULL;
+		}
+
+		to.tv_nsec += timeout_ns;
+		if (to.tv_nsec >= 1000000000) {
+			to.tv_sec++;
+			to.tv_nsec -= 1000000000;
+		}
+
+		if (mixed) {
+			flags |= FUTEX2_SIZE_U16;
+			mask = (unsigned short)(~0U);
+		} else {
+			flags |= FUTEX2_SIZE_U32;
+			mask = (unsigned int)(~0U);
+		}
+
+		if (futex2_wait(f1, *f1, mask, flags,
+				&to, CLOCK_MONOTONIC))
+			printf("waiter failed errno %d\n", errno);
+	} else {
+
+		to.tv_sec = 0;
+		to.tv_nsec = timeout_ns;
 
-	if (futex_wait(f1, *f1, &to, 0))
-		printf("waiter failed errno %d\n", errno);
+		if (futex_wait(f1, *f1, &to, flags))
+			printf("waiter failed errno %d\n", errno);
+	}
 
 	return NULL;
 }
@@ -48,7 +84,7 @@ int main(int argc, char *argv[])
 
 	f1 = &_f1;
 
-	while ((c = getopt(argc, argv, "cht:v:")) != -1) {
+	while ((c = getopt(argc, argv, "xncht:v:")) != -1) {
 		switch (c) {
 		case 'c':
 			log_color(1);
@@ -59,6 +95,12 @@ int main(int argc, char *argv[])
 		case 'v':
 			log_verbosity(atoi(optarg));
 			break;
+		case 'x':
+			mixed=1;
+			/* fallthrough */
+		case 'n':
+			futex2=1;
+			break;
 		default:
 			usage(basename(argv[0]));
 			exit(1);
@@ -79,7 +121,22 @@ int main(int argc, char *argv[])
 	usleep(WAKE_WAIT_US);
 
 	info("Requeuing 1 futex from f1 to f2\n");
-	res = futex_cmp_requeue(f1, 0, &f2, 0, 1, 0);
+	if (futex2) {
+		struct futex_waitv futexes[2] = {
+			{
+				.val = 0,
+				.uaddr = (unsigned long)f1,
+				.flags = mixed ? FUTEX2_SIZE_U16 : FUTEX2_SIZE_U32,
+			},
+			{
+				.uaddr = (unsigned long)&f2,
+				.flags = FUTEX2_SIZE_U32,
+			},
+		};
+		res = futex2_requeue(futexes, 0, 0, 1);
+	} else {
+		res = futex_cmp_requeue(f1, 0, &f2, 0, 1, 0);
+	}
 	if (res != 1) {
 		ksft_test_result_fail("futex_requeue simple returned: %d %s\n",
 				      res ? errno : res,
@@ -89,7 +146,11 @@ int main(int argc, char *argv[])
 
 
 	info("Waking 1 futex at f2\n");
-	res = futex_wake(&f2, 1, 0);
+	if (futex2) {
+		res = futex2_wake(&f2, ~0U, 1, FUTEX2_SIZE_U32);
+	} else {
+		res = futex_wake(&f2, 1, 0);
+	}
 	if (res != 1) {
 		ksft_test_result_fail("futex_requeue simple returned: %d %s\n",
 				      res ? errno : res,
@@ -112,7 +173,22 @@ int main(int argc, char *argv[])
 	usleep(WAKE_WAIT_US);
 
 	info("Waking 3 futexes at f1 and requeuing 7 futexes from f1 to f2\n");
-	res = futex_cmp_requeue(f1, 0, &f2, 3, 7, 0);
+	if (futex2) {
+		struct futex_waitv futexes[2] = {
+			{
+				.val = 0,
+				.uaddr = (unsigned long)f1,
+				.flags = mixed ? FUTEX2_SIZE_U16 : FUTEX2_SIZE_U32,
+			},
+			{
+				.uaddr = (unsigned long)&f2,
+				.flags = FUTEX2_SIZE_U32,
+			},
+		};
+		res = futex2_requeue(futexes, 0, 3, 7);
+	} else {
+		res = futex_cmp_requeue(f1, 0, &f2, 3, 7, 0);
+	}
 	if (res != 10) {
 		ksft_test_result_fail("futex_requeue many returned: %d %s\n",
 				      res ? errno : res,
@@ -121,7 +197,11 @@ int main(int argc, char *argv[])
 	}
 
 	info("Waking INT_MAX futexes at f2\n");
-	res = futex_wake(&f2, INT_MAX, 0);
+	if (futex2) {
+		res = futex2_wake(&f2, ~0U, INT_MAX, FUTEX2_SIZE_U32);
+	} else {
+		res = futex_wake(&f2, INT_MAX, 0);
+	}
 	if (res != 7) {
 		ksft_test_result_fail("futex_requeue many returned: %d %s\n",
 				      res ? errno : res,
--- a/tools/testing/selftests/futex/functional/futex_wait.c
+++ b/tools/testing/selftests/futex/functional/futex_wait.c
@@ -9,8 +9,10 @@
 #include <sys/shm.h>
 #include <sys/mman.h>
 #include <fcntl.h>
+#include <stdbool.h>
 #include "logging.h"
 #include "futextest.h"
+#include "futex2test.h"
 
 #define TEST_NAME "futex-wait"
 #define timeout_ns  30000000
@@ -19,10 +21,13 @@
 
 void *futex;
 
+bool futex2 = 0;
+
 void usage(char *prog)
 {
 	printf("Usage: %s\n", prog);
 	printf("  -c	Use color\n");
+	printf("  -n	Use futex2 interface\n");
 	printf("  -h	Display this help message\n");
 	printf("  -v L	Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
 	       VQUIET, VCRITICAL, VINFO);
@@ -30,17 +35,35 @@ void usage(char *prog)
 
 static void *waiterfn(void *arg)
 {
-	struct timespec to;
 	unsigned int flags = 0;
+	struct timespec to;
 
 	if (arg)
 		flags = *((unsigned int *) arg);
 
-	to.tv_sec = 0;
-	to.tv_nsec = timeout_ns;
+	if (futex2) {
+		if (clock_gettime(CLOCK_MONOTONIC, &to)) {
+			printf("clock_gettime() failed errno %d", errno);
+			return NULL;
+		}
 
-	if (futex_wait(futex, 0, &to, flags))
-		printf("waiter failed errno %d\n", errno);
+		to.tv_nsec += timeout_ns;
+		if (to.tv_nsec >= 1000000000) {
+			to.tv_sec++;
+			to.tv_nsec -= 1000000000;
+		}
+
+		if (futex2_wait(futex, 0, ~0U, flags | FUTEX2_SIZE_U32,
+				&to, CLOCK_MONOTONIC))
+			printf("waiter failed errno %d\n", errno);
+	} else {
+
+		to.tv_sec = 0;
+		to.tv_nsec = timeout_ns;
+
+		if (futex_wait(futex, 0, &to, flags))
+			printf("waiter failed errno %d\n", errno);
+	}
 
 	return NULL;
 }
@@ -55,7 +78,7 @@ int main(int argc, char *argv[])
 
 	futex = &f_private;
 
-	while ((c = getopt(argc, argv, "cht:v:")) != -1) {
+	while ((c = getopt(argc, argv, "ncht:v:")) != -1) {
 		switch (c) {
 		case 'c':
 			log_color(1);
@@ -66,6 +89,9 @@ int main(int argc, char *argv[])
 		case 'v':
 			log_verbosity(atoi(optarg));
 			break;
+		case 'n':
+			futex2=1;
+			break;
 		default:
 			usage(basename(argv[0]));
 			exit(1);
@@ -84,7 +110,11 @@ int main(int argc, char *argv[])
 	usleep(WAKE_WAIT_US);
 
 	info("Calling private futex_wake on futex: %p\n", futex);
-	res = futex_wake(futex, 1, FUTEX_PRIVATE_FLAG);
+	if (futex2) {
+		res = futex2_wake(futex, ~0U, 1, FUTEX2_SIZE_U32 | FUTEX2_PRIVATE);
+	} else {
+		res = futex_wake(futex, 1, FUTEX_PRIVATE_FLAG);
+	}
 	if (res != 1) {
 		ksft_test_result_fail("futex_wake private returned: %d %s\n",
 				      errno, strerror(errno));
@@ -112,7 +142,11 @@ int main(int argc, char *argv[])
 	usleep(WAKE_WAIT_US);
 
 	info("Calling shared (page anon) futex_wake on futex: %p\n", futex);
-	res = futex_wake(futex, 1, 0);
+	if (futex2) {
+		res = futex2_wake(futex, ~0U, 1, FUTEX2_SIZE_U32);
+	} else {
+		res = futex_wake(futex, 1, 0);
+	}
 	if (res != 1) {
 		ksft_test_result_fail("futex_wake shared (page anon) returned: %d %s\n",
 				      errno, strerror(errno));
@@ -151,7 +185,11 @@ int main(int argc, char *argv[])
 	usleep(WAKE_WAIT_US);
 
 	info("Calling shared (file backed) futex_wake on futex: %p\n", futex);
-	res = futex_wake(shm, 1, 0);
+	if (futex2) {
+		res = futex2_wake(shm, ~0U, 1, FUTEX2_SIZE_U32);
+	} else {
+		res = futex_wake(shm, 1, 0);
+	}
 	if (res != 1) {
 		ksft_test_result_fail("futex_wake shared (file backed) returned: %d %s\n",
 				      errno, strerror(errno));
--- a/tools/testing/selftests/futex/functional/futex_wait_timeout.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
@@ -103,7 +103,7 @@ int main(int argc, char *argv[])
 	struct futex_waitv waitv = {
 			.uaddr = (uintptr_t)&f1,
 			.val = f1,
-			.flags = FUTEX_32,
+			.flags = FUTEX2_SIZE_U32,
 			.__reserved = 0
 		};
 
@@ -128,7 +128,7 @@ int main(int argc, char *argv[])
 	}
 
 	ksft_print_header();
-	ksft_set_plan(9);
+	ksft_set_plan(11);
 	ksft_print_msg("%s: Block on a futex and wait for timeout\n",
 	       basename(argv[0]));
 	ksft_print_msg("\tArguments: timeout=%ldns\n", timeout_ns);
@@ -201,6 +201,18 @@ int main(int argc, char *argv[])
 	res = futex_waitv(&waitv, 1, 0, &to, CLOCK_REALTIME);
 	test_timeout(res, &ret, "futex_waitv realtime", ETIMEDOUT);
 
+	/* futex2_wait with CLOCK_MONOTONIC */
+	if (futex_get_abs_timeout(CLOCK_MONOTONIC, &to, timeout_ns))
+		return RET_FAIL;
+	res = futex2_wait(&f1, f1, 1, FUTEX2_SIZE_U32, &to, CLOCK_MONOTONIC);
+	test_timeout(res, &ret, "futex2_wait monotonic", ETIMEDOUT);
+
+	/* futex2_wait with CLOCK_REALTIME */
+	if (futex_get_abs_timeout(CLOCK_REALTIME, &to, timeout_ns))
+		return RET_FAIL;
+	res = futex2_wait(&f1, f1, 1, FUTEX2_SIZE_U32, &to, CLOCK_REALTIME);
+	test_timeout(res, &ret, "futex2_wait realtime", ETIMEDOUT);
+
 	ksft_print_cnts();
 	return ret;
 }
--- a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
@@ -46,7 +46,7 @@ int main(int argc, char *argv[])
 	struct futex_waitv waitv = {
 			.uaddr = (uintptr_t)&f1,
 			.val = f1+1,
-			.flags = FUTEX_32,
+			.flags = FUTEX2_SIZE_U32 | FUTEX2_PRIVATE,
 			.__reserved = 0
 		};
 
@@ -68,7 +68,7 @@ int main(int argc, char *argv[])
 	}
 
 	ksft_print_header();
-	ksft_set_plan(2);
+	ksft_set_plan(3);
 	ksft_print_msg("%s: Test the unexpected futex value in FUTEX_WAIT\n",
 	       basename(argv[0]));
 
@@ -106,6 +106,30 @@ int main(int argc, char *argv[])
 		ksft_test_result_pass("futex_waitv\n");
 	}
 
+	if (clock_gettime(CLOCK_MONOTONIC, &to)) {
+		error("clock_gettime failed\n", errno);
+		return errno;
+	}
+
+	to.tv_nsec += timeout_ns;
+
+	if (to.tv_nsec >= 1000000000) {
+		to.tv_sec++;
+		to.tv_nsec -= 1000000000;
+	}
+
+	info("Calling futex2_wait on f1: %u @ %p with val=%u\n", f1, &f1, f1+1);
+	res = futex2_wait(&f1, f1+1, ~0U, FUTEX2_SIZE_U32 | FUTEX2_PRIVATE,
+			  &to, CLOCK_MONOTONIC);
+	if (!res || errno != EWOULDBLOCK) {
+		ksft_test_result_pass("futex2_wait returned: %d %s\n",
+				      res ? errno : res,
+				      res ? strerror(errno) : "");
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_pass("futex2_wait\n");
+	}
+
 	ksft_print_cnts();
 	return ret;
 }
--- a/tools/testing/selftests/futex/functional/futex_waitv.c
+++ b/tools/testing/selftests/futex/functional/futex_waitv.c
@@ -88,7 +88,7 @@ int main(int argc, char *argv[])
 
 	for (i = 0; i < NR_FUTEXES; i++) {
 		waitv[i].uaddr = (uintptr_t)&futexes[i];
-		waitv[i].flags = FUTEX_32 | FUTEX_PRIVATE_FLAG;
+		waitv[i].flags = FUTEX2_SIZE_U32 | FUTEX2_PRIVATE;
 		waitv[i].val = 0;
 		waitv[i].__reserved = 0;
 	}
@@ -99,7 +99,8 @@ int main(int argc, char *argv[])
 
 	usleep(WAKE_WAIT_US);
 
-	res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, FUTEX_PRIVATE_FLAG);
+	res = futex2_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), ~0U, 1,
+			  FUTEX2_PRIVATE | FUTEX2_SIZE_U32);
 	if (res != 1) {
 		ksft_test_result_fail("futex_wake private returned: %d %s\n",
 				      res ? errno : res,
@@ -122,7 +123,7 @@ int main(int argc, char *argv[])
 
 		*shared_data = 0;
 		waitv[i].uaddr = (uintptr_t)shared_data;
-		waitv[i].flags = FUTEX_32;
+		waitv[i].flags = FUTEX2_SIZE_U32;
 		waitv[i].val = 0;
 		waitv[i].__reserved = 0;
 	}
@@ -145,8 +146,8 @@ int main(int argc, char *argv[])
 	for (i = 0; i < NR_FUTEXES; i++)
 		shmdt(u64_to_ptr(waitv[i].uaddr));
 
-	/* Testing a waiter without FUTEX_32 flag */
-	waitv[0].flags = FUTEX_PRIVATE_FLAG;
+	/* Testing a waiter without FUTEX2_SIZE_U32 flag */
+	waitv[0].flags = FUTEX2_PRIVATE;
 
 	if (clock_gettime(CLOCK_MONOTONIC, &to))
 		error("gettime64 failed\n", errno);
@@ -160,11 +161,11 @@ int main(int argc, char *argv[])
 				      res ? strerror(errno) : "");
 		ret = RET_FAIL;
 	} else {
-		ksft_test_result_pass("futex_waitv without FUTEX_32\n");
+		ksft_test_result_pass("futex_waitv without FUTEX2_SIZE_U32\n");
 	}
 
 	/* Testing a waiter with an unaligned address */
-	waitv[0].flags = FUTEX_PRIVATE_FLAG | FUTEX_32;
+	waitv[0].flags = FUTEX2_PRIVATE | FUTEX2_SIZE_U32;
 	waitv[0].uaddr = 1;
 
 	if (clock_gettime(CLOCK_MONOTONIC, &to))
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -76,9 +76,15 @@ echo
 
 echo
 ./futex_wait $COLOR
+echo
+./futex_wait -n $COLOR
 
 echo
 ./futex_requeue $COLOR
+echo
+./futex_requeue -n $COLOR
+echo
+./futex_requeue -x $COLOR
 
 echo
 ./futex_waitv $COLOR
--- a/tools/testing/selftests/futex/include/futex2test.h
+++ b/tools/testing/selftests/futex/include/futex2test.h
@@ -8,6 +8,41 @@
 
 #define u64_to_ptr(x) ((void *)(uintptr_t)(x))
 
+#ifndef __NR_futex_waitv
+#define __NR_futex_waitv 449
+
+struct futex_waitv {
+	__u64 val;
+	__u64 uaddr;
+	__u32 flags;
+	__u32 __reserved;
+};
+#endif
+
+#ifndef __NR_futex_wake
+#define __NR_futex_wake 454
+#define __NR_futex_wait 455
+#define __NR_futex_requeue 456
+#endif
+
+#ifndef FUTEX2_SIZE_U8
+/*
+ * Flags for futex2 syscalls.
+ */
+#define FUTEX2_SIZE_U8		0x00
+#define FUTEX2_SIZE_U16		0x01
+#define FUTEX2_SIZE_U32		0x02
+#define FUTEX2_SIZE_U64		0x03
+#define FUTEX2_NUMA		0x04
+			/*	0x08 */
+			/*	0x10 */
+			/*	0x20 */
+			/*	0x40 */
+#define FUTEX2_PRIVATE		FUTEX_PRIVATE_FLAG
+#endif
+
+#define FUTEX_NO_NODE (-1)
+
 /**
  * futex_waitv - Wait at multiple futexes, wake on any
  * @waiters:    Array of waiters
@@ -20,3 +55,20 @@ static inline int futex_waitv(volatile s
 {
 	return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
 }
+
+static inline int futex2_wake(volatile void *uaddr, unsigned long mask, int nr, unsigned int flags)
+{
+	return syscall(__NR_futex_wake, uaddr, mask, nr, flags);
+}
+
+static inline int futex2_wait(volatile void *uaddr, unsigned long val, unsigned long mask,
+			      unsigned int flags, struct timespec *timo, clockid_t clockid)
+{
+	return syscall(__NR_futex_wait, uaddr, val, mask, flags, timo, clockid);
+}
+
+static inline int futex2_requeue(struct futex_waitv *futexes, unsigned int flags,
+				 int nr_wake, int nr_requeue)
+{
+	return syscall(__NR_futex_requeue, futexes, flags, nr_wake, nr_requeue);
+}



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 6/6] futex,selftests: Extend the futex selftests for NUMA
  2024-10-25  9:03 [PATCH 0/6] futex: The remaining futex2 bits Peter Zijlstra
                   ` (4 preceding siblings ...)
  2024-10-25  9:03 ` [PATCH 5/6] futex,selftests: Extend the futex selftests Peter Zijlstra
@ 2024-10-25  9:03 ` Peter Zijlstra
  5 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-25  9:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong

XXX

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/testing/selftests/futex/functional/Makefile     |    3 
 tools/testing/selftests/futex/functional/futex_numa.c |  262 ++++++++++++++++++
 2 files changed, 264 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -17,7 +17,8 @@ TEST_GEN_PROGS := \
 	futex_wait_private_mapped_file \
 	futex_wait \
 	futex_requeue \
-	futex_waitv
+	futex_waitv \
+	futex_numa
 
 TEST_PROGS := run.sh
 
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_numa.c
@@ -0,0 +1,262 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <pthread.h>
+#include <sys/shm.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <time.h>
+#include <assert.h>
+#include "logging.h"
+#include "futextest.h"
+#include "futex2test.h"
+
+typedef u_int32_t u32;
+typedef int32_t   s32;
+typedef u_int64_t u64;
+
+static int fflags = (FUTEX2_SIZE_U32 | FUTEX2_PRIVATE);
+static int fnode = FUTEX_NO_NODE;
+
+/* fairly stupid test-and-set lock with a waiter flag */
+
+#define N_LOCK		0x0000001
+#define N_WAITERS	0x0001000
+
+struct futex_numa_32 {
+	union {
+		u64 full;
+		struct {
+			u32 val;
+			u32 node;
+		};
+	};
+};
+
+void futex_numa_32_lock(struct futex_numa_32 *lock)
+{
+	for (;;) {
+		struct futex_numa_32 new, old = {
+			.full = __atomic_load_n(&lock->full, __ATOMIC_RELAXED),
+		};
+
+		for (;;) {
+			new = old;
+			if (old.val == 0) {
+				/* no waiter, no lock -> first lock, set no-node */
+				new.node = fnode;
+			}
+			if (old.val & N_LOCK) {
+				/* contention, set waiter */
+				new.val |= N_WAITERS;
+			}
+			new.val |= N_LOCK;
+
+			/* nothing changed, ready to block */
+			if (old.full == new.full)
+				break;
+
+			/*
+			 * Use u64 cmpxchg to set the futex value and node in a
+			 * consistent manner.
+			 */
+			if (__atomic_compare_exchange_n(&lock->full,
+							&old.full, new.full,
+							/* .weak */ false,
+							__ATOMIC_ACQUIRE,
+							__ATOMIC_RELAXED)) {
+
+				/* if we just set N_LOCK, we own it */
+				if (!(old.val & N_LOCK))
+					return;
+
+				/* go block */
+				break;
+			}
+		}
+
+		futex2_wait(lock, new.val, ~0U, fflags, NULL, 0);
+	}
+}
+
+void futex_numa_32_unlock(struct futex_numa_32 *lock)
+{
+	u32 val = __atomic_sub_fetch(&lock->val, N_LOCK, __ATOMIC_RELEASE);
+	assert((s32)val >= 0);
+	if (val & N_WAITERS) {
+		int woken = futex2_wake(lock, ~0U, 1, fflags);
+		assert(val == N_WAITERS);
+		if (!woken) {
+			__atomic_compare_exchange_n(&lock->val, &val, 0U,
+						    false, __ATOMIC_RELAXED,
+						    __ATOMIC_RELAXED);
+		}
+	}
+}
+
+static long nanos = 50000;
+
+struct thread_args {
+	pthread_t tid;
+	volatile int * done;
+	struct futex_numa_32 *lock;
+	int val;
+	int *val1, *val2;
+	int node;
+};
+
+static void *threadfn(void *_arg)
+{
+	struct thread_args *args = _arg;
+	struct timespec ts = {
+		.tv_nsec = nanos,
+	};
+	int node;
+
+	while (!*args->done) {
+
+		futex_numa_32_lock(args->lock);
+		args->val++;
+
+		assert(*args->val1 == *args->val2);
+		(*args->val1)++;
+		nanosleep(&ts, NULL);
+		(*args->val2)++;
+
+		node = args->lock->node;
+		futex_numa_32_unlock(args->lock);
+
+		if (node != args->node) {
+			args->node = node;
+			printf("node: %d\n", node);
+		}
+
+		nanosleep(&ts, NULL);
+	}
+
+	return NULL;
+}
+
+static void *contendfn(void *_arg)
+{
+	struct thread_args *args = _arg;
+
+	while (!*args->done) {
+		/*
+		 * futex2_wait() will take hb-lock, verify *var == val and
+		 * queue/abort.  By knowingly setting val 'wrong' this will
+		 * abort and thereby generate hb-lock contention.
+		 */
+		futex2_wait(&args->lock->val, ~0U, ~0U, fflags, NULL, 0);
+		args->val++;
+	}
+
+	return NULL;
+}
+
+static volatile int done = 0;
+static struct futex_numa_32 lock = { .val = 0, };
+static int val1, val2;
+
+int main(int argc, char *argv[])
+{
+	struct thread_args *tas[512], *cas[512];
+	int c, t, threads = 2, contenders = 0;
+	int sleeps = 10;
+	int total = 0;
+
+	while ((c = getopt(argc, argv, "c:t:s:n:N::")) != -1) {
+		switch (c) {
+		case 'c':
+			contenders = atoi(optarg);
+			break;
+		case 't':
+			threads = atoi(optarg);
+			break;
+		case 's':
+			sleeps = atoi(optarg);
+			break;
+		case 'n':
+			nanos = atoi(optarg);
+			break;
+		case 'N':
+			fflags |= FUTEX2_NUMA;
+			if (optarg)
+				fnode = atoi(optarg);
+			break;
+		default:
+			exit(1);
+			break;
+		}
+	}
+
+	for (t = 0; t < contenders; t++) {
+		struct thread_args *args = calloc(1, sizeof(*args));
+		if (!args) {
+			perror("thread_args");
+			exit(-1);
+		}
+
+		args->done = &done;
+		args->lock = &lock;
+		args->val1 = &val1;
+		args->val2 = &val2;
+		args->node = -1;
+
+		if (pthread_create(&args->tid, NULL, contendfn, args)) {
+			perror("pthread_create");
+			exit(-1);
+		}
+
+		cas[t] = args;
+	}
+
+	for (t = 0; t < threads; t++) {
+		struct thread_args *args = calloc(1, sizeof(*args));
+		if (!args) {
+			perror("thread_args");
+			exit(-1);
+		}
+
+		args->done = &done;
+		args->lock = &lock;
+		args->val1 = &val1;
+		args->val2 = &val2;
+		args->node = -1;
+
+		if (pthread_create(&args->tid, NULL, threadfn, args)) {
+			perror("pthread_create");
+			exit(-1);
+		}
+
+		tas[t] = args;
+	}
+
+	sleep(sleeps);
+
+	done = true;
+
+	for (t = 0; t < threads; t++) {
+		struct thread_args *args = tas[t];
+
+		pthread_join(args->tid, NULL);
+		total += args->val;
+//		printf("tval: %d\n", args->val);
+	}
+	printf("total: %d\n", total);
+
+	if (contenders) {
+		total = 0;
+		for (t = 0; t < contenders; t++) {
+			struct thread_args *args = cas[t];
+
+			pthread_join(args->tid, NULL);
+			total += args->val;
+			//		printf("tval: %d\n", args->val);
+		}
+		printf("contenders: %d\n", total);
+	}
+
+	return 0;
+}
+



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] mm: Add vmalloc_huge_node()
  2024-10-25  9:03 ` [PATCH 1/6] mm: Add vmalloc_huge_node() Peter Zijlstra
@ 2024-10-25  9:58   ` Uladzislau Rezki
  2024-10-25 16:00   ` Davidlohr Bueso
  1 sibling, 0 replies; 17+ messages in thread
From: Uladzislau Rezki @ 2024-10-25  9:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, cl, llong, Christoph Hellwig

On Fri, Oct 25, 2024 at 11:03:48AM +0200, Peter Zijlstra wrote:
> To enable node specific hash-tables.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/vmalloc.h |    3 +++
>  mm/vmalloc.c            |    7 +++++++
>  2 files changed, 10 insertions(+)
> 
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -177,6 +177,9 @@ void *__vmalloc_node_noprof(unsigned lon
>  void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
>  #define vmalloc_huge(...)	alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
>  
> +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
> +#define vmalloc_huge_node(...)	alloc_hooks(vmalloc_huge_node_noprof(__VA_ARGS__))
> +
>  extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
>  #define __vmalloc_array(...)	alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
>  
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3948,6 +3948,13 @@ void *vmalloc_huge_noprof(unsigned long
>  }
>  EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
>  
> +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +				    gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> +				    node, __builtin_return_address(0));
> +}
> +
>  /**
>   * vzalloc - allocate virtually contiguous memory with zero fill
>   * @size:    allocation size
> 
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] mm: Add vmalloc_huge_node()
  2024-10-25  9:03 ` [PATCH 1/6] mm: Add vmalloc_huge_node() Peter Zijlstra
  2024-10-25  9:58   ` Uladzislau Rezki
@ 2024-10-25 16:00   ` Davidlohr Bueso
  1 sibling, 0 replies; 17+ messages in thread
From: Davidlohr Bueso @ 2024-10-25 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, mingo, dvhart, andrealmeid, Andrew Morton,
	urezki, hch, lstoakes, Arnd Bergmann, linux-api, linux-mm,
	linux-arch, malteskarupke, cl, llong, Christoph Hellwig

On Fri, 25 Oct 2024, Peter Zijlstra wrote:

>To enable node specific hash-tables.
>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-25  9:03 ` [PATCH 2/6] futex: Implement FUTEX2_NUMA Peter Zijlstra
@ 2024-10-25 18:30   ` Davidlohr Bueso
  2024-10-28  9:38     ` Peter Zijlstra
  2024-10-25 19:28   ` Christoph Lameter (Ampere)
  2024-10-25 21:23   ` André Almeida
  2 siblings, 1 reply; 17+ messages in thread
From: Davidlohr Bueso @ 2024-10-25 18:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, mingo, dvhart, andrealmeid, Andrew Morton,
	urezki, hch, lstoakes, Arnd Bergmann, linux-api, linux-mm,
	linux-arch, malteskarupke, cl, llong

On Fri, 25 Oct 2024, Peter Zijlstra wrote:\n

> static int __init futex_init(void)
> {
>-	unsigned int futex_shift;
>-	unsigned long i;
>+	unsigned int order, n;
>+	unsigned long size, i;
>
> #ifdef CONFIG_BASE_SMALL
> 	futex_hashsize = 16;
> #else
>-	futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
>+	futex_hashsize = 256 * num_possible_cpus();
>+	futex_hashsize /= num_possible_nodes();
>+	futex_hashsize = roundup_pow_of_two(futex_hashsize);
> #endif
>+	futex_hashshift = ilog2(futex_hashsize);
>+	size = sizeof(struct futex_hash_bucket) * futex_hashsize;
>+	order = get_order(size);
>+
>+	for_each_node(n) {

Probably want to skip nodes that don't have CPUs, those will never
have the remote for .node value.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-25  9:03 ` [PATCH 2/6] futex: Implement FUTEX2_NUMA Peter Zijlstra
  2024-10-25 18:30   ` Davidlohr Bueso
@ 2024-10-25 19:28   ` Christoph Lameter (Ampere)
  2024-10-28  1:59     ` Davidlohr Bueso
  2024-10-28  9:46     ` Peter Zijlstra
  2024-10-25 21:23   ` André Almeida
  2 siblings, 2 replies; 17+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-25 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, llong

On Fri, 25 Oct 2024, Peter Zijlstra wrote:

> Extend the futex2 interface to be numa aware.
>
> When FUTEX2_NUMA is specified for a futex, the user value is extended
> to two words (of the same size). The first is the user value we all
> know, the second one will be the node to place this futex on.
>
>   struct futex_numa_32 {
> 	u32 val;
> 	u32 node;
>   };
>
> When node is set to ~0, WAIT will set it to the current node_id such
> that WAKE knows where to find it. If userspace corrupts the node value
> between WAIT and WAKE, the futex will not be found and no wakeup will
> happen.
>
> When FUTEX2_NUMA is not set, the node is simply an extention of the
> hash, such that traditional futexes are still interleaved over the
> nodes.

Would it be possible to follow the NUMA memory policy set up for a task
when making these decisions? We may not need a separate FUTEX2_NUMA
option. There are supportive functions in mm/mempolicy.c that will yield
a node for the futex logic to use.

See f.e. linux/include/uapi/mempolicy.h for the types of memory policy
that can be set for a task in current->mempolicy.

MPOL_DEFAULT	get local memory / use system default policy
MPOL_INTERLEAVE interleaved over nodes
MPOL_BIND	use the node specified in the task policy.
MPOL_LOCAL	get_local_memory

etc.

You will get a page or objects with the correct node by calling
alloc_pages() or kmalloc without GFP_THISNODE.

If you just need the node to use then use

mempolicy_slab_node()

and assign that to the node of the futex.

The function will determine which node to use depending on the active
memory policy.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-25  9:03 ` [PATCH 2/6] futex: Implement FUTEX2_NUMA Peter Zijlstra
  2024-10-25 18:30   ` Davidlohr Bueso
  2024-10-25 19:28   ` Christoph Lameter (Ampere)
@ 2024-10-25 21:23   ` André Almeida
  2 siblings, 0 replies; 17+ messages in thread
From: André Almeida @ 2024-10-25 21:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, dvhart, dave, Andrew Morton, urezki, hch,
	lstoakes, Arnd Bergmann, linux-api, linux-mm, linux-arch,
	malteskarupke, cl, llong, tglx

Hey Peter,

Em 25/10/2024 06:03, Peter Zijlstra escreveu:
> Extend the futex2 interface to be numa aware.
> 
> When FUTEX2_NUMA is specified for a futex, the user value is extended
> to two words (of the same size). The first is the user value we all
> know, the second one will be the node to place this futex on.
> 
>    struct futex_numa_32 {
> 	u32 val;
> 	u32 node;
>    };
> 

Maybe this should live at include/uapi/linux/futex.h.

> When node is set to ~0, WAIT will set it to the current node_id such
> that WAKE knows where to find it. If userspace corrupts the node value
> between WAIT and WAKE, the futex will not be found and no wakeup will
> happen.
> 
> When FUTEX2_NUMA is not set, the node is simply an extention of the
> hash, such that traditional futexes are still interleaved over the
> nodes.
> 
> This is done to avoid having to have a separate !numa hash-table.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Do you think some of those changes should be guarded with #ifdef 
CONFIG_NUMA? Or is fine as it is? I see that most of NUMA_ values 
defines to 1 anyway on !numa, but maybe the futex_init() and 
futex_hash() would be a bit more simplified.

[...]

>   
> +static int futex_get_value(u32 *val, u32 __user *from, unsigned int flags)
> +{
> +	switch (futex_size(flags)) {
> +	case 1: return __get_user(*val, (u8 __user *)from);
> +	case 2: return __get_user(*val, (u16 __user *)from);
> +	case 4: return __get_user(*val, (u32 __user *)from);
> +	default: BUG();
> +	}
> +}
> +
> +static int futex_put_value(u32 val, u32 __user *to, unsigned int flags)
> +{
> +	switch (futex_size(flags)) {
> +	case 1: return __put_user(val, (u8 __user *)to);
> +	case 2: return __put_user(val, (u16 __user *)to);
> +	case 4: return __put_user(val, (u32 __user *)to);
> +	default: BUG();
> +	}
> +}
> +

I found a bit confusing that this is here, shouldn't be at [PATCH 4/6] 
futex: Enable FUTEX2_{8,16}?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-25 19:28   ` Christoph Lameter (Ampere)
@ 2024-10-28  1:59     ` Davidlohr Bueso
  2024-10-28 22:34       ` Christoph Lameter (Ampere)
  2024-10-28  9:46     ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Davidlohr Bueso @ 2024-10-28  1:59 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Peter Zijlstra, tglx, linux-kernel, mingo, dvhart, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, llong

On Fri, 25 Oct 2024, Christoph Lameter (Ampere) wrote:\n

>Would it be possible to follow the NUMA memory policy set up for a task
>when making these decisions? We may not need a separate FUTEX2_NUMA
>option. There are supportive functions in mm/mempolicy.c that will yield
>a node for the futex logic to use.

With numa-awareness, when would lookups ever want to be anywhere but
local? mempolicy is about allocations, futexes are not that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-25 18:30   ` Davidlohr Bueso
@ 2024-10-28  9:38     ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-28  9:38 UTC (permalink / raw)
  To: tglx, linux-kernel, mingo, dvhart, andrealmeid, Andrew Morton,
	urezki, hch, lstoakes, Arnd Bergmann, linux-api, linux-mm,
	linux-arch, malteskarupke, cl, llong

On Fri, Oct 25, 2024 at 11:30:26AM -0700, Davidlohr Bueso wrote:
> On Fri, 25 Oct 2024, Peter Zijlstra wrote:\n
> 
> > static int __init futex_init(void)
> > {
> > -	unsigned int futex_shift;
> > -	unsigned long i;
> > +	unsigned int order, n;
> > +	unsigned long size, i;
> > 
> > #ifdef CONFIG_BASE_SMALL
> > 	futex_hashsize = 16;
> > #else
> > -	futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
> > +	futex_hashsize = 256 * num_possible_cpus();
> > +	futex_hashsize /= num_possible_nodes();
> > +	futex_hashsize = roundup_pow_of_two(futex_hashsize);
> > #endif
> > +	futex_hashshift = ilog2(futex_hashsize);
> > +	size = sizeof(struct futex_hash_bucket) * futex_hashsize;
> > +	order = get_order(size);
> > +
> > +	for_each_node(n) {
> 
> Probably want to skip nodes that don't have CPUs, those will never
> have the remote for .node value.

What if the CPU-less node is placed equidistant between two (or more)
regular nodes and it is the best location for a futex that is spanning
those nodes?

That is to say, just because it doesn't have CPUs, doesn't mean it is
never the right node.

Hmm?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-25 19:28   ` Christoph Lameter (Ampere)
  2024-10-28  1:59     ` Davidlohr Bueso
@ 2024-10-28  9:46     ` Peter Zijlstra
  2024-10-28 22:37       ` Christoph Lameter (Ampere)
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2024-10-28  9:46 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: tglx, linux-kernel, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, llong

On Fri, Oct 25, 2024 at 12:28:54PM -0700, Christoph Lameter (Ampere) wrote:
> On Fri, 25 Oct 2024, Peter Zijlstra wrote:
> 
> > Extend the futex2 interface to be numa aware.
> >
> > When FUTEX2_NUMA is specified for a futex, the user value is extended
> > to two words (of the same size). The first is the user value we all
> > know, the second one will be the node to place this futex on.
> >
> >   struct futex_numa_32 {
> > 	u32 val;
> > 	u32 node;
> >   };
> >
> > When node is set to ~0, WAIT will set it to the current node_id such
> > that WAKE knows where to find it. If userspace corrupts the node value
> > between WAIT and WAKE, the futex will not be found and no wakeup will
> > happen.
> >
> > When FUTEX2_NUMA is not set, the node is simply an extention of the
> > hash, such that traditional futexes are still interleaved over the
> > nodes.
> 
> 
> Would it be possible to follow the NUMA memory policy set up for a task
> when making these decisions? We may not need a separate FUTEX2_NUMA
> option. There are supportive functions in mm/mempolicy.c that will yield
> a node for the futex logic to use.

Using get_task_policy() seems very dangerous to me. It is explicitly
possible for different tasks in a process to have different policies,
which means (private) futexes would fail to work correctly.

We need something that is process wide consistent -- like the vma
policies. Except at current, those are to expensive to readily access.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-28  1:59     ` Davidlohr Bueso
@ 2024-10-28 22:34       ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-28 22:34 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, tglx, linux-kernel, mingo, dvhart, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, llong

On Sun, 27 Oct 2024, Davidlohr Bueso wrote:

> On Fri, 25 Oct 2024, Christoph Lameter (Ampere) wrote:\n
>
> > Would it be possible to follow the NUMA memory policy set up for a task
> > when making these decisions? We may not need a separate FUTEX2_NUMA
> > option. There are supportive functions in mm/mempolicy.c that will yield
> > a node for the futex logic to use.
>
> With numa-awareness, when would lookups ever want to be anywhere but
> local? mempolicy is about allocations, futexes are not that.

futexes use kernel metadata right? Those allocations are controlled by
the tasks memory policy.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] futex: Implement FUTEX2_NUMA
  2024-10-28  9:46     ` Peter Zijlstra
@ 2024-10-28 22:37       ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-28 22:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, mingo, dvhart, dave, andrealmeid,
	Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
	linux-mm, linux-arch, malteskarupke, llong

On Mon, 28 Oct 2024, Peter Zijlstra wrote:

> Using get_task_policy() seems very dangerous to me. It is explicitly
> possible for different tasks in a process to have different policies,
> which means (private) futexes would fail to work correctly.
>
> We need something that is process wide consistent -- like the vma
> policies. Except at current, those are to expensive to readily access.

The vma policies are bound to addresses that in turn yields address space
wide validity.

However, different threads may run on processes on different nodes and
therefore having different numa nodes close to them etc.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-10-28 22:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25  9:03 [PATCH 0/6] futex: The remaining futex2 bits Peter Zijlstra
2024-10-25  9:03 ` [PATCH 1/6] mm: Add vmalloc_huge_node() Peter Zijlstra
2024-10-25  9:58   ` Uladzislau Rezki
2024-10-25 16:00   ` Davidlohr Bueso
2024-10-25  9:03 ` [PATCH 2/6] futex: Implement FUTEX2_NUMA Peter Zijlstra
2024-10-25 18:30   ` Davidlohr Bueso
2024-10-28  9:38     ` Peter Zijlstra
2024-10-25 19:28   ` Christoph Lameter (Ampere)
2024-10-28  1:59     ` Davidlohr Bueso
2024-10-28 22:34       ` Christoph Lameter (Ampere)
2024-10-28  9:46     ` Peter Zijlstra
2024-10-28 22:37       ` Christoph Lameter (Ampere)
2024-10-25 21:23   ` André Almeida
2024-10-25  9:03 ` [PATCH 3/6] futex: Propagate flags into futex_get_value_locked() Peter Zijlstra
2024-10-25  9:03 ` [PATCH 4/6] futex: Enable FUTEX2_{8,16} Peter Zijlstra
2024-10-25  9:03 ` [PATCH 5/6] futex,selftests: Extend the futex selftests Peter Zijlstra
2024-10-25  9:03 ` [PATCH 6/6] futex,selftests: Extend the futex selftests for NUMA Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).