[RFC PATCH 0/5] sched: NUMA-aware concurrency IDs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs
@ 2024-08-19 14:24 Mathieu Desnoyers
  2024-08-19 14:24 ` [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit Mathieu Desnoyers
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-19 14:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers, Valentin Schneider, Mel Gorman,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Yury Norov, Rasmus Villemoes, Shuah Khan

The issue addressed by this series is the non-locality of NUMA accesses
to data structures indexed by concurrency IDs: for example, in a
scenario where a process has two threads, and they periodically run one
after the other on different NUMA nodes, each will be assigned mm_cid=0.
As a consequence, they will end up accessing the same pages, and thus at
least one of the threads will need to perform remote NUMA accesses,
which is inefficient.

Solve this by making the rseq concurrency ID (mm_cid) NUMA-aware. On
NUMA systems, when a NUMA-aware concurrency ID is observed by user-space
to be associated with a NUMA node, guarantee that it never changes NUMA
node unless either a kernel-level NUMA configuration change happens, or
scheduler migrations end up migrating tasks across NUMA nodes.

There is a tradeoff between NUMA locality and compactness of the
concurrency ID allocation. Favor compactness over NUMA locality when
the scheduler migrates tasks across NUMA nodes, as this does not cause
the frequent remote NUMA accesses behavior. This is done by limiting the
concurrency ID range to minimum between the number of threads belonging
to the process and the number of allowed CPUs.

This series applies on top of v6.10.3.

Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Shuah Khan <skhan@linuxfoundation.org>

Mathieu Desnoyers (5):
  lib: Implement find_{first,next,nth}_notandnot_bit,
    find_first_andnot_bit
  cpumask: Implement cpumask_{first,next}_{not,}andnot
  sched: NUMA-aware per-memory-map concurrency IDs
  selftests/rseq: x86: Implement rseq_load_u32_u32
  selftests/rseq: Implement NUMA node id vs mm_cid invariant test

 include/linux/cpumask.h                       |  60 ++++++++
 include/linux/find.h                          | 122 ++++++++++++++-
 include/linux/mm_types.h                      |  57 ++++++-
 kernel/sched/core.c                           |  10 +-
 kernel/sched/sched.h                          | 139 +++++++++++++++--
 lib/find_bit.c                                |  42 +++++
 tools/testing/selftests/rseq/.gitignore       |   1 +
 tools/testing/selftests/rseq/Makefile         |   2 +-
 .../testing/selftests/rseq/basic_numa_test.c  | 144 ++++++++++++++++++
 tools/testing/selftests/rseq/rseq-x86-bits.h  |  43 ++++++
 tools/testing/selftests/rseq/rseq.h           |  14 ++
 11 files changed, 613 insertions(+), 21 deletions(-)
 create mode 100644 tools/testing/selftests/rseq/basic_numa_test.c

-- 
2.39.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit
  2024-08-19 14:24 [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Mathieu Desnoyers
@ 2024-08-19 14:24 ` Mathieu Desnoyers
  2024-08-19 19:19   ` Yury Norov
  2024-08-19 14:24 ` [RFC PATCH 2/5] cpumask: Implement cpumask_{first,next}_{not,}andnot Mathieu Desnoyers
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-19 14:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers, Valentin Schneider, Mel Gorman,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Yury Norov, Rasmus Villemoes, Shuah Khan

Allow finding the first, next, or nth bit within two input bitmasks
which is zero in both masks.

Allow fiding the first bit within two input bitmasks which is set in
first mask and cleared in the second mask. find_next_andnot_bit and
find_nth_andnot_bit already exist, so find the first bit appears to be
missing.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 include/linux/find.h | 122 +++++++++++++++++++++++++++++++++++++++++--
 lib/find_bit.c       |  42 +++++++++++++++
 2 files changed, 160 insertions(+), 4 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 5dfca4225fef..6b2377006b22 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -14,6 +14,8 @@ unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long
 					unsigned long nbits, unsigned long start);
 unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long nbits, unsigned long start);
+unsigned long _find_next_notandnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start);
 unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long nbits, unsigned long start);
 unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
@@ -24,11 +26,17 @@ unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long
 				unsigned long size, unsigned long n);
 unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long size, unsigned long n);
+unsigned long __find_nth_notandnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long size, unsigned long n);
 unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
 					const unsigned long *addr3, unsigned long size,
 					unsigned long n);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
+extern unsigned long _find_first_andnot_bit(const unsigned long *addr1,
+					 const unsigned long *addr2, unsigned long size);
+extern unsigned long _find_first_notandnot_bit(const unsigned long *addr1,
+					 const unsigned long *addr2, unsigned long size);
 unsigned long _find_first_and_and_bit(const unsigned long *addr1, const unsigned long *addr2,
 				      const unsigned long *addr3, unsigned long size);
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
@@ -102,15 +110,14 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 
 #ifndef find_next_andnot_bit
 /**
- * find_next_andnot_bit - find the next set bit in *addr1 excluding all the bits
- *                        in *addr2
+ * find_next_andnot_bit - find the next set bit in *addr1, cleared in *addr2
  * @addr1: The first address to base the search on
  * @addr2: The second address to base the search on
  * @size: The bitmap size in bits
  * @offset: The bitnumber to start searching at
  *
- * Returns the bit number for the next set bit
- * If no bits are set, returns @size.
+ * Returns the bit number for the next bit set in *addr1, cleared in *addr2.
+ * If no such bits are found, returns @size.
  */
 static inline
 unsigned long find_next_andnot_bit(const unsigned long *addr1,
@@ -131,6 +138,36 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1,
 }
 #endif
 
+#ifndef find_next_notandnot_bit
+/**
+ * find_next_notandnot_bit - find the next bit cleared in both *addr1 and *addr2
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Returns the bit number for the next bit cleared in both *addr1 and *addr2.
+ * If no such bits are found, returns @size.
+ */
+static inline
+unsigned long find_next_notandnot_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = (~*addr1) & (~*addr2) & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_notandnot_bit(addr1, addr2, size, offset);
+}
+#endif
+
 #ifndef find_next_or_bit
 /**
  * find_next_or_bit - find the next set bit in either memory regions
@@ -292,6 +329,32 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
 	return __find_nth_andnot_bit(addr1, addr2, size, n);
 }
 
+/**
+ * find_nth_notandnot_bit - find N'th cleared bit in 2 memory regions.
+ * @addr1: The 1st address to start the search at
+ * @addr2: The 2nd address to start the search at
+ * @size: The maximum number of bits to search
+ * @n: The number of set bit, which position is needed, counting from 0
+ *
+ * Returns the bit number of the N'th bit cleared in the two regions.
+ * If no such, returns @size.
+ */
+static inline
+unsigned long find_nth_notandnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+				unsigned long size, unsigned long n)
+{
+	if (n >= size)
+		return size;
+
+	if (small_const_nbits(size)) {
+		unsigned long val = (~*addr1) & (~*addr2) & GENMASK(size - 1, 0);
+
+		return val ? fns(val, n) : size;
+	}
+
+	return __find_nth_notandnot_bit(addr1, addr2, size, n);
+}
+
 /**
  * find_nth_and_andnot_bit - find N'th set bit in 2 memory regions,
  *			     excluding those set in 3rd region
@@ -347,6 +410,57 @@ unsigned long find_first_and_bit(const unsigned long *addr1,
 }
 #endif
 
+#ifndef find_first_andnot_bit
+/**
+ * find_first_andnot_bit - find the first set bit in 2 memory regions,
+ *                         flipping bits in 2nd region.
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit.
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_first_andnot_bit(const unsigned long *addr1,
+				 const unsigned long *addr2,
+				 unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = *addr1 & (~*addr2) & GENMASK(size - 1, 0);
+
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_first_andnot_bit(addr1, addr2, size);
+}
+#endif
+
+#ifndef find_first_notandnot_bit
+/**
+ * find_first_notandnot_bit - find the first cleared bit in 2 memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next cleared bit.
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_first_notandnot_bit(const unsigned long *addr1,
+				 const unsigned long *addr2,
+				 unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = (~*addr1) & (~*addr2) & GENMASK(size - 1, 0);
+
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_first_notandnot_bit(addr1, addr2, size);
+}
+#endif
+
 /**
  * find_first_and_and_bit - find the first set bit in 3 memory regions
  * @addr1: The first address to base the search on
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 0836bb3d76c5..b4a3dd62a255 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -116,6 +116,32 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
 EXPORT_SYMBOL(_find_first_and_bit);
 #endif
 
+#ifndef find_first_andnot_bit
+/*
+ * Find the first set bit in two memory regions, flipping bits in 2nd region.
+ */
+unsigned long _find_first_andnot_bit(const unsigned long *addr1,
+				  const unsigned long *addr2,
+				  unsigned long size)
+{
+	return FIND_FIRST_BIT(addr1[idx] & ~addr2[idx], /* nop */, size);
+}
+EXPORT_SYMBOL(_find_first_andnot_bit);
+#endif
+
+#ifndef find_first_notandnot_bit
+/*
+ * Find the first cleared bit in two memory regions.
+ */
+unsigned long _find_first_notandnot_bit(const unsigned long *addr1,
+				  const unsigned long *addr2,
+				  unsigned long size)
+{
+	return FIND_FIRST_BIT(~addr1[idx] & ~addr2[idx], /* nop */, size);
+}
+EXPORT_SYMBOL(_find_first_notandnot_bit);
+#endif
+
 /*
  * Find the first set bit in three memory regions.
  */
@@ -167,6 +193,13 @@ unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned l
 }
 EXPORT_SYMBOL(__find_nth_andnot_bit);
 
+unsigned long __find_nth_notandnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+				 unsigned long size, unsigned long n)
+{
+	return FIND_NTH_BIT(~addr1[idx] & ~addr2[idx], size, n);
+}
+EXPORT_SYMBOL(__find_nth_notandnot_bit);
+
 unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
 					const unsigned long *addr2,
 					const unsigned long *addr3,
@@ -194,6 +227,15 @@ unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned l
 EXPORT_SYMBOL(_find_next_andnot_bit);
 #endif
 
+#ifndef find_next_notandnot_bit
+unsigned long _find_next_notandnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(~addr1[idx] & ~addr2[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_notandnot_bit);
+#endif
+
 #ifndef find_next_or_bit
 unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long nbits, unsigned long start)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/5] cpumask: Implement cpumask_{first,next}_{not,}andnot
  2024-08-19 14:24 [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Mathieu Desnoyers
  2024-08-19 14:24 ` [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit Mathieu Desnoyers
@ 2024-08-19 14:24 ` Mathieu Desnoyers
  2024-08-19 19:24   ` Yury Norov
  2024-08-19 14:24 ` [RFC PATCH 3/5] sched: NUMA-aware per-memory-map concurrency IDs Mathieu Desnoyers
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-19 14:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers, Valentin Schneider, Mel Gorman,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Yury Norov, Rasmus Villemoes, Shuah Khan

Allow finding the first or next bit within two input cpumasks which is
either:

- both zero and zero,
- respectively one and zero.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 include/linux/cpumask.h | 60 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 23686bed441d..57b7d99d6da1 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -204,6 +204,32 @@ unsigned int cpumask_first_and_and(const struct cpumask *srcp1,
 				      cpumask_bits(srcp3), small_cpumask_bits);
 }
 
+/**
+ * cpumask_first_andnot - return the first cpu from *srcp1 & ~*srcp2
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_first_andnot(const struct cpumask *srcp1, const struct cpumask *srcp2)
+{
+	return find_first_andnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
+}
+
+/**
+ * cpumask_first_notandnot - return the first cpu from ~*srcp1 & ~*srcp2
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_first_notandnot(const struct cpumask *srcp1, const struct cpumask *srcp2)
+{
+	return find_first_notandnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
+}
+
 /**
  * cpumask_last - get the last CPU in a cpumask
  * @srcp:	- the cpumask pointer
@@ -246,6 +272,40 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 	return find_next_zero_bit(cpumask_bits(srcp), small_cpumask_bits, n+1);
 }
 
+/**
+ * cpumask_next_andnot - return the next cpu from *srcp1 & ~*srcp2
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_next_andnot(int n, const struct cpumask *srcp1, const struct cpumask *srcp2)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_andnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits, n+1);
+}
+
+/**
+ * cpumask_next_notandnot - return the next cpu from ~*srcp1 & ~*srcp2
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_next_notandnot(int n, const struct cpumask *srcp1, const struct cpumask *srcp2)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_notandnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits, n+1);
+}
+
 #if NR_CPUS == 1
 /* Uniprocessor: there is only one valid CPU */
 static inline unsigned int cpumask_local_spread(unsigned int i, int node)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/5] sched: NUMA-aware per-memory-map concurrency IDs
  2024-08-19 14:24 [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Mathieu Desnoyers
  2024-08-19 14:24 ` [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit Mathieu Desnoyers
  2024-08-19 14:24 ` [RFC PATCH 2/5] cpumask: Implement cpumask_{first,next}_{not,}andnot Mathieu Desnoyers
@ 2024-08-19 14:24 ` Mathieu Desnoyers
  2024-08-19 14:24 ` [RFC PATCH 4/5] selftests/rseq: x86: Implement rseq_load_u32_u32 Mathieu Desnoyers
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-19 14:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers, Valentin Schneider, Mel Gorman,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Yury Norov, Rasmus Villemoes, Shuah Khan

The issue addressed by this change is the non-locality of NUMA accesses
to data structures indexed by concurrency IDs: for example, in a
scenario where a process has two threads, and they periodically run one
after the other on different NUMA nodes, each will be assigned mm_cid=0.
As a consequence, they will end up accessing the same pages, and thus at
least one of the threads will need to perform remote NUMA accesses,
which is inefficient.

That being said, the same issue theoretically exists due to false
sharing of cache lines by threads running on after another on different
cores/CPUs within a single NUMA node, but the extent of the performance
impact is lesser than remote NUMA accesses.

Solve this by making the rseq concurrency ID (mm_cid) NUMA-aware. On
NUMA systems, when a NUMA-aware concurrency ID is observed by user-space
to be associated with a NUMA node, guarantee that it never changes NUMA
node unless either a kernel-level NUMA configuration change happens, or
scheduler migrations end up migrating tasks across NUMA nodes.

There is a tradeoff between NUMA locality and compactness of the
concurrency ID allocation. Favor compactness over NUMA locality when
the scheduler migrates tasks across NUMA nodes, as this does not cause
the frequent remote NUMA accesses behavior. This is done by limiting the
concurrency ID range to minimum between the number of threads belonging
to the process and the number of allowed CPUs.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
---
 include/linux/mm_types.h |  57 +++++++++++++++-
 kernel/sched/core.c      |  10 ++-
 kernel/sched/sched.h     | 139 +++++++++++++++++++++++++++++++++++----
 3 files changed, 190 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index af3a0256fa93..4307352c8900 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
+#include <linux/nodemask.h>
 
 #include <asm/mmu.h>
 
@@ -1154,6 +1155,59 @@ static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 	return (struct cpumask *)cid_bitmap;
 }
 
+#ifdef CONFIG_NUMA
+/*
+ * Layout of NUMA cidmasks:
+ * - node_alloc cidmask:        cpumask tracking which cids were
+ *                              allocated (across nodes) in this
+ *                              memory map.
+ * - node cidmask[nr_node_ids]: per-node cpumask tracking which cid
+ *                              were allocated in this memory map.
+ */
+static inline cpumask_t *mm_node_alloc_cidmask(struct mm_struct *mm)
+{
+	unsigned long cid_bitmap = (unsigned long)mm_cidmask(mm);
+
+	/* Skip mm_cidmask */
+	cid_bitmap += cpumask_size();
+	return (struct cpumask *)cid_bitmap;
+}
+
+static inline cpumask_t *mm_node_cidmask(struct mm_struct *mm, unsigned int node)
+{
+	unsigned long cid_bitmap = (unsigned long)mm_node_alloc_cidmask(mm);
+
+	/* Skip node alloc cidmask */
+	cid_bitmap += cpumask_size();
+	cid_bitmap += node * cpumask_size();
+	return (struct cpumask *)cid_bitmap;
+}
+
+static inline void mm_init_node_cidmask(struct mm_struct *mm)
+{
+	unsigned int node;
+
+	if (num_possible_nodes() == 1)
+		return;
+	cpumask_clear(mm_node_alloc_cidmask(mm));
+	for (node = 0; node < nr_node_ids; node++)
+		cpumask_clear(mm_node_cidmask(mm, node));
+}
+
+static inline unsigned int mm_node_cidmask_size(void)
+{
+	if (num_possible_nodes() == 1)
+		return 0;
+	return (nr_node_ids + 1) * cpumask_size();
+}
+#else /* CONFIG_NUMA */
+static inline void mm_init_node_cidmask(struct mm_struct *mm) { }
+static inline unsigned int mm_node_cidmask_size(void)
+{
+	return 0;
+}
+#endif /* CONFIG_NUMA */
+
 static inline void mm_init_cid(struct mm_struct *mm)
 {
 	int i;
@@ -1165,6 +1219,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
 		pcpu_cid->time = 0;
 	}
 	cpumask_clear(mm_cidmask(mm));
+	mm_init_node_cidmask(mm);
 }
 
 static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
@@ -1185,7 +1240,7 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
 
 static inline unsigned int mm_cid_size(void)
 {
-	return cpumask_size();
+	return cpumask_size() + mm_node_cidmask_size();
 }
 #else /* CONFIG_SCHED_MM_CID */
 static inline void mm_init_cid(struct mm_struct *mm) { }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ebf21373f663..74b0e76bf036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11792,9 +11792,13 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
 	 * scenarios.
 	 *
 	 * It is not useful to clear the src cid when the number of threads is
-	 * greater or equal to the number of allowed cpus, because user-space
+	 * greater or equal to the number of CPUs allowed, because user-space
 	 * can expect that the number of allowed cids can reach the number of
-	 * allowed cpus.
+	 * CPUs allowed.
+	 *
+	 * This also prevents moving cid across NUMA nodes when the
+	 * number of threads is greater or equal to the number of
+	 * CPUs allowed.
 	 */
 	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
 	dst_cid = READ_ONCE(dst_pcpu_cid->cid);
@@ -12053,7 +12057,7 @@ void sched_mm_cid_after_execve(struct task_struct *t)
 		 * Matches barrier in sched_mm_cid_remote_clear_old().
 		 */
 		smp_mb();
-		t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm);
+		t->last_mm_cid = t->mm_cid = mm_cid_get(rq, t, mm);
 	}
 	rseq_set_notify_resume(t);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38aeedd8a6cc..723f3fb727b4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -68,6 +68,7 @@
 #include <linux/wait_api.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue_api.h>
+#include <linux/nodemask.h>
 
 #include <trace/events/power.h>
 #include <trace/events/sched.h>
@@ -3311,12 +3312,10 @@ static inline void mm_cid_put(struct mm_struct *mm)
 	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
 }
 
-static inline int __mm_cid_try_get(struct mm_struct *mm)
+static inline int __mm_cid_test_and_set_first(struct cpumask *cpumask)
 {
-	struct cpumask *cpumask;
 	int cid;
 
-	cpumask = mm_cidmask(mm);
 	/*
 	 * Retry finding first zero bit if the mask is temporarily
 	 * filled. This only happens during concurrent remote-clear
@@ -3333,9 +3332,123 @@ static inline int __mm_cid_try_get(struct mm_struct *mm)
 	return cid;
 }
 
+#ifdef CONFIG_NUMA
+/*
+ * NUMA locality is preserved as long as the mm_cid range is restricted
+ * to the minimum between the number of CPUs allowed and the number of
+ * threads with references to the mm_struct.
+ */
+static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
+{
+	struct cpumask *cpumask = mm_cidmask(mm),
+		      *node_cpumask = mm_node_cidmask(mm, numa_node_id()),
+		      *node_alloc_cpumask = mm_node_alloc_cidmask(mm);
+	unsigned int node;
+	int cid;
+
+	if (num_possible_nodes() == 1)
+		return __mm_cid_test_and_set_first(cpumask);
+
+	/*
+	 * Try to reserve lowest available cid number within those
+	 * already reserved for this NUMA node.
+	 */
+	cid = cpumask_first_andnot(node_cpumask, cpumask);
+	if (cid >= t->nr_cpus_allowed || cid >= atomic_read(&mm->mm_users))
+		goto alloc_numa;
+	if (cpumask_test_and_set_cpu(cid, cpumask))
+		return -1;
+	goto end;
+
+alloc_numa:
+	/*
+	 * Try to reserve lowest available cid number within those not
+	 * already allocated for NUMA nodes.
+	 */
+	cid = cpumask_first_notandnot(node_alloc_cpumask, cpumask);
+	if (cid >= t->nr_cpus_allowed)
+		goto steal_overprovisioned_cid;
+	if (cid >= atomic_read(&mm->mm_users))
+		goto steal_first_available_cid;
+	if (cpumask_test_and_set_cpu(cid, cpumask))
+		return -1;
+	__cpumask_set_cpu(cid, node_cpumask);
+	__cpumask_set_cpu(cid, node_alloc_cpumask);
+	goto end;
+
+steal_overprovisioned_cid:
+	/*
+	 * Either the NUMA node id configuration changed for at least
+	 * one CPU in the system, or the scheduler migrated threads
+	 * across NUMA nodes, or the CPUs allowed mask changed. We need
+	 * to steal a currently unused cid. Userspace must handle the
+	 * fact that the node id associated with this cid may change.
+	 *
+	 * Try to steal an available cid number from an overprovisioned
+	 * NUMA node. A NUMA node is overprovisioned when more cids are
+	 * associated to it than the number of cores associated with
+	 * this NUMA node in the CPUs allowed mask. Stealing from
+	 * overprovisioned NUMA nodes ensures cid movement across NUMA
+	 * nodes stabilises after a configuration or CPUs allowed mask
+	 * change.
+	 */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+		int nr_allowed_cores;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = mm_node_cidmask(mm, node);
+		nr_allowed_cores = cpumask_weight_and(cpumask_of_node(node), t->cpus_ptr);
+		if (cpumask_weight(iter_cpumask) <= nr_allowed_cores)
+			continue;
+		/* Try to steal from an overprovisioned NUMA node. */
+		cid = cpumask_first_andnot(iter_cpumask, cpumask);
+		if (cid >= t->nr_cpus_allowed || cid >= atomic_read(&mm->mm_users))
+			goto steal_first_available_cid;
+		if (cpumask_test_and_set_cpu(cid, cpumask))
+			return -1;
+		__cpumask_clear_cpu(cid, iter_cpumask);
+		__cpumask_set_cpu(cid, node_cpumask);
+		goto end;
+	}
+
+steal_first_available_cid:
+	/*
+	 * Steal the first available cid, without caring about NUMA
+	 * locality. This is needed when the scheduler migrates threads
+	 * across NUMA nodes, when those threads belong to processes
+	 * which have fewer threads than the number of CPUs allowed.
+	 */
+	cid = __mm_cid_test_and_set_first(cpumask);
+	if (cid < 0)
+		return -1;
+	/* Steal cid from its NUMA node mask. */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = mm_node_cidmask(mm, node);
+		if (cpumask_test_cpu(cid, iter_cpumask)) {
+			__cpumask_clear_cpu(cid, iter_cpumask);
+			break;
+		}
+	}
+	__cpumask_set_cpu(cid, node_cpumask);
+end:
+	return cid;
+}
+#else
+static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
+{
+	return __mm_cid_test_and_set_first(mm_cidmask(mm));
+}
+#endif
+
 /*
- * Save a snapshot of the current runqueue time of this cpu
- * with the per-cpu cid value, allowing to estimate how recently it was used.
+ * Save a snapshot of the current runqueue time of this CPU
+ * with the per-CPU cid value, allowing to estimate how recently it was used.
  */
 static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
 {
@@ -3345,7 +3458,8 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
 	WRITE_ONCE(pcpu_cid->time, rq->clock);
 }
 
-static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
+static inline int __mm_cid_get(struct rq *rq, struct task_struct *t,
+			       struct mm_struct *mm)
 {
 	int cid;
 
@@ -3355,13 +3469,13 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
 	 * guarantee forward progress.
 	 */
 	if (!READ_ONCE(use_cid_lock)) {
-		cid = __mm_cid_try_get(mm);
+		cid = __mm_cid_try_get(t, mm);
 		if (cid >= 0)
 			goto end;
 		raw_spin_lock(&cid_lock);
 	} else {
 		raw_spin_lock(&cid_lock);
-		cid = __mm_cid_try_get(mm);
+		cid = __mm_cid_try_get(t, mm);
 		if (cid >= 0)
 			goto unlock;
 	}
@@ -3381,7 +3495,7 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
 	 * all newcoming allocations observe the use_cid_lock flag set.
 	 */
 	do {
-		cid = __mm_cid_try_get(mm);
+		cid = __mm_cid_try_get(t, mm);
 		cpu_relax();
 	} while (cid < 0);
 	/*
@@ -3397,7 +3511,8 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
 	return cid;
 }
 
-static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
+static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
+			     struct mm_struct *mm)
 {
 	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
 	struct cpumask *cpumask;
@@ -3414,7 +3529,7 @@ static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
 		if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
 			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
 	}
-	cid = __mm_cid_get(rq, mm);
+	cid = __mm_cid_get(rq, t, mm);
 	__this_cpu_write(pcpu_cid->cid, cid);
 	return cid;
 }
@@ -3467,7 +3582,7 @@ static inline void switch_mm_cid(struct rq *rq,
 		prev->mm_cid = -1;
 	}
 	if (next->mm_cid_active)
-		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next->mm);
+		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
 }
 
 #else
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 4/5] selftests/rseq: x86: Implement rseq_load_u32_u32
  2024-08-19 14:24 [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2024-08-19 14:24 ` [RFC PATCH 3/5] sched: NUMA-aware per-memory-map concurrency IDs Mathieu Desnoyers
@ 2024-08-19 14:24 ` Mathieu Desnoyers
  2024-08-19 14:24 ` [RFC PATCH 5/5] selftests/rseq: Implement NUMA node id vs mm_cid invariant test Mathieu Desnoyers
  2024-08-22  2:09 ` [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Shuah Khan
  5 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-19 14:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers, Valentin Schneider, Mel Gorman,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Yury Norov, Rasmus Villemoes, Shuah Khan, linux-kselftest

Allow loading a pair of u32 within a rseq critical section. It can be
used in situations where both rseq_abi()->mm_cid and
rseq_abi()->node_id need to be sampled atomically with respect to
preemption, signal delivery and migration.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-kselftest@vger.kernel.org
---
 tools/testing/selftests/rseq/rseq-x86-bits.h | 43 ++++++++++++++++++++
 tools/testing/selftests/rseq/rseq.h          | 14 +++++++
 2 files changed, 57 insertions(+)

diff --git a/tools/testing/selftests/rseq/rseq-x86-bits.h b/tools/testing/selftests/rseq/rseq-x86-bits.h
index 8a9431eec467..fdf5ef398393 100644
--- a/tools/testing/selftests/rseq/rseq-x86-bits.h
+++ b/tools/testing/selftests/rseq/rseq-x86-bits.h
@@ -990,4 +990,47 @@ int RSEQ_TEMPLATE_IDENTIFIER(rseq_cmpeqv_trymemcpy_storev)(intptr_t *v, intptr_t
 
 #endif
 
+#if defined(RSEQ_TEMPLATE_CPU_ID_NONE) && defined(RSEQ_TEMPLATE_MO_RELAXED)
+
+#define RSEQ_ARCH_HAS_LOAD_U32_U32
+
+static inline __attribute__((always_inline))
+int RSEQ_TEMPLATE_IDENTIFIER(rseq_load_u32_u32)(uint32_t *dst1, uint32_t *src1,
+		      uint32_t *dst2, uint32_t *src2)
+{
+	RSEQ_INJECT_C(9)
+
+	__asm__ __volatile__ goto (
+		RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+		/* Start rseq by storing table entry pointer into rseq_cs. */
+		RSEQ_ASM_STORE_RSEQ_CS(1, 3b, RSEQ_ASM_TP_SEGMENT:RSEQ_CS_OFFSET(%[rseq_offset]))
+		RSEQ_INJECT_ASM(3)
+		"movl %[src1], %%eax\n\t"
+		"movl %%eax, %[dst1]\n\t"
+		"movl %[src2], %%eax\n\t"
+		"movl %%eax, %[dst2]\n\t"
+		"2:\n\t"
+		RSEQ_INJECT_ASM(4)
+		RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+		: /* gcc asm goto does not allow outputs */
+		: [rseq_offset]		"r" (rseq_offset),
+		  /* final store input */
+		  [dst1]		"m" (*dst1),
+		  [src1]		"m" (*src1),
+		  [dst2]		"m" (*dst2),
+		  [src2]		"m" (*src2)
+		: "memory", "cc", "rax"
+		  RSEQ_INJECT_CLOBBER
+		: abort
+	);
+	rseq_after_asm_goto();
+	return 0;
+abort:
+	rseq_after_asm_goto();
+	RSEQ_INJECT_FAILED
+	return -1;
+}
+
+#endif /* defined(RSEQ_TEMPLATE_CPU_ID_NONE) && defined(RSEQ_TEMPLATE_MO_RELAXED) */
+
 #include "rseq-bits-reset.h"
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index d7364ea4d201..b6095c2a5da6 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -381,4 +381,18 @@ int rseq_cmpeqv_trymemcpy_storev(enum rseq_mo rseq_mo, enum rseq_percpu_mode per
 	}
 }
 
+#ifdef RSEQ_ARCH_HAS_LOAD_U32_U32
+
+static inline __attribute__((always_inline))
+int rseq_load_u32_u32(enum rseq_mo rseq_mo,
+		      uint32_t *dst1, uint32_t *src1,
+		      uint32_t *dst2, uint32_t *src2)
+{
+	if (rseq_mo != RSEQ_MO_RELAXED)
+		return -1;
+	return rseq_load_u32_u32_relaxed(dst1, src1, dst2, src2);
+}
+
+#endif
+
 #endif  /* RSEQ_H_ */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 5/5] selftests/rseq: Implement NUMA node id vs mm_cid invariant test
  2024-08-19 14:24 [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2024-08-19 14:24 ` [RFC PATCH 4/5] selftests/rseq: x86: Implement rseq_load_u32_u32 Mathieu Desnoyers
@ 2024-08-19 14:24 ` Mathieu Desnoyers
  2024-08-22  2:09 ` [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Shuah Khan
  5 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-19 14:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Mathieu Desnoyers, Valentin Schneider, Mel Gorman,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Yury Norov, Rasmus Villemoes, Shuah Khan, linux-kselftest

This test validates that the mapping between a mm_cid and a NUMA node id
remains invariant for the process lifetime for a process with a number of
threads >= number of allowed CPUs. In other words, it validates that if
any thread within the process running on behalf of a mm_cid N observes a
NUMA node id M, all threads within this process will always observe the
same NUMA node id value when running on behalf of that same mm_cid.

This characteristic is important for NUMA locality.

On all architectures except Power, the NUMA topology is never
reconfigured after a CPU has been associated with a NUMA node in the
system lifetime. Even on Power, we can assume that NUMA topology
reconfiguration happens rarely, and therefore we do not expect it to
happen while the NUMA test is running.

As a result the NUMA node id associated with a mm_cid should be
invariant as long as:

- A process has a number of threads >= number of allowed CPUs,
- The allowed CPUs mask is unchanged, and
- The NUMA configuration is unchanged.

This test is skipped on architectures that do not implement
rseq_load_u32_u32.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-kselftest@vger.kernel.org
---
 tools/testing/selftests/rseq/.gitignore       |   1 +
 tools/testing/selftests/rseq/Makefile         |   2 +-
 .../testing/selftests/rseq/basic_numa_test.c  | 144 ++++++++++++++++++
 3 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rseq/basic_numa_test.c

diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 16496de5f6ce..8a8d163cbb9f 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
+basic_numa_test
 basic_percpu_ops_test
 basic_percpu_ops_mm_cid_test
 basic_test
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 5a3432fceb58..9ef1c949114a 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -14,7 +14,7 @@ LDLIBS += -lpthread -ldl
 # still track changes to header files and depend on shared object.
 OVERRIDE_TARGETS = 1
 
-TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
+TEST_GEN_PROGS = basic_test basic_numa_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
 		param_test_benchmark param_test_compare_twice param_test_mm_cid \
 		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice
 
diff --git a/tools/testing/selftests/rseq/basic_numa_test.c b/tools/testing/selftests/rseq/basic_numa_test.c
new file mode 100644
index 000000000000..8e51c662057d
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_numa_test.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Basic rseq NUMA test. Validate that (mm_cid, numa_node_id) pairs are
+ * invariant when the number of threads >= number of allowed CPUs, as
+ * long as those preconditions are respected:
+ *
+ *   - A process has a number of threads >= number of allowed CPUs,
+ *   - The allowed CPUs mask is unchanged, and
+ *   - The NUMA configuration is unchanged.
+ */
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <poll.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#define NR_LOOPS	100
+
+static int nr_threads, nr_active_threads, test_go, test_stop;
+
+#ifdef RSEQ_ARCH_HAS_LOAD_U32_U32
+
+static int cpu_numa_id[CPU_SETSIZE];
+
+static int get_affinity_weight(void)
+{
+	cpu_set_t allowed_cpus;
+
+	if (sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus)) {
+		perror("sched_getaffinity");
+		abort();
+	}
+	return CPU_COUNT(&allowed_cpus);
+}
+
+static void numa_id_init(void)
+{
+	int i;
+
+	for (i = 0; i < CPU_SETSIZE; i++)
+		cpu_numa_id[i] = -1;
+}
+
+static void *test_thread(void *arg)
+{
+	int i;
+
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+	/*
+	 * Rendez-vous across all threads to make sure the number of
+	 * threads >= number of possible CPUs for the entire test duration.
+	 */
+	if (__atomic_add_fetch(&nr_active_threads, 1, __ATOMIC_RELAXED) == nr_threads)
+		__atomic_store_n(&test_go, 1, __ATOMIC_RELAXED);
+	while (!__atomic_load_n(&test_go, __ATOMIC_RELAXED))
+		rseq_barrier();
+
+	for (i = 0; i < NR_LOOPS; i++) {
+		uint32_t mm_cid, node;
+		int cached_node_id;
+
+		while (rseq_load_u32_u32(RSEQ_MO_RELAXED, &mm_cid,
+					 &rseq_get_abi()->mm_cid,
+					 &node, &rseq_get_abi()->node_id) != 0) {
+			/* Retry. */
+		}
+		cached_node_id = RSEQ_READ_ONCE(cpu_numa_id[mm_cid]);
+		if (cached_node_id == -1) {
+			RSEQ_WRITE_ONCE(cpu_numa_id[mm_cid], node);
+		} else {
+			if (node != cached_node_id) {
+				fprintf(stderr, "Error: NUMA node id discrepancy: mm_cid %u cached node id %d node id %u.\n",
+					mm_cid, cached_node_id, node);
+				fprintf(stderr, "This is likely a kernel bug, or caused by a concurrent NUMA topology reconfiguration.\n");
+				abort();
+			}
+		}
+		(void) poll(NULL, 0, 10);	/* wait 10ms */
+	}
+	/*
+	 * Rendez-vous before exiting all threads to make sure the
+	 * number of threads >= number of possible CPUs for the entire
+	 * test duration.
+	 */
+	if (__atomic_sub_fetch(&nr_active_threads, 1, __ATOMIC_RELAXED) == 0)
+		__atomic_store_n(&test_stop, 1, __ATOMIC_RELAXED);
+	while (!__atomic_load_n(&test_stop, __ATOMIC_RELAXED))
+		rseq_barrier();
+
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+	return NULL;
+}
+
+static int test_numa(void)
+{
+	pthread_t tid[nr_threads];
+	int err, i;
+	void *tret;
+
+	numa_id_init();
+
+	printf("testing rseq (mm_cid, numa_node_id) invariant, multi-threaded (%d threads)\n",
+	       nr_threads);
+
+	for (i = 0; i < nr_threads; i++) {
+		err = pthread_create(&tid[i], NULL, test_thread, NULL);
+		if (err != 0)
+			abort();
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		err = pthread_join(tid[i], &tret);
+		if (err != 0)
+			abort();
+	}
+
+	return 0;
+}
+#else
+static int test_numa(void)
+{
+	fprintf(stderr, "rseq_load_u32_u32 is not implemented on this architecture. Skipping numa test.\n");
+	return 0;
+}
+#endif
+
+int main(int argc, char **argv)
+{
+	nr_threads = get_affinity_weight();
+	return test_numa();
+}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit
  2024-08-19 14:24 ` [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit Mathieu Desnoyers
@ 2024-08-19 19:19   ` Yury Norov
  2024-08-20 17:19     ` Mathieu Desnoyers
  0 siblings, 1 reply; 13+ messages in thread
From: Yury Norov @ 2024-08-19 19:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Valentin Schneider,
	Mel Gorman, Steven Rostedt, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Rasmus Villemoes, Shuah Khan

On Mon, Aug 19, 2024 at 04:24:02PM +0200, Mathieu Desnoyers wrote:
> Allow finding the first, next, or nth bit within two input bitmasks
> which is zero in both masks.
> 
> Allow fiding the first bit within two input bitmasks which is set in
> first mask and cleared in the second mask. find_next_andnot_bit and
> find_nth_andnot_bit already exist, so find the first bit appears to be
> missing.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Yury Norov <yury.norov@gmail.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> ---
>  include/linux/find.h | 122 +++++++++++++++++++++++++++++++++++++++++--
>  lib/find_bit.c       |  42 +++++++++++++++
>  2 files changed, 160 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/find.h b/include/linux/find.h
> index 5dfca4225fef..6b2377006b22 100644
> --- a/include/linux/find.h
> +++ b/include/linux/find.h
> @@ -14,6 +14,8 @@ unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long
>  					unsigned long nbits, unsigned long start);
>  unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
>  					unsigned long nbits, unsigned long start);
> +unsigned long _find_next_notandnot_bit(const unsigned long *addr1, const unsigned long *addr2,
> +					unsigned long nbits, unsigned long start);
>  unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
>  					unsigned long nbits, unsigned long start);
>  unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
> @@ -24,11 +26,17 @@ unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long
>  				unsigned long size, unsigned long n);
>  unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
>  					unsigned long size, unsigned long n);
> +unsigned long __find_nth_notandnot_bit(const unsigned long *addr1, const unsigned long *addr2,
> +					unsigned long size, unsigned long n);
>  unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
>  					const unsigned long *addr3, unsigned long size,
>  					unsigned long n);
>  extern unsigned long _find_first_and_bit(const unsigned long *addr1,
>  					 const unsigned long *addr2, unsigned long size);
> +extern unsigned long _find_first_andnot_bit(const unsigned long *addr1,
> +					 const unsigned long *addr2, unsigned long size);
> +extern unsigned long _find_first_notandnot_bit(const unsigned long *addr1,
> +					 const unsigned long *addr2, unsigned long size);
>  unsigned long _find_first_and_and_bit(const unsigned long *addr1, const unsigned long *addr2,
>  				      const unsigned long *addr3, unsigned long size);
>  extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
> @@ -102,15 +110,14 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
>  
>  #ifndef find_next_andnot_bit
>  /**
> - * find_next_andnot_bit - find the next set bit in *addr1 excluding all the bits
> - *                        in *addr2
> + * find_next_andnot_bit - find the next set bit in *addr1, cleared in *addr2
>   * @addr1: The first address to base the search on
>   * @addr2: The second address to base the search on
>   * @size: The bitmap size in bits
>   * @offset: The bitnumber to start searching at
>   *
> - * Returns the bit number for the next set bit
> - * If no bits are set, returns @size.
> + * Returns the bit number for the next bit set in *addr1, cleared in *addr2.
> + * If no such bits are found, returns @size.

Can you split rewording of existing comments out to a separate patch
please?

>   */
>  static inline
>  unsigned long find_next_andnot_bit(const unsigned long *addr1,
> @@ -131,6 +138,36 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1,
>  }
>  #endif
>  
> +#ifndef find_next_notandnot_bit

Don't protect new functions. This is only for those having arch
implementation, and it's only armv7 now.

> +/**
> + * find_next_notandnot_bit - find the next bit cleared in both *addr1 and *addr2
> + * @addr1: The first address to base the search on
> + * @addr2: The second address to base the search on
> + * @size: The bitmap size in bits
> + * @offset: The bitnumber to start searching at
> + *
> + * Returns the bit number for the next bit cleared in both *addr1 and *addr2.
> + * If no such bits are found, returns @size.
> + */
> +static inline
> +unsigned long find_next_notandnot_bit(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long size,
> +		unsigned long offset)
> +{
> +	if (small_const_nbits(size)) {
> +		unsigned long val;
> +
> +		if (unlikely(offset >= size))
> +			return size;
> +
> +		val = (~*addr1) & (~*addr2) & GENMASK(size - 1, offset);
> +		return val ? __ffs(val) : size;
> +	}
> +
> +	return _find_next_notandnot_bit(addr1, addr2, size, offset);
> +}
> +#endif
> +

It's not said explicitly, but some naming conventions exist around bitmap
searching.

If you're looking for a clear (unset) bit in a mask, you'd use a 'zero'
modifier. We have only 2 such functions now: find_{first,next}_zero_bit,
both taking one bitmap. I think it's time to extend this rule for
many bitmaps and write down the naming rules.

With the following, the find_next_notandnot_bit() should be named
like; find_next_zero_and_bit(). It's not perfect, but still sounds
better to me than 'notandnot' thing.

If we search for a set bit in bitmap, we use find_first or find_next
prefixes:
 - find_first_bit;
 - find_next_bit.

If we'd like to pass an additional bitmap to AND, OR or XOR with the
1st bitmap, we provide as corresponding logical operation as
suffix(es):
 - find_first_and_bit(b1, b2) : b1 & b2;
 - find _next_and_or_bit(b1, b2, b3) : b1 & b2 | b3.

If additional bitmap must be inverted, we provide a 'not' after the
corresponding logical operation:
 - find_first_andnot_bit(b1, b2) : b1 & ~b2;
 - find _next_and_ornot_bit(b1, b2, b3) : b1 & b2 | ~b3.

If all bitmaps have to be inverted, or in other words, we are looking
for an unset bit in a bitmap or a combination of bitmaps, we provide
a 'zero' prefix in the function name:
 - find_next_zero_bit;
 - find_next_zero_and_bit;
 - find_next_zero_and_or_bit;

Functions having 'zero' prefix should not negate bitmaps (should not
have 'not' in names) because of commutative property. For example,
find_next_zero_andnot_bit(), which is ~b1 & ~(~b2) is redundant
because it's the same as find_next_andnot_bit() : b2 & ~b1.

Iterators over unset bits in bitmap(s) (those based on 'zero' search
functions) should have a 'clear' prefix in the name:
 - for_each_clear_bit;
 - for_each_clear_bit_from;

I should probably put the above on top of the file...

Thanks,
Yury

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/5] cpumask: Implement cpumask_{first,next}_{not,}andnot
  2024-08-19 14:24 ` [RFC PATCH 2/5] cpumask: Implement cpumask_{first,next}_{not,}andnot Mathieu Desnoyers
@ 2024-08-19 19:24   ` Yury Norov
  2024-08-20 17:28     ` Mathieu Desnoyers
  0 siblings, 1 reply; 13+ messages in thread
From: Yury Norov @ 2024-08-19 19:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Valentin Schneider,
	Mel Gorman, Steven Rostedt, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Rasmus Villemoes, Shuah Khan

On Mon, Aug 19, 2024 at 04:24:03PM +0200, Mathieu Desnoyers wrote:
> Allow finding the first or next bit within two input cpumasks which is
> either:

"first or next CPU..." here.
 
> - both zero and zero,
> - respectively one and zero.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Yury Norov <yury.norov@gmail.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> ---
>  include/linux/cpumask.h | 60 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 60 insertions(+)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 23686bed441d..57b7d99d6da1 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -204,6 +204,32 @@ unsigned int cpumask_first_and_and(const struct cpumask *srcp1,
>  				      cpumask_bits(srcp3), small_cpumask_bits);
>  }
>  
> +/**
> + * cpumask_first_andnot - return the first cpu from *srcp1 & ~*srcp2
> + * @src1p: the first input
> + * @src2p: the second input
> + *
> + * Returns >= nr_cpu_ids if no cpus match in both.
> + */
> +static inline
> +unsigned int cpumask_first_andnot(const struct cpumask *srcp1, const struct cpumask *srcp2)

Please use __always_inline to enforce a compile-time optimizations.
Check for this series:
https://lore.kernel.org/lkml/20240719005127.2449328-4-briannorris@chromium.org/T/

It's already in -next.

Thanks,
Yury

> +{
> +	return find_first_andnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
> +}
> +
> +/**
> + * cpumask_first_notandnot - return the first cpu from ~*srcp1 & ~*srcp2
> + * @src1p: the first input
> + * @src2p: the second input
> + *
> + * Returns >= nr_cpu_ids if no cpus match in both.
> + */
> +static inline
> +unsigned int cpumask_first_notandnot(const struct cpumask *srcp1, const struct cpumask *srcp2)
> +{
> +	return find_first_notandnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
> +}
> +
>  /**
>   * cpumask_last - get the last CPU in a cpumask
>   * @srcp:	- the cpumask pointer
> @@ -246,6 +272,40 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
>  	return find_next_zero_bit(cpumask_bits(srcp), small_cpumask_bits, n+1);
>  }
>  
> +/**
> + * cpumask_next_andnot - return the next cpu from *srcp1 & ~*srcp2
> + * @n: the cpu prior to the place to search (ie. return will be > @n)
> + * @src1p: the first input
> + * @src2p: the second input
> + *
> + * Returns >= nr_cpu_ids if no cpus match in both.
> + */
> +static inline
> +unsigned int cpumask_next_andnot(int n, const struct cpumask *srcp1, const struct cpumask *srcp2)
> +{
> +	/* -1 is a legal arg here. */
> +	if (n != -1)
> +		cpumask_check(n);
> +	return find_next_andnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits, n+1);
> +}
> +
> +/**
> + * cpumask_next_notandnot - return the next cpu from ~*srcp1 & ~*srcp2
> + * @n: the cpu prior to the place to search (ie. return will be > @n)
> + * @src1p: the first input
> + * @src2p: the second input
> + *
> + * Returns >= nr_cpu_ids if no cpus match in both.
> + */
> +static inline
> +unsigned int cpumask_next_notandnot(int n, const struct cpumask *srcp1, const struct cpumask *srcp2)
> +{
> +	/* -1 is a legal arg here. */
> +	if (n != -1)
> +		cpumask_check(n);
> +	return find_next_notandnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits, n+1);
> +}
> +
>  #if NR_CPUS == 1
>  /* Uniprocessor: there is only one valid CPU */
>  static inline unsigned int cpumask_local_spread(unsigned int i, int node)
> -- 
> 2.39.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit
  2024-08-19 19:19   ` Yury Norov
@ 2024-08-20 17:19     ` Mathieu Desnoyers
  2024-08-20 20:45       ` Mathieu Desnoyers
  0 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-20 17:19 UTC (permalink / raw)
  To: Yury Norov
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Valentin Schneider,
	Mel Gorman, Steven Rostedt, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Rasmus Villemoes, Shuah Khan

On 2024-08-19 21:19, Yury Norov wrote:
[...[> Can you split rewording of existing comments out to a separate patch
> please?

Will do.

> 
>>    */
>>   static inline
>>   unsigned long find_next_andnot_bit(const unsigned long *addr1,
>> @@ -131,6 +138,36 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1,
>>   }
>>   #endif
>>   
>> +#ifndef find_next_notandnot_bit
> 
> Don't protect new functions. This is only for those having arch
> implementation, and it's only armv7 now.

OK

> 
>> +/**
>> + * find_next_notandnot_bit - find the next bit cleared in both *addr1 and *addr2
>> + * @addr1: The first address to base the search on
>> + * @addr2: The second address to base the search on
>> + * @size: The bitmap size in bits
>> + * @offset: The bitnumber to start searching at
>> + *
>> + * Returns the bit number for the next bit cleared in both *addr1 and *addr2.
>> + * If no such bits are found, returns @size.
>> + */
>> +static inline
>> +unsigned long find_next_notandnot_bit(const unsigned long *addr1,
>> +		const unsigned long *addr2, unsigned long size,
>> +		unsigned long offset)
>> +{
>> +	if (small_const_nbits(size)) {
>> +		unsigned long val;
>> +
>> +		if (unlikely(offset >= size))
>> +			return size;
>> +
>> +		val = (~*addr1) & (~*addr2) & GENMASK(size - 1, offset);
>> +		return val ? __ffs(val) : size;
>> +	}
>> +
>> +	return _find_next_notandnot_bit(addr1, addr2, size, offset);
>> +}
>> +#endif
>> +
> 
> It's not said explicitly, but some naming conventions exist around bitmap
> searching.
> 
> If you're looking for a clear (unset) bit in a mask, you'd use a 'zero'
> modifier. We have only 2 such functions now: find_{first,next}_zero_bit,
> both taking one bitmap. I think it's time to extend this rule for
> many bitmaps and write down the naming rules.
> 
> With the following, the find_next_notandnot_bit() should be named
> like; find_next_zero_and_bit(). It's not perfect, but still sounds
> better to me than 'notandnot' thing.
> 
> If we search for a set bit in bitmap, we use find_first or find_next
> prefixes:
>   - find_first_bit;
>   - find_next_bit.
> 
> If we'd like to pass an additional bitmap to AND, OR or XOR with the
> 1st bitmap, we provide as corresponding logical operation as
> suffix(es):
>   - find_first_and_bit(b1, b2) : b1 & b2;
>   - find _next_and_or_bit(b1, b2, b3) : b1 & b2 | b3.
> 
> If additional bitmap must be inverted, we provide a 'not' after the
> corresponding logical operation:
>   - find_first_andnot_bit(b1, b2) : b1 & ~b2;
>   - find _next_and_ornot_bit(b1, b2, b3) : b1 & b2 | ~b3.
> 
> If all bitmaps have to be inverted, or in other words, we are looking
> for an unset bit in a bitmap or a combination of bitmaps, we provide
> a 'zero' prefix in the function name:
>   - find_next_zero_bit;
>   - find_next_zero_and_bit;
>   - find_next_zero_and_or_bit;
> 
> Functions having 'zero' prefix should not negate bitmaps (should not
> have 'not' in names) because of commutative property. For example,
> find_next_zero_andnot_bit(), which is ~b1 & ~(~b2) is redundant
> because it's the same as find_next_andnot_bit() : b2 & ~b1.
> 
> Iterators over unset bits in bitmap(s) (those based on 'zero' search
> functions) should have a 'clear' prefix in the name:
>   - for_each_clear_bit;
>   - for_each_clear_bit_from;
> 
> I should probably put the above on top of the file...

I'll do this for the next round. Yes, it would be good to add
those explanations on top of the file.

Thanks for the review !

Mathieu

> 
> Thanks,
> Yury

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/5] cpumask: Implement cpumask_{first,next}_{not,}andnot
  2024-08-19 19:24   ` Yury Norov
@ 2024-08-20 17:28     ` Mathieu Desnoyers
  0 siblings, 0 replies; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-20 17:28 UTC (permalink / raw)
  To: Yury Norov
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Valentin Schneider,
	Mel Gorman, Steven Rostedt, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Rasmus Villemoes, Shuah Khan

On 2024-08-19 21:24, Yury Norov wrote:
> On Mon, Aug 19, 2024 at 04:24:03PM +0200, Mathieu Desnoyers wrote:
>> Allow finding the first or next bit within two input cpumasks which is
>> either:
> 
> "first or next CPU..." here.
>   
>> - both zero and zero,
>> - respectively one and zero.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Yury Norov <yury.norov@gmail.com>
>> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
>> ---
>>   include/linux/cpumask.h | 60 +++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 60 insertions(+)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index 23686bed441d..57b7d99d6da1 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -204,6 +204,32 @@ unsigned int cpumask_first_and_and(const struct cpumask *srcp1,
>>   				      cpumask_bits(srcp3), small_cpumask_bits);
>>   }
>>   
>> +/**
>> + * cpumask_first_andnot - return the first cpu from *srcp1 & ~*srcp2
>> + * @src1p: the first input
>> + * @src2p: the second input
>> + *
>> + * Returns >= nr_cpu_ids if no cpus match in both.
>> + */
>> +static inline
>> +unsigned int cpumask_first_andnot(const struct cpumask *srcp1, const struct cpumask *srcp2)
> 
> Please use __always_inline to enforce a compile-time optimizations.
> Check for this series:
> https://lore.kernel.org/lkml/20240719005127.2449328-4-briannorris@chromium.org/T/

I'll use __always_inline in both bitmap and cpumask patches.

I'll update this patch to rename notandnot to zero_and.

Thanks,

Mathieu

> 
> It's already in -next.
> 
> Thanks,
> Yury
> 
>> +{
>> +	return find_first_andnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
>> +}
>> +
>> +/**
>> + * cpumask_first_notandnot - return the first cpu from ~*srcp1 & ~*srcp2
>> + * @src1p: the first input
>> + * @src2p: the second input
>> + *
>> + * Returns >= nr_cpu_ids if no cpus match in both.
>> + */
>> +static inline
>> +unsigned int cpumask_first_notandnot(const struct cpumask *srcp1, const struct cpumask *srcp2)
>> +{
>> +	return find_first_notandnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
>> +}
>> +
>>   /**
>>    * cpumask_last - get the last CPU in a cpumask
>>    * @srcp:	- the cpumask pointer
>> @@ -246,6 +272,40 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
>>   	return find_next_zero_bit(cpumask_bits(srcp), small_cpumask_bits, n+1);
>>   }
>>   
>> +/**
>> + * cpumask_next_andnot - return the next cpu from *srcp1 & ~*srcp2
>> + * @n: the cpu prior to the place to search (ie. return will be > @n)
>> + * @src1p: the first input
>> + * @src2p: the second input
>> + *
>> + * Returns >= nr_cpu_ids if no cpus match in both.
>> + */
>> +static inline
>> +unsigned int cpumask_next_andnot(int n, const struct cpumask *srcp1, const struct cpumask *srcp2)
>> +{
>> +	/* -1 is a legal arg here. */
>> +	if (n != -1)
>> +		cpumask_check(n);
>> +	return find_next_andnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits, n+1);
>> +}
>> +
>> +/**
>> + * cpumask_next_notandnot - return the next cpu from ~*srcp1 & ~*srcp2
>> + * @n: the cpu prior to the place to search (ie. return will be > @n)
>> + * @src1p: the first input
>> + * @src2p: the second input
>> + *
>> + * Returns >= nr_cpu_ids if no cpus match in both.
>> + */
>> +static inline
>> +unsigned int cpumask_next_notandnot(int n, const struct cpumask *srcp1, const struct cpumask *srcp2)
>> +{
>> +	/* -1 is a legal arg here. */
>> +	if (n != -1)
>> +		cpumask_check(n);
>> +	return find_next_notandnot_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits, n+1);
>> +}
>> +
>>   #if NR_CPUS == 1
>>   /* Uniprocessor: there is only one valid CPU */
>>   static inline unsigned int cpumask_local_spread(unsigned int i, int node)
>> -- 
>> 2.39.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit
  2024-08-20 17:19     ` Mathieu Desnoyers
@ 2024-08-20 20:45       ` Mathieu Desnoyers
  2024-08-20 21:15         ` Yury Norov
  0 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2024-08-20 20:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Valentin Schneider,
	Mel Gorman, Steven Rostedt, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Rasmus Villemoes, Shuah Khan

On 2024-08-20 19:19, Mathieu Desnoyers wrote:
> On 2024-08-19 21:19, Yury Norov wrote:
[...]
>>> +/**
>>> + * find_next_notandnot_bit - find the next bit cleared in both 
>>> *addr1 and *addr2
>>> + * @addr1: The first address to base the search on
>>> + * @addr2: The second address to base the search on
>>> + * @size: The bitmap size in bits
>>> + * @offset: The bitnumber to start searching at
>>> + *
>>> + * Returns the bit number for the next bit cleared in both *addr1 
>>> and *addr2.
>>> + * If no such bits are found, returns @size.
>>> + */
>>> +static inline
>>> +unsigned long find_next_notandnot_bit(const unsigned long *addr1,
>>> +        const unsigned long *addr2, unsigned long size,
>>> +        unsigned long offset)
>>> +{
>>> +    if (small_const_nbits(size)) {
>>> +        unsigned long val;
>>> +
>>> +        if (unlikely(offset >= size))
>>> +            return size;
>>> +
>>> +        val = (~*addr1) & (~*addr2) & GENMASK(size - 1, offset);
>>> +        return val ? __ffs(val) : size;
>>> +    }
>>> +
>>> +    return _find_next_notandnot_bit(addr1, addr2, size, offset);
>>> +}
>>> +#endif
>>> +
>>
>> It's not said explicitly, but some naming conventions exist around bitmap
>> searching.
>>
>> If you're looking for a clear (unset) bit in a mask, you'd use a 'zero'
>> modifier. We have only 2 such functions now: find_{first,next}_zero_bit,
>> both taking one bitmap. I think it's time to extend this rule for
>> many bitmaps and write down the naming rules.
>>
>> With the following, the find_next_notandnot_bit() should be named
>> like; find_next_zero_and_bit(). It's not perfect, but still sounds
>> better to me than 'notandnot' thing.

Actually, now that I come to think of it in terms of logic gates:

~A & ~B == ~(A | B)

So this "notandnot" is simply a "NOR" gate.

I therefore intend to name it "find_next_nor_bit" if that's OK with
you.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit
  2024-08-20 20:45       ` Mathieu Desnoyers
@ 2024-08-20 21:15         ` Yury Norov
  0 siblings, 0 replies; 13+ messages in thread
From: Yury Norov @ 2024-08-20 21:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Valentin Schneider,
	Mel Gorman, Steven Rostedt, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Rasmus Villemoes, Shuah Khan

On Tue, Aug 20, 2024 at 10:45:17PM +0200, Mathieu Desnoyers wrote:
> On 2024-08-20 19:19, Mathieu Desnoyers wrote:
> > On 2024-08-19 21:19, Yury Norov wrote:
> [...]
> > > > +/**
> > > > + * find_next_notandnot_bit - find the next bit cleared in both
> > > > *addr1 and *addr2
> > > > + * @addr1: The first address to base the search on
> > > > + * @addr2: The second address to base the search on
> > > > + * @size: The bitmap size in bits
> > > > + * @offset: The bitnumber to start searching at
> > > > + *
> > > > + * Returns the bit number for the next bit cleared in both
> > > > *addr1 and *addr2.
> > > > + * If no such bits are found, returns @size.
> > > > + */
> > > > +static inline
> > > > +unsigned long find_next_notandnot_bit(const unsigned long *addr1,
> > > > +        const unsigned long *addr2, unsigned long size,
> > > > +        unsigned long offset)
> > > > +{
> > > > +    if (small_const_nbits(size)) {
> > > > +        unsigned long val;
> > > > +
> > > > +        if (unlikely(offset >= size))
> > > > +            return size;
> > > > +
> > > > +        val = (~*addr1) & (~*addr2) & GENMASK(size - 1, offset);
> > > > +        return val ? __ffs(val) : size;
> > > > +    }
> > > > +
> > > > +    return _find_next_notandnot_bit(addr1, addr2, size, offset);
> > > > +}
> > > > +#endif
> > > > +
> > > 
> > > It's not said explicitly, but some naming conventions exist around bitmap
> > > searching.
> > > 
> > > If you're looking for a clear (unset) bit in a mask, you'd use a 'zero'
> > > modifier. We have only 2 such functions now: find_{first,next}_zero_bit,
> > > both taking one bitmap. I think it's time to extend this rule for
> > > many bitmaps and write down the naming rules.
> > > 
> > > With the following, the find_next_notandnot_bit() should be named
> > > like; find_next_zero_and_bit(). It's not perfect, but still sounds
> > > better to me than 'notandnot' thing.
> 
> Actually, now that I come to think of it in terms of logic gates:
> 
> ~A & ~B == ~(A | B)
> 
> So this "notandnot" is simply a "NOR" gate.
> 
> I therefore intend to name it "find_next_nor_bit" if that's OK with
> you.

Yes, I'm OK.

To me, if you can put definition of a logical operation inside
FIND_NEXT_BIT() macro directly, you can name it correspondingly.
So in this case, find_next_nor_bit would be a:

  FIND_NEXT_BIT(~(addr1[idx] | addr2[idx]), /* nop */, size)

Correspondingly, instead of 'zero_or' we should use a 'nand' notation,
if it will be needed. I'll notice that in the naming manual.

Good catch.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs
  2024-08-19 14:24 [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2024-08-19 14:24 ` [RFC PATCH 5/5] selftests/rseq: Implement NUMA node id vs mm_cid invariant test Mathieu Desnoyers
@ 2024-08-22  2:09 ` Shuah Khan
  5 siblings, 0 replies; 13+ messages in thread
From: Shuah Khan @ 2024-08-22  2:09 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Valentin Schneider, Mel Gorman, Steven Rostedt,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Yury Norov,
	Rasmus Villemoes, Shuah Khan

On 8/19/24 08:24, Mathieu Desnoyers wrote:
> The issue addressed by this series is the non-locality of NUMA accesses
> to data structures indexed by concurrency IDs: for example, in a
> scenario where a process has two threads, and they periodically run one
> after the other on different NUMA nodes, each will be assigned mm_cid=0.
> As a consequence, they will end up accessing the same pages, and thus at
> least one of the threads will need to perform remote NUMA accesses,
> which is inefficient.
> 
> Solve this by making the rseq concurrency ID (mm_cid) NUMA-aware. On
> NUMA systems, when a NUMA-aware concurrency ID is observed by user-space
> to be associated with a NUMA node, guarantee that it never changes NUMA
> node unless either a kernel-level NUMA configuration change happens, or
> scheduler migrations end up migrating tasks across NUMA nodes.
> 
> There is a tradeoff between NUMA locality and compactness of the
> concurrency ID allocation. Favor compactness over NUMA locality when
> the scheduler migrates tasks across NUMA nodes, as this does not cause
> the frequent remote NUMA accesses behavior. This is done by limiting the
> concurrency ID range to minimum between the number of threads belonging
> to the process and the number of allowed CPUs.
> 
> This series applies on top of v6.10.3.
> 
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Yury Norov <yury.norov@gmail.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Cc: Shuah Khan <skhan@linuxfoundation.org>
> 
> Mathieu Desnoyers (5):
>    lib: Implement find_{first,next,nth}_notandnot_bit,
>      find_first_andnot_bit
>    cpumask: Implement cpumask_{first,next}_{not,}andnot
>    sched: NUMA-aware per-memory-map concurrency IDs
>    selftests/rseq: x86: Implement rseq_load_u32_u32
>    selftests/rseq: Implement NUMA node id vs mm_cid invariant test
> 
>   include/linux/cpumask.h                       |  60 ++++++++
>   include/linux/find.h                          | 122 ++++++++++++++-
>   include/linux/mm_types.h                      |  57 ++++++-
>   kernel/sched/core.c                           |  10 +-
>   kernel/sched/sched.h                          | 139 +++++++++++++++--
>   lib/find_bit.c                                |  42 +++++
>   tools/testing/selftests/rseq/.gitignore       |   1 +
>   tools/testing/selftests/rseq/Makefile         |   2 +-
>   .../testing/selftests/rseq/basic_numa_test.c  | 144 ++++++++++++++++++
>   tools/testing/selftests/rseq/rseq-x86-bits.h  |  43 ++++++
>   tools/testing/selftests/rseq/rseq.h           |  14 ++
>   11 files changed, 613 insertions(+), 21 deletions(-)
>   create mode 100644 tools/testing/selftests/rseq/basic_numa_test.c
> 

Looks good to me - for selftests:

Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-08-22  2:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-19 14:24 [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Mathieu Desnoyers
2024-08-19 14:24 ` [RFC PATCH 1/5] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit Mathieu Desnoyers
2024-08-19 19:19   ` Yury Norov
2024-08-20 17:19     ` Mathieu Desnoyers
2024-08-20 20:45       ` Mathieu Desnoyers
2024-08-20 21:15         ` Yury Norov
2024-08-19 14:24 ` [RFC PATCH 2/5] cpumask: Implement cpumask_{first,next}_{not,}andnot Mathieu Desnoyers
2024-08-19 19:24   ` Yury Norov
2024-08-20 17:28     ` Mathieu Desnoyers
2024-08-19 14:24 ` [RFC PATCH 3/5] sched: NUMA-aware per-memory-map concurrency IDs Mathieu Desnoyers
2024-08-19 14:24 ` [RFC PATCH 4/5] selftests/rseq: x86: Implement rseq_load_u32_u32 Mathieu Desnoyers
2024-08-19 14:24 ` [RFC PATCH 5/5] selftests/rseq: Implement NUMA node id vs mm_cid invariant test Mathieu Desnoyers
2024-08-22  2:09 ` [RFC PATCH 0/5] sched: NUMA-aware concurrency IDs Shuah Khan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox