* [RFC][PATCH 00/10] futex: More Futex2 bits
@ 2023-07-14 13:38 Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 01/10] futex: Clarify FUTEX2 flags Peter Zijlstra
` (9 more replies)
0 siblings, 10 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:38 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
Hi,
Reviewing Jens' series to add io_uring futex support got me looking at futex
again, and I realized the current flags situation is a mess. After cleaning
that up I decided to continue and implement most of the missing flags for
futex2.
I initially also wanted to add a futex_wait() syscall, but given the amount and
kind of arguments that needs, that's just not going to work on 32bit.
futex_waitv() will have to do for now.
I've not yet done futex_requeue(), that's even worse than futex_wait() and I
think the only option is to do something like:
sys_futex_requeue(struct futex_waitv __user futexes[2], unsigned int flags, int nr_wake, int nr_requeue)
Where we use struct futex_wativ to specify the two futexes (addr and flags) and
cmp value.
Additionally, robust futexes can fundamentally only support 32bit unless we go
make more lists.
The 'small' futex support is very limited, esp. when combined with FUTEX2_NUMA
mixing sizes is really not an option. The requeue variant above would be able
to specify different sizes for each futex and might just do.
The whole series is *very* lightly tested, as in, it builds and boots, but I've
not yet written a single line of user code to trigger any of the new paths.
Please handle with care etc.. ;-)
Jens, given you do a completely new futex interface, it probably makes more
sense if you use FUTEX2 flags and a u64 value and mask.
I'm hoping this series clarifies that situation a little instead of making it a
worse mess. Many of these things have been brewing over the past several years
but nobody put it all together before.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 01/10] futex: Clarify FUTEX2 flags
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 02/10] futex: Extend the " Peter Zijlstra
` (8 subsequent siblings)
9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
sys_futex_waitv() is part of the futex2 series (the first and only so
far) of syscalls and has a flags field per futex (as opposed to flags
being encoded in the futex op).
This new flags field has a new namespace, which unfortunately isn't
super explicit. Notably it currently takes FUTEX_32 and
FUTEX_PRIVATE_FLAG.
Introduce the FUTEX2 namespace to clarify this
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/uapi/linux/futex.h | 16 +++++++++++++---
kernel/futex/syscalls.c | 7 +++----
2 files changed, 16 insertions(+), 7 deletions(-)
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -44,10 +44,20 @@
FUTEX_PRIVATE_FLAG)
/*
- * Flags to specify the bit length of the futex word for futex2 syscalls.
- * Currently, only 32 is supported.
+ * Flags for futex2 syscalls.
*/
-#define FUTEX_32 2
+ /* 0x00 */
+ /* 0x01 */
+#define FUTEX2_32 0x02
+ /* 0x04 */
+ /* 0x08 */
+ /* 0x10 */
+ /* 0x20 */
+ /* 0x40 */
+#define FUTEX2_PRIVATE FUTEX_PRIVATE_FLAG
+
+/* do not use */
+#define FUTEX_32 FUTEX2_32 /* historical accident :-( */
/*
* Max numbers of elements in a futex_waitv array
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -183,8 +183,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uad
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
-/* Mask of available flags for each futex in futex_waitv list */
-#define FUTEXV_WAITER_MASK (FUTEX_32 | FUTEX_PRIVATE_FLAG)
+#define FUTEX2_MASK (FUTEX2_32 | FUTEX2_PRIVATE)
/**
* futex_parse_waitv - Parse a waitv array from userspace
@@ -205,10 +204,10 @@ static int futex_parse_waitv(struct fute
if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
return -EFAULT;
- if ((aux.flags & ~FUTEXV_WAITER_MASK) || aux.__reserved)
+ if ((aux.flags & ~FUTEX2_MASK) || aux.__reserved)
return -EINVAL;
- if (!(aux.flags & FUTEX_32))
+ if (!(aux.flags & FUTEX2_32))
return -EINVAL;
futexv[i].w.flags = aux.flags;
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 02/10] futex: Extend the FUTEX2 flags
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 01/10] futex: Clarify FUTEX2 flags Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 03/10] futex: Flag conversion Peter Zijlstra
` (7 subsequent siblings)
9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
Add the definition for the missing but always intended extra sizes,
and add a NUMA flag for the planned numa extention.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/uapi/linux/futex.h | 7 ++++---
kernel/futex/syscalls.c | 4 ++--
2 files changed, 6 insertions(+), 5 deletions(-)
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -46,10 +46,11 @@
/*
* Flags for futex2 syscalls.
*/
- /* 0x00 */
- /* 0x01 */
+#define FUTEX2_8 0x00
+#define FUTEX2_16 0x01
#define FUTEX2_32 0x02
- /* 0x04 */
+#define FUTEX2_64 0x03
+#define FUTEX2_NUMA 0x04
/* 0x08 */
/* 0x10 */
/* 0x20 */
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -183,7 +183,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uad
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
-#define FUTEX2_MASK (FUTEX2_32 | FUTEX2_PRIVATE)
+#define FUTEX2_MASK (FUTEX2_64 | FUTEX2_PRIVATE)
/**
* futex_parse_waitv - Parse a waitv array from userspace
@@ -207,7 +207,7 @@ static int futex_parse_waitv(struct fute
if ((aux.flags & ~FUTEX2_MASK) || aux.__reserved)
return -EINVAL;
- if (!(aux.flags & FUTEX2_32))
+ if ((aux.flags & FUTEX2_64) != FUTEX2_32)
return -EINVAL;
futexv[i].w.flags = aux.flags;
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 03/10] futex: Flag conversion
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 01/10] futex: Clarify FUTEX2 flags Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 02/10] futex: Extend the " Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 04/10] futex: Add sys_futex_wake() Peter Zijlstra
` (6 subsequent siblings)
9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
Futex has 3 sets of flags:
- legacy futex op bits
- futex2 flags
- internal flags
Add a few helpers to convert from the API flags into the internal
flags.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/futex/futex.h | 48 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/futex/syscalls.c | 21 +++++++++++++--------
kernel/futex/waitwake.c | 4 ++--
3 files changed, 60 insertions(+), 13 deletions(-)
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -16,8 +16,15 @@
* Futex flags used to encode options to functions and preserve them across
* restarts.
*/
+#define FLAGS_SIZE_8 0x00
+#define FLAGS_SIZE_16 0x01
+#define FLAGS_SIZE_32 0x02
+#define FLAGS_SIZE_64 0x03
+
+#define FLAGS_SIZE_MASK 0x03
+
#ifdef CONFIG_MMU
-# define FLAGS_SHARED 0x01
+# define FLAGS_SHARED 0x10
#else
/*
* NOMMU does not have per process address space. Let the compiler optimize
@@ -25,8 +32,43 @@
*/
# define FLAGS_SHARED 0x00
#endif
-#define FLAGS_CLOCKRT 0x02
-#define FLAGS_HAS_TIMEOUT 0x04
+#define FLAGS_CLOCKRT 0x20
+#define FLAGS_HAS_TIMEOUT 0x40
+#define FLAGS_NUMA 0x80
+
+/* FUTEX_ to FLAGS_ */
+static inline unsigned int futex_to_flags(unsigned int op)
+{
+ unsigned int flags = FLAGS_SIZE_32;
+
+ if (!(op & FUTEX_PRIVATE_FLAG))
+ flags |= FLAGS_SHARED;
+
+ if (op & FUTEX_CLOCK_REALTIME)
+ flags |= FLAGS_CLOCKRT;
+
+ return flags;
+}
+
+/* FUTEX2_ to FLAGS_ */
+static inline unsigned int futex2_to_flags(unsigned int flags2)
+{
+ unsigned int flags = flags2 & FUTEX2_64;
+
+ if (!(flags2 & FUTEX2_PRIVATE))
+ flags |= FLAGS_SHARED;
+
+ if (flags2 & FUTEX2_NUMA)
+ flags |= FLAGS_NUMA;
+
+ return flags;
+}
+
+static inline unsigned int futex_size(unsigned int flags)
+{
+ unsigned int size = flags & FLAGS_SIZE_MASK;
+ return 1 << size; /* {0,1,2,3} -> {1,2,4,8} */
+}
#ifdef CONFIG_FAIL_FUTEX
extern bool should_fail_futex(bool fshared);
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -85,15 +85,12 @@ SYSCALL_DEFINE3(get_robust_list, int, pi
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
+ unsigned int flags = futex_to_flags(op);
int cmd = op & FUTEX_CMD_MASK;
- unsigned int flags = 0;
- if (!(op & FUTEX_PRIVATE_FLAG))
- flags |= FLAGS_SHARED;
-
- if (op & FUTEX_CLOCK_REALTIME) {
- flags |= FLAGS_CLOCKRT;
- if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI &&
+ if (flags & FLAGS_CLOCKRT) {
+ if (cmd != FUTEX_WAIT_BITSET &&
+ cmd != FUTEX_WAIT_REQUEUE_PI &&
cmd != FUTEX_LOCK_PI2)
return -ENOSYS;
}
@@ -201,6 +198,8 @@ static int futex_parse_waitv(struct fute
unsigned int i;
for (i = 0; i < nr_futexes; i++) {
+ unsigned int bits, flags;
+
if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
return -EFAULT;
@@ -210,7 +209,13 @@ static int futex_parse_waitv(struct fute
if ((aux.flags & FUTEX2_64) != FUTEX2_32)
return -EINVAL;
- futexv[i].w.flags = aux.flags;
+ flags = futex2_to_flags(aux.flags);
+ bits = 8 * futex_size(flags);
+
+ if (bits < 64 && aux.val >> bits)
+ return -EINVAL;
+
+ futexv[i].w.flags = flags;
futexv[i].w.val = aux.val;
futexv[i].w.uaddr = aux.uaddr;
futexv[i].q = futex_q_init;
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -419,11 +419,11 @@ static int futex_wait_multiple_setup(str
*/
retry:
for (i = 0; i < count; i++) {
- if ((vs[i].w.flags & FUTEX_PRIVATE_FLAG) && retry)
+ if (!(vs[i].w.flags & FLAGS_SHARED) && retry)
continue;
ret = get_futex_key(u64_to_user_ptr(vs[i].w.uaddr),
- !(vs[i].w.flags & FUTEX_PRIVATE_FLAG),
+ vs[i].w.flags & FLAGS_SHARED,
&vs[i].q.key, FUTEX_READ);
if (unlikely(ret))
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 04/10] futex: Add sys_futex_wake()
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
` (2 preceding siblings ...)
2023-07-14 13:39 ` [RFC][PATCH 03/10] futex: Flag conversion Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 14:26 ` Arnd Bergmann
2023-07-14 13:39 ` [RFC][PATCH 05/10] mm: Add vmalloc_huge_node() Peter Zijlstra
` (5 subsequent siblings)
9 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
To complement sys_futex_waitv() add sys_futex_wake(). Together they
provide the basic Futex2 WAIT/WAKE functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
include/linux/syscalls.h | 3 ++
include/uapi/asm-generic/unistd.h | 5 ++-
kernel/futex/syscalls.c | 37 ++++++++++++++++++++++++++++
kernel/sys_ni.c | 1
21 files changed, 62 insertions(+), 2 deletions(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -491,3 +491,4 @@
559 common futex_waitv sys_futex_waitv
560 common set_mempolicy_home_node sys_ni_syscall
561 common cachestat sys_cachestat
+562 common futex_wake sys_futex_wake
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -465,3 +465,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -909,6 +909,8 @@ __SYSCALL(__NR_futex_waitv, sys_futex_wa
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
#define __NR_cachestat 451
__SYSCALL(__NR_cachestat, sys_cachestat)
+#define __NR_futex_wake 452
+__SYSCALL(__NR_futex_wake, sys_futex_wake)
/*
* Please add new compat syscalls above this comment and update
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -372,3 +372,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -451,3 +451,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -390,3 +390,4 @@
449 n32 futex_waitv sys_futex_waitv
450 n32 set_mempolicy_home_node sys_set_mempolicy_home_node
451 n32 cachestat sys_cachestat
+452 n32 futex_wake sys_futex_wake
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -366,3 +366,4 @@
449 n64 futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 n64 cachestat sys_cachestat
+452 n64 futex_wake sys_futex_wake
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -439,3 +439,4 @@
449 o32 futex_waitv sys_futex_waitv
450 o32 set_mempolicy_home_node sys_set_mempolicy_home_node
451 o32 cachestat sys_cachestat
+452 o32 futex_wake sys_futex_wake
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -450,3 +450,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -538,3 +538,4 @@
449 common futex_waitv sys_futex_waitv
450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -454,3 +454,4 @@
449 common futex_waitv sys_futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat sys_cachestat
+452 common futex_wake sys_futex_wake sys_futex_wake
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -454,3 +454,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -497,3 +497,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -456,3 +456,4 @@
449 i386 futex_waitv sys_futex_waitv
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
451 i386 cachestat sys_cachestat
+452 i386 futex_wake sys_futex_wake
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -373,6 +373,7 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
#
# Due to a historical design error, certain syscalls are numbered differently
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -422,3 +422,4 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common cachestat sys_cachestat
+452 common futex_wake sys_futex_wake
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -563,6 +563,9 @@ asmlinkage long sys_set_robust_list(stru
asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
unsigned int nr_futexes, unsigned int flags,
struct __kernel_timespec __user *timeout, clockid_t clockid);
+
+asmlinkage long sys_futex_wake(void __user *uaddr, int nr, unsigned int flags, u64 mask);
+
asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
struct __kernel_timespec __user *rmtp);
asmlinkage long sys_nanosleep_time32(struct old_timespec32 __user *rqtp,
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -816,12 +816,13 @@ __SYSCALL(__NR_process_mrelease, sys_pro
__SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
-
#define __NR_cachestat 451
__SYSCALL(__NR_cachestat, sys_cachestat)
+#define __NR_futex_wake 452
+__SYSCALL(__NR_futex_wake, sys_futex_wake)
#undef __NR_syscalls
-#define __NR_syscalls 452
+#define __NR_syscalls 453
/*
* 32 bit systems traditionally used different
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -309,6 +309,43 @@ SYSCALL_DEFINE5(futex_waitv, struct fute
return ret;
}
+/*
+ * sys_futex_wake - Wake a number of futexes
+ * @uaddr: Address of the futex(es) to wake
+ * @nr: Number of the futexes to wake
+ * @flags: FUTEX2 flags
+ * @mask: bitmask
+ *
+ * Identical to the traditional FUTEX_WAKE_BITSET op, except it matches futex_waitv() above
+ * in that it enables u64 futex values and has a new flags set.
+ *
+ * NOTE: u64 futexes are not actually supported yet, but both these interfaces
+ * should allow for this to happen.
+ */
+
+SYSCALL_DEFINE4(futex_wake,
+ void __user *, uaddr,
+ int, nr,
+ unsigned int, flags,
+ u64, mask)
+{
+ int bits;
+
+ if (flags & ~FUTEX2_MASK)
+ return -EINVAL;
+
+ if ((flags & FUTEX2_64) != FUTEX2_32)
+ return -EINVAL;
+
+ flags = futex2_to_flags(flags);
+ bits = 8 * futex_size(flags);
+
+ if (bits < 64 && mask >> bits)
+ return -EINVAL;
+
+ return futex_wake(uaddr, flags, nr, mask);
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE2(set_robust_list,
struct compat_robust_list_head __user *, head,
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -87,6 +87,7 @@ COND_SYSCALL_COMPAT(set_robust_list);
COND_SYSCALL(get_robust_list);
COND_SYSCALL_COMPAT(get_robust_list);
COND_SYSCALL(futex_waitv);
+COND_SYSCALL(futex_wake);
COND_SYSCALL(kexec_load);
COND_SYSCALL_COMPAT(kexec_load);
COND_SYSCALL(init_module);
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 05/10] mm: Add vmalloc_huge_node()
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
` (3 preceding siblings ...)
2023-07-14 13:39 ` [RFC][PATCH 04/10] futex: Add sys_futex_wake() Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 14:37 ` Matthew Wilcox
2023-07-14 13:39 ` [RFC][PATCH 06/10] futex: Propagate flags into get_futex_key() Peter Zijlstra
` (4 subsequent siblings)
9 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
To enable node specific hash-tables.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 11 ++++++++---
2 files changed, 9 insertions(+), 3 deletions(-)
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -152,6 +152,7 @@ extern void *__vmalloc_node_range(unsign
void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
int node, const void *caller) __alloc_size(1);
void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3416,6 +3416,13 @@ void *vmalloc(unsigned long size)
}
EXPORT_SYMBOL(vmalloc);
+void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node)
+{
+ return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+ gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+ node, __builtin_return_address(0));
+}
+
/**
* vmalloc_huge - allocate virtually contiguous memory, allow huge pages
* @size: allocation size
@@ -3430,9 +3437,7 @@ EXPORT_SYMBOL(vmalloc);
*/
void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
{
- return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
- gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
- NUMA_NO_NODE, __builtin_return_address(0));
+ return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE);
}
EXPORT_SYMBOL_GPL(vmalloc_huge);
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 06/10] futex: Propagate flags into get_futex_key()
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
` (4 preceding siblings ...)
2023-07-14 13:39 ` [RFC][PATCH 05/10] mm: Add vmalloc_huge_node() Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 07/10] futex: Implement FUTEX2_NUMA Peter Zijlstra
` (3 subsequent siblings)
9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
Instead of only passing FLAGS_SHARED as a boolean, pass down flags as
a whole.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/futex/core.c | 5 ++++-
kernel/futex/futex.h | 2 +-
kernel/futex/pi.c | 4 ++--
kernel/futex/requeue.c | 6 +++---
kernel/futex/waitwake.c | 15 ++++++++-------
5 files changed, 18 insertions(+), 14 deletions(-)
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -217,7 +217,7 @@ static u64 get_inode_sequence_number(str
*
* lock_page() might sleep, the caller should not hold a spinlock.
*/
-int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key,
+int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw)
{
unsigned long address = (unsigned long)uaddr;
@@ -225,6 +225,9 @@ int get_futex_key(u32 __user *uaddr, boo
struct page *page, *tail;
struct address_space *mapping;
int err, ro = 0;
+ bool fshared;
+
+ fshared = flags & FLAGS_SHARED;
/*
* The futex address must be "naturally" aligned.
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -158,7 +158,7 @@ enum futex_access {
FUTEX_WRITE
};
-extern int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key,
+extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw);
extern struct hrtimer_sleeper *
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -945,7 +945,7 @@ int futex_lock_pi(u32 __user *uaddr, uns
to = futex_setup_timer(time, &timeout, flags, 0);
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, FUTEX_WRITE);
+ ret = get_futex_key(uaddr, flags, &q.key, FUTEX_WRITE);
if (unlikely(ret != 0))
goto out;
@@ -1117,7 +1117,7 @@ int futex_unlock_pi(u32 __user *uaddr, u
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, FUTEX_WRITE);
+ ret = get_futex_key(uaddr, flags, &key, FUTEX_WRITE);
if (ret)
return ret;
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -424,10 +424,10 @@ int futex_requeue(u32 __user *uaddr1, un
}
retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, FUTEX_READ);
+ ret = get_futex_key(uaddr1, flags, &key1, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2,
+ ret = get_futex_key(uaddr2, flags, &key2,
requeue_pi ? FUTEX_WRITE : FUTEX_READ);
if (unlikely(ret != 0))
return ret;
@@ -789,7 +789,7 @@ int futex_wait_requeue_pi(u32 __user *ua
*/
rt_mutex_init_waiter(&rt_waiter);
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, FUTEX_WRITE);
+ ret = get_futex_key(uaddr2, flags, &key2, FUTEX_WRITE);
if (unlikely(ret != 0))
goto out;
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -145,13 +145,13 @@ int futex_wake(u32 __user *uaddr, unsign
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
- int ret;
DEFINE_WAKE_Q(wake_q);
+ int ret;
if (!bitset)
return -EINVAL;
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, FUTEX_READ);
+ ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
@@ -245,10 +245,10 @@ int futex_wake_op(u32 __user *uaddr1, un
DEFINE_WAKE_Q(wake_q);
retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, FUTEX_READ);
+ ret = get_futex_key(uaddr1, flags, &key1, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, FUTEX_WRITE);
+ ret = get_futex_key(uaddr2, flags, &key2, FUTEX_WRITE);
if (unlikely(ret != 0))
return ret;
@@ -423,7 +423,7 @@ static int futex_wait_multiple_setup(str
continue;
ret = get_futex_key(u64_to_user_ptr(vs[i].w.uaddr),
- vs[i].w.flags & FLAGS_SHARED,
+ vs[i].w.flags,
&vs[i].q.key, FUTEX_READ);
if (unlikely(ret))
@@ -435,7 +435,8 @@ static int futex_wait_multiple_setup(str
for (i = 0; i < count; i++) {
u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
struct futex_q *q = &vs[i].q;
- u32 val = (u32)vs[i].w.val;
+ unsigned int flags = vs[i].w.flags;
+ u32 val = vs[i].w.val;
hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
@@ -599,7 +600,7 @@ int futex_wait_setup(u32 __user *uaddr,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, FUTEX_READ);
+ ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 07/10] futex: Implement FUTEX2_NUMA
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
` (5 preceding siblings ...)
2023-07-14 13:39 ` [RFC][PATCH 06/10] futex: Propagate flags into get_futex_key() Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 15:22 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 08/10] futex: Propagate flags into futex_get_value_locked() Peter Zijlstra
` (2 subsequent siblings)
9 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.
struct futex_numa_32 {
u32 val;
u32 node;
};
When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.
When FUTEX2_NUMA is not set, the node is simply an extention of the
hash, such that traditional futexes are still interleaved over the
nodes.
This is done to avoid having to have a separate !numa hash-table.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/futex.h | 3 +
kernel/futex/core.c | 125 +++++++++++++++++++++++++++++++++++++++---------
kernel/futex/futex.h | 2
kernel/futex/syscalls.c | 2
4 files changed, 107 insertions(+), 25 deletions(-)
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -34,6 +34,7 @@ union futex_key {
u64 i_seq;
unsigned long pgoff;
unsigned int offset;
+ /* unsigned int node; */
} shared;
struct {
union {
@@ -42,11 +43,13 @@ union futex_key {
};
unsigned long address;
unsigned int offset;
+ /* unsigned int node; */
} private;
struct {
u64 ptr;
unsigned long word;
unsigned int offset;
+ unsigned int node; /* NOT hashed! */
} both;
};
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -47,12 +47,14 @@
* reside in the same cacheline.
*/
static struct {
- struct futex_hash_bucket *queues;
unsigned long hashsize;
+ unsigned int hashshift;
+ struct futex_hash_bucket *queues[MAX_NUMNODES];
} __futex_data __read_mostly __aligned(2*sizeof(long));
-#define futex_queues (__futex_data.queues)
-#define futex_hashsize (__futex_data.hashsize)
+#define futex_hashsize (__futex_data.hashsize)
+#define futex_hashshift (__futex_data.hashshift)
+#define futex_queues (__futex_data.queues)
/*
* Fault injections for futexes.
@@ -105,6 +107,26 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
+static int futex_get_value(u32 *val, u32 __user *from, unsigned int flags)
+{
+ switch (futex_size(flags)) {
+ case 1: return __get_user(*val, (u8 __user *)from);
+ case 2: return __get_user(*val, (u16 __user *)from);
+ case 4: return __get_user(*val, (u32 __user *)from);
+ default: BUG();
+ }
+}
+
+static int futex_put_value(u32 val, u32 __user *to, unsigned int flags)
+{
+ switch (futex_size(flags)) {
+ case 1: return __put_user(val, (u8 __user *)to);
+ case 2: return __put_user(val, (u16 __user *)to);
+ case 4: return __put_user(val, (u32 __user *)to);
+ default: BUG();
+ }
+}
+
/**
* futex_hash - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
@@ -114,10 +136,20 @@ late_initcall(fail_futex_debugfs);
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
- u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
+ u32 hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / sizeof(u32),
key->both.offset);
+ int node = key->both.node;
+
+ if (node == -1) {
+ /*
+ * In case of !FLAGS_NUMA, use some unused hash bits to pick a
+ * node.
+ */
+ node = (hash >> futex_hashshift) % num_possible_nodes();
+ }
- return &futex_queues[hash & (futex_hashsize - 1)];
+ return &futex_queues[node][hash & (futex_hashsize - 1)];
}
@@ -217,32 +249,64 @@ static u64 get_inode_sequence_number(str
*
* lock_page() might sleep, the caller should not hold a spinlock.
*/
-int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
+int get_futex_key(void __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
struct page *page, *tail;
struct address_space *mapping;
- int err, ro = 0;
+ int node, err, size, ro = 0;
bool fshared;
fshared = flags & FLAGS_SHARED;
+ size = futex_size(flags);
/*
* The futex address must be "naturally" aligned.
*/
key->both.offset = address % PAGE_SIZE;
- if (unlikely((address % sizeof(u32)) != 0))
+ if (unlikely((address % size) != 0))
return -EINVAL;
address -= key->both.offset;
- if (unlikely(!access_ok(uaddr, sizeof(u32))))
+ if (flags & FLAGS_NUMA)
+ size *= 2;
+
+ if (unlikely(!access_ok(uaddr, size)))
return -EFAULT;
if (unlikely(should_fail_futex(fshared)))
return -EFAULT;
+ key->both.node = -1;
+ if (flags & FLAGS_NUMA) {
+ void __user *naddr = uaddr + size/2;
+
+ if (futex_get_value(&node, naddr, flags))
+ return -EFAULT;
+
+ if (node == -1) {
+ node = numa_node_id();
+ if (futex_put_value(node, naddr, flags))
+ return -EFAULT;
+ }
+
+ if (node >= num_possible_nodes())
+ return -EINVAL;
+
+ key->both.node = node;
+ }
+
+ /*
+ * Encode the futex size in the offset. This makes cross-size
+ * wake-wait fail -- see futex_match().
+ *
+ * NOTE that cross-size wake-wait is fundamentally broken wrt
+ * FLAGS_NUMA but could possibly work for !NUMA.
+ */
+ key->both.offset |= FUT_OFF_SIZE * (flags & FLAGS_SIZE_MASK);
+
/*
* PROCESS_PRIVATE futexes are fast.
* As the mm cannot disappear under us and the 'key' only needs
@@ -1125,27 +1189,42 @@ void futex_exit_release(struct task_stru
static int __init futex_init(void)
{
- unsigned int futex_shift;
- unsigned long i;
+ unsigned int order, n;
+ unsigned long size, i;
#if CONFIG_BASE_SMALL
futex_hashsize = 16;
#else
- futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+ futex_hashsize = 256 * num_possible_cpus();
+ futex_hashsize /= num_possible_nodes();
+ futex_hashsize = roundup_pow_of_two(futex_hashsize);
#endif
+ futex_hashshift = ilog2(futex_hashsize);
+ size = sizeof(struct futex_hash_bucket) * futex_hashsize;
+ order = get_order(size);
+
+ for_each_node(n) {
+ struct futex_hash_bucket *table;
+
+ if (order > MAX_ORDER)
+ table = vmalloc_huge_node(size, GFP_KERNEL, n);
+ else
+ table = alloc_pages_exact_nid(n, size, GFP_KERNEL);
+
+ BUG_ON(!table);
+
+ for (i = 0; i < futex_hashsize; i++) {
+ atomic_set(&table[i].waiters, 0);
+ spin_lock_init(&table[i].lock);
+ plist_head_init(&table[i].chain);
+ }
- futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
- futex_hashsize, 0,
- futex_hashsize < 256 ? HASH_SMALL : 0,
- &futex_shift, NULL,
- futex_hashsize, futex_hashsize);
- futex_hashsize = 1UL << futex_shift;
-
- for (i = 0; i < futex_hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
- plist_head_init(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
+ futex_queues[n] = table;
}
+ pr_info("futex hash table, %d nodes, %ld entries (order: %d, %lu bytes)\n",
+ num_possible_nodes(),
+ futex_hashsize, order,
+ sizeof(struct futex_hash_bucket) * futex_hashsize);
return 0;
}
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -158,7 +158,7 @@ enum futex_access {
FUTEX_WRITE
};
-extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
+extern int get_futex_key(void __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw);
extern struct hrtimer_sleeper *
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -180,7 +180,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uad
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
-#define FUTEX2_MASK (FUTEX2_64 | FUTEX2_PRIVATE)
+#define FUTEX2_MASK (FUTEX2_64 | FUTEX2_NUMA | FUTEX2_PRIVATE)
/**
* futex_parse_waitv - Parse a waitv array from userspace
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 08/10] futex: Propagate flags into futex_get_value_locked()
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
` (6 preceding siblings ...)
2023-07-14 13:39 ` [RFC][PATCH 07/10] futex: Implement FUTEX2_NUMA Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 09/10] futex: Enable FUTEX2_{8,16} Peter Zijlstra
2023-07-14 13:39 ` [HACK][PATCH 10/10] futex: Munge size and numa into the legacy interface Peter Zijlstra
9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
In order to facilitate variable sized futexes propagate the flags into
futex_get_value_locked().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/futex/core.c | 4 ++--
kernel/futex/futex.h | 2 +-
kernel/futex/pi.c | 8 ++++----
kernel/futex/requeue.c | 4 ++--
kernel/futex/waitwake.c | 4 ++--
5 files changed, 11 insertions(+), 11 deletions(-)
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -506,12 +506,12 @@ int futex_cmpxchg_value_locked(u32 *curv
return ret;
}
-int futex_get_value_locked(u32 *dest, u32 __user *from)
+int futex_get_value_locked(u32 *dest, u32 __user *from, unsigned int flags)
{
int ret;
pagefault_disable();
- ret = __get_user(*dest, from);
+ ret = futex_get_value(dest, from, flags);
pagefault_enable();
return ret ? -EFAULT : 0;
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -190,7 +190,7 @@ extern void futex_wake_mark(struct wake_
extern int fault_in_user_writeable(u32 __user *uaddr);
extern int futex_cmpxchg_value_locked(u32 *curval, u32 __user *uaddr, u32 uval, u32 newval);
-extern int futex_get_value_locked(u32 *dest, u32 __user *from);
+extern int futex_get_value_locked(u32 *dest, u32 __user *from, unsigned int flags);
extern struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb, union futex_key *key);
extern void __futex_unqueue(struct futex_q *q);
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -239,7 +239,7 @@ static int attach_to_pi_state(u32 __user
* still is what we expect it to be, otherwise retry the entire
* operation.
*/
- if (futex_get_value_locked(&uval2, uaddr))
+ if (futex_get_value_locked(&uval2, uaddr, FLAGS_SIZE_32))
goto out_efault;
if (uval != uval2)
@@ -358,7 +358,7 @@ static int handle_exit_race(u32 __user *
* The same logic applies to the case where the exiting task is
* already gone.
*/
- if (futex_get_value_locked(&uval2, uaddr))
+ if (futex_get_value_locked(&uval2, uaddr, FLAGS_SIZE_32))
return -EFAULT;
/* If the user space value has changed, try again. */
@@ -526,7 +526,7 @@ int futex_lock_pi_atomic(u32 __user *uad
* Read the user space value first so we can validate a few
* things before proceeding further.
*/
- if (futex_get_value_locked(&uval, uaddr))
+ if (futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32))
return -EFAULT;
if (unlikely(should_fail_futex(true)))
@@ -762,7 +762,7 @@ static int __fixup_pi_state_owner(u32 __
if (!pi_state->owner)
newtid |= FUTEX_OWNER_DIED;
- err = futex_get_value_locked(&uval, uaddr);
+ err = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
if (err)
goto handle_err;
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -273,7 +273,7 @@ futex_proxy_trylock_atomic(u32 __user *p
u32 curval;
int ret;
- if (futex_get_value_locked(&curval, pifutex))
+ if (futex_get_value_locked(&curval, pifutex, FLAGS_SIZE_32))
return -EFAULT;
if (unlikely(should_fail_futex(true)))
@@ -449,7 +449,7 @@ int futex_requeue(u32 __user *uaddr1, un
if (likely(cmpval != NULL)) {
u32 curval;
- ret = futex_get_value_locked(&curval, uaddr1);
+ ret = futex_get_value_locked(&curval, uaddr1, FLAGS_SIZE_32);
if (unlikely(ret)) {
double_unlock_hb(hb1, hb2);
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -439,7 +439,7 @@ static int futex_wait_multiple_setup(str
u32 val = vs[i].w.val;
hb = futex_q_lock(q);
- ret = futex_get_value_locked(&uval, uaddr);
+ ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
if (!ret && uval == val) {
/*
@@ -607,7 +607,7 @@ int futex_wait_setup(u32 __user *uaddr,
retry_private:
*hb = futex_q_lock(q);
- ret = futex_get_value_locked(&uval, uaddr);
+ ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
if (ret) {
futex_q_unlock(*hb);
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH 09/10] futex: Enable FUTEX2_{8,16}
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
` (7 preceding siblings ...)
2023-07-14 13:39 ` [RFC][PATCH 08/10] futex: Propagate flags into futex_get_value_locked() Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
2023-07-14 13:39 ` [HACK][PATCH 10/10] futex: Munge size and numa into the legacy interface Peter Zijlstra
9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
When futexes are no longer u32 aligned, the lower offset bits are no
longer available to put type info in. However, since offset is the
offset within a page, there are plenty bits available on the top end.
After that, pass flags into futex_get_value_locked() for WAIT and
disallow FUTEX2_64 instead of mandating FUTEX2_32.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/futex.h | 11 ++++++-----
kernel/futex/syscalls.c | 4 ++--
kernel/futex/waitwake.c | 4 ++--
3 files changed, 10 insertions(+), 9 deletions(-)
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -16,18 +16,19 @@ struct task_struct;
* The key type depends on whether it's a shared or private mapping.
* Don't rearrange members without looking at hash_futex().
*
- * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We use the two low order bits of offset to tell what is the kind of key :
+ * offset is the position within a page and is in the range [0, PAGE_SIZE).
+ * The high bits of the offset indicate what kind of key this is:
* 00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
* (no reference on an inode or mm)
* 01 : Shared futex (PTHREAD_PROCESS_SHARED)
* mapped on a file (reference on the underlying inode)
* 10 : Shared futex (PTHREAD_PROCESS_SHARED)
* (but private mapping on an mm, and reference taken on it)
-*/
+ */
-#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */
-#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+#define FUT_OFF_INODE (PAGE_SIZE << 0)
+#define FUT_OFF_MMSHARED (PAGE_SIZE << 1)
+#define FUT_OFF_SIZE (PAGE_SIZE << 2)
union futex_key {
struct {
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -206,7 +206,7 @@ static int futex_parse_waitv(struct fute
if ((aux.flags & ~FUTEX2_MASK) || aux.__reserved)
return -EINVAL;
- if ((aux.flags & FUTEX2_64) != FUTEX2_32)
+ if ((aux.flags & FUTEX2_64) == FUTEX2_64)
return -EINVAL;
flags = futex2_to_flags(aux.flags);
@@ -334,7 +334,7 @@ SYSCALL_DEFINE4(futex_wake,
if (flags & ~FUTEX2_MASK)
return -EINVAL;
- if ((flags & FUTEX2_64) != FUTEX2_32)
+ if ((flags & FUTEX2_64) == FUTEX2_64)
return -EINVAL;
flags = futex2_to_flags(flags);
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -439,7 +439,7 @@ static int futex_wait_multiple_setup(str
u32 val = vs[i].w.val;
hb = futex_q_lock(q);
- ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
+ ret = futex_get_value_locked(&uval, uaddr, flags);
if (!ret && uval == val) {
/*
@@ -607,7 +607,7 @@ int futex_wait_setup(u32 __user *uaddr,
retry_private:
*hb = futex_q_lock(q);
- ret = futex_get_value_locked(&uval, uaddr, FLAGS_SIZE_32);
+ ret = futex_get_value_locked(&uval, uaddr, flags);
if (ret) {
futex_q_unlock(*hb);
^ permalink raw reply [flat|nested] 19+ messages in thread
* [HACK][PATCH 10/10] futex: Munge size and numa into the legacy interface
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
` (8 preceding siblings ...)
2023-07-14 13:39 ` [RFC][PATCH 09/10] futex: Enable FUTEX2_{8,16} Peter Zijlstra
@ 2023-07-14 13:39 ` Peter Zijlstra
9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 13:39 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, peterz, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
Avert your eyes...
Arguably just the NUMA thing wouldn't be too bad.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/uapi/linux/futex.h | 15 ++++++++++++---
kernel/futex/futex.h | 9 ++++++++-
kernel/futex/syscalls.c | 18 ++++++++++++++++++
3 files changed, 38 insertions(+), 4 deletions(-)
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -23,9 +23,18 @@
#define FUTEX_CMP_REQUEUE_PI 12
#define FUTEX_LOCK_PI2 13
-#define FUTEX_PRIVATE_FLAG 128
-#define FUTEX_CLOCK_REALTIME 256
-#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
+#define FUTEX_PRIVATE_FLAG (1 << 7)
+#define FUTEX_CLOCK_REALTIME (1 << 8)
+#define FUTEX_NUMA (1 << 9)
+#define FUTEX_SIZE_32 (0 << 10) /* backwards compat */
+#define FUTEX_SIZE_64 (1 << 10)
+#define FUTEX_SIZE_8 (2 << 10)
+#define FUTEX_SIZE_16 (3 << 10)
+
+#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | \
+ FUTEX_CLOCK_REALTIME | \
+ FUTEX_NUMA | \
+ FUTEX_SIZE_16)
#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -39,7 +39,7 @@
/* FUTEX_ to FLAGS_ */
static inline unsigned int futex_to_flags(unsigned int op)
{
- unsigned int flags = FLAGS_SIZE_32;
+ unsigned int sz, flags = 0;
if (!(op & FUTEX_PRIVATE_FLAG))
flags |= FLAGS_SHARED;
@@ -47,6 +47,13 @@ static inline unsigned int futex_to_flag
if (op & FUTEX_CLOCK_REALTIME)
flags |= FLAGS_CLOCKRT;
+ if (op & FUTEX_NUMA)
+ flags |= FLAGS_NUMA;
+
+ /* { 2,3,0,1 } -> { 0,1,2,3 } */
+ sz = ((op + FUTEX_SIZE_8) & FUTEX_SIZE_16) >> 10;
+ flags |= sz;
+
return flags;
}
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -95,6 +95,24 @@ long do_futex(u32 __user *uaddr, int op,
return -ENOSYS;
}
+ /* can't support u64 with a u32 based interface */
+ if ((flags & FLAGS_SIZE_MASK) == FLAGS_SIZE_64)
+ return -ENOSYS;
+
+ switch (cmd) {
+ case FUTEX_WAIT:
+ case FUTEX_WAIT_BITSET:
+ case FUTEX_WAKE:
+ case FUTEX_WAKE_BITSET:
+ /* u8, u16, u32 */
+ break;
+
+ default:
+ /* only u32 for now */
+ if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32)
+ return -ENOSYS;
+ }
+
switch (cmd) {
case FUTEX_WAIT:
val3 = FUTEX_BITSET_MATCH_ANY;
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 04/10] futex: Add sys_futex_wake()
2023-07-14 13:39 ` [RFC][PATCH 04/10] futex: Add sys_futex_wake() Peter Zijlstra
@ 2023-07-14 14:26 ` Arnd Bergmann
2023-07-14 14:47 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Arnd Bergmann @ 2023-07-14 14:26 UTC (permalink / raw)
To: Peter Zijlstra, Thomas Gleixner, Jens Axboe
Cc: linux-kernel, Ingo Molnar, Darren Hart, dave, andrealmeid,
Andrew Morton, urezki, Christoph Hellwig, Lorenzo Stoakes,
linux-api, linux-mm, Linux-Arch, malteskarupke
On Fri, Jul 14, 2023, at 15:39, Peter Zijlstra wrote:
>
> +++ b/include/linux/syscalls.h
> @@ -563,6 +563,9 @@ asmlinkage long sys_set_robust_list(stru
> asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
> unsigned int nr_futexes, unsigned int flags,
> struct __kernel_timespec __user *timeout, clockid_t clockid);
> +
> +asmlinkage long sys_futex_wake(void __user *uaddr, int nr, unsigned
> int flags, u64 mask);
> +
You can't really use 'u64' arguments in portable syscalls, it causes
a couple of problems, both with defining the user space wrappers,
and with compat mode.
Variants that would work include:
- using 'unsigned long' instead of 'u64'
- passing 'mask' by reference, as in splice()
- passing the mask in two u32-bit arguments like in llseek()
Not sure if any of the above work for you.
Arnd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 05/10] mm: Add vmalloc_huge_node()
2023-07-14 13:39 ` [RFC][PATCH 05/10] mm: Add vmalloc_huge_node() Peter Zijlstra
@ 2023-07-14 14:37 ` Matthew Wilcox
2023-07-14 15:09 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2023-07-14 14:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, axboe, linux-kernel, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
On Fri, Jul 14, 2023 at 03:39:04PM +0200, Peter Zijlstra wrote:
> +void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node)
> +{
> + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> + gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> + node, __builtin_return_address(0));
> +}
> +
> /**
> * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
> * @size: allocation size
> @@ -3430,9 +3437,7 @@ EXPORT_SYMBOL(vmalloc);
> */
> void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
> {
> - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> - gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> - NUMA_NO_NODE, __builtin_return_address(0));
> + return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE);
> }
Isn't this going to result in the "caller" being always recorded as
vmalloc_huge() instead of the caller of vmalloc_huge()?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 04/10] futex: Add sys_futex_wake()
2023-07-14 14:26 ` Arnd Bergmann
@ 2023-07-14 14:47 ` Peter Zijlstra
2023-07-14 20:10 ` Arnd Bergmann
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 14:47 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Thomas Gleixner, Jens Axboe, linux-kernel, Ingo Molnar,
Darren Hart, dave, andrealmeid, Andrew Morton, urezki,
Christoph Hellwig, Lorenzo Stoakes, linux-api, linux-mm,
Linux-Arch, malteskarupke
On Fri, Jul 14, 2023 at 04:26:45PM +0200, Arnd Bergmann wrote:
> On Fri, Jul 14, 2023, at 15:39, Peter Zijlstra wrote:
> >
> > +++ b/include/linux/syscalls.h
> > @@ -563,6 +563,9 @@ asmlinkage long sys_set_robust_list(stru
> > asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
> > unsigned int nr_futexes, unsigned int flags,
> > struct __kernel_timespec __user *timeout, clockid_t clockid);
> > +
> > +asmlinkage long sys_futex_wake(void __user *uaddr, int nr, unsigned
> > int flags, u64 mask);
> > +
>
> You can't really use 'u64' arguments in portable syscalls, it causes
> a couple of problems, both with defining the user space wrappers,
> and with compat mode.
>
> Variants that would work include:
>
> - using 'unsigned long' instead of 'u64'
> - passing 'mask' by reference, as in splice()
> - passing the mask in two u32-bit arguments like in llseek()
>
> Not sure if any of the above work for you.
Durr, I was hoping they'd use register pairs, but yeah I can see how
that would be very hard to do in generic code.
Hurmph.. using 2 u32s is unfortunate on 64bit, while unsigned long
would limit 64bit futexes to 64bit machines (perhaps not too bad).
Using unsigned long would help with the futex_wait() thing as well.
I'll ponder things a bit.
Obviously I only did build x86_64 ;-)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 05/10] mm: Add vmalloc_huge_node()
2023-07-14 14:37 ` Matthew Wilcox
@ 2023-07-14 15:09 ` Peter Zijlstra
2023-07-14 15:11 ` Matthew Wilcox
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 15:09 UTC (permalink / raw)
To: Matthew Wilcox
Cc: tglx, axboe, linux-kernel, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
On Fri, Jul 14, 2023 at 03:37:38PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 14, 2023 at 03:39:04PM +0200, Peter Zijlstra wrote:
> > +void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node)
> > +{
> > + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > + gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> > + node, __builtin_return_address(0));
> > +}
> > +
> > /**
> > * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
> > * @size: allocation size
> > @@ -3430,9 +3437,7 @@ EXPORT_SYMBOL(vmalloc);
> > */
> > void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
> > {
> > - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > - gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> > - NUMA_NO_NODE, __builtin_return_address(0));
> > + return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE);
> > }
>
> Isn't this going to result in the "caller" being always recorded as
> vmalloc_huge() instead of the caller of vmalloc_huge()?
Durr, I missed that, but it depends, not if the compiler inlines it.
I'll make a common __always_inline helper to cure this.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 05/10] mm: Add vmalloc_huge_node()
2023-07-14 15:09 ` Peter Zijlstra
@ 2023-07-14 15:11 ` Matthew Wilcox
2023-07-14 15:26 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2023-07-14 15:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, axboe, linux-kernel, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
On Fri, Jul 14, 2023 at 05:09:48PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 14, 2023 at 03:37:38PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 14, 2023 at 03:39:04PM +0200, Peter Zijlstra wrote:
> > > +void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node)
> > > +{
> > > + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > > + gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> > > + node, __builtin_return_address(0));
> > > +}
> > > +
> > > /**
> > > * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
> > > * @size: allocation size
> > > @@ -3430,9 +3437,7 @@ EXPORT_SYMBOL(vmalloc);
> > > */
> > > void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
> > > {
> > > - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > > - gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> > > - NUMA_NO_NODE, __builtin_return_address(0));
> > > + return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE);
> > > }
> >
> > Isn't this going to result in the "caller" being always recorded as
> > vmalloc_huge() instead of the caller of vmalloc_huge()?
>
> Durr, I missed that, but it depends, not if the compiler inlines it.
>
> I'll make a common __always_inline helper to cure this.
... or just don't change vmalloc_huge()? Or make the common helper take
the __builtin_return_address as a parameter?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 07/10] futex: Implement FUTEX2_NUMA
2023-07-14 13:39 ` [RFC][PATCH 07/10] futex: Implement FUTEX2_NUMA Peter Zijlstra
@ 2023-07-14 15:22 ` Peter Zijlstra
0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 15:22 UTC (permalink / raw)
To: tglx, axboe
Cc: linux-kernel, mingo, dvhart, dave, andrealmeid, Andrew Morton,
urezki, hch, lstoakes, Arnd Bergmann, linux-api, linux-mm,
linux-arch, malteskarupke
On Fri, Jul 14, 2023 at 03:39:06PM +0200, Peter Zijlstra wrote:
> + /*
> + * Encode the futex size in the offset. This makes cross-size
> + * wake-wait fail -- see futex_match().
> + *
> + * NOTE that cross-size wake-wait is fundamentally broken wrt
> + * FLAGS_NUMA but could possibly work for !NUMA.
> + */
> + key->both.offset |= FUT_OFF_SIZE * (flags & FLAGS_SIZE_MASK);
this wee bit should've gone in patch 9.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 05/10] mm: Add vmalloc_huge_node()
2023-07-14 15:11 ` Matthew Wilcox
@ 2023-07-14 15:26 ` Peter Zijlstra
0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-07-14 15:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: tglx, axboe, linux-kernel, mingo, dvhart, dave, andrealmeid,
Andrew Morton, urezki, hch, lstoakes, Arnd Bergmann, linux-api,
linux-mm, linux-arch, malteskarupke
On Fri, Jul 14, 2023 at 04:11:39PM +0100, Matthew Wilcox wrote:
> ... or just don't change vmalloc_huge()?
Yeah, that, everything else just adds more lines without read benefit. I
eneded up with the below.
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -152,6 +152,7 @@ extern void *__vmalloc_node_range(unsign
void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
int node, const void *caller) __alloc_size(1);
void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3416,6 +3416,13 @@ void *vmalloc(unsigned long size)
}
EXPORT_SYMBOL(vmalloc);
+void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node)
+{
+ return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+ gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+ node, __builtin_return_address(0));
+}
+
/**
* vmalloc_huge - allocate virtually contiguous memory, allow huge pages
* @size: allocation size
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 04/10] futex: Add sys_futex_wake()
2023-07-14 14:47 ` Peter Zijlstra
@ 2023-07-14 20:10 ` Arnd Bergmann
0 siblings, 0 replies; 19+ messages in thread
From: Arnd Bergmann @ 2023-07-14 20:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, Jens Axboe, linux-kernel, Ingo Molnar,
Darren Hart, dave, andrealmeid, Andrew Morton, urezki,
Christoph Hellwig, Lorenzo Stoakes, linux-api, linux-mm,
Linux-Arch, malteskarupke
On Fri, Jul 14, 2023, at 16:47, Peter Zijlstra wrote:
> On Fri, Jul 14, 2023 at 04:26:45PM +0200, Arnd Bergmann wrote:
>> On Fri, Jul 14, 2023, at 15:39, Peter Zijlstra wrote:
>> >
>> > +++ b/include/linux/syscalls.h
>> > @@ -563,6 +563,9 @@ asmlinkage long sys_set_robust_list(stru
>> > asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
>> > unsigned int nr_futexes, unsigned int flags,
>> > struct __kernel_timespec __user *timeout, clockid_t clockid);
>> > +
>> > +asmlinkage long sys_futex_wake(void __user *uaddr, int nr, unsigned
>> > int flags, u64 mask);
>> > +
>>
>> You can't really use 'u64' arguments in portable syscalls, it causes
>> a couple of problems, both with defining the user space wrappers,
>> and with compat mode.
>>
>> Variants that would work include:
>>
>> - using 'unsigned long' instead of 'u64'
>> - passing 'mask' by reference, as in splice()
>> - passing the mask in two u32-bit arguments like in llseek()
>>
>> Not sure if any of the above work for you.
>
> Durr, I was hoping they'd use register pairs, but yeah I can see how
> that would be very hard to do in generic code.
It kind of works to just use register pairs, the actual problem
you run into here is that:
- depending on the architecture, the register pairs need to be
even/odd pairs, so there are two different ways that 32-bit
architectures handle it
- The compat handler needs to explicitly name the registers that
are used, so to make your version above work correctly, we'd
need three entry points, for native 64-bit, compat 32-bit
odd/even pairs and compat 32-bit even/odd pairs.
> Hurmph.. using 2 u32s is unfortunate on 64bit, while unsigned long
> would limit 64bit futexes to 64bit machines (perhaps not too bad).
>
> Using unsigned long would help with the futex_wait() thing as well.
>
> I'll ponder things a bit.
>
> Obviously I only did build x86_64 ;-)
I suspect that restricting the futexes to native work size is
ok since many 32-bit architectures don't have 64-bit atomic
instructions anyway (armv6k+ and i586tsc+ being the obvious
exceptions), so userspace code that relies on it becomes
nonportable.
Arnd
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2023-07-14 20:10 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-14 13:38 [RFC][PATCH 00/10] futex: More Futex2 bits Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 01/10] futex: Clarify FUTEX2 flags Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 02/10] futex: Extend the " Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 03/10] futex: Flag conversion Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 04/10] futex: Add sys_futex_wake() Peter Zijlstra
2023-07-14 14:26 ` Arnd Bergmann
2023-07-14 14:47 ` Peter Zijlstra
2023-07-14 20:10 ` Arnd Bergmann
2023-07-14 13:39 ` [RFC][PATCH 05/10] mm: Add vmalloc_huge_node() Peter Zijlstra
2023-07-14 14:37 ` Matthew Wilcox
2023-07-14 15:09 ` Peter Zijlstra
2023-07-14 15:11 ` Matthew Wilcox
2023-07-14 15:26 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 06/10] futex: Propagate flags into get_futex_key() Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 07/10] futex: Implement FUTEX2_NUMA Peter Zijlstra
2023-07-14 15:22 ` Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 08/10] futex: Propagate flags into futex_get_value_locked() Peter Zijlstra
2023-07-14 13:39 ` [RFC][PATCH 09/10] futex: Enable FUTEX2_{8,16} Peter Zijlstra
2023-07-14 13:39 ` [HACK][PATCH 10/10] futex: Munge size and numa into the legacy interface Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).