linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
       [not found] <20250113170925.GA392@strace.io>
@ 2025-01-13 17:10 ` Dmitry V. Levin
  2025-01-13 17:34   ` Christophe Leroy
  2025-01-14 13:00   ` Alexey Gladkov
  2025-01-13 17:11 ` [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32 Dmitry V. Levin
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:10 UTC (permalink / raw)
  To: Oleg Nesterov, Michael Ellerman
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Christophe Leroy, Naveen N Rao, linuxppc-dev,
	linux-kernel

Bring syscall_set_return_value() in sync with syscall_get_error(),
and let upcoming ptrace/set_syscall_info selftest pass on powerpc.

This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
syscall_set_return_value()").

Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---
 arch/powerpc/include/asm/syscall.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index 3dd36c5e334a..422d7735ace6 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
 		 */
 		if (error) {
 			regs->ccr |= 0x10000000L;
-			regs->gpr[3] = error;
+			/*
+			 * In case of an error regs->gpr[3] contains
+			 * a positive ERRORCODE.
+			 */
+			regs->gpr[3] = -error;
 		} else {
 			regs->ccr &= ~0x10000000L;
 			regs->gpr[3] = val;
-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32
       [not found] <20250113170925.GA392@strace.io>
  2025-01-13 17:10 ` [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value() Dmitry V. Levin
@ 2025-01-13 17:11 ` Dmitry V. Levin
  2025-01-14  3:29   ` Maciej W. Rozycki
  2025-01-13 17:11 ` [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value() Dmitry V. Levin
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:11 UTC (permalink / raw)
  To: Oleg Nesterov, Thomas Bogendoerfer
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-mips, linux-kernel

Fix the following get_syscall_info test assertion on mips O32:
  # get_syscall_info.c:218:get_syscall_info:Expected exp_args[5] (3134521044) == info.entry.args[4] (4911432)
  # get_syscall_info.c:219:get_syscall_info:wait #1: entry stop mismatch

Fix the following get_syscall_info test assertion on mips64 O32 and mips64 N32:
  # get_syscall_info.c:209:get_syscall_info:Expected exp_args[2] (3134324433) == info.entry.args[1] (18446744072548908753)
  # get_syscall_info.c:210:get_syscall_info:wait #1: entry stop mismatch

This makes ptrace/get_syscall_info selftest pass on mips O32,
mips64 O32, and mips64 N32.

Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---

Note that I'm not a MIPS expert, so I cannot tell why the get_user()
approach doesn't work for O32.  Also, during experiments I discovered that
regs->pad0 approach works for O32, but why it works remains a mystery.

 arch/mips/include/asm/syscall.h | 34 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 24 deletions(-)

diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
index ebdf4d910af2..2f85f2d8f754 100644
--- a/arch/mips/include/asm/syscall.h
+++ b/arch/mips/include/asm/syscall.h
@@ -57,37 +57,23 @@ static inline void mips_syscall_update_nr(struct task_struct *task,
 static inline void mips_get_syscall_arg(unsigned long *arg,
 	struct task_struct *task, struct pt_regs *regs, unsigned int n)
 {
-	unsigned long usp __maybe_unused = regs->regs[29];
-
+#ifdef CONFIG_32BIT
 	switch (n) {
 	case 0: case 1: case 2: case 3:
 		*arg = regs->regs[4 + n];
-
-		return;
-
-#ifdef CONFIG_32BIT
-	case 4: case 5: case 6: case 7:
-		get_user(*arg, (int *)usp + n);
 		return;
-#endif
-
-#ifdef CONFIG_64BIT
 	case 4: case 5: case 6: case 7:
-#ifdef CONFIG_MIPS32_O32
-		if (test_tsk_thread_flag(task, TIF_32BIT_REGS))
-			get_user(*arg, (int *)usp + n);
-		else
-#endif
-			*arg = regs->regs[4 + n];
-
+		*arg = regs->pad0[n];
 		return;
-#endif
-
-	default:
-		BUG();
 	}
-
-	unreachable();
+#else
+	*arg = regs->regs[4 + n];
+	if ((IS_ENABLED(CONFIG_MIPS32_O32) &&
+	     test_tsk_thread_flag(task, TIF_32BIT_REGS)) ||
+	    (IS_ENABLED(CONFIG_MIPS32_N32) &&
+	     test_tsk_thread_flag(task, TIF_32BIT_ADDR)))
+		*arg = (unsigned int)*arg;
+#endif
 }
 
 static inline long syscall_get_error(struct task_struct *task,
-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value()
       [not found] <20250113170925.GA392@strace.io>
  2025-01-13 17:10 ` [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value() Dmitry V. Levin
  2025-01-13 17:11 ` [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32 Dmitry V. Levin
@ 2025-01-13 17:11 ` Dmitry V. Levin
  2025-01-16  2:20   ` Charlie Jenkins
  2025-01-13 17:11 ` [PATCH v2 4/7] syscall.h: introduce syscall_set_nr() Dmitry V. Levin
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Vineet Gupta, Russell King,
	Will Deacon, Guo Ren, Brian Cain, Huacai Chen, WANG Xuerui,
	Thomas Bogendoerfer, Dinh Nguyen, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, James E.J. Bottomley, Helge Deller,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Naveen N Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Yoshinori Sato, Rich Felker,
	John Paul Adrian Glaubitz, David S. Miller, Andreas Larsson,
	Richard Weinberger, Anton Ivanov, Johannes Berg, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-snps-arc,
	linux-kernel, linux-arm-kernel, linux-csky, linux-hexagon,
	loongarch, linux-mips, linux-openrisc, linux-parisc, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-um,
	linux-arch

These functions are going to be needed on all HAVE_ARCH_TRACEHOOK
architectures to implement PTRACE_SET_SYSCALL_INFO API.

This partially reverts commit 7962c2eddbfe ("arch: remove unused
function syscall_set_arguments()") by reusing some of old
syscall_set_arguments() implementations.

Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---

Note that I'm not a MIPS expert, I just added mips_set_syscall_arg() by
looking at mips_get_syscall_arg() and the result passes tests in qemu on
mips O32, mips64 O32, mips64 N32, and mips64 N64.

 arch/arc/include/asm/syscall.h        | 14 +++++++++++
 arch/arm/include/asm/syscall.h        | 13 ++++++++++
 arch/arm64/include/asm/syscall.h      | 13 ++++++++++
 arch/csky/include/asm/syscall.h       | 13 ++++++++++
 arch/hexagon/include/asm/syscall.h    | 14 +++++++++++
 arch/loongarch/include/asm/syscall.h  |  8 ++++++
 arch/mips/include/asm/syscall.h       | 32 ++++++++++++++++++++++++
 arch/nios2/include/asm/syscall.h      | 11 ++++++++
 arch/openrisc/include/asm/syscall.h   |  7 ++++++
 arch/parisc/include/asm/syscall.h     | 12 +++++++++
 arch/powerpc/include/asm/syscall.h    | 10 ++++++++
 arch/riscv/include/asm/syscall.h      |  9 +++++++
 arch/s390/include/asm/syscall.h       | 12 +++++++++
 arch/sh/include/asm/syscall_32.h      | 12 +++++++++
 arch/sparc/include/asm/syscall.h      | 10 ++++++++
 arch/um/include/asm/syscall-generic.h | 14 +++++++++++
 arch/x86/include/asm/syscall.h        | 36 +++++++++++++++++++++++++++
 arch/xtensa/include/asm/syscall.h     | 11 ++++++++
 include/asm-generic/syscall.h         | 16 ++++++++++++
 19 files changed, 267 insertions(+)

diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
index 9709256e31c8..89c1e1736356 100644
--- a/arch/arc/include/asm/syscall.h
+++ b/arch/arc/include/asm/syscall.h
@@ -67,6 +67,20 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 	}
 }
 
+static inline void
+syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
+		      unsigned long *args)
+{
+	unsigned long *inside_ptregs = &regs->r0;
+	unsigned int n = 6;
+	unsigned int i = 0;
+
+	while (n--) {
+		*inside_ptregs = args[i++];
+		inside_ptregs--;
+	}
+}
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index fe4326d938c1..21927fa0ae2b 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -80,6 +80,19 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	memcpy(args, &regs->ARM_r0 + 1, 5 * sizeof(args[0]));
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	memcpy(&regs->ARM_r0, args, 6 * sizeof(args[0]));
+	/*
+	 * Also copy the first argument into ARM_ORIG_r0
+	 * so that syscall_get_arguments() would return it
+	 * instead of the previous value.
+	 */
+	regs->ARM_ORIG_r0 = regs->ARM_r0;
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	/* ARM tasks don't change audit architectures on the fly. */
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index ab8e14b96f68..76020b66286b 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -73,6 +73,19 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	memcpy(args, &regs->regs[1], 5 * sizeof(args[0]));
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	memcpy(&regs->regs[0], args, 6 * sizeof(args[0]));
+	/*
+	 * Also copy the first argument into orig_x0
+	 * so that syscall_get_arguments() would return it
+	 * instead of the previous value.
+	 */
+	regs->orig_x0 = regs->regs[0];
+}
+
 /*
  * We don't care about endianness (__AUDIT_ARCH_LE bit) here because
  * AArch64 has the same system calls both on little- and big- endian.
diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
index 0de5734950bf..30403f7a0487 100644
--- a/arch/csky/include/asm/syscall.h
+++ b/arch/csky/include/asm/syscall.h
@@ -59,6 +59,19 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(args, &regs->a1, 5 * sizeof(args[0]));
 }
 
+static inline void
+syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
+		      const unsigned long *args)
+{
+	memcpy(&regs->a0, args, 6 * sizeof(regs->a0));
+	/*
+	 * Also copy the first argument into orig_x0
+	 * so that syscall_get_arguments() would return it
+	 * instead of the previous value.
+	 */
+	regs->orig_a0 = regs->a0;
+}
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
index f6e454f18038..1024a6548d78 100644
--- a/arch/hexagon/include/asm/syscall.h
+++ b/arch/hexagon/include/asm/syscall.h
@@ -33,6 +33,13 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	memcpy(args, &(&regs->r00)[0], 6 * sizeof(args[0]));
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 unsigned long *args)
+{
+	memcpy(&(&regs->r00)[0], args, 6 * sizeof(args[0]));
+}
+
 static inline long syscall_get_error(struct task_struct *task,
 				     struct pt_regs *regs)
 {
@@ -45,6 +52,13 @@ static inline long syscall_get_return_value(struct task_struct *task,
 	return regs->r00;
 }
 
+static inline void syscall_set_return_value(struct task_struct *task,
+					    struct pt_regs *regs,
+					    int error, long val)
+{
+	regs->r00 = (long) error ?: val;
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_HEXAGON;
diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
index e286dc58476e..ff415b3c0a8e 100644
--- a/arch/loongarch/include/asm/syscall.h
+++ b/arch/loongarch/include/asm/syscall.h
@@ -61,6 +61,14 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	memcpy(&args[1], &regs->regs[5], 5 * sizeof(long));
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 unsigned long *args)
+{
+	regs->orig_a0 = args[0];
+	memcpy(&regs->regs[5], &args[1], 5 * sizeof(long));
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_LOONGARCH64;
diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
index 2f85f2d8f754..3163d1506fae 100644
--- a/arch/mips/include/asm/syscall.h
+++ b/arch/mips/include/asm/syscall.h
@@ -76,6 +76,23 @@ static inline void mips_get_syscall_arg(unsigned long *arg,
 #endif
 }
 
+static inline void mips_set_syscall_arg(unsigned long *arg,
+	struct task_struct *task, struct pt_regs *regs, unsigned int n)
+{
+#ifdef CONFIG_32BIT
+	switch (n) {
+	case 0: case 1: case 2: case 3:
+		regs->regs[4 + n] = *arg;
+		return;
+	case 4: case 5: case 6: case 7:
+		*arg = regs->pad0[n] = *arg;
+		return;
+	}
+#else
+	regs->regs[4 + n] = *arg;
+#endif
+}
+
 static inline long syscall_get_error(struct task_struct *task,
 				     struct pt_regs *regs)
 {
@@ -122,6 +139,21 @@ static inline void syscall_get_arguments(struct task_struct *task,
 		mips_get_syscall_arg(args++, task, regs, i++);
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 unsigned long *args)
+{
+	unsigned int i = 0;
+	unsigned int n = 6;
+
+	/* O32 ABI syscall() */
+	if (mips_syscall_is_indirect(task, regs))
+		i++;
+
+	while (n--)
+		mips_set_syscall_arg(args++, task, regs, i++);
+}
+
 extern const unsigned long sys_call_table[];
 extern const unsigned long sys32_call_table[];
 extern const unsigned long sysn32_call_table[];
diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
index fff52205fb65..526449edd768 100644
--- a/arch/nios2/include/asm/syscall.h
+++ b/arch/nios2/include/asm/syscall.h
@@ -58,6 +58,17 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	*args   = regs->r9;
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+	struct pt_regs *regs, const unsigned long *args)
+{
+	regs->r4 = *args++;
+	regs->r5 = *args++;
+	regs->r6 = *args++;
+	regs->r7 = *args++;
+	regs->r8 = *args++;
+	regs->r9 = *args;
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_NIOS2;
diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
index 903ed882bdec..e6383be2a195 100644
--- a/arch/openrisc/include/asm/syscall.h
+++ b/arch/openrisc/include/asm/syscall.h
@@ -57,6 +57,13 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(args, &regs->gpr[3], 6 * sizeof(args[0]));
 }
 
+static inline void
+syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
+		      const unsigned long *args)
+{
+	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_OPENRISC;
diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
index 00b127a5e09b..b146d0ae4c77 100644
--- a/arch/parisc/include/asm/syscall.h
+++ b/arch/parisc/include/asm/syscall.h
@@ -29,6 +29,18 @@ static inline void syscall_get_arguments(struct task_struct *tsk,
 	args[0] = regs->gr[26];
 }
 
+static inline void syscall_set_arguments(struct task_struct *tsk,
+					 struct pt_regs *regs,
+					 unsigned long *args)
+{
+	regs->gr[21] = args[5];
+	regs->gr[22] = args[4];
+	regs->gr[23] = args[3];
+	regs->gr[24] = args[2];
+	regs->gr[25] = args[1];
+	regs->gr[26] = args[0];
+}
+
 static inline long syscall_get_error(struct task_struct *task,
 				     struct pt_regs *regs)
 {
diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index 422d7735ace6..521f279e6b33 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -114,6 +114,16 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	}
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
+
+	/* Also copy the first argument into orig_gpr3 */
+	regs->orig_gpr3 = args[0];
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	if (is_tsk_32bit_task(task))
diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
index 121fff429dce..8d389ba995c8 100644
--- a/arch/riscv/include/asm/syscall.h
+++ b/arch/riscv/include/asm/syscall.h
@@ -66,6 +66,15 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	memcpy(args, &regs->a1, 5 * sizeof(args[0]));
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	regs->orig_a0 = args[0];
+	args++;
+	memcpy(&regs->a1, args, 5 * sizeof(regs->a1));
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 #ifdef CONFIG_64BIT
diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
index 27e3d804b311..b3dd883699e7 100644
--- a/arch/s390/include/asm/syscall.h
+++ b/arch/s390/include/asm/syscall.h
@@ -78,6 +78,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	args[0] = regs->orig_gpr2 & mask;
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	unsigned int n = 6;
+
+	while (n-- > 0)
+		if (n > 0)
+			regs->gprs[2 + n] = args[n];
+	regs->orig_gpr2 = args[0];
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 #ifdef CONFIG_COMPAT
diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
index d87738eebe30..cb51a7528384 100644
--- a/arch/sh/include/asm/syscall_32.h
+++ b/arch/sh/include/asm/syscall_32.h
@@ -57,6 +57,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	args[0] = regs->regs[4];
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	regs->regs[1] = args[5];
+	regs->regs[0] = args[4];
+	regs->regs[7] = args[3];
+	regs->regs[6] = args[2];
+	regs->regs[5] = args[1];
+	regs->regs[4] = args[0];
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	int arch = AUDIT_ARCH_SH;
diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
index 20c109ac8cc9..62a5a78804c4 100644
--- a/arch/sparc/include/asm/syscall.h
+++ b/arch/sparc/include/asm/syscall.h
@@ -117,6 +117,16 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	}
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	unsigned int i;
+
+	for (i = 0; i < 6; i++)
+		regs->u_regs[UREG_I0 + i] = args[i];
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT)
diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h
index 172b74143c4b..2984feb9d576 100644
--- a/arch/um/include/asm/syscall-generic.h
+++ b/arch/um/include/asm/syscall-generic.h
@@ -62,6 +62,20 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	*args   = UPT_SYSCALL_ARG6(r);
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	struct uml_pt_regs *r = &regs->regs;
+
+	UPT_SYSCALL_ARG1(r) = *args++;
+	UPT_SYSCALL_ARG2(r) = *args++;
+	UPT_SYSCALL_ARG3(r) = *args++;
+	UPT_SYSCALL_ARG4(r) = *args++;
+	UPT_SYSCALL_ARG5(r) = *args++;
+	UPT_SYSCALL_ARG6(r) = *args;
+}
+
 /* See arch/x86/um/asm/syscall.h for syscall_get_arch() definition. */
 
 #endif	/* __UM_SYSCALL_GENERIC_H */
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 7c488ff0c764..b9c249dd9e3d 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -90,6 +90,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	args[5] = regs->bp;
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	regs->bx = args[0];
+	regs->cx = args[1];
+	regs->dx = args[2];
+	regs->si = args[3];
+	regs->di = args[4];
+	regs->bp = args[5];
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_I386;
@@ -121,6 +133,30 @@ static inline void syscall_get_arguments(struct task_struct *task,
 	}
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+# ifdef CONFIG_IA32_EMULATION
+	if (task->thread_info.status & TS_COMPAT) {
+		regs->bx = *args++;
+		regs->cx = *args++;
+		regs->dx = *args++;
+		regs->si = *args++;
+		regs->di = *args++;
+		regs->bp = *args;
+	} else
+# endif
+	{
+		regs->di = *args++;
+		regs->si = *args++;
+		regs->dx = *args++;
+		regs->r10 = *args++;
+		regs->r8 = *args++;
+		regs->r9 = *args;
+	}
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
index 5ee974bf8330..f9a671cbf933 100644
--- a/arch/xtensa/include/asm/syscall.h
+++ b/arch/xtensa/include/asm/syscall.h
@@ -68,6 +68,17 @@ static inline void syscall_get_arguments(struct task_struct *task,
 		args[i] = regs->areg[reg[i]];
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+					 struct pt_regs *regs,
+					 const unsigned long *args)
+{
+	static const unsigned int reg[] = XTENSA_SYSCALL_ARGUMENT_REGS;
+	unsigned int i;
+
+	for (i = 0; i < 6; ++i)
+		regs->areg[reg[i]] = args[i];
+}
+
 asmlinkage long xtensa_rt_sigreturn(void);
 asmlinkage long xtensa_shmat(int, char __user *, int);
 asmlinkage long xtensa_fadvise64_64(int, int,
diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index 5a80fe728dc8..0f7b9a493de7 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -117,6 +117,22 @@ void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs,
 void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 			   unsigned long *args);
 
+/**
+ * syscall_set_arguments - change system call parameter value
+ * @task:	task of interest, must be in system call entry tracing
+ * @regs:	task_pt_regs() of @task
+ * @args:	array of argument values to store
+ *
+ * Changes 6 arguments to the system call.
+ * The first argument gets value @args[0], and so on.
+ *
+ * It's only valid to call this when @task is stopped for tracing on
+ * entry to a system call, due to %SYSCALL_WORK_SYSCALL_TRACE or
+ * %SYSCALL_WORK_SYSCALL_AUDIT.
+ */
+void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
+			   const unsigned long *args);
+
 /**
  * syscall_get_arch - return the AUDIT_ARCH for the current system call
  * @task:	task of interest, must be blocked
-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 4/7] syscall.h: introduce syscall_set_nr()
       [not found] <20250113170925.GA392@strace.io>
                   ` (2 preceding siblings ...)
  2025-01-13 17:11 ` [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value() Dmitry V. Levin
@ 2025-01-13 17:11 ` Dmitry V. Levin
  2025-01-16  2:20   ` Charlie Jenkins
  2025-01-13 17:12 ` [PATCH v2 5/7] ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op Dmitry V. Levin
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Brian Cain, Huacai Chen,
	WANG Xuerui, Geert Uytterhoeven, Michal Simek,
	Thomas Bogendoerfer, Dinh Nguyen, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, James E.J. Bottomley, Helge Deller,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, Naveen N Rao,
	Madhavan Srinivasan, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Yoshinori Sato, Rich Felker,
	John Paul Adrian Glaubitz, David S. Miller, Andreas Larsson,
	Richard Weinberger, Anton Ivanov, Johannes Berg, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-snps-arc,
	linux-kernel, linux-arm-kernel, linux-hexagon, loongarch,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-um, linux-arch

Similar to syscall_set_arguments() that complements
syscall_get_arguments(), introduce syscall_set_nr()
that complements syscall_get_nr().

syscall_set_nr() is going to be needed along with
syscall_set_arguments() on all HAVE_ARCH_TRACEHOOK
architectures to implement PTRACE_SET_SYSCALL_INFO API.

Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---
 arch/arc/include/asm/syscall.h        | 11 +++++++++++
 arch/arm/include/asm/syscall.h        | 24 ++++++++++++++++++++++++
 arch/arm64/include/asm/syscall.h      | 16 ++++++++++++++++
 arch/hexagon/include/asm/syscall.h    |  7 +++++++
 arch/loongarch/include/asm/syscall.h  |  7 +++++++
 arch/m68k/include/asm/syscall.h       |  7 +++++++
 arch/microblaze/include/asm/syscall.h |  7 +++++++
 arch/mips/include/asm/syscall.h       | 14 ++++++++++++++
 arch/nios2/include/asm/syscall.h      |  5 +++++
 arch/openrisc/include/asm/syscall.h   |  6 ++++++
 arch/parisc/include/asm/syscall.h     |  7 +++++++
 arch/powerpc/include/asm/syscall.h    | 10 ++++++++++
 arch/riscv/include/asm/syscall.h      |  7 +++++++
 arch/s390/include/asm/syscall.h       | 12 ++++++++++++
 arch/sh/include/asm/syscall_32.h      | 12 ++++++++++++
 arch/sparc/include/asm/syscall.h      | 12 ++++++++++++
 arch/um/include/asm/syscall-generic.h |  5 +++++
 arch/x86/include/asm/syscall.h        |  7 +++++++
 arch/xtensa/include/asm/syscall.h     |  7 +++++++
 include/asm-generic/syscall.h         | 14 ++++++++++++++
 20 files changed, 197 insertions(+)

diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
index 89c1e1736356..728d625a10f1 100644
--- a/arch/arc/include/asm/syscall.h
+++ b/arch/arc/include/asm/syscall.h
@@ -23,6 +23,17 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
 		return -1;
 }
 
+static inline void
+syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
+{
+	/*
+	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
+	 * the target task is stopped for tracing on entering syscall, so
+	 * there is no need to have the same check syscall_get_nr() has.
+	 */
+	regs->r8 = nr;
+}
+
 static inline void
 syscall_rollback(struct task_struct *task, struct pt_regs *regs)
 {
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index 21927fa0ae2b..18b102a30741 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -68,6 +68,30 @@ static inline void syscall_set_return_value(struct task_struct *task,
 	regs->ARM_r0 = (long) error ? error : val;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	if (nr == -1) {
+		task_thread_info(task)->abi_syscall = -1;
+		/*
+		 * When the syscall number is set to -1, the syscall will be
+		 * skipped.  In this case the syscall return value has to be
+		 * set explicitly, otherwise the first syscall argument is
+		 * returned as the syscall return value.
+		 */
+		syscall_set_return_value(task, regs, -ENOSYS, 0);
+		return;
+	}
+	if ((IS_ENABLED(CONFIG_AEABI) && !IS_ENABLED(CONFIG_OABI_COMPAT))) {
+		task_thread_info(task)->abi_syscall = nr;
+		return;
+	}
+	task_thread_info(task)->abi_syscall =
+		(task_thread_info(task)->abi_syscall & ~__NR_SYSCALL_MASK) |
+		(nr & __NR_SYSCALL_MASK);
+}
+
 #define SYSCALL_MAX_ARGS 7
 
 static inline void syscall_get_arguments(struct task_struct *task,
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index 76020b66286b..712daa90e643 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -61,6 +61,22 @@ static inline void syscall_set_return_value(struct task_struct *task,
 	regs->regs[0] = val;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->syscallno = nr;
+	if (nr == -1) {
+		/*
+		 * When the syscall number is set to -1, the syscall will be
+		 * skipped.  In this case the syscall return value has to be
+		 * set explicitly, otherwise the first syscall argument is
+		 * returned as the syscall return value.
+		 */
+		syscall_set_return_value(task, regs, -ENOSYS, 0);
+	}
+}
+
 #define SYSCALL_MAX_ARGS 6
 
 static inline void syscall_get_arguments(struct task_struct *task,
diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
index 1024a6548d78..70637261817a 100644
--- a/arch/hexagon/include/asm/syscall.h
+++ b/arch/hexagon/include/asm/syscall.h
@@ -26,6 +26,13 @@ static inline long syscall_get_nr(struct task_struct *task,
 	return regs->r06;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->r06 = nr;
+}
+
 static inline void syscall_get_arguments(struct task_struct *task,
 					 struct pt_regs *regs,
 					 unsigned long *args)
diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
index ff415b3c0a8e..81d2733f7b94 100644
--- a/arch/loongarch/include/asm/syscall.h
+++ b/arch/loongarch/include/asm/syscall.h
@@ -26,6 +26,13 @@ static inline long syscall_get_nr(struct task_struct *task,
 	return regs->regs[11];
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->regs[11] = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h
index d1453e850cdd..bf84b160c2eb 100644
--- a/arch/m68k/include/asm/syscall.h
+++ b/arch/m68k/include/asm/syscall.h
@@ -14,6 +14,13 @@ static inline int syscall_get_nr(struct task_struct *task,
 	return regs->orig_d0;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->orig_d0 = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h
index 5eb3f624cc59..b5b6b91fae3e 100644
--- a/arch/microblaze/include/asm/syscall.h
+++ b/arch/microblaze/include/asm/syscall.h
@@ -14,6 +14,13 @@ static inline long syscall_get_nr(struct task_struct *task,
 	return regs->r12;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->r12 = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
index 3163d1506fae..58d68205fd2c 100644
--- a/arch/mips/include/asm/syscall.h
+++ b/arch/mips/include/asm/syscall.h
@@ -41,6 +41,20 @@ static inline long syscall_get_nr(struct task_struct *task,
 	return task_thread_info(task)->syscall;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	/*
+	 * New syscall number has to be assigned to regs[2] because
+	 * syscall_trace_entry() loads it from there unconditionally.
+	 *
+	 * Consequently, if the syscall was indirect and nr != __NR_syscall,
+	 * then after this assignment the syscall will cease to be indirect.
+	 */
+	task_thread_info(task)->syscall = regs->regs[2] = nr;
+}
+
 static inline void mips_syscall_update_nr(struct task_struct *task,
 					  struct pt_regs *regs)
 {
diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
index 526449edd768..8e3eb1d689bb 100644
--- a/arch/nios2/include/asm/syscall.h
+++ b/arch/nios2/include/asm/syscall.h
@@ -15,6 +15,11 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
 	return regs->r2;
 }
 
+static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
+{
+	regs->r2 = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				struct pt_regs *regs)
 {
diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
index e6383be2a195..5e037d9659c5 100644
--- a/arch/openrisc/include/asm/syscall.h
+++ b/arch/openrisc/include/asm/syscall.h
@@ -25,6 +25,12 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
 	return regs->orig_gpr11;
 }
 
+static inline void
+syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
+{
+	regs->orig_gpr11 = nr;
+}
+
 static inline void
 syscall_rollback(struct task_struct *task, struct pt_regs *regs)
 {
diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
index b146d0ae4c77..c11222798ab2 100644
--- a/arch/parisc/include/asm/syscall.h
+++ b/arch/parisc/include/asm/syscall.h
@@ -17,6 +17,13 @@ static inline long syscall_get_nr(struct task_struct *tsk,
 	return regs->gr[20];
 }
 
+static inline void syscall_set_nr(struct task_struct *tsk,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->gr[20] = nr;
+}
+
 static inline void syscall_get_arguments(struct task_struct *tsk,
 					 struct pt_regs *regs,
 					 unsigned long *args)
diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index 521f279e6b33..7505dcfed247 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -39,6 +39,16 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
 		return -1;
 }
 
+static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
+{
+	/*
+	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
+	 * the target task is stopped for tracing on entering syscall, so
+	 * there is no need to have the same check syscall_get_nr() has.
+	 */
+	regs->gpr[0] = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
index 8d389ba995c8..a5281cdf2b10 100644
--- a/arch/riscv/include/asm/syscall.h
+++ b/arch/riscv/include/asm/syscall.h
@@ -30,6 +30,13 @@ static inline int syscall_get_nr(struct task_struct *task,
 	return regs->a7;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->a7 = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
index b3dd883699e7..12cd0c60c07b 100644
--- a/arch/s390/include/asm/syscall.h
+++ b/arch/s390/include/asm/syscall.h
@@ -24,6 +24,18 @@ static inline long syscall_get_nr(struct task_struct *task,
 		(regs->int_code & 0xffff) : -1;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	/*
+	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
+	 * the target task is stopped for tracing on entering syscall, so
+	 * there is no need to have the same check syscall_get_nr() has.
+	 */
+	regs->int_code = (regs->int_code & ~0xffff) | (nr & 0xffff);
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
index cb51a7528384..7027d87d901d 100644
--- a/arch/sh/include/asm/syscall_32.h
+++ b/arch/sh/include/asm/syscall_32.h
@@ -15,6 +15,18 @@ static inline long syscall_get_nr(struct task_struct *task,
 	return (regs->tra >= 0) ? regs->regs[3] : -1L;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	/*
+	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
+	 * the target task is stopped for tracing on entering syscall, so
+	 * there is no need to have the same check syscall_get_nr() has.
+	 */
+	regs->regs[3] = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
index 62a5a78804c4..b0233924d323 100644
--- a/arch/sparc/include/asm/syscall.h
+++ b/arch/sparc/include/asm/syscall.h
@@ -25,6 +25,18 @@ static inline long syscall_get_nr(struct task_struct *task,
 	return (syscall_p ? regs->u_regs[UREG_G1] : -1L);
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	/*
+	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
+	 * the target task is stopped for tracing on entering syscall, so
+	 * there is no need to have the same check syscall_get_nr() has.
+	 */
+	regs->u_regs[UREG_G1] = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h
index 2984feb9d576..bcd73bcfe577 100644
--- a/arch/um/include/asm/syscall-generic.h
+++ b/arch/um/include/asm/syscall-generic.h
@@ -21,6 +21,11 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
 	return PT_REGS_SYSCALL_NR(regs);
 }
 
+static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
+{
+	PT_REGS_SYSCALL_NR(regs) = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index b9c249dd9e3d..c10dbb74cd00 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -38,6 +38,13 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
 	return regs->orig_ax;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->orig_ax = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
index f9a671cbf933..7db3b489c8ad 100644
--- a/arch/xtensa/include/asm/syscall.h
+++ b/arch/xtensa/include/asm/syscall.h
@@ -28,6 +28,13 @@ static inline long syscall_get_nr(struct task_struct *task,
 	return regs->syscall;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+				  struct pt_regs *regs,
+				  int nr)
+{
+	regs->syscall = nr;
+}
+
 static inline void syscall_rollback(struct task_struct *task,
 				    struct pt_regs *regs)
 {
diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index 0f7b9a493de7..e33fd4e783c1 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -37,6 +37,20 @@ struct pt_regs;
  */
 int syscall_get_nr(struct task_struct *task, struct pt_regs *regs);
 
+/**
+ * syscall_set_nr - change the system call a task is executing
+ * @task:	task of interest, must be blocked
+ * @regs:	task_pt_regs() of @task
+ * @nr:		system call number
+ *
+ * Changes the system call number @task is about to execute.
+ *
+ * It's only valid to call this when @task is stopped for tracing on
+ * entry to a system call, due to %SYSCALL_WORK_SYSCALL_TRACE or
+ * %SYSCALL_WORK_SYSCALL_AUDIT.
+ */
+void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr);
+
 /**
  * syscall_rollback - roll back registers after an aborted system call
  * @task:	task of interest, must be in system call exit tracing
-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 5/7] ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op
       [not found] <20250113170925.GA392@strace.io>
                   ` (3 preceding siblings ...)
  2025-01-13 17:11 ` [PATCH v2 4/7] syscall.h: introduce syscall_set_nr() Dmitry V. Levin
@ 2025-01-13 17:12 ` Dmitry V. Levin
  2025-01-13 17:12 ` [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request Dmitry V. Levin
  2025-01-13 17:12 ` [PATCH v2 7/7] selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO Dmitry V. Levin
  6 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel

Move the code that calculates the type of the system call stop
out of ptrace_get_syscall_info() into a separate function
ptrace_get_syscall_info_op() which is going to be used later
to implement PTRACE_SET_SYSCALL_INFO API.

Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---
 kernel/ptrace.c | 58 +++++++++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index d5f89f9ef29f..22e7d74cf4cd 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -921,7 +921,6 @@ ptrace_get_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
 	unsigned long args[ARRAY_SIZE(info->entry.args)];
 	int i;
 
-	info->op = PTRACE_SYSCALL_INFO_ENTRY;
 	info->entry.nr = syscall_get_nr(child, regs);
 	syscall_get_arguments(child, regs, args);
 	for (i = 0; i < ARRAY_SIZE(args); i++)
@@ -943,7 +942,6 @@ ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs,
 	 * diverge significantly enough.
 	 */
 	ptrace_get_syscall_info_entry(child, regs, info);
-	info->op = PTRACE_SYSCALL_INFO_SECCOMP;
 	info->seccomp.ret_data = child->ptrace_message;
 
 	/* ret_data is the last field in struct ptrace_syscall_info.seccomp */
@@ -954,7 +952,6 @@ static unsigned long
 ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
 			     struct ptrace_syscall_info *info)
 {
-	info->op = PTRACE_SYSCALL_INFO_EXIT;
 	info->exit.rval = syscall_get_error(child, regs);
 	info->exit.is_error = !!info->exit.rval;
 	if (!info->exit.is_error)
@@ -965,19 +962,8 @@ ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
 }
 
 static int
-ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
-			void __user *datavp)
+ptrace_get_syscall_info_op(struct task_struct *child)
 {
-	struct pt_regs *regs = task_pt_regs(child);
-	struct ptrace_syscall_info info = {
-		.op = PTRACE_SYSCALL_INFO_NONE,
-		.arch = syscall_get_arch(child),
-		.instruction_pointer = instruction_pointer(regs),
-		.stack_pointer = user_stack_pointer(regs),
-	};
-	unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry);
-	unsigned long write_size;
-
 	/*
 	 * This does not need lock_task_sighand() to access
 	 * child->last_siginfo because ptrace_freeze_traced()
@@ -988,18 +974,42 @@ ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
 	case SIGTRAP | 0x80:
 		switch (child->ptrace_message) {
 		case PTRACE_EVENTMSG_SYSCALL_ENTRY:
-			actual_size = ptrace_get_syscall_info_entry(child, regs,
-								    &info);
-			break;
+			return PTRACE_SYSCALL_INFO_ENTRY;
 		case PTRACE_EVENTMSG_SYSCALL_EXIT:
-			actual_size = ptrace_get_syscall_info_exit(child, regs,
-								   &info);
-			break;
+			return PTRACE_SYSCALL_INFO_EXIT;
+		default:
+			return PTRACE_SYSCALL_INFO_NONE;
 		}
-		break;
 	case SIGTRAP | (PTRACE_EVENT_SECCOMP << 8):
-		actual_size = ptrace_get_syscall_info_seccomp(child, regs,
-							      &info);
+		return PTRACE_SYSCALL_INFO_SECCOMP;
+	default:
+		return PTRACE_SYSCALL_INFO_NONE;
+	}
+}
+
+static int
+ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
+			void __user *datavp)
+{
+	struct pt_regs *regs = task_pt_regs(child);
+	struct ptrace_syscall_info info = {
+		.op = ptrace_get_syscall_info_op(child),
+		.arch = syscall_get_arch(child),
+		.instruction_pointer = instruction_pointer(regs),
+		.stack_pointer = user_stack_pointer(regs),
+	};
+	unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry);
+	unsigned long write_size;
+
+	switch (info.op) {
+	case PTRACE_SYSCALL_INFO_ENTRY:
+		actual_size = ptrace_get_syscall_info_entry(child, regs, &info);
+		break;
+	case PTRACE_SYSCALL_INFO_EXIT:
+		actual_size = ptrace_get_syscall_info_exit(child, regs, &info);
+		break;
+	case PTRACE_SYSCALL_INFO_SECCOMP:
+		actual_size = ptrace_get_syscall_info_seccomp(child, regs, &info);
 		break;
 	}
 
-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
       [not found] <20250113170925.GA392@strace.io>
                   ` (4 preceding siblings ...)
  2025-01-13 17:12 ` [PATCH v2 5/7] ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op Dmitry V. Levin
@ 2025-01-13 17:12 ` Dmitry V. Levin
  2025-01-15 16:38   ` Oleg Nesterov
                     ` (2 more replies)
  2025-01-13 17:12 ` [PATCH v2 7/7] selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO Dmitry V. Levin
  6 siblings, 3 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of
system calls the tracee is blocked in.

This API allows ptracers to obtain and modify system call details
in a straightforward and architecture-agnostic way.

Current implementation supports changing only those bits of system call
information that are used by strace, namely, syscall number, syscall
arguments, and syscall return value.

Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO,
such as instruction pointer and stack pointer, could be added later
if needed, by using struct ptrace_syscall_info.flags to specify
the additional details that should be set.  Currently, flags and reserved
fields of struct ptrace_syscall_info must be initialized with zeroes;
arch, instruction_pointer, and stack_pointer fields are ignored.

PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.

Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen.  The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure
to provide an API of changing the first system call argument on riscv
architecture.

ptrace(2) man page:

long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
       Modify information about the system call that caused the stop.
       The "data" argument is a pointer to struct ptrace_syscall_info
       that specifies the system call information to be set.
       The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).

Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---
 include/linux/ptrace.h      |  3 ++
 include/uapi/linux/ptrace.h |  4 +-
 kernel/ptrace.c             | 95 +++++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 90507d4afcd6..c8dbf1e498bf 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -17,6 +17,9 @@ struct syscall_info {
 	struct seccomp_data	data;
 };
 
+/* sizeof() the first published struct ptrace_syscall_info */
+#define PTRACE_SYSCALL_INFO_SIZE_VER0	84
+
 extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
 			    void *buf, int len, unsigned int gup_flags);
 
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 72c038fc71d0..ca75b3ab5d22 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -74,6 +74,7 @@ struct seccomp_metadata {
 };
 
 #define PTRACE_GET_SYSCALL_INFO		0x420e
+#define PTRACE_SET_SYSCALL_INFO		0x4212
 #define PTRACE_SYSCALL_INFO_NONE	0
 #define PTRACE_SYSCALL_INFO_ENTRY	1
 #define PTRACE_SYSCALL_INFO_EXIT	2
@@ -81,7 +82,8 @@ struct seccomp_metadata {
 
 struct ptrace_syscall_info {
 	__u8 op;	/* PTRACE_SYSCALL_INFO_* */
-	__u8 pad[3];
+	__u8 reserved;
+	__u16 flags;
 	__u32 arch;
 	__u64 instruction_pointer;
 	__u64 stack_pointer;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 22e7d74cf4cd..41d37cb8f74a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1016,6 +1016,97 @@ ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
 	write_size = min(actual_size, user_size);
 	return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size;
 }
+
+static unsigned long
+ptrace_set_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
+			      struct ptrace_syscall_info *info)
+{
+	unsigned long args[ARRAY_SIZE(info->entry.args)];
+	int nr = info->entry.nr;
+	int i;
+
+	if (nr != info->entry.nr)
+		return -ERANGE;
+
+	for (i = 0; i < ARRAY_SIZE(args); i++) {
+		args[i] = info->entry.args[i];
+		if (args[i] != info->entry.args[i])
+			return -ERANGE;
+	}
+
+	syscall_set_nr(child, regs, nr);
+	/*
+	 * If the syscall number is set to -1, setting syscall arguments is not
+	 * just pointless, it would also clobber the syscall return value on
+	 * those architectures that share the same register both for the first
+	 * argument of syscall and its return value.
+	 */
+	if (nr != -1)
+		syscall_set_arguments(child, regs, args);
+
+	return 0;
+}
+
+static unsigned long
+ptrace_set_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs,
+				struct ptrace_syscall_info *info)
+{
+	/*
+	 * info->entry is currently a subset of info->seccomp,
+	 * info->seccomp.ret_data is currently ignored.
+	 */
+	return ptrace_set_syscall_info_entry(child, regs, info);
+}
+
+static unsigned long
+ptrace_set_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
+			     struct ptrace_syscall_info *info)
+{
+	if (info->exit.is_error)
+		syscall_set_return_value(child, regs, info->exit.rval, 0);
+	else
+		syscall_set_return_value(child, regs, 0, info->exit.rval);
+
+	return 0;
+}
+
+static int
+ptrace_set_syscall_info(struct task_struct *child, unsigned long user_size,
+			void __user *datavp)
+{
+	struct pt_regs *regs = task_pt_regs(child);
+	struct ptrace_syscall_info info;
+	int error;
+
+	BUILD_BUG_ON(sizeof(struct ptrace_syscall_info) < PTRACE_SYSCALL_INFO_SIZE_VER0);
+
+	if (user_size < PTRACE_SYSCALL_INFO_SIZE_VER0 || user_size > PAGE_SIZE)
+		return -EINVAL;
+
+	error = copy_struct_from_user(&info, sizeof(info), datavp, user_size);
+	if (error)
+		return error;
+
+	/* Reserved for future use. */
+	if (info.flags || info.reserved)
+		return -EINVAL;
+
+	/* Changing the type of the system call stop is not supported. */
+	if (ptrace_get_syscall_info_op(child) != info.op)
+		return -EINVAL;
+
+	switch (info.op) {
+	case PTRACE_SYSCALL_INFO_ENTRY:
+		return ptrace_set_syscall_info_entry(child, regs, &info);
+	case PTRACE_SYSCALL_INFO_EXIT:
+		return ptrace_set_syscall_info_exit(child, regs, &info);
+	case PTRACE_SYSCALL_INFO_SECCOMP:
+		return ptrace_set_syscall_info_seccomp(child, regs, &info);
+	default:
+		/* Other types of system call stops are not supported. */
+		return -EINVAL;
+	}
+}
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
 int ptrace_request(struct task_struct *child, long request,
@@ -1234,6 +1325,10 @@ int ptrace_request(struct task_struct *child, long request,
 	case PTRACE_GET_SYSCALL_INFO:
 		ret = ptrace_get_syscall_info(child, addr, datavp);
 		break;
+
+	case PTRACE_SET_SYSCALL_INFO:
+		ret = ptrace_set_syscall_info(child, addr, datavp);
+		break;
 #endif
 
 	case PTRACE_SECCOMP_GET_FILTER:
-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 7/7] selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO
       [not found] <20250113170925.GA392@strace.io>
                   ` (5 preceding siblings ...)
  2025-01-13 17:12 ` [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request Dmitry V. Levin
@ 2025-01-13 17:12 ` Dmitry V. Levin
  6 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:12 UTC (permalink / raw)
  To: Oleg Nesterov, Shuah Khan
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-kselftest

Check whether PTRACE_SET_SYSCALL_INFO semantics implemented in the
kernel matches userspace expectations.

Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---
 tools/testing/selftests/ptrace/Makefile       |   2 +-
 .../selftests/ptrace/set_syscall_info.c       | 441 ++++++++++++++++++
 2 files changed, 442 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/ptrace/set_syscall_info.c

diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index 1c631740a730..c5e0b76ba6ac 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 CFLAGS += -std=c99 -pthread -Wall $(KHDR_INCLUDES)
 
-TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess get_set_sud
+TEST_GEN_PROGS := get_syscall_info set_syscall_info peeksiginfo vmaccess get_set_sud
 
 include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/set_syscall_info.c b/tools/testing/selftests/ptrace/set_syscall_info.c
new file mode 100644
index 000000000000..c977991e0c4a
--- /dev/null
+++ b/tools/testing/selftests/ptrace/set_syscall_info.c
@@ -0,0 +1,441 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2018-2025 Dmitry V. Levin <ldv@strace.io>
+ * All rights reserved.
+ *
+ * Check whether PTRACE_SET_SYSCALL_INFO semantics implemented in the kernel
+ * matches userspace expectations.
+ */
+
+#include "../kselftest_harness.h"
+#include <err.h>
+#include <fcntl.h>
+#include <signal.h>
+#include <asm/unistd.h>
+#include <linux/types.h>
+#include <linux/ptrace.h>
+
+static int
+kill_tracee(pid_t pid)
+{
+	if (!pid)
+		return 0;
+
+	int saved_errno = errno;
+
+	int rc = kill(pid, SIGKILL);
+
+	errno = saved_errno;
+	return rc;
+}
+
+static long
+sys_ptrace(int request, pid_t pid, unsigned long addr, unsigned long data)
+{
+	return syscall(__NR_ptrace, request, pid, addr, data);
+}
+
+#define LOG_KILL_TRACEE(fmt, ...)				\
+	do {							\
+		kill_tracee(pid);				\
+		TH_LOG("wait #%d: " fmt,			\
+		       ptrace_stop, ##__VA_ARGS__);		\
+	} while (0)
+
+struct si_entry {
+	int nr;
+	__kernel_ulong_t args[6];
+};
+struct si_exit {
+	unsigned int is_error;
+	int rval;
+};
+
+TEST(set_syscall_info)
+{
+	const pid_t tracer_pid = getpid();
+	const __kernel_ulong_t dummy[] = {
+		(__kernel_ulong_t) 0xdad0bef0bad0fed0ULL,
+		(__kernel_ulong_t) 0xdad1bef1bad1fed1ULL,
+		(__kernel_ulong_t) 0xdad2bef2bad2fed2ULL,
+		(__kernel_ulong_t) 0xdad3bef3bad3fed3ULL,
+		(__kernel_ulong_t) 0xdad4bef4bad4fed4ULL,
+		(__kernel_ulong_t) 0xdad5bef5bad5fed5ULL,
+	};
+	int splice_in[2], splice_out[2];
+
+	ASSERT_EQ(0, pipe(splice_in));
+	ASSERT_EQ(0, pipe(splice_out));
+	ASSERT_EQ(sizeof(dummy), write(splice_in[1], dummy, sizeof(dummy)));
+
+	const struct {
+		struct si_entry entry[2];
+		struct si_exit exit[2];
+	} si[] = {
+		/* change scno, keep non-error rval */
+		{
+			{
+				{
+					__NR_gettid,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					__NR_getppid,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}
+			}, {
+				{ 0, tracer_pid }, { 0, tracer_pid }
+			}
+		},
+
+		/* set scno to -1, keep error rval */
+		{
+			{
+				{
+					__NR_chdir,
+					{
+						(__kernel_ulong_t) ".",
+						dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					-1,
+					{
+						(__kernel_ulong_t) ".",
+						dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}
+			}, {
+				{ 1, -ENOSYS }, { 1, -ENOSYS }
+			}
+		},
+
+		/* keep scno, change non-error rval */
+		{
+			{
+				{
+					__NR_getppid,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					__NR_getppid,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}
+			}, {
+				{ 0, tracer_pid }, { 0, tracer_pid + 1 }
+			}
+		},
+
+		/* change arg1, keep non-error rval */
+		{
+			{
+				{
+					__NR_chdir,
+					{
+						(__kernel_ulong_t) "",
+						dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					__NR_chdir,
+					{
+						(__kernel_ulong_t) ".",
+						dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}
+			}, {
+				{ 0, 0 }, { 0, 0 }
+			}
+		},
+
+		/* set scno to -1, change error rval to non-error */
+		{
+			{
+				{
+					__NR_gettid,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					-1,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}
+			}, {
+				{ 1, -ENOSYS }, { 0, tracer_pid }
+			}
+		},
+
+		/* change scno, change non-error rval to error */
+		{
+			{
+				{
+					__NR_chdir,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					__NR_getppid,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}
+			}, {
+				{ 0, tracer_pid }, { 1, -EISDIR }
+			}
+		},
+
+		/* change scno and all args, change non-error rval */
+		{
+			{
+				{
+					__NR_gettid,
+					{
+						splice_in[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					__NR_splice,
+					{
+						splice_in[0], 0, splice_out[1], 0,
+						sizeof(dummy), SPLICE_F_NONBLOCK
+					}
+				}
+			}, {
+				{ 0, sizeof(dummy) }, { 0, sizeof(dummy) + 1 }
+			}
+		},
+
+		/* change arg1, no exit stop */
+		{
+			{
+				{
+					__NR_exit_group,
+					{
+						dummy[0], dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}, {
+					__NR_exit_group,
+					{
+						0, dummy[1], dummy[2],
+						dummy[3], dummy[4], dummy[5]
+					}
+				}
+			}, {
+				{ 0, 0 }, { 0, 0 }
+			}
+		},
+	};
+
+	long rc;
+	unsigned int i;
+	unsigned int ptrace_stop;
+
+	pid_t pid = fork();
+
+	ASSERT_LE(0, pid) {
+		TH_LOG("fork: %m");
+	}
+
+	if (pid == 0) {
+		/* get the pid before PTRACE_TRACEME */
+		pid = getpid();
+		ASSERT_EQ(0, sys_ptrace(PTRACE_TRACEME, 0, 0, 0)) {
+			TH_LOG("PTRACE_TRACEME: %m");
+		}
+		ASSERT_EQ(0, kill(pid, SIGSTOP)) {
+			/* cannot happen */
+			TH_LOG("kill SIGSTOP: %m");
+		}
+		for (i = 0; i < ARRAY_SIZE(si); ++i) {
+			rc = syscall(si[i].entry[0].nr,
+				     si[i].entry[0].args[0],
+				     si[i].entry[0].args[1],
+				     si[i].entry[0].args[2],
+				     si[i].entry[0].args[3],
+				     si[i].entry[0].args[4],
+				     si[i].entry[0].args[5]);
+			if (si[i].exit[1].is_error) {
+				if (rc != -1 || errno != -si[i].exit[1].rval)
+					break;
+			} else {
+				if (rc != si[i].exit[1].rval)
+					break;
+			}
+		}
+		/*
+		 * Something went wrong, but in this state tracee
+		 * cannot reliably issue syscalls, so just crash.
+		 */
+		*(volatile unsigned char *) (uintptr_t) i = 42;
+		/* unreachable */
+		_exit(i + 1);
+	}
+
+	for (ptrace_stop = 0; ; ++ptrace_stop) {
+		struct ptrace_syscall_info info = {
+			.op = 0xff	/* invalid PTRACE_SYSCALL_INFO_* op */
+		};
+		const size_t size = sizeof(info);
+		const int expected_entry_size =
+			(void *) &info.entry.args[6] - (void *) &info;
+		const int expected_exit_size =
+			(void *) (&info.exit.is_error + 1) -
+			(void *) &info;
+		int status;
+
+		ASSERT_EQ(pid, wait(&status)) {
+			/* cannot happen */
+			LOG_KILL_TRACEE("wait: %m");
+		}
+		if (WIFEXITED(status)) {
+			pid = 0;	/* the tracee is no more */
+			ASSERT_EQ(0, WEXITSTATUS(status)) {
+				LOG_KILL_TRACEE("unexpected exit status %u",
+						WEXITSTATUS(status));
+			}
+			break;
+		}
+		ASSERT_FALSE(WIFSIGNALED(status)) {
+			pid = 0;	/* the tracee is no more */
+			LOG_KILL_TRACEE("unexpected signal %u",
+					WTERMSIG(status));
+		}
+		ASSERT_TRUE(WIFSTOPPED(status)) {
+			/* cannot happen */
+			LOG_KILL_TRACEE("unexpected wait status %#x", status);
+		}
+
+		ASSERT_LT(ptrace_stop, ARRAY_SIZE(si) * 2) {
+			LOG_KILL_TRACEE("ptrace stop overflow");
+		}
+
+		switch (WSTOPSIG(status)) {
+		case SIGSTOP:
+			ASSERT_EQ(0, ptrace_stop) {
+				LOG_KILL_TRACEE("unexpected signal stop");
+			}
+			ASSERT_EQ(0, sys_ptrace(PTRACE_SETOPTIONS, pid, 0,
+						PTRACE_O_TRACESYSGOOD)) {
+				LOG_KILL_TRACEE("PTRACE_SETOPTIONS: %m");
+			}
+			break;
+
+		case SIGTRAP | 0x80:
+			ASSERT_LT(0, ptrace_stop) {
+				LOG_KILL_TRACEE("unexpected syscall stop");
+			}
+			ASSERT_LT(0, (rc = sys_ptrace(PTRACE_GET_SYSCALL_INFO,
+						      pid, size,
+						      (uintptr_t) &info))) {
+				LOG_KILL_TRACEE("PTRACE_GET_SYSCALL_INFO: %m");
+			}
+			if (ptrace_stop & 1) {
+				/* entering syscall */
+				const struct si_entry *exp_entry =
+					&si[ptrace_stop / 2].entry[0];
+				const struct si_entry *set_entry =
+					&si[ptrace_stop / 2].entry[1];
+
+				ASSERT_EQ(expected_entry_size, rc) {
+					LOG_KILL_TRACEE("entry stop mismatch");
+				}
+				ASSERT_EQ(PTRACE_SYSCALL_INFO_ENTRY, info.op) {
+					LOG_KILL_TRACEE("entry stop mismatch");
+				}
+				ASSERT_TRUE(info.arch) {
+					LOG_KILL_TRACEE("entry stop mismatch");
+				}
+				ASSERT_TRUE(info.instruction_pointer) {
+					LOG_KILL_TRACEE("entry stop mismatch");
+				}
+				ASSERT_TRUE(info.stack_pointer) {
+					LOG_KILL_TRACEE("entry stop mismatch");
+				}
+				ASSERT_EQ(exp_entry->nr, info.entry.nr) {
+					LOG_KILL_TRACEE("syscall nr mismatch");
+				}
+				for (i = 0; i < ARRAY_SIZE(exp_entry->args); ++i) {
+					ASSERT_EQ(exp_entry->args[i], info.entry.args[i]) {
+						LOG_KILL_TRACEE("syscall arg #%u mismatch", i);
+					}
+				}
+				info.entry.nr = set_entry->nr;
+				for (i = 0; i < ARRAY_SIZE(set_entry->args); ++i)
+					info.entry.args[i] = set_entry->args[i];
+				ASSERT_EQ(0, sys_ptrace(PTRACE_SET_SYSCALL_INFO,
+							pid, size,
+							(uintptr_t) &info)) {
+					LOG_KILL_TRACEE("PTRACE_SET_SYSCALL_INFO: %m");
+				}
+			} else {
+				/* exiting syscall */
+				const struct si_exit *exp_exit =
+					&si[ptrace_stop / 2 - 1].exit[0];
+				const struct si_exit *set_exit =
+					&si[ptrace_stop / 2 - 1].exit[1];
+
+				ASSERT_EQ(expected_exit_size, rc) {
+					LOG_KILL_TRACEE("exit stop mismatch");
+				}
+				ASSERT_EQ(PTRACE_SYSCALL_INFO_EXIT, info.op) {
+					LOG_KILL_TRACEE("exit stop mismatch");
+				}
+				ASSERT_TRUE(info.arch) {
+					LOG_KILL_TRACEE("exit stop mismatch");
+				}
+				ASSERT_TRUE(info.instruction_pointer) {
+					LOG_KILL_TRACEE("exit stop mismatch");
+				}
+				ASSERT_TRUE(info.stack_pointer) {
+					LOG_KILL_TRACEE("exit stop mismatch");
+				}
+				ASSERT_EQ(exp_exit->is_error, info.exit.is_error) {
+					LOG_KILL_TRACEE("exit stop mismatch");
+				}
+				ASSERT_EQ(exp_exit->rval, info.exit.rval) {
+					LOG_KILL_TRACEE("exit stop mismatch");
+				}
+				info.exit.is_error = set_exit->is_error;
+				info.exit.rval = set_exit->rval;
+				ASSERT_EQ(0, sys_ptrace(PTRACE_SET_SYSCALL_INFO,
+							pid, size,
+							(uintptr_t) &info)) {
+					LOG_KILL_TRACEE("PTRACE_SET_SYSCALL_INFO: %m");
+				}
+			}
+			break;
+
+		default:
+			LOG_KILL_TRACEE("unexpected stop signal %u",
+					WSTOPSIG(status));
+			abort();
+		}
+
+		ASSERT_EQ(0, sys_ptrace(PTRACE_SYSCALL, pid, 0, 0)) {
+			LOG_KILL_TRACEE("PTRACE_SYSCALL: %m");
+		}
+	}
+
+	ASSERT_EQ(ptrace_stop, ARRAY_SIZE(si) * 2);
+}
+
+TEST_HARNESS_MAIN
-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-13 17:10 ` [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value() Dmitry V. Levin
@ 2025-01-13 17:34   ` Christophe Leroy
  2025-01-13 17:54     ` Dmitry V. Levin
  2025-01-14 17:04     ` Dmitry V. Levin
  2025-01-14 13:00   ` Alexey Gladkov
  1 sibling, 2 replies; 65+ messages in thread
From: Christophe Leroy @ 2025-01-13 17:34 UTC (permalink / raw)
  To: Dmitry V. Levin, Oleg Nesterov, Michael Ellerman
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel



Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> Bring syscall_set_return_value() in sync with syscall_get_error(),
> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> 
> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> syscall_set_return_value()").

There is a clear detailed explanation in that commit of why it needs to 
be done.

If you think that commit is wrong you have to explain why with at least 
the same level of details.

> 
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
>   arch/powerpc/include/asm/syscall.h | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 3dd36c5e334a..422d7735ace6 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
>   		 */
>   		if (error) {
>   			regs->ccr |= 0x10000000L;
> -			regs->gpr[3] = error;
> +			/*
> +			 * In case of an error regs->gpr[3] contains
> +			 * a positive ERRORCODE.
> +			 */
> +			regs->gpr[3] = -error;
>   		} else {
>   			regs->ccr &= ~0x10000000L;
>   			regs->gpr[3] = val;


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-13 17:34   ` Christophe Leroy
@ 2025-01-13 17:54     ` Dmitry V. Levin
  2025-01-14 17:04     ` Dmitry V. Levin
  1 sibling, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-13 17:54 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Oleg Nesterov, Michael Ellerman, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	linux-kernel

On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> > Bring syscall_set_return_value() in sync with syscall_get_error(),
> > and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > 
> > This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> > syscall_set_return_value()").
> 
> There is a clear detailed explanation in that commit of why it needs to 
> be done.
> 
> If you think that commit is wrong you have to explain why with at least 
> the same level of details.

I'm sorry, I'm not by any means a powerpc expert to explain why that
commit was added in the first place, I wish Michael would be able to do it
himself.  All I can say is that for some mysterious reason current
syscall_set_return_value() implementation assumes that in case of an error
regs->gpr[3] has to be negative, while, according to well-tested
syscall_get_error(), it has to be positive.

This is very visible with PTRACE_SET_SYSCALL_INFO that exposes
syscall_set_return_value() to userspace, and, in particular, with the
architecture-agnostic ptrace/set_syscall_info selftest added later in the
series.

> > diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> > index 3dd36c5e334a..422d7735ace6 100644
> > --- a/arch/powerpc/include/asm/syscall.h
> > +++ b/arch/powerpc/include/asm/syscall.h
> > @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
> >   		 */
> >   		if (error) {
> >   			regs->ccr |= 0x10000000L;
> > -			regs->gpr[3] = error;
> > +			/*
> > +			 * In case of an error regs->gpr[3] contains
> > +			 * a positive ERRORCODE.
> > +			 */
> > +			regs->gpr[3] = -error;
> >   		} else {
> >   			regs->ccr &= ~0x10000000L;
> >   			regs->gpr[3] = val;

-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32
  2025-01-13 17:11 ` [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32 Dmitry V. Levin
@ 2025-01-14  3:29   ` Maciej W. Rozycki
  2025-01-14  8:47     ` Dmitry V. Levin
  0 siblings, 1 reply; 65+ messages in thread
From: Maciej W. Rozycki @ 2025-01-14  3:29 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Thomas Bogendoerfer, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	linux-mips, linux-kernel

On Mon, 13 Jan 2025, Dmitry V. Levin wrote:

> Fix the following get_syscall_info test assertion on mips O32:
>   # get_syscall_info.c:218:get_syscall_info:Expected exp_args[5] (3134521044) == info.entry.args[4] (4911432)
>   # get_syscall_info.c:219:get_syscall_info:wait #1: entry stop mismatch
> 
> Fix the following get_syscall_info test assertion on mips64 O32 and mips64 N32:
>   # get_syscall_info.c:209:get_syscall_info:Expected exp_args[2] (3134324433) == info.entry.args[1] (18446744072548908753)
>   # get_syscall_info.c:210:get_syscall_info:wait #1: entry stop mismatch

 How did you produce these results?

> This makes ptrace/get_syscall_info selftest pass on mips O32,
> mips64 O32, and mips64 N32.
> 
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
> 
> Note that I'm not a MIPS expert, so I cannot tell why the get_user()
> approach doesn't work for O32.  Also, during experiments I discovered that
> regs->pad0 approach works for O32, but why it works remains a mystery.

 The patch is definitely broken, the calling convention is the same 
between n32 and n64: 64-bit arguments in $4 through $11 registers as 
required, and your change makes n32 truncate arguments to 32 bits.

 The regs->pad0 approach works due to this piece:

	/*
	 * Ok, copy the args from the luser stack to the kernel stack.
	 */

	.set    push
	.set    noreorder
	.set	nomacro

load_a4: user_lw(t5, 16(t0))		# argument #5 from usp
load_a5: user_lw(t6, 20(t0))		# argument #6 from usp
load_a6: user_lw(t7, 24(t0))		# argument #7 from usp
load_a7: user_lw(t8, 28(t0))		# argument #8 from usp
loads_done:

	sw	t5, 16(sp)		# argument #5 to ksp
	sw	t6, 20(sp)		# argument #6 to ksp
	sw	t7, 24(sp)		# argument #7 to ksp
	sw	t8, 28(sp)		# argument #8 to ksp
	.set	pop

	.section __ex_table,"a"
	PTR_WD	load_a4, bad_stack_a4
	PTR_WD	load_a5, bad_stack_a5
	PTR_WD	load_a6, bad_stack_a6
	PTR_WD	load_a7, bad_stack_a7
	.previous

in arch/mips/kernel/scall32-o32.S (and arch/mips/kernel/scall64-o32.S has 
analogous code to adapt to the native n64 calling convention instead), but 
this doesn't seem to me to be the correct approach here.  At first glance 
`mips_get_syscall_arg' does appear fine as it is, so I'd like to know how 
you obtained your results.

  Maciej

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32
  2025-01-14  3:29   ` Maciej W. Rozycki
@ 2025-01-14  8:47     ` Dmitry V. Levin
  2025-01-14 16:03       ` Maciej W. Rozycki
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-14  8:47 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Oleg Nesterov, Thomas Bogendoerfer, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	linux-mips, linux-kernel

On Tue, Jan 14, 2025 at 03:29:11AM +0000, Maciej W. Rozycki wrote:
> On Mon, 13 Jan 2025, Dmitry V. Levin wrote:
> 
> > Fix the following get_syscall_info test assertion on mips O32:
> >   # get_syscall_info.c:218:get_syscall_info:Expected exp_args[5] (3134521044) == info.entry.args[4] (4911432)
> >   # get_syscall_info.c:219:get_syscall_info:wait #1: entry stop mismatch
> > 
> > Fix the following get_syscall_info test assertion on mips64 O32 and mips64 N32:
> >   # get_syscall_info.c:209:get_syscall_info:Expected exp_args[2] (3134324433) == info.entry.args[1] (18446744072548908753)
> >   # get_syscall_info.c:210:get_syscall_info:wait #1: entry stop mismatch
> 
>  How did you produce these results?

$ PATH="$HOME/x-tools/mips64-unknown-linux-gnu/bin:$PATH" make -j`nproc` ARCH=mips CROSS_COMPILE=mips64-unknown-linux-gnu- -C tools/testing/selftests TARGETS=ptrace USERLDFLAGS='-static' USERCFLAGS='-mabi=32'
$ echo init |(cd tools/testing/selftests/ptrace && ln -snf get_syscall_info init && cpio --dereference -o -H newc -R 0:0) |gzip >get_syscall_info.mips-o32.img
$ qemu-system-mips -nographic -kernel vmlinuz -initrd get_syscall_info.mips-o32.img -append 'console=ttyS0'

Likewise for mips64, but the patch for kselftest_harness.h from [1]
is needed to see correct mismatch values in the test diagnostics.

[1] https://lore.kernel.org/all/20250108170757.GA6723@strace.io/

> > This makes ptrace/get_syscall_info selftest pass on mips O32,
> > mips64 O32, and mips64 N32.
> > 
> > Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> > ---
> > 
> > Note that I'm not a MIPS expert, so I cannot tell why the get_user()
> > approach doesn't work for O32.  Also, during experiments I discovered that
> > regs->pad0 approach works for O32, but why it works remains a mystery.
> 
>  The patch is definitely broken, the calling convention is the same 
> between n32 and n64: 64-bit arguments in $4 through $11 registers as 
> required, and your change makes n32 truncate arguments to 32 bits.

There must be something very specific to n32 then: apparently,
__kernel_ulong_t is a 32-bit type on n32, so the syscall arguments are
32-bit values, at some point (in glibc?) they get sign-extended from 32 to
64 bits, and syscall_get_arguments returns them as 64-bit values different
from the original syscall arguments, breaking the test.

If this is the expected behaviour, then I'd have to add an exception for
mips n32 both to the kernel test and to strace that uses this interface.

>  The regs->pad0 approach works due to this piece:
> 
> 	/*
> 	 * Ok, copy the args from the luser stack to the kernel stack.
> 	 */
> 
> 	.set    push
> 	.set    noreorder
> 	.set	nomacro
> 
> load_a4: user_lw(t5, 16(t0))		# argument #5 from usp
> load_a5: user_lw(t6, 20(t0))		# argument #6 from usp
> load_a6: user_lw(t7, 24(t0))		# argument #7 from usp
> load_a7: user_lw(t8, 28(t0))		# argument #8 from usp
> loads_done:
> 
> 	sw	t5, 16(sp)		# argument #5 to ksp
> 	sw	t6, 20(sp)		# argument #6 to ksp
> 	sw	t7, 24(sp)		# argument #7 to ksp
> 	sw	t8, 28(sp)		# argument #8 to ksp
> 	.set	pop
> 
> 	.section __ex_table,"a"
> 	PTR_WD	load_a4, bad_stack_a4
> 	PTR_WD	load_a5, bad_stack_a5
> 	PTR_WD	load_a6, bad_stack_a6
> 	PTR_WD	load_a7, bad_stack_a7
> 	.previous
> 
> in arch/mips/kernel/scall32-o32.S (and arch/mips/kernel/scall64-o32.S has 
> analogous code to adapt to the native n64 calling convention instead), but 
> this doesn't seem to me to be the correct approach here.  At first glance 
> `mips_get_syscall_arg' does appear fine as it is, so I'd like to know how 
> you obtained your results.
> 
>   Maciej

-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-13 17:10 ` [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value() Dmitry V. Levin
  2025-01-13 17:34   ` Christophe Leroy
@ 2025-01-14 13:00   ` Alexey Gladkov
  2025-01-14 13:48     ` Dmitry V. Levin
  1 sibling, 1 reply; 65+ messages in thread
From: Alexey Gladkov @ 2025-01-14 13:00 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Michael Ellerman, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Christophe Leroy,
	Naveen N Rao, linuxppc-dev, linux-kernel

On Mon, Jan 13, 2025 at 07:10:54PM +0200, Dmitry V. Levin wrote:
> Bring syscall_set_return_value() in sync with syscall_get_error(),
> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> 
> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> syscall_set_return_value()").
> 
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
>  arch/powerpc/include/asm/syscall.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 3dd36c5e334a..422d7735ace6 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
>  		 */
>  		if (error) {
>  			regs->ccr |= 0x10000000L;
> -			regs->gpr[3] = error;
> +			/*
> +			 * In case of an error regs->gpr[3] contains
> +			 * a positive ERRORCODE.
> +			 */
> +			regs->gpr[3] = -error;

After this change the syscall_get_error() will return positive value if
the system call failed. Since syscall_get_error() still believes
regs->gpr[3] is still positive in case !trap_is_scv().

Or am I missing something?

It looks like the selftest you mentioned in the commit message doesn't
check the !trap_is_scv() branch.

>  		} else {
>  			regs->ccr &= ~0x10000000L;
>  			regs->gpr[3] = val;
> -- 
> ldv

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-14 13:00   ` Alexey Gladkov
@ 2025-01-14 13:48     ` Dmitry V. Levin
  2025-01-14 14:53       ` Alexey Gladkov
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-14 13:48 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Oleg Nesterov, Michael Ellerman, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Christophe Leroy,
	Naveen N Rao, linuxppc-dev, linux-kernel

On Tue, Jan 14, 2025 at 02:00:16PM +0100, Alexey Gladkov wrote:
> On Mon, Jan 13, 2025 at 07:10:54PM +0200, Dmitry V. Levin wrote:
> > Bring syscall_set_return_value() in sync with syscall_get_error(),
> > and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > 
> > This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> > syscall_set_return_value()").
> > 
> > Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> > ---
> >  arch/powerpc/include/asm/syscall.h | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> > index 3dd36c5e334a..422d7735ace6 100644
> > --- a/arch/powerpc/include/asm/syscall.h
> > +++ b/arch/powerpc/include/asm/syscall.h
> > @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
> >  		 */
> >  		if (error) {
> >  			regs->ccr |= 0x10000000L;
> > -			regs->gpr[3] = error;
> > +			/*
> > +			 * In case of an error regs->gpr[3] contains
> > +			 * a positive ERRORCODE.
> > +			 */
> > +			regs->gpr[3] = -error;
> 
> After this change the syscall_get_error() will return positive value if
> the system call failed. Since syscall_get_error() still believes
> regs->gpr[3] is still positive in case !trap_is_scv().
> 
> Or am I missing something?

syscall_get_error() does the following in case of !trap_is_scv():

                /*
                 * If the system call failed,
                 * regs->gpr[3] contains a positive ERRORCODE.
                 */
                return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;

That is, in !trap_is_scv() case it assumes that regs->gpr[3] is positive
and is going to return a negative value (-ERRORCODE).

> It looks like the selftest you mentioned in the commit message doesn't
> check the !trap_is_scv() branch.

The selftest is architecture-agnostic, it just executes syscalls and
checks whether the data returned by PTRACE_GET_SYSCALL_INFO meets
expectations.  Do you mean that syscall() is not good enough for syscall
invocation from coverage perspective on powerpc?

See also commit d72500f99284 ("powerpc/64s/syscall: Fix ptrace syscall
info with scv syscalls").


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-14 13:48     ` Dmitry V. Levin
@ 2025-01-14 14:53       ` Alexey Gladkov
  0 siblings, 0 replies; 65+ messages in thread
From: Alexey Gladkov @ 2025-01-14 14:53 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Michael Ellerman, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Christophe Leroy,
	Naveen N Rao, linuxppc-dev, linux-kernel

On Tue, Jan 14, 2025 at 03:48:44PM +0200, Dmitry V. Levin wrote:
> On Tue, Jan 14, 2025 at 02:00:16PM +0100, Alexey Gladkov wrote:
> > On Mon, Jan 13, 2025 at 07:10:54PM +0200, Dmitry V. Levin wrote:
> > > Bring syscall_set_return_value() in sync with syscall_get_error(),
> > > and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > > 
> > > This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> > > syscall_set_return_value()").
> > > 
> > > Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> > > ---
> > >  arch/powerpc/include/asm/syscall.h | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> > > index 3dd36c5e334a..422d7735ace6 100644
> > > --- a/arch/powerpc/include/asm/syscall.h
> > > +++ b/arch/powerpc/include/asm/syscall.h
> > > @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
> > >  		 */
> > >  		if (error) {
> > >  			regs->ccr |= 0x10000000L;
> > > -			regs->gpr[3] = error;
> > > +			/*
> > > +			 * In case of an error regs->gpr[3] contains
> > > +			 * a positive ERRORCODE.
> > > +			 */
> > > +			regs->gpr[3] = -error;
> > 
> > After this change the syscall_get_error() will return positive value if
> > the system call failed. Since syscall_get_error() still believes
> > regs->gpr[3] is still positive in case !trap_is_scv().
> > 
> > Or am I missing something?
> 
> syscall_get_error() does the following in case of !trap_is_scv():
> 
>                 /*
>                  * If the system call failed,
>                  * regs->gpr[3] contains a positive ERRORCODE.
>                  */
>                 return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> 
> That is, in !trap_is_scv() case it assumes that regs->gpr[3] is positive
> and is going to return a negative value (-ERRORCODE).

Yeah. Now I see it.

	if (trap_is_scv(regs)) {
		regs->result = -EINTR;
		regs->gpr[3] = -EINTR;
	} else {
		regs->result = -EINTR;
		regs->gpr[3] = EINTR;
		regs->ccr |= 0x10000000;
	}

Two different APIs imply gpr[3] with a different sign.

You can add:

Reviewed-by: Alexey Gladkov <legion@kernel.org>

> > It looks like the selftest you mentioned in the commit message doesn't
> > check the !trap_is_scv() branch.
> 
> The selftest is architecture-agnostic, it just executes syscalls and
> checks whether the data returned by PTRACE_GET_SYSCALL_INFO meets
> expectations.  Do you mean that syscall() is not good enough for syscall
> invocation from coverage perspective on powerpc?
> 
> See also commit d72500f99284 ("powerpc/64s/syscall: Fix ptrace syscall
> info with scv syscalls").
> 
> 
> -- 
> ldv

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32
  2025-01-14  8:47     ` Dmitry V. Levin
@ 2025-01-14 16:03       ` Maciej W. Rozycki
  2025-01-14 16:42         ` Dmitry V. Levin
  0 siblings, 1 reply; 65+ messages in thread
From: Maciej W. Rozycki @ 2025-01-14 16:03 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Thomas Bogendoerfer, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	linux-mips, linux-kernel

On Tue, 14 Jan 2025, Dmitry V. Levin wrote:

> >  How did you produce these results?
> 
> $ PATH="$HOME/x-tools/mips64-unknown-linux-gnu/bin:$PATH" make -j`nproc` ARCH=mips CROSS_COMPILE=mips64-unknown-linux-gnu- -C tools/testing/selftests TARGETS=ptrace USERLDFLAGS='-static' USERCFLAGS='-mabi=32'
> $ echo init |(cd tools/testing/selftests/ptrace && ln -snf get_syscall_info init && cpio --dereference -o -H newc -R 0:0) |gzip >get_syscall_info.mips-o32.img
> $ qemu-system-mips -nographic -kernel vmlinuz -initrd get_syscall_info.mips-o32.img -append 'console=ttyS0'
> 
> Likewise for mips64, but the patch for kselftest_harness.h from [1]
> is needed to see correct mismatch values in the test diagnostics.
> 
> [1] https://lore.kernel.org/all/20250108170757.GA6723@strace.io/

 Thanks, I'll try to see what's going on with `get_user'.

> >  The patch is definitely broken, the calling convention is the same 
> > between n32 and n64: 64-bit arguments in $4 through $11 registers as 
> > required, and your change makes n32 truncate arguments to 32 bits.
> 
> There must be something very specific to n32 then: apparently,
> __kernel_ulong_t is a 32-bit type on n32, so the syscall arguments are
> 32-bit values, at some point (in glibc?) they get sign-extended from 32 to
> 64 bits, and syscall_get_arguments returns them as 64-bit values different
> from the original syscall arguments, breaking the test.

 This matters at least for `lseek', which has an `off64_t' aka `long long' 
argument on n32 (there's no `_llseek' on n32).  Since arguments are passed 
via 64-bit registers and a `long long' datum is held in just one this is 
transparent between n32 and n64 (of course on n64 this corresponds to the 
plain `long' data type, so the kernel, which is always n64 for 64-bit 
configurations, sees the incoming argument as `long', and the same stands 
for the outgoing return value).

 Surely non-LFS lseek(2) will produce the syscall's `long long' argument 
truncated (cf. sysdeps/unix/sysv/linux/mips/mips64/n32/lseek.c in glibc), 
but both LFS lseek(2) and lseek64(2) will pass the native value on n32.

> If this is the expected behaviour, then I'd have to add an exception for
> mips n32 both to the kernel test and to strace that uses this interface.

 Is MIPS n32 the only psABI across all our architectures supported that 
can have `long long' syscall arguments?  I guess it might actually be the 
case, in which case I won't be surprised it needs specific handling.

  Maciej

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32
  2025-01-14 16:03       ` Maciej W. Rozycki
@ 2025-01-14 16:42         ` Dmitry V. Levin
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-14 16:42 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Oleg Nesterov, Thomas Bogendoerfer, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	linux-mips, linux-kernel

On Tue, Jan 14, 2025 at 04:03:28PM +0000, Maciej W. Rozycki wrote:
> On Tue, 14 Jan 2025, Dmitry V. Levin wrote:
> 
> > >  How did you produce these results?
> > 
> > $ PATH="$HOME/x-tools/mips64-unknown-linux-gnu/bin:$PATH" make -j`nproc` ARCH=mips CROSS_COMPILE=mips64-unknown-linux-gnu- -C tools/testing/selftests TARGETS=ptrace USERLDFLAGS='-static' USERCFLAGS='-mabi=32'
> > $ echo init |(cd tools/testing/selftests/ptrace && ln -snf get_syscall_info init && cpio --dereference -o -H newc -R 0:0) |gzip >get_syscall_info.mips-o32.img
> > $ qemu-system-mips -nographic -kernel vmlinuz -initrd get_syscall_info.mips-o32.img -append 'console=ttyS0'
> > 
> > Likewise for mips64, but the patch for kselftest_harness.h from [1]
> > is needed to see correct mismatch values in the test diagnostics.
> > 
> > [1] https://lore.kernel.org/all/20250108170757.GA6723@strace.io/
> 
>  Thanks, I'll try to see what's going on with `get_user'.

Thanks.

> > >  The patch is definitely broken, the calling convention is the same 
> > > between n32 and n64: 64-bit arguments in $4 through $11 registers as 
> > > required, and your change makes n32 truncate arguments to 32 bits.
> > 
> > There must be something very specific to n32 then: apparently,
> > __kernel_ulong_t is a 32-bit type on n32, so the syscall arguments are
> > 32-bit values, at some point (in glibc?) they get sign-extended from 32 to
> > 64 bits, and syscall_get_arguments returns them as 64-bit values different
> > from the original syscall arguments, breaking the test.
> 
>  This matters at least for `lseek', which has an `off64_t' aka `long long' 
> argument on n32 (there's no `_llseek' on n32).  Since arguments are passed 
> via 64-bit registers and a `long long' datum is held in just one this is 
> transparent between n32 and n64 (of course on n64 this corresponds to the 
> plain `long' data type, so the kernel, which is always n64 for 64-bit 
> configurations, sees the incoming argument as `long', and the same stands 
> for the outgoing return value).
> 
>  Surely non-LFS lseek(2) will produce the syscall's `long long' argument 
> truncated (cf. sysdeps/unix/sysv/linux/mips/mips64/n32/lseek.c in glibc), 
> but both LFS lseek(2) and lseek64(2) will pass the native value on n32.
> 
> > If this is the expected behaviour, then I'd have to add an exception for
> > mips n32 both to the kernel test and to strace that uses this interface.
> 
>  Is MIPS n32 the only psABI across all our architectures supported that 
> can have `long long' syscall arguments?  I guess it might actually be the 
> case, in which case I won't be surprised it needs specific handling.

This is very similar to x32, with the only essential difference:
on x32 the 64-bitness of syscall arguments is exposed to userspace via
__kernel_ulong_t:

arch/x86/include/uapi/asm/posix_types_x32.h:typedef unsigned long long __kernel_ulong_t;

In fact, there is a workaround in strace for this case, and now I realized
that it ceased to work on mips n32 long time ago:

# if defined HAVE___KERNEL_LONG_T && defined HAVE___KERNEL_ULONG_T

#  include <asm/posix_types.h>

typedef __kernel_long_t kernel_long_t;
typedef __kernel_ulong_t kernel_ulong_t;

# elif (defined __x86_64__ && defined __ILP32__) || defined LINUX_MIPSN32

typedef long long kernel_long_t;
typedef unsigned long long kernel_ulong_t;

# else

typedef long kernel_long_t;
typedef unsigned long kernel_ulong_t;

# endif


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-13 17:34   ` Christophe Leroy
  2025-01-13 17:54     ` Dmitry V. Levin
@ 2025-01-14 17:04     ` Dmitry V. Levin
  2025-01-20 13:51       ` Christophe Leroy
  1 sibling, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-14 17:04 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> > Bring syscall_set_return_value() in sync with syscall_get_error(),
> > and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > 
> > This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> > syscall_set_return_value()").
> 
> There is a clear detailed explanation in that commit of why it needs to 
> be done.
> 
> If you think that commit is wrong you have to explain why with at least 
> the same level of details.

OK, please have a look whether this explanation is clear and detailed enough:

=======
powerpc: properly negate error in syscall_set_return_value()

When syscall_set_return_value() is used to set an error code, the caller
specifies it as a negative value in -ERRORCODE form.

In !trap_is_scv case the error code is traditionally stored as follows:
gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
Here are a few examples to illustrate this convention.  The first one
is from syscall_get_error():
        /*
         * If the system call failed,
         * regs->gpr[3] contains a positive ERRORCODE.
         */
        return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;

The second example is from regs_return_value():
        if (is_syscall_success(regs))
                return regs->gpr[3];
        else
                return -regs->gpr[3];

The third example is from check_syscall_restart():
        regs->result = -EINTR;
        regs->gpr[3] = EINTR;
        regs->ccr |= 0x10000000;

Compared with these examples, the failure of syscall_set_return_value()
to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
	/*
	 * In the general case it's not obvious that we must deal with
	 * CCR here, as the syscall exit path will also do that for us.
	 * However there are some places, eg. the signal code, which
	 * check ccr to decide if the value in r3 is actually an error.
	 */
	if (error) {
		regs->ccr |= 0x10000000L;
		regs->gpr[3] = error;
	} else {
		regs->ccr &= ~0x10000000L;
		regs->gpr[3] = val;
	}

This fix brings syscall_set_return_value() in sync with syscall_get_error()
and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.

Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
=======


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-13 17:12 ` [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request Dmitry V. Levin
@ 2025-01-15 16:38   ` Oleg Nesterov
  2025-01-15 17:36     ` Dmitry V. Levin
  2025-01-16  1:55   ` Charlie Jenkins
  2025-01-16 15:21   ` Oleg Nesterov
  2 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2025-01-15 16:38 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

Dmitry,

I can't review the non-x86 changes in 1/7 - 4/7.

As for this and the previous patch I see nothing bad after a quick glance.

Just I have some concerns about the "future extensions", I'll write another
email tomorrow. In particualar, I personally hate the very idea of
copy_struct_from_user/check_zeroed_user ;)

On 01/13, Dmitry V. Levin wrote:
>
> +ptrace_set_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
> +			      struct ptrace_syscall_info *info)
> +{
> +	unsigned long args[ARRAY_SIZE(info->entry.args)];
> +	int nr = info->entry.nr;
> +	int i;
> +
> +	if (nr != info->entry.nr)
> +		return -ERANGE;
> +
> +	for (i = 0; i < ARRAY_SIZE(args); i++) {
> +		args[i] = info->entry.args[i];
> +		if (args[i] != info->entry.args[i])
> +			return -ERANGE;
> +	}
> +
> +	syscall_set_nr(child, regs, nr);
> +	/*
> +	 * If the syscall number is set to -1, setting syscall arguments is not
> +	 * just pointless, it would also clobber the syscall return value on
> +	 * those architectures that share the same register both for the first
> +	 * argument of syscall and its return value.
> +	 */
> +	if (nr != -1)
> +		syscall_set_arguments(child, regs, args);

Thanks, much better than I tried to suggest in my reply to V1.

But may be

	if (syscall_get_nr() != -1)
		syscall_set_arguments(...);

will look a bit more consistent?

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-15 16:38   ` Oleg Nesterov
@ 2025-01-15 17:36     ` Dmitry V. Levin
  2025-01-15 19:10       ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-15 17:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexey Gladkov, Eugene Syromyatnikov, Mike Frysinger,
	Renzo Davoli, Davide Berardi, strace-devel, linux-kernel,
	linux-api

On Wed, Jan 15, 2025 at 05:38:09PM +0100, Oleg Nesterov wrote:
[...]
> > +	syscall_set_nr(child, regs, nr);
> > +	/*
> > +	 * If the syscall number is set to -1, setting syscall arguments is not
> > +	 * just pointless, it would also clobber the syscall return value on
> > +	 * those architectures that share the same register both for the first
> > +	 * argument of syscall and its return value.
> > +	 */
> > +	if (nr != -1)
> > +		syscall_set_arguments(child, regs, args);
> 
> Thanks, much better than I tried to suggest in my reply to V1.
> 
> But may be
> 
> 	if (syscall_get_nr() != -1)
> 		syscall_set_arguments(...);
> 
> will look a bit more consistent?

I'm sorry, but I didn't follow.  As we've just set the syscall number with
syscall_set_nr(), why would we want to call syscall_get_nr() right after
that to obtain the syscall number?


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-15 17:36     ` Dmitry V. Levin
@ 2025-01-15 19:10       ` Oleg Nesterov
  0 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2025-01-15 19:10 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Alexey Gladkov, Eugene Syromyatnikov, Mike Frysinger,
	Renzo Davoli, Davide Berardi, strace-devel, linux-kernel,
	linux-api

On 01/15, Dmitry V. Levin wrote:
>
> On Wed, Jan 15, 2025 at 05:38:09PM +0100, Oleg Nesterov wrote:
> >
> > But may be
> >
> > 	if (syscall_get_nr() != -1)
> > 		syscall_set_arguments(...);
> >
> > will look a bit more consistent?
>
> I'm sorry, but I didn't follow.  As we've just set the syscall number with
> syscall_set_nr(), why would we want to call syscall_get_nr() right after
> that to obtain the syscall number?

Mostly for grep. We have more syscall_get_nr() != -1 checks. Even right after
syscall_set_nr-like code, see putreg32().

I think this needs another helper (which can have more users) and some cleanups.

But this is another issue, so please forget. I agree that syscall_get_nr() in
this code will probably just add the unnecessary confusion.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-13 17:12 ` [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request Dmitry V. Levin
  2025-01-15 16:38   ` Oleg Nesterov
@ 2025-01-16  1:55   ` Charlie Jenkins
  2025-01-16  8:33     ` Dmitry V. Levin
  2025-01-16 15:21   ` Oleg Nesterov
  2 siblings, 1 reply; 65+ messages in thread
From: Charlie Jenkins @ 2025-01-16  1:55 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Mon, Jan 13, 2025 at 07:12:08PM +0200, Dmitry V. Levin wrote:
> PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
> PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of
> system calls the tracee is blocked in.
> 
> This API allows ptracers to obtain and modify system call details
> in a straightforward and architecture-agnostic way.
> 
> Current implementation supports changing only those bits of system call
> information that are used by strace, namely, syscall number, syscall
> arguments, and syscall return value.
> 
> Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO,
> such as instruction pointer and stack pointer, could be added later
> if needed, by using struct ptrace_syscall_info.flags to specify
> the additional details that should be set.  Currently, flags and reserved
> fields of struct ptrace_syscall_info must be initialized with zeroes;
> arch, instruction_pointer, and stack_pointer fields are ignored.
> 
> PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
> PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
> Other operations could be added later if needed.
> 
> Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
> PTRACE_GET_SYSCALL_INFO, but it didn't happen.  The last straw that
> convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure
> to provide an API of changing the first system call argument on riscv
> architecture.
> 
> ptrace(2) man page:
> 
> long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
> ...
> PTRACE_SET_SYSCALL_INFO
>        Modify information about the system call that caused the stop.
>        The "data" argument is a pointer to struct ptrace_syscall_info
>        that specifies the system call information to be set.
>        The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).
> 
> Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
>  include/linux/ptrace.h      |  3 ++
>  include/uapi/linux/ptrace.h |  4 +-
>  kernel/ptrace.c             | 95 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 101 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
> index 90507d4afcd6..c8dbf1e498bf 100644
> --- a/include/linux/ptrace.h
> +++ b/include/linux/ptrace.h
> @@ -17,6 +17,9 @@ struct syscall_info {
>  	struct seccomp_data	data;
>  };
>  
> +/* sizeof() the first published struct ptrace_syscall_info */
> +#define PTRACE_SYSCALL_INFO_SIZE_VER0	84
> +
>  extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
>  			    void *buf, int len, unsigned int gup_flags);
>  
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index 72c038fc71d0..ca75b3ab5d22 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -74,6 +74,7 @@ struct seccomp_metadata {
>  };
>  
>  #define PTRACE_GET_SYSCALL_INFO		0x420e
> +#define PTRACE_SET_SYSCALL_INFO		0x4212
>  #define PTRACE_SYSCALL_INFO_NONE	0
>  #define PTRACE_SYSCALL_INFO_ENTRY	1
>  #define PTRACE_SYSCALL_INFO_EXIT	2
> @@ -81,7 +82,8 @@ struct seccomp_metadata {
>  
>  struct ptrace_syscall_info {
>  	__u8 op;	/* PTRACE_SYSCALL_INFO_* */
> -	__u8 pad[3];
> +	__u8 reserved;
> +	__u16 flags;
>  	__u32 arch;
>  	__u64 instruction_pointer;
>  	__u64 stack_pointer;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 22e7d74cf4cd..41d37cb8f74a 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -1016,6 +1016,97 @@ ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
>  	write_size = min(actual_size, user_size);
>  	return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size;
>  }
> +
> +static unsigned long
> +ptrace_set_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
> +			      struct ptrace_syscall_info *info)
> +{
> +	unsigned long args[ARRAY_SIZE(info->entry.args)];
> +	int nr = info->entry.nr;
> +	int i;
> +
> +	if (nr != info->entry.nr)
> +		return -ERANGE;
> +
> +	for (i = 0; i < ARRAY_SIZE(args); i++) {
> +		args[i] = info->entry.args[i];
> +		if (args[i] != info->entry.args[i])
> +			return -ERANGE;
> +	}
> +
> +	syscall_set_nr(child, regs, nr);
> +	/*
> +	 * If the syscall number is set to -1, setting syscall arguments is not
> +	 * just pointless, it would also clobber the syscall return value on
> +	 * those architectures that share the same register both for the first
> +	 * argument of syscall and its return value.
> +	 */
> +	if (nr != -1)
> +		syscall_set_arguments(child, regs, args);
> +
> +	return 0;
> +}
> +
> +static unsigned long
> +ptrace_set_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs,
> +				struct ptrace_syscall_info *info)
> +{
> +	/*
> +	 * info->entry is currently a subset of info->seccomp,
> +	 * info->seccomp.ret_data is currently ignored.
> +	 */
> +	return ptrace_set_syscall_info_entry(child, regs, info);
> +}
> +
> +static unsigned long
> +ptrace_set_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
> +			     struct ptrace_syscall_info *info)
> +{
> +	if (info->exit.is_error)
> +		syscall_set_return_value(child, regs, info->exit.rval, 0);
> +	else
> +		syscall_set_return_value(child, regs, 0, info->exit.rval);
> +
> +	return 0;
> +}
> +
> +static int
> +ptrace_set_syscall_info(struct task_struct *child, unsigned long user_size,
> +			void __user *datavp)
> +{
> +	struct pt_regs *regs = task_pt_regs(child);
> +	struct ptrace_syscall_info info;
> +	int error;
> +
> +	BUILD_BUG_ON(sizeof(struct ptrace_syscall_info) < PTRACE_SYSCALL_INFO_SIZE_VER0);
> +
> +	if (user_size < PTRACE_SYSCALL_INFO_SIZE_VER0 || user_size > PAGE_SIZE)
> +		return -EINVAL;
> +
> +	error = copy_struct_from_user(&info, sizeof(info), datavp, user_size);
> +	if (error)
> +		return error;
> +
> +	/* Reserved for future use. */
> +	if (info.flags || info.reserved)
> +		return -EINVAL;
> +
> +	/* Changing the type of the system call stop is not supported. */
> +	if (ptrace_get_syscall_info_op(child) != info.op)

Since this isn't supported anyway, would it make sense to set the
info.op to ptrace_get_syscall_info_op(child) like is done for
get_syscall_info? The usecase I see for this is simplifying when the
user doesn't call PTRACE_GET_SYSCALL_INFO before calling
PTRACE_SET_SYSCALL_INFO.

- Charlie

> +		return -EINVAL;
> +
> +	switch (info.op) {
> +	case PTRACE_SYSCALL_INFO_ENTRY:
> +		return ptrace_set_syscall_info_entry(child, regs, &info);
> +	case PTRACE_SYSCALL_INFO_EXIT:
> +		return ptrace_set_syscall_info_exit(child, regs, &info);
> +	case PTRACE_SYSCALL_INFO_SECCOMP:
> +		return ptrace_set_syscall_info_seccomp(child, regs, &info);
> +	default:
> +		/* Other types of system call stops are not supported. */
> +		return -EINVAL;
> +	}
> +}
>  #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
>  
>  int ptrace_request(struct task_struct *child, long request,
> @@ -1234,6 +1325,10 @@ int ptrace_request(struct task_struct *child, long request,
>  	case PTRACE_GET_SYSCALL_INFO:
>  		ret = ptrace_get_syscall_info(child, addr, datavp);
>  		break;
> +
> +	case PTRACE_SET_SYSCALL_INFO:
> +		ret = ptrace_set_syscall_info(child, addr, datavp);
> +		break;
>  #endif
>  
>  	case PTRACE_SECCOMP_GET_FILTER:
> -- 
> ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value()
  2025-01-13 17:11 ` [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value() Dmitry V. Levin
@ 2025-01-16  2:20   ` Charlie Jenkins
  2025-01-17  0:59     ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Charlie Jenkins @ 2025-01-16  2:20 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Vineet Gupta, Russell King,
	Will Deacon, Guo Ren, Brian Cain, Huacai Chen, WANG Xuerui,
	Thomas Bogendoerfer, Dinh Nguyen, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, James E.J. Bottomley, Helge Deller,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Naveen N Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Yoshinori Sato, Rich Felker,
	John Paul Adrian Glaubitz, David S. Miller, Andreas Larsson,
	Richard Weinberger, Anton Ivanov, Johannes Berg, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-snps-arc,
	linux-kernel, linux-arm-kernel, linux-csky, linux-hexagon,
	loongarch, linux-mips, linux-openrisc, linux-parisc, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-um,
	linux-arch

On Mon, Jan 13, 2025 at 07:11:40PM +0200, Dmitry V. Levin wrote:
> These functions are going to be needed on all HAVE_ARCH_TRACEHOOK
> architectures to implement PTRACE_SET_SYSCALL_INFO API.
> 
> This partially reverts commit 7962c2eddbfe ("arch: remove unused
> function syscall_set_arguments()") by reusing some of old
> syscall_set_arguments() implementations.
> 
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
> 
> Note that I'm not a MIPS expert, I just added mips_set_syscall_arg() by
> looking at mips_get_syscall_arg() and the result passes tests in qemu on
> mips O32, mips64 O32, mips64 N32, and mips64 N64.
> 
>  arch/arc/include/asm/syscall.h        | 14 +++++++++++
>  arch/arm/include/asm/syscall.h        | 13 ++++++++++
>  arch/arm64/include/asm/syscall.h      | 13 ++++++++++
>  arch/csky/include/asm/syscall.h       | 13 ++++++++++
>  arch/hexagon/include/asm/syscall.h    | 14 +++++++++++
>  arch/loongarch/include/asm/syscall.h  |  8 ++++++
>  arch/mips/include/asm/syscall.h       | 32 ++++++++++++++++++++++++
>  arch/nios2/include/asm/syscall.h      | 11 ++++++++
>  arch/openrisc/include/asm/syscall.h   |  7 ++++++
>  arch/parisc/include/asm/syscall.h     | 12 +++++++++
>  arch/powerpc/include/asm/syscall.h    | 10 ++++++++
>  arch/riscv/include/asm/syscall.h      |  9 +++++++
>  arch/s390/include/asm/syscall.h       | 12 +++++++++
>  arch/sh/include/asm/syscall_32.h      | 12 +++++++++
>  arch/sparc/include/asm/syscall.h      | 10 ++++++++
>  arch/um/include/asm/syscall-generic.h | 14 +++++++++++
>  arch/x86/include/asm/syscall.h        | 36 +++++++++++++++++++++++++++
>  arch/xtensa/include/asm/syscall.h     | 11 ++++++++
>  include/asm-generic/syscall.h         | 16 ++++++++++++
>  19 files changed, 267 insertions(+)
> 
> diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
> index 9709256e31c8..89c1e1736356 100644
> --- a/arch/arc/include/asm/syscall.h
> +++ b/arch/arc/include/asm/syscall.h
> @@ -67,6 +67,20 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>  	}
>  }
>  
> +static inline void
> +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
> +		      unsigned long *args)
> +{
> +	unsigned long *inside_ptregs = &regs->r0;
> +	unsigned int n = 6;
> +	unsigned int i = 0;
> +
> +	while (n--) {
> +		*inside_ptregs = args[i++];
> +		inside_ptregs--;
> +	}
> +}
> +
>  static inline int
>  syscall_get_arch(struct task_struct *task)
>  {
> diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
> index fe4326d938c1..21927fa0ae2b 100644
> --- a/arch/arm/include/asm/syscall.h
> +++ b/arch/arm/include/asm/syscall.h
> @@ -80,6 +80,19 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	memcpy(args, &regs->ARM_r0 + 1, 5 * sizeof(args[0]));
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	memcpy(&regs->ARM_r0, args, 6 * sizeof(args[0]));
> +	/*
> +	 * Also copy the first argument into ARM_ORIG_r0
> +	 * so that syscall_get_arguments() would return it
> +	 * instead of the previous value.
> +	 */
> +	regs->ARM_ORIG_r0 = regs->ARM_r0;
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	/* ARM tasks don't change audit architectures on the fly. */
> diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
> index ab8e14b96f68..76020b66286b 100644
> --- a/arch/arm64/include/asm/syscall.h
> +++ b/arch/arm64/include/asm/syscall.h
> @@ -73,6 +73,19 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	memcpy(args, &regs->regs[1], 5 * sizeof(args[0]));
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	memcpy(&regs->regs[0], args, 6 * sizeof(args[0]));
> +	/*
> +	 * Also copy the first argument into orig_x0
> +	 * so that syscall_get_arguments() would return it
> +	 * instead of the previous value.
> +	 */
> +	regs->orig_x0 = regs->regs[0];
> +}
> +
>  /*
>   * We don't care about endianness (__AUDIT_ARCH_LE bit) here because
>   * AArch64 has the same system calls both on little- and big- endian.
> diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
> index 0de5734950bf..30403f7a0487 100644
> --- a/arch/csky/include/asm/syscall.h
> +++ b/arch/csky/include/asm/syscall.h
> @@ -59,6 +59,19 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>  	memcpy(args, &regs->a1, 5 * sizeof(args[0]));
>  }
>  
> +static inline void
> +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
> +		      const unsigned long *args)
> +{
> +	memcpy(&regs->a0, args, 6 * sizeof(regs->a0));
> +	/*
> +	 * Also copy the first argument into orig_x0
> +	 * so that syscall_get_arguments() would return it
> +	 * instead of the previous value.
> +	 */
> +	regs->orig_a0 = regs->a0;
> +}
> +
>  static inline int
>  syscall_get_arch(struct task_struct *task)
>  {
> diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
> index f6e454f18038..1024a6548d78 100644
> --- a/arch/hexagon/include/asm/syscall.h
> +++ b/arch/hexagon/include/asm/syscall.h
> @@ -33,6 +33,13 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	memcpy(args, &(&regs->r00)[0], 6 * sizeof(args[0]));
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 unsigned long *args)
> +{
> +	memcpy(&(&regs->r00)[0], args, 6 * sizeof(args[0]));
> +}
> +
>  static inline long syscall_get_error(struct task_struct *task,
>  				     struct pt_regs *regs)
>  {
> @@ -45,6 +52,13 @@ static inline long syscall_get_return_value(struct task_struct *task,
>  	return regs->r00;
>  }
>  
> +static inline void syscall_set_return_value(struct task_struct *task,
> +					    struct pt_regs *regs,
> +					    int error, long val)
> +{
> +	regs->r00 = (long) error ?: val;
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	return AUDIT_ARCH_HEXAGON;
> diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
> index e286dc58476e..ff415b3c0a8e 100644
> --- a/arch/loongarch/include/asm/syscall.h
> +++ b/arch/loongarch/include/asm/syscall.h
> @@ -61,6 +61,14 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	memcpy(&args[1], &regs->regs[5], 5 * sizeof(long));
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 unsigned long *args)
> +{
> +	regs->orig_a0 = args[0];
> +	memcpy(&regs->regs[5], &args[1], 5 * sizeof(long));
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	return AUDIT_ARCH_LOONGARCH64;
> diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
> index 2f85f2d8f754..3163d1506fae 100644
> --- a/arch/mips/include/asm/syscall.h
> +++ b/arch/mips/include/asm/syscall.h
> @@ -76,6 +76,23 @@ static inline void mips_get_syscall_arg(unsigned long *arg,
>  #endif
>  }
>  
> +static inline void mips_set_syscall_arg(unsigned long *arg,
> +	struct task_struct *task, struct pt_regs *regs, unsigned int n)
> +{
> +#ifdef CONFIG_32BIT
> +	switch (n) {
> +	case 0: case 1: case 2: case 3:
> +		regs->regs[4 + n] = *arg;
> +		return;
> +	case 4: case 5: case 6: case 7:
> +		*arg = regs->pad0[n] = *arg;
> +		return;
> +	}
> +#else
> +	regs->regs[4 + n] = *arg;
> +#endif
> +}
> +
>  static inline long syscall_get_error(struct task_struct *task,
>  				     struct pt_regs *regs)
>  {
> @@ -122,6 +139,21 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  		mips_get_syscall_arg(args++, task, regs, i++);
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 unsigned long *args)
> +{
> +	unsigned int i = 0;
> +	unsigned int n = 6;
> +
> +	/* O32 ABI syscall() */
> +	if (mips_syscall_is_indirect(task, regs))
> +		i++;
> +
> +	while (n--)
> +		mips_set_syscall_arg(args++, task, regs, i++);
> +}
> +
>  extern const unsigned long sys_call_table[];
>  extern const unsigned long sys32_call_table[];
>  extern const unsigned long sysn32_call_table[];
> diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
> index fff52205fb65..526449edd768 100644
> --- a/arch/nios2/include/asm/syscall.h
> +++ b/arch/nios2/include/asm/syscall.h
> @@ -58,6 +58,17 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	*args   = regs->r9;
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +	struct pt_regs *regs, const unsigned long *args)
> +{
> +	regs->r4 = *args++;
> +	regs->r5 = *args++;
> +	regs->r6 = *args++;
> +	regs->r7 = *args++;
> +	regs->r8 = *args++;
> +	regs->r9 = *args;
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	return AUDIT_ARCH_NIOS2;
> diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
> index 903ed882bdec..e6383be2a195 100644
> --- a/arch/openrisc/include/asm/syscall.h
> +++ b/arch/openrisc/include/asm/syscall.h
> @@ -57,6 +57,13 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>  	memcpy(args, &regs->gpr[3], 6 * sizeof(args[0]));
>  }
>  
> +static inline void
> +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
> +		      const unsigned long *args)
> +{
> +	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	return AUDIT_ARCH_OPENRISC;
> diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
> index 00b127a5e09b..b146d0ae4c77 100644
> --- a/arch/parisc/include/asm/syscall.h
> +++ b/arch/parisc/include/asm/syscall.h
> @@ -29,6 +29,18 @@ static inline void syscall_get_arguments(struct task_struct *tsk,
>  	args[0] = regs->gr[26];
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *tsk,
> +					 struct pt_regs *regs,
> +					 unsigned long *args)
> +{
> +	regs->gr[21] = args[5];
> +	regs->gr[22] = args[4];
> +	regs->gr[23] = args[3];
> +	regs->gr[24] = args[2];
> +	regs->gr[25] = args[1];
> +	regs->gr[26] = args[0];
> +}
> +
>  static inline long syscall_get_error(struct task_struct *task,
>  				     struct pt_regs *regs)
>  {
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 422d7735ace6..521f279e6b33 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -114,6 +114,16 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	}
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
> +
> +	/* Also copy the first argument into orig_gpr3 */
> +	regs->orig_gpr3 = args[0];
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	if (is_tsk_32bit_task(task))
> diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
> index 121fff429dce..8d389ba995c8 100644
> --- a/arch/riscv/include/asm/syscall.h
> +++ b/arch/riscv/include/asm/syscall.h
> @@ -66,6 +66,15 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	memcpy(args, &regs->a1, 5 * sizeof(args[0]));
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	regs->orig_a0 = args[0];
> +	args++;
> +	memcpy(&regs->a1, args, 5 * sizeof(regs->a1));
> +}

Looks good for riscv.

Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com

> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  #ifdef CONFIG_64BIT
> diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
> index 27e3d804b311..b3dd883699e7 100644
> --- a/arch/s390/include/asm/syscall.h
> +++ b/arch/s390/include/asm/syscall.h
> @@ -78,6 +78,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	args[0] = regs->orig_gpr2 & mask;
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	unsigned int n = 6;
> +
> +	while (n-- > 0)
> +		if (n > 0)
> +			regs->gprs[2 + n] = args[n];
> +	regs->orig_gpr2 = args[0];
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  #ifdef CONFIG_COMPAT
> diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
> index d87738eebe30..cb51a7528384 100644
> --- a/arch/sh/include/asm/syscall_32.h
> +++ b/arch/sh/include/asm/syscall_32.h
> @@ -57,6 +57,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	args[0] = regs->regs[4];
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	regs->regs[1] = args[5];
> +	regs->regs[0] = args[4];
> +	regs->regs[7] = args[3];
> +	regs->regs[6] = args[2];
> +	regs->regs[5] = args[1];
> +	regs->regs[4] = args[0];
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	int arch = AUDIT_ARCH_SH;
> diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
> index 20c109ac8cc9..62a5a78804c4 100644
> --- a/arch/sparc/include/asm/syscall.h
> +++ b/arch/sparc/include/asm/syscall.h
> @@ -117,6 +117,16 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	}
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < 6; i++)
> +		regs->u_regs[UREG_I0 + i] = args[i];
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT)
> diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h
> index 172b74143c4b..2984feb9d576 100644
> --- a/arch/um/include/asm/syscall-generic.h
> +++ b/arch/um/include/asm/syscall-generic.h
> @@ -62,6 +62,20 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	*args   = UPT_SYSCALL_ARG6(r);
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	struct uml_pt_regs *r = &regs->regs;
> +
> +	UPT_SYSCALL_ARG1(r) = *args++;
> +	UPT_SYSCALL_ARG2(r) = *args++;
> +	UPT_SYSCALL_ARG3(r) = *args++;
> +	UPT_SYSCALL_ARG4(r) = *args++;
> +	UPT_SYSCALL_ARG5(r) = *args++;
> +	UPT_SYSCALL_ARG6(r) = *args;
> +}
> +
>  /* See arch/x86/um/asm/syscall.h for syscall_get_arch() definition. */
>  
>  #endif	/* __UM_SYSCALL_GENERIC_H */
> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index 7c488ff0c764..b9c249dd9e3d 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h
> @@ -90,6 +90,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	args[5] = regs->bp;
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	regs->bx = args[0];
> +	regs->cx = args[1];
> +	regs->dx = args[2];
> +	regs->si = args[3];
> +	regs->di = args[4];
> +	regs->bp = args[5];
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	return AUDIT_ARCH_I386;
> @@ -121,6 +133,30 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  	}
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +# ifdef CONFIG_IA32_EMULATION
> +	if (task->thread_info.status & TS_COMPAT) {
> +		regs->bx = *args++;
> +		regs->cx = *args++;
> +		regs->dx = *args++;
> +		regs->si = *args++;
> +		regs->di = *args++;
> +		regs->bp = *args;
> +	} else
> +# endif
> +	{
> +		regs->di = *args++;
> +		regs->si = *args++;
> +		regs->dx = *args++;
> +		regs->r10 = *args++;
> +		regs->r8 = *args++;
> +		regs->r9 = *args;
> +	}
> +}
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
> diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
> index 5ee974bf8330..f9a671cbf933 100644
> --- a/arch/xtensa/include/asm/syscall.h
> +++ b/arch/xtensa/include/asm/syscall.h
> @@ -68,6 +68,17 @@ static inline void syscall_get_arguments(struct task_struct *task,
>  		args[i] = regs->areg[reg[i]];
>  }
>  
> +static inline void syscall_set_arguments(struct task_struct *task,
> +					 struct pt_regs *regs,
> +					 const unsigned long *args)
> +{
> +	static const unsigned int reg[] = XTENSA_SYSCALL_ARGUMENT_REGS;
> +	unsigned int i;
> +
> +	for (i = 0; i < 6; ++i)
> +		regs->areg[reg[i]] = args[i];
> +}
> +
>  asmlinkage long xtensa_rt_sigreturn(void);
>  asmlinkage long xtensa_shmat(int, char __user *, int);
>  asmlinkage long xtensa_fadvise64_64(int, int,
> diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
> index 5a80fe728dc8..0f7b9a493de7 100644
> --- a/include/asm-generic/syscall.h
> +++ b/include/asm-generic/syscall.h
> @@ -117,6 +117,22 @@ void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs,
>  void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>  			   unsigned long *args);
>  
> +/**
> + * syscall_set_arguments - change system call parameter value
> + * @task:	task of interest, must be in system call entry tracing
> + * @regs:	task_pt_regs() of @task
> + * @args:	array of argument values to store
> + *
> + * Changes 6 arguments to the system call.
> + * The first argument gets value @args[0], and so on.
> + *
> + * It's only valid to call this when @task is stopped for tracing on
> + * entry to a system call, due to %SYSCALL_WORK_SYSCALL_TRACE or
> + * %SYSCALL_WORK_SYSCALL_AUDIT.
> + */
> +void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
> +			   const unsigned long *args);
> +
>  /**
>   * syscall_get_arch - return the AUDIT_ARCH for the current system call
>   * @task:	task of interest, must be blocked
> -- 
> ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 4/7] syscall.h: introduce syscall_set_nr()
  2025-01-13 17:11 ` [PATCH v2 4/7] syscall.h: introduce syscall_set_nr() Dmitry V. Levin
@ 2025-01-16  2:20   ` Charlie Jenkins
  0 siblings, 0 replies; 65+ messages in thread
From: Charlie Jenkins @ 2025-01-16  2:20 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Brian Cain, Huacai Chen,
	WANG Xuerui, Geert Uytterhoeven, Michal Simek,
	Thomas Bogendoerfer, Dinh Nguyen, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, James E.J. Bottomley, Helge Deller,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, Naveen N Rao,
	Madhavan Srinivasan, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Yoshinori Sato, Rich Felker,
	John Paul Adrian Glaubitz, David S. Miller, Andreas Larsson,
	Richard Weinberger, Anton Ivanov, Johannes Berg, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-snps-arc,
	linux-kernel, linux-arm-kernel, linux-hexagon, loongarch,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-um, linux-arch

On Mon, Jan 13, 2025 at 07:11:51PM +0200, Dmitry V. Levin wrote:
> Similar to syscall_set_arguments() that complements
> syscall_get_arguments(), introduce syscall_set_nr()
> that complements syscall_get_nr().
> 
> syscall_set_nr() is going to be needed along with
> syscall_set_arguments() on all HAVE_ARCH_TRACEHOOK
> architectures to implement PTRACE_SET_SYSCALL_INFO API.
> 
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
>  arch/arc/include/asm/syscall.h        | 11 +++++++++++
>  arch/arm/include/asm/syscall.h        | 24 ++++++++++++++++++++++++
>  arch/arm64/include/asm/syscall.h      | 16 ++++++++++++++++
>  arch/hexagon/include/asm/syscall.h    |  7 +++++++
>  arch/loongarch/include/asm/syscall.h  |  7 +++++++
>  arch/m68k/include/asm/syscall.h       |  7 +++++++
>  arch/microblaze/include/asm/syscall.h |  7 +++++++
>  arch/mips/include/asm/syscall.h       | 14 ++++++++++++++
>  arch/nios2/include/asm/syscall.h      |  5 +++++
>  arch/openrisc/include/asm/syscall.h   |  6 ++++++
>  arch/parisc/include/asm/syscall.h     |  7 +++++++
>  arch/powerpc/include/asm/syscall.h    | 10 ++++++++++
>  arch/riscv/include/asm/syscall.h      |  7 +++++++
>  arch/s390/include/asm/syscall.h       | 12 ++++++++++++
>  arch/sh/include/asm/syscall_32.h      | 12 ++++++++++++
>  arch/sparc/include/asm/syscall.h      | 12 ++++++++++++
>  arch/um/include/asm/syscall-generic.h |  5 +++++
>  arch/x86/include/asm/syscall.h        |  7 +++++++
>  arch/xtensa/include/asm/syscall.h     |  7 +++++++
>  include/asm-generic/syscall.h         | 14 ++++++++++++++
>  20 files changed, 197 insertions(+)
> 
> diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
> index 89c1e1736356..728d625a10f1 100644
> --- a/arch/arc/include/asm/syscall.h
> +++ b/arch/arc/include/asm/syscall.h
> @@ -23,6 +23,17 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
>  		return -1;
>  }
>  
> +static inline void
> +syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
> +{
> +	/*
> +	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
> +	 * the target task is stopped for tracing on entering syscall, so
> +	 * there is no need to have the same check syscall_get_nr() has.
> +	 */
> +	regs->r8 = nr;
> +}
> +
>  static inline void
>  syscall_rollback(struct task_struct *task, struct pt_regs *regs)
>  {
> diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
> index 21927fa0ae2b..18b102a30741 100644
> --- a/arch/arm/include/asm/syscall.h
> +++ b/arch/arm/include/asm/syscall.h
> @@ -68,6 +68,30 @@ static inline void syscall_set_return_value(struct task_struct *task,
>  	regs->ARM_r0 = (long) error ? error : val;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	if (nr == -1) {
> +		task_thread_info(task)->abi_syscall = -1;
> +		/*
> +		 * When the syscall number is set to -1, the syscall will be
> +		 * skipped.  In this case the syscall return value has to be
> +		 * set explicitly, otherwise the first syscall argument is
> +		 * returned as the syscall return value.
> +		 */
> +		syscall_set_return_value(task, regs, -ENOSYS, 0);
> +		return;
> +	}
> +	if ((IS_ENABLED(CONFIG_AEABI) && !IS_ENABLED(CONFIG_OABI_COMPAT))) {
> +		task_thread_info(task)->abi_syscall = nr;
> +		return;
> +	}
> +	task_thread_info(task)->abi_syscall =
> +		(task_thread_info(task)->abi_syscall & ~__NR_SYSCALL_MASK) |
> +		(nr & __NR_SYSCALL_MASK);
> +}
> +
>  #define SYSCALL_MAX_ARGS 7
>  
>  static inline void syscall_get_arguments(struct task_struct *task,
> diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
> index 76020b66286b..712daa90e643 100644
> --- a/arch/arm64/include/asm/syscall.h
> +++ b/arch/arm64/include/asm/syscall.h
> @@ -61,6 +61,22 @@ static inline void syscall_set_return_value(struct task_struct *task,
>  	regs->regs[0] = val;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->syscallno = nr;
> +	if (nr == -1) {
> +		/*
> +		 * When the syscall number is set to -1, the syscall will be
> +		 * skipped.  In this case the syscall return value has to be
> +		 * set explicitly, otherwise the first syscall argument is
> +		 * returned as the syscall return value.
> +		 */
> +		syscall_set_return_value(task, regs, -ENOSYS, 0);
> +	}
> +}
> +
>  #define SYSCALL_MAX_ARGS 6
>  
>  static inline void syscall_get_arguments(struct task_struct *task,
> diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
> index 1024a6548d78..70637261817a 100644
> --- a/arch/hexagon/include/asm/syscall.h
> +++ b/arch/hexagon/include/asm/syscall.h
> @@ -26,6 +26,13 @@ static inline long syscall_get_nr(struct task_struct *task,
>  	return regs->r06;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->r06 = nr;
> +}
> +
>  static inline void syscall_get_arguments(struct task_struct *task,
>  					 struct pt_regs *regs,
>  					 unsigned long *args)
> diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
> index ff415b3c0a8e..81d2733f7b94 100644
> --- a/arch/loongarch/include/asm/syscall.h
> +++ b/arch/loongarch/include/asm/syscall.h
> @@ -26,6 +26,13 @@ static inline long syscall_get_nr(struct task_struct *task,
>  	return regs->regs[11];
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->regs[11] = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h
> index d1453e850cdd..bf84b160c2eb 100644
> --- a/arch/m68k/include/asm/syscall.h
> +++ b/arch/m68k/include/asm/syscall.h
> @@ -14,6 +14,13 @@ static inline int syscall_get_nr(struct task_struct *task,
>  	return regs->orig_d0;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->orig_d0 = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h
> index 5eb3f624cc59..b5b6b91fae3e 100644
> --- a/arch/microblaze/include/asm/syscall.h
> +++ b/arch/microblaze/include/asm/syscall.h
> @@ -14,6 +14,13 @@ static inline long syscall_get_nr(struct task_struct *task,
>  	return regs->r12;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->r12 = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
> index 3163d1506fae..58d68205fd2c 100644
> --- a/arch/mips/include/asm/syscall.h
> +++ b/arch/mips/include/asm/syscall.h
> @@ -41,6 +41,20 @@ static inline long syscall_get_nr(struct task_struct *task,
>  	return task_thread_info(task)->syscall;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	/*
> +	 * New syscall number has to be assigned to regs[2] because
> +	 * syscall_trace_entry() loads it from there unconditionally.
> +	 *
> +	 * Consequently, if the syscall was indirect and nr != __NR_syscall,
> +	 * then after this assignment the syscall will cease to be indirect.
> +	 */
> +	task_thread_info(task)->syscall = regs->regs[2] = nr;
> +}
> +
>  static inline void mips_syscall_update_nr(struct task_struct *task,
>  					  struct pt_regs *regs)
>  {
> diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
> index 526449edd768..8e3eb1d689bb 100644
> --- a/arch/nios2/include/asm/syscall.h
> +++ b/arch/nios2/include/asm/syscall.h
> @@ -15,6 +15,11 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
>  	return regs->r2;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
> +{
> +	regs->r2 = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				struct pt_regs *regs)
>  {
> diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
> index e6383be2a195..5e037d9659c5 100644
> --- a/arch/openrisc/include/asm/syscall.h
> +++ b/arch/openrisc/include/asm/syscall.h
> @@ -25,6 +25,12 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
>  	return regs->orig_gpr11;
>  }
>  
> +static inline void
> +syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
> +{
> +	regs->orig_gpr11 = nr;
> +}
> +
>  static inline void
>  syscall_rollback(struct task_struct *task, struct pt_regs *regs)
>  {
> diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
> index b146d0ae4c77..c11222798ab2 100644
> --- a/arch/parisc/include/asm/syscall.h
> +++ b/arch/parisc/include/asm/syscall.h
> @@ -17,6 +17,13 @@ static inline long syscall_get_nr(struct task_struct *tsk,
>  	return regs->gr[20];
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *tsk,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->gr[20] = nr;
> +}
> +
>  static inline void syscall_get_arguments(struct task_struct *tsk,
>  					 struct pt_regs *regs,
>  					 unsigned long *args)
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 521f279e6b33..7505dcfed247 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -39,6 +39,16 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
>  		return -1;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
> +{
> +	/*
> +	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
> +	 * the target task is stopped for tracing on entering syscall, so
> +	 * there is no need to have the same check syscall_get_nr() has.
> +	 */
> +	regs->gpr[0] = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
> index 8d389ba995c8..a5281cdf2b10 100644
> --- a/arch/riscv/include/asm/syscall.h
> +++ b/arch/riscv/include/asm/syscall.h
> @@ -30,6 +30,13 @@ static inline int syscall_get_nr(struct task_struct *task,
>  	return regs->a7;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->a7 = nr;
> +}

Looks good for riscv.

Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>

> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
> index b3dd883699e7..12cd0c60c07b 100644
> --- a/arch/s390/include/asm/syscall.h
> +++ b/arch/s390/include/asm/syscall.h
> @@ -24,6 +24,18 @@ static inline long syscall_get_nr(struct task_struct *task,
>  		(regs->int_code & 0xffff) : -1;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	/*
> +	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
> +	 * the target task is stopped for tracing on entering syscall, so
> +	 * there is no need to have the same check syscall_get_nr() has.
> +	 */
> +	regs->int_code = (regs->int_code & ~0xffff) | (nr & 0xffff);
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
> index cb51a7528384..7027d87d901d 100644
> --- a/arch/sh/include/asm/syscall_32.h
> +++ b/arch/sh/include/asm/syscall_32.h
> @@ -15,6 +15,18 @@ static inline long syscall_get_nr(struct task_struct *task,
>  	return (regs->tra >= 0) ? regs->regs[3] : -1L;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	/*
> +	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
> +	 * the target task is stopped for tracing on entering syscall, so
> +	 * there is no need to have the same check syscall_get_nr() has.
> +	 */
> +	regs->regs[3] = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
> index 62a5a78804c4..b0233924d323 100644
> --- a/arch/sparc/include/asm/syscall.h
> +++ b/arch/sparc/include/asm/syscall.h
> @@ -25,6 +25,18 @@ static inline long syscall_get_nr(struct task_struct *task,
>  	return (syscall_p ? regs->u_regs[UREG_G1] : -1L);
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	/*
> +	 * Unlike syscall_get_nr(), syscall_set_nr() can be called only when
> +	 * the target task is stopped for tracing on entering syscall, so
> +	 * there is no need to have the same check syscall_get_nr() has.
> +	 */
> +	regs->u_regs[UREG_G1] = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h
> index 2984feb9d576..bcd73bcfe577 100644
> --- a/arch/um/include/asm/syscall-generic.h
> +++ b/arch/um/include/asm/syscall-generic.h
> @@ -21,6 +21,11 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
>  	return PT_REGS_SYSCALL_NR(regs);
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
> +{
> +	PT_REGS_SYSCALL_NR(regs) = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index b9c249dd9e3d..c10dbb74cd00 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h
> @@ -38,6 +38,13 @@ static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
>  	return regs->orig_ax;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->orig_ax = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
> index f9a671cbf933..7db3b489c8ad 100644
> --- a/arch/xtensa/include/asm/syscall.h
> +++ b/arch/xtensa/include/asm/syscall.h
> @@ -28,6 +28,13 @@ static inline long syscall_get_nr(struct task_struct *task,
>  	return regs->syscall;
>  }
>  
> +static inline void syscall_set_nr(struct task_struct *task,
> +				  struct pt_regs *regs,
> +				  int nr)
> +{
> +	regs->syscall = nr;
> +}
> +
>  static inline void syscall_rollback(struct task_struct *task,
>  				    struct pt_regs *regs)
>  {
> diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
> index 0f7b9a493de7..e33fd4e783c1 100644
> --- a/include/asm-generic/syscall.h
> +++ b/include/asm-generic/syscall.h
> @@ -37,6 +37,20 @@ struct pt_regs;
>   */
>  int syscall_get_nr(struct task_struct *task, struct pt_regs *regs);
>  
> +/**
> + * syscall_set_nr - change the system call a task is executing
> + * @task:	task of interest, must be blocked
> + * @regs:	task_pt_regs() of @task
> + * @nr:		system call number
> + *
> + * Changes the system call number @task is about to execute.
> + *
> + * It's only valid to call this when @task is stopped for tracing on
> + * entry to a system call, due to %SYSCALL_WORK_SYSCALL_TRACE or
> + * %SYSCALL_WORK_SYSCALL_AUDIT.
> + */
> +void syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr);
> +
>  /**
>   * syscall_rollback - roll back registers after an aborted system call
>   * @task:	task of interest, must be in system call exit tracing
> -- 
> ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-16  1:55   ` Charlie Jenkins
@ 2025-01-16  8:33     ` Dmitry V. Levin
  2025-01-16 21:07       ` Charlie Jenkins
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-16  8:33 UTC (permalink / raw)
  To: Charlie Jenkins
  Cc: Oleg Nesterov, Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Wed, Jan 15, 2025 at 05:55:31PM -0800, Charlie Jenkins wrote:
> On Mon, Jan 13, 2025 at 07:12:08PM +0200, Dmitry V. Levin wrote:
[...]
> > +	/* Changing the type of the system call stop is not supported. */
> > +	if (ptrace_get_syscall_info_op(child) != info.op)
> 
> Since this isn't supported anyway, would it make sense to set the
> info.op to ptrace_get_syscall_info_op(child) like is done for
> get_syscall_info? The usecase I see for this is simplifying when the
> user doesn't call PTRACE_GET_SYSCALL_INFO before calling
> PTRACE_SET_SYSCALL_INFO.

struct ptrace_syscall_info.op is a field that specifies how to interpret
the union fields of the structure, so if "op" is ignored, then the
kernel would infer the meaning of the structure specified by the userspace
tracer from the kernel state of the tracee.  This looks a bit too
error-prone to allow.  For example, nothing good is expected to happen
if syscall entry information is applied in a syscall exit stop.

The tracer is not obliged to call PTRACE_GET_SYSCALL_INFO to set
struct ptrace_syscall_info.op.  If the tracer keeps track of ptrace stops
by other means, it can assign the right value by itself.

And, btw, the comment should say "is not currently supported",
I'll update it in the next iteration.

An idea mentioned in prior discussions was that it would make sense to
specify syscall return value along with skipping the syscall in seccomp stop,
and this would require a different value for "op" field, but
I decided not to introduce this extra complexity yet.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-13 17:12 ` [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request Dmitry V. Levin
  2025-01-15 16:38   ` Oleg Nesterov
  2025-01-16  1:55   ` Charlie Jenkins
@ 2025-01-16 15:21   ` Oleg Nesterov
  2025-01-16 16:04     ` Dmitry V. Levin
  2 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2025-01-16 15:21 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On 01/13, Dmitry V. Levin wrote:
>
> +static int
> +ptrace_set_syscall_info(struct task_struct *child, unsigned long user_size,
> +			void __user *datavp)
> +{
> +	struct pt_regs *regs = task_pt_regs(child);
> +	struct ptrace_syscall_info info;
> +	int error;
> +
> +	BUILD_BUG_ON(sizeof(struct ptrace_syscall_info) < PTRACE_SYSCALL_INFO_SIZE_VER0);
> +
> +	if (user_size < PTRACE_SYSCALL_INFO_SIZE_VER0 || user_size > PAGE_SIZE)
> +		return -EINVAL;
> +
> +	error = copy_struct_from_user(&info, sizeof(info), datavp, user_size);

OK, I certainly can't understand why copy_struct_from_user/check_zeroed_user
is useful, at least in this case. In particular, this won't allow to run the
new code (which uses the "extended" ptrace_syscall_info) on the older kernels?

Can't we just use user_size as a version number?

We can also turn info->reserved into info->version filled by
ptrace_get_syscall_info().

ptrace_set_syscall_info() can check that info->version matches user_size.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-16 15:21   ` Oleg Nesterov
@ 2025-01-16 16:04     ` Dmitry V. Levin
  2025-01-16 16:40       ` Dmitry V. Levin
  2025-01-17 14:45       ` Oleg Nesterov
  0 siblings, 2 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-16 16:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Thu, Jan 16, 2025 at 04:21:38PM +0100, Oleg Nesterov wrote:
> On 01/13, Dmitry V. Levin wrote:
> >
> > +static int
> > +ptrace_set_syscall_info(struct task_struct *child, unsigned long user_size,
> > +			void __user *datavp)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(child);
> > +	struct ptrace_syscall_info info;
> > +	int error;
> > +
> > +	BUILD_BUG_ON(sizeof(struct ptrace_syscall_info) < PTRACE_SYSCALL_INFO_SIZE_VER0);
> > +
> > +	if (user_size < PTRACE_SYSCALL_INFO_SIZE_VER0 || user_size > PAGE_SIZE)
> > +		return -EINVAL;
> > +
> > +	error = copy_struct_from_user(&info, sizeof(info), datavp, user_size);
> 
> OK, I certainly can't understand why copy_struct_from_user/check_zeroed_user
> is useful, at least in this case. In particular, this won't allow to run the
> new code (which uses the "extended" ptrace_syscall_info) on the older kernels?
> 
> Can't we just use user_size as a version number?
> 
> We can also turn info->reserved into info->version filled by
> ptrace_get_syscall_info().
> 
> ptrace_set_syscall_info() can check that info->version matches user_size.

The idea is to use "op" to specify the operation, and "flags" to specify
future extensions to the operation.  For example, we could later add
PTRACE_SYSCALL_INFO_SECCOMP_SKIP operation to specify an exit-like
data for seccomp stops, or some flag to set instruction_pointer or
stack_pointer.  I don't think any of these would require a version field,
though.

That is, the zero check implied by copy_struct_from_user() is not really
needed here since the compatibility is tracked by "op" and "flags":
if "op" and "flags" do not instruct the kernel to use these unknown
extra bits, the kernel is not obliged to check them either.
For the same reason I don't think the kernel is obliged to read more
than sizeof(info) from userspace.

What would you recommend using instead of copy_struct_from_user in this
case?


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-16 16:04     ` Dmitry V. Levin
@ 2025-01-16 16:40       ` Dmitry V. Levin
  2025-01-17 14:45       ` Oleg Nesterov
  1 sibling, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-16 16:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Thu, Jan 16, 2025 at 06:04:03PM +0200, Dmitry V. Levin wrote:
> On Thu, Jan 16, 2025 at 04:21:38PM +0100, Oleg Nesterov wrote:
> > On 01/13, Dmitry V. Levin wrote:
> > >
> > > +static int
> > > +ptrace_set_syscall_info(struct task_struct *child, unsigned long user_size,
> > > +			void __user *datavp)
> > > +{
> > > +	struct pt_regs *regs = task_pt_regs(child);
> > > +	struct ptrace_syscall_info info;
> > > +	int error;
> > > +
> > > +	BUILD_BUG_ON(sizeof(struct ptrace_syscall_info) < PTRACE_SYSCALL_INFO_SIZE_VER0);
> > > +
> > > +	if (user_size < PTRACE_SYSCALL_INFO_SIZE_VER0 || user_size > PAGE_SIZE)
> > > +		return -EINVAL;
> > > +
> > > +	error = copy_struct_from_user(&info, sizeof(info), datavp, user_size);
> > 
> > OK, I certainly can't understand why copy_struct_from_user/check_zeroed_user
> > is useful, at least in this case. In particular, this won't allow to run the
> > new code (which uses the "extended" ptrace_syscall_info) on the older kernels?
> > 
> > Can't we just use user_size as a version number?
> > 
> > We can also turn info->reserved into info->version filled by
> > ptrace_get_syscall_info().
> > 
> > ptrace_set_syscall_info() can check that info->version matches user_size.
> 
> The idea is to use "op" to specify the operation, and "flags" to specify
> future extensions to the operation.  For example, we could later add
> PTRACE_SYSCALL_INFO_SECCOMP_SKIP operation to specify an exit-like
> data for seccomp stops, or some flag to set instruction_pointer or
> stack_pointer.  I don't think any of these would require a version field,
> though.
> 
> That is, the zero check implied by copy_struct_from_user() is not really
> needed here since the compatibility is tracked by "op" and "flags":
> if "op" and "flags" do not instruct the kernel to use these unknown
> extra bits, the kernel is not obliged to check them either.
> For the same reason I don't think the kernel is obliged to read more
> than sizeof(info) from userspace.
> 
> What would you recommend using instead of copy_struct_from_user in this
> case?

Something like this?

        if (user_size < PTRACE_SYSCALL_INFO_SIZE_VER0 || user_size > PAGE_SIZE)
                return -EINVAL;

        if (copy_from_user(&info, datavp, min(sizeof(info), user_size)))
                return -EFAULT;

        if (user_size < sizeof(info))
                memset((void *)&info + user_size, 0, sizeof(info) - user_size);

-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-16  8:33     ` Dmitry V. Levin
@ 2025-01-16 21:07       ` Charlie Jenkins
  2025-01-16 21:47         ` Charlie Jenkins
  0 siblings, 1 reply; 65+ messages in thread
From: Charlie Jenkins @ 2025-01-16 21:07 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Thu, Jan 16, 2025 at 10:33:28AM +0200, Dmitry V. Levin wrote:
> On Wed, Jan 15, 2025 at 05:55:31PM -0800, Charlie Jenkins wrote:
> > On Mon, Jan 13, 2025 at 07:12:08PM +0200, Dmitry V. Levin wrote:
> [...]
> > > +	/* Changing the type of the system call stop is not supported. */
> > > +	if (ptrace_get_syscall_info_op(child) != info.op)
> > 
> > Since this isn't supported anyway, would it make sense to set the
> > info.op to ptrace_get_syscall_info_op(child) like is done for
> > get_syscall_info? The usecase I see for this is simplifying when the
> > user doesn't call PTRACE_GET_SYSCALL_INFO before calling
> > PTRACE_SET_SYSCALL_INFO.
> 
> struct ptrace_syscall_info.op is a field that specifies how to interpret
> the union fields of the structure, so if "op" is ignored, then the
> kernel would infer the meaning of the structure specified by the userspace
> tracer from the kernel state of the tracee.  This looks a bit too
> error-prone to allow.  For example, nothing good is expected to happen
> if syscall entry information is applied in a syscall exit stop.

Yes that's a good point. 

> 
> The tracer is not obliged to call PTRACE_GET_SYSCALL_INFO to set
> struct ptrace_syscall_info.op.  If the tracer keeps track of ptrace stops
> by other means, it can assign the right value by itself.
>
> And, btw, the comment should say "is not currently supported",
> I'll update it in the next iteration.
> 
> An idea mentioned in prior discussions was that it would make sense to
> specify syscall return value along with skipping the syscall in seccomp stop,
> and this would require a different value for "op" field, but
> I decided not to introduce this extra complexity yet.

Makes sense, thank you!

- Charlie

> 
> 
> -- 
> ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-16 21:07       ` Charlie Jenkins
@ 2025-01-16 21:47         ` Charlie Jenkins
  0 siblings, 0 replies; 65+ messages in thread
From: Charlie Jenkins @ 2025-01-16 21:47 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Oleg Nesterov, Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, Celeste Liu, strace-devel, linux-kernel,
	linux-api

On Thu, Jan 16, 2025 at 01:07:59PM -0800, Charlie Jenkins wrote:
> On Thu, Jan 16, 2025 at 10:33:28AM +0200, Dmitry V. Levin wrote:
> > On Wed, Jan 15, 2025 at 05:55:31PM -0800, Charlie Jenkins wrote:
> > > On Mon, Jan 13, 2025 at 07:12:08PM +0200, Dmitry V. Levin wrote:
> > [...]
> > > > +	/* Changing the type of the system call stop is not supported. */
> > > > +	if (ptrace_get_syscall_info_op(child) != info.op)
> > > 
> > > Since this isn't supported anyway, would it make sense to set the
> > > info.op to ptrace_get_syscall_info_op(child) like is done for
> > > get_syscall_info? The usecase I see for this is simplifying when the
> > > user doesn't call PTRACE_GET_SYSCALL_INFO before calling
> > > PTRACE_SET_SYSCALL_INFO.
> > 
> > struct ptrace_syscall_info.op is a field that specifies how to interpret
> > the union fields of the structure, so if "op" is ignored, then the
> > kernel would infer the meaning of the structure specified by the userspace
> > tracer from the kernel state of the tracee.  This looks a bit too
> > error-prone to allow.  For example, nothing good is expected to happen
> > if syscall entry information is applied in a syscall exit stop.
> 
> Yes that's a good point. 
> 
> > 
> > The tracer is not obliged to call PTRACE_GET_SYSCALL_INFO to set
> > struct ptrace_syscall_info.op.  If the tracer keeps track of ptrace stops
> > by other means, it can assign the right value by itself.
> >
> > And, btw, the comment should say "is not currently supported",
> > I'll update it in the next iteration.
> > 
> > An idea mentioned in prior discussions was that it would make sense to
> > specify syscall return value along with skipping the syscall in seccomp stop,
> > and this would require a different value for "op" field, but
> > I decided not to introduce this extra complexity yet.
> 
> Makes sense, thank you!
> 
> - Charlie

I am no longer convinced that we need Celeste's patch that solves this
problem on riscv [1]. That patch is necessary without this change, but
PTRACE_SET_SYSCALL_INFO seems like a cleaner solution.

Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>

- Charlie

[1] https://lore.kernel.org/lkml/20250115-13cc73c36c7bb3b9f046f614@orel/T/

> 
> > 
> > 
> > -- 
> > ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value()
  2025-01-16  2:20   ` Charlie Jenkins
@ 2025-01-17  0:59     ` H. Peter Anvin
  2025-01-17 15:45       ` Eugene Syromyatnikov
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2025-01-17  0:59 UTC (permalink / raw)
  To: Charlie Jenkins, Dmitry V. Levin
  Cc: Oleg Nesterov, Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Vineet Gupta, Russell King,
	Will Deacon, Guo Ren, Brian Cain, Huacai Chen, WANG Xuerui,
	Thomas Bogendoerfer, Dinh Nguyen, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, James E.J. Bottomley, Helge Deller,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Naveen N Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Yoshinori Sato, Rich Felker,
	John Paul Adrian Glaubitz, David S. Miller, Andreas Larsson,
	Richard Weinberger, Anton Ivanov, Johannes Berg, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, Chris Zankel,
	Max Filippov, Arnd Bergmann, linux-snps-arc, linux-kernel,
	linux-arm-kernel, linux-csky, linux-hexagon, loongarch,
	linux-mips, linux-openrisc, linux-parisc, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-um,
	linux-arch

I link the concept of this patchset, but *please* make it clear in the 
comments that this does not solve the issue of 64-bit kernel arguments 
on 32-bit systems being ABI specific.

This isn't unique to this patch in any way; the only way to handle it is 
by keeping track of each ABI.

On 1/15/25 18:20, Charlie Jenkins wrote:
> On Mon, Jan 13, 2025 at 07:11:40PM +0200, Dmitry V. Levin wrote:
>> These functions are going to be needed on all HAVE_ARCH_TRACEHOOK
>> architectures to implement PTRACE_SET_SYSCALL_INFO API.
>>
>> This partially reverts commit 7962c2eddbfe ("arch: remove unused
>> function syscall_set_arguments()") by reusing some of old
>> syscall_set_arguments() implementations.
>>
>> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
>> ---
>>
>> Note that I'm not a MIPS expert, I just added mips_set_syscall_arg() by
>> looking at mips_get_syscall_arg() and the result passes tests in qemu on
>> mips O32, mips64 O32, mips64 N32, and mips64 N64.
>>
>>   arch/arc/include/asm/syscall.h        | 14 +++++++++++
>>   arch/arm/include/asm/syscall.h        | 13 ++++++++++
>>   arch/arm64/include/asm/syscall.h      | 13 ++++++++++
>>   arch/csky/include/asm/syscall.h       | 13 ++++++++++
>>   arch/hexagon/include/asm/syscall.h    | 14 +++++++++++
>>   arch/loongarch/include/asm/syscall.h  |  8 ++++++
>>   arch/mips/include/asm/syscall.h       | 32 ++++++++++++++++++++++++
>>   arch/nios2/include/asm/syscall.h      | 11 ++++++++
>>   arch/openrisc/include/asm/syscall.h   |  7 ++++++
>>   arch/parisc/include/asm/syscall.h     | 12 +++++++++
>>   arch/powerpc/include/asm/syscall.h    | 10 ++++++++
>>   arch/riscv/include/asm/syscall.h      |  9 +++++++
>>   arch/s390/include/asm/syscall.h       | 12 +++++++++
>>   arch/sh/include/asm/syscall_32.h      | 12 +++++++++
>>   arch/sparc/include/asm/syscall.h      | 10 ++++++++
>>   arch/um/include/asm/syscall-generic.h | 14 +++++++++++
>>   arch/x86/include/asm/syscall.h        | 36 +++++++++++++++++++++++++++
>>   arch/xtensa/include/asm/syscall.h     | 11 ++++++++
>>   include/asm-generic/syscall.h         | 16 ++++++++++++
>>   19 files changed, 267 insertions(+)
>>
>> diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
>> index 9709256e31c8..89c1e1736356 100644
>> --- a/arch/arc/include/asm/syscall.h
>> +++ b/arch/arc/include/asm/syscall.h
>> @@ -67,6 +67,20 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>>   	}
>>   }
>>   
>> +static inline void
>> +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
>> +		      unsigned long *args)
>> +{
>> +	unsigned long *inside_ptregs = &regs->r0;
>> +	unsigned int n = 6;
>> +	unsigned int i = 0;
>> +
>> +	while (n--) {
>> +		*inside_ptregs = args[i++];
>> +		inside_ptregs--;
>> +	}
>> +}
>> +
>>   static inline int
>>   syscall_get_arch(struct task_struct *task)
>>   {
>> diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
>> index fe4326d938c1..21927fa0ae2b 100644
>> --- a/arch/arm/include/asm/syscall.h
>> +++ b/arch/arm/include/asm/syscall.h
>> @@ -80,6 +80,19 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	memcpy(args, &regs->ARM_r0 + 1, 5 * sizeof(args[0]));
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	memcpy(&regs->ARM_r0, args, 6 * sizeof(args[0]));
>> +	/*
>> +	 * Also copy the first argument into ARM_ORIG_r0
>> +	 * so that syscall_get_arguments() would return it
>> +	 * instead of the previous value.
>> +	 */
>> +	regs->ARM_ORIG_r0 = regs->ARM_r0;
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	/* ARM tasks don't change audit architectures on the fly. */
>> diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
>> index ab8e14b96f68..76020b66286b 100644
>> --- a/arch/arm64/include/asm/syscall.h
>> +++ b/arch/arm64/include/asm/syscall.h
>> @@ -73,6 +73,19 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	memcpy(args, &regs->regs[1], 5 * sizeof(args[0]));
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	memcpy(&regs->regs[0], args, 6 * sizeof(args[0]));
>> +	/*
>> +	 * Also copy the first argument into orig_x0
>> +	 * so that syscall_get_arguments() would return it
>> +	 * instead of the previous value.
>> +	 */
>> +	regs->orig_x0 = regs->regs[0];
>> +}
>> +
>>   /*
>>    * We don't care about endianness (__AUDIT_ARCH_LE bit) here because
>>    * AArch64 has the same system calls both on little- and big- endian.
>> diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
>> index 0de5734950bf..30403f7a0487 100644
>> --- a/arch/csky/include/asm/syscall.h
>> +++ b/arch/csky/include/asm/syscall.h
>> @@ -59,6 +59,19 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>>   	memcpy(args, &regs->a1, 5 * sizeof(args[0]));
>>   }
>>   
>> +static inline void
>> +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
>> +		      const unsigned long *args)
>> +{
>> +	memcpy(&regs->a0, args, 6 * sizeof(regs->a0));
>> +	/*
>> +	 * Also copy the first argument into orig_x0
>> +	 * so that syscall_get_arguments() would return it
>> +	 * instead of the previous value.
>> +	 */
>> +	regs->orig_a0 = regs->a0;
>> +}
>> +
>>   static inline int
>>   syscall_get_arch(struct task_struct *task)
>>   {
>> diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
>> index f6e454f18038..1024a6548d78 100644
>> --- a/arch/hexagon/include/asm/syscall.h
>> +++ b/arch/hexagon/include/asm/syscall.h
>> @@ -33,6 +33,13 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	memcpy(args, &(&regs->r00)[0], 6 * sizeof(args[0]));
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 unsigned long *args)
>> +{
>> +	memcpy(&(&regs->r00)[0], args, 6 * sizeof(args[0]));
>> +}
>> +
>>   static inline long syscall_get_error(struct task_struct *task,
>>   				     struct pt_regs *regs)
>>   {
>> @@ -45,6 +52,13 @@ static inline long syscall_get_return_value(struct task_struct *task,
>>   	return regs->r00;
>>   }
>>   
>> +static inline void syscall_set_return_value(struct task_struct *task,
>> +					    struct pt_regs *regs,
>> +					    int error, long val)
>> +{
>> +	regs->r00 = (long) error ?: val;
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	return AUDIT_ARCH_HEXAGON;
>> diff --git a/arch/loongarch/include/asm/syscall.h b/arch/loongarch/include/asm/syscall.h
>> index e286dc58476e..ff415b3c0a8e 100644
>> --- a/arch/loongarch/include/asm/syscall.h
>> +++ b/arch/loongarch/include/asm/syscall.h
>> @@ -61,6 +61,14 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	memcpy(&args[1], &regs->regs[5], 5 * sizeof(long));
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 unsigned long *args)
>> +{
>> +	regs->orig_a0 = args[0];
>> +	memcpy(&regs->regs[5], &args[1], 5 * sizeof(long));
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	return AUDIT_ARCH_LOONGARCH64;
>> diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
>> index 2f85f2d8f754..3163d1506fae 100644
>> --- a/arch/mips/include/asm/syscall.h
>> +++ b/arch/mips/include/asm/syscall.h
>> @@ -76,6 +76,23 @@ static inline void mips_get_syscall_arg(unsigned long *arg,
>>   #endif
>>   }
>>   
>> +static inline void mips_set_syscall_arg(unsigned long *arg,
>> +	struct task_struct *task, struct pt_regs *regs, unsigned int n)
>> +{
>> +#ifdef CONFIG_32BIT
>> +	switch (n) {
>> +	case 0: case 1: case 2: case 3:
>> +		regs->regs[4 + n] = *arg;
>> +		return;
>> +	case 4: case 5: case 6: case 7:
>> +		*arg = regs->pad0[n] = *arg;
>> +		return;
>> +	}
>> +#else
>> +	regs->regs[4 + n] = *arg;
>> +#endif
>> +}
>> +
>>   static inline long syscall_get_error(struct task_struct *task,
>>   				     struct pt_regs *regs)
>>   {
>> @@ -122,6 +139,21 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   		mips_get_syscall_arg(args++, task, regs, i++);
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 unsigned long *args)
>> +{
>> +	unsigned int i = 0;
>> +	unsigned int n = 6;
>> +
>> +	/* O32 ABI syscall() */
>> +	if (mips_syscall_is_indirect(task, regs))
>> +		i++;
>> +
>> +	while (n--)
>> +		mips_set_syscall_arg(args++, task, regs, i++);
>> +}
>> +
>>   extern const unsigned long sys_call_table[];
>>   extern const unsigned long sys32_call_table[];
>>   extern const unsigned long sysn32_call_table[];
>> diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
>> index fff52205fb65..526449edd768 100644
>> --- a/arch/nios2/include/asm/syscall.h
>> +++ b/arch/nios2/include/asm/syscall.h
>> @@ -58,6 +58,17 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	*args   = regs->r9;
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +	struct pt_regs *regs, const unsigned long *args)
>> +{
>> +	regs->r4 = *args++;
>> +	regs->r5 = *args++;
>> +	regs->r6 = *args++;
>> +	regs->r7 = *args++;
>> +	regs->r8 = *args++;
>> +	regs->r9 = *args;
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	return AUDIT_ARCH_NIOS2;
>> diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
>> index 903ed882bdec..e6383be2a195 100644
>> --- a/arch/openrisc/include/asm/syscall.h
>> +++ b/arch/openrisc/include/asm/syscall.h
>> @@ -57,6 +57,13 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>>   	memcpy(args, &regs->gpr[3], 6 * sizeof(args[0]));
>>   }
>>   
>> +static inline void
>> +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
>> +		      const unsigned long *args)
>> +{
>> +	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	return AUDIT_ARCH_OPENRISC;
>> diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
>> index 00b127a5e09b..b146d0ae4c77 100644
>> --- a/arch/parisc/include/asm/syscall.h
>> +++ b/arch/parisc/include/asm/syscall.h
>> @@ -29,6 +29,18 @@ static inline void syscall_get_arguments(struct task_struct *tsk,
>>   	args[0] = regs->gr[26];
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *tsk,
>> +					 struct pt_regs *regs,
>> +					 unsigned long *args)
>> +{
>> +	regs->gr[21] = args[5];
>> +	regs->gr[22] = args[4];
>> +	regs->gr[23] = args[3];
>> +	regs->gr[24] = args[2];
>> +	regs->gr[25] = args[1];
>> +	regs->gr[26] = args[0];
>> +}
>> +
>>   static inline long syscall_get_error(struct task_struct *task,
>>   				     struct pt_regs *regs)
>>   {
>> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
>> index 422d7735ace6..521f279e6b33 100644
>> --- a/arch/powerpc/include/asm/syscall.h
>> +++ b/arch/powerpc/include/asm/syscall.h
>> @@ -114,6 +114,16 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	}
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
>> +
>> +	/* Also copy the first argument into orig_gpr3 */
>> +	regs->orig_gpr3 = args[0];
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	if (is_tsk_32bit_task(task))
>> diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
>> index 121fff429dce..8d389ba995c8 100644
>> --- a/arch/riscv/include/asm/syscall.h
>> +++ b/arch/riscv/include/asm/syscall.h
>> @@ -66,6 +66,15 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	memcpy(args, &regs->a1, 5 * sizeof(args[0]));
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	regs->orig_a0 = args[0];
>> +	args++;
>> +	memcpy(&regs->a1, args, 5 * sizeof(regs->a1));
>> +}
> 
> Looks good for riscv.
> 
> Tested-by: Charlie Jenkins <charlie@rivosinc.com>
> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com
> 
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   #ifdef CONFIG_64BIT
>> diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
>> index 27e3d804b311..b3dd883699e7 100644
>> --- a/arch/s390/include/asm/syscall.h
>> +++ b/arch/s390/include/asm/syscall.h
>> @@ -78,6 +78,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	args[0] = regs->orig_gpr2 & mask;
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	unsigned int n = 6;
>> +
>> +	while (n-- > 0)
>> +		if (n > 0)
>> +			regs->gprs[2 + n] = args[n];
>> +	regs->orig_gpr2 = args[0];
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   #ifdef CONFIG_COMPAT
>> diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
>> index d87738eebe30..cb51a7528384 100644
>> --- a/arch/sh/include/asm/syscall_32.h
>> +++ b/arch/sh/include/asm/syscall_32.h
>> @@ -57,6 +57,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	args[0] = regs->regs[4];
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	regs->regs[1] = args[5];
>> +	regs->regs[0] = args[4];
>> +	regs->regs[7] = args[3];
>> +	regs->regs[6] = args[2];
>> +	regs->regs[5] = args[1];
>> +	regs->regs[4] = args[0];
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	int arch = AUDIT_ARCH_SH;
>> diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
>> index 20c109ac8cc9..62a5a78804c4 100644
>> --- a/arch/sparc/include/asm/syscall.h
>> +++ b/arch/sparc/include/asm/syscall.h
>> @@ -117,6 +117,16 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	}
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < 6; i++)
>> +		regs->u_regs[UREG_I0 + i] = args[i];
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT)
>> diff --git a/arch/um/include/asm/syscall-generic.h b/arch/um/include/asm/syscall-generic.h
>> index 172b74143c4b..2984feb9d576 100644
>> --- a/arch/um/include/asm/syscall-generic.h
>> +++ b/arch/um/include/asm/syscall-generic.h
>> @@ -62,6 +62,20 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	*args   = UPT_SYSCALL_ARG6(r);
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	struct uml_pt_regs *r = &regs->regs;
>> +
>> +	UPT_SYSCALL_ARG1(r) = *args++;
>> +	UPT_SYSCALL_ARG2(r) = *args++;
>> +	UPT_SYSCALL_ARG3(r) = *args++;
>> +	UPT_SYSCALL_ARG4(r) = *args++;
>> +	UPT_SYSCALL_ARG5(r) = *args++;
>> +	UPT_SYSCALL_ARG6(r) = *args;
>> +}
>> +
>>   /* See arch/x86/um/asm/syscall.h for syscall_get_arch() definition. */
>>   
>>   #endif	/* __UM_SYSCALL_GENERIC_H */
>> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
>> index 7c488ff0c764..b9c249dd9e3d 100644
>> --- a/arch/x86/include/asm/syscall.h
>> +++ b/arch/x86/include/asm/syscall.h
>> @@ -90,6 +90,18 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	args[5] = regs->bp;
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	regs->bx = args[0];
>> +	regs->cx = args[1];
>> +	regs->dx = args[2];
>> +	regs->si = args[3];
>> +	regs->di = args[4];
>> +	regs->bp = args[5];
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	return AUDIT_ARCH_I386;
>> @@ -121,6 +133,30 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   	}
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +# ifdef CONFIG_IA32_EMULATION
>> +	if (task->thread_info.status & TS_COMPAT) {
>> +		regs->bx = *args++;
>> +		regs->cx = *args++;
>> +		regs->dx = *args++;
>> +		regs->si = *args++;
>> +		regs->di = *args++;
>> +		regs->bp = *args;
>> +	} else
>> +# endif
>> +	{
>> +		regs->di = *args++;
>> +		regs->si = *args++;
>> +		regs->dx = *args++;
>> +		regs->r10 = *args++;
>> +		regs->r8 = *args++;
>> +		regs->r9 = *args;
>> +	}
>> +}
>> +
>>   static inline int syscall_get_arch(struct task_struct *task)
>>   {
>>   	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
>> diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
>> index 5ee974bf8330..f9a671cbf933 100644
>> --- a/arch/xtensa/include/asm/syscall.h
>> +++ b/arch/xtensa/include/asm/syscall.h
>> @@ -68,6 +68,17 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>   		args[i] = regs->areg[reg[i]];
>>   }
>>   
>> +static inline void syscall_set_arguments(struct task_struct *task,
>> +					 struct pt_regs *regs,
>> +					 const unsigned long *args)
>> +{
>> +	static const unsigned int reg[] = XTENSA_SYSCALL_ARGUMENT_REGS;
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < 6; ++i)
>> +		regs->areg[reg[i]] = args[i];
>> +}
>> +
>>   asmlinkage long xtensa_rt_sigreturn(void);
>>   asmlinkage long xtensa_shmat(int, char __user *, int);
>>   asmlinkage long xtensa_fadvise64_64(int, int,
>> diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
>> index 5a80fe728dc8..0f7b9a493de7 100644
>> --- a/include/asm-generic/syscall.h
>> +++ b/include/asm-generic/syscall.h
>> @@ -117,6 +117,22 @@ void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs,
>>   void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
>>   			   unsigned long *args);
>>   
>> +/**
>> + * syscall_set_arguments - change system call parameter value
>> + * @task:	task of interest, must be in system call entry tracing
>> + * @regs:	task_pt_regs() of @task
>> + * @args:	array of argument values to store
>> + *
>> + * Changes 6 arguments to the system call.
>> + * The first argument gets value @args[0], and so on.
>> + *
>> + * It's only valid to call this when @task is stopped for tracing on
>> + * entry to a system call, due to %SYSCALL_WORK_SYSCALL_TRACE or
>> + * %SYSCALL_WORK_SYSCALL_AUDIT.
>> + */
>> +void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
>> +			   const unsigned long *args);
>> +
>>   /**
>>    * syscall_get_arch - return the AUDIT_ARCH for the current system call
>>    * @task:	task of interest, must be blocked
>> -- 
>> ldv


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-16 16:04     ` Dmitry V. Levin
  2025-01-16 16:40       ` Dmitry V. Levin
@ 2025-01-17 14:45       ` Oleg Nesterov
  2025-01-17 15:06         ` Dmitry V. Levin
  1 sibling, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2025-01-17 14:45 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On 01/16, Dmitry V. Levin wrote:
>
> The idea is to use "op" to specify the operation, and "flags" to specify
> future extensions to the operation.

OK,

> That is, the zero check implied by copy_struct_from_user() is not really
> needed here since the compatibility is tracked by "op" and "flags":

OK, but then why this patch uses copy_struct_from_user() ?

Why can't we simply do

	if (user_size != PTRACE_SYSCALL_INFO_SIZE_VER0)
		return -EINVAL;

	if (copy_from_user(..., user_size))
		return EFAULT;

now, until we add the extensions ?

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-17 14:45       ` Oleg Nesterov
@ 2025-01-17 15:06         ` Dmitry V. Levin
  2025-01-17 15:32           ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-17 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Fri, Jan 17, 2025 at 03:45:57PM +0100, Oleg Nesterov wrote:
> On 01/16, Dmitry V. Levin wrote:
> >
> > The idea is to use "op" to specify the operation, and "flags" to specify
> > future extensions to the operation.
> 
> OK,
> 
> > That is, the zero check implied by copy_struct_from_user() is not really
> > needed here since the compatibility is tracked by "op" and "flags":
> 
> OK, but then why this patch uses copy_struct_from_user() ?
> 
> Why can't we simply do
> 
> 	if (user_size != PTRACE_SYSCALL_INFO_SIZE_VER0)
> 		return -EINVAL;
> 
> 	if (copy_from_user(..., user_size))
> 		return EFAULT;
> 
> now, until we add the extensions ?

We should accept larger user_size from the very beginning, so that in case
the structure grows in the future, the userspace that sicks to the current
set of supported features would be still able to work with older kernels.

I think we can do the following:

        /*
         * ptrace_syscall_info.seccomp is the largest member in the union,
         * and ret_data is the last field there.
         * min_size can be less than sizeof(info) due to alignment.
         */
        size_t min_size = offsetofend(struct ptrace_syscall_info, seccomp.ret_data);
        size_t copy_size = min(sizeof(info), user_size);

        if (copy_size < min_size)
                return -EINVAL;

        if (copy_from_user(&info, datavp, copy_size))
                return -EFAULT;

We cannot just use sizeof(info) because it depends on the alignment of
__u64.  Also, I don't think we need to fill with zeroes the trailing
padding bytes of the structure as we are not going to use them in any way.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-17 15:06         ` Dmitry V. Levin
@ 2025-01-17 15:32           ` Oleg Nesterov
  2025-01-17 16:22             ` Dmitry V. Levin
  0 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2025-01-17 15:32 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

Dmitry,

You certainly understand the user-space needs much better than me.
I am just trying to understand your point.

On 01/17, Dmitry V. Levin wrote:
>
> We should accept larger user_size from the very beginning, so that in case
> the structure grows in the future, the userspace that sicks to the current
> set of supported features would be still able to work with older kernels.

This is what I can't understand, perhaps I have a blind spot here ;)

Could you provide an example (even absolutely artificial) of possible extension
which can help me to understand?

> We cannot just use sizeof(info) because it depends on the alignment of
> __u64.

Hmm why? I thought that the kernel already depends on the "natural" alignment?
And if we can't, then PTRACE_SYSCALL_INFO_SIZE_VER0 added by this patch makes
no sense?

Sorry I guess I must have missed something, I am sick today.

> Also, I don't think we need to fill with zeroes the trailing
> padding bytes of the structure as we are not going to use them in any way.

At least we seem to agree here ;)

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value()
  2025-01-17  0:59     ` H. Peter Anvin
@ 2025-01-17 15:45       ` Eugene Syromyatnikov
  2025-01-18  4:34         ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Eugene Syromyatnikov @ 2025-01-17 15:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Charlie Jenkins, Dmitry V. Levin, Oleg Nesterov, Mike Frysinger,
	Renzo Davoli, Davide Berardi, strace-devel, Vineet Gupta,
	Russell King, Will Deacon, Guo Ren, Brian Cain, Huacai Chen,
	WANG Xuerui, Thomas Bogendoerfer, Dinh Nguyen, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, James E.J. Bottomley,
	Helge Deller, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Naveen N Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Yoshinori Sato, Rich Felker, John Paul Adrian Glaubitz,
	David S. Miller, Andreas Larsson, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Chris Zankel, Max Filippov,
	Arnd Bergmann, linux-snps-arc, linux-kernel, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-mips, linux-openrisc,
	linux-parisc, linuxppc-dev, linux-riscv, linux-s390, linux-sh,
	sparclinux, linux-um, linux-arch

On Fri, Jan 17, 2025 at 2:03 AM H. Peter Anvin <hpa@zytor.com> wrote:
>
> I link the concept of this patchset, but *please* make it clear in the
> comments that this does not solve the issue of 64-bit kernel arguments
> on 32-bit systems being ABI specific.

Sorry, but I don't see how this is relevant; each architecture has its
own ABI with its own set of peculiarities, and there's a lot of
(completely unrelated) work needed in order to make an ABI that is
architecture-agnostic.  All this patch set does is provides a
consistent way to manipulate scno and args across architectures;  it
doesn't address the fact that some architectures have mmap2/mmap_pgoff
syscall, or that some have fadvise64_64 in addition to fadvise64, or
the existence of clone2, or socketcall, or ipc; or that some
architectures don't have open or stat;  or that scnos on different
architectures or even different bit-widths within the "same"
architecture are different.

> This isn't unique to this patch in any way; the only way to handle it is
> by keeping track of each ABI.

That's true, but this patch doesn't even try to address that.

-- 
Eugene Syromyatnikov
mailto:evgsyr@gmail.com
xmpp:esyr@jabber.{ru|org}

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-17 15:32           ` Oleg Nesterov
@ 2025-01-17 16:22             ` Dmitry V. Levin
  2025-01-18 14:13               ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-17 16:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Fri, Jan 17, 2025 at 04:32:59PM +0100, Oleg Nesterov wrote:
> Dmitry,
> 
> You certainly understand the user-space needs much better than me.
> I am just trying to understand your point.
> 
> On 01/17, Dmitry V. Levin wrote:
> >
> > We should accept larger user_size from the very beginning, so that in case
> > the structure grows in the future, the userspace that sicks to the current
> > set of supported features would be still able to work with older kernels.
> 
> This is what I can't understand, perhaps I have a blind spot here ;)
> 
> Could you provide an example (even absolutely artificial) of possible extension
> which can help me to understand?

An absolutely artificial example: let's say we're adding an optional 
64-bit field "artificial" to ptrace_syscall_info.seccomp, this means
sizeof(ptrace_syscall_info) grows by 8 bytes.  When userspace wants
to set this optional field, it sets a bit in ptrace_syscall_info.flags,
this tells the kernel to look into this new "artificial" field.
When userspace is not interested in setting new optional fields,
it just keeps ptrace_syscall_info.flags == 0.  Remember, however, that
by adding the new optional field sizeof(ptrace_syscall_info) grew by 8 bytes.

What we need is to make sure that an older kernel that has no idea of this
new field would still accept the bigger size, so that userspace would be
able to continue doing its
	ptrace(PTRACE_SET_SYSCALL_INFO, pid, sizeof(info), &info)
despite of potential growth of sizeof(info) until it actually starts using
new optional fields.

> > We cannot just use sizeof(info) because it depends on the alignment of
> > __u64.
> 
> Hmm why? I thought that the kernel already depends on the "natural" alignment?
> And if we can't, then PTRACE_SYSCALL_INFO_SIZE_VER0 added by this patch makes
> no sense?

struct ptrace_syscall_info has members of type __u64, and it currently
ends with "__u32 ret_data".  So depending on the alignment, the structure
either has extra 4 trailing padding bytes, or it doesn't.

For example, on x86_64 sizeof(struct ptrace_syscall_info) is currently 88,
while on x86 it is 84.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value()
  2025-01-17 15:45       ` Eugene Syromyatnikov
@ 2025-01-18  4:34         ` H. Peter Anvin
  0 siblings, 0 replies; 65+ messages in thread
From: H. Peter Anvin @ 2025-01-18  4:34 UTC (permalink / raw)
  To: Eugene Syromyatnikov
  Cc: Charlie Jenkins, Dmitry V. Levin, Oleg Nesterov, Mike Frysinger,
	Renzo Davoli, Davide Berardi, strace-devel, Vineet Gupta,
	Russell King, Will Deacon, Guo Ren, Brian Cain, Huacai Chen,
	WANG Xuerui, Thomas Bogendoerfer, Dinh Nguyen, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, James E.J. Bottomley,
	Helge Deller, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Naveen N Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Yoshinori Sato, Rich Felker, John Paul Adrian Glaubitz,
	David S. Miller, Andreas Larsson, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Chris Zankel, Max Filippov,
	Arnd Bergmann, linux-snps-arc, linux-kernel, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-mips, linux-openrisc,
	linux-parisc, linuxppc-dev, linux-riscv, linux-s390, linux-sh,
	sparclinux, linux-um, linux-arch

On January 17, 2025 7:45:02 AM PST, Eugene Syromyatnikov <evgsyr@gmail.com> wrote:
>On Fri, Jan 17, 2025 at 2:03 AM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I link the concept of this patchset, but *please* make it clear in the
>> comments that this does not solve the issue of 64-bit kernel arguments
>> on 32-bit systems being ABI specific.
>
>Sorry, but I don't see how this is relevant; each architecture has its
>own ABI with its own set of peculiarities, and there's a lot of
>(completely unrelated) work needed in order to make an ABI that is
>architecture-agnostic.  All this patch set does is provides a
>consistent way to manipulate scno and args across architectures;  it
>doesn't address the fact that some architectures have mmap2/mmap_pgoff
>syscall, or that some have fadvise64_64 in addition to fadvise64, or
>the existence of clone2, or socketcall, or ipc; or that some
>architectures don't have open or stat;  or that scnos on different
>architectures or even different bit-widths within the "same"
>architecture are different.
>
>> This isn't unique to this patch in any way; the only way to handle it is
>> by keeping track of each ABI.
>
>That's true, but this patch doesn't even try to address that.
>

I just want it noted in the comment, that's all.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-17 16:22             ` Dmitry V. Levin
@ 2025-01-18 14:13               ` Oleg Nesterov
  2025-01-19 12:44                 ` Dmitry V. Levin
  2025-01-19 14:38                 ` Aleksa Sarai
  0 siblings, 2 replies; 65+ messages in thread
From: Oleg Nesterov @ 2025-01-18 14:13 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On 01/17, Dmitry V. Levin wrote:
>

(reordered)

> struct ptrace_syscall_info has members of type __u64, and it currently
> ends with "__u32 ret_data".  So depending on the alignment, the structure
> either has extra 4 trailing padding bytes, or it doesn't.

Ah, I didn't realize that the last member is __u32, so I completely
misunderstood your "it depends on the alignment of __u64" note.

> For example, on x86_64 sizeof(struct ptrace_syscall_info) is currently 88,
> while on x86 it is 84.

Not good, but too late to complain...

OK, I see your point now and I won't argue with approach you outlined in your
previous email

        size_t min_size = offsetofend(struct ptrace_syscall_info, seccomp.ret_data);
        size_t copy_size = min(sizeof(info), user_size);

        if (copy_size < min_size)
                return -EINVAL;

        if (copy_from_user(&info, datavp, copy_size))
                return -EFAULT;

-------------------------------------------------------------------------------
Thats said... Can't resist,

> An absolutely artificial example: let's say we're adding an optional
> 64-bit field "artificial" to ptrace_syscall_info.seccomp, this means
> sizeof(ptrace_syscall_info) grows by 8 bytes.  When userspace wants
> to set this optional field, it sets a bit in ptrace_syscall_info.flags,
> this tells the kernel to look into this new "artificial" field.
> When userspace is not interested in setting new optional fields,
> it just keeps ptrace_syscall_info.flags == 0.  Remember, however, that
> by adding the new optional field sizeof(ptrace_syscall_info) grew by 8 bytes.
>
> What we need is to make sure that an older kernel that has no idea of this
> new field would still accept the bigger size, so that userspace would be
> able to continue doing its
> 	ptrace(PTRACE_SET_SYSCALL_INFO, pid, sizeof(info), &info)
> despite of potential growth of sizeof(info) until it actually starts using
> new optional fields.

This is clear, but personally I don't really like this pattern... Consider

	void set_syscall_info(int unlikely_condition)
	{
		struct ptrace_syscall_info info;

		fill_info(&info);
		if (unlikely_condition) {
			info.flags = USE_ARTIFICIAL;
			info.artificial = 1;
		}

		assert(ptrace(PTRACE_SET_SYSCALL_INFO, sizeof(info), &info) == 0);
	}

Now this application (running on the older kernel) can fail or not, depending
on "unlikely_condition". To me it would be better to always fail in this case.

That is why I tried to suggest to use "user_size" as a version number.
Currently we have PTRACE_SYSCALL_INFO_SIZE_VER0, when we add the new
"artificial" member we will have PTRACE_SYSCALL_INFO_SIZE_VER1. Granted,
this way set_syscall_info() can't use sizeof(info), it should do

	ptrace(PTRACE_SET_SYSCALL_INFO, PTRACE_SYSCALL_INFO_SIZE_VER1, info);

and the kernel needs more checks, but this is what I had in mind when I said
that the 1st version can just require "user_size == PTRACE_SYSCALL_INFO_SIZE_VER0".

But I won't insist, I do not pretend I understand the user-space needs.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-18 14:13               ` Oleg Nesterov
@ 2025-01-19 12:44                 ` Dmitry V. Levin
  2025-01-20 19:56                   ` Oleg Nesterov
  2025-01-19 14:38                 ` Aleksa Sarai
  1 sibling, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-19 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On Sat, Jan 18, 2025 at 03:13:42PM +0100, Oleg Nesterov wrote:
> On 01/17, Dmitry V. Levin wrote:
[...]
> > For example, on x86_64 sizeof(struct ptrace_syscall_info) is currently 88,
> > while on x86 it is 84.
> 
> Not good, but too late to complain...

Actually, I don't think it's too late to add an extra __u32 padding
there since it wouldn't affect PTRACE_GET_SYSCALL_INFO.

I can add an explicit padding to the structure if you say
you like it better this way.

[...]
> Thats said... Can't resist,
> 
> > An absolutely artificial example: let's say we're adding an optional
> > 64-bit field "artificial" to ptrace_syscall_info.seccomp, this means
> > sizeof(ptrace_syscall_info) grows by 8 bytes.  When userspace wants
> > to set this optional field, it sets a bit in ptrace_syscall_info.flags,
> > this tells the kernel to look into this new "artificial" field.
> > When userspace is not interested in setting new optional fields,
> > it just keeps ptrace_syscall_info.flags == 0.  Remember, however, that
> > by adding the new optional field sizeof(ptrace_syscall_info) grew by 8 bytes.
> >
> > What we need is to make sure that an older kernel that has no idea of this
> > new field would still accept the bigger size, so that userspace would be
> > able to continue doing its
> > 	ptrace(PTRACE_SET_SYSCALL_INFO, pid, sizeof(info), &info)
> > despite of potential growth of sizeof(info) until it actually starts using
> > new optional fields.
> 
> This is clear, but personally I don't really like this pattern... Consider
> 
> 	void set_syscall_info(int unlikely_condition)
> 	{
> 		struct ptrace_syscall_info info;
> 
> 		fill_info(&info);
> 		if (unlikely_condition) {
> 			info.flags = USE_ARTIFICIAL;
> 			info.artificial = 1;
> 		}
> 
> 		assert(ptrace(PTRACE_SET_SYSCALL_INFO, sizeof(info), &info) == 0);
> 	}
> 
> Now this application (running on the older kernel) can fail or not, depending
> on "unlikely_condition". To me it would be better to always fail in this case.

In practice, user-space programs rarely have the luxury to assume that
some new kernel API is available.  For example, strace still performs a
runtime check for PTRACE_GET_SYSCALL_INFO (introduced more than 5 years
ago) and falls back to pre-PTRACE_GET_SYSCALL_INFO interfaces when the
kernel lacks support.  Consequently, user-space programs would have to
keep track of PTRACE_SET_SYSCALL_INFO interfaces supported by the kernel,
so ...

> That is why I tried to suggest to use "user_size" as a version number.
> Currently we have PTRACE_SYSCALL_INFO_SIZE_VER0, when we add the new
> "artificial" member we will have PTRACE_SYSCALL_INFO_SIZE_VER1. Granted,
> this way set_syscall_info() can't use sizeof(info), it should do
> 
> 	ptrace(PTRACE_SET_SYSCALL_INFO, PTRACE_SYSCALL_INFO_SIZE_VER1, info);
> 
> and the kernel needs more checks, but this is what I had in mind when I said
> that the 1st version can just require "user_size == PTRACE_SYSCALL_INFO_SIZE_VER0".

... it wouldn't be a big deal for user-space to specify also an
appropriate "user_size", e.g. PTRACE_SYSCALL_INFO_SIZE_VER1 when it starts
using the interface available since VER1, but it wouldn't help user-space
programs either as they would have to update "op" and/or "flags" anyway,
and "user_size" would become just yet another detail they have to care
about.

At the same time, "flags" is needed anyway because the most likely
extension of PTRACE_SET_SYSCALL_INFO would be support of setting some
fields that are present in the structure already, e.g.
instruction_pointer.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-18 14:13               ` Oleg Nesterov
  2025-01-19 12:44                 ` Dmitry V. Levin
@ 2025-01-19 14:38                 ` Aleksa Sarai
  1 sibling, 0 replies; 65+ messages in thread
From: Aleksa Sarai @ 2025-01-19 14:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dmitry V. Levin, Eugene Syromyatnikov, Mike Frysinger,
	Renzo Davoli, Davide Berardi, strace-devel, linux-kernel,
	linux-api

[-- Attachment #1: Type: text/plain, Size: 4362 bytes --]

On 2025-01-18, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/17, Dmitry V. Levin wrote:
> >
> 
> (reordered)
> 
> > struct ptrace_syscall_info has members of type __u64, and it currently
> > ends with "__u32 ret_data".  So depending on the alignment, the structure
> > either has extra 4 trailing padding bytes, or it doesn't.
> 
> Ah, I didn't realize that the last member is __u32, so I completely
> misunderstood your "it depends on the alignment of __u64" note.
> 
> > For example, on x86_64 sizeof(struct ptrace_syscall_info) is currently 88,
> > while on x86 it is 84.
> 
> Not good, but too late to complain...
> 
> OK, I see your point now and I won't argue with approach you outlined in your
> previous email
> 
>         size_t min_size = offsetofend(struct ptrace_syscall_info, seccomp.ret_data);
>         size_t copy_size = min(sizeof(info), user_size);
> 
>         if (copy_size < min_size)
>                 return -EINVAL;
> 
>         if (copy_from_user(&info, datavp, copy_size))
>                 return -EFAULT;
> 
> -------------------------------------------------------------------------------
> Thats said... Can't resist,
> 
> > An absolutely artificial example: let's say we're adding an optional
> > 64-bit field "artificial" to ptrace_syscall_info.seccomp, this means
> > sizeof(ptrace_syscall_info) grows by 8 bytes.  When userspace wants
> > to set this optional field, it sets a bit in ptrace_syscall_info.flags,
> > this tells the kernel to look into this new "artificial" field.
> > When userspace is not interested in setting new optional fields,
> > it just keeps ptrace_syscall_info.flags == 0.  Remember, however, that
> > by adding the new optional field sizeof(ptrace_syscall_info) grew by 8 bytes.
> >
> > What we need is to make sure that an older kernel that has no idea of this
> > new field would still accept the bigger size, so that userspace would be
> > able to continue doing its
> > 	ptrace(PTRACE_SET_SYSCALL_INFO, pid, sizeof(info), &info)
> > despite of potential growth of sizeof(info) until it actually starts using
> > new optional fields.
> 
> This is clear, but personally I don't really like this pattern... Consider
> 
> 	void set_syscall_info(int unlikely_condition)
> 	{
> 		struct ptrace_syscall_info info;
> 
> 		fill_info(&info);
> 		if (unlikely_condition) {
> 			info.flags = USE_ARTIFICIAL;
> 			info.artificial = 1;
> 		}
> 
> 		assert(ptrace(PTRACE_SET_SYSCALL_INFO, sizeof(info), &info) == 0);
> 	}
> 
> Now this application (running on the older kernel) can fail or not, depending
> on "unlikely_condition". To me it would be better to always fail in this case.
> 
> That is why I tried to suggest to use "user_size" as a version number.
> Currently we have PTRACE_SYSCALL_INFO_SIZE_VER0, when we add the new
> "artificial" member we will have PTRACE_SYSCALL_INFO_SIZE_VER1. Granted,
> this way set_syscall_info() can't use sizeof(info), it should do

user_size *is* a version number, it's just that copy_struct_from_user()
allows programs built with newer headers to run on older kernels (if
they don't use the new features). The alternative is that programs that
build with a newer set of kernel headers will implicitly have a larger
ptrace_syscall_info struct, which will cause them to start failing after
the binary is rebuilt.

*Strictly speaking* this wouldn't be a kernel regression (because it's a
new binary, the old binary would still work), but the risk of these
kinds of APIs being incredibly fragile is the reason why I went with the
check_zeroed_user() approach in copy_struct_from_user().

(I haven't looked at the details of this patchset, this is just a
general comment about copy_struct_from_user() and why this feature is
useful to userspace programs. Not all APIs need the extensibility of
copy_struct_from_user().)

> 	ptrace(PTRACE_SET_SYSCALL_INFO, PTRACE_SYSCALL_INFO_SIZE_VER1, info);
> 
> and the kernel needs more checks, but this is what I had in mind when I said
> that the 1st version can just require "user_size == PTRACE_SYSCALL_INFO_SIZE_VER0".
> 
> But I won't insist, I do not pretend I understand the user-space needs.
> 
> Thanks!
> 
> Oleg.
> 
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-14 17:04     ` Dmitry V. Levin
@ 2025-01-20 13:51       ` Christophe Leroy
  2025-01-20 17:12         ` Dmitry V. Levin
  2025-01-23 18:28         ` Dmitry V. Levin
  0 siblings, 2 replies; 65+ messages in thread
From: Christophe Leroy @ 2025-01-20 13:51 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel



Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
> On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
>> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
>>> Bring syscall_set_return_value() in sync with syscall_get_error(),
>>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>
>>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
>>> syscall_set_return_value()").
>>
>> There is a clear detailed explanation in that commit of why it needs to
>> be done.
>>
>> If you think that commit is wrong you have to explain why with at least
>> the same level of details.
> 
> OK, please have a look whether this explanation is clear and detailed enough:
> 
> =======
> powerpc: properly negate error in syscall_set_return_value()
> 
> When syscall_set_return_value() is used to set an error code, the caller
> specifies it as a negative value in -ERRORCODE form.
> 
> In !trap_is_scv case the error code is traditionally stored as follows:
> gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
> Here are a few examples to illustrate this convention.  The first one
> is from syscall_get_error():
>          /*
>           * If the system call failed,
>           * regs->gpr[3] contains a positive ERRORCODE.
>           */
>          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> 
> The second example is from regs_return_value():
>          if (is_syscall_success(regs))
>                  return regs->gpr[3];
>          else
>                  return -regs->gpr[3];
> 
> The third example is from check_syscall_restart():
>          regs->result = -EINTR;
>          regs->gpr[3] = EINTR;
>          regs->ccr |= 0x10000000;
> 
> Compared with these examples, the failure of syscall_set_return_value()
> to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
> 	/*
> 	 * In the general case it's not obvious that we must deal with
> 	 * CCR here, as the syscall exit path will also do that for us.
> 	 * However there are some places, eg. the signal code, which
> 	 * check ccr to decide if the value in r3 is actually an error.
> 	 */
> 	if (error) {
> 		regs->ccr |= 0x10000000L;
> 		regs->gpr[3] = error;
> 	} else {
> 		regs->ccr &= ~0x10000000L;
> 		regs->gpr[3] = val;
> 	}
> 
> This fix brings syscall_set_return_value() in sync with syscall_get_error()
> and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
> 
> Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
> =======
> 
> 

I think there is still something going wrong.

do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.

Then it calls __secure_computing() which returns what __seccomp_filter() 
returns.

In case of error, __seccomp_filter() calls syscall_set_return_value() 
with a negative value then returns -1

do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
do_seccomp() doesn't return 0.

do_syscall_trace_enter() is called by system_call_exception() and 
returns -1, so syscall_exception() returns regs->gpr[3]

In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
called with the return of syscall_exception() as first parameter, which 
leads to:

	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
			r3 = -r3;
			regs->ccr |= 0x10000000; /* Set SO bit in CR */
		}
	}

By chance, because you have already changed the sign of gpr[3], the 
above test fails and nothing is done to r3, and because you have also 
already set regs->ccr it works.

But all this looks inconsistent with the fact that do_seccomp sets 
-ENOSYS as default value

Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
syscall number and when it is wrong it goes to skip: which sets 
regs->gpr[3] = -ENOSYS;

So really I think it is not in line with your changes to set positive 
value in gpr[3].

Maybe your change is still correct but it needs to be handled completely 
in that case.

Christophe

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-20 13:51       ` Christophe Leroy
@ 2025-01-20 17:12         ` Dmitry V. Levin
  2025-01-21 11:13           ` Madhavan Srinivasan
  2025-01-23 18:28         ` Dmitry V. Levin
  1 sibling, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-20 17:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
> > On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> >> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> >>> Bring syscall_set_return_value() in sync with syscall_get_error(),
> >>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> >>>
> >>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> >>> syscall_set_return_value()").
> >>
> >> There is a clear detailed explanation in that commit of why it needs to
> >> be done.
> >>
> >> If you think that commit is wrong you have to explain why with at least
> >> the same level of details.
> > 
> > OK, please have a look whether this explanation is clear and detailed enough:
> > 
> > =======
> > powerpc: properly negate error in syscall_set_return_value()
> > 
> > When syscall_set_return_value() is used to set an error code, the caller
> > specifies it as a negative value in -ERRORCODE form.
> > 
> > In !trap_is_scv case the error code is traditionally stored as follows:
> > gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
> > Here are a few examples to illustrate this convention.  The first one
> > is from syscall_get_error():
> >          /*
> >           * If the system call failed,
> >           * regs->gpr[3] contains a positive ERRORCODE.
> >           */
> >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > 
> > The second example is from regs_return_value():
> >          if (is_syscall_success(regs))
> >                  return regs->gpr[3];
> >          else
> >                  return -regs->gpr[3];
> > 
> > The third example is from check_syscall_restart():
> >          regs->result = -EINTR;
> >          regs->gpr[3] = EINTR;
> >          regs->ccr |= 0x10000000;
> > 
> > Compared with these examples, the failure of syscall_set_return_value()
> > to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
> > 	/*
> > 	 * In the general case it's not obvious that we must deal with
> > 	 * CCR here, as the syscall exit path will also do that for us.
> > 	 * However there are some places, eg. the signal code, which
> > 	 * check ccr to decide if the value in r3 is actually an error.
> > 	 */
> > 	if (error) {
> > 		regs->ccr |= 0x10000000L;
> > 		regs->gpr[3] = error;
> > 	} else {
> > 		regs->ccr &= ~0x10000000L;
> > 		regs->gpr[3] = val;
> > 	}
> > 
> > This fix brings syscall_set_return_value() in sync with syscall_get_error()
> > and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > 
> > Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
> > =======
> > 
> > 
> 
> I think there is still something going wrong.
> 
> do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
> 
> Then it calls __secure_computing() which returns what __seccomp_filter() 
> returns.
> 
> In case of error, __seccomp_filter() calls syscall_set_return_value() 
> with a negative value then returns -1
> 
> do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
> do_seccomp() doesn't return 0.
> 
> do_syscall_trace_enter() is called by system_call_exception() and 
> returns -1, so syscall_exception() returns regs->gpr[3]
> 
> In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
> called with the return of syscall_exception() as first parameter, which 
> leads to:
> 
> 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
> 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
> 			r3 = -r3;
> 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
> 		}
> 	}

Note the "unlikely" keyword here reminding us once more that in !scv case
regs->gpr[3] does not normally have -ERRORCODE form.

> By chance, because you have already changed the sign of gpr[3], the 
> above test fails and nothing is done to r3, and because you have also 
> already set regs->ccr it works.
> 
> But all this looks inconsistent with the fact that do_seccomp sets 
> -ENOSYS as default value
> 
> Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
> syscall number and when it is wrong it goes to skip: which sets 
> regs->gpr[3] = -ENOSYS;

It looks like do_seccomp() and do_syscall_trace_enter() get away by sheer
luck, implicitly relying on syscall_exit_prepare() transparently fixing
regs->gpr[3] for them.

> So really I think it is not in line with your changes to set positive 
> value in gpr[3].
> 
> Maybe your change is still correct but it needs to be handled completely 
> in that case.

By the way, is there any reasons why do_seccomp() and
do_syscall_trace_enter() don't use syscall_set_return_value() yet?


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request
  2025-01-19 12:44                 ` Dmitry V. Levin
@ 2025-01-20 19:56                   ` Oleg Nesterov
  0 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2025-01-20 19:56 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, linux-kernel, linux-api

On 01/19, Dmitry V. Levin wrote:
>
> On Sat, Jan 18, 2025 at 03:13:42PM +0100, Oleg Nesterov wrote:
> > On 01/17, Dmitry V. Levin wrote:
> [...]
> > > For example, on x86_64 sizeof(struct ptrace_syscall_info) is currently 88,
> > > while on x86 it is 84.
> >
> > Not good, but too late to complain...
>
> Actually, I don't think it's too late to add an extra __u32 padding
> there since it wouldn't affect PTRACE_GET_SYSCALL_INFO.

Hmm, indeed thanks for correcting me. I forgot that ptrace_get_syscall_info()
returns actual_size, not sizeof().

> I can add an explicit padding to the structure if you say
> you like it better this way.

I dunno, up to you...

Well if we add "__u32 padding" at the end, we can probably use sizeof(info)
instead of min_size = offsetofend(struct ptrace_syscall_info, seccomp.ret_data)
in ptrace_set_syscall_info(), but then it probably makes sense to check
info->padding == 0 (just like info.flags || info.reserved) and rename this
member to reserved2.

Again, up to you, I don't know.

> > Currently we have PTRACE_SYSCALL_INFO_SIZE_VER0, when we add the new
> > "artificial" member we will have PTRACE_SYSCALL_INFO_SIZE_VER1. Granted,
> > this way set_syscall_info() can't use sizeof(info), it should do
> >
> > 	ptrace(PTRACE_SET_SYSCALL_INFO, PTRACE_SYSCALL_INFO_SIZE_VER1, info);
> >
> > and the kernel needs more checks, but this is what I had in mind when I said
> > that the 1st version can just require "user_size == PTRACE_SYSCALL_INFO_SIZE_VER0".
>
> ... it wouldn't be a big deal for user-space to specify also an
> appropriate "user_size", e.g. PTRACE_SYSCALL_INFO_SIZE_VER1 when it starts
> using the interface available since VER1, but it wouldn't help user-space
> programs either as they would have to update "op" and/or "flags" anyway,

Sure, and yes, "flags" is needed anyway.

> and "user_size" would become just yet another detail they have to care
> about.

True.

It is not that I ever thought that my suggestion could "help user-space".
Not at all. Just imo it would be better to fail "early" on the older kernel
in the case when user-space expects the "extended" API, even if flags == 0.
And no, it is not that I am 100% sure it would be always better.

So let me repeat: please do what you think is right, I won't argue. I just
tried to understand your points and explain mine to ensure we more or less
understand each other.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-20 17:12         ` Dmitry V. Levin
@ 2025-01-21 11:13           ` Madhavan Srinivasan
  2025-01-21 11:28             ` Christophe Leroy
  0 siblings, 1 reply; 65+ messages in thread
From: Madhavan Srinivasan @ 2025-01-21 11:13 UTC (permalink / raw)
  To: Dmitry V. Levin, Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, linux-kernel



On 1/20/25 10:42 PM, Dmitry V. Levin wrote:
> On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
>> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
>>> On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
>>>> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
>>>>> Bring syscall_set_return_value() in sync with syscall_get_error(),
>>>>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>>>

Sorry for getting to this thread late.

Tried the series without this patch in 

1) power9 PowerNV system and in power10 pSeries lpar 

# ./set_syscall_info
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           global.set_syscall_info ...
#            OK  global.set_syscall_info
ok 1 global.set_syscall_info
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

and in both case set_syscall_info passes.
Will look at it further.

Maddy

>>>>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
>>>>> syscall_set_return_value()").
>>>>
>>>> There is a clear detailed explanation in that commit of why it needs to
>>>> be done.
>>>>
>>>> If you think that commit is wrong you have to explain why with at least
>>>> the same level of details.
>>>
>>> OK, please have a look whether this explanation is clear and detailed enough:
>>>
>>> =======
>>> powerpc: properly negate error in syscall_set_return_value()
>>>
>>> When syscall_set_return_value() is used to set an error code, the caller
>>> specifies it as a negative value in -ERRORCODE form.
>>>
>>> In !trap_is_scv case the error code is traditionally stored as follows:
>>> gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
>>> Here are a few examples to illustrate this convention.  The first one
>>> is from syscall_get_error():
>>>          /*
>>>           * If the system call failed,
>>>           * regs->gpr[3] contains a positive ERRORCODE.
>>>           */
>>>          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
>>>
>>> The second example is from regs_return_value():
>>>          if (is_syscall_success(regs))
>>>                  return regs->gpr[3];
>>>          else
>>>                  return -regs->gpr[3];
>>>
>>> The third example is from check_syscall_restart():
>>>          regs->result = -EINTR;
>>>          regs->gpr[3] = EINTR;
>>>          regs->ccr |= 0x10000000;
>>>
>>> Compared with these examples, the failure of syscall_set_return_value()
>>> to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
>>> 	/*
>>> 	 * In the general case it's not obvious that we must deal with
>>> 	 * CCR here, as the syscall exit path will also do that for us.
>>> 	 * However there are some places, eg. the signal code, which
>>> 	 * check ccr to decide if the value in r3 is actually an error.
>>> 	 */
>>> 	if (error) {
>>> 		regs->ccr |= 0x10000000L;
>>> 		regs->gpr[3] = error;
>>> 	} else {
>>> 		regs->ccr &= ~0x10000000L;
>>> 		regs->gpr[3] = val;
>>> 	}
>>>
>>> This fix brings syscall_set_return_value() in sync with syscall_get_error()
>>> and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>
>>> Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
>>> =======
>>>
>>>
>>
>> I think there is still something going wrong.
>>
>> do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
>>
>> Then it calls __secure_computing() which returns what __seccomp_filter() 
>> returns.
>>
>> In case of error, __seccomp_filter() calls syscall_set_return_value() 
>> with a negative value then returns -1
>>
>> do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
>> do_seccomp() doesn't return 0.
>>
>> do_syscall_trace_enter() is called by system_call_exception() and 
>> returns -1, so syscall_exception() returns regs->gpr[3]
>>
>> In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
>> called with the return of syscall_exception() as first parameter, which 
>> leads to:
>>
>> 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
>> 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
>> 			r3 = -r3;
>> 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
>> 		}
>> 	}
> 
> Note the "unlikely" keyword here reminding us once more that in !scv case
> regs->gpr[3] does not normally have -ERRORCODE form.
> 
>> By chance, because you have already changed the sign of gpr[3], the 
>> above test fails and nothing is done to r3, and because you have also 
>> already set regs->ccr it works.
>>
>> But all this looks inconsistent with the fact that do_seccomp sets 
>> -ENOSYS as default value
>>
>> Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
>> syscall number and when it is wrong it goes to skip: which sets 
>> regs->gpr[3] = -ENOSYS;
> 
> It looks like do_seccomp() and do_syscall_trace_enter() get away by sheer
> luck, implicitly relying on syscall_exit_prepare() transparently fixing
> regs->gpr[3] for them.
> 
>> So really I think it is not in line with your changes to set positive 
>> value in gpr[3].
>>
>> Maybe your change is still correct but it needs to be handled completely 
>> in that case.
> 
> By the way, is there any reasons why do_seccomp() and
> do_syscall_trace_enter() don't use syscall_set_return_value() yet?
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-21 11:13           ` Madhavan Srinivasan
@ 2025-01-21 11:28             ` Christophe Leroy
  2025-01-21 12:25               ` Madhavan Srinivasan
  0 siblings, 1 reply; 65+ messages in thread
From: Christophe Leroy @ 2025-01-21 11:28 UTC (permalink / raw)
  To: Madhavan Srinivasan, Dmitry V. Levin
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, linux-kernel



Le 21/01/2025 à 12:13, Madhavan Srinivasan a écrit :
> 
> 
> On 1/20/25 10:42 PM, Dmitry V. Levin wrote:
>> On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
>>> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
>>>> On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
>>>>> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
>>>>>> Bring syscall_set_return_value() in sync with syscall_get_error(),
>>>>>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>>>>
> 
> Sorry for getting to this thread late.
> 
> Tried the series without this patch in
> 
> 1) power9 PowerNV system and in power10 pSeries lpar
> 
> # ./set_syscall_info
> TAP version 13
> 1..1
> # Starting 1 tests from 1 test cases.
> #  RUN           global.set_syscall_info ...
> #            OK  global.set_syscall_info
> ok 1 global.set_syscall_info
> # PASSED: 1 / 1 tests passed.
> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> and in both case set_syscall_info passes.
> Will look at it further.

I guess it works because power9/10 are using scv not sc for system call, 
hence using the new ABI ?

Christophe

> 
> Maddy
> 
>>>>>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
>>>>>> syscall_set_return_value()").
>>>>>
>>>>> There is a clear detailed explanation in that commit of why it needs to
>>>>> be done.
>>>>>
>>>>> If you think that commit is wrong you have to explain why with at least
>>>>> the same level of details.
>>>>
>>>> OK, please have a look whether this explanation is clear and detailed enough:
>>>>
>>>> =======
>>>> powerpc: properly negate error in syscall_set_return_value()
>>>>
>>>> When syscall_set_return_value() is used to set an error code, the caller
>>>> specifies it as a negative value in -ERRORCODE form.
>>>>
>>>> In !trap_is_scv case the error code is traditionally stored as follows:
>>>> gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
>>>> Here are a few examples to illustrate this convention.  The first one
>>>> is from syscall_get_error():
>>>>           /*
>>>>            * If the system call failed,
>>>>            * regs->gpr[3] contains a positive ERRORCODE.
>>>>            */
>>>>           return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
>>>>
>>>> The second example is from regs_return_value():
>>>>           if (is_syscall_success(regs))
>>>>                   return regs->gpr[3];
>>>>           else
>>>>                   return -regs->gpr[3];
>>>>
>>>> The third example is from check_syscall_restart():
>>>>           regs->result = -EINTR;
>>>>           regs->gpr[3] = EINTR;
>>>>           regs->ccr |= 0x10000000;
>>>>
>>>> Compared with these examples, the failure of syscall_set_return_value()
>>>> to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
>>>> 	/*
>>>> 	 * In the general case it's not obvious that we must deal with
>>>> 	 * CCR here, as the syscall exit path will also do that for us.
>>>> 	 * However there are some places, eg. the signal code, which
>>>> 	 * check ccr to decide if the value in r3 is actually an error.
>>>> 	 */
>>>> 	if (error) {
>>>> 		regs->ccr |= 0x10000000L;
>>>> 		regs->gpr[3] = error;
>>>> 	} else {
>>>> 		regs->ccr &= ~0x10000000L;
>>>> 		regs->gpr[3] = val;
>>>> 	}
>>>>
>>>> This fix brings syscall_set_return_value() in sync with syscall_get_error()
>>>> and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>>
>>>> Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
>>>> =======
>>>>
>>>>
>>>
>>> I think there is still something going wrong.
>>>
>>> do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
>>>
>>> Then it calls __secure_computing() which returns what __seccomp_filter()
>>> returns.
>>>
>>> In case of error, __seccomp_filter() calls syscall_set_return_value()
>>> with a negative value then returns -1
>>>
>>> do_seccomp() is called by do_syscall_trace_enter() which returns -1 when
>>> do_seccomp() doesn't return 0.
>>>
>>> do_syscall_trace_enter() is called by system_call_exception() and
>>> returns -1, so syscall_exception() returns regs->gpr[3]
>>>
>>> In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then
>>> called with the return of syscall_exception() as first parameter, which
>>> leads to:
>>>
>>> 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
>>> 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
>>> 			r3 = -r3;
>>> 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
>>> 		}
>>> 	}
>>
>> Note the "unlikely" keyword here reminding us once more that in !scv case
>> regs->gpr[3] does not normally have -ERRORCODE form.
>>
>>> By chance, because you have already changed the sign of gpr[3], the
>>> above test fails and nothing is done to r3, and because you have also
>>> already set regs->ccr it works.
>>>
>>> But all this looks inconsistent with the fact that do_seccomp sets
>>> -ENOSYS as default value
>>>
>>> Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the
>>> syscall number and when it is wrong it goes to skip: which sets
>>> regs->gpr[3] = -ENOSYS;
>>
>> It looks like do_seccomp() and do_syscall_trace_enter() get away by sheer
>> luck, implicitly relying on syscall_exit_prepare() transparently fixing
>> regs->gpr[3] for them.
>>
>>> So really I think it is not in line with your changes to set positive
>>> value in gpr[3].
>>>
>>> Maybe your change is still correct but it needs to be handled completely
>>> in that case.
>>
>> By the way, is there any reasons why do_seccomp() and
>> do_syscall_trace_enter() don't use syscall_set_return_value() yet?
>>
>>
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-21 11:28             ` Christophe Leroy
@ 2025-01-21 12:25               ` Madhavan Srinivasan
  2025-01-21 12:42                 ` Dmitry V. Levin
  0 siblings, 1 reply; 65+ messages in thread
From: Madhavan Srinivasan @ 2025-01-21 12:25 UTC (permalink / raw)
  To: Christophe Leroy, Dmitry V. Levin
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, linux-kernel



On 1/21/25 4:58 PM, Christophe Leroy wrote:
> 
> 
> Le 21/01/2025 à 12:13, Madhavan Srinivasan a écrit :
>>
>>
>> On 1/20/25 10:42 PM, Dmitry V. Levin wrote:
>>> On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
>>>> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
>>>>> On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
>>>>>> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
>>>>>>> Bring syscall_set_return_value() in sync with syscall_get_error(),
>>>>>>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>>>>>
>>
>> Sorry for getting to this thread late.
>>
>> Tried the series without this patch in
>>
>> 1) power9 PowerNV system and in power10 pSeries lpar
>>
>> # ./set_syscall_info
>> TAP version 13
>> 1..1
>> # Starting 1 tests from 1 test cases.
>> #  RUN           global.set_syscall_info ...
>> #            OK  global.set_syscall_info
>> ok 1 global.set_syscall_info
>> # PASSED: 1 / 1 tests passed.
>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>>
>> and in both case set_syscall_info passes.
>> Will look at it further.
> 
> I guess it works because power9/10 are using scv not sc for system call, hence using the new ABI ?
> 

yeah, I guess.
This is from the a Power8 pSeries lpar without this patch

# ./set_syscall_info 
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           global.set_syscall_info ...
# set_syscall_info.c:428:set_syscall_info:wait #5: unexpected stop signal 11
# set_syscall_info: Test terminated by assertion
#          FAIL  global.set_syscall_info
not ok 1 global.set_syscall_info
# FAILED: 0 / 1 tests passed.
# Totals: pass:0 fail:1 xfail:0 xpass:0 skip:0 error:0

Maddy

> Christophe
> 
>>
>> Maddy
>>
>>>>>>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
>>>>>>> syscall_set_return_value()").
>>>>>>
>>>>>> There is a clear detailed explanation in that commit of why it needs to
>>>>>> be done.
>>>>>>
>>>>>> If you think that commit is wrong you have to explain why with at least
>>>>>> the same level of details.
>>>>>
>>>>> OK, please have a look whether this explanation is clear and detailed enough:
>>>>>
>>>>> =======
>>>>> powerpc: properly negate error in syscall_set_return_value()
>>>>>
>>>>> When syscall_set_return_value() is used to set an error code, the caller
>>>>> specifies it as a negative value in -ERRORCODE form.
>>>>>
>>>>> In !trap_is_scv case the error code is traditionally stored as follows:
>>>>> gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
>>>>> Here are a few examples to illustrate this convention.  The first one
>>>>> is from syscall_get_error():
>>>>>           /*
>>>>>            * If the system call failed,
>>>>>            * regs->gpr[3] contains a positive ERRORCODE.
>>>>>            */
>>>>>           return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
>>>>>
>>>>> The second example is from regs_return_value():
>>>>>           if (is_syscall_success(regs))
>>>>>                   return regs->gpr[3];
>>>>>           else
>>>>>                   return -regs->gpr[3];
>>>>>
>>>>> The third example is from check_syscall_restart():
>>>>>           regs->result = -EINTR;
>>>>>           regs->gpr[3] = EINTR;
>>>>>           regs->ccr |= 0x10000000;
>>>>>
>>>>> Compared with these examples, the failure of syscall_set_return_value()
>>>>> to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
>>>>>     /*
>>>>>      * In the general case it's not obvious that we must deal with
>>>>>      * CCR here, as the syscall exit path will also do that for us.
>>>>>      * However there are some places, eg. the signal code, which
>>>>>      * check ccr to decide if the value in r3 is actually an error.
>>>>>      */
>>>>>     if (error) {
>>>>>         regs->ccr |= 0x10000000L;
>>>>>         regs->gpr[3] = error;
>>>>>     } else {
>>>>>         regs->ccr &= ~0x10000000L;
>>>>>         regs->gpr[3] = val;
>>>>>     }
>>>>>
>>>>> This fix brings syscall_set_return_value() in sync with syscall_get_error()
>>>>> and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>>>
>>>>> Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
>>>>> =======
>>>>>
>>>>>
>>>>
>>>> I think there is still something going wrong.
>>>>
>>>> do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
>>>>
>>>> Then it calls __secure_computing() which returns what __seccomp_filter()
>>>> returns.
>>>>
>>>> In case of error, __seccomp_filter() calls syscall_set_return_value()
>>>> with a negative value then returns -1
>>>>
>>>> do_seccomp() is called by do_syscall_trace_enter() which returns -1 when
>>>> do_seccomp() doesn't return 0.
>>>>
>>>> do_syscall_trace_enter() is called by system_call_exception() and
>>>> returns -1, so syscall_exception() returns regs->gpr[3]
>>>>
>>>> In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then
>>>> called with the return of syscall_exception() as first parameter, which
>>>> leads to:
>>>>
>>>>     if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
>>>>         if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
>>>>             r3 = -r3;
>>>>             regs->ccr |= 0x10000000; /* Set SO bit in CR */
>>>>         }
>>>>     }
>>>
>>> Note the "unlikely" keyword here reminding us once more that in !scv case
>>> regs->gpr[3] does not normally have -ERRORCODE form.
>>>
>>>> By chance, because you have already changed the sign of gpr[3], the
>>>> above test fails and nothing is done to r3, and because you have also
>>>> already set regs->ccr it works.
>>>>
>>>> But all this looks inconsistent with the fact that do_seccomp sets
>>>> -ENOSYS as default value
>>>>
>>>> Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the
>>>> syscall number and when it is wrong it goes to skip: which sets
>>>> regs->gpr[3] = -ENOSYS;
>>>
>>> It looks like do_seccomp() and do_syscall_trace_enter() get away by sheer
>>> luck, implicitly relying on syscall_exit_prepare() transparently fixing
>>> regs->gpr[3] for them.
>>>
>>>> So really I think it is not in line with your changes to set positive
>>>> value in gpr[3].
>>>>
>>>> Maybe your change is still correct but it needs to be handled completely
>>>> in that case.
>>>
>>> By the way, is there any reasons why do_seccomp() and
>>> do_syscall_trace_enter() don't use syscall_set_return_value() yet?
>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-21 12:25               ` Madhavan Srinivasan
@ 2025-01-21 12:42                 ` Dmitry V. Levin
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-21 12:42 UTC (permalink / raw)
  To: Madhavan Srinivasan
  Cc: Christophe Leroy, Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, linux-kernel

On Tue, Jan 21, 2025 at 05:55:40PM +0530, Madhavan Srinivasan wrote:
> On 1/21/25 4:58 PM, Christophe Leroy wrote:
> > Le 21/01/2025 à 12:13, Madhavan Srinivasan a écrit :
> >> On 1/20/25 10:42 PM, Dmitry V. Levin wrote:
> >>> On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
> >>>> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
> >>>>> On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> >>>>>> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> >>>>>>> Bring syscall_set_return_value() in sync with syscall_get_error(),
> >>>>>>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> >>
> >> Sorry for getting to this thread late.
> >>
> >> Tried the series without this patch in
> >>
> >> 1) power9 PowerNV system and in power10 pSeries lpar
> >>
> >> # ./set_syscall_info
> >> TAP version 13
> >> 1..1
> >> # Starting 1 tests from 1 test cases.
> >> #  RUN           global.set_syscall_info ...
> >> #            OK  global.set_syscall_info
> >> ok 1 global.set_syscall_info
> >> # PASSED: 1 / 1 tests passed.
> >> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> >>
> >> and in both case set_syscall_info passes.
> >> Will look at it further.
> > 
> > I guess it works because power9/10 are using scv not sc for system call, hence using the new ABI ?
> 
> yeah, I guess.
> This is from the a Power8 pSeries lpar without this patch
> 
> # ./set_syscall_info 
> TAP version 13
> 1..1
> # Starting 1 tests from 1 test cases.
> #  RUN           global.set_syscall_info ...
> # set_syscall_info.c:428:set_syscall_info:wait #5: unexpected stop signal 11
> # set_syscall_info: Test terminated by assertion
> #          FAIL  global.set_syscall_info
> not ok 1 global.set_syscall_info
> # FAILED: 0 / 1 tests passed.
> # Totals: pass:0 fail:1 xfail:0 xpass:0 skip:0 error:0

I've enhanced error diagnostics of the test a bit.  Inspired by this
powerpc bug, in the next iteration of the patchset the test would also
invoke PTRACE_GET_SYSCALL_INFO right after PTRACE_SET_SYSCALL_INFO to
check whether the changes are applied by the kernel correctly.
Without the fix, in non-svc case the test would complain this way:

# set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
# set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-20 13:51       ` Christophe Leroy
  2025-01-20 17:12         ` Dmitry V. Levin
@ 2025-01-23 18:28         ` Dmitry V. Levin
  2025-01-23 19:11           ` Eugene Syromyatnikov
                             ` (3 more replies)
  1 sibling, 4 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-23 18:28 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
> > On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> >> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> >>> Bring syscall_set_return_value() in sync with syscall_get_error(),
> >>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> >>>
> >>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> >>> syscall_set_return_value()").
> >>
> >> There is a clear detailed explanation in that commit of why it needs to
> >> be done.
> >>
> >> If you think that commit is wrong you have to explain why with at least
> >> the same level of details.
> > 
> > OK, please have a look whether this explanation is clear and detailed enough:
> > 
> > =======
> > powerpc: properly negate error in syscall_set_return_value()
> > 
> > When syscall_set_return_value() is used to set an error code, the caller
> > specifies it as a negative value in -ERRORCODE form.
> > 
> > In !trap_is_scv case the error code is traditionally stored as follows:
> > gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
> > Here are a few examples to illustrate this convention.  The first one
> > is from syscall_get_error():
> >          /*
> >           * If the system call failed,
> >           * regs->gpr[3] contains a positive ERRORCODE.
> >           */
> >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > 
> > The second example is from regs_return_value():
> >          if (is_syscall_success(regs))
> >                  return regs->gpr[3];
> >          else
> >                  return -regs->gpr[3];
> > 
> > The third example is from check_syscall_restart():
> >          regs->result = -EINTR;
> >          regs->gpr[3] = EINTR;
> >          regs->ccr |= 0x10000000;
> > 
> > Compared with these examples, the failure of syscall_set_return_value()
> > to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
> > 	/*
> > 	 * In the general case it's not obvious that we must deal with
> > 	 * CCR here, as the syscall exit path will also do that for us.
> > 	 * However there are some places, eg. the signal code, which
> > 	 * check ccr to decide if the value in r3 is actually an error.
> > 	 */
> > 	if (error) {
> > 		regs->ccr |= 0x10000000L;
> > 		regs->gpr[3] = error;
> > 	} else {
> > 		regs->ccr &= ~0x10000000L;
> > 		regs->gpr[3] = val;
> > 	}
> > 
> > This fix brings syscall_set_return_value() in sync with syscall_get_error()
> > and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > 
> > Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
> > =======
> 
> I think there is still something going wrong.
> 
> do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
> 
> Then it calls __secure_computing() which returns what __seccomp_filter() 
> returns.
> 
> In case of error, __seccomp_filter() calls syscall_set_return_value() 
> with a negative value then returns -1
> 
> do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
> do_seccomp() doesn't return 0.
> 
> do_syscall_trace_enter() is called by system_call_exception() and 
> returns -1, so syscall_exception() returns regs->gpr[3]
> 
> In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
> called with the return of syscall_exception() as first parameter, which 
> leads to:
> 
> 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
> 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
> 			r3 = -r3;
> 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
> 		}
> 	}
> 
> By chance, because you have already changed the sign of gpr[3], the 
> above test fails and nothing is done to r3, and because you have also 
> already set regs->ccr it works.
> 
> But all this looks inconsistent with the fact that do_seccomp sets 
> -ENOSYS as default value
> 
> Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
> syscall number and when it is wrong it goes to skip: which sets 
> regs->gpr[3] = -ENOSYS;
> 
> So really I think it is not in line with your changes to set positive 
> value in gpr[3].
> 
> Maybe your change is still correct but it needs to be handled completely 
> in that case.

Indeed, there is an inconsistency in !trap_is_scv case.

In some places such as syscall_get_error() and regs_return_value() the
semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
therefore cannot be changed.

In some other places like do_seccomp() and do_syscall_trace_enter() the
semantics is similar to the trap_is_scv case: gpr[3] contains a negative
ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
returns the system call function return value when it is executed, and
gpr[3] otherwise.  The value returned by system_call_exception() is passed
on to syscall_exit_prepare() which performs the conversion you mentioned.

What's remarkable is that in those places that are a part of the ABI the
traditional semantics is kept, while in other places the implementation
follows the trap_is_scv-like semantics, while traditional semantics is
also supported there.

The only case where I see some intersection is do_seccomp() where the
tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
is not the place where the tracer *reads* the system call exit status,
so whatever was written in gpr[3] before __secure_computing() is not
really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
this patch applied as well as without it.

After looking at system_call_exception() I doubt this inconsistency can be
easily avoided, so I don't see how this patch could be enhanced further,
and what else could I do with the patch besides dropping it and letting
!trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
would be unfortunate.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 18:28         ` Dmitry V. Levin
@ 2025-01-23 19:11           ` Eugene Syromyatnikov
  2025-01-23 22:16             ` Dmitry V. Levin
  2025-01-23 22:07           ` Christophe Leroy
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 65+ messages in thread
From: Eugene Syromyatnikov @ 2025-01-23 19:11 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Christophe Leroy, Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	linux-kernel

On Thu, Jan 23, 2025 at 7:28 PM Dmitry V. Levin <ldv@strace.io> wrote:
> Indeed, there is an inconsistency in !trap_is_scv case.
>
> In some places such as syscall_get_error() and regs_return_value() the
> semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
> and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
> therefore cannot be changed.
>
> In some other places like do_seccomp() and do_syscall_trace_enter() the
> semantics is similar to the trap_is_scv case: gpr[3] contains a negative
> ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
> returns the system call function return value when it is executed, and
> gpr[3] otherwise.  The value returned by system_call_exception() is passed
> on to syscall_exit_prepare() which performs the conversion you mentioned.
>
> What's remarkable is that in those places that are a part of the ABI the
> traditional semantics is kept, while in other places the implementation
> follows the trap_is_scv-like semantics, while traditional semantics is
> also supported there.
>
> The only case where I see some intersection is do_seccomp() where the
> tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> is not the place where the tracer *reads* the system call exit status,
> so whatever was written in gpr[3] before __secure_computing() is not
> really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> this patch applied as well as without it.
>
> After looking at system_call_exception() I doubt this inconsistency can be
> easily avoided, so I don't see how this patch could be enhanced further,
> and what else could I do with the patch besides dropping it and letting
> !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
> would be unfortunate.

The semantics of r3 on syscall return (including the negatedness of
the errno value) is documented in [1] (at least for the 64-bit case,
but I conjecture the 32-bit one is the same, sans the lack of the v2
ABI and scv there), so I would suggest to consider any deviation from
that a kernel programming error to be fixed.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/arch/powerpc/syscall64-abi.rst?id=v6.13#n30

-- 
Eugene Syromyatnikov
mailto:evgsyr@gmail.com
xmpp:esyr@jabber.{ru|org}

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 18:28         ` Dmitry V. Levin
  2025-01-23 19:11           ` Eugene Syromyatnikov
@ 2025-01-23 22:07           ` Christophe Leroy
  2025-01-23 22:35             ` Dmitry V. Levin
  2025-01-27 11:20             ` Dmitry V. Levin
  2025-01-23 23:43           ` Dmitry V. Levin
  2025-01-25 12:17           ` Michael Ellerman
  3 siblings, 2 replies; 65+ messages in thread
From: Christophe Leroy @ 2025-01-23 22:07 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel



Le 23/01/2025 à 19:28, Dmitry V. Levin a écrit :
> On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
>> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
>>> On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
>>>> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
>>>>> Bring syscall_set_return_value() in sync with syscall_get_error(),
>>>>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>>>
>>>>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
>>>>> syscall_set_return_value()").
>>>>
>>>> There is a clear detailed explanation in that commit of why it needs to
>>>> be done.
>>>>
>>>> If you think that commit is wrong you have to explain why with at least
>>>> the same level of details.
>>>
>>> OK, please have a look whether this explanation is clear and detailed enough:
>>>
>>> =======
>>> powerpc: properly negate error in syscall_set_return_value()
>>>
>>> When syscall_set_return_value() is used to set an error code, the caller
>>> specifies it as a negative value in -ERRORCODE form.
>>>
>>> In !trap_is_scv case the error code is traditionally stored as follows:
>>> gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
>>> Here are a few examples to illustrate this convention.  The first one
>>> is from syscall_get_error():
>>>           /*
>>>            * If the system call failed,
>>>            * regs->gpr[3] contains a positive ERRORCODE.
>>>            */
>>>           return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
>>>
>>> The second example is from regs_return_value():
>>>           if (is_syscall_success(regs))
>>>                   return regs->gpr[3];
>>>           else
>>>                   return -regs->gpr[3];
>>>
>>> The third example is from check_syscall_restart():
>>>           regs->result = -EINTR;
>>>           regs->gpr[3] = EINTR;
>>>           regs->ccr |= 0x10000000;
>>>
>>> Compared with these examples, the failure of syscall_set_return_value()
>>> to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
>>> 	/*
>>> 	 * In the general case it's not obvious that we must deal with
>>> 	 * CCR here, as the syscall exit path will also do that for us.
>>> 	 * However there are some places, eg. the signal code, which
>>> 	 * check ccr to decide if the value in r3 is actually an error.
>>> 	 */
>>> 	if (error) {
>>> 		regs->ccr |= 0x10000000L;
>>> 		regs->gpr[3] = error;
>>> 	} else {
>>> 		regs->ccr &= ~0x10000000L;
>>> 		regs->gpr[3] = val;
>>> 	}
>>>
>>> This fix brings syscall_set_return_value() in sync with syscall_get_error()
>>> and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
>>>
>>> Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
>>> =======
>>
>> I think there is still something going wrong.
>>
>> do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
>>
>> Then it calls __secure_computing() which returns what __seccomp_filter()
>> returns.
>>
>> In case of error, __seccomp_filter() calls syscall_set_return_value()
>> with a negative value then returns -1
>>
>> do_seccomp() is called by do_syscall_trace_enter() which returns -1 when
>> do_seccomp() doesn't return 0.
>>
>> do_syscall_trace_enter() is called by system_call_exception() and
>> returns -1, so syscall_exception() returns regs->gpr[3]
>>
>> In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then
>> called with the return of syscall_exception() as first parameter, which
>> leads to:
>>
>> 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
>> 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
>> 			r3 = -r3;
>> 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
>> 		}
>> 	}
>>
>> By chance, because you have already changed the sign of gpr[3], the
>> above test fails and nothing is done to r3, and because you have also
>> already set regs->ccr it works.
>>
>> But all this looks inconsistent with the fact that do_seccomp sets
>> -ENOSYS as default value
>>
>> Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the
>> syscall number and when it is wrong it goes to skip: which sets
>> regs->gpr[3] = -ENOSYS;
>>
>> So really I think it is not in line with your changes to set positive
>> value in gpr[3].
>>
>> Maybe your change is still correct but it needs to be handled completely
>> in that case.
> 
> Indeed, there is an inconsistency in !trap_is_scv case.
> 
> In some places such as syscall_get_error() and regs_return_value() the
> semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
> and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
> therefore cannot be changed.
> 
> In some other places like do_seccomp() and do_syscall_trace_enter() the
> semantics is similar to the trap_is_scv case: gpr[3] contains a negative
> ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
> returns the system call function return value when it is executed, and
> gpr[3] otherwise.  The value returned by system_call_exception() is passed
> on to syscall_exit_prepare() which performs the conversion you mentioned.
> 
> What's remarkable is that in those places that are a part of the ABI the
> traditional semantics is kept, while in other places the implementation
> follows the trap_is_scv-like semantics, while traditional semantics is
> also supported there.
> 
> The only case where I see some intersection is do_seccomp() where the
> tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> is not the place where the tracer *reads* the system call exit status,
> so whatever was written in gpr[3] before __secure_computing() is not
> really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> this patch applied as well as without it.
> 
> After looking at system_call_exception() I doubt this inconsistency can be
> easily avoided, so I don't see how this patch could be enhanced further,
> and what else could I do with the patch besides dropping it and letting
> !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
> would be unfortunate.
> 
> 

To add a bit more to the confusion, a task can be flagged with 
TIF_NOERROR by calling force_successful_syscall_return(), in which case 
even if gpr[3] contains a negative between -MAX_ERRNO and -1 the syscall 
will be handled as successfull hence CCR[SO] won't be set. But it seems 
this is not handled by syscall_set_return_value(). So what will happen 
with time() when approaching year 2036 for instance ?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 19:11           ` Eugene Syromyatnikov
@ 2025-01-23 22:16             ` Dmitry V. Levin
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-23 22:16 UTC (permalink / raw)
  To: Eugene Syromyatnikov
  Cc: Christophe Leroy, Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	linux-kernel

On Thu, Jan 23, 2025 at 08:11:44PM +0100, Eugene Syromyatnikov wrote:
> On Thu, Jan 23, 2025 at 7:28 PM Dmitry V. Levin <ldv@strace.io> wrote:
> > Indeed, there is an inconsistency in !trap_is_scv case.
> >
> > In some places such as syscall_get_error() and regs_return_value() the
> > semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
> > and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
> > therefore cannot be changed.
> >
> > In some other places like do_seccomp() and do_syscall_trace_enter() the
> > semantics is similar to the trap_is_scv case: gpr[3] contains a negative
> > ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
> > returns the system call function return value when it is executed, and
> > gpr[3] otherwise.  The value returned by system_call_exception() is passed
> > on to syscall_exit_prepare() which performs the conversion you mentioned.
> >
> > What's remarkable is that in those places that are a part of the ABI the
> > traditional semantics is kept, while in other places the implementation
> > follows the trap_is_scv-like semantics, while traditional semantics is
> > also supported there.
> >
> > The only case where I see some intersection is do_seccomp() where the
> > tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> > is not the place where the tracer *reads* the system call exit status,
> > so whatever was written in gpr[3] before __secure_computing() is not
> > really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> > this patch applied as well as without it.
> >
> > After looking at system_call_exception() I doubt this inconsistency can be
> > easily avoided, so I don't see how this patch could be enhanced further,
> > and what else could I do with the patch besides dropping it and letting
> > !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
> > would be unfortunate.
> 
> The semantics of r3 on syscall return (including the negatedness of
> the errno value) is documented in [1] (at least for the 64-bit case,
> but I conjecture the 32-bit one is the same, sans the lack of the v2
> ABI and scv there), so I would suggest to consider any deviation from
> that a kernel programming error to be fixed.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/arch/powerpc/syscall64-abi.rst?id=v6.13#n30

The semantics of r3 on syscall return is correct, thanks to
syscall_exit_prepare() that performs necessary manipulations with gpr[3].

What's wrong on powerpc in !trap_is_scv case is that its current
implementation of syscall_set_return_value() follows a different semantics,
making it unusable on syscall return.  While syscall_set_return_value() was
used only on entering syscall via do_seccomp(), it was not a problem yet.
It became a problem when we started to use it on syscall return, in the
same state when its sibling syscall_get_error() is used.  Note that among
all the architectures in the kernel tree powerpc in !trap_is_scv case is
the only one that has this problem.  My patch is intended to address this
without breaking anything else.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 22:07           ` Christophe Leroy
@ 2025-01-23 22:35             ` Dmitry V. Levin
  2025-01-27 11:20             ` Dmitry V. Levin
  1 sibling, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-23 22:35 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Thu, Jan 23, 2025 at 11:07:21PM +0100, Christophe Leroy wrote:
[...]
> To add a bit more to the confusion, a task can be flagged with 
> TIF_NOERROR by calling force_successful_syscall_return(), in which case 
> even if gpr[3] contains a negative between -MAX_ERRNO and -1 the syscall 
> will be handled as successfull hence CCR[SO] won't be set. But it seems 
> this is not handled by syscall_set_return_value(). So what will happen 
> with time() when approaching year 2036 for instance ?

syscall_set_return_value() takes both "int error" and "long val"
arguments.  It doesn't and shouldn't take TIF_NOERROR into account.
With my patch applied, when it's called by PTRACE_SET_SYSCALL_INFO
from do_syscall_trace_leave(), it will properly update gpr[3] and ccr
regardless of TIF_NOERROR.  If tracer wants to set an error status for
a syscall that cannot return an error, it's up to the tracer to face the
consequences.  Tracers can do it now via PTRACE_SETREGS* anyway.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 18:28         ` Dmitry V. Levin
  2025-01-23 19:11           ` Eugene Syromyatnikov
  2025-01-23 22:07           ` Christophe Leroy
@ 2025-01-23 23:43           ` Dmitry V. Levin
  2025-01-24 15:18             ` Alexey Gladkov
  2025-01-25 12:17             ` Michael Ellerman
  2025-01-25 12:17           ` Michael Ellerman
  3 siblings, 2 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-23 23:43 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Thu, Jan 23, 2025 at 08:28:15PM +0200, Dmitry V. Levin wrote:
> On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
> > Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
> > > On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> > >> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> > >>> Bring syscall_set_return_value() in sync with syscall_get_error(),
> > >>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > >>>
> > >>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> > >>> syscall_set_return_value()").
> > >>
> > >> There is a clear detailed explanation in that commit of why it needs to
> > >> be done.
> > >>
> > >> If you think that commit is wrong you have to explain why with at least
> > >> the same level of details.
> > > 
> > > OK, please have a look whether this explanation is clear and detailed enough:
> > > 
> > > =======
> > > powerpc: properly negate error in syscall_set_return_value()
> > > 
> > > When syscall_set_return_value() is used to set an error code, the caller
> > > specifies it as a negative value in -ERRORCODE form.
> > > 
> > > In !trap_is_scv case the error code is traditionally stored as follows:
> > > gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
> > > Here are a few examples to illustrate this convention.  The first one
> > > is from syscall_get_error():
> > >          /*
> > >           * If the system call failed,
> > >           * regs->gpr[3] contains a positive ERRORCODE.
> > >           */
> > >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > > 
> > > The second example is from regs_return_value():
> > >          if (is_syscall_success(regs))
> > >                  return regs->gpr[3];
> > >          else
> > >                  return -regs->gpr[3];
> > > 
> > > The third example is from check_syscall_restart():
> > >          regs->result = -EINTR;
> > >          regs->gpr[3] = EINTR;
> > >          regs->ccr |= 0x10000000;
> > > 
> > > Compared with these examples, the failure of syscall_set_return_value()
> > > to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
> > > 	/*
> > > 	 * In the general case it's not obvious that we must deal with
> > > 	 * CCR here, as the syscall exit path will also do that for us.
> > > 	 * However there are some places, eg. the signal code, which
> > > 	 * check ccr to decide if the value in r3 is actually an error.
> > > 	 */
> > > 	if (error) {
> > > 		regs->ccr |= 0x10000000L;
> > > 		regs->gpr[3] = error;
> > > 	} else {
> > > 		regs->ccr &= ~0x10000000L;
> > > 		regs->gpr[3] = val;
> > > 	}
> > > 
> > > This fix brings syscall_set_return_value() in sync with syscall_get_error()
> > > and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > > 
> > > Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
> > > =======
> > 
> > I think there is still something going wrong.
> > 
> > do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
> > 
> > Then it calls __secure_computing() which returns what __seccomp_filter() 
> > returns.
> > 
> > In case of error, __seccomp_filter() calls syscall_set_return_value() 
> > with a negative value then returns -1
> > 
> > do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
> > do_seccomp() doesn't return 0.
> > 
> > do_syscall_trace_enter() is called by system_call_exception() and 
> > returns -1, so syscall_exception() returns regs->gpr[3]
> > 
> > In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
> > called with the return of syscall_exception() as first parameter, which 
> > leads to:
> > 
> > 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
> > 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
> > 			r3 = -r3;
> > 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
> > 		}
> > 	}
> > 
> > By chance, because you have already changed the sign of gpr[3], the 
> > above test fails and nothing is done to r3, and because you have also 
> > already set regs->ccr it works.
> > 
> > But all this looks inconsistent with the fact that do_seccomp sets 
> > -ENOSYS as default value
> > 
> > Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
> > syscall number and when it is wrong it goes to skip: which sets 
> > regs->gpr[3] = -ENOSYS;
> > 
> > So really I think it is not in line with your changes to set positive 
> > value in gpr[3].
> > 
> > Maybe your change is still correct but it needs to be handled completely 
> > in that case.
> 
> Indeed, there is an inconsistency in !trap_is_scv case.
> 
> In some places such as syscall_get_error() and regs_return_value() the
> semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
> and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
> therefore cannot be changed.
> 
> In some other places like do_seccomp() and do_syscall_trace_enter() the
> semantics is similar to the trap_is_scv case: gpr[3] contains a negative
> ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
> returns the system call function return value when it is executed, and
> gpr[3] otherwise.  The value returned by system_call_exception() is passed
> on to syscall_exit_prepare() which performs the conversion you mentioned.
> 
> What's remarkable is that in those places that are a part of the ABI the
> traditional semantics is kept, while in other places the implementation
> follows the trap_is_scv-like semantics, while traditional semantics is
> also supported there.
> 
> The only case where I see some intersection is do_seccomp() where the
> tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> is not the place where the tracer *reads* the system call exit status,
> so whatever was written in gpr[3] before __secure_computing() is not
> really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> this patch applied as well as without it.
> 
> After looking at system_call_exception() I doubt this inconsistency can be
> easily avoided, so I don't see how this patch could be enhanced further,
> and what else could I do with the patch besides dropping it and letting
> !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
> would be unfortunate.

If you say this would bring some consistency, I can extend the patch with
something like this:

diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index 727ed4a14545..dda276a934fd 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
 	 * syscall parameter. This is different to the ptrace ABI where
 	 * both r3 and orig_gpr3 contain the first syscall parameter.
 	 */
-	regs->gpr[3] = -ENOSYS;
+	syscall_set_return_value(current, regs, -ENOSYS, 0);
 
 	/*
 	 * We use the __ version here because we have already checked
@@ -225,7 +225,7 @@ static int do_seccomp(struct pt_regs *regs)
 	 * modify the first syscall parameter (in orig_gpr3) and also
 	 * allow the syscall to proceed.
 	 */
-	regs->gpr[3] = regs->orig_gpr3;
+	syscall_set_return_value(current, regs, 0, regs->orig_gpr3);
 
 	return 0;
 }
@@ -315,7 +315,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 	 * If we are aborting explicitly, or if the syscall number is
 	 * now invalid, set the return value to -ENOSYS.
 	 */
-	regs->gpr[3] = -ENOSYS;
+	syscall_set_return_value(current, regs, -ENOSYS, 0);
 	return -1;
 }
 
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index aa17e62f3754..c921e0cb54b8 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -229,14 +229,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
 		regs_add_return_ip(regs, -4);
 		regs->result = 0;
 	} else {
-		if (trap_is_scv(regs)) {
-			regs->result = -EINTR;
-			regs->gpr[3] = -EINTR;
-		} else {
-			regs->result = -EINTR;
-			regs->gpr[3] = EINTR;
-			regs->ccr |= 0x10000000;
-		}
+		regs->result = -EINTR;
+		syscall_set_return_value(current, regs, -EINTR, 0);
 	}
 }
 

-- 
ldv

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 23:43           ` Dmitry V. Levin
@ 2025-01-24 15:18             ` Alexey Gladkov
  2025-01-25  0:25               ` Dmitry V. Levin
  2025-01-25 12:18               ` Michael Ellerman
  2025-01-25 12:17             ` Michael Ellerman
  1 sibling, 2 replies; 65+ messages in thread
From: Alexey Gladkov @ 2025-01-24 15:18 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Christophe Leroy, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Fri, Jan 24, 2025 at 01:43:22AM +0200, Dmitry V. Levin wrote:
> On Thu, Jan 23, 2025 at 08:28:15PM +0200, Dmitry V. Levin wrote:
> > On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
> > > Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
> > > > On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> > > >> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> > > >>> Bring syscall_set_return_value() in sync with syscall_get_error(),
> > > >>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > > >>>
> > > >>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> > > >>> syscall_set_return_value()").
> > > >>
> > > >> There is a clear detailed explanation in that commit of why it needs to
> > > >> be done.
> > > >>
> > > >> If you think that commit is wrong you have to explain why with at least
> > > >> the same level of details.
> > > > 
> > > > OK, please have a look whether this explanation is clear and detailed enough:
> > > > 
> > > > =======
> > > > powerpc: properly negate error in syscall_set_return_value()
> > > > 
> > > > When syscall_set_return_value() is used to set an error code, the caller
> > > > specifies it as a negative value in -ERRORCODE form.
> > > > 
> > > > In !trap_is_scv case the error code is traditionally stored as follows:
> > > > gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
> > > > Here are a few examples to illustrate this convention.  The first one
> > > > is from syscall_get_error():
> > > >          /*
> > > >           * If the system call failed,
> > > >           * regs->gpr[3] contains a positive ERRORCODE.
> > > >           */
> > > >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > > > 
> > > > The second example is from regs_return_value():
> > > >          if (is_syscall_success(regs))
> > > >                  return regs->gpr[3];
> > > >          else
> > > >                  return -regs->gpr[3];
> > > > 
> > > > The third example is from check_syscall_restart():
> > > >          regs->result = -EINTR;
> > > >          regs->gpr[3] = EINTR;
> > > >          regs->ccr |= 0x10000000;
> > > > 
> > > > Compared with these examples, the failure of syscall_set_return_value()
> > > > to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
> > > > 	/*
> > > > 	 * In the general case it's not obvious that we must deal with
> > > > 	 * CCR here, as the syscall exit path will also do that for us.
> > > > 	 * However there are some places, eg. the signal code, which
> > > > 	 * check ccr to decide if the value in r3 is actually an error.
> > > > 	 */
> > > > 	if (error) {
> > > > 		regs->ccr |= 0x10000000L;
> > > > 		regs->gpr[3] = error;
> > > > 	} else {
> > > > 		regs->ccr &= ~0x10000000L;
> > > > 		regs->gpr[3] = val;
> > > > 	}
> > > > 
> > > > This fix brings syscall_set_return_value() in sync with syscall_get_error()
> > > > and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > > > 
> > > > Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
> > > > =======
> > > 
> > > I think there is still something going wrong.
> > > 
> > > do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
> > > 
> > > Then it calls __secure_computing() which returns what __seccomp_filter() 
> > > returns.
> > > 
> > > In case of error, __seccomp_filter() calls syscall_set_return_value() 
> > > with a negative value then returns -1
> > > 
> > > do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
> > > do_seccomp() doesn't return 0.
> > > 
> > > do_syscall_trace_enter() is called by system_call_exception() and 
> > > returns -1, so syscall_exception() returns regs->gpr[3]
> > > 
> > > In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
> > > called with the return of syscall_exception() as first parameter, which 
> > > leads to:
> > > 
> > > 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
> > > 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
> > > 			r3 = -r3;
> > > 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
> > > 		}
> > > 	}
> > > 
> > > By chance, because you have already changed the sign of gpr[3], the 
> > > above test fails and nothing is done to r3, and because you have also 
> > > already set regs->ccr it works.
> > > 
> > > But all this looks inconsistent with the fact that do_seccomp sets 
> > > -ENOSYS as default value
> > > 
> > > Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
> > > syscall number and when it is wrong it goes to skip: which sets 
> > > regs->gpr[3] = -ENOSYS;
> > > 
> > > So really I think it is not in line with your changes to set positive 
> > > value in gpr[3].
> > > 
> > > Maybe your change is still correct but it needs to be handled completely 
> > > in that case.
> > 
> > Indeed, there is an inconsistency in !trap_is_scv case.
> > 
> > In some places such as syscall_get_error() and regs_return_value() the
> > semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
> > and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
> > therefore cannot be changed.
> > 
> > In some other places like do_seccomp() and do_syscall_trace_enter() the
> > semantics is similar to the trap_is_scv case: gpr[3] contains a negative
> > ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
> > returns the system call function return value when it is executed, and
> > gpr[3] otherwise.  The value returned by system_call_exception() is passed
> > on to syscall_exit_prepare() which performs the conversion you mentioned.
> > 
> > What's remarkable is that in those places that are a part of the ABI the
> > traditional semantics is kept, while in other places the implementation
> > follows the trap_is_scv-like semantics, while traditional semantics is
> > also supported there.
> > 
> > The only case where I see some intersection is do_seccomp() where the
> > tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> > is not the place where the tracer *reads* the system call exit status,
> > so whatever was written in gpr[3] before __secure_computing() is not
> > really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> > this patch applied as well as without it.
> > 
> > After looking at system_call_exception() I doubt this inconsistency can be
> > easily avoided, so I don't see how this patch could be enhanced further,
> > and what else could I do with the patch besides dropping it and letting
> > !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
> > would be unfortunate.
> 
> If you say this would bring some consistency, I can extend the patch with
> something like this:
> 
> diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> index 727ed4a14545..dda276a934fd 100644
> --- a/arch/powerpc/kernel/ptrace/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
>  	 * syscall parameter. This is different to the ptrace ABI where
>  	 * both r3 and orig_gpr3 contain the first syscall parameter.
>  	 */
> -	regs->gpr[3] = -ENOSYS;
> +	syscall_set_return_value(current, regs, -ENOSYS, 0);
>  
>  	/*
>  	 * We use the __ version here because we have already checked
> @@ -225,7 +225,7 @@ static int do_seccomp(struct pt_regs *regs)
>  	 * modify the first syscall parameter (in orig_gpr3) and also
>  	 * allow the syscall to proceed.
>  	 */
> -	regs->gpr[3] = regs->orig_gpr3;
> +	syscall_set_return_value(current, regs, 0, regs->orig_gpr3);
>  
>  	return 0;
>  }
> @@ -315,7 +315,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>  	 * If we are aborting explicitly, or if the syscall number is
>  	 * now invalid, set the return value to -ENOSYS.
>  	 */
> -	regs->gpr[3] = -ENOSYS;
> +	syscall_set_return_value(current, regs, -ENOSYS, 0);
>  	return -1;
>  }
>  
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index aa17e62f3754..c921e0cb54b8 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -229,14 +229,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
>  		regs_add_return_ip(regs, -4);
>  		regs->result = 0;
>  	} else {
> -		if (trap_is_scv(regs)) {
> -			regs->result = -EINTR;
> -			regs->gpr[3] = -EINTR;
> -		} else {
> -			regs->result = -EINTR;
> -			regs->gpr[3] = EINTR;
> -			regs->ccr |= 0x10000000;
> -		}
> +		regs->result = -EINTR;
> +		syscall_set_return_value(current, regs, -EINTR, 0);
>  	}
>  }

I'm not a powerpc expert but shouldn't be used regs->gpr[3] via a
regs_return_value() in system_call_exception() ?

notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
{
...
		r0 = do_syscall_trace_enter(regs);
		if (unlikely(r0 >= NR_syscalls))
			return regs->gpr[3];

	} else if (unlikely(r0 >= NR_syscalls)) {
		if (unlikely(trap_is_unsupported_scv(regs))) {
			/* Unsupported scv vector */
			_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
			return regs->gpr[3];
		}
		return -ENOSYS;
	}
}

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-24 15:18             ` Alexey Gladkov
@ 2025-01-25  0:25               ` Dmitry V. Levin
  2025-01-25 12:18               ` Michael Ellerman
  1 sibling, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-25  0:25 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Christophe Leroy, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Fri, Jan 24, 2025 at 04:18:10PM +0100, Alexey Gladkov wrote:
> On Fri, Jan 24, 2025 at 01:43:22AM +0200, Dmitry V. Levin wrote:
> > On Thu, Jan 23, 2025 at 08:28:15PM +0200, Dmitry V. Levin wrote:
> > > On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
> > > > Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
> > > > > On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
> > > > >> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
> > > > >>> Bring syscall_set_return_value() in sync with syscall_get_error(),
> > > > >>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > > > >>>
> > > > >>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
> > > > >>> syscall_set_return_value()").
> > > > >>
> > > > >> There is a clear detailed explanation in that commit of why it needs to
> > > > >> be done.
> > > > >>
> > > > >> If you think that commit is wrong you have to explain why with at least
> > > > >> the same level of details.
> > > > > 
> > > > > OK, please have a look whether this explanation is clear and detailed enough:
> > > > > 
> > > > > =======
> > > > > powerpc: properly negate error in syscall_set_return_value()
> > > > > 
> > > > > When syscall_set_return_value() is used to set an error code, the caller
> > > > > specifies it as a negative value in -ERRORCODE form.
> > > > > 
> > > > > In !trap_is_scv case the error code is traditionally stored as follows:
> > > > > gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
> > > > > Here are a few examples to illustrate this convention.  The first one
> > > > > is from syscall_get_error():
> > > > >          /*
> > > > >           * If the system call failed,
> > > > >           * regs->gpr[3] contains a positive ERRORCODE.
> > > > >           */
> > > > >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > > > > 
> > > > > The second example is from regs_return_value():
> > > > >          if (is_syscall_success(regs))
> > > > >                  return regs->gpr[3];
> > > > >          else
> > > > >                  return -regs->gpr[3];
> > > > > 
> > > > > The third example is from check_syscall_restart():
> > > > >          regs->result = -EINTR;
> > > > >          regs->gpr[3] = EINTR;
> > > > >          regs->ccr |= 0x10000000;
> > > > > 
> > > > > Compared with these examples, the failure of syscall_set_return_value()
> > > > > to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
> > > > > 	/*
> > > > > 	 * In the general case it's not obvious that we must deal with
> > > > > 	 * CCR here, as the syscall exit path will also do that for us.
> > > > > 	 * However there are some places, eg. the signal code, which
> > > > > 	 * check ccr to decide if the value in r3 is actually an error.
> > > > > 	 */
> > > > > 	if (error) {
> > > > > 		regs->ccr |= 0x10000000L;
> > > > > 		regs->gpr[3] = error;
> > > > > 	} else {
> > > > > 		regs->ccr &= ~0x10000000L;
> > > > > 		regs->gpr[3] = val;
> > > > > 	}
> > > > > 
> > > > > This fix brings syscall_set_return_value() in sync with syscall_get_error()
> > > > > and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
> > > > > 
> > > > > Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
> > > > > =======
> > > > 
> > > > I think there is still something going wrong.
> > > > 
> > > > do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
> > > > 
> > > > Then it calls __secure_computing() which returns what __seccomp_filter() 
> > > > returns.
> > > > 
> > > > In case of error, __seccomp_filter() calls syscall_set_return_value() 
> > > > with a negative value then returns -1
> > > > 
> > > > do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
> > > > do_seccomp() doesn't return 0.
> > > > 
> > > > do_syscall_trace_enter() is called by system_call_exception() and 
> > > > returns -1, so syscall_exception() returns regs->gpr[3]
> > > > 
> > > > In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
> > > > called with the return of syscall_exception() as first parameter, which 
> > > > leads to:
> > > > 
> > > > 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
> > > > 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
> > > > 			r3 = -r3;
> > > > 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
> > > > 		}
> > > > 	}
> > > > 
> > > > By chance, because you have already changed the sign of gpr[3], the 
> > > > above test fails and nothing is done to r3, and because you have also 
> > > > already set regs->ccr it works.
> > > > 
> > > > But all this looks inconsistent with the fact that do_seccomp sets 
> > > > -ENOSYS as default value
> > > > 
> > > > Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
> > > > syscall number and when it is wrong it goes to skip: which sets 
> > > > regs->gpr[3] = -ENOSYS;
> > > > 
> > > > So really I think it is not in line with your changes to set positive 
> > > > value in gpr[3].
> > > > 
> > > > Maybe your change is still correct but it needs to be handled completely 
> > > > in that case.
> > > 
> > > Indeed, there is an inconsistency in !trap_is_scv case.
> > > 
> > > In some places such as syscall_get_error() and regs_return_value() the
> > > semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
> > > and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
> > > therefore cannot be changed.
> > > 
> > > In some other places like do_seccomp() and do_syscall_trace_enter() the
> > > semantics is similar to the trap_is_scv case: gpr[3] contains a negative
> > > ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
> > > returns the system call function return value when it is executed, and
> > > gpr[3] otherwise.  The value returned by system_call_exception() is passed
> > > on to syscall_exit_prepare() which performs the conversion you mentioned.
> > > 
> > > What's remarkable is that in those places that are a part of the ABI the
> > > traditional semantics is kept, while in other places the implementation
> > > follows the trap_is_scv-like semantics, while traditional semantics is
> > > also supported there.
> > > 
> > > The only case where I see some intersection is do_seccomp() where the
> > > tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> > > is not the place where the tracer *reads* the system call exit status,
> > > so whatever was written in gpr[3] before __secure_computing() is not
> > > really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> > > this patch applied as well as without it.
> > > 
> > > After looking at system_call_exception() I doubt this inconsistency can be
> > > easily avoided, so I don't see how this patch could be enhanced further,
> > > and what else could I do with the patch besides dropping it and letting
> > > !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
> > > would be unfortunate.
> > 
> > If you say this would bring some consistency, I can extend the patch with
> > something like this:
> > 
> > diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> > index 727ed4a14545..dda276a934fd 100644
> > --- a/arch/powerpc/kernel/ptrace/ptrace.c
> > +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> > @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
> >  	 * syscall parameter. This is different to the ptrace ABI where
> >  	 * both r3 and orig_gpr3 contain the first syscall parameter.
> >  	 */
> > -	regs->gpr[3] = -ENOSYS;
> > +	syscall_set_return_value(current, regs, -ENOSYS, 0);
> >  
> >  	/*
> >  	 * We use the __ version here because we have already checked
> > @@ -225,7 +225,7 @@ static int do_seccomp(struct pt_regs *regs)
> >  	 * modify the first syscall parameter (in orig_gpr3) and also
> >  	 * allow the syscall to proceed.
> >  	 */
> > -	regs->gpr[3] = regs->orig_gpr3;
> > +	syscall_set_return_value(current, regs, 0, regs->orig_gpr3);
> >  
> >  	return 0;
> >  }
> > @@ -315,7 +315,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
> >  	 * If we are aborting explicitly, or if the syscall number is
> >  	 * now invalid, set the return value to -ENOSYS.
> >  	 */
> > -	regs->gpr[3] = -ENOSYS;
> > +	syscall_set_return_value(current, regs, -ENOSYS, 0);
> >  	return -1;
> >  }
> >  
> > diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> > index aa17e62f3754..c921e0cb54b8 100644
> > --- a/arch/powerpc/kernel/signal.c
> > +++ b/arch/powerpc/kernel/signal.c
> > @@ -229,14 +229,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
> >  		regs_add_return_ip(regs, -4);
> >  		regs->result = 0;
> >  	} else {
> > -		if (trap_is_scv(regs)) {
> > -			regs->result = -EINTR;
> > -			regs->gpr[3] = -EINTR;
> > -		} else {
> > -			regs->result = -EINTR;
> > -			regs->gpr[3] = EINTR;
> > -			regs->ccr |= 0x10000000;
> > -		}
> > +		regs->result = -EINTR;
> > +		syscall_set_return_value(current, regs, -EINTR, 0);
> >  	}
> >  }
> 
> I'm not a powerpc expert but shouldn't be used regs->gpr[3] via a
> regs_return_value() in system_call_exception() ?

This would ensure that system_call_exception() returns errors in -ERRORCODE
form, which wouldn't have any practical difference given that the return
code is passed on to syscall_exit_prepare() which performs the conversion.

However, this could bring more consistency when applied along with other
consistency-related changes.

I wish the people responsible for powerpc would be more specific about
the level of consistency they are ready to maintain.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 18:28         ` Dmitry V. Levin
                             ` (2 preceding siblings ...)
  2025-01-23 23:43           ` Dmitry V. Levin
@ 2025-01-25 12:17           ` Michael Ellerman
  2025-01-25 21:25             ` Dmitry V. Levin
  3 siblings, 1 reply; 65+ messages in thread
From: Michael Ellerman @ 2025-01-25 12:17 UTC (permalink / raw)
  To: Dmitry V. Levin, Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	linux-kernel

"Dmitry V. Levin" <ldv@strace.io> writes:
> On Mon, Jan 20, 2025 at 02:51:38PM +0100, Christophe Leroy wrote:
>> Le 14/01/2025 à 18:04, Dmitry V. Levin a écrit :
>> > On Mon, Jan 13, 2025 at 06:34:44PM +0100, Christophe Leroy wrote:
>> >> Le 13/01/2025 à 18:10, Dmitry V. Levin a écrit :
>> >>> Bring syscall_set_return_value() in sync with syscall_get_error(),
>> >>> and let upcoming ptrace/set_syscall_info selftest pass on powerpc.
>> >>>
>> >>> This reverts commit 1b1a3702a65c ("powerpc: Don't negate error in
>> >>> syscall_set_return_value()").
>> >>
>> >> There is a clear detailed explanation in that commit of why it needs to
>> >> be done.
>> >>
>> >> If you think that commit is wrong you have to explain why with at least
>> >> the same level of details.
>> > 
>> > OK, please have a look whether this explanation is clear and detailed enough:
>> > 
>> > =======
>> > powerpc: properly negate error in syscall_set_return_value()
>> > 
>> > When syscall_set_return_value() is used to set an error code, the caller
>> > specifies it as a negative value in -ERRORCODE form.
>> > 
>> > In !trap_is_scv case the error code is traditionally stored as follows:
>> > gpr[3] contains a positive ERRORCODE, and ccr has 0x10000000 flag set.
>> > Here are a few examples to illustrate this convention.  The first one
>> > is from syscall_get_error():
>> >          /*
>> >           * If the system call failed,
>> >           * regs->gpr[3] contains a positive ERRORCODE.
>> >           */
>> >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
>> > 
>> > The second example is from regs_return_value():
>> >          if (is_syscall_success(regs))
>> >                  return regs->gpr[3];
>> >          else
>> >                  return -regs->gpr[3];
>> > 
>> > The third example is from check_syscall_restart():
>> >          regs->result = -EINTR;
>> >          regs->gpr[3] = EINTR;
>> >          regs->ccr |= 0x10000000;
>> > 
>> > Compared with these examples, the failure of syscall_set_return_value()
>> > to assign a positive ERRORCODE into regs->gpr[3] is clearly visible:
>> > 	/*
>> > 	 * In the general case it's not obvious that we must deal with
>> > 	 * CCR here, as the syscall exit path will also do that for us.
>> > 	 * However there are some places, eg. the signal code, which
>> > 	 * check ccr to decide if the value in r3 is actually an error.
>> > 	 */
>> > 	if (error) {
>> > 		regs->ccr |= 0x10000000L;
>> > 		regs->gpr[3] = error;
>> > 	} else {
>> > 		regs->ccr &= ~0x10000000L;
>> > 		regs->gpr[3] = val;
>> > 	}
>> > 
>> > This fix brings syscall_set_return_value() in sync with syscall_get_error()
>> > and lets upcoming ptrace/set_syscall_info selftest pass on powerpc.
>> > 
>> > Fixes: 1b1a3702a65c ("powerpc: Don't negate error in syscall_set_return_value()").
>> > =======
>> 
>> I think there is still something going wrong.
>> 
>> do_seccomp() sets regs->gpr[3] = -ENOSYS; by default.
>> 
>> Then it calls __secure_computing() which returns what __seccomp_filter() 
>> returns.
>> 
>> In case of error, __seccomp_filter() calls syscall_set_return_value() 
>> with a negative value then returns -1
>> 
>> do_seccomp() is called by do_syscall_trace_enter() which returns -1 when 
>> do_seccomp() doesn't return 0.
>> 
>> do_syscall_trace_enter() is called by system_call_exception() and 
>> returns -1, so syscall_exception() returns regs->gpr[3]
>> 
>> In entry_32.S, transfer_to_syscall, syscall_exit_prepare() is then 
>> called with the return of syscall_exception() as first parameter, which 
>> leads to:
>> 
>> 	if (unlikely(r3 >= (unsigned long)-MAX_ERRNO) && is_not_scv) {
>> 		if (likely(!(ti_flags & (_TIF_NOERROR | _TIF_RESTOREALL)))) {
>> 			r3 = -r3;
>> 			regs->ccr |= 0x10000000; /* Set SO bit in CR */
>> 		}
>> 	}
>> 
>> By chance, because you have already changed the sign of gpr[3], the 
>> above test fails and nothing is done to r3, and because you have also 
>> already set regs->ccr it works.
>> 
>> But all this looks inconsistent with the fact that do_seccomp sets 
>> -ENOSYS as default value
>> 
>> Also, when do_seccomp() returns 0, do_syscall_trace_enter() check the 
>> syscall number and when it is wrong it goes to skip: which sets 
>> regs->gpr[3] = -ENOSYS;
>> 
>> So really I think it is not in line with your changes to set positive 
>> value in gpr[3].
>> 
>> Maybe your change is still correct but it needs to be handled completely 
>> in that case.
>
> Indeed, there is an inconsistency in !trap_is_scv case.
>
> In some places such as syscall_get_error() and regs_return_value() the
> semantics is as I described earlier: gpr[3] contains a positive ERRORCODE
> and ccr has 0x10000000 flag set.  This semantics is a part of the ABI and
> therefore cannot be changed.
>
> In some other places like do_seccomp() and do_syscall_trace_enter() the
> semantics is similar to the trap_is_scv case: gpr[3] contains a negative
> ERRORCODE and ccr is unchanged.  In addition, system_call_exception()
> returns the system call function return value when it is executed, and
> gpr[3] otherwise.  The value returned by system_call_exception() is passed
> on to syscall_exit_prepare() which performs the conversion you mentioned.
>
> What's remarkable is that in those places that are a part of the ABI the
> traditional semantics is kept, while in other places the implementation
> follows the trap_is_scv-like semantics, while traditional semantics is
> also supported there.

scv didn't exist when the seccomp code was written so that's not really
the right way to look at it.

The distinction was between the in-kernel semantic of negative
ERRORCODE, which is used everywhere, vs the original (non-scv) syscall
ABI which uses positive ERRORCODE and CCR.SO.

The way I wrote it at the time was to try and maintain the negative
ERRORCODE semantic in the kernel, and only flip to positive ERRORCODE
when we actually exit to userspace.

But even back then syscall_set_return_value() needed to set CCR.SO to
make some cases work, so it was probably the wrong design.

> The only case where I see some intersection is do_seccomp() where the
> tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> is not the place where the tracer *reads* the system call exit status,
> so whatever was written in gpr[3] before __secure_computing() is not
> really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> this patch applied as well as without it.
 
IIRC it is important for a tracer that blocks the syscall but doesn't
explicitly set the return value. But it's only important that the
default return value is syscall failure (ie. ENOSYS/-ENOSYS), the actual
sign of the r3 value should be irrelevant to the tracer.

If the selftest still passes then that's probably sufficient.

cheers

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 23:43           ` Dmitry V. Levin
  2025-01-24 15:18             ` Alexey Gladkov
@ 2025-01-25 12:17             ` Michael Ellerman
  2025-01-25 20:48               ` Dmitry V. Levin
  1 sibling, 1 reply; 65+ messages in thread
From: Michael Ellerman @ 2025-01-25 12:17 UTC (permalink / raw)
  To: Dmitry V. Levin, Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	linux-kernel

"Dmitry V. Levin" <ldv@strace.io> writes:
> On Thu, Jan 23, 2025 at 08:28:15PM +0200, Dmitry V. Levin wrote:
...
>> After looking at system_call_exception() I doubt this inconsistency can be
>> easily avoided, so I don't see how this patch could be enhanced further,
>> and what else could I do with the patch besides dropping it and letting
>> !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
>> would be unfortunate.
>
> If you say this would bring some consistency, I can extend the patch with
> something like this:

Yes that would improve things IMHO, with one caveat ....

> diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> index 727ed4a14545..dda276a934fd 100644
> --- a/arch/powerpc/kernel/ptrace/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
>  	 * syscall parameter. This is different to the ptrace ABI where
>  	 * both r3 and orig_gpr3 contain the first syscall parameter.
>  	 */
> -	regs->gpr[3] = -ENOSYS;
> +	syscall_set_return_value(current, regs, -ENOSYS, 0);
>  
>  	/*
>  	 * We use the __ version here because we have already checked
> @@ -225,7 +225,7 @@ static int do_seccomp(struct pt_regs *regs)
>  	 * modify the first syscall parameter (in orig_gpr3) and also
>  	 * allow the syscall to proceed.
>  	 */
> -	regs->gpr[3] = regs->orig_gpr3;
> +	syscall_set_return_value(current, regs, 0, regs->orig_gpr3);

This case should remain as-is. The orig_gpr3 value here is not a syscall
error code, it's the original r3 value, which is a syscall parameter.

If the tracer wants to fail the syscall it should have set something in
r3, not orig_gpr3.

>  	return 0;
>  }
> @@ -315,7 +315,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>  	 * If we are aborting explicitly, or if the syscall number is
>  	 * now invalid, set the return value to -ENOSYS.
>  	 */
> -	regs->gpr[3] = -ENOSYS;
> +	syscall_set_return_value(current, regs, -ENOSYS, 0);
>  	return -1;
>  }
>  
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index aa17e62f3754..c921e0cb54b8 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -229,14 +229,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
>  		regs_add_return_ip(regs, -4);
>  		regs->result = 0;
>  	} else {
> -		if (trap_is_scv(regs)) {
> -			regs->result = -EINTR;
> -			regs->gpr[3] = -EINTR;
> -		} else {
> -			regs->result = -EINTR;
> -			regs->gpr[3] = EINTR;
> -			regs->ccr |= 0x10000000;
> -		}
> +		regs->result = -EINTR;
> +		syscall_set_return_value(current, regs, -EINTR, 0);
>  	}
>  }

cheers

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-24 15:18             ` Alexey Gladkov
  2025-01-25  0:25               ` Dmitry V. Levin
@ 2025-01-25 12:18               ` Michael Ellerman
  2025-01-27 11:13                 ` Dmitry V. Levin
  1 sibling, 1 reply; 65+ messages in thread
From: Michael Ellerman @ 2025-01-25 12:18 UTC (permalink / raw)
  To: Alexey Gladkov, Dmitry V. Levin
  Cc: Christophe Leroy, Oleg Nesterov, Eugene Syromyatnikov,
	Mike Frysinger, Renzo Davoli, Davide Berardi, strace-devel,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	linux-kernel

Alexey Gladkov <legion@kernel.org> writes:
>
...
> I'm not a powerpc expert but shouldn't be used regs->gpr[3] via a
> regs_return_value() in system_call_exception() ?

Yes I agree.

> notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
> {
> ...
> 		r0 = do_syscall_trace_enter(regs);
> 		if (unlikely(r0 >= NR_syscalls))
> 			return regs->gpr[3];

This is the case where we're expecting the r3 value to be a negative
error code, to match the in-kernel semantics. But after this change it
would be a positive error value. It is probably harmless with the
current code structure, but that's just luck.

cheers

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-25 12:17             ` Michael Ellerman
@ 2025-01-25 20:48               ` Dmitry V. Levin
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-25 20:48 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, Alexey Gladkov, Oleg Nesterov,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Sat, Jan 25, 2025 at 11:17:58PM +1100, Michael Ellerman wrote:
> "Dmitry V. Levin" <ldv@strace.io> writes:
> > On Thu, Jan 23, 2025 at 08:28:15PM +0200, Dmitry V. Levin wrote:
> ...
> >> After looking at system_call_exception() I doubt this inconsistency can be
> >> easily avoided, so I don't see how this patch could be enhanced further,
> >> and what else could I do with the patch besides dropping it and letting
> >> !trap_is_scv case be unsupported by PTRACE_SET_SYSCALL_INFO API, which
> >> would be unfortunate.
> >
> > If you say this would bring some consistency, I can extend the patch with
> > something like this:
> 
> Yes that would improve things IMHO, with one caveat ....
> 
> > diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> > index 727ed4a14545..dda276a934fd 100644
> > --- a/arch/powerpc/kernel/ptrace/ptrace.c
> > +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> > @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
> >  	 * syscall parameter. This is different to the ptrace ABI where
> >  	 * both r3 and orig_gpr3 contain the first syscall parameter.
> >  	 */
> > -	regs->gpr[3] = -ENOSYS;
> > +	syscall_set_return_value(current, regs, -ENOSYS, 0);
> >  
> >  	/*
> >  	 * We use the __ version here because we have already checked
> > @@ -225,7 +225,7 @@ static int do_seccomp(struct pt_regs *regs)
> >  	 * modify the first syscall parameter (in orig_gpr3) and also
> >  	 * allow the syscall to proceed.
> >  	 */
> > -	regs->gpr[3] = regs->orig_gpr3;
> > +	syscall_set_return_value(current, regs, 0, regs->orig_gpr3);
> 
> This case should remain as-is. The orig_gpr3 value here is not a syscall
> error code, it's the original r3 value, which is a syscall parameter.

I agree, but shouldn't CCR.SO be cleared somehow after it was set earlier by
	syscall_set_return_value(current, regs, -ENOSYS, 0);
?


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-25 12:17           ` Michael Ellerman
@ 2025-01-25 21:25             ` Dmitry V. Levin
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-25 21:25 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, Alexey Gladkov, Oleg Nesterov,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Sat, Jan 25, 2025 at 11:17:45PM +1100, Michael Ellerman wrote:
> "Dmitry V. Levin" <ldv@strace.io> writes:
[...]
> > The only case where I see some intersection is do_seccomp() where the
> > tracer would be able to see -ENOSYS in gpr[3].  However, the seccomp stop
> > is not the place where the tracer *reads* the system call exit status,
> > so whatever was written in gpr[3] before __secure_computing() is not
> > really relevant, consequently, selftests/seccomp/seccomp_bpf passes with
> > this patch applied as well as without it.
>  
> IIRC it is important for a tracer that blocks the syscall but doesn't
> explicitly set the return value. But it's only important that the
> default return value is syscall failure (ie. ENOSYS/-ENOSYS), the actual
> sign of the r3 value should be irrelevant to the tracer.
> 
> If the selftest still passes then that's probably sufficient.

Yes, I failed to explain this properly, thanks for correcting me.
With the current implementation, both -ENOSYS and ENOSYS/cr0.SO semantics
of the error code at __secure_computing() stage lead to the same result,
this is the reason why seccomp_bpf selftest passes regardless of the patch.

At any point where the tracer is entitled to interpret gpr[3] as a syscall
return value, the semantics of gpr[3] is well-defined (-ERRORCODE/cr0.SO
in non-scv case) and is a part of the ABI.

However, since we have to provide backwards compatibility with the current
inconsistent implementation, in the non-scv case we have to continue
supporting both -ENOSYS and ENOSYS/cr0.SO semantics of the syscall return
value set by the tracer at __secure_computing() stage.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-25 12:18               ` Michael Ellerman
@ 2025-01-27 11:13                 ` Dmitry V. Levin
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-27 11:13 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Alexey Gladkov, Christophe Leroy, Oleg Nesterov,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Sat, Jan 25, 2025 at 11:18:06PM +1100, Michael Ellerman wrote:
> Alexey Gladkov <legion@kernel.org> writes:
> >
> ...
> > I'm not a powerpc expert but shouldn't be used regs->gpr[3] via a
> > regs_return_value() in system_call_exception() ?
> 
> Yes I agree.
> 
> > notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
> > {
> > ...
> > 		r0 = do_syscall_trace_enter(regs);
> > 		if (unlikely(r0 >= NR_syscalls))
> > 			return regs->gpr[3];
> 
> This is the case where we're expecting the r3 value to be a negative
> error code, to match the in-kernel semantics. But after this change it
> would be a positive error value. It is probably harmless with the
> current code structure, but that's just luck.

I'm afraid that's not just luck.  do_seccomp() from the very beginning
supports both the generic kernel -ERRORCODE return value ABI and the
powerpc sc syscall return ABI, thanks to syscall_exit_prepare() that
converts the former to the latter.  Given that this inconsistency was
exposed to user space via PTRACE_EVENT_SECCOMP tracers for so many years,
I suppose backwards compatibility has to be provided.  Consequently, since
the point of __secure_computing() invocation and up to the point of
conversion in syscall_exit_prepare(), gpr[3] may be set according to
either of these two ABIs.  Unfortunately, this means any future attempt
to avoid the inconsistency would be inherently incomplete.

For this reason, I doubt it would make sense to include into the patch
any changes that are needed only to address this consistency issue.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-23 22:07           ` Christophe Leroy
  2025-01-23 22:35             ` Dmitry V. Levin
@ 2025-01-27 11:20             ` Dmitry V. Levin
  2025-01-27 11:36               ` Christophe Leroy
  1 sibling, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-27 11:20 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Thu, Jan 23, 2025 at 11:07:21PM +0100, Christophe Leroy wrote:
[...]
> To add a bit more to the confusion,

Looks like there is no end to it:

static inline long regs_return_value(struct pt_regs *regs)
{
        if (trap_is_scv(regs))
                return regs->gpr[3];

        if (is_syscall_success(regs))
                return regs->gpr[3];
        else
                return -regs->gpr[3];
}

static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
{
        regs->gpr[3] = rc;
}

This doesn't look consistent, does it?


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-27 11:20             ` Dmitry V. Levin
@ 2025-01-27 11:36               ` Christophe Leroy
  2025-01-27 11:44                 ` Dmitry V. Levin
  0 siblings, 1 reply; 65+ messages in thread
From: Christophe Leroy @ 2025-01-27 11:36 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel



Le 27/01/2025 à 12:20, Dmitry V. Levin a écrit :
> On Thu, Jan 23, 2025 at 11:07:21PM +0100, Christophe Leroy wrote:
> [...]
>> To add a bit more to the confusion,
> 
> Looks like there is no end to it:
> 
> static inline long regs_return_value(struct pt_regs *regs)
> {
>          if (trap_is_scv(regs))
>                  return regs->gpr[3];
> 
>          if (is_syscall_success(regs))
>                  return regs->gpr[3];
>          else
>                  return -regs->gpr[3];
> }
> 
> static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
> {
>          regs->gpr[3] = rc;
> }
> 
> This doesn't look consistent, does it?
> 
> 

That regs_set_return_value() looks pretty similar to 
syscall_get_return_value().

regs_set_return_value() documentation in asm-generic/syscall.h 
explicitely says: This value is meaningless if syscall_get_error() 
returned nonzero

Is it the same with regs_set_return_value(), only meaningfull where 
there is no error ?

By the way, why have two very similar APIs, one in syscall.h one in 
ptrace.h ?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-27 11:36               ` Christophe Leroy
@ 2025-01-27 11:44                 ` Dmitry V. Levin
  2025-01-27 12:04                   ` Christophe Leroy
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-27 11:44 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Mon, Jan 27, 2025 at 12:36:53PM +0100, Christophe Leroy wrote:
> Le 27/01/2025 à 12:20, Dmitry V. Levin a écrit :
> > On Thu, Jan 23, 2025 at 11:07:21PM +0100, Christophe Leroy wrote:
> > [...]
> >> To add a bit more to the confusion,
> > 
> > Looks like there is no end to it:
> > 
> > static inline long regs_return_value(struct pt_regs *regs)
> > {
> >          if (trap_is_scv(regs))
> >                  return regs->gpr[3];
> > 
> >          if (is_syscall_success(regs))
> >                  return regs->gpr[3];
> >          else
> >                  return -regs->gpr[3];
> > }
> > 
> > static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
> > {
> >          regs->gpr[3] = rc;
> > }
> > 
> > This doesn't look consistent, does it?
> > 
> > 
> 
> That regs_set_return_value() looks pretty similar to 
> syscall_get_return_value().

Yes, but here similarities end, and differences begin.

> regs_set_return_value() documentation in asm-generic/syscall.h 
> explicitely says: This value is meaningless if syscall_get_error() 
> returned nonzero
> 
> Is it the same with regs_set_return_value(), only meaningfull where 
> there is no error ?

Did you mean syscall_set_return_value?  No, it explicitly has two
arguments, "int error" and "long val", so it can be used to either
clear or set the error condition as specified by the caller.

> By the way, why have two very similar APIs, one in syscall.h one in 
> ptrace.h ?

I have no polite answer to this, sorry.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-27 11:44                 ` Dmitry V. Levin
@ 2025-01-27 12:04                   ` Christophe Leroy
  2025-01-27 12:26                     ` Dmitry V. Levin
  0 siblings, 1 reply; 65+ messages in thread
From: Christophe Leroy @ 2025-01-27 12:04 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel



Le 27/01/2025 à 12:44, Dmitry V. Levin a écrit :
> On Mon, Jan 27, 2025 at 12:36:53PM +0100, Christophe Leroy wrote:
>> Le 27/01/2025 à 12:20, Dmitry V. Levin a écrit :
>>> On Thu, Jan 23, 2025 at 11:07:21PM +0100, Christophe Leroy wrote:
>>> [...]
>>>> To add a bit more to the confusion,
>>>
>>> Looks like there is no end to it:
>>>
>>> static inline long regs_return_value(struct pt_regs *regs)
>>> {
>>>           if (trap_is_scv(regs))
>>>                   return regs->gpr[3];
>>>
>>>           if (is_syscall_success(regs))
>>>                   return regs->gpr[3];
>>>           else
>>>                   return -regs->gpr[3];
>>> }
>>>
>>> static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
>>> {
>>>           regs->gpr[3] = rc;
>>> }
>>>
>>> This doesn't look consistent, does it?
>>>
>>>
>>
>> That regs_set_return_value() looks pretty similar to
>> syscall_get_return_value().
> 
> Yes, but here similarities end, and differences begin.
> 
>> regs_set_return_value() documentation in asm-generic/syscall.h
>> explicitely says: This value is meaningless if syscall_get_error()
>> returned nonzero
>>
>> Is it the same with regs_set_return_value(), only meaningfull where
>> there is no error ?
> 
> Did you mean syscall_set_return_value?  No, it explicitly has two
> arguments, "int error" and "long val", so it can be used to either
> clear or set the error condition as specified by the caller.

Sorry, I mean syscall_get_return_value() here.

static inline long syscall_get_return_value(struct task_struct *task,
					    struct pt_regs *regs)
{
	return regs->gpr[3];
}

Versus

static inline void regs_set_return_value(struct pt_regs *regs, unsigned 
long rc)
{
	regs->gpr[3] = rc;
}

> 
>> By the way, why have two very similar APIs, one in syscall.h one in
>> ptrace.h ?
> 
> I have no polite answer to this, sorry.
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value()
  2025-01-27 12:04                   ` Christophe Leroy
@ 2025-01-27 12:26                     ` Dmitry V. Levin
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry V. Levin @ 2025-01-27 12:26 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Alexey Gladkov, Oleg Nesterov, Michael Ellerman,
	Eugene Syromyatnikov, Mike Frysinger, Renzo Davoli,
	Davide Berardi, strace-devel, Madhavan Srinivasan,
	Nicholas Piggin, Naveen N Rao, linuxppc-dev, linux-kernel

On Mon, Jan 27, 2025 at 01:04:27PM +0100, Christophe Leroy wrote:
> 
> 
> Le 27/01/2025 à 12:44, Dmitry V. Levin a écrit :
> > On Mon, Jan 27, 2025 at 12:36:53PM +0100, Christophe Leroy wrote:
> >> Le 27/01/2025 à 12:20, Dmitry V. Levin a écrit :
> >>> On Thu, Jan 23, 2025 at 11:07:21PM +0100, Christophe Leroy wrote:
> >>> [...]
> >>>> To add a bit more to the confusion,
> >>>
> >>> Looks like there is no end to it:
> >>>
> >>> static inline long regs_return_value(struct pt_regs *regs)
> >>> {
> >>>           if (trap_is_scv(regs))
> >>>                   return regs->gpr[3];
> >>>
> >>>           if (is_syscall_success(regs))
> >>>                   return regs->gpr[3];
> >>>           else
> >>>                   return -regs->gpr[3];
> >>> }
> >>>
> >>> static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
> >>> {
> >>>           regs->gpr[3] = rc;
> >>> }
> >>>
> >>> This doesn't look consistent, does it?
> >>>
> >>>
> >>
> >> That regs_set_return_value() looks pretty similar to
> >> syscall_get_return_value().
> > 
> > Yes, but here similarities end, and differences begin.
> > 
> >> regs_set_return_value() documentation in asm-generic/syscall.h
> >> explicitely says: This value is meaningless if syscall_get_error()
> >> returned nonzero
> >>
> >> Is it the same with regs_set_return_value(), only meaningfull where
> >> there is no error ?
> > 
> > Did you mean syscall_set_return_value?  No, it explicitly has two
> > arguments, "int error" and "long val", so it can be used to either
> > clear or set the error condition as specified by the caller.
> 
> Sorry, I mean syscall_get_return_value() here.
> 
> static inline long syscall_get_return_value(struct task_struct *task,
> 					    struct pt_regs *regs)
> {
> 	return regs->gpr[3];
> }
> 
> Versus
> 
> static inline void regs_set_return_value(struct pt_regs *regs, unsigned 
> long rc)
> {
> 	regs->gpr[3] = rc;
> }

The asm/syscall.h API provides two functions to obtain the return value:
syscall_get_error() and syscall_get_return_value().  The first one is used
to obtain the error code when the error condition is set.  When the error
condition is not set, it returns 0.  The second function is used to obtain
the return value when the error condition is not set.  When the error
condition is set, its return value is undefined.


-- 
ldv

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2025-01-27 12:26 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250113170925.GA392@strace.io>
2025-01-13 17:10 ` [PATCH v2 1/7] powerpc: properly negate error in syscall_set_return_value() Dmitry V. Levin
2025-01-13 17:34   ` Christophe Leroy
2025-01-13 17:54     ` Dmitry V. Levin
2025-01-14 17:04     ` Dmitry V. Levin
2025-01-20 13:51       ` Christophe Leroy
2025-01-20 17:12         ` Dmitry V. Levin
2025-01-21 11:13           ` Madhavan Srinivasan
2025-01-21 11:28             ` Christophe Leroy
2025-01-21 12:25               ` Madhavan Srinivasan
2025-01-21 12:42                 ` Dmitry V. Levin
2025-01-23 18:28         ` Dmitry V. Levin
2025-01-23 19:11           ` Eugene Syromyatnikov
2025-01-23 22:16             ` Dmitry V. Levin
2025-01-23 22:07           ` Christophe Leroy
2025-01-23 22:35             ` Dmitry V. Levin
2025-01-27 11:20             ` Dmitry V. Levin
2025-01-27 11:36               ` Christophe Leroy
2025-01-27 11:44                 ` Dmitry V. Levin
2025-01-27 12:04                   ` Christophe Leroy
2025-01-27 12:26                     ` Dmitry V. Levin
2025-01-23 23:43           ` Dmitry V. Levin
2025-01-24 15:18             ` Alexey Gladkov
2025-01-25  0:25               ` Dmitry V. Levin
2025-01-25 12:18               ` Michael Ellerman
2025-01-27 11:13                 ` Dmitry V. Levin
2025-01-25 12:17             ` Michael Ellerman
2025-01-25 20:48               ` Dmitry V. Levin
2025-01-25 12:17           ` Michael Ellerman
2025-01-25 21:25             ` Dmitry V. Levin
2025-01-14 13:00   ` Alexey Gladkov
2025-01-14 13:48     ` Dmitry V. Levin
2025-01-14 14:53       ` Alexey Gladkov
2025-01-13 17:11 ` [PATCH v2 2/7] mips: fix mips_get_syscall_arg() for O32 and N32 Dmitry V. Levin
2025-01-14  3:29   ` Maciej W. Rozycki
2025-01-14  8:47     ` Dmitry V. Levin
2025-01-14 16:03       ` Maciej W. Rozycki
2025-01-14 16:42         ` Dmitry V. Levin
2025-01-13 17:11 ` [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value() Dmitry V. Levin
2025-01-16  2:20   ` Charlie Jenkins
2025-01-17  0:59     ` H. Peter Anvin
2025-01-17 15:45       ` Eugene Syromyatnikov
2025-01-18  4:34         ` H. Peter Anvin
2025-01-13 17:11 ` [PATCH v2 4/7] syscall.h: introduce syscall_set_nr() Dmitry V. Levin
2025-01-16  2:20   ` Charlie Jenkins
2025-01-13 17:12 ` [PATCH v2 5/7] ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op Dmitry V. Levin
2025-01-13 17:12 ` [PATCH v2 6/7] ptrace: introduce PTRACE_SET_SYSCALL_INFO request Dmitry V. Levin
2025-01-15 16:38   ` Oleg Nesterov
2025-01-15 17:36     ` Dmitry V. Levin
2025-01-15 19:10       ` Oleg Nesterov
2025-01-16  1:55   ` Charlie Jenkins
2025-01-16  8:33     ` Dmitry V. Levin
2025-01-16 21:07       ` Charlie Jenkins
2025-01-16 21:47         ` Charlie Jenkins
2025-01-16 15:21   ` Oleg Nesterov
2025-01-16 16:04     ` Dmitry V. Levin
2025-01-16 16:40       ` Dmitry V. Levin
2025-01-17 14:45       ` Oleg Nesterov
2025-01-17 15:06         ` Dmitry V. Levin
2025-01-17 15:32           ` Oleg Nesterov
2025-01-17 16:22             ` Dmitry V. Levin
2025-01-18 14:13               ` Oleg Nesterov
2025-01-19 12:44                 ` Dmitry V. Levin
2025-01-20 19:56                   ` Oleg Nesterov
2025-01-19 14:38                 ` Aleksa Sarai
2025-01-13 17:12 ` [PATCH v2 7/7] selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO Dmitry V. Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).