linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
@ 2025-01-27 18:13 Dmitry V. Levin
  2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-27 18:13 UTC (permalink / raw)
  To: Michael Ellerman, Christophe Leroy
  Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	strace-devel, linux-kernel

According to the Power Architecture Linux system call ABI documented in
[1], when the syscall is made with the sc instruction, both a value and an
error condition are returned, where r3 register contains the return value,
and cr0.SO bit specifies the error condition.  When cr0.SO is clear, the
syscall succeeded and r3 is the return value.  When cr0.SO is set, the
syscall failed and r3 is the error value.  This syscall return semantics
was implemented from the very beginning of Power Architecture on Linux,
and syscall tracers and debuggers like strace that read or modify syscall
return information also rely on this ABI.

r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
(ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
For example, here is an excerpt from check_syscall_restart() that assigns
these members of struct pt_regs:
        regs->result = -EINTR;
        regs->gpr[3] = EINTR;
        regs->ccr |= 0x10000000;
In this example, the semantics of negative ERRORCODE that's being used
virtually everywhere in generic kernel code is translated to powerpc sc
syscall return ABI which uses positive ERRORCODE and cr0.SO bit.

Also, r3 and cr0.SO are exposed indirectly via helpers.
For example, here is an excerpt from syscall_get_error():
        /*
         * If the system call failed,
         * regs->gpr[3] contains a positive ERRORCODE.
         */
        return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
and here is another example, from regs_return_value():
        if (is_syscall_success(regs))
                return regs->gpr[3];
        else
                return -regs->gpr[3];
In these examples, the powerpc sc syscall return ABI which uses positive
ERRORCODE and cr0.SO bit is translated to the semantics of negative
ERRORCODE that's being used virtually everywhere in generic kernel code.

Up to a certain point in time the kernel managed to implement the powerpc
sc syscall return ABI in all cases where struct pt_regs was exposed to user
space.

The situation changed when SECCOMP_RET_TRACE support was introduced.
At this point the -ERRORCODE semantics that was used under the hood to
implement seccomp on powerpc became exposed to user space.  The tracer
handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
- this is relatively harmless as at this stage there is no syscall return
yet so the powerpc sc syscall return ABI does not apply.  What's important
is that the tracer can change the syscall number to -1 thus making the
syscall fail, and at this point the tracer is also able to specify the
error value.  This has to be done in accordance with the syscall return
ABI, however, the current implementation of do_seccomp() supports both the
generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
return ABI, thanks to syscall_exit_prepare() that converts the former to
the latter.  Consequently, seccomp_bpf selftest passes both with and
without this change.

Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
introduced.  PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
the details of the system calls the tracee is blocked in.

One of the helpers that have to be used to implement
PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
This helper complements other two helpers, syscall_get_error() and
syscall_get_return_value(), that are currently used to implement
PTRACE_GET_SYSCALL_INFO on syscall return.  When syscall_set_return_value()
is used to set an error code, the caller specifies it as a negative value
in -ERRORCODE format.

Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
("powerpc: Don't negate error in syscall_set_return_value()") because
syscall_set_return_value() does not follow the powerpc sc syscall return
ABI:
	/*
	 * In the general case it's not obvious that we must deal with
	 * CCR here, as the syscall exit path will also do that for us.
	 * However there are some places, eg. the signal code, which
	 * check ccr to decide if the value in r3 is actually an error.
	 */
	if (error) {
		regs->ccr |= 0x10000000L;
		regs->gpr[3] = error;
	} else {
		regs->ccr &= ~0x10000000L;
		regs->gpr[3] = val;
	}

The reason why this syscall_set_return_value() implementation was able to
get away with violating the powerpc sc syscall return ABI is the following:
Up to now, syscall_set_return_value() on powerpc could be called only from
do_syscall_trace_enter() via do_seccomp(), there was no way it could be
called from do_syscall_trace_leave() which is the point where tracers on
syscall return are activated and the powerpc sc syscall return ABI has
to be respected.

Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
syscall_set_return_value() to comply with the powerpc sc syscall return
ABI.  Without the change, the upcoming ptrace/set_syscall_info selftest
fails with the following diagnostics:

  # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
  # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch

Note that since backwards compatibility with the current implementation has
to be provided, the kernel has to continue supporting simultaneously both
the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
return ABI at least for PTRACE_EVENT_SECCOMP tracers.  Consequently, since
the point of __secure_computing() invocation and up to the point of
conversion in syscall_exit_prepare(), gpr[3] may be set according to either
of these two ABIs.  An attempt to address code inconsistencies in syscall
error return handling that were introduced as a side effect of the dual
ABI support follows in a separate patch.

Link: https://www.kernel.org/doc/html/latest/arch/powerpc/syscall64-abi.html#return-value [1]
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
---
 arch/powerpc/include/asm/syscall.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index 3dd36c5e334a..422d7735ace6 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
 		 */
 		if (error) {
 			regs->ccr |= 0x10000000L;
-			regs->gpr[3] = error;
+			/*
+			 * In case of an error regs->gpr[3] contains
+			 * a positive ERRORCODE.
+			 */
+			regs->gpr[3] = -error;
 		} else {
 			regs->ccr &= ~0x10000000L;
 			regs->gpr[3] = val;
-- 
ldv

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling
  2025-01-27 18:13 [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Dmitry V. Levin
@ 2025-01-27 18:14 ` Dmitry V. Levin
  2025-01-28 18:01   ` Christophe Leroy
  2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
  2025-06-02 11:36 ` Dmitry V. Levin
  2 siblings, 1 reply; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-27 18:14 UTC (permalink / raw)
  To: Michael Ellerman, Christophe Leroy
  Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	strace-devel, linux-kernel

Since the introduction of SECCOMP_RET_TRACE support, the kernel supports
simultaneously both the generic kernel -ERRORCODE return value ABI and
the powerpc sc syscall return ABI for PTRACE_EVENT_SECCOMP tracers.
This change is an attempt to address the code inconsistencies in syscall
error return handling that were introduced as a side effect of the dual
ABI support.

Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---
 arch/powerpc/kernel/ptrace/ptrace.c | 23 ++++++++++++++++++++---
 arch/powerpc/kernel/signal.c        | 11 +++--------
 arch/powerpc/kernel/syscall.c       |  6 +++---
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index 727ed4a14545..3778775bf6ba 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
 	 * syscall parameter. This is different to the ptrace ABI where
 	 * both r3 and orig_gpr3 contain the first syscall parameter.
 	 */
-	regs->gpr[3] = -ENOSYS;
+	syscall_set_return_value(current, regs, -ENOSYS, 0);
 
 	/*
 	 * We use the __ version here because we have already checked
@@ -215,8 +215,18 @@ static int do_seccomp(struct pt_regs *regs)
 	 * have already loaded -ENOSYS into r3, or seccomp has put
 	 * something else in r3 (via SECCOMP_RET_ERRNO/TRACE).
 	 */
-	if (__secure_computing(NULL))
+	if (__secure_computing(NULL)) {
+
+		/*
+		 * Traditionally, both the generic kernel -ERRORCODE return
+		 * value ABI and the powerpc sc syscall return ABI is
+		 * supported.  For consistency, if the former is detected,
+		 * convert it to the latter.
+		 */
+		if (!trap_is_scv(regs) && IS_ERR_VALUE(regs->gpr[3]))
+			syscall_set_return_value(current, regs, regs->gpr[3], 0);
 		return -1;
+	}
 
 	/*
 	 * The syscall was allowed by seccomp, restore the register
@@ -226,6 +236,13 @@ static int do_seccomp(struct pt_regs *regs)
 	 * allow the syscall to proceed.
 	 */
 	regs->gpr[3] = regs->orig_gpr3;
+	if (!trap_is_scv(regs)) {
+		/*
+		 * Clear SO bit that was set in this function earlier by
+		 * syscall_set_return_value.
+		 */
+		regs->ccr &= ~0x10000000L;
+	}
 
 	return 0;
 }
@@ -315,7 +332,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 	 * If we are aborting explicitly, or if the syscall number is
 	 * now invalid, set the return value to -ENOSYS.
 	 */
-	regs->gpr[3] = -ENOSYS;
+	syscall_set_return_value(current, regs, -ENOSYS, 0);
 	return -1;
 }
 
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index aa17e62f3754..1a38d6bcaed6 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -19,6 +19,7 @@
 #include <asm/unistd.h>
 #include <asm/debug.h>
 #include <asm/tm.h>
+#include <asm/syscall.h>
 
 #include "signal.h"
 
@@ -229,14 +230,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
 		regs_add_return_ip(regs, -4);
 		regs->result = 0;
 	} else {
-		if (trap_is_scv(regs)) {
-			regs->result = -EINTR;
-			regs->gpr[3] = -EINTR;
-		} else {
-			regs->result = -EINTR;
-			regs->gpr[3] = EINTR;
-			regs->ccr |= 0x10000000;
-		}
+		regs->result = -EINTR;
+		syscall_set_return_value(current, regs, -EINTR, 0);
 	}
 }
 
diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
index be159ad4b77b..2fe47191e509 100644
--- a/arch/powerpc/kernel/syscall.c
+++ b/arch/powerpc/kernel/syscall.c
@@ -122,7 +122,7 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
 		if (unlikely(trap_is_unsupported_scv(regs))) {
 			/* Unsupported scv vector */
 			_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
-			return regs->gpr[3];
+			return regs_return_value(regs);
 		}
 		/*
 		 * We use the return value of do_syscall_trace_enter() as the
@@ -133,13 +133,13 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
 		 */
 		r0 = do_syscall_trace_enter(regs);
 		if (unlikely(r0 >= NR_syscalls))
-			return regs->gpr[3];
+			return regs_return_value(regs);
 
 	} else if (unlikely(r0 >= NR_syscalls)) {
 		if (unlikely(trap_is_unsupported_scv(regs))) {
 			/* Unsupported scv vector */
 			_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
-			return regs->gpr[3];
+			return regs_return_value(regs);
 		}
 		return -ENOSYS;
 	}
-- 
ldv

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
  2025-01-27 18:13 [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Dmitry V. Levin
  2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
@ 2025-01-28 14:59 ` Christophe Leroy
  2025-01-28 15:52   ` Dmitry V. Levin
  2025-01-28 15:54   ` Eugene Syromyatnikov
  2025-06-02 11:36 ` Dmitry V. Levin
  2 siblings, 2 replies; 10+ messages in thread
From: Christophe Leroy @ 2025-01-28 14:59 UTC (permalink / raw)
  To: Dmitry V. Levin, Michael Ellerman
  Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	strace-devel, linux-kernel


Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> According to the Power Architecture Linux system call ABI documented in
> [1], when the syscall is made with the sc instruction, both a value and an
> error condition are returned, where r3 register contains the return value,
> and cr0.SO bit specifies the error condition.  When cr0.SO is clear, the
> syscall succeeded and r3 is the return value.  When cr0.SO is set, the
> syscall failed and r3 is the error value.  This syscall return semantics
> was implemented from the very beginning of Power Architecture on Linux,
> and syscall tracers and debuggers like strace that read or modify syscall
> return information also rely on this ABI.

I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they 
behave all the same ?

> 
> r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> For example, here is an excerpt from check_syscall_restart() that assigns
> these members of struct pt_regs:
>          regs->result = -EINTR;
>          regs->gpr[3] = EINTR;
>          regs->ccr |= 0x10000000;
> In this example, the semantics of negative ERRORCODE that's being used
> virtually everywhere in generic kernel code is translated to powerpc sc
> syscall return ABI which uses positive ERRORCODE and cr0.SO bit.

At what point are they exposed really ? At what point do they need to 
comply with the ABI ?

I'm also a bit lost between regs->orig_r3, regs->gpr[3] and regs->result.

The comment added by commit 1b1a3702a65c ("powerpc: Don't negate error 
in syscall_set_return_value()") says that CCR needs to be set because of 
signal code. But signal code is invoked by syscall_exit_prepare() 
through call to interrupt_exit_user_prepare_main() after setting CR[SO] 
and negating syscall result.

> 
> Also, r3 and cr0.SO are exposed indirectly via helpers.
> For example, here is an excerpt from syscall_get_error():
>          /*
>           * If the system call failed,
>           * regs->gpr[3] contains a positive ERRORCODE.
>           */
>          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> and here is another example, from regs_return_value():
>          if (is_syscall_success(regs))
>                  return regs->gpr[3];
>          else
>                  return -regs->gpr[3];
> In these examples, the powerpc sc syscall return ABI which uses positive
> ERRORCODE and cr0.SO bit is translated to the semantics of negative
> ERRORCODE that's being used virtually everywhere in generic kernel code.
> 
> Up to a certain point in time the kernel managed to implement the powerpc
> sc syscall return ABI in all cases where struct pt_regs was exposed to user
> space.
> 
> The situation changed when SECCOMP_RET_TRACE support was introduced.
> At this point the -ERRORCODE semantics that was used under the hood to
> implement seccomp on powerpc became exposed to user space.  The tracer
> handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> - this is relatively harmless as at this stage there is no syscall return
> yet so the powerpc sc syscall return ABI does not apply.  What's important
> is that the tracer can change the syscall number to -1 thus making the
> syscall fail, and at this point the tracer is also able to specify the
> error value.  This has to be done in accordance with the syscall return
> ABI, however, the current implementation of do_seccomp() supports both the
> generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI, thanks to syscall_exit_prepare() that converts the former to
> the latter.  Consequently, seccomp_bpf selftest passes both with and
> without this change.
> 
> Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> introduced.  PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> the details of the system calls the tracee is blocked in.
> 
> One of the helpers that have to be used to implement
> PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> This helper complements other two helpers, syscall_get_error() and
> syscall_get_return_value(), that are currently used to implement
> PTRACE_GET_SYSCALL_INFO on syscall return.  When syscall_set_return_value()
> is used to set an error code, the caller specifies it as a negative value
> in -ERRORCODE format.
> 
> Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> ("powerpc: Don't negate error in syscall_set_return_value()") because
> syscall_set_return_value() does not follow the powerpc sc syscall return
> ABI:
> 	/*
> 	 * In the general case it's not obvious that we must deal with
> 	 * CCR here, as the syscall exit path will also do that for us.
> 	 * However there are some places, eg. the signal code, which
> 	 * check ccr to decide if the value in r3 is actually an error.
> 	 */
> 	if (error) {
> 		regs->ccr |= 0x10000000L;
> 		regs->gpr[3] = error;
> 	} else {
> 		regs->ccr &= ~0x10000000L;
> 		regs->gpr[3] = val;
> 	}
> 
> The reason why this syscall_set_return_value() implementation was able to
> get away with violating the powerpc sc syscall return ABI is the following:
> Up to now, syscall_set_return_value() on powerpc could be called only from
> do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> called from do_syscall_trace_leave() which is the point where tracers on
> syscall return are activated and the powerpc sc syscall return ABI has
> to be respected.
> 
> Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> syscall_set_return_value() to comply with the powerpc sc syscall return
> ABI.  Without the change, the upcoming ptrace/set_syscall_info selftest
> fails with the following diagnostics:
> 
>    # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
>    # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
> 
> Note that since backwards compatibility with the current implementation has
> to be provided, the kernel has to continue supporting simultaneously both
> the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI at least for PTRACE_EVENT_SECCOMP tracers.  Consequently, since
> the point of __secure_computing() invocation and up to the point of
> conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> of these two ABIs.  An attempt to address code inconsistencies in syscall
> error return handling that were introduced as a side effect of the dual
> ABI support follows in a separate patch.

What do you mean by "backwards compatibility" here ? backwards 
compatibility applies only to userspace API doesn't it ? So if there was 
no way to trigger the problem previously, what does it mean ?

> 
> Link: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Farch%2Fpowerpc%2Fsyscall64-abi.html%23return-value&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc2cf590281c24fe1478408dd3efe4a3e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638735984085033893%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=EEP9s6k%2Fs5VfqWgrs6VXi879HEfJ8BYOOJ8InmmVTQA%3D&reserved=0 [1]
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> Reviewed-by: Alexey Gladkov <legion@kernel.org>
> ---
>   arch/powerpc/include/asm/syscall.h | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 3dd36c5e334a..422d7735ace6 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
>   		 */
>   		if (error) {
>   			regs->ccr |= 0x10000000L;
> -			regs->gpr[3] = error;
> +			/*
> +			 * In case of an error regs->gpr[3] contains
> +			 * a positive ERRORCODE.
> +			 */
> +			regs->gpr[3] = -error;
>   		} else {
>   			regs->ccr &= ~0x10000000L;
>   			regs->gpr[3] = val;


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
  2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
@ 2025-01-28 15:52   ` Dmitry V. Levin
  2025-01-28 16:00     ` Christophe Leroy
  2025-01-28 15:54   ` Eugene Syromyatnikov
  1 sibling, 1 reply; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-28 15:52 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
	Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, strace-devel, linux-kernel

On Tue, Jan 28, 2025 at 03:59:29PM +0100, Christophe Leroy wrote:
> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> > According to the Power Architecture Linux system call ABI documented in
> > [1], when the syscall is made with the sc instruction, both a value and an
> > error condition are returned, where r3 register contains the return value,
> > and cr0.SO bit specifies the error condition.  When cr0.SO is clear, the
> > syscall succeeded and r3 is the return value.  When cr0.SO is set, the
> > syscall failed and r3 is the error value.  This syscall return semantics
> > was implemented from the very beginning of Power Architecture on Linux,
> > and syscall tracers and debuggers like strace that read or modify syscall
> > return information also rely on this ABI.
> 
> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they 
> behave all the same ?

Yes, also on alpha.  I don't think microblaze should be in this list,
though.

> > r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> > (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> > For example, here is an excerpt from check_syscall_restart() that assigns
> > these members of struct pt_regs:
> >          regs->result = -EINTR;
> >          regs->gpr[3] = EINTR;
> >          regs->ccr |= 0x10000000;
> > In this example, the semantics of negative ERRORCODE that's being used
> > virtually everywhere in generic kernel code is translated to powerpc sc
> > syscall return ABI which uses positive ERRORCODE and cr0.SO bit.
> 
> At what point are they exposed really ? At what point do they need to 
> comply with the ABI ?

That's a good question.  Of course when returning to user space, but,
besides that, at syscall exit tracepoint (trace_sys_exit), ptrace syscall
exit stop (ptrace_report_syscall_exit), and PTRACE_EVENT_SECCOMP stop
(__secure_computing).  There could be other points where this is exposed.
For example, on many architectures the tracer can specify syscall error
return value also at ptrace syscall entry stop
(ptrace_report_syscall_entry), but powerpc does not implement this.

> I'm also a bit lost between regs->orig_r3, regs->gpr[3] and regs->result.

I'm not aware of the role of regs->result as it's not in struct user_pt_regs
and therefore is not exposed to user space.

> The comment added by commit 1b1a3702a65c ("powerpc: Don't negate error 
> in syscall_set_return_value()") says that CCR needs to be set because of 
> signal code. But signal code is invoked by syscall_exit_prepare() 
> through call to interrupt_exit_user_prepare_main() after setting CR[SO] 
> and negating syscall result.

I hope Michael can comment on this.

> > Also, r3 and cr0.SO are exposed indirectly via helpers.
> > For example, here is an excerpt from syscall_get_error():
> >          /*
> >           * If the system call failed,
> >           * regs->gpr[3] contains a positive ERRORCODE.
> >           */
> >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > and here is another example, from regs_return_value():
> >          if (is_syscall_success(regs))
> >                  return regs->gpr[3];
> >          else
> >                  return -regs->gpr[3];
> > In these examples, the powerpc sc syscall return ABI which uses positive
> > ERRORCODE and cr0.SO bit is translated to the semantics of negative
> > ERRORCODE that's being used virtually everywhere in generic kernel code.
> > 
> > Up to a certain point in time the kernel managed to implement the powerpc
> > sc syscall return ABI in all cases where struct pt_regs was exposed to user
> > space.
> > 
> > The situation changed when SECCOMP_RET_TRACE support was introduced.
> > At this point the -ERRORCODE semantics that was used under the hood to
> > implement seccomp on powerpc became exposed to user space.  The tracer
> > handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> > - this is relatively harmless as at this stage there is no syscall return
> > yet so the powerpc sc syscall return ABI does not apply.  What's important
> > is that the tracer can change the syscall number to -1 thus making the
> > syscall fail, and at this point the tracer is also able to specify the
> > error value.  This has to be done in accordance with the syscall return
> > ABI, however, the current implementation of do_seccomp() supports both the
> > generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI, thanks to syscall_exit_prepare() that converts the former to
> > the latter.  Consequently, seccomp_bpf selftest passes both with and
> > without this change.
> > 
> > Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> > introduced.  PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> > complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> > the details of the system calls the tracee is blocked in.
> > 
> > One of the helpers that have to be used to implement
> > PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> > This helper complements other two helpers, syscall_get_error() and
> > syscall_get_return_value(), that are currently used to implement
> > PTRACE_GET_SYSCALL_INFO on syscall return.  When syscall_set_return_value()
> > is used to set an error code, the caller specifies it as a negative value
> > in -ERRORCODE format.
> > 
> > Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> > ("powerpc: Don't negate error in syscall_set_return_value()") because
> > syscall_set_return_value() does not follow the powerpc sc syscall return
> > ABI:
> > 	/*
> > 	 * In the general case it's not obvious that we must deal with
> > 	 * CCR here, as the syscall exit path will also do that for us.
> > 	 * However there are some places, eg. the signal code, which
> > 	 * check ccr to decide if the value in r3 is actually an error.
> > 	 */
> > 	if (error) {
> > 		regs->ccr |= 0x10000000L;
> > 		regs->gpr[3] = error;
> > 	} else {
> > 		regs->ccr &= ~0x10000000L;
> > 		regs->gpr[3] = val;
> > 	}
> > 
> > The reason why this syscall_set_return_value() implementation was able to
> > get away with violating the powerpc sc syscall return ABI is the following:
> > Up to now, syscall_set_return_value() on powerpc could be called only from
> > do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> > called from do_syscall_trace_leave() which is the point where tracers on
> > syscall return are activated and the powerpc sc syscall return ABI has
> > to be respected.
> > 
> > Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> > syscall_set_return_value() to comply with the powerpc sc syscall return
> > ABI.  Without the change, the upcoming ptrace/set_syscall_info selftest
> > fails with the following diagnostics:
> > 
> >    # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
> >    # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
> > 
> > Note that since backwards compatibility with the current implementation has
> > to be provided, the kernel has to continue supporting simultaneously both
> > the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI at least for PTRACE_EVENT_SECCOMP tracers.  Consequently, since
> > the point of __secure_computing() invocation and up to the point of
> > conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> > of these two ABIs.  An attempt to address code inconsistencies in syscall
> > error return handling that were introduced as a side effect of the dual
> > ABI support follows in a separate patch.
> 
> What do you mean by "backwards compatibility" here ? backwards 
> compatibility applies only to userspace API doesn't it ? So if there was 
> no way to trigger the problem previously, what does it mean ?

As I wrote earlier, userspace tracers handling PTRACE_EVENT_SECCOMP
are able to set gpr[3] to either -ERRORCODE or ERRORCODE (along with
SO bit) for all these years, and both methods work up to now.


-- 
ldv

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
  2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
  2025-01-28 15:52   ` Dmitry V. Levin
@ 2025-01-28 15:54   ` Eugene Syromyatnikov
  1 sibling, 0 replies; 10+ messages in thread
From: Eugene Syromyatnikov @ 2025-01-28 15:54 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Dmitry V. Levin, Michael Ellerman, Alexey Gladkov, Oleg Nesterov,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	strace-devel, linux-kernel

On Tue, Jan 28, 2025 at 3:59 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> > According to the Power Architecture Linux system call ABI documented in
> > [1], when the syscall is made with the sc instruction, both a value and an
> > error condition are returned, where r3 register contains the return value,
> > and cr0.SO bit specifies the error condition.  When cr0.SO is clear, the
> > syscall succeeded and r3 is the return value.  When cr0.SO is set, the
> > syscall failed and r3 is the error value.  This syscall return semantics
> > was implemented from the very beginning of Power Architecture on Linux,
> > and syscall tracers and debuggers like strace that read or modify syscall
> > return information also rely on this ABI.
>
> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
> behave all the same ?

In terms of ABI, yes:
https://gitlab.com/strace/strace/-/blob/master/src/linux/sparc/get_error.c
https://gitlab.com/strace/strace/-/blob/master/src/linux/mips/get_error.c
https://gitlab.com/strace/strace/-/blob/master/src/linux/nios2/get_error.c

(cf. https://gitlab.com/strace/strace/-/blob/master/src/linux/powerpc/get_error.c
)

I can't find microblaze SyscallV ABI, but at least both strace and
glibc expect negated errno in r3:
https://gitlab.com/strace/strace/-/blob/master/src/linux/microblaze/get_error.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/microblaze/syscall.S;h=6ebf688d800e19c14a140f0cd5a0ba9344fa95a5;hb=HEAD

> > r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> > (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> > For example, here is an excerpt from check_syscall_restart() that assigns
> > these members of struct pt_regs:
> >          regs->result = -EINTR;
> >          regs->gpr[3] = EINTR;
> >          regs->ccr |= 0x10000000;
> > In this example, the semantics of negative ERRORCODE that's being used
> > virtually everywhere in generic kernel code is translated to powerpc sc
> > syscall return ABI which uses positive ERRORCODE and cr0.SO bit.
>
> At what point are they exposed really ? At what point do they need to
> comply with the ABI ?

On ptrace stops via pt_regs, for example?

> I'm also a bit lost between regs->orig_r3, regs->gpr[3] and regs->result.
>
> The comment added by commit 1b1a3702a65c ("powerpc: Don't negate error
> in syscall_set_return_value()") says that CCR needs to be set because of
> signal code. But signal code is invoked by syscall_exit_prepare()
> through call to interrupt_exit_user_prepare_main() after setting CR[SO]
> and negating syscall result.
>
> >
> > Also, r3 and cr0.SO are exposed indirectly via helpers.
> > For example, here is an excerpt from syscall_get_error():
> >          /*
> >           * If the system call failed,
> >           * regs->gpr[3] contains a positive ERRORCODE.
> >           */
> >          return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > and here is another example, from regs_return_value():
> >          if (is_syscall_success(regs))
> >                  return regs->gpr[3];
> >          else
> >                  return -regs->gpr[3];
> > In these examples, the powerpc sc syscall return ABI which uses positive
> > ERRORCODE and cr0.SO bit is translated to the semantics of negative
> > ERRORCODE that's being used virtually everywhere in generic kernel code.
> >
> > Up to a certain point in time the kernel managed to implement the powerpc
> > sc syscall return ABI in all cases where struct pt_regs was exposed to user
> > space.
> >
> > The situation changed when SECCOMP_RET_TRACE support was introduced.
> > At this point the -ERRORCODE semantics that was used under the hood to
> > implement seccomp on powerpc became exposed to user space.  The tracer
> > handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> > - this is relatively harmless as at this stage there is no syscall return
> > yet so the powerpc sc syscall return ABI does not apply.  What's important
> > is that the tracer can change the syscall number to -1 thus making the
> > syscall fail, and at this point the tracer is also able to specify the
> > error value.  This has to be done in accordance with the syscall return
> > ABI, however, the current implementation of do_seccomp() supports both the
> > generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI, thanks to syscall_exit_prepare() that converts the former to
> > the latter.  Consequently, seccomp_bpf selftest passes both with and
> > without this change.
> >
> > Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> > introduced.  PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> > complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> > the details of the system calls the tracee is blocked in.
> >
> > One of the helpers that have to be used to implement
> > PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> > This helper complements other two helpers, syscall_get_error() and
> > syscall_get_return_value(), that are currently used to implement
> > PTRACE_GET_SYSCALL_INFO on syscall return.  When syscall_set_return_value()
> > is used to set an error code, the caller specifies it as a negative value
> > in -ERRORCODE format.
> >
> > Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> > ("powerpc: Don't negate error in syscall_set_return_value()") because
> > syscall_set_return_value() does not follow the powerpc sc syscall return
> > ABI:
> >       /*
> >        * In the general case it's not obvious that we must deal with
> >        * CCR here, as the syscall exit path will also do that for us.
> >        * However there are some places, eg. the signal code, which
> >        * check ccr to decide if the value in r3 is actually an error.
> >        */
> >       if (error) {
> >               regs->ccr |= 0x10000000L;
> >               regs->gpr[3] = error;
> >       } else {
> >               regs->ccr &= ~0x10000000L;
> >               regs->gpr[3] = val;
> >       }
> >
> > The reason why this syscall_set_return_value() implementation was able to
> > get away with violating the powerpc sc syscall return ABI is the following:
> > Up to now, syscall_set_return_value() on powerpc could be called only from
> > do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> > called from do_syscall_trace_leave() which is the point where tracers on
> > syscall return are activated and the powerpc sc syscall return ABI has
> > to be respected.
> >
> > Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> > syscall_set_return_value() to comply with the powerpc sc syscall return
> > ABI.  Without the change, the upcoming ptrace/set_syscall_info selftest
> > fails with the following diagnostics:
> >
> >    # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
> >    # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
> >
> > Note that since backwards compatibility with the current implementation has
> > to be provided, the kernel has to continue supporting simultaneously both
> > the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI at least for PTRACE_EVENT_SECCOMP tracers.  Consequently, since
> > the point of __secure_computing() invocation and up to the point of
> > conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> > of these two ABIs.  An attempt to address code inconsistencies in syscall
> > error return handling that were introduced as a side effect of the dual
> > ABI support follows in a separate patch.
>
> What do you mean by "backwards compatibility" here ? backwards
> compatibility applies only to userspace API doesn't it ? So if there was
> no way to trigger the problem previously, what does it mean ?

ptrace and seccomp users are pretty much userspace, aren't they?

> >
> > Link: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Farch%2Fpowerpc%2Fsyscall64-abi.html%23return-value&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc2cf590281c24fe1478408dd3efe4a3e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638735984085033893%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=EEP9s6k%2Fs5VfqWgrs6VXi879HEfJ8BYOOJ8InmmVTQA%3D&reserved=0 [1]
> > Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> > Reviewed-by: Alexey Gladkov <legion@kernel.org>
> > ---
> >   arch/powerpc/include/asm/syscall.h | 6 +++++-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> > index 3dd36c5e334a..422d7735ace6 100644
> > --- a/arch/powerpc/include/asm/syscall.h
> > +++ b/arch/powerpc/include/asm/syscall.h
> > @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
> >                */
> >               if (error) {
> >                       regs->ccr |= 0x10000000L;
> > -                     regs->gpr[3] = error;
> > +                     /*
> > +                      * In case of an error regs->gpr[3] contains
> > +                      * a positive ERRORCODE.
> > +                      */
> > +                     regs->gpr[3] = -error;
> >               } else {
> >                       regs->ccr &= ~0x10000000L;
> >                       regs->gpr[3] = val;
>


-- 
Eugene Syromyatnikov
mailto:evgsyr@gmail.com
xmpp:esyr@jabber.{ru|org}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
  2025-01-28 15:52   ` Dmitry V. Levin
@ 2025-01-28 16:00     ` Christophe Leroy
  2025-01-28 16:10       ` Dmitry V. Levin
  0 siblings, 1 reply; 10+ messages in thread
From: Christophe Leroy @ 2025-01-28 16:00 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
	Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, strace-devel, linux-kernel



Le 28/01/2025 à 16:52, Dmitry V. Levin a écrit :
> On Tue, Jan 28, 2025 at 03:59:29PM +0100, Christophe Leroy wrote:
>> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
>>> According to the Power Architecture Linux system call ABI documented in
>>> [1], when the syscall is made with the sc instruction, both a value and an
>>> error condition are returned, where r3 register contains the return value,
>>> and cr0.SO bit specifies the error condition.  When cr0.SO is clear, the
>>> syscall succeeded and r3 is the return value.  When cr0.SO is set, the
>>> syscall failed and r3 is the error value.  This syscall return semantics
>>> was implemented from the very beginning of Power Architecture on Linux,
>>> and syscall tracers and debuggers like strace that read or modify syscall
>>> return information also rely on this ABI.
>>
>> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
>> behave all the same ?
> 
> Yes, also on alpha.  I don't think microblaze should be in this list,
> though.

Microblaze has

static inline void syscall_set_return_value(struct task_struct *task,
					    struct pt_regs *regs,
					    int error, long val)
{
	if (error)
		regs->r3 = -error;
	else
		regs->r3 = val;
}

So it has a positive error setting allthough it has no flag to tell it 
is an error. Wondering how it works at the end.

Alpha I'm not sure, I see nothing obvious in include/asm/ptrace.h or 
include/asm/syscall.h



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
  2025-01-28 16:00     ` Christophe Leroy
@ 2025-01-28 16:10       ` Dmitry V. Levin
  0 siblings, 0 replies; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-28 16:10 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
	Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, strace-devel, linux-kernel

On Tue, Jan 28, 2025 at 05:00:31PM +0100, Christophe Leroy wrote:
> Le 28/01/2025 à 16:52, Dmitry V. Levin a écrit :
> > On Tue, Jan 28, 2025 at 03:59:29PM +0100, Christophe Leroy wrote:
> >> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> >>> According to the Power Architecture Linux system call ABI documented in
> >>> [1], when the syscall is made with the sc instruction, both a value and an
> >>> error condition are returned, where r3 register contains the return value,
> >>> and cr0.SO bit specifies the error condition.  When cr0.SO is clear, the
> >>> syscall succeeded and r3 is the return value.  When cr0.SO is set, the
> >>> syscall failed and r3 is the error value.  This syscall return semantics
> >>> was implemented from the very beginning of Power Architecture on Linux,
> >>> and syscall tracers and debuggers like strace that read or modify syscall
> >>> return information also rely on this ABI.
> >>
> >> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
> >> behave all the same ?
> > 
> > Yes, also on alpha.  I don't think microblaze should be in this list,
> > though.
> 
> Microblaze has
> 
> static inline void syscall_set_return_value(struct task_struct *task,
> 					    struct pt_regs *regs,
> 					    int error, long val)
> {
> 	if (error)
> 		regs->r3 = -error;
> 	else
> 		regs->r3 = val;
> }
> 
> So it has a positive error setting allthough it has no flag to tell it 
> is an error. Wondering how it works at the end.

It's a bug, but given that microblaze doesn't enable
CONFIG_HAVE_ARCH_TRACEHOOK, most likely this function is unused there.

> Alpha I'm not sure, I see nothing obvious in include/asm/ptrace.h or 
> include/asm/syscall.h

Alpha doesn't enable CONFIG_HAVE_ARCH_TRACEHOOK, it just lacks the
necessary interfaces, but it uses a3 register for this purpose, see
arch/alpha/kernel/entry.S for details.


-- 
ldv

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling
  2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
@ 2025-01-28 18:01   ` Christophe Leroy
  2025-01-28 21:39     ` Dmitry V. Levin
  0 siblings, 1 reply; 10+ messages in thread
From: Christophe Leroy @ 2025-01-28 18:01 UTC (permalink / raw)
  To: Dmitry V. Levin, Michael Ellerman
  Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
	Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
	strace-devel, linux-kernel



Le 27/01/2025 à 19:14, Dmitry V. Levin a écrit :
> Since the introduction of SECCOMP_RET_TRACE support, the kernel supports
> simultaneously both the generic kernel -ERRORCODE return value ABI and
> the powerpc sc syscall return ABI for PTRACE_EVENT_SECCOMP tracers.
> This change is an attempt to address the code inconsistencies in syscall
> error return handling that were introduced as a side effect of the dual
> ABI support.
> 
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
>   arch/powerpc/kernel/ptrace/ptrace.c | 23 ++++++++++++++++++++---
>   arch/powerpc/kernel/signal.c        | 11 +++--------
>   arch/powerpc/kernel/syscall.c       |  6 +++---
>   3 files changed, 26 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> index 727ed4a14545..3778775bf6ba 100644
> --- a/arch/powerpc/kernel/ptrace/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
>   	 * syscall parameter. This is different to the ptrace ABI where
>   	 * both r3 and orig_gpr3 contain the first syscall parameter.
>   	 */
> -	regs->gpr[3] = -ENOSYS;
> +	syscall_set_return_value(current, regs, -ENOSYS, 0);
>   
>   	/*
>   	 * We use the __ version here because we have already checked
> @@ -215,8 +215,18 @@ static int do_seccomp(struct pt_regs *regs)
>   	 * have already loaded -ENOSYS into r3, or seccomp has put
>   	 * something else in r3 (via SECCOMP_RET_ERRNO/TRACE).
>   	 */
> -	if (__secure_computing(NULL))
> +	if (__secure_computing(NULL)) {
> +
> +		/*
> +		 * Traditionally, both the generic kernel -ERRORCODE return
> +		 * value ABI and the powerpc sc syscall return ABI is
> +		 * supported.  For consistency, if the former is detected,
> +		 * convert it to the latter.
> +		 */
> +		if (!trap_is_scv(regs) && IS_ERR_VALUE(regs->gpr[3]))

Why !trap_is_scv(regs) ? Shouldn't this also work with scv allthough it 
should be a noop ?


> +			syscall_set_return_value(current, regs, regs->gpr[3], 0);
>   		return -1;
> +	}
>   
>   	/*
>   	 * The syscall was allowed by seccomp, restore the register
> @@ -226,6 +236,13 @@ static int do_seccomp(struct pt_regs *regs)
>   	 * allow the syscall to proceed.
>   	 */
>   	regs->gpr[3] = regs->orig_gpr3;
> +	if (!trap_is_scv(regs)) {
> +		/*
> +		 * Clear SO bit that was set in this function earlier by
> +		 * syscall_set_return_value.
> +		 */
> +		regs->ccr &= ~0x10000000L;
> +	}

Can't we use syscall_set_return_value() to do that ?

>   
>   	return 0;
>   }
> @@ -315,7 +332,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>   	 * If we are aborting explicitly, or if the syscall number is
>   	 * now invalid, set the return value to -ENOSYS.
>   	 */
> -	regs->gpr[3] = -ENOSYS;
> +	syscall_set_return_value(current, regs, -ENOSYS, 0);
>   	return -1;
>   }
>   
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index aa17e62f3754..1a38d6bcaed6 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -19,6 +19,7 @@
>   #include <asm/unistd.h>
>   #include <asm/debug.h>
>   #include <asm/tm.h>
> +#include <asm/syscall.h>
>   
>   #include "signal.h"
>   
> @@ -229,14 +230,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
>   		regs_add_return_ip(regs, -4);
>   		regs->result = 0;
>   	} else {
> -		if (trap_is_scv(regs)) {
> -			regs->result = -EINTR;
> -			regs->gpr[3] = -EINTR;
> -		} else {
> -			regs->result = -EINTR;
> -			regs->gpr[3] = EINTR;
> -			regs->ccr |= 0x10000000;
> -		}
> +		regs->result = -EINTR;
> +		syscall_set_return_value(current, regs, -EINTR, 0);
>   	}
>   }
>   
> diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
> index be159ad4b77b..2fe47191e509 100644
> --- a/arch/powerpc/kernel/syscall.c
> +++ b/arch/powerpc/kernel/syscall.c
> @@ -122,7 +122,7 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
>   		if (unlikely(trap_is_unsupported_scv(regs))) {
>   			/* Unsupported scv vector */
>   			_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
> -			return regs->gpr[3];
> +			return regs_return_value(regs);
>   		}
>   		/*
>   		 * We use the return value of do_syscall_trace_enter() as the
> @@ -133,13 +133,13 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
>   		 */
>   		r0 = do_syscall_trace_enter(regs);
>   		if (unlikely(r0 >= NR_syscalls))
> -			return regs->gpr[3];
> +			return regs_return_value(regs);
>   
>   	} else if (unlikely(r0 >= NR_syscalls)) {
>   		if (unlikely(trap_is_unsupported_scv(regs))) {
>   			/* Unsupported scv vector */
>   			_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
> -			return regs->gpr[3];
> +			return regs_return_value(regs);
>   		}
>   		return -ENOSYS;
>   	}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling
  2025-01-28 18:01   ` Christophe Leroy
@ 2025-01-28 21:39     ` Dmitry V. Levin
  0 siblings, 0 replies; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-28 21:39 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
	Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
	linuxppc-dev, strace-devel, linux-kernel

On Tue, Jan 28, 2025 at 07:01:47PM +0100, Christophe Leroy wrote:
> Le 27/01/2025 à 19:14, Dmitry V. Levin a écrit :
> > Since the introduction of SECCOMP_RET_TRACE support, the kernel supports
> > simultaneously both the generic kernel -ERRORCODE return value ABI and
> > the powerpc sc syscall return ABI for PTRACE_EVENT_SECCOMP tracers.
> > This change is an attempt to address the code inconsistencies in syscall
> > error return handling that were introduced as a side effect of the dual
> > ABI support.
> > 
> > Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> > ---
> >   arch/powerpc/kernel/ptrace/ptrace.c | 23 ++++++++++++++++++++---
> >   arch/powerpc/kernel/signal.c        | 11 +++--------
> >   arch/powerpc/kernel/syscall.c       |  6 +++---
> >   3 files changed, 26 insertions(+), 14 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> > index 727ed4a14545..3778775bf6ba 100644
> > --- a/arch/powerpc/kernel/ptrace/ptrace.c
> > +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> > @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
> >   	 * syscall parameter. This is different to the ptrace ABI where
> >   	 * both r3 and orig_gpr3 contain the first syscall parameter.
> >   	 */
> > -	regs->gpr[3] = -ENOSYS;
> > +	syscall_set_return_value(current, regs, -ENOSYS, 0);
> >   
> >   	/*
> >   	 * We use the __ version here because we have already checked
> > @@ -215,8 +215,18 @@ static int do_seccomp(struct pt_regs *regs)
> >   	 * have already loaded -ENOSYS into r3, or seccomp has put
> >   	 * something else in r3 (via SECCOMP_RET_ERRNO/TRACE).
> >   	 */
> > -	if (__secure_computing(NULL))
> > +	if (__secure_computing(NULL)) {
> > +
> > +		/*
> > +		 * Traditionally, both the generic kernel -ERRORCODE return
> > +		 * value ABI and the powerpc sc syscall return ABI is
> > +		 * supported.  For consistency, if the former is detected,
> > +		 * convert it to the latter.
> > +		 */
> > +		if (!trap_is_scv(regs) && IS_ERR_VALUE(regs->gpr[3]))
> 
> Why !trap_is_scv(regs) ? Shouldn't this also work with scv allthough it 
> should be a noop ?

In trap_is_scv(regs) case both the source and the target ABIs are
-ERRORCODE so there is no subject for conversion.

> > +			syscall_set_return_value(current, regs, regs->gpr[3], 0);
> >   		return -1;
> > +	}
> >   
> >   	/*
> >   	 * The syscall was allowed by seccomp, restore the register
> > @@ -226,6 +236,13 @@ static int do_seccomp(struct pt_regs *regs)
> >   	 * allow the syscall to proceed.
> >   	 */
> >   	regs->gpr[3] = regs->orig_gpr3;
> > +	if (!trap_is_scv(regs)) {
> > +		/*
> > +		 * Clear SO bit that was set in this function earlier by
> > +		 * syscall_set_return_value.
> > +		 */
> > +		regs->ccr &= ~0x10000000L;
> > +	}
> 
> Can't we use syscall_set_return_value() to do that ?

Of course we could do

	syscall_set_return_value(current, regs, 0, regs->orig_gpr3);

but Michael has objected to this already, see
https://lore.kernel.org/all/87jzajjde1.fsf@mpe.ellerman.id.au/


-- 
ldv

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
  2025-01-27 18:13 [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Dmitry V. Levin
  2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
  2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
@ 2025-06-02 11:36 ` Dmitry V. Levin
  2 siblings, 0 replies; 10+ messages in thread
From: Dmitry V. Levin @ 2025-06-02 11:36 UTC (permalink / raw)
  To: Michael Ellerman, Nicholas Piggin, Christophe Leroy
  Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
	Madhavan Srinivasan, Naveen N Rao, linuxppc-dev, strace-devel,
	linux-kernel

On Mon, Jan 27, 2025 at 08:13:23PM +0200, Dmitry V. Levin wrote:
> According to the Power Architecture Linux system call ABI documented in
> [1], when the syscall is made with the sc instruction, both a value and an
> error condition are returned, where r3 register contains the return value,
> and cr0.SO bit specifies the error condition.  When cr0.SO is clear, the
> syscall succeeded and r3 is the return value.  When cr0.SO is set, the
> syscall failed and r3 is the error value.  This syscall return semantics
> was implemented from the very beginning of Power Architecture on Linux,
> and syscall tracers and debuggers like strace that read or modify syscall
> return information also rely on this ABI.
> 
> r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> For example, here is an excerpt from check_syscall_restart() that assigns
> these members of struct pt_regs:
>         regs->result = -EINTR;
>         regs->gpr[3] = EINTR;
>         regs->ccr |= 0x10000000;
> In this example, the semantics of negative ERRORCODE that's being used
> virtually everywhere in generic kernel code is translated to powerpc sc
> syscall return ABI which uses positive ERRORCODE and cr0.SO bit.
> 
> Also, r3 and cr0.SO are exposed indirectly via helpers.
> For example, here is an excerpt from syscall_get_error():
>         /*
>          * If the system call failed,
>          * regs->gpr[3] contains a positive ERRORCODE.
>          */
>         return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> and here is another example, from regs_return_value():
>         if (is_syscall_success(regs))
>                 return regs->gpr[3];
>         else
>                 return -regs->gpr[3];
> In these examples, the powerpc sc syscall return ABI which uses positive
> ERRORCODE and cr0.SO bit is translated to the semantics of negative
> ERRORCODE that's being used virtually everywhere in generic kernel code.
> 
> Up to a certain point in time the kernel managed to implement the powerpc
> sc syscall return ABI in all cases where struct pt_regs was exposed to user
> space.
> 
> The situation changed when SECCOMP_RET_TRACE support was introduced.
> At this point the -ERRORCODE semantics that was used under the hood to
> implement seccomp on powerpc became exposed to user space.  The tracer
> handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> - this is relatively harmless as at this stage there is no syscall return
> yet so the powerpc sc syscall return ABI does not apply.  What's important
> is that the tracer can change the syscall number to -1 thus making the
> syscall fail, and at this point the tracer is also able to specify the
> error value.  This has to be done in accordance with the syscall return
> ABI, however, the current implementation of do_seccomp() supports both the
> generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI, thanks to syscall_exit_prepare() that converts the former to
> the latter.  Consequently, seccomp_bpf selftest passes both with and
> without this change.
> 
> Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> introduced.  PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> the details of the system calls the tracee is blocked in.
> 
> One of the helpers that have to be used to implement
> PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> This helper complements other two helpers, syscall_get_error() and
> syscall_get_return_value(), that are currently used to implement
> PTRACE_GET_SYSCALL_INFO on syscall return.  When syscall_set_return_value()
> is used to set an error code, the caller specifies it as a negative value
> in -ERRORCODE format.
> 
> Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> ("powerpc: Don't negate error in syscall_set_return_value()") because
> syscall_set_return_value() does not follow the powerpc sc syscall return
> ABI:
> 	/*
> 	 * In the general case it's not obvious that we must deal with
> 	 * CCR here, as the syscall exit path will also do that for us.
> 	 * However there are some places, eg. the signal code, which
> 	 * check ccr to decide if the value in r3 is actually an error.
> 	 */
> 	if (error) {
> 		regs->ccr |= 0x10000000L;
> 		regs->gpr[3] = error;
> 	} else {
> 		regs->ccr &= ~0x10000000L;
> 		regs->gpr[3] = val;
> 	}
> 
> The reason why this syscall_set_return_value() implementation was able to
> get away with violating the powerpc sc syscall return ABI is the following:
> Up to now, syscall_set_return_value() on powerpc could be called only from
> do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> called from do_syscall_trace_leave() which is the point where tracers on
> syscall return are activated and the powerpc sc syscall return ABI has
> to be respected.
> 
> Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> syscall_set_return_value() to comply with the powerpc sc syscall return
> ABI.  Without the change, the upcoming ptrace/set_syscall_info selftest
> fails with the following diagnostics:
> 
>   # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
>   # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
> 
> Note that since backwards compatibility with the current implementation has
> to be provided, the kernel has to continue supporting simultaneously both
> the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI at least for PTRACE_EVENT_SECCOMP tracers.  Consequently, since
> the point of __secure_computing() invocation and up to the point of
> conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> of these two ABIs.  An attempt to address code inconsistencies in syscall
> error return handling that were introduced as a side effect of the dual
> ABI support follows in a separate patch.
> 
> Link: https://www.kernel.org/doc/html/latest/arch/powerpc/syscall64-abi.html#return-value [1]
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> Reviewed-by: Alexey Gladkov <legion@kernel.org>
> ---
>  arch/powerpc/include/asm/syscall.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 3dd36c5e334a..422d7735ace6 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
>  		 */
>  		if (error) {
>  			regs->ccr |= 0x10000000L;
> -			regs->gpr[3] = error;
> +			/*
> +			 * In case of an error regs->gpr[3] contains
> +			 * a positive ERRORCODE.
> +			 */
> +			regs->gpr[3] = -error;
>  		} else {
>  			regs->ccr &= ~0x10000000L;
>  			regs->gpr[3] = val;

Given that the PTRACE_SET_SYSCALL_INFO API has been merged and this
powerpc issue is now exposed to userspace, can we make it fixed, please?


-- 
ldv

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-06-02 11:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-27 18:13 [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Dmitry V. Levin
2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
2025-01-28 18:01   ` Christophe Leroy
2025-01-28 21:39     ` Dmitry V. Levin
2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
2025-01-28 15:52   ` Dmitry V. Levin
2025-01-28 16:00     ` Christophe Leroy
2025-01-28 16:10       ` Dmitry V. Levin
2025-01-28 15:54   ` Eugene Syromyatnikov
2025-06-02 11:36 ` Dmitry V. Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).