* [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling
2025-01-27 18:13 [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Dmitry V. Levin
@ 2025-01-27 18:14 ` Dmitry V. Levin
2025-01-28 18:01 ` Christophe Leroy
2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
2025-06-02 11:36 ` Dmitry V. Levin
2 siblings, 1 reply; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-27 18:14 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy
Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
strace-devel, linux-kernel
Since the introduction of SECCOMP_RET_TRACE support, the kernel supports
simultaneously both the generic kernel -ERRORCODE return value ABI and
the powerpc sc syscall return ABI for PTRACE_EVENT_SECCOMP tracers.
This change is an attempt to address the code inconsistencies in syscall
error return handling that were introduced as a side effect of the dual
ABI support.
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
---
arch/powerpc/kernel/ptrace/ptrace.c | 23 ++++++++++++++++++++---
arch/powerpc/kernel/signal.c | 11 +++--------
arch/powerpc/kernel/syscall.c | 6 +++---
3 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index 727ed4a14545..3778775bf6ba 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
* syscall parameter. This is different to the ptrace ABI where
* both r3 and orig_gpr3 contain the first syscall parameter.
*/
- regs->gpr[3] = -ENOSYS;
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
/*
* We use the __ version here because we have already checked
@@ -215,8 +215,18 @@ static int do_seccomp(struct pt_regs *regs)
* have already loaded -ENOSYS into r3, or seccomp has put
* something else in r3 (via SECCOMP_RET_ERRNO/TRACE).
*/
- if (__secure_computing(NULL))
+ if (__secure_computing(NULL)) {
+
+ /*
+ * Traditionally, both the generic kernel -ERRORCODE return
+ * value ABI and the powerpc sc syscall return ABI is
+ * supported. For consistency, if the former is detected,
+ * convert it to the latter.
+ */
+ if (!trap_is_scv(regs) && IS_ERR_VALUE(regs->gpr[3]))
+ syscall_set_return_value(current, regs, regs->gpr[3], 0);
return -1;
+ }
/*
* The syscall was allowed by seccomp, restore the register
@@ -226,6 +236,13 @@ static int do_seccomp(struct pt_regs *regs)
* allow the syscall to proceed.
*/
regs->gpr[3] = regs->orig_gpr3;
+ if (!trap_is_scv(regs)) {
+ /*
+ * Clear SO bit that was set in this function earlier by
+ * syscall_set_return_value.
+ */
+ regs->ccr &= ~0x10000000L;
+ }
return 0;
}
@@ -315,7 +332,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
* If we are aborting explicitly, or if the syscall number is
* now invalid, set the return value to -ENOSYS.
*/
- regs->gpr[3] = -ENOSYS;
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
return -1;
}
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index aa17e62f3754..1a38d6bcaed6 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -19,6 +19,7 @@
#include <asm/unistd.h>
#include <asm/debug.h>
#include <asm/tm.h>
+#include <asm/syscall.h>
#include "signal.h"
@@ -229,14 +230,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
regs_add_return_ip(regs, -4);
regs->result = 0;
} else {
- if (trap_is_scv(regs)) {
- regs->result = -EINTR;
- regs->gpr[3] = -EINTR;
- } else {
- regs->result = -EINTR;
- regs->gpr[3] = EINTR;
- regs->ccr |= 0x10000000;
- }
+ regs->result = -EINTR;
+ syscall_set_return_value(current, regs, -EINTR, 0);
}
}
diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
index be159ad4b77b..2fe47191e509 100644
--- a/arch/powerpc/kernel/syscall.c
+++ b/arch/powerpc/kernel/syscall.c
@@ -122,7 +122,7 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
if (unlikely(trap_is_unsupported_scv(regs))) {
/* Unsupported scv vector */
_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
- return regs->gpr[3];
+ return regs_return_value(regs);
}
/*
* We use the return value of do_syscall_trace_enter() as the
@@ -133,13 +133,13 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
*/
r0 = do_syscall_trace_enter(regs);
if (unlikely(r0 >= NR_syscalls))
- return regs->gpr[3];
+ return regs_return_value(regs);
} else if (unlikely(r0 >= NR_syscalls)) {
if (unlikely(trap_is_unsupported_scv(regs))) {
/* Unsupported scv vector */
_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
- return regs->gpr[3];
+ return regs_return_value(regs);
}
return -ENOSYS;
}
--
ldv
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling
2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
@ 2025-01-28 18:01 ` Christophe Leroy
2025-01-28 21:39 ` Dmitry V. Levin
0 siblings, 1 reply; 10+ messages in thread
From: Christophe Leroy @ 2025-01-28 18:01 UTC (permalink / raw)
To: Dmitry V. Levin, Michael Ellerman
Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
strace-devel, linux-kernel
Le 27/01/2025 à 19:14, Dmitry V. Levin a écrit :
> Since the introduction of SECCOMP_RET_TRACE support, the kernel supports
> simultaneously both the generic kernel -ERRORCODE return value ABI and
> the powerpc sc syscall return ABI for PTRACE_EVENT_SECCOMP tracers.
> This change is an attempt to address the code inconsistencies in syscall
> error return handling that were introduced as a side effect of the dual
> ABI support.
>
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> ---
> arch/powerpc/kernel/ptrace/ptrace.c | 23 ++++++++++++++++++++---
> arch/powerpc/kernel/signal.c | 11 +++--------
> arch/powerpc/kernel/syscall.c | 6 +++---
> 3 files changed, 26 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> index 727ed4a14545..3778775bf6ba 100644
> --- a/arch/powerpc/kernel/ptrace/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
> * syscall parameter. This is different to the ptrace ABI where
> * both r3 and orig_gpr3 contain the first syscall parameter.
> */
> - regs->gpr[3] = -ENOSYS;
> + syscall_set_return_value(current, regs, -ENOSYS, 0);
>
> /*
> * We use the __ version here because we have already checked
> @@ -215,8 +215,18 @@ static int do_seccomp(struct pt_regs *regs)
> * have already loaded -ENOSYS into r3, or seccomp has put
> * something else in r3 (via SECCOMP_RET_ERRNO/TRACE).
> */
> - if (__secure_computing(NULL))
> + if (__secure_computing(NULL)) {
> +
> + /*
> + * Traditionally, both the generic kernel -ERRORCODE return
> + * value ABI and the powerpc sc syscall return ABI is
> + * supported. For consistency, if the former is detected,
> + * convert it to the latter.
> + */
> + if (!trap_is_scv(regs) && IS_ERR_VALUE(regs->gpr[3]))
Why !trap_is_scv(regs) ? Shouldn't this also work with scv allthough it
should be a noop ?
> + syscall_set_return_value(current, regs, regs->gpr[3], 0);
> return -1;
> + }
>
> /*
> * The syscall was allowed by seccomp, restore the register
> @@ -226,6 +236,13 @@ static int do_seccomp(struct pt_regs *regs)
> * allow the syscall to proceed.
> */
> regs->gpr[3] = regs->orig_gpr3;
> + if (!trap_is_scv(regs)) {
> + /*
> + * Clear SO bit that was set in this function earlier by
> + * syscall_set_return_value.
> + */
> + regs->ccr &= ~0x10000000L;
> + }
Can't we use syscall_set_return_value() to do that ?
>
> return 0;
> }
> @@ -315,7 +332,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
> * If we are aborting explicitly, or if the syscall number is
> * now invalid, set the return value to -ENOSYS.
> */
> - regs->gpr[3] = -ENOSYS;
> + syscall_set_return_value(current, regs, -ENOSYS, 0);
> return -1;
> }
>
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index aa17e62f3754..1a38d6bcaed6 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -19,6 +19,7 @@
> #include <asm/unistd.h>
> #include <asm/debug.h>
> #include <asm/tm.h>
> +#include <asm/syscall.h>
>
> #include "signal.h"
>
> @@ -229,14 +230,8 @@ static void check_syscall_restart(struct pt_regs *regs, struct k_sigaction *ka,
> regs_add_return_ip(regs, -4);
> regs->result = 0;
> } else {
> - if (trap_is_scv(regs)) {
> - regs->result = -EINTR;
> - regs->gpr[3] = -EINTR;
> - } else {
> - regs->result = -EINTR;
> - regs->gpr[3] = EINTR;
> - regs->ccr |= 0x10000000;
> - }
> + regs->result = -EINTR;
> + syscall_set_return_value(current, regs, -EINTR, 0);
> }
> }
>
> diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
> index be159ad4b77b..2fe47191e509 100644
> --- a/arch/powerpc/kernel/syscall.c
> +++ b/arch/powerpc/kernel/syscall.c
> @@ -122,7 +122,7 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
> if (unlikely(trap_is_unsupported_scv(regs))) {
> /* Unsupported scv vector */
> _exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
> - return regs->gpr[3];
> + return regs_return_value(regs);
> }
> /*
> * We use the return value of do_syscall_trace_enter() as the
> @@ -133,13 +133,13 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
> */
> r0 = do_syscall_trace_enter(regs);
> if (unlikely(r0 >= NR_syscalls))
> - return regs->gpr[3];
> + return regs_return_value(regs);
>
> } else if (unlikely(r0 >= NR_syscalls)) {
> if (unlikely(trap_is_unsupported_scv(regs))) {
> /* Unsupported scv vector */
> _exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
> - return regs->gpr[3];
> + return regs_return_value(regs);
> }
> return -ENOSYS;
> }
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling
2025-01-28 18:01 ` Christophe Leroy
@ 2025-01-28 21:39 ` Dmitry V. Levin
0 siblings, 0 replies; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-28 21:39 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
linuxppc-dev, strace-devel, linux-kernel
On Tue, Jan 28, 2025 at 07:01:47PM +0100, Christophe Leroy wrote:
> Le 27/01/2025 à 19:14, Dmitry V. Levin a écrit :
> > Since the introduction of SECCOMP_RET_TRACE support, the kernel supports
> > simultaneously both the generic kernel -ERRORCODE return value ABI and
> > the powerpc sc syscall return ABI for PTRACE_EVENT_SECCOMP tracers.
> > This change is an attempt to address the code inconsistencies in syscall
> > error return handling that were introduced as a side effect of the dual
> > ABI support.
> >
> > Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> > ---
> > arch/powerpc/kernel/ptrace/ptrace.c | 23 ++++++++++++++++++++---
> > arch/powerpc/kernel/signal.c | 11 +++--------
> > arch/powerpc/kernel/syscall.c | 6 +++---
> > 3 files changed, 26 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
> > index 727ed4a14545..3778775bf6ba 100644
> > --- a/arch/powerpc/kernel/ptrace/ptrace.c
> > +++ b/arch/powerpc/kernel/ptrace/ptrace.c
> > @@ -207,7 +207,7 @@ static int do_seccomp(struct pt_regs *regs)
> > * syscall parameter. This is different to the ptrace ABI where
> > * both r3 and orig_gpr3 contain the first syscall parameter.
> > */
> > - regs->gpr[3] = -ENOSYS;
> > + syscall_set_return_value(current, regs, -ENOSYS, 0);
> >
> > /*
> > * We use the __ version here because we have already checked
> > @@ -215,8 +215,18 @@ static int do_seccomp(struct pt_regs *regs)
> > * have already loaded -ENOSYS into r3, or seccomp has put
> > * something else in r3 (via SECCOMP_RET_ERRNO/TRACE).
> > */
> > - if (__secure_computing(NULL))
> > + if (__secure_computing(NULL)) {
> > +
> > + /*
> > + * Traditionally, both the generic kernel -ERRORCODE return
> > + * value ABI and the powerpc sc syscall return ABI is
> > + * supported. For consistency, if the former is detected,
> > + * convert it to the latter.
> > + */
> > + if (!trap_is_scv(regs) && IS_ERR_VALUE(regs->gpr[3]))
>
> Why !trap_is_scv(regs) ? Shouldn't this also work with scv allthough it
> should be a noop ?
In trap_is_scv(regs) case both the source and the target ABIs are
-ERRORCODE so there is no subject for conversion.
> > + syscall_set_return_value(current, regs, regs->gpr[3], 0);
> > return -1;
> > + }
> >
> > /*
> > * The syscall was allowed by seccomp, restore the register
> > @@ -226,6 +236,13 @@ static int do_seccomp(struct pt_regs *regs)
> > * allow the syscall to proceed.
> > */
> > regs->gpr[3] = regs->orig_gpr3;
> > + if (!trap_is_scv(regs)) {
> > + /*
> > + * Clear SO bit that was set in this function earlier by
> > + * syscall_set_return_value.
> > + */
> > + regs->ccr &= ~0x10000000L;
> > + }
>
> Can't we use syscall_set_return_value() to do that ?
Of course we could do
syscall_set_return_value(current, regs, 0, regs->orig_gpr3);
but Michael has objected to this already, see
https://lore.kernel.org/all/87jzajjde1.fsf@mpe.ellerman.id.au/
--
ldv
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
2025-01-27 18:13 [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Dmitry V. Levin
2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
@ 2025-01-28 14:59 ` Christophe Leroy
2025-01-28 15:52 ` Dmitry V. Levin
2025-01-28 15:54 ` Eugene Syromyatnikov
2025-06-02 11:36 ` Dmitry V. Levin
2 siblings, 2 replies; 10+ messages in thread
From: Christophe Leroy @ 2025-01-28 14:59 UTC (permalink / raw)
To: Dmitry V. Levin, Michael Ellerman
Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
strace-devel, linux-kernel
Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> According to the Power Architecture Linux system call ABI documented in
> [1], when the syscall is made with the sc instruction, both a value and an
> error condition are returned, where r3 register contains the return value,
> and cr0.SO bit specifies the error condition. When cr0.SO is clear, the
> syscall succeeded and r3 is the return value. When cr0.SO is set, the
> syscall failed and r3 is the error value. This syscall return semantics
> was implemented from the very beginning of Power Architecture on Linux,
> and syscall tracers and debuggers like strace that read or modify syscall
> return information also rely on this ABI.
I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
behave all the same ?
>
> r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> For example, here is an excerpt from check_syscall_restart() that assigns
> these members of struct pt_regs:
> regs->result = -EINTR;
> regs->gpr[3] = EINTR;
> regs->ccr |= 0x10000000;
> In this example, the semantics of negative ERRORCODE that's being used
> virtually everywhere in generic kernel code is translated to powerpc sc
> syscall return ABI which uses positive ERRORCODE and cr0.SO bit.
At what point are they exposed really ? At what point do they need to
comply with the ABI ?
I'm also a bit lost between regs->orig_r3, regs->gpr[3] and regs->result.
The comment added by commit 1b1a3702a65c ("powerpc: Don't negate error
in syscall_set_return_value()") says that CCR needs to be set because of
signal code. But signal code is invoked by syscall_exit_prepare()
through call to interrupt_exit_user_prepare_main() after setting CR[SO]
and negating syscall result.
>
> Also, r3 and cr0.SO are exposed indirectly via helpers.
> For example, here is an excerpt from syscall_get_error():
> /*
> * If the system call failed,
> * regs->gpr[3] contains a positive ERRORCODE.
> */
> return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> and here is another example, from regs_return_value():
> if (is_syscall_success(regs))
> return regs->gpr[3];
> else
> return -regs->gpr[3];
> In these examples, the powerpc sc syscall return ABI which uses positive
> ERRORCODE and cr0.SO bit is translated to the semantics of negative
> ERRORCODE that's being used virtually everywhere in generic kernel code.
>
> Up to a certain point in time the kernel managed to implement the powerpc
> sc syscall return ABI in all cases where struct pt_regs was exposed to user
> space.
>
> The situation changed when SECCOMP_RET_TRACE support was introduced.
> At this point the -ERRORCODE semantics that was used under the hood to
> implement seccomp on powerpc became exposed to user space. The tracer
> handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> - this is relatively harmless as at this stage there is no syscall return
> yet so the powerpc sc syscall return ABI does not apply. What's important
> is that the tracer can change the syscall number to -1 thus making the
> syscall fail, and at this point the tracer is also able to specify the
> error value. This has to be done in accordance with the syscall return
> ABI, however, the current implementation of do_seccomp() supports both the
> generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI, thanks to syscall_exit_prepare() that converts the former to
> the latter. Consequently, seccomp_bpf selftest passes both with and
> without this change.
>
> Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> introduced. PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> the details of the system calls the tracee is blocked in.
>
> One of the helpers that have to be used to implement
> PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> This helper complements other two helpers, syscall_get_error() and
> syscall_get_return_value(), that are currently used to implement
> PTRACE_GET_SYSCALL_INFO on syscall return. When syscall_set_return_value()
> is used to set an error code, the caller specifies it as a negative value
> in -ERRORCODE format.
>
> Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> ("powerpc: Don't negate error in syscall_set_return_value()") because
> syscall_set_return_value() does not follow the powerpc sc syscall return
> ABI:
> /*
> * In the general case it's not obvious that we must deal with
> * CCR here, as the syscall exit path will also do that for us.
> * However there are some places, eg. the signal code, which
> * check ccr to decide if the value in r3 is actually an error.
> */
> if (error) {
> regs->ccr |= 0x10000000L;
> regs->gpr[3] = error;
> } else {
> regs->ccr &= ~0x10000000L;
> regs->gpr[3] = val;
> }
>
> The reason why this syscall_set_return_value() implementation was able to
> get away with violating the powerpc sc syscall return ABI is the following:
> Up to now, syscall_set_return_value() on powerpc could be called only from
> do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> called from do_syscall_trace_leave() which is the point where tracers on
> syscall return are activated and the powerpc sc syscall return ABI has
> to be respected.
>
> Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> syscall_set_return_value() to comply with the powerpc sc syscall return
> ABI. Without the change, the upcoming ptrace/set_syscall_info selftest
> fails with the following diagnostics:
>
> # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
> # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
>
> Note that since backwards compatibility with the current implementation has
> to be provided, the kernel has to continue supporting simultaneously both
> the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI at least for PTRACE_EVENT_SECCOMP tracers. Consequently, since
> the point of __secure_computing() invocation and up to the point of
> conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> of these two ABIs. An attempt to address code inconsistencies in syscall
> error return handling that were introduced as a side effect of the dual
> ABI support follows in a separate patch.
What do you mean by "backwards compatibility" here ? backwards
compatibility applies only to userspace API doesn't it ? So if there was
no way to trigger the problem previously, what does it mean ?
>
> Link: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Farch%2Fpowerpc%2Fsyscall64-abi.html%23return-value&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc2cf590281c24fe1478408dd3efe4a3e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638735984085033893%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=EEP9s6k%2Fs5VfqWgrs6VXi879HEfJ8BYOOJ8InmmVTQA%3D&reserved=0 [1]
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> Reviewed-by: Alexey Gladkov <legion@kernel.org>
> ---
> arch/powerpc/include/asm/syscall.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 3dd36c5e334a..422d7735ace6 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
> */
> if (error) {
> regs->ccr |= 0x10000000L;
> - regs->gpr[3] = error;
> + /*
> + * In case of an error regs->gpr[3] contains
> + * a positive ERRORCODE.
> + */
> + regs->gpr[3] = -error;
> } else {
> regs->ccr &= ~0x10000000L;
> regs->gpr[3] = val;
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
@ 2025-01-28 15:52 ` Dmitry V. Levin
2025-01-28 16:00 ` Christophe Leroy
2025-01-28 15:54 ` Eugene Syromyatnikov
1 sibling, 1 reply; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-28 15:52 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
linuxppc-dev, strace-devel, linux-kernel
On Tue, Jan 28, 2025 at 03:59:29PM +0100, Christophe Leroy wrote:
> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> > According to the Power Architecture Linux system call ABI documented in
> > [1], when the syscall is made with the sc instruction, both a value and an
> > error condition are returned, where r3 register contains the return value,
> > and cr0.SO bit specifies the error condition. When cr0.SO is clear, the
> > syscall succeeded and r3 is the return value. When cr0.SO is set, the
> > syscall failed and r3 is the error value. This syscall return semantics
> > was implemented from the very beginning of Power Architecture on Linux,
> > and syscall tracers and debuggers like strace that read or modify syscall
> > return information also rely on this ABI.
>
> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
> behave all the same ?
Yes, also on alpha. I don't think microblaze should be in this list,
though.
> > r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> > (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> > For example, here is an excerpt from check_syscall_restart() that assigns
> > these members of struct pt_regs:
> > regs->result = -EINTR;
> > regs->gpr[3] = EINTR;
> > regs->ccr |= 0x10000000;
> > In this example, the semantics of negative ERRORCODE that's being used
> > virtually everywhere in generic kernel code is translated to powerpc sc
> > syscall return ABI which uses positive ERRORCODE and cr0.SO bit.
>
> At what point are they exposed really ? At what point do they need to
> comply with the ABI ?
That's a good question. Of course when returning to user space, but,
besides that, at syscall exit tracepoint (trace_sys_exit), ptrace syscall
exit stop (ptrace_report_syscall_exit), and PTRACE_EVENT_SECCOMP stop
(__secure_computing). There could be other points where this is exposed.
For example, on many architectures the tracer can specify syscall error
return value also at ptrace syscall entry stop
(ptrace_report_syscall_entry), but powerpc does not implement this.
> I'm also a bit lost between regs->orig_r3, regs->gpr[3] and regs->result.
I'm not aware of the role of regs->result as it's not in struct user_pt_regs
and therefore is not exposed to user space.
> The comment added by commit 1b1a3702a65c ("powerpc: Don't negate error
> in syscall_set_return_value()") says that CCR needs to be set because of
> signal code. But signal code is invoked by syscall_exit_prepare()
> through call to interrupt_exit_user_prepare_main() after setting CR[SO]
> and negating syscall result.
I hope Michael can comment on this.
> > Also, r3 and cr0.SO are exposed indirectly via helpers.
> > For example, here is an excerpt from syscall_get_error():
> > /*
> > * If the system call failed,
> > * regs->gpr[3] contains a positive ERRORCODE.
> > */
> > return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > and here is another example, from regs_return_value():
> > if (is_syscall_success(regs))
> > return regs->gpr[3];
> > else
> > return -regs->gpr[3];
> > In these examples, the powerpc sc syscall return ABI which uses positive
> > ERRORCODE and cr0.SO bit is translated to the semantics of negative
> > ERRORCODE that's being used virtually everywhere in generic kernel code.
> >
> > Up to a certain point in time the kernel managed to implement the powerpc
> > sc syscall return ABI in all cases where struct pt_regs was exposed to user
> > space.
> >
> > The situation changed when SECCOMP_RET_TRACE support was introduced.
> > At this point the -ERRORCODE semantics that was used under the hood to
> > implement seccomp on powerpc became exposed to user space. The tracer
> > handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> > - this is relatively harmless as at this stage there is no syscall return
> > yet so the powerpc sc syscall return ABI does not apply. What's important
> > is that the tracer can change the syscall number to -1 thus making the
> > syscall fail, and at this point the tracer is also able to specify the
> > error value. This has to be done in accordance with the syscall return
> > ABI, however, the current implementation of do_seccomp() supports both the
> > generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI, thanks to syscall_exit_prepare() that converts the former to
> > the latter. Consequently, seccomp_bpf selftest passes both with and
> > without this change.
> >
> > Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> > introduced. PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> > complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> > the details of the system calls the tracee is blocked in.
> >
> > One of the helpers that have to be used to implement
> > PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> > This helper complements other two helpers, syscall_get_error() and
> > syscall_get_return_value(), that are currently used to implement
> > PTRACE_GET_SYSCALL_INFO on syscall return. When syscall_set_return_value()
> > is used to set an error code, the caller specifies it as a negative value
> > in -ERRORCODE format.
> >
> > Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> > ("powerpc: Don't negate error in syscall_set_return_value()") because
> > syscall_set_return_value() does not follow the powerpc sc syscall return
> > ABI:
> > /*
> > * In the general case it's not obvious that we must deal with
> > * CCR here, as the syscall exit path will also do that for us.
> > * However there are some places, eg. the signal code, which
> > * check ccr to decide if the value in r3 is actually an error.
> > */
> > if (error) {
> > regs->ccr |= 0x10000000L;
> > regs->gpr[3] = error;
> > } else {
> > regs->ccr &= ~0x10000000L;
> > regs->gpr[3] = val;
> > }
> >
> > The reason why this syscall_set_return_value() implementation was able to
> > get away with violating the powerpc sc syscall return ABI is the following:
> > Up to now, syscall_set_return_value() on powerpc could be called only from
> > do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> > called from do_syscall_trace_leave() which is the point where tracers on
> > syscall return are activated and the powerpc sc syscall return ABI has
> > to be respected.
> >
> > Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> > syscall_set_return_value() to comply with the powerpc sc syscall return
> > ABI. Without the change, the upcoming ptrace/set_syscall_info selftest
> > fails with the following diagnostics:
> >
> > # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
> > # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
> >
> > Note that since backwards compatibility with the current implementation has
> > to be provided, the kernel has to continue supporting simultaneously both
> > the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI at least for PTRACE_EVENT_SECCOMP tracers. Consequently, since
> > the point of __secure_computing() invocation and up to the point of
> > conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> > of these two ABIs. An attempt to address code inconsistencies in syscall
> > error return handling that were introduced as a side effect of the dual
> > ABI support follows in a separate patch.
>
> What do you mean by "backwards compatibility" here ? backwards
> compatibility applies only to userspace API doesn't it ? So if there was
> no way to trigger the problem previously, what does it mean ?
As I wrote earlier, userspace tracers handling PTRACE_EVENT_SECCOMP
are able to set gpr[3] to either -ERRORCODE or ERRORCODE (along with
SO bit) for all these years, and both methods work up to now.
--
ldv
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
2025-01-28 15:52 ` Dmitry V. Levin
@ 2025-01-28 16:00 ` Christophe Leroy
2025-01-28 16:10 ` Dmitry V. Levin
0 siblings, 1 reply; 10+ messages in thread
From: Christophe Leroy @ 2025-01-28 16:00 UTC (permalink / raw)
To: Dmitry V. Levin
Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
linuxppc-dev, strace-devel, linux-kernel
Le 28/01/2025 à 16:52, Dmitry V. Levin a écrit :
> On Tue, Jan 28, 2025 at 03:59:29PM +0100, Christophe Leroy wrote:
>> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
>>> According to the Power Architecture Linux system call ABI documented in
>>> [1], when the syscall is made with the sc instruction, both a value and an
>>> error condition are returned, where r3 register contains the return value,
>>> and cr0.SO bit specifies the error condition. When cr0.SO is clear, the
>>> syscall succeeded and r3 is the return value. When cr0.SO is set, the
>>> syscall failed and r3 is the error value. This syscall return semantics
>>> was implemented from the very beginning of Power Architecture on Linux,
>>> and syscall tracers and debuggers like strace that read or modify syscall
>>> return information also rely on this ABI.
>>
>> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
>> behave all the same ?
>
> Yes, also on alpha. I don't think microblaze should be in this list,
> though.
Microblaze has
static inline void syscall_set_return_value(struct task_struct *task,
struct pt_regs *regs,
int error, long val)
{
if (error)
regs->r3 = -error;
else
regs->r3 = val;
}
So it has a positive error setting allthough it has no flag to tell it
is an error. Wondering how it works at the end.
Alpha I'm not sure, I see nothing obvious in include/asm/ptrace.h or
include/asm/syscall.h
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
2025-01-28 16:00 ` Christophe Leroy
@ 2025-01-28 16:10 ` Dmitry V. Levin
0 siblings, 0 replies; 10+ messages in thread
From: Dmitry V. Levin @ 2025-01-28 16:10 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Alexey Gladkov, Eugene Syromyatnikov,
Oleg Nesterov, Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao,
linuxppc-dev, strace-devel, linux-kernel
On Tue, Jan 28, 2025 at 05:00:31PM +0100, Christophe Leroy wrote:
> Le 28/01/2025 à 16:52, Dmitry V. Levin a écrit :
> > On Tue, Jan 28, 2025 at 03:59:29PM +0100, Christophe Leroy wrote:
> >> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> >>> According to the Power Architecture Linux system call ABI documented in
> >>> [1], when the syscall is made with the sc instruction, both a value and an
> >>> error condition are returned, where r3 register contains the return value,
> >>> and cr0.SO bit specifies the error condition. When cr0.SO is clear, the
> >>> syscall succeeded and r3 is the return value. When cr0.SO is set, the
> >>> syscall failed and r3 is the error value. This syscall return semantics
> >>> was implemented from the very beginning of Power Architecture on Linux,
> >>> and syscall tracers and debuggers like strace that read or modify syscall
> >>> return information also rely on this ABI.
> >>
> >> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
> >> behave all the same ?
> >
> > Yes, also on alpha. I don't think microblaze should be in this list,
> > though.
>
> Microblaze has
>
> static inline void syscall_set_return_value(struct task_struct *task,
> struct pt_regs *regs,
> int error, long val)
> {
> if (error)
> regs->r3 = -error;
> else
> regs->r3 = val;
> }
>
> So it has a positive error setting allthough it has no flag to tell it
> is an error. Wondering how it works at the end.
It's a bug, but given that microblaze doesn't enable
CONFIG_HAVE_ARCH_TRACEHOOK, most likely this function is unused there.
> Alpha I'm not sure, I see nothing obvious in include/asm/ptrace.h or
> include/asm/syscall.h
Alpha doesn't enable CONFIG_HAVE_ARCH_TRACEHOOK, it just lacks the
necessary interfaces, but it uses a3 register for this purpose, see
arch/alpha/kernel/entry.S for details.
--
ldv
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
2025-01-28 15:52 ` Dmitry V. Levin
@ 2025-01-28 15:54 ` Eugene Syromyatnikov
1 sibling, 0 replies; 10+ messages in thread
From: Eugene Syromyatnikov @ 2025-01-28 15:54 UTC (permalink / raw)
To: Christophe Leroy
Cc: Dmitry V. Levin, Michael Ellerman, Alexey Gladkov, Oleg Nesterov,
Madhavan Srinivasan, Nicholas Piggin, Naveen N Rao, linuxppc-dev,
strace-devel, linux-kernel
On Tue, Jan 28, 2025 at 3:59 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
> Le 27/01/2025 à 19:13, Dmitry V. Levin a écrit :
> > According to the Power Architecture Linux system call ABI documented in
> > [1], when the syscall is made with the sc instruction, both a value and an
> > error condition are returned, where r3 register contains the return value,
> > and cr0.SO bit specifies the error condition. When cr0.SO is clear, the
> > syscall succeeded and r3 is the return value. When cr0.SO is set, the
> > syscall failed and r3 is the error value. This syscall return semantics
> > was implemented from the very beginning of Power Architecture on Linux,
> > and syscall tracers and debuggers like strace that read or modify syscall
> > return information also rely on this ABI.
>
> I see a quite similar ABI on microblaze, mips, nios2 and sparc. Do they
> behave all the same ?
In terms of ABI, yes:
https://gitlab.com/strace/strace/-/blob/master/src/linux/sparc/get_error.c
https://gitlab.com/strace/strace/-/blob/master/src/linux/mips/get_error.c
https://gitlab.com/strace/strace/-/blob/master/src/linux/nios2/get_error.c
(cf. https://gitlab.com/strace/strace/-/blob/master/src/linux/powerpc/get_error.c
)
I can't find microblaze SyscallV ABI, but at least both strace and
glibc expect negated errno in r3:
https://gitlab.com/strace/strace/-/blob/master/src/linux/microblaze/get_error.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/microblaze/syscall.S;h=6ebf688d800e19c14a140f0cd5a0ba9344fa95a5;hb=HEAD
> > r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> > (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> > For example, here is an excerpt from check_syscall_restart() that assigns
> > these members of struct pt_regs:
> > regs->result = -EINTR;
> > regs->gpr[3] = EINTR;
> > regs->ccr |= 0x10000000;
> > In this example, the semantics of negative ERRORCODE that's being used
> > virtually everywhere in generic kernel code is translated to powerpc sc
> > syscall return ABI which uses positive ERRORCODE and cr0.SO bit.
>
> At what point are they exposed really ? At what point do they need to
> comply with the ABI ?
On ptrace stops via pt_regs, for example?
> I'm also a bit lost between regs->orig_r3, regs->gpr[3] and regs->result.
>
> The comment added by commit 1b1a3702a65c ("powerpc: Don't negate error
> in syscall_set_return_value()") says that CCR needs to be set because of
> signal code. But signal code is invoked by syscall_exit_prepare()
> through call to interrupt_exit_user_prepare_main() after setting CR[SO]
> and negating syscall result.
>
> >
> > Also, r3 and cr0.SO are exposed indirectly via helpers.
> > For example, here is an excerpt from syscall_get_error():
> > /*
> > * If the system call failed,
> > * regs->gpr[3] contains a positive ERRORCODE.
> > */
> > return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> > and here is another example, from regs_return_value():
> > if (is_syscall_success(regs))
> > return regs->gpr[3];
> > else
> > return -regs->gpr[3];
> > In these examples, the powerpc sc syscall return ABI which uses positive
> > ERRORCODE and cr0.SO bit is translated to the semantics of negative
> > ERRORCODE that's being used virtually everywhere in generic kernel code.
> >
> > Up to a certain point in time the kernel managed to implement the powerpc
> > sc syscall return ABI in all cases where struct pt_regs was exposed to user
> > space.
> >
> > The situation changed when SECCOMP_RET_TRACE support was introduced.
> > At this point the -ERRORCODE semantics that was used under the hood to
> > implement seccomp on powerpc became exposed to user space. The tracer
> > handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> > - this is relatively harmless as at this stage there is no syscall return
> > yet so the powerpc sc syscall return ABI does not apply. What's important
> > is that the tracer can change the syscall number to -1 thus making the
> > syscall fail, and at this point the tracer is also able to specify the
> > error value. This has to be done in accordance with the syscall return
> > ABI, however, the current implementation of do_seccomp() supports both the
> > generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI, thanks to syscall_exit_prepare() that converts the former to
> > the latter. Consequently, seccomp_bpf selftest passes both with and
> > without this change.
> >
> > Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> > introduced. PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> > complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> > the details of the system calls the tracee is blocked in.
> >
> > One of the helpers that have to be used to implement
> > PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> > This helper complements other two helpers, syscall_get_error() and
> > syscall_get_return_value(), that are currently used to implement
> > PTRACE_GET_SYSCALL_INFO on syscall return. When syscall_set_return_value()
> > is used to set an error code, the caller specifies it as a negative value
> > in -ERRORCODE format.
> >
> > Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> > ("powerpc: Don't negate error in syscall_set_return_value()") because
> > syscall_set_return_value() does not follow the powerpc sc syscall return
> > ABI:
> > /*
> > * In the general case it's not obvious that we must deal with
> > * CCR here, as the syscall exit path will also do that for us.
> > * However there are some places, eg. the signal code, which
> > * check ccr to decide if the value in r3 is actually an error.
> > */
> > if (error) {
> > regs->ccr |= 0x10000000L;
> > regs->gpr[3] = error;
> > } else {
> > regs->ccr &= ~0x10000000L;
> > regs->gpr[3] = val;
> > }
> >
> > The reason why this syscall_set_return_value() implementation was able to
> > get away with violating the powerpc sc syscall return ABI is the following:
> > Up to now, syscall_set_return_value() on powerpc could be called only from
> > do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> > called from do_syscall_trace_leave() which is the point where tracers on
> > syscall return are activated and the powerpc sc syscall return ABI has
> > to be respected.
> >
> > Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> > syscall_set_return_value() to comply with the powerpc sc syscall return
> > ABI. Without the change, the upcoming ptrace/set_syscall_info selftest
> > fails with the following diagnostics:
> >
> > # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
> > # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
> >
> > Note that since backwards compatibility with the current implementation has
> > to be provided, the kernel has to continue supporting simultaneously both
> > the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> > return ABI at least for PTRACE_EVENT_SECCOMP tracers. Consequently, since
> > the point of __secure_computing() invocation and up to the point of
> > conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> > of these two ABIs. An attempt to address code inconsistencies in syscall
> > error return handling that were introduced as a side effect of the dual
> > ABI support follows in a separate patch.
>
> What do you mean by "backwards compatibility" here ? backwards
> compatibility applies only to userspace API doesn't it ? So if there was
> no way to trigger the problem previously, what does it mean ?
ptrace and seccomp users are pretty much userspace, aren't they?
> >
> > Link: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Farch%2Fpowerpc%2Fsyscall64-abi.html%23return-value&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc2cf590281c24fe1478408dd3efe4a3e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638735984085033893%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=EEP9s6k%2Fs5VfqWgrs6VXi879HEfJ8BYOOJ8InmmVTQA%3D&reserved=0 [1]
> > Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> > Reviewed-by: Alexey Gladkov <legion@kernel.org>
> > ---
> > arch/powerpc/include/asm/syscall.h | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> > index 3dd36c5e334a..422d7735ace6 100644
> > --- a/arch/powerpc/include/asm/syscall.h
> > +++ b/arch/powerpc/include/asm/syscall.h
> > @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
> > */
> > if (error) {
> > regs->ccr |= 0x10000000L;
> > - regs->gpr[3] = error;
> > + /*
> > + * In case of an error regs->gpr[3] contains
> > + * a positive ERRORCODE.
> > + */
> > + regs->gpr[3] = -error;
> > } else {
> > regs->ccr &= ~0x10000000L;
> > regs->gpr[3] = val;
>
--
Eugene Syromyatnikov
mailto:evgsyr@gmail.com
xmpp:esyr@jabber.{ru|org}
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case
2025-01-27 18:13 [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Dmitry V. Levin
2025-01-27 18:14 ` [PATCH 2/2] powerpc: fix inconsistencies in syscall error return handling Dmitry V. Levin
2025-01-28 14:59 ` [PATCH 1/2] powerpc: properly negate error in syscall_set_return_value() in sc case Christophe Leroy
@ 2025-06-02 11:36 ` Dmitry V. Levin
2 siblings, 0 replies; 10+ messages in thread
From: Dmitry V. Levin @ 2025-06-02 11:36 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, Christophe Leroy
Cc: Alexey Gladkov, Eugene Syromyatnikov, Oleg Nesterov,
Madhavan Srinivasan, Naveen N Rao, linuxppc-dev, strace-devel,
linux-kernel
On Mon, Jan 27, 2025 at 08:13:23PM +0200, Dmitry V. Levin wrote:
> According to the Power Architecture Linux system call ABI documented in
> [1], when the syscall is made with the sc instruction, both a value and an
> error condition are returned, where r3 register contains the return value,
> and cr0.SO bit specifies the error condition. When cr0.SO is clear, the
> syscall succeeded and r3 is the return value. When cr0.SO is set, the
> syscall failed and r3 is the error value. This syscall return semantics
> was implemented from the very beginning of Power Architecture on Linux,
> and syscall tracers and debuggers like strace that read or modify syscall
> return information also rely on this ABI.
>
> r3 and cr0.SO are exposed directly via struct pt_regs where gpr[3] and
> (ccr & 0x10000000) correspond to r3 and cr0.SO, respectively.
> For example, here is an excerpt from check_syscall_restart() that assigns
> these members of struct pt_regs:
> regs->result = -EINTR;
> regs->gpr[3] = EINTR;
> regs->ccr |= 0x10000000;
> In this example, the semantics of negative ERRORCODE that's being used
> virtually everywhere in generic kernel code is translated to powerpc sc
> syscall return ABI which uses positive ERRORCODE and cr0.SO bit.
>
> Also, r3 and cr0.SO are exposed indirectly via helpers.
> For example, here is an excerpt from syscall_get_error():
> /*
> * If the system call failed,
> * regs->gpr[3] contains a positive ERRORCODE.
> */
> return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
> and here is another example, from regs_return_value():
> if (is_syscall_success(regs))
> return regs->gpr[3];
> else
> return -regs->gpr[3];
> In these examples, the powerpc sc syscall return ABI which uses positive
> ERRORCODE and cr0.SO bit is translated to the semantics of negative
> ERRORCODE that's being used virtually everywhere in generic kernel code.
>
> Up to a certain point in time the kernel managed to implement the powerpc
> sc syscall return ABI in all cases where struct pt_regs was exposed to user
> space.
>
> The situation changed when SECCOMP_RET_TRACE support was introduced.
> At this point the -ERRORCODE semantics that was used under the hood to
> implement seccomp on powerpc became exposed to user space. The tracer
> handling PTRACE_EVENT_SECCOMP is not just able to observe -ENOSYS in gpr[3]
> - this is relatively harmless as at this stage there is no syscall return
> yet so the powerpc sc syscall return ABI does not apply. What's important
> is that the tracer can change the syscall number to -1 thus making the
> syscall fail, and at this point the tracer is also able to specify the
> error value. This has to be done in accordance with the syscall return
> ABI, however, the current implementation of do_seccomp() supports both the
> generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI, thanks to syscall_exit_prepare() that converts the former to
> the latter. Consequently, seccomp_bpf selftest passes both with and
> without this change.
>
> Now comes the moment when PTRACE_SET_SYSCALL_INFO is going to be
> introduced. PTRACE_SET_SYSCALL_INFO is a generic ptrace API that
> complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify
> the details of the system calls the tracee is blocked in.
>
> One of the helpers that have to be used to implement
> PTRACE_SET_SYSCALL_INFO is syscall_set_return_value().
> This helper complements other two helpers, syscall_get_error() and
> syscall_get_return_value(), that are currently used to implement
> PTRACE_GET_SYSCALL_INFO on syscall return. When syscall_set_return_value()
> is used to set an error code, the caller specifies it as a negative value
> in -ERRORCODE format.
>
> Unfortunately, this does not work well on powerpc since commit 1b1a3702a65c
> ("powerpc: Don't negate error in syscall_set_return_value()") because
> syscall_set_return_value() does not follow the powerpc sc syscall return
> ABI:
> /*
> * In the general case it's not obvious that we must deal with
> * CCR here, as the syscall exit path will also do that for us.
> * However there are some places, eg. the signal code, which
> * check ccr to decide if the value in r3 is actually an error.
> */
> if (error) {
> regs->ccr |= 0x10000000L;
> regs->gpr[3] = error;
> } else {
> regs->ccr &= ~0x10000000L;
> regs->gpr[3] = val;
> }
>
> The reason why this syscall_set_return_value() implementation was able to
> get away with violating the powerpc sc syscall return ABI is the following:
> Up to now, syscall_set_return_value() on powerpc could be called only from
> do_syscall_trace_enter() via do_seccomp(), there was no way it could be
> called from do_syscall_trace_leave() which is the point where tracers on
> syscall return are activated and the powerpc sc syscall return ABI has
> to be respected.
>
> Introduction of PTRACE_SET_SYSCALL_INFO necessitates a change of
> syscall_set_return_value() to comply with the powerpc sc syscall return
> ABI. Without the change, the upcoming ptrace/set_syscall_info selftest
> fails with the following diagnostics:
>
> # set_syscall_info.c:119:set_syscall_info:Expected exp_exit->rval (-38) == info->exit.rval (38)
> # set_syscall_info.c:120:set_syscall_info:wait #4: PTRACE_GET_SYSCALL_INFO #2: exit stop mismatch
>
> Note that since backwards compatibility with the current implementation has
> to be provided, the kernel has to continue supporting simultaneously both
> the generic kernel -ERRORCODE return value ABI and the powerpc sc syscall
> return ABI at least for PTRACE_EVENT_SECCOMP tracers. Consequently, since
> the point of __secure_computing() invocation and up to the point of
> conversion in syscall_exit_prepare(), gpr[3] may be set according to either
> of these two ABIs. An attempt to address code inconsistencies in syscall
> error return handling that were introduced as a side effect of the dual
> ABI support follows in a separate patch.
>
> Link: https://www.kernel.org/doc/html/latest/arch/powerpc/syscall64-abi.html#return-value [1]
> Signed-off-by: Dmitry V. Levin <ldv@strace.io>
> Reviewed-by: Alexey Gladkov <legion@kernel.org>
> ---
> arch/powerpc/include/asm/syscall.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
> index 3dd36c5e334a..422d7735ace6 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -82,7 +82,11 @@ static inline void syscall_set_return_value(struct task_struct *task,
> */
> if (error) {
> regs->ccr |= 0x10000000L;
> - regs->gpr[3] = error;
> + /*
> + * In case of an error regs->gpr[3] contains
> + * a positive ERRORCODE.
> + */
> + regs->gpr[3] = -error;
> } else {
> regs->ccr &= ~0x10000000L;
> regs->gpr[3] = val;
Given that the PTRACE_SET_SYSCALL_INFO API has been merged and this
powerpc issue is now exposed to userspace, can we make it fixed, please?
--
ldv
^ permalink raw reply [flat|nested] 10+ messages in thread