* [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
@ 2024-10-16 17:49 Celeste Liu via B4 Relay
2024-10-21 14:00 ` Björn Töpel
0 siblings, 1 reply; 17+ messages in thread
From: Celeste Liu via B4 Relay @ 2024-10-16 17:49 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel,
Celeste Liu
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable, Celeste Liu
From: Celeste Liu <CoelacanthusHex@gmail.com>
The return value of syscall_enter_from_user_mode() is always -1 when the
syscall was filtered. We can't know whether syscall_nr is -1 when we get -1
from syscall_enter_from_user_mode(). And the old syscall variable is
unusable because syscall_enter_from_user_mode() may change a7 register.
So get correct syscall number from syscall_get_nr().
So syscall number part of return value of syscall_enter_from_user_mode()
is completely useless. We can remove it from API and require caller to
get syscall number from syscall_get_nr(). But this change affect more
architectures and will block more time. So we split it into another
patchset to avoid block this fix. (Other architectures can works
without this change but riscv need it, see Link: tag below)
Fixes: 61119394631f ("riscv: entry: always initialize regs->a0 to -ENOSYS")
Reported-by: Andrea Bolognani <abologna@redhat.com>
Closes: https://github.com/strace/strace/issues/315
Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
---
arch/riscv/kernel/traps.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 51ebfd23e0076447518081d137102a9a11ff2e45..3125fab8ee4af468ace9f692dd34e1797555cce3 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -316,18 +316,25 @@ void do_trap_ecall_u(struct pt_regs *regs)
{
if (user_mode(regs)) {
long syscall = regs->a7;
+ long res;
regs->epc += 4;
regs->orig_a0 = regs->a0;
- regs->a0 = -ENOSYS;
riscv_v_vstate_discard(regs);
- syscall = syscall_enter_from_user_mode(regs, syscall);
+ res = syscall_enter_from_user_mode(regs, syscall);
+ /*
+ * Call syscall_get_nr() again because syscall_enter_from_user_mode()
+ * may change a7 register.
+ */
+ syscall = syscall_get_nr(current, regs);
add_random_kstack_offset();
- if (syscall >= 0 && syscall < NR_syscalls)
+ if (syscall < 0 || syscall >= NR_syscalls)
+ regs->a0 = -ENOSYS;
+ else if (res != -1)
syscall_handler(regs, syscall);
/*
---
base-commit: 2f87d0916ce0d2925cedbc9e8f5d6291ba2ac7b2
change-id: 20241016-fix-riscv-syscall-nr-917b566f97f3
Best regards,
--
Celeste Liu <CoelacanthusHex@gmail.com>
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-16 17:49 [PATCH] riscv/entry: get correct syscall number from syscall_get_nr() Celeste Liu via B4 Relay
@ 2024-10-21 14:00 ` Björn Töpel
2024-10-21 15:23 ` Celeste Liu
0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-10-21 14:00 UTC (permalink / raw)
To: Celeste Liu via B4 Relay, Paul Walmsley, Palmer Dabbelt,
Albert Ou, Björn Töpel, Celeste Liu
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable, Celeste Liu,
Thomas Gleixner
Celeste!
I'll pick up your thread [1] here, since there's code here! :-) Let's
try to clear up/define the possible flow.
The common/generic entry syscall_enter_from_user_mode{,_work}() says
that a return value of -1 means that the syscall should be skipped.
The RISC-V calling convention uses the same register for arg0/retval
(a0). So, when a ptracer (PTRACE_SYSCALL+PTRACE_GETREGS). That means
that the kernel cannot call into the tracer, *after* changing a0.
The interesting flow for a syscall is roughly:
1. The exception/trap function
2. syscall_enter_from_user_mode() which might return -1, meaning that
the syscall should be skipped. A tracer might have altered the
regs. More importantly, if it's -1 the kernel cannot change the
return value, because seccomp filtering might already done that.
3. If it's a valid syscall, perform the syscall.
Now some scenarios:
* A user does a valid syscall.
* A user does an invalid syscall(-1)
* A user does an invalid syscall(!= -1)
Then there're the tracing variants of those, and that's where we go get
trouble.
Now the bugs we've seen in RISC-V:
1. strace "tracing": Requires that regs->a0 is not tampered with prior
ptrace notification
E.g.:
| # ./strace /
| execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
| ./strace: exec: Permission denied
| +++ exited with 1 +++
| # ./disable_ptrace_get_syscall_info ./strace /
| execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
| ./strace: exec: Permission denied
| +++ exited with 1 +++
In the second case, arg0 is prematurely set to -ENOSYS
(0xffffffffffffffda).
2. strace "syscall tampering": Requires that ENOSYS is returned for
syscall(-1), and not skipped w/o a proper return value.
E.g.:
| ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello
Here, strace expects that injecting -1, would result in a ENOSYS.
3. seccomp filtering: Requires that the a0 is not tampered to
First 3 was broken (tampering with a0 after seccomp), then 2 was
broken (not setting ENOSYS for -1), and finally 1 was broken (and
still is!).
Now for your patch:
Celeste Liu via B4 Relay <devnull+CoelacanthusHex.gmail.com@kernel.org>
writes:
> From: Celeste Liu <CoelacanthusHex@gmail.com>
>
> The return value of syscall_enter_from_user_mode() is always -1 when the
> syscall was filtered. We can't know whether syscall_nr is -1 when we get -1
> from syscall_enter_from_user_mode(). And the old syscall variable is
> unusable because syscall_enter_from_user_mode() may change a7 register.
> So get correct syscall number from syscall_get_nr().
>
> So syscall number part of return value of syscall_enter_from_user_mode()
> is completely useless. We can remove it from API and require caller to
> get syscall number from syscall_get_nr(). But this change affect more
> architectures and will block more time. So we split it into another
> patchset to avoid block this fix. (Other architectures can works
> without this change but riscv need it, see Link: tag below)
>
> Fixes: 61119394631f ("riscv: entry: always initialize regs->a0 to -ENOSYS")
> Reported-by: Andrea Bolognani <abologna@redhat.com>
> Closes: https://github.com/strace/strace/issues/315
> Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
> ---
> arch/riscv/kernel/traps.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 51ebfd23e0076447518081d137102a9a11ff2e45..3125fab8ee4af468ace9f692dd34e1797555cce3 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -316,18 +316,25 @@ void do_trap_ecall_u(struct pt_regs *regs)
> {
> if (user_mode(regs)) {
> long syscall = regs->a7;
> + long res;
>
> regs->epc += 4;
> regs->orig_a0 = regs->a0;
> - regs->a0 = -ENOSYS;
>
> riscv_v_vstate_discard(regs);
>
> - syscall = syscall_enter_from_user_mode(regs, syscall);
> + res = syscall_enter_from_user_mode(regs, syscall);
> + /*
> + * Call syscall_get_nr() again because syscall_enter_from_user_mode()
> + * may change a7 register.
> + */
> + syscall = syscall_get_nr(current, regs);
>
> add_random_kstack_offset();
>
> - if (syscall >= 0 && syscall < NR_syscalls)
> + if (syscall < 0 || syscall >= NR_syscalls)
> + regs->a0 = -ENOSYS;
> + else if (res != -1)
> syscall_handler(regs, syscall);
Here we can perform the syscall, even if res is -1. E.g., if this path
[2] is taken, we might have a valid syscall number in a7, but the
syscall should not be performed.
Also, one reason for the generic entry is so that it should be less
work. Here, you pull (IMO) details that belong to the common entry
implementation/API up the common entry user. Wdyt about pushing it down
to common entry? Something like:
--8<--
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 51ebfd23e007..66fded8e4b60 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -319,7 +319,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
regs->epc += 4;
regs->orig_a0 = regs->a0;
- regs->a0 = -ENOSYS;
riscv_v_vstate_discard(regs);
@@ -329,6 +328,8 @@ void do_trap_ecall_u(struct pt_regs *regs)
if (syscall >= 0 && syscall < NR_syscalls)
syscall_handler(regs, syscall);
+ else if (syscall != -1)
+ regs->a0 = -ENOSYS;
/*
* Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1e50cdb83ae5..9b69c2ad4f12 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -14,6 +14,7 @@
#include <linux/kmsan.h>
#include <asm/entry-common.h>
+#include <asm/syscall.h>
/*
* Define dummy _TIF work flags if not defined by the architecture or for
@@ -166,6 +167,8 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
if (work & SYSCALL_WORK_ENTER)
syscall = syscall_trace_enter(regs, syscall, work);
+ else if (syscall == -1L)
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
return syscall;
}
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5b6934e23c21..99742aee5002 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -43,8 +43,10 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
- if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
+ if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) {
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
return -1L;
+ }
}
/* Do seccomp after ptrace, to catch any tracer changes. */
@@ -66,6 +68,14 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
syscall = syscall_get_nr(current, regs);
}
+ /*
+ * If we're not setting the return value here, the context is
+ * gone; For higher callers, -1 means that the syscall should
+ * be skipped.
+ */
+ if (syscall == -1L)
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
+
syscall_enter_audit(regs, syscall);
return ret ? : syscall;
--8<--
I did a quick test of the above, and it seems to cover all the previous
bugs -- but who knows! ;-)
Björn
[1] https://lore.kernel.org/linux-riscv/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/entry/common.c#n47
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-21 14:00 ` Björn Töpel
@ 2024-10-21 15:23 ` Celeste Liu
2024-10-21 16:46 ` Björn Töpel
0 siblings, 1 reply; 17+ messages in thread
From: Celeste Liu @ 2024-10-21 15:23 UTC (permalink / raw)
To: Björn Töpel, Celeste Liu via B4 Relay, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable,
Thomas Gleixner
On 2024-10-21 22:00, Björn Töpel wrote:
> Celeste!
>
> I'll pick up your thread [1] here, since there's code here! :-) Let's
> try to clear up/define the possible flow.
>
> The common/generic entry syscall_enter_from_user_mode{,_work}() says
> that a return value of -1 means that the syscall should be skipped.
>
> The RISC-V calling convention uses the same register for arg0/retval
> (a0). So, when a ptracer (PTRACE_SYSCALL+PTRACE_GETREGS). That means
> that the kernel cannot call into the tracer, *after* changing a0.
>
> The interesting flow for a syscall is roughly:
> 1. The exception/trap function
> 2. syscall_enter_from_user_mode() which might return -1, meaning that
> the syscall should be skipped. A tracer might have altered the
> regs. More importantly, if it's -1 the kernel cannot change the
> return value, because seccomp filtering might already done that.
> 3. If it's a valid syscall, perform the syscall.
>
> Now some scenarios:
> * A user does a valid syscall.
> * A user does an invalid syscall(-1)
> * A user does an invalid syscall(!= -1)
>
> Then there're the tracing variants of those, and that's where we go get
> trouble.
>
> Now the bugs we've seen in RISC-V:
>
> 1. strace "tracing": Requires that regs->a0 is not tampered with prior
> ptrace notification
>
> E.g.:
> | # ./strace /
> | execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
> | ./strace: exec: Permission denied
> | +++ exited with 1 +++
> | # ./disable_ptrace_get_syscall_info ./strace /
> | execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
> | ./strace: exec: Permission denied
> | +++ exited with 1 +++
>
> In the second case, arg0 is prematurely set to -ENOSYS
> (0xffffffffffffffda).
>
> 2. strace "syscall tampering": Requires that ENOSYS is returned for
> syscall(-1), and not skipped w/o a proper return value.
>
> E.g.:
> | ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello
>
> Here, strace expects that injecting -1, would result in a ENOSYS.
>
> 3. seccomp filtering: Requires that the a0 is not tampered to
>
> First 3 was broken (tampering with a0 after seccomp), then 2 was
> broken (not setting ENOSYS for -1), and finally 1 was broken (and
> still is!).
>
> Now for your patch:
>
> Celeste Liu via B4 Relay <devnull+CoelacanthusHex.gmail.com@kernel.org>
> writes:
>
>> From: Celeste Liu <CoelacanthusHex@gmail.com>
>>
>> The return value of syscall_enter_from_user_mode() is always -1 when the
>> syscall was filtered. We can't know whether syscall_nr is -1 when we get -1
>> from syscall_enter_from_user_mode(). And the old syscall variable is
>> unusable because syscall_enter_from_user_mode() may change a7 register.
>> So get correct syscall number from syscall_get_nr().
>>
>> So syscall number part of return value of syscall_enter_from_user_mode()
>> is completely useless. We can remove it from API and require caller to
>> get syscall number from syscall_get_nr(). But this change affect more
>> architectures and will block more time. So we split it into another
>> patchset to avoid block this fix. (Other architectures can works
>> without this change but riscv need it, see Link: tag below)
>>
>> Fixes: 61119394631f ("riscv: entry: always initialize regs->a0 to -ENOSYS")
>> Reported-by: Andrea Bolognani <abologna@redhat.com>
>> Closes: https://github.com/strace/strace/issues/315
>> Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
>> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
>> ---
>> arch/riscv/kernel/traps.c | 13 ++++++++++---
>> 1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
>> index 51ebfd23e0076447518081d137102a9a11ff2e45..3125fab8ee4af468ace9f692dd34e1797555cce3 100644
>> --- a/arch/riscv/kernel/traps.c
>> +++ b/arch/riscv/kernel/traps.c
>> @@ -316,18 +316,25 @@ void do_trap_ecall_u(struct pt_regs *regs)
>> {
>> if (user_mode(regs)) {
>> long syscall = regs->a7;
>> + long res;
>>
>> regs->epc += 4;
>> regs->orig_a0 = regs->a0;
>> - regs->a0 = -ENOSYS;
>>
>> riscv_v_vstate_discard(regs);
>>
>> - syscall = syscall_enter_from_user_mode(regs, syscall);
>> + res = syscall_enter_from_user_mode(regs, syscall);
>> + /*
>> + * Call syscall_get_nr() again because syscall_enter_from_user_mode()
>> + * may change a7 register.
>> + */
>> + syscall = syscall_get_nr(current, regs);
>>
>> add_random_kstack_offset();
>>
>> - if (syscall >= 0 && syscall < NR_syscalls)
>> + if (syscall < 0 || syscall >= NR_syscalls)
>> + regs->a0 = -ENOSYS;
>> + else if (res != -1)
>> syscall_handler(regs, syscall);
>
> Here we can perform the syscall, even if res is -1. E.g., if this path
> [2] is taken, we might have a valid syscall number in a7, but the
> syscall should not be performed.
I may misunderstand what you said, but I can't see the issue you pointed.
A syscall is performed iff
1) syscall number in a7 must be valid, so it can reach "else if" branch.
2) res != -1, so syscall_enter_from_user_mode() doesn't return -1 to
inform the syscall should be skipped.
If [2] path is taken, syscall_trace_enter() will return -1 and pass through
syscall_enter_from_user_mode() so res == -1, the syscall will not be performed.
>
> Also, one reason for the generic entry is so that it should be less
> work. Here, you pull (IMO) details that belong to the common entry
> implementation/API up the common entry user. Wdyt about pushing it down
> to common entry? Something like:
Yeah, we can. But I pull it out of common entry to get more simple API semantic:
1. syscall_enter_from_user_mode() will do two things:
1) the return value is only to inform whether the syscall should be skipped.
2) regs will be modified by filters (seccomp or ptrace and so on).
2. for common entry user, there is two informations: syscall number and
the return value of syscall_enter_from_user_mode() (called is_skipped below).
so there is three situations:
1) if syscall number is invalid, the syscall should not be performed, and
we set a0 to -ENOSYS to inform userspace the syscall doesn't exist.
2) if syscall number is valid, is_skipped will be used:
a) if is_skipped is -1, which means there are some filters reject this syscall,
so the syscall should not performed. (Of course, we can use bool instead to
get better semantic)
b) if is_skipped != -1, which means the filters approved this syscall,
so we invoke syscall handler with modified regs.
In your design, the logical condition is not obvious. Why syscall_enter_from_user_mode()
informed the syscall will be skipped but the syscall handler will be called
when syscall number is invalid? The users need to think two things to get result:
a) -1 means skip
b) -1 < 0 in signed integer, so the skip condition is always a invalid syscall number.
In may way, the users only need to think one thing: The syscall_enter_from_user_mode()
said -1 means the syscall should not be performed, so use it as a condition of reject
directly. They just need to combine the informations that they get from API as the
condition of control flow.
>
> --8<--
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 51ebfd23e007..66fded8e4b60 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -319,7 +319,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
>
> regs->epc += 4;
> regs->orig_a0 = regs->a0;
> - regs->a0 = -ENOSYS;
>
> riscv_v_vstate_discard(regs);
>
> @@ -329,6 +328,8 @@ void do_trap_ecall_u(struct pt_regs *regs)
>
> if (syscall >= 0 && syscall < NR_syscalls)
> syscall_handler(regs, syscall);
> + else if (syscall != -1)
> + regs->a0 = -ENOSYS;
>
> /*
> * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 1e50cdb83ae5..9b69c2ad4f12 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -14,6 +14,7 @@
> #include <linux/kmsan.h>
>
> #include <asm/entry-common.h>
> +#include <asm/syscall.h>
>
> /*
> * Define dummy _TIF work flags if not defined by the architecture or for
> @@ -166,6 +167,8 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
>
> if (work & SYSCALL_WORK_ENTER)
> syscall = syscall_trace_enter(regs, syscall, work);
> + else if (syscall == -1L)
> + syscall_set_return_value(current, regs, -ENOSYS, 0);
>
> return syscall;
> }
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 5b6934e23c21..99742aee5002 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -43,8 +43,10 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
> /* Handle ptrace */
> if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
> ret = ptrace_report_syscall_entry(regs);
> - if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
> + if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) {
> + syscall_set_return_value(current, regs, -ENOSYS, 0);
> return -1L;
> + }
> }
>
> /* Do seccomp after ptrace, to catch any tracer changes. */
> @@ -66,6 +68,14 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
> syscall = syscall_get_nr(current, regs);
> }
>
> + /*
> + * If we're not setting the return value here, the context is
> + * gone; For higher callers, -1 means that the syscall should
> + * be skipped.
> + */
> + if (syscall == -1L)
> + syscall_set_return_value(current, regs, -ENOSYS, 0);
> +
> syscall_enter_audit(regs, syscall);
>
> return ret ? : syscall;
> --8<--
>
> I did a quick test of the above, and it seems to cover all the previous
> bugs -- but who knows! ;-)
>
>
> Björn
>
> [1] https://lore.kernel.org/linux-riscv/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/entry/common.c#n47
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-21 15:23 ` Celeste Liu
@ 2024-10-21 16:46 ` Björn Töpel
2024-10-25 13:12 ` Thomas Gleixner
0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-10-21 16:46 UTC (permalink / raw)
To: Celeste Liu, Celeste Liu via B4 Relay, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable,
Thomas Gleixner
Celeste Liu <coelacanthushex@gmail.com> writes:
>>> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
>>> index 51ebfd23e0076447518081d137102a9a11ff2e45..3125fab8ee4af468ace9f692dd34e1797555cce3 100644
>>> --- a/arch/riscv/kernel/traps.c
>>> +++ b/arch/riscv/kernel/traps.c
>>> @@ -316,18 +316,25 @@ void do_trap_ecall_u(struct pt_regs *regs)
>>> {
>>> if (user_mode(regs)) {
>>> long syscall = regs->a7;
>>> + long res;
>>>
>>> regs->epc += 4;
>>> regs->orig_a0 = regs->a0;
>>> - regs->a0 = -ENOSYS;
>>>
>>> riscv_v_vstate_discard(regs);
>>>
>>> - syscall = syscall_enter_from_user_mode(regs, syscall);
>>> + res = syscall_enter_from_user_mode(regs, syscall);
>>> + /*
>>> + * Call syscall_get_nr() again because syscall_enter_from_user_mode()
>>> + * may change a7 register.
>>> + */
>>> + syscall = syscall_get_nr(current, regs);
>>>
>>> add_random_kstack_offset();
>>>
>>> - if (syscall >= 0 && syscall < NR_syscalls)
>>> + if (syscall < 0 || syscall >= NR_syscalls)
>>> + regs->a0 = -ENOSYS;
>>> + else if (res != -1)
>>> syscall_handler(regs, syscall);
>>
>> Here we can perform the syscall, even if res is -1. E.g., if this path
>> [2] is taken, we might have a valid syscall number in a7, but the
>> syscall should not be performed.
>
> I may misunderstand what you said, but I can't see the issue you pointed.
> A syscall is performed iff
>
> 1) syscall number in a7 must be valid, so it can reach "else if" branch.
> 2) res != -1, so syscall_enter_from_user_mode() doesn't return -1 to
> inform the syscall should be skipped.
Ah, indeed. Apologies, that'll work!
Related, now wont this reintroduce the seccomp filtering problem? Say,
res is -1 *and* syscall invalid, then a0 updated by seccomp will be
overwritten here?
>> Also, one reason for the generic entry is so that it should be less
>> work. Here, you pull (IMO) details that belong to the common entry
>> implementation/API up the common entry user. Wdyt about pushing it down
>> to common entry? Something like:
>
> Yeah, we can. But I pull it out of common entry to get more simple API semantic:
>
> 1. syscall_enter_from_user_mode() will do two things:
> 1) the return value is only to inform whether the syscall should be skipped.
> 2) regs will be modified by filters (seccomp or ptrace and so on).
> 2. for common entry user, there is two informations: syscall number and
> the return value of syscall_enter_from_user_mode() (called is_skipped below).
> so there is three situations:
> 1) if syscall number is invalid, the syscall should not be performed, and
> we set a0 to -ENOSYS to inform userspace the syscall doesn't exist.
> 2) if syscall number is valid, is_skipped will be used:
> a) if is_skipped is -1, which means there are some filters reject this syscall,
> so the syscall should not performed. (Of course, we can use bool instead to
> get better semantic)
> b) if is_skipped != -1, which means the filters approved this syscall,
> so we invoke syscall handler with modified regs.
>
> In your design, the logical condition is not obvious. Why syscall_enter_from_user_mode()
> informed the syscall will be skipped but the syscall handler will be called
> when syscall number is invalid? The users need to think two things to get result:
> a) -1 means skip
> b) -1 < 0 in signed integer, so the skip condition is always a invalid syscall number.
>
> In may way, the users only need to think one thing: The syscall_enter_from_user_mode()
> said -1 means the syscall should not be performed, so use it as a condition of reject
> directly. They just need to combine the informations that they get from API as the
> condition of control flow.
I'm all-in for simpler API usage! Maybe massage the
syscall_enter_from_user_mode() (or a new one), so that additional
syscall_get_nr() call is not needed?
Björn
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-21 16:46 ` Björn Töpel
@ 2024-10-25 13:12 ` Thomas Gleixner
2024-10-25 14:30 ` Björn Töpel
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-10-25 13:12 UTC (permalink / raw)
To: Björn Töpel, Celeste Liu, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On Mon, Oct 21 2024 at 09:46, Björn Töpel wrote:
> Celeste Liu <coelacanthushex@gmail.com> writes:
>> 1. syscall_enter_from_user_mode() will do two things:
>> 1) the return value is only to inform whether the syscall should be skipped.
>> 2) regs will be modified by filters (seccomp or ptrace and so on).
>> 2. for common entry user, there is two informations: syscall number and
>> the return value of syscall_enter_from_user_mode() (called is_skipped below).
>> so there is three situations:
>> 1) if syscall number is invalid, the syscall should not be performed, and
>> we set a0 to -ENOSYS to inform userspace the syscall doesn't exist.
>> 2) if syscall number is valid, is_skipped will be used:
>> a) if is_skipped is -1, which means there are some filters reject this syscall,
>> so the syscall should not performed. (Of course, we can use bool instead to
>> get better semantic)
>> b) if is_skipped != -1, which means the filters approved this syscall,
>> so we invoke syscall handler with modified regs.
>>
>> In your design, the logical condition is not obvious. Why syscall_enter_from_user_mode()
>> informed the syscall will be skipped but the syscall handler will be called
>> when syscall number is invalid? The users need to think two things to get result:
>> a) -1 means skip
>> b) -1 < 0 in signed integer, so the skip condition is always a invalid syscall number.
>>
>> In may way, the users only need to think one thing: The syscall_enter_from_user_mode()
>> said -1 means the syscall should not be performed, so use it as a condition of reject
>> directly. They just need to combine the informations that they get from API as the
>> condition of control flow.
>
> I'm all-in for simpler API usage! Maybe massage the
> syscall_enter_from_user_mode() (or a new one), so that additional
> syscall_get_nr() call is not needed?
It's completely unclear to me what the actual problem is. The flow how
this works on all architectures is:
regs->orig_a0 = regs->a0
regs->a0 = -ENOSYS;
nr = syscall_enter_from_user_mode(....);
if (nr >= 0)
regs->a0 = nr < MAX_SYSCALL ? syscall(nr) : -ENOSYS;
If syscall_trace_enter() returns -1 to skip the syscall, then regs->a0
is unmodified, unless one of the magic operations modified it.
If syscall_trace_enter() was not active (no tracer, no seccomp ...) then
regs->a0 already contains -ENOSYS.
So what's the exact problem?
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-25 13:12 ` Thomas Gleixner
@ 2024-10-25 14:30 ` Björn Töpel
2024-10-26 20:21 ` Thomas Gleixner
0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-10-25 14:30 UTC (permalink / raw)
To: Thomas Gleixner, Celeste Liu, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
Thomas Gleixner <tglx@linutronix.de> writes:
> On Mon, Oct 21 2024 at 09:46, Björn Töpel wrote:
>> Celeste Liu <coelacanthushex@gmail.com> writes:
>>> 1. syscall_enter_from_user_mode() will do two things:
>>> 1) the return value is only to inform whether the syscall should be skipped.
>>> 2) regs will be modified by filters (seccomp or ptrace and so on).
>>> 2. for common entry user, there is two informations: syscall number and
>>> the return value of syscall_enter_from_user_mode() (called is_skipped below).
>>> so there is three situations:
>>> 1) if syscall number is invalid, the syscall should not be performed, and
>>> we set a0 to -ENOSYS to inform userspace the syscall doesn't exist.
>>> 2) if syscall number is valid, is_skipped will be used:
>>> a) if is_skipped is -1, which means there are some filters reject this syscall,
>>> so the syscall should not performed. (Of course, we can use bool instead to
>>> get better semantic)
>>> b) if is_skipped != -1, which means the filters approved this syscall,
>>> so we invoke syscall handler with modified regs.
>>>
>>> In your design, the logical condition is not obvious. Why syscall_enter_from_user_mode()
>>> informed the syscall will be skipped but the syscall handler will be called
>>> when syscall number is invalid? The users need to think two things to get result:
>>> a) -1 means skip
>>> b) -1 < 0 in signed integer, so the skip condition is always a invalid syscall number.
>>>
>>> In may way, the users only need to think one thing: The syscall_enter_from_user_mode()
>>> said -1 means the syscall should not be performed, so use it as a condition of reject
>>> directly. They just need to combine the informations that they get from API as the
>>> condition of control flow.
>>
>> I'm all-in for simpler API usage! Maybe massage the
>> syscall_enter_from_user_mode() (or a new one), so that additional
>> syscall_get_nr() call is not needed?
>
> It's completely unclear to me what the actual problem is. The flow how
> this works on all architectures is:
>
> regs->orig_a0 = regs->a0
> regs->a0 = -ENOSYS;
>
> nr = syscall_enter_from_user_mode(....);
>
> if (nr >= 0)
> regs->a0 = nr < MAX_SYSCALL ? syscall(nr) : -ENOSYS;
>
> If syscall_trace_enter() returns -1 to skip the syscall, then regs->a0
> is unmodified, unless one of the magic operations modified it.
>
> If syscall_trace_enter() was not active (no tracer, no seccomp ...) then
> regs->a0 already contains -ENOSYS.
>
> So what's the exact problem?
It's a mix of calling convention, and UAPI:
* RISC-V uses a0 for arg0 *and* return value (like arm64).
* RISC-V does not expose orig_a0 to userland, and cannot easily start
doing that w/o breaking UAPI.
Now, when setting a0 to -ENOSYS, it's clobbering arg0, and the ptracer
will have an incorrect arg0 (-ENOSYS).
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-25 14:30 ` Björn Töpel
@ 2024-10-26 20:21 ` Thomas Gleixner
2024-10-27 15:29 ` Celeste Liu
2024-10-28 9:45 ` Björn Töpel
0 siblings, 2 replies; 17+ messages in thread
From: Thomas Gleixner @ 2024-10-26 20:21 UTC (permalink / raw)
To: Björn Töpel, Celeste Liu, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On Fri, Oct 25 2024 at 07:30, Björn Töpel wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> It's completely unclear to me what the actual problem is. The flow how
>> this works on all architectures is:
>>
>> regs->orig_a0 = regs->a0
>> regs->a0 = -ENOSYS;
>>
>> nr = syscall_enter_from_user_mode(....);
>>
>> if (nr >= 0)
>> regs->a0 = nr < MAX_SYSCALL ? syscall(nr) : -ENOSYS;
>>
>> If syscall_trace_enter() returns -1 to skip the syscall, then regs->a0
>> is unmodified, unless one of the magic operations modified it.
>>
>> If syscall_trace_enter() was not active (no tracer, no seccomp ...) then
>> regs->a0 already contains -ENOSYS.
>>
>> So what's the exact problem?
>
> It's a mix of calling convention, and UAPI:
> * RISC-V uses a0 for arg0 *and* return value (like arm64).
> * RISC-V does not expose orig_a0 to userland, and cannot easily start
> doing that w/o breaking UAPI.
>
> Now, when setting a0 to -ENOSYS, it's clobbering arg0, and the ptracer
> will have an incorrect arg0 (-ENOSYS).
Oh I see. I was looking at it from the x86 POV...
Looking deeper into this, this is all completely inconsistent across
architectures. All of them copied either from x86 or from some other
close enough existing copy and changed stuff on top.
So we have two different scenarios AFAICT (I did not look really
deeply):
1) The register which holds the syscall number is used for the
return value
2) An argument register is used for the return value
#1 is the easy case and just "works"
because orig_$REG holds the original syscall number and everything
falls into place.
#2 needs some thought, but we are not going to add this:
> if (work & SYSCALL_WORK_ENTER)
> syscall = syscall_trace_enter(regs, syscall, work);
> + else if (syscall == -1L)
> + syscall_set_return_value(current, regs, -ENOSYS, 0);
>
into the syscall path just to make #2 work. That's hotpath and affects
all other architectures too.
So the problem for the #2 case is that there is no distinction between a
user space issued syscall(@nr = -1) and the return value of (-1) of
various functions involved in the syscall 'tracer' processing.
So what the issue with Celeste's change is:
res = syscall_enter_from_user_mode(regs, syscall);
syscall = syscall_get_nr(current, regs);
add_random_kstack_offset();
if (syscall < 0 || syscall >= NR_syscalls)
regs->a0 = -ENOSYS;
As the tracer can invalidate the syscall number along with regs->a0,
this overwrites the error code set by the tracer. Your solution has a
similar problem.
There is another issue vs. regs->a0. Assume a ptracer modified regs->a0
(arg0) and lets the task continue (no fatal signal pending).
Then the following seccomp() invocation will get regs->orig_a0 from
syscall_get_arguments(), which is not what the ptracer set, right?
Let me look at your failure analysis from your first reply:
> 1. strace "tracing": Requires that regs->a0 is not tampered with prior
> ptrace notification
>
> E.g.:
> | # ./strace /
> | execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
> | ./strace: exec: Permission denied
> | +++ exited with 1 +++
> | # ./disable_ptrace_get_syscall_info ./strace /
> | execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
> | ./strace: exec: Permission denied
> | +++ exited with 1 +++
>
> In the second case, arg0 is prematurely set to -ENOSYS
> (0xffffffffffffffda).
That's expected if ptrace_get_syscall_info() is not used. Plain dumping
registers will give you the current value on all architectures.
ptrace_get_syscall_info() exist exactly for that reason.
> 2. strace "syscall tampering": Requires that ENOSYS is returned for
> syscall(-1), and not skipped w/o a proper return value.
>
> E.g.:
> | ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello
>
> Here, strace expects that injecting -1, would result in a ENOSYS.
No. It expects ENOSPC with the above command line. man strace:
If :error=errno option is specified, a fault is injected into a
syscall invocation: the syscall number is replaced by -1 which
corresponds to an invalid syscall (unless a syscall is specified
with :syscall= option), and the error code is specified using a
symbolic errno value like ENOSYS or a numeric value within
1..4095 range.
Similar for -einject:retval=$N
So you cannot overwrite a0 with ENOSYS if the syscall needs to be
skipped.
> 3. seccomp filtering: Requires that the a0 is not tampered to
No. seccomp uses syscall_get_arguments() which sets a0 to orig_a0 for
inspection. As I said before that fails when the ptracer changed
argument 0 before the seccomp invocation. seccomp will see the original
argument and waves it through.
Looking at Celeste's analysis again:
> We can't know whether syscall_nr is -1 when we get -1
> from syscall_enter_from_user_mode(). And the old syscall variable is
> unusable because syscall_enter_from_user_mode() may change a7 register.
You obviously can save the user space supplied value away
in do_trap_ecall_u() by simply doing
long orig_nr = regs->a7;
No? But I'm not sure that it solves all problems. It cannot solve the
ptrace/seccomp interaction.
The rest of the changelog is simply bogus. Just because riscv made a
mistake with the UABI design does not mean that it's useless for
everyone else.
And no, I'm not going to change x86 for that just to have a pointless
load in the syscall hotpath, when the normal operation just keeps the
syscall number in the same register.
The real problem is that orig_a0 is not exposed in the user view of the
registers. Changing that struct breaks the existing applications
obviously.
But you can expose it without changing the struct by exposing a regset
for orig_a0 which allows you to read and write it similar to what ARM64
does for the syscall number.
That of course requires updated user space, but existing user space will
continue to work with the current limitations.
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-26 20:21 ` Thomas Gleixner
@ 2024-10-27 15:29 ` Celeste Liu
2024-10-27 15:56 ` Thomas Gleixner
2024-10-28 9:45 ` Björn Töpel
1 sibling, 1 reply; 17+ messages in thread
From: Celeste Liu @ 2024-10-27 15:29 UTC (permalink / raw)
To: Thomas Gleixner, Björn Töpel, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On 2024-10-27 04:21, Thomas Gleixner wrote:
> On Fri, Oct 25 2024 at 07:30, Björn Töpel wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>>> It's completely unclear to me what the actual problem is. The flow how
>>> this works on all architectures is:
>>>
>>> regs->orig_a0 = regs->a0
>>> regs->a0 = -ENOSYS;
>>>
>>> nr = syscall_enter_from_user_mode(....);
>>>
>>> if (nr >= 0)
>>> regs->a0 = nr < MAX_SYSCALL ? syscall(nr) : -ENOSYS;
>>>
>>> If syscall_trace_enter() returns -1 to skip the syscall, then regs->a0
>>> is unmodified, unless one of the magic operations modified it.
>>>
>>> If syscall_trace_enter() was not active (no tracer, no seccomp ...) then
>>> regs->a0 already contains -ENOSYS.
>>>
>>> So what's the exact problem?
>>
>> It's a mix of calling convention, and UAPI:
>> * RISC-V uses a0 for arg0 *and* return value (like arm64).
>> * RISC-V does not expose orig_a0 to userland, and cannot easily start
>> doing that w/o breaking UAPI.
>>
>> Now, when setting a0 to -ENOSYS, it's clobbering arg0, and the ptracer
>> will have an incorrect arg0 (-ENOSYS).
>
> Oh I see. I was looking at it from the x86 POV...
>
> Looking deeper into this, this is all completely inconsistent across
> architectures. All of them copied either from x86 or from some other
> close enough existing copy and changed stuff on top.
>
> So we have two different scenarios AFAICT (I did not look really
> deeply):
>
> 1) The register which holds the syscall number is used for the
> return value
>
> 2) An argument register is used for the return value
>
> #1 is the easy case and just "works"
>
> because orig_$REG holds the original syscall number and everything
> falls into place.
>
> #2 needs some thought, but we are not going to add this:
>
>> if (work & SYSCALL_WORK_ENTER)
>> syscall = syscall_trace_enter(regs, syscall, work);
>> + else if (syscall == -1L)
>> + syscall_set_return_value(current, regs, -ENOSYS, 0);
>>
>
> into the syscall path just to make #2 work. That's hotpath and affects
> all other architectures too.
>
> So the problem for the #2 case is that there is no distinction between a
> user space issued syscall(@nr = -1) and the return value of (-1) of
> various functions involved in the syscall 'tracer' processing.
>
> So what the issue with Celeste's change is:
>
> res = syscall_enter_from_user_mode(regs, syscall);
> syscall = syscall_get_nr(current, regs);
>
> add_random_kstack_offset();
>
> if (syscall < 0 || syscall >= NR_syscalls)
> regs->a0 = -ENOSYS;
>
> As the tracer can invalidate the syscall number along with regs->a0,
> this overwrites the error code set by the tracer. Your solution has a
> similar problem.
Oh. It's a real issue.
>
> There is another issue vs. regs->a0. Assume a ptracer modified regs->a0
> (arg0) and lets the task continue (no fatal signal pending).
>
> Then the following seccomp() invocation will get regs->orig_a0 from
> syscall_get_arguments(), which is not what the ptracer set, right?
>
> Let me look at your failure analysis from your first reply:
>
>> 1. strace "tracing": Requires that regs->a0 is not tampered with prior
>> ptrace notification
>>
>> E.g.:
>> | # ./strace /
>> | execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
>> | ./strace: exec: Permission denied
>> | +++ exited with 1 +++
>> | # ./disable_ptrace_get_syscall_info ./strace /
>> | execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
>> | ./strace: exec: Permission denied
>> | +++ exited with 1 +++
>>
>> In the second case, arg0 is prematurely set to -ENOSYS
>> (0xffffffffffffffda).
>
> That's expected if ptrace_get_syscall_info() is not used. Plain dumping
> registers will give you the current value on all architectures.
> ptrace_get_syscall_info() exist exactly for that reason.
>
>> 2. strace "syscall tampering": Requires that ENOSYS is returned for
>> syscall(-1), and not skipped w/o a proper return value.
>>
>> E.g.:
>> | ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello
>>
>> Here, strace expects that injecting -1, would result in a ENOSYS.
>
> No. It expects ENOSPC with the above command line. man strace:
>
> If :error=errno option is specified, a fault is injected into a
> syscall invocation: the syscall number is replaced by -1 which
> corresponds to an invalid syscall (unless a syscall is specified
> with :syscall= option), and the error code is specified using a
> symbolic errno value like ENOSYS or a numeric value within
> 1..4095 range.
>
> Similar for -einject:retval=$N
>
> So you cannot overwrite a0 with ENOSYS if the syscall needs to be
> skipped.
>
>> 3. seccomp filtering: Requires that the a0 is not tampered to
>
> No. seccomp uses syscall_get_arguments() which sets a0 to orig_a0 for
> inspection. As I said before that fails when the ptracer changed
> argument 0 before the seccomp invocation. seccomp will see the original
> argument and waves it through.
>
> Looking at Celeste's analysis again:
>
>> We can't know whether syscall_nr is -1 when we get -1
>> from syscall_enter_from_user_mode(). And the old syscall variable is
>> unusable because syscall_enter_from_user_mode() may change a7 register.
>
> You obviously can save the user space supplied value away
> in do_trap_ecall_u() by simply doing
>
> long orig_nr = regs->a7;
>
> No? But I'm not sure that it solves all problems. It cannot solve the
> ptrace/seccomp interaction.
>
> The rest of the changelog is simply bogus. Just because riscv made a
> mistake with the UABI design does not mean that it's useless for
> everyone else.
>
> And no, I'm not going to change x86 for that just to have a pointless
> load in the syscall hotpath, when the normal operation just keeps the
> syscall number in the same register.
>
> The real problem is that orig_a0 is not exposed in the user view of the
> registers. Changing that struct breaks the existing applications
> obviously.
>
> But you can expose it without changing the struct by exposing a regset
> for orig_a0 which allows you to read and write it similar to what ARM64
> does for the syscall number.
If we add something like NT_SYSCALL_NR to UAPI, it cannot solve anything: We
already have PTRACE_GET_SYSCALL_INFO to get syscall number, which was introduced
in 5.3 kernel. The problem is only in the kernel before 5.3. So we can't fix
this issue unless we also backport NT_SYSCALL_NR to 4.19 LTS. But if we can
backport it, we can backport PTRACE_GET_SYSCALL_INFO directly instead.
That's why I try to limit changes in "internal handle logic" in all 3 tries
before.
>
> That of course requires updated user space, but existing user space will
> continue to work with the current limitations.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-27 15:29 ` Celeste Liu
@ 2024-10-27 15:56 ` Thomas Gleixner
2024-10-27 17:01 ` Celeste Liu
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-10-27 15:56 UTC (permalink / raw)
To: Celeste Liu, Björn Töpel, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On Sun, Oct 27 2024 at 23:29, Celeste Liu wrote:
> On 2024-10-27 04:21, Thomas Gleixner wrote:
>> The real problem is that orig_a0 is not exposed in the user view of the
>> registers. Changing that struct breaks the existing applications
>> obviously.
>>
>> But you can expose it without changing the struct by exposing a regset
>> for orig_a0 which allows you to read and write it similar to what ARM64
>> does for the syscall number.
>
> If we add something like NT_SYSCALL_NR to UAPI, it cannot solve anything: We
> already have PTRACE_GET_SYSCALL_INFO to get syscall number, which was introduced
> in 5.3 kernel. The problem is only in the kernel before 5.3. So we can't fix
> this issue unless we also backport NT_SYSCALL_NR to 4.19 LTS. But if we can
> backport it, we can backport PTRACE_GET_SYSCALL_INFO directly instead.
PTRACE_GET_SYSCALL_INFO only solves half of the problem. It correctly
returns orig_a0, but there is no way to modify orig_a0, which is
required to change arg0.
On x86 AX contains the syscall number and is used for the return
value. So the tracer has do modify orig_AX when it wants to change the
syscall number.
Equivalently you need to be able to modify orig_a0 for changing arg0,
no?
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-27 15:56 ` Thomas Gleixner
@ 2024-10-27 17:01 ` Celeste Liu
2024-10-27 21:52 ` Thomas Gleixner
0 siblings, 1 reply; 17+ messages in thread
From: Celeste Liu @ 2024-10-27 17:01 UTC (permalink / raw)
To: Thomas Gleixner, Björn Töpel, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel,
Greg Kroah-Hartman
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On 2024-10-27 23:56, Thomas Gleixner wrote:
> On Sun, Oct 27 2024 at 23:29, Celeste Liu wrote:
>> On 2024-10-27 04:21, Thomas Gleixner wrote:
>>> The real problem is that orig_a0 is not exposed in the user view of the
>>> registers. Changing that struct breaks the existing applications
>>> obviously.
>>>
>>> But you can expose it without changing the struct by exposing a regset
>>> for orig_a0 which allows you to read and write it similar to what ARM64
>>> does for the syscall number.
>>
>> If we add something like NT_SYSCALL_NR to UAPI, it cannot solve anything: We
>> already have PTRACE_GET_SYSCALL_INFO to get syscall number, which was introduced
>> in 5.3 kernel. The problem is only in the kernel before 5.3. So we can't fix
>> this issue unless we also backport NT_SYSCALL_NR to 4.19 LTS. But if we can
>> backport it, we can backport PTRACE_GET_SYSCALL_INFO directly instead.
>
> PTRACE_GET_SYSCALL_INFO only solves half of the problem. It correctly
> returns orig_a0, but there is no way to modify orig_a0, which is
> required to change arg0.
>
> On x86 AX contains the syscall number and is used for the return
> value. So the tracer has do modify orig_AX when it wants to change the
> syscall number.
>
> Equivalently you need to be able to modify orig_a0 for changing arg0,
> no?
Ok.
Greg, could you accept a backport a new API parameter for
PTRACE_GETREGSET/PTRACE_SETREGSET to 4.19 LTS branch?
>
> Thanks,
>
> tglx
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-27 17:01 ` Celeste Liu
@ 2024-10-27 21:52 ` Thomas Gleixner
2024-10-28 0:17 ` Ron Economos
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-10-27 21:52 UTC (permalink / raw)
To: Celeste Liu, Björn Töpel, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel,
Greg Kroah-Hartman
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On Mon, Oct 28 2024 at 01:01, Celeste Liu wrote:
> On 2024-10-27 23:56, Thomas Gleixner wrote:
>> Equivalently you need to be able to modify orig_a0 for changing arg0,
>> no?
>
> Ok.
>
> Greg, could you accept a backport a new API parameter for
> PTRACE_GETREGSET/PTRACE_SETREGSET to 4.19 LTS branch?
Fix the problem properly and put a proper Fixes tag on it and worry
about the backport later.
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-27 21:52 ` Thomas Gleixner
@ 2024-10-28 0:17 ` Ron Economos
2024-10-28 16:25 ` Celeste Liu
0 siblings, 1 reply; 17+ messages in thread
From: Ron Economos @ 2024-10-28 0:17 UTC (permalink / raw)
To: Thomas Gleixner, Celeste Liu, Björn Töpel,
Celeste Liu via B4 Relay, Paul Walmsley, Palmer Dabbelt,
Albert Ou, Björn Töpel, Greg Kroah-Hartman
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On 10/27/24 2:52 PM, Thomas Gleixner wrote:
> On Mon, Oct 28 2024 at 01:01, Celeste Liu wrote:
>> On 2024-10-27 23:56, Thomas Gleixner wrote:
>>> Equivalently you need to be able to modify orig_a0 for changing arg0,
>>> no?
>> Ok.
>>
>> Greg, could you accept a backport a new API parameter for
>> PTRACE_GETREGSET/PTRACE_SETREGSET to 4.19 LTS branch?
> Fix the problem properly and put a proper Fixes tag on it and worry
> about the backport later.
>
> Thanks,
>
> tglx
>
I wouldn't worry about backporting to the 4.19 kernel. It's essentially
prehistoric for RISC-V. There's no device tree support for any hardware.
Also, 4.19 will be going EOL very soon (December 2024).
Ron
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-26 20:21 ` Thomas Gleixner
2024-10-27 15:29 ` Celeste Liu
@ 2024-10-28 9:45 ` Björn Töpel
2024-11-15 21:49 ` Aurelien Jarno
1 sibling, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-10-28 9:45 UTC (permalink / raw)
To: Thomas Gleixner, Celeste Liu, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
Thanks for helping out to dissect this! Much appreciated!
Thomas Gleixner <tglx@linutronix.de> writes:
> Let me look at your failure analysis from your first reply:
>
>> 1. strace "tracing": Requires that regs->a0 is not tampered with prior
>> ptrace notification
>>
>> E.g.:
>> | # ./strace /
>> | execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
>> | ./strace: exec: Permission denied
>> | +++ exited with 1 +++
>> | # ./disable_ptrace_get_syscall_info ./strace /
>> | execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
>> | ./strace: exec: Permission denied
>> | +++ exited with 1 +++
>>
>> In the second case, arg0 is prematurely set to -ENOSYS
>> (0xffffffffffffffda).
>
> That's expected if ptrace_get_syscall_info() is not used. Plain dumping
> registers will give you the current value on all architectures.
> ptrace_get_syscall_info() exist exactly for that reason.
Noted! So this shouldn't be considered as a regression. IOW, the
existing upstream code is OK for this scenario.
>> 2. strace "syscall tampering": Requires that ENOSYS is returned for
>> syscall(-1), and not skipped w/o a proper return value.
>>
>> E.g.:
>> | ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello
>>
>> Here, strace expects that injecting -1, would result in a ENOSYS.
>
> No. It expects ENOSPC with the above command line. man strace:
>
> If :error=errno option is specified, a fault is injected into a
> syscall invocation: the syscall number is replaced by -1 which
> corresponds to an invalid syscall (unless a syscall is specified
> with :syscall= option), and the error code is specified using a
> symbolic errno value like ENOSYS or a numeric value within
> 1..4095 range.
>
> Similar for -einject:retval=$N
>
> So you cannot overwrite a0 with ENOSYS if the syscall needs to be
> skipped.
ACK!
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-28 0:17 ` Ron Economos
@ 2024-10-28 16:25 ` Celeste Liu
2024-10-28 19:33 ` Björn Töpel
0 siblings, 1 reply; 17+ messages in thread
From: Celeste Liu @ 2024-10-28 16:25 UTC (permalink / raw)
To: Ron Economos, Thomas Gleixner, Björn Töpel,
Celeste Liu via B4 Relay, Paul Walmsley, Palmer Dabbelt,
Albert Ou, Björn Töpel, Greg Kroah-Hartman
Cc: Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On 2024-10-28 08:17, Ron Economos wrote:
> On 10/27/24 2:52 PM, Thomas Gleixner wrote:
>> On Mon, Oct 28 2024 at 01:01, Celeste Liu wrote:
>>> On 2024-10-27 23:56, Thomas Gleixner wrote:
>>>> Equivalently you need to be able to modify orig_a0 for changing arg0,
>>>> no?
>>> Ok.
>>>
>>> Greg, could you accept a backport a new API parameter for
>>> PTRACE_GETREGSET/PTRACE_SETREGSET to 4.19 LTS branch?
>> Fix the problem properly and put a proper Fixes tag on it and worry
>> about the backport later.
>>
>> Thanks,
>>
>> tglx
>>
> I wouldn't worry about backporting to the 4.19 kernel. It's essentially prehistoric for RISC-V. There's no device tree support for any hardware. Also, 4.19 will be going EOL very soon (December 2024).
Ok, I will work on preparing a new patch to add a new set in
PTRACE_GETREGSET/PTRACE_SETREGSET.
>
> Ron
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-28 16:25 ` Celeste Liu
@ 2024-10-28 19:33 ` Björn Töpel
2024-11-30 22:39 ` Celeste Liu
0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-10-28 19:33 UTC (permalink / raw)
To: Celeste Liu
Cc: Ron Economos, Thomas Gleixner, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel,
Greg Kroah-Hartman, Palmer Dabbelt, Alexandre Ghiti,
Dmitry V. Levin, Andrea Bolognani, Felix Yan, Ruizhe Pan,
Shiqi Zhang, Guo Ren, Yao Zi, Han Gao, linux-riscv, linux-kernel,
stable
On Mon, 28 Oct 2024 at 17:26, Celeste Liu <coelacanthushex@gmail.com> wrote:
>
>
> On 2024-10-28 08:17, Ron Economos wrote:
> > On 10/27/24 2:52 PM, Thomas Gleixner wrote:
> >> On Mon, Oct 28 2024 at 01:01, Celeste Liu wrote:
> >>> On 2024-10-27 23:56, Thomas Gleixner wrote:
> >>>> Equivalently you need to be able to modify orig_a0 for changing arg0,
> >>>> no?
> >>> Ok.
> >>>
> >>> Greg, could you accept a backport a new API parameter for
> >>> PTRACE_GETREGSET/PTRACE_SETREGSET to 4.19 LTS branch?
> >> Fix the problem properly and put a proper Fixes tag on it and worry
> >> about the backport later.
> >>
> >> Thanks,
> >>
> >> tglx
> >>
> > I wouldn't worry about backporting to the 4.19 kernel. It's essentially prehistoric for RISC-V. There's no device tree support for any hardware. Also, 4.19 will be going EOL very soon (December 2024).
>
> Ok, I will work on preparing a new patch to add a new set in
> PTRACE_GETREGSET/PTRACE_SETREGSET.
Thanks for working/finding working on this! Looking forward to the patch!
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-28 9:45 ` Björn Töpel
@ 2024-11-15 21:49 ` Aurelien Jarno
0 siblings, 0 replies; 17+ messages in thread
From: Aurelien Jarno @ 2024-11-15 21:49 UTC (permalink / raw)
To: Björn Töpel
Cc: Thomas Gleixner, Celeste Liu, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel,
Palmer Dabbelt, Alexandre Ghiti, Dmitry V. Levin,
Andrea Bolognani, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren,
Yao Zi, Han Gao, linux-riscv, linux-kernel, stable
On 2024-10-28 02:45, Björn Töpel wrote:
> Thanks for helping out to dissect this! Much appreciated!
>
> Thomas Gleixner <tglx@linutronix.de> writes:
>
> > Let me look at your failure analysis from your first reply:
> >
> >> 1. strace "tracing": Requires that regs->a0 is not tampered with prior
> >> ptrace notification
> >>
> >> E.g.:
> >> | # ./strace /
> >> | execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
> >> | ./strace: exec: Permission denied
> >> | +++ exited with 1 +++
> >> | # ./disable_ptrace_get_syscall_info ./strace /
> >> | execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
> >> | ./strace: exec: Permission denied
> >> | +++ exited with 1 +++
> >>
> >> In the second case, arg0 is prematurely set to -ENOSYS
> >> (0xffffffffffffffda).
> >
> > That's expected if ptrace_get_syscall_info() is not used. Plain dumping
> > registers will give you the current value on all architectures.
> > ptrace_get_syscall_info() exist exactly for that reason.
>
> Noted! So this shouldn't be considered as a regression. IOW, the
> existing upstream code is OK for this scenario.
Not however that it breaks some programs, for instance I arrived on this
thread by debugging python-ptrace. I'll try to look at adding support
for ptrace_get_syscall_info(), but I am afraid we will find more broken
programs.
Regards
Aurelien
[1] https://buildd.debian.org/status/fetch.php?pkg=python-ptrace&arch=riscv64&ver=0.9.9-0.1%2Bb2&stamp=1731547088&raw=0
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aurelien@aurel32.net http://aurel32.net
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] riscv/entry: get correct syscall number from syscall_get_nr()
2024-10-28 19:33 ` Björn Töpel
@ 2024-11-30 22:39 ` Celeste Liu
0 siblings, 0 replies; 17+ messages in thread
From: Celeste Liu @ 2024-11-30 22:39 UTC (permalink / raw)
To: Björn Töpel, Celeste Liu
Cc: Ron Economos, Thomas Gleixner, Celeste Liu via B4 Relay,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Björn Töpel,
Greg Kroah-Hartman, Palmer Dabbelt, Alexandre Ghiti,
Dmitry V. Levin, Andrea Bolognani, Felix Yan, Ruizhe Pan,
Shiqi Zhang, Guo Ren, Yao Zi, Han Gao, linux-riscv, linux-kernel,
stable
On 2024-10-29 03:33, Björn Töpel wrote:
> On Mon, 28 Oct 2024 at 17:26, Celeste Liu <coelacanthushex@gmail.com> wrote:
>>
>>
>> On 2024-10-28 08:17, Ron Economos wrote:
>>> On 10/27/24 2:52 PM, Thomas Gleixner wrote:
>>>> On Mon, Oct 28 2024 at 01:01, Celeste Liu wrote:
>>>>> On 2024-10-27 23:56, Thomas Gleixner wrote:
>>>>>> Equivalently you need to be able to modify orig_a0 for changing arg0,
>>>>>> no?
>>>>> Ok.
>>>>>
>>>>> Greg, could you accept a backport a new API parameter for
>>>>> PTRACE_GETREGSET/PTRACE_SETREGSET to 4.19 LTS branch?
>>>> Fix the problem properly and put a proper Fixes tag on it and worry
>>>> about the backport later.
>>>>
>>>> Thanks,
>>>>
>>>> tglx
>>>>
>>> I wouldn't worry about backporting to the 4.19 kernel. It's essentially prehistoric for RISC-V. There's no device tree support for any hardware. Also, 4.19 will be going EOL very soon (December 2024).
>>
>> Ok, I will work on preparing a new patch to add a new set in
>> PTRACE_GETREGSET/PTRACE_SETREGSET.
>
> Thanks for working/finding working on this! Looking forward to the patch!
The patch to add new regset has been sent.
See https://lore.kernel.org/lkml/20241201-riscv-new-regset-v1-1-c83c58abcc7b@coelacanthus.name/T/
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-11-30 22:40 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-16 17:49 [PATCH] riscv/entry: get correct syscall number from syscall_get_nr() Celeste Liu via B4 Relay
2024-10-21 14:00 ` Björn Töpel
2024-10-21 15:23 ` Celeste Liu
2024-10-21 16:46 ` Björn Töpel
2024-10-25 13:12 ` Thomas Gleixner
2024-10-25 14:30 ` Björn Töpel
2024-10-26 20:21 ` Thomas Gleixner
2024-10-27 15:29 ` Celeste Liu
2024-10-27 15:56 ` Thomas Gleixner
2024-10-27 17:01 ` Celeste Liu
2024-10-27 21:52 ` Thomas Gleixner
2024-10-28 0:17 ` Ron Economos
2024-10-28 16:25 ` Celeste Liu
2024-10-28 19:33 ` Björn Töpel
2024-11-30 22:39 ` Celeste Liu
2024-10-28 9:45 ` Björn Töpel
2024-11-15 21:49 ` Aurelien Jarno
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).