* [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS
@ 2024-09-17 4:09 Celeste Liu
2024-09-17 5:59 ` Celeste Liu
2024-10-16 12:00 ` Alexandre Ghiti
0 siblings, 2 replies; 8+ messages in thread
From: Celeste Liu @ 2024-09-17 4:09 UTC (permalink / raw)
To: linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Oleg Nesterov, Dmitry V. Levin
Cc: Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen,
Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen,
Han Gao, linux-kernel, rsworktech
Before PTRACE_GET_SYSCALL_INFO was implemented in v5.3, the only way to
get syscall arguments was to get user_regs_struct via PTRACE_GETREGSET.
On some architectures where a register is used as both the first
argument and the return value and thus will be changed at some stage of
the syscall process, something like orig_a0 is provided to save the
original argument register value. But RISC-V doesn't export orig_a0 in
user_regs_struct (This ABI was designed at e2c0cdfba7f6 ("RISC-V:
User-facing API").) so userspace application like strace will get the
right or wrong result depends on the operation order in do_trap_ecall_u()
function.
This requires we put the ENOSYS process after syscall_enter_from_user_mode()
or syscall_handler()[1]. Unfortunately, the generic entry API
syscall_enter_from_user_mode() requires we
* process ENOSYS before syscall_enter_from_user_mode()
* or only set a0 to ENOSYS when the return value of
syscall_enter_from_user_mode() != -1
Again, if we choose the latter way to avoid conflict with the first
issue, we will meet the third problem: strace depends on that kernel
will return ENOSYS when syscall nr is -1 to implement their syscall
tampering feature[2].
Actually, we tried the both ways in 52449c17bdd1 ("riscv: entry: set
a0 = -ENOSYS only when syscall != -1") and 61119394631f ("riscv: entry:
always initialize regs->a0 to -ENOSYS") before.
Naturally, there is a solution:
1. Just add orig_a0 in user_regs_struct and let strace use it as
loongarch does. So only two problems, which can be resolved without
conflict, are left here.
The conflicts are the direct result of the limitation of generic entry
API, so we have another two solutions:
2. Give up the generic entry API, and switch back to the
architecture-specific standardalone implementation.
3. Redesign the generic entry API: the problem was caused by
syscall_enter_from_user_mode() using the value -1 (which is unused
normally) of syscall nr to inform syscall was reject by seccomp/bpf.
In theory, the Solution 1 is best:
* a0 was used for two purposes in ABI, so using two variables to store
it is natural.
* Userspace can implement features without depending on the internal
behavior of the kernel.
Unfortunately, it's difficult to implement based on the current code.
The RISC-V defined struct pt_regs as below:
struct pt_regs {
unsigned long epc;
...
unsigned long t6;
/* Supervisor/Machine CSRs */
unsigned long status;
unsigned long badaddr;
unsigned long cause;
/* a0 value before the syscall */
unsigned long orig_a0;
};
And user_regs_struct needs to be a prefix of struct pt_regs, so if we
want to include orig_a0 in user_regs_struct, we will need to include
Supervisor/Machine CSRs as well. It's not a big problem. Since
struct pt_regs is the internal ABI of the kernel, we can reorder it.
Unfortunately, struct user_regs_struct is defined as below:
struct user_regs_struct {
unsigned long pc;
...
unsigned long t6;
};
It doesn't contain something like reserved[] as padding to leave the
space to add more registers from struct pt_regs!
The loongarch do the right thing as below:
struct user_pt_regs {
/* Main processor registers. */
unsigned long regs[32];
...
unsigned long reserved[10];
} __attribute__((aligned(8)));
RISC-V can't include orig_a0 in user_regs_struct without breaking UABI.
Need a discussion to decide to use which solution, or is there any
other better solution?
[1]: https://github.com/strace/strace/issues/315
[2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS 2024-09-17 4:09 [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS Celeste Liu @ 2024-09-17 5:59 ` Celeste Liu 2024-09-17 12:51 ` Eugene Syromiatnikov 2024-10-16 12:00 ` Alexandre Ghiti 1 sibling, 1 reply; 8+ messages in thread From: Celeste Liu @ 2024-09-17 5:59 UTC (permalink / raw) To: linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov, Dmitry V. Levin Cc: Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen, Han Gao, linux-kernel, rsworktech On 2024-09-17 12:09, Celeste Liu wrote: > Before PTRACE_GET_SYSCALL_INFO was implemented in v5.3, the only way to > get syscall arguments was to get user_regs_struct via PTRACE_GETREGSET. > On some architectures where a register is used as both the first > argument and the return value and thus will be changed at some stage of > the syscall process, something like orig_a0 is provided to save the > original argument register value. But RISC-V doesn't export orig_a0 in > user_regs_struct (This ABI was designed at e2c0cdfba7f6 ("RISC-V: > User-facing API").) so userspace application like strace will get the > right or wrong result depends on the operation order in do_trap_ecall_u() > function. > > This requires we put the ENOSYS process after syscall_enter_from_user_mode() > or syscall_handler()[1]. Unfortunately, the generic entry API > syscall_enter_from_user_mode() requires we > > * process ENOSYS before syscall_enter_from_user_mode() > * or only set a0 to ENOSYS when the return value of > syscall_enter_from_user_mode() != -1 > > Again, if we choose the latter way to avoid conflict with the first > issue, we will meet the third problem: strace depends on that kernel > will return ENOSYS when syscall nr is -1 to implement their syscall > tampering feature[2]. > > Actually, we tried the both ways in 52449c17bdd1 ("riscv: entry: set > a0 = -ENOSYS only when syscall != -1") and 61119394631f ("riscv: entry: > always initialize regs->a0 to -ENOSYS") before. It's also impossible to save original syscall number before syscall_enter_from_user_mode() to use later because some API like ptrace() can change syscall number in syscall_enter_from_user_mode(). > > Naturally, there is a solution: > > 1. Just add orig_a0 in user_regs_struct and let strace use it as > loongarch does. So only two problems, which can be resolved without > conflict, are left here. > > The conflicts are the direct result of the limitation of generic entry > API, so we have another two solutions: > > 2. Give up the generic entry API, and switch back to the > architecture-specific standardalone implementation. > 3. Redesign the generic entry API: the problem was caused by > syscall_enter_from_user_mode() using the value -1 (which is unused > normally) of syscall nr to inform syscall was reject by seccomp/bpf. The issue of generic entry API is that -1 is invalid syscall number, but it still contains some information. It's similar to situation of Python's PyLong_As* API: all bits of return value are used to contains some information. Since there is no elegant way to implement sum type in C, we can split to into two value: the return value is just success or reject, and an argument to pass syscall out. But from another angle, syscall number is in a7 register, so we can call the get_syscall_nr() after calling the syscall_enter_from_user_mode() to bypass the information lost of the return value of the syscall_enter_from_user_mode(). But in this way, the syscall number in the syscall_enter_from_user_mode() return value is useless, and we can remove it directly. > > In theory, the Solution 1 is best: > > * a0 was used for two purposes in ABI, so using two variables to store > it is natural. > * Userspace can implement features without depending on the internal > behavior of the kernel. > > Unfortunately, it's difficult to implement based on the current code. > The RISC-V defined struct pt_regs as below: > > struct pt_regs { > unsigned long epc; > ... > unsigned long t6; > /* Supervisor/Machine CSRs */ > unsigned long status; > unsigned long badaddr; > unsigned long cause; > /* a0 value before the syscall */ > unsigned long orig_a0; > }; > > And user_regs_struct needs to be a prefix of struct pt_regs, so if we > want to include orig_a0 in user_regs_struct, we will need to include > Supervisor/Machine CSRs as well. It's not a big problem. Since > struct pt_regs is the internal ABI of the kernel, we can reorder it. > Unfortunately, struct user_regs_struct is defined as below: > > struct user_regs_struct { > unsigned long pc; > ... > unsigned long t6; > }; > > It doesn't contain something like reserved[] as padding to leave the > space to add more registers from struct pt_regs! > The loongarch do the right thing as below: > > struct user_pt_regs { > /* Main processor registers. */ > unsigned long regs[32]; > ... > unsigned long reserved[10]; > } __attribute__((aligned(8))); > > RISC-V can't include orig_a0 in user_regs_struct without breaking UABI. > > Need a discussion to decide to use which solution, or is there any > other better solution? > > [1]: https://github.com/strace/strace/issues/315 > [2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/ > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS 2024-09-17 5:59 ` Celeste Liu @ 2024-09-17 12:51 ` Eugene Syromiatnikov 0 siblings, 0 replies; 8+ messages in thread From: Eugene Syromiatnikov @ 2024-09-17 12:51 UTC (permalink / raw) To: Celeste Liu Cc: linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov, Dmitry V. Levin, Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen, Han Gao, linux-kernel, rsworktech On Tue, Sep 17, 2024 at 01:59:23PM +0800, Celeste Liu wrote: > On 2024-09-17 12:09, Celeste Liu wrote: > [...] > > Unfortunately, struct user_regs_struct is defined as below: > > > > struct user_regs_struct { > > unsigned long pc; > > ... > > unsigned long t6; > > }; > > > > It doesn't contain something like reserved[] as padding to leave the > > space to add more registers from struct pt_regs! > > The loongarch do the right thing as below: > > > > struct user_pt_regs { > > /* Main processor registers. */ > > unsigned long regs[32]; > > ... > > unsigned long reserved[10]; > > } __attribute__((aligned(8))); > > > > RISC-V can't include orig_a0 in user_regs_struct without breaking UABI. > > > > Need a discussion to decide to use which solution, or is there any > > other better solution? As another data point, AArch64 has NT_ARM_SYSTEM_CALL (introduced in v3.19-rc1~59^2~16[3]) for the purposes of syscall number tampering. > > [1]: https://github.com/strace/strace/issues/315 > > [2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/ [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=766a85d7bc5d7f1ddd6de28bdb844eae45ec63b0 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS 2024-09-17 4:09 [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS Celeste Liu 2024-09-17 5:59 ` Celeste Liu @ 2024-10-16 12:00 ` Alexandre Ghiti 2024-10-16 12:23 ` Celeste Liu 1 sibling, 1 reply; 8+ messages in thread From: Alexandre Ghiti @ 2024-10-16 12:00 UTC (permalink / raw) To: Celeste Liu, linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov, Dmitry V. Levin Cc: Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen, Han Gao, linux-kernel, rsworktech Hi Celeste, Thank you for looking into this and really sorry about the late response. On 17/09/2024 06:09, Celeste Liu wrote: > Before PTRACE_GET_SYSCALL_INFO was implemented in v5.3, the only way to > get syscall arguments was to get user_regs_struct via PTRACE_GETREGSET. > On some architectures where a register is used as both the first > argument and the return value and thus will be changed at some stage of > the syscall process, something like orig_a0 is provided to save the > original argument register value. But RISC-V doesn't export orig_a0 in > user_regs_struct (This ABI was designed at e2c0cdfba7f6 ("RISC-V: > User-facing API").) so userspace application like strace will get the > right or wrong result depends on the operation order in do_trap_ecall_u() > function. > > This requires we put the ENOSYS process after syscall_enter_from_user_mode() > or syscall_handler()[1]. Unfortunately, the generic entry API > syscall_enter_from_user_mode() requires we > > * process ENOSYS before syscall_enter_from_user_mode() Where does this requirement come from? > * or only set a0 to ENOSYS when the return value of > syscall_enter_from_user_mode() != -1 > > Again, if we choose the latter way to avoid conflict with the first > issue, we will meet the third problem: strace depends on that kernel > will return ENOSYS when syscall nr is -1 to implement their syscall > tampering feature[2]. IIUC, seccomp and others in syscall_enter_from_user_mode() could return -1 and then we could not differentiate with the syscall number being -1. But could we imagine, to distinguish between an error and the syscall number being -1, checking again the syscall number after we call syscall_enter_from_user_mode()? If the syscall number is -1, then we set ENOSYS otherwise we don't do anything (a bit like what you did in 52449c17bdd1 ("riscv: entry: set a0 = -ENOSYS only when syscall != -1")). Let me know if I completely misunderstood here! Thanks again for the thorough explanation, Alex > > Actually, we tried the both ways in 52449c17bdd1 ("riscv: entry: set > a0 = -ENOSYS only when syscall != -1") and 61119394631f ("riscv: entry: > always initialize regs->a0 to -ENOSYS") before. > > Naturally, there is a solution: > > 1. Just add orig_a0 in user_regs_struct and let strace use it as > loongarch does. So only two problems, which can be resolved without > conflict, are left here. > > The conflicts are the direct result of the limitation of generic entry > API, so we have another two solutions: > > 2. Give up the generic entry API, and switch back to the > architecture-specific standardalone implementation. > 3. Redesign the generic entry API: the problem was caused by > syscall_enter_from_user_mode() using the value -1 (which is unused > normally) of syscall nr to inform syscall was reject by seccomp/bpf. > > In theory, the Solution 1 is best: > > * a0 was used for two purposes in ABI, so using two variables to store > it is natural. > * Userspace can implement features without depending on the internal > behavior of the kernel. > > Unfortunately, it's difficult to implement based on the current code. > The RISC-V defined struct pt_regs as below: > > struct pt_regs { > unsigned long epc; > ... > unsigned long t6; > /* Supervisor/Machine CSRs */ > unsigned long status; > unsigned long badaddr; > unsigned long cause; > /* a0 value before the syscall */ > unsigned long orig_a0; > }; > > And user_regs_struct needs to be a prefix of struct pt_regs, so if we > want to include orig_a0 in user_regs_struct, we will need to include > Supervisor/Machine CSRs as well. It's not a big problem. Since > struct pt_regs is the internal ABI of the kernel, we can reorder it. > Unfortunately, struct user_regs_struct is defined as below: > > struct user_regs_struct { > unsigned long pc; > ... > unsigned long t6; > }; > > It doesn't contain something like reserved[] as padding to leave the > space to add more registers from struct pt_regs! > The loongarch do the right thing as below: > > struct user_pt_regs { > /* Main processor registers. */ > unsigned long regs[32]; > ... > unsigned long reserved[10]; > } __attribute__((aligned(8))); > > RISC-V can't include orig_a0 in user_regs_struct without breaking UABI. > > Need a discussion to decide to use which solution, or is there any > other better solution? > > [1]: https://github.com/strace/strace/issues/315 > [2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/ > > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS 2024-10-16 12:00 ` Alexandre Ghiti @ 2024-10-16 12:23 ` Celeste Liu 2024-10-16 12:56 ` Alexandre Ghiti 0 siblings, 1 reply; 8+ messages in thread From: Celeste Liu @ 2024-10-16 12:23 UTC (permalink / raw) To: Alexandre Ghiti, linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov, Dmitry V. Levin Cc: Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen, Han Gao, linux-kernel, rsworktech On 2024-10-16 20:00, Alexandre Ghiti wrote: > Hi Celeste, > > Thank you for looking into this and really sorry about the late response. > > On 17/09/2024 06:09, Celeste Liu wrote: >> Before PTRACE_GET_SYSCALL_INFO was implemented in v5.3, the only way to >> get syscall arguments was to get user_regs_struct via PTRACE_GETREGSET. >> On some architectures where a register is used as both the first >> argument and the return value and thus will be changed at some stage of >> the syscall process, something like orig_a0 is provided to save the >> original argument register value. But RISC-V doesn't export orig_a0 in >> user_regs_struct (This ABI was designed at e2c0cdfba7f6 ("RISC-V: >> User-facing API").) so userspace application like strace will get the >> right or wrong result depends on the operation order in do_trap_ecall_u() >> function. >> >> This requires we put the ENOSYS process after syscall_enter_from_user_mode() >> or syscall_handler()[1]. Unfortunately, the generic entry API >> syscall_enter_from_user_mode() requires we >> >> * process ENOSYS before syscall_enter_from_user_mode() > > > Where does this requirement come from? > > >> * or only set a0 to ENOSYS when the return value of >> syscall_enter_from_user_mode() != -1 >> >> Again, if we choose the latter way to avoid conflict with the first >> issue, we will meet the third problem: strace depends on that kernel >> will return ENOSYS when syscall nr is -1 to implement their syscall >> tampering feature[2]. > > > IIUC, seccomp and others in syscall_enter_from_user_mode() could return -1 and then we could not differentiate with the syscall number being -1. > > But could we imagine, to distinguish between an error and the syscall number being -1, checking again the syscall number after we call syscall_enter_from_user_mode()? If the syscall number is -1, then we set ENOSYS otherwise we don't do anything (a bit like what you did in 52449c17bdd1 ("riscv: entry: set a0 = -ENOSYS only when syscall != -1")). > > Let me know if I completely misunderstood here! Yeah. I found this a bit later after I post this RFC. I include it in a update reply, copy here as well: > But from another angle, syscall number is in a7 register, so we can call the > get_syscall_nr() after calling the syscall_enter_from_user_mode() to bypass the > information lost of the return value of the syscall_enter_from_user_mode(). But > in this way, the syscall number in the syscall_enter_from_user_mode() return > value is useless, and we can remove it directly. So if we get syscall number from a7 register again, the syscall number part of the return value of syscall_enter_from_user_mode() is useless completely. I think it's better to drop it so the later new architecture developer will not run into the same issue. (Actually, the syscall number returned by syscall_enter_from_user_mode() is also the result of get_syscall_nr() at the end of it.) But it will affect other architecture's code so I think there still need some discussions. Or if you think it's better to post a patch and then discuss in patch thread directly, I'm glad to do this. > > Thanks again for the thorough explanation, > > Alex > > >> >> Actually, we tried the both ways in 52449c17bdd1 ("riscv: entry: set >> a0 = -ENOSYS only when syscall != -1") and 61119394631f ("riscv: entry: >> always initialize regs->a0 to -ENOSYS") before. >> >> Naturally, there is a solution: >> >> 1. Just add orig_a0 in user_regs_struct and let strace use it as >> loongarch does. So only two problems, which can be resolved without >> conflict, are left here. >> >> The conflicts are the direct result of the limitation of generic entry >> API, so we have another two solutions: >> >> 2. Give up the generic entry API, and switch back to the >> architecture-specific standardalone implementation. >> 3. Redesign the generic entry API: the problem was caused by >> syscall_enter_from_user_mode() using the value -1 (which is unused >> normally) of syscall nr to inform syscall was reject by seccomp/bpf. >> >> In theory, the Solution 1 is best: >> >> * a0 was used for two purposes in ABI, so using two variables to store >> it is natural. >> * Userspace can implement features without depending on the internal >> behavior of the kernel. >> >> Unfortunately, it's difficult to implement based on the current code. >> The RISC-V defined struct pt_regs as below: >> >> struct pt_regs { >> unsigned long epc; >> ... >> unsigned long t6; >> /* Supervisor/Machine CSRs */ >> unsigned long status; >> unsigned long badaddr; >> unsigned long cause; >> /* a0 value before the syscall */ >> unsigned long orig_a0; >> }; >> >> And user_regs_struct needs to be a prefix of struct pt_regs, so if we >> want to include orig_a0 in user_regs_struct, we will need to include >> Supervisor/Machine CSRs as well. It's not a big problem. Since >> struct pt_regs is the internal ABI of the kernel, we can reorder it. >> Unfortunately, struct user_regs_struct is defined as below: >> >> struct user_regs_struct { >> unsigned long pc; >> ... >> unsigned long t6; >> }; >> >> It doesn't contain something like reserved[] as padding to leave the >> space to add more registers from struct pt_regs! >> The loongarch do the right thing as below: >> >> struct user_pt_regs { >> /* Main processor registers. */ >> unsigned long regs[32]; >> ... >> unsigned long reserved[10]; >> } __attribute__((aligned(8))); >> >> RISC-V can't include orig_a0 in user_regs_struct without breaking UABI. >> >> Need a discussion to decide to use which solution, or is there any >> other better solution? >> >> [1]: https://github.com/strace/strace/issues/315 >> [2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/ >> >> >> _______________________________________________ >> linux-riscv mailing list >> linux-riscv@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS 2024-10-16 12:23 ` Celeste Liu @ 2024-10-16 12:56 ` Alexandre Ghiti 2024-10-16 13:07 ` Celeste Liu 0 siblings, 1 reply; 8+ messages in thread From: Alexandre Ghiti @ 2024-10-16 12:56 UTC (permalink / raw) To: Celeste Liu, linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov, Dmitry V. Levin Cc: Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen, Han Gao, linux-kernel, rsworktech On 16/10/2024 14:23, Celeste Liu wrote: > On 2024-10-16 20:00, Alexandre Ghiti wrote: >> Hi Celeste, >> >> Thank you for looking into this and really sorry about the late response. >> >> On 17/09/2024 06:09, Celeste Liu wrote: >>> Before PTRACE_GET_SYSCALL_INFO was implemented in v5.3, the only way to >>> get syscall arguments was to get user_regs_struct via PTRACE_GETREGSET. >>> On some architectures where a register is used as both the first >>> argument and the return value and thus will be changed at some stage of >>> the syscall process, something like orig_a0 is provided to save the >>> original argument register value. But RISC-V doesn't export orig_a0 in >>> user_regs_struct (This ABI was designed at e2c0cdfba7f6 ("RISC-V: >>> User-facing API").) so userspace application like strace will get the >>> right or wrong result depends on the operation order in do_trap_ecall_u() >>> function. >>> >>> This requires we put the ENOSYS process after syscall_enter_from_user_mode() >>> or syscall_handler()[1]. Unfortunately, the generic entry API >>> syscall_enter_from_user_mode() requires we >>> >>> * process ENOSYS before syscall_enter_from_user_mode() >> >> Where does this requirement come from? >> >> >>> * or only set a0 to ENOSYS when the return value of >>> syscall_enter_from_user_mode() != -1 >>> >>> Again, if we choose the latter way to avoid conflict with the first >>> issue, we will meet the third problem: strace depends on that kernel >>> will return ENOSYS when syscall nr is -1 to implement their syscall >>> tampering feature[2]. >> >> IIUC, seccomp and others in syscall_enter_from_user_mode() could return -1 and then we could not differentiate with the syscall number being -1. >> >> But could we imagine, to distinguish between an error and the syscall number being -1, checking again the syscall number after we call syscall_enter_from_user_mode()? If the syscall number is -1, then we set ENOSYS otherwise we don't do anything (a bit like what you did in 52449c17bdd1 ("riscv: entry: set a0 = -ENOSYS only when syscall != -1")). >> >> Let me know if I completely misunderstood here! > Yeah. I found this a bit later after I post this RFC. I include it in a update reply, > copy here as well: > >> But from another angle, syscall number is in a7 register, so we can call the >> get_syscall_nr() after calling the syscall_enter_from_user_mode() to bypass the >> information lost of the return value of the syscall_enter_from_user_mode(). But >> in this way, the syscall number in the syscall_enter_from_user_mode() return >> value is useless, and we can remove it directly. > So if we get syscall number from a7 register again, the syscall number part of > the return value of syscall_enter_from_user_mode() is useless completely. > I think it's better to drop it so the later new architecture developer will not > run into the same issue. (Actually, the syscall number returned by > syscall_enter_from_user_mode() is also the result of get_syscall_nr() at the end > of it.) But it will affect other architecture's code so I think there still need > some discussions. > > Or if you think it's better to post a patch and then discuss in patch thread > directly, I'm glad to do this. Great that we have a solution that does not need to change the ABI :) I think we should start by implementing a fix for riscv only that implements the get_syscall_nr() after syscall_enter_from_user_mode() so that we can merge that in 6.12-rcX. And after that, you could come with the nicer solution you propose. Do you think you can send a patch for the quick fix soon? In the meantime, I'm adding the strace testsuite to my CI to make sure it works and we don't break it again :) Thanks! Alex > >> Thanks again for the thorough explanation, >> >> Alex >> >> >>> Actually, we tried the both ways in 52449c17bdd1 ("riscv: entry: set >>> a0 = -ENOSYS only when syscall != -1") and 61119394631f ("riscv: entry: >>> always initialize regs->a0 to -ENOSYS") before. >>> >>> Naturally, there is a solution: >>> >>> 1. Just add orig_a0 in user_regs_struct and let strace use it as >>> loongarch does. So only two problems, which can be resolved without >>> conflict, are left here. >>> >>> The conflicts are the direct result of the limitation of generic entry >>> API, so we have another two solutions: >>> >>> 2. Give up the generic entry API, and switch back to the >>> architecture-specific standardalone implementation. >>> 3. Redesign the generic entry API: the problem was caused by >>> syscall_enter_from_user_mode() using the value -1 (which is unused >>> normally) of syscall nr to inform syscall was reject by seccomp/bpf. >>> >>> In theory, the Solution 1 is best: >>> >>> * a0 was used for two purposes in ABI, so using two variables to store >>> it is natural. >>> * Userspace can implement features without depending on the internal >>> behavior of the kernel. >>> >>> Unfortunately, it's difficult to implement based on the current code. >>> The RISC-V defined struct pt_regs as below: >>> >>> struct pt_regs { >>> unsigned long epc; >>> ... >>> unsigned long t6; >>> /* Supervisor/Machine CSRs */ >>> unsigned long status; >>> unsigned long badaddr; >>> unsigned long cause; >>> /* a0 value before the syscall */ >>> unsigned long orig_a0; >>> }; >>> >>> And user_regs_struct needs to be a prefix of struct pt_regs, so if we >>> want to include orig_a0 in user_regs_struct, we will need to include >>> Supervisor/Machine CSRs as well. It's not a big problem. Since >>> struct pt_regs is the internal ABI of the kernel, we can reorder it. >>> Unfortunately, struct user_regs_struct is defined as below: >>> >>> struct user_regs_struct { >>> unsigned long pc; >>> ... >>> unsigned long t6; >>> }; >>> >>> It doesn't contain something like reserved[] as padding to leave the >>> space to add more registers from struct pt_regs! >>> The loongarch do the right thing as below: >>> >>> struct user_pt_regs { >>> /* Main processor registers. */ >>> unsigned long regs[32]; >>> ... >>> unsigned long reserved[10]; >>> } __attribute__((aligned(8))); >>> >>> RISC-V can't include orig_a0 in user_regs_struct without breaking UABI. >>> >>> Need a discussion to decide to use which solution, or is there any >>> other better solution? >>> >>> [1]: https://github.com/strace/strace/issues/315 >>> [2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/ >>> >>> >>> _______________________________________________ >>> linux-riscv mailing list >>> linux-riscv@lists.infradead.org >>> http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS 2024-10-16 12:56 ` Alexandre Ghiti @ 2024-10-16 13:07 ` Celeste Liu 2024-10-16 17:56 ` Celeste Liu 0 siblings, 1 reply; 8+ messages in thread From: Celeste Liu @ 2024-10-16 13:07 UTC (permalink / raw) To: Alexandre Ghiti, linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov, Dmitry V. Levin Cc: Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen, Han Gao, linux-kernel, rsworktech On 2024-10-16 20:56, Alexandre Ghiti wrote: > On 16/10/2024 14:23, Celeste Liu wrote: >> On 2024-10-16 20:00, Alexandre Ghiti wrote: >>> Hi Celeste, >>> >>> Thank you for looking into this and really sorry about the late response. >>> >>> On 17/09/2024 06:09, Celeste Liu wrote: >>>> Before PTRACE_GET_SYSCALL_INFO was implemented in v5.3, the only way to >>>> get syscall arguments was to get user_regs_struct via PTRACE_GETREGSET. >>>> On some architectures where a register is used as both the first >>>> argument and the return value and thus will be changed at some stage of >>>> the syscall process, something like orig_a0 is provided to save the >>>> original argument register value. But RISC-V doesn't export orig_a0 in >>>> user_regs_struct (This ABI was designed at e2c0cdfba7f6 ("RISC-V: >>>> User-facing API").) so userspace application like strace will get the >>>> right or wrong result depends on the operation order in do_trap_ecall_u() >>>> function. >>>> >>>> This requires we put the ENOSYS process after syscall_enter_from_user_mode() >>>> or syscall_handler()[1]. Unfortunately, the generic entry API >>>> syscall_enter_from_user_mode() requires we >>>> >>>> * process ENOSYS before syscall_enter_from_user_mode() >>> >>> Where does this requirement come from? >>> >>> >>>> * or only set a0 to ENOSYS when the return value of >>>> syscall_enter_from_user_mode() != -1 >>>> >>>> Again, if we choose the latter way to avoid conflict with the first >>>> issue, we will meet the third problem: strace depends on that kernel >>>> will return ENOSYS when syscall nr is -1 to implement their syscall >>>> tampering feature[2]. >>> >>> IIUC, seccomp and others in syscall_enter_from_user_mode() could return -1 and then we could not differentiate with the syscall number being -1. >>> >>> But could we imagine, to distinguish between an error and the syscall number being -1, checking again the syscall number after we call syscall_enter_from_user_mode()? If the syscall number is -1, then we set ENOSYS otherwise we don't do anything (a bit like what you did in 52449c17bdd1 ("riscv: entry: set a0 = -ENOSYS only when syscall != -1")). >>> >>> Let me know if I completely misunderstood here! >> Yeah. I found this a bit later after I post this RFC. I include it in a update reply, >> copy here as well: >> >>> But from another angle, syscall number is in a7 register, so we can call the >>> get_syscall_nr() after calling the syscall_enter_from_user_mode() to bypass the >>> information lost of the return value of the syscall_enter_from_user_mode(). But >>> in this way, the syscall number in the syscall_enter_from_user_mode() return >>> value is useless, and we can remove it directly. >> So if we get syscall number from a7 register again, the syscall number part of >> the return value of syscall_enter_from_user_mode() is useless completely. >> I think it's better to drop it so the later new architecture developer will not >> run into the same issue. (Actually, the syscall number returned by >> syscall_enter_from_user_mode() is also the result of get_syscall_nr() at the end >> of it.) But it will affect other architecture's code so I think there still need >> some discussions. >> >> Or if you think it's better to post a patch and then discuss in patch thread >> directly, I'm glad to do this. > > > Great that we have a solution that does not need to change the ABI :) > > I think we should start by implementing a fix for riscv only that implements the get_syscall_nr() after syscall_enter_from_user_mode() so that we can merge that in 6.12-rcX. > > And after that, you could come with the nicer solution you propose. > > Do you think you can send a patch for the quick fix soon? In the meantime, I'm adding the strace testsuite to my CI to make sure it works and we don't break it again :) Ok. I will send patch soon. > > Thanks! > > Alex > > >> >>> Thanks again for the thorough explanation, >>> >>> Alex >>> >>> >>>> Actually, we tried the both ways in 52449c17bdd1 ("riscv: entry: set >>>> a0 = -ENOSYS only when syscall != -1") and 61119394631f ("riscv: entry: >>>> always initialize regs->a0 to -ENOSYS") before. >>>> >>>> Naturally, there is a solution: >>>> >>>> 1. Just add orig_a0 in user_regs_struct and let strace use it as >>>> loongarch does. So only two problems, which can be resolved without >>>> conflict, are left here. >>>> >>>> The conflicts are the direct result of the limitation of generic entry >>>> API, so we have another two solutions: >>>> >>>> 2. Give up the generic entry API, and switch back to the >>>> architecture-specific standardalone implementation. >>>> 3. Redesign the generic entry API: the problem was caused by >>>> syscall_enter_from_user_mode() using the value -1 (which is unused >>>> normally) of syscall nr to inform syscall was reject by seccomp/bpf. >>>> >>>> In theory, the Solution 1 is best: >>>> >>>> * a0 was used for two purposes in ABI, so using two variables to store >>>> it is natural. >>>> * Userspace can implement features without depending on the internal >>>> behavior of the kernel. >>>> >>>> Unfortunately, it's difficult to implement based on the current code. >>>> The RISC-V defined struct pt_regs as below: >>>> >>>> struct pt_regs { >>>> unsigned long epc; >>>> ... >>>> unsigned long t6; >>>> /* Supervisor/Machine CSRs */ >>>> unsigned long status; >>>> unsigned long badaddr; >>>> unsigned long cause; >>>> /* a0 value before the syscall */ >>>> unsigned long orig_a0; >>>> }; >>>> >>>> And user_regs_struct needs to be a prefix of struct pt_regs, so if we >>>> want to include orig_a0 in user_regs_struct, we will need to include >>>> Supervisor/Machine CSRs as well. It's not a big problem. Since >>>> struct pt_regs is the internal ABI of the kernel, we can reorder it. >>>> Unfortunately, struct user_regs_struct is defined as below: >>>> >>>> struct user_regs_struct { >>>> unsigned long pc; >>>> ... >>>> unsigned long t6; >>>> }; >>>> >>>> It doesn't contain something like reserved[] as padding to leave the >>>> space to add more registers from struct pt_regs! >>>> The loongarch do the right thing as below: >>>> >>>> struct user_pt_regs { >>>> /* Main processor registers. */ >>>> unsigned long regs[32]; >>>> ... >>>> unsigned long reserved[10]; >>>> } __attribute__((aligned(8))); >>>> >>>> RISC-V can't include orig_a0 in user_regs_struct without breaking UABI. >>>> >>>> Need a discussion to decide to use which solution, or is there any >>>> other better solution? >>>> >>>> [1]: https://github.com/strace/strace/issues/315 >>>> [2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/ >>>> >>>> >>>> _______________________________________________ >>>> linux-riscv mailing list >>>> linux-riscv@lists.infradead.org >>>> http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS 2024-10-16 13:07 ` Celeste Liu @ 2024-10-16 17:56 ` Celeste Liu 0 siblings, 0 replies; 8+ messages in thread From: Celeste Liu @ 2024-10-16 17:56 UTC (permalink / raw) To: Alexandre Ghiti, linux-riscv, Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov, Dmitry V. Levin Cc: Andrea Bolognani, WANG Xuerui, Jiaxun Yang, Huacai Chen, Felix Yan, Ruizhe Pan, Shiqi Zhang, Guo Ren, Yao Zi, Yangyu Chen, Han Gao, linux-kernel, rsworktech On 2024-10-16 21:07, Celeste Liu wrote: > > On 2024-10-16 20:56, Alexandre Ghiti wrote: >> On 16/10/2024 14:23, Celeste Liu wrote: >>> On 2024-10-16 20:00, Alexandre Ghiti wrote: >>>> Hi Celeste, >>>> >>>> Thank you for looking into this and really sorry about the late response. >>>> >>>> On 17/09/2024 06:09, Celeste Liu wrote: >>>>> Before PTRACE_GET_SYSCALL_INFO was implemented in v5.3, the only way to >>>>> get syscall arguments was to get user_regs_struct via PTRACE_GETREGSET. >>>>> On some architectures where a register is used as both the first >>>>> argument and the return value and thus will be changed at some stage of >>>>> the syscall process, something like orig_a0 is provided to save the >>>>> original argument register value. But RISC-V doesn't export orig_a0 in >>>>> user_regs_struct (This ABI was designed at e2c0cdfba7f6 ("RISC-V: >>>>> User-facing API").) so userspace application like strace will get the >>>>> right or wrong result depends on the operation order in do_trap_ecall_u() >>>>> function. >>>>> >>>>> This requires we put the ENOSYS process after syscall_enter_from_user_mode() >>>>> or syscall_handler()[1]. Unfortunately, the generic entry API >>>>> syscall_enter_from_user_mode() requires we >>>>> >>>>> * process ENOSYS before syscall_enter_from_user_mode() >>>> >>>> Where does this requirement come from? >>>> >>>> >>>>> * or only set a0 to ENOSYS when the return value of >>>>> syscall_enter_from_user_mode() != -1 >>>>> >>>>> Again, if we choose the latter way to avoid conflict with the first >>>>> issue, we will meet the third problem: strace depends on that kernel >>>>> will return ENOSYS when syscall nr is -1 to implement their syscall >>>>> tampering feature[2]. >>>> >>>> IIUC, seccomp and others in syscall_enter_from_user_mode() could return -1 and then we could not differentiate with the syscall number being -1. >>>> >>>> But could we imagine, to distinguish between an error and the syscall number being -1, checking again the syscall number after we call syscall_enter_from_user_mode()? If the syscall number is -1, then we set ENOSYS otherwise we don't do anything (a bit like what you did in 52449c17bdd1 ("riscv: entry: set a0 = -ENOSYS only when syscall != -1")). >>>> >>>> Let me know if I completely misunderstood here! >>> Yeah. I found this a bit later after I post this RFC. I include it in a update reply, >>> copy here as well: >>> >>>> But from another angle, syscall number is in a7 register, so we can call the >>>> get_syscall_nr() after calling the syscall_enter_from_user_mode() to bypass the >>>> information lost of the return value of the syscall_enter_from_user_mode(). But >>>> in this way, the syscall number in the syscall_enter_from_user_mode() return >>>> value is useless, and we can remove it directly. >>> So if we get syscall number from a7 register again, the syscall number part of >>> the return value of syscall_enter_from_user_mode() is useless completely. >>> I think it's better to drop it so the later new architecture developer will not >>> run into the same issue. (Actually, the syscall number returned by >>> syscall_enter_from_user_mode() is also the result of get_syscall_nr() at the end >>> of it.) But it will affect other architecture's code so I think there still need >>> some discussions. >>> >>> Or if you think it's better to post a patch and then discuss in patch thread >>> directly, I'm glad to do this. >> >> >> Great that we have a solution that does not need to change the ABI :) >> >> I think we should start by implementing a fix for riscv only that implements the get_syscall_nr() after syscall_enter_from_user_mode() so that we can merge that in 6.12-rcX. >> >> And after that, you could come with the nicer solution you propose. >> >> Do you think you can send a patch for the quick fix soon? In the meantime, I'm adding the strace testsuite to my CI to make sure it works and we don't break it again :) > > Ok. I will send patch soon. Patch has been sent. See https://lore.kernel.org/all/20241017-fix-riscv-syscall-nr-v1-1-4edb4ca07f07@gmail.com/T/ > >> >> Thanks! >> >> Alex >> >> >>> >>>> Thanks again for the thorough explanation, >>>> >>>> Alex >>>> >>>> >>>>> Actually, we tried the both ways in 52449c17bdd1 ("riscv: entry: set >>>>> a0 = -ENOSYS only when syscall != -1") and 61119394631f ("riscv: entry: >>>>> always initialize regs->a0 to -ENOSYS") before. >>>>> >>>>> Naturally, there is a solution: >>>>> >>>>> 1. Just add orig_a0 in user_regs_struct and let strace use it as >>>>> loongarch does. So only two problems, which can be resolved without >>>>> conflict, are left here. >>>>> >>>>> The conflicts are the direct result of the limitation of generic entry >>>>> API, so we have another two solutions: >>>>> >>>>> 2. Give up the generic entry API, and switch back to the >>>>> architecture-specific standardalone implementation. >>>>> 3. Redesign the generic entry API: the problem was caused by >>>>> syscall_enter_from_user_mode() using the value -1 (which is unused >>>>> normally) of syscall nr to inform syscall was reject by seccomp/bpf. >>>>> >>>>> In theory, the Solution 1 is best: >>>>> >>>>> * a0 was used for two purposes in ABI, so using two variables to store >>>>> it is natural. >>>>> * Userspace can implement features without depending on the internal >>>>> behavior of the kernel. >>>>> >>>>> Unfortunately, it's difficult to implement based on the current code. >>>>> The RISC-V defined struct pt_regs as below: >>>>> >>>>> struct pt_regs { >>>>> unsigned long epc; >>>>> ... >>>>> unsigned long t6; >>>>> /* Supervisor/Machine CSRs */ >>>>> unsigned long status; >>>>> unsigned long badaddr; >>>>> unsigned long cause; >>>>> /* a0 value before the syscall */ >>>>> unsigned long orig_a0; >>>>> }; >>>>> >>>>> And user_regs_struct needs to be a prefix of struct pt_regs, so if we >>>>> want to include orig_a0 in user_regs_struct, we will need to include >>>>> Supervisor/Machine CSRs as well. It's not a big problem. Since >>>>> struct pt_regs is the internal ABI of the kernel, we can reorder it. >>>>> Unfortunately, struct user_regs_struct is defined as below: >>>>> >>>>> struct user_regs_struct { >>>>> unsigned long pc; >>>>> ... >>>>> unsigned long t6; >>>>> }; >>>>> >>>>> It doesn't contain something like reserved[] as padding to leave the >>>>> space to add more registers from struct pt_regs! >>>>> The loongarch do the right thing as below: >>>>> >>>>> struct user_pt_regs { >>>>> /* Main processor registers. */ >>>>> unsigned long regs[32]; >>>>> ... >>>>> unsigned long reserved[10]; >>>>> } __attribute__((aligned(8))); >>>>> >>>>> RISC-V can't include orig_a0 in user_regs_struct without breaking UABI. >>>>> >>>>> Need a discussion to decide to use which solution, or is there any >>>>> other better solution? >>>>> >>>>> [1]: https://github.com/strace/strace/issues/315 >>>>> [2]: https://lore.kernel.org/linux-riscv/20240627071422.GA2626@altlinux.org/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> linux-riscv mailing list >>>>> linux-riscv@lists.infradead.org >>>>> http://lists.infradead.org/mailman/listinfo/linux-riscv > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-10-16 17:57 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-17 4:09 [RFC] riscv/entry: issue about a0/orig_a0 register and ENOSYS Celeste Liu 2024-09-17 5:59 ` Celeste Liu 2024-09-17 12:51 ` Eugene Syromiatnikov 2024-10-16 12:00 ` Alexandre Ghiti 2024-10-16 12:23 ` Celeste Liu 2024-10-16 12:56 ` Alexandre Ghiti 2024-10-16 13:07 ` Celeste Liu 2024-10-16 17:56 ` Celeste Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox