From: "Michal Suchánek" <msuchanek@suse.de>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Huacai Chen <chenhuacai@kernel.org>,
WANG Xuerui <kernel@xen0n.name>,
Madhavan Srinivasan <maddy@linux.ibm.com>,
Michael Ellerman <mpe@ellerman.id.au>,
Nicholas Piggin <npiggin@gmail.com>,
"Christophe Leroy (CS GROUP)" <chleroy@kernel.org>,
Paul Walmsley <pjw@kernel.org>,
Palmer Dabbelt <palmer@dabbelt.com>,
Albert Ou <aou@eecs.berkeley.edu>,
Alexandre Ghiti <alex@ghiti.fr>,
Heiko Carstens <hca@linux.ibm.com>,
Vasily Gorbik <gor@linux.ibm.com>,
Alexander Gordeev <agordeev@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Sven Schnelle <svens@linux.ibm.com>,
Andy Lutomirski <luto@kernel.org>,
Thomas Gleixner <tglx@kernel.org>, Ingo Molnar <mingo@redhat.com>,
Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org, Andrew Donnellan <andrew+kernel@donnellan.id.au>,
Mark Rutland <mark.rutland@arm.com>,
Arnd Bergmann <arnd@arndb.de>,
Jiaxun Yang <jiaxun.yang@flygoat.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Mukesh Kumar Chaurasiya <mkchauras@linux.ibm.com>,
Shrikanth Hegde <sshegde@linux.ibm.com>,
Zong Li <zong.li@sifive.com>, Nam Cao <namcao@linutronix.de>,
Deepak Gupta <debug@rivosinc.com>,
Lukas Gerlach <lukas.gerlach@cispa.de>,
Rui Qi <qirui.001@bytedance.com>, Kees Cook <kees@kernel.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
loongarch@lists.linux.dev, linuxppc-dev@lists.ozlabs.org,
linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org
Subject: Re: [RFC] entry: Untangle the return value of syscall_enter_from_user_mode from syscall NR
Date: Thu, 2 Jul 2026 11:30:20 +0200 [thread overview]
Message-ID: <akYvrFoMaw4tLuSd@kunlun.suse.cz> (raw)
In-Reply-To: <BA7CD91D-C0E5-47A1-B49C-BC6AF6604182@zytor.com>
On Wed, Jul 01, 2026 at 11:29:01AM -0700, H. Peter Anvin wrote:
> On July 1, 2026 10:42:08 AM PDT, "Michal Suchánek" <msuchanek@suse.de> wrote:
> >The return value of syscall_enter_from_user_mode is used both for the
> >adjusted syscall number and the indicator that a syscall should be
> >skipped.
> >
> >As seccomp can be invoked on any syscall, including invalid ones this
> >somewhat undermines seccomp.
> >
> >While the seccomp variants that terminate the process do not need to
> >care about this for the filter that sets the syscall return value this
> >disctinction is required.
> >
> >Pass the syscall number as a pointer to the inline entry functions, and
> >use the return value exclusively for the indication that the syscall is
> >already handled.
> >
> >This should avoid the need for the s390 PIF_SYSCALL_RET_SET which is the
> >workaround for exactly this deficiency.
> >
> >If this is desirable the patch could be split into some series that
> >adjusts the code flow where needed so that the final change is mostly
> >mechanical.
> >
> >There is also another way to handle this problem.
> >
> >With x86 using bit 30 to denote compatibility syscall it sounds like
> >declaring syscall number a 30bit quantity would work.
> >
> >Then bit 31 could be used to denote an invalid syscall that can never be
> >executed, and the -1 returned from syscall_enter_from_user_mode would
> >then be inherently invalid.
> >
> >That is so long as no architectures use syscall numbers outside of this
> >range so far, and the limitation is considered fine.
> >
>
> Negative numbers most definitely not be assigned as valid system calls, not now, not ever.
Negativity of a number is a matter of intepretation. Sometimes the
syscall number is decleared as int, sometimes long, sometimes unsigned
long.
Passing -1 to strtoul generates some bit pattern that can then be
compared to another bit pattern inside a seccomp filter program, for
example.
> Therein lies some serious madness.
>
> I believe setting the syscall number to -1 to skip is an ABI already in e.g. ptrace, so I doubt we can just get rid of it anyway.
Yes, and seccomp can set the syscall number to -1 indicating it was
handled already even if the number was -1 to start with. While -1 is not
a valid syscall number it can still be filtered, at least on some
architectures.
> I would say as follows:
>
> Let's formally define that:
>
> - valid system call numbers are positive 32-bit numbers, using the appropriate ABI convention for "int".
>
> - bits [30:n] for some value of n are reserved for architecture-specific flags/modes. MIPS uses an offset of 2000 decimal between its syscall ABIs, which would imply n ~ 11, although I personally think that is too restrictive (MIPS could in fact use such a flag to provide an escape into a larger number space if we ever need more than 2000 system calls.)
>
> I would suggest n = 24, at least for now. It is easier to give up additional bits later than to claw them back when already used.
>
> Thus:
>
> 1. The type for a system call is int.
>
> 2. A valid system call number is always going to be positive.
>
> 3. Bits [30:24] are available for architecture ABI use. The "architecture independent" part of the system call number is therefore 24 bits wide.
Will that also work correctly with seccomp?
As I understand it the current situation is that on x86 the BPF code
passed to seccomp must filter the compat syscall bit in the PBF code,
and I do not see how restricting the syscall value to 24bit would happen
without changing the seccomp filter API.
See eg. https://lore.kernel.org/linuxppc-dev/akTExSO3ZT7iRtBa@kunlun.suse.cz/
for sample code.
>
> 4. The exact ABI is platform-specific, obviously, but as a general guideline (especially for new platforms/ABIs) should follow the rules for a platform "int" if practical. Notably, when passing a value in a register larger than 32 bits, which side of the calling interface is responsible for sign-extending a value passed in a register. If caller side, the kernel should validate, if callee side the kernel should ignore the additional bits and do the extension.
Do we even want to play with sign-extend?
If the syscall number is >= 1<<n after masking off flags recognized by
the platfrom (if any) it's invalid.
> 5. A negative system call number is guaranteed to return -ENOSYS (unless intercepted by seccomp, ptrace, or another mechanism under user space control.)
Interception by seccomp is exactly the case that's wonky.
> 6. If the platform needs to algorithmically modify the system call number due to platform-specific concerns (say, the platform uses a 16-bit special purpose register for the syscall number, or it has multiple kernel entry points with different behavior), it should if at all possible transcode the system call number as necessary to match this convention in APIs that are exposed to general kernel code.
>
> For example, in the future I could very much see the IA32 code in the x86 kernel using bit 29 internally to indicate an ia32 system call, simplifying the is_compat implementation on x86. It should not mean that passing bit 29 to either the syscall instruction or int $0x80 will be accepted.
As I understand the code it uses bit 30 for that. Maybe I missed
something?
Thanks
Michal
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
next prev parent reply other threads:[~2026-07-02 9:30 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-01 17:42 [RFC] entry: Untangle the return value of syscall_enter_from_user_mode from syscall NR Michal Suchánek
2026-07-01 18:29 ` H. Peter Anvin
2026-07-02 9:30 ` Michal Suchánek [this message]
2026-07-02 21:49 ` Thomas Gleixner
2026-07-03 6:26 ` Sven Schnelle
2026-07-03 9:25 ` Peter Zijlstra
2026-07-03 9:27 ` Thomas Gleixner
2026-07-03 9:59 ` Sven Schnelle
2026-07-03 10:57 ` Peter Zijlstra
2026-07-03 11:17 ` Sven Schnelle
2026-07-03 11:25 ` Michal Suchánek
2026-07-03 11:39 ` Sven Schnelle
2026-07-02 8:12 ` Sven Schnelle
[not found] ` <akYreY_BHuRbxSsO@kunlun.suse.cz>
2026-07-02 12:01 ` Sven Schnelle
[not found] ` <akZV7kjVh37z63Nz@kunlun.suse.cz>
2026-07-03 6:16 ` Sven Schnelle
2026-07-02 11:24 ` Thomas Gleixner
2026-07-02 11:45 ` Michal Suchánek
2026-07-02 20:45 ` Thomas Gleixner
[not found] ` <akdqlO0eJ6jKH-wU@kunlun.suse.cz>
2026-07-03 9:34 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=akYvrFoMaw4tLuSd@kunlun.suse.cz \
--to=msuchanek@suse.de \
--cc=agordeev@linux.ibm.com \
--cc=alex@ghiti.fr \
--cc=andrew+kernel@donnellan.id.au \
--cc=aou@eecs.berkeley.edu \
--cc=arnd@arndb.de \
--cc=borntraeger@linux.ibm.com \
--cc=bp@alien8.de \
--cc=chenhuacai@kernel.org \
--cc=chleroy@kernel.org \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=debug@rivosinc.com \
--cc=gor@linux.ibm.com \
--cc=gregkh@linuxfoundation.org \
--cc=hca@linux.ibm.com \
--cc=hpa@zytor.com \
--cc=jiaxun.yang@flygoat.com \
--cc=kees@kernel.org \
--cc=kernel@xen0n.name \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linux-s390@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=loongarch@lists.linux.dev \
--cc=lukas.gerlach@cispa.de \
--cc=luto@kernel.org \
--cc=maddy@linux.ibm.com \
--cc=mark.rutland@arm.com \
--cc=mingo@redhat.com \
--cc=mkchauras@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=namcao@linutronix.de \
--cc=npiggin@gmail.com \
--cc=palmer@dabbelt.com \
--cc=peterz@infradead.org \
--cc=pjw@kernel.org \
--cc=qirui.001@bytedance.com \
--cc=ryan.roberts@arm.com \
--cc=skhan@linuxfoundation.org \
--cc=sshegde@linux.ibm.com \
--cc=svens@linux.ibm.com \
--cc=tglx@kernel.org \
--cc=x86@kernel.org \
--cc=zong.li@sifive.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox