Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH net-next] net: uapi: Provide an UAPI definition of 'struct sockaddr'
From: Thomas Weißschuh @ 2026-01-05  8:25 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn
  Cc: netdev, linux-kernel, linux-api, Arnd Bergmann,
	Thomas Weißschuh

Various UAPI headers reference 'struct sockaddr'. Currently the
definition of this struct is pulled in from the libc header
sys/socket.h. This is problematic as it introduces a dependency
on a full userspace toolchain.

Instead expose a custom but compatible definition of 'struct sockaddr'
in the UAPI headers. It is guarded by the libc compatibility
infrastructure to avoid potential conflicts.

The compatibility symbol won't be supported by glibc right away,
but right now __UAPI_DEF_IF_IFNAMSIZ is not supported either,
so including the libc headers before the UAPI headers is broken anyways.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 include/linux/socket.h           | 10 ----------
 include/uapi/linux/if.h          |  4 ----
 include/uapi/linux/libc-compat.h | 12 ++++++++++++
 include/uapi/linux/socket.h      | 14 ++++++++++++++
 4 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index ec715ad4bf25..8363d4e0a044 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -28,16 +28,6 @@ extern void socket_seq_show(struct seq_file *seq);
 
 typedef __kernel_sa_family_t	sa_family_t;
 
-/*
- *	1003.1g requires sa_family_t and that sa_data is char.
- */
-
-/* Deprecated for in-kernel use. Use struct sockaddr_unsized instead. */
-struct sockaddr {
-	sa_family_t	sa_family;	/* address family, AF_xxx	*/
-	char		sa_data[14];	/* 14 bytes of protocol address	*/
-};
-
 /**
  * struct sockaddr_unsized - Unspecified size sockaddr for callbacks
  * @sa_family: Address family (AF_UNIX, AF_INET, AF_INET6, etc.)
diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
index 797ba2c1562a..a4bc54196a07 100644
--- a/include/uapi/linux/if.h
+++ b/include/uapi/linux/if.h
@@ -25,10 +25,6 @@
 #include <linux/socket.h>		/* for "struct sockaddr" et al	*/
 #include <linux/compiler.h>		/* for "__user" et al           */
 
-#ifndef __KERNEL__
-#include <sys/socket.h>			/* for struct sockaddr.		*/
-#endif
-
 #if __UAPI_DEF_IF_IFNAMSIZ
 #define	IFNAMSIZ	16
 #endif /* __UAPI_DEF_IF_IFNAMSIZ */
diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index 0eca95ccb41e..13a06ce4e825 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -140,6 +140,13 @@
 
 #endif /* _NETINET_IN_H */
 
+/* Definitions for socket.h */
+#if defined(_SYS_SOCKET_H)
+#define __UAPI_DEF_SOCKADDR		0
+#else
+#define __UAPI_DEF_SOCKADDR		1
+#endif
+
 /* Definitions for xattr.h */
 #if defined(_SYS_XATTR_H)
 #define __UAPI_DEF_XATTR		0
@@ -221,6 +228,11 @@
 #define __UAPI_DEF_IP6_MTUINFO		1
 #endif
 
+/* Definitions for socket.h */
+#ifndef __UAPI_DEF_SOCKADDR
+#define __UAPI_DEF_SOCKADDR		1
+#endif
+
 /* Definitions for xattr.h */
 #ifndef __UAPI_DEF_XATTR
 #define __UAPI_DEF_XATTR		1
diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
index d3fcd3b5ec53..35d7d5f4b1a8 100644
--- a/include/uapi/linux/socket.h
+++ b/include/uapi/linux/socket.h
@@ -2,6 +2,8 @@
 #ifndef _UAPI_LINUX_SOCKET_H
 #define _UAPI_LINUX_SOCKET_H
 
+#include <linux/libc-compat.h>          /* for compatibility with glibc */
+
 /*
  * Desired design of maximum size and alignment (see RFC2553)
  */
@@ -26,6 +28,18 @@ struct __kernel_sockaddr_storage {
 	};
 };
 
+/*
+ *	1003.1g requires sa_family_t and that sa_data is char.
+ */
+
+/* Deprecated for in-kernel use. Use struct sockaddr_unsized instead. */
+#if __UAPI_DEF_SOCKADDR
+struct sockaddr {
+	__kernel_sa_family_t	sa_family;	/* address family, AF_xxx	*/
+	char			sa_data[14];	/* 14 bytes of protocol address	*/
+};
+#endif /* __UAPI_DEF_SOCKADDR */
+
 #define SOCK_SNDBUF_LOCK	1
 #define SOCK_RCVBUF_LOCK	2
 

---
base-commit: dbf8fe85a16a33d6b6bd01f2bc606fc017771465
change-id: 20251222-uapi-sockaddr-cf10e7624729

Best regards,
-- 
Thomas Weißschuh <thomas.weissschuh@linutronix.de>


^ permalink raw reply related

* Re: [PATCH v3] vdso: Remove struct getcpu_cache
From: Heiko Carstens @ 2026-01-02 12:20 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Huacai Chen, WANG Xuerui, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vincenzo Frascino, Shuah Khan, Arnd Bergmann,
	loongarch, linux-kernel, linux-s390, linux-api, linux-kselftest
In-Reply-To: <20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de>

On Tue, Dec 30, 2025 at 08:08:44AM +0100, Thomas Weißschuh wrote:
> The cache parameter of getcpu() is useless nowadays for various reasons.
> * It is never passed by userspace for either the vDSO or syscalls.
> * It is never used by the kernel.
> * It could not be made to work on the current vDSO architecture.
> * The structure definition is not part of the UAPI headers.
> * vdso_getcpu() is superseded by restartable sequences in any case.
> 
> Remove the struct and its header.
> 
> As a side-effect we get rid of an unwanted inclusion of the linux/
> header namespace from vDSO code.
> 
> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> ---
> Changes in v3:
> - Rebase on v6.19-rc1
>   - Fix conflict with UML vdso_getcpu() removal
> - Flesh out commit message
> - Link to v2: https://lore.kernel.org/r/20251013-getcpu_cache-v2-1-880fbfa3b7cc@linutronix.de
> 
> Changes in v2:
> - Rebase on v6.18-rc1
> - Link to v1: https://lore.kernel.org/r/20250826-getcpu_cache-v1-1-8748318f6141@linutronix.de
> ---
> We could also completely remove the parameter, but I am not sure if
> that is a good idea for syscalls and vDSO entrypoints.
> ---
>  arch/loongarch/vdso/vgetcpu.c                   |  5 ++---
>  arch/s390/kernel/vdso/getcpu.c                  |  3 +--
>  arch/s390/kernel/vdso/vdso.h                    |  4 +---
>  arch/x86/entry/vdso/vgetcpu.c                   |  5 ++---
>  arch/x86/include/asm/vdso/processor.h           |  4 +---
>  include/linux/getcpu.h                          | 19 -------------------
>  include/linux/syscalls.h                        |  3 +--
>  kernel/sys.c                                    |  4 +---
>  tools/testing/selftests/vDSO/vdso_test_getcpu.c |  4 +---
>  9 files changed, 10 insertions(+), 41 deletions(-)

Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390

^ permalink raw reply

* Re: [PATCH v3] vdso: Remove struct getcpu_cache
From: Arnd Bergmann @ 2025-12-30 21:23 UTC (permalink / raw)
  To: Thomas Weißschuh, Huacai Chen, WANG Xuerui, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Vincenzo Frascino, shuah
  Cc: loongarch, linux-kernel, linux-s390, linux-api, linux-kselftest
In-Reply-To: <20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de>

On Tue, Dec 30, 2025, at 08:08, Thomas Weißschuh wrote:
> The cache parameter of getcpu() is useless nowadays for various reasons.
> * It is never passed by userspace for either the vDSO or syscalls.
> * It is never used by the kernel.
> * It could not be made to work on the current vDSO architecture.
> * The structure definition is not part of the UAPI headers.
> * vdso_getcpu() is superseded by restartable sequences in any case.
>
> Remove the struct and its header.
>
> As a side-effect we get rid of an unwanted inclusion of the linux/
> header namespace from vDSO code.
>
> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* [PATCH v3] vdso: Remove struct getcpu_cache
From: Thomas Weißschuh @ 2025-12-30  7:08 UTC (permalink / raw)
  To: Huacai Chen, WANG Xuerui, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Vincenzo Frascino, Shuah Khan
  Cc: Arnd Bergmann, loongarch, linux-kernel, linux-s390, linux-api,
	linux-kselftest, Thomas Weißschuh

The cache parameter of getcpu() is useless nowadays for various reasons.
* It is never passed by userspace for either the vDSO or syscalls.
* It is never used by the kernel.
* It could not be made to work on the current vDSO architecture.
* The structure definition is not part of the UAPI headers.
* vdso_getcpu() is superseded by restartable sequences in any case.

Remove the struct and its header.

As a side-effect we get rid of an unwanted inclusion of the linux/
header namespace from vDSO code.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
Changes in v3:
- Rebase on v6.19-rc1
  - Fix conflict with UML vdso_getcpu() removal
- Flesh out commit message
- Link to v2: https://lore.kernel.org/r/20251013-getcpu_cache-v2-1-880fbfa3b7cc@linutronix.de

Changes in v2:
- Rebase on v6.18-rc1
- Link to v1: https://lore.kernel.org/r/20250826-getcpu_cache-v1-1-8748318f6141@linutronix.de
---
We could also completely remove the parameter, but I am not sure if
that is a good idea for syscalls and vDSO entrypoints.
---
 arch/loongarch/vdso/vgetcpu.c                   |  5 ++---
 arch/s390/kernel/vdso/getcpu.c                  |  3 +--
 arch/s390/kernel/vdso/vdso.h                    |  4 +---
 arch/x86/entry/vdso/vgetcpu.c                   |  5 ++---
 arch/x86/include/asm/vdso/processor.h           |  4 +---
 include/linux/getcpu.h                          | 19 -------------------
 include/linux/syscalls.h                        |  3 +--
 kernel/sys.c                                    |  4 +---
 tools/testing/selftests/vDSO/vdso_test_getcpu.c |  4 +---
 9 files changed, 10 insertions(+), 41 deletions(-)

diff --git a/arch/loongarch/vdso/vgetcpu.c b/arch/loongarch/vdso/vgetcpu.c
index 73af49242ecd..6f054ec898c7 100644
--- a/arch/loongarch/vdso/vgetcpu.c
+++ b/arch/loongarch/vdso/vgetcpu.c
@@ -4,7 +4,6 @@
  */
 
 #include <asm/vdso.h>
-#include <linux/getcpu.h>
 
 static __always_inline int read_cpu_id(void)
 {
@@ -28,8 +27,8 @@ static __always_inline int read_cpu_id(void)
 }
 
 extern
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
 {
 	int cpu_id;
 
diff --git a/arch/s390/kernel/vdso/getcpu.c b/arch/s390/kernel/vdso/getcpu.c
index 5c5d4a848b76..1e17665616c5 100644
--- a/arch/s390/kernel/vdso/getcpu.c
+++ b/arch/s390/kernel/vdso/getcpu.c
@@ -2,11 +2,10 @@
 /* Copyright IBM Corp. 2020 */
 
 #include <linux/compiler.h>
-#include <linux/getcpu.h>
 #include <asm/timex.h>
 #include "vdso.h"
 
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	union tod_clock clk;
 
diff --git a/arch/s390/kernel/vdso/vdso.h b/arch/s390/kernel/vdso/vdso.h
index 8cff033dd854..1fe52a6f5a56 100644
--- a/arch/s390/kernel/vdso/vdso.h
+++ b/arch/s390/kernel/vdso/vdso.h
@@ -4,9 +4,7 @@
 
 #include <vdso/datapage.h>
 
-struct getcpu_cache;
-
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 int __s390_vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 int __s390_vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
 int __s390_vdso_clock_getres(clockid_t clock, struct __kernel_timespec *ts);
diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
index e4640306b2e3..6381b472b7c5 100644
--- a/arch/x86/entry/vdso/vgetcpu.c
+++ b/arch/x86/entry/vdso/vgetcpu.c
@@ -6,17 +6,16 @@
  */
 
 #include <linux/kernel.h>
-#include <linux/getcpu.h>
 #include <asm/segment.h>
 #include <vdso/processor.h>
 
 notrace long
-__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	vdso_read_cpunode(cpu, node);
 
 	return 0;
 }
 
-long getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
+long getcpu(unsigned *cpu, unsigned *node, void *tcache)
 	__attribute__((weak, alias("__vdso_getcpu")));
diff --git a/arch/x86/include/asm/vdso/processor.h b/arch/x86/include/asm/vdso/processor.h
index 7000aeb59aa2..93e0e24e5cb4 100644
--- a/arch/x86/include/asm/vdso/processor.h
+++ b/arch/x86/include/asm/vdso/processor.h
@@ -18,9 +18,7 @@ static __always_inline void cpu_relax(void)
 	native_pause();
 }
 
-struct getcpu_cache;
-
-notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 
 #endif /* __ASSEMBLER__ */
 
diff --git a/include/linux/getcpu.h b/include/linux/getcpu.h
deleted file mode 100644
index c304dcdb4eac..000000000000
--- a/include/linux/getcpu.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_GETCPU_H
-#define _LINUX_GETCPU_H 1
-
-/* Cache for getcpu() to speed it up. Results might be a short time
-   out of date, but will be faster.
-
-   User programs should not refer to the contents of this structure.
-   I repeat they should not refer to it. If they do they will break
-   in future kernels.
-
-   It is only a private cache for vgetcpu(). It will change in future kernels.
-   The user program must store this information per thread (__thread)
-   If you want 100% accurate information pass NULL instead. */
-struct getcpu_cache {
-	unsigned long blob[128 / sizeof(long)];
-};
-
-#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cf84d98964b2..23704e006afd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -59,7 +59,6 @@ struct compat_stat;
 struct old_timeval32;
 struct robust_list_head;
 struct futex_waitv;
-struct getcpu_cache;
 struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
@@ -718,7 +717,7 @@ asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
 asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
-asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, void __user *cache);
 asmlinkage long sys_gettimeofday(struct __kernel_old_timeval __user *tv,
 				struct timezone __user *tz);
 asmlinkage long sys_settimeofday(struct __kernel_old_timeval __user *tv,
diff --git a/kernel/sys.c b/kernel/sys.c
index 8b58eece4e58..f1780ab132a3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -31,7 +31,6 @@
 #include <linux/tty.h>
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
-#include <linux/getcpu.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/seccomp.h>
 #include <linux/cpu.h>
@@ -2876,8 +2875,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	return error;
 }
 
-SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep,
-		struct getcpu_cache __user *, unused)
+SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep, void __user *, unused)
 {
 	int err = 0;
 	int cpu = raw_smp_processor_id();
diff --git a/tools/testing/selftests/vDSO/vdso_test_getcpu.c b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
index bea8ad54da11..3fe49cbdae98 100644
--- a/tools/testing/selftests/vDSO/vdso_test_getcpu.c
+++ b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
@@ -16,9 +16,7 @@
 #include "vdso_config.h"
 #include "vdso_call.h"
 
-struct getcpu_cache;
-typedef long (*getcpu_t)(unsigned int *, unsigned int *,
-			 struct getcpu_cache *);
+typedef long (*getcpu_t)(unsigned int *, unsigned int *, void *);
 
 int main(int argc, char **argv)
 {

---
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
change-id: 20250825-getcpu_cache-3abcd2e65437

Best regards,
-- 
Thomas Weißschuh <thomas.weissschuh@linutronix.de>


^ permalink raw reply related

* [PATCH 9/9] MIPS: vdso: Provide getres_time64() for 32-bit ABIs
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

For consistency with __vdso_clock_gettime64() there should also be a
64-bit variant of clock_getres(). This will allow the extension of
CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit
time types from the kernel and UAPI.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 arch/mips/vdso/vdso.lds.S      | 1 +
 arch/mips/vdso/vgettimeofday.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/mips/vdso/vdso.lds.S b/arch/mips/vdso/vdso.lds.S
index c8bbe56d89cb..5d08be3a6b85 100644
--- a/arch/mips/vdso/vdso.lds.S
+++ b/arch/mips/vdso/vdso.lds.S
@@ -103,6 +103,7 @@ VERSION
 		__vdso_clock_getres;
 #if _MIPS_SIM != _MIPS_SIM_ABI64
 		__vdso_clock_gettime64;
+		__vdso_clock_getres_time64;
 #endif
 #endif
 	local: *;
diff --git a/arch/mips/vdso/vgettimeofday.c b/arch/mips/vdso/vgettimeofday.c
index 604afea3f336..59627f2f51b7 100644
--- a/arch/mips/vdso/vgettimeofday.c
+++ b/arch/mips/vdso/vgettimeofday.c
@@ -46,6 +46,12 @@ int __vdso_clock_gettime64(clockid_t clock,
 	return __cvdso_clock_gettime(clock, ts);
 }
 
+int __vdso_clock_getres_time64(clockid_t clock,
+			       struct __kernel_timespec *ts)
+{
+	return __cvdso_clock_getres(clock, ts);
+}
+
 #else
 
 int __vdso_clock_gettime(clockid_t clock,

-- 
2.52.0


^ permalink raw reply related

* [PATCH 8/9] arm64: vdso32: Provide clock_getres_time64()
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

For consistency with __vdso_clock_gettime64() there should also be a
64-bit variant of clock_getres(). This will allow the extension of
CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit
time types from the kernel and UAPI.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 arch/arm64/kernel/vdso32/vdso.lds.S      | 1 +
 arch/arm64/kernel/vdso32/vgettimeofday.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/arm64/kernel/vdso32/vdso.lds.S b/arch/arm64/kernel/vdso32/vdso.lds.S
index e02b27487ce8..c374fb0146f3 100644
--- a/arch/arm64/kernel/vdso32/vdso.lds.S
+++ b/arch/arm64/kernel/vdso32/vdso.lds.S
@@ -86,6 +86,7 @@ VERSION
 		__vdso_gettimeofday;
 		__vdso_clock_getres;
 		__vdso_clock_gettime64;
+		__vdso_clock_getres_time64;
 	local: *;
 	};
 }
diff --git a/arch/arm64/kernel/vdso32/vgettimeofday.c b/arch/arm64/kernel/vdso32/vgettimeofday.c
index 29b4d8f61e39..d7b39b0a9668 100644
--- a/arch/arm64/kernel/vdso32/vgettimeofday.c
+++ b/arch/arm64/kernel/vdso32/vgettimeofday.c
@@ -32,6 +32,12 @@ int __vdso_clock_getres(clockid_t clock_id,
 	return __cvdso_clock_getres_time32(clock_id, res);
 }
 
+int __vdso_clock_getres_time64(clockid_t clock_id,
+			       struct __kernel_timespec *res)
+{
+	return __cvdso_clock_getres(clock_id, res);
+}
+
 /* Avoid unresolved references emitted by GCC */
 
 void __aeabi_unwind_cpp_pr0(void)

-- 
2.52.0


^ permalink raw reply related

* [PATCH 7/9] ARM: VDSO: provide clock_getres_time64()
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

For consistency with __vdso_clock_gettime64() there should also be a
64-bit variant of clock_getres(). This will allow the extension of
CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit
time types from the kernel and UAPI.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 arch/arm/kernel/vdso.c        | 1 +
 arch/arm/vdso/vdso.lds.S      | 1 +
 arch/arm/vdso/vgettimeofday.c | 6 ++++++
 3 files changed, 8 insertions(+)

diff --git a/arch/arm/kernel/vdso.c b/arch/arm/kernel/vdso.c
index 566c40f0f7c7..0108f33d6bed 100644
--- a/arch/arm/kernel/vdso.c
+++ b/arch/arm/kernel/vdso.c
@@ -162,6 +162,7 @@ static void __init patch_vdso(void *ehdr)
 		vdso_nullpatch_one(&einfo, "__vdso_clock_gettime");
 		vdso_nullpatch_one(&einfo, "__vdso_clock_gettime64");
 		vdso_nullpatch_one(&einfo, "__vdso_clock_getres");
+		vdso_nullpatch_one(&einfo, "__vdso_clock_getres_time64");
 	}
 }
 
diff --git a/arch/arm/vdso/vdso.lds.S b/arch/arm/vdso/vdso.lds.S
index 7c08371f4400..74d8d8bc8a40 100644
--- a/arch/arm/vdso/vdso.lds.S
+++ b/arch/arm/vdso/vdso.lds.S
@@ -74,6 +74,7 @@ VERSION
 		__vdso_gettimeofday;
 		__vdso_clock_getres;
 		__vdso_clock_gettime64;
+		__vdso_clock_getres_time64;
 	local: *;
 	};
 }
diff --git a/arch/arm/vdso/vgettimeofday.c b/arch/arm/vdso/vgettimeofday.c
index 3554aa35f1ba..2874dde7e6cf 100644
--- a/arch/arm/vdso/vgettimeofday.c
+++ b/arch/arm/vdso/vgettimeofday.c
@@ -34,6 +34,12 @@ int __vdso_clock_getres(clockid_t clock_id,
 	return __cvdso_clock_getres_time32(clock_id, res);
 }
 
+int __vdso_clock_getres_time64(clockid_t clock_id,
+			       struct __kernel_timespec *res)
+{
+	return __cvdso_clock_getres(clock_id, res);
+}
+
 /* Avoid unresolved references emitted by GCC */
 
 void __aeabi_unwind_cpp_pr0(void)

-- 
2.52.0


^ permalink raw reply related

* [PATCH 6/9] ARM: VDSO: also patch out __vdso_clock_getres() if unavailable
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

The vDSO code hides symbols which are non-functional.
__vdso_clock_getres() was not added to this list when it got introduced.

Fixes: 052e76a31b4a ("ARM: 8931/1: Add clock_getres entry point")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 arch/arm/kernel/vdso.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/kernel/vdso.c b/arch/arm/kernel/vdso.c
index e38a30477f3d..566c40f0f7c7 100644
--- a/arch/arm/kernel/vdso.c
+++ b/arch/arm/kernel/vdso.c
@@ -161,6 +161,7 @@ static void __init patch_vdso(void *ehdr)
 		vdso_nullpatch_one(&einfo, "__vdso_gettimeofday");
 		vdso_nullpatch_one(&einfo, "__vdso_clock_gettime");
 		vdso_nullpatch_one(&einfo, "__vdso_clock_gettime64");
+		vdso_nullpatch_one(&einfo, "__vdso_clock_getres");
 	}
 }
 

-- 
2.52.0


^ permalink raw reply related

* [PATCH 4/9] selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64()
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

Some architectures will start to implement this function.
Make sure it works correctly.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 tools/testing/selftests/vDSO/vdso_test_abi.c | 53 +++++++++++++++++++++++++++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vDSO/vdso_test_abi.c b/tools/testing/selftests/vDSO/vdso_test_abi.c
index a75c12dcb0f1..b162a4ba9c4f 100644
--- a/tools/testing/selftests/vDSO/vdso_test_abi.c
+++ b/tools/testing/selftests/vDSO/vdso_test_abi.c
@@ -36,6 +36,7 @@ typedef long (*vdso_gettimeofday_t)(struct timeval *tv, struct timezone *tz);
 typedef long (*vdso_clock_gettime_t)(clockid_t clk_id, struct timespec *ts);
 typedef long (*vdso_clock_gettime64_t)(clockid_t clk_id, struct vdso_timespec64 *ts);
 typedef long (*vdso_clock_getres_t)(clockid_t clk_id, struct timespec *ts);
+typedef long (*vdso_clock_getres_time64_t)(clockid_t clk_id, struct vdso_timespec64 *ts);
 typedef time_t (*vdso_time_t)(time_t *t);
 
 static const char * const vdso_clock_name[] = {
@@ -196,6 +197,55 @@ static void vdso_test_clock_getres(clockid_t clk_id)
 	}
 }
 
+#ifdef __NR_clock_getres_time64
+static void vdso_test_clock_getres_time64(clockid_t clk_id)
+{
+	int clock_getres_fail = 0;
+
+	/* Find clock_getres. */
+	vdso_clock_getres_time64_t vdso_clock_getres_time64 =
+		(vdso_clock_getres_time64_t)vdso_sym(version, name[7]);
+
+	if (!vdso_clock_getres_time64) {
+		ksft_print_msg("Couldn't find %s\n", name[7]);
+		ksft_test_result_skip("%s %s\n", name[7],
+				      vdso_clock_name[clk_id]);
+		return;
+	}
+
+	struct vdso_timespec64 ts, sys_ts;
+	long ret = VDSO_CALL(vdso_clock_getres_time64, 2, clk_id, &ts);
+
+	if (ret == 0) {
+		ksft_print_msg("The vdso resolution is %lld %lld\n",
+			       (long long)ts.tv_sec, (long long)ts.tv_nsec);
+	} else {
+		clock_getres_fail++;
+	}
+
+	ret = syscall(__NR_clock_getres_time64, clk_id, &sys_ts);
+
+	ksft_print_msg("The syscall resolution is %lld %lld\n",
+			(long long)sys_ts.tv_sec, (long long)sys_ts.tv_nsec);
+
+	if ((sys_ts.tv_sec != ts.tv_sec) || (sys_ts.tv_nsec != ts.tv_nsec))
+		clock_getres_fail++;
+
+	if (clock_getres_fail > 0) {
+		ksft_test_result_fail("%s %s\n", name[7],
+				      vdso_clock_name[clk_id]);
+	} else {
+		ksft_test_result_pass("%s %s\n", name[7],
+				      vdso_clock_name[clk_id]);
+	}
+}
+#else /* !__NR_clock_getres_time64 */
+static void vdso_test_clock_getres_time64(clockid_t clk_id)
+{
+	ksft_test_result_skip("%s %s\n", name[7], vdso_clock_name[clk_id]);
+}
+#endif /* __NR_clock_getres_time64 */
+
 /*
  * This function calls vdso_test_clock_gettime and vdso_test_clock_getres
  * with different values for clock_id.
@@ -208,9 +258,10 @@ static inline void vdso_test_clock(clockid_t clock_id)
 	vdso_test_clock_gettime64(clock_id);
 
 	vdso_test_clock_getres(clock_id);
+	vdso_test_clock_getres_time64(clock_id);
 }
 
-#define VDSO_TEST_PLAN	29
+#define VDSO_TEST_PLAN	38
 
 int main(int argc, char **argv)
 {

-- 
2.52.0


^ permalink raw reply related

* [PATCH 5/9] x86/vdso: Provide clock_getres_time64() for x86-32
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

For consistency with __vdso_clock_gettime64() there should also be a
64-bit variant of clock_getres(). This will allow the extension of
CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit
time types from the kernel and UAPI.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c    | 8 ++++++++
 arch/x86/entry/vdso/vdso32/vdso32.lds.S | 1 +
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 0debc194bd78..027b7e88d753 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -74,4 +74,12 @@ int __vdso_clock_getres(clockid_t clock, struct old_timespec32 *res)
 
 int clock_getres(clockid_t, struct old_timespec32 *)
 	__attribute__((weak, alias("__vdso_clock_getres")));
+
+int __vdso_clock_getres_time64(clockid_t clock, struct __kernel_timespec *ts)
+{
+	return __cvdso_clock_getres(clock, ts);
+}
+
+int clock_getres_time64(clockid_t, struct __kernel_timespec *)
+	__attribute__((weak, alias("__vdso_clock_getres_time64")));
 #endif
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 8a3be07006bb..6f977c103584 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -28,6 +28,7 @@ VERSION
 		__vdso_time;
 		__vdso_clock_getres;
 		__vdso_clock_gettime64;
+		__vdso_clock_getres_time64;
 		__vdso_getcpu;
 	};
 

-- 
2.52.0


^ permalink raw reply related

* [PATCH 3/9] selftests: vDSO: vdso_test_abi: Use UAPI system call numbers
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

SYS_clock_getres might have been redirected by libc to some other system
call than the actual clock_getres. In the test we want to make sure to
use exactly this system call.

Use the system call number exported by the UAPI headers which is always
correct.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 tools/testing/selftests/vDSO/vdso_test_abi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vDSO/vdso_test_abi.c b/tools/testing/selftests/vDSO/vdso_test_abi.c
index c620317eaeea..a75c12dcb0f1 100644
--- a/tools/testing/selftests/vDSO/vdso_test_abi.c
+++ b/tools/testing/selftests/vDSO/vdso_test_abi.c
@@ -179,7 +179,7 @@ static void vdso_test_clock_getres(clockid_t clk_id)
 		clock_getres_fail++;
 	}
 
-	ret = syscall(SYS_clock_getres, clk_id, &sys_ts);
+	ret = syscall(__NR_clock_getres, clk_id, &sys_ts);
 
 	ksft_print_msg("The syscall resolution is %lld %lld\n",
 			(long long)sys_ts.tv_sec, (long long)sys_ts.tv_nsec);

-- 
2.52.0


^ permalink raw reply related

* [PATCH 2/9] selftests: vDSO: vdso_config: Add configurations for clock_getres_time64()
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

Some architectures will start to implement this function.
Make sure that tests can be written for it.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 tools/testing/selftests/vDSO/vdso_config.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vDSO/vdso_config.h b/tools/testing/selftests/vDSO/vdso_config.h
index 50c261005111..5da223731b81 100644
--- a/tools/testing/selftests/vDSO/vdso_config.h
+++ b/tools/testing/selftests/vDSO/vdso_config.h
@@ -66,7 +66,7 @@ static const char *versions[7] = {
 };
 
 __attribute__((unused))
-static const char *names[2][7] = {
+static const char *names[2][8] = {
 	{
 		"__kernel_gettimeofday",
 		"__kernel_clock_gettime",
@@ -75,6 +75,7 @@ static const char *names[2][7] = {
 		"__kernel_getcpu",
 		"__kernel_clock_gettime64",
 		"__kernel_getrandom",
+		"__kernel_clock_getres_time64",
 	},
 	{
 		"__vdso_gettimeofday",
@@ -84,6 +85,7 @@ static const char *names[2][7] = {
 		"__vdso_getcpu",
 		"__vdso_clock_gettime64",
 		"__vdso_getrandom",
+		"__vdso_clock_getres_time64",
 	},
 };
 

-- 
2.52.0


^ permalink raw reply related

* [PATCH 1/9] vdso: Add prototype for __vdso_clock_getres_time64()
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh
In-Reply-To: <20251223-vdso-compat-time32-v1-0-97ea7a06a543@linutronix.de>

For consistency with __vdso_clock_gettime64() there should also be a
64-bit variant of clock_getres(). This will allow the extension of
CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit
time types from the kernel and UAPI. The generic vDSO library already
provides nearly all necessary building blocks for architectures to
provide this function. Only a prototype is missing.

Add the prototype to the generic header so architectures can start
providing this function.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
 include/vdso/gettime.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/vdso/gettime.h b/include/vdso/gettime.h
index 9ac161866653..16a0a0556b86 100644
--- a/include/vdso/gettime.h
+++ b/include/vdso/gettime.h
@@ -20,5 +20,6 @@ int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
 __kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
 int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
+int __vdso_clock_getres_time64(clockid_t clock, struct __kernel_timespec *ts);
 
 #endif

-- 
2.52.0


^ permalink raw reply related

* [PATCH 0/9] vDSO: Provide clock_getres_time64() where applicable
From: Thomas Weißschuh @ 2025-12-23  6:59 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Vincenzo Frascino, Shuah Khan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Russell King, Catalin Marinas, Will Deacon, Thomas Bogendoerfer
  Cc: linux-kernel, linux-kselftest, Russell King, linux-arm-kernel,
	linux-mips, Arnd Bergmann, linux-api, Thomas Weißschuh

For consistency with __vdso_clock_gettime64() there should also be a
64-bit variant of clock_getres(). This will allow the extension of
CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit
time types from the kernel and UAPI.
    

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
Thomas Weißschuh (9):
      vdso: Add prototype for __vdso_clock_getres_time64()
      selftests: vDSO: vdso_config: Add configurations for clock_getres_time64()
      selftests: vDSO: vdso_test_abi: Use UAPI system call numbers
      selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64()
      x86/vdso: Provide clock_getres_time64() for x86-32
      ARM: VDSO: also patch out __vdso_clock_getres() if unavailable
      ARM: VDSO: provide clock_getres_time64()
      arm64: vdso32: Provide clock_getres_time64()
      MIPS: vdso: Provide getres_time64() for 32-bit ABIs

 arch/arm/kernel/vdso.c                       |  2 +
 arch/arm/vdso/vdso.lds.S                     |  1 +
 arch/arm/vdso/vgettimeofday.c                |  6 +++
 arch/arm64/kernel/vdso32/vdso.lds.S          |  1 +
 arch/arm64/kernel/vdso32/vgettimeofday.c     |  6 +++
 arch/mips/vdso/vdso.lds.S                    |  1 +
 arch/mips/vdso/vgettimeofday.c               |  6 +++
 arch/x86/entry/vdso/vclock_gettime.c         |  8 ++++
 arch/x86/entry/vdso/vdso32/vdso32.lds.S      |  1 +
 include/vdso/gettime.h                       |  1 +
 tools/testing/selftests/vDSO/vdso_config.h   |  4 +-
 tools/testing/selftests/vDSO/vdso_test_abi.c | 55 +++++++++++++++++++++++++++-
 12 files changed, 89 insertions(+), 3 deletions(-)
---
base-commit: 15a11f3ffb629cbbf6efd272239c04a9eb3180e2
change-id: 20251120-vdso-compat-time32-f4684ff250ba

Best regards,
-- 
Thomas Weißschuh <thomas.weissschuh@linutronix.de>


^ permalink raw reply

* Re: [PATCH 1/6] uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
From: Jan Kara @ 2025-12-22 15:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, linux-api, linux-ext4, jack, linux-xfs, linux-fsdevel,
	gabriel, hch, amir73il
In-Reply-To: <176602332146.686273.6355079912638580915.stgit@frogsfrogsfrogs>

On Wed 17-12-25 18:02:56, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Stop definining these privately and instead move them to the uapi
> errno.h so that they become canonical instead of copy pasta.
> 
> Cc: linux-api@vger.kernel.org
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  arch/alpha/include/uapi/asm/errno.h        |    2 ++
>  arch/mips/include/uapi/asm/errno.h         |    2 ++
>  arch/parisc/include/uapi/asm/errno.h       |    2 ++
>  arch/sparc/include/uapi/asm/errno.h        |    2 ++
>  fs/erofs/internal.h                        |    2 --
>  fs/ext2/ext2.h                             |    1 -
>  fs/ext4/ext4.h                             |    3 ---
>  fs/f2fs/f2fs.h                             |    3 ---
>  fs/minix/minix.h                           |    2 --
>  fs/udf/udf_sb.h                            |    2 --
>  fs/xfs/xfs_linux.h                         |    2 --
>  include/linux/jbd2.h                       |    3 ---
>  include/uapi/asm-generic/errno.h           |    2 ++
>  tools/arch/alpha/include/uapi/asm/errno.h  |    2 ++
>  tools/arch/mips/include/uapi/asm/errno.h   |    2 ++
>  tools/arch/parisc/include/uapi/asm/errno.h |    2 ++
>  tools/arch/sparc/include/uapi/asm/errno.h  |    2 ++
>  tools/include/uapi/asm-generic/errno.h     |    2 ++
>  18 files changed, 20 insertions(+), 18 deletions(-)
> 
> 
> diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
> index 3d265f6babaf0a..6791f6508632ee 100644
> --- a/arch/alpha/include/uapi/asm/errno.h
> +++ b/arch/alpha/include/uapi/asm/errno.h
> @@ -55,6 +55,7 @@
>  #define	ENOSR		82	/* Out of streams resources */
>  #define	ETIME		83	/* Timer expired */
>  #define	EBADMSG		84	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EPROTO		85	/* Protocol error */
>  #define	ENODATA		86	/* No data available */
>  #define	ENOSTR		87	/* Device not a stream */
> @@ -96,6 +97,7 @@
>  #define	EREMCHG		115	/* Remote address changed */
>  
>  #define	EUCLEAN		117	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		118	/* Not a XENIX named type file */
>  #define	ENAVAIL		119	/* No XENIX semaphores available */
>  #define	EISNAM		120	/* Is a named type file */
> diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
> index 2fb714e2d6d8fc..c01ed91b1ef44b 100644
> --- a/arch/mips/include/uapi/asm/errno.h
> +++ b/arch/mips/include/uapi/asm/errno.h
> @@ -50,6 +50,7 @@
>  #define EDOTDOT		73	/* RFS specific error */
>  #define EMULTIHOP	74	/* Multihop attempted */
>  #define EBADMSG		77	/* Not a data message */
> +#define EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define ENAMETOOLONG	78	/* File name too long */
>  #define EOVERFLOW	79	/* Value too large for defined data type */
>  #define ENOTUNIQ	80	/* Name not unique on network */
> @@ -88,6 +89,7 @@
>  #define EISCONN		133	/* Transport endpoint is already connected */
>  #define ENOTCONN	134	/* Transport endpoint is not connected */
>  #define EUCLEAN		135	/* Structure needs cleaning */
> +#define EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define ENOTNAM		137	/* Not a XENIX named type file */
>  #define ENAVAIL		138	/* No XENIX semaphores available */
>  #define EISNAM		139	/* Is a named type file */
> diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
> index 8d94739d75c67c..8cbc07c1903e4c 100644
> --- a/arch/parisc/include/uapi/asm/errno.h
> +++ b/arch/parisc/include/uapi/asm/errno.h
> @@ -36,6 +36,7 @@
>  
>  #define	EDOTDOT		66	/* RFS specific error */
>  #define	EBADMSG		67	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EUSERS		68	/* Too many users */
>  #define	EDQUOT		69	/* Quota exceeded */
>  #define	ESTALE		70	/* Stale file handle */
> @@ -62,6 +63,7 @@
>  #define	ERESTART	175	/* Interrupted system call should be restarted */
>  #define	ESTRPIPE	176	/* Streams pipe error */
>  #define	EUCLEAN		177	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		178	/* Not a XENIX named type file */
>  #define	ENAVAIL		179	/* No XENIX semaphores available */
>  #define	EISNAM		180	/* Is a named type file */
> diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
> index 81a732b902ee38..4a41e7835fd5b8 100644
> --- a/arch/sparc/include/uapi/asm/errno.h
> +++ b/arch/sparc/include/uapi/asm/errno.h
> @@ -48,6 +48,7 @@
>  #define	ENOSR		74	/* Out of streams resources */
>  #define	ENOMSG		75	/* No message of desired type */
>  #define	EBADMSG		76	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EIDRM		77	/* Identifier removed */
>  #define	EDEADLK		78	/* Resource deadlock would occur */
>  #define	ENOLCK		79	/* No record locks available */
> @@ -91,6 +92,7 @@
>  #define	ENOTUNIQ	115	/* Name not unique on network */
>  #define	ERESTART	116	/* Interrupted syscall should be restarted */
>  #define	EUCLEAN		117	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		118	/* Not a XENIX named type file */
>  #define	ENAVAIL		119	/* No XENIX semaphores available */
>  #define	EISNAM		120	/* Is a named type file */
> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
> index f7f622836198da..d06e99baf5d5ae 100644
> --- a/fs/erofs/internal.h
> +++ b/fs/erofs/internal.h
> @@ -541,6 +541,4 @@ long erofs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
>  long erofs_compat_ioctl(struct file *filp, unsigned int cmd,
>  			unsigned long arg);
>  
> -#define EFSCORRUPTED    EUCLEAN         /* Filesystem is corrupted */
> -
>  #endif	/* __EROFS_INTERNAL_H */
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index cf97b76e9fd3e9..5e0c6c5fcb6cd6 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -357,7 +357,6 @@ struct ext2_inode {
>   */
>  #define	EXT2_VALID_FS			0x0001	/* Unmounted cleanly */
>  #define	EXT2_ERROR_FS			0x0002	/* Errors detected */
> -#define	EFSCORRUPTED			EUCLEAN	/* Filesystem is corrupted */
>  
>  /*
>   * Mount flags
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 56112f201cace7..62c091b52bacdf 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3938,7 +3938,4 @@ extern int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  				  get_block_t *get_block);
>  #endif	/* __KERNEL__ */
>  
> -#define EFSBADCRC	EBADMSG		/* Bad CRC detected */
> -#define EFSCORRUPTED	EUCLEAN		/* Filesystem is corrupted */
> -
>  #endif	/* _EXT4_H */
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 20edbb99b814a7..9f3aa3c7f12613 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -5004,7 +5004,4 @@ static inline void f2fs_invalidate_internal_cache(struct f2fs_sb_info *sbi,
>  	f2fs_invalidate_compress_pages_range(sbi, blkaddr, len);
>  }
>  
> -#define EFSBADCRC	EBADMSG		/* Bad CRC detected */
> -#define EFSCORRUPTED	EUCLEAN		/* Filesystem is corrupted */
> -
>  #endif /* _LINUX_F2FS_H */
> diff --git a/fs/minix/minix.h b/fs/minix/minix.h
> index 2bfaf377f2086c..7e1f652f16d311 100644
> --- a/fs/minix/minix.h
> +++ b/fs/minix/minix.h
> @@ -175,6 +175,4 @@ static inline int minix_test_bit(int nr, const void *vaddr)
>  	__minix_error_inode((inode), __func__, __LINE__,	\
>  			    (fmt), ##__VA_ARGS__)
>  
> -#define EFSCORRUPTED	EUCLEAN		/* Filesystem is corrupted */
> -
>  #endif /* FS_MINIX_H */
> diff --git a/fs/udf/udf_sb.h b/fs/udf/udf_sb.h
> index 08ec8756b9487b..8399accc788dea 100644
> --- a/fs/udf/udf_sb.h
> +++ b/fs/udf/udf_sb.h
> @@ -55,8 +55,6 @@
>  #define MF_DUPLICATE_MD		0x01
>  #define MF_MIRROR_FE_LOADED	0x02
>  
> -#define EFSCORRUPTED EUCLEAN
> -
>  struct udf_meta_data {
>  	__u32	s_meta_file_loc;
>  	__u32	s_mirror_file_loc;
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 4dd747bdbccab2..55064228c4d574 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -121,8 +121,6 @@ typedef __u32			xfs_nlink_t;
>  
>  #define ENOATTR		ENODATA		/* Attribute not found */
>  #define EWRONGFS	EINVAL		/* Mount with wrong filesystem type */
> -#define EFSCORRUPTED	EUCLEAN		/* Filesystem is corrupted */
> -#define EFSBADCRC	EBADMSG		/* Bad CRC detected */
>  
>  #define __return_address __builtin_return_address(0)
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index f5eaf76198f377..a53a00d36228ce 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1815,7 +1815,4 @@ static inline int jbd2_handle_buffer_credits(handle_t *handle)
>  
>  #endif	/* __KERNEL__ */
>  
> -#define EFSBADCRC	EBADMSG		/* Bad CRC detected */
> -#define EFSCORRUPTED	EUCLEAN		/* Filesystem is corrupted */
> -
>  #endif	/* _LINUX_JBD2_H */
> diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h
> index cf9c51ac49f97e..92e7ae493ee315 100644
> --- a/include/uapi/asm-generic/errno.h
> +++ b/include/uapi/asm-generic/errno.h
> @@ -55,6 +55,7 @@
>  #define	EMULTIHOP	72	/* Multihop attempted */
>  #define	EDOTDOT		73	/* RFS specific error */
>  #define	EBADMSG		74	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EOVERFLOW	75	/* Value too large for defined data type */
>  #define	ENOTUNIQ	76	/* Name not unique on network */
>  #define	EBADFD		77	/* File descriptor in bad state */
> @@ -98,6 +99,7 @@
>  #define	EINPROGRESS	115	/* Operation now in progress */
>  #define	ESTALE		116	/* Stale file handle */
>  #define	EUCLEAN		117	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		118	/* Not a XENIX named type file */
>  #define	ENAVAIL		119	/* No XENIX semaphores available */
>  #define	EISNAM		120	/* Is a named type file */
> diff --git a/tools/arch/alpha/include/uapi/asm/errno.h b/tools/arch/alpha/include/uapi/asm/errno.h
> index 3d265f6babaf0a..6791f6508632ee 100644
> --- a/tools/arch/alpha/include/uapi/asm/errno.h
> +++ b/tools/arch/alpha/include/uapi/asm/errno.h
> @@ -55,6 +55,7 @@
>  #define	ENOSR		82	/* Out of streams resources */
>  #define	ETIME		83	/* Timer expired */
>  #define	EBADMSG		84	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EPROTO		85	/* Protocol error */
>  #define	ENODATA		86	/* No data available */
>  #define	ENOSTR		87	/* Device not a stream */
> @@ -96,6 +97,7 @@
>  #define	EREMCHG		115	/* Remote address changed */
>  
>  #define	EUCLEAN		117	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		118	/* Not a XENIX named type file */
>  #define	ENAVAIL		119	/* No XENIX semaphores available */
>  #define	EISNAM		120	/* Is a named type file */
> diff --git a/tools/arch/mips/include/uapi/asm/errno.h b/tools/arch/mips/include/uapi/asm/errno.h
> index 2fb714e2d6d8fc..c01ed91b1ef44b 100644
> --- a/tools/arch/mips/include/uapi/asm/errno.h
> +++ b/tools/arch/mips/include/uapi/asm/errno.h
> @@ -50,6 +50,7 @@
>  #define EDOTDOT		73	/* RFS specific error */
>  #define EMULTIHOP	74	/* Multihop attempted */
>  #define EBADMSG		77	/* Not a data message */
> +#define EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define ENAMETOOLONG	78	/* File name too long */
>  #define EOVERFLOW	79	/* Value too large for defined data type */
>  #define ENOTUNIQ	80	/* Name not unique on network */
> @@ -88,6 +89,7 @@
>  #define EISCONN		133	/* Transport endpoint is already connected */
>  #define ENOTCONN	134	/* Transport endpoint is not connected */
>  #define EUCLEAN		135	/* Structure needs cleaning */
> +#define EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define ENOTNAM		137	/* Not a XENIX named type file */
>  #define ENAVAIL		138	/* No XENIX semaphores available */
>  #define EISNAM		139	/* Is a named type file */
> diff --git a/tools/arch/parisc/include/uapi/asm/errno.h b/tools/arch/parisc/include/uapi/asm/errno.h
> index 8d94739d75c67c..8cbc07c1903e4c 100644
> --- a/tools/arch/parisc/include/uapi/asm/errno.h
> +++ b/tools/arch/parisc/include/uapi/asm/errno.h
> @@ -36,6 +36,7 @@
>  
>  #define	EDOTDOT		66	/* RFS specific error */
>  #define	EBADMSG		67	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EUSERS		68	/* Too many users */
>  #define	EDQUOT		69	/* Quota exceeded */
>  #define	ESTALE		70	/* Stale file handle */
> @@ -62,6 +63,7 @@
>  #define	ERESTART	175	/* Interrupted system call should be restarted */
>  #define	ESTRPIPE	176	/* Streams pipe error */
>  #define	EUCLEAN		177	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		178	/* Not a XENIX named type file */
>  #define	ENAVAIL		179	/* No XENIX semaphores available */
>  #define	EISNAM		180	/* Is a named type file */
> diff --git a/tools/arch/sparc/include/uapi/asm/errno.h b/tools/arch/sparc/include/uapi/asm/errno.h
> index 81a732b902ee38..4a41e7835fd5b8 100644
> --- a/tools/arch/sparc/include/uapi/asm/errno.h
> +++ b/tools/arch/sparc/include/uapi/asm/errno.h
> @@ -48,6 +48,7 @@
>  #define	ENOSR		74	/* Out of streams resources */
>  #define	ENOMSG		75	/* No message of desired type */
>  #define	EBADMSG		76	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EIDRM		77	/* Identifier removed */
>  #define	EDEADLK		78	/* Resource deadlock would occur */
>  #define	ENOLCK		79	/* No record locks available */
> @@ -91,6 +92,7 @@
>  #define	ENOTUNIQ	115	/* Name not unique on network */
>  #define	ERESTART	116	/* Interrupted syscall should be restarted */
>  #define	EUCLEAN		117	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		118	/* Not a XENIX named type file */
>  #define	ENAVAIL		119	/* No XENIX semaphores available */
>  #define	EISNAM		120	/* Is a named type file */
> diff --git a/tools/include/uapi/asm-generic/errno.h b/tools/include/uapi/asm-generic/errno.h
> index cf9c51ac49f97e..92e7ae493ee315 100644
> --- a/tools/include/uapi/asm-generic/errno.h
> +++ b/tools/include/uapi/asm-generic/errno.h
> @@ -55,6 +55,7 @@
>  #define	EMULTIHOP	72	/* Multihop attempted */
>  #define	EDOTDOT		73	/* RFS specific error */
>  #define	EBADMSG		74	/* Not a data message */
> +#define	EFSBADCRC	EBADMSG	/* Bad CRC detected */
>  #define	EOVERFLOW	75	/* Value too large for defined data type */
>  #define	ENOTUNIQ	76	/* Name not unique on network */
>  #define	EBADFD		77	/* File descriptor in bad state */
> @@ -98,6 +99,7 @@
>  #define	EINPROGRESS	115	/* Operation now in progress */
>  #define	ESTALE		116	/* Stale file handle */
>  #define	EUCLEAN		117	/* Structure needs cleaning */
> +#define	EFSCORRUPTED	EUCLEAN	/* Filesystem is corrupted */
>  #define	ENOTNAM		118	/* Not a XENIX named type file */
>  #define	ENAVAIL		119	/* No XENIX semaphores available */
>  #define	EISNAM		120	/* Is a named type file */
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [RFC PATCH v5 15/15] kernel/api: add API specification for sys_write
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/read_write.c | 377 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 377 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 422046a666b1d..685bf6b9bd3b1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1030,6 +1030,383 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	return ret;
 }
 
+/**
+ * sys_write - Write data to a file descriptor
+ * @fd: File descriptor to write to
+ * @buf: User-space buffer containing data to write
+ * @count: Maximum number of bytes to write
+ *
+ * long-desc: Attempts to write up to count bytes from the buffer starting at
+ *   buf to the file referred to by the file descriptor fd. For seekable files
+ *   (regular files, block devices), the write begins at the current file offset,
+ *   and the file offset is advanced by the number of bytes written. If the file
+ *   was opened with O_APPEND, the file offset is first set to the end of the
+ *   file before writing. For non-seekable files (pipes, FIFOs, sockets, character
+ *   devices), the file offset is not used and writing occurs at the current
+ *   position as defined by the device.
+ *
+ *   The number of bytes written may be less than count if, for example, there is
+ *   insufficient space on the underlying physical medium, or the RLIMIT_FSIZE
+ *   resource limit is encountered, or the call was interrupted by a signal
+ *   handler after having written less than count bytes. In the event of a
+ *   successful partial write, the caller should make another write() call to
+ *   transfer the remaining bytes. This behavior is called a "short write."
+ *
+ *   On Linux, write() transfers at most MAX_RW_COUNT (0x7ffff000, approximately
+ *   2GB minus one page) bytes per call, regardless of whether the file or
+ *   filesystem would allow more. This prevents signed arithmetic overflow.
+ *
+ *   For regular files, a successful write() does not guarantee that data has been
+ *   committed to disk. Use fsync(2) or fdatasync(2) if durability is required.
+ *   For O_SYNC or O_DSYNC files, the kernel automatically syncs data on write.
+ *
+ *   POSIX permits writes that are interrupted after partial writes to either
+ *   return -1 with errno=EINTR, or to return the count of bytes already written.
+ *   Linux implements the latter behavior: if some data has been written before
+ *   a signal arrives, write() returns the number of bytes written rather than
+ *   failing with EINTR.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: fd
+ *   type: KAPI_TYPE_FD
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, INT_MAX
+ *   constraint: Must be a valid, open file descriptor with write permission.
+ *     The file must have been opened with O_WRONLY or O_RDWR. File descriptors
+ *     opened with O_RDONLY, O_PATH, or that have been closed return EBADF.
+ *     Standard file descriptors 0 (stdin), 1 (stdout), 2 (stderr) are valid if
+ *     open and writable. AT_FDCWD and other special values are not valid.
+ *
+ * param: buf
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must point to a valid, readable user-space memory region of at
+ *     least count bytes. The buffer is validated via access_ok() before any
+ *     write operation. NULL is invalid and returns EFAULT. For O_DIRECT writes,
+ *     the buffer may need to be aligned to the filesystem's block size (varies
+ *     by filesystem; query with statx() using STATX_DIOALIGN on Linux 6.1+).
+ *
+ * param: count
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, SIZE_MAX
+ *   constraint: Maximum number of bytes to write. Clamped internally to
+ *     MAX_RW_COUNT (INT_MAX & PAGE_MASK, approximately 0x7ffff000 bytes) to
+ *     prevent signed overflow. A count of 0 returns 0 immediately without any
+ *     file operations. Cast to ssize_t must not be negative.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_RANGE
+ *   success: >= 0
+ *   desc: On success, returns the number of bytes written (non-negative). Zero
+ *     indicates that nothing was written (count was 0, or no space available
+ *     for non-blocking writes). The return value may be less than count due to
+ *     resource limits, signal interruption, or device constraints (short write).
+ *     On error, returns a negative error code.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: fd is not a valid file descriptor, or fd was not opened for writing.
+ *     This includes file descriptors opened with O_RDONLY, O_PATH, or file
+ *     descriptors that have been closed. Also returned if the file structure
+ *     does not have FMODE_WRITE or FMODE_CAN_WRITE set.
+ *
+ * error: EFAULT, Bad address
+ *   desc: buf points outside the accessible address space. The buffer address
+ *     failed access_ok() validation. Can also occur if a fault happens during
+ *     copy_from_user() when reading data from user space.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned in several cases: (1) The file descriptor refers to an
+ *     object that is not suitable for writing (no write or write_iter method).
+ *     (2) The file was opened with O_DIRECT and the buffer alignment, offset,
+ *     or count does not meet the filesystem's alignment requirements. (3) The
+ *     count argument, when cast to ssize_t, is negative. (4) For IOCB_NOWAIT
+ *     operations on non-O_DIRECT files that don't support WASYNC.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: fd refers to a file (pipe, socket, device) that is marked non-blocking
+ *     (O_NONBLOCK) and the write would block because the buffer is full. Also
+ *     returned with IOCB_NOWAIT when data cannot be written immediately.
+ *     Equivalent to EWOULDBLOCK. The application should retry later or use
+ *     select/poll/epoll to wait for writability.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The call was interrupted by a signal before any data was written. This
+ *     only occurs if no data has been transferred; if some data was written
+ *     before the signal, the call returns the number of bytes written. The
+ *     caller should typically restart the write.
+ *
+ * error: EPIPE, Broken pipe
+ *   desc: fd refers to a pipe or socket whose reading end has been closed.
+ *     When this condition occurs, the calling process also receives a SIGPIPE
+ *     signal unless MSG_NOSIGNAL is used (for sockets) or IOCB_NOSIGNAL is set.
+ *     If the signal is caught or ignored, EPIPE is still returned.
+ *
+ * error: EFBIG, File too large
+ *   desc: An attempt was made to write a file that exceeds the implementation-
+ *     defined maximum file size or the file size limit (RLIMIT_FSIZE) of the
+ *     process. When RLIMIT_FSIZE is exceeded, the process also receives SIGXFSZ.
+ *     For files not opened with O_LARGEFILE on 32-bit systems, the limit is 2GB.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: The device containing the file has no room for the data. This can
+ *     occur mid-write resulting in a short write followed by ENOSPC on retry.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: The user's quota of disk blocks on the filesystem has been exhausted.
+ *     Like ENOSPC, this can result in a short write.
+ *
+ * error: EIO, Input/output error
+ *   desc: A low-level I/O error occurred while modifying the inode or writing
+ *     data. This typically indicates hardware failure, filesystem corruption,
+ *     or network filesystem timeout. Some data may have been written.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: The operation was prevented: (1) by a file seal (F_SEAL_WRITE or
+ *     F_SEAL_FUTURE_WRITE on memfd/shmem), (2) writing to an immutable inode
+ *     (IS_IMMUTABLE), (3) by an LSM hook denying the operation, or (4) by a
+ *     fanotify permission event denying the write.
+ *
+ * error: EOVERFLOW, Value too large for defined data type
+ *   desc: The file position plus count would exceed LLONG_MAX. Also returned
+ *     when the offset would exceed filesystem limits after the write.
+ *
+ * error: EDESTADDRREQ, Destination address required
+ *   desc: fd is a datagram socket for which no peer address has been set using
+ *     connect(2). Use sendto(2) to specify the destination address.
+ *
+ * error: ETXTBSY, Text file busy
+ *   desc: The file is being used as a swap file (IS_SWAPFILE).
+ *
+ * error: EXDEV, Cross-device link
+ *   desc: When writing to a pipe that has been configured as a watch queue
+ *     (CONFIG_WATCH_QUEUE), direct write() calls are not supported.
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Insufficient kernel memory was available for the write operation.
+ *     For pipes, this occurs when allocating pages for the pipe buffer.
+ *
+ * error: ERESTARTSYS, Restart system call (internal)
+ *   desc: Internal error code indicating the syscall should be restarted. This
+ *     is converted to EINTR if SA_RESTART is not set on the signal handler, or
+ *     the syscall is transparently restarted if SA_RESTART is set. User space
+ *     should not see this error code directly.
+ *
+ * error: EACCES, Permission denied
+ *   desc: The security subsystem (LSM such as SELinux or AppArmor) denied the
+ *     write operation via security_file_permission(). This can occur even if
+ *     the file was successfully opened.
+ *
+ * lock: file->f_pos_lock
+ *   type: KAPI_LOCK_MUTEX
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files that require atomic position updates (FMODE_ATOMIC_POS),
+ *     the f_pos_lock mutex is acquired by fdget_pos() at syscall entry and released
+ *     by fdput_pos() at syscall exit. This serializes concurrent writes sharing
+ *     the same file description. Not acquired for stream files (FMODE_STREAM like
+ *     pipes and sockets) or when the file is not shared.
+ *
+ * lock: sb->s_writers (freeze protection)
+ *   type: KAPI_LOCK_CUSTOM
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files, file_start_write() acquires freeze protection on
+ *     the superblock via sb_start_write() before the write, and file_end_write()
+ *     releases it after. This prevents writes during filesystem freeze. Not
+ *     acquired for non-regular files (pipes, sockets, devices).
+ *
+ * lock: inode->i_rwsem
+ *   type: KAPI_LOCK_RWLOCK
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files using generic_file_write_iter(), the inode's i_rwsem
+ *     is acquired in write mode before modifying file data. This is internal to
+ *     the filesystem and released before return. Not all filesystems use this
+ *     pattern.
+ *
+ * lock: pipe->mutex
+ *   type: KAPI_LOCK_MUTEX
+ *   acquired: conditional
+ *   released: true
+ *   desc: For pipes and FIFOs, the pipe's mutex is held while modifying pipe
+ *     buffers. Released temporarily while waiting for space, then reacquired.
+ *
+ * lock: RCU read-side
+ *   type: KAPI_LOCK_RCU
+ *   acquired: conditional
+ *   released: true
+ *   desc: Used during file descriptor lookup via fdget(). RCU read lock protects
+ *     access to the file descriptor table. Released by fdput() at syscall exit.
+ *
+ * signal: SIGPIPE
+ *   direction: KAPI_SIGNAL_SEND
+ *   action: KAPI_SIGNAL_ACTION_TERMINATE
+ *   condition: Writing to a pipe or socket with no readers
+ *   desc: When writing to a pipe whose read end is closed, or a socket whose
+ *     peer has closed, SIGPIPE is sent to the calling process. The default
+ *     action terminates the process. Use signal(SIGPIPE, SIG_IGN) or set
+ *     IOCB_NOSIGNAL/MSG_NOSIGNAL to suppress. EPIPE is returned regardless.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *
+ * signal: SIGXFSZ
+ *   direction: KAPI_SIGNAL_SEND
+ *   action: KAPI_SIGNAL_ACTION_COREDUMP
+ *   condition: Writing exceeds RLIMIT_FSIZE
+ *   desc: When a write would exceed the soft file size limit (RLIMIT_FSIZE),
+ *     SIGXFSZ is sent. The default action terminates with a core dump. The
+ *     write returns EFBIG. If RLIMIT_FSIZE is RLIM_INFINITY, no signal is sent.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *
+ * signal: Any signal
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: While blocked waiting for space (pipes, sockets)
+ *   desc: The syscall may be interrupted by signals while waiting for buffer
+ *     space to become available. If interrupted before any data is written,
+ *     returns -EINTR or -ERESTARTSYS. If data was already written, returns the
+ *     byte count. Restartable if SA_RESTART is set and no data was written.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: yes
+ *
+ * side-effect: KAPI_EFFECT_FILE_POSITION
+ *   target: file->f_pos
+ *   condition: For seekable files when write succeeds (returns > 0)
+ *   desc: The file offset (f_pos) is advanced by the number of bytes written.
+ *     For files opened with O_APPEND, f_pos is first set to file size. For
+ *     stream files (FMODE_STREAM such as pipes and sockets), the offset is not
+ *     used or modified. Position updates are protected by f_pos_lock when
+ *     shared.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: inode timestamps (mtime, ctime)
+ *   condition: When write succeeds (returns > 0)
+ *   desc: Updates the file's modification time (mtime) and change time (ctime)
+ *     via file_update_time(). The update precision depends on filesystem mount
+ *     options (fine-grained timestamps for multigrain inodes).
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: SUID/SGID bits (mode)
+ *   condition: When writing to a setuid/setgid file
+ *   desc: The SUID bit is cleared when a non-root user writes to a file with
+ *     the bit set. The SGID bit may also be cleared. This is a security feature
+ *     to prevent privilege escalation via modified setuid binaries. Done via
+ *     file_remove_privs().
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: file data
+ *   condition: When write succeeds (returns > 0)
+ *   desc: Modifies the file's data content. For regular files, data is written
+ *     to the page cache (buffered I/O) or directly to storage (O_DIRECT).
+ *     Data may not be persistent until fsync() is called or the file is closed.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: task I/O accounting
+ *   condition: Always
+ *   desc: Updates the current task's I/O accounting statistics. The wchar field
+ *     (write characters) is incremented by bytes written via add_wchar(). The
+ *     syscw field (syscall write count) is incremented via inc_syscw(). These
+ *     statistics are visible in /proc/[pid]/io.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: fsnotify events
+ *   condition: When write returns > 0
+ *   desc: Generates an FS_MODIFY fsnotify event via fsnotify_modify(), allowing
+ *     inotify, fanotify, and dnotify watchers to be notified of the write.
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass discretionary access control on write permission
+ *   without: Standard DAC checks are enforced
+ *   condition: Checked via security_file_permission() during rw_verify_area()
+ *
+ * capability: CAP_FOWNER
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass ownership checks for SUID/SGID clearing
+ *   without: SUID/SGID bits are cleared on write by non-owner
+ *   condition: Checked during file_remove_privs()
+ *
+ * constraint: MAX_RW_COUNT
+ *   desc: The count parameter is silently clamped to MAX_RW_COUNT (INT_MAX &
+ *     PAGE_MASK, approximately 2GB minus one page) to prevent integer overflow
+ *     in internal calculations. This is transparent to the caller.
+ *   expr: actual_count = min(count, MAX_RW_COUNT)
+ *
+ * constraint: File must be open for writing
+ *   desc: The file descriptor must have been opened with O_WRONLY or O_RDWR.
+ *     Files opened with O_RDONLY or O_PATH cannot be written and return EBADF.
+ *     The file must have both FMODE_WRITE and FMODE_CAN_WRITE flags set.
+ *   expr: (file->f_mode & FMODE_WRITE) && (file->f_mode & FMODE_CAN_WRITE)
+ *
+ * constraint: RLIMIT_FSIZE
+ *   desc: The size of data written is constrained by the RLIMIT_FSIZE resource
+ *     limit. If writing would exceed this limit, SIGXFSZ is sent and EFBIG is
+ *     returned. The limit does not apply to files beyond the limit - only to
+ *     writes that would cross it.
+ *   expr: pos + count <= rlimit(RLIMIT_FSIZE) || rlimit(RLIMIT_FSIZE) == RLIM_INFINITY
+ *
+ * constraint: File seals
+ *   desc: For memfd or shmem files with F_SEAL_WRITE or F_SEAL_FUTURE_WRITE
+ *     seals applied, all write operations fail with EPERM. With F_SEAL_GROW,
+ *     writes that would extend file size fail with EPERM.
+ *
+ * examples: n = write(fd, buf, sizeof(buf));  // Basic write
+ *   n = write(STDOUT_FILENO, msg, strlen(msg));  // Write to stdout
+ *   while (total < len) { n = write(fd, buf+total, len-total); if (n<0) break; total += n; }  // Handle short writes
+ *   if (write(pipefd[1], &byte, 1) < 0 && errno == EPIPE) { handle_broken_pipe(); }  // Pipe error handling
+ *
+ * notes: The behavior of write() varies significantly depending on the type of
+ *   file descriptor:
+ *
+ *   - Regular files: Writes to the page cache (buffered) or directly to storage
+ *     (O_DIRECT). Short writes are rare except near RLIMIT_FSIZE or disk full.
+ *     O_APPEND is atomic for determining write position.
+ *
+ *   - Pipes and FIFOs: Blocking by default. Writes up to PIPE_BUF (4096 bytes
+ *     on Linux) are guaranteed atomic. Larger writes may be interleaved with
+ *     writes from other processes. Blocks if pipe is full; returns EAGAIN with
+ *     O_NONBLOCK. SIGPIPE/EPIPE if no readers.
+ *
+ *   - Sockets: Behavior depends on socket type and protocol. Stream sockets
+ *     (TCP) may return partial writes. Datagram sockets (UDP) typically write
+ *     complete messages or fail. SIGPIPE/EPIPE for broken connections (unless
+ *     MSG_NOSIGNAL). EDESTADDRREQ for unconnected datagram sockets.
+ *
+ *   - Terminals: May block on flow control. Canonical vs raw mode affects
+ *     behavior. Special characters may be interpreted.
+ *
+ *   - Device special files: Behavior is device-specific. Block devices behave
+ *     similarly to regular files. Character device behavior varies.
+ *
+ *   Race condition considerations: Concurrent writes from threads sharing a
+ *   file description race on the file position. Linux 3.14+ provides atomic
+ *   position updates via f_pos_lock for regular files (FMODE_ATOMIC_POS), but
+ *   for maximum safety, use pwrite() for concurrent positioned writes.
+ *
+ *   O_DIRECT writes bypass the page cache and typically require buffer and
+ *   offset alignment to filesystem block size. Query requirements via statx()
+ *   with STATX_DIOALIGN (Linux 6.1+). Unaligned O_DIRECT writes return EINVAL
+ *   on most filesystems.
+ *
+ *   For zero-copy writes, consider using splice(2), sendfile(2), or vmsplice(2)
+ *   instead of copying data through user-space buffers with write().
+ *
+ *   Partial writes (short writes) must be handled by application code.
+ *   Applications should loop until all data is written or an error occurs.
+ *
+ * since-version: 1.0
+ */
 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		size_t, count)
 {
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 14/15] kernel/api: add API specification for sys_read
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/read_write.c | 287 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 287 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 833bae068770a..422046a666b1d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -719,6 +719,293 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	return ret;
 }
 
+/**
+ * sys_read - Read data from a file descriptor
+ * @fd: File descriptor to read from
+ * @buf: User-space buffer to read data into
+ * @count: Maximum number of bytes to read
+ *
+ * long-desc: Attempts to read up to count bytes from file descriptor fd into
+ *   the buffer starting at buf. For seekable files (regular files, block
+ *   devices), the read begins at the current file offset, and the file offset
+ *   is advanced by the number of bytes read. For non-seekable files (pipes,
+ *   FIFOs, sockets, character devices), the file offset is not used.
+ *
+ *   If count is zero and fd refers to a regular file, read() may detect errors
+ *   as described below. In the absence of errors, or if read() does not check
+ *   for errors, a read() with a count of 0 returns zero and has no other effects.
+ *
+ *   On success, the number of bytes read is returned (zero indicates end of
+ *   file for regular files). It is not an error if this number is smaller than
+ *   the number of bytes requested; this may happen because fewer bytes are
+ *   actually available right now (maybe because we were close to end-of-file,
+ *   or because we are reading from a pipe, socket, or terminal), or because
+ *   read() was interrupted by a signal.
+ *
+ *   On Linux, read() transfers at most MAX_RW_COUNT (0x7ffff000, approximately
+ *   2GB) bytes per call, regardless of whether the filesystem would allow more.
+ *   This is to avoid issues with signed arithmetic overflow on 32-bit systems.
+ *
+ *   POSIX allows reads that are interrupted after reading some data to either
+ *   return -1 (with errno set to EINTR) or return the number of bytes already
+ *   read. Linux follows the latter behavior: if data has been read before a
+ *   signal arrives, the call returns the bytes read rather than failing.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: fd
+ *   type: KAPI_TYPE_FD
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, INT_MAX
+ *   constraint: Must be a valid, open file descriptor with read permission.
+ *     The file must have been opened with O_RDONLY or O_RDWR. Special values
+ *     like AT_FDCWD are not valid. File descriptors for directories return
+ *     EISDIR. Standard file descriptors 0 (stdin), 1 (stdout), 2 (stderr) are
+ *     valid if open and readable.
+ *
+ * param: buf
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_OUT | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must point to a valid, writable user-space memory region of at
+ *     least count bytes. The buffer is validated via access_ok() before any
+ *     read operation. NULL is invalid and will return EFAULT. The buffer may
+ *     be partially written if an error occurs mid-read. For O_DIRECT reads,
+ *     the buffer may need to be aligned to the filesystem's block size (varies
+ *     by filesystem, check via statx() with STATX_DIOALIGN).
+ *
+ * param: count
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, SIZE_MAX
+ *   constraint: Maximum number of bytes to read. Clamped internally to
+ *     MAX_RW_COUNT (INT_MAX & PAGE_MASK, approximately 0x7ffff000 bytes) to
+ *     prevent signed overflow issues. A count of 0 returns immediately with 0
+ *     without accessing the file (but may still detect errors). Large values
+ *     are not errors but will be clamped. Cast to ssize_t must not be negative.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_RANGE
+ *   success: >= 0
+ *   desc: On success, returns the number of bytes read (non-negative). Zero
+ *     indicates end-of-file (EOF) for regular files, or no data available
+ *     from a device that does not block. The return value may be less than
+ *     count if fewer bytes were available (short read). Partial reads are
+ *     not errors. On error, returns a negative error code.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: fd is not a valid file descriptor, or fd was not opened for reading.
+ *     This includes file descriptors opened with O_WRONLY, O_PATH, or file
+ *     descriptors that have been closed. Also returned if the file structure
+ *     does not have FMODE_READ set.
+ *
+ * error: EFAULT, Bad address
+ *   desc: buf points outside the accessible address space. The buffer address
+ *     failed access_ok() validation. Can also occur if a fault happens during
+ *     copy_to_user() when transferring data to user space after the read
+ *     completes in kernel space.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned in several cases: (1) The file descriptor refers to an
+ *     object that is not suitable for reading (no read or read_iter method).
+ *     (2) The file was opened with O_DIRECT and the buffer alignment, offset,
+ *     or count does not meet the filesystem's alignment requirements. (3) For
+ *     timerfd file descriptors, the buffer is smaller than 8 bytes. (4) The
+ *     count argument, when cast to ssize_t, is negative.
+ *
+ * error: EISDIR, Is a directory
+ *   desc: fd refers to a directory. Directories cannot be read using read();
+ *     use getdents64() instead. This error is returned by the generic_read_dir()
+ *     handler installed for directory file operations.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: fd refers to a file (pipe, socket, device) that is marked non-blocking
+ *     (O_NONBLOCK) and the read would block. Also returned with IOCB_NOWAIT
+ *     when data is not immediately available. Equivalent to EWOULDBLOCK.
+ *     The application should retry the read later or use select/poll/epoll.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The call was interrupted by a signal before any data was read. This
+ *     only occurs if no data has been transferred; if some data was read before
+ *     the signal, the call returns the number of bytes read. The caller should
+ *     typically restart the read.
+ *
+ * error: EIO, Input/output error
+ *   desc: A low-level I/O error occurred. For regular files, this typically
+ *     indicates a hardware error on the storage device, a filesystem error,
+ *     or a network filesystem timeout. For terminals, this may indicate the
+ *     controlling terminal has been closed for a background process.
+ *
+ * error: EOVERFLOW, Value too large for defined data type
+ *   desc: The file position plus count would exceed LLONG_MAX. Also returned
+ *     when reading from certain files (e.g., some /proc files) where the file
+ *     position would overflow. For files without FOP_UNSIGNED_OFFSET flag,
+ *     negative file positions are not allowed.
+ *
+ * error: ENOBUFS, No buffer space available
+ *   desc: Returned when reading from pipe-based watch queues (CONFIG_WATCH_QUEUE)
+ *     when the buffer is too small to hold a complete notification, or when
+ *     reading packets from pipes with PIPE_BUF_FLAG_WHOLE set.
+ *
+ * error: ERESTARTSYS, Restart system call (internal)
+ *   desc: Internal error code indicating the syscall should be restarted. This
+ *     is typically translated to EINTR if SA_RESTART is not set on the signal
+ *     handler, or the syscall is transparently restarted if SA_RESTART is set.
+ *     User space should not see this error code directly.
+ *
+ * error: EACCES, Permission denied
+ *   desc: The security subsystem (LSM such as SELinux or AppArmor) denied
+ *     the read operation via security_file_permission(). This can occur even
+ *     if the file was successfully opened, as LSM policies may enforce per-
+ *     operation checks.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned by fanotify permission events (CONFIG_FANOTIFY_ACCESS_PERMISSIONS)
+ *     when a user-space fanotify listener denies the read operation via
+ *     fsnotify_file_area_perm().
+ *
+ * lock: file->f_pos_lock
+ *   type: KAPI_LOCK_MUTEX
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files that require atomic position updates (FMODE_ATOMIC_POS),
+ *     the f_pos_lock mutex is acquired by fdget_pos() at syscall entry and released
+ *     by fdput_pos() at syscall exit. This serializes concurrent reads that share
+ *     the same file description. Not acquired for files opened with FMODE_STREAM
+ *     (pipes, sockets) or when the file is not shared.
+ *
+ * lock: Filesystem-specific locks
+ *   type: KAPI_LOCK_CUSTOM
+ *   acquired: conditional
+ *   released: true
+ *   desc: The filesystem's read_iter or read method may acquire additional locks.
+ *     For regular files, this typically includes the inode's i_rwsem for certain
+ *     operations. For pipes, the pipe->mutex is acquired. For sockets, socket
+ *     lock is acquired. These are internal to the file operation and released
+ *     before return.
+ *
+ * lock: RCU read-side
+ *   type: KAPI_LOCK_RCU
+ *   acquired: conditional
+ *   released: true
+ *   desc: Used during file descriptor lookup via fdget(). RCU read lock protects
+ *     access to the file descriptor table. Released by fdput() at syscall exit.
+ *
+ * signal: Any signal
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: When blocked waiting for data on interruptible operations
+ *   desc: The syscall may be interrupted by signals while waiting for data to
+ *     become available (pipes, sockets, terminals) or waiting for locks. If
+ *     interrupted before any data is read, returns -EINTR or -ERESTARTSYS.
+ *     If data has already been read, returns the number of bytes read.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: yes
+ *
+ * side-effect: KAPI_EFFECT_FILE_POSITION
+ *   target: file->f_pos
+ *   condition: For seekable files when read succeeds (returns > 0)
+ *   desc: The file offset (f_pos) is advanced by the number of bytes read.
+ *     For stream files (FMODE_STREAM such as pipes and sockets), the offset
+ *     is not used or modified. The offset update is protected by f_pos_lock
+ *     when the file is shared between threads/processes.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: inode access time (atime)
+ *   condition: When read succeeds and O_NOATIME is not set
+ *   desc: Updates the file's access time (atime) via touch_atime(). The update
+ *     may be suppressed by mount options (noatime, relatime), the O_NOATIME
+ *     flag, or if the filesystem does not support atime. Relatime only updates
+ *     atime if it is older than mtime or ctime, or more than a day old.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: task I/O accounting
+ *   condition: Always
+ *   desc: Updates the current task's I/O accounting statistics. The rchar field
+ *     (read characters) is incremented by bytes read via add_rchar(). The syscr
+ *     field (syscall read count) is incremented via inc_syscr(). These statistics
+ *     are visible in /proc/[pid]/io. Updated regardless of success or failure.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: fsnotify events
+ *   condition: When read returns > 0
+ *   desc: Generates an FS_ACCESS fsnotify event via fsnotify_access() allowing
+ *     inotify, fanotify, and dnotify watchers to be notified of the read. This
+ *     occurs after data transfer completes successfully.
+ *   reversible: no
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass discretionary access control on read permission
+ *   without: Standard DAC checks are enforced
+ *   condition: Checked via security_file_permission() during rw_verify_area()
+ *
+ * capability: CAP_DAC_READ_SEARCH
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass read permission checks on regular files
+ *   without: Must have read permission on file
+ *   condition: Checked by LSM hooks during the read operation
+ *
+ * constraint: MAX_RW_COUNT
+ *   desc: The count parameter is silently clamped to MAX_RW_COUNT (INT_MAX &
+ *     PAGE_MASK, approximately 2GB minus one page) to prevent integer overflow
+ *     in internal calculations. This is transparent to the caller; the syscall
+ *     succeeds but reads at most MAX_RW_COUNT bytes.
+ *   expr: actual_count = min(count, MAX_RW_COUNT)
+ *
+ * constraint: File must be open for reading
+ *   desc: The file descriptor must have been opened with O_RDONLY or O_RDWR.
+ *     Files opened with O_WRONLY or O_PATH cannot be read and return EBADF.
+ *     The file must have both FMODE_READ and FMODE_CAN_READ flags set.
+ *   expr: (file->f_mode & FMODE_READ) && (file->f_mode & FMODE_CAN_READ)
+ *
+ * examples: n = read(fd, buf, sizeof(buf));  // Basic read
+ *   n = read(STDIN_FILENO, buf, 1024);  // Read from stdin
+ *   while ((n = read(fd, buf, 4096)) > 0) { process(buf, n); }  // Read loop
+ *   if (read(fd, buf, count) == 0) { handle_eof(); }  // Check for EOF
+ *
+ * notes: The behavior of read() varies significantly depending on the type of
+ *   file descriptor:
+ *
+ *   - Regular files: Reads from current position, advances position, returns 0
+ *     at EOF. Short reads are rare but possible near EOF or on signal.
+ *
+ *   - Pipes and FIFOs: Blocking by default. Returns available data (up to count)
+ *     or blocks until data is available. Returns 0 when all writers have closed.
+ *     O_NONBLOCK returns EAGAIN when empty instead of blocking.
+ *
+ *   - Sockets: Similar to pipes. Specific behavior depends on socket type and
+ *     protocol. MSG_* flags can be specified via recv() for more control.
+ *
+ *   - Terminals: Line-buffered in canonical mode; read returns when newline is
+ *     entered or buffer is full. Raw mode returns immediately when data available.
+ *     Special handling for signals (SIGINT on Ctrl+C, etc.).
+ *
+ *   - Device special files: Behavior is device-specific. Some devices support
+ *     seeking, others do not. Read size may be constrained by device.
+ *
+ *   Race condition: Concurrent reads from the same file description (not just
+ *   file descriptor) can race on the file position. Linux 3.14+ provides atomic
+ *   position updates for regular files via f_pos_lock, but applications should
+ *   use pread() for concurrent positioned reads.
+ *
+ *   O_DIRECT reads bypass the page cache and typically require aligned buffers
+ *   and positions. Alignment requirements are filesystem-specific; use statx()
+ *   with STATX_DIOALIGN (Linux 6.1+) to query. Unaligned O_DIRECT reads fail
+ *   with EINVAL on most filesystems.
+ *
+ *   For splice(2)-like zero-copy reads, consider using splice(), sendfile(),
+ *   or copy_file_range() instead of read() + write().
+ *
+ * since-version: 1.0
+ */
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	return ksys_read(fd, buf, count);
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 13/15] kernel/api: add API specification for sys_close
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/open.c | 247 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 343e6d3798ec3..26d8ee8336405 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1868,10 +1868,249 @@ int filp_close(struct file *filp, fl_owner_t id)
 }
 EXPORT_SYMBOL(filp_close);
 
-/*
- * Careful here! We test whether the file pointer is NULL before
- * releasing the fd. This ensures that one clone task can't release
- * an fd while another clone is opening it.
+/**
+ * sys_close - Close a file descriptor
+ * @fd: The file descriptor to close
+ *
+ * long-desc: Terminates access to an open file descriptor, releasing the file
+ *   descriptor for reuse by subsequent open(), dup(), or similar syscalls. Any
+ *   advisory record locks (POSIX locks, OFD locks, and flock locks) held on the
+ *   associated file are released. When this is the last file descriptor
+ *   referring to the underlying open file description, associated resources are
+ *   freed. If the file was previously unlinked, the file itself is deleted when
+ *   the last reference is closed.
+ *
+ *   CRITICAL: The file descriptor is ALWAYS closed, even when close() returns
+ *   an error. This differs from POSIX semantics where the state of the file
+ *   descriptor is unspecified after EINTR. On Linux, the fd is released early
+ *   in close() processing before flush operations that may fail. Therefore,
+ *   retrying close() after an error return is DANGEROUS and may close an
+ *   unrelated file descriptor that was assigned to another thread.
+ *
+ *   Errors returned from close() (EIO, ENOSPC, EDQUOT) indicate that the final
+ *   flush of buffered data failed. These errors commonly occur on network
+ *   filesystems like NFS when write errors are deferred to close time. A
+ *   successful return from close() does NOT guarantee that data has been
+ *   successfully written to disk; the kernel uses buffer cache to defer writes.
+ *   To ensure data persistence, call fsync() before close().
+ *
+ *   On close, the following cleanup operations are performed: POSIX advisory
+ *   locks are removed, dnotify registrations are cleaned up, the file is
+ *   flushed if the file operations define a flush callback, and the file
+ *   reference is released. If this was the last reference, additional cleanup
+ *   includes: fsnotify close notification, epoll cleanup, flock and lease
+ *   removal, FASYNC cleanup, the file's release callback invocation, and
+ *   the file structure deallocation.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: fd
+ *   type: KAPI_TYPE_FD
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, INT_MAX
+ *   constraint: Must be a valid, open file descriptor for the current process.
+ *     The value 0, 1, or 2 (stdin, stdout, stderr) may be closed like any other
+ *     fd, though this is unusual and may cause issues with libraries that assume
+ *     these descriptors are valid. The parameter is unsigned int to match kernel
+ *     file descriptor table indexing, but values exceeding INT_MAX are effectively
+ *     invalid due to internal checks.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_EXACT
+ *   success: 0
+ *   desc: Returns 0 on success. On error, returns a negative error code.
+ *     IMPORTANT: Even when an error is returned, the file descriptor is still
+ *     closed and must not be used again. The error indicates a problem with
+ *     the final flush operation, not that the fd remains open.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: The file descriptor fd is not a valid open file descriptor, or was
+ *     already closed. This is the only error that indicates the fd was NOT
+ *     closed (because it was never open to begin with). Occurs when fd is out
+ *     of range, has no file assigned, or was already closed.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The flush operation was interrupted by a signal before completion.
+ *     This occurs when a file's flush callback (e.g., NFS) performs an
+ *     interruptible wait that receives a signal. IMPORTANT: Despite this error,
+ *     the file descriptor IS closed and must not be used again. This error
+ *     is generated by converting kernel-internal restart codes (ERESTARTSYS,
+ *     ERESTARTNOINTR, ERESTARTNOHAND, ERESTART_RESTARTBLOCK) to EINTR because
+ *     restarting the syscall would be incorrect once the fd is freed.
+ *
+ * error: EIO, I/O error
+ *   desc: An I/O error occurred during the flush of buffered data to the
+ *     underlying storage. This typically indicates a hardware error, network
+ *     failure on NFS, or other storage system error. The file descriptor is
+ *     still closed. Previously buffered write data may have been lost.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: There was insufficient space on the storage device to flush buffered
+ *     writes. This is common on NFS when the server runs out of space between
+ *     write() and close(). The file descriptor is still closed.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: The user's disk quota was exceeded while attempting to flush buffered
+ *     writes. Common on NFS when quota is exceeded between write() and close().
+ *     The file descriptor is still closed.
+ *
+ * lock: files->file_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired via file_close_fd() to atomically lookup and remove the fd
+ *     from the file descriptor table. Held only during the table manipulation;
+ *     released before flush and final cleanup operations. This ensures that
+ *     another thread cannot allocate the same fd number while close is in
+ *     progress.
+ *
+ * lock: file->f_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired during epoll cleanup (eventpoll_release_file) and dnotify
+ *     cleanup to safely unlink the file from monitoring structures. May also
+ *     be acquired during lock context operations.
+ *
+ * lock: ep->mtx
+ *   type: KAPI_LOCK_MUTEX
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired during epoll cleanup if the file was monitored by epoll.
+ *     Used to safely remove the file from epoll interest lists.
+ *
+ * lock: flc_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: File lock context spinlock, acquired during locks_remove_file() to
+ *     safely remove POSIX, flock, and lease locks associated with the file.
+ *
+ * signal: pending_signals
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: When flush callback performs interruptible wait
+ *   desc: If the file's flush callback (e.g., nfs_file_flush) performs an
+ *     interruptible wait and a signal is pending, the wait is interrupted.
+ *     Any kernel restart codes are converted to EINTR since close cannot be
+ *     restarted after the fd is freed.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: no
+ *
+ * side-effect: KAPI_EFFECT_RESOURCE_DESTROY | KAPI_EFFECT_IRREVERSIBLE
+ *   target: File descriptor table entry
+ *   desc: The file descriptor is removed from the process's file descriptor
+ *     table, making the fd number available for reuse by subsequent open(),
+ *     dup(), or similar calls. This occurs BEFORE any flush or cleanup that
+ *     might fail, making the operation irreversible regardless of return value.
+ *   condition: Always (when fd is valid)
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_LOCK_RELEASE
+ *   target: POSIX advisory locks, OFD locks, flock locks
+ *   desc: All advisory locks held on the file by this process are removed.
+ *     POSIX locks are removed via locks_remove_posix() during filp_flush().
+ *     All lock types (POSIX, OFD, flock) are removed via locks_remove_file()
+ *     during __fput() when this is the last reference.
+ *   condition: File has FMODE_OPENED and !(FMODE_PATH)
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_RESOURCE_DESTROY
+ *   target: File leases
+ *   desc: Any file leases held on the file are removed during locks_remove_file()
+ *     when this is the last reference to the open file description.
+ *   condition: File had leases and this is the last close
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: dnotify registrations
+ *   desc: Directory notification (dnotify) registrations associated with this
+ *     file are cleaned up via dnotify_flush(). This only applies to directories.
+ *   condition: File is a directory with dnotify registrations
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: epoll interest lists
+ *   desc: If the file was being monitored by epoll instances, it is removed
+ *     from those interest lists via eventpoll_release().
+ *   condition: File was added to epoll instances
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: Buffered data
+ *   desc: The file's flush callback is invoked if defined (e.g., NFS calls
+ *     nfs_file_flush). This attempts to write any buffered data to storage
+ *     and may return errors (EIO, ENOSPC, EDQUOT) if the flush fails. The
+ *     success of this flush is NOT guaranteed even with a 0 return; use
+ *     fsync() before close() to ensure data persistence.
+ *   condition: File has a flush callback and was opened for writing
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FREE_MEMORY
+ *   target: struct file and related structures
+ *   desc: When this is the last reference to the file, __fput() is called
+ *     synchronously (fput_close_sync), which frees the file structure, releases
+ *     the dentry and mount references, and invokes the file's release callback.
+ *   condition: This is the last reference to the file
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: Unlinked file deletion
+ *   desc: If the file was previously unlinked (deleted) but kept open, closing
+ *     the last reference causes the actual file data to be removed from the
+ *     filesystem and the inode to be freed.
+ *   condition: File was unlinked and this is the last reference
+ *   reversible: no
+ *
+ * state-trans: file_descriptor
+ *   from: open
+ *   to: closed/free
+ *   condition: Valid fd passed to close
+ *   desc: The file descriptor transitions from open (usable) to closed (invalid).
+ *     The fd number becomes available for reuse. This transition occurs early
+ *     in close() processing, before any operations that might fail.
+ *
+ * state-trans: file_reference_count
+ *   from: n
+ *   to: n-1 (or freed if n was 1)
+ *   condition: Always on successful fd lookup
+ *   desc: The file's reference count is decremented. If this was the last
+ *     reference, the file is fully cleaned up and freed.
+ *
+ * constraint: File Descriptor Reuse Race
+ *   desc: Because the fd is freed early in close() processing, another thread
+ *     may receive the same fd number from a concurrent open() before close()
+ *     returns. Applications must not retry close() after an error return, as
+ *     this could close an unrelated file opened by another thread.
+ *   expr: After close(fd) returns (even with error), fd is invalid
+ *
+ * examples: close(fd);  // Basic usage - ignore errors (common but not ideal)
+ *   if (close(fd) == -1) perror("close");  // Log errors for debugging
+ *   fsync(fd); close(fd);  // Ensure data persistence before closing
+ *
+ * notes: This syscall has subtle non-POSIX semantics: the fd is ALWAYS closed
+ *   regardless of the return value. POSIX specifies that on EINTR, the state
+ *   of the fd is unspecified, but Linux always closes it. HP-UX requires
+ *   retrying close() on EINTR, but doing so on Linux may close an unrelated
+ *   fd that was reassigned by another thread. For portable code, the safest
+ *   approach is to check for errors but never retry close().
+ *
+ *   Error codes from the flush callback (EIO, ENOSPC, EDQUOT) indicate that
+ *   previously written data may have been lost. These errors are particularly
+ *   common on NFS where write errors are often deferred to close time.
+ *
+ *   The driver's release() callback errors are explicitly ignored by the
+ *   kernel, so device driver cleanup errors are not propagated to userspace.
+ *
+ *   Calling close() on a file descriptor while another thread is using it
+ *   (e.g., in a blocking read() or write()) has implementation-defined
+ *   behavior. On Linux, the blocked operation continues on the underlying
+ *   file and may complete even after close() returns.
+ *
+ * since-version: 1.0
  */
 SYSCALL_DEFINE1(close, unsigned int, fd)
 {
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 12/15] kernel/api: add API specification for sys_open
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/open.c | 318 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 318 insertions(+)

diff --git a/fs/open.c b/fs/open.c
index f328622061c56..343e6d3798ec3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1437,6 +1437,324 @@ int do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 }
 
 
+/**
+ * sys_open - Open or create a file
+ * @filename: Pathname of the file to open or create
+ * @flags: File access mode and behavior flags (O_RDONLY, O_WRONLY, O_RDWR, etc.)
+ * @mode: File permission bits for newly created files (only with O_CREAT/O_TMPFILE)
+ *
+ * long-desc: Opens the file specified by pathname. If O_CREAT or O_TMPFILE is
+ *   specified in flags, the file is created if it does not exist; its mode is
+ *   set according to the mode parameter modified by the process's umask.
+ *
+ *   The flags argument must include one of the following access modes: O_RDONLY
+ *   (read-only), O_WRONLY (write-only), or O_RDWR (read/write). These are the
+ *   low-order two bits of flags. In addition, zero or more file creation and
+ *   file status flags can be bitwise-ORed in flags.
+ *
+ *   File creation flags: O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC, O_DIRECTORY,
+ *   O_NOFOLLOW, O_CLOEXEC, O_TMPFILE. These flags affect open behavior.
+ *
+ *   File status flags: O_APPEND, O_ASYNC, O_DIRECT, O_DSYNC, O_LARGEFILE,
+ *   O_NOATIME, O_NONBLOCK (O_NDELAY), O_PATH, O_SYNC. These become part of the
+ *   file's open file description and can be retrieved/modified with fcntl().
+ *
+ *   The return value is a file descriptor, a small nonnegative integer used in
+ *   subsequent system calls (read, write, lseek, fcntl, etc.) to refer to the
+ *   open file. The file descriptor returned by a successful open is the lowest-
+ *   numbered file descriptor not currently open for the process.
+ *
+ *   On 64-bit systems, O_LARGEFILE is automatically added to the flags. On 32-bit
+ *   systems, files larger than 2GB require O_LARGEFILE to be explicitly set.
+ *
+ *   This syscall is a legacy interface. Modern code should prefer openat() for
+ *   relative path operations and openat2() for additional control via resolve
+ *   flags. The open() call is equivalent to openat(AT_FDCWD, pathname, flags).
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: filename
+ *   type: KAPI_TYPE_PATH
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_USER_PATH
+ *   constraint: Must be a valid null-terminated path string in user memory.
+ *     Maximum path length is PATH_MAX (4096 bytes) including null terminator.
+ *     For relative paths, resolution starts from current working directory.
+ *     The path is followed (symlinks resolved) unless O_NOFOLLOW is specified.
+ *
+ * param: flags
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_MASK
+ *   valid-mask: O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY |
+ *               O_TRUNC | O_APPEND | O_NONBLOCK | O_DSYNC | O_SYNC | FASYNC |
+ *               O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | O_NOATIME |
+ *               O_CLOEXEC | O_PATH | O_TMPFILE
+ *   constraint: Must include exactly one of O_RDONLY (0), O_WRONLY (1), or
+ *     O_RDWR (2) as the access mode. Additional flags may be ORed. Invalid flag
+ *     combinations (e.g., O_DIRECTORY|O_CREAT, O_PATH with incompatible flags,
+ *     O_TMPFILE without O_DIRECTORY, O_TMPFILE with read-only mode) return
+ *     EINVAL. Unknown flags are silently ignored for backward compatibility
+ *     (unlike openat2 which rejects them).
+ *
+ * param: mode
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_MASK
+ *   valid-mask: S_ISUID | S_ISGID | S_ISVTX | S_IRWXU | S_IRWXG | S_IRWXO
+ *   constraint: Only meaningful when O_CREAT or O_TMPFILE is specified in
+ *     flags. Specifies the file mode bits (permissions and setuid/setgid/sticky
+ *     bits) for a newly created file. The effective mode is (mode & ~umask).
+ *     When O_CREAT/O_TMPFILE is not set, mode is ignored. Mode values exceeding
+ *     S_IALLUGO (07777) are masked off.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_FD
+ *   success: >= 0
+ *   desc: On success, returns a new file descriptor (non-negative integer).
+ *     The returned file descriptor is the lowest-numbered descriptor not
+ *     currently open for the process. On error, returns -1 and errno is set.
+ *
+ * error: EACCES, Permission denied
+ *   desc: The requested access to the file is not allowed, or search permission
+ *     is denied for one of the directories in the path prefix of pathname, or
+ *     the file did not exist yet and write access to the parent directory is
+ *     not allowed, or O_TRUNC is specified but write permission is denied, or
+ *     the file is on a filesystem mounted with noexec and MAY_EXEC was implied.
+ *
+ * error: EBUSY, Device or resource busy
+ *   desc: O_EXCL was specified in flags and pathname refers to a block device
+ *     that is in use by the system (e.g., it is mounted).
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: O_CREAT is specified and the file does not exist, and the user's quota
+ *     of disk blocks or inodes on the filesystem has been exhausted.
+ *
+ * error: EEXIST, File exists
+ *   desc: O_CREAT and O_EXCL were specified in flags, but pathname already exists.
+ *     This error is atomic with respect to file creation - it prevents race
+ *     conditions (TOCTOU) when creating files.
+ *
+ * error: EFAULT, Bad address
+ *   desc: pathname points outside the process's accessible address space.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The call was interrupted by a signal handler before completing file
+ *     open. This can occur during lock acquisition or when breaking leases.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned for several conditions: (1) Invalid O_* flag combinations
+ *     (O_DIRECTORY|O_CREAT, O_TMPFILE without O_DIRECTORY, O_TMPFILE with
+ *     read-only access, O_PATH with flags other than O_DIRECTORY|O_NOFOLLOW|
+ *     O_CLOEXEC). (2) mode contains bits outside S_IALLUGO when O_CREAT/O_TMPFILE
+ *     is set (openat2 only). (3) O_DIRECT requested but filesystem doesn't
+ *     support it. (4) The filesystem does not support O_SYNC or O_DSYNC.
+ *
+ * error: EISDIR, Is a directory
+ *   desc: pathname refers to a directory and the access requested involved
+ *     writing (O_WRONLY, O_RDWR, or O_TRUNC). Also returned when O_TMPFILE is
+ *     used on a directory that doesn't support tmpfile operations.
+ *
+ * error: ELOOP, Too many symbolic links
+ *   desc: Too many symbolic links were encountered in resolving pathname, or
+ *     O_NOFOLLOW was specified but pathname refers to a symbolic link.
+ *
+ * error: EMFILE, Too many open files
+ *   desc: The per-process limit on the number of open file descriptors has been
+ *     reached. This limit is RLIMIT_NOFILE (default typically 1024, max set by
+ *     /proc/sys/fs/nr_open).
+ *
+ * error: ENAMETOOLONG, File name too long
+ *   desc: pathname was too long, exceeding PATH_MAX (4096) bytes, or a single
+ *     path component exceeded NAME_MAX (usually 255) bytes.
+ *
+ * error: ENFILE, Too many open files in system
+ *   desc: The system-wide limit on the total number of open files has been
+ *     reached (/proc/sys/fs/file-max). Processes with CAP_SYS_ADMIN can exceed
+ *     this limit.
+ *
+ * error: ENODEV, No such device
+ *   desc: pathname refers to a special file that has no corresponding device, or
+ *     the file's inode has no file operations assigned.
+ *
+ * error: ENOENT, No such file or directory
+ *   desc: A directory component in pathname does not exist or is a dangling
+ *     symbolic link, or O_CREAT is not set and the named file does not exist,
+ *     or pathname is an empty string (unless AT_EMPTY_PATH is used with openat2).
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: The kernel could not allocate sufficient memory for the file structure,
+ *     path lookup structures, or the filename buffer.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: O_CREAT was specified and the file does not exist, and the directory
+ *     or filesystem containing the file has no room for a new file entry.
+ *
+ * error: ENOTDIR, Not a directory
+ *   desc: A component used as a directory in pathname is not actually a directory,
+ *     or O_DIRECTORY was specified and pathname was not a directory.
+ *
+ * error: ENXIO, No such device or address
+ *   desc: O_NONBLOCK | O_WRONLY is set and the named file is a FIFO and no
+ *     process has the FIFO open for reading. Also returned when opening a device
+ *     special file that does not exist.
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: The filesystem containing pathname does not support O_TMPFILE.
+ *
+ * error: EOVERFLOW, Value too large for defined data type
+ *   desc: pathname refers to a regular file that is too large to be opened.
+ *     This occurs on 32-bit systems without O_LARGEFILE when the file size
+ *     exceeds 2GB (2^31 - 1 bytes).
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: O_NOATIME flag was specified but the effective UID of the caller did
+ *     not match the owner of the file and the caller is not privileged, or the
+ *     file is append-only and O_TRUNC was specified or write mode without
+ *     O_APPEND, or the file is immutable, or a seal prevents the operation.
+ *
+ * error: EROFS, Read-only file system
+ *   desc: pathname refers to a file on a read-only filesystem and write access
+ *     was requested.
+ *
+ * error: ETXTBSY, Text file busy
+ *   desc: pathname refers to an executable image which is currently being
+ *     executed, or to a swap file, and write access or truncation was requested.
+ *
+ * error: EWOULDBLOCK, Resource temporarily unavailable
+ *   desc: O_NONBLOCK was specified and an incompatible lease is held on the file.
+ *
+ * lock: files->file_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired when allocating a file descriptor slot. Held briefly during
+ *     fd allocation via alloc_fd() and released before the syscall returns.
+ *
+ * lock: inode->i_rwsem (parent directory)
+ *   type: KAPI_LOCK_RWLOCK
+ *   acquired: conditional
+ *   released: true
+ *   desc: Write lock acquired on parent directory inode when creating a new file
+ *     (O_CREAT). Acquired via inode_lock_nested() in lookup path. May use
+ *     killable variant which can return EINTR on fatal signal.
+ *
+ * lock: RCU read-side
+ *   type: KAPI_LOCK_RCU
+ *   acquired: true
+ *   released: true
+ *   desc: Path lookup uses RCU mode initially for performance. If RCU lookup
+ *     fails (returns -ECHILD), falls back to reference-based lookup.
+ *
+ * signal: Any signal
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: When blocked on interruptible or killable operations
+ *   desc: The syscall may be interrupted during path lookup, lock acquisition,
+ *     or lease breaking. Fatal signals (SIGKILL, etc.) will interrupt killable
+ *     operations. Non-fatal signals may interrupt interruptible operations.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: yes
+ *
+ * side-effect: KAPI_EFFECT_RESOURCE_CREATE | KAPI_EFFECT_ALLOC_MEMORY
+ *   target: file descriptor, file structure, dentry cache
+ *   desc: Allocates a new file descriptor in the process's fd table. Allocates
+ *     a struct file from the filp slab cache. May allocate dentries and inodes
+ *     during path lookup. System-wide file count (nr_files) is incremented.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: filesystem, inode
+ *   condition: When O_CREAT is specified and file doesn't exist
+ *   desc: Creates a new file on the filesystem. Creates new inode, allocates
+ *     data blocks as needed, and creates directory entry. Updates parent
+ *     directory mtime and ctime.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: file content
+ *   condition: When O_TRUNC is specified for existing file
+ *   desc: Truncates the file to zero length, releasing data blocks. Updates
+ *     file mtime and ctime. May trigger notifications to lease holders.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: inode timestamps
+ *   condition: Unless O_NOATIME is specified
+ *   desc: Opens for reading may update inode access time (atime) unless mounted
+ *     with noatime/relatime or O_NOATIME is specified. Opens for writing that
+ *     truncate or create update mtime and ctime.
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass file read, write, and execute permission checks
+ *   without: Standard DAC (discretionary access control) checks are applied
+ *   condition: Checked when file permission would otherwise deny access
+ *
+ * capability: CAP_DAC_READ_SEARCH
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass read permission on files and search permission on directories
+ *   without: Must have read permission on file or search permission on directory
+ *   condition: Checked during path traversal and file open
+ *
+ * capability: CAP_FOWNER
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Use O_NOATIME on files not owned by caller
+ *   without: O_NOATIME returns EPERM if caller is not file owner
+ *   condition: Checked when O_NOATIME is specified and caller is not owner
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_INCREASE_LIMIT
+ *   allows: Exceed the system-wide file limit (file-max)
+ *   without: Returns ENFILE when system limit is reached
+ *   condition: Checked in alloc_empty_file() when nr_files >= max_files
+ *
+ * constraint: RLIMIT_NOFILE (per-process fd limit)
+ *   desc: The returned file descriptor must be less than the process's
+ *     RLIMIT_NOFILE limit. Default is typically 1024, maximum is controlled
+ *     by /proc/sys/fs/nr_open (default 1048576). Exceeding returns EMFILE.
+ *   expr: fd < rlimit(RLIMIT_NOFILE)
+ *
+ * constraint: file-max (system-wide limit)
+ *   desc: System-wide limit on open files in /proc/sys/fs/file-max. Processes
+ *     without CAP_SYS_ADMIN receive ENFILE when this limit is reached. The
+ *     limit is computed based on system memory at boot time.
+ *   expr: nr_files < files_stat.max_files || capable(CAP_SYS_ADMIN)
+ *
+ * constraint: PATH_MAX
+ *   desc: Maximum length of pathname including null terminator is PATH_MAX
+ *     (4096 bytes). Individual path components must not exceed NAME_MAX (255).
+ *
+ * examples: fd = open("/etc/passwd", O_RDONLY);  // Read existing file
+ *   fd = open("/tmp/newfile", O_WRONLY | O_CREAT | O_TRUNC, 0644);  // Create/truncate
+ *   fd = open("/tmp/lockfile", O_WRONLY | O_CREAT | O_EXCL, 0600);  // Exclusive create
+ *   fd = open("/dev/null", O_RDWR);  // Open device
+ *   fd = open("/tmp", O_RDONLY | O_DIRECTORY);  // Open directory
+ *   fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);  // Anonymous temp file
+ *
+ * notes: The distinction between O_RDONLY, O_WRONLY, and O_RDWR is critical.
+ *   O_RDONLY is defined as 0, so (flags & O_RDONLY) will be true for all flags.
+ *   Test access mode using (flags & O_ACCMODE) == O_RDONLY.
+ *
+ *   When O_CREAT is specified without O_EXCL, there is a race condition between
+ *   testing for file existence and creating it. Use O_CREAT | O_EXCL for atomic
+ *   exclusive file creation.
+ *
+ *   O_CLOEXEC should be used in multithreaded programs to prevent file descriptor
+ *   leaks to child processes between fork() and execve().
+ *
+ *   O_DIRECT has alignment requirements that vary by filesystem. Use statx()
+ *   with STATX_DIOALIGN (Linux 6.1+) to query requirements. Unaligned I/O may
+ *   fail with EINVAL or fall back to buffered I/O.
+ *
+ *   O_PATH opens a file descriptor that can be used only for certain operations
+ *   (fstat, dup, fcntl, close, fchdir on directories, as dirfd for *at() calls).
+ *   I/O operations will fail with EBADF.
+ *
+ * since-version: 1.0
+ */
 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
 {
 	if (force_o_largefile())
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 11/15] kernel/api: add API specification for fsetxattr
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/xattr.c | 322 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 322 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 466dcaf7ba83e..8a27c11905f7e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -1392,6 +1392,328 @@ SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
 			       value, size, flags);
 }
 
+/**
+ * sys_fsetxattr - Set an extended attribute value on an open file descriptor
+ * @fd: File descriptor of the file on which to set the extended attribute
+ * @name: Null-terminated name of the extended attribute (includes namespace prefix)
+ * @value: Buffer containing the attribute value to set
+ * @size: Size of the value buffer in bytes
+ * @flags: Flags controlling attribute creation/replacement behavior
+ *
+ * long-desc: Sets the value of an extended attribute identified by name on
+ *   the file referred to by the open file descriptor fd. Extended attributes
+ *   are name:value pairs associated with inodes (files, directories, symbolic
+ *   links, etc.) that extend the normal attributes (stat data) associated with
+ *   all inodes.
+ *
+ *   This syscall is similar to setxattr() but operates on an already-open file
+ *   descriptor rather than a pathname. This is useful when the file is already
+ *   open, when the caller wants to avoid race conditions between opening and
+ *   setting attributes, or when operating on file descriptors that cannot be
+ *   easily reopened.
+ *
+ *   The attribute name must include a namespace prefix. Valid namespaces are:
+ *   - "user." - User-defined attributes (regular files and directories only)
+ *   - "trusted." - Trusted attributes (requires CAP_SYS_ADMIN)
+ *   - "security." - Security module attributes (e.g., SELinux, Smack, capabilities)
+ *   - "system." - System attributes (e.g., POSIX ACLs via system.posix_acl_access)
+ *
+ *   The value can be arbitrary binary data or text. A zero-length value is
+ *   permitted and creates an attribute with an empty value (different from
+ *   removing the attribute).
+ *
+ *   The file descriptor must have been opened for writing to modify extended
+ *   attributes. The file descriptor cannot be an O_PATH file descriptor.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: fd
+ *   type: KAPI_TYPE_FD
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid file descriptor returned by open(), creat(),
+ *     or similar syscalls. The file descriptor cannot be an O_PATH file
+ *     descriptor. The file must be on a filesystem that is not mounted
+ *     read-only. AT_FDCWD (-100) is NOT valid for this syscall as it operates
+ *     on file descriptors, not directory handles.
+ *
+ * param: name
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_USER_STRING
+ *   range: 1, 255
+ *   constraint: Must be a valid null-terminated string in user memory containing
+ *     the extended attribute name with namespace prefix (e.g., "user.myattr").
+ *     The name (including prefix) must be between 1 and XATTR_NAME_MAX (255)
+ *     characters. An empty name returns ERANGE.
+ *
+ * param: value
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER | KAPI_PARAM_OPTIONAL
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid pointer to user memory containing the attribute
+ *     value, or NULL if size is 0. When size is non-zero, the pointer must be
+ *     valid and accessible for size bytes.
+ *
+ * param: size
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, 65536
+ *   constraint: Size of the value in bytes. Must not exceed XATTR_SIZE_MAX
+ *     (65536 bytes). Zero is permitted and creates an attribute with empty value.
+ *     Filesystem-specific limits may be smaller (e.g., ext4 limits total xattr
+ *     space to one filesystem block, typically 4KB).
+ *
+ * param: flags
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_MASK
+ *   valid-mask: XATTR_CREATE | XATTR_REPLACE
+ *   constraint: Controls creation/replacement behavior. Valid values are 0,
+ *     XATTR_CREATE (0x1), or XATTR_REPLACE (0x2). XATTR_CREATE fails if the
+ *     attribute already exists. XATTR_REPLACE fails if the attribute does not
+ *     exist. With flags=0, the attribute is created if it doesn't exist or
+ *     replaced if it does. XATTR_CREATE and XATTR_REPLACE are mutually exclusive.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_ERROR_CHECK
+ *   success: 0
+ *   desc: Returns 0 on success. The extended attribute is set with the specified
+ *     value. Any previous value for the attribute is replaced.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: The file descriptor fd is not valid or is not open for writing. This
+ *     is returned from the fd class lookup when the file descriptor does not
+ *     refer to an open file.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned when: (1) file is immutable or append-only, (2) trusted.*
+ *     without CAP_SYS_ADMIN, (3) security.* (except capability) without
+ *     CAP_SYS_ADMIN, (4) user.* on sticky dir without ownership/CAP_FOWNER,
+ *     (5) unmapped ID in idmapped mount, (6) user.* on non-regular/non-dir.
+ *
+ * error: ENODATA, Attribute not found
+ *   desc: XATTR_REPLACE was specified but the named attribute does not exist on
+ *     the file. Also returned when reading trusted.* without CAP_SYS_ADMIN.
+ *
+ * error: EEXIST, Attribute already exists
+ *   desc: XATTR_CREATE was specified but the named attribute already exists on
+ *     the file.
+ *
+ * error: ERANGE, Name out of range
+ *   desc: The attribute name is empty (zero length) or exceeds XATTR_NAME_MAX
+ *     (255 characters). Returned from import_xattr_name() via strncpy_from_user().
+ *
+ * error: E2BIG, Value too large
+ *   desc: The size parameter exceeds XATTR_SIZE_MAX (65536 bytes). Returned from
+ *     setxattr_copy() before attempting to copy the value from userspace.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: The flags parameter contains bits other than XATTR_CREATE and
+ *     XATTR_REPLACE. Also returned for malformed capability values when setting
+ *     security.capability (invalid header format, invalid rootid mapping), or
+ *     when the xattr name doesn't match any handler prefix.
+ *
+ * error: EFAULT, Bad address
+ *   desc: One of the user pointers (name or value) is invalid or points to
+ *     memory that cannot be accessed. Returned from strncpy_from_user() for
+ *     name or vmemdup_user()/copy_from_user() for value.
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Kernel could not allocate memory to copy the attribute value from
+ *     userspace (via vmemdup_user), or for namespace capability conversion
+ *     (cap_convert_nscap allocates memory for v3 capability format).
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: The filesystem does not support extended attributes (IOP_XATTR not set),
+ *     or no xattr handler exists for the given namespace prefix, or the handler
+ *     does not implement the set operation. Also returned for POSIX ACL xattrs
+ *     (system.posix_acl_*) when CONFIG_FS_POSIX_ACL is disabled.
+ *
+ * error: EROFS, Read-only filesystem
+ *   desc: The filesystem containing the file is mounted read-only. Returned from
+ *     mnt_want_write_file() before attempting any modification.
+ *
+ * error: EIO, I/O error
+ *   desc: The inode is marked as bad (is_bad_inode), indicating filesystem
+ *     corruption or I/O failure. Also may be returned by filesystem-specific
+ *     xattr handler operations.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: The user's disk quota for extended attributes has been exceeded.
+ *     Filesystem-specific error returned from the handler's set operation.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: The filesystem has insufficient space to store the extended attribute.
+ *     Filesystem-specific error from handler's set operation.
+ *
+ * error: EACCES, Permission denied
+ *   desc: Write access to the file is denied based on DAC permissions. The caller
+ *     does not have appropriate permission to modify xattrs on this file.
+ *
+ * lock: inode->i_rwsem
+ *   type: KAPI_LOCK_MUTEX
+ *   acquired: true
+ *   released: true
+ *   desc: The inode's read-write semaphore is acquired exclusively via inode_lock()
+ *     before calling __vfs_setxattr_locked() and released via inode_unlock() after.
+ *     This serializes concurrent xattr modifications on the same inode.
+ *
+ * lock: sb->s_writers (superblock freeze protection)
+ *   type: KAPI_LOCK_SEMAPHORE
+ *   acquired: true
+ *   released: true
+ *   desc: Write access to the mount is acquired via mnt_want_write_file() which
+ *     calls sb_start_write(). This prevents filesystem freeze during the operation.
+ *     Released via mnt_drop_write_file() after the operation completes.
+ *
+ * lock: file_rwsem (delegation breaking)
+ *   type: KAPI_LOCK_SEMAPHORE
+ *   acquired: true
+ *   released: true
+ *   desc: If the file has NFSv4 delegations, the percpu file_rwsem is acquired
+ *     during delegation breaking in __break_lease(). The syscall may wait for
+ *     delegation holders to acknowledge the break.
+ *
+ * signal: Any
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RESTART
+ *   condition: Signal arrives during interruptible wait for delegation breaking
+ *   desc: The syscall may wait for NFSv4 delegation holders to release their
+ *     delegations via wait_event_interruptible_timeout() in __break_lease().
+ *     During this wait, signals can interrupt the operation. If a signal is
+ *     pending, the wait is interrupted and the operation may be retried by
+ *     the kernel automatically if the signal disposition allows (SA_RESTART).
+ *   error: -ERESTARTSYS
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: yes
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: Kernel buffer for attribute value
+ *   desc: The attribute value is copied from userspace to a kernel buffer
+ *     allocated via vmemdup_user(). This memory is freed (kvfree) after the
+ *     operation completes, regardless of success or failure.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: File's extended attributes
+ *   desc: On success, the specified extended attribute is created or modified.
+ *     The change is typically persisted to storage synchronously or asynchronously
+ *     depending on filesystem and mount options.
+ *   reversible: yes
+ *   condition: Operation succeeds
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: Inode flags (S_NOSEC)
+ *   desc: When setting security.* attributes, the S_NOSEC flag is cleared from
+ *     the inode. This flag is an optimization that indicates no security xattrs
+ *     exist; clearing it ensures proper security checks on subsequent accesses.
+ *   condition: Setting security.* namespace attribute
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: fsnotify event
+ *   desc: On success, fsnotify_xattr() is called to notify any registered
+ *     watchers (inotify, fanotify) of the extended attribute modification.
+ *     This generates an IN_ATTRIB event.
+ *   condition: Operation succeeds
+ *
+ * state-trans: extended attribute
+ *   from: nonexistent or has old value
+ *   to: has new value
+ *   condition: Operation succeeds with flags=0 or appropriate flags
+ *   desc: The extended attribute transitions from not existing (or having its
+ *     previous value) to containing the new value. With XATTR_CREATE, the
+ *     attribute must not exist beforehand. With XATTR_REPLACE, it must exist.
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Setting trusted.* namespace attributes and most security.* attributes
+ *   without: Setting trusted.* returns EPERM. Setting security.* (except
+ *     security.capability) returns EPERM. The check uses ns_capable() against
+ *     the filesystem's user namespace.
+ *   condition: Attribute name starts with "trusted." or "security." (except
+ *     security.capability)
+ *
+ * capability: CAP_SETFCAP
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Setting the security.capability extended attribute
+ *   without: Setting security.capability returns EPERM
+ *   condition: Attribute name is "security.capability". Checked via
+ *     capable_wrt_inode_uidgid() which considers the inode's ownership.
+ *
+ * capability: CAP_FOWNER
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypassing owner check for user.* on sticky directories
+ *   without: Non-owners cannot set user.* attributes on files in sticky
+ *     directories without this capability
+ *   condition: Setting user.* namespace attribute on a file in a sticky directory
+ *
+ * constraint: Filesystem support
+ *   desc: The filesystem must support extended attributes (have IOP_XATTR flag
+ *     set and provide xattr handlers). Common filesystems supporting xattrs
+ *     include ext4, XFS, Btrfs, and tmpfs. Some filesystems (e.g., FAT, older
+ *     ext2) do not support extended attributes.
+ *
+ * constraint: Filesystem-specific size limits
+ *   desc: While the VFS limit is 64KB (XATTR_SIZE_MAX), filesystems may impose
+ *     smaller limits. For example, ext4 limits all xattrs on an inode to fit
+ *     in a single filesystem block (typically 4KB). XFS and ReiserFS support
+ *     the full 64KB. Exceeding filesystem limits returns ENOSPC or E2BIG.
+ *
+ * constraint: user.* namespace restrictions
+ *   desc: The user.* namespace is only supported on regular files and directories.
+ *     Attempting to set user.* attributes on other file types (symlinks, devices,
+ *     sockets, FIFOs) returns EPERM (for write) or ENODATA (for read).
+ *
+ * constraint: LSM checks
+ *   desc: Linux Security Modules (SELinux, Smack, AppArmor) may impose additional
+ *     restrictions via security_inode_setxattr() hook. These can return various
+ *     error codes depending on the security policy. The LSM is called after
+ *     permission checks but before the actual xattr modification.
+ *
+ * constraint: File descriptor must not be O_PATH
+ *   desc: The file descriptor must be a regular file descriptor, not one opened
+ *     with O_PATH. O_PATH file descriptors do not provide access to the file
+ *     contents or metadata modification operations.
+ *
+ * examples: fsetxattr(fd, "user.comment", "test", 4, 0);  // Set user attr
+ *   fsetxattr(fd, "user.new", "val", 3, XATTR_CREATE);  // Create only, fail if exists
+ *   fsetxattr(fd, "user.existing", "new", 3, XATTR_REPLACE);  // Replace only
+ *   fsetxattr(fd, "user.empty", "", 0, 0);  // Create attribute with empty value
+ *
+ * notes: Extended attributes provide a way to associate arbitrary metadata with
+ *   files beyond the standard stat attributes. They are commonly used for:
+ *   - SELinux security contexts (security.selinux)
+ *   - File capabilities (security.capability)
+ *   - POSIX ACLs (system.posix_acl_access, system.posix_acl_default)
+ *   - User-defined metadata (user.* namespace)
+ *
+ *   Using fsetxattr() with an already-open file descriptor avoids potential
+ *   TOCTOU (time-of-check-time-of-use) race conditions that can occur when
+ *   using setxattr() with a pathname, where the file might be replaced between
+ *   opening and setting the attribute.
+ *
+ *   The trusted.* namespace is designed for use by privileged processes to store
+ *   data that should not be accessible to unprivileged users (e.g., during
+ *   backup/restore operations).
+ *
+ *   NFSv4 delegation support means this syscall may need to wait for remote
+ *   clients to release their delegations before the operation can complete.
+ *   This can introduce unbounded delays in pathological cases.
+ *
+ *   For security.capability specifically, the kernel may convert between v2
+ *   (non-namespaced) and v3 (namespaced) capability formats depending on the
+ *   filesystem's user namespace and caller's capabilities.
+ *
+ *   Unlike setxattr() and lsetxattr(), fsetxattr() does not involve path
+ *   resolution, so errors related to path traversal (ENOENT, ENOTDIR,
+ *   ENAMETOOLONG, ELOOP, ESTALE) are not possible.
+ *
+ * since-version: 2.4
+ */
 SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name,
 		const void __user *,value, size_t, size, int, flags)
 {
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 10/15] kernel/api: add API specification for lsetxattr
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/xattr.c | 327 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 327 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 02a946227129e..466dcaf7ba83e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -1057,6 +1057,333 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
 	return path_setxattrat(AT_FDCWD, pathname, 0, name, value, size, flags);
 }
 
+/**
+ * sys_lsetxattr - Set an extended attribute value on a symbolic link
+ * @pathname: Path to the file or symbolic link on which to set the attribute
+ * @name: Null-terminated name of the extended attribute (includes namespace prefix)
+ * @value: Buffer containing the attribute value to set
+ * @size: Size of the value buffer in bytes
+ * @flags: Flags controlling attribute creation/replacement behavior
+ *
+ * long-desc: Sets the value of an extended attribute identified by name on
+ *   the file specified by pathname. Unlike setxattr(), this syscall does not
+ *   follow symbolic links - if pathname refers to a symbolic link, the
+ *   extended attribute is set on the link itself, not on the file it refers to.
+ *
+ *   Extended attributes are name:value pairs associated with inodes (files,
+ *   directories, symbolic links, etc.) that extend the normal attributes
+ *   (stat data) associated with all inodes.
+ *
+ *   The attribute name must include a namespace prefix. Valid namespaces are:
+ *   - "user." - User-defined attributes (regular files and directories only)
+ *   - "trusted." - Trusted attributes (requires CAP_SYS_ADMIN)
+ *   - "security." - Security module attributes (e.g., SELinux, Smack, capabilities)
+ *   - "system." - System attributes (e.g., POSIX ACLs via system.posix_acl_access)
+ *
+ *   The value can be arbitrary binary data or text. A zero-length value is
+ *   permitted and creates an attribute with an empty value (different from
+ *   removing the attribute).
+ *
+ *   Note that not all filesystems support extended attributes on symbolic links.
+ *   Additionally, the user.* namespace is not available on symbolic links since
+ *   they are not regular files or directories.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: pathname
+ *   type: KAPI_TYPE_PATH
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_USER_PATH
+ *   constraint: Must be a valid null-terminated path string in user memory.
+ *     The path is resolved WITHOUT following symbolic links - if the final
+ *     component is a symbolic link, the operation applies to the link itself.
+ *     Maximum path length is PATH_MAX (4096 bytes). The file or link must
+ *     exist and the caller must have appropriate permissions.
+ *
+ * param: name
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_USER_STRING
+ *   range: 1, 255
+ *   constraint: Must be a valid null-terminated string in user memory containing
+ *     the extended attribute name with namespace prefix (e.g., "security.selinux").
+ *     The name (including prefix) must be between 1 and XATTR_NAME_MAX (255)
+ *     characters. An empty name returns ERANGE. Note that user.* namespace is
+ *     not supported on symbolic links.
+ *
+ * param: value
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER | KAPI_PARAM_OPTIONAL
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid pointer to user memory containing the attribute
+ *     value, or NULL if size is 0. When size is non-zero, the pointer must be
+ *     valid and accessible for size bytes.
+ *
+ * param: size
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, 65536
+ *   constraint: Size of the value in bytes. Must not exceed XATTR_SIZE_MAX
+ *     (65536 bytes). Zero is permitted and creates an attribute with empty value.
+ *     Filesystem-specific limits may be smaller (e.g., ext4 limits total xattr
+ *     space to one filesystem block, typically 4KB).
+ *
+ * param: flags
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_MASK
+ *   valid-mask: XATTR_CREATE | XATTR_REPLACE
+ *   constraint: Controls creation/replacement behavior. Valid values are 0,
+ *     XATTR_CREATE (0x1), or XATTR_REPLACE (0x2). XATTR_CREATE fails if the
+ *     attribute already exists. XATTR_REPLACE fails if the attribute does not
+ *     exist. With flags=0, the attribute is created if it doesn't exist or
+ *     replaced if it does. XATTR_CREATE and XATTR_REPLACE are mutually exclusive.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_ERROR_CHECK
+ *   success: 0
+ *   desc: Returns 0 on success. The extended attribute is set with the specified
+ *     value on the symbolic link itself. Any previous value for the attribute
+ *     is replaced.
+ *
+ * error: ENOENT, File or symlink not found
+ *   desc: The file or symbolic link specified by pathname does not exist, or a
+ *     directory component in the path does not exist. Returned from path lookup.
+ *
+ * error: EACCES, Permission denied
+ *   desc: Permission denied during path resolution (search permission on a directory
+ *     component) or write access to the file is denied based on DAC permissions.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned in several cases: (1) The file is marked immutable (chattr +i)
+ *     or append-only (chattr +a). (2) For trusted.* namespace, caller lacks
+ *     CAP_SYS_ADMIN in the filesystem's user namespace. (3) For security.*
+ *     namespace (except security.capability), caller lacks CAP_SYS_ADMIN.
+ *     (4) For user.* namespace on sticky directories, caller is not the owner
+ *     and lacks CAP_FOWNER. (5) The inode has an unmapped ID in an idmapped mount.
+ *     (6) Attempting to set user.* namespace on a symbolic link (not supported).
+ *
+ * error: ENODATA, Attribute not found
+ *   desc: XATTR_REPLACE was specified but the named attribute does not exist on
+ *     the symbolic link.
+ *
+ * error: EEXIST, Attribute already exists
+ *   desc: XATTR_CREATE was specified but the named attribute already exists on
+ *     the symbolic link.
+ *
+ * error: ERANGE, Name out of range
+ *   desc: The attribute name is empty (zero length) or exceeds XATTR_NAME_MAX
+ *     (255 characters). Returned from import_xattr_name() via strncpy_from_user().
+ *
+ * error: E2BIG, Value too large
+ *   desc: The size parameter exceeds XATTR_SIZE_MAX (65536 bytes). Returned from
+ *     setxattr_copy() before attempting to copy the value from userspace.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: The flags parameter contains bits other than XATTR_CREATE and
+ *     XATTR_REPLACE. Also returned for malformed capability values when setting
+ *     security.capability, or when the xattr name doesn't match any handler prefix.
+ *
+ * error: EFAULT, Bad address
+ *   desc: One of the user pointers (pathname, name, or value) is invalid or
+ *     points to memory that cannot be accessed. Returned from strncpy_from_user()
+ *     for pathname/name or vmemdup_user()/copy_from_user() for value.
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Kernel could not allocate memory to copy the attribute value from
+ *     userspace (via vmemdup_user), or for namespace capability conversion
+ *     (cap_convert_nscap allocates memory for v3 capability format).
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: The filesystem does not support extended attributes on symbolic links,
+ *     or no xattr handler exists for the given namespace prefix, or the handler
+ *     does not implement the set operation. Many filesystems do not support
+ *     setting xattrs on symbolic links.
+ *
+ * error: EROFS, Read-only filesystem
+ *   desc: The filesystem containing the symbolic link is mounted read-only.
+ *     Returned from mnt_want_write() before attempting any modification.
+ *
+ * error: EIO, I/O error
+ *   desc: The inode is marked as bad (is_bad_inode), indicating filesystem
+ *     corruption or I/O failure. Also may be returned by filesystem-specific
+ *     xattr handler operations.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: The user's disk quota for extended attributes has been exceeded.
+ *     Filesystem-specific error returned from the handler's set operation.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: The filesystem has insufficient space to store the extended attribute.
+ *     Filesystem-specific error from handler's set operation.
+ *
+ * error: ELOOP, Too many symbolic links
+ *   desc: Too many symbolic links were encountered during path resolution of
+ *     directory components (more than MAXSYMLINKS, typically 40). Note that the
+ *     final component (the target of the operation) is not followed.
+ *
+ * error: ENAMETOOLONG, Filename too long
+ *   desc: The pathname or a component of the pathname exceeds the system limit
+ *     (PATH_MAX or NAME_MAX).
+ *
+ * error: ENOTDIR, Not a directory
+ *   desc: A component of the path prefix is not a directory.
+ *
+ * error: ESTALE, Stale file handle
+ *   desc: The file handle became stale during the operation (NFS). The syscall
+ *     automatically retries with LOOKUP_REVAL in this case.
+ *
+ * lock: inode->i_rwsem
+ *   type: KAPI_LOCK_MUTEX
+ *   acquired: true
+ *   released: true
+ *   desc: The inode's read-write semaphore is acquired exclusively via inode_lock()
+ *     before calling __vfs_setxattr_locked() and released via inode_unlock() after.
+ *     This serializes concurrent xattr modifications on the same inode.
+ *
+ * lock: sb->s_writers (superblock freeze protection)
+ *   type: KAPI_LOCK_SEMAPHORE
+ *   acquired: true
+ *   released: true
+ *   desc: Write access to the mount is acquired via mnt_want_write() which calls
+ *     sb_start_write(). This prevents filesystem freeze during the operation.
+ *     Released via mnt_drop_write() after the operation completes.
+ *
+ * lock: file_rwsem (delegation breaking)
+ *   type: KAPI_LOCK_SEMAPHORE
+ *   acquired: true
+ *   released: true
+ *   desc: If the file has NFSv4 delegations, the percpu file_rwsem is acquired
+ *     during delegation breaking in __break_lease(). The syscall may wait for
+ *     delegation holders to acknowledge the break.
+ *
+ * signal: Any
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RESTART
+ *   condition: Signal arrives during interruptible waits (delegation breaking)
+ *   desc: The syscall may wait for NFSv4 delegation holders to release their
+ *     delegations. During this wait, signals can interrupt the operation. If a
+ *     signal is pending, the wait may be interrupted and the operation retried.
+ *     Most blocking points in this syscall use non-interruptible waits.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: yes
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: Kernel buffer for attribute value
+ *   desc: The attribute value is copied from userspace to a kernel buffer
+ *     allocated via vmemdup_user(). This memory is freed (kvfree) after the
+ *     operation completes, regardless of success or failure.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: Symbolic link's extended attributes
+ *   desc: On success, the specified extended attribute is created or modified
+ *     on the symbolic link itself. The change is typically persisted to storage
+ *     synchronously or asynchronously depending on filesystem and mount options.
+ *   reversible: yes
+ *   condition: Operation succeeds
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: Inode flags (S_NOSEC)
+ *   desc: When setting security.* attributes, the S_NOSEC flag is cleared from
+ *     the inode. This flag is an optimization that indicates no security xattrs
+ *     exist; clearing it ensures proper security checks on subsequent accesses.
+ *   condition: Setting security.* namespace attribute
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: fsnotify event
+ *   desc: On success, fsnotify_xattr() is called to notify any registered
+ *     watchers (inotify, fanotify) of the extended attribute modification.
+ *     This generates an IN_ATTRIB event.
+ *   condition: Operation succeeds
+ *
+ * state-trans: extended attribute
+ *   from: nonexistent or has old value
+ *   to: has new value
+ *   condition: Operation succeeds with flags=0 or appropriate flags
+ *   desc: The extended attribute on the symbolic link transitions from not
+ *     existing (or having its previous value) to containing the new value.
+ *     With XATTR_CREATE, the attribute must not exist beforehand. With
+ *     XATTR_REPLACE, it must exist.
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Setting trusted.* namespace attributes and most security.* attributes
+ *   without: Setting trusted.* returns EPERM. Setting security.* (except
+ *     security.capability) returns EPERM. The check uses ns_capable() against
+ *     the filesystem's user namespace.
+ *   condition: Attribute name starts with "trusted." or "security." (except
+ *     security.capability)
+ *
+ * capability: CAP_SETFCAP
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Setting the security.capability extended attribute
+ *   without: Setting security.capability returns EPERM
+ *   condition: Attribute name is "security.capability". Checked via
+ *     capable_wrt_inode_uidgid() which considers the inode's ownership.
+ *
+ * capability: CAP_FOWNER
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypassing owner check for user.* on sticky directories
+ *   without: Non-owners cannot set user.* attributes on files in sticky
+ *     directories without this capability
+ *   condition: Setting user.* namespace attribute on a file in a sticky directory
+ *
+ * constraint: Filesystem support for symlinks
+ *   desc: Not all filesystems support extended attributes on symbolic links.
+ *     Some filesystems (like ext4) may only support certain xattr namespaces
+ *     on symlinks. The user.* namespace is explicitly not supported on symbolic
+ *     links since they are not regular files or directories.
+ *
+ * constraint: Filesystem-specific size limits
+ *   desc: While the VFS limit is 64KB (XATTR_SIZE_MAX), filesystems may impose
+ *     smaller limits. For example, ext4 limits all xattrs on an inode to fit
+ *     in a single filesystem block (typically 4KB). XFS and ReiserFS support
+ *     the full 64KB. Exceeding filesystem limits returns ENOSPC or E2BIG.
+ *
+ * constraint: user.* namespace restrictions on symlinks
+ *   desc: The user.* namespace is only supported on regular files and directories.
+ *     Attempting to set user.* attributes on symbolic links returns EPERM.
+ *     This is because user.* xattrs have permission semantics that don't apply
+ *     to symbolic links which anyone can follow.
+ *
+ * constraint: LSM checks
+ *   desc: Linux Security Modules (SELinux, Smack, AppArmor) may impose additional
+ *     restrictions via security_inode_setxattr() hook. These can return various
+ *     error codes depending on the security policy. The LSM is called after
+ *     permission checks but before the actual xattr modification.
+ *
+ * examples: lsetxattr("/path/symlink", "security.selinux", ctx, len, 0);  // Set SELinux context on link
+ *   lsetxattr("/path/symlink", "trusted.overlay.opaque", "y", 1, XATTR_CREATE);  // Set overlay attr
+ *
+ * notes: This syscall is primarily used for security labeling of symbolic links
+ *   themselves (as opposed to their targets). Common use cases include:
+ *   - SELinux security contexts on symbolic links (security.selinux)
+ *   - Overlay filesystem metadata (trusted.overlay.*)
+ *   - IMA/EVM integrity metadata (security.ima, security.evm)
+ *
+ *   Unlike regular files and directories, symbolic links do not support the
+ *   user.* xattr namespace. This is because user.* xattrs require ownership
+ *   or capability checks that don't make sense for symlinks which can be
+ *   followed by anyone with directory access.
+ *
+ *   The trusted.* namespace on symbolic links requires CAP_SYS_ADMIN and is
+ *   commonly used by overlay filesystems to store metadata about redirected
+ *   or opaque directories.
+ *
+ *   NFSv4 delegation support means this syscall may need to wait for remote
+ *   clients to release their delegations before the operation can complete.
+ *
+ *   This syscall was introduced alongside setxattr(), fsetxattr(), and the
+ *   corresponding get/list/remove variants in Linux 2.4 to provide the
+ *   non-following behavior needed for backup/restore tools and security
+ *   labeling of links.
+ *
+ * since-version: 2.4
+ */
 SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
 		const char __user *, name, const void __user *, value,
 		size_t, size, int, flags)
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 09/15] kernel/api: add API specification for setxattr
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/xattr.c | 310 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 310 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 32d445fb60aaf..02a946227129e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -740,6 +740,316 @@ SYSCALL_DEFINE6(setxattrat, int, dfd, const char __user *, pathname, unsigned in
 			       args.flags);
 }
 
+/**
+ * sys_setxattr - Set an extended attribute value on a file
+ * @pathname: Path to the file on which to set the extended attribute
+ * @name: Null-terminated name of the extended attribute (includes namespace prefix)
+ * @value: Buffer containing the attribute value to set
+ * @size: Size of the value buffer in bytes
+ * @flags: Flags controlling attribute creation/replacement behavior
+ *
+ * long-desc: Sets the value of an extended attribute identified by name on
+ *   the file specified by pathname. Extended attributes are name:value pairs
+ *   associated with inodes (files, directories, symbolic links, etc.) that
+ *   extend the normal attributes (stat data) associated with all inodes.
+ *
+ *   The attribute name must include a namespace prefix. Valid namespaces are:
+ *   - "user." - User-defined attributes (regular files and directories only)
+ *   - "trusted." - Trusted attributes (requires CAP_SYS_ADMIN)
+ *   - "security." - Security module attributes (e.g., SELinux, Smack, capabilities)
+ *   - "system." - System attributes (e.g., POSIX ACLs via system.posix_acl_access)
+ *
+ *   The value can be arbitrary binary data or text. A zero-length value is
+ *   permitted and creates an attribute with an empty value (different from
+ *   removing the attribute).
+ *
+ *   This syscall follows symbolic links. Use lsetxattr() to operate on the
+ *   symbolic link itself, or fsetxattr() to operate on an open file descriptor.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: pathname
+ *   type: KAPI_TYPE_PATH
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_USER_PATH
+ *   constraint: Must be a valid null-terminated path string in user memory.
+ *     The path is resolved following symbolic links. Maximum path length is
+ *     PATH_MAX (4096 bytes). The file must exist and the caller must have
+ *     appropriate permissions.
+ *
+ * param: name
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_USER_STRING
+ *   range: 1, 255
+ *   constraint: Must be a valid null-terminated string in user memory containing
+ *     the extended attribute name with namespace prefix (e.g., "user.myattr").
+ *     The name (including prefix) must be between 1 and XATTR_NAME_MAX (255)
+ *     characters. An empty name returns ERANGE.
+ *
+ * param: value
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER | KAPI_PARAM_OPTIONAL
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid pointer to user memory containing the attribute
+ *     value, or NULL if size is 0. When size is non-zero, the pointer must be
+ *     valid and accessible for size bytes.
+ *
+ * param: size
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, 65536
+ *   constraint: Size of the value in bytes. Must not exceed XATTR_SIZE_MAX
+ *     (65536 bytes). Zero is permitted and creates an attribute with empty value.
+ *     Filesystem-specific limits may be smaller (e.g., ext4 limits total xattr
+ *     space to one filesystem block, typically 4KB).
+ *
+ * param: flags
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_MASK
+ *   valid-mask: XATTR_CREATE | XATTR_REPLACE
+ *   constraint: Controls creation/replacement behavior. Valid values are 0,
+ *     XATTR_CREATE (0x1), or XATTR_REPLACE (0x2). XATTR_CREATE fails if the
+ *     attribute already exists. XATTR_REPLACE fails if the attribute does not
+ *     exist. With flags=0, the attribute is created if it doesn't exist or
+ *     replaced if it does. XATTR_CREATE and XATTR_REPLACE are mutually exclusive.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_ERROR_CHECK
+ *   success: 0
+ *   desc: Returns 0 on success. The extended attribute is set with the specified
+ *     value. Any previous value for the attribute is replaced.
+ *
+ * error: ENOENT, File not found
+ *   desc: The file specified by pathname does not exist, or a directory component
+ *     in the path does not exist. Returned from path lookup (filename_lookup).
+ *
+ * error: EACCES, Permission denied
+ *   desc: Permission denied during path resolution (search permission on a directory
+ *     component) or write access to the file is denied based on DAC permissions.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned in several cases: (1) The file is marked immutable (chattr +i)
+ *     or append-only (chattr +a). (2) For trusted.* namespace, caller lacks
+ *     CAP_SYS_ADMIN in the filesystem's user namespace. (3) For security.*
+ *     namespace (except security.capability), caller lacks CAP_SYS_ADMIN.
+ *     (4) For user.* namespace on sticky directories, caller is not the owner
+ *     and lacks CAP_FOWNER. (5) The inode has an unmapped ID in an idmapped mount.
+ *
+ * error: ENODATA, Attribute not found
+ *   desc: XATTR_REPLACE was specified but the named attribute does not exist on
+ *     the file. Also returned when reading trusted.* without CAP_SYS_ADMIN (for
+ *     read operations, but included here for completeness with the flag).
+ *
+ * error: EEXIST, Attribute already exists
+ *   desc: XATTR_CREATE was specified but the named attribute already exists on
+ *     the file.
+ *
+ * error: ERANGE, Name out of range
+ *   desc: The attribute name is empty (zero length) or exceeds XATTR_NAME_MAX
+ *     (255 characters). Returned from import_xattr_name() via strncpy_from_user().
+ *
+ * error: E2BIG, Value too large
+ *   desc: The size parameter exceeds XATTR_SIZE_MAX (65536 bytes). Returned from
+ *     setxattr_copy() before attempting to copy the value from userspace.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: The flags parameter contains bits other than XATTR_CREATE and
+ *     XATTR_REPLACE. Also returned for malformed capability values when setting
+ *     security.capability, or when the xattr name doesn't match any handler prefix.
+ *
+ * error: EFAULT, Bad address
+ *   desc: One of the user pointers (pathname, name, or value) is invalid or
+ *     points to memory that cannot be accessed. Returned from strncpy_from_user()
+ *     for pathname/name or vmemdup_user()/copy_from_user() for value.
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Kernel could not allocate memory to copy the attribute value from
+ *     userspace (via vmemdup_user), or for namespace capability conversion
+ *     (cap_convert_nscap allocates memory for v3 capability format).
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: The filesystem does not support extended attributes (IOP_XATTR not set),
+ *     or no xattr handler exists for the given namespace prefix, or the handler
+ *     does not implement the set operation. Also returned for POSIX ACL xattrs
+ *     (system.posix_acl_*) when CONFIG_FS_POSIX_ACL is disabled.
+ *
+ * error: EROFS, Read-only filesystem
+ *   desc: The filesystem containing the file is mounted read-only. Returned from
+ *     mnt_want_write() before attempting any modification.
+ *
+ * error: EIO, I/O error
+ *   desc: The inode is marked as bad (is_bad_inode), indicating filesystem
+ *     corruption or I/O failure. Also may be returned by filesystem-specific
+ *     xattr handler operations.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: The user's disk quota for extended attributes has been exceeded.
+ *     Filesystem-specific error returned from the handler's set operation.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: The filesystem has insufficient space to store the extended attribute.
+ *     Filesystem-specific error from handler's set operation.
+ *
+ * error: ELOOP, Too many symbolic links
+ *   desc: Too many symbolic links were encountered during path resolution
+ *     (more than MAXSYMLINKS, typically 40).
+ *
+ * error: ENAMETOOLONG, Filename too long
+ *   desc: The pathname or a component of the pathname exceeds the system limit
+ *     (PATH_MAX or NAME_MAX).
+ *
+ * error: ENOTDIR, Not a directory
+ *   desc: A component of the path prefix is not a directory.
+ *
+ * error: ESTALE, Stale file handle
+ *   desc: The file handle became stale during the operation (NFS). The syscall
+ *     automatically retries with LOOKUP_REVAL in this case.
+ *
+ * lock: inode->i_rwsem
+ *   type: KAPI_LOCK_MUTEX
+ *   desc: The inode's read-write semaphore is acquired exclusively via inode_lock()
+ *     before calling __vfs_setxattr_locked() and released via inode_unlock() after.
+ *     This serializes concurrent xattr modifications on the same inode.
+ *
+ * lock: sb->s_writers (superblock freeze protection)
+ *   type: KAPI_LOCK_SEMAPHORE
+ *   desc: Write access to the mount is acquired via mnt_want_write() which calls
+ *     sb_start_write(). This prevents filesystem freeze during the operation.
+ *     Released via mnt_drop_write() after the operation completes.
+ *
+ * lock: file_rwsem (delegation breaking)
+ *   type: KAPI_LOCK_SEMAPHORE
+ *   desc: If the file has NFSv4 delegations, the percpu file_rwsem is acquired
+ *     during delegation breaking in __break_lease(). The syscall may wait for
+ *     delegation holders to acknowledge the break.
+ *
+ * signal: Any
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RESTART
+ *   condition: Signal arrives during interruptible waits (delegation breaking)
+ *   desc: The syscall may wait for NFSv4 delegation holders to release their
+ *     delegations. During this wait, signals can interrupt the operation. If a
+ *     signal is pending, the wait may be interrupted and the operation retried.
+ *     Most blocking points in this syscall use non-interruptible waits.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: yes
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: Kernel buffer for attribute value
+ *   desc: The attribute value is copied from userspace to a kernel buffer
+ *     allocated via vmemdup_user(). This memory is freed (kvfree) after the
+ *     operation completes, regardless of success or failure.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: File's extended attributes
+ *   desc: On success, the specified extended attribute is created or modified.
+ *     The change is typically persisted to storage synchronously or asynchronously
+ *     depending on filesystem and mount options.
+ *   reversible: yes
+ *   condition: Operation succeeds
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: Inode flags (S_NOSEC)
+ *   desc: When setting security.* attributes, the S_NOSEC flag is cleared from
+ *     the inode. This flag is an optimization that indicates no security xattrs
+ *     exist; clearing it ensures proper security checks on subsequent accesses.
+ *   condition: Setting security.* namespace attribute
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: fsnotify event
+ *   desc: On success, fsnotify_xattr() is called to notify any registered
+ *     watchers (inotify, fanotify) of the extended attribute modification.
+ *     This generates an IN_ATTRIB event.
+ *   condition: Operation succeeds
+ *
+ * state-trans: extended attribute
+ *   from: nonexistent or has old value
+ *   to: has new value
+ *   condition: Operation succeeds with flags=0 or appropriate flags
+ *   desc: The extended attribute transitions from not existing (or having its
+ *     previous value) to containing the new value. With XATTR_CREATE, the
+ *     attribute must not exist beforehand. With XATTR_REPLACE, it must exist.
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Setting trusted.* namespace attributes and most security.* attributes
+ *   without: Setting trusted.* returns EPERM. Setting security.* (except
+ *     security.capability) returns EPERM. The check uses ns_capable() against
+ *     the filesystem's user namespace.
+ *   condition: Attribute name starts with "trusted." or "security." (except
+ *     security.capability)
+ *
+ * capability: CAP_SETFCAP
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Setting the security.capability extended attribute
+ *   without: Setting security.capability returns EPERM
+ *   condition: Attribute name is "security.capability". Checked via
+ *     capable_wrt_inode_uidgid() which considers the inode's ownership.
+ *
+ * capability: CAP_FOWNER
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypassing owner check for user.* on sticky directories
+ *   without: Non-owners cannot set user.* attributes on files in sticky
+ *     directories without this capability
+ *   condition: Setting user.* namespace attribute on a file in a sticky directory
+ *
+ * constraint: Filesystem support
+ *   desc: The filesystem must support extended attributes (have IOP_XATTR flag
+ *     set and provide xattr handlers). Common filesystems supporting xattrs
+ *     include ext4, XFS, Btrfs, and tmpfs. Some filesystems (e.g., FAT, older
+ *     ext2) do not support extended attributes.
+ *
+ * constraint: Filesystem-specific size limits
+ *   desc: While the VFS limit is 64KB (XATTR_SIZE_MAX), filesystems may impose
+ *     smaller limits. For example, ext4 limits all xattrs on an inode to fit
+ *     in a single filesystem block (typically 4KB). XFS and ReiserFS support
+ *     the full 64KB. Exceeding filesystem limits returns ENOSPC or E2BIG.
+ *
+ * constraint: user.* namespace restrictions
+ *   desc: The user.* namespace is only supported on regular files and directories.
+ *     Attempting to set user.* attributes on other file types (symlinks, devices,
+ *     sockets, FIFOs) returns EPERM (for write) or ENODATA (for read).
+ *
+ * constraint: LSM checks
+ *   desc: Linux Security Modules (SELinux, Smack, AppArmor) may impose additional
+ *     restrictions via security_inode_setxattr() hook. These can return various
+ *     error codes depending on the security policy. The LSM is called after
+ *     permission checks but before the actual xattr modification.
+ *
+ * examples: setxattr("/path/file", "user.comment", "test", 4, 0);  // Set user attr
+ *   setxattr("/path/file", "user.new", "val", 3, XATTR_CREATE);  // Create only
+ *   setxattr("/path/file", "user.existing", "new", 3, XATTR_REPLACE);  // Replace
+ *
+ * notes: Extended attributes provide a way to associate arbitrary metadata with
+ *   files beyond the standard stat attributes. They are commonly used for:
+ *   - SELinux security contexts (security.selinux)
+ *   - File capabilities (security.capability)
+ *   - POSIX ACLs (system.posix_acl_access, system.posix_acl_default)
+ *   - User-defined metadata (user.* namespace)
+ *
+ *   The trusted.* namespace is designed for use by privileged processes to store
+ *   data that should not be accessible to unprivileged users (e.g., during
+ *   backup/restore operations).
+ *
+ *   NFSv4 delegation support means this syscall may need to wait for remote
+ *   clients to release their delegations before the operation can complete.
+ *   This can introduce unbounded delays in pathological cases.
+ *
+ *   For security.capability specifically, the kernel may convert between v2
+ *   (non-namespaced) and v3 (namespaced) capability formats depending on the
+ *   filesystem's user namespace and caller's capabilities.
+ *
+ *   The setxattrat() syscall (added in Linux 6.17) provides more flexibility
+ *   with AT_FDCWD and AT_* flags for specifying the file location.
+ *
+ * since-version: 2.4
+ */
 SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
 		const char __user *, name, const void __user *, value,
 		size_t, size, int, flags)
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 08/15] kernel/api: add API specification for io_cancel
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/aio.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 237 insertions(+), 9 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f6f1b3790c88b..710517c9a990d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -2843,15 +2843,243 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
 }
 #endif
 
-/* sys_io_cancel:
- *	Attempts to cancel an iocb previously passed to io_submit.  If
- *	the operation is successfully cancelled, the resulting event is
- *	copied into the memory pointed to by result without being placed
- *	into the completion queue and 0 is returned.  May fail with
- *	-EFAULT if any of the data structures pointed to are invalid.
- *	May fail with -EINVAL if aio_context specified by ctx_id is
- *	invalid.  May fail with -EAGAIN if the iocb specified was not
- *	cancelled.  Will fail with -ENOSYS if not implemented.
+/**
+ * sys_io_cancel - Attempt to cancel an outstanding asynchronous I/O operation
+ * @ctx_id: AIO context handle returned by io_setup
+ * @iocb: Pointer to the iocb structure that was previously submitted
+ * @result: Unused parameter (historically for result storage, now ignored)
+ *
+ * long-desc: Attempts to cancel an asynchronous I/O operation that was
+ *   previously submitted via io_submit(). The syscall searches for the
+ *   specified iocb in the context's active request list and invokes the
+ *   operation-specific cancellation callback if found.
+ *
+ *   The cancellation behavior depends on the type of I/O operation:
+ *   - For poll operations (IOCB_CMD_POLL): The request is marked as cancelled
+ *     and a work item is scheduled to complete the cancellation.
+ *   - For USB gadget I/O: The USB endpoint dequeue function is called, which
+ *     triggers the completion callback with -ECONNRESET status.
+ *   - For most direct I/O operations: Cancellation is typically not supported
+ *     as these operations do not register a cancel callback.
+ *
+ *   If the iocb is found and has a registered cancellation callback, that
+ *   callback is invoked and the iocb is removed from the active request list.
+ *   The completion event is delivered via the ring buffer (not via the result
+ *   parameter, which is now unused for this purpose).
+ *
+ *   On successful cancellation initiation, the syscall returns -EINPROGRESS
+ *   (not 0) to indicate that cancellation is in progress. This is because
+ *   the actual completion may occur asynchronously via the cancel callback.
+ *
+ *   Important limitations:
+ *   - Most file I/O operations do not support cancellation
+ *   - The iocb must still be pending (not yet completed)
+ *   - The iocb must have been submitted via io_submit (aio_key == KIOCB_KEY)
+ *   - Only operations that register a ki_cancel callback can be cancelled
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_ATOMIC
+ *
+ * param: ctx_id
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid AIO context handle previously returned by
+ *     io_setup() for the current process. The context must not have been
+ *     destroyed via io_destroy(). A value of 0 is always invalid. The handle
+ *     is actually the virtual address of the ring buffer mapping, and must
+ *     belong to the calling process's address space.
+ *
+ * param: iocb
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   size: sizeof(struct iocb)
+ *   constraint-type: KAPI_CONSTRAINT_USER_PTR
+ *   constraint: Must be a valid userspace pointer to a struct iocb that was
+ *     previously submitted via io_submit(). The iocb's aio_key field must
+ *     contain KIOCB_KEY (0), which is written by the kernel during io_submit.
+ *     A NULL pointer will result in EFAULT. The iocb must still be pending
+ *     (present in the context's active_reqs list) for cancellation to succeed.
+ *
+ * param: result
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER | KAPI_PARAM_OPTIONAL
+ *   constraint-type: KAPI_CONSTRAINT_NONE
+ *   constraint: This parameter is no longer used by the kernel. It was
+ *     historically intended to receive the io_event result on successful
+ *     cancellation, but completion events are now always delivered via the
+ *     ring buffer. May be NULL.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_ERROR_CHECK
+ *   success: -EINPROGRESS
+ *   desc: Returns -EINPROGRESS when the cancellation callback was successfully
+ *     invoked and the request is being cancelled. This is the expected return
+ *     value on successful cancellation initiation. The completion event will
+ *     be delivered via the ring buffer. Note that this is different from the
+ *     man page which claims 0 is returned on success.
+ *
+ * error: EFAULT, Cannot read iocb from userspace
+ *   desc: Returned if the iocb pointer is invalid or points to memory that
+ *     cannot be read. Specifically, the kernel attempts to read the aio_key
+ *     field from the iocb via get_user() and returns EFAULT if this fails.
+ *     A NULL iocb pointer will trigger this error.
+ *
+ * error: EINVAL, iocb not submitted via io_submit
+ *   desc: Returned if the aio_key field of the iocb does not contain KIOCB_KEY
+ *     (which is 0). The kernel sets aio_key to KIOCB_KEY when an iocb is
+ *     successfully submitted via io_submit(). If aio_key contains a different
+ *     value, it indicates the iocb was never successfully submitted, is
+ *     corrupted, or the memory has been reused.
+ *
+ * error: EINVAL, Invalid AIO context
+ *   desc: Returned if ctx_id does not refer to a valid AIO context. This can
+ *     occur if: (1) the context was never created, (2) the context was
+ *     destroyed via io_destroy(), (3) the ctx_id is 0, (4) the ring buffer
+ *     header cannot be read from userspace, (5) the context belongs to a
+ *     different process, or (6) the context's internal ID doesn't match.
+ *
+ * error: EINVAL, iocb not found or not cancellable
+ *   desc: Returned if the specified iocb is not present in the context's
+ *     active request list. This occurs when: (1) the operation has already
+ *     completed and the completion event is in the ring buffer, (2) the
+ *     operation was never submitted to this context, (3) the iocb pointer
+ *     does not match any pending operation (comparison is by pointer value
+ *     converted to u64), or (4) the operation did not register a cancellation
+ *     callback (though in this case EINVAL comes from the default ret value).
+ *     Note: The man page documents EAGAIN for this case, but the actual
+ *     implementation returns EINVAL.
+ *
+ * error: ENOSYS, AIO not implemented
+ *   desc: Returned if the kernel was compiled without CONFIG_AIO support.
+ *     This error is returned by the syscall dispatch mechanism before the
+ *     io_cancel implementation is even reached.
+ *
+ * error: (driver-specific), Cancellation callback failed
+ *   desc: If the iocb is found and its ki_cancel callback is invoked, the
+ *     callback's return value is propagated to userspace if non-zero. For
+ *     USB gadget operations, usb_ep_dequeue() may return various errors
+ *     including EINVAL if the request wasn't queued. The aio_poll_cancel
+ *     callback always returns 0. Driver-specific cancellation functions
+ *     may return other error codes.
+ *
+ * lock: RCU read lock
+ *   type: KAPI_LOCK_RCU
+ *   desc: Acquired in lookup_ioctx() during context lookup. Protects against
+ *     concurrent modification of the mm->ioctx_table while searching for the
+ *     context. Released before any spinlocks are acquired.
+ *
+ * lock: ctx->ctx_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-context spinlock acquired with interrupts disabled via
+ *     spin_lock_irq(). Held while iterating through the active_reqs list
+ *     searching for the iocb, while invoking the ki_cancel callback, and
+ *     while removing the iocb from the list. The cancel callback is invoked
+ *     with this lock held, so callbacks must not sleep and must be IRQ-safe.
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: ctx->active_reqs list
+ *   desc: If the iocb is found and its cancellation callback is invoked, the
+ *     kiocb is removed from the context's active_reqs list via list_del_init().
+ *     This prevents the iocb from being found by subsequent io_cancel calls.
+ *   condition: iocb found and ki_cancel callback invoked
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: Pending I/O operation
+ *   desc: The cancellation callback may modify the state of the underlying
+ *     I/O operation. For poll operations, the cancelled flag is set. For USB
+ *     operations, the USB request is dequeued which triggers the completion
+ *     callback. The completion event is delivered via the ring buffer.
+ *   condition: ki_cancel callback is invoked
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_SCHEDULE
+ *   target: aio_poll work queue
+ *   desc: For poll operations (IOCB_CMD_POLL), the aio_poll_cancel callback
+ *     schedules a work item via schedule_work() to complete the cancellation
+ *     asynchronously. This work item will eventually deliver the completion
+ *     event to the ring buffer.
+ *   condition: Cancelling a poll operation
+ *   reversible: no
+ *
+ * state-trans: kiocb state
+ *   from: in_flight (in active_reqs list)
+ *   to: cancelling (removed from list, cancel callback invoked)
+ *   condition: iocb found and ki_cancel invoked
+ *   desc: When the iocb is found in the active_reqs list and its cancellation
+ *     callback is invoked, the kiocb transitions from in-flight to cancelling
+ *     state. The kiocb is removed from the active_reqs list, preventing
+ *     duplicate cancellation attempts. Final completion occurs asynchronously.
+ *
+ * state-trans: poll_iocb cancelled flag
+ *   from: false
+ *   to: true
+ *   condition: aio_poll_cancel is invoked
+ *   desc: For poll operations, the aio_poll_cancel callback sets the cancelled
+ *     flag on the poll_iocb structure. This signals to the poll completion
+ *     handler that the operation was cancelled rather than completed normally.
+ *
+ * constraint: Operation must support cancellation
+ *   desc: Only operations that register a ki_cancel callback can be cancelled.
+ *     Operations that don't set this callback (most direct I/O operations)
+ *     will never appear in the active_reqs list and thus cannot be cancelled.
+ *     Currently, only IOCB_CMD_POLL operations in the kernel AIO subsystem
+ *     and USB gadget operations support cancellation.
+ *
+ * constraint: Timing window for cancellation
+ *   desc: The iocb must still be pending at the time io_cancel is called.
+ *     There is an inherent race condition: the operation may complete
+ *     naturally between the time the application decides to cancel and when
+ *     io_cancel is invoked. In this case, EINVAL is returned because the
+ *     iocb is no longer in the active_reqs list.
+ *
+ * constraint: CONFIG_AIO required
+ *   desc: The kernel must be compiled with CONFIG_AIO=y for this syscall
+ *     to be available. If not configured, ENOSYS is returned.
+ *
+ * examples: io_cancel(ctx, &iocb, NULL);  // Cancel with unused result param
+ *   if (io_cancel(ctx, &iocb, NULL) == -EINPROGRESS) handle_cancellation();
+ *   ret = io_cancel(ctx, &iocb, NULL); if (ret && ret != -EINPROGRESS) error();
+ *
+ * notes: The return value semantics are unusual: -EINPROGRESS indicates
+ *   successful cancellation initiation, not an error. This is because the
+ *   actual cancellation may complete asynchronously, with the completion
+ *   event delivered via the ring buffer.
+ *
+ *   The result parameter is completely ignored by current kernels. It was
+ *   historically used to return the io_event directly, but since commit
+ *   28468cbed92e ("Revert 'fs/aio: Make io_cancel() generate completions
+ *   again'"), completion events are always delivered via the ring buffer.
+ *   Applications should use io_getevents() to retrieve the cancelled
+ *   operation's completion event.
+ *
+ *   The man page documents EAGAIN as a possible error when "the iocb specified
+ *   was not cancelled", but code analysis shows that EINVAL is actually
+ *   returned in this case. The man page is outdated in this regard.
+ *
+ *   The aio_key field must equal KIOCB_KEY (0) because the kernel writes this
+ *   value during io_submit. If an application attempts to cancel an iocb
+ *   before submitting it, or after the memory has been reused, this check
+ *   will fail with EINVAL.
+ *
+ *   For poll operations specifically, the cancellation is marked but the
+ *   actual completion may be delayed until a worker processes it. The
+ *   -EINPROGRESS return value reflects this asynchronous completion model.
+ *
+ *   USB gadget operations are an exception: when usb_ep_dequeue() is called,
+ *   it typically completes the request synchronously with -ECONNRESET status
+ *   in the completion callback.
+ *
+ *   There is no glibc wrapper for this syscall. Applications must use
+ *   syscall(SYS_io_cancel, ...) or the libaio library. The libaio wrapper
+ *   returns negative error numbers directly rather than returning -1 and
+ *   setting errno.
+ *
+ *   io_uring (since Linux 5.1) provides a more capable and widely-supported
+ *   async I/O interface with better cancellation support via IORING_OP_ASYNC_CANCEL.
+ *
+ * since-version: 2.5
  */
 SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
 		struct io_event __user *, result)
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 07/15] kernel/api: add API specification for io_submit
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/aio.c | 319 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 308 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index ff2a8527e1b85..f6f1b3790c88b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -2450,17 +2450,314 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	return err;
 }
 
-/* sys_io_submit:
- *	Queue the nr iocbs pointed to by iocbpp for processing.  Returns
- *	the number of iocbs queued.  May return -EINVAL if the aio_context
- *	specified by ctx_id is invalid, if nr is < 0, if the iocb at
- *	*iocbpp[0] is not properly initialized, if the operation specified
- *	is invalid for the file descriptor in the iocb.  May fail with
- *	-EFAULT if any of the data structures point to invalid data.  May
- *	fail with -EBADF if the file descriptor specified in the first
- *	iocb is invalid.  May fail with -EAGAIN if insufficient resources
- *	are available to queue any iocbs.  Will return 0 if nr is 0.  Will
- *	fail with -ENOSYS if not implemented.
+/**
+ * sys_io_submit - Submit asynchronous I/O operations for processing
+ * @ctx_id: AIO context handle returned by io_setup
+ * @nr: Number of I/O control blocks to submit
+ * @iocbpp: Array of pointers to iocb structures describing the operations
+ *
+ * long-desc: Submits one or more asynchronous I/O operations for processing
+ *   against a previously created AIO context. Each iocb structure describes
+ *   a single I/O operation including the operation type, file descriptor,
+ *   buffer, size, and offset.
+ *
+ *   The syscall processes iocbs sequentially from the array. If an error
+ *   occurs while processing an iocb, submission stops at that point and
+ *   the number of successfully submitted operations is returned. This means
+ *   partial submission is possible: if submitting 10 iocbs and the 5th fails,
+ *   4 is returned and iocbs 0-3 are queued for processing.
+ *
+ *   Supported operations (specified via aio_lio_opcode):
+ *   - IOCB_CMD_PREAD (0): Positioned read from file
+ *   - IOCB_CMD_PWRITE (1): Positioned write to file
+ *   - IOCB_CMD_FSYNC (2): Sync file data and metadata
+ *   - IOCB_CMD_FDSYNC (3): Sync file data only
+ *   - IOCB_CMD_POLL (5): Poll for events on file descriptor
+ *   - IOCB_CMD_NOOP (6): No operation (useful for testing)
+ *   - IOCB_CMD_PREADV (7): Positioned scatter read
+ *   - IOCB_CMD_PWRITEV (8): Positioned gather write
+ *
+ *   The iocb structure fields include:
+ *   - aio_data: User data copied to io_event on completion
+ *   - aio_lio_opcode: Operation type (one of IOCB_CMD_*)
+ *   - aio_fildes: File descriptor for the operation
+ *   - aio_buf: Buffer address (or iovec array for vectored ops)
+ *   - aio_nbytes: Buffer size (or iovec count for vectored ops)
+ *   - aio_offset: File offset for positioned operations
+ *   - aio_flags: Optional flags (IOCB_FLAG_RESFD, IOCB_FLAG_IOPRIO)
+ *   - aio_resfd: eventfd to signal on completion (if IOCB_FLAG_RESFD set)
+ *   - aio_rw_flags: Per-operation RWF_* flags
+ *   - aio_reqprio: I/O priority (if IOCB_FLAG_IOPRIO set)
+ *
+ *   After successful submission, operations complete asynchronously. Results
+ *   are delivered to the completion ring buffer and can be retrieved via
+ *   io_getevents(). If aio_resfd specifies a valid eventfd, it is signaled
+ *   when each operation completes.
+ *
+ *   The actual I/O may complete synchronously if the data is cached or if
+ *   the underlying filesystem doesn't support truly asynchronous I/O. In
+ *   such cases, the operation is still reported via the completion ring.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: ctx_id
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid AIO context handle previously returned by
+ *     io_setup() for the current process. The context must not have been
+ *     destroyed. A value of 0 is always invalid. The handle is actually
+ *     the virtual address of the ring buffer mapping.
+ *
+ * param: nr
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, LONG_MAX
+ *   constraint: Must be >= 0. If 0, the syscall returns immediately with 0.
+ *     The actual number processed is capped to ctx->nr_events (the context's
+ *     capacity). Very large values are effectively limited by the context
+ *     capacity and available ring buffer slots.
+ *
+ * param: iocbpp
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid userspace pointer to an array of nr pointers
+ *     to struct iocb. Each iocb pointer must itself be valid and point to a
+ *     properly initialized iocb structure. The iocb structures must have
+ *     aio_reserved2 set to 0 for forward compatibility.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_RANGE
+ *   success: >= 0
+ *   desc: Returns the number of iocbs successfully submitted (0 to nr). If
+ *     partial submission occurs due to an error, returns the count of
+ *     successfully submitted operations. Returns 0 if nr is 0.
+ *
+ * error: EINVAL, Invalid context or parameter
+ *   desc: Returned if ctx_id is invalid, nr is negative, aio_reserved2 is
+ *     non-zero, aio_lio_opcode is invalid, aio_buf/aio_nbytes overflow,
+ *     aio_resfd is not an eventfd, conflicting aio_rw_flags, file lacks
+ *     required operation support, invalid POLL/FSYNC parameters, or
+ *     invalid aio_reqprio class.
+ *
+ * error: EFAULT, Invalid memory access
+ *   desc: Returned if: (1) iocbpp is not a valid userspace pointer, (2) any
+ *     pointer in the iocbpp array is invalid, (3) the iocb data cannot be
+ *     copied from userspace, (4) aio_buf points to invalid memory, or
+ *     (5) the kernel cannot write the aio_key field back to userspace.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: Returned if: (1) aio_fildes in an iocb does not refer to an open
+ *     file, (2) aio_resfd does not refer to a valid file descriptor when
+ *     IOCB_FLAG_RESFD is set, (3) the file is not opened with appropriate
+ *     mode for the operation (e.g., read on write-only file).
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: Returned if insufficient slots are available in the completion
+ *     ring buffer. This typically means too many operations are already
+ *     in flight and the application should call io_getevents() to consume
+ *     completed events before submitting more.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned if: (1) IOCB_FLAG_IOPRIO is set and aio_reqprio specifies
+ *     IOPRIO_CLASS_RT (real-time I/O priority) but the process lacks
+ *     CAP_SYS_ADMIN or CAP_SYS_NICE capability, or (2) RWF_NOAPPEND is
+ *     specified but the file has the append-only attribute (IS_APPEND).
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: Returned if: (1) unsupported aio_rw_flags are specified, (2)
+ *     RWF_NOWAIT is specified but the file doesn't support non-blocking I/O
+ *     (FMODE_NOWAIT not set), (3) RWF_ATOMIC is specified for a read or
+ *     the file doesn't support atomic writes, or (4) RWF_DONTCACHE is
+ *     specified but not supported by the filesystem or file is DAX-mapped.
+ *
+ * error: EOVERFLOW, Value too large
+ *   desc: Returned if aio_offset plus aio_nbytes would overflow and the
+ *     file does not support unsigned offsets. This check prevents reading
+ *     or writing past the maximum representable file position.
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Returned if memory allocation fails when preparing credentials
+ *     for IOCB_CMD_FSYNC operations, or if vectored I/O (preadv/pwritev)
+ *     requires allocating iovec arrays larger than the stack buffer.
+ *
+ * lock: RCU read lock
+ *   type: KAPI_LOCK_RCU
+ *   desc: Acquired during context lookup in lookup_ioctx(). Protects against
+ *     concurrent modification of the ioctx_table while looking up the
+ *     context. Released before processing any iocbs.
+ *
+ * lock: ctx->completion_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-context spinlock acquired briefly during request slot allocation
+ *     via user_refill_reqs_available() if the percpu request counter is empty.
+ *     Protects the ring buffer tail and completed_events counters.
+ *
+ * lock: ctx->ctx_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-context spinlock acquired when adding cancellable requests to
+ *     the active_reqs list. This enables io_cancel() to find and cancel
+ *     in-flight operations.
+ *
+ * lock: blk_plug
+ *   type: KAPI_LOCK_CUSTOM
+ *   desc: Block layer plugging is enabled when nr > 2 (AIO_PLUG_THRESHOLD)
+ *     to batch block I/O requests for better performance. This is not a
+ *     traditional lock but affects I/O scheduling.
+ *
+ * signal: any
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_TRANSFORM
+ *   condition: Signal arrives during underlying read/write operation
+ *   desc: If a signal arrives during the underlying file read/write operation
+ *     and the operation returns ERESTARTSYS/ERESTARTNOINTR/etc., the error
+ *     is transformed to EINTR for the completion event. AIO operations cannot
+ *     be restarted in the traditional sense because other operations may have
+ *     already been submitted. The syscall itself (io_submit) is NOT interrupted
+ *     by signals - only the individual async operations can be.
+ *   error: -EINTR (in io_event.res, not syscall return)
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: no
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: aio_kiocb structures
+ *   desc: Allocates one aio_kiocb structure per submitted operation from the
+ *     kiocb_cachep slab cache. These structures track the in-flight operations
+ *     and are freed after completion is recorded in the ring buffer.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: AIO context request counters
+ *   desc: Decrements the available request slot counter in the context.
+ *     Slots are reclaimed when completion events are consumed from the ring
+ *     buffer via io_getevents().
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: ctx->active_reqs list
+ *   desc: Cancellable operations (reads, writes, polls) are added to the
+ *     context's active_reqs list, enabling cancellation via io_cancel().
+ *   condition: Operation supports cancellation
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: iocb->aio_key field
+ *   desc: The kernel writes KIOCB_KEY (0) to the aio_key field of each
+ *     submitted iocb in userspace memory. This marks the iocb as submitted
+ *     and is checked by io_cancel() to validate the iocb.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: file reference count
+ *   desc: Increments the reference count of the file descriptor's struct file
+ *     via fget() for each submitted operation. The reference is released
+ *     when the operation completes (via fput() in iocb_destroy()).
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: target file(s)
+ *   desc: For write operations, the file content may be modified. For fsync
+ *     operations, dirty data is flushed to storage. The actual I/O may
+ *     complete synchronously or asynchronously depending on the filesystem.
+ *   condition: IOCB_CMD_PWRITE, IOCB_CMD_PWRITEV, IOCB_CMD_FSYNC, IOCB_CMD_FDSYNC
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_SCHEDULE
+ *   target: fsync work queue
+ *   desc: FSYNC and FDSYNC operations are scheduled to run on a workqueue
+ *     because vfs_fsync() can block. The operation runs asynchronously and
+ *     completion is signaled via the ring buffer.
+ *   condition: IOCB_CMD_FSYNC or IOCB_CMD_FDSYNC
+ *   reversible: no
+ *
+ * state-trans: iocb state
+ *   from: user-prepared iocb
+ *   to: submitted (aio_key set to KIOCB_KEY)
+ *   condition: successful submission of each iocb
+ *   desc: Each successfully submitted iocb transitions from user-prepared
+ *     state to submitted state, marked by the kernel writing KIOCB_KEY to
+ *     aio_key. The iocb remains in submitted state until completion.
+ *
+ * state-trans: AIO context slot availability
+ *   from: slots_available = N
+ *   to: slots_available = N - submitted_count
+ *   condition: successful submission
+ *   desc: Available slots in the context decrease by the number of successfully
+ *     submitted operations. Slots are reclaimed when io_getevents() consumes
+ *     completion events.
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Use of IOPRIO_CLASS_RT (real-time I/O priority class)
+ *   without: Returns EPERM when attempting to use RT I/O priority
+ *   condition: IOCB_FLAG_IOPRIO set and aio_reqprio specifies IOPRIO_CLASS_RT
+ *
+ * capability: CAP_SYS_NICE
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Use of IOPRIO_CLASS_RT (alternative to CAP_SYS_ADMIN)
+ *   without: Returns EPERM when attempting to use RT I/O priority
+ *   condition: IOCB_FLAG_IOPRIO set and aio_reqprio specifies IOPRIO_CLASS_RT
+ *
+ * constraint: Ring buffer slot availability
+ *   desc: There must be available slots in the completion ring buffer for
+ *     each operation to be submitted. If all slots are occupied by pending
+ *     completion events, submission fails with EAGAIN. The number of slots
+ *     is determined by nr_events passed to io_setup(), though internal
+ *     doubling means more slots may be available.
+ *   expr: available_slots >= 1 for each submission
+ *
+ * constraint: Valid file descriptor per iocb
+ *   desc: Each iocb must reference a valid, open file descriptor via
+ *     aio_fildes. The file must be opened with appropriate access mode
+ *     for the requested operation (read access for PREAD, write access
+ *     for PWRITE, etc.).
+ *
+ * constraint: File must support operation
+ *   desc: For read/write operations, the underlying file must implement
+ *     read_iter/write_iter file operations. For fsync, the file must
+ *     implement fsync. For poll, the file must support vfs_poll().
+ *
+ * constraint: CONFIG_AIO required
+ *   desc: The kernel must be compiled with CONFIG_AIO=y for this syscall
+ *     to be available. If not configured, returns -ENOSYS.
+ *
+ * examples: struct iocb iocb, *iocbp = &iocb; io_submit(ctx, 1, &iocbp);
+ *   struct iocb iocbs[10], *ptrs[10]; io_submit(ctx, 10, ptrs);  // Batch submit
+ *
+ * notes: Unlike traditional synchronous I/O, errors from io_submit() indicate
+ *   submission failures, not I/O errors. Actual I/O errors are reported via
+ *   the res field of struct io_event when retrieved via io_getevents().
+ *
+ *   The return value indicates how many iocbs were successfully submitted.
+ *   If this is less than nr, the application should check which operation
+ *   failed (it's the one at index = return_value) and handle the error.
+ *   Previously submitted operations in the batch are still queued.
+ *
+ *   For vectored operations (PREADV/PWRITEV), aio_buf points to an array
+ *   of struct iovec and aio_nbytes contains the iovec count. The maximum
+ *   iovec count is UIO_MAXIOV (1024).
+ *
+ *   Block layer plugging is automatically enabled for batches larger than
+ *   2 operations, improving I/O merging and reducing per-I/O overhead.
+ *
+ *   The COMPAT_SYSCALL variant handles 32-bit userspace on 64-bit kernels,
+ *   using compat_uptr_t for the iocbpp array elements.
+ *
+ *   Historical note: commit d6b2615f7d31d ("aio: simplify - and fix - fget/fput
+ *   for io_submit()") fixed file descriptor reference counting issues. Earlier
+ *   kernels could leak file references on certain error paths.
+ *
+ *   io_uring (since Linux 5.1) is a more modern and performant alternative.
+ *   Consider using io_uring_enter() for new applications requiring async I/O.
+ *
+ *   There is no glibc wrapper; use syscall(SYS_io_submit, ...) or the libaio
+ *   library. The libaio wrapper io_submit() returns negative error numbers
+ *   directly rather than returning -1 and setting errno.
+ *
+ * since-version: 2.5
  */
 SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 		struct iocb __user * __user *, iocbpp)
-- 
2.51.0


^ permalink raw reply related

* [RFC PATCH v5 06/15] kernel/api: add API specification for io_destroy
From: Sasha Levin @ 2025-12-18 20:42 UTC (permalink / raw)
  To: linux-api; +Cc: linux-doc, linux-kernel, tools, gpaoloni, Sasha Levin
In-Reply-To: <20251218204239.4159453-1-sashal@kernel.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/aio.c | 189 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 184 insertions(+), 5 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 36556e7a8e2c0..ff2a8527e1b85 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1646,11 +1646,190 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p)
 }
 #endif
 
-/* sys_io_destroy:
- *	Destroy the aio_context specified.  May cancel any outstanding 
- *	AIOs and block on completion.  Will fail with -ENOSYS if not
- *	implemented.  May fail with -EINVAL if the context pointed to
- *	is invalid.
+/**
+ * sys_io_destroy - Destroy an asynchronous I/O context
+ * @ctx: AIO context handle returned by io_setup
+ *
+ * long-desc: Destroys the asynchronous I/O context identified by ctx. This
+ *   syscall will attempt to cancel all outstanding asynchronous I/O operations
+ *   against the context and block until all operations have completed. Once
+ *   this syscall returns successfully, the context handle becomes invalid and
+ *   must not be used with any other io_* syscalls.
+ *
+ *   The context's memory-mapped ring buffer is unmapped from the process address
+ *   space, and all associated kernel resources are freed. The system-wide AIO
+ *   event counter (aio_nr) is decremented by the original nr_events value that
+ *   was passed to io_setup when creating this context.
+ *
+ *   This syscall blocks until all in-flight I/O operations have completed. This
+ *   ensures that userspace buffers passed to io_submit are no longer accessed
+ *   by the kernel after io_destroy returns. The wait is NOT interruptible by
+ *   signals, so callers cannot cancel this blocking behavior.
+ *
+ *   If two threads call io_destroy on the same context simultaneously, only the
+ *   first call will succeed; subsequent calls return -EINVAL as the context is
+ *   already marked as dead.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: ctx
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid context handle previously returned by io_setup.
+ *     The handle is actually the virtual address of the ring buffer mapping in
+ *     the calling process's address space. A value of 0 is always invalid.
+ *     The context must not have been previously destroyed.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_ERROR_CHECK
+ *   success: 0
+ *   desc: Returns 0 on success. After successful return, the context handle is
+ *     invalid and all resources have been released. All outstanding I/O
+ *     operations have completed.
+ *
+ * error: EINVAL, Invalid context
+ *   desc: The ctx argument does not refer to a valid AIO context in the calling
+ *     process. This can occur if: (1) ctx was never returned by io_setup,
+ *     (2) ctx was returned by io_setup in a different process, (3) ctx was
+ *     already destroyed by a previous io_destroy call, (4) ctx is 0 or an
+ *     arbitrary invalid value, or (5) the ring buffer at the ctx address has
+ *     been corrupted (e.g., the id field no longer matches).
+ *
+ * lock: mm->ioctx_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-mm spinlock protecting the ioctx_table. Held briefly while
+ *     marking the context as dead and removing it from the process's AIO
+ *     context table.
+ *
+ * lock: RCU read lock
+ *   type: KAPI_LOCK_RCU
+ *   desc: RCU read-side critical section held during context lookup in
+ *     lookup_ioctx(). Protects against concurrent modification of the
+ *     ioctx_table.
+ *
+ * lock: ctx->ctx_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-context spinlock held while cancelling outstanding I/O requests
+ *     in free_ioctx_users(). Protects the active_reqs list.
+ *
+ * lock: mmap_lock
+ *   type: KAPI_LOCK_RWLOCK
+ *   desc: Process memory map write lock acquired during vm_munmap() when
+ *     unmapping the ring buffer. May contend with other memory operations
+ *     in the same process.
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: ctx->dead flag
+ *   desc: Atomically sets the context's dead flag to 1, marking it as being
+ *     destroyed. This prevents new I/O submissions and ensures subsequent
+ *     io_destroy calls return -EINVAL.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: mm->ioctx_table
+ *   desc: Removes the context from the process's AIO context table by setting
+ *     the corresponding table entry to NULL. After this, lookup_ioctx will
+ *     no longer find this context.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: aio_nr (global counter)
+ *   desc: Decrements the system-wide AIO context counter by the context's
+ *     max_reqs value (the nr_events originally passed to io_setup). This
+ *     counter is visible via /proc/sys/fs/aio-nr.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: process virtual memory
+ *   desc: Unmaps the ring buffer from the process's address space via
+ *     vm_munmap(). The memory region at ctx becomes invalid.
+ *   condition: ctx->mmap_size > 0
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FREE_MEMORY
+ *   target: kioctx structure and associated resources
+ *   desc: Frees the AIO context structure, percpu data, ring buffer pages, and
+ *     the anonymous file backing the ring buffer. Deferred via RCU work queue
+ *     to ensure safe cleanup after all references are dropped.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_SIGNAL_SEND
+ *   target: outstanding AIO operations
+ *   desc: Cancels all outstanding asynchronous I/O operations by invoking their
+ *     ki_cancel callbacks. The specific effect depends on the operation type
+ *     (read, write, fsync, poll).
+ *   condition: active_reqs list is not empty
+ *   reversible: no
+ *
+ * state-trans: AIO context state
+ *   from: alive (ctx->dead == 0)
+ *   to: dead (ctx->dead == 1)
+ *   condition: successful atomic exchange in kill_ioctx
+ *   desc: The context transitions from usable to destroyed. Once dead, the
+ *     context cannot be used for any operations and will be freed after all
+ *     references are dropped.
+ *
+ * state-trans: process AIO state
+ *   from: has AIO context(s)
+ *   to: context removed (or no contexts)
+ *   condition: successful io_destroy
+ *   desc: The destroyed context is removed from the process's context table.
+ *     If this was the only context, the process no longer has any active
+ *     AIO contexts.
+ *
+ * state-trans: system AIO resources
+ *   from: aio_nr = N
+ *   to: aio_nr = N - max_reqs
+ *   condition: successful io_destroy
+ *   desc: System-wide AIO resource counter decreases, making room for other
+ *     processes to create new AIO contexts.
+ *
+ * constraint: CONFIG_AIO required
+ *   desc: The kernel must be compiled with CONFIG_AIO=y for this syscall to be
+ *     available. If not configured, the syscall returns -ENOSYS. This is
+ *     typically enabled by default but may be disabled on embedded systems.
+ *
+ * constraint: Context must belong to calling process
+ *   desc: Each AIO context is bound to a specific process (mm_struct). A context
+ *     created by one process cannot be destroyed by another process, even if
+ *     the context handle value is somehow known.
+ *   expr: ctx belongs to current->mm
+ *
+ * examples: io_destroy(ctx);  // Destroy context and wait for completion
+ *   if (io_destroy(ctx) == -EINVAL) handle_error();  // Invalid context
+ *
+ * notes: The man page documents EFAULT as a possible error, but code analysis
+ *   shows that EFAULT conditions (e.g., invalid ring buffer pointer) actually
+ *   result in EINVAL being returned, as lookup_ioctx returns NULL on any
+ *   failure to access the ring buffer header.
+ *
+ *   This syscall blocks in TASK_UNINTERRUPTIBLE state while waiting for
+ *   outstanding I/O operations to complete. This means the process cannot be
+ *   interrupted by signals during this wait. In extreme cases with very slow
+ *   I/O devices, this could cause the process to appear hung.
+ *
+ *   Historical note: Before kernel 3.11, io_destroy blocked waiting for I/O
+ *   completion. A refactoring in 3.11 accidentally removed this behavior,
+ *   creating a race where userspace buffers could be freed while the kernel
+ *   was still using them. This was fixed by commit e02ba72aabfa that blocks
+ *   io_destroy until all context requests are completed.
+ *
+ *   Race condition handling: A race between io_destroy and io_submit was fixed
+ *   by commit 7137c6bd4552. A race between io_setup and io_destroy was fixed
+ *   by commit 86b62a2cb4fc. Both fixes ensure proper synchronization via
+ *   reference counting.
+ *
+ *   io_uring (since Linux 5.1) is a more modern alternative that provides better
+ *   performance and more features. Consider using io_uring for new applications.
+ *
+ *   There is no glibc wrapper for this syscall. Use syscall(SYS_io_destroy, ctx)
+ *   or the libaio library wrapper io_destroy(). Note: libaio has slightly
+ *   different error semantics, returning negative error numbers directly instead
+ *   of -1 with errno.
+ *
+ * since-version: 2.5
  */
 SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
 {
-- 
2.51.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox