Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC PATCH v3 1/6] uapi: add goldfish_address_space userspace ABI header
From: Arnd Bergmann @ 2026-04-13 16:28 UTC (permalink / raw)
  To: Wenzhao Liao, rust-for-linux, linux-pci
  Cc: Miguel Ojeda, Danilo Krummrich, bhelgaas,
	Krzysztof Wilczyński, Greg Kroah-Hartman, linux-kernel,
	linux-api
In-Reply-To: <20260406165120.166928-2-wenzhaoliao@ruc.edu.cn>

On Mon, Apr 6, 2026, at 18:51, Wenzhao Liao wrote:

> +struct goldfish_address_space_allocate_block {
> +	__u64 size;
> +	__u64 offset;
> +	__u64 phys_addr;
> +};
> +
> +struct goldfish_address_space_ping {
> +	__u64 offset;
> +	__u64 size;
> +	__u64 metadata;
> +	__u32 version;
> +	__u32 wait_fd;
> +	__u32 wait_flags;
> +	__u32 direction;
> +};
> +
> +struct goldfish_address_space_claim_shared {
> +	__u64 offset;
> +	__u64 size;
> +};

All these ioctl structures are well-formed in the sense that they
are portable across architectures and won't leak kernel data
through implicit padding.

Two of the members are a bit worrying, but that may just
be my own understanding:

- the 'phys_addr' member sounds like it refers to a physical
  memory location in the CPU address space, which in general
  should not be visible to user space, as that tends to
  expose security problems if users with access to the
  device can use this to access data they should not.

- the 'version' field may refer to the version of the ioctl
  command, which is similarly discouraged since it is
  harder to deal with than just coming up with new ioctl
  command codes. If this refers to the version of the
  remote side, this is probably fine.

> +#define GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC 'G'
> +
> +#define GOLDFISH_ADDRESS_SPACE_IOCTL_OP(OP, T) \
> +	_IOWR(GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC, OP, T)

I think it would be better to remove this intermediate macro, since
it prevents easy scraping of ioctl command codes from looking
at the source file with regular expressions.

It is also unusual that all commands are both reading
and writing the data. Please check if you can make some
of them read-only or write-only.

     Arnd

^ permalink raw reply

* [RFC PATCH v2 2/2] selftest: add tests for mkdirat2()
From: Jori Koolstra @ 2026-04-12 13:54 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner,
	Arnd Bergmann
  Cc: H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jori Koolstra, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, Jeff Layton, Aleksa Sarai, cmirabil,
	Greg Kroah-Hartman, linux-kernel, linux-fsdevel, linux-api,
	linux-arch
In-Reply-To: <20260412135434.3095416-1-jkoolstra@xs4all.nl>

Add some tests for the new mkdirat2() syscall to test compliance and
to showcase its behaviour.

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 .../testing/selftests/filesystems/.gitignore  |   1 +
 tools/testing/selftests/filesystems/Makefile  |   4 +-
 .../selftests/filesystems/mkdirat_fd_test.c   | 143 ++++++++++++++++++
 4 files changed, 150 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c

diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..6efc21779b62 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_mkdirat2 472
+__SYSCALL(__NR_mkdirat2, sys_mkdirat2)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/testing/selftests/filesystems/.gitignore b/tools/testing/selftests/filesystems/.gitignore
index 64ac0dfa46b7..84e2175d171f 100644
--- a/tools/testing/selftests/filesystems/.gitignore
+++ b/tools/testing/selftests/filesystems/.gitignore
@@ -5,3 +5,4 @@ fclog
 file_stressor
 anon_inode_test
 kernfs_test
+mkdirat_fd_test
diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile
index 85427d7f19b9..7357769db57a 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 
-CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog
+CFLAGS += $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
+TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog mkdirat_fd_test
 TEST_GEN_PROGS_EXTENDED := dnotify_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/filesystems/mkdirat_fd_test.c b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
new file mode 100644
index 000000000000..a02c0223d63b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/mkdirat_fd_test.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sys/stat.h>
+
+#include <asm-generic/unistd.h>
+
+#include "kselftest_harness.h"
+
+#ifndef VALID_MKDIRAT2_FLAGS
+#define VALID_MKDIRAT2_FLAGS (AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT)
+#endif
+
+#define mkdirat2_checked_flags(dfd, pathname, flags) ({		\
+	struct stat __st;					\
+	int __fd = sys_mkdirat2(dfd, pathname, S_IRWXU, flags);	\
+	ASSERT_GE(__fd, 0);					\
+	EXPECT_EQ(fstat(__fd, &__st), 0);			\
+	EXPECT_TRUE(S_ISDIR(__st.st_mode));			\
+	__fd;							\
+})
+
+#define mkdirat2_checked(dfd, pathname) \
+	mkdirat2_checked_flags(dfd, pathname, 0)
+
+
+static inline int sys_mkdirat2(int dfd, const char *pathname, mode_t mode,
+				 unsigned int flags)
+{
+	return syscall(__NR_mkdirat2, dfd, pathname, mode, flags);
+}
+
+FIXTURE(mkdirat2) {
+	char dirpath[PATH_MAX];
+	int dfd;
+};
+
+FIXTURE_SETUP(mkdirat2)
+{
+	snprintf(self->dirpath, sizeof(self->dirpath),
+		 "/tmp/mkdirat2_test.%d", getpid());
+	ASSERT_EQ(mkdir(self->dirpath, S_IRWXU), 0);
+
+	self->dfd = open(self->dirpath, O_DIRECTORY);
+	ASSERT_GE(self->dfd, 0);
+}
+
+FIXTURE_TEARDOWN(mkdirat2)
+{
+	close(self->dfd);
+	rmdir(self->dirpath);
+}
+
+/* Does mkdirat2 return a fd at all */
+TEST_F(mkdirat2, returns_fd)
+{
+	int fd = mkdirat2_checked(self->dfd, "newdir");
+	EXPECT_EQ(close(fd), 0)
+	EXPECT_EQ(unlinkat(self->dfd, "newdir", AT_REMOVEDIR), 0);
+}
+
+/* The fd must refer to the directory that was just created. */
+TEST_F(mkdirat2, fd_is_created_dir)
+{
+	int fd;
+	struct stat st_via_fd, st_via_path;
+	char path[PATH_MAX];
+
+	fd = mkdirat2_checked(self->dfd, "checkdir");
+
+	ASSERT_EQ(fstat(fd, &st_via_fd), 0);
+
+	snprintf(path, sizeof(path), "%s/checkdir", self->dirpath);
+	ASSERT_EQ(stat(path, &st_via_path), 0);
+
+	EXPECT_EQ(st_via_fd.st_ino, st_via_path.st_ino);
+	EXPECT_EQ(st_via_fd.st_dev, st_via_path.st_dev);
+
+	EXPECT_EQ(close(fd), 0)
+	EXPECT_EQ(rmdir(path), 0);
+}
+
+
+/* Missing parent component must fail with ENOENT. */
+TEST_F(mkdirat2, enoent_missing_parent)
+{
+	EXPECT_EQ(sys_mkdirat2(self->dfd, "nonexistent/child", S_IRWXU, 0), -1);
+	EXPECT_EQ(errno, ENOENT);
+}
+
+/* An invalid dfd must fail with EBADF. */
+TEST_F(mkdirat2, ebadf)
+{
+	EXPECT_EQ(sys_mkdirat2(-42, "badfdir", S_IRWXU, 0), -1);
+	EXPECT_EQ(errno, EBADF);
+}
+
+/* A dfd that points to a file (not a directory) must fail with ENOTDIR. */
+TEST_F(mkdirat2, enotdir_dfd)
+{
+	int file_fd;
+
+	file_fd = openat(self->dfd, "file",
+			 O_CREAT | O_WRONLY, S_IRWXU);
+	ASSERT_GE(file_fd, 0);
+
+	EXPECT_EQ(sys_mkdirat2(file_fd, "subdir", S_IRWXU, 0), -1);
+	EXPECT_EQ(errno, ENOTDIR);
+
+	EXPECT_EQ(close(file_fd), 0);
+	EXPECT_EQ(unlinkat(self->dfd, "file", 0), 0);
+}
+
+/*
+ * The returned fd must be usable as a dfd for further *at() calls.
+ */
+TEST_F(mkdirat2, fd_usable_as_dfd)
+{
+	int parent_fd, child_fd;
+
+	parent_fd = mkdirat2_checked(self->dfd, "parent");
+	child_fd = mkdirat2_checked(parent_fd, "child");
+
+	EXPECT_EQ(close(child_fd), 0);
+	EXPECT_EQ(close(parent_fd), 0);
+
+	char path[PATH_MAX];
+	snprintf(path, sizeof(path), "%s/parent/child", self->dirpath);
+	EXPECT_EQ(rmdir(path), 0);
+	snprintf(path, sizeof(path), "%s/parent", self->dirpath);
+	EXPECT_EQ(rmdir(path), 0);
+}
+
+/* Unknown flags must be rejected with EINVAL. */
+TEST_F(mkdirat2, einval_unknown_flags)
+{
+	EXPECT_EQ(sys_mkdirat2(self->dfd, "flagsdir", S_IRWXU, ~VALID_MKDIRAT2_FLAGS ), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
From: Jori Koolstra @ 2026-04-12 13:54 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner,
	Arnd Bergmann
  Cc: H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jori Koolstra, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, Jeff Layton, Aleksa Sarai, cmirabil,
	Greg Kroah-Hartman, linux-kernel, linux-fsdevel, linux-api,
	linux-arch
In-Reply-To: <20260412135434.3095416-1-jkoolstra@xs4all.nl>

Currently there is no way to race-freely create and open a directory.
For regular files we have open(O_CREAT) for creating a new file inode,
and returning a pinning fd to it. The lack of such functionality for
directories means that when populating a directory tree there's always
a race involved: the inodes first need to be created, and then opened
to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
but in the time window between the creation and the opening they might
be replaced by something else.

Addressing this race without proper APIs is possible (by immediately
fstat()ing what was opened, to verify that it has the right inode type),
but difficult to get right. Hence, mkdirat2() that creates a directory
and returns an O_DIRECTORY fd is useful.

This feature idea (and description) is taken from the UAPI group:
https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/internal.h                          |  2 ++
 fs/namei.c                             | 44 +++++++++++++++++++++++---
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/unistd.h      |  5 ++-
 scripts/syscall.tbl                    |  1 +
 6 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..e200ca2067a4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
 471	common	rseq_slice_yield	sys_rseq_slice_yield
+472	common	mkdirat2		sys_mkdirat2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..c6a79afadacf 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -59,6 +59,8 @@ int may_linkat(struct mnt_idmap *idmap, const struct path *link);
 int filename_renameat2(int olddfd, struct filename *oldname, int newdfd,
 		 struct filename *newname, unsigned int flags);
 int filename_mkdirat(int dfd, struct filename *name, umode_t mode);
+struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
+		unsigned int flags, bool open);
 int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev);
 int filename_symlinkat(struct filename *from, int newdfd, struct filename *to);
 int filename_linkat(int olddfd, struct filename *old, int newdfd,
diff --git a/fs/namei.c b/fs/namei.c
index a880454a6415..6451e96dc225 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5255,18 +5255,36 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 EXPORT_SYMBOL(vfs_mkdir);
 
-int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
+static int mkdirat_lookup_flags(unsigned int flags)
+{
+	int lookup_flags = LOOKUP_DIRECTORY;
+
+	if (!(flags & AT_SYMLINK_NOFOLLOW))
+		lookup_flags |= LOOKUP_FOLLOW;
+	if (!(flags & AT_NO_AUTOMOUNT))
+		lookup_flags |= LOOKUP_AUTOMOUNT;
+
+	return lookup_flags;
+}
+
+int filename_mkdirat(int dfd, struct filename *name, umode_t mode) {
+	return PTR_ERR_OR_ZERO(do_file_mkdirat(dfd, name, mode, 0, false));
+}
+
+struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
+		unsigned int flags, bool open)
 {
 	struct dentry *dentry;
 	struct path path;
 	int error;
-	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	struct file *filp = NULL;
+	unsigned int lookup_flags = mkdirat_lookup_flags(flags);
 	struct delegated_inode delegated_inode = { };
 
 retry:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
 	if (IS_ERR(dentry))
-		return PTR_ERR(dentry);
+		return ERR_CAST(dentry);
 
 	error = security_path_mkdir(&path, dentry,
 			mode_strip_umask(path.dentry->d_inode, mode));
@@ -5276,6 +5294,10 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
 		if (IS_ERR(dentry))
 			error = PTR_ERR(dentry);
 	}
+	if (open && !error && !is_delegated(&delegated_inode)) {
+		const struct path new_path = { .mnt = path.mnt, .dentry = dentry };
+		filp = dentry_open(&new_path, O_DIRECTORY, current_cred());
+	}
 	end_creating_path(&path, dentry);
 	if (is_delegated(&delegated_inode)) {
 		error = break_deleg_wait(&delegated_inode);
@@ -5286,7 +5308,21 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
 	}
-	return error;
+	if (error)
+		return ERR_PTR(error);
+	return filp;
+}
+
+#define VALID_MKDIRAT2_FLAGS (AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT)
+
+SYSCALL_DEFINE4(mkdirat2, int, dfd, const char __user *, pathname, umode_t, mode,
+		unsigned int, flags)
+{
+	CLASS(filename, name)(pathname);
+	if (flags & ~VALID_MKDIRAT2_FLAGS)
+		return -EINVAL;
+
+	return FD_ADD(O_CLOEXEC, do_file_mkdirat(dfd, name, mode, flags, true));
 }
 
 SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 02bd6ddb6278..b3b4ae26dbdd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
 asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
 				      u32 size, u32 flags);
 asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_mkdirat2(int dfd, const char __user *pathname, umode_t mode,
+				     unsigned int flags)
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..6efc21779b62 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_mkdirat2 472
+__SYSCALL(__NR_mkdirat2, sys_mkdirat2)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..9d86f29762ae 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	mkdirat2			sys_mkdirat2
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v2 0/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
From: Jori Koolstra @ 2026-04-12 13:54 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner,
	Arnd Bergmann
  Cc: H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jori Koolstra, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, Jeff Layton, Aleksa Sarai, cmirabil,
	Greg Kroah-Hartman, linux-kernel, linux-fsdevel, linux-api,
	linux-arch

This series implements the mkdirat2() syscall that was suggested over
at the UAPI group kernel feature page [1] with some tests.

Obviously, we probably also want to implement equivalent mknodeat2() and
symlinkat2() syscalls, but their implementation can be done quite similar
I believe.

This has been compiled and tested on x86 only.

[1]: https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes

v2:
- Use AT_* flags.
- Ensure an fd is allocated only if mkdir and open_dentry succeed.
- The returned fd gets O_CLOEXEC by default.
- Renamed syscall from mkdirat_fd() to mkdirat2().

Jori Koolstra (2):
  vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
  selftest: add tests for mkdirat2()

 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 fs/internal.h                                 |   2 +
 fs/namei.c                                    |  44 +++++-
 include/linux/syscalls.h                      |   2 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 scripts/syscall.tbl                           |   1 +
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 .../testing/selftests/filesystems/.gitignore  |   1 +
 tools/testing/selftests/filesystems/Makefile  |   4 +-
 .../selftests/filesystems/mkdirat_fd_test.c   | 143 ++++++++++++++++++
 10 files changed, 200 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/mkdirat_fd_test.c

-- 
2.53.0


^ permalink raw reply

* Re: Avoid reading /sys/kernel/mm/transparent_hugepage/?
From: H.J. Lu @ 2026-04-11  0:12 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Florian Weimer, GNU C Library, linux-kernel, linux-arch,
	linux-api
In-Reply-To: <d095cc40-5217-4318-ae2e-40e5fe3be47a@p183>

On Fri, Apr 10, 2026 at 4:35 PM Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> On Fri, Apr 10, 2026 at 03:40:30PM +0800, H.J. Lu wrote:
> > On Fri, Apr 10, 2026 at 3:28 PM Florian Weimer <fweimer@redhat.com> wrote:
> > >
> > > * H. J. Lu:
> > >
> > > > To enable THP segment load, ld.so opens and reads 2 files under
> > > > /sys/kernel/mm/transparent_hugepage/.   This requires mounting
> > > > /sys and is expensive.   Is it possible to put such info in vDSO?
> > >
> > > Alexey Dobriyan proposed adding AT_PAGE_SHIFT_LIST to the auxiliary
> >
> > Does it cover
> >
> > [hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > 2097152
> > [hjl@gnu-tgl-3 linux]$
> >
> > > vector a while back, but I don't know the status of that.
>
> Status: nothing happened.
>
> > How can we get
> >
> > [hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/enabled
> > always [madvise] never
> > [hjl@gnu-tgl-3 linux]$
>
> This is not covered, see the link:
> https://lore.kernel.org/lkml/ecb049aa-bcac-45c7-bbb1-4612d094935a@p183/
>
> PAGE_SHIFT_MASK should be folded into system call probably.

We need a fast way to check THP status for THP segment load.
A system call to return /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/hpage_pmd_size should
work.

-- 
H.J.

^ permalink raw reply

* Re: [RFC] Modernizing Linux authentication logs (lastlog, btmp, utmp, wtmp) with SQLite
From: Thorsten Kukuk @ 2026-04-10 12:38 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Roman Bakshansky, linux-api, linux-kernel, audit, libc-alpha
In-Reply-To: <87cy175zrg.fsf@mid.deneb.enyo.de>

On Fri, Mar 13, 2026 at 7:51 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>
> * Roman Bakshansky:
>
> > The full RFC, including preliminary database schemas and API drafts,
> > is available in the discussion repository:
> >
> >      https://github.com/bakshansky/linux-auth-logs
>
> I don't understand how SQLite (without a daemon) addresses the locking
> issue.  WAL mode still uses fcntl locking.

It doesn't, that's why wtmpdb is using a daemon for this.
With pam_lastlog2, the messages aren't important or reliable enough to
justify the overhead. But if you want, you would need to introduce a
daemon, too.

Regards,
Thorsten

-- 
Thorsten Kukuk, Distinguished Engineer, Future Technologies
SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
Nuernberg, Germany
Geschäftsführer: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB
36809, AG Nürnberg)

^ permalink raw reply

* Re: [RFC] Modernizing Linux authentication logs (lastlog, btmp, utmp, wtmp) with SQLite
From: Thorsten Kukuk @ 2026-04-10 12:35 UTC (permalink / raw)
  To: linux-api, linux-kernel, audit, libc-alpha; +Cc: Roman Bakshansky
In-Reply-To: <20260313144508.GA5446@cventin.lip.ens-lyon.fr>

On Fri, Mar 13, 2026 at 3:45 PM Vincent Lefevre <vincent@vinc17.net> wrote:
>
> On 2026-03-13 10:59:11 -0300, Adhemerval Zanella Netto wrote:
> > From the glibc standpoint my plan is just to make the accounting database
> > function no-op [1] (I hopefully to get this in the next 2.44 release).
> >
> > And I think Thorsten Kukuk already adapted most of the usages in current
> > distros [2][3] using similar strategy, along with a better systemd
> > integration.  I am not sure if/when distros are incorporating his work.
> >
> > [1] https://patchwork.sourceware.org/project/glibc/list/?series=37271
> > [2] https://www.thkukuk.de/blog/Y2038_glibc_lastlog_64bit/
> > [3] https://www.thkukuk.de/blog/Y2038_glibc_utmp_64bit/
>
> FYI, utmp has been reintroduced in Debian for libutempter (and thus
> applications that use this library), because systemd was not working
> or at least not sufficiently documented:
>
>   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1125682

They introduced the old "hack" to get "wall" working without solving
the problem.
What will happen now again: all people having xterm running will get
the wall message in all terminals.
People not using a terminal (so most of the normal users, not
developers) will not see this message, because web browsers and other
graphical applications don't show them.
The correct solution is, that the desktop environments register a
session, and if there is a wall message, show that in an own dialog,
so that everybody get's the message once. Not the one person 50 times,
the others not at all.

Regards,
Thorsten

-- 
Thorsten Kukuk, Distinguished Engineer, Future Technologies
SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
Nuernberg, Germany
Geschäftsführer: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB
36809, AG Nürnberg)

^ permalink raw reply

* Re: Avoid reading /sys/kernel/mm/transparent_hugepage/?
From: Alexey Dobriyan @ 2026-04-10  8:37 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Florian Weimer, GNU C Library, linux-kernel, linux-arch,
	linux-api
In-Reply-To: <CAMe9rOpf2f8u4ng+nnaqEYB3bUvvVPu3mGv7bt=5xfzDHcMOFg@mail.gmail.com>

On Fri, Apr 10, 2026 at 03:40:30PM +0800, H.J. Lu wrote:
> On Fri, Apr 10, 2026 at 3:28 PM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * H. J. Lu:
> >
> > > To enable THP segment load, ld.so opens and reads 2 files under
> > > /sys/kernel/mm/transparent_hugepage/.   This requires mounting
> > > /sys and is expensive.   Is it possible to put such info in vDSO?
> >
> > Alexey Dobriyan proposed adding AT_PAGE_SHIFT_LIST to the auxiliary
> 
> Does it cover
> 
> [hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> 2097152
> [hjl@gnu-tgl-3 linux]$
> 
> > vector a while back, but I don't know the status of that.

Status: nothing happened.

> How can we get
> 
> [hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/enabled
> always [madvise] never
> [hjl@gnu-tgl-3 linux]$

This is not covered, see the link:
https://lore.kernel.org/lkml/ecb049aa-bcac-45c7-bbb1-4612d094935a@p183/

PAGE_SHIFT_MASK should be folded into system call probably.

^ permalink raw reply

* Re: Avoid reading /sys/kernel/mm/transparent_hugepage/?
From: H.J. Lu @ 2026-04-10  7:40 UTC (permalink / raw)
  To: Florian Weimer
  Cc: GNU C Library, Alexey Dobriyan, linux-kernel, linux-arch,
	linux-api
In-Reply-To: <lhupl47e0lc.fsf@oldenburg.str.redhat.com>

On Fri, Apr 10, 2026 at 3:28 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > To enable THP segment load, ld.so opens and reads 2 files under
> > /sys/kernel/mm/transparent_hugepage/.   This requires mounting
> > /sys and is expensive.   Is it possible to put such info in vDSO?
>
> Alexey Dobriyan proposed adding AT_PAGE_SHIFT_LIST to the auxiliary

Does it cover

[hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
2097152
[hjl@gnu-tgl-3 linux]$

> vector a while back, but I don't know the status of that.
>

How can we get

[hjl@gnu-tgl-3 linux]$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
[hjl@gnu-tgl-3 linux]$

-- 
H.J.

^ permalink raw reply

* Re: Avoid reading /sys/kernel/mm/transparent_hugepage/?
From: Florian Weimer @ 2026-04-10  7:27 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Alexey Dobriyan, linux-kernel, linux-arch,
	linux-api
In-Reply-To: <CAMe9rOrk20jCXO_Bun4LK6M3fd_8HzEtAu94FW+-xSkwNiOt7w@mail.gmail.com>

* H. J. Lu:

> To enable THP segment load, ld.so opens and reads 2 files under
> /sys/kernel/mm/transparent_hugepage/.   This requires mounting
> /sys and is expensive.   Is it possible to put such info in vDSO?

Alexey Dobriyan proposed adding AT_PAGE_SHIFT_LIST to the auxiliary
vector a while back, but I don't know the status of that.

Thanks,
Florian


^ permalink raw reply

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Aleksa Sarai @ 2026-04-09  7:58 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Mateusz Guzik, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring, Peter Zijlstra, Oleg Nesterov,
	Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <52244194.1650546.1775684643065@kpc.webmail.kpnmail.nl>

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

On 2026-04-08, Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> 
> > Op 02-04-2026 04:52 CEST schreef Aleksa Sarai <cyphar@cyphar.com>:
> > 
> > Please do not use O_* flags! O_CLOEXEC takes up 3 flag bits on different
> > architectures which makes adding new flags a nightmare.
> > 
> > I think this should take AT_* flags and (like most newer syscalls)
> > O_CLOEXEC should be automatically set. Userspace can unset it with
> > fnctl(F_SETFD) in the relatively rare case where they don't want
> > O_CLOEXEC.
> 
> And then do something like statx_lookup_flags() does to build the lookup
> flags from those AT flags?

Yeah, that is the usual pattern for *at(2) syscalls.

> But there is also no AT_ROOT_CONTAINED (or whatever you would want to
> call the RESOLVE_IN_ROOT AT-equivalent) right now.

This point about AT_* flags was a separate point to my hopes that we
could get this into openat2(2). We don't have AT_* equivalents to
RESOLVE_* flags because it would burn too many bits (at least 5,
likely more when we add RESOLVE_NO_DOTDOT and other such extensions) and
openat2(2) is actually sufficient for almost all operations in practice.

> > Alternatively, we could just bite the bullet and make
> > AT_NO_CLOEXEC a thing...
> 
> What's the bullet to bite there?

It's not a big deal but it just burns another generic AT_* flag bit for
something that userspace can do themselves with fnctl(2).

Maybe having it will encourage future syscall authors to default to
O_CLOEXEC, but we could end up with the slightly silly AT_SYMLINK_FOLLOW
/ AT_SYMLINK_NOFOLLOW situation too. Documenting it in
seemingly-rarely-read Documentation/process/adding-syscalls.rst might
end up being equally (in)effective without burning a flag bit... :/

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Aleksa Sarai @ 2026-04-09  7:45 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring, Peter Zijlstra, Oleg Nesterov,
	Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <CAGudoHH7z8CwAXMxAxTbjfovRBpne5f19Tz0okMh7_6G9NfQ-Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3347 bytes --]

On 2026-04-07, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Thu, Apr 2, 2026 at 4:52 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2026-04-01, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > Trying to handle this in open() is a no-go. openat2 is rather
> > > problematic.
> >
> > I'm interested in what makes you say that. It would be very nice to be able
> > to do mkdir + RESOLVE_IN_ROOT and get an fd back all in one syscall. :D
> >
> 
> Not handling this in either of open or openat2 does not preclude mkdir
> + RESOLVE_IN_ROOT + getting a fd in one go from existing.

Well, that would also require passing RESOLVE_* flags to mkdirat2(2)
which kind of begs the question why not just integrate it into
openat2(2) -- otherwise there will always be more features available to
O_CREAT than mkdirat2(2) which seems unfortunate.

> Creating a directory was always a different syscall than creating a
> file. I don't see any benefit to squeezing it into open. I do see a
> downside because of an extra branchfest to differentiate the cases.

Ah, so it's just an issue of taste, not a technical problem (as the mail
I replied to made it sound)?

> > > The routine would have to start with validating the passed O_ flags, for
> > > now only allowing O_CLOEXEC and EINVAL-ing otherwise.
> >
> > Please do not use O_* flags! O_CLOEXEC takes up 3 flag bits on different
> > architectures which makes adding new flags a nightmare.
> >
> 
> With my proposal there are no new flags added so I don't think that's relevant.

I'm confused, was "the new routine would have to start with validating
the passed O_ flags" talking about a hypothetical API you oppose? It
read like a suggestion on my first pass-through, hence the reply.

If you're saying that your proposal doesn't add any new O_* (or
MKDIRAT_*) flags that really isn't the issue -- any syscall that takes a
flag argument will grow new flags eventually and using the literal value
of O_CLOEXEC for some other syscall's flags just leads to burning three
flag bits needlessly.

This is arguably the most painful thing about open_tree(2)'s flags --
most other syscalls define their own flag that is equivalent to
O_CLOEXEC but not literally equal to it (this is even recommended in
Documentation/process/adding-syscalls.rst!).

> > I think this should take AT_* flags and (like most newer syscalls)
> > O_CLOEXEC should be automatically set. Userspace can unset it with
> > fnctl(F_SETFD) in the relatively rare case where they don't want
> > O_CLOEXEC. Alternatively, we could just bite the bullet and make
> > AT_NO_CLOEXEC a thing...
> >
> 
> I would say that's a pretty weird discrepancy vs what normally happens
> with other syscalls, but perhaps it would be fine.

Quite a few of the newer uAPIs do this -- all of the pidfd APIs do it,
as well as newer ioctls that return fds (like the NS_GET_* ioctls for
nsfs).

Clearing O_CLOEXEC safely is trivial but safely setting it is not really
possible in multi-threaded programs (see "man 2 openat"), so it makes
more sense for newer APIs to just default to O_CLOEXEC and userspace can
unset it (and that is what newer APIs already do).

We should probably update Documentation/process/adding-syscalls.rst to
mention this...

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: David Laight @ 2026-04-09  7:44 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Mateusz Guzik, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring, Peter Zijlstra, Oleg Nesterov,
	Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <1333067272.1649333.1775682991132@kpc.webmail.kpnmail.nl>

On Wed, 8 Apr 2026 23:16:31 +0200 (CEST)
Jori Koolstra <jkoolstra@xs4all.nl> wrote:

> > Op 07-04-2026 11:00 CEST schreef Mateusz Guzik <mjguzik@gmail.com>:
...
> > I am not saying it's impossible. I am saying mkdir was always a
> > separate codepath and in order to change that you would need to add a
> > branchfest to open. I don't see any reason to go that route.

The open code is complex enough that an extra branch won't matter. 

> That's a fair point. But there's also upsides like Aleksa has mentioned.
> I'm not very opinionated on the matter, especially since I don't know why
> those paths were ever separated.

I doubt they were ever joined.
mkdir() is more likely to have been separated from mknod() when the code
to add the "." and ".." entries was moved into the kernel filesystem code.
I'm not sure when that would have happened, mvdir() was done in userspace
with the link() and unlink() system calls until (at least) the mid 1980s.
It was probably the complexity of locking in SMP kernels that make both "."
and ".." be 'canned' names rather than just references to another directory.
(Yes, it used to be easy to make ".." refer to the 'wrong' place and get
find to loop.)
Of course, this all predates Linux.

	David

> 
> Thanks,
> Jori.
> 


^ permalink raw reply

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Jori Koolstra @ 2026-04-08 21:44 UTC (permalink / raw)
  To: Aleksa Sarai, Mateusz Guzik
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <2026-04-02-aged-convex-snowbird-foxes-Ym20JZ@cyphar.com>


> Op 02-04-2026 04:52 CEST schreef Aleksa Sarai <cyphar@cyphar.com>:
> 
> Please do not use O_* flags! O_CLOEXEC takes up 3 flag bits on different
> architectures which makes adding new flags a nightmare.
> 
> I think this should take AT_* flags and (like most newer syscalls)
> O_CLOEXEC should be automatically set. Userspace can unset it with
> fnctl(F_SETFD) in the relatively rare case where they don't want
> O_CLOEXEC.

And then do something like statx_lookup_flags() does to build the lookup
flags from those AT flags? But there is also no AT_ROOT_CONTAINED (or whatever
you would want to call the RESOLVE_IN_ROOT AT-equivalent) right now.

> Alternatively, we could just bite the bullet and make
> AT_NO_CLOEXEC a thing...

What's the bullet to bite there?

> 
> But yes, new syscalls *absolutely* need to take some kind of flag
> argument. I'd hoped we finally learned our lesson on that one...
> 
> -- 
> Aleksa Sarai
> https://www.cyphar.com/

Thanks,
Jori.

^ permalink raw reply

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Jori Koolstra @ 2026-04-08 21:16 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <CAGudoHE-zSfiL2aVf41UHOtMsE53gCqLpVoy-NxoB8HeXtdgEA@mail.gmail.com>


> Op 07-04-2026 11:00 CEST schreef Mateusz Guzik <mjguzik@gmail.com>:
> 
>  
> On Wed, Apr 1, 2026 at 12:25 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
> >
> >
> > > Op 01-04-2026 06:19 CEST schreef Mateusz Guzik <mjguzik@gmail.com>:
> > >
> > >
> > > On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> > > > @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> > > >             lookup_flags |= LOOKUP_REVAL;
> > > >             goto retry;
> > > >     }
> > > > +
> > > > +   if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> > > > +           struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> > > > +           error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> > > > +   }
> > > > +   end_creating_path(&path, dentry);
> > > >     return error;
> > >
> > >
> > > You can't do it like this. Should it turn out no fd can be allocated,
> > > the entire thing is going to error out while keeping the newly created
> > > directory behind. You need to allocate the fd first, then do the hard
> > > work, and only then fd_install and or free the fd. The FD_ADD machinery
> > > can probably still be used provided proper wrapping of the real new
> > > mkdir.
> >
> > But isn't this exactly what happens in open(O_CREAT) too? Eventually we
> > call
> >                 error = dir_inode->i_op->create(idmap, dir_inode, dentry,
> >                                                 mode, open_flag & O_EXCL);
> >
> > and only then do we assign and install the fd. AFAIK there is no cleanup
> > happening there either if the FD_ADD step fails. You will just have a
> > regular file and no descriptor. But I would have to test this to be sure.
> >
> 
> FD_ADD(how->flags, do_file_open(dfd, name, &op)) means fd itself will
> be allocated upfront and only then file creation will happen and which
> is what I'm saying is how it should be done. With your patch the
> directory is created first and the possibly failing fd allocation
> happens later.

Err, you're right. I understand what you mean now. That does need to be fixed.
I misremembered how FD_ADD works. I'll get back to this in the weekend.

> > > Trying to handle this in open() is a no-go. openat2 is rather
> > > problematic.
> >
> > I don't think that is necessarily true. It turned out O_CREAT | O_DIRECTORY
> > was bugged for a very long time. Christian Brauner fixed it eventually, and
> > that combination now returns EINVAL. But I think there is nothing really
> > stopping us from implementing that combination in the expected way, apart
> > from whatever reasons there were for not allowing this in the first place,
> > which I don't know about (maybe mixing semantics?)
> >
> 
> I am not saying it's impossible. I am saying mkdir was always a
> separate codepath and in order to change that you would need to add a
> branchfest to open. I don't see any reason to go that route.

That's a fair point. But there's also upsides like Aleksa has mentioned.
I'm not very opinionated on the matter, especially since I don't know why
those paths were ever separated.

Thanks,
Jori.

^ permalink raw reply

* Re: [PATCH 0/9] Kernel API Specification Framework
From: Geert Uytterhoeven @ 2026-04-08 12:05 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Jakub Kicinski, linux-api, linux-kernel, linux-doc, linux-fsdevel,
	linux-kbuild, linux-kselftest, workflows, tools, x86,
	Thomas Gleixner, Paul E. McKenney, Greg Kroah-Hartman,
	Jonathan Corbet, Dmitry Vyukov, Randy Dunlap, Cyril Hrubis,
	Kees Cook, Jake Edge, David Laight, Askar Safin, Gabriele Paoloni,
	Mauro Carvalho Chehab, Christian Brauner, Alexander Viro,
	Andrew Morton, Masahiro Yamada, Shuah Khan, Ingo Molnar,
	Arnd Bergmann
In-Reply-To: <abZTg9ZwnE5J4qXa@laps>

Hi Sasha,

On Sun, 15 Mar 2026 at 07:36, Sasha Levin <sashal@kernel.org> wrote:
> On Sat, Mar 14, 2026 at 11:18:22AM -0700, Jakub Kicinski wrote:
> >On Fri, 13 Mar 2026 11:09:10 -0400 Sasha Levin wrote:
> >> This enables static analysis tools to verify userspace API usage at compile
> >> time, test generation based on formal specifications, consistent error handling
> >> validation, automated documentation generation, and formal verification of
> >> kernel interfaces.
> >
> >Could you give some examples? We have machine readable descriptions for
> >Netlink interfaces, we approached syzbot folks and they did not really
> >seem to care for those.
>
> Once the API is in a machine-readable format, we can write formatters to
> output whatever downstream tools need. The kapi tool in the series
> already ships with plain text, JSON, and RST formatters, and adding new
> output formats is straightforward. We don't need to convince the
> syzkaller folks to consume our specs, we can just output them in a
> format that syzkaller already understands.
>
> For example, I have a syzlang formatter that produces the following
> from the sys_read spec in this series:
>
>    # --- read ---
>    # Read data from a file descriptor
>    #
>    # @context process, sleepable
>    #
>    # @capability CAP_DAC_OVERRIDE: Bypass discretionary access control on read permission
>    # @capability CAP_DAC_READ_SEARCH: Bypass read permission checks on regular files
>    #
>    # @error EPERM (-1): Returned by fanotify permission events...
>    # @error EINTR (-4): The call was interrupted by a signal before any data was read.
>    # @error EIO (-5): A low-level I/O error occurred.
>    # @error EBADF (-9): fd is not a valid file descriptor, or fd was not opened for reading.
>    # @error EAGAIN (-11): O_NONBLOCK set and read would block.
>    # @error EACCES (-13): LSM denied the read operation via security_file_permission().
>    # @error EFAULT (-14): buf points outside the accessible address space.
>    # @error EISDIR (-21): fd refers to a directory.
>    # @error EINVAL (-22): fd not suitable for reading, O_DIRECT misaligned, count negative...
>    # @error ENODATA (-61): Data not available in cache...
>    # @error EOVERFLOW (-75): File position plus count would exceed LLONG_MAX.
>    # @error EOPNOTSUPP (-95): Read not supported for this file type...
>    # @error ENOBUFS (-105): Buffer too small for complete notification...

The actual E-values are positive, so I guess you want e.g. -EPERM?

Note that the actual errno values are architecture-specific.
E.g. EOPNOTSUPP can be 45, 95, 122, or 223.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 1/4] exec: inherit HWCAPs from the parent process
From: Mark Rutland @ 2026-04-07 15:23 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Will Deacon, Kees Cook, Andrew Morton, Marek Szyprowski,
	Cyrill Gorcunov, Mike Rapoport, Alexander Mikhalitsyn,
	linux-kernel, linux-fsdevel, linux-mm, criu, Catalin Marinas,
	linux-arm-kernel, Chen Ridong, Christian Brauner,
	David Hildenbrand, Eric Biederman, Lorenzo Stoakes, Michal Koutny,
	Alexander Mikhalitsyn, Linux API
In-Reply-To: <CAEWA0a7AZiuy1F+0LDxtEtJpdu=zA-RKhPb1wcDMpy2tSMFO5g@mail.gmail.com>

On Fri, Mar 27, 2026 at 05:21:26PM -0700, Andrei Vagin wrote:
> Hi Mark,
> 
> I understand all these points and they are valid. However, as I
> mentioned, we are not trying to introduce a mechanism that will strictly
> enforce feature sets for every container. While we would like to have
> that functionality, as you and will mentioned, it would require
> substantially more complexity to address, and maintainers would unlikely
> to pick up that complexity. 

The crux of my complaint here is that unless you do that (to some
degree), this is not going to work reliably, even with the constraints
you outline.

Further, I disagree with your proposed solution of pushing more
constraints onto userspace (to also consider HWCAPs as overriding other
mechainsms, etc).

I think that as-is, the approach is flawed.

> Even masking ID registers on a per-container basis would introduce
> extra complexity that could make architecture maintainers unhappy.
> There were a few attempts to introduce container CPUID masking on
> x86_64 in the past.

> In CRIU, we are not aiming to handle every possible workload. Our goal
> is to target workloads where developers are ready to cooperate and
> willing to make adjustments to be C/R compatible. The goal here is to
> provide developers with clear instructions on what they can do to ensure
> their applications are C/R compatible. When I say "workloads", I mean
> this in a broad sense. A container might pack a set of tools with
> different runtimes (Go, Java, libc-based). All these runtimes should
> detect only allowed features.

I do not think that arbitrary applications (and libraries!) should have
to pick up additional constraints that are unnecessary without CRIU,
especially where that goes against deliberate design decisions (e.g.
features in arm64's HINT instruction space, which are designed to be
usable in fast paths WITHOUT needing explicit checks of things like
HWCAPs). Note that those typically *do* have kernel controls.

I think there's a much larger problem space than you anticipate, and
adding an incomplete solution now is just going to introduce a
maintenance burden.

> Returning to the subject of this patchset: this series extends the role
> of hwcaps. With this change, we would establish that hwcaps is the
> "source of truth" for which features an application can safely use. Any
> other features available on the current CPU would not be guaranteed to
> remain available after migration to another machine.
> 
> After this discussion, I found that the current version missed one major
> thing: there should be a signal indicating that hwcaps must be used for
> feature detection. Since we will need to integrate this interface into
> libc, Go, and other runtimes, they definitely should not rely just on
> hwcaps by default, especially in the early stages. This can be solved
> via the prctl command.  Libraries like libc would call
> prctl(PR_USER_HWCAP_ENABLED). If this returns true, the runtime knows
> that only the features explicitly listed in hwcaps should be used.

I do not think we should be pushing that shape of constraint onto
userspace.

> You are right, the controlled feature set will be limited to features
> the kernel knows about. And yes, we would need to report CPU features in
> hwcaps even if the kernel isn't directly involved in handling them.

To be clear, that is not what I am arguing.

As I mentioned before, the way this works on arm64 is that the kernel
only exposes what it is aware of, even in the ID regs accessible to
userspace. We usually *can* hide features, and do that for cases of
mismatched big.LITTLE, virtual machines, etc.

> Honestly, I am not certain if this is the "right" interface for that,
> and I would be happy to consider other ideas. I understand that these
> hwcaps will not work right out of the box, but we need a way to solve
> this problem. Having a centralized API for CPU/kernel feature detection
> seems like the right direction.

I think that for better or worse the approach you are tkaing here simply
does not solve enough of the problem to actually be worthwhile.

> As for signal frame size and extended states like SVE/SME, we aware
> about this problem.  However, it is partly mitigated by the fact that if
> an application does not use some features, those states are not placed
> in the signal frame.

That is not true. The kernel can and will create signal frames for
architectural state that a task might never have touched.

Generally arm64 creates signal frames for features when the feature
*exists*, regardless of whether the task has actively manipulated the
relevant state. For example, on systems with SVE a trivial SVE signal
frame gets created even if a task only uses the FPSIMD registers, and on
systms with SME a TPIDR2 signal frame gets created even if the task has
never read/written TPIDR2.

When restoring, an unrecognised signal frame is treated as invalid, and
we can require that certain signal frames are present.

> In the future, when we construct/reload a signal frame, we could look
> at a process feature set for a process and generate a frame according
> to those features...

When you say 'we' here, are you talking about within the kernel, or
within the userspace C/R mechanism?

Mark.

^ permalink raw reply

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Mateusz Guzik @ 2026-04-07  9:00 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Alexander Viro, Christian Brauner, Jeff Layton,
	Chuck Lever, Arnd Bergmann, Shuah Khan, Greg Kroah-Hartman,
	H. Peter Anvin, Jan Kara, Alexander Aring, Peter Zijlstra,
	Oleg Nesterov, Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	Aleksa Sarai, linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <1632825771.784338.1775039101736@kpc.webmail.kpnmail.nl>

On Wed, Apr 1, 2026 at 12:25 PM Jori Koolstra <jkoolstra@xs4all.nl> wrote:
>
>
> > Op 01-04-2026 06:19 CEST schreef Mateusz Guzik <mjguzik@gmail.com>:
> >
> >
> > On Tue, Mar 31, 2026 at 07:19:58PM +0200, Jori Koolstra wrote:
> > > @@ -5286,7 +5290,25 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> > >             lookup_flags |= LOOKUP_REVAL;
> > >             goto retry;
> > >     }
> > > +
> > > +   if (!error && (flags & MKDIRAT_FD_NEED_FD)) {
> > > +           struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> > > +           error = FD_ADD(0, dentry_open(&new_path, O_DIRECTORY, current_cred()));
> > > +   }
> > > +   end_creating_path(&path, dentry);
> > >     return error;
> >
> >
> > You can't do it like this. Should it turn out no fd can be allocated,
> > the entire thing is going to error out while keeping the newly created
> > directory behind. You need to allocate the fd first, then do the hard
> > work, and only then fd_install and or free the fd. The FD_ADD machinery
> > can probably still be used provided proper wrapping of the real new
> > mkdir.
>
> But isn't this exactly what happens in open(O_CREAT) too? Eventually we
> call
>                 error = dir_inode->i_op->create(idmap, dir_inode, dentry,
>                                                 mode, open_flag & O_EXCL);
>
> and only then do we assign and install the fd. AFAIK there is no cleanup
> happening there either if the FD_ADD step fails. You will just have a
> regular file and no descriptor. But I would have to test this to be sure.
>

FD_ADD(how->flags, do_file_open(dfd, name, &op)) means fd itself will
be allocated upfront and only then file creation will happen and which
is what I'm saying is how it should be done. With your patch the
directory is created first and the possibly failing fd allocation
happens later.

> >
> > On top of that similarly to what other people mentioned the new syscall
> > will definitely want to support O_CLOEXEC and probably other flags down
> > the line.
> >
>
> I agree, and perhaps O_PATH too. Maybe just all open flags relevant to
> directories?
>

I don't know about O_PATH as is, but certainly the syscall needs to be
able to grab more flags in the future.

> > Trying to handle this in open() is a no-go. openat2 is rather
> > problematic.
>
> I don't think that is necessarily true. It turned out O_CREAT | O_DIRECTORY
> was bugged for a very long time. Christian Brauner fixed it eventually, and
> that combination now returns EINVAL. But I think there is nothing really
> stopping us from implementing that combination in the expected way, apart
> from whatever reasons there were for not allowing this in the first place,
> which I don't know about (maybe mixing semantics?)
>

I am not saying it's impossible. I am saying mkdir was always a
separate codepath and in order to change that you would need to add a
branchfest to open. I don't see any reason to go that route.

^ permalink raw reply

* Re: [RFC PATCH 1/2] vfs: syscalls: add mkdirat_fd()
From: Mateusz Guzik @ 2026-04-07  8:52 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jori Koolstra, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Alexander Viro,
	Christian Brauner, Jeff Layton, Chuck Lever, Arnd Bergmann,
	Shuah Khan, Greg Kroah-Hartman, H. Peter Anvin, Jan Kara,
	Alexander Aring, Peter Zijlstra, Oleg Nesterov,
	Andrey Albershteyn, Jiri Olsa, Mathieu Desnoyers,
	Thomas Weißschuh, Namhyung Kim, Arnaldo Carvalho de Melo,
	linux-kernel, linux-fsdevel, linux-api, linux-arch,
	linux-kselftest, cmirabil, Masami Hiramatsu (Google)
In-Reply-To: <2026-04-02-aged-convex-snowbird-foxes-Ym20JZ@cyphar.com>

On Thu, Apr 2, 2026 at 4:52 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2026-04-01, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Trying to handle this in open() is a no-go. openat2 is rather
> > problematic.
>
> I'm interested in what makes you say that. It would be very nice to be able
> to do mkdir + RESOLVE_IN_ROOT and get an fd back all in one syscall. :D
>

Not handling this in either of open or openat2 does not preclude mkdir
+ RESOLVE_IN_ROOT + getting a fd in one go from existing.

Creating a directory was always a different syscall than creating a
file. I don't see any benefit to squeezing it into open. I do see a
downside because of an extra branchfest to differentiate the cases.

> > The routine would have to start with validating the passed O_ flags, for
> > now only allowing O_CLOEXEC and EINVAL-ing otherwise.
>
> Please do not use O_* flags! O_CLOEXEC takes up 3 flag bits on different
> architectures which makes adding new flags a nightmare.
>

With my proposal there are no new flags added so I don't think that's relevant.

> I think this should take AT_* flags and (like most newer syscalls)
> O_CLOEXEC should be automatically set. Userspace can unset it with
> fnctl(F_SETFD) in the relatively rare case where they don't want
> O_CLOEXEC. Alternatively, we could just bite the bullet and make
> AT_NO_CLOEXEC a thing...
>

I would say that's a pretty weird discrepancy vs what normally happens
with other syscalls, but perhaps it would be fine.

^ permalink raw reply

* [RFC PATCH v3 6/6] platform/goldfish: add Rust goldfish_address_space driver
From: Wenzhao Liao @ 2026-04-06 16:51 UTC (permalink / raw)
  To: rust-for-linux, linux-pci
  Cc: ojeda, dakr, bhelgaas, kwilczynski, arnd, gregkh, linux-kernel,
	linux-api
In-Reply-To: <20260406165120.166928-1-wenzhaoliao@ruc.edu.cn>

Add a Rust implementation of the goldfish address-space driver and wire
it into drivers/platform/goldfish.

This RFC intentionally scopes the driver to the open/release/ioctl ABI
subset; userspace mmap is not part of this series. The driver keeps all
unsafe and bindings-facing work inside the Rust abstraction layers,
carries #![forbid(unsafe_code)] in the driver crate, and uses typed
miscdevice registration data plus SharedMemoryBar to stay on the safe
side of those abstractions.

On teardown, unbind first deregisters the miscdevice, then drains
already-running operations and revokes live file-owned device state
before disabling the PCI function. Probe also pairs
enable_device_mem() with a ScopeGuard so mid-probe failures cannot leak
an enabled device.

Signed-off-by: Wenzhao Liao <wenzhaoliao@ruc.edu.cn>
---
 MAINTAINERS                                   |   2 +
 drivers/platform/goldfish/Kconfig             |  11 +
 drivers/platform/goldfish/Makefile            |   1 +
 .../goldfish/goldfish_address_space.rs        | 917 ++++++++++++++++++
 4 files changed, 931 insertions(+)
 create mode 100644 drivers/platform/goldfish/goldfish_address_space.rs

diff --git a/MAINTAINERS b/MAINTAINERS
index 800b2fe0e648..0a9193854f1b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1888,7 +1888,9 @@ L:	linux-kernel@vger.kernel.org
 L:	linux-pci@vger.kernel.org
 L:	rust-for-linux@vger.kernel.org
 S:	Maintained
+F:	drivers/platform/goldfish/goldfish_address_space.rs
 F:	include/uapi/linux/goldfish_address_space.h
+K:	\bGOLDFISH_ADDRESS_SPACE\b
 
 ANDROID GOLDFISH RTC DRIVER
 M:	Jiaxun Yang <jiaxun.yang@flygoat.com>
diff --git a/drivers/platform/goldfish/Kconfig b/drivers/platform/goldfish/Kconfig
index 03ca5bf19f98..58ccf5a757bd 100644
--- a/drivers/platform/goldfish/Kconfig
+++ b/drivers/platform/goldfish/Kconfig
@@ -17,4 +17,15 @@ config GOLDFISH_PIPE
 	  This is a virtual device to drive the QEMU pipe interface used by
 	  the Goldfish Android Virtual Device.
 
+config GOLDFISH_ADDRESS_SPACE
+	tristate "Goldfish address space driver in Rust"
+	depends on PCI && RUST && MMU
+	help
+	  Adds a Rust implementation of the Goldfish address space driver
+	  used by the Android Goldfish emulator.
+
+	  This implementation uses typed Rust abstractions for PCI resource
+	  setup, miscdevice registration, page-backed ping state, and the
+	  userspace ioctl interface.
+
 endif # GOLDFISH
diff --git a/drivers/platform/goldfish/Makefile b/drivers/platform/goldfish/Makefile
index 76ba1d571896..17f67c223e95 100644
--- a/drivers/platform/goldfish/Makefile
+++ b/drivers/platform/goldfish/Makefile
@@ -3,3 +3,4 @@
 # Makefile for Goldfish platform specific drivers
 #
 obj-$(CONFIG_GOLDFISH_PIPE)	+= goldfish_pipe.o
+obj-$(CONFIG_GOLDFISH_ADDRESS_SPACE) += goldfish_address_space.o
diff --git a/drivers/platform/goldfish/goldfish_address_space.rs b/drivers/platform/goldfish/goldfish_address_space.rs
new file mode 100644
index 000000000000..7742c76ea7fc
--- /dev/null
+++ b/drivers/platform/goldfish/goldfish_address_space.rs
@@ -0,0 +1,917 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Rust Goldfish address space driver.
+
+#![forbid(unsafe_code)]
+
+use core::{mem::size_of, pin::Pin};
+use kernel::{
+    alloc::KVVec,
+    device::Core,
+    devres::Devres,
+    error::Error,
+    fs::File,
+    io::{Io, PhysAddr},
+    ioctl,
+    miscdevice::{MiscDevice, MiscDeviceOpenContext, MiscDeviceOptions, MiscDeviceRegistration},
+    new_condvar, new_mutex,
+    page::{Page, PAGE_SIZE},
+    pci,
+    prelude::*,
+    sync::{Arc, ArcBorrow, CondVar, Mutex},
+    types::ScopeGuard,
+    uaccess::{UserPtr, UserSlice},
+    uapi,
+};
+
+const GOLDFISH_AS_CONTROL_BAR: u32 = 0;
+const GOLDFISH_AS_AREA_BAR: u32 = 1;
+const GOLDFISH_AS_VENDOR_ID: u32 = 0x607d;
+const GOLDFISH_AS_DEVICE_ID: u32 = 0xf153;
+const GOLDFISH_AS_SUPPORTED_REVISION: u8 = 1;
+const GOLDFISH_AS_INVALID_HANDLE: u32 = u32::MAX;
+
+const GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC: u32 = uapi::GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC as u32;
+const GOLDFISH_ADDRESS_SPACE_IOCTL_ALLOCATE_BLOCK: u32 =
+    ioctl::_IOWR::<AllocateBlockIoctl>(GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC, 10);
+const GOLDFISH_ADDRESS_SPACE_IOCTL_DEALLOCATE_BLOCK: u32 =
+    ioctl::_IOWR::<u64>(GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC, 11);
+const GOLDFISH_ADDRESS_SPACE_IOCTL_PING: u32 =
+    ioctl::_IOWR::<PingIoctl>(GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC, 12);
+const GOLDFISH_ADDRESS_SPACE_IOCTL_CLAIM_SHARED: u32 =
+    ioctl::_IOWR::<ClaimSharedIoctl>(GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC, 13);
+const GOLDFISH_ADDRESS_SPACE_IOCTL_UNCLAIM_SHARED: u32 =
+    ioctl::_IOWR::<u64>(GOLDFISH_ADDRESS_SPACE_IOCTL_MAGIC, 14);
+
+struct Registers;
+
+impl Registers {
+    const COMMAND: usize = 0;
+    const STATUS: usize = 4;
+    const GUEST_PAGE_SIZE: usize = 8;
+    const BLOCK_SIZE_LOW: usize = 12;
+    const BLOCK_SIZE_HIGH: usize = 16;
+    const BLOCK_OFFSET_LOW: usize = 20;
+    const BLOCK_OFFSET_HIGH: usize = 24;
+    const PING: usize = 28;
+    const PING_INFO_ADDR_LOW: usize = 32;
+    const PING_INFO_ADDR_HIGH: usize = 36;
+    const HANDLE: usize = 40;
+    const PHYS_START_LOW: usize = 44;
+    const PHYS_START_HIGH: usize = 48;
+    const END: usize = 56;
+}
+
+#[repr(u32)]
+#[derive(Clone, Copy)]
+enum CommandId {
+    AllocateBlock = 1,
+    DeallocateBlock = 2,
+    GenHandle = 3,
+    DestroyHandle = 4,
+    TellPingInfoAddr = 5,
+}
+
+type ControlBar = pci::Bar<{ Registers::END }>;
+
+#[derive(Clone, Copy)]
+struct Block {
+    offset: u64,
+    size: u64,
+}
+
+struct BlockSet {
+    blocks: KVVec<Block>,
+}
+
+impl BlockSet {
+    fn new() -> Self {
+        Self {
+            blocks: KVVec::new(),
+        }
+    }
+
+    fn insert(&mut self, block: Block) -> Result {
+        self.blocks.push(block, GFP_KERNEL)?;
+        Ok(())
+    }
+
+    fn remove(&mut self, offset: u64) -> Result<Block> {
+        let index = self
+            .blocks
+            .iter()
+            .position(|block| block.offset == offset)
+            .ok_or(ENXIO)?;
+        self.blocks.remove(index).map_err(|_| EINVAL)
+    }
+
+    fn iter(&self) -> impl Iterator<Item = Block> + '_ {
+        self.blocks.iter().copied()
+    }
+
+    fn clear(&mut self) {
+        let _ = self.take_all();
+    }
+
+    fn take_all(&mut self) -> KVVec<Block> {
+        let mut blocks = KVVec::new();
+        core::mem::swap(&mut blocks, &mut self.blocks);
+        blocks
+    }
+}
+
+#[derive(Clone, Copy, Default)]
+struct PingInfoHeader {
+    offset: u64,
+    size: u64,
+    metadata: u64,
+    version: u32,
+    wait_fd: u32,
+    wait_flags: u32,
+    direction: u32,
+    data_size: u64,
+}
+
+impl PingInfoHeader {
+    const ENCODED_LEN: usize = 48;
+
+    fn encode(self) -> [u8; Self::ENCODED_LEN] {
+        let mut bytes = [0u8; Self::ENCODED_LEN];
+
+        bytes[0..8].copy_from_slice(&self.offset.to_ne_bytes());
+        bytes[8..16].copy_from_slice(&self.size.to_ne_bytes());
+        bytes[16..24].copy_from_slice(&self.metadata.to_ne_bytes());
+        bytes[24..28].copy_from_slice(&self.version.to_ne_bytes());
+        bytes[28..32].copy_from_slice(&self.wait_fd.to_ne_bytes());
+        bytes[32..36].copy_from_slice(&self.wait_flags.to_ne_bytes());
+        bytes[36..40].copy_from_slice(&self.direction.to_ne_bytes());
+        bytes[40..48].copy_from_slice(&self.data_size.to_ne_bytes());
+
+        bytes
+    }
+
+    fn decode(bytes: &[u8; Self::ENCODED_LEN]) -> Self {
+        Self {
+            offset: u64::from_ne_bytes(bytes[0..8].try_into().unwrap()),
+            size: u64::from_ne_bytes(bytes[8..16].try_into().unwrap()),
+            metadata: u64::from_ne_bytes(bytes[16..24].try_into().unwrap()),
+            version: u32::from_ne_bytes(bytes[24..28].try_into().unwrap()),
+            wait_fd: u32::from_ne_bytes(bytes[28..32].try_into().unwrap()),
+            wait_flags: u32::from_ne_bytes(bytes[32..36].try_into().unwrap()),
+            direction: u32::from_ne_bytes(bytes[36..40].try_into().unwrap()),
+            data_size: u64::from_ne_bytes(bytes[40..48].try_into().unwrap()),
+        }
+    }
+}
+
+#[repr(C)]
+#[derive(Clone, Copy, Default)]
+struct AllocateBlockIoctl {
+    size: u64,
+    offset: u64,
+    phys_addr: u64,
+}
+
+impl AllocateBlockIoctl {
+    const ENCODED_LEN: usize = 24;
+
+    fn encode(self) -> [u8; Self::ENCODED_LEN] {
+        let mut bytes = [0u8; Self::ENCODED_LEN];
+        bytes[0..8].copy_from_slice(&self.size.to_ne_bytes());
+        bytes[8..16].copy_from_slice(&self.offset.to_ne_bytes());
+        bytes[16..24].copy_from_slice(&self.phys_addr.to_ne_bytes());
+        bytes
+    }
+
+    fn decode(bytes: &[u8; Self::ENCODED_LEN]) -> Self {
+        Self {
+            size: u64::from_ne_bytes(bytes[0..8].try_into().unwrap()),
+            offset: u64::from_ne_bytes(bytes[8..16].try_into().unwrap()),
+            phys_addr: u64::from_ne_bytes(bytes[16..24].try_into().unwrap()),
+        }
+    }
+}
+
+#[repr(C)]
+#[derive(Clone, Copy, Default)]
+struct PingIoctl {
+    offset: u64,
+    size: u64,
+    metadata: u64,
+    version: u32,
+    wait_fd: u32,
+    wait_flags: u32,
+    direction: u32,
+}
+
+impl PingIoctl {
+    const ENCODED_LEN: usize = 40;
+
+    fn encode(self) -> [u8; Self::ENCODED_LEN] {
+        let mut bytes = [0u8; Self::ENCODED_LEN];
+        bytes[0..8].copy_from_slice(&self.offset.to_ne_bytes());
+        bytes[8..16].copy_from_slice(&self.size.to_ne_bytes());
+        bytes[16..24].copy_from_slice(&self.metadata.to_ne_bytes());
+        bytes[24..28].copy_from_slice(&self.version.to_ne_bytes());
+        bytes[28..32].copy_from_slice(&self.wait_fd.to_ne_bytes());
+        bytes[32..36].copy_from_slice(&self.wait_flags.to_ne_bytes());
+        bytes[36..40].copy_from_slice(&self.direction.to_ne_bytes());
+        bytes
+    }
+
+    fn decode(bytes: &[u8; Self::ENCODED_LEN]) -> Self {
+        Self {
+            offset: u64::from_ne_bytes(bytes[0..8].try_into().unwrap()),
+            size: u64::from_ne_bytes(bytes[8..16].try_into().unwrap()),
+            metadata: u64::from_ne_bytes(bytes[16..24].try_into().unwrap()),
+            version: u32::from_ne_bytes(bytes[24..28].try_into().unwrap()),
+            wait_fd: u32::from_ne_bytes(bytes[28..32].try_into().unwrap()),
+            wait_flags: u32::from_ne_bytes(bytes[32..36].try_into().unwrap()),
+            direction: u32::from_ne_bytes(bytes[36..40].try_into().unwrap()),
+        }
+    }
+}
+
+#[repr(C)]
+#[derive(Clone, Copy, Default)]
+struct ClaimSharedIoctl {
+    offset: u64,
+    size: u64,
+}
+
+impl ClaimSharedIoctl {
+    const ENCODED_LEN: usize = 16;
+
+    fn encode(self) -> [u8; Self::ENCODED_LEN] {
+        let mut bytes = [0u8; Self::ENCODED_LEN];
+        bytes[0..8].copy_from_slice(&self.offset.to_ne_bytes());
+        bytes[8..16].copy_from_slice(&self.size.to_ne_bytes());
+        bytes
+    }
+
+    fn decode(bytes: &[u8; Self::ENCODED_LEN]) -> Self {
+        Self {
+            offset: u64::from_ne_bytes(bytes[0..8].try_into().unwrap()),
+            size: u64::from_ne_bytes(bytes[8..16].try_into().unwrap()),
+        }
+    }
+}
+
+struct PingState {
+    page: Page,
+}
+
+impl PingState {
+    fn new() -> Result<Self> {
+        let mut page = Page::alloc_page(GFP_KERNEL)?;
+        page.fill_zero(0, PAGE_SIZE)?;
+        Ok(Self { page })
+    }
+
+    fn phys_addr(&self) -> PhysAddr {
+        self.page.phys_addr()
+    }
+
+    fn shared_offset(offset: u64, shared_phys_start: PhysAddr) -> Result<u64> {
+        let shared_phys_start = u64::try_from(shared_phys_start).map_err(|_| EOVERFLOW)?;
+        offset.checked_add(shared_phys_start).ok_or(EOVERFLOW)
+    }
+
+    fn prepare_ping(&mut self, request: &PingIoctl, shared_phys_start: PhysAddr) -> Result {
+        let header = PingInfoHeader {
+            offset: Self::shared_offset(request.offset, shared_phys_start)?,
+            size: request.size,
+            metadata: request.metadata,
+            version: request.version,
+            wait_fd: request.wait_fd,
+            wait_flags: request.wait_flags,
+            direction: request.direction,
+            data_size: 0,
+        };
+
+        self.page.fill_zero(0, PAGE_SIZE)?;
+        self.page.write_slice(&header.encode(), 0)
+    }
+
+    fn finish_ping(&self, request: &mut PingIoctl) -> Result {
+        let mut bytes = [0u8; PingInfoHeader::ENCODED_LEN];
+        self.page.read_slice(&mut bytes, 0)?;
+        let header = PingInfoHeader::decode(&bytes);
+        request.offset = header.offset;
+        request.size = header.size;
+        request.metadata = header.metadata;
+        request.version = header.version;
+        request.wait_fd = header.wait_fd;
+        request.wait_flags = header.wait_flags;
+        request.direction = header.direction;
+        Ok(())
+    }
+}
+
+#[pin_data]
+struct DeviceRuntime {
+    #[pin]
+    control_bar: Devres<ControlBar>,
+    #[pin]
+    shared_bar: Devres<pci::SharedMemoryBar>,
+    #[pin]
+    registers_lock: Mutex<()>,
+    #[pin]
+    lifecycle: Mutex<RuntimeLifecycleState>,
+    #[pin]
+    lifecycle_idle: CondVar,
+}
+
+struct RuntimeLifecycleState {
+    accepting_new_ops: bool,
+    active_ops: usize,
+    live_files: KVVec<Arc<FileState>>,
+}
+
+struct RuntimeOpGuard {
+    runtime: Arc<DeviceRuntime>,
+}
+
+impl Drop for RuntimeOpGuard {
+    fn drop(&mut self) {
+        let mut state = self.runtime.lifecycle.lock();
+        state.active_ops -= 1;
+        self.runtime.notify_if_idle(&state);
+    }
+}
+
+impl DeviceRuntime {
+    fn new(pdev: &pci::Device<Core>) -> Result<Arc<Self>> {
+        Arc::pin_init(
+            try_pin_init!(Self {
+                control_bar <- pdev.iomap_region_sized::<{ Registers::END }>(
+                    GOLDFISH_AS_CONTROL_BAR,
+                    c"goldfish_address_space/control",
+                ),
+                shared_bar <- pdev.memremap_bar(
+                    GOLDFISH_AS_AREA_BAR,
+                    c"goldfish_address_space/area",
+                ),
+                registers_lock <- new_mutex!(()),
+                lifecycle <- new_mutex!(RuntimeLifecycleState {
+                    accepting_new_ops: true,
+                    active_ops: 0,
+                    live_files: KVVec::new(),
+                }),
+                lifecycle_idle <- new_condvar!("goldfish_address_space/lifecycle_idle"),
+            }),
+            GFP_KERNEL,
+        )
+    }
+
+    fn notify_if_idle(&self, state: &RuntimeLifecycleState) {
+        if !state.accepting_new_ops && state.active_ops == 0 {
+            self.lifecycle_idle.notify_all();
+        }
+    }
+
+    fn begin_operation(self: &Arc<Self>) -> Result<RuntimeOpGuard> {
+        let mut state = self.lifecycle.lock();
+        if !state.accepting_new_ops {
+            return Err(ENODEV);
+        }
+
+        state.active_ops = state.active_ops.checked_add(1).ok_or(EBUSY)?;
+        drop(state);
+
+        Ok(RuntimeOpGuard {
+            runtime: self.clone(),
+        })
+    }
+
+    fn register_live_file(&self, file: Arc<FileState>) -> Result {
+        let mut state = self.lifecycle.lock();
+        if !state.accepting_new_ops {
+            return Err(ENODEV);
+        }
+
+        state.live_files.push(file, GFP_KERNEL)?;
+        Ok(())
+    }
+
+    fn unregister_live_file(&self, file: &Arc<FileState>) {
+        let mut state = self.lifecycle.lock();
+        let Some(index) = state
+            .live_files
+            .iter()
+            .position(|entry| Arc::ptr_eq(entry, file))
+        else {
+            return;
+        };
+
+        if let Ok(entry) = state.live_files.remove(index) {
+            drop(entry);
+        }
+    }
+
+    fn shutdown(&self) {
+        let mut state = self.lifecycle.lock();
+        // `unbind()` removes miscdevice reachability before calling `shutdown()`. After that we
+        // only need to wait for already-entered syscalls to finish; live files are revoked below,
+        // so remove is no longer bounded by userspace deciding to close descriptors.
+        state.accepting_new_ops = false;
+
+        while state.active_ops != 0 {
+            self.lifecycle_idle.wait(&mut state);
+        }
+
+        let mut live_files = KVVec::new();
+        core::mem::swap(&mut live_files, &mut state.live_files);
+        drop(state);
+
+        for file in &live_files {
+            file.revoke_for_shutdown();
+        }
+    }
+
+    fn control_bar(&self) -> Result<impl core::ops::Deref<Target = ControlBar> + '_> {
+        self.control_bar.try_access().ok_or(ENXIO)
+    }
+
+    fn shared_bar(&self) -> Result<impl core::ops::Deref<Target = pci::SharedMemoryBar> + '_> {
+        self.shared_bar.try_access().ok_or(ENXIO)
+    }
+
+    fn run_command_locked(control: &ControlBar, command: CommandId) -> Result {
+        control.write32(command as u32, Registers::COMMAND);
+
+        let status = i32::try_from(control.read32(Registers::STATUS)).map_err(|_| EIO)?;
+        if status == 0 {
+            Ok(())
+        } else {
+            Err(Error::from_errno(-status))
+        }
+    }
+
+    fn issue_command_locked(control: &ControlBar, command: CommandId) {
+        control.write32(command as u32, Registers::COMMAND);
+    }
+
+    fn write_u64(control: &ControlBar, low_offset: usize, high_offset: usize, value: u64) {
+        control.write32(value as u32, low_offset);
+        control.write32((value >> 32) as u32, high_offset);
+    }
+
+    fn read_u64(control: &ControlBar, low_offset: usize, high_offset: usize) -> u64 {
+        u64::from(control.read32(low_offset)) | (u64::from(control.read32(high_offset)) << 32)
+    }
+
+    fn program_host_visible_state(&self) -> Result {
+        let control = self.control_bar()?;
+        let shared = self.shared_bar()?;
+        let phys_start = u64::try_from(shared.phys_start()).map_err(|_| EOVERFLOW)?;
+
+        control.write32(PAGE_SIZE as u32, Registers::GUEST_PAGE_SIZE);
+        Self::write_u64(
+            &control,
+            Registers::PHYS_START_LOW,
+            Registers::PHYS_START_HIGH,
+            phys_start,
+        );
+
+        Ok(())
+    }
+
+    fn shared_phys_start(&self) -> Result<PhysAddr> {
+        Ok(self.shared_bar()?.phys_start())
+    }
+
+    fn generate_handle(&self) -> Result<u32> {
+        let _guard = self.registers_lock.lock();
+        let control = self.control_bar()?;
+
+        // The external C driver does not gate `GEN_HANDLE` on the status register and instead
+        // validates completion by reading back the handle.
+        Self::issue_command_locked(&control, CommandId::GenHandle);
+
+        let handle = control.read32(Registers::HANDLE);
+        if handle == GOLDFISH_AS_INVALID_HANDLE {
+            return Err(EINVAL);
+        }
+
+        Ok(handle)
+    }
+
+    fn tell_ping_info_addr(&self, handle: u32, ping_info_phys: PhysAddr) -> Result {
+        let _guard = self.registers_lock.lock();
+        let control = self.control_bar()?;
+        let ping_info_phys = u64::try_from(ping_info_phys).map_err(|_| EOVERFLOW)?;
+
+        control.write32(handle, Registers::HANDLE);
+        Self::write_u64(
+            &control,
+            Registers::PING_INFO_ADDR_LOW,
+            Registers::PING_INFO_ADDR_HIGH,
+            ping_info_phys,
+        );
+        // The external C driver validates `TELL_PING_INFO_ADDR` through the echoed physical
+        // address rather than through the status register.
+        Self::issue_command_locked(&control, CommandId::TellPingInfoAddr);
+
+        let returned = Self::read_u64(
+            &control,
+            Registers::PING_INFO_ADDR_LOW,
+            Registers::PING_INFO_ADDR_HIGH,
+        );
+        if returned != ping_info_phys {
+            return Err(EINVAL);
+        }
+
+        Ok(())
+    }
+
+    fn destroy_handle(&self, handle: u32) -> Result {
+        let _guard = self.registers_lock.lock();
+        let control = self.control_bar()?;
+        control.write32(handle, Registers::HANDLE);
+        Self::issue_command_locked(&control, CommandId::DestroyHandle);
+        Ok(())
+    }
+
+    fn allocate_block(&self, size: u64) -> Result<Block> {
+        let _guard = self.registers_lock.lock();
+        let control = self.control_bar()?;
+
+        Self::write_u64(
+            &control,
+            Registers::BLOCK_SIZE_LOW,
+            Registers::BLOCK_SIZE_HIGH,
+            size,
+        );
+        Self::run_command_locked(&control, CommandId::AllocateBlock)?;
+
+        Ok(Block {
+            offset: Self::read_u64(
+                &control,
+                Registers::BLOCK_OFFSET_LOW,
+                Registers::BLOCK_OFFSET_HIGH,
+            ),
+            size: Self::read_u64(
+                &control,
+                Registers::BLOCK_SIZE_LOW,
+                Registers::BLOCK_SIZE_HIGH,
+            ),
+        })
+    }
+
+    fn deallocate_block(&self, offset: u64) -> Result {
+        let _guard = self.registers_lock.lock();
+        let control = self.control_bar()?;
+        Self::write_u64(
+            &control,
+            Registers::BLOCK_OFFSET_LOW,
+            Registers::BLOCK_OFFSET_HIGH,
+            offset,
+        );
+        Self::run_command_locked(&control, CommandId::DeallocateBlock)
+    }
+
+    fn ping(&self, handle: u32) -> Result {
+        let _guard = self.registers_lock.lock();
+        self.control_bar()?.write32(handle, Registers::PING);
+        Ok(())
+    }
+
+    fn cleanup_file_resources<I>(&self, handle: u32, blocks: I)
+    where
+        I: IntoIterator<Item = Block>,
+    {
+        // `unbind()` revokes live files before `disable_device()`, so both the shutdown path and a
+        // concurrent `release()` may still legitimately touch the BAR here.
+        if let Err(err) = self.destroy_handle(handle) {
+            pr_warn!(
+                "goldfish_address_space: destroy handle {} failed: {}\n",
+                handle,
+                err.to_errno()
+            );
+        }
+
+        for block in blocks {
+            if let Err(err) = self.deallocate_block(block.offset) {
+                pr_warn!(
+                    "goldfish_address_space: deallocate block 0x{:x} failed: {}\n",
+                    block.offset,
+                    err.to_errno()
+                );
+            }
+        }
+    }
+}
+
+struct FileResources {
+    handle: Option<u32>,
+    allocated_blocks: BlockSet,
+    shared_blocks: BlockSet,
+}
+
+impl FileResources {
+    fn new(handle: u32) -> Self {
+        Self {
+            handle: Some(handle),
+            allocated_blocks: BlockSet::new(),
+            shared_blocks: BlockSet::new(),
+        }
+    }
+}
+
+#[pin_data]
+struct FileState {
+    runtime: Arc<DeviceRuntime>,
+    #[pin]
+    ping: Mutex<PingState>,
+    #[pin]
+    resources: Mutex<FileResources>,
+}
+
+impl FileState {
+    fn new(runtime: Arc<DeviceRuntime>, handle: u32, ping: PingState) -> Result<Arc<Self>> {
+        Arc::pin_init(
+            try_pin_init!(Self {
+                runtime: runtime,
+                ping <- new_mutex!(ping),
+                resources <- new_mutex!(FileResources::new(handle)),
+            }),
+            GFP_KERNEL,
+        )
+    }
+
+    fn shared_phys_addr(&self, offset: u64) -> Result<u64> {
+        let base = u64::try_from(self.runtime.shared_phys_start()?).map_err(|_| EOVERFLOW)?;
+        base.checked_add(offset).ok_or(EOVERFLOW)
+    }
+
+    fn allocate_block(
+        self: ArcBorrow<'_, Self>,
+        mut request: AllocateBlockIoctl,
+    ) -> Result<AllocateBlockIoctl> {
+        let block = self.runtime.allocate_block(request.size)?;
+        let mut resources = self.resources.lock();
+        if resources.handle.is_none() {
+            drop(resources);
+            let _ = self.runtime.deallocate_block(block.offset);
+            return Err(ENODEV);
+        }
+
+        if let Err(err) = resources.allocated_blocks.insert(block) {
+            drop(resources);
+            let _ = self.runtime.deallocate_block(block.offset);
+            return Err(err);
+        }
+
+        request.size = block.size;
+        request.offset = block.offset;
+        request.phys_addr = self.shared_phys_addr(block.offset)?;
+        Ok(request)
+    }
+
+    fn deallocate_block(self: ArcBorrow<'_, Self>, offset: u64) -> Result {
+        let mut resources = self.resources.lock();
+        if resources.handle.is_none() {
+            return Err(ENODEV);
+        }
+
+        if !resources
+            .allocated_blocks
+            .iter()
+            .any(|block| block.offset == offset)
+        {
+            return Err(ENXIO);
+        }
+
+        self.runtime.deallocate_block(offset)?;
+        let _ = resources.allocated_blocks.remove(offset)?;
+        Ok(())
+    }
+
+    fn claim_shared(
+        self: ArcBorrow<'_, Self>,
+        request: ClaimSharedIoctl,
+    ) -> Result<ClaimSharedIoctl> {
+        let mut resources = self.resources.lock();
+        if resources.handle.is_none() {
+            return Err(ENODEV);
+        }
+
+        resources.shared_blocks.insert(Block {
+            offset: request.offset,
+            size: request.size,
+        })?;
+        Ok(request)
+    }
+
+    fn unclaim_shared(self: ArcBorrow<'_, Self>, offset: u64) -> Result {
+        let mut resources = self.resources.lock();
+        if resources.handle.is_none() {
+            return Err(ENODEV);
+        }
+
+        resources.shared_blocks.remove(offset)?;
+        Ok(())
+    }
+
+    fn ping(self: ArcBorrow<'_, Self>, mut request: PingIoctl) -> Result<PingIoctl> {
+        let handle = self.resources.lock().handle.ok_or(ENODEV)?;
+        let mut ping = self.ping.lock();
+        ping.prepare_ping(&request, self.runtime.shared_phys_start()?)?;
+        self.runtime.ping(handle)?;
+        ping.finish_ping(&mut request)?;
+        Ok(request)
+    }
+
+    fn cleanup_resources(&self) {
+        let mut resources = self.resources.lock();
+        let Some(handle) = resources.handle.take() else {
+            return;
+        };
+
+        self.runtime
+            .cleanup_file_resources(handle, resources.allocated_blocks.iter());
+        resources.allocated_blocks.clear();
+        resources.shared_blocks.clear();
+    }
+
+    fn revoke_for_shutdown(&self) {
+        self.cleanup_resources();
+    }
+
+    fn release(self: Arc<Self>) {
+        self.cleanup_resources();
+        self.runtime.unregister_live_file(&self);
+    }
+}
+
+#[pin_data]
+struct GoldfishAddressSpaceDriver {
+    runtime: Arc<DeviceRuntime>,
+    #[pin]
+    misc: MiscDeviceRegistration<GoldfishAddressSpaceMisc>,
+}
+
+struct GoldfishAddressSpaceMisc;
+
+#[vtable]
+impl MiscDevice for GoldfishAddressSpaceMisc {
+    type Ptr = Arc<FileState>;
+    type RegistrationData = Arc<DeviceRuntime>;
+
+    fn open(_file: &File, ctx: &MiscDeviceOpenContext<'_, Self>) -> Result<Self::Ptr> {
+        let runtime = ctx.data().clone();
+        let _op = runtime.begin_operation()?;
+        let ping = PingState::new()?;
+        let handle = runtime.generate_handle()?;
+        let cleanup = ScopeGuard::new_with_data((runtime.clone(), handle), |(runtime, handle)| {
+            let _ = runtime.destroy_handle(handle);
+        });
+
+        runtime.tell_ping_info_addr(handle, ping.phys_addr())?;
+        let state = FileState::new(runtime.clone(), handle, ping)?;
+        cleanup.dismiss();
+
+        // Publish the file as a live shutdown owner before returning it to the miscdevice core.
+        if let Err(err) = runtime.register_live_file(state.clone()) {
+            state.release();
+            return Err(err);
+        }
+
+        Ok(state)
+    }
+
+    fn release(device: Self::Ptr, _file: &File) {
+        device.release();
+    }
+
+    fn ioctl(
+        device: ArcBorrow<'_, FileState>,
+        _file: &File,
+        cmd: u32,
+        arg: usize,
+    ) -> Result<isize> {
+        let _op = device.runtime.begin_operation()?;
+        match cmd {
+            GOLDFISH_ADDRESS_SPACE_IOCTL_ALLOCATE_BLOCK => {
+                let data = UserSlice::new(UserPtr::from_addr(arg), AllocateBlockIoctl::ENCODED_LEN);
+                let (mut reader, mut writer) = data.reader_writer();
+                let mut bytes = [0u8; AllocateBlockIoctl::ENCODED_LEN];
+                reader.read_slice(&mut bytes)?;
+                let request = AllocateBlockIoctl::decode(&bytes);
+                let response = device.allocate_block(request)?;
+                writer.write_slice(&response.encode())?;
+                Ok(0)
+            }
+            GOLDFISH_ADDRESS_SPACE_IOCTL_DEALLOCATE_BLOCK => {
+                let mut reader = UserSlice::new(UserPtr::from_addr(arg), size_of::<u64>()).reader();
+                device.deallocate_block(reader.read::<u64>()?)?;
+                Ok(0)
+            }
+            GOLDFISH_ADDRESS_SPACE_IOCTL_PING => {
+                let data = UserSlice::new(UserPtr::from_addr(arg), PingIoctl::ENCODED_LEN);
+                let (mut reader, mut writer) = data.reader_writer();
+                let mut bytes = [0u8; PingIoctl::ENCODED_LEN];
+                reader.read_slice(&mut bytes)?;
+                let request = PingIoctl::decode(&bytes);
+                let response = device.ping(request)?;
+                writer.write_slice(&response.encode())?;
+                Ok(0)
+            }
+            GOLDFISH_ADDRESS_SPACE_IOCTL_CLAIM_SHARED => {
+                let data = UserSlice::new(UserPtr::from_addr(arg), ClaimSharedIoctl::ENCODED_LEN);
+                let (mut reader, mut writer) = data.reader_writer();
+                let mut bytes = [0u8; ClaimSharedIoctl::ENCODED_LEN];
+                reader.read_slice(&mut bytes)?;
+                let request = ClaimSharedIoctl::decode(&bytes);
+                let response = device.claim_shared(request)?;
+                writer.write_slice(&response.encode())?;
+                Ok(0)
+            }
+            GOLDFISH_ADDRESS_SPACE_IOCTL_UNCLAIM_SHARED => {
+                let mut reader = UserSlice::new(UserPtr::from_addr(arg), size_of::<u64>()).reader();
+                device.unclaim_shared(reader.read::<u64>()?)?;
+                Ok(0)
+            }
+            _ => Err(ENOTTY),
+        }
+    }
+
+    #[cfg(CONFIG_COMPAT)]
+    fn compat_ioctl(
+        device: ArcBorrow<'_, FileState>,
+        file: &File,
+        cmd: u32,
+        arg: usize,
+    ) -> Result<isize> {
+        Self::ioctl(device, file, cmd, arg)
+    }
+}
+
+kernel::declare_misc_device_fops!(GoldfishAddressSpaceMisc);
+
+kernel::pci_device_table!(
+    PCI_TABLE,
+    MODULE_PCI_TABLE,
+    <GoldfishAddressSpaceDriver as pci::Driver>::IdInfo,
+    [(
+        pci::DeviceId::from_id(
+            pci::Vendor::from_raw(GOLDFISH_AS_VENDOR_ID as u16),
+            GOLDFISH_AS_DEVICE_ID,
+        ),
+        (),
+    )]
+);
+
+impl pci::Driver for GoldfishAddressSpaceDriver {
+    type IdInfo = ();
+
+    const ID_TABLE: pci::IdTable<Self::IdInfo> = &PCI_TABLE;
+
+    fn probe(pdev: &pci::Device<Core>, _id_info: &Self::IdInfo) -> impl PinInit<Self, Error> {
+        pin_init::pin_init_scope(move || {
+            if pdev.revision_id() != GOLDFISH_AS_SUPPORTED_REVISION {
+                return Err(ENODEV);
+            }
+
+            pdev.enable_device_mem()?;
+            let enable_guard = ScopeGuard::new(|| pdev.disable_device());
+
+            let runtime = DeviceRuntime::new(pdev)?;
+            runtime.program_host_visible_state()?;
+
+            let driver = try_pin_init!(Self {
+                runtime: runtime.clone(),
+                misc <- MiscDeviceRegistration::register_with_data(
+                    MiscDeviceOptions {
+                        name: c"goldfish_address_space",
+                    },
+                    runtime.clone(),
+                ),
+            });
+            enable_guard.dismiss();
+
+            Ok(driver)
+        })
+    }
+
+    fn unbind(pdev: &pci::Device<Core>, this: Pin<&Self>) {
+        let this = this.get_ref();
+        // 1. Stop new miscdevice opens from reaching the driver.
+        this.misc.deregister();
+        // 2. Wait for already-running syscalls, then revoke every still-live file's device-side
+        //    state before the PCI function disappears.
+        this.runtime.shutdown();
+        // 3. Only then disable the PCI function, so post-shutdown release never needs to touch a
+        //    disabled device.
+        pdev.disable_device();
+    }
+}
+
+kernel::module_pci_driver! {
+    type: GoldfishAddressSpaceDriver,
+    name: "goldfish_address_space",
+    authors: ["Wenzhao Liao"],
+    description: "Rust Goldfish address space driver",
+    license: "GPL v2",
+}
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v3 5/6] rust: miscdevice: harden registration and safe file_operations invariants
From: Wenzhao Liao @ 2026-04-06 16:51 UTC (permalink / raw)
  To: rust-for-linux, linux-pci
  Cc: ojeda, dakr, bhelgaas, kwilczynski, arnd, gregkh, linux-kernel,
	linux-api
In-Reply-To: <20260406165120.166928-1-wenzhaoliao@ruc.edu.cn>

Extend miscdevice registration with typed per-device data that open()
can read through a publication-safe context, and move raw
file_operations exposure behind an internal vtable boundary generated by
declare_misc_device_fops!().

This keeps safe open() on pre-publication state, binds
file_operations.owner to THIS_MODULE for safe drivers, and keeps the
private_data ownership protocol inside the abstraction instead of in
driver code. The goldfish driver uses the typed registration data to
pass its runtime into open() without raw casts or container traversal.

Signed-off-by: Wenzhao Liao <wenzhaoliao@ruc.edu.cn>
---
 rust/kernel/miscdevice.rs        | 409 +++++++++++++++++++++++--------
 samples/rust/rust_misc_device.rs |   9 +-
 2 files changed, 306 insertions(+), 112 deletions(-)

diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs
index c3c2052c9206..c2db81cd5da2 100644
--- a/rust/kernel/miscdevice.rs
+++ b/rust/kernel/miscdevice.rs
@@ -9,7 +9,8 @@
 //! Reference: <https://www.kernel.org/doc/html/latest/driver-api/misc_devices.html>
 
 use crate::{
-    bindings,
+    alloc::KBox,
+    bindings, container_of,
     device::Device,
     error::{to_result, Error, Result, VTABLE_DEFAULT_ERROR},
     ffi::{c_int, c_long, c_uint, c_ulong},
@@ -18,9 +19,15 @@
     mm::virt::VmaNew,
     prelude::*,
     seq_file::SeqFile,
-    types::{ForeignOwnable, Opaque},
+    sync::aref::ARef,
+    types::{ForeignOwnable, Opaque, ScopeGuard},
+};
+use core::{
+    marker::{PhantomData, PhantomPinned},
+    pin::Pin,
+    ptr::drop_in_place,
+    sync::atomic::{AtomicBool, Ordering},
 };
-use core::{marker::PhantomData, pin::Pin};
 
 /// Options for creating a misc device.
 #[derive(Copy, Clone)]
@@ -31,94 +38,258 @@ pub struct MiscDeviceOptions {
 
 impl MiscDeviceOptions {
     /// Create a raw `struct miscdev` ready for registration.
-    pub const fn into_raw<T: MiscDevice>(self) -> bindings::miscdevice {
+    pub fn into_raw<T: MiscDeviceVTable + 'static>(self) -> bindings::miscdevice {
         let mut result: bindings::miscdevice = pin_init::zeroed();
         result.minor = bindings::MISC_DYNAMIC_MINOR as ffi::c_int;
         result.name = crate::str::as_char_ptr_in_const_context(self.name);
-        result.fops = MiscdeviceVTable::<T>::build();
+        result.fops = T::file_operations();
         result
     }
 }
 
+/// Generates the `MiscDeviceVTable` implementation for a concrete miscdevice type.
+///
+/// Place this macro after `impl MiscDevice for ...`.
+///
+/// The generated implementation always binds `file_operations.owner` to the current module's
+/// `THIS_MODULE`, so safe drivers cannot accidentally publish owner-less or foreign-owned misc
+/// device callbacks.
+#[macro_export]
+macro_rules! declare_misc_device_fops {
+    ($type:ty) => {
+        // SAFETY: This implements the standard Rust miscdevice vtable generated by
+        // `build_file_operations()`, which wires up owner/module pinning and the private-data
+        // protocol enforced by this abstraction.
+        unsafe impl $crate::miscdevice::MiscDeviceVTable for $type {
+            fn file_operations() -> &'static $crate::bindings::file_operations {
+                struct AssertSync<T>(T);
+                // SAFETY: This wrapper is only used for immutable `file_operations` tables stored
+                // in a `static`.
+                unsafe impl<T> Sync for AssertSync<T> {}
+
+                static FOPS: AssertSync<$crate::bindings::file_operations> = AssertSync(
+                    $crate::miscdevice::build_file_operations::<$type>(THIS_MODULE.as_ptr()),
+                );
+
+                &FOPS.0
+            }
+        }
+    };
+}
+
+#[repr(C)]
+struct RegistrationBacking<T: MiscDevice + 'static> {
+    misc: Opaque<bindings::miscdevice>,
+    data: T::RegistrationData,
+    owner: *const MiscDeviceRegistration<T>,
+    registered: AtomicBool,
+}
+
+struct OpenFile<T: MiscDevice + 'static> {
+    data: *mut ffi::c_void,
+    _t: PhantomData<T>,
+}
+
+impl<T: MiscDevice + 'static> OpenFile<T> {
+    fn empty() -> Self {
+        Self {
+            data: core::ptr::null_mut(),
+            _t: PhantomData,
+        }
+    }
+
+    fn borrow(&self) -> <T::Ptr as ForeignOwnable>::Borrowed<'_> {
+        // SAFETY: `self.data` comes from `T::Ptr::into_foreign()` and is only converted back in
+        // `release`, after all borrows from this file operation callback have ended.
+        unsafe { <T::Ptr as ForeignOwnable>::borrow(self.data) }
+    }
+}
+
 /// A registration of a miscdevice.
 ///
 /// # Invariants
 ///
-/// - `inner` contains a `struct miscdevice` that is registered using
-///   `misc_register()`.
-/// - This registration remains valid for the entire lifetime of the
-///   [`MiscDeviceRegistration`] instance.
-/// - Deregistration occurs exactly once in [`Drop`] via `misc_deregister()`.
-/// - `inner` wraps a valid, pinned `miscdevice` created using
+/// - `backing.misc` contains a valid `struct miscdevice` created using
 ///   [`MiscDeviceOptions::into_raw`].
-#[repr(transparent)]
+/// - When `backing.registered` is `true`, `backing.misc` is registered using
+///   `misc_register()`.
+/// - Before `misc_register()` publishes `backing.misc`, every field reachable through the safe
+///   open context (`backing.data` and `backing.owner`) is fully initialized.
+/// - `backing.owner` points back to this wrapper for the entire time the miscdevice is registered.
+/// - Deregistration occurs at most once, either via [`MiscDeviceRegistration::deregister`] or
+///   [`Drop`].
 #[pin_data(PinnedDrop)]
-pub struct MiscDeviceRegistration<T> {
+pub struct MiscDeviceRegistration<T: MiscDevice + 'static> {
+    backing: KBox<RegistrationBacking<T>>,
     #[pin]
-    inner: Opaque<bindings::miscdevice>,
+    _pin: PhantomPinned,
     _t: PhantomData<T>,
 }
 
 // SAFETY: It is allowed to call `misc_deregister` on a different thread from where you called
 // `misc_register`.
-unsafe impl<T> Send for MiscDeviceRegistration<T> {}
+unsafe impl<T: MiscDevice> Send for MiscDeviceRegistration<T> {}
 // SAFETY: All `&self` methods on this type are written to ensure that it is safe to call them in
 // parallel.
-unsafe impl<T> Sync for MiscDeviceRegistration<T> {}
+unsafe impl<T: MiscDevice> Sync for MiscDeviceRegistration<T> {}
 
-impl<T: MiscDevice> MiscDeviceRegistration<T> {
+impl<T: MiscDevice + 'static> MiscDeviceRegistration<T> {
     /// Register a misc device.
-    pub fn register(opts: MiscDeviceOptions) -> impl PinInit<Self, Error> {
-        try_pin_init!(Self {
-            inner <- Opaque::try_ffi_init(move |slot: *mut bindings::miscdevice| {
-                // SAFETY: The initializer can write to the provided `slot`.
-                unsafe { slot.write(opts.into_raw::<T>()) };
-
-                // SAFETY: We just wrote the misc device options to the slot. The miscdevice will
-                // get unregistered before `slot` is deallocated because the memory is pinned and
-                // the destructor of this type deallocates the memory.
-                // INVARIANT: If this returns `Ok(())`, then the `slot` will contain a registered
-                // misc device.
-                to_result(unsafe { bindings::misc_register(slot) })
-            }),
-            _t: PhantomData,
-        })
+    pub fn register(opts: MiscDeviceOptions) -> impl PinInit<Self, Error>
+    where
+        T: MiscDevice<RegistrationData = ()> + MiscDeviceVTable,
+    {
+        Self::register_with_data(opts, ())
+    }
+
+    /// Register a misc device together with driver-defined registration data.
+    pub fn register_with_data(
+        opts: MiscDeviceOptions,
+        data: T::RegistrationData,
+    ) -> impl PinInit<Self, Error>
+    where
+        T: MiscDeviceVTable,
+    {
+        let init = move |slot: *mut Self| {
+            let backing = KBox::new(
+                RegistrationBacking {
+                    misc: Opaque::new(opts.into_raw::<T>()),
+                    data,
+                    owner: slot.cast_const(),
+                    registered: AtomicBool::new(false),
+                },
+                GFP_KERNEL,
+            )?;
+
+            // SAFETY: `slot` is valid for writes for the duration of this initializer.
+            unsafe {
+                slot.write(Self {
+                    backing,
+                    _pin: PhantomPinned,
+                    _t: PhantomData,
+                })
+            };
+
+            // SAFETY: `slot` points to the fully-initialized registration wrapper we just wrote
+            // above.
+            let this = unsafe { &mut *slot };
+            // SAFETY: `this.as_raw()` points at the fully initialized `struct miscdevice`
+            // contained in the heap-backed registration backing.
+            let ret = to_result(unsafe { bindings::misc_register(this.as_raw()) });
+            if let Err(err) = ret {
+                // SAFETY: The wrapper was fully initialized above, so dropping it here correctly
+                // releases the heap-backed registration backing.
+                unsafe { drop_in_place(slot) };
+                return Err(err);
+            }
+
+            this.backing.registered.store(true, Ordering::Release);
+            Ok(())
+        };
+
+        // SAFETY:
+        // - On success, the closure writes a fully-initialized `Self` into `slot` before making
+        //   the miscdevice visible via `misc_register()`. All state observable through the safe
+        //   open context is initialized before publication.
+        // - On failure after the write, it drops the initialized value before returning.
+        unsafe { pin_init::pin_init_from_closure(init) }
+    }
+
+    /// Returns the registration wrapper for a raw `struct miscdevice` pointer.
+    ///
+    /// # Safety
+    ///
+    /// `misc` must point at the `misc` field of a live [`RegistrationBacking<T>`].
+    unsafe fn from_raw_misc<'a>(misc: *mut bindings::miscdevice) -> &'a Self {
+        // SAFETY: The caller guarantees that `misc` points at the `misc` field of a live
+        // `RegistrationBacking<T>`, whose `owner` points back to the live registration wrapper.
+        let backing =
+            unsafe { &*container_of!(Opaque::cast_from(misc), RegistrationBacking<T>, misc) };
+        // SAFETY: By the type invariant, `owner` points at the live wrapper that owns `backing`.
+        unsafe { &*backing.owner }
     }
 
     /// Returns a raw pointer to the misc device.
     pub fn as_raw(&self) -> *mut bindings::miscdevice {
-        self.inner.get()
+        self.backing.misc.get()
+    }
+
+    /// Returns the registration data that was supplied at registration time.
+    pub fn data(&self) -> &T::RegistrationData {
+        &self.backing.data
+    }
+
+    fn deregister_inner(backing: &RegistrationBacking<T>) {
+        if backing.registered.swap(false, Ordering::AcqRel) {
+            // SAFETY: `registered == true` guarantees that the miscdevice was successfully
+            // registered and has not been deregistered yet.
+            unsafe { bindings::misc_deregister(backing.misc.get()) };
+        }
     }
 
-    /// Access the `this_device` field.
-    pub fn device(&self) -> &Device {
-        // SAFETY: This can only be called after a successful register(), which always
-        // initialises `this_device` with a valid device. Furthermore, the signature of this
-        // function tells the borrow-checker that the `&Device` reference must not outlive the
-        // `&MiscDeviceRegistration<T>` used to obtain it, so the last use of the reference must be
-        // before the underlying `struct miscdevice` is destroyed.
-        unsafe { Device::from_raw((*self.as_raw()).this_device) }
+    /// Deregister this misc device if it is still registered.
+    ///
+    /// After this returns, the misc core will no longer route new opens to [`MiscDevice::open`].
+    /// Existing open files keep their own pinned `file_operations` table and private data and must
+    /// be drained by the driver before it tears down device-side resources that those file handles
+    /// still own.
+    pub fn deregister(&self) {
+        Self::deregister_inner(&self.backing);
     }
 }
 
 #[pinned_drop]
-impl<T> PinnedDrop for MiscDeviceRegistration<T> {
+impl<T: MiscDevice + 'static> PinnedDrop for MiscDeviceRegistration<T> {
     fn drop(self: Pin<&mut Self>) {
-        // SAFETY: We know that the device is registered by the type invariants.
-        unsafe { bindings::misc_deregister(self.inner.get()) };
+        let this = self.project();
+        Self::deregister_inner(this.backing);
+    }
+}
+
+/// Publication-safe context passed to [`MiscDevice::open`].
+pub struct MiscDeviceOpenContext<'a, T: MiscDevice + 'static> {
+    registration: &'a MiscDeviceRegistration<T>,
+    device: ARef<Device>,
+}
+
+impl<'a, T: MiscDevice + 'static> MiscDeviceOpenContext<'a, T> {
+    /// Returns the registration data supplied at registration time.
+    pub fn data(&self) -> &T::RegistrationData {
+        self.registration.data()
+    }
+
+    /// Returns the class device backing this miscdevice open.
+    pub fn device(&self) -> ARef<Device> {
+        self.device.clone()
     }
 }
 
+/// Internal trait that supplies the concrete `file_operations` table used for a Rust miscdevice.
+///
+/// # Safety
+///
+/// Implementations must return a stable `file_operations` table produced by this abstraction so
+/// that owner/module pinning and the private-data protocol remain intact. Drivers should use
+/// [`declare_misc_device_fops!`] instead of implementing this trait manually.
+#[doc(hidden)]
+pub unsafe trait MiscDeviceVTable: MiscDevice + 'static {
+    /// Returns the `file_operations` table for this miscdevice implementation.
+    fn file_operations() -> &'static bindings::file_operations;
+}
+
 /// Trait implemented by the private data of an open misc device.
 #[vtable]
 pub trait MiscDevice: Sized {
     /// What kind of pointer should `Self` be wrapped in.
     type Ptr: ForeignOwnable + Send + Sync;
 
+    /// Driver-defined data stored in the miscdevice registration.
+    type RegistrationData: Send + Sync + 'static;
+
     /// Called when the misc device is opened.
     ///
     /// The returned pointer will be stored as the private data for the file.
-    fn open(_file: &File, _misc: &MiscDeviceRegistration<Self>) -> Result<Self::Ptr>;
+    fn open(_file: &File, _ctx: &MiscDeviceOpenContext<'_, Self>) -> Result<Self::Ptr>;
 
     /// Called when the misc device is released.
     fn release(device: Self::Ptr, _file: &File) {
@@ -195,7 +366,45 @@ fn show_fdinfo(
 /// A vtable for the file operations of a Rust miscdevice.
 struct MiscdeviceVTable<T: MiscDevice>(PhantomData<T>);
 
-impl<T: MiscDevice> MiscdeviceVTable<T> {
+impl<T: MiscDevice + 'static> MiscdeviceVTable<T> {
+    const fn build(owner: *mut bindings::module) -> bindings::file_operations {
+        bindings::file_operations {
+            owner,
+            open: Some(Self::open),
+            release: Some(Self::release),
+            mmap: if T::HAS_MMAP { Some(Self::mmap) } else { None },
+            read_iter: if T::HAS_READ_ITER {
+                Some(Self::read_iter)
+            } else {
+                None
+            },
+            write_iter: if T::HAS_WRITE_ITER {
+                Some(Self::write_iter)
+            } else {
+                None
+            },
+            unlocked_ioctl: if T::HAS_IOCTL {
+                Some(Self::ioctl)
+            } else {
+                None
+            },
+            #[cfg(CONFIG_COMPAT)]
+            compat_ioctl: if T::HAS_COMPAT_IOCTL {
+                Some(Self::compat_ioctl)
+            } else if T::HAS_IOCTL {
+                bindings::compat_ptr_ioctl
+            } else {
+                None
+            },
+            show_fdinfo: if T::HAS_SHOW_FDINFO {
+                Some(Self::show_fdinfo)
+            } else {
+                None
+            },
+            ..pin_init::zeroed()
+        }
+    }
+
     /// # Safety
     ///
     /// `file` and `inode` must be the file and inode for a file that is undergoing initialization.
@@ -214,25 +423,38 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
         // associated `struct miscdevice` before calling into this method. Furthermore,
         // `misc_open()` ensures that the miscdevice can't be unregistered and freed during this
         // call to `fops_open`.
-        let misc = unsafe { &*misc_ptr.cast::<MiscDeviceRegistration<T>>() };
+        let misc = unsafe { MiscDeviceRegistration::<T>::from_raw_misc(misc_ptr.cast()) };
 
         // SAFETY:
         // * This underlying file is valid for (much longer than) the duration of `T::open`.
         // * There is no active fdget_pos region on the file on this thread.
         let file = unsafe { File::from_raw_file(raw_file) };
 
-        let ptr = match T::open(file, misc) {
+        // SAFETY: `misc_open()` serializes with `misc_deregister()` via `misc_mtx`, so the class
+        // device remains live for the duration of this callback. Taking an extra reference here
+        // lets the safe open context own the device independently from later teardown.
+        let ctx = MiscDeviceOpenContext {
+            registration: misc,
+            device: unsafe { Device::get_device((*misc.as_raw()).this_device) },
+        };
+
+        let ptr = match T::open(file, &ctx) {
             Ok(ptr) => ptr,
             Err(err) => return err.to_errno(),
         };
+        let ptr = ScopeGuard::new_with_data(ptr, |ptr| T::release(ptr, file));
 
-        // This overwrites the private data with the value specified by the user, changing the type
-        // of this file's private data. All future accesses to the private data is performed by
-        // other fops_* methods in this file, which all correctly cast the private data to the new
-        // type.
+        let mut open_file = match KBox::new(OpenFile::<T>::empty(), GFP_KERNEL) {
+            Ok(open_file) => open_file,
+            Err(_) => return ENOMEM.to_errno(),
+        };
+        open_file.data = ptr.dismiss().into_foreign();
+
+        // This overwrites the private data with a small Rust-owned wrapper that keeps the module
+        // pinned for the full file lifetime and owns the driver's foreign private data handle.
         //
         // SAFETY: The open call of a file can access the private data.
-        unsafe { (*raw_file).private_data = ptr.into_foreign() };
+        unsafe { (*raw_file).private_data = open_file.into_foreign() };
 
         0
     }
@@ -243,14 +465,17 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
     /// must be associated with a `MiscDeviceRegistration<T>`.
     unsafe extern "C" fn release(_inode: *mut bindings::inode, file: *mut bindings::file) -> c_int {
         // SAFETY: The release call of a file owns the private data.
-        let private = unsafe { (*file).private_data };
-        // SAFETY: The release call of a file owns the private data.
-        let ptr = unsafe { <T::Ptr as ForeignOwnable>::from_foreign(private) };
+        let open_file =
+            unsafe { <KBox<OpenFile<T>> as ForeignOwnable>::from_foreign((*file).private_data) };
+        let data = open_file.data;
 
         // SAFETY:
         // * The file is valid for the duration of this call.
         // * There is no active fdget_pos region on the file on this thread.
-        T::release(ptr, unsafe { File::from_raw_file(file) });
+        T::release(
+            unsafe { <T::Ptr as ForeignOwnable>::from_foreign(data) },
+            unsafe { File::from_raw_file(file) },
+        );
 
         0
     }
@@ -304,11 +529,9 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
         vma: *mut bindings::vm_area_struct,
     ) -> c_int {
         // SAFETY: The mmap call of a file can access the private data.
-        let private = unsafe { (*file).private_data };
-        // SAFETY: This is a Rust Miscdevice, so we call `into_foreign` in `open` and
-        // `from_foreign` in `release`, and `fops_mmap` is guaranteed to be called between those
-        // two operations.
-        let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private.cast()) };
+        let open_file =
+            unsafe { <KBox<OpenFile<T>> as ForeignOwnable>::borrow((*file).private_data) };
+        let device = open_file.borrow();
         // SAFETY: The caller provides a vma that is undergoing initial VMA setup.
         let area = unsafe { VmaNew::from_raw(vma) };
         // SAFETY:
@@ -327,9 +550,9 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
     /// `file` must be a valid file that is associated with a `MiscDeviceRegistration<T>`.
     unsafe extern "C" fn ioctl(file: *mut bindings::file, cmd: c_uint, arg: c_ulong) -> c_long {
         // SAFETY: The ioctl call of a file can access the private data.
-        let private = unsafe { (*file).private_data };
-        // SAFETY: Ioctl calls can borrow the private data of the file.
-        let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private) };
+        let open_file =
+            unsafe { <KBox<OpenFile<T>> as ForeignOwnable>::borrow((*file).private_data) };
+        let device = open_file.borrow();
 
         // SAFETY:
         // * The file is valid for the duration of this call.
@@ -352,9 +575,9 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
         arg: c_ulong,
     ) -> c_long {
         // SAFETY: The compat ioctl call of a file can access the private data.
-        let private = unsafe { (*file).private_data };
-        // SAFETY: Ioctl calls can borrow the private data of the file.
-        let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private) };
+        let open_file =
+            unsafe { <KBox<OpenFile<T>> as ForeignOwnable>::borrow((*file).private_data) };
+        let device = open_file.borrow();
 
         // SAFETY:
         // * The file is valid for the duration of this call.
@@ -373,9 +596,9 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
     /// - `seq_file` must be a valid `struct seq_file` that we can write to.
     unsafe extern "C" fn show_fdinfo(seq_file: *mut bindings::seq_file, file: *mut bindings::file) {
         // SAFETY: The release call of a file owns the private data.
-        let private = unsafe { (*file).private_data };
-        // SAFETY: Ioctl calls can borrow the private data of the file.
-        let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private) };
+        let open_file =
+            unsafe { <KBox<OpenFile<T>> as ForeignOwnable>::borrow((*file).private_data) };
+        let device = open_file.borrow();
         // SAFETY:
         // * The file is valid for the duration of this call.
         // * There is no active fdget_pos region on the file on this thread.
@@ -386,43 +609,11 @@ impl<T: MiscDevice> MiscdeviceVTable<T> {
 
         T::show_fdinfo(device, m, file);
     }
+}
 
-    const VTABLE: bindings::file_operations = bindings::file_operations {
-        open: Some(Self::open),
-        release: Some(Self::release),
-        mmap: if T::HAS_MMAP { Some(Self::mmap) } else { None },
-        read_iter: if T::HAS_READ_ITER {
-            Some(Self::read_iter)
-        } else {
-            None
-        },
-        write_iter: if T::HAS_WRITE_ITER {
-            Some(Self::write_iter)
-        } else {
-            None
-        },
-        unlocked_ioctl: if T::HAS_IOCTL {
-            Some(Self::ioctl)
-        } else {
-            None
-        },
-        #[cfg(CONFIG_COMPAT)]
-        compat_ioctl: if T::HAS_COMPAT_IOCTL {
-            Some(Self::compat_ioctl)
-        } else if T::HAS_IOCTL {
-            bindings::compat_ptr_ioctl
-        } else {
-            None
-        },
-        show_fdinfo: if T::HAS_SHOW_FDINFO {
-            Some(Self::show_fdinfo)
-        } else {
-            None
-        },
-        ..pin_init::zeroed()
-    };
-
-    const fn build() -> &'static bindings::file_operations {
-        &Self::VTABLE
-    }
+#[doc(hidden)]
+pub const fn build_file_operations<T: MiscDevice + 'static>(
+    owner: *mut bindings::module,
+) -> bindings::file_operations {
+    MiscdeviceVTable::<T>::build(owner)
 }
diff --git a/samples/rust/rust_misc_device.rs b/samples/rust/rust_misc_device.rs
index 87a1fe63533a..f2d7a98a5715 100644
--- a/samples/rust/rust_misc_device.rs
+++ b/samples/rust/rust_misc_device.rs
@@ -100,7 +100,7 @@
     fs::{File, Kiocb},
     ioctl::{_IO, _IOC_SIZE, _IOR, _IOW},
     iov::{IovIterDest, IovIterSource},
-    miscdevice::{MiscDevice, MiscDeviceOptions, MiscDeviceRegistration},
+    miscdevice::{MiscDevice, MiscDeviceOpenContext, MiscDeviceOptions, MiscDeviceRegistration},
     new_mutex,
     prelude::*,
     sync::{aref::ARef, Mutex},
@@ -154,9 +154,10 @@ struct RustMiscDevice {
 #[vtable]
 impl MiscDevice for RustMiscDevice {
     type Ptr = Pin<KBox<Self>>;
+    type RegistrationData = ();
 
-    fn open(_file: &File, misc: &MiscDeviceRegistration<Self>) -> Result<Pin<KBox<Self>>> {
-        let dev = ARef::from(misc.device());
+    fn open(_file: &File, ctx: &MiscDeviceOpenContext<'_, Self>) -> Result<Pin<KBox<Self>>> {
+        let dev = ctx.device();
 
         dev_info!(dev, "Opening Rust Misc Device Sample\n");
 
@@ -222,6 +223,8 @@ fn ioctl(me: Pin<&RustMiscDevice>, _file: &File, cmd: u32, arg: usize) -> Result
     }
 }
 
+kernel::declare_misc_device_fops!(RustMiscDevice);
+
 #[pinned_drop]
 impl PinnedDrop for RustMiscDevice {
     fn drop(self: Pin<&mut Self>) {
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v3 3/6] rust: page: add helpers for page-backed ping state
From: Wenzhao Liao @ 2026-04-06 16:51 UTC (permalink / raw)
  To: rust-for-linux, linux-pci
  Cc: ojeda, dakr, bhelgaas, kwilczynski, arnd, gregkh, linux-kernel,
	linux-api
In-Reply-To: <20260406165120.166928-1-wenzhaoliao@ruc.edu.cn>

Add the minimal safe page helpers needed by the goldfish ping buffer: physical address discovery plus bounded read, write, and zeroing operations.

The driver uses these helpers to manage its per-file ping page entirely from safe Rust while keeping the raw page mapping and pointer handling inside the page abstraction.

Signed-off-by: Wenzhao Liao <wenzhaoliao@ruc.edu.cn>
---
 rust/helpers/page.c |  5 +++++
 rust/kernel/page.rs | 52 +++++++++++++++++++++++----------------------
 2 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/rust/helpers/page.c b/rust/helpers/page.c
index f8463fbed2a2..05824bdc4fd8 100644
--- a/rust/helpers/page.c
+++ b/rust/helpers/page.c
@@ -20,6 +20,11 @@ __rust_helper void rust_helper_kunmap_local(const void *addr)
 	kunmap_local(addr);
 }
 
+__rust_helper phys_addr_t rust_helper_page_to_phys(struct page *page)
+{
+	return page_to_phys(page);
+}
+
 #ifndef NODE_NOT_IN_PAGE_FLAGS
 __rust_helper int rust_helper_page_to_nid(const struct page *page)
 {
diff --git a/rust/kernel/page.rs b/rust/kernel/page.rs
index adecb200c654..e8336d1bcc12 100644
--- a/rust/kernel/page.rs
+++ b/rust/kernel/page.rs
@@ -7,7 +7,7 @@
     bindings,
     error::code::*,
     error::Result,
-    uaccess::UserSliceReader,
+    io::PhysAddr,
 };
 use core::{
     marker::PhantomData,
@@ -198,6 +198,13 @@ pub fn nid(&self) -> i32 {
         unsafe { bindings::page_to_nid(self.as_ptr()) }
     }
 
+    /// Returns the physical address of the start of this page.
+    #[inline]
+    pub fn phys_addr(&self) -> PhysAddr {
+        // SAFETY: `self.as_ptr()` is a live `struct page` owned by this `Page`.
+        unsafe { bindings::page_to_phys(self.as_ptr()) }
+    }
+
     /// Runs a piece of code with this page mapped to an address.
     ///
     /// The page is unmapped when this call returns.
@@ -337,30 +344,25 @@ pub unsafe fn fill_zero_raw(&self, offset: usize, len: usize) -> Result {
         })
     }
 
-    /// Copies data from userspace into this page.
-    ///
-    /// This method will perform bounds checks on the page offset. If `offset .. offset+len` goes
-    /// outside of the page, then this call returns [`EINVAL`].
-    ///
-    /// Like the other `UserSliceReader` methods, data races are allowed on the userspace address.
-    /// However, they are not allowed on the page you are copying into.
-    ///
-    /// # Safety
-    ///
-    /// Callers must ensure that this call does not race with a read or write to the same page that
-    /// overlaps with this write.
-    pub unsafe fn copy_from_user_slice_raw(
-        &self,
-        reader: &mut UserSliceReader,
-        offset: usize,
-        len: usize,
-    ) -> Result {
-        self.with_pointer_into_page(offset, len, move |dst| {
-            // SAFETY: If `with_pointer_into_page` calls into this closure, then it has performed a
-            // bounds check and guarantees that `dst` is valid for `len` bytes. Furthermore, we have
-            // exclusive access to the slice since the caller guarantees that there are no races.
-            reader.read_raw(unsafe { core::slice::from_raw_parts_mut(dst.cast(), len) })
-        })
+    /// Maps the page and reads from it into the given buffer.
+    pub fn read_slice(&self, dst: &mut [u8], offset: usize) -> Result {
+        // SAFETY: `dst` is a valid mutable slice for `dst.len()` bytes. Safe Rust also prevents
+        // callers from obtaining a mutable reference to this `Page` while this shared borrow
+        // exists, so concurrent writes through the safe API cannot overlap with this read.
+        unsafe { self.read_raw(dst.as_mut_ptr(), offset, dst.len()) }
+    }
+
+    /// Maps the page and writes the given buffer into it.
+    pub fn write_slice(&mut self, src: &[u8], offset: usize) -> Result {
+        // SAFETY: `src` is a valid immutable slice for `src.len()` bytes, and `&mut self`
+        // guarantees unique access to the page through the safe API.
+        unsafe { self.write_raw(src.as_ptr(), offset, src.len()) }
+    }
+
+    /// Zeroes a range within the page.
+    pub fn fill_zero(&mut self, offset: usize, len: usize) -> Result {
+        // SAFETY: `&mut self` guarantees unique access to the page through the safe API.
+        unsafe { self.fill_zero_raw(offset, len) }
     }
 }
 
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v3 2/6] rust: bindings: expose goldfish address-space headers
From: Wenzhao Liao @ 2026-04-06 16:51 UTC (permalink / raw)
  To: rust-for-linux, linux-pci
  Cc: ojeda, dakr, bhelgaas, kwilczynski, arnd, gregkh, linux-kernel,
	linux-api
In-Reply-To: <20260406165120.166928-1-wenzhaoliao@ruc.edu.cn>

Expose the UAPI header and the Linux I/O declarations needed by the Rust goldfish address-space driver.

This keeps the driver-side code on typed Rust interfaces while still allowing the binding and helper layers to see the header and memremap support required by the abstraction patches that follow.

Signed-off-by: Wenzhao Liao <wenzhaoliao@ruc.edu.cn>
---
 rust/bindings/bindings_helper.h | 1 +
 rust/uapi/uapi_helper.h         | 1 +
 2 files changed, 2 insertions(+)

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 083cc44aa952..b0baff4c6349 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -59,6 +59,7 @@
 #include <linux/fs.h>
 #include <linux/i2c.h>
 #include <linux/interrupt.h>
+#include <linux/io.h>
 #include <linux/io-pgtable.h>
 #include <linux/ioport.h>
 #include <linux/jiffies.h>
diff --git a/rust/uapi/uapi_helper.h b/rust/uapi/uapi_helper.h
index 06d7d1a2e8da..ff19edab81da 100644
--- a/rust/uapi/uapi_helper.h
+++ b/rust/uapi/uapi_helper.h
@@ -11,6 +11,7 @@
 #include <uapi/drm/nova_drm.h>
 #include <uapi/drm/panthor_drm.h>
 #include <uapi/linux/android/binder.h>
+#include <uapi/linux/goldfish_address_space.h>
 #include <uapi/linux/mdio.h>
 #include <uapi/linux/mii.h>
 #include <uapi/linux/ethtool.h>
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v3 4/6] rust: pci: add shared BAR memremap support
From: Wenzhao Liao @ 2026-04-06 16:51 UTC (permalink / raw)
  To: rust-for-linux, linux-pci
  Cc: ojeda, dakr, bhelgaas, kwilczynski, arnd, gregkh, linux-kernel,
	linux-api
In-Reply-To: <20260406165120.166928-1-wenzhaoliao@ruc.edu.cn>

Add a small Rust-owned abstraction for PCI BARs that back shared memory
instead of register MMIO.

The new SharedMemoryBar type owns both the BAR reservation and the
memremap() lifetime, exposes the physical BAR start needed by the
address-space ping path, and keeps the resource bookkeeping out of the
Rust driver.

The current RFC no longer exposes userspace mmap, but the driver still
needs an owned shared-BAR reservation and the BAR's physical base for
the ping path. Keeping the reservation/memremap() pairing in a Rust
abstraction avoids pushing that lifetime bookkeeping back into driver
code.

Signed-off-by: Wenzhao Liao <wenzhaoliao@ruc.edu.cn>
---
 rust/kernel/pci.rs    |   8 +++
 rust/kernel/pci/id.rs |   2 +-
 rust/kernel/pci/io.rs | 112 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index af74ddff6114..4c63c931ffb2 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -47,6 +47,7 @@
     ConfigSpaceSize,
     Extended,
     Normal, //
+    SharedMemoryBar,
 };
 pub use self::irq::{
     IrqType,
@@ -458,6 +459,13 @@ pub fn set_master(&self) {
         // SAFETY: `self.as_raw` is guaranteed to be a pointer to a valid `struct pci_dev`.
         unsafe { bindings::pci_set_master(self.as_raw()) };
     }
+
+    /// Disable this PCI device.
+    #[inline]
+    pub fn disable_device(&self) {
+        // SAFETY: `self.as_raw` is guaranteed to be a pointer to a valid `struct pci_dev`.
+        unsafe { bindings::pci_disable_device(self.as_raw()) };
+    }
 }
 
 // SAFETY: `pci::Device` is a transparent wrapper of `struct pci_dev`.
diff --git a/rust/kernel/pci/id.rs b/rust/kernel/pci/id.rs
index 50005d176561..bd3cf17fd8de 100644
--- a/rust/kernel/pci/id.rs
+++ b/rust/kernel/pci/id.rs
@@ -156,7 +156,7 @@ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
 impl Vendor {
     /// Create a Vendor from a raw 16-bit vendor ID.
     #[inline]
-    pub(super) fn from_raw(vendor_id: u16) -> Self {
+    pub const fn from_raw(vendor_id: u16) -> Self {
         Self(vendor_id)
     }
 
diff --git a/rust/kernel/pci/io.rs b/rust/kernel/pci/io.rs
index fb6edab2aea7..89bf882b9634 100644
--- a/rust/kernel/pci/io.rs
+++ b/rust/kernel/pci/io.rs
@@ -7,6 +7,7 @@
     bindings,
     device,
     devres::Devres,
+    ffi::{c_ulong, c_void},
     io::{
         io_define_read,
         io_define_write,
@@ -17,11 +18,13 @@
         MmioRaw, //
     },
     prelude::*,
-    sync::aref::ARef, //
+    sync::aref::ARef,
+    types::ScopeGuard,
 };
 use core::{
     marker::PhantomData,
     ops::Deref, //
+    ptr::NonNull,
 };
 
 /// Represents the size of a PCI configuration space.
@@ -285,6 +288,104 @@ fn deref(&self) -> &Self::Target {
     }
 }
 
+/// A cacheable shared-memory mapping of a PCI BAR created via `memremap()`.
+///
+/// This is intended for BARs that back shared memory rather than device register MMIO. The
+/// mapping owns both the underlying PCI region reservation and the `memremap()` lifetime, so
+/// driver code does not need to keep raw pointers or manually pair teardown calls.
+pub struct SharedMemoryBar {
+    pdev: ARef<Device>,
+    addr: NonNull<c_void>,
+    phys_start: bindings::resource_size_t,
+    len: usize,
+    num: i32,
+}
+
+// SAFETY: `SharedMemoryBar` owns a stable BAR reservation plus its `memremap()` mapping. Moving
+// the owner to another thread does not change the validity of the underlying PCI resource.
+unsafe impl Send for SharedMemoryBar {}
+
+// SAFETY: Shared references only expose immutable metadata queries; the mapped pointer itself is
+// not exposed for dereferencing.
+unsafe impl Sync for SharedMemoryBar {}
+
+impl SharedMemoryBar {
+    fn new(pdev: &Device, num: u32, name: &CStr) -> Result<Self> {
+        if !Bar::index_is_valid(num) {
+            return Err(EINVAL);
+        }
+
+        let len = pdev.resource_len(num)?;
+        if len == 0 {
+            return Err(ENXIO);
+        }
+
+        let len = usize::try_from(len)?;
+        let phys_start = pdev.resource_start(num)?;
+        let num = i32::try_from(num)?;
+
+        // SAFETY:
+        // - `pdev` is valid by the invariants of `Device`.
+        // - `num` is checked above.
+        // - `name` is a valid NUL-terminated string.
+        let ret = unsafe { bindings::pci_request_region(pdev.as_raw(), num, name.as_char_ptr()) };
+        if ret != 0 {
+            return Err(EBUSY);
+        }
+
+        let release_region = ScopeGuard::new(|| {
+            // SAFETY:
+            // - `pdev` is still valid for the duration of this constructor.
+            // - `num` has just been successfully reserved.
+            unsafe { bindings::pci_release_region(pdev.as_raw(), num) };
+        });
+
+        // SAFETY:
+        // - `phys_start`/`len` describe the BAR range we just reserved.
+        // - `MEMREMAP_WB` matches the external goldfish driver behaviour.
+        let addr = unsafe { bindings::memremap(phys_start, len, bindings::MEMREMAP_WB as c_ulong) };
+        let addr = NonNull::new(addr.cast()).ok_or(ENOMEM)?;
+
+        release_region.dismiss();
+
+        Ok(Self {
+            pdev: pdev.into(),
+            addr,
+            phys_start,
+            len,
+            num,
+        })
+    }
+
+    /// Returns the physical start address of the BAR.
+    #[inline]
+    pub fn phys_start(&self) -> bindings::resource_size_t {
+        self.phys_start
+    }
+
+    /// Returns the BAR size in bytes.
+    #[inline]
+    pub fn len(&self) -> usize {
+        self.len
+    }
+
+    fn release(&self) {
+        // SAFETY:
+        // - `self.addr` is a valid `memremap()` result owned by `self`.
+        // - `self.num` is the BAR region successfully reserved by `Self::new`.
+        unsafe {
+            bindings::memunmap(self.addr.as_ptr().cast());
+            bindings::pci_release_region(self.pdev.as_raw(), self.num);
+        }
+    }
+}
+
+impl Drop for SharedMemoryBar {
+    fn drop(&mut self) {
+        self.release();
+    }
+}
+
 impl Device<device::Bound> {
     /// Maps an entire PCI BAR after performing a region-request on it. I/O operation bound checks
     /// can be performed on compile time for offsets (plus the requested type size) < SIZE.
@@ -305,6 +406,15 @@ pub fn iomap_region<'a>(
         self.iomap_region_sized::<0>(bar, name)
     }
 
+    /// Reserve and `memremap()` an entire PCI BAR as cacheable shared memory.
+    pub fn memremap_bar<'a>(
+        &'a self,
+        bar: u32,
+        name: &'a CStr,
+    ) -> impl PinInit<Devres<SharedMemoryBar>, Error> + 'a {
+        Devres::new(self.as_ref(), SharedMemoryBar::new(self, bar, name))
+    }
+
     /// Returns the size of configuration space.
     pub fn cfg_size(&self) -> ConfigSpaceSize {
         // SAFETY: `self.as_raw` is a valid pointer to a `struct pci_dev`.
-- 
2.34.1

^ permalink raw reply related

* [RFC PATCH v3 0/6] Rust goldfish_address_space driver (ioctl-only subset)
From: Wenzhao Liao @ 2026-04-06 16:51 UTC (permalink / raw)
  To: rust-for-linux, linux-pci
  Cc: ojeda, dakr, bhelgaas, kwilczynski, arnd, gregkh, linux-kernel,
	linux-api
In-Reply-To: <cover.1775456181.git.wenzhaoliao@ruc.edu.cn>

This respin narrows the Rust goldfish_address_space RFC to the
open/release/ioctl ABI subset. Userspace mmap and PING_WITH_DATA are
not part of this series.

I would like to send this as a small first upstream step for the Rust
driver, instead of asking reviewers to take the mmap/VMA lifecycle work
in the same round.

The goal of the respin is to keep only the pieces that are still
required by the current driver:
- the goldfish UAPI header and Rust bindings exposure,
- minimal page helpers for the ping page,
- a small SharedMemoryBar abstraction for shared BAR reservation,
  memremap() lifetime, and physical base discovery,
- hardened miscdevice registration/open boundaries,
- and the Rust goldfish_address_space driver itself.

Compared to the previous round, this drops the Rust VMA/BAR-to-VMA
mapping work from the series and rewrites the driver and miscdevice
pieces around the current teardown and publication model. The driver
remains #![forbid(unsafe_code)].

Feedback would be especially helpful on:
- whether the ioctl-only ABI subset is a reasonable first upstream step
  for goldfish_address_space;
- whether SharedMemoryBar is the right minimal Rust abstraction for
  shared-memory BAR reservation plus memremap() lifetime;
- whether the miscdevice hardening direction makes sense, especially the
  publication-safe open context and the THIS_MODULE-owned safe
  file_operations path.

Changes since v2:
- dropped the userspace mmap portion of the RFC and removed the unused
  Rust VMA/BAR-to-VMA mapping patch from the series;
- narrowed the goldfish Kconfig help text and driver description to the
  open/release/ioctl ABI subset;
- reworked miscdevice so safe open() only sees publication-safe state
  and safe drivers no longer have a raw file_operations escape hatch;
- reworked goldfish teardown around deregister() -> shutdown() ->
  disable_device(), with live-file revocation before PCI disable and
  explicit enable_device_mem() probe unwind;
- kept the in-tree Rust VMA helpers still used by binder out of this
  series, so the respin only carries code with a current caller.

Behavior exercised for the RFC-limited ABI subset:
- open / release
- allocate_block / deallocate_block
- ping
- claim_shared / unclaim_shared
- unknown ioctl
- reopen

No claim is made beyond that subset in this respin.

Build-tested:
- make LLVM=1 rust/kernel.o
- make LLVM=1 drivers/platform/goldfish/goldfish_address_space.o
- make LLVM=1 samples/rust/rust_misc_device.o

Wenzhao Liao (6):
  uapi: add goldfish_address_space userspace ABI header
  rust: bindings: expose goldfish address-space headers
  rust: page: add helpers for page-backed ping state
  rust: pci: add shared BAR memremap support
  rust: miscdevice: harden registration and safe file_operations
    invariants
  platform/goldfish: add Rust goldfish_address_space driver

 MAINTAINERS                                   |  10 +
 drivers/platform/goldfish/Kconfig             |  11 +
 drivers/platform/goldfish/Makefile            |   1 +
 .../goldfish/goldfish_address_space.rs        | 917 ++++++++++++++++++
 include/uapi/linux/goldfish_address_space.h   |  54 ++
 rust/bindings/bindings_helper.h               |   1 +
 rust/helpers/page.c                           |   5 +
 rust/kernel/miscdevice.rs                     | 409 +++++---
 rust/kernel/page.rs                           |  52 +-
 rust/kernel/pci.rs                            |   8 +
 rust/kernel/pci/id.rs                         |   2 +-
 rust/kernel/pci/io.rs                         | 112 ++-
 rust/uapi/uapi_helper.h                       |   1 +
 samples/rust/rust_misc_device.rs              |   9 +-
 14 files changed, 1453 insertions(+), 139 deletions(-)
 create mode 100644 drivers/platform/goldfish/goldfish_address_space.rs
 create mode 100644 include/uapi/linux/goldfish_address_space.h

-- 
2.34.1

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox