Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 15:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826142406.GE1970008@nvidia.com>

On Tue, Aug 26, 2025 at 2:24 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 26, 2025 at 01:54:31PM +0000, Pasha Tatashin wrote:
> > > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> > > >
> > > > Changelog from v2:
> > > > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > > > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > > > - With the above changes, sessions are not needed, and should be
> > > >   maintained by the user-agent itself, so removed support for
> > > >   sessions.
> > >
> > > If all the FDs are restored in the agent's context, this assigns all the
> > > resources to the agent. For example, if the agent restores a memfd, all
> > > the memory gets charged to the agent's cgroup, and the client gets none
> > > of it. This makes it impossible to do any kind of resource limits.
> > >
> > > This was one of the advantages of being able to pass around sessions
> > > instead of FDs. The agent can pass on the right session to the right
> > > client, and then the client does the restore, getting all the resources
> > > charged to it.
> > >
> > > If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
> > > for many kinds of workloads. Do you have any ideas on how to do proper
> > > resource attribution with the current patches? If not, then perhaps we
> > > should reconsider this change?
> >
> > Hi Pratyush,
> >
> > That's an excellent point, and you're right that we must have a
> > solution for correct resource charging.
> >
> > I'd prefer to keep the session logic in the userspace agent (luod
> > https://tinyurl.com/luoddesign).
> >
> > For the charging problem, I believe there's a clear path forward with
> > the current ioctl-based API. The design of the ioctl commands (with a
> > size field in each struct) is intentionally extensible. In a follow-up
> > patch, we can extend the liveupdate_ioctl_fd_restore struct to include
> > a target pid field. The luod agent, would then be able to restore an
> > FD on behalf of a client and instruct the kernel to charge the
> > associated resources to that client's PID.
>
> This wasn't quite the idea though..
>
> The sessions sub FD were intended to be passed directly to other
> processes though unix sockets and fd passing so they could run their
> own ioctls in their own context for both save and restore. The ioctls
> available on the sessions should be specifically narrowed to be safe
> for this.
>
> I can understand not implementing session FDs in the first version,
> but when sessions FD are available they should work like this and
> solve the namespace/cgroup/etc issues.
>
> Passing some PID in an ioctl is not a great idea...

Hi Jason,

I'm trying to understand the drawbacks of the PID-based approach.
Could you elaborate on why passing a PID in the RESTORE_FD ioctl is
not a good idea?

From my perspective, luod would have a live, open socket to the client
process requesting the restore. It can use SO_PEERCRED to securely
identify the client's PID at that moment. The flow would be:

1. Client connects and resumes its session with luod.
2. Client requests to restore TOKEN_X.
3. luod verifies the client owns TOKEN_X for its session.
4. luod calls the RESTORE_FD ioctl, telling the kernel: "Please
restore TOKEN_X and charge the resources to PID Y (which I just
verified is on the other end of this socket)."
5. The kernel performs the action.
6. luod receives the new FD from the kernel and passes it back to the
client over the socket.

In this flow, the client isn't providing an arbitrary PID; the trusted
luod agent is providing the PID of a process it has an active
connection with.

The idea was to let luod handle the session/security story, and the
kernel handle the core preservation mechanism. Adding sessions to the
kernel, delegates the management and part of the security model into
the kernel. I am not sure if it is necessary, what can be cleanly
managed in userspace should stay in userspace.

Thanks,
Pasha


>
> Jason

^ permalink raw reply

* [PATCH 5.4 313/403] move_mount: allow to add a mount into an existing group
From: Greg Kroah-Hartman @ 2025-08-26 11:10 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Eric W. Biederman, Alexander Viro,
	Christian Brauner, Mattias Nissler, Aleksa Sarai, Andrei Vagin,
	linux-fsdevel, linux-api, lkml, Pavel Tikhomirov, Sasha Levin
In-Reply-To: <20250826110905.607690791@linuxfoundation.org>

5.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

[ Upstream commit 9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 ]

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Link: https://lore.kernel.org/r/20210715100714.120228-1-ptikhomirov@virtuozzo.com
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Mattias Nissler <mnissler@chromium.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: lkml <linux-kernel@vger.kernel.org>
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Stable-dep-of: cffd0441872e ("use uniform permission checks for all mount propagation changes")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/namespace.c             | 77 +++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/mount.h |  3 +-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ee5a87061f20..3c1afe60d438 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2624,6 +2624,78 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
 	return ret;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+	struct mount *from, *to;
+	int err;
+
+	from = real_mount(from_path->mnt);
+	to = real_mount(to_path->mnt);
+
+	namespace_lock();
+
+	err = -EINVAL;
+	/* To and From must be mounted */
+	if (!is_mounted(&from->mnt))
+		goto out;
+	if (!is_mounted(&to->mnt))
+		goto out;
+
+	err = -EPERM;
+	/* We should be allowed to modify mount namespaces of both mounts */
+	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+
+	err = -EINVAL;
+	/* To and From paths should be mount roots */
+	if (from_path->dentry != from_path->mnt->mnt_root)
+		goto out;
+	if (to_path->dentry != to_path->mnt->mnt_root)
+		goto out;
+
+	/* Setting sharing groups is only allowed across same superblock */
+	if (from->mnt.mnt_sb != to->mnt.mnt_sb)
+		goto out;
+
+	/* From mount root should be wider than To mount root */
+	if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
+		goto out;
+
+	/* From mount should not have locked children in place of To's root */
+	if (has_locked_children(from, to->mnt.mnt_root))
+		goto out;
+
+	/* Setting sharing groups is only allowed on private mounts */
+	if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
+		goto out;
+
+	/* From should not be private */
+	if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
+		goto out;
+
+	if (IS_MNT_SLAVE(from)) {
+		struct mount *m = from->mnt_master;
+
+		list_add(&to->mnt_slave, &m->mnt_slave_list);
+		to->mnt_master = m;
+	}
+
+	if (IS_MNT_SHARED(from)) {
+		to->mnt_group_id = from->mnt_group_id;
+		list_add(&to->mnt_share, &from->mnt_share);
+		lock_mount_hash();
+		set_mnt_shared(to);
+		unlock_mount_hash();
+	}
+
+	err = 0;
+out:
+	namespace_unlock();
+	return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
 	struct mnt_namespace *ns;
@@ -3583,7 +3655,10 @@ SYSCALL_DEFINE5(move_mount,
 	if (ret < 0)
 		goto out_to;
 
-	ret = do_move_mount(&from_path, &to_path);
+	if (flags & MOVE_MOUNT_SET_GROUP)
+		ret = do_set_group(&from_path, &to_path);
+	else
+		ret = do_move_mount(&from_path, &to_path);
 
 out_to:
 	path_put(&to_path);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fe..535ca707dfd7 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -70,7 +70,8 @@
 #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
 #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
-#define MOVE_MOUNT__MASK		0x00000077
+#define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
+#define MOVE_MOUNT__MASK		0x00000177
 
 /*
  * fsopen() flags.
-- 
2.50.1




^ permalink raw reply related

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 14:24 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bBoLi9tYWHSFyDEHWd_cwvS_hR4q2HMmg-C+SJpQDNs=g@mail.gmail.com>

On Tue, Aug 26, 2025 at 01:54:31PM +0000, Pasha Tatashin wrote:
> > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> > >
> > > Changelog from v2:
> > > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > > - With the above changes, sessions are not needed, and should be
> > >   maintained by the user-agent itself, so removed support for
> > >   sessions.
> >
> > If all the FDs are restored in the agent's context, this assigns all the
> > resources to the agent. For example, if the agent restores a memfd, all
> > the memory gets charged to the agent's cgroup, and the client gets none
> > of it. This makes it impossible to do any kind of resource limits.
> >
> > This was one of the advantages of being able to pass around sessions
> > instead of FDs. The agent can pass on the right session to the right
> > client, and then the client does the restore, getting all the resources
> > charged to it.
> >
> > If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
> > for many kinds of workloads. Do you have any ideas on how to do proper
> > resource attribution with the current patches? If not, then perhaps we
> > should reconsider this change?
> 
> Hi Pratyush,
> 
> That's an excellent point, and you're right that we must have a
> solution for correct resource charging.
> 
> I'd prefer to keep the session logic in the userspace agent (luod
> https://tinyurl.com/luoddesign).
> 
> For the charging problem, I believe there's a clear path forward with
> the current ioctl-based API. The design of the ioctl commands (with a
> size field in each struct) is intentionally extensible. In a follow-up
> patch, we can extend the liveupdate_ioctl_fd_restore struct to include
> a target pid field. The luod agent, would then be able to restore an
> FD on behalf of a client and instruct the kernel to charge the
> associated resources to that client's PID.

This wasn't quite the idea though..

The sessions sub FD were intended to be passed directly to other
processes though unix sockets and fd passing so they could run their
own ioctls in their own context for both save and restore. The ioctls
available on the sessions should be specifically narrowed to be safe
for this.

I can understand not implementing session FDs in the first version,
but when sessions FD are available they should work like this and
solve the namespace/cgroup/etc issues.

Passing some PID in an ioctl is not a great idea...

Jason

^ permalink raw reply

* [PATCH 5.10 402/523] move_mount: allow to add a mount into an existing group
From: Greg Kroah-Hartman @ 2025-08-26 11:10 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Eric W. Biederman, Alexander Viro,
	Christian Brauner, Mattias Nissler, Aleksa Sarai, Andrei Vagin,
	linux-fsdevel, linux-api, lkml, Pavel Tikhomirov, Sasha Levin
In-Reply-To: <20250826110924.562212281@linuxfoundation.org>

5.10-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

[ Upstream commit 9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 ]

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Link: https://lore.kernel.org/r/20210715100714.120228-1-ptikhomirov@virtuozzo.com
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Mattias Nissler <mnissler@chromium.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: lkml <linux-kernel@vger.kernel.org>
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Stable-dep-of: cffd0441872e ("use uniform permission checks for all mount propagation changes")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/namespace.c             | 77 +++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/mount.h |  3 +-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ee6d139f7529..7f7ccc9e53b8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2692,6 +2692,78 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
 	return ret;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+	struct mount *from, *to;
+	int err;
+
+	from = real_mount(from_path->mnt);
+	to = real_mount(to_path->mnt);
+
+	namespace_lock();
+
+	err = -EINVAL;
+	/* To and From must be mounted */
+	if (!is_mounted(&from->mnt))
+		goto out;
+	if (!is_mounted(&to->mnt))
+		goto out;
+
+	err = -EPERM;
+	/* We should be allowed to modify mount namespaces of both mounts */
+	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+
+	err = -EINVAL;
+	/* To and From paths should be mount roots */
+	if (from_path->dentry != from_path->mnt->mnt_root)
+		goto out;
+	if (to_path->dentry != to_path->mnt->mnt_root)
+		goto out;
+
+	/* Setting sharing groups is only allowed across same superblock */
+	if (from->mnt.mnt_sb != to->mnt.mnt_sb)
+		goto out;
+
+	/* From mount root should be wider than To mount root */
+	if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
+		goto out;
+
+	/* From mount should not have locked children in place of To's root */
+	if (has_locked_children(from, to->mnt.mnt_root))
+		goto out;
+
+	/* Setting sharing groups is only allowed on private mounts */
+	if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
+		goto out;
+
+	/* From should not be private */
+	if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
+		goto out;
+
+	if (IS_MNT_SLAVE(from)) {
+		struct mount *m = from->mnt_master;
+
+		list_add(&to->mnt_slave, &m->mnt_slave_list);
+		to->mnt_master = m;
+	}
+
+	if (IS_MNT_SHARED(from)) {
+		to->mnt_group_id = from->mnt_group_id;
+		list_add(&to->mnt_share, &from->mnt_share);
+		lock_mount_hash();
+		set_mnt_shared(to);
+		unlock_mount_hash();
+	}
+
+	err = 0;
+out:
+	namespace_unlock();
+	return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
 	struct mnt_namespace *ns;
@@ -3667,7 +3739,10 @@ SYSCALL_DEFINE5(move_mount,
 	if (ret < 0)
 		goto out_to;
 
-	ret = do_move_mount(&from_path, &to_path);
+	if (flags & MOVE_MOUNT_SET_GROUP)
+		ret = do_set_group(&from_path, &to_path);
+	else
+		ret = do_move_mount(&from_path, &to_path);
 
 out_to:
 	path_put(&to_path);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index dd8306ea336c..fc6a2e63130b 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -71,7 +71,8 @@
 #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
 #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
-#define MOVE_MOUNT__MASK		0x00000077
+#define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
+#define MOVE_MOUNT__MASK		0x00000177
 
 /*
  * fsopen() flags.
-- 
2.50.1




^ permalink raw reply related

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 13:54 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu
In-Reply-To: <mafs0ms7mxly1.fsf@kernel.org>

> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> >
> > Changelog from v2:
> > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > - With the above changes, sessions are not needed, and should be
> >   maintained by the user-agent itself, so removed support for
> >   sessions.
>
> If all the FDs are restored in the agent's context, this assigns all the
> resources to the agent. For example, if the agent restores a memfd, all
> the memory gets charged to the agent's cgroup, and the client gets none
> of it. This makes it impossible to do any kind of resource limits.
>
> This was one of the advantages of being able to pass around sessions
> instead of FDs. The agent can pass on the right session to the right
> client, and then the client does the restore, getting all the resources
> charged to it.
>
> If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
> for many kinds of workloads. Do you have any ideas on how to do proper
> resource attribution with the current patches? If not, then perhaps we
> should reconsider this change?

Hi Pratyush,

That's an excellent point, and you're right that we must have a
solution for correct resource charging.

I'd prefer to keep the session logic in the userspace agent (luod
https://tinyurl.com/luoddesign).

For the charging problem, I believe there's a clear path forward with
the current ioctl-based API. The design of the ioctl commands (with a
size field in each struct) is intentionally extensible. In a follow-up
patch, we can extend the liveupdate_ioctl_fd_restore struct to include
a target pid field. The luod agent, would then be able to restore an
FD on behalf of a client and instruct the kernel to charge the
associated resources to that client's PID.

This keeps the responsibilities clean: luod manages sessions and
authorization, while the kernel provides the specific mechanism for
resource attribution. I agree this is a must-have feature, but I think
it can be cleanly added on top of the current foundation.

Pasha

>
> [...]
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [RFC] Extension to POSIX API for zero-copy data transfers
From: Mihai-Drosi Câju @ 2025-08-26 13:17 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: linux-api, alx
In-Reply-To: <aK2rRYT+FXe6BvwC@mail.hallyn.com>

On Tue, Aug 26, 2025 at 3:40 PM Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Sun, Jan 19, 2025 at 10:21:45PM +0200, mcaju95@gmail.com wrote:
> > Greetings,
> >
> > I've been thinking about a POSIX-like API that would allow
> > read/write/send/recv to be zero-copy instead of being buffered. As such,
> > storage devices and network sockets can have data transferred to and from
> > them directly to a user-space application's buffers.
>
> Hi Mihai,
>
> You're proposing a particular API.  Do you have a kernel side
> implementation of something along these lines?  Do you have a particular
> user space use case of your own in mind, or have you spoken to any
> potential users?
>
I have a user-space implementation based on DPDK, it has a different and
less hackish API than the one presented here. I thought it best to seek
feedback from the Linux community before actually writing code for the kernel.

The use-case is the same as the normal Berkley sockets API.
Except that it's faster because you don't copy buffers between kernel and
user-space on each send and recv. You can even receive a buffer that
will be written to disk or vice-versa. Thereby making obsolete KTLS, etc.

I have not spoken to potential users, but I am aware of several attempts
at a zero copy TCP/IP stack F-Stack, mTCP, io_uring.

> > My focus was initially on network stacks and I drew inspiration from DPDK.
> > I'm also aware of some work underway on extending io_uring to support zero
> > copy.
>
> I've not really been following io_uring work.  Can you summarize the
> status of their zero copy support and the advantages that this new
> API would bring?
>

So far, io_uring only supports zero copy reception of TCP segments
https://docs.kernel.org/networking/iou-zcrx.html
it's rather cluttered...

> thanks,
> -serge
>
> > A draft API would work as follows:
> > * The application fills-out a series of iovec's with buffers in its own
> > memory that can store data from protocols such as TCP or UDP. These iovec's
> > will serve as hints that will tell the network stack that it can definitely
> > map a part of a frame's contents into the described buffers. For example, an
> > iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case,
> > the data payload may end-up anywhere between 0x4000 and 0xe000 and after the
> > syscall, its fields will be overwritten to something like { .iov_base =
> > 0x4036, .iov_len = 1460 }
> > * In order to receive packets, the application calls readv or a readv-like
> > syscall and its array of iovec will be modified to point to data payloads.
> > Given that their pages will be mapped directly to user-space, some header
> > fields or tail-room may have to be zero-ed out before being mapped, in order
> > to prevent information leaks. Anny array of iovec's passed to one such readv
> > syscall should be checked for sanity such as being able to hold data
> > payloads in corner cases, not overlap with each-other and hold values that
> > would prove to map pages to.
> > * The return value would be the number of data payloads that have been
> > populated. Only the first such elements in the provided array would end-up
> > containing data payloads.
> > * The syscall's prototype would be quite identical to that of readv, except
> > that iov would not be a const struct iovec *, but just a struct iovec * and
> > the return type would be modified. Like so:
> >  int zc_readv(int fd, struct iovec *iov, int iovcnt);
> >
> > * In the case of write's a struct iovec may not suffice as the provided
> > buffers should not only provide the location and size of data to be sent,
> > but also the guarantee that the buffers have sufficient head and tail room.
> > A hackish syscall would look like so:
> >  int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
> > * While the first iovec should describe the entire memory area available to
> > a packet, including enough head and tail room for headers and CRC's or other
> > fields specific to the NIC, the second should describe a sub-buffer that
> > holds the data to be written.
> > * Again, sanity checks should be performed on the entire array, for things
> > like having enough room for other fields, not overlapping, proper alignment,
> > ability to DMA to a device, etc.
> > * After calling zc_writev the pages associated with the provided iovec's are
> > immediately swapped for zero-pages to avoid data-leaks.
> > * For writes, arbitrary physical pages may not work for every NIC as some
> > are bound by 32bit addressing constrains on the PCIe bus, etc. As such the
> > application would have to manage a memory pool associated with each
> > file-descriptor(possibly NIC) that would contain memory that is physically
> > mapped to areas that can be DMA'ed to the proper devices. For example one
> > may mmap the file-descriptor to obtain a pool of a certain size for this
> > purpose.
> >
> > This concept can be extended to storage devices, unfortunately I am
> > unfamiliar with NVMe and SCSI so I can only guess that they work in a
> > similar manner to NIC rings, in that data can be written and read to
> > arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read
> > and zc_write can be used on file descriptors pointing to storage devices to
> > fetch or write sectors that contain data belonging to files. Some data
> > should be zeroed-out in this case as well, as sectors more often that not
> > will contain data that does not belong to the intended files.
> >
> > For example one can mix such syscalls to read directly from storage into NIC
> > buffers, providing in-place encryption on the way(via TLS) and send them to
> > a client in a similar way that Netflix does with in-kernel TLS and sendfile.
> >
> > All the best,
> > Mihai
> >
> >
> >

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-08-26 13:16 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-1-pasha.tatashin@soleen.com>

Hi Pasha,

On Thu, Aug 07 2025, Pasha Tatashin wrote:

> This series introduces the LUO, a kernel subsystem designed to
> facilitate live kernel updates with minimal downtime,
> particularly in cloud delplyoments aiming to update without fully
> disrupting running virtual machines.
>
> This series builds upon KHO framework by adding programmatic
> control over KHO's lifecycle and leveraging KHO for persisting LUO's
> own metadata across the kexec boundary. The git branch for this series
> can be found at:
>
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
>
> Changelog from v2:
> - Addressed comments from Mike Rapoport and Jason Gunthorpe
> - Only one user agent (LiveupdateD) can open /dev/liveupdate
> - With the above changes, sessions are not needed, and should be
>   maintained by the user-agent itself, so removed support for
>   sessions.

If all the FDs are restored in the agent's context, this assigns all the
resources to the agent. For example, if the agent restores a memfd, all
the memory gets charged to the agent's cgroup, and the client gets none
of it. This makes it impossible to do any kind of resource limits.

This was one of the advantages of being able to pass around sessions
instead of FDs. The agent can pass on the right session to the right
client, and then the client does the restore, getting all the resources
charged to it.

If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
for many kinds of workloads. Do you have any ideas on how to do proper
resource attribution with the current patches? If not, then perhaps we
should reconsider this change?

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [RFC] Extension to POSIX API for zero-copy data transfers
From: Serge E. Hallyn @ 2025-08-26 12:40 UTC (permalink / raw)
  To: mcaju95; +Cc: linux-api, alx, serge
In-Reply-To: <98RCQS.25Q70IQZ9KFA1@gmail.com>

On Sun, Jan 19, 2025 at 10:21:45PM +0200, mcaju95@gmail.com wrote:
> Greetings,
> 
> I've been thinking about a POSIX-like API that would allow
> read/write/send/recv to be zero-copy instead of being buffered. As such,
> storage devices and network sockets can have data transferred to and from
> them directly to a user-space application's buffers.

Hi Mihai,

You're proposing a particular API.  Do you have a kernel side
implementation of something along these lines?  Do you have a particular
user space use case of your own in mind, or have you spoken to any
potential users?

> My focus was initially on network stacks and I drew inspiration from DPDK.
> I'm also aware of some work underway on extending io_uring to support zero
> copy.

I've not really been following io_uring work.  Can you summarize the
status of their zero copy support and the advantages that this new
API would bring?

thanks,
-serge

> A draft API would work as follows:
> * The application fills-out a series of iovec's with buffers in its own
> memory that can store data from protocols such as TCP or UDP. These iovec's
> will serve as hints that will tell the network stack that it can definitely
> map a part of a frame's contents into the described buffers. For example, an
> iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case,
> the data payload may end-up anywhere between 0x4000 and 0xe000 and after the
> syscall, its fields will be overwritten to something like { .iov_base =
> 0x4036, .iov_len = 1460 }
> * In order to receive packets, the application calls readv or a readv-like
> syscall and its array of iovec will be modified to point to data payloads.
> Given that their pages will be mapped directly to user-space, some header
> fields or tail-room may have to be zero-ed out before being mapped, in order
> to prevent information leaks. Anny array of iovec's passed to one such readv
> syscall should be checked for sanity such as being able to hold data
> payloads in corner cases, not overlap with each-other and hold values that
> would prove to map pages to.
> * The return value would be the number of data payloads that have been
> populated. Only the first such elements in the provided array would end-up
> containing data payloads.
> * The syscall's prototype would be quite identical to that of readv, except
> that iov would not be a const struct iovec *, but just a struct iovec * and
> the return type would be modified. Like so:
>  int zc_readv(int fd, struct iovec *iov, int iovcnt);
> 
> * In the case of write's a struct iovec may not suffice as the provided
> buffers should not only provide the location and size of data to be sent,
> but also the guarantee that the buffers have sufficient head and tail room.
> A hackish syscall would look like so:
>  int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
> * While the first iovec should describe the entire memory area available to
> a packet, including enough head and tail room for headers and CRC's or other
> fields specific to the NIC, the second should describe a sub-buffer that
> holds the data to be written.
> * Again, sanity checks should be performed on the entire array, for things
> like having enough room for other fields, not overlapping, proper alignment,
> ability to DMA to a device, etc.
> * After calling zc_writev the pages associated with the provided iovec's are
> immediately swapped for zero-pages to avoid data-leaks.
> * For writes, arbitrary physical pages may not work for every NIC as some
> are bound by 32bit addressing constrains on the PCIe bus, etc. As such the
> application would have to manage a memory pool associated with each
> file-descriptor(possibly NIC) that would contain memory that is physically
> mapped to areas that can be DMA'ed to the proper devices. For example one
> may mmap the file-descriptor to obtain a pool of a certain size for this
> purpose.
> 
> This concept can be extended to storage devices, unfortunately I am
> unfamiliar with NVMe and SCSI so I can only guess that they work in a
> similar manner to NIC rings, in that data can be written and read to
> arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read
> and zc_write can be used on file descriptors pointing to storage devices to
> fetch or write sectors that contain data belonging to files. Some data
> should be zeroed-out in this case as well, as sectors more often that not
> will contain data that does not belong to the intended files.
> 
> For example one can mix such syscalls to read directly from storage into NIC
> buffers, providing in-place encryption on the way(via TLS) and send them to
> a client in a similar way that Netflix does with in-kernel TLS and sendfile.
> 
> All the best,
> Mihai
> 
> 
> 

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-26 12:39 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module, Jeff Xu
In-Reply-To: <CALmYWFv90uzq0J76+xtUFjZxDzR2rYvrFbrr5Jva5zdy_dvoHA@mail.gmail.com>

On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> Hi Mickaël
> 
> On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > >
> > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > >
> > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > because it broke userspace assumptions.
> > > >
> > > > Oh, good to know.
> > > >
> > > > >
> > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > >
> > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > removed for security reasons; as
> > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > >
> > > > > |        MAP_DENYWRITE
> > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > |               signaled that attempts to write to the underlying file
> > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > |               of-service attacks.)"
> > > > >
> > > > > It seems to me that the same issue applies to your patch - it would
> > > > > allow unprivileged processes to essentially lock files such that other
> > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > users to prevent root from updating config files or stuff like that if
> > > > > they're updated in-place.
> > > >
> > > > Yes, I agree, but since it is the case for executed files I though it
> > > > was worth starting a discussion on this topic.  This new flag could be
> > > > restricted to executable files, but we should avoid system-wide locks
> > > > like this.  I'm not sure how Windows handle these issues though.
> > > >
> > > > Anyway, we should rely on the access control policy to control write and
> > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > the references and the background!
> > >
> > > I'm confused.  I understand that there are many contexts in which one
> > > would want to prevent execution of unapproved content, which might
> > > include preventing a given process from modifying some code and then
> > > executing it.
> > >
> > > I don't understand what these deny-write features have to do with it.
> > > These features merely prevent someone from modifying code *that is
> > > currently in use*, which is not at all the same thing as preventing
> > > modifying code that might get executed -- one can often modify
> > > contents *before* executing those contents.
> >
> > The order of checks would be:
> > 1. open script with O_DENY_WRITE
> > 2. check executability with AT_EXECVE_CHECK
> > 3. read the content and interpret it
> >
> I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> 
> AT_EXECVE_CHECK is not just for scripting languages. It could also
> work with bytecodes like Java, for example. If we let the Java runtime
> call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> develop a policy based on that.

Sure, I'm using "script" to make it simple, but this applies to other
use cases.

> 
> > The deny-write feature was to guarantee that there is no race condition
> > between step 2 and 3.  All these checks are supposed to be done by a
> > trusted interpreter (which is allowed to be executed).  The
> > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > associated security policies) allowed the *current* content of the file
> > to be executed.  Whatever happen before or after that (wrt.
> > O_DENY_WRITE) should be covered by the security policy.
> >
> Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> 
> Enforcing non-write for the path that stores scripts or bytecodes can
> be challenging due to historical or backward compatibility reasons.
> Since AT_EXECVE_CHECK provides a mechanism to check the file right
> before it is used, we can assume it will detect any "problem" that
> happened before that, (e.g. the file was overwritten). However, that
> also imposes two additional requirements:
> 1> the file doesn't change while AT_EXECVE_CHECK does the check.

This is already the case, so any kind of LSM checks are good.

> 2>The file content kept by the process remains unchanged after passing
> the AT_EXECVE_CHECK.

The goal of this patch was to avoid such race condition in the case
where executable files can be updated.  But in most cases it should not
be a security issue (because processes allowed to write to executable
files should be trusted), but this could still lead to bugs (because of
inconsistent file content, half-updated).

> 
> I imagine, the complete solution for AT_EXECVE_CHECK would include
> those two grantees.

There is no issue directly with AT_EXECVE_CHECK, but according to the
system configuration, interpreters could read a file that was updated
after the AT_EXECVE_CHECK.  This should not be an issue for secure
systems where executable files are only updated with trusted code,
except if the update mechanism is not atomic.  The main use case for
this patch series was for generic distros that may not have the
write-xor-execute guarantees e.g., for developers.

The only viable solution I see would be to have some kind of snapshot of
files, requested by interpreters, but I'm not sure if it is worth it.

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-26 12:35 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Jeff Xu
In-Reply-To: <lhuikibbv0g.fsf@oldenburg.str.redhat.com>

On Mon, Aug 25, 2025 at 11:39:11AM +0200, Florian Weimer wrote:
> * Mickaël Salaün:
> 
> > The order of checks would be:
> > 1. open script with O_DENY_WRITE
> > 2. check executability with AT_EXECVE_CHECK
> > 3. read the content and interpret it
> >
> > The deny-write feature was to guarantee that there is no race condition
> > between step 2 and 3.  All these checks are supposed to be done by a
> > trusted interpreter (which is allowed to be executed).  The
> > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > associated security policies) allowed the *current* content of the file
> > to be executed.  Whatever happen before or after that (wrt.
> > O_DENY_WRITE) should be covered by the security policy.
> 
> Why isn't it an improper system configuration if the script file is
> writable?

It is, except if the system only wants to track executions (e.g. record
checksum of scripts) without restricting file modifications.

> 
> In the past, the argument was that making a file (writable and)
> executable was an auditable even, and that provided enough coverage for
> those people who are interested in this.

Yes, but in this case there is a race condition that this patch tried to
fix.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Theodore Ts'o @ 2025-08-26 12:30 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826.aig5aiShunga@digikod.net>

Is there a single, unified design and requirements document that
describes the threat model, and what you are trying to achieve with
AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
that has landed for AT_EXECVE_CHECK and it really doesn't describe
what *are* the checks that AT_EXECVE_CHECK is trying to achieve:

   "The AT_EXECVE_CHECK execveat(2) flag, and the
   SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
   securebits are intended for script interpreters and dynamic linkers
   to enforce a consistent execution security policy handled by the
   kernel."

Um, what security policy?  What checks?  What is a sample exploit
which is blocked by AT_EXECVE_CHECK?

And then on top of it, why can't you do these checks by modifying the
script interpreters?

Confused,

						- Ted

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-26 11:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250826-skorpion-magma-141496988fdc@brauner>

On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote:
> > Hi,
> > 
> > Script interpreters can check if a file would be allowed to be executed
> > by the kernel using the new AT_EXECVE_CHECK flag. This approach works
> > well on systems with write-xor-execute policies, where scripts cannot
> > be modified by malicious processes. However, this protection may not be
> > available on more generic distributions.
> > 
> > The key difference between `./script.sh` and `sh script.sh` (when using
> > AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
> > for writing while it's being executed. To achieve parity, the kernel
> > should provide a mechanism for script interpreters to deny write access
> > during script interpretation. While interpreters can copy script content
> > into a buffer, a race condition remains possible after AT_EXECVE_CHECK.
> > 
> > This patch series introduces a new O_DENY_WRITE flag for use with
> > open*(2) and fcntl(2). Both interfaces are necessary since script
> > interpreters may receive either a file path or file descriptor. For
> > backward compatibility, open(2) with O_DENY_WRITE will not fail on
> > unsupported systems, while users requiring explicit support guarantees
> > can use openat2(2).
> 
> We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff
> before and you've been told by Linus as well that this is a nogo.

Oh, please, don't mix up everything.  First, this is an RFC, and as I
explained, the goal is to start a discussion with something concrete.
Second, doing a one-time check on a file and providing guarantees for
the whole lifetime of an opened file requires different approaches,
hence this O_ *proposal*.

> 
> Nothing has changed in that regard and I'm not interested in stuffing
> the VFS APIs full of special-purpose behavior to work around the fact
> that this is work that needs to be done in userspace. Change the apps,
> stop pushing more and more cruft into the VFS that has no business
> there.

It would be interesting to know how to patch user space to get the same
guarantees...  Do you think I would propose a kernel patch otherwise?

> 
> That's before we get into all the issues that are introduced by this
> mechanism that magically makes arbitrary files unwritable. It's not just
> a DoS it's likely to cause breakage in userspace as well. I removed the
> deny-write from execve because it already breaks various use-cases or
> leads to spurious failures in e.g., go. We're not spreading this disease
> as a first-class VFS API.

Jann explained it very well, and the deny-write for execve is still
there, but let's keep it civil.  I already agreed that this is not a
good approach, but we could get interesting proposals.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Christian Brauner @ 2025-08-26  9:07 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250822170800.2116980-1-mic@digikod.net>

On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote:
> Hi,
> 
> Script interpreters can check if a file would be allowed to be executed
> by the kernel using the new AT_EXECVE_CHECK flag. This approach works
> well on systems with write-xor-execute policies, where scripts cannot
> be modified by malicious processes. However, this protection may not be
> available on more generic distributions.
> 
> The key difference between `./script.sh` and `sh script.sh` (when using
> AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
> for writing while it's being executed. To achieve parity, the kernel
> should provide a mechanism for script interpreters to deny write access
> during script interpretation. While interpreters can copy script content
> into a buffer, a race condition remains possible after AT_EXECVE_CHECK.
> 
> This patch series introduces a new O_DENY_WRITE flag for use with
> open*(2) and fcntl(2). Both interfaces are necessary since script
> interpreters may receive either a file path or file descriptor. For
> backward compatibility, open(2) with O_DENY_WRITE will not fail on
> unsupported systems, while users requiring explicit support guarantees
> can use openat2(2).

We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff
before and you've been told by Linus as well that this is a nogo.

Nothing has changed in that regard and I'm not interested in stuffing
the VFS APIs full of special-purpose behavior to work around the fact
that this is work that needs to be done in userspace. Change the apps,
stop pushing more and more cruft into the VFS that has no business
there.

That's before we get into all the issues that are introduced by this
mechanism that magically makes arbitrary files unwritable. It's not just
a DoS it's likely to cause breakage in userspace as well. I removed the
deny-write from execve because it already breaks various use-cases or
leads to spurious failures in e.g., go. We're not spreading this disease
as a first-class VFS API.

^ permalink raw reply

* Re: [PATCH v2 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Alejandro Colomar @ 2025-08-26  8:51 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <198e5864132.1283ed42534579.7191562270325331624@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1869 bytes --]

Hi Askar,

On Tue, Aug 26, 2025 at 12:37:17PM +0400, Askar Safin wrote:
>  ---- On Mon, 25 Aug 2025 23:13:05 +0400  Alejandro Colomar <alx@kernel.org> wrote --- 
>  > Should we say "mount point" instead?  Otherwise, it's inconsistent with
> 
> d-user@comp:/rbt/man-pages$ grep -E -r -I -i 'mount point' /rbt/man-pages/man | wc -l
> 101
> d-user@comp:/rbt/man-pages$ grep -E -r -I -i 'mount-point' /rbt/man-pages/man | wc -l
> 9
> d-user@comp:/rbt/man-pages$ grep -E -r -I -i 'mountpoint' /rbt/man-pages/man | wc -l
> 4
> 
> My experiments show that "mount point" is indeed the most popular variant.
> 
> I changed all "mountpoint" to "mount point".
> 
> I decided to keep all "per-mount-point".

Thanks!

>  > > +have its existing per-mount-point flags
>  > > +cleared and replaced with those in
>  > > +.I mountflags
>  > > +when
>  > > +.B MS_REMOUNT
>  > > +and
>  > > +.B MS_BIND
>  > > +are specified.
>  > 
>  > Maybe reverse the sentence to start with this?
> 
> I decided simply to remove that "MS_REMOUNT and MS_BIND" part
> (because it is already present in previous sentence).

Okay.

>  > > +This means that if
>  > 
>  > I would move the 'if' to the next line.
> 
> I moved it. But, please, next time do it youself.
> I don't plan to become regular man-pages contributor.

I do these small things myself if they're the only issue.  If there are
more important issues, I _also_ point these out, just because it's
useful.

In general, when writing documentation sentences, write them similarly
to how you would write them if they were code.  You never put an if at
the end of a line of code; never put it at the end of a line of
documentation text.

> I addressed all complains except for listed above and sent v3.

I'll check.


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-26  8:37 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <rxl7zzllf374j6osujwvpvbvsnrjwikoo5tj2o3pqntfjdmwps@isiyqms4s776>

 ---- On Mon, 25 Aug 2025 23:13:05 +0400  Alejandro Colomar <alx@kernel.org> wrote --- 
 > Should we say "mount point" instead?  Otherwise, it's inconsistent with

d-user@comp:/rbt/man-pages$ grep -E -r -I -i 'mount point' /rbt/man-pages/man | wc -l
101
d-user@comp:/rbt/man-pages$ grep -E -r -I -i 'mount-point' /rbt/man-pages/man | wc -l
9
d-user@comp:/rbt/man-pages$ grep -E -r -I -i 'mountpoint' /rbt/man-pages/man | wc -l
4

My experiments show that "mount point" is indeed the most popular variant.

I changed all "mountpoint" to "mount point".

I decided to keep all "per-mount-point".

 > > +have its existing per-mount-point flags
 > > +cleared and replaced with those in
 > > +.I mountflags
 > > +when
 > > +.B MS_REMOUNT
 > > +and
 > > +.B MS_BIND
 > > +are specified.
 > 
 > Maybe reverse the sentence to start with this?

I decided simply to remove that "MS_REMOUNT and MS_BIND" part
(because it is already present in previous sentence).

 > > +This means that if
 > 
 > I would move the 'if' to the next line.

I moved it. But, please, next time do it youself.
I don't plan to become regular man-pages contributor.

I addressed all complains except for listed above and sent v3.

====
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* [PATCH v3 2/2] man2/mount.2: tfix (mountpoint => mount point)
From: Askar Safin @ 2025-08-26  8:32 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250826083227.2611457-1-safinaskar@zohomail.com>

Here we fix the only remaining mention of "mountpoint"
in all man pages

Signed-off-by: Askar Safin <safinaskar@zohomail.com>
---
 man/man2/mount.2 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man2/mount.2 b/man/man2/mount.2
index 599c2d6fa..9b11fff51 100644
--- a/man/man2/mount.2
+++ b/man/man2/mount.2
@@ -311,7 +311,7 @@ Since Linux 2.6.16,
 can be set or cleared on a per-mount-point basis as well as on
 the underlying filesystem superblock.
 The mounted filesystem will be writable only if neither the filesystem
-nor the mountpoint are flagged as read-only.
+nor the mount point are flagged as read-only.
 .\"
 .SS Remounting an existing mount
 An existing mount may be remounted by specifying
-- 
2.47.2


^ permalink raw reply related

* [PATCH v3 1/2] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-26  8:32 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250826083227.2611457-1-safinaskar@zohomail.com>

My edit is based on experiments and reading Linux code

Signed-off-by: Askar Safin <safinaskar@zohomail.com>
---
 man/man2/mount.2 | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/man/man2/mount.2 b/man/man2/mount.2
index 5d83231f9..599c2d6fa 100644
--- a/man/man2/mount.2
+++ b/man/man2/mount.2
@@ -405,7 +405,25 @@ flag can be used with
 to modify only the per-mount-point flags.
 .\" See https://lwn.net/Articles/281157/
 This is particularly useful for setting or clearing the "read-only"
-flag on a mount without changing the underlying filesystem.
+flag on a mount without changing the underlying filesystem parameters.
+The
+.I data
+argument is ignored if
+.B MS_REMOUNT
+and
+.B MS_BIND
+are specified.
+The mount point will
+have its existing per-mount-point flags
+cleared and replaced with those in
+.IR mountflags .
+This means that
+if you wish to preserve
+any existing per-mount-point flags,
+you need to include them in
+.IR mountflags ,
+along with the per-mount-point flags you wish to set
+(or with the flags you wish to clear missing).
 Specifying
 .I mountflags
 as:
@@ -416,8 +434,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
 .EE
 .in
 .P
-will make access through this mountpoint read-only, without affecting
-other mounts.
+will make access through this mount point read-only
+(clearing all other per-mount-point flags),
+without affecting
+other mounts
+of this filesystem.
 .\"
 .SS Creating a bind mount
 If
-- 
2.47.2


^ permalink raw reply related

* [PATCH v3 0/2] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-26  8:32 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man

My edit is based on experiments and reading Linux code

You will find C code I used for experiments below

v1: https://lore.kernel.org/linux-man/20250822114315.1571537-1-safinaskar@zohomail.com/
v2: https://lore.kernel.org/linux-man/20250825154839.2422856-1-safinaskar@zohomail.com/

Askar Safin (2):
  man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
  man2/mount.2: tfix (mountpoint => mount point)

 man/man2/mount.2 | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

-- 
2.47.2

// You need to be root in initial user namespace

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/sysmacros.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%d: %s: assertion failed\n", __LINE__, #cond); \
        exit (1); \
    } \
} while (0)

int
main (void)
{
    // Init
    {
        MY_ASSERT (chdir ("/") == 0);
        MY_ASSERT (unshare (CLONE_NEWNS) == 0);
        MY_ASSERT (mount (NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp", "tmpfs", 0, NULL) == 0);
    }

    MY_ASSERT (mkdir ("/tmp/a", 0777) == 0);
    MY_ASSERT (mkdir ("/tmp/b", 0777) == 0);

    // MS_REMOUNT sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // MS_REMOUNT | MS_BIND sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // fspick sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            int fsfd = fspick (AT_FDCWD, "/tmp/a", 0);
            MY_ASSERT (fsfd >= 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0) == 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0) == 0);
            MY_ASSERT (close (fsfd) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // mount_setattr sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            struct mount_attr attr;
            memset (&attr, 0, sizeof attr);
            attr.attr_set = MOUNT_ATTR_RDONLY;
            MY_ASSERT (mount_setattr (AT_FDCWD, "/tmp/a", 0, &attr, sizeof attr) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string works for MS_REMOUNT
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string doesn't work for MS_REMOUNT | MS_BIND
    // Option string is ignored
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT | MS_BIND)
    // Same for other options (not tested, but I did read code)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Removing "ro" from option string makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    // This works even if option string is NULL, i. e. NULL works as default option string
    {
        typedef const char *c_string;
        c_string opts[3] = {NULL, "", "rw"};
        for (int i = 0; i != 3; ++i)
            {
                for (int j = 0; j != 3; ++j)
                    {
                        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, opts[i]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
                        MY_ASSERT (errno == EROFS);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, opts[j]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (umount ("/tmp/a") == 0);
                    }
            }
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Setting MS_RDONLY (without other flags) removes all other flags, such as MS_NODEV (in case of MS_REMOUNT | MS_BIND)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mknod ("/tmp/a/mynull", S_IFCHR | 0666, makedev (1, 3)) == 0);

        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_NODEV, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (open ("/tmp/a/mynull", O_WRONLY) == -1);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }
    printf ("All tests passed\n");
    exit (0);
}

^ permalink raw reply

* [PATCH] vdso: Remove struct getcpu_cache
From: Thomas Weißschuh @ 2025-08-26  5:29 UTC (permalink / raw)
  To: Huacai Chen, WANG Xuerui, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Vincenzo Frascino, Shuah Khan
  Cc: loongarch, linux-kernel, linux-s390, linux-um, linux-api,
	linux-kselftest, Thomas Weißschuh

The cache parameter of getcpu() is not used by the kernel and no user ever
passes it in anyways.

Remove the struct and its header.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
We could also completely remove the parameter, but I am not sure if
that is a good idea for syscalls and vDSO entrypoints.
---
 arch/loongarch/vdso/vgetcpu.c                   |  5 ++---
 arch/s390/kernel/vdso64/getcpu.c                |  3 +--
 arch/s390/kernel/vdso64/vdso.h                  |  4 +---
 arch/x86/entry/vdso/vgetcpu.c                   |  5 ++---
 arch/x86/include/asm/vdso/processor.h           |  4 +---
 arch/x86/um/vdso/um_vdso.c                      |  7 +++----
 include/linux/getcpu.h                          | 19 -------------------
 include/linux/syscalls.h                        |  3 +--
 kernel/sys.c                                    |  4 +---
 tools/testing/selftests/vDSO/vdso_test_getcpu.c |  4 +---
 10 files changed, 13 insertions(+), 45 deletions(-)

diff --git a/arch/loongarch/vdso/vgetcpu.c b/arch/loongarch/vdso/vgetcpu.c
index 5301cd9d0f839eb0fd7b73a1d36e80aaa75d5e76..aefba899873ed035d70766a95b0b6fea881e94df 100644
--- a/arch/loongarch/vdso/vgetcpu.c
+++ b/arch/loongarch/vdso/vgetcpu.c
@@ -4,7 +4,6 @@
  */
 
 #include <asm/vdso.h>
-#include <linux/getcpu.h>
 
 static __always_inline int read_cpu_id(void)
 {
@@ -20,8 +19,8 @@ static __always_inline int read_cpu_id(void)
 }
 
 extern
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
 {
 	int cpu_id;
 
diff --git a/arch/s390/kernel/vdso64/getcpu.c b/arch/s390/kernel/vdso64/getcpu.c
index 5c5d4a848b7669436e73df8e3b711e5b876eb3db..1e17665616c5fa766ca66c8de276b212528934bd 100644
--- a/arch/s390/kernel/vdso64/getcpu.c
+++ b/arch/s390/kernel/vdso64/getcpu.c
@@ -2,11 +2,10 @@
 /* Copyright IBM Corp. 2020 */
 
 #include <linux/compiler.h>
-#include <linux/getcpu.h>
 #include <asm/timex.h>
 #include "vdso.h"
 
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	union tod_clock clk;
 
diff --git a/arch/s390/kernel/vdso64/vdso.h b/arch/s390/kernel/vdso64/vdso.h
index 9e5397e7b590a23c149ccc6043d0c0b0d5ea8457..cadd307d7a365cabf53f5c8d313be3718625533d 100644
--- a/arch/s390/kernel/vdso64/vdso.h
+++ b/arch/s390/kernel/vdso64/vdso.h
@@ -4,9 +4,7 @@
 
 #include <vdso/datapage.h>
 
-struct getcpu_cache;
-
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 int __s390_vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 int __s390_vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
 int __s390_vdso_clock_getres(clockid_t clock, struct __kernel_timespec *ts);
diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
index e4640306b2e3c95d74d73037ab6b09294b8e1d6c..6381b472b7c52487bccf3cbf0664c3d7a0e59699 100644
--- a/arch/x86/entry/vdso/vgetcpu.c
+++ b/arch/x86/entry/vdso/vgetcpu.c
@@ -6,17 +6,16 @@
  */
 
 #include <linux/kernel.h>
-#include <linux/getcpu.h>
 #include <asm/segment.h>
 #include <vdso/processor.h>
 
 notrace long
-__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	vdso_read_cpunode(cpu, node);
 
 	return 0;
 }
 
-long getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
+long getcpu(unsigned *cpu, unsigned *node, void *tcache)
 	__attribute__((weak, alias("__vdso_getcpu")));
diff --git a/arch/x86/include/asm/vdso/processor.h b/arch/x86/include/asm/vdso/processor.h
index 7000aeb59aa287e2119c3d43ab3eaf82befb59c4..93e0e24e5cb47f7b0056c13f2a7f2304ed4a0595 100644
--- a/arch/x86/include/asm/vdso/processor.h
+++ b/arch/x86/include/asm/vdso/processor.h
@@ -18,9 +18,7 @@ static __always_inline void cpu_relax(void)
 	native_pause();
 }
 
-struct getcpu_cache;
-
-notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 
 #endif /* __ASSEMBLER__ */
 
diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
index cbae2584124fd0ff0f9d240c33fefb8d213c84cd..9aa2c62cce6b7a07bbaf8441014d347162d1950d 100644
--- a/arch/x86/um/vdso/um_vdso.c
+++ b/arch/x86/um/vdso/um_vdso.c
@@ -10,14 +10,13 @@
 #define DISABLE_BRANCH_PROFILING
 
 #include <linux/time.h>
-#include <linux/getcpu.h>
 #include <asm/unistd.h>
 
 /* workaround for -Wmissing-prototypes warnings */
 int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts);
 int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 __kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
-long __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
+long __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
 
 int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
 {
@@ -60,7 +59,7 @@ __kernel_old_time_t __vdso_time(__kernel_old_time_t *t)
 __kernel_old_time_t time(__kernel_old_time_t *t) __attribute__((weak, alias("__vdso_time")));
 
 long
-__vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
 {
 	/*
 	 * UML does not support SMP, we can cheat here. :)
@@ -74,5 +73,5 @@ __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused
 	return 0;
 }
 
-long getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *tcache)
+long getcpu(unsigned int *cpu, unsigned int *node, void *tcache)
 	__attribute__((weak, alias("__vdso_getcpu")));
diff --git a/include/linux/getcpu.h b/include/linux/getcpu.h
deleted file mode 100644
index c304dcdb4eac2a9117080e6a14f4e3f28d07fd56..0000000000000000000000000000000000000000
--- a/include/linux/getcpu.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_GETCPU_H
-#define _LINUX_GETCPU_H 1
-
-/* Cache for getcpu() to speed it up. Results might be a short time
-   out of date, but will be faster.
-
-   User programs should not refer to the contents of this structure.
-   I repeat they should not refer to it. If they do they will break
-   in future kernels.
-
-   It is only a private cache for vgetcpu(). It will change in future kernels.
-   The user program must store this information per thread (__thread)
-   If you want 100% accurate information pass NULL instead. */
-struct getcpu_cache {
-	unsigned long blob[128 / sizeof(long)];
-};
-
-#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 77f45e5d44139da36a5dacbf9db7b65261d13398..81822d203eac5d8d91488a18ff7fcdc65670df54 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -59,7 +59,6 @@ struct compat_stat;
 struct old_timeval32;
 struct robust_list_head;
 struct futex_waitv;
-struct getcpu_cache;
 struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
@@ -714,7 +713,7 @@ asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
 asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
-asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, void __user *cache);
 asmlinkage long sys_gettimeofday(struct __kernel_old_timeval __user *tv,
 				struct timezone __user *tz);
 asmlinkage long sys_settimeofday(struct __kernel_old_timeval __user *tv,
diff --git a/kernel/sys.c b/kernel/sys.c
index 1e28b40053ce206d7d0ed27e8a4fce8b616c3565..a830d78c1e1eb1d6cef31294feeb6a88dc0f83f3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -31,7 +31,6 @@
 #include <linux/tty.h>
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
-#include <linux/getcpu.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/seccomp.h>
 #include <linux/cpu.h>
@@ -2813,8 +2812,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	return error;
 }
 
-SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep,
-		struct getcpu_cache __user *, unused)
+SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep, void __user *, unused)
 {
 	int err = 0;
 	int cpu = raw_smp_processor_id();
diff --git a/tools/testing/selftests/vDSO/vdso_test_getcpu.c b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
index cdeaed45fb26c61f6314c58fe1b71fa0be3c0108..994ce569dc37c6689b1a3c79156e3dfc8bf27f22 100644
--- a/tools/testing/selftests/vDSO/vdso_test_getcpu.c
+++ b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
@@ -16,9 +16,7 @@
 #include "vdso_config.h"
 #include "vdso_call.h"
 
-struct getcpu_cache;
-typedef long (*getcpu_t)(unsigned int *, unsigned int *,
-			 struct getcpu_cache *);
+typedef long (*getcpu_t)(unsigned int *, unsigned int *, void *);
 
 int main(int argc, char **argv)
 {

---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250825-getcpu_cache-3abcd2e65437

Best regards,
-- 
Thomas Weißschuh <thomas.weissschuh@linutronix.de>


^ permalink raw reply related

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Jeff Xu @ 2025-08-25 23:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mickaël Salaün, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module, Jeff Xu
In-Reply-To: <F0E70FC7-8DCE-4057-8E91-9FA1AC5BC758@amacapital.net>

On Mon, Aug 25, 2025 at 2:56 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
> > On Aug 25, 2025, at 11:10 AM, Jeff Xu <jeffxu@google.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 9:43 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >>> On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> >>> On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> >>>> On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> >>>>> On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> >>>>>> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> >>>>>>> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> >>>>>>> passed file descriptors).  This changes the state of the opened file by
> >>>>>>> making it read-only until it is closed.  The main use case is for script
> >>>>>>> interpreters to get the guarantee that script' content cannot be altered
> >>>>>>> while being read and interpreted.  This is useful for generic distros
> >>>>>>> that may not have a write-xor-execute policy.  See commit a5874fde3c08
> >>>>>>> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> >>>>>>> Both execve(2) and the IOCTL to enable fsverity can already set this
> >>>>>>> property on files with deny_write_access().  This new O_DENY_WRITE make
> >>>>>> The kernel actually tried to get rid of this behavior on execve() in
> >>>>>> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> >>>>>> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> >>>>>> because it broke userspace assumptions.
> >>>>> Oh, good to know.
> >>>>>>> it widely available.  This is similar to what other OSs may provide
> >>>>>>> e.g., opening a file with only FILE_SHARE_READ on Windows.
> >>>>>> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> >>>>>> removed for security reasons; as
> >>>>>> https://man7.org/linux/man-pages/man2/mmap.2.html says:
> >>>>>> |        MAP_DENYWRITE
> >>>>>> |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> >>>>>> |               signaled that attempts to write to the underlying file
> >>>>>> |               should fail with ETXTBSY.  But this was a source of denial-
> >>>>>> |               of-service attacks.)"
> >>>>>> It seems to me that the same issue applies to your patch - it would
> >>>>>> allow unprivileged processes to essentially lock files such that other
> >>>>>> processes can't write to them anymore. This might allow unprivileged
> >>>>>> users to prevent root from updating config files or stuff like that if
> >>>>>> they're updated in-place.
> >>>>> Yes, I agree, but since it is the case for executed files I though it
> >>>>> was worth starting a discussion on this topic.  This new flag could be
> >>>>> restricted to executable files, but we should avoid system-wide locks
> >>>>> like this.  I'm not sure how Windows handle these issues though.
> >>>>> Anyway, we should rely on the access control policy to control write and
> >>>>> execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> >>>>> the references and the background!
> >>>> I'm confused.  I understand that there are many contexts in which one
> >>>> would want to prevent execution of unapproved content, which might
> >>>> include preventing a given process from modifying some code and then
> >>>> executing it.
> >>>> I don't understand what these deny-write features have to do with it.
> >>>> These features merely prevent someone from modifying code *that is
> >>>> currently in use*, which is not at all the same thing as preventing
> >>>> modifying code that might get executed -- one can often modify
> >>>> contents *before* executing those contents.
> >>> The order of checks would be:
> >>> 1. open script with O_DENY_WRITE
> >>> 2. check executability with AT_EXECVE_CHECK
> >>> 3. read the content and interpret it
> >> Hmm.  Common LSM configurations should be able to handle this without
> >> deny write, I think.  If you don't want a program to be able to make
> >> their own scripts, then don't allow AT_EXECVE_CHECK to succeed on a
> >> script that the program can write.
> > Yes, Common LSM could handle this, however, due to historic and app
> > backward compability reason, sometimes it is impossible to enforce
> > that kind of policy in practice, therefore as an alternative, a
> > machinism such as AT_EXECVE_CHECK is really useful.
>
> Can you clarify?  I’m suspicious that we’re taking past each other.
>
Apology, my response isn't clear.

> AT_EXECVE_CHECK solves a problem that there are actions that effectively “execute” a file that don’t execute literal CPU instructions for it. Sometimes open+read has the effect of interpreting the contents of the file as something code-like.
>
Yes. We have the same understanding of this.
As an example, shell script or java byte code, their file permission
can be rw, but no x bit set. The interpreter reads those and executes
them.

> But, as I see it, deny-write is almost entirely orthogonal. If you open a file with the intent of executing it (mmap-execute or interpret — makes little practical difference here), then the kernel can enforce some policy. If the file is writable by a process that ought not have permission to execute code in the context of the opening-for-execute process, then LSMs need deny-write to be enforced so that they can verify the contents at the time of opening.
>
> But let’s step back a moment: is there any actual sensible security policy that does this?  If I want to *enforce* that a process only execute approved code, then wouldn’t I do it be only allowing executing files that the process can’t write?
>
I imagine the following situation: an app has both "rw" access to the
file that holds the script code, the "w" is needed because the app
updates the script sometimes.

What is a reasonable sandbox solution for such an app? There are maybe
two options:

1> split the app as two processes: processA has "w" access to the
script for updating when needed. Process B has "r" access but no "w",
for executing. ProcessA and ProcessB will coordinate to avoid racing
on the script update.

2> The process will use AT_EXECVE_CHECK (added by interpreter) to
validate the file before opening , and the file content held by the
process should be immutable while being validated and executed later
by interpreter.

option 1 is the ideal, and IIUC, you promote this too. However, that
requires refactoring the app as two processes.
option 2 is an alternative. Because it doesn't require the change from
the apps, therefore a solution worth considering.

> The reason that the removal of deny-write wasn’t security — it was a functionality issue: a linker accidentally modified an in-use binary. If you have permission to use gcc or lld, etc to create binaries, and you have permission to run them, then you pretty much have permission to run whatever code you like.
>
> So, if there’s a real security use case for deny-write, I’m still not seeing it.
>
Although the current patch might not be ideal due to the potential DOS
attack, it does offer a starting point to address the needs. Let's
continue the discussion based on this patch and explore different
ideas.

Thanks and regards,
-Jeff

> >> Keep in mind that trying to lock this down too hard is pointless for
> >> users who are allowed to to ptrace-write to their own processes.  Or
> >> for users who can do JIT, or for users who can run a REPL, etc.
> > The ptrace-write and /proc/pid/mem writing are on my radar, at least
> > for ChomeOS and Android.
> > AT_EXECVE_CHECK is orthogonal to those IMO, I hope eventually all
> > those paths will be hardened.
> >
> > Thanks and regards,
> > -Jeff

^ permalink raw reply

* [PATCH 07/11] tools headers: Sync syscall tables with the kernel source
From: Namhyung Kim @ 2025-08-25 21:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, Peter Zijlstra, Ingo Molnar,
	LKML, linux-perf-users, Arnd Bergmann, linux-api
In-Reply-To: <20250825215904.2594216-1-namhyung@kernel.org>

To pick up the changes in this cset:

  be7efb2d20d67f33 fs: introduce file_getattr and file_setattr syscalls

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
    diff -u tools/scripts/syscall.tbl scripts/syscall.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
    diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
    diff -u tools/perf/arch/arm/entry/syscalls/syscall.tbl arch/arm/tools/syscall.tbl
    diff -u tools/perf/arch/sh/entry/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/sparc/entry/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/xtensa/entry/syscalls/syscall.tbl arch/xtensa/kernel/syscalls/syscall.tbl

Please see tools/include/uapi/README for further details.

Cc: Arnd Bergmann <arnd@arndb.de>
CC: linux-api@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/include/uapi/asm-generic/unistd.h             | 8 +++++++-
 tools/perf/arch/arm/entry/syscalls/syscall.tbl      | 2 ++
 tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++
 tools/perf/arch/powerpc/entry/syscalls/syscall.tbl  | 2 ++
 tools/perf/arch/s390/entry/syscalls/syscall.tbl     | 2 ++
 tools/perf/arch/sh/entry/syscalls/syscall.tbl       | 2 ++
 tools/perf/arch/sparc/entry/syscalls/syscall.tbl    | 2 ++
 tools/perf/arch/x86/entry/syscalls/syscall_32.tbl   | 2 ++
 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl   | 2 ++
 tools/perf/arch/xtensa/entry/syscalls/syscall.tbl   | 2 ++
 tools/scripts/syscall.tbl                           | 2 ++
 11 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 2892a45023af6d3e..04e0077fb4c97a4d 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -852,8 +852,14 @@ __SYSCALL(__NR_removexattrat, sys_removexattrat)
 #define __NR_open_tree_attr 467
 __SYSCALL(__NR_open_tree_attr, sys_open_tree_attr)
 
+/* fs/inode.c */
+#define __NR_file_getattr 468
+__SYSCALL(__NR_file_getattr, sys_file_getattr)
+#define __NR_file_setattr 469
+__SYSCALL(__NR_file_setattr, sys_file_setattr)
+
 #undef __NR_syscalls
-#define __NR_syscalls 468
+#define __NR_syscalls 470
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/perf/arch/arm/entry/syscalls/syscall.tbl b/tools/perf/arch/arm/entry/syscalls/syscall.tbl
index 27c1d5ebcd91c8c2..b07e699aaa3c2840 100644
--- a/tools/perf/arch/arm/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/arm/entry/syscalls/syscall.tbl
@@ -482,3 +482,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index 1e8c44c7b61492ea..7a7049c2c307885f 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -382,3 +382,5 @@
 465	n64	listxattrat			sys_listxattrat
 466	n64	removexattrat			sys_removexattrat
 467	n64	open_tree_attr			sys_open_tree_attr
+468	n64	file_getattr			sys_file_getattr
+469	n64	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 9a084bdb892694bc..b453e80dfc003796 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -558,3 +558,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index a4569b96ef06c54c..8a6744d658db3986 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
 465  common	listxattrat		sys_listxattrat			sys_listxattrat
 466  common	removexattrat		sys_removexattrat		sys_removexattrat
 467  common	open_tree_attr		sys_open_tree_attr		sys_open_tree_attr
+468  common	file_getattr		sys_file_getattr		sys_file_getattr
+469  common	file_setattr		sys_file_setattr		sys_file_setattr
diff --git a/tools/perf/arch/sh/entry/syscalls/syscall.tbl b/tools/perf/arch/sh/entry/syscalls/syscall.tbl
index 52a7652fcff6394b..5e9c9eff5539e241 100644
--- a/tools/perf/arch/sh/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/sh/entry/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/sparc/entry/syscalls/syscall.tbl b/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
index 83e45eb6c095a36b..ebb7d06d1044fa9b 100644
--- a/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/sparc/entry/syscalls/syscall.tbl
@@ -513,3 +513,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
index ac007ea00979dc28..4877e16da69a50f2 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_32.tbl
@@ -473,3 +473,5 @@
 465	i386	listxattrat		sys_listxattrat
 466	i386	removexattrat		sys_removexattrat
 467	i386	open_tree_attr		sys_open_tree_attr
+468	i386	file_getattr		sys_file_getattr
+469	i386	file_setattr		sys_file_setattr
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index cfb5ca41e30de1a4..92cf0fe2291eb99b 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -391,6 +391,8 @@
 465	common	listxattrat		sys_listxattrat
 466	common	removexattrat		sys_removexattrat
 467	common	open_tree_attr		sys_open_tree_attr
+468	common	file_getattr		sys_file_getattr
+469	common	file_setattr		sys_file_setattr
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl b/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
index f657a77314f8667f..374e4cb788d8a6d4 100644
--- a/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/xtensa/entry/syscalls/syscall.tbl
@@ -438,3 +438,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
diff --git a/tools/scripts/syscall.tbl b/tools/scripts/syscall.tbl
index 580b4e246aecd5f0..d1ae5e92c615b58e 100644
--- a/tools/scripts/syscall.tbl
+++ b/tools/scripts/syscall.tbl
@@ -408,3 +408,5 @@
 465	common	listxattrat			sys_listxattrat
 466	common	removexattrat			sys_removexattrat
 467	common	open_tree_attr			sys_open_tree_attr
+468	common	file_getattr			sys_file_getattr
+469	common	file_setattr			sys_file_setattr
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply related

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Andy Lutomirski @ 2025-08-25 21:56 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Mickaël Salaün, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module, Jeff Xu


> On Aug 25, 2025, at 11:10 AM, Jeff Xu <jeffxu@google.com> wrote:
> 
> On Mon, Aug 25, 2025 at 9:43 AM Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
>>> On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
>>>> On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
>>>>> On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
>>>>>> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
>>>>>>> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
>>>>>>> passed file descriptors).  This changes the state of the opened file by
>>>>>>> making it read-only until it is closed.  The main use case is for script
>>>>>>> interpreters to get the guarantee that script' content cannot be altered
>>>>>>> while being read and interpreted.  This is useful for generic distros
>>>>>>> that may not have a write-xor-execute policy.  See commit a5874fde3c08
>>>>>>> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
>>>>>>> Both execve(2) and the IOCTL to enable fsverity can already set this
>>>>>>> property on files with deny_write_access().  This new O_DENY_WRITE make
>>>>>> The kernel actually tried to get rid of this behavior on execve() in
>>>>>> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
>>>>>> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
>>>>>> because it broke userspace assumptions.
>>>>> Oh, good to know.
>>>>>>> it widely available.  This is similar to what other OSs may provide
>>>>>>> e.g., opening a file with only FILE_SHARE_READ on Windows.
>>>>>> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
>>>>>> removed for security reasons; as
>>>>>> https://man7.org/linux/man-pages/man2/mmap.2.html says:
>>>>>> |        MAP_DENYWRITE
>>>>>> |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
>>>>>> |               signaled that attempts to write to the underlying file
>>>>>> |               should fail with ETXTBSY.  But this was a source of denial-
>>>>>> |               of-service attacks.)"
>>>>>> It seems to me that the same issue applies to your patch - it would
>>>>>> allow unprivileged processes to essentially lock files such that other
>>>>>> processes can't write to them anymore. This might allow unprivileged
>>>>>> users to prevent root from updating config files or stuff like that if
>>>>>> they're updated in-place.
>>>>> Yes, I agree, but since it is the case for executed files I though it
>>>>> was worth starting a discussion on this topic.  This new flag could be
>>>>> restricted to executable files, but we should avoid system-wide locks
>>>>> like this.  I'm not sure how Windows handle these issues though.
>>>>> Anyway, we should rely on the access control policy to control write and
>>>>> execute access in a consistent way (e.g. write-xor-execute).  Thanks for
>>>>> the references and the background!
>>>> I'm confused.  I understand that there are many contexts in which one
>>>> would want to prevent execution of unapproved content, which might
>>>> include preventing a given process from modifying some code and then
>>>> executing it.
>>>> I don't understand what these deny-write features have to do with it.
>>>> These features merely prevent someone from modifying code *that is
>>>> currently in use*, which is not at all the same thing as preventing
>>>> modifying code that might get executed -- one can often modify
>>>> contents *before* executing those contents.
>>> The order of checks would be:
>>> 1. open script with O_DENY_WRITE
>>> 2. check executability with AT_EXECVE_CHECK
>>> 3. read the content and interpret it
>> Hmm.  Common LSM configurations should be able to handle this without
>> deny write, I think.  If you don't want a program to be able to make
>> their own scripts, then don't allow AT_EXECVE_CHECK to succeed on a
>> script that the program can write.
> Yes, Common LSM could handle this, however, due to historic and app
> backward compability reason, sometimes it is impossible to enforce
> that kind of policy in practice, therefore as an alternative, a
> machinism such as AT_EXECVE_CHECK is really useful.

Can you clarify?  I’m suspicious that we’re taking past each other.

AT_EXECVE_CHECK solves a problem that there are actions that effectively “execute” a file that don’t execute literal CPU instructions for it. Sometimes open+read has the effect of interpreting the contents of the file as something code-like.

But, as I see it, deny-write is almost entirely orthogonal. If you open a file with the intent of executing it (mmap-execute or interpret — makes little practical difference here), then the kernel can enforce some policy. If the file is writable by a process that ought not have permission to execute code in the context of the opening-for-execute process, then LSMs need deny-write to be enforced so that they can verify the contents at the time of opening.

But let’s step back a moment: is there any actual sensible security policy that does this?  If I want to *enforce* that a process only execute approved code, then wouldn’t I do it be only allowing executing files that the process can’t write?

The reason that the removal of deny-write wasn’t security — it was a functionality issue: a linker accidentally modified an in-use binary. If you have permission to use gcc or lld, etc to create binaries, and you have permission to run them, then you pretty much have permission to run whatever code you like.

So, if there’s a real security use case for deny-write, I’m still not seeing it.

>> Keep in mind that trying to lock this down too hard is pointless for
>> users who are allowed to to ptrace-write to their own processes.  Or
>> for users who can do JIT, or for users who can run a REPL, etc.
> The ptrace-write and /proc/pid/mem writing are on my radar, at least
> for ChomeOS and Android.
> AT_EXECVE_CHECK is orthogonal to those IMO, I hope eventually all
> those paths will be hardened.
> 
> Thanks and regards,
> -Jeff

^ permalink raw reply

* Re: [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Randy Dunlap @ 2025-08-25 19:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, linux-fsdevel, linux-api
In-Reply-To: <aKyvO2bvPCZEzuBd@casper.infradead.org>



On 8/25/25 11:45 AM, Matthew Wilcox wrote:
> On Mon, Aug 25, 2025 at 10:52:31AM -0700, Randy Dunlap wrote:
>> $ grep -r AT_RENAME_NOREPLACE /usr/include
>> /usr/include/stdio.h:# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>> /usr/include/linux/fcntl.h:#define AT_RENAME_NOREPLACE	0x0001
>>
>> I have libc 2.42-1.1 (openSUSE).
> 
> I wonder if we can fix it by changing include/uapi/linux/fcntl.h
> from being an explicit 0x0001 to RENAME_NOREPLACE?  There's probably
> a horrendous include problem between linux/fcntl.h and linux/fs.h
> though?

I'm working on something like that now (suggested by Amir),
but it might depend on whether stdio.h has been #included first.

-- 
~Randy


^ permalink raw reply

* Re: [PATCH v2 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Alejandro Colomar @ 2025-08-25 19:13 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250825154839.2422856-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 3043 bytes --]

Hi Askar,

On Mon, Aug 25, 2025 at 03:48:39PM +0000, Askar Safin wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>
> ---
>  man/man2/mount.2 | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..47fc2d21f 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,30 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing the underlying filesystem parameters.
> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +Note that the mountpoint will

I would remove "Note that".  Starting with "The" already is equally
meaningful, and two less meaningless words for the reader.

Should we say "mount point" instead?  Otherwise, it's inconsistent with
"mount-point flags" below.  Also, see:

alx@debian:~/src/linux/man-pages/man-pages/master/man$ grep -rn 'mount point' | wc -l
98
alx@debian:~/src/linux/man-pages/man-pages/master/man$ grep -rn 'mountpoint' | wc -l
3


> +have its existing per-mount-point flags
> +cleared and replaced with those in
> +.I mountflags
> +when
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.

Maybe reverse the sentence to start with this?

	When
	.B MS_REMOUNT
	and
	.B MS_BIND
	are specified,
	the ...
	will have its existing ...
	cleared and replaced with those in
	.IR mountflags .

Having conditionals at the end makes my brain have to reparse the
previous text to understand it.  If I read the conditional early on,
my branch predictor kind of knows what to expect.  :)

> +This means that if

I would move the 'if' to the next line.

> +you wish to preserve
> +any existing per-mount-point flags,
> +you need to include them in
> +.IR mountflags ,
> +along with the per-mount-point flags you wish to set
> +(or with the flags you wish to clear missing).
>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +439,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.

Hmmm, I see this uses 'mountpoint' already.

I guess we should have a clear direction of what term we want to use.
Since the existing text already uses this, I think we should change it
in a separate commit.  Do you want to send a second patch to use
'mount point'?

> +will make access through this mountpoint read-only
> +(clearing all other per-mount-point flags),
> +without affecting
> +other mounts
> +of this filesystem.


Have a lovely night!
Alex

>  .\"
>  .SS Creating a bind mount
>  If
> -- 
> 2.47.2
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Matthew Wilcox @ 2025-08-25 18:45 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, linux-fsdevel, linux-api
In-Reply-To: <0c755ddc-9ed1-462e-a9f1-16762ebe0a19@infradead.org>

On Mon, Aug 25, 2025 at 10:52:31AM -0700, Randy Dunlap wrote:
> $ grep -r AT_RENAME_NOREPLACE /usr/include
> /usr/include/stdio.h:# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> /usr/include/linux/fcntl.h:#define AT_RENAME_NOREPLACE	0x0001
> 
> I have libc 2.42-1.1 (openSUSE).

I wonder if we can fix it by changing include/uapi/linux/fcntl.h
from being an explicit 0x0001 to RENAME_NOREPLACE?  There's probably
a horrendous include problem between linux/fcntl.h and linux/fs.h
though?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox