linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] API extension for handling sysctl
       [not found] <CAHk-=whi2SzU4XT_FsdTCAuK2qtYmH+-hwi1cbSdG8zu0KXL=g@mail.gmail.com>
@ 2022-06-01 13:20 ` Alexey Gladkov
  2022-06-01 13:20   ` [RFC PATCH 1/4] sysctl: " Alexey Gladkov
                     ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 13:20 UTC (permalink / raw)
  To: LKML, Eric W . Biederman, Linus Torvalds
  Cc: Andrew Morton, Christian Brauner, Iurii Zaikin, Kees Cook,
	Linux Containers, linux-fsdevel, Luis Chamberlain, Vasily Averin

On Fri, Apr 22, 2022 at 01:44:50PM -0700, Linus Torvalds wrote:
> On Fri, Apr 22, 2022 at 5:53 AM Alexey Gladkov <legion@kernel.org> wrote:
> >
> > Yes, Linus, these changes are not the refactoring you were talking
> > about, but I plan to try to do such a refactoring in the my next
> > patchset.
> 
> Heh. Ok, I'm not saying these patches are pretty, and looking up the
> namespace thing is a bit subtle, but it's certainly prettier than the
> existing odd "create a new ctl_table entry because of field abuse".

As I promised, here is one of the possible options for how to get rid of dynamic
memory allocation.

We can slightly extend the API and thus be able to save data at the time the
file is opened. This will not only eliminate the need to allocate memory, but
also provide access to file struct and f_cred.

I made an RFC because I'm not sure that I did the permissions check for
ipc_sysctl. I also did not change all the places where this API can be applied
to make the patch smaller. As in the case of /proc/sys/kernel/printk where
CAP_SYS_ADMIN is checked[1] for the current process at the time of write.

I made a patchset on top of:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-next

Because there are my previous changes.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/printk/sysctl.c#n17

--

Alexey Gladkov (4):
  sysctl: API extension for handling sysctl
  sysctl: ipc: Do not use dynamic memory
  sysctl: userns: Do not use dynamic memory
  sysctl: mqueue: Do not use dynamic memory

 fs/proc/proc_sysctl.c          |  71 ++++++++--
 include/linux/ipc_namespace.h  |  35 -----
 include/linux/sysctl.h         |  20 ++-
 include/linux/user_namespace.h |   6 -
 ipc/ipc_sysctl.c               | 236 +++++++++++++++++----------------
 ipc/mq_sysctl.c                | 138 ++++++++++---------
 ipc/mqueue.c                   |   5 -
 ipc/namespace.c                |  10 --
 kernel/ucount.c                | 116 +++++++---------
 kernel/user_namespace.c        |  10 +-
 10 files changed, 323 insertions(+), 324 deletions(-)

-- 
2.33.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/4] sysctl: API extension for handling sysctl
  2022-06-01 13:20 ` [RFC PATCH 0/4] API extension for handling sysctl Alexey Gladkov
@ 2022-06-01 13:20   ` Alexey Gladkov
  2022-06-01 19:19     ` Matthew Wilcox
  2022-06-01 13:20   ` [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory Alexey Gladkov
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 13:20 UTC (permalink / raw)
  To: LKML, Eric W . Biederman, Linus Torvalds
  Cc: Andrew Morton, Christian Brauner, Iurii Zaikin, Kees Cook,
	Linux Containers, linux-fsdevel, Luis Chamberlain, Vasily Averin

This adds additional optional functions for handling open, read, and
write operations that can be customized for each sysctl file. It also
creates ctl_context that persists from opening to closing the file in
the /proc/sys.

The context allows us to store dynamic information at the time the file
is opened. This eliminates the need to duplicate ctl_table in order to
dynamically change .data, .extra1 or .extra2.

This API extends the existing one and does not require any changes to
already existing sysctl handlers.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/proc_sysctl.c  | 71 +++++++++++++++++++++++++++++++++++-------
 include/linux/sysctl.h | 20 ++++++++++--
 2 files changed, 77 insertions(+), 14 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 7d9cfc730bd4..d3d43e738f01 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -560,6 +560,7 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
+	struct ctl_fops *fops = table->ctl_fops;
 	size_t count = iov_iter_count(iter);
 	char *kbuf;
 	ssize_t error;
@@ -577,7 +578,7 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 
 	/* if that can happen at all, it should be -EINVAL, not -EISDIR */
 	error = -EINVAL;
-	if (!table->proc_handler)
+	if (!table->proc_handler && !fops)
 		goto out;
 
 	/* don't even try if the size is too large */
@@ -600,8 +601,20 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 	if (error)
 		goto out_free_buf;
 
-	/* careful: calling conventions are nasty here */
-	error = table->proc_handler(table, write, kbuf, &count, &iocb->ki_pos);
+	if (fops) {
+		struct ctl_context *ctx = iocb->ki_filp->private_data;
+
+		if (write && fops->write)
+			error = fops->write(ctx, iocb->ki_filp, kbuf, &count, &iocb->ki_pos);
+		else if (!write && fops->read)
+			error = fops->read(ctx, iocb->ki_filp, kbuf, &count, &iocb->ki_pos);
+		else
+			error = -EINVAL;
+	} else {
+		/* careful: calling conventions are nasty here */
+		error = table->proc_handler(table, write, kbuf, &count, &iocb->ki_pos);
+	}
+
 	if (error)
 		goto out_free_buf;
 
@@ -634,17 +647,50 @@ static int proc_sys_open(struct inode *inode, struct file *filp)
 {
 	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
+	struct ctl_context *ctx;
+	int ret = 0;
 
 	/* sysctl was unregistered */
 	if (IS_ERR(head))
 		return PTR_ERR(head);
 
-	if (table->poll)
-		filp->private_data = proc_sys_poll_event(table->poll);
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->table = table;
+	filp->private_data = ctx;
+
+	if (table->ctl_fops && table->ctl_fops->open)
+		ret = table->ctl_fops->open(ctx, inode, filp);
+
+	if (!ret && table->poll)
+		ctx->poll_event = proc_sys_poll_event(table->poll);
 
 	sysctl_head_finish(head);
 
-	return 0;
+	return ret;
+}
+
+static int proc_sys_release(struct inode *inode, struct file *filp)
+{
+	struct ctl_table_header *head = grab_header(inode);
+	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
+	struct ctl_context *ctx = filp->private_data;
+	int ret = 0;
+
+	if (IS_ERR(head))
+		return PTR_ERR(head);
+
+	if (table->ctl_fops && table->ctl_fops->release)
+		ret = table->ctl_fops->release(ctx, inode, filp);
+
+	sysctl_head_finish(head);
+
+	kfree(ctx);
+	filp->private_data =  NULL;
+
+	return ret;
 }
 
 static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
@@ -653,23 +699,23 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
 	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
 	__poll_t ret = DEFAULT_POLLMASK;
-	unsigned long event;
+	struct ctl_context *ctx;
 
 	/* sysctl was unregistered */
 	if (IS_ERR(head))
 		return EPOLLERR | EPOLLHUP;
 
-	if (!table->proc_handler)
+	if (!table->proc_handler && !table->ctl_fops)
 		goto out;
 
 	if (!table->poll)
 		goto out;
 
-	event = (unsigned long)filp->private_data;
+	ctx = filp->private_data;
 	poll_wait(filp, &table->poll->wait, wait);
 
-	if (event != atomic_read(&table->poll->event)) {
-		filp->private_data = proc_sys_poll_event(table->poll);
+	if (ctx->poll_event != atomic_read(&table->poll->event)) {
+		ctx->poll_event = proc_sys_poll_event(table->poll);
 		ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI;
 	}
 
@@ -866,6 +912,7 @@ static int proc_sys_getattr(struct user_namespace *mnt_userns,
 
 static const struct file_operations proc_sys_file_operations = {
 	.open		= proc_sys_open,
+	.release	= proc_sys_release,
 	.poll		= proc_sys_poll,
 	.read_iter	= proc_sys_read,
 	.write_iter	= proc_sys_write,
@@ -1153,7 +1200,7 @@ static int sysctl_check_table(const char *path, struct ctl_table *table)
 			else
 				err |= sysctl_check_table_array(path, table);
 		}
-		if (!table->proc_handler)
+		if (!table->proc_handler && !table->ctl_fops)
 			err |= sysctl_err(path, table, "No proc_handler");
 
 		if ((table->mode & (S_IRUGO|S_IWUGO)) != table->mode)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 6353d6db69b2..ca5657c9fcb2 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,9 +116,9 @@ struct ctl_table_poll {
 	wait_queue_head_t wait;
 };
 
-static inline void *proc_sys_poll_event(struct ctl_table_poll *poll)
+static inline unsigned long proc_sys_poll_event(struct ctl_table_poll *poll)
 {
-	return (void *)(unsigned long)atomic_read(&poll->event);
+	return (unsigned long)atomic_read(&poll->event);
 }
 
 #define __CTL_TABLE_POLL_INITIALIZER(name) {				\
@@ -128,6 +128,21 @@ static inline void *proc_sys_poll_event(struct ctl_table_poll *poll)
 #define DEFINE_CTL_TABLE_POLL(name)					\
 	struct ctl_table_poll name = __CTL_TABLE_POLL_INITIALIZER(name)
 
+struct ctl_context {
+	struct ctl_table *table;
+	unsigned long poll_event;
+	void *ctl_data;
+};
+
+struct inode;
+
+struct ctl_fops {
+	int (*open) (struct ctl_context *, struct inode *, struct file *);
+	int (*release) (struct ctl_context *, struct inode *, struct file *);
+	ssize_t (*read) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
+	ssize_t (*write) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
+};
+
 /* A sysctl table is an array of struct ctl_table: */
 struct ctl_table {
 	const char *procname;		/* Text ID for /proc/sys, or zero */
@@ -139,6 +154,7 @@ struct ctl_table {
 	struct ctl_table_poll *poll;
 	void *extra1;
 	void *extra2;
+	struct ctl_fops *ctl_fops;
 } __randomize_layout;
 
 struct ctl_node {
-- 
2.33.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory
  2022-06-01 13:20 ` [RFC PATCH 0/4] API extension for handling sysctl Alexey Gladkov
  2022-06-01 13:20   ` [RFC PATCH 1/4] sysctl: " Alexey Gladkov
@ 2022-06-01 13:20   ` Alexey Gladkov
  2022-06-01 16:45     ` Linus Torvalds
  2022-06-01 13:20   ` [RFC PATCH 3/4] sysctl: userns: " Alexey Gladkov
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 13:20 UTC (permalink / raw)
  To: LKML, Eric W . Biederman, Linus Torvalds
  Cc: Andrew Morton, Christian Brauner, Iurii Zaikin, Kees Cook,
	Linux Containers, linux-fsdevel, Luis Chamberlain, Vasily Averin

Dynamic memory allocation is needed to modify .data and specify the per
namespace parameter. The new sysctl API is allowed to get rid of the
need for such modification.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 include/linux/ipc_namespace.h |  18 ---
 ipc/ipc_sysctl.c              | 236 +++++++++++++++++-----------------
 ipc/namespace.c               |   4 -
 3 files changed, 121 insertions(+), 137 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index e3e8c8662b49..51c2c247c447 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -191,22 +191,4 @@ static inline bool setup_mq_sysctls(struct ipc_namespace *ns)
 }
 
 #endif /* CONFIG_POSIX_MQUEUE_SYSCTL */
-
-#ifdef CONFIG_SYSVIPC_SYSCTL
-
-bool setup_ipc_sysctls(struct ipc_namespace *ns);
-void retire_ipc_sysctls(struct ipc_namespace *ns);
-
-#else /* CONFIG_SYSVIPC_SYSCTL */
-
-static inline void retire_ipc_sysctls(struct ipc_namespace *ns)
-{
-}
-
-static inline bool setup_ipc_sysctls(struct ipc_namespace *ns)
-{
-	return true;
-}
-
-#endif /* CONFIG_SYSVIPC_SYSCTL */
 #endif
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index ef313ecfb53a..833b670c38f3 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -68,26 +68,94 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 	return ret;
 }
 
+static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table);
+
+static int ipc_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file)
+{
+	struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+
+	// For now, we only allow changes in init_user_ns.
+	if (ns->user_ns != &init_user_ns)
+		return -EPERM;
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	int index = (ctx->table - ipc_sysctls);
+
+	switch (index) {
+		case IPC_SYSCTL_SEM_NEXT_ID:
+		case IPC_SYSCTL_MSG_NEXT_ID:
+		case IPC_SYSCTL_SHM_NEXT_ID:
+			if (!checkpoint_restore_ns_capable(ns->user_ns))
+				return -EPERM;
+			break;
+	}
+#endif
+	ctx->ctl_data = ns;
+	return 0;
+}
+
+static ssize_t ipc_sys_read(struct ctl_context *ctx, struct file *file,
+		     char *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table table = *ctx->table;
+	table.data = data_from_ns(ctx, ctx->table);
+	return table.proc_handler(&table, 0, buffer, lenp, ppos);
+}
+
+static ssize_t ipc_sys_write(struct ctl_context *ctx, struct file *file,
+		      char *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table table = *ctx->table;
+	table.data = data_from_ns(ctx, ctx->table);
+	return table.proc_handler(&table, 1, buffer, lenp, ppos);
+}
+
+static struct ctl_fops ipc_sys_fops = {
+	.open	= ipc_sys_open,
+	.read	= ipc_sys_read,
+	.write	= ipc_sys_write,
+};
+
 int ipc_mni = IPCMNI;
 int ipc_mni_shift = IPCMNI_SHIFT;
 int ipc_min_cycle = RADIX_TREE_MAP_SIZE;
 
+enum {
+	IPC_SYSCTL_SHMMAX,
+	IPC_SYSCTL_SHMALL,
+	IPC_SYSCTL_SHMMNI,
+	IPC_SYSCTL_SHM_RMID_FORCED,
+	IPC_SYSCTL_MSGMAX,
+	IPC_SYSCTL_MSGMNI,
+	IPC_SYSCTL_AUTO_MSGMNI,
+	IPC_SYSCTL_MSGMNB,
+	IPC_SYSCTL_SEM,
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	IPC_SYSCTL_SEM_NEXT_ID,
+	IPC_SYSCTL_MSG_NEXT_ID,
+	IPC_SYSCTL_SHM_NEXT_ID,
+#endif
+	IPC_SYSCTL_COUNTS
+};
+
 static struct ctl_table ipc_sysctls[] = {
-	{
+	[IPC_SYSCTL_SHMMAX] = {
 		.procname	= "shmmax",
 		.data		= &init_ipc_ns.shm_ctlmax,
 		.maxlen		= sizeof(init_ipc_ns.shm_ctlmax),
 		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax,
+		.proc_handler   = proc_doulongvec_minmax,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_SHMALL] = {
 		.procname	= "shmall",
 		.data		= &init_ipc_ns.shm_ctlall,
 		.maxlen		= sizeof(init_ipc_ns.shm_ctlall),
 		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax,
+		.proc_handler   = proc_doulongvec_minmax,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_SHMMNI] = {
 		.procname	= "shmmni",
 		.data		= &init_ipc_ns.shm_ctlmni,
 		.maxlen		= sizeof(init_ipc_ns.shm_ctlmni),
@@ -95,8 +163,9 @@ static struct ctl_table ipc_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &ipc_mni,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_SHM_RMID_FORCED] = {
 		.procname	= "shm_rmid_forced",
 		.data		= &init_ipc_ns.shm_rmid_forced,
 		.maxlen		= sizeof(init_ipc_ns.shm_rmid_forced),
@@ -104,8 +173,9 @@ static struct ctl_table ipc_sysctls[] = {
 		.proc_handler	= proc_ipc_dointvec_minmax_orphans,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_MSGMAX] = {
 		.procname	= "msgmax",
 		.data		= &init_ipc_ns.msg_ctlmax,
 		.maxlen		= sizeof(init_ipc_ns.msg_ctlmax),
@@ -113,8 +183,9 @@ static struct ctl_table ipc_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_INT_MAX,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_MSGMNI] = {
 		.procname	= "msgmni",
 		.data		= &init_ipc_ns.msg_ctlmni,
 		.maxlen		= sizeof(init_ipc_ns.msg_ctlmni),
@@ -122,8 +193,9 @@ static struct ctl_table ipc_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &ipc_mni,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_AUTO_MSGMNI] = {
 		.procname	= "auto_msgmni",
 		.data		= NULL,
 		.maxlen		= sizeof(int),
@@ -131,8 +203,9 @@ static struct ctl_table ipc_sysctls[] = {
 		.proc_handler	= proc_ipc_auto_msgmni,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_MSGMNB] = {
 		.procname	=  "msgmnb",
 		.data		= &init_ipc_ns.msg_ctlmnb,
 		.maxlen		= sizeof(init_ipc_ns.msg_ctlmnb),
@@ -140,152 +213,85 @@ static struct ctl_table ipc_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_INT_MAX,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_SEM] = {
 		.procname	= "sem",
 		.data		= &init_ipc_ns.sem_ctls,
 		.maxlen		= 4*sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_sem_dointvec,
+		.ctl_fops	= &ipc_sys_fops,
 	},
 #ifdef CONFIG_CHECKPOINT_RESTORE
-	{
+	[IPC_SYSCTL_SEM_NEXT_ID] = {
 		.procname	= "sem_next_id",
 		.data		= &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
-		.mode		= 0444,
+		.mode		= 0666,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_INT_MAX,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_MSG_NEXT_ID] = {
 		.procname	= "msg_next_id",
 		.data		= &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
-		.mode		= 0444,
+		.mode		= 0666,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_INT_MAX,
+		.ctl_fops	= &ipc_sys_fops,
 	},
-	{
+	[IPC_SYSCTL_SHM_NEXT_ID] = {
 		.procname	= "shm_next_id",
 		.data		= &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
-		.mode		= 0444,
+		.mode		= 0666,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_INT_MAX,
+		.ctl_fops	= &ipc_sys_fops,
 	},
 #endif
-	{}
+	[IPC_SYSCTL_COUNTS] = {}
 };
 
-static struct ctl_table_set *set_lookup(struct ctl_table_root *root)
+static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table)
 {
-	return &current->nsproxy->ipc_ns->ipc_set;
-}
-
-static int set_is_seen(struct ctl_table_set *set)
-{
-	return &current->nsproxy->ipc_ns->ipc_set == set;
-}
-
-static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table)
-{
-	int mode = table->mode;
-
+	struct ipc_namespace *ns = ctx->ctl_data;
+
+	switch (ctx->table - ipc_sysctls) {
+		case IPC_SYSCTL_SHMMAX:			return &ns->shm_ctlmax;
+		case IPC_SYSCTL_SHMALL:			return &ns->shm_ctlall;
+		case IPC_SYSCTL_SHMMNI:			return &ns->shm_ctlmni;
+		case IPC_SYSCTL_SHM_RMID_FORCED:	return &ns->shm_rmid_forced;
+		case IPC_SYSCTL_MSGMAX:			return &ns->msg_ctlmax;
+		case IPC_SYSCTL_MSGMNI:			return &ns->msg_ctlmni;
+		case IPC_SYSCTL_MSGMNB:			return &ns->msg_ctlmnb;
+		case IPC_SYSCTL_SEM:			return &ns->sem_ctls;
 #ifdef CONFIG_CHECKPOINT_RESTORE
-	struct ipc_namespace *ns = current->nsproxy->ipc_ns;
-
-	if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) ||
-	     (table->data == &ns->ids[IPC_MSG_IDS].next_id) ||
-	     (table->data == &ns->ids[IPC_SHM_IDS].next_id)) &&
-	    checkpoint_restore_ns_capable(ns->user_ns))
-		mode = 0666;
+		case IPC_SYSCTL_SEM_NEXT_ID:		return &ns->ids[IPC_SEM_IDS].next_id;
+		case IPC_SYSCTL_MSG_NEXT_ID:		return &ns->ids[IPC_MSG_IDS].next_id;
+		case IPC_SYSCTL_SHM_NEXT_ID:		return &ns->ids[IPC_SHM_IDS].next_id;
 #endif
-	return mode;
-}
-
-static struct ctl_table_root set_root = {
-	.lookup = set_lookup,
-	.permissions = ipc_permissions,
-};
-
-bool setup_ipc_sysctls(struct ipc_namespace *ns)
-{
-	struct ctl_table *tbl;
-
-	setup_sysctl_set(&ns->ipc_set, &set_root, set_is_seen);
-
-	tbl = kmemdup(ipc_sysctls, sizeof(ipc_sysctls), GFP_KERNEL);
-	if (tbl) {
-		int i;
-
-		for (i = 0; i < ARRAY_SIZE(ipc_sysctls); i++) {
-			if (tbl[i].data == &init_ipc_ns.shm_ctlmax)
-				tbl[i].data = &ns->shm_ctlmax;
-
-			else if (tbl[i].data == &init_ipc_ns.shm_ctlall)
-				tbl[i].data = &ns->shm_ctlall;
-
-			else if (tbl[i].data == &init_ipc_ns.shm_ctlmni)
-				tbl[i].data = &ns->shm_ctlmni;
-
-			else if (tbl[i].data == &init_ipc_ns.shm_rmid_forced)
-				tbl[i].data = &ns->shm_rmid_forced;
-
-			else if (tbl[i].data == &init_ipc_ns.msg_ctlmax)
-				tbl[i].data = &ns->msg_ctlmax;
-
-			else if (tbl[i].data == &init_ipc_ns.msg_ctlmni)
-				tbl[i].data = &ns->msg_ctlmni;
-
-			else if (tbl[i].data == &init_ipc_ns.msg_ctlmnb)
-				tbl[i].data = &ns->msg_ctlmnb;
-
-			else if (tbl[i].data == &init_ipc_ns.sem_ctls)
-				tbl[i].data = &ns->sem_ctls;
-#ifdef CONFIG_CHECKPOINT_RESTORE
-			else if (tbl[i].data == &init_ipc_ns.ids[IPC_SEM_IDS].next_id)
-				tbl[i].data = &ns->ids[IPC_SEM_IDS].next_id;
-
-			else if (tbl[i].data == &init_ipc_ns.ids[IPC_MSG_IDS].next_id)
-				tbl[i].data = &ns->ids[IPC_MSG_IDS].next_id;
-
-			else if (tbl[i].data == &init_ipc_ns.ids[IPC_SHM_IDS].next_id)
-				tbl[i].data = &ns->ids[IPC_SHM_IDS].next_id;
-#endif
-			else
-				tbl[i].data = NULL;
-		}
-
-		ns->ipc_sysctls = __register_sysctl_table(&ns->ipc_set, "kernel", tbl);
-	}
-	if (!ns->ipc_sysctls) {
-		kfree(tbl);
-		retire_sysctl_set(&ns->ipc_set);
-		return false;
 	}
-
-	return true;
+	return NULL;
 }
 
-void retire_ipc_sysctls(struct ipc_namespace *ns)
-{
-	struct ctl_table *tbl;
-
-	tbl = ns->ipc_sysctls->ctl_table_arg;
-	unregister_sysctl_table(ns->ipc_sysctls);
-	retire_sysctl_set(&ns->ipc_set);
-	kfree(tbl);
-}
+static struct ctl_table ipc_root_table[] = {
+	{
+		.procname       = "kernel",
+		.mode           = 0555,
+		.child          = ipc_sysctls,
+	},
+	{}
+};
 
 static int __init ipc_sysctl_init(void)
 {
-	if (!setup_ipc_sysctls(&init_ipc_ns)) {
-		pr_warn("ipc sysctl registration failed\n");
-		return -ENOMEM;
-	}
+	register_sysctl_table(ipc_root_table);
 	return 0;
 }
 
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 754f3237194a..f760243ca685 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -63,9 +63,6 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 	if (!setup_mq_sysctls(ns))
 		goto fail_put;
 
-	if (!setup_ipc_sysctls(ns))
-		goto fail_put;
-
 	sem_init_ns(ns);
 	msg_init_ns(ns);
 	shm_init_ns(ns);
@@ -133,7 +130,6 @@ static void free_ipc_ns(struct ipc_namespace *ns)
 	shm_exit_ns(ns);
 
 	retire_mq_sysctls(ns);
-	retire_ipc_sysctls(ns);
 
 	dec_ipc_namespaces(ns->ucounts);
 	put_user_ns(ns->user_ns);
-- 
2.33.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/4] sysctl: userns: Do not use dynamic memory
  2022-06-01 13:20 ` [RFC PATCH 0/4] API extension for handling sysctl Alexey Gladkov
  2022-06-01 13:20   ` [RFC PATCH 1/4] sysctl: " Alexey Gladkov
  2022-06-01 13:20   ` [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory Alexey Gladkov
@ 2022-06-01 13:20   ` Alexey Gladkov
  2022-06-01 13:20   ` [RFC PATCH 4/4] sysctl: mqueue: " Alexey Gladkov
  2022-06-09 16:45   ` [RFC PATCH 0/4] API extension for handling sysctl Luis Chamberlain
  4 siblings, 0 replies; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 13:20 UTC (permalink / raw)
  To: LKML, Eric W . Biederman, Linus Torvalds
  Cc: Andrew Morton, Christian Brauner, Iurii Zaikin, Kees Cook,
	Linux Containers, linux-fsdevel, Luis Chamberlain, Vasily Averin

Dynamic memory allocation is needed to modify .data and specify the
per namespace parameter. The new sysctl API is allowed to get rid of
the need for such modification.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 include/linux/user_namespace.h |   6 --
 kernel/ucount.c                | 116 +++++++++++++--------------------
 kernel/user_namespace.c        |  10 +--
 3 files changed, 46 insertions(+), 86 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 45f09bec02c4..7b134516e5cb 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -95,10 +95,6 @@ struct user_namespace {
 	struct key		*persistent_keyring_register;
 #endif
 	struct work_struct	work;
-#ifdef CONFIG_SYSCTL
-	struct ctl_table_set	set;
-	struct ctl_table_header *sysctls;
-#endif
 	struct ucounts		*ucounts;
 	long ucount_max[UCOUNT_COUNTS];
 	long rlimit_max[UCOUNT_RLIMIT_COUNTS];
@@ -116,8 +112,6 @@ struct ucounts {
 extern struct user_namespace init_user_ns;
 extern struct ucounts init_ucounts;
 
-bool setup_userns_sysctls(struct user_namespace *ns);
-void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
diff --git a/kernel/ucount.c b/kernel/ucount.c
index ee8e57fd6f90..4a5072671847 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -7,6 +7,7 @@
 #include <linux/hash.h>
 #include <linux/kmemleak.h>
 #include <linux/user_namespace.h>
+#include <linux/fs.h>
 
 struct ucounts init_ucounts = {
 	.ns    = &init_user_ns,
@@ -26,38 +27,20 @@ static DEFINE_SPINLOCK(ucounts_lock);
 
 
 #ifdef CONFIG_SYSCTL
-static struct ctl_table_set *
-set_lookup(struct ctl_table_root *root)
-{
-	return &current_user_ns()->set;
-}
-
-static int set_is_seen(struct ctl_table_set *set)
-{
-	return &current_user_ns()->set == set;
-}
-
-static int set_permissions(struct ctl_table_header *head,
-				  struct ctl_table *table)
-{
-	struct user_namespace *user_ns =
-		container_of(head->set, struct user_namespace, set);
-	int mode;
-
-	/* Allow users with CAP_SYS_RESOURCE unrestrained access */
-	if (ns_capable(user_ns, CAP_SYS_RESOURCE))
-		mode = (table->mode & S_IRWXU) >> 6;
-	else
-	/* Allow all others at most read-only access */
-		mode = table->mode & S_IROTH;
-	return (mode << 6) | (mode << 3) | mode;
-}
-
-static struct ctl_table_root set_root = {
-	.lookup = set_lookup,
-	.permissions = set_permissions,
+static int user_sys_open(struct ctl_context *ctx, struct inode *inode,
+		        struct file *file);
+static ssize_t user_sys_read(struct ctl_context *ctx, struct file *file,
+			    char *buffer, size_t *lenp, loff_t *ppos);
+static ssize_t user_sys_write(struct ctl_context *ctx, struct file *file,
+			     char *buffer, size_t *lenp, loff_t *ppos);
+
+static struct ctl_fops user_sys_fops = {
+	.open	= user_sys_open,
+	.read	= user_sys_read,
+	.write	= user_sys_write,
 };
 
+static long ue_dummy = 0;
 static long ue_zero = 0;
 static long ue_int_max = INT_MAX;
 
@@ -66,9 +49,11 @@ static long ue_int_max = INT_MAX;
 		.procname	= name,				\
 		.maxlen		= sizeof(long),			\
 		.mode		= 0644,				\
+		.data		= &ue_dummy,			\
 		.proc_handler	= proc_doulongvec_minmax,	\
 		.extra1		= &ue_zero,			\
 		.extra2		= &ue_int_max,			\
+		.ctl_fops	= &user_sys_fops,		\
 	}
 static struct ctl_table user_table[] = {
 	UCOUNT_ENTRY("max_user_namespaces"),
@@ -89,44 +74,43 @@ static struct ctl_table user_table[] = {
 #endif
 	{ }
 };
-#endif /* CONFIG_SYSCTL */
 
-bool setup_userns_sysctls(struct user_namespace *ns)
+static int user_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file)
 {
-#ifdef CONFIG_SYSCTL
-	struct ctl_table *tbl;
-
-	BUILD_BUG_ON(ARRAY_SIZE(user_table) != UCOUNT_COUNTS + 1);
-	setup_sysctl_set(&ns->set, &set_root, set_is_seen);
-	tbl = kmemdup(user_table, sizeof(user_table), GFP_KERNEL);
-	if (tbl) {
-		int i;
-		for (i = 0; i < UCOUNT_COUNTS; i++) {
-			tbl[i].data = &ns->ucount_max[i];
-		}
-		ns->sysctls = __register_sysctl_table(&ns->set, "user", tbl);
-	}
-	if (!ns->sysctls) {
-		kfree(tbl);
-		retire_sysctl_set(&ns->set);
-		return false;
-	}
-#endif
-	return true;
+	/* Allow users with CAP_SYS_RESOURCE unrestrained access */
+	if ((file->f_mode & FMODE_WRITE) &&
+	    !ns_capable(file->f_cred->user_ns, CAP_SYS_RESOURCE))
+		return -EPERM;
+	return 0;
 }
 
-void retire_userns_sysctls(struct user_namespace *ns)
+static ssize_t user_sys_read(struct ctl_context *ctx, struct file *file,
+		     char *buffer, size_t *lenp, loff_t *ppos)
 {
-#ifdef CONFIG_SYSCTL
-	struct ctl_table *tbl;
+	struct ctl_table table = *ctx->table;
+	table.data = &file->f_cred->user_ns->ucount_max[ctx->table - user_table];
+	return table.proc_handler(&table, 0, buffer, lenp, ppos);
+}
 
-	tbl = ns->sysctls->ctl_table_arg;
-	unregister_sysctl_table(ns->sysctls);
-	retire_sysctl_set(&ns->set);
-	kfree(tbl);
-#endif
+static ssize_t user_sys_write(struct ctl_context *ctx, struct file *file,
+		      char *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table table = *ctx->table;
+	table.data = &file->f_cred->user_ns->ucount_max[ctx->table - user_table];
+	return table.proc_handler(&table, 1, buffer, lenp, ppos);
 }
 
+static struct ctl_table user_root_table[] = {
+	{
+		.procname       = "user",
+		.mode           = 0555,
+		.child          = user_table,
+	},
+	{}
+};
+
+#endif /* CONFIG_SYSCTL */
+
 static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, struct hlist_head *hashent)
 {
 	struct ucounts *ucounts;
@@ -357,17 +341,7 @@ bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigne
 static __init int user_namespace_sysctl_init(void)
 {
 #ifdef CONFIG_SYSCTL
-	static struct ctl_table_header *user_header;
-	static struct ctl_table empty[1];
-	/*
-	 * It is necessary to register the user directory in the
-	 * default set so that registrations in the child sets work
-	 * properly.
-	 */
-	user_header = register_sysctl("user", empty);
-	kmemleak_ignore(user_header);
-	BUG_ON(!user_header);
-	BUG_ON(!setup_userns_sysctls(&init_user_ns));
+	register_sysctl_table(user_root_table);
 #endif
 	hlist_add_ucounts(&init_ucounts);
 	inc_rlimit_ucounts(&init_ucounts, UCOUNT_RLIMIT_NPROC, 1);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 981bb2d10d83..c0e707bc9a31 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -149,17 +149,10 @@ int create_user_ns(struct cred *new)
 	INIT_LIST_HEAD(&ns->keyring_name_list);
 	init_rwsem(&ns->keyring_sem);
 #endif
-	ret = -ENOMEM;
-	if (!setup_userns_sysctls(ns))
-		goto fail_keyring;
 
 	set_cred_user_ns(new, ns);
 	return 0;
-fail_keyring:
-#ifdef CONFIG_PERSISTENT_KEYRINGS
-	key_put(ns->persistent_keyring_register);
-#endif
-	ns_free_inum(&ns->ns);
+
 fail_free:
 	kmem_cache_free(user_ns_cachep, ns);
 fail_dec:
@@ -208,7 +201,6 @@ static void free_user_ns(struct work_struct *work)
 			kfree(ns->projid_map.forward);
 			kfree(ns->projid_map.reverse);
 		}
-		retire_userns_sysctls(ns);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
 		kmem_cache_free(user_ns_cachep, ns);
-- 
2.33.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 4/4] sysctl: mqueue: Do not use dynamic memory
  2022-06-01 13:20 ` [RFC PATCH 0/4] API extension for handling sysctl Alexey Gladkov
                     ` (2 preceding siblings ...)
  2022-06-01 13:20   ` [RFC PATCH 3/4] sysctl: userns: " Alexey Gladkov
@ 2022-06-01 13:20   ` Alexey Gladkov
  2022-06-09 16:45   ` [RFC PATCH 0/4] API extension for handling sysctl Luis Chamberlain
  4 siblings, 0 replies; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 13:20 UTC (permalink / raw)
  To: LKML, Eric W . Biederman, Linus Torvalds
  Cc: Andrew Morton, Christian Brauner, Iurii Zaikin, Kees Cook,
	Linux Containers, linux-fsdevel, Luis Chamberlain, Vasily Averin

Dynamic memory allocation is needed to modify .data and specify the
per namespace parameter. The new sysctl API is allowed to get rid of
the need for such modification.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 include/linux/ipc_namespace.h |  17 -----
 ipc/mq_sysctl.c               | 138 +++++++++++++++++++---------------
 ipc/mqueue.c                  |   5 --
 ipc/namespace.c               |   6 --
 4 files changed, 79 insertions(+), 87 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 51c2c247c447..d20753093a2c 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -174,21 +174,4 @@ static inline void put_ipc_ns(struct ipc_namespace *ns)
 }
 #endif
 
-#ifdef CONFIG_POSIX_MQUEUE_SYSCTL
-
-void retire_mq_sysctls(struct ipc_namespace *ns);
-bool setup_mq_sysctls(struct ipc_namespace *ns);
-
-#else /* CONFIG_POSIX_MQUEUE_SYSCTL */
-
-static inline void retire_mq_sysctls(struct ipc_namespace *ns)
-{
-}
-
-static inline bool setup_mq_sysctls(struct ipc_namespace *ns)
-{
-	return true;
-}
-
-#endif /* CONFIG_POSIX_MQUEUE_SYSCTL */
 #endif
diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c
index fbf6a8b93a26..08ff7dfb721c 100644
--- a/ipc/mq_sysctl.c
+++ b/ipc/mq_sysctl.c
@@ -13,6 +13,45 @@
 #include <linux/capability.h>
 #include <linux/slab.h>
 
+static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table);
+
+static int mq_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file)
+{
+	ctx->ctl_data = current->nsproxy->ipc_ns;
+	return 0;
+}
+
+static ssize_t mq_sys_read(struct ctl_context *ctx, struct file *file,
+		     char *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table table = *ctx->table;
+	table.data = data_from_ns(ctx, ctx->table);
+	return table.proc_handler(&table, 0, buffer, lenp, ppos);
+}
+
+static ssize_t mq_sys_write(struct ctl_context *ctx, struct file *file,
+		      char *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table table = *ctx->table;
+	table.data = data_from_ns(ctx, ctx->table);
+	return table.proc_handler(&table, 1, buffer, lenp, ppos);
+}
+
+static struct ctl_fops mq_sys_fops = {
+	.open	= mq_sys_open,
+	.read	= mq_sys_read,
+	.write	= mq_sys_write,
+};
+
+enum {
+	MQ_SYSCTL_QUEUES_MAX,
+	MQ_SYSCTL_MSG_MAX,
+	MQ_SYSCTL_MSGSIZE_MAX,
+	MQ_SYSCTL_MSG_DEFAULT,
+	MQ_SYSCTL_MSGSIZE_DEFAULT,
+	MQ_SYSCTL_COUNTS
+};
+
 static int msg_max_limit_min = MIN_MSGMAX;
 static int msg_max_limit_max = HARD_MSGMAX;
 
@@ -20,14 +59,15 @@ static int msg_maxsize_limit_min = MIN_MSGSIZEMAX;
 static int msg_maxsize_limit_max = HARD_MSGSIZEMAX;
 
 static struct ctl_table mq_sysctls[] = {
-	{
+	[MQ_SYSCTL_QUEUES_MAX] = {
 		.procname	= "queues_max",
 		.data		= &init_ipc_ns.mq_queues_max,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.ctl_fops	= &mq_sys_fops,
 	},
-	{
+	[MQ_SYSCTL_MSG_MAX] = {
 		.procname	= "msg_max",
 		.data		= &init_ipc_ns.mq_msg_max,
 		.maxlen		= sizeof(int),
@@ -35,8 +75,9 @@ static struct ctl_table mq_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &msg_max_limit_min,
 		.extra2		= &msg_max_limit_max,
+		.ctl_fops	= &mq_sys_fops,
 	},
-	{
+	[MQ_SYSCTL_MSGSIZE_MAX] = {
 		.procname	= "msgsize_max",
 		.data		= &init_ipc_ns.mq_msgsize_max,
 		.maxlen		= sizeof(int),
@@ -44,8 +85,9 @@ static struct ctl_table mq_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &msg_maxsize_limit_min,
 		.extra2		= &msg_maxsize_limit_max,
+		.ctl_fops	= &mq_sys_fops,
 	},
-	{
+	[MQ_SYSCTL_MSG_DEFAULT] = {
 		.procname	= "msg_default",
 		.data		= &init_ipc_ns.mq_msg_default,
 		.maxlen		= sizeof(int),
@@ -53,8 +95,9 @@ static struct ctl_table mq_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &msg_max_limit_min,
 		.extra2		= &msg_max_limit_max,
+		.ctl_fops	= &mq_sys_fops,
 	},
-	{
+	[MQ_SYSCTL_MSGSIZE_DEFAULT] = {
 		.procname	= "msgsize_default",
 		.data		= &init_ipc_ns.mq_msgsize_default,
 		.maxlen		= sizeof(int),
@@ -62,70 +105,47 @@ static struct ctl_table mq_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &msg_maxsize_limit_min,
 		.extra2		= &msg_maxsize_limit_max,
+		.ctl_fops	= &mq_sys_fops,
 	},
 	{}
 };
 
-static struct ctl_table_set *set_lookup(struct ctl_table_root *root)
+static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table)
 {
-	return &current->nsproxy->ipc_ns->mq_set;
+	struct ipc_namespace *ns = ctx->ctl_data;
+
+	switch (ctx->table - mq_sysctls) {
+		case MQ_SYSCTL_QUEUES_MAX:      return &ns->mq_queues_max;
+		case MQ_SYSCTL_MSG_MAX:         return &ns->mq_msg_max;
+		case MQ_SYSCTL_MSGSIZE_MAX:     return &ns->mq_msgsize_max;
+		case MQ_SYSCTL_MSG_DEFAULT:     return &ns->mq_msg_default;
+		case MQ_SYSCTL_MSGSIZE_DEFAULT: return &ns->mq_msgsize_default;
+	}
+	return NULL;
 }
 
-static int set_is_seen(struct ctl_table_set *set)
-{
-	return &current->nsproxy->ipc_ns->mq_set == set;
-}
+static struct ctl_table mq_sysctl_dir[] = {
+	{
+		.procname       = "mqueue",
+		.mode           = 0555,
+		.child          = mq_sysctls,
+	},
+	{}
+};
 
-static struct ctl_table_root set_root = {
-	.lookup = set_lookup,
+static struct ctl_table mq_sysctl_root[] = {
+	{
+		.procname       = "fs",
+		.mode           = 0555,
+		.child          = mq_sysctl_dir,
+	},
+	{}
 };
 
-bool setup_mq_sysctls(struct ipc_namespace *ns)
+static int __init mq_sysctl_init(void)
 {
-	struct ctl_table *tbl;
-
-	setup_sysctl_set(&ns->mq_set, &set_root, set_is_seen);
-
-	tbl = kmemdup(mq_sysctls, sizeof(mq_sysctls), GFP_KERNEL);
-	if (tbl) {
-		int i;
-
-		for (i = 0; i < ARRAY_SIZE(mq_sysctls); i++) {
-			if (tbl[i].data == &init_ipc_ns.mq_queues_max)
-				tbl[i].data = &ns->mq_queues_max;
-
-			else if (tbl[i].data == &init_ipc_ns.mq_msg_max)
-				tbl[i].data = &ns->mq_msg_max;
-
-			else if (tbl[i].data == &init_ipc_ns.mq_msgsize_max)
-				tbl[i].data = &ns->mq_msgsize_max;
-
-			else if (tbl[i].data == &init_ipc_ns.mq_msg_default)
-				tbl[i].data = &ns->mq_msg_default;
-
-			else if (tbl[i].data == &init_ipc_ns.mq_msgsize_default)
-				tbl[i].data = &ns->mq_msgsize_default;
-			else
-				tbl[i].data = NULL;
-		}
-
-		ns->mq_sysctls = __register_sysctl_table(&ns->mq_set, "fs/mqueue", tbl);
-	}
-	if (!ns->mq_sysctls) {
-		kfree(tbl);
-		retire_sysctl_set(&ns->mq_set);
-		return false;
-	}
-
-	return true;
+	register_sysctl_table(mq_sysctl_root);
+	return 0;
 }
 
-void retire_mq_sysctls(struct ipc_namespace *ns)
-{
-	struct ctl_table *tbl;
-
-	tbl = ns->mq_sysctls->ctl_table_arg;
-	unregister_sysctl_table(ns->mq_sysctls);
-	retire_sysctl_set(&ns->mq_set);
-	kfree(tbl);
-}
+device_initcall(mq_sysctl_init);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c0f24cc9f619..ffb79a24d70b 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1711,11 +1711,6 @@ static int __init init_mqueue_fs(void)
 	if (mqueue_inode_cachep == NULL)
 		return -ENOMEM;
 
-	if (!setup_mq_sysctls(&init_ipc_ns)) {
-		pr_warn("sysctl registration failed\n");
-		return -ENOMEM;
-	}
-
 	error = register_filesystem(&mqueue_fs_type);
 	if (error)
 		goto out_sysctl;
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f760243ca685..ae83f0f2651b 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -59,10 +59,6 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 	if (err)
 		goto fail_put;
 
-	err = -ENOMEM;
-	if (!setup_mq_sysctls(ns))
-		goto fail_put;
-
 	sem_init_ns(ns);
 	msg_init_ns(ns);
 	shm_init_ns(ns);
@@ -129,8 +125,6 @@ static void free_ipc_ns(struct ipc_namespace *ns)
 	msg_exit_ns(ns);
 	shm_exit_ns(ns);
 
-	retire_mq_sysctls(ns);
-
 	dec_ipc_namespaces(ns->ucounts);
 	put_user_ns(ns->user_ns);
 	ns_free_inum(&ns->ns);
-- 
2.33.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory
  2022-06-01 13:20   ` [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory Alexey Gladkov
@ 2022-06-01 16:45     ` Linus Torvalds
  2022-06-01 18:24       ` Alexey Gladkov
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2022-06-01 16:45 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Eric W . Biederman, Andrew Morton, Christian Brauner,
	Iurii Zaikin, Kees Cook, Linux Containers, linux-fsdevel,
	Luis Chamberlain, Vasily Averin

On Wed, Jun 1, 2022 at 6:20 AM Alexey Gladkov <legion@kernel.org> wrote:
>
> Dynamic memory allocation is needed to modify .data and specify the per
> namespace parameter. The new sysctl API is allowed to get rid of the
> need for such modification.

Ok, this is looking better. That said, a few comments:

>
> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
> index ef313ecfb53a..833b670c38f3 100644
> --- a/ipc/ipc_sysctl.c
> +++ b/ipc/ipc_sysctl.c
> @@ -68,26 +68,94 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
>         return ret;
>  }
>
> +static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table);
> +
> +static int ipc_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file)
> +{
> +       struct ipc_namespace *ns = current->nsproxy->ipc_ns;
> +
> +       // For now, we only allow changes in init_user_ns.
> +       if (ns->user_ns != &init_user_ns)
> +               return -EPERM;
> +
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> +       int index = (ctx->table - ipc_sysctls);
> +
> +       switch (index) {
> +               case IPC_SYSCTL_SEM_NEXT_ID:
> +               case IPC_SYSCTL_MSG_NEXT_ID:
[...]

I don't think you actually even compile-tested this, because you're
using these IPC_SYSCTL_SEM_NEXT_ID etc enums before you even declared
them later in the same file.

> +static ssize_t ipc_sys_read(struct ctl_context *ctx, struct file *file,
> +                    char *buffer, size_t *lenp, loff_t *ppos)
> +{
> +       struct ctl_table table = *ctx->table;
> +       table.data = data_from_ns(ctx, ctx->table);
> +       return table.proc_handler(&table, 0, buffer, lenp, ppos);
> +}

Can we please fix the names and the types of this new 'ctx' structure?

Yes, yes, I know the old legacy "sysctl table" is horribly named, and
uses "ctl_table".

But let's just write it out. It's not some random control table for
anything. It's a very odd and specific thing: "sysctl". Let's use the
full name.

Also, Please just make that "ctl_data" member in that "ctl_context"
struct not just have a real name, but a real type. Make it literally
be

    struct ipc_namespace *ipc_ns;

and if we end up with other things wanting other pointers, just add a
new one (or make a union if we care about the size of that allocation,
which I don't see any reason we'd do when it's literally just like a
couple of pointers in size).

There is no reason to have some pseudo-generic "void *ctl_data" that
makes it ambiguous and allows for type confusion and isn't
self-documenting. I'd rather have a properly typed pointer that is
just initialized to NULL and is not always used or needed, but always
has a clear case for *what* it would be used for.

Yes, yes, we have f_private etc for things that are really very very
generic and have arbitrary users. But 'sysctl' is not that kind of
truly generic use.

I wish we didn't have that silly "create a temporary ctl_table entry"
either, and I wish it was properly named. But it's not worth the
pointless churn to fix old bad interfaces. But the new ones should
have better names, and try to avoid those bad old decisions.

But yeah, I think this all is a step in the right direction. And maybe
some of those cases and old 'ctl_table' things can be migrated to just
using individual read() functions entirely. The whole 'ctl_table'
model was broken, and came from the bad old days with an actual
'sysctl()' system call.

Because I think it would be lovely if people would move away from the
'sysctl table' approach entirely for cases where that makes sense, and
these guys that already need special handling are very much in that
situation.

          Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory
  2022-06-01 16:45     ` Linus Torvalds
@ 2022-06-01 18:24       ` Alexey Gladkov
  2022-06-01 18:34         ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 18:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Eric W . Biederman, Andrew Morton, Christian Brauner,
	Iurii Zaikin, Kees Cook, Linux Containers, linux-fsdevel,
	Luis Chamberlain, Vasily Averin

On Wed, Jun 01, 2022 at 09:45:15AM -0700, Linus Torvalds wrote:
> On Wed, Jun 1, 2022 at 6:20 AM Alexey Gladkov <legion@kernel.org> wrote:
> >
> > Dynamic memory allocation is needed to modify .data and specify the per
> > namespace parameter. The new sysctl API is allowed to get rid of the
> > need for such modification.
> 
> Ok, this is looking better. That said, a few comments:
> 
> >
> > diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
> > index ef313ecfb53a..833b670c38f3 100644
> > --- a/ipc/ipc_sysctl.c
> > +++ b/ipc/ipc_sysctl.c
> > @@ -68,26 +68,94 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
> >         return ret;
> >  }
> >
> > +static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table);
> > +
> > +static int ipc_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file)
> > +{
> > +       struct ipc_namespace *ns = current->nsproxy->ipc_ns;
> > +
> > +       // For now, we only allow changes in init_user_ns.
> > +       if (ns->user_ns != &init_user_ns)
> > +               return -EPERM;
> > +
> > +#ifdef CONFIG_CHECKPOINT_RESTORE
> > +       int index = (ctx->table - ipc_sysctls);
> > +
> > +       switch (index) {
> > +               case IPC_SYSCTL_SEM_NEXT_ID:
> > +               case IPC_SYSCTL_MSG_NEXT_ID

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/cgroup.c#n1392:
> [...]
> 
> I don't think you actually even compile-tested this, because you're
> using these IPC_SYSCTL_SEM_NEXT_ID etc enums before you even declared
> them later in the same file.

I did it but without CONFIG_CHECKPOINT_RESTORE.

This is where I'm not sure who can write to ipc sysctls inside
ipc_namespace.

> > +static ssize_t ipc_sys_read(struct ctl_context *ctx, struct file *file,
> > +                    char *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > +       struct ctl_table table = *ctx->table;
> > +       table.data = data_from_ns(ctx, ctx->table);
> > +       return table.proc_handler(&table, 0, buffer, lenp, ppos);
> > +}
> 
> Can we please fix the names and the types of this new 'ctx' structure?
> 
> Yes, yes, I know the old legacy "sysctl table" is horribly named, and
> uses "ctl_table".

Sure.

> But let's just write it out. It's not some random control table for
> anything. It's a very odd and specific thing: "sysctl". Let's use the
> full name.
> 
> Also, Please just make that "ctl_data" member in that "ctl_context"
> struct not just have a real name, but a real type. Make it literally
> be
> 
>     struct ipc_namespace *ipc_ns;
> 
> and if we end up with other things wanting other pointers, just add a
> new one (or make a union if we care about the size of that allocation,
> which I don't see any reason we'd do when it's literally just like a
> couple of pointers in size).
> 
> There is no reason to have some pseudo-generic "void *ctl_data" that
> makes it ambiguous and allows for type confusion and isn't
> self-documenting. I'd rather have a properly typed pointer that is
> just initialized to NULL and is not always used or needed, but always
> has a clear case for *what* it would be used for.
> 
> Yes, yes, we have f_private etc for things that are really very very
> generic and have arbitrary users. But 'sysctl' is not that kind of
> truly generic use.

Yep. I made ctl_data in the same way as f_private. My idea is that if
someone needs to store more than one pointer, they can put a struct there.
But it turned out that at least now, apart from ipc_namespace, nothing is
needed.

> I wish we didn't have that silly "create a temporary ctl_table entry"
> either, and I wish it was properly named. But it's not worth the
> pointless churn to fix old bad interfaces. But the new ones should
> have better names, and try to avoid those bad old decisions.

Currently temporary ctl_table is the main strategy for handling sysctl
entries.

Perhaps it will be possible to get rid of this if we add another
get_data() that would return what is currently placed in .data in
ctl_table. I mean make getting .data dynamic.

> But yeah, I think this all is a step in the right direction. And maybe
> some of those cases and old 'ctl_table' things can be migrated to just
> using individual read() functions entirely. The whole 'ctl_table'
> model was broken, and came from the bad old days with an actual
> 'sysctl()' system call.

I'm not sure how to get rid of ctl_table since net sysctls are heavily
dependent on it.

I was wondering if it's possible to get rid of ctl_table but if it's not
possible to rewrite everything to some kind of new magic API, then keeping
two of them would be a nightmare.

Another problem is that ctl_table is being used by __cgroup_bpf_run_filter_sysctl.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/cgroup.c#n1392

> Because I think it would be lovely if people would move away from the
> 'sysctl table' approach entirely for cases where that makes sense, and
> these guys that already need special handling are very much in that
> situation.

Since you think that these patches are a step in the right direction, then
I will prepare the first version with your comments in mind.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory
  2022-06-01 18:24       ` Alexey Gladkov
@ 2022-06-01 18:34         ` Linus Torvalds
  2022-06-01 19:05           ` Alexey Gladkov
  2022-06-09 18:51           ` Luis Chamberlain
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2022-06-01 18:34 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Eric W . Biederman, Andrew Morton, Christian Brauner,
	Iurii Zaikin, Kees Cook, Linux Containers, linux-fsdevel,
	Luis Chamberlain, Vasily Averin

On Wed, Jun 1, 2022 at 11:25 AM Alexey Gladkov <legion@kernel.org> wrote:
>
> I'm not sure how to get rid of ctl_table since net sysctls are heavily
> dependent on it.

I don't actually think it's worth getting rid of entirely, because
there's just a lot of simple cases where it "JustWorks(tm)" and having
just that table entry describe all the semantics is not wrong at all.

The name may suck, but hey, it's not a big deal. Changing it now would
be more pain than it's worth.

No, I was more thinking that things that already need more
infrastructure than that simple static ctl_table entry might be better
off trying to migrate to your new "proper read op" model, and having
more of that dynamic behavior in the read op.

The whole "create dynamic ctl_table entries on the fly" model works,
but it's kind of ugly.

Anyway, I think all of this is "I think there is more room for cleanup
in this area", and maybe we'll never have enough motivation to
actually do that.

Your patches seem to fix the extant issue with the ipc namespace, and
the truly disgusting parts (although maybe there are other truly
disgusting things hiding - I didn't go look for them).

                      Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory
  2022-06-01 18:34         ` Linus Torvalds
@ 2022-06-01 19:05           ` Alexey Gladkov
  2022-06-09 18:51           ` Luis Chamberlain
  1 sibling, 0 replies; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 19:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Eric W . Biederman, Andrew Morton, Christian Brauner,
	Iurii Zaikin, Kees Cook, Linux Containers, linux-fsdevel,
	Luis Chamberlain, Vasily Averin

On Wed, Jun 01, 2022 at 11:34:18AM -0700, Linus Torvalds wrote:
> On Wed, Jun 1, 2022 at 11:25 AM Alexey Gladkov <legion@kernel.org> wrote:
> >
> > I'm not sure how to get rid of ctl_table since net sysctls are heavily
> > dependent on it.
> 
> I don't actually think it's worth getting rid of entirely, because
> there's just a lot of simple cases where it "JustWorks(tm)" and having
> just that table entry describe all the semantics is not wrong at all.
> 
> The name may suck, but hey, it's not a big deal. Changing it now would
> be more pain than it's worth.
> 
> No, I was more thinking that things that already need more
> infrastructure than that simple static ctl_table entry might be better
> off trying to migrate to your new "proper read op" model, and having
> more of that dynamic behavior in the read op.

This was part of my plan. I wanted to step by step try migrating other
sysctls to use open/read/write where it makes sense.

To be honest, it was Eric Biederman who came up with the idea to separate
open, read and write. I am very grateful to him.

> The whole "create dynamic ctl_table entries on the fly" model works,
> but it's kind of ugly.
> 
> Anyway, I think all of this is "I think there is more room for cleanup
> in this area", and maybe we'll never have enough motivation to
> actually do that.
> 
> Your patches seem to fix the extant issue with the ipc namespace, and
> the truly disgusting parts (although maybe there are other truly
> disgusting things hiding - I didn't go look for them).

I also hope to try and fix the f_cred issue.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl
  2022-06-01 13:20   ` [RFC PATCH 1/4] sysctl: " Alexey Gladkov
@ 2022-06-01 19:19     ` Matthew Wilcox
  2022-06-01 19:23       ` Linus Torvalds
  2022-06-01 19:32       ` Alexey Gladkov
  0 siblings, 2 replies; 16+ messages in thread
From: Matthew Wilcox @ 2022-06-01 19:19 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Eric W . Biederman, Linus Torvalds, Andrew Morton,
	Christian Brauner, Iurii Zaikin, Kees Cook, Linux Containers,
	linux-fsdevel, Luis Chamberlain, Vasily Averin

On Wed, Jun 01, 2022 at 03:20:29PM +0200, Alexey Gladkov wrote:
> +struct ctl_fops {
> +	int (*open) (struct ctl_context *, struct inode *, struct file *);
> +	int (*release) (struct ctl_context *, struct inode *, struct file *);
> +	ssize_t (*read) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> +	ssize_t (*write) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> +};

Why not pass the iocb in ->read and ->write?  We're still regretting not
doing that with file_operations.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl
  2022-06-01 19:19     ` Matthew Wilcox
@ 2022-06-01 19:23       ` Linus Torvalds
  2022-06-01 19:25         ` Matthew Wilcox
  2022-06-01 19:32       ` Alexey Gladkov
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2022-06-01 19:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexey Gladkov, LKML, Eric W . Biederman, Andrew Morton,
	Christian Brauner, Iurii Zaikin, Kees Cook, Linux Containers,
	linux-fsdevel, Luis Chamberlain, Vasily Averin

On Wed, Jun 1, 2022 at 12:19 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Why not pass the iocb in ->read and ->write?  We're still regretting not
> doing that with file_operations.

No, all the actual "io" is done by the caller.

There is no way in hell I want the sysctl callbacks to actually
possibly do user space accesses etc.

They get a kernel buffer that has already been set up. There is no
iocb or iovec left for them.

(That also means that they can take whatever locks they need,
including spinlocks, because there's not going to be any random user
accesses or complex pipe buffer lookups or whatever).

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl
  2022-06-01 19:23       ` Linus Torvalds
@ 2022-06-01 19:25         ` Matthew Wilcox
  2022-06-01 19:31           ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2022-06-01 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Gladkov, LKML, Eric W . Biederman, Andrew Morton,
	Christian Brauner, Iurii Zaikin, Kees Cook, Linux Containers,
	linux-fsdevel, Luis Chamberlain, Vasily Averin

On Wed, Jun 01, 2022 at 12:23:06PM -0700, Linus Torvalds wrote:
> On Wed, Jun 1, 2022 at 12:19 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Why not pass the iocb in ->read and ->write?  We're still regretting not
> > doing that with file_operations.
> 
> No, all the actual "io" is done by the caller.
> 
> There is no way in hell I want the sysctl callbacks to actually
> possibly do user space accesses etc.
> 
> They get a kernel buffer that has already been set up. There is no
> iocb or iovec left for them.

I wasn't suggesting the iovec.  Just the iocb, instead of passing in the
ki_filp and the ki_pos.

> (That also means that they can take whatever locks they need,
> including spinlocks, because there's not going to be any random user
> accesses or complex pipe buffer lookups or whatever).
> 
>                 Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl
  2022-06-01 19:25         ` Matthew Wilcox
@ 2022-06-01 19:31           ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2022-06-01 19:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexey Gladkov, LKML, Eric W . Biederman, Andrew Morton,
	Christian Brauner, Iurii Zaikin, Kees Cook, Linux Containers,
	linux-fsdevel, Luis Chamberlain, Vasily Averin

On Wed, Jun 1, 2022 at 12:25 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> I wasn't suggesting the iovec.  Just the iocb, instead of passing in the
> ki_filp and the ki_pos.

I guess that would work, but would it actually help anything?

I think it's a lot more obvious when it just looks mostly like a
"read/write()" system call (ok, pread/pwrite since you get the pos,
but still).

There is nothing that could possibly be relevant in the kiocb. All
those fields are about odd async or atomicity issues (or "report
completion" issues) that simply cannot possibly be valid for a sysctl
value.

And if they are, that sysctl value is doing something really really
really wrong. So we absolutely don't want to encourage that.

This is not a "do IO" interface. This is a "read or write value" interface.

               Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl
  2022-06-01 19:19     ` Matthew Wilcox
  2022-06-01 19:23       ` Linus Torvalds
@ 2022-06-01 19:32       ` Alexey Gladkov
  1 sibling, 0 replies; 16+ messages in thread
From: Alexey Gladkov @ 2022-06-01 19:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: LKML, Eric W . Biederman, Linus Torvalds, Andrew Morton,
	Christian Brauner, Iurii Zaikin, Kees Cook, Linux Containers,
	linux-fsdevel, Luis Chamberlain, Vasily Averin

On Wed, Jun 01, 2022 at 08:19:14PM +0100, Matthew Wilcox wrote:
> On Wed, Jun 01, 2022 at 03:20:29PM +0200, Alexey Gladkov wrote:
> > +struct ctl_fops {
> > +	int (*open) (struct ctl_context *, struct inode *, struct file *);
> > +	int (*release) (struct ctl_context *, struct inode *, struct file *);
> > +	ssize_t (*read) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> > +	ssize_t (*write) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> > +};
> 
> Why not pass the iocb in ->read and ->write?  We're still regretting not
> doing that with file_operations.

Because buf and len can be modified in BPF_CGROUP_RUN_PROG_SYSCTL.
We need to pass the result of this hook to read/write callback.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/proc_sysctl.c#n605
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/cgroup.c#n1441

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] API extension for handling sysctl
  2022-06-01 13:20 ` [RFC PATCH 0/4] API extension for handling sysctl Alexey Gladkov
                     ` (3 preceding siblings ...)
  2022-06-01 13:20   ` [RFC PATCH 4/4] sysctl: mqueue: " Alexey Gladkov
@ 2022-06-09 16:45   ` Luis Chamberlain
  4 siblings, 0 replies; 16+ messages in thread
From: Luis Chamberlain @ 2022-06-09 16:45 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Eric W . Biederman, Linus Torvalds, Andrew Morton,
	Christian Brauner, Iurii Zaikin, Kees Cook, Linux Containers,
	linux-fsdevel, Vasily Averin

On Wed, Jun 01, 2022 at 03:20:28PM +0200, Alexey Gladkov wrote:
> On Fri, Apr 22, 2022 at 01:44:50PM -0700, Linus Torvalds wrote:
> > On Fri, Apr 22, 2022 at 5:53 AM Alexey Gladkov <legion@kernel.org> wrote:
> > >
> > > Yes, Linus, these changes are not the refactoring you were talking
> > > about, but I plan to try to do such a refactoring in the my next
> > > patchset.
> > 
> > Heh. Ok, I'm not saying these patches are pretty, and looking up the
> > namespace thing is a bit subtle, but it's certainly prettier than the
> > existing odd "create a new ctl_table entry because of field abuse".
> 
> As I promised, here is one of the possible options for how to get rid of dynamic
> memory allocation.
> 
> We can slightly extend the API and thus be able to save data at the time the
> file is opened. This will not only eliminate the need to allocate memory, but
> also provide access to file struct and f_cred.
> 
> I made an RFC because I'm not sure that I did the permissions check for
> ipc_sysctl. I also did not change all the places where this API can be applied
> to make the patch smaller. As in the case of /proc/sys/kernel/printk where
> CAP_SYS_ADMIN is checked[1] for the current process at the time of write.

Thanks for all this, can you also add respective selftests extensions
for this on lib/test_sysctl.c and tools/testing/selftests/sysctl/sysctl.sh ?

  Luis

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory
  2022-06-01 18:34         ` Linus Torvalds
  2022-06-01 19:05           ` Alexey Gladkov
@ 2022-06-09 18:51           ` Luis Chamberlain
  1 sibling, 0 replies; 16+ messages in thread
From: Luis Chamberlain @ 2022-06-09 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Gladkov, LKML, Eric W . Biederman, Andrew Morton,
	Christian Brauner, Iurii Zaikin, Kees Cook, Linux Containers,
	linux-fsdevel, Vasily Averin, Muchun Song, Jia He, Pan Xinhui,
	Thomas Huth, ChuckLever

On Wed, Jun 01, 2022 at 11:34:18AM -0700, Linus Torvalds wrote:
> On Wed, Jun 1, 2022 at 11:25 AM Alexey Gladkov <legion@kernel.org> wrote:
> >
> > I'm not sure how to get rid of ctl_table since net sysctls are heavily
> > dependent on it.
> 
> Anyway, I think all of this is "I think there is more room for cleanup
> in this area", and maybe we'll never have enough motivation to
> actually do that.

That's a good summary of the cleanup with sysctls so I appreciate the nudges
in the right directions.

  Luis

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-06-09 18:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAHk-=whi2SzU4XT_FsdTCAuK2qtYmH+-hwi1cbSdG8zu0KXL=g@mail.gmail.com>
2022-06-01 13:20 ` [RFC PATCH 0/4] API extension for handling sysctl Alexey Gladkov
2022-06-01 13:20   ` [RFC PATCH 1/4] sysctl: " Alexey Gladkov
2022-06-01 19:19     ` Matthew Wilcox
2022-06-01 19:23       ` Linus Torvalds
2022-06-01 19:25         ` Matthew Wilcox
2022-06-01 19:31           ` Linus Torvalds
2022-06-01 19:32       ` Alexey Gladkov
2022-06-01 13:20   ` [RFC PATCH 2/4] sysctl: ipc: Do not use dynamic memory Alexey Gladkov
2022-06-01 16:45     ` Linus Torvalds
2022-06-01 18:24       ` Alexey Gladkov
2022-06-01 18:34         ` Linus Torvalds
2022-06-01 19:05           ` Alexey Gladkov
2022-06-09 18:51           ` Luis Chamberlain
2022-06-01 13:20   ` [RFC PATCH 3/4] sysctl: userns: " Alexey Gladkov
2022-06-01 13:20   ` [RFC PATCH 4/4] sysctl: mqueue: " Alexey Gladkov
2022-06-09 16:45   ` [RFC PATCH 0/4] API extension for handling sysctl Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).