linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] acct: don't allow access to internal filesystems
       [not found] <20250210-unordnung-petersilie-90e37411db18@brauner>
@ 2025-02-11 17:15 ` Christian Brauner
  2025-02-11 17:15   ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Christian Brauner @ 2025-02-11 17:15 UTC (permalink / raw)
  To: Zicheng Qu, Linus Torvalds
  Cc: Christian Brauner, jlayton, axboe, joel.granados, tglx, viro, hch,
	len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
	judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
	linux-pm, stable

In [1] it was reported that the acct(2) system call can be used to
trigger a NULL deref in cases where it is set to write to a file that
triggers an internal lookup.

This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
point the where the write to this file happens the calling task has
already exited and called exit_fs() but an internal lookup might be
triggered through lookup_bdev(). This may trigger a NULL-deref
when accessing current->fs.

This series does two things:

- Reorganize the code so that the the final write happens from the
  workqueue but with the caller's credentials. This preserves the
  (strange) permission model and has almost no regression risk.

- Block access to kernel internal filesystems as well as procfs and
  sysfs in the first place.

This api should stop to exist imho.

Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (2):
      acct: perform last write from workqueue
      acct: block access to kernel internal filesystems

 kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 84 insertions(+), 50 deletions(-)
---
base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
change-id: 20250211-work-acct-a6d8e92a5fe0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] acct: perform last write from workqueue
  2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
@ 2025-02-11 17:15   ` Christian Brauner
  2025-02-11 17:16   ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
  2025-02-11 18:56   ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
  2 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2025-02-11 17:15 UTC (permalink / raw)
  To: Zicheng Qu, Linus Torvalds
  Cc: Christian Brauner, jlayton, axboe, joel.granados, tglx, viro, hch,
	len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
	judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
	linux-pm, stable

In [1] it was reported that the acct(2) system call can be used to
trigger NULL deref in cases where it is set to write to a file that
triggers an internal lookup. This can e.g., happen when pointing acc(2)
to /sys/power/resume. At the point the where the write to this file
happens the calling task has already exited and called exit_fs(). A
lookup will thus trigger a NULL-deref when accessing current->fs.

Reorganize the code so that the the final write happens from the
workqueue but with the caller's credentials. This preserves the
(strange) permission model and has almost no regression risk.

This api should stop to exist though.

Reported-by: Zicheng Qu <quzicheng@huawei.com>
Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/acct.c | 120 ++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 70 insertions(+), 50 deletions(-)

diff --git a/kernel/acct.c b/kernel/acct.c
index 31222e8cd534..48283efe8a12 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -103,48 +103,50 @@ struct bsd_acct_struct {
 	atomic_long_t		count;
 	struct rcu_head		rcu;
 	struct mutex		lock;
-	int			active;
+	bool			active;
+	bool			check_space;
 	unsigned long		needcheck;
 	struct file		*file;
 	struct pid_namespace	*ns;
 	struct work_struct	work;
 	struct completion	done;
+	acct_t			ac;
 };
 
-static void do_acct_process(struct bsd_acct_struct *acct);
+static void fill_ac(struct bsd_acct_struct *acct);
+static void acct_write_process(struct bsd_acct_struct *acct);
 
 /*
  * Check the amount of free space and suspend/resume accordingly.
  */
-static int check_free_space(struct bsd_acct_struct *acct)
+static bool check_free_space(struct bsd_acct_struct *acct)
 {
 	struct kstatfs sbuf;
 
-	if (time_is_after_jiffies(acct->needcheck))
-		goto out;
+	if (!acct->check_space)
+		return acct->active;
 
 	/* May block */
 	if (vfs_statfs(&acct->file->f_path, &sbuf))
-		goto out;
+		return acct->active;
 
 	if (acct->active) {
 		u64 suspend = sbuf.f_blocks * SUSPEND;
 		do_div(suspend, 100);
 		if (sbuf.f_bavail <= suspend) {
-			acct->active = 0;
+			acct->active = false;
 			pr_info("Process accounting paused\n");
 		}
 	} else {
 		u64 resume = sbuf.f_blocks * RESUME;
 		do_div(resume, 100);
 		if (sbuf.f_bavail >= resume) {
-			acct->active = 1;
+			acct->active = true;
 			pr_info("Process accounting resumed\n");
 		}
 	}
 
 	acct->needcheck = jiffies + ACCT_TIMEOUT*HZ;
-out:
 	return acct->active;
 }
 
@@ -189,7 +191,11 @@ static void acct_pin_kill(struct fs_pin *pin)
 {
 	struct bsd_acct_struct *acct = to_acct(pin);
 	mutex_lock(&acct->lock);
-	do_acct_process(acct);
+	/*
+	 * Fill the accounting struct with the exiting task's info
+	 * before punting to the workqueue.
+	 */
+	fill_ac(acct);
 	schedule_work(&acct->work);
 	wait_for_completion(&acct->done);
 	cmpxchg(&acct->ns->bacct, pin, NULL);
@@ -202,6 +208,9 @@ static void close_work(struct work_struct *work)
 {
 	struct bsd_acct_struct *acct = container_of(work, struct bsd_acct_struct, work);
 	struct file *file = acct->file;
+
+	/* We were fired by acct_pin_kill() which holds acct->lock. */
+	acct_write_process(acct);
 	if (file->f_op->flush)
 		file->f_op->flush(file, NULL);
 	__fput_sync(file);
@@ -430,13 +439,27 @@ static u32 encode_float(u64 value)
  *  do_exit() or when switching to a different output file.
  */
 
-static void fill_ac(acct_t *ac)
+static void fill_ac(struct bsd_acct_struct *acct)
 {
 	struct pacct_struct *pacct = &current->signal->pacct;
+	struct file *file = acct->file;
+	acct_t *ac = &acct->ac;
 	u64 elapsed, run_time;
 	time64_t btime;
 	struct tty_struct *tty;
 
+	lockdep_assert_held(&acct->lock);
+
+	if (time_is_after_jiffies(acct->needcheck)) {
+		acct->check_space = false;
+
+		/* Don't fill in @ac if nothing will be written. */
+		if (!acct->active)
+			return;
+	} else {
+		acct->check_space = true;
+	}
+
 	/*
 	 * Fill the accounting struct with the needed info as recorded
 	 * by the different kernel functions.
@@ -484,64 +507,61 @@ static void fill_ac(acct_t *ac)
 	ac->ac_majflt = encode_comp_t(pacct->ac_majflt);
 	ac->ac_exitcode = pacct->ac_exitcode;
 	spin_unlock_irq(&current->sighand->siglock);
-}
-/*
- *  do_acct_process does all actual work. Caller holds the reference to file.
- */
-static void do_acct_process(struct bsd_acct_struct *acct)
-{
-	acct_t ac;
-	unsigned long flim;
-	const struct cred *orig_cred;
-	struct file *file = acct->file;
-
-	/*
-	 * Accounting records are not subject to resource limits.
-	 */
-	flim = rlimit(RLIMIT_FSIZE);
-	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
-	/* Perform file operations on behalf of whoever enabled accounting */
-	orig_cred = override_creds(file->f_cred);
 
-	/*
-	 * First check to see if there is enough free_space to continue
-	 * the process accounting system.
-	 */
-	if (!check_free_space(acct))
-		goto out;
-
-	fill_ac(&ac);
 	/* we really need to bite the bullet and change layout */
-	ac.ac_uid = from_kuid_munged(file->f_cred->user_ns, orig_cred->uid);
-	ac.ac_gid = from_kgid_munged(file->f_cred->user_ns, orig_cred->gid);
+	ac->ac_uid = from_kuid_munged(file->f_cred->user_ns, current_uid());
+	ac->ac_gid = from_kgid_munged(file->f_cred->user_ns, current_gid());
 #if ACCT_VERSION == 1 || ACCT_VERSION == 2
 	/* backward-compatible 16 bit fields */
-	ac.ac_uid16 = ac.ac_uid;
-	ac.ac_gid16 = ac.ac_gid;
+	ac->ac_uid16 = ac->ac_uid;
+	ac->ac_gid16 = ac->ac_gid;
 #elif ACCT_VERSION == 3
 	{
 		struct pid_namespace *ns = acct->ns;
 
-		ac.ac_pid = task_tgid_nr_ns(current, ns);
+		ac->ac_pid = task_tgid_nr_ns(current, ns);
 		rcu_read_lock();
-		ac.ac_ppid = task_tgid_nr_ns(rcu_dereference(current->real_parent),
-					     ns);
+		ac->ac_ppid = task_tgid_nr_ns(rcu_dereference(current->real_parent), ns);
 		rcu_read_unlock();
 	}
 #endif
+}
+
+static void acct_write_process(struct bsd_acct_struct *acct)
+{
+	struct file *file = acct->file;
+	const struct cred *cred;
+	acct_t *ac = &acct->ac;
+
+	/* Perform file operations on behalf of whoever enabled accounting */
+	cred = override_creds(file->f_cred);
+
 	/*
-	 * Get freeze protection. If the fs is frozen, just skip the write
-	 * as we could deadlock the system otherwise.
+	 * First check to see if there is enough free_space to continue
+	 * the process accounting system. Then get freeze protection. If
+	 * the fs is frozen, just skip the write as we could deadlock
+	 * the system otherwise.
 	 */
-	if (file_start_write_trylock(file)) {
+	if (check_free_space(acct) && file_start_write_trylock(file)) {
 		/* it's been opened O_APPEND, so position is irrelevant */
 		loff_t pos = 0;
-		__kernel_write(file, &ac, sizeof(acct_t), &pos);
+		__kernel_write(file, ac, sizeof(acct_t), &pos);
 		file_end_write(file);
 	}
-out:
+
+	revert_creds(cred);
+}
+
+static void do_acct_process(struct bsd_acct_struct *acct)
+{
+	unsigned long flim;
+
+	/* Accounting records are not subject to resource limits. */
+	flim = rlimit(RLIMIT_FSIZE);
+	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
+	fill_ac(acct);
+	acct_write_process(acct);
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim;
-	revert_creds(orig_cred);
 }
 
 /**

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] acct: block access to kernel internal filesystems
  2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
  2025-02-11 17:15   ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
@ 2025-02-11 17:16   ` Christian Brauner
  2025-02-11 20:30     ` Amir Goldstein
  2025-02-11 20:54     ` Al Viro
  2025-02-11 18:56   ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
  2 siblings, 2 replies; 9+ messages in thread
From: Christian Brauner @ 2025-02-11 17:16 UTC (permalink / raw)
  To: Zicheng Qu, Linus Torvalds
  Cc: Christian Brauner, jlayton, axboe, joel.granados, tglx, viro, hch,
	len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
	judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
	linux-pm, stable

There's no point in allowing anything kernel internal nor procfs or
sysfs.

Reported-by: Zicheng Qu <quzicheng@huawei.com>
Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/acct.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/acct.c b/kernel/acct.c
index 48283efe8a12..6520baa13669 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -243,6 +243,20 @@ static int acct_on(struct filename *pathname)
 		return -EACCES;
 	}
 
+	/* Exclude kernel kernel internal filesystems. */
+	if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
+		kfree(acct);
+		filp_close(file, NULL);
+		return -EINVAL;
+	}
+
+	/* Exclude procfs and sysfs. */
+	if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
+		kfree(acct);
+		filp_close(file, NULL);
+		return -EINVAL;
+	}
+
 	if (!(file->f_mode & FMODE_CAN_WRITE)) {
 		kfree(acct);
 		filp_close(file, NULL);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] acct: don't allow access to internal filesystems
  2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
  2025-02-11 17:15   ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
  2025-02-11 17:16   ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
@ 2025-02-11 18:56   ` Jeff Layton
  2025-02-12 11:16     ` Christian Brauner
  2 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2025-02-11 18:56 UTC (permalink / raw)
  To: Christian Brauner, Zicheng Qu, Linus Torvalds
  Cc: axboe, joel.granados, tglx, viro, hch, len.brown, pavel,
	pengfei.xu, rafael, tanghui20, zhangqiao22, judy.chenhui,
	linux-kernel, linux-fsdevel, syzkaller-bugs, linux-pm, stable

On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> In [1] it was reported that the acct(2) system call can be used to
> trigger a NULL deref in cases where it is set to write to a file that
> triggers an internal lookup.
> 
> This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> point the where the write to this file happens the calling task has
> already exited and called exit_fs() but an internal lookup might be
> triggered through lookup_bdev(). This may trigger a NULL-deref
> when accessing current->fs.
> 
> This series does two things:
> 
> - Reorganize the code so that the the final write happens from the
>   workqueue but with the caller's credentials. This preserves the
>   (strange) permission model and has almost no regression risk.
> 
> - Block access to kernel internal filesystems as well as procfs and
>   sysfs in the first place.
> 
> This api should stop to exist imho.
> 

I wonder who uses it these days, and what would we suggest they replace
it with? Maybe syscall auditing?

config BSD_PROCESS_ACCT
        bool "BSD Process Accounting"
        depends on MULTIUSER
        help
          If you say Y here, a user level program will be able to instruct the
          kernel (via a special system call) to write process accounting
          information to a file: whenever a process exits, information about
          that process will be appended to the file by the kernel.  The
          information includes things such as creation time, owning user,
          command name, memory usage, controlling terminal etc. (the complete
          list is in the struct acct in <file:include/linux/acct.h>).  It is
          up to the user level program to do useful things with this
          information.  This is generally a good idea, so say Y.

Maybe at least time to replace that last sentence and make this default
to 'n'?

> Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (2):
>       acct: perform last write from workqueue
>       acct: block access to kernel internal filesystems
> 
>  kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 84 insertions(+), 50 deletions(-)
> ---
> base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
> change-id: 20250211-work-acct-a6d8e92a5fe0
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] acct: block access to kernel internal filesystems
  2025-02-11 17:16   ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
@ 2025-02-11 20:30     ` Amir Goldstein
  2025-02-11 20:54     ` Al Viro
  1 sibling, 0 replies; 9+ messages in thread
From: Amir Goldstein @ 2025-02-11 20:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
	viro, hch, len.brown, pavel, pengfei.xu, rafael, tanghui20,
	zhangqiao22, judy.chenhui, linux-kernel, linux-fsdevel,
	syzkaller-bugs, linux-pm, stable

On Tue, Feb 11, 2025 at 6:17 PM Christian Brauner <brauner@kernel.org> wrote:
>
> There's no point in allowing anything kernel internal nor procfs or
> sysfs.
>
> Reported-by: Zicheng Qu <quzicheng@huawei.com>
> Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  kernel/acct.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 48283efe8a12..6520baa13669 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -243,6 +243,20 @@ static int acct_on(struct filename *pathname)
>                 return -EACCES;
>         }
>
> +       /* Exclude kernel kernel internal filesystems. */
> +       if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
> +               kfree(acct);
> +               filp_close(file, NULL);
> +               return -EINVAL;
> +       }
> +
> +       /* Exclude procfs and sysfs. */
> +       if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
> +               kfree(acct);
> +               filp_close(file, NULL);
> +               return -EINVAL;
> +       }
> +
>         if (!(file->f_mode & FMODE_CAN_WRITE)) {
>                 kfree(acct);
>                 filp_close(file, NULL);
>
> --
> 2.47.2
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] acct: block access to kernel internal filesystems
  2025-02-11 17:16   ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
  2025-02-11 20:30     ` Amir Goldstein
@ 2025-02-11 20:54     ` Al Viro
  2025-02-12 10:32       ` Christian Brauner
  1 sibling, 1 reply; 9+ messages in thread
From: Al Viro @ 2025-02-11 20:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
	hch, len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
	judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
	linux-pm, stable

On Tue, Feb 11, 2025 at 06:16:00PM +0100, Christian Brauner wrote:
> There's no point in allowing anything kernel internal nor procfs or
> sysfs.

> +	/* Exclude kernel kernel internal filesystems. */
> +	if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
> +		kfree(acct);
> +		filp_close(file, NULL);
> +		return -EINVAL;
> +	}
> +
> +	/* Exclude procfs and sysfs. */
> +	if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
> +		kfree(acct);
> +		filp_close(file, NULL);
> +		return -EINVAL;
> +	}

That looks like a really weird way to test it, especially the second
part...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] acct: block access to kernel internal filesystems
  2025-02-11 20:54     ` Al Viro
@ 2025-02-12 10:32       ` Christian Brauner
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2025-02-12 10:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
	hch, len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
	judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
	linux-pm, stable

On Tue, Feb 11, 2025 at 08:54:18PM +0000, Al Viro wrote:
> On Tue, Feb 11, 2025 at 06:16:00PM +0100, Christian Brauner wrote:
> > There's no point in allowing anything kernel internal nor procfs or
> > sysfs.
> 
> > +	/* Exclude kernel kernel internal filesystems. */
> > +	if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
> > +		kfree(acct);
> > +		filp_close(file, NULL);
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* Exclude procfs and sysfs. */
> > +	if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
> > +		kfree(acct);
> > +		filp_close(file, NULL);
> > +		return -EINVAL;
> > +	}
> 
> That looks like a really weird way to test it, especially the second
> part...

SB_I_USERNS_VISIBLE has only ever applied to procfs and sysfs.

Granted, it's main purpose is to indicate that a caller in an
unprivileged userns might have a restricted view of sysfs/procfs already
so mounting it again must be prevented to not reveal any overmounted
entities (A Strong candidate for the price of least transparent cause of
EPERMs from the kernel imho.).

That flag could reasonably go and be replaced by explicit checks for
procfs and sysfs in general because we haven't ever grown any additional
candidates for that mess and it's unlikely that we ever will. But as
long as we have this I don't mind using it. If it's important to you
I'll happily change it. If you can live with the comment I added I'll
leave it.

To be perfectly blunt: Imho, this api isn't worth massaging a single
line of VFS code which is why this isn't going to win the price of
prettiest fix of a NULL-deref.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] acct: don't allow access to internal filesystems
  2025-02-11 18:56   ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
@ 2025-02-12 11:16     ` Christian Brauner
  2025-02-13 14:56       ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2025-02-12 11:16 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Zicheng Qu, Linus Torvalds, axboe, joel.granados, tglx, viro, hch,
	len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
	judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
	linux-pm, stable

On Tue, Feb 11, 2025 at 01:56:41PM -0500, Jeff Layton wrote:
> On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> > In [1] it was reported that the acct(2) system call can be used to
> > trigger a NULL deref in cases where it is set to write to a file that
> > triggers an internal lookup.
> > 
> > This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> > point the where the write to this file happens the calling task has
> > already exited and called exit_fs() but an internal lookup might be
> > triggered through lookup_bdev(). This may trigger a NULL-deref
> > when accessing current->fs.
> > 
> > This series does two things:
> > 
> > - Reorganize the code so that the the final write happens from the
> >   workqueue but with the caller's credentials. This preserves the
> >   (strange) permission model and has almost no regression risk.
> > 
> > - Block access to kernel internal filesystems as well as procfs and
> >   sysfs in the first place.
> > 
> > This api should stop to exist imho.
> > 
> 
> I wonder who uses it these days, and what would we suggest they replace
> it with? Maybe syscall auditing?

Someone pointed me to atop but that also works without it. Since this is
a privileged api I think the natural candidate to replace all of this is
bpf. I'm pretty sure that it's relatively straightforward to get a lot
more information out of it than with acct(2) and it will probably be
more performant too.

Without any limitations as it is right now, acct(2) can easily lockup
the system quite easily by pointing it to various things in sysfs and
I'm sure it can be abused in other ways. So I wouldn't enable it.

> 
> config BSD_PROCESS_ACCT
>         bool "BSD Process Accounting"
>         depends on MULTIUSER
>         help
>           If you say Y here, a user level program will be able to instruct the
>           kernel (via a special system call) to write process accounting
>           information to a file: whenever a process exits, information about
>           that process will be appended to the file by the kernel.  The
>           information includes things such as creation time, owning user,
>           command name, memory usage, controlling terminal etc. (the complete
>           list is in the struct acct in <file:include/linux/acct.h>).  It is
>           up to the user level program to do useful things with this
>           information.  This is generally a good idea, so say Y.
> 
> Maybe at least time to replace that last sentence and make this default
> to 'n'?

I agree.

> 
> > Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > Christian Brauner (2):
> >       acct: perform last write from workqueue
> >       acct: block access to kernel internal filesystems
> > 
> >  kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
> >  1 file changed, 84 insertions(+), 50 deletions(-)
> > ---
> > base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
> > change-id: 20250211-work-acct-a6d8e92a5fe0
> > 
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] acct: don't allow access to internal filesystems
  2025-02-12 11:16     ` Christian Brauner
@ 2025-02-13 14:56       ` Christian Brauner
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2025-02-13 14:56 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Zicheng Qu, Linus Torvalds, axboe, joel.granados, tglx, viro, hch,
	len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
	judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
	linux-pm, stable

On Wed, Feb 12, 2025 at 12:16:44PM +0100, Christian Brauner wrote:
> On Tue, Feb 11, 2025 at 01:56:41PM -0500, Jeff Layton wrote:
> > On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> > > In [1] it was reported that the acct(2) system call can be used to
> > > trigger a NULL deref in cases where it is set to write to a file that
> > > triggers an internal lookup.
> > > 
> > > This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> > > point the where the write to this file happens the calling task has
> > > already exited and called exit_fs() but an internal lookup might be
> > > triggered through lookup_bdev(). This may trigger a NULL-deref
> > > when accessing current->fs.
> > > 
> > > This series does two things:
> > > 
> > > - Reorganize the code so that the the final write happens from the
> > >   workqueue but with the caller's credentials. This preserves the
> > >   (strange) permission model and has almost no regression risk.
> > > 
> > > - Block access to kernel internal filesystems as well as procfs and
> > >   sysfs in the first place.
> > > 
> > > This api should stop to exist imho.
> > > 
> > 
> > I wonder who uses it these days, and what would we suggest they replace
> > it with? Maybe syscall auditing?
> 
> Someone pointed me to atop but that also works without it. Since this is
> a privileged api I think the natural candidate to replace all of this is
> bpf. I'm pretty sure that it's relatively straightforward to get a lot
> more information out of it than with acct(2) and it will probably be
> more performant too.
> 
> Without any limitations as it is right now, acct(2) can easily lockup
> the system quite easily by pointing it to various things in sysfs and
> I'm sure it can be abused in other ways. So I wouldn't enable it.

And I totally forgot about taskstats via Netlink:
https://www.kernel.org/doc/Documentation/accounting/taskstats.txt
include/uapi/linux/taskstats.h

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-02-13 14:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250210-unordnung-petersilie-90e37411db18@brauner>
2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
2025-02-11 17:15   ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
2025-02-11 17:16   ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
2025-02-11 20:30     ` Amir Goldstein
2025-02-11 20:54     ` Al Viro
2025-02-12 10:32       ` Christian Brauner
2025-02-11 18:56   ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
2025-02-12 11:16     ` Christian Brauner
2025-02-13 14:56       ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).