* [PATCH][RFC] fs: add levels to inode write access
@ 2024-05-29 20:41 Josef Bacik
2024-05-29 22:00 ` Jeff Layton
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Josef Bacik @ 2024-05-29 20:41 UTC (permalink / raw)
To: linux-fsdevel, brauner, viro, jack, david, amir73il, hch
NOTE:
This is compile tested only. It's also based on an out of tree feature branch
from Amir that I'm extending to add page fault content events to allow us to
have on-demand binary loading at exec time. If you need more context please let
me know, I'll push my current branch somewhere and wire up how I plan to use
this patch so you can see it in better context, but hopefully I've described
what I'm trying to accomplish enough that this leads to useful discussion.
Currently we have ->i_writecount to control write access to an inode.
Callers may deny write access by calling deny_write_access(), which will
cause ->i_writecount to go negative, and allow_write_access() to push it
back up to 0.
This is used in a few ways, the biggest user being exec. Historically
we've blocked write access to files that are opened for executing.
fsverity is the other notable user.
With the upcoming fanotify feature that allows for on-demand population
of files, this blanket policy of denying writes to files opened for
executing creates a problem. We have to issue events right before
denying access, and the entire file must be populated before we can
continue with the exec.
This creates a problem for users who have large binaries they want to
populate on demand. Inside of Meta we often have multi-gigabyte
binaries, even on the order of tens of gigabytes. Pre-loading these
large files is costly, especially when large portions of the binary may
never be read (think debuginfo).
In order to facilitate on-demand loading of binaries we need to have a
way to punch a hole in this exec related write denial.
This patch accomplishes this by doing something similar to the freeze
levels on the super block. We have two levels, one for all write
denial, and one for exec. People wishing to deny writes will specify
the level they're denying. Users wishing to get write access must go
through all of the levels and increment each levels counter until it
increments them all, or encounters a level that is currently denied, at
which point they undo their increments and return an error.
Future patches will use the get_write_access_level() helper in order to
skip the level they wish to be allowed to access. Any inode being
populated via the pre-content fanotify mechanism will be marked with a
special flag, and the open path will use get_write_access_level() in
order to bypass the exec restriction.
This is a significant behavior change, as it allows us to write to a
file that is currently being exec'ed. However this will be limited to a
very narrow use case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
drivers/md/md.c | 2 +-
fs/binfmt_elf.c | 4 +-
fs/exec.c | 6 +--
fs/ext4/file.c | 4 +-
fs/inode.c | 3 +-
fs/kernel_read_file.c | 4 +-
fs/locks.c | 2 +-
fs/quota/dquot.c | 2 +-
fs/verity/enable.c | 4 +-
include/linux/fs.h | 90 +++++++++++++++++++++++++++----
include/trace/events/filelock.h | 2 +-
kernel/fork.c | 11 ++--
security/integrity/evm/evm_main.c | 2 +-
security/integrity/ima/ima_main.c | 2 +-
14 files changed, 104 insertions(+), 34 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index aff9118ff697..134cefbd7bef 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7231,7 +7231,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
pr_warn("%s: error: bitmap file must open for write\n",
mdname(mddev));
err = -EBADF;
- } else if (atomic_read(&inode->i_writecount) != 1) {
+ } else if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != 1) {
pr_warn("%s: error: bitmap file is already in use\n",
mdname(mddev));
err = -EBUSY;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a43897b03ce9..9a6fcf8ba17c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1216,7 +1216,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
reloc_func_desc = interp_load_addr;
- allow_write_access(interpreter);
+ allow_write_access(interpreter, INODE_DENY_WRITE_EXEC);
fput(interpreter);
kfree(interp_elf_ex);
@@ -1308,7 +1308,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
kfree(interp_elf_ex);
kfree(interp_elf_phdata);
out_free_file:
- allow_write_access(interpreter);
+ allow_write_access(interpreter, INODE_DENY_WRITE_EXEC);
if (interpreter)
fput(interpreter);
out_free_ph:
diff --git a/fs/exec.c b/fs/exec.c
index 18f057ba64a5..6b2900ee448d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -971,7 +971,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
if (err)
goto exit;
- err = deny_write_access(file);
+ err = deny_write_access(file, INODE_DENY_WRITE_EXEC);
if (err)
goto exit;
@@ -1545,7 +1545,7 @@ static void do_close_execat(struct file *file)
{
if (!file)
return;
- allow_write_access(file);
+ allow_write_access(file, INODE_DENY_WRITE_EXEC);
fput(file);
}
@@ -1865,7 +1865,7 @@ static int exec_binprm(struct linux_binprm *bprm)
bprm->file = bprm->interpreter;
bprm->interpreter = NULL;
- allow_write_access(exec);
+ allow_write_access(exec, INODE_DENY_WRITE_EXEC);
if (unlikely(bprm->have_execfd)) {
if (bprm->executable) {
fput(exec);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index c89e434db6b7..6196f449649c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -171,8 +171,8 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
}
/* if we are the last writer on the inode, drop the block reservation */
if ((filp->f_mode & FMODE_WRITE) &&
- (atomic_read(&inode->i_writecount) == 1) &&
- !EXT4_I(inode)->i_reserved_data_blocks) {
+ (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) &&
+ !EXT4_I(inode)->i_reserved_data_blocks) {
down_write(&EXT4_I(inode)->i_data_sem);
ext4_discard_preallocations(inode);
up_write(&EXT4_I(inode)->i_data_sem);
diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..fb6155412252 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -173,7 +173,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
inode->i_opflags |= IOP_XATTR;
i_uid_write(inode, 0);
i_gid_write(inode, 0);
- atomic_set(&inode->i_writecount, 0);
+ for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++)
+ atomic_set(&inode->i_writecount[i], 0);
inode->i_size = 0;
inode->i_write_hint = WRITE_LIFE_NOT_SET;
inode->i_blocks = 0;
diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index c429c42a6867..9af82d20aa1f 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -48,7 +48,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
if (!S_ISREG(file_inode(file)->i_mode))
return -EINVAL;
- ret = deny_write_access(file);
+ ret = deny_write_access(file, INODE_DENY_WRITE_ALL);
if (ret)
return ret;
@@ -119,7 +119,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
}
out:
- allow_write_access(file);
+ allow_write_access(file, INODE_DENY_WRITE_ALL);
return ret == 0 ? copied : ret;
}
EXPORT_SYMBOL_GPL(kernel_read_file);
diff --git a/fs/locks.c b/fs/locks.c
index 90c8746874de..3e6a62f56528 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1763,7 +1763,7 @@ check_conflicting_open(struct file *filp, const int arg, int flags)
else if (filp->f_mode & FMODE_READ)
self_rcount = 1;
- if (atomic_read(&inode->i_writecount) != self_wcount ||
+ if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != self_wcount ||
atomic_read(&inode->i_readcount) != self_rcount)
return -EAGAIN;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 627eb2f72ef3..fa470d76f0b3 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1031,7 +1031,7 @@ static int add_dquot_ref(struct super_block *sb, int type)
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
- !atomic_read(&inode->i_writecount) ||
+ !inode_is_open_for_write(inode) ||
!dqinit_needed(inode, type)) {
spin_unlock(&inode->i_lock);
continue;
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index c284f46d1b53..a0e0d49c6ccc 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -371,7 +371,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
if (err) /* -EROFS */
return err;
- err = deny_write_access(filp);
+ err = deny_write_access(filp, INODE_DENY_WRITE_ALL);
if (err) /* -ETXTBSY */
goto out_drop_write;
@@ -397,7 +397,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
* allow_write_access() is needed to pair with deny_write_access().
* Regardless, the filesystem won't allow writing to verity files.
*/
- allow_write_access(filp);
+ allow_write_access(filp, INODE_DENY_WRITE_ALL);
out_drop_write:
mnt_drop_write_file(filp);
return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0283cf366c2a..f0720cd0ab45 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -620,6 +620,22 @@ is_uncached_acl(struct posix_acl *acl)
#define IOP_XATTR 0x0008
#define IOP_DEFAULT_READLINK 0x0010
+/*
+ * These are used with the *write_access helpers.
+ *
+ * INODE_DENY_WRITE_ALL - Do not allow writes at all.
+ * INODE_DENY_WRITE_EXEC - Do not allow writes because we are
+ * exec'ing the file. This can be bypassed
+ * in cases where the file needs to be
+ * filled in via the fanotify pre-content
+ * hooks.
+ */
+enum inode_write_level {
+ INODE_DENY_WRITE_ALL,
+ INODE_DENY_WRITE_EXEC,
+ INODE_DENY_WRITE_LEVEL,
+};
+
/*
* Keep mostly read-only and often accessed (especially for
* the RCU path lookup and 'stat' data) fields at the beginning
@@ -701,7 +717,7 @@ struct inode {
atomic64_t i_sequence; /* see futex */
atomic_t i_count;
atomic_t i_dio_count;
- atomic_t i_writecount;
+ atomic_t i_writecount[INODE_DENY_WRITE_LEVEL];
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
atomic_t i_readcount; /* struct files open RO */
#endif
@@ -2920,38 +2936,90 @@ static inline void kiocb_end_write(struct kiocb *iocb)
* put_write_access() releases this write permission.
* deny_write_access() denies write access to a file.
* allow_write_access() re-enables write access to a file.
+ * __get_write_access_level() gets write permission for a file for the given
+ * level.
+ * put_write_access_level() releases the level write permission.
*
- * The i_writecount field of an inode can have the following values:
+ * The level indicates which level we're denying writes for. Some levels are
+ * allowed to be bypassed in special circumstances. In the cases that the write
+ * level needs to be bypassed the user must use the
+ * get_write_access_level()/put_write_access_level() pairs of the helpers, which
+ * will allow the user to bypass the given level, but none of the others.
+ *
+ * The i_writecount[level] field of an inode can have the following values:
* 0: no write access, no denied write access
- * < 0: (-i_writecount) users that denied write access to the file.
- * > 0: (i_writecount) users that have write access to the file.
+ * < 0: (-i_writecount[level]) users that denied write access to the file.
+ * > 0: (i_writecount[level]) users that have write access to the file.
*
* Normally we operate on that counter with atomic_{inc,dec} and it's safe
* except for the cases where we don't hold i_writecount yet. Then we need to
* use {get,deny}_write_access() - these functions check the sign and refuse
* to do the change if sign is wrong.
*/
+static inline int __get_write_access(struct inode *inode,
+ enum inode_write_level skip_level)
+{
+ int i;
+
+ /* We are not allowed to skip INODE_DENY_WRITE_ALL. */
+ WARN_ON_ONCE(skip_level == INODE_DENY_WRITE_ALL);
+
+ for (i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
+ if (i == skip_level)
+ continue;
+ if (!atomic_inc_unless_negative(&inode->i_writecount[i]))
+ goto fail;
+ }
+ return 0;
+fail:
+ while (--i >= 0) {
+ if (i == skip_level)
+ continue;
+ atomic_dec(&inode->i_writecount[i]);
+ }
+ return -ETXTBSY;
+}
static inline int get_write_access(struct inode *inode)
{
- return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
+ return __get_write_access(inode, INODE_DENY_WRITE_LEVEL);
}
-static inline int deny_write_access(struct file *file)
+static inline int get_write_access_level(struct inode *inode,
+ enum inode_write_level skip_level)
+{
+ return __get_write_access(inode, skip_level);
+}
+static inline int deny_write_access(struct file *file,
+ enum inode_write_level level)
{
struct inode *inode = file_inode(file);
- return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+ return atomic_dec_unless_positive(&inode->i_writecount[level]) ? 0 : -ETXTBSY;
+}
+static inline void __put_write_access(struct inode *inode,
+ enum inode_write_level skip_level)
+{
+ for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
+ if (i == skip_level)
+ continue;
+ atomic_dec(&inode->i_writecount[i]);
+ }
}
static inline void put_write_access(struct inode * inode)
{
- atomic_dec(&inode->i_writecount);
+ __put_write_access(inode, INODE_DENY_WRITE_LEVEL);
}
-static inline void allow_write_access(struct file *file)
+static inline void put_write_access_level(struct inode *inode,
+ enum inode_write_level skip_level)
+{
+ __put_write_access(inode, skip_level);
+}
+static inline void allow_write_access(struct file *file, enum inode_write_level level)
{
if (file)
- atomic_inc(&file_inode(file)->i_writecount);
+ atomic_inc(&file_inode(file)->i_writecount[level]);
}
static inline bool inode_is_open_for_write(const struct inode *inode)
{
- return atomic_read(&inode->i_writecount) > 0;
+ return atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) > 0;
}
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
index b8d1e00a7982..ad076e87a956 100644
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -187,7 +187,7 @@ TRACE_EVENT(generic_add_lease,
TP_fast_assign(
__entry->s_dev = inode->i_sb->s_dev;
__entry->i_ino = inode->i_ino;
- __entry->wcount = atomic_read(&inode->i_writecount);
+ __entry->wcount = atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]);
__entry->rcount = atomic_read(&inode->i_readcount);
__entry->icount = atomic_read(&inode->i_count);
__entry->owner = fl->c.flc_owner;
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..b6dc7aed2ebf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -620,7 +620,7 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
* We depend on the oldmm having properly denied write access to the
* exe_file already.
*/
- if (exe_file && deny_write_access(exe_file))
+ if (exe_file && deny_write_access(exe_file, INODE_DENY_WRITE_EXEC))
pr_warn_once("deny_write_access() failed in %s\n", __func__);
}
@@ -1417,13 +1417,14 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
* We expect the caller (i.e., sys_execve) to already denied
* write access, so this is unlikely to fail.
*/
- if (unlikely(deny_write_access(new_exe_file)))
+ if (unlikely(deny_write_access(new_exe_file,
+ INODE_DENY_WRITE_EXEC)))
return -EACCES;
get_file(new_exe_file);
}
rcu_assign_pointer(mm->exe_file, new_exe_file);
if (old_exe_file) {
- allow_write_access(old_exe_file);
+ allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC);
fput(old_exe_file);
}
return 0;
@@ -1464,7 +1465,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
return ret;
}
- ret = deny_write_access(new_exe_file);
+ ret = deny_write_access(new_exe_file, INODE_DENY_WRITE_EXEC);
if (ret)
return -EACCES;
get_file(new_exe_file);
@@ -1476,7 +1477,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
mmap_write_unlock(mm);
if (old_exe_file) {
- allow_write_access(old_exe_file);
+ allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC);
fput(old_exe_file);
}
return 0;
diff --git a/security/integrity/evm/evm_main.c b/security/integrity/evm/evm_main.c
index 62fe66dd53ce..b83dfb295d85 100644
--- a/security/integrity/evm/evm_main.c
+++ b/security/integrity/evm/evm_main.c
@@ -1084,7 +1084,7 @@ static void evm_file_release(struct file *file)
if (!S_ISREG(inode->i_mode) || !(mode & FMODE_WRITE))
return;
- if (iint && atomic_read(&inode->i_writecount) == 1)
+ if (iint && atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1)
iint->flags &= ~EVM_NEW_FILE;
}
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index f04f43af651c..7aed80a70692 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -164,7 +164,7 @@ static void ima_check_last_writer(struct ima_iint_cache *iint,
return;
mutex_lock(&iint->mutex);
- if (atomic_read(&inode->i_writecount) == 1) {
+ if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) {
struct kstat stat;
update = test_and_clear_bit(IMA_UPDATE_XATTR,
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-29 20:41 [PATCH][RFC] fs: add levels to inode write access Josef Bacik
@ 2024-05-29 22:00 ` Jeff Layton
2024-05-30 1:14 ` Matthew Wilcox
2024-05-30 10:32 ` Christian Brauner
2 siblings, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2024-05-29 22:00 UTC (permalink / raw)
To: Josef Bacik, linux-fsdevel, brauner, viro, jack, david, amir73il,
hch
On Wed, 2024-05-29 at 16:41 -0400, Josef Bacik wrote:
> NOTE:
> This is compile tested only. It's also based on an out of tree
> feature branch
> from Amir that I'm extending to add page fault content events to
> allow us to
> have on-demand binary loading at exec time. If you need more context
> please let
> me know, I'll push my current branch somewhere and wire up how I plan
> to use
> this patch so you can see it in better context, but hopefully I've
> described
> what I'm trying to accomplish enough that this leads to useful
> discussion.
>
>
> Currently we have ->i_writecount to control write access to an inode.
> Callers may deny write access by calling deny_write_access(), which
> will
> cause ->i_writecount to go negative, and allow_write_access() to push
> it
> back up to 0.
>
> This is used in a few ways, the biggest user being exec.
> Historically
> we've blocked write access to files that are opened for executing.
> fsverity is the other notable user.
>
> With the upcoming fanotify feature that allows for on-demand
> population
> of files, this blanket policy of denying writes to files opened for
> executing creates a problem. We have to issue events right before
> denying access, and the entire file must be populated before we can
> continue with the exec.
>
> This creates a problem for users who have large binaries they want to
> populate on demand. Inside of Meta we often have multi-gigabyte
> binaries, even on the order of tens of gigabytes. Pre-loading these
> large files is costly, especially when large portions of the binary
> may
> never be read (think debuginfo).
>
> In order to facilitate on-demand loading of binaries we need to have
> a
> way to punch a hole in this exec related write denial.
>
> This patch accomplishes this by doing something similar to the freeze
> levels on the super block. We have two levels, one for all write
> denial, and one for exec. People wishing to deny writes will specify
> the level they're denying. Users wishing to get write access must go
> through all of the levels and increment each levels counter until it
> increments them all, or encounters a level that is currently denied,
> at
> which point they undo their increments and return an error.
>
> Future patches will use the get_write_access_level() helper in order
> to
> skip the level they wish to be allowed to access. Any inode being
> populated via the pre-content fanotify mechanism will be marked with
> a
> special flag, and the open path will use get_write_access_level() in
> order to bypass the exec restriction.
>
> This is a significant behavior change, as it allows us to write to a
> file that is currently being exec'ed. However this will be limited
> to a
> very narrow use case.
>
>
Given that this is a change in behavior, should this be behind Kconfig
option or a sysctl or something?
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> drivers/md/md.c | 2 +-
> fs/binfmt_elf.c | 4 +-
> fs/exec.c | 6 +--
> fs/ext4/file.c | 4 +-
> fs/inode.c | 3 +-
> fs/kernel_read_file.c | 4 +-
> fs/locks.c | 2 +-
> fs/quota/dquot.c | 2 +-
> fs/verity/enable.c | 4 +-
> include/linux/fs.h | 90 +++++++++++++++++++++++++++--
> --
> include/trace/events/filelock.h | 2 +-
> kernel/fork.c | 11 ++--
> security/integrity/evm/evm_main.c | 2 +-
> security/integrity/ima/ima_main.c | 2 +-
> 14 files changed, 104 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index aff9118ff697..134cefbd7bef 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7231,7 +7231,7 @@ static int set_bitmap_file(struct mddev *mddev,
> int fd)
> pr_warn("%s: error: bitmap file must open
> for write\n",
> mdname(mddev));
> err = -EBADF;
> - } else if (atomic_read(&inode->i_writecount) != 1) {
> + } else if (atomic_read(&inode-
> >i_writecount[INODE_DENY_WRITE_ALL]) != 1) {
> pr_warn("%s: error: bitmap file is already
> in use\n",
> mdname(mddev));
> err = -EBUSY;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index a43897b03ce9..9a6fcf8ba17c 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1216,7 +1216,7 @@ static int load_elf_binary(struct linux_binprm
> *bprm)
> }
> reloc_func_desc = interp_load_addr;
>
> - allow_write_access(interpreter);
> + allow_write_access(interpreter,
> INODE_DENY_WRITE_EXEC);
> fput(interpreter);
>
> kfree(interp_elf_ex);
> @@ -1308,7 +1308,7 @@ static int load_elf_binary(struct linux_binprm
> *bprm)
> kfree(interp_elf_ex);
> kfree(interp_elf_phdata);
> out_free_file:
> - allow_write_access(interpreter);
> + allow_write_access(interpreter, INODE_DENY_WRITE_EXEC);
> if (interpreter)
> fput(interpreter);
> out_free_ph:
> diff --git a/fs/exec.c b/fs/exec.c
> index 18f057ba64a5..6b2900ee448d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -971,7 +971,7 @@ static struct file *do_open_execat(int fd, struct
> filename *name, int flags)
> if (err)
> goto exit;
>
> - err = deny_write_access(file);
> + err = deny_write_access(file, INODE_DENY_WRITE_EXEC);
> if (err)
> goto exit;
>
> @@ -1545,7 +1545,7 @@ static void do_close_execat(struct file *file)
> {
> if (!file)
> return;
> - allow_write_access(file);
> + allow_write_access(file, INODE_DENY_WRITE_EXEC);
> fput(file);
> }
>
> @@ -1865,7 +1865,7 @@ static int exec_binprm(struct linux_binprm
> *bprm)
> bprm->file = bprm->interpreter;
> bprm->interpreter = NULL;
>
> - allow_write_access(exec);
> + allow_write_access(exec, INODE_DENY_WRITE_EXEC);
> if (unlikely(bprm->have_execfd)) {
> if (bprm->executable) {
> fput(exec);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index c89e434db6b7..6196f449649c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -171,8 +171,8 @@ static int ext4_release_file(struct inode *inode,
> struct file *filp)
> }
> /* if we are the last writer on the inode, drop the block
> reservation */
> if ((filp->f_mode & FMODE_WRITE) &&
> - (atomic_read(&inode->i_writecount) == 1) &&
> - !EXT4_I(inode)->i_reserved_data_blocks) {
> + (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL])
> == 1) &&
> + !EXT4_I(inode)->i_reserved_data_blocks) {
> down_write(&EXT4_I(inode)->i_data_sem);
> ext4_discard_preallocations(inode);
> up_write(&EXT4_I(inode)->i_data_sem);
> diff --git a/fs/inode.c b/fs/inode.c
> index 3a41f83a4ba5..fb6155412252 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -173,7 +173,8 @@ int inode_init_always(struct super_block *sb,
> struct inode *inode)
> inode->i_opflags |= IOP_XATTR;
> i_uid_write(inode, 0);
> i_gid_write(inode, 0);
> - atomic_set(&inode->i_writecount, 0);
> + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++)
> + atomic_set(&inode->i_writecount[i], 0);
> inode->i_size = 0;
> inode->i_write_hint = WRITE_LIFE_NOT_SET;
> inode->i_blocks = 0;
> diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
> index c429c42a6867..9af82d20aa1f 100644
> --- a/fs/kernel_read_file.c
> +++ b/fs/kernel_read_file.c
> @@ -48,7 +48,7 @@ ssize_t kernel_read_file(struct file *file, loff_t
> offset, void **buf,
> if (!S_ISREG(file_inode(file)->i_mode))
> return -EINVAL;
>
> - ret = deny_write_access(file);
> + ret = deny_write_access(file, INODE_DENY_WRITE_ALL);
> if (ret)
> return ret;
>
> @@ -119,7 +119,7 @@ ssize_t kernel_read_file(struct file *file,
> loff_t offset, void **buf,
> }
>
> out:
> - allow_write_access(file);
> + allow_write_access(file, INODE_DENY_WRITE_ALL);
> return ret == 0 ? copied : ret;
> }
> EXPORT_SYMBOL_GPL(kernel_read_file);
> diff --git a/fs/locks.c b/fs/locks.c
> index 90c8746874de..3e6a62f56528 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1763,7 +1763,7 @@ check_conflicting_open(struct file *filp, const
> int arg, int flags)
> else if (filp->f_mode & FMODE_READ)
> self_rcount = 1;
>
> - if (atomic_read(&inode->i_writecount) != self_wcount ||
> + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL])
> != self_wcount ||
> atomic_read(&inode->i_readcount) != self_rcount)
> return -EAGAIN;
>
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 627eb2f72ef3..fa470d76f0b3 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -1031,7 +1031,7 @@ static int add_dquot_ref(struct super_block
> *sb, int type)
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> spin_lock(&inode->i_lock);
> if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> ||
> - !atomic_read(&inode->i_writecount) ||
> + !inode_is_open_for_write(inode) ||
> !dqinit_needed(inode, type)) {
> spin_unlock(&inode->i_lock);
> continue;
> diff --git a/fs/verity/enable.c b/fs/verity/enable.c
> index c284f46d1b53..a0e0d49c6ccc 100644
> --- a/fs/verity/enable.c
> +++ b/fs/verity/enable.c
> @@ -371,7 +371,7 @@ int fsverity_ioctl_enable(struct file *filp,
> const void __user *uarg)
> if (err) /* -EROFS */
> return err;
>
> - err = deny_write_access(filp);
> + err = deny_write_access(filp, INODE_DENY_WRITE_ALL);
> if (err) /* -ETXTBSY */
> goto out_drop_write;
>
> @@ -397,7 +397,7 @@ int fsverity_ioctl_enable(struct file *filp,
> const void __user *uarg)
> * allow_write_access() is needed to pair with
> deny_write_access().
> * Regardless, the filesystem won't allow writing to verity
> files.
> */
> - allow_write_access(filp);
> + allow_write_access(filp, INODE_DENY_WRITE_ALL);
> out_drop_write:
> mnt_drop_write_file(filp);
> return err;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0283cf366c2a..f0720cd0ab45 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -620,6 +620,22 @@ is_uncached_acl(struct posix_acl *acl)
> #define IOP_XATTR 0x0008
> #define IOP_DEFAULT_READLINK 0x0010
>
> +/*
> + * These are used with the *write_access helpers.
> + *
> + * INODE_DENY_WRITE_ALL - Do not allow writes
> at all.
> + * INODE_DENY_WRITE_EXEC - Do not allow writes because
> we are
> + * exec'ing the file. This can
> be bypassed
> + * in cases where the file
> needs to be
> + * filled in via the fanotify
> pre-content
> + * hooks.
> + */
> +enum inode_write_level {
> + INODE_DENY_WRITE_ALL,
> + INODE_DENY_WRITE_EXEC,
> + INODE_DENY_WRITE_LEVEL,
> +};
> +
> /*
> * Keep mostly read-only and often accessed (especially for
> * the RCU path lookup and 'stat' data) fields at the beginning
> @@ -701,7 +717,7 @@ struct inode {
> atomic64_t i_sequence; /* see futex */
> atomic_t i_count;
> atomic_t i_dio_count;
> - atomic_t i_writecount;
> + atomic_t i_writecount[INODE_DENY_WRITE_LEVEL]
> ;
Rather than turning this into an array, I think this would be easier to
reason about as i_execcount. Then you could add a new
allow/deny_exec_access(), and leave most of the existing write_access
helpers alone, etc.
> #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> atomic_t i_readcount; /* struct files open RO
> */
> #endif
> @@ -2920,38 +2936,90 @@ static inline void kiocb_end_write(struct
> kiocb *iocb)
> * put_write_access() releases this write permission.
> * deny_write_access() denies write access to a file.
> * allow_write_access() re-enables write access to a file.
> + * __get_write_access_level() gets write permission for a file for
> the given
> + * level.
> + * put_write_access_level() releases the level write permission.
> *
> - * The i_writecount field of an inode can have the following values:
> + * The level indicates which level we're denying writes for. Some
> levels are
> + * allowed to be bypassed in special circumstances. In the cases
> that the write
> + * level needs to be bypassed the user must use the
> + * get_write_access_level()/put_write_access_level() pairs of the
> helpers, which
> + * will allow the user to bypass the given level, but none of the
> others.
> + *
> + * The i_writecount[level] field of an inode can have the following
> values:
> * 0: no write access, no denied write access
> - * < 0: (-i_writecount) users that denied write access to the file.
> - * > 0: (i_writecount) users that have write access to the file.
> + * < 0: (-i_writecount[level]) users that denied write access to the
> file.
> + * > 0: (i_writecount[level]) users that have write access to the
> file.
> *
> * Normally we operate on that counter with atomic_{inc,dec} and
> it's safe
> * except for the cases where we don't hold i_writecount yet. Then
> we need to
> * use {get,deny}_write_access() - these functions check the sign
> and refuse
> * to do the change if sign is wrong.
> */
> +static inline int __get_write_access(struct inode *inode,
> + enum inode_write_level
> skip_level)
> +{
> + int i;
> +
> + /* We are not allowed to skip INODE_DENY_WRITE_ALL. */
> + WARN_ON_ONCE(skip_level == INODE_DENY_WRITE_ALL);
> +
> + for (i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
> + if (i == skip_level)
> + continue;
> + if (!atomic_inc_unless_negative(&inode-
> >i_writecount[i]))
> + goto fail;
> + }
> + return 0;
> +fail:
> + while (--i >= 0) {
> + if (i == skip_level)
> + continue;
> + atomic_dec(&inode->i_writecount[i]);
> + }
> + return -ETXTBSY;
> +}
> static inline int get_write_access(struct inode *inode)
> {
> - return atomic_inc_unless_negative(&inode->i_writecount) ? 0
> : -ETXTBSY;
> + return __get_write_access(inode, INODE_DENY_WRITE_LEVEL);
> }
> -static inline int deny_write_access(struct file *file)
> +static inline int get_write_access_level(struct inode *inode,
> + enum inode_write_level
> skip_level)
> +{
> + return __get_write_access(inode, skip_level);
> +}
> +static inline int deny_write_access(struct file *file,
> + enum inode_write_level level)
> {
> struct inode *inode = file_inode(file);
> - return atomic_dec_unless_positive(&inode->i_writecount) ? 0
> : -ETXTBSY;
> + return atomic_dec_unless_positive(&inode-
> >i_writecount[level]) ? 0 : -ETXTBSY;
> +}
> +static inline void __put_write_access(struct inode *inode,
> + enum inode_write_level
> skip_level)
> +{
> + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
> + if (i == skip_level)
> + continue;
> + atomic_dec(&inode->i_writecount[i]);
> + }
> }
> static inline void put_write_access(struct inode * inode)
> {
> - atomic_dec(&inode->i_writecount);
> + __put_write_access(inode, INODE_DENY_WRITE_LEVEL);
> }
> -static inline void allow_write_access(struct file *file)
> +static inline void put_write_access_level(struct inode *inode,
> + enum inode_write_level
> skip_level)
> +{
> + __put_write_access(inode, skip_level);
> +}
> +static inline void allow_write_access(struct file *file, enum
> inode_write_level level)
> {
> if (file)
> - atomic_inc(&file_inode(file)->i_writecount);
> + atomic_inc(&file_inode(file)->i_writecount[level]);
> }
> static inline bool inode_is_open_for_write(const struct inode
> *inode)
> {
> - return atomic_read(&inode->i_writecount) > 0;
> + return atomic_read(&inode-
> >i_writecount[INODE_DENY_WRITE_ALL]) > 0;
> }
>
> #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> diff --git a/include/trace/events/filelock.h
> b/include/trace/events/filelock.h
> index b8d1e00a7982..ad076e87a956 100644
> --- a/include/trace/events/filelock.h
> +++ b/include/trace/events/filelock.h
> @@ -187,7 +187,7 @@ TRACE_EVENT(generic_add_lease,
> TP_fast_assign(
> __entry->s_dev = inode->i_sb->s_dev;
> __entry->i_ino = inode->i_ino;
> - __entry->wcount = atomic_read(&inode->i_writecount);
> + __entry->wcount = atomic_read(&inode-
> >i_writecount[INODE_DENY_WRITE_ALL]);
> __entry->rcount = atomic_read(&inode->i_readcount);
> __entry->icount = atomic_read(&inode->i_count);
> __entry->owner = fl->c.flc_owner;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 99076dbe27d8..b6dc7aed2ebf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -620,7 +620,7 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> struct mm_struct *oldmm)
> * We depend on the oldmm having properly denied write
> access to the
> * exe_file already.
> */
> - if (exe_file && deny_write_access(exe_file))
> + if (exe_file && deny_write_access(exe_file,
> INODE_DENY_WRITE_EXEC))
> pr_warn_once("deny_write_access() failed in %s\n",
> __func__);
> }
>
> @@ -1417,13 +1417,14 @@ int set_mm_exe_file(struct mm_struct *mm,
> struct file *new_exe_file)
> * We expect the caller (i.e., sys_execve) to
> already denied
> * write access, so this is unlikely to fail.
> */
> - if (unlikely(deny_write_access(new_exe_file)))
> + if (unlikely(deny_write_access(new_exe_file,
> +
> INODE_DENY_WRITE_EXEC)))
> return -EACCES;
> get_file(new_exe_file);
> }
> rcu_assign_pointer(mm->exe_file, new_exe_file);
> if (old_exe_file) {
> - allow_write_access(old_exe_file);
> + allow_write_access(old_exe_file,
> INODE_DENY_WRITE_EXEC);
> fput(old_exe_file);
> }
> return 0;
> @@ -1464,7 +1465,7 @@ int replace_mm_exe_file(struct mm_struct *mm,
> struct file *new_exe_file)
> return ret;
> }
>
> - ret = deny_write_access(new_exe_file);
> + ret = deny_write_access(new_exe_file,
> INODE_DENY_WRITE_EXEC);
> if (ret)
> return -EACCES;
> get_file(new_exe_file);
> @@ -1476,7 +1477,7 @@ int replace_mm_exe_file(struct mm_struct *mm,
> struct file *new_exe_file)
> mmap_write_unlock(mm);
>
> if (old_exe_file) {
> - allow_write_access(old_exe_file);
> + allow_write_access(old_exe_file,
> INODE_DENY_WRITE_EXEC);
> fput(old_exe_file);
> }
> return 0;
> diff --git a/security/integrity/evm/evm_main.c
> b/security/integrity/evm/evm_main.c
> index 62fe66dd53ce..b83dfb295d85 100644
> --- a/security/integrity/evm/evm_main.c
> +++ b/security/integrity/evm/evm_main.c
> @@ -1084,7 +1084,7 @@ static void evm_file_release(struct file *file)
> if (!S_ISREG(inode->i_mode) || !(mode & FMODE_WRITE))
> return;
>
> - if (iint && atomic_read(&inode->i_writecount) == 1)
> + if (iint && atomic_read(&inode-
> >i_writecount[INODE_DENY_WRITE_ALL]) == 1)
> iint->flags &= ~EVM_NEW_FILE;
> }
>
> diff --git a/security/integrity/ima/ima_main.c
> b/security/integrity/ima/ima_main.c
> index f04f43af651c..7aed80a70692 100644
> --- a/security/integrity/ima/ima_main.c
> +++ b/security/integrity/ima/ima_main.c
> @@ -164,7 +164,7 @@ static void ima_check_last_writer(struct
> ima_iint_cache *iint,
> return;
>
> mutex_lock(&iint->mutex);
> - if (atomic_read(&inode->i_writecount) == 1) {
> + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL])
> == 1) {
> struct kstat stat;
>
> update = test_and_clear_bit(IMA_UPDATE_XATTR,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-29 20:41 [PATCH][RFC] fs: add levels to inode write access Josef Bacik
2024-05-29 22:00 ` Jeff Layton
@ 2024-05-30 1:14 ` Matthew Wilcox
2024-05-30 10:32 ` Christian Brauner
2 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2024-05-30 1:14 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-fsdevel, brauner, viro, jack, david, amir73il, hch
On Wed, May 29, 2024 at 04:41:32PM -0400, Josef Bacik wrote:
> +++ b/drivers/md/md.c
> @@ -7231,7 +7231,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
> pr_warn("%s: error: bitmap file must open for write\n",
> mdname(mddev));
> err = -EBADF;
> - } else if (atomic_read(&inode->i_writecount) != 1) {
> + } else if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != 1) {
I think this needs an abstraction because I have no idea what this
means.
} else if (!write_access_ok(inode, INODE_DENY_WRITE_ALL)) {
perhaps?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-29 20:41 [PATCH][RFC] fs: add levels to inode write access Josef Bacik
2024-05-29 22:00 ` Jeff Layton
2024-05-30 1:14 ` Matthew Wilcox
@ 2024-05-30 10:32 ` Christian Brauner
2024-05-30 12:57 ` Amir Goldstein
` (3 more replies)
2 siblings, 4 replies; 27+ messages in thread
From: Christian Brauner @ 2024-05-30 10:32 UTC (permalink / raw)
To: Josef Bacik, Linus Torvalds
Cc: linux-fsdevel, viro, jack, david, amir73il, hch
[-- Attachment #1: Type: text/plain, Size: 19069 bytes --]
On Wed, May 29, 2024 at 04:41:32PM -0400, Josef Bacik wrote:
> NOTE:
> This is compile tested only. It's also based on an out of tree feature branch
> from Amir that I'm extending to add page fault content events to allow us to
> have on-demand binary loading at exec time. If you need more context please let
> me know, I'll push my current branch somewhere and wire up how I plan to use
> this patch so you can see it in better context, but hopefully I've described
> what I'm trying to accomplish enough that this leads to useful discussion.
>
>
> Currently we have ->i_writecount to control write access to an inode.
> Callers may deny write access by calling deny_write_access(), which will
> cause ->i_writecount to go negative, and allow_write_access() to push it
> back up to 0.
>
> This is used in a few ways, the biggest user being exec. Historically
> we've blocked write access to files that are opened for executing.
> fsverity is the other notable user.
>
> With the upcoming fanotify feature that allows for on-demand population
> of files, this blanket policy of denying writes to files opened for
> executing creates a problem. We have to issue events right before
> denying access, and the entire file must be populated before we can
> continue with the exec.
>
> This creates a problem for users who have large binaries they want to
> populate on demand. Inside of Meta we often have multi-gigabyte
> binaries, even on the order of tens of gigabytes. Pre-loading these
> large files is costly, especially when large portions of the binary may
> never be read (think debuginfo).
>
> In order to facilitate on-demand loading of binaries we need to have a
> way to punch a hole in this exec related write denial.
Hm. I suggest we try to tackle this in a completely different way. Maybe
I mentioned it during LSFMM but instead of doing this dance we should
try and remove the deny_write_access() mechanisms for exec completely.
Back in 2021 we removed the remaining VM_DENYWRITE bits but left in
deny_write_access() for exec. Back then I had thought that this was a
bit risky for not too much gain. But looking at this code here I think
we now have an even stronger reason to try and get rid of this
restriction. And I've since changed my mind. See my notes on the
_completely untested_ RFC patch I appended.
Ofc depends on whether Linus still agrees that removing this might be
something we could try.
> This patch accomplishes this by doing something similar to the freeze
> levels on the super block. We have two levels, one for all write
> denial, and one for exec. People wishing to deny writes will specify
> the level they're denying. Users wishing to get write access must go
> through all of the levels and increment each levels counter until it
> increments them all, or encounters a level that is currently denied, at
> which point they undo their increments and return an error.
>
> Future patches will use the get_write_access_level() helper in order to
> skip the level they wish to be allowed to access. Any inode being
> populated via the pre-content fanotify mechanism will be marked with a
> special flag, and the open path will use get_write_access_level() in
> order to bypass the exec restriction.
>
> This is a significant behavior change, as it allows us to write to a
> file that is currently being exec'ed. However this will be limited to a
> very narrow use case.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> drivers/md/md.c | 2 +-
> fs/binfmt_elf.c | 4 +-
> fs/exec.c | 6 +--
> fs/ext4/file.c | 4 +-
> fs/inode.c | 3 +-
> fs/kernel_read_file.c | 4 +-
> fs/locks.c | 2 +-
> fs/quota/dquot.c | 2 +-
> fs/verity/enable.c | 4 +-
> include/linux/fs.h | 90 +++++++++++++++++++++++++++----
> include/trace/events/filelock.h | 2 +-
> kernel/fork.c | 11 ++--
> security/integrity/evm/evm_main.c | 2 +-
> security/integrity/ima/ima_main.c | 2 +-
> 14 files changed, 104 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index aff9118ff697..134cefbd7bef 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7231,7 +7231,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
> pr_warn("%s: error: bitmap file must open for write\n",
> mdname(mddev));
> err = -EBADF;
> - } else if (atomic_read(&inode->i_writecount) != 1) {
> + } else if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != 1) {
> pr_warn("%s: error: bitmap file is already in use\n",
> mdname(mddev));
> err = -EBUSY;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index a43897b03ce9..9a6fcf8ba17c 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1216,7 +1216,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> }
> reloc_func_desc = interp_load_addr;
>
> - allow_write_access(interpreter);
> + allow_write_access(interpreter, INODE_DENY_WRITE_EXEC);
> fput(interpreter);
>
> kfree(interp_elf_ex);
> @@ -1308,7 +1308,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> kfree(interp_elf_ex);
> kfree(interp_elf_phdata);
> out_free_file:
> - allow_write_access(interpreter);
> + allow_write_access(interpreter, INODE_DENY_WRITE_EXEC);
> if (interpreter)
> fput(interpreter);
> out_free_ph:
> diff --git a/fs/exec.c b/fs/exec.c
> index 18f057ba64a5..6b2900ee448d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -971,7 +971,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
> if (err)
> goto exit;
>
> - err = deny_write_access(file);
> + err = deny_write_access(file, INODE_DENY_WRITE_EXEC);
> if (err)
> goto exit;
>
> @@ -1545,7 +1545,7 @@ static void do_close_execat(struct file *file)
> {
> if (!file)
> return;
> - allow_write_access(file);
> + allow_write_access(file, INODE_DENY_WRITE_EXEC);
> fput(file);
> }
>
> @@ -1865,7 +1865,7 @@ static int exec_binprm(struct linux_binprm *bprm)
> bprm->file = bprm->interpreter;
> bprm->interpreter = NULL;
>
> - allow_write_access(exec);
> + allow_write_access(exec, INODE_DENY_WRITE_EXEC);
> if (unlikely(bprm->have_execfd)) {
> if (bprm->executable) {
> fput(exec);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index c89e434db6b7..6196f449649c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -171,8 +171,8 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
> }
> /* if we are the last writer on the inode, drop the block reservation */
> if ((filp->f_mode & FMODE_WRITE) &&
> - (atomic_read(&inode->i_writecount) == 1) &&
> - !EXT4_I(inode)->i_reserved_data_blocks) {
> + (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) &&
> + !EXT4_I(inode)->i_reserved_data_blocks) {
> down_write(&EXT4_I(inode)->i_data_sem);
> ext4_discard_preallocations(inode);
> up_write(&EXT4_I(inode)->i_data_sem);
> diff --git a/fs/inode.c b/fs/inode.c
> index 3a41f83a4ba5..fb6155412252 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -173,7 +173,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> inode->i_opflags |= IOP_XATTR;
> i_uid_write(inode, 0);
> i_gid_write(inode, 0);
> - atomic_set(&inode->i_writecount, 0);
> + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++)
> + atomic_set(&inode->i_writecount[i], 0);
> inode->i_size = 0;
> inode->i_write_hint = WRITE_LIFE_NOT_SET;
> inode->i_blocks = 0;
> diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
> index c429c42a6867..9af82d20aa1f 100644
> --- a/fs/kernel_read_file.c
> +++ b/fs/kernel_read_file.c
> @@ -48,7 +48,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
> if (!S_ISREG(file_inode(file)->i_mode))
> return -EINVAL;
>
> - ret = deny_write_access(file);
> + ret = deny_write_access(file, INODE_DENY_WRITE_ALL);
> if (ret)
> return ret;
>
> @@ -119,7 +119,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
> }
>
> out:
> - allow_write_access(file);
> + allow_write_access(file, INODE_DENY_WRITE_ALL);
> return ret == 0 ? copied : ret;
> }
> EXPORT_SYMBOL_GPL(kernel_read_file);
> diff --git a/fs/locks.c b/fs/locks.c
> index 90c8746874de..3e6a62f56528 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1763,7 +1763,7 @@ check_conflicting_open(struct file *filp, const int arg, int flags)
> else if (filp->f_mode & FMODE_READ)
> self_rcount = 1;
>
> - if (atomic_read(&inode->i_writecount) != self_wcount ||
> + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != self_wcount ||
> atomic_read(&inode->i_readcount) != self_rcount)
> return -EAGAIN;
>
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 627eb2f72ef3..fa470d76f0b3 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -1031,7 +1031,7 @@ static int add_dquot_ref(struct super_block *sb, int type)
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> spin_lock(&inode->i_lock);
> if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - !atomic_read(&inode->i_writecount) ||
> + !inode_is_open_for_write(inode) ||
> !dqinit_needed(inode, type)) {
> spin_unlock(&inode->i_lock);
> continue;
> diff --git a/fs/verity/enable.c b/fs/verity/enable.c
> index c284f46d1b53..a0e0d49c6ccc 100644
> --- a/fs/verity/enable.c
> +++ b/fs/verity/enable.c
> @@ -371,7 +371,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
> if (err) /* -EROFS */
> return err;
>
> - err = deny_write_access(filp);
> + err = deny_write_access(filp, INODE_DENY_WRITE_ALL);
> if (err) /* -ETXTBSY */
> goto out_drop_write;
>
> @@ -397,7 +397,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
> * allow_write_access() is needed to pair with deny_write_access().
> * Regardless, the filesystem won't allow writing to verity files.
> */
> - allow_write_access(filp);
> + allow_write_access(filp, INODE_DENY_WRITE_ALL);
> out_drop_write:
> mnt_drop_write_file(filp);
> return err;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0283cf366c2a..f0720cd0ab45 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -620,6 +620,22 @@ is_uncached_acl(struct posix_acl *acl)
> #define IOP_XATTR 0x0008
> #define IOP_DEFAULT_READLINK 0x0010
>
> +/*
> + * These are used with the *write_access helpers.
> + *
> + * INODE_DENY_WRITE_ALL - Do not allow writes at all.
> + * INODE_DENY_WRITE_EXEC - Do not allow writes because we are
> + * exec'ing the file. This can be bypassed
> + * in cases where the file needs to be
> + * filled in via the fanotify pre-content
> + * hooks.
> + */
> +enum inode_write_level {
> + INODE_DENY_WRITE_ALL,
> + INODE_DENY_WRITE_EXEC,
> + INODE_DENY_WRITE_LEVEL,
> +};
> +
> /*
> * Keep mostly read-only and often accessed (especially for
> * the RCU path lookup and 'stat' data) fields at the beginning
> @@ -701,7 +717,7 @@ struct inode {
> atomic64_t i_sequence; /* see futex */
> atomic_t i_count;
> atomic_t i_dio_count;
> - atomic_t i_writecount;
> + atomic_t i_writecount[INODE_DENY_WRITE_LEVEL];
> #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> atomic_t i_readcount; /* struct files open RO */
> #endif
> @@ -2920,38 +2936,90 @@ static inline void kiocb_end_write(struct kiocb *iocb)
> * put_write_access() releases this write permission.
> * deny_write_access() denies write access to a file.
> * allow_write_access() re-enables write access to a file.
> + * __get_write_access_level() gets write permission for a file for the given
> + * level.
> + * put_write_access_level() releases the level write permission.
> *
> - * The i_writecount field of an inode can have the following values:
> + * The level indicates which level we're denying writes for. Some levels are
> + * allowed to be bypassed in special circumstances. In the cases that the write
> + * level needs to be bypassed the user must use the
> + * get_write_access_level()/put_write_access_level() pairs of the helpers, which
> + * will allow the user to bypass the given level, but none of the others.
> + *
> + * The i_writecount[level] field of an inode can have the following values:
> * 0: no write access, no denied write access
> - * < 0: (-i_writecount) users that denied write access to the file.
> - * > 0: (i_writecount) users that have write access to the file.
> + * < 0: (-i_writecount[level]) users that denied write access to the file.
> + * > 0: (i_writecount[level]) users that have write access to the file.
> *
> * Normally we operate on that counter with atomic_{inc,dec} and it's safe
> * except for the cases where we don't hold i_writecount yet. Then we need to
> * use {get,deny}_write_access() - these functions check the sign and refuse
> * to do the change if sign is wrong.
> */
> +static inline int __get_write_access(struct inode *inode,
> + enum inode_write_level skip_level)
> +{
> + int i;
> +
> + /* We are not allowed to skip INODE_DENY_WRITE_ALL. */
> + WARN_ON_ONCE(skip_level == INODE_DENY_WRITE_ALL);
> +
> + for (i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
> + if (i == skip_level)
> + continue;
> + if (!atomic_inc_unless_negative(&inode->i_writecount[i]))
> + goto fail;
> + }
> + return 0;
> +fail:
> + while (--i >= 0) {
> + if (i == skip_level)
> + continue;
> + atomic_dec(&inode->i_writecount[i]);
> + }
> + return -ETXTBSY;
> +}
> static inline int get_write_access(struct inode *inode)
> {
> - return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
> + return __get_write_access(inode, INODE_DENY_WRITE_LEVEL);
> }
> -static inline int deny_write_access(struct file *file)
> +static inline int get_write_access_level(struct inode *inode,
> + enum inode_write_level skip_level)
> +{
> + return __get_write_access(inode, skip_level);
> +}
> +static inline int deny_write_access(struct file *file,
> + enum inode_write_level level)
> {
> struct inode *inode = file_inode(file);
> - return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
> + return atomic_dec_unless_positive(&inode->i_writecount[level]) ? 0 : -ETXTBSY;
> +}
> +static inline void __put_write_access(struct inode *inode,
> + enum inode_write_level skip_level)
> +{
> + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
> + if (i == skip_level)
> + continue;
> + atomic_dec(&inode->i_writecount[i]);
> + }
> }
> static inline void put_write_access(struct inode * inode)
> {
> - atomic_dec(&inode->i_writecount);
> + __put_write_access(inode, INODE_DENY_WRITE_LEVEL);
> }
> -static inline void allow_write_access(struct file *file)
> +static inline void put_write_access_level(struct inode *inode,
> + enum inode_write_level skip_level)
> +{
> + __put_write_access(inode, skip_level);
> +}
> +static inline void allow_write_access(struct file *file, enum inode_write_level level)
> {
> if (file)
> - atomic_inc(&file_inode(file)->i_writecount);
> + atomic_inc(&file_inode(file)->i_writecount[level]);
> }
> static inline bool inode_is_open_for_write(const struct inode *inode)
> {
> - return atomic_read(&inode->i_writecount) > 0;
> + return atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) > 0;
> }
>
> #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> index b8d1e00a7982..ad076e87a956 100644
> --- a/include/trace/events/filelock.h
> +++ b/include/trace/events/filelock.h
> @@ -187,7 +187,7 @@ TRACE_EVENT(generic_add_lease,
> TP_fast_assign(
> __entry->s_dev = inode->i_sb->s_dev;
> __entry->i_ino = inode->i_ino;
> - __entry->wcount = atomic_read(&inode->i_writecount);
> + __entry->wcount = atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]);
> __entry->rcount = atomic_read(&inode->i_readcount);
> __entry->icount = atomic_read(&inode->i_count);
> __entry->owner = fl->c.flc_owner;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 99076dbe27d8..b6dc7aed2ebf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -620,7 +620,7 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> * We depend on the oldmm having properly denied write access to the
> * exe_file already.
> */
> - if (exe_file && deny_write_access(exe_file))
> + if (exe_file && deny_write_access(exe_file, INODE_DENY_WRITE_EXEC))
> pr_warn_once("deny_write_access() failed in %s\n", __func__);
> }
>
> @@ -1417,13 +1417,14 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> * We expect the caller (i.e., sys_execve) to already denied
> * write access, so this is unlikely to fail.
> */
> - if (unlikely(deny_write_access(new_exe_file)))
> + if (unlikely(deny_write_access(new_exe_file,
> + INODE_DENY_WRITE_EXEC)))
> return -EACCES;
> get_file(new_exe_file);
> }
> rcu_assign_pointer(mm->exe_file, new_exe_file);
> if (old_exe_file) {
> - allow_write_access(old_exe_file);
> + allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC);
> fput(old_exe_file);
> }
> return 0;
> @@ -1464,7 +1465,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> return ret;
> }
>
> - ret = deny_write_access(new_exe_file);
> + ret = deny_write_access(new_exe_file, INODE_DENY_WRITE_EXEC);
> if (ret)
> return -EACCES;
> get_file(new_exe_file);
> @@ -1476,7 +1477,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> mmap_write_unlock(mm);
>
> if (old_exe_file) {
> - allow_write_access(old_exe_file);
> + allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC);
> fput(old_exe_file);
> }
> return 0;
> diff --git a/security/integrity/evm/evm_main.c b/security/integrity/evm/evm_main.c
> index 62fe66dd53ce..b83dfb295d85 100644
> --- a/security/integrity/evm/evm_main.c
> +++ b/security/integrity/evm/evm_main.c
> @@ -1084,7 +1084,7 @@ static void evm_file_release(struct file *file)
> if (!S_ISREG(inode->i_mode) || !(mode & FMODE_WRITE))
> return;
>
> - if (iint && atomic_read(&inode->i_writecount) == 1)
> + if (iint && atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1)
> iint->flags &= ~EVM_NEW_FILE;
> }
>
> diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
> index f04f43af651c..7aed80a70692 100644
> --- a/security/integrity/ima/ima_main.c
> +++ b/security/integrity/ima/ima_main.c
> @@ -164,7 +164,7 @@ static void ima_check_last_writer(struct ima_iint_cache *iint,
> return;
>
> mutex_lock(&iint->mutex);
> - if (atomic_read(&inode->i_writecount) == 1) {
> + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) {
> struct kstat stat;
>
> update = test_and_clear_bit(IMA_UPDATE_XATTR,
> --
> 2.43.0
>
[-- Attachment #2: 0001-RFC-UNTESTED-fs-don-t-deny-writes-to-executables.patch --]
[-- Type: text/x-diff, Size: 11183 bytes --]
From ba4a40635236fef36663bcfeca13dd816f9de69a Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 30 May 2024 09:35:44 +0200
Subject: [PATCH] [RFC UNTESTED]: fs: don't deny writes to executables
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.
Here are some of the notes that I took:
(1) The deny_write_access() mechanism is causing really pointless issues
such as [1]. If a thread in a thread-group opens a file writable,
then writes some stuff, then closing the file descriptor and then
calling execve() they can fail the execve() with ETXTBUSY because
another thread in the thread-group could have concurrently called
fork(). Multi-threaded libraries such as go suffer from this.
(2) There are userspace attacks that rely on overwriting the binary of a
running process. These attacks are _mitigated_ but _not at all
prevented_ from ocurring by the deny_write_access() mechanism.
I'll go over some details. The clearest example of such attacks was
the attack against runC in CVE-2019-5736 (cf. [3]).
An attack could compromise the runC host binary from inside a
_privileged_ runC container. The malicious binary could then be used
to take over the host.
(It is crucial to note that this attack is _not_ possible with
unprivileged containers. IOW, the setup here is already insecure.)
The attack can be made when attaching to a running container or when
starting a container running a specially crafted image. For example,
when runC attaches to a container the attacker can trick it into
executing itself.
This could be done by replacing the target binary inside the
container with a custom binary pointing back at the runC binary
itself. As an example, if the target binary was /bin/bash, this
could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe.
As such when /bin/bash is executed inside the container, instead the
target of /proc/self/exe will be executed. That magic link will
point to the runc binary on the host. The attacker can then proceed
to write to the target of /proc/self/exe to try and overwrite the
runC binary on the host.
However, this will not succeed because of deny_write_access(). Now,
one might think that this would prevent the attack but it doesn't.
To overcome this, the attacker has multiple ways:
* Open a file descriptor to /proc/self/exe using the O_PATH flag and
then proceed to reopen the binary as O_WRONLY through
/proc/self/fd/<nr> and try to write to it in a busy loop from a
separate process. Ultimately it will succeed when the runC binary
exits. After this the runC binary is compromised and can be used
to attack other containers or the host itself.
* Use a malicious shared library annotating a function in there with
the constructor attribute making the malicious function run as an
initializor. The malicious library will then open /proc/self/exe
for creating a new entry under /proc/self/fd/<nr>. It'll then call
exec to a) force runC to exit and b) hand the file descriptor off
to a program that then reopens /proc/self/fd/<nr> for writing
(which is now possible because runC has exited) and overwriting
that binary.
To sum up: the deny_write_access() mechanism doesn't prevent such
attacks in insecure setups. It just makes them minimally harder.
That's all.
The only way back then to prevent this is to create a temporary copy
of the calling binary itself when it starts or attaches to
containers. So what I did back then for LXC (and Aleksa for runC)
was to create an anonymous, in-memory file using the memfd_create()
system call and to copy itself into the temporary in-memory file,
which is then sealed to prevent further modifications. This sealed,
in-memory file copy is then executed instead of the original on-disk
binary.
Any compromising write operations from a privileged container to the
host binary will then write to the temporary in-memory binary and
not to the host binary on-disk, preserving the integrity of the host
binary. Also as the temporary, in-memory binary is sealed, writes to
this will also fail.
The point is that deny_write_access() is uselss to prevent these
attacks.
(3) Denying write access to an inode because it's currently used in an
exec path could easily be done on an LSM level. It might need an
additional hook but that should be about it.
(4) deny_write_access() doesn't help with hardlinks.
(5) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
ago so while we do protect the main executable the bigger portion of
the things you'd think need protecting such as the shared libraries
aren't. IOW, we let anyone happily overwrite shared libraries.
(6) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
(6.1) We removed the legacy uselib() protection for preventing
overwriting of shared libraries. Nobody cared in 3 years.
(6.2) We allow write access to the elf interpreter after exec
completed treating it on a par with shared libraries.
(7) Yes, someone in userspace could potentially be relying on this. It's
not completely out of the realm of possibility but let's find out if
that's actually the case and not guess.
Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/binfmt_elf.c | 2 --
fs/binfmt_elf_fdpic.c | 5 +----
fs/binfmt_misc.c | 7 ++-----
fs/exec.c | 14 +++-----------
kernel/fork.c | 26 +++-----------------------
5 files changed, 9 insertions(+), 45 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a43897b03ce9..6fdec541f8bf 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1216,7 +1216,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
reloc_func_desc = interp_load_addr;
- allow_write_access(interpreter);
fput(interpreter);
kfree(interp_elf_ex);
@@ -1308,7 +1307,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
kfree(interp_elf_ex);
kfree(interp_elf_phdata);
out_free_file:
- allow_write_access(interpreter);
if (interpreter)
fput(interpreter);
out_free_ph:
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index b799701454a9..28a3439f163a 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -394,7 +394,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
goto error;
}
- allow_write_access(interpreter);
fput(interpreter);
interpreter = NULL;
}
@@ -466,10 +465,8 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
retval = 0;
error:
- if (interpreter) {
- allow_write_access(interpreter);
+ if (interpreter)
fput(interpreter);
- }
kfree(interpreter_name);
kfree(exec_params.phdrs);
kfree(exec_params.loadmap);
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 68fa225f89e5..21ce5ec1ea76 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -247,13 +247,10 @@ static int load_misc_binary(struct linux_binprm *bprm)
if (retval < 0)
goto ret;
- if (fmt->flags & MISC_FMT_OPEN_FILE) {
+ if (fmt->flags & MISC_FMT_OPEN_FILE)
interp_file = file_clone_open(fmt->interp_file);
- if (!IS_ERR(interp_file))
- deny_write_access(interp_file);
- } else {
+ else
interp_file = open_exec(fmt->interpreter);
- }
retval = PTR_ERR(interp_file);
if (IS_ERR(interp_file))
goto ret;
diff --git a/fs/exec.c b/fs/exec.c
index 40073142288f..4dee205452e2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -952,10 +952,6 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
path_noexec(&file->f_path)))
goto exit;
- err = deny_write_access(file);
- if (err)
- goto exit;
-
out:
return file;
@@ -971,8 +967,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
*
* Returns ERR_PTR on failure or allocated struct file on success.
*
- * As this is a wrapper for the internal do_open_execat(), callers
- * must call allow_write_access() before fput() on release. Also see
+ * As this is a wrapper for the internal do_open_execat(). Also see
* do_close_execat().
*/
struct file *open_exec(const char *name)
@@ -1524,10 +1519,8 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
/* Matches do_open_execat() */
static void do_close_execat(struct file *file)
{
- if (!file)
- return;
- allow_write_access(file);
- fput(file);
+ if (file)
+ fput(file);
}
static void free_bprm(struct linux_binprm *bprm)
@@ -1846,7 +1839,6 @@ static int exec_binprm(struct linux_binprm *bprm)
bprm->file = bprm->interpreter;
bprm->interpreter = NULL;
- allow_write_access(exec);
if (unlikely(bprm->have_execfd)) {
if (bprm->executable) {
fput(exec);
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..763a042eef9c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -616,12 +616,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
exe_file = get_mm_exe_file(oldmm);
RCU_INIT_POINTER(mm->exe_file, exe_file);
- /*
- * We depend on the oldmm having properly denied write access to the
- * exe_file already.
- */
- if (exe_file && deny_write_access(exe_file))
- pr_warn_once("deny_write_access() failed in %s\n", __func__);
}
#ifdef CONFIG_MMU
@@ -1412,20 +1406,11 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
*/
old_exe_file = rcu_dereference_raw(mm->exe_file);
- if (new_exe_file) {
- /*
- * We expect the caller (i.e., sys_execve) to already denied
- * write access, so this is unlikely to fail.
- */
- if (unlikely(deny_write_access(new_exe_file)))
- return -EACCES;
+ if (new_exe_file)
get_file(new_exe_file);
- }
rcu_assign_pointer(mm->exe_file, new_exe_file);
- if (old_exe_file) {
- allow_write_access(old_exe_file);
+ if (old_exe_file)
fput(old_exe_file);
- }
return 0;
}
@@ -1464,9 +1449,6 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
return ret;
}
- ret = deny_write_access(new_exe_file);
- if (ret)
- return -EACCES;
get_file(new_exe_file);
/* set the new file */
@@ -1475,10 +1457,8 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
rcu_assign_pointer(mm->exe_file, new_exe_file);
mmap_write_unlock(mm);
- if (old_exe_file) {
- allow_write_access(old_exe_file);
+ if (old_exe_file)
fput(old_exe_file);
- }
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-30 10:32 ` Christian Brauner
@ 2024-05-30 12:57 ` Amir Goldstein
2024-05-30 14:58 ` Josef Bacik
` (2 subsequent siblings)
3 siblings, 0 replies; 27+ messages in thread
From: Amir Goldstein @ 2024-05-30 12:57 UTC (permalink / raw)
To: Christian Brauner
Cc: Josef Bacik, Linus Torvalds, linux-fsdevel, viro, jack, david,
hch
On Thu, May 30, 2024 at 1:32 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, May 29, 2024 at 04:41:32PM -0400, Josef Bacik wrote:
> > NOTE:
> > This is compile tested only. It's also based on an out of tree feature branch
> > from Amir that I'm extending to add page fault content events to allow us to
> > have on-demand binary loading at exec time. If you need more context please let
> > me know, I'll push my current branch somewhere and wire up how I plan to use
> > this patch so you can see it in better context, but hopefully I've described
> > what I'm trying to accomplish enough that this leads to useful discussion.
> >
> >
> > Currently we have ->i_writecount to control write access to an inode.
> > Callers may deny write access by calling deny_write_access(), which will
> > cause ->i_writecount to go negative, and allow_write_access() to push it
> > back up to 0.
> >
> > This is used in a few ways, the biggest user being exec. Historically
> > we've blocked write access to files that are opened for executing.
> > fsverity is the other notable user.
> >
> > With the upcoming fanotify feature that allows for on-demand population
> > of files, this blanket policy of denying writes to files opened for
> > executing creates a problem. We have to issue events right before
> > denying access, and the entire file must be populated before we can
> > continue with the exec.
> >
> > This creates a problem for users who have large binaries they want to
> > populate on demand. Inside of Meta we often have multi-gigabyte
> > binaries, even on the order of tens of gigabytes. Pre-loading these
> > large files is costly, especially when large portions of the binary may
> > never be read (think debuginfo).
> >
> > In order to facilitate on-demand loading of binaries we need to have a
> > way to punch a hole in this exec related write denial.
>
> Hm. I suggest we try to tackle this in a completely different way. Maybe
> I mentioned it during LSFMM but instead of doing this dance we should
> try and remove the deny_write_access() mechanisms for exec completely.
>
> Back in 2021 we removed the remaining VM_DENYWRITE bits but left in
> deny_write_access() for exec. Back then I had thought that this was a
> bit risky for not too much gain. But looking at this code here I think
> we now have an even stronger reason to try and get rid of this
> restriction. And I've since changed my mind. See my notes on the
> _completely untested_ RFC patch I appended.
>
This looks much simpler :)
Obviously, a proper patch would need to also turn get_write_access()
to void and clean up the error handling code of its callers.
> Ofc depends on whether Linus still agrees that removing this might be
> something we could try.
>
Crossing my fingers that this will be acceptable.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-30 10:32 ` Christian Brauner
2024-05-30 12:57 ` Amir Goldstein
@ 2024-05-30 14:58 ` Josef Bacik
2024-05-30 15:23 ` Christian Brauner
2024-05-30 15:49 ` Linus Torvalds
3 siblings, 0 replies; 27+ messages in thread
From: Josef Bacik @ 2024-05-30 14:58 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, linux-fsdevel, viro, jack, david, amir73il, hch
On Thu, May 30, 2024 at 12:32:06PM +0200, Christian Brauner wrote:
> On Wed, May 29, 2024 at 04:41:32PM -0400, Josef Bacik wrote:
> > NOTE:
> > This is compile tested only. It's also based on an out of tree feature branch
> > from Amir that I'm extending to add page fault content events to allow us to
> > have on-demand binary loading at exec time. If you need more context please let
> > me know, I'll push my current branch somewhere and wire up how I plan to use
> > this patch so you can see it in better context, but hopefully I've described
> > what I'm trying to accomplish enough that this leads to useful discussion.
> >
> >
> > Currently we have ->i_writecount to control write access to an inode.
> > Callers may deny write access by calling deny_write_access(), which will
> > cause ->i_writecount to go negative, and allow_write_access() to push it
> > back up to 0.
> >
> > This is used in a few ways, the biggest user being exec. Historically
> > we've blocked write access to files that are opened for executing.
> > fsverity is the other notable user.
> >
> > With the upcoming fanotify feature that allows for on-demand population
> > of files, this blanket policy of denying writes to files opened for
> > executing creates a problem. We have to issue events right before
> > denying access, and the entire file must be populated before we can
> > continue with the exec.
> >
> > This creates a problem for users who have large binaries they want to
> > populate on demand. Inside of Meta we often have multi-gigabyte
> > binaries, even on the order of tens of gigabytes. Pre-loading these
> > large files is costly, especially when large portions of the binary may
> > never be read (think debuginfo).
> >
> > In order to facilitate on-demand loading of binaries we need to have a
> > way to punch a hole in this exec related write denial.
>
> Hm. I suggest we try to tackle this in a completely different way. Maybe
> I mentioned it during LSFMM but instead of doing this dance we should
> try and remove the deny_write_access() mechanisms for exec completely.
>
> Back in 2021 we removed the remaining VM_DENYWRITE bits but left in
> deny_write_access() for exec. Back then I had thought that this was a
> bit risky for not too much gain. But looking at this code here I think
> we now have an even stronger reason to try and get rid of this
> restriction. And I've since changed my mind. See my notes on the
> _completely untested_ RFC patch I appended.
>
> Ofc depends on whether Linus still agrees that removing this might be
> something we could try.
>
Muahaha you took the bait.
I obviously much prefer this solution, and from what I can tell we're all pretty
well in agreement about this. Turn it into a real patch and I'll happily add my
reviewed-by. Thanks,
Josef
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-30 10:32 ` Christian Brauner
2024-05-30 12:57 ` Amir Goldstein
2024-05-30 14:58 ` Josef Bacik
@ 2024-05-30 15:23 ` Christian Brauner
2024-05-30 15:49 ` Linus Torvalds
3 siblings, 0 replies; 27+ messages in thread
From: Christian Brauner @ 2024-05-30 15:23 UTC (permalink / raw)
To: Josef Bacik, Linus Torvalds
Cc: linux-fsdevel, viro, jack, david, amir73il, hch
On Thu, May 30, 2024 at 12:32:06PM +0200, Christian Brauner wrote:
> On Wed, May 29, 2024 at 04:41:32PM -0400, Josef Bacik wrote:
> > NOTE:
> > This is compile tested only. It's also based on an out of tree feature branch
> > from Amir that I'm extending to add page fault content events to allow us to
> > have on-demand binary loading at exec time. If you need more context please let
> > me know, I'll push my current branch somewhere and wire up how I plan to use
> > this patch so you can see it in better context, but hopefully I've described
> > what I'm trying to accomplish enough that this leads to useful discussion.
> >
> >
> > Currently we have ->i_writecount to control write access to an inode.
> > Callers may deny write access by calling deny_write_access(), which will
> > cause ->i_writecount to go negative, and allow_write_access() to push it
> > back up to 0.
> >
> > This is used in a few ways, the biggest user being exec. Historically
> > we've blocked write access to files that are opened for executing.
> > fsverity is the other notable user.
> >
> > With the upcoming fanotify feature that allows for on-demand population
> > of files, this blanket policy of denying writes to files opened for
> > executing creates a problem. We have to issue events right before
> > denying access, and the entire file must be populated before we can
> > continue with the exec.
> >
> > This creates a problem for users who have large binaries they want to
> > populate on demand. Inside of Meta we often have multi-gigabyte
> > binaries, even on the order of tens of gigabytes. Pre-loading these
> > large files is costly, especially when large portions of the binary may
> > never be read (think debuginfo).
> >
> > In order to facilitate on-demand loading of binaries we need to have a
> > way to punch a hole in this exec related write denial.
>
> Hm. I suggest we try to tackle this in a completely different way. Maybe
> I mentioned it during LSFMM but instead of doing this dance we should
> try and remove the deny_write_access() mechanisms for exec completely.
>
> Back in 2021 we removed the remaining VM_DENYWRITE bits but left in
> deny_write_access() for exec. Back then I had thought that this was a
> bit risky for not too much gain. But looking at this code here I think
> we now have an even stronger reason to try and get rid of this
> restriction. And I've since changed my mind. See my notes on the
> _completely untested_ RFC patch I appended.
>
> Ofc depends on whether Linus still agrees that removing this might be
> something we could try.
(Ignore point (4) on my notes list. I somehow forgot to remove this
as it's obviously nonsense.)
>
> > This patch accomplishes this by doing something similar to the freeze
> > levels on the super block. We have two levels, one for all write
> > denial, and one for exec. People wishing to deny writes will specify
> > the level they're denying. Users wishing to get write access must go
> > through all of the levels and increment each levels counter until it
> > increments them all, or encounters a level that is currently denied, at
> > which point they undo their increments and return an error.
> >
> > Future patches will use the get_write_access_level() helper in order to
> > skip the level they wish to be allowed to access. Any inode being
> > populated via the pre-content fanotify mechanism will be marked with a
> > special flag, and the open path will use get_write_access_level() in
> > order to bypass the exec restriction.
> >
> > This is a significant behavior change, as it allows us to write to a
> > file that is currently being exec'ed. However this will be limited to a
> > very narrow use case.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > drivers/md/md.c | 2 +-
> > fs/binfmt_elf.c | 4 +-
> > fs/exec.c | 6 +--
> > fs/ext4/file.c | 4 +-
> > fs/inode.c | 3 +-
> > fs/kernel_read_file.c | 4 +-
> > fs/locks.c | 2 +-
> > fs/quota/dquot.c | 2 +-
> > fs/verity/enable.c | 4 +-
> > include/linux/fs.h | 90 +++++++++++++++++++++++++++----
> > include/trace/events/filelock.h | 2 +-
> > kernel/fork.c | 11 ++--
> > security/integrity/evm/evm_main.c | 2 +-
> > security/integrity/ima/ima_main.c | 2 +-
> > 14 files changed, 104 insertions(+), 34 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index aff9118ff697..134cefbd7bef 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -7231,7 +7231,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
> > pr_warn("%s: error: bitmap file must open for write\n",
> > mdname(mddev));
> > err = -EBADF;
> > - } else if (atomic_read(&inode->i_writecount) != 1) {
> > + } else if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != 1) {
> > pr_warn("%s: error: bitmap file is already in use\n",
> > mdname(mddev));
> > err = -EBUSY;
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index a43897b03ce9..9a6fcf8ba17c 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -1216,7 +1216,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > }
> > reloc_func_desc = interp_load_addr;
> >
> > - allow_write_access(interpreter);
> > + allow_write_access(interpreter, INODE_DENY_WRITE_EXEC);
> > fput(interpreter);
> >
> > kfree(interp_elf_ex);
> > @@ -1308,7 +1308,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > kfree(interp_elf_ex);
> > kfree(interp_elf_phdata);
> > out_free_file:
> > - allow_write_access(interpreter);
> > + allow_write_access(interpreter, INODE_DENY_WRITE_EXEC);
> > if (interpreter)
> > fput(interpreter);
> > out_free_ph:
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 18f057ba64a5..6b2900ee448d 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -971,7 +971,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
> > if (err)
> > goto exit;
> >
> > - err = deny_write_access(file);
> > + err = deny_write_access(file, INODE_DENY_WRITE_EXEC);
> > if (err)
> > goto exit;
> >
> > @@ -1545,7 +1545,7 @@ static void do_close_execat(struct file *file)
> > {
> > if (!file)
> > return;
> > - allow_write_access(file);
> > + allow_write_access(file, INODE_DENY_WRITE_EXEC);
> > fput(file);
> > }
> >
> > @@ -1865,7 +1865,7 @@ static int exec_binprm(struct linux_binprm *bprm)
> > bprm->file = bprm->interpreter;
> > bprm->interpreter = NULL;
> >
> > - allow_write_access(exec);
> > + allow_write_access(exec, INODE_DENY_WRITE_EXEC);
> > if (unlikely(bprm->have_execfd)) {
> > if (bprm->executable) {
> > fput(exec);
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index c89e434db6b7..6196f449649c 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -171,8 +171,8 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
> > }
> > /* if we are the last writer on the inode, drop the block reservation */
> > if ((filp->f_mode & FMODE_WRITE) &&
> > - (atomic_read(&inode->i_writecount) == 1) &&
> > - !EXT4_I(inode)->i_reserved_data_blocks) {
> > + (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) &&
> > + !EXT4_I(inode)->i_reserved_data_blocks) {
> > down_write(&EXT4_I(inode)->i_data_sem);
> > ext4_discard_preallocations(inode);
> > up_write(&EXT4_I(inode)->i_data_sem);
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 3a41f83a4ba5..fb6155412252 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -173,7 +173,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> > inode->i_opflags |= IOP_XATTR;
> > i_uid_write(inode, 0);
> > i_gid_write(inode, 0);
> > - atomic_set(&inode->i_writecount, 0);
> > + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++)
> > + atomic_set(&inode->i_writecount[i], 0);
> > inode->i_size = 0;
> > inode->i_write_hint = WRITE_LIFE_NOT_SET;
> > inode->i_blocks = 0;
> > diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
> > index c429c42a6867..9af82d20aa1f 100644
> > --- a/fs/kernel_read_file.c
> > +++ b/fs/kernel_read_file.c
> > @@ -48,7 +48,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
> > if (!S_ISREG(file_inode(file)->i_mode))
> > return -EINVAL;
> >
> > - ret = deny_write_access(file);
> > + ret = deny_write_access(file, INODE_DENY_WRITE_ALL);
> > if (ret)
> > return ret;
> >
> > @@ -119,7 +119,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
> > }
> >
> > out:
> > - allow_write_access(file);
> > + allow_write_access(file, INODE_DENY_WRITE_ALL);
> > return ret == 0 ? copied : ret;
> > }
> > EXPORT_SYMBOL_GPL(kernel_read_file);
> > diff --git a/fs/locks.c b/fs/locks.c
> > index 90c8746874de..3e6a62f56528 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -1763,7 +1763,7 @@ check_conflicting_open(struct file *filp, const int arg, int flags)
> > else if (filp->f_mode & FMODE_READ)
> > self_rcount = 1;
> >
> > - if (atomic_read(&inode->i_writecount) != self_wcount ||
> > + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != self_wcount ||
> > atomic_read(&inode->i_readcount) != self_rcount)
> > return -EAGAIN;
> >
> > diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> > index 627eb2f72ef3..fa470d76f0b3 100644
> > --- a/fs/quota/dquot.c
> > +++ b/fs/quota/dquot.c
> > @@ -1031,7 +1031,7 @@ static int add_dquot_ref(struct super_block *sb, int type)
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > spin_lock(&inode->i_lock);
> > if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> > - !atomic_read(&inode->i_writecount) ||
> > + !inode_is_open_for_write(inode) ||
> > !dqinit_needed(inode, type)) {
> > spin_unlock(&inode->i_lock);
> > continue;
> > diff --git a/fs/verity/enable.c b/fs/verity/enable.c
> > index c284f46d1b53..a0e0d49c6ccc 100644
> > --- a/fs/verity/enable.c
> > +++ b/fs/verity/enable.c
> > @@ -371,7 +371,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
> > if (err) /* -EROFS */
> > return err;
> >
> > - err = deny_write_access(filp);
> > + err = deny_write_access(filp, INODE_DENY_WRITE_ALL);
> > if (err) /* -ETXTBSY */
> > goto out_drop_write;
> >
> > @@ -397,7 +397,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
> > * allow_write_access() is needed to pair with deny_write_access().
> > * Regardless, the filesystem won't allow writing to verity files.
> > */
> > - allow_write_access(filp);
> > + allow_write_access(filp, INODE_DENY_WRITE_ALL);
> > out_drop_write:
> > mnt_drop_write_file(filp);
> > return err;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 0283cf366c2a..f0720cd0ab45 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -620,6 +620,22 @@ is_uncached_acl(struct posix_acl *acl)
> > #define IOP_XATTR 0x0008
> > #define IOP_DEFAULT_READLINK 0x0010
> >
> > +/*
> > + * These are used with the *write_access helpers.
> > + *
> > + * INODE_DENY_WRITE_ALL - Do not allow writes at all.
> > + * INODE_DENY_WRITE_EXEC - Do not allow writes because we are
> > + * exec'ing the file. This can be bypassed
> > + * in cases where the file needs to be
> > + * filled in via the fanotify pre-content
> > + * hooks.
> > + */
> > +enum inode_write_level {
> > + INODE_DENY_WRITE_ALL,
> > + INODE_DENY_WRITE_EXEC,
> > + INODE_DENY_WRITE_LEVEL,
> > +};
> > +
> > /*
> > * Keep mostly read-only and often accessed (especially for
> > * the RCU path lookup and 'stat' data) fields at the beginning
> > @@ -701,7 +717,7 @@ struct inode {
> > atomic64_t i_sequence; /* see futex */
> > atomic_t i_count;
> > atomic_t i_dio_count;
> > - atomic_t i_writecount;
> > + atomic_t i_writecount[INODE_DENY_WRITE_LEVEL];
> > #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> > atomic_t i_readcount; /* struct files open RO */
> > #endif
> > @@ -2920,38 +2936,90 @@ static inline void kiocb_end_write(struct kiocb *iocb)
> > * put_write_access() releases this write permission.
> > * deny_write_access() denies write access to a file.
> > * allow_write_access() re-enables write access to a file.
> > + * __get_write_access_level() gets write permission for a file for the given
> > + * level.
> > + * put_write_access_level() releases the level write permission.
> > *
> > - * The i_writecount field of an inode can have the following values:
> > + * The level indicates which level we're denying writes for. Some levels are
> > + * allowed to be bypassed in special circumstances. In the cases that the write
> > + * level needs to be bypassed the user must use the
> > + * get_write_access_level()/put_write_access_level() pairs of the helpers, which
> > + * will allow the user to bypass the given level, but none of the others.
> > + *
> > + * The i_writecount[level] field of an inode can have the following values:
> > * 0: no write access, no denied write access
> > - * < 0: (-i_writecount) users that denied write access to the file.
> > - * > 0: (i_writecount) users that have write access to the file.
> > + * < 0: (-i_writecount[level]) users that denied write access to the file.
> > + * > 0: (i_writecount[level]) users that have write access to the file.
> > *
> > * Normally we operate on that counter with atomic_{inc,dec} and it's safe
> > * except for the cases where we don't hold i_writecount yet. Then we need to
> > * use {get,deny}_write_access() - these functions check the sign and refuse
> > * to do the change if sign is wrong.
> > */
> > +static inline int __get_write_access(struct inode *inode,
> > + enum inode_write_level skip_level)
> > +{
> > + int i;
> > +
> > + /* We are not allowed to skip INODE_DENY_WRITE_ALL. */
> > + WARN_ON_ONCE(skip_level == INODE_DENY_WRITE_ALL);
> > +
> > + for (i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
> > + if (i == skip_level)
> > + continue;
> > + if (!atomic_inc_unless_negative(&inode->i_writecount[i]))
> > + goto fail;
> > + }
> > + return 0;
> > +fail:
> > + while (--i >= 0) {
> > + if (i == skip_level)
> > + continue;
> > + atomic_dec(&inode->i_writecount[i]);
> > + }
> > + return -ETXTBSY;
> > +}
> > static inline int get_write_access(struct inode *inode)
> > {
> > - return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
> > + return __get_write_access(inode, INODE_DENY_WRITE_LEVEL);
> > }
> > -static inline int deny_write_access(struct file *file)
> > +static inline int get_write_access_level(struct inode *inode,
> > + enum inode_write_level skip_level)
> > +{
> > + return __get_write_access(inode, skip_level);
> > +}
> > +static inline int deny_write_access(struct file *file,
> > + enum inode_write_level level)
> > {
> > struct inode *inode = file_inode(file);
> > - return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
> > + return atomic_dec_unless_positive(&inode->i_writecount[level]) ? 0 : -ETXTBSY;
> > +}
> > +static inline void __put_write_access(struct inode *inode,
> > + enum inode_write_level skip_level)
> > +{
> > + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++) {
> > + if (i == skip_level)
> > + continue;
> > + atomic_dec(&inode->i_writecount[i]);
> > + }
> > }
> > static inline void put_write_access(struct inode * inode)
> > {
> > - atomic_dec(&inode->i_writecount);
> > + __put_write_access(inode, INODE_DENY_WRITE_LEVEL);
> > }
> > -static inline void allow_write_access(struct file *file)
> > +static inline void put_write_access_level(struct inode *inode,
> > + enum inode_write_level skip_level)
> > +{
> > + __put_write_access(inode, skip_level);
> > +}
> > +static inline void allow_write_access(struct file *file, enum inode_write_level level)
> > {
> > if (file)
> > - atomic_inc(&file_inode(file)->i_writecount);
> > + atomic_inc(&file_inode(file)->i_writecount[level]);
> > }
> > static inline bool inode_is_open_for_write(const struct inode *inode)
> > {
> > - return atomic_read(&inode->i_writecount) > 0;
> > + return atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) > 0;
> > }
> >
> > #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
> > diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> > index b8d1e00a7982..ad076e87a956 100644
> > --- a/include/trace/events/filelock.h
> > +++ b/include/trace/events/filelock.h
> > @@ -187,7 +187,7 @@ TRACE_EVENT(generic_add_lease,
> > TP_fast_assign(
> > __entry->s_dev = inode->i_sb->s_dev;
> > __entry->i_ino = inode->i_ino;
> > - __entry->wcount = atomic_read(&inode->i_writecount);
> > + __entry->wcount = atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]);
> > __entry->rcount = atomic_read(&inode->i_readcount);
> > __entry->icount = atomic_read(&inode->i_count);
> > __entry->owner = fl->c.flc_owner;
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 99076dbe27d8..b6dc7aed2ebf 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -620,7 +620,7 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> > * We depend on the oldmm having properly denied write access to the
> > * exe_file already.
> > */
> > - if (exe_file && deny_write_access(exe_file))
> > + if (exe_file && deny_write_access(exe_file, INODE_DENY_WRITE_EXEC))
> > pr_warn_once("deny_write_access() failed in %s\n", __func__);
> > }
> >
> > @@ -1417,13 +1417,14 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> > * We expect the caller (i.e., sys_execve) to already denied
> > * write access, so this is unlikely to fail.
> > */
> > - if (unlikely(deny_write_access(new_exe_file)))
> > + if (unlikely(deny_write_access(new_exe_file,
> > + INODE_DENY_WRITE_EXEC)))
> > return -EACCES;
> > get_file(new_exe_file);
> > }
> > rcu_assign_pointer(mm->exe_file, new_exe_file);
> > if (old_exe_file) {
> > - allow_write_access(old_exe_file);
> > + allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC);
> > fput(old_exe_file);
> > }
> > return 0;
> > @@ -1464,7 +1465,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> > return ret;
> > }
> >
> > - ret = deny_write_access(new_exe_file);
> > + ret = deny_write_access(new_exe_file, INODE_DENY_WRITE_EXEC);
> > if (ret)
> > return -EACCES;
> > get_file(new_exe_file);
> > @@ -1476,7 +1477,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> > mmap_write_unlock(mm);
> >
> > if (old_exe_file) {
> > - allow_write_access(old_exe_file);
> > + allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC);
> > fput(old_exe_file);
> > }
> > return 0;
> > diff --git a/security/integrity/evm/evm_main.c b/security/integrity/evm/evm_main.c
> > index 62fe66dd53ce..b83dfb295d85 100644
> > --- a/security/integrity/evm/evm_main.c
> > +++ b/security/integrity/evm/evm_main.c
> > @@ -1084,7 +1084,7 @@ static void evm_file_release(struct file *file)
> > if (!S_ISREG(inode->i_mode) || !(mode & FMODE_WRITE))
> > return;
> >
> > - if (iint && atomic_read(&inode->i_writecount) == 1)
> > + if (iint && atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1)
> > iint->flags &= ~EVM_NEW_FILE;
> > }
> >
> > diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
> > index f04f43af651c..7aed80a70692 100644
> > --- a/security/integrity/ima/ima_main.c
> > +++ b/security/integrity/ima/ima_main.c
> > @@ -164,7 +164,7 @@ static void ima_check_last_writer(struct ima_iint_cache *iint,
> > return;
> >
> > mutex_lock(&iint->mutex);
> > - if (atomic_read(&inode->i_writecount) == 1) {
> > + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) {
> > struct kstat stat;
> >
> > update = test_and_clear_bit(IMA_UPDATE_XATTR,
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-30 10:32 ` Christian Brauner
` (2 preceding siblings ...)
2024-05-30 15:23 ` Christian Brauner
@ 2024-05-30 15:49 ` Linus Torvalds
2024-05-31 10:02 ` Christian Brauner
3 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2024-05-30 15:49 UTC (permalink / raw)
To: Christian Brauner
Cc: Josef Bacik, linux-fsdevel, viro, jack, david, amir73il, hch
On Thu, 30 May 2024 at 03:32, Christian Brauner <brauner@kernel.org> wrote:
>
> Ofc depends on whether Linus still agrees that removing this might be
> something we could try.
I _definitely_ do not want to see any more complex deny_write_access().
So yes, if people have good reasons to override the inode write
access, I'd rather remove it entirely than make it some eldritch
horror that is even worse than what we have now.
It would obviously have to be tested in case some odd case actually
depends on the ETXTBSY semantics, since we *have* supported it for a
long time. But iirc nobody even noticed when we removed it from
shared libraries, so...
That said, verity seems to depend on it as a way to do the
"enable_verity()" atomically with no concurrent writes, and I see some
i_writecount noise in the integrity code too.
But maybe that's just a belt-and-suspenders thing?
Because if execve() no longer does it, I think we should just remove
that i_writecount thing entirely.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-30 15:49 ` Linus Torvalds
@ 2024-05-31 10:02 ` Christian Brauner
2024-05-31 12:32 ` Christian Brauner
0 siblings, 1 reply; 27+ messages in thread
From: Christian Brauner @ 2024-05-31 10:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, linux-fsdevel, viro, jack, david, amir73il, hch
On Thu, May 30, 2024 at 08:49:12AM -0700, Linus Torvalds wrote:
> On Thu, 30 May 2024 at 03:32, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Ofc depends on whether Linus still agrees that removing this might be
> > something we could try.
>
> I _definitely_ do not want to see any more complex deny_write_access().
>
> So yes, if people have good reasons to override the inode write
> access, I'd rather remove it entirely than make it some eldritch
> horror that is even worse than what we have now.
>
> It would obviously have to be tested in case some odd case actually
> depends on the ETXTBSY semantics, since we *have* supported it for a
> long time. But iirc nobody even noticed when we removed it from
> shared libraries, so...
>
> That said, verity seems to depend on it as a way to do the
> "enable_verity()" atomically with no concurrent writes, and I see some
> i_writecount noise in the integrity code too.
>
> But maybe that's just a belt-and-suspenders thing?
>
> Because if execve() no longer does it, I think we should just remove
> that i_writecount thing entirely.
deny_write_access() is being used from kernel_read_file() which has a
few wrappers around it and they are used in various places:
(1) kernel_read_file() based helpers:
(1.1) kernel_read_file_from_path()
(1.2) kernel_read_file_from_path_initns()
(1.3) kernel_read_file_from_fd()
(2) kernel_read_file() users:
(2.1) kernel/module/main.c:init_module_from_file()
(2.2) security/loadpin/loadpin.c:read_trusted_verity_root_digests()
(3) kernel_read_file_from_path() users:
(3.1) security/integrity/digsig.c:integrity_load_x509()
(3.2) security/integrity/ima/ima_fs.c:ima_read_busy()
(4) kernel_read_file_from_path_initns() users:
(4.1) drivers/base/firmware_loader/main.c:fw_get_filesystem_firmware()
(5) kernel_read_file_from_fd() users:
(5.1) kernel/kexec_file.c:kimage_file_prepare_segments()
In order to remove i_writecount completely we would need to do this in
multiple steps as some of that stuff seems potentially sensitive.
The exec deny write mechanism can be removed because we have a decent
understanding of the implications and there's decent justification for
removing it.
So I propose that I do various testing (LTP) etc. now, send the patch
and then put this into -next to see if anything breaks?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-31 10:02 ` Christian Brauner
@ 2024-05-31 12:32 ` Christian Brauner
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
2024-05-31 13:09 ` [PATCH][RFC] fs: add levels to inode write access Amir Goldstein
0 siblings, 2 replies; 27+ messages in thread
From: Christian Brauner @ 2024-05-31 12:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, linux-fsdevel, viro, jack, david, amir73il, hch
On Fri, May 31, 2024 at 12:02:16PM +0200, Christian Brauner wrote:
> On Thu, May 30, 2024 at 08:49:12AM -0700, Linus Torvalds wrote:
> > On Thu, 30 May 2024 at 03:32, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Ofc depends on whether Linus still agrees that removing this might be
> > > something we could try.
> >
> > I _definitely_ do not want to see any more complex deny_write_access().
> >
> > So yes, if people have good reasons to override the inode write
> > access, I'd rather remove it entirely than make it some eldritch
> > horror that is even worse than what we have now.
> >
> > It would obviously have to be tested in case some odd case actually
> > depends on the ETXTBSY semantics, since we *have* supported it for a
> > long time. But iirc nobody even noticed when we removed it from
> > shared libraries, so...
> >
> > That said, verity seems to depend on it as a way to do the
> > "enable_verity()" atomically with no concurrent writes, and I see some
> > i_writecount noise in the integrity code too.
> >
> > But maybe that's just a belt-and-suspenders thing?
> >
> > Because if execve() no longer does it, I think we should just remove
> > that i_writecount thing entirely.
>
> deny_write_access() is being used from kernel_read_file() which has a
> few wrappers around it and they are used in various places:
>
> (1) kernel_read_file() based helpers:
> (1.1) kernel_read_file_from_path()
> (1.2) kernel_read_file_from_path_initns()
> (1.3) kernel_read_file_from_fd()
>
> (2) kernel_read_file() users:
> (2.1) kernel/module/main.c:init_module_from_file()
> (2.2) security/loadpin/loadpin.c:read_trusted_verity_root_digests()
>
> (3) kernel_read_file_from_path() users:
> (3.1) security/integrity/digsig.c:integrity_load_x509()
> (3.2) security/integrity/ima/ima_fs.c:ima_read_busy()
>
> (4) kernel_read_file_from_path_initns() users:
> (4.1) drivers/base/firmware_loader/main.c:fw_get_filesystem_firmware()
>
> (5) kernel_read_file_from_fd() users:
> (5.1) kernel/kexec_file.c:kimage_file_prepare_segments()
>
> In order to remove i_writecount completely we would need to do this in
Sorry, typo s/i_write_count/deny_write_access()/g
(I don't think we can remove i_writecount itself as it's used for file
leases and locks.)
> multiple steps as some of that stuff seems potentially sensitive.
>
> The exec deny write mechanism can be removed because we have a decent
> understanding of the implications and there's decent justification for
> removing it.
>
> So I propose that I do various testing (LTP) etc. now, send the patch
> and then put this into -next to see if anything breaks?
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] fs: don't block i_writecount during exec
2024-05-31 12:32 ` Christian Brauner
@ 2024-05-31 13:01 ` Christian Brauner
2024-05-31 15:40 ` Linus Torvalds
` (4 more replies)
2024-05-31 13:09 ` [PATCH][RFC] fs: add levels to inode write access Amir Goldstein
1 sibling, 5 replies; 27+ messages in thread
From: Christian Brauner @ 2024-05-31 13:01 UTC (permalink / raw)
To: Josef Bacik, Linus Torvalds, amir73il
Cc: Christian Brauner, linux-fsdevel, viro, jack, david, hch
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.
Here are some of the notes that I took:
(1) The deny_write_access() mechanism is causing really pointless issues
such as [1]. If a thread in a thread-group opens a file writable,
then writes some stuff, then closing the file descriptor and then
calling execve() they can fail the execve() with ETXTBUSY because
another thread in the thread-group could have concurrently called
fork(). Multi-threaded libraries such as go suffer from this.
(2) There are userspace attacks that rely on overwriting the binary of a
running process. These attacks are _mitigated_ but _not at all
prevented_ from ocurring by the deny_write_access() mechanism.
I'll go over some details. The clearest example of such attacks was
the attack against runC in CVE-2019-5736 (cf. [3]).
An attack could compromise the runC host binary from inside a
_privileged_ runC container. The malicious binary could then be used
to take over the host.
(It is crucial to note that this attack is _not_ possible with
unprivileged containers. IOW, the setup here is already insecure.)
The attack can be made when attaching to a running container or when
starting a container running a specially crafted image. For example,
when runC attaches to a container the attacker can trick it into
executing itself.
This could be done by replacing the target binary inside the
container with a custom binary pointing back at the runC binary
itself. As an example, if the target binary was /bin/bash, this
could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe.
As such when /bin/bash is executed inside the container, instead the
target of /proc/self/exe will be executed. That magic link will
point to the runc binary on the host. The attacker can then proceed
to write to the target of /proc/self/exe to try and overwrite the
runC binary on the host.
However, this will not succeed because of deny_write_access(). Now,
one might think that this would prevent the attack but it doesn't.
To overcome this, the attacker has multiple ways:
* Open a file descriptor to /proc/self/exe using the O_PATH flag and
then proceed to reopen the binary as O_WRONLY through
/proc/self/fd/<nr> and try to write to it in a busy loop from a
separate process. Ultimately it will succeed when the runC binary
exits. After this the runC binary is compromised and can be used
to attack other containers or the host itself.
* Use a malicious shared library annotating a function in there with
the constructor attribute making the malicious function run as an
initializor. The malicious library will then open /proc/self/exe
for creating a new entry under /proc/self/fd/<nr>. It'll then call
exec to a) force runC to exit and b) hand the file descriptor off
to a program that then reopens /proc/self/fd/<nr> for writing
(which is now possible because runC has exited) and overwriting
that binary.
To sum up: the deny_write_access() mechanism doesn't prevent such
attacks in insecure setups. It just makes them minimally harder.
That's all.
The only way back then to prevent this is to create a temporary copy
of the calling binary itself when it starts or attaches to
containers. So what I did back then for LXC (and Aleksa for runC)
was to create an anonymous, in-memory file using the memfd_create()
system call and to copy itself into the temporary in-memory file,
which is then sealed to prevent further modifications. This sealed,
in-memory file copy is then executed instead of the original on-disk
binary.
Any compromising write operations from a privileged container to the
host binary will then write to the temporary in-memory binary and
not to the host binary on-disk, preserving the integrity of the host
binary. Also as the temporary, in-memory binary is sealed, writes to
this will also fail.
The point is that deny_write_access() is uselss to prevent these
attacks.
(3) Denying write access to an inode because it's currently used in an
exec path could easily be done on an LSM level. It might need an
additional hook but that should be about it.
(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
ago so while we do protect the main executable the bigger portion of
the things you'd think need protecting such as the shared libraries
aren't. IOW, we let anyone happily overwrite shared libraries.
(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
(5.1) We removed the legacy uselib() protection for preventing
overwriting of shared libraries. Nobody cared in 3 years.
(5.2) We allow write access to the elf interpreter after exec
completed treating it on a par with shared libraries.
Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.
Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
---
fs/binfmt_elf.c | 2 --
fs/binfmt_elf_fdpic.c | 5 +----
fs/binfmt_misc.c | 7 ++-----
fs/exec.c | 14 +++-----------
kernel/fork.c | 26 +++-----------------------
5 files changed, 9 insertions(+), 45 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a43897b03ce9..6fdec541f8bf 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1216,7 +1216,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
reloc_func_desc = interp_load_addr;
- allow_write_access(interpreter);
fput(interpreter);
kfree(interp_elf_ex);
@@ -1308,7 +1307,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
kfree(interp_elf_ex);
kfree(interp_elf_phdata);
out_free_file:
- allow_write_access(interpreter);
if (interpreter)
fput(interpreter);
out_free_ph:
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index b799701454a9..28a3439f163a 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -394,7 +394,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
goto error;
}
- allow_write_access(interpreter);
fput(interpreter);
interpreter = NULL;
}
@@ -466,10 +465,8 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
retval = 0;
error:
- if (interpreter) {
- allow_write_access(interpreter);
+ if (interpreter)
fput(interpreter);
- }
kfree(interpreter_name);
kfree(exec_params.phdrs);
kfree(exec_params.loadmap);
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 68fa225f89e5..21ce5ec1ea76 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -247,13 +247,10 @@ static int load_misc_binary(struct linux_binprm *bprm)
if (retval < 0)
goto ret;
- if (fmt->flags & MISC_FMT_OPEN_FILE) {
+ if (fmt->flags & MISC_FMT_OPEN_FILE)
interp_file = file_clone_open(fmt->interp_file);
- if (!IS_ERR(interp_file))
- deny_write_access(interp_file);
- } else {
+ else
interp_file = open_exec(fmt->interpreter);
- }
retval = PTR_ERR(interp_file);
if (IS_ERR(interp_file))
goto ret;
diff --git a/fs/exec.c b/fs/exec.c
index 40073142288f..4dee205452e2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -952,10 +952,6 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
path_noexec(&file->f_path)))
goto exit;
- err = deny_write_access(file);
- if (err)
- goto exit;
-
out:
return file;
@@ -971,8 +967,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
*
* Returns ERR_PTR on failure or allocated struct file on success.
*
- * As this is a wrapper for the internal do_open_execat(), callers
- * must call allow_write_access() before fput() on release. Also see
+ * As this is a wrapper for the internal do_open_execat(). Also see
* do_close_execat().
*/
struct file *open_exec(const char *name)
@@ -1524,10 +1519,8 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
/* Matches do_open_execat() */
static void do_close_execat(struct file *file)
{
- if (!file)
- return;
- allow_write_access(file);
- fput(file);
+ if (file)
+ fput(file);
}
static void free_bprm(struct linux_binprm *bprm)
@@ -1846,7 +1839,6 @@ static int exec_binprm(struct linux_binprm *bprm)
bprm->file = bprm->interpreter;
bprm->interpreter = NULL;
- allow_write_access(exec);
if (unlikely(bprm->have_execfd)) {
if (bprm->executable) {
fput(exec);
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..763a042eef9c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -616,12 +616,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
exe_file = get_mm_exe_file(oldmm);
RCU_INIT_POINTER(mm->exe_file, exe_file);
- /*
- * We depend on the oldmm having properly denied write access to the
- * exe_file already.
- */
- if (exe_file && deny_write_access(exe_file))
- pr_warn_once("deny_write_access() failed in %s\n", __func__);
}
#ifdef CONFIG_MMU
@@ -1412,20 +1406,11 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
*/
old_exe_file = rcu_dereference_raw(mm->exe_file);
- if (new_exe_file) {
- /*
- * We expect the caller (i.e., sys_execve) to already denied
- * write access, so this is unlikely to fail.
- */
- if (unlikely(deny_write_access(new_exe_file)))
- return -EACCES;
+ if (new_exe_file)
get_file(new_exe_file);
- }
rcu_assign_pointer(mm->exe_file, new_exe_file);
- if (old_exe_file) {
- allow_write_access(old_exe_file);
+ if (old_exe_file)
fput(old_exe_file);
- }
return 0;
}
@@ -1464,9 +1449,6 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
return ret;
}
- ret = deny_write_access(new_exe_file);
- if (ret)
- return -EACCES;
get_file(new_exe_file);
/* set the new file */
@@ -1475,10 +1457,8 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
rcu_assign_pointer(mm->exe_file, new_exe_file);
mmap_write_unlock(mm);
- if (old_exe_file) {
- allow_write_access(old_exe_file);
+ if (old_exe_file)
fput(old_exe_file);
- }
return 0;
}
---
base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
change-id: 20240531-vfs-i_writecount-ee88353b2d7f
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-31 12:32 ` Christian Brauner
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
@ 2024-05-31 13:09 ` Amir Goldstein
2024-05-31 14:50 ` Christian Brauner
1 sibling, 1 reply; 27+ messages in thread
From: Amir Goldstein @ 2024-05-31 13:09 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Josef Bacik, linux-fsdevel, viro, jack, david,
hch, Mimi Zohar
On Fri, May 31, 2024 at 3:32 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, May 31, 2024 at 12:02:16PM +0200, Christian Brauner wrote:
> > On Thu, May 30, 2024 at 08:49:12AM -0700, Linus Torvalds wrote:
> > > On Thu, 30 May 2024 at 03:32, Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > Ofc depends on whether Linus still agrees that removing this might be
> > > > something we could try.
> > >
> > > I _definitely_ do not want to see any more complex deny_write_access().
> > >
> > > So yes, if people have good reasons to override the inode write
> > > access, I'd rather remove it entirely than make it some eldritch
> > > horror that is even worse than what we have now.
> > >
> > > It would obviously have to be tested in case some odd case actually
> > > depends on the ETXTBSY semantics, since we *have* supported it for a
> > > long time. But iirc nobody even noticed when we removed it from
> > > shared libraries, so...
> > >
> > > That said, verity seems to depend on it as a way to do the
> > > "enable_verity()" atomically with no concurrent writes, and I see some
> > > i_writecount noise in the integrity code too.
This one is a bit more challenging.
The IMA ima_bprm_check() LSM hook (called from exec_binprm() context)
may read the file (in ima_collect_measurement()) and record the signature
of the file to be executed, assuming that it cannot be modified.
Not sure how to deal with this expectation.
Only thing I could think of is that IMA would be allowed to
deny_write_access() and set FMODE_EXEC_DENY_WRITE
as a hint for do_close_execat() to allow_write_access(), but
it's pretty ugly, I admit.
> > >
> > > But maybe that's just a belt-and-suspenders thing?
> > >
> > > Because if execve() no longer does it, I think we should just remove
> > > that i_writecount thing entirely.
> >
> > deny_write_access() is being used from kernel_read_file() which has a
> > few wrappers around it and they are used in various places:
> >
> > (1) kernel_read_file() based helpers:
> > (1.1) kernel_read_file_from_path()
> > (1.2) kernel_read_file_from_path_initns()
> > (1.3) kernel_read_file_from_fd()
> >
> > (2) kernel_read_file() users:
> > (2.1) kernel/module/main.c:init_module_from_file()
> > (2.2) security/loadpin/loadpin.c:read_trusted_verity_root_digests()
> >
> > (3) kernel_read_file_from_path() users:
> > (3.1) security/integrity/digsig.c:integrity_load_x509()
> > (3.2) security/integrity/ima/ima_fs.c:ima_read_busy()
> >
> > (4) kernel_read_file_from_path_initns() users:
> > (4.1) drivers/base/firmware_loader/main.c:fw_get_filesystem_firmware()
> >
> > (5) kernel_read_file_from_fd() users:
> > (5.1) kernel/kexec_file.c:kimage_file_prepare_segments()
> >
> > In order to remove i_writecount completely we would need to do this in
>
> Sorry, typo s/i_write_count/deny_write_access()/g
> (I don't think we can remove i_writecount itself as it's used for file
> leases and locks.)
Indeed, i_writecount (as does i_readcount) is used by fs/locks.c:
check_conflicting_open(), but not as a synchronization primitive.
>
> > multiple steps as some of that stuff seems potentially sensitive.
> >
> > The exec deny write mechanism can be removed because we have a decent
> > understanding of the implications and there's decent justification for
> > removing it.
> >
> > So I propose that I do various testing (LTP) etc. now, send the patch
> > and then put this into -next to see if anything breaks?
Wouldn't hurt to see what else we are missing.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-31 13:09 ` [PATCH][RFC] fs: add levels to inode write access Amir Goldstein
@ 2024-05-31 14:50 ` Christian Brauner
2024-05-31 15:47 ` Matthew Wilcox
0 siblings, 1 reply; 27+ messages in thread
From: Christian Brauner @ 2024-05-31 14:50 UTC (permalink / raw)
To: Amir Goldstein, Linus Torvalds
Cc: Josef Bacik, linux-fsdevel, viro, jack, david, hch, Mimi Zohar
On Fri, May 31, 2024 at 04:09:17PM +0300, Amir Goldstein wrote:
> On Fri, May 31, 2024 at 3:32 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Fri, May 31, 2024 at 12:02:16PM +0200, Christian Brauner wrote:
> > > On Thu, May 30, 2024 at 08:49:12AM -0700, Linus Torvalds wrote:
> > > > On Thu, 30 May 2024 at 03:32, Christian Brauner <brauner@kernel.org> wrote:
> > > > >
> > > > > Ofc depends on whether Linus still agrees that removing this might be
> > > > > something we could try.
> > > >
> > > > I _definitely_ do not want to see any more complex deny_write_access().
> > > >
> > > > So yes, if people have good reasons to override the inode write
> > > > access, I'd rather remove it entirely than make it some eldritch
> > > > horror that is even worse than what we have now.
> > > >
> > > > It would obviously have to be tested in case some odd case actually
> > > > depends on the ETXTBSY semantics, since we *have* supported it for a
> > > > long time. But iirc nobody even noticed when we removed it from
> > > > shared libraries, so...
> > > >
> > > > That said, verity seems to depend on it as a way to do the
> > > > "enable_verity()" atomically with no concurrent writes, and I see some
> > > > i_writecount noise in the integrity code too.
>
> This one is a bit more challenging.
> The IMA ima_bprm_check() LSM hook (called from exec_binprm() context)
> may read the file (in ima_collect_measurement()) and record the signature
> of the file to be executed, assuming that it cannot be modified.
> Not sure how to deal with this expectation.
This is very annoying. And it probably doesn't work for dynamic
binaries. 90% of the code will be in libraries to which one can happily
write. Plus there's various attacks to break this:
https://svs.informatik.uni-hamburg.de/publications/2020/2020-08-27-Bohling-IMA.pdf
Anyway, I really really don't want to add more complex
deny_write_access() either. It's hard to understand, it's hard to
document and it punches holes into this anyway. The freezing levels are
annoying enough already.
So then I propose we just make the deny write stuff during exec
conditional on IMA being active. At the end it's small- vs chicken pox.
(I figure it won't be enough for IMA to read the executable after it has
been mapped MS_PRIVATE?)
> Only thing I could think of is that IMA would be allowed to
> deny_write_access() and set FMODE_EXEC_DENY_WRITE
> as a hint for do_close_execat() to allow_write_access(), but
> it's pretty ugly, I admit.
>
> > > >
> > > > But maybe that's just a belt-and-suspenders thing?
> > > >
> > > > Because if execve() no longer does it, I think we should just remove
> > > > that i_writecount thing entirely.
> > >
> > > deny_write_access() is being used from kernel_read_file() which has a
> > > few wrappers around it and they are used in various places:
> > >
> > > (1) kernel_read_file() based helpers:
> > > (1.1) kernel_read_file_from_path()
> > > (1.2) kernel_read_file_from_path_initns()
> > > (1.3) kernel_read_file_from_fd()
> > >
> > > (2) kernel_read_file() users:
> > > (2.1) kernel/module/main.c:init_module_from_file()
> > > (2.2) security/loadpin/loadpin.c:read_trusted_verity_root_digests()
> > >
> > > (3) kernel_read_file_from_path() users:
> > > (3.1) security/integrity/digsig.c:integrity_load_x509()
> > > (3.2) security/integrity/ima/ima_fs.c:ima_read_busy()
> > >
> > > (4) kernel_read_file_from_path_initns() users:
> > > (4.1) drivers/base/firmware_loader/main.c:fw_get_filesystem_firmware()
> > >
> > > (5) kernel_read_file_from_fd() users:
> > > (5.1) kernel/kexec_file.c:kimage_file_prepare_segments()
> > >
> > > In order to remove i_writecount completely we would need to do this in
> >
> > Sorry, typo s/i_write_count/deny_write_access()/g
> > (I don't think we can remove i_writecount itself as it's used for file
> > leases and locks.)
>
> Indeed, i_writecount (as does i_readcount) is used by fs/locks.c:
> check_conflicting_open(), but not as a synchronization primitive.
>
> >
> > > multiple steps as some of that stuff seems potentially sensitive.
> > >
> > > The exec deny write mechanism can be removed because we have a decent
> > > understanding of the implications and there's decent justification for
> > > removing it.
> > >
> > > So I propose that I do various testing (LTP) etc. now, send the patch
> > > and then put this into -next to see if anything breaks?
>
> Wouldn't hurt to see what else we are missing.
>
> Thanks,
> Amir.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
@ 2024-05-31 15:40 ` Linus Torvalds
2024-05-31 18:08 ` Jeff Layton
2024-05-31 22:08 ` Josef Bacik
` (3 subsequent siblings)
4 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2024-05-31 15:40 UTC (permalink / raw)
To: Christian Brauner
Cc: Josef Bacik, amir73il, linux-fsdevel, viro, jack, david, hch
On Fri, 31 May 2024 at 06:05, Christian Brauner <brauner@kernel.org> wrote:
>
> Back in 2021 we already discussed removing deny_write_access() for
> executables. Back then I was hesistant because I thought that this might
> cause issues in userspace. But even back then I had started taking some
> notes on what could potentially depend on this and I didn't come up with
> a lot so I've changed my mind and I would like to try this.
Ack. Let's try it and see if anybody notices. Judging by past
performance, nobody will, but...
I still think we should strive to remove the underlying i_writecount
entirely, but yes, even if we're eventually able to do that, it should
be done in small steps, and this is the obvious first one.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-31 14:50 ` Christian Brauner
@ 2024-05-31 15:47 ` Matthew Wilcox
2024-05-31 22:14 ` Matthew Wilcox
0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2024-05-31 15:47 UTC (permalink / raw)
To: Christian Brauner
Cc: Amir Goldstein, Linus Torvalds, Josef Bacik, linux-fsdevel, viro,
jack, david, hch, Mimi Zohar
On Fri, May 31, 2024 at 04:50:16PM +0200, Christian Brauner wrote:
> So then I propose we just make the deny write stuff during exec
> conditional on IMA being active. At the end it's small- vs chicken pox.
>
> (I figure it won't be enough for IMA to read the executable after it has
> been mapped MS_PRIVATE?)
do you mean MAP_PRIVATE?
If so, you have a misapprehension. We can change the contents of the
pagecache after MAP_PRIVATE and that will not cause COW. COW only
occurs if someone writes through a MAP_PRIVATE.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-05-31 15:40 ` Linus Torvalds
@ 2024-05-31 18:08 ` Jeff Layton
0 siblings, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2024-05-31 18:08 UTC (permalink / raw)
To: Linus Torvalds, Christian Brauner
Cc: Josef Bacik, amir73il, linux-fsdevel, viro, jack, david, hch
On Fri, 2024-05-31 at 08:40 -0700, Linus Torvalds wrote:
> On Fri, 31 May 2024 at 06:05, Christian Brauner <brauner@kernel.org>
> wrote:
> >
> > Back in 2021 we already discussed removing deny_write_access() for
> > executables. Back then I was hesistant because I thought that this
> > might
> > cause issues in userspace. But even back then I had started taking
> > some
> > notes on what could potentially depend on this and I didn't come up
> > with
> > a lot so I've changed my mind and I would like to try this.
>
> Ack. Let's try it and see if anybody notices. Judging by past
> performance, nobody will, but...
>
> I still think we should strive to remove the underlying i_writecount
> entirely, but yes, even if we're eventually able to do that, it
> should
> be done in small steps, and this is the obvious first one.
>
>
FWIW: file leases also use i_writecount and i_readcount(...see
check_conflicting_open()), but they don't rely on the semantics for
blocking writes. They're just interested in whether the inode is open
for read or write anywhere.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
2024-05-31 15:40 ` Linus Torvalds
@ 2024-05-31 22:08 ` Josef Bacik
2024-06-03 13:52 ` (subset) " Christian Brauner
` (2 subsequent siblings)
4 siblings, 0 replies; 27+ messages in thread
From: Josef Bacik @ 2024-05-31 22:08 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, amir73il, linux-fsdevel, viro, jack, david, hch
On Fri, May 31, 2024 at 03:01:43PM +0200, Christian Brauner wrote:
> Back in 2021 we already discussed removing deny_write_access() for
> executables. Back then I was hesistant because I thought that this might
> cause issues in userspace. But even back then I had started taking some
> notes on what could potentially depend on this and I didn't come up with
> a lot so I've changed my mind and I would like to try this.
>
> Here are some of the notes that I took:
>
> (1) The deny_write_access() mechanism is causing really pointless issues
> such as [1]. If a thread in a thread-group opens a file writable,
> then writes some stuff, then closing the file descriptor and then
> calling execve() they can fail the execve() with ETXTBUSY because
> another thread in the thread-group could have concurrently called
> fork(). Multi-threaded libraries such as go suffer from this.
>
> (2) There are userspace attacks that rely on overwriting the binary of a
> running process. These attacks are _mitigated_ but _not at all
> prevented_ from ocurring by the deny_write_access() mechanism.
>
> I'll go over some details. The clearest example of such attacks was
> the attack against runC in CVE-2019-5736 (cf. [3]).
>
> An attack could compromise the runC host binary from inside a
> _privileged_ runC container. The malicious binary could then be used
> to take over the host.
>
> (It is crucial to note that this attack is _not_ possible with
> unprivileged containers. IOW, the setup here is already insecure.)
>
> The attack can be made when attaching to a running container or when
> starting a container running a specially crafted image. For example,
> when runC attaches to a container the attacker can trick it into
> executing itself.
>
> This could be done by replacing the target binary inside the
> container with a custom binary pointing back at the runC binary
> itself. As an example, if the target binary was /bin/bash, this
> could be replaced with an executable script specifying the
> interpreter path #!/proc/self/exe.
>
> As such when /bin/bash is executed inside the container, instead the
> target of /proc/self/exe will be executed. That magic link will
> point to the runc binary on the host. The attacker can then proceed
> to write to the target of /proc/self/exe to try and overwrite the
> runC binary on the host.
>
> However, this will not succeed because of deny_write_access(). Now,
> one might think that this would prevent the attack but it doesn't.
>
> To overcome this, the attacker has multiple ways:
> * Open a file descriptor to /proc/self/exe using the O_PATH flag and
> then proceed to reopen the binary as O_WRONLY through
> /proc/self/fd/<nr> and try to write to it in a busy loop from a
> separate process. Ultimately it will succeed when the runC binary
> exits. After this the runC binary is compromised and can be used
> to attack other containers or the host itself.
> * Use a malicious shared library annotating a function in there with
> the constructor attribute making the malicious function run as an
> initializor. The malicious library will then open /proc/self/exe
> for creating a new entry under /proc/self/fd/<nr>. It'll then call
> exec to a) force runC to exit and b) hand the file descriptor off
> to a program that then reopens /proc/self/fd/<nr> for writing
> (which is now possible because runC has exited) and overwriting
> that binary.
>
> To sum up: the deny_write_access() mechanism doesn't prevent such
> attacks in insecure setups. It just makes them minimally harder.
> That's all.
>
> The only way back then to prevent this is to create a temporary copy
> of the calling binary itself when it starts or attaches to
> containers. So what I did back then for LXC (and Aleksa for runC)
> was to create an anonymous, in-memory file using the memfd_create()
> system call and to copy itself into the temporary in-memory file,
> which is then sealed to prevent further modifications. This sealed,
> in-memory file copy is then executed instead of the original on-disk
> binary.
>
> Any compromising write operations from a privileged container to the
> host binary will then write to the temporary in-memory binary and
> not to the host binary on-disk, preserving the integrity of the host
> binary. Also as the temporary, in-memory binary is sealed, writes to
> this will also fail.
>
> The point is that deny_write_access() is uselss to prevent these
> attacks.
>
> (3) Denying write access to an inode because it's currently used in an
> exec path could easily be done on an LSM level. It might need an
> additional hook but that should be about it.
>
> (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
> ago so while we do protect the main executable the bigger portion of
> the things you'd think need protecting such as the shared libraries
> aren't. IOW, we let anyone happily overwrite shared libraries.
>
> (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
> (5.1) We removed the legacy uselib() protection for preventing
> overwriting of shared libraries. Nobody cared in 3 years.
> (5.2) We allow write access to the elf interpreter after exec
> completed treating it on a par with shared libraries.
>
> Yes, someone in userspace could potentially be relying on this. It's not
> completely out of the realm of possibility but let's find out if that's
> actually the case and not guess.
>
> Link: https://github.com/golang/go/issues/22315 [1]
> Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
> Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
> Link: https://lwn.net/Articles/866493
> Link: https://github.com/golang/go/issues/22220
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
> Link: https://github.com/buildkite/agent/pull/2736
> Link: https://github.com/rust-lang/rust/issues/114554
> Link: https://bugs.openjdk.org/browse/JDK-8068370
> Link: https://github.com/dotnet/runtime/issues/58964
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] fs: add levels to inode write access
2024-05-31 15:47 ` Matthew Wilcox
@ 2024-05-31 22:14 ` Matthew Wilcox
0 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2024-05-31 22:14 UTC (permalink / raw)
To: Christian Brauner
Cc: Amir Goldstein, Linus Torvalds, Josef Bacik, linux-fsdevel, viro,
jack, david, hch, Mimi Zohar
On Fri, May 31, 2024 at 04:47:20PM +0100, Matthew Wilcox wrote:
> On Fri, May 31, 2024 at 04:50:16PM +0200, Christian Brauner wrote:
> > So then I propose we just make the deny write stuff during exec
> > conditional on IMA being active. At the end it's small- vs chicken pox.
> >
> > (I figure it won't be enough for IMA to read the executable after it has
> > been mapped MS_PRIVATE?)
>
> do you mean MAP_PRIVATE?
>
> If so, you have a misapprehension. We can change the contents of the
> pagecache after MAP_PRIVATE and that will not cause COW. COW only
> occurs if someone writes through a MAP_PRIVATE.
If IMA does want to prevent writes, I suggest it puts a lease on the
file. That's a mechanism that all writes must honour, rather than
it being an IMA speciality.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: (subset) [PATCH] fs: don't block i_writecount during exec
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
2024-05-31 15:40 ` Linus Torvalds
2024-05-31 22:08 ` Josef Bacik
@ 2024-06-03 13:52 ` Christian Brauner
2024-06-06 12:45 ` Aishwarya TCV
2024-09-04 17:04 ` Jann Horn
4 siblings, 0 replies; 27+ messages in thread
From: Christian Brauner @ 2024-06-03 13:52 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, viro, jack, david, hch, Josef Bacik,
Linus Torvalds, amir73il
On Fri, 31 May 2024 15:01:43 +0200, Christian Brauner wrote:
> Back in 2021 we already discussed removing deny_write_access() for
> executables. Back then I was hesistant because I thought that this might
> cause issues in userspace. But even back then I had started taking some
> notes on what could potentially depend on this and I didn't come up with
> a lot so I've changed my mind and I would like to try this.
>
> Here are some of the notes that I took:
>
> [...]
Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc
[1/1] fs: don't block i_writecount during exec
https://git.kernel.org/vfs/vfs/c/244ebddd34a0
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
` (2 preceding siblings ...)
2024-06-03 13:52 ` (subset) " Christian Brauner
@ 2024-06-06 12:45 ` Aishwarya TCV
2024-06-06 15:37 ` Mark Brown
2024-09-04 17:04 ` Jann Horn
4 siblings, 1 reply; 27+ messages in thread
From: Aishwarya TCV @ 2024-06-06 12:45 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, viro, jack, david, hch, Josef Bacik,
Linus Torvalds, amir73il, Mark Brown, Naresh Kamboju
On 31/05/2024 14:01, Christian Brauner wrote:
> Back in 2021 we already discussed removing deny_write_access() for
>
> executables. Back then I was hesistant because I thought that this might
>
> cause issues in userspace. But even back then I had started taking some
>
> notes on what could potentially depend on this and I didn't come up with
>
> a lot so I've changed my mind and I would like to try this.
>
>
>
Hi Christian,
LTP test "execve04" is failing when run against
next-master(next-20240606) kernel with Arm64 on JUNO in our CI.
A bisect identified 244ebddd34a0ab7b1ef865811864136873f4b67c as the
first bad commit. Bisected it on the tag "next-20240604" at repo
"https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git".
This works fine on Linux version 6.10-rc2
Failure log
------------
tst_test.c:1690: TINFO: LTP version: 20230929
tst_test.c:1574: TINFO: Timeout per run is 0h 01m 30s
execve_child.c:27: TFAIL: execve_child shouldn't be executed
tst_test.c:1622: TINFO: Killed the leftover descendant processes
Summary:
passed 0
failed 1
broken 0
skipped 0
warnings 0
Bisect log:
----------
git bisect start
# good: [c3f38fa61af77b49866b006939479069cd451173] Linux 6.10-rc2
git bisect good c3f38fa61af77b49866b006939479069cd451173
# bad: [d97496ca23a2d4ee80b7302849404859d9058bcd] Add linux-next
specific files for 20240604
git bisect bad d97496ca23a2d4ee80b7302849404859d9058bcd
# bad: [f89ceec12d7134c9be89f880655b72e36dd3c681] Merge branch
'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
git bisect bad f89ceec12d7134c9be89f880655b72e36dd3c681
# good: [a933c8b54a9ea9bb6b10d168351638933c40b487] Merge branch
'for-next' of
git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git
git bisect good a933c8b54a9ea9bb6b10d168351638933c40b487
# bad: [2d893301453cefb11116f6095a791b8013bbc3e9] Merge branch
'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git
git bisect bad 2d893301453cefb11116f6095a791b8013bbc3e9
# good: [a6a2e13e3bc64365c70b52d42b5d3a674152d5cd] Merge branch
'nfsd-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect good a6a2e13e3bc64365c70b52d42b5d3a674152d5cd
# bad: [e4c1aa3f74bcc90cff6cf8ea575e222066f5be7b] Merge branch 'fs-next'
of linux-next
git bisect bad e4c1aa3f74bcc90cff6cf8ea575e222066f5be7b
# good: [37007c875042e5f5fc65087ea7d67ca388412875] Merge branch
'renesas-clk' of
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git
git bisect good 37007c875042e5f5fc65087ea7d67ca388412875
# bad: [4247aca2219daab5ea6aeebc3d27d32ad56d2472] Merge branch
'vfs.module.description' into vfs.all
git bisect bad 4247aca2219daab5ea6aeebc3d27d32ad56d2472
# bad: [68efe7d9f3055ee986c51178e76061bdad782ff9] Merge branch
'vfs.xattr' into vfs.all
git bisect bad 68efe7d9f3055ee986c51178e76061bdad782ff9
# good: [1f9ccdf69c9ffb9a9084cc6e1a47c5030cebed26] readdir: Add missing
quote in macro comment
git bisect good 1f9ccdf69c9ffb9a9084cc6e1a47c5030cebed26
# bad: [03950a7a6f1a511a372bc46cd4fddd2df7e37a62] Merge branch
'vfs.misc' into vfs.all
git bisect bad 03950a7a6f1a511a372bc46cd4fddd2df7e37a62
# bad: [f113ef08b6bde5f4c74eb4d66f7ca52e09305bb0] tmpfs: don't interrupt
fallocate with EINTR
git bisect bad f113ef08b6bde5f4c74eb4d66f7ca52e09305bb0
# bad: [244ebddd34a0ab7b1ef865811864136873f4b67c] fs: don't block
i_writecount during exec
git bisect bad 244ebddd34a0ab7b1ef865811864136873f4b67c
# first bad commit: [244ebddd34a0ab7b1ef865811864136873f4b67c] fs: don't
block i_writecount during exec
Thanks,
Aishwarya
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-06-06 12:45 ` Aishwarya TCV
@ 2024-06-06 15:37 ` Mark Brown
2024-06-06 16:53 ` Josef Bacik
0 siblings, 1 reply; 27+ messages in thread
From: Mark Brown @ 2024-06-06 15:37 UTC (permalink / raw)
To: Aishwarya TCV
Cc: Christian Brauner, linux-fsdevel, viro, jack, david, hch,
Josef Bacik, Linus Torvalds, amir73il, Naresh Kamboju
[-- Attachment #1: Type: text/plain, Size: 1851 bytes --]
On Thu, Jun 06, 2024 at 01:45:05PM +0100, Aishwarya TCV wrote:
> On 31/05/2024 14:01, Christian Brauner wrote:
> > Back in 2021 we already discussed removing deny_write_access() for
> >
> > executables. Back then I was hesistant because I thought that this might
> >
> > cause issues in userspace. But even back then I had started taking some
> >
> > notes on what could potentially depend on this and I didn't come up with
> >
> > a lot so I've changed my mind and I would like to try this.
> LTP test "execve04" is failing when run against
> next-master(next-20240606) kernel with Arm64 on JUNO in our CI.
It's also causing the LTP creat07 test to fail with basically the same
bisection (I started from next/pending-fixes rather than the -rc so the
initial phases were different):
tst_test.c:1690: TINFO: LTP version: 20230929
tst_test.c:1574: TINFO: Timeout per run is 0h 01m 30s
creat07.c:37: TFAIL: creat() succeeded unexpectedly
Test timeouted, sending SIGKILL!
tst_test.c:1622: TINFO: Killed the leftover descendant processes
tst_test.c:1628: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1630: TBROK: Test killed! (timeout?)
The code in the testcase is below:
static void verify_creat(void)
{
pid_t pid;
pid = SAFE_FORK();
if (pid == 0) {
SAFE_EXECL(TEST_APP, TEST_APP, NULL);
exit(1);
}
TST_CHECKPOINT_WAIT(0);
TEST(creat(TEST_APP, O_WRONLY));
if (TST_RET != -1) {
tst_res(TFAIL, "creat() succeeded unexpectedly");
return;
}
if (TST_ERR == ETXTBSY)
tst_res(TPASS, "creat() received EXTBSY");
else
tst_res(TFAIL | TTERRNO, "creat() failed unexpectedly");
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-06-06 15:37 ` Mark Brown
@ 2024-06-06 16:53 ` Josef Bacik
2024-06-06 17:33 ` Mark Brown
0 siblings, 1 reply; 27+ messages in thread
From: Josef Bacik @ 2024-06-06 16:53 UTC (permalink / raw)
To: Mark Brown
Cc: Aishwarya TCV, Christian Brauner, linux-fsdevel, viro, jack,
david, hch, Linus Torvalds, amir73il, Naresh Kamboju
On Thu, Jun 06, 2024 at 04:37:49PM +0100, Mark Brown wrote:
> On Thu, Jun 06, 2024 at 01:45:05PM +0100, Aishwarya TCV wrote:
> > On 31/05/2024 14:01, Christian Brauner wrote:
>
> > > Back in 2021 we already discussed removing deny_write_access() for
> > >
> > > executables. Back then I was hesistant because I thought that this might
> > >
> > > cause issues in userspace. But even back then I had started taking some
> > >
> > > notes on what could potentially depend on this and I didn't come up with
> > >
> > > a lot so I've changed my mind and I would like to try this.
>
> > LTP test "execve04" is failing when run against
> > next-master(next-20240606) kernel with Arm64 on JUNO in our CI.
>
> It's also causing the LTP creat07 test to fail with basically the same
> bisection (I started from next/pending-fixes rather than the -rc so the
> initial phases were different):
>
> tst_test.c:1690: TINFO: LTP version: 20230929
> tst_test.c:1574: TINFO: Timeout per run is 0h 01m 30s
> creat07.c:37: TFAIL: creat() succeeded unexpectedly
> Test timeouted, sending SIGKILL!
> tst_test.c:1622: TINFO: Killed the leftover descendant processes
> tst_test.c:1628: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> tst_test.c:1630: TBROK: Test killed! (timeout?)
>
> The code in the testcase is below:
>
> static void verify_creat(void)
> {
> pid_t pid;
>
> pid = SAFE_FORK();
> if (pid == 0) {
> SAFE_EXECL(TEST_APP, TEST_APP, NULL);
> exit(1);
> }
>
> TST_CHECKPOINT_WAIT(0);
>
> TEST(creat(TEST_APP, O_WRONLY));
>
> if (TST_RET != -1) {
> tst_res(TFAIL, "creat() succeeded unexpectedly");
> return;
> }
>
> if (TST_ERR == ETXTBSY)
> tst_res(TPASS, "creat() received EXTBSY");
> else
> tst_res(TFAIL | TTERRNO, "creat() failed unexpectedly");
>
These tests will have to be updated, as this patch removes that behavior.
Thanks,
Josef
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-06-06 16:53 ` Josef Bacik
@ 2024-06-06 17:33 ` Mark Brown
2024-06-06 17:49 ` Mark Brown
0 siblings, 1 reply; 27+ messages in thread
From: Mark Brown @ 2024-06-06 17:33 UTC (permalink / raw)
To: Josef Bacik
Cc: Aishwarya TCV, Christian Brauner, linux-fsdevel, viro, jack,
david, hch, Linus Torvalds, amir73il, Naresh Kamboju, ltp
[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]
On Thu, Jun 06, 2024 at 12:53:18PM -0400, Josef Bacik wrote:
> On Thu, Jun 06, 2024 at 04:37:49PM +0100, Mark Brown wrote:
> > On Thu, Jun 06, 2024 at 01:45:05PM +0100, Aishwarya TCV wrote:
> > > LTP test "execve04" is failing when run against
> > > next-master(next-20240606) kernel with Arm64 on JUNO in our CI.
> > It's also causing the LTP creat07 test to fail with basically the same
> > bisection (I started from next/pending-fixes rather than the -rc so the
> > initial phases were different):
> > tst_test.c:1690: TINFO: LTP version: 20230929
> > tst_test.c:1574: TINFO: Timeout per run is 0h 01m 30s
> > creat07.c:37: TFAIL: creat() succeeded unexpectedly
> > Test timeouted, sending SIGKILL!
> > tst_test.c:1622: TINFO: Killed the leftover descendant processes
> > tst_test.c:1628: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> > tst_test.c:1630: TBROK: Test killed! (timeout?)
> > The code in the testcase is below:
> These tests will have to be updated, as this patch removes that behavior.
Adding the LTP list - looking at execve04 it seems to be trying for a
similar thing to creat07, it's looking for an ETXTBUSY.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-06-06 17:33 ` Mark Brown
@ 2024-06-06 17:49 ` Mark Brown
2024-08-07 9:59 ` Tudor Ambarus
0 siblings, 1 reply; 27+ messages in thread
From: Mark Brown @ 2024-06-06 17:49 UTC (permalink / raw)
To: Josef Bacik
Cc: Aishwarya TCV, Christian Brauner, linux-fsdevel, viro, jack,
david, hch, Linus Torvalds, amir73il, Naresh Kamboju, ltp
[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]
On Thu, Jun 06, 2024 at 06:33:34PM +0100, Mark Brown wrote:
> On Thu, Jun 06, 2024 at 12:53:18PM -0400, Josef Bacik wrote:
> > On Thu, Jun 06, 2024 at 04:37:49PM +0100, Mark Brown wrote:
> > > On Thu, Jun 06, 2024 at 01:45:05PM +0100, Aishwarya TCV wrote:
>
> > > > LTP test "execve04" is failing when run against
> > > > next-master(next-20240606) kernel with Arm64 on JUNO in our CI.
>
> > > It's also causing the LTP creat07 test to fail with basically the same
> > > bisection (I started from next/pending-fixes rather than the -rc so the
> > > initial phases were different):
>
> > > tst_test.c:1690: TINFO: LTP version: 20230929
> > > tst_test.c:1574: TINFO: Timeout per run is 0h 01m 30s
> > > creat07.c:37: TFAIL: creat() succeeded unexpectedly
> > > Test timeouted, sending SIGKILL!
> > > tst_test.c:1622: TINFO: Killed the leftover descendant processes
> > > tst_test.c:1628: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> > > tst_test.c:1630: TBROK: Test killed! (timeout?)
>
> > > The code in the testcase is below:
>
> > These tests will have to be updated, as this patch removes that behavior.
>
> Adding the LTP list - looking at execve04 it seems to be trying for a
> similar thing to creat07, it's looking for an ETXTBUSY.
Or not since they reject signed mail :/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-06-06 17:49 ` Mark Brown
@ 2024-08-07 9:59 ` Tudor Ambarus
0 siblings, 0 replies; 27+ messages in thread
From: Tudor Ambarus @ 2024-08-07 9:59 UTC (permalink / raw)
To: Mark Brown, Josef Bacik
Cc: Aishwarya TCV, Christian Brauner, linux-fsdevel, viro, jack,
david, hch, Linus Torvalds, amir73il, Naresh Kamboju, ltp,
lee@kernel.org, Peter Griffin, André Draszik,
William McVicker
On 6/6/24 6:49 PM, Mark Brown wrote:
> On Thu, Jun 06, 2024 at 06:33:34PM +0100, Mark Brown wrote:
>> On Thu, Jun 06, 2024 at 12:53:18PM -0400, Josef Bacik wrote:
>>> On Thu, Jun 06, 2024 at 04:37:49PM +0100, Mark Brown wrote:
>>>> On Thu, Jun 06, 2024 at 01:45:05PM +0100, Aishwarya TCV wrote:
>>
>>>>> LTP test "execve04" is failing when run against
>>>>> next-master(next-20240606) kernel with Arm64 on JUNO in our CI.
>>
>>>> It's also causing the LTP creat07 test to fail with basically the same
>>>> bisection (I started from next/pending-fixes rather than the -rc so the
>>>> initial phases were different):
>>
>>>> tst_test.c:1690: TINFO: LTP version: 20230929
>>>> tst_test.c:1574: TINFO: Timeout per run is 0h 01m 30s
>>>> creat07.c:37: TFAIL: creat() succeeded unexpectedly
>>>> Test timeouted, sending SIGKILL!
>>>> tst_test.c:1622: TINFO: Killed the leftover descendant processes
>>>> tst_test.c:1628: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
>>>> tst_test.c:1630: TBROK: Test killed! (timeout?)
>>
>>>> The code in the testcase is below:
>>
>>> These tests will have to be updated, as this patch removes that behavior.
>>
>> Adding the LTP list - looking at execve04 it seems to be trying for a
>> similar thing to creat07, it's looking for an ETXTBUSY.
>
> Or not since they reject signed mail :/
FYI, I encountered the same test failures. I opened a bug on the github
ltp project suggesting that the tests need to be updated. Here it is:
https://github.com/linux-test-project/ltp/issues/1184
Cheers,
ta
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
` (3 preceding siblings ...)
2024-06-06 12:45 ` Aishwarya TCV
@ 2024-09-04 17:04 ` Jann Horn
2024-09-05 7:38 ` Roberto Sassu
4 siblings, 1 reply; 27+ messages in thread
From: Jann Horn @ 2024-09-04 17:04 UTC (permalink / raw)
To: Christian Brauner, Mimi Zohar, Roberto Sassu, Dmitry Kasatkin,
Eric Snowberg
Cc: Josef Bacik, Linus Torvalds, amir73il, linux-fsdevel, viro, jack,
david, hch, linux-integrity
[necrothreading...]
[+IMA folks]
On Fri, May 31, 2024 at 3:01 PM Christian Brauner <brauner@kernel.org> wrote:
> Back in 2021 we already discussed removing deny_write_access() for
> executables. Back then I was hesistant because I thought that this might
> cause issues in userspace. But even back then I had started taking some
> notes on what could potentially depend on this and I didn't come up with
> a lot so I've changed my mind and I would like to try this.
[snip]
> Yes, someone in userspace could potentially be relying on this. It's not
> completely out of the realm of possibility but let's find out if that's
> actually the case and not guess.
FYI, ima_bprm_check() still has a comment that claims that executables
use deny_write_access():
/**
* ima_bprm_check - based on policy, collect/store measurement.
* @bprm: contains the linux_binprm structure
*
* The OS protects against an executable file, already open for write,
* from being executed in deny_write_access() and an executable file,
* already open for execute, from being modified in get_write_access().
* So we can be certain that what we verify and measure here is actually
* what is being executed.
*
* On success return 0. On integrity appraisal error, assuming the file
* is in policy and IMA-appraisal is in enforcing mode, return -EACCES.
*/
But what actually happens in there is not so different from what
happens in ima_file_mmap(), so I think probably the only change
required here is to fix up the comment...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] fs: don't block i_writecount during exec
2024-09-04 17:04 ` Jann Horn
@ 2024-09-05 7:38 ` Roberto Sassu
0 siblings, 0 replies; 27+ messages in thread
From: Roberto Sassu @ 2024-09-05 7:38 UTC (permalink / raw)
To: Jann Horn, Christian Brauner, Mimi Zohar, Roberto Sassu,
Dmitry Kasatkin, Eric Snowberg
Cc: Josef Bacik, Linus Torvalds, amir73il, linux-fsdevel, viro, jack,
david, hch, linux-integrity
On Wed, 2024-09-04 at 19:04 +0200, Jann Horn wrote:
> [necrothreading...]
> [+IMA folks]
>
> On Fri, May 31, 2024 at 3:01 PM Christian Brauner <brauner@kernel.org> wrote:
> > Back in 2021 we already discussed removing deny_write_access() for
> > executables. Back then I was hesistant because I thought that this might
> > cause issues in userspace. But even back then I had started taking some
> > notes on what could potentially depend on this and I didn't come up with
> > a lot so I've changed my mind and I would like to try this.
> [snip]
> > Yes, someone in userspace could potentially be relying on this. It's not
> > completely out of the realm of possibility but let's find out if that's
> > actually the case and not guess.
>
> FYI, ima_bprm_check() still has a comment that claims that executables
> use deny_write_access():
>
> /**
> * ima_bprm_check - based on policy, collect/store measurement.
> * @bprm: contains the linux_binprm structure
> *
> * The OS protects against an executable file, already open for write,
> * from being executed in deny_write_access() and an executable file,
> * already open for execute, from being modified in get_write_access().
> * So we can be certain that what we verify and measure here is actually
> * what is being executed.
> *
> * On success return 0. On integrity appraisal error, assuming the file
> * is in policy and IMA-appraisal is in enforcing mode, return -EACCES.
> */
>
> But what actually happens in there is not so different from what
> happens in ima_file_mmap(), so I think probably the only change
> required here is to fix up the comment...
We need to do the violation check for the BPRM_CHECK IMA hook too:
violation_check = ((func == FILE_CHECK || func == MMAP_CHECK
||
func == MMAP_CHECK_REQPROT) &&
(ima_policy_flag & IMA_MEASURE));
Roberto
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2024-09-05 7:38 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29 20:41 [PATCH][RFC] fs: add levels to inode write access Josef Bacik
2024-05-29 22:00 ` Jeff Layton
2024-05-30 1:14 ` Matthew Wilcox
2024-05-30 10:32 ` Christian Brauner
2024-05-30 12:57 ` Amir Goldstein
2024-05-30 14:58 ` Josef Bacik
2024-05-30 15:23 ` Christian Brauner
2024-05-30 15:49 ` Linus Torvalds
2024-05-31 10:02 ` Christian Brauner
2024-05-31 12:32 ` Christian Brauner
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
2024-05-31 15:40 ` Linus Torvalds
2024-05-31 18:08 ` Jeff Layton
2024-05-31 22:08 ` Josef Bacik
2024-06-03 13:52 ` (subset) " Christian Brauner
2024-06-06 12:45 ` Aishwarya TCV
2024-06-06 15:37 ` Mark Brown
2024-06-06 16:53 ` Josef Bacik
2024-06-06 17:33 ` Mark Brown
2024-06-06 17:49 ` Mark Brown
2024-08-07 9:59 ` Tudor Ambarus
2024-09-04 17:04 ` Jann Horn
2024-09-05 7:38 ` Roberto Sassu
2024-05-31 13:09 ` [PATCH][RFC] fs: add levels to inode write access Amir Goldstein
2024-05-31 14:50 ` Christian Brauner
2024-05-31 15:47 ` Matthew Wilcox
2024-05-31 22:14 ` Matthew Wilcox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).