From: Josef Bacik <josef@toxicpanda.com>
To: Christian Brauner <brauner@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
amir73il@gmail.com, linux-fsdevel@vger.kernel.org,
viro@zeniv.linux.org.uk, jack@suse.cz, david@fromorbit.com,
hch@lst.de
Subject: Re: [PATCH] fs: don't block i_writecount during exec
Date: Fri, 31 May 2024 18:08:44 -0400 [thread overview]
Message-ID: <20240531220844.GA2233362@perftesting> (raw)
In-Reply-To: <20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org>
On Fri, May 31, 2024 at 03:01:43PM +0200, Christian Brauner wrote:
> Back in 2021 we already discussed removing deny_write_access() for
> executables. Back then I was hesistant because I thought that this might
> cause issues in userspace. But even back then I had started taking some
> notes on what could potentially depend on this and I didn't come up with
> a lot so I've changed my mind and I would like to try this.
>
> Here are some of the notes that I took:
>
> (1) The deny_write_access() mechanism is causing really pointless issues
> such as [1]. If a thread in a thread-group opens a file writable,
> then writes some stuff, then closing the file descriptor and then
> calling execve() they can fail the execve() with ETXTBUSY because
> another thread in the thread-group could have concurrently called
> fork(). Multi-threaded libraries such as go suffer from this.
>
> (2) There are userspace attacks that rely on overwriting the binary of a
> running process. These attacks are _mitigated_ but _not at all
> prevented_ from ocurring by the deny_write_access() mechanism.
>
> I'll go over some details. The clearest example of such attacks was
> the attack against runC in CVE-2019-5736 (cf. [3]).
>
> An attack could compromise the runC host binary from inside a
> _privileged_ runC container. The malicious binary could then be used
> to take over the host.
>
> (It is crucial to note that this attack is _not_ possible with
> unprivileged containers. IOW, the setup here is already insecure.)
>
> The attack can be made when attaching to a running container or when
> starting a container running a specially crafted image. For example,
> when runC attaches to a container the attacker can trick it into
> executing itself.
>
> This could be done by replacing the target binary inside the
> container with a custom binary pointing back at the runC binary
> itself. As an example, if the target binary was /bin/bash, this
> could be replaced with an executable script specifying the
> interpreter path #!/proc/self/exe.
>
> As such when /bin/bash is executed inside the container, instead the
> target of /proc/self/exe will be executed. That magic link will
> point to the runc binary on the host. The attacker can then proceed
> to write to the target of /proc/self/exe to try and overwrite the
> runC binary on the host.
>
> However, this will not succeed because of deny_write_access(). Now,
> one might think that this would prevent the attack but it doesn't.
>
> To overcome this, the attacker has multiple ways:
> * Open a file descriptor to /proc/self/exe using the O_PATH flag and
> then proceed to reopen the binary as O_WRONLY through
> /proc/self/fd/<nr> and try to write to it in a busy loop from a
> separate process. Ultimately it will succeed when the runC binary
> exits. After this the runC binary is compromised and can be used
> to attack other containers or the host itself.
> * Use a malicious shared library annotating a function in there with
> the constructor attribute making the malicious function run as an
> initializor. The malicious library will then open /proc/self/exe
> for creating a new entry under /proc/self/fd/<nr>. It'll then call
> exec to a) force runC to exit and b) hand the file descriptor off
> to a program that then reopens /proc/self/fd/<nr> for writing
> (which is now possible because runC has exited) and overwriting
> that binary.
>
> To sum up: the deny_write_access() mechanism doesn't prevent such
> attacks in insecure setups. It just makes them minimally harder.
> That's all.
>
> The only way back then to prevent this is to create a temporary copy
> of the calling binary itself when it starts or attaches to
> containers. So what I did back then for LXC (and Aleksa for runC)
> was to create an anonymous, in-memory file using the memfd_create()
> system call and to copy itself into the temporary in-memory file,
> which is then sealed to prevent further modifications. This sealed,
> in-memory file copy is then executed instead of the original on-disk
> binary.
>
> Any compromising write operations from a privileged container to the
> host binary will then write to the temporary in-memory binary and
> not to the host binary on-disk, preserving the integrity of the host
> binary. Also as the temporary, in-memory binary is sealed, writes to
> this will also fail.
>
> The point is that deny_write_access() is uselss to prevent these
> attacks.
>
> (3) Denying write access to an inode because it's currently used in an
> exec path could easily be done on an LSM level. It might need an
> additional hook but that should be about it.
>
> (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
> ago so while we do protect the main executable the bigger portion of
> the things you'd think need protecting such as the shared libraries
> aren't. IOW, we let anyone happily overwrite shared libraries.
>
> (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
> (5.1) We removed the legacy uselib() protection for preventing
> overwriting of shared libraries. Nobody cared in 3 years.
> (5.2) We allow write access to the elf interpreter after exec
> completed treating it on a par with shared libraries.
>
> Yes, someone in userspace could potentially be relying on this. It's not
> completely out of the realm of possibility but let's find out if that's
> actually the case and not guess.
>
> Link: https://github.com/golang/go/issues/22315 [1]
> Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
> Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
> Link: https://lwn.net/Articles/866493
> Link: https://github.com/golang/go/issues/22220
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
> Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
> Link: https://github.com/buildkite/agent/pull/2736
> Link: https://github.com/rust-lang/rust/issues/114554
> Link: https://bugs.openjdk.org/browse/JDK-8068370
> Link: https://github.com/dotnet/runtime/issues/58964
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2024-05-31 22:08 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-29 20:41 [PATCH][RFC] fs: add levels to inode write access Josef Bacik
2024-05-29 22:00 ` Jeff Layton
2024-05-30 1:14 ` Matthew Wilcox
2024-05-30 10:32 ` Christian Brauner
2024-05-30 12:57 ` Amir Goldstein
2024-05-30 14:58 ` Josef Bacik
2024-05-30 15:23 ` Christian Brauner
2024-05-30 15:49 ` Linus Torvalds
2024-05-31 10:02 ` Christian Brauner
2024-05-31 12:32 ` Christian Brauner
2024-05-31 13:01 ` [PATCH] fs: don't block i_writecount during exec Christian Brauner
2024-05-31 15:40 ` Linus Torvalds
2024-05-31 18:08 ` Jeff Layton
2024-05-31 22:08 ` Josef Bacik [this message]
2024-06-03 13:52 ` (subset) " Christian Brauner
2024-06-06 12:45 ` Aishwarya TCV
2024-06-06 15:37 ` Mark Brown
2024-06-06 16:53 ` Josef Bacik
2024-06-06 17:33 ` Mark Brown
2024-06-06 17:49 ` Mark Brown
2024-08-07 9:59 ` Tudor Ambarus
2024-09-04 17:04 ` Jann Horn
2024-09-05 7:38 ` Roberto Sassu
2024-05-31 13:09 ` [PATCH][RFC] fs: add levels to inode write access Amir Goldstein
2024-05-31 14:50 ` Christian Brauner
2024-05-31 15:47 ` Matthew Wilcox
2024-05-31 22:14 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240531220844.GA2233362@perftesting \
--to=josef@toxicpanda.com \
--cc=amir73il@gmail.com \
--cc=brauner@kernel.org \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).