linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: CVE-2025-21830: landlock: Handle weird files
       [not found]   ` <2025031034-savanna-debit-eb8e@gregkh>
@ 2025-03-10 23:42     ` Dave Chinner
  2025-03-11  2:09       ` Kent Overstreet
                         ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dave Chinner @ 2025-03-10 23:42 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Mickaël Salaün, cve, Günther Noack,
	linux-security-module, Kent Overstreet, linux-bcachefs,
	linux-fsdevel

[cc linux-fsdevel]

On Mon, Mar 10, 2025 at 03:36:04PM +0100, Greg Kroah-Hartman wrote:
> On Mon, Mar 10, 2025 at 01:00:50PM +0100, Mickaël Salaün wrote:
> > Hi Greg,
> > 
> > FYI, I don't think this patch fixes a security issue.  If attackers can
> > corrupt a filesystem, then they should already be able to harm the whole
> > system.
> > 
> > The commit description might be a bit confusing, but from an access
> > control point of view, the filesystem on which we spotted this issue
> > (bcachefs) does not allow to open weird files (but they are still
> > visible, hence this patch) and I guess it would be the same for other
> > filesystems, right?  I'm not sure how a weird file could be used by user
> > space.  See
> > https://lore.kernel.org/all/Zpc46HEacI%2Fwd7Rg@dread.disaster.area/
> > 
> > The goal of this fix was mainly to not warn about a bcachefs issue (and
> > avoid related syzkaller report for Landlock), and to harden Landlock in
> > case other filesystems have this kind of bug.
> 
> It was issue a CVE because the reviewers thought that it was a way to
> circumvent the landlock permission checks, based on the changelog text
> (note, creating a "corrupted filesystem" is quite easy to get many Linux
> systems to auto-mount it, so those types of issues do get assigned
> CVEs.)

That's an argument straight from the security theatre.

> If you all do not think this meets the definition of a vulnerability as
> defined by CVE.org as:
> 	An instance of one or more weaknesses in a Product that can be
> 	exploited, causing a negative impact to confidentiality, integrity, or
> 	availability; a set of conditions or behaviors that allows the
> 	violation of an explicit or implicit security policy.

Yes, so shall we follow this reasoning based on untrusted user
auto-mounts of untrusted devices to it's logical conclusion?

If an untrusted user is in control of the filesystem image, then
they don't need to corrupt the filesystem image to subvert the
system. They can just change the permissions on files, change ACLs,
change security xattrs (selinux, landlock, smack, etc),
replace the contents of file data (e.g. trojan executables), etc.

The filesystem will not flag *any* of these shenanigans as they
don't involve actually corrupting the filesystem structure. IOWs,
the kernel filesystem code can function perfectly and bug free, yet
the system can be silently compromised through the hole punched in
the *implicitly trusted security information under user control* in
the fs image.

This is a "trusted device contains trusted security information"
model deficiency, not a filesystem implementation issue. The CVE
worthy issue here is that the security model is violated by the
untrusted automounts, not by how the filesystem reacts to the
security model violation that has already occurred.

Further, the kernel (and therefore the filesystem implemenation)
cannot prevent untrusted user device auto-mounts, so this must be
considered a system level vulnerability that requires userspace
policy and implementation changes to mitigate.

We've tried for years to get userspace to adopt a more
security-aware model for untrusted devices, but have made pretty
much no progress. Filesystem developers have ended up with their
userspace filesystem packages shipping udisks rules to turn off
automounting of those filesystem types for application that use
udisks for this stuff. That catches -some- of the automounting
behaviour, but not all of it. And we can't do anything else without
changes to the wider userspace/distro policies around user
automounting of untrusted devices.

IOWs, to prevent these "corrupted filesystem causes issues" from
being considered security issues, we need userspace to stop
violating the kernel trust model for persistent security information
storage.

Greg, you have the ability to issue a CVE that will require
downstream distros to fix userspace-based vulnerabilities if they
want various certifications. You have the power to force downstream
distros to -change their security model policies- for the wider
good.

We could knock out this whole class of vulnerability in one CVE:
issue a CVE considering the auto-mounting of untrusted filesystem
images as a *critical system vulnerability*. This can only be solved
by changing the distro policies and implementations that allow this
dangerous behaviour to persist.

We've suggested many relatively user friendly ways this can be
handled in the past (e.g. device fingerprinting via libblkid (which
it kinda already does) and prompting the user to allow/deny devices
with an unknown fingerprint). The simplest policy fix is to simply
disallow auto-mount of removable devices by default across the
entire distro.

If distros want to close that kernel CVE then they have to, at
minimum, turn off device auto-mount by default across the entire
distro.

At worst, this makes the reason you give for filesystem corruption
issues being considered CVE worthy go away completely.

At best, we get full distro level integration of efficient,
persistent untrusted device handling at the desktop interfaces.
That would be a win for -everyone-, not just the distro people who
have to handle kernel CVEs....

If we want filesystem corruption CVEs to be any other than security
theatre, then use we should be using the kernel CVE powers for the
reason they were obtained in the first place. i.e. to force
downstream distros to address issues they would otherwise ignore to
help make our linux systems more reliable and secure.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: CVE-2025-21830: landlock: Handle weird files
  2025-03-10 23:42     ` CVE-2025-21830: landlock: Handle weird files Dave Chinner
@ 2025-03-11  2:09       ` Kent Overstreet
  2025-03-11  4:24         ` Dave Chinner
  2025-03-11  2:19       ` Unprivileged filesystem mounts Demi Marie Obenour
  2025-03-11  6:53       ` CVE-2025-21830: landlock: Handle weird files Greg Kroah-Hartman
  2 siblings, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2025-03-11  2:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Kroah-Hartman, Mickaël Salaün, cve,
	Günther Noack, linux-security-module, linux-bcachefs,
	linux-fsdevel

On Tue, Mar 11, 2025 at 10:42:41AM +1100, Dave Chinner wrote:
> [cc linux-fsdevel]
> 
> On Mon, Mar 10, 2025 at 03:36:04PM +0100, Greg Kroah-Hartman wrote:
> > On Mon, Mar 10, 2025 at 01:00:50PM +0100, Mickaël Salaün wrote:
> > > Hi Greg,
> > > 
> > > FYI, I don't think this patch fixes a security issue.  If attackers can
> > > corrupt a filesystem, then they should already be able to harm the whole
> > > system.
> > > 
> > > The commit description might be a bit confusing, but from an access
> > > control point of view, the filesystem on which we spotted this issue
> > > (bcachefs) does not allow to open weird files (but they are still
> > > visible, hence this patch) and I guess it would be the same for other
> > > filesystems, right?  I'm not sure how a weird file could be used by user
> > > space.  See
> > > https://lore.kernel.org/all/Zpc46HEacI%2Fwd7Rg@dread.disaster.area/
> > > 
> > > The goal of this fix was mainly to not warn about a bcachefs issue (and
> > > avoid related syzkaller report for Landlock), and to harden Landlock in
> > > case other filesystems have this kind of bug.
> > 
> > It was issue a CVE because the reviewers thought that it was a way to
> > circumvent the landlock permission checks, based on the changelog text
> > (note, creating a "corrupted filesystem" is quite easy to get many Linux
> > systems to auto-mount it, so those types of issues do get assigned
> > CVEs.)
> 
> That's an argument straight from the security theatre.
> 
> > If you all do not think this meets the definition of a vulnerability as
> > defined by CVE.org as:
> > 	An instance of one or more weaknesses in a Product that can be
> > 	exploited, causing a negative impact to confidentiality, integrity, or
> > 	availability; a set of conditions or behaviors that allows the
> > 	violation of an explicit or implicit security policy.
> 
> Yes, so shall we follow this reasoning based on untrusted user
> auto-mounts of untrusted devices to it's logical conclusion?
> 
> If an untrusted user is in control of the filesystem image, then
> they don't need to corrupt the filesystem image to subvert the
> system. They can just change the permissions on files, change ACLs,
> change security xattrs (selinux, landlock, smack, etc),
> replace the contents of file data (e.g. trojan executables), etc.

If user mounts are enabled, that comes with UID mapping, and device
nodes disabled - no?

Out of curiosity, what's keeping us from saying "user mounts are
generally expected to be safe" for XFS?

Obviously, that does expose a massive attack surface, so saying that for
a C codebase that wasn't initially designed for it has a high pucker
factor.

But I've been impressed with syzbot's ability to find bugs, so barring
architectural issues which I assume you'd know about it seems it's not
nearly as crazy a thought as it used to be - for XFS, as you guys have
been the most rigorous about hardening so I expect that's about as good
as it's going to get until we start rewriting our filesystems in Rust.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Unprivileged filesystem mounts
  2025-03-10 23:42     ` CVE-2025-21830: landlock: Handle weird files Dave Chinner
  2025-03-11  2:09       ` Kent Overstreet
@ 2025-03-11  2:19       ` Demi Marie Obenour
  2025-03-11  5:57         ` Dave Chinner
  2025-03-11  6:53       ` CVE-2025-21830: landlock: Handle weird files Greg Kroah-Hartman
  2 siblings, 1 reply; 22+ messages in thread
From: Demi Marie Obenour @ 2025-03-11  2:19 UTC (permalink / raw)
  To: david
  Cc: cve, gnoack, gregkh, kent.overstreet, linux-bcachefs,
	linux-fsdevel, linux-security-module, mic, Demi Marie Obenour

People have stuff to get done.  If you disallow unprivileged filesystem
mounts, they will just use sudo (or equivalent) instead.  The problem is
not that users are mounting untrusted filesystems.  The problem is that
mounting untrusted filesystems is unsafe.

Making untrusted filesystems safe to mount is the only solution that
lets users do what they actually need to do.  That means either actually
fixing the filesystem code, or running it in a sufficiently tight
sandbox that vulnerabilities in it are of too low importance to matter.
libguestfs+FUSE is the most obvious way to do this, but the performance
might not be enough for distros to turn it on.

For ext4 and F2FS, if there is a vulnerability that can be exploited by
a malicious filesystem image, it is a verified boot bypass for Chrome OS
and Android, respectively.  Verified boot is a security boundary for
both of them, so just forward syzbot reports to their respective
security teams and let them do the jobs they are paid to do.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: CVE-2025-21830: landlock: Handle weird files
  2025-03-11  2:09       ` Kent Overstreet
@ 2025-03-11  4:24         ` Dave Chinner
  2025-03-11 10:50           ` Kent Overstreet
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2025-03-11  4:24 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Greg Kroah-Hartman, Mickaël Salaün, cve,
	Günther Noack, linux-security-module, linux-bcachefs,
	linux-fsdevel

On Mon, Mar 10, 2025 at 10:09:22PM -0400, Kent Overstreet wrote:
> On Tue, Mar 11, 2025 at 10:42:41AM +1100, Dave Chinner wrote:
> > [cc linux-fsdevel]
> > 
> > On Mon, Mar 10, 2025 at 03:36:04PM +0100, Greg Kroah-Hartman wrote:
> > > On Mon, Mar 10, 2025 at 01:00:50PM +0100, Mickaël Salaün wrote:
> > > > Hi Greg,
> > > > 
> > > > FYI, I don't think this patch fixes a security issue.  If attackers can
> > > > corrupt a filesystem, then they should already be able to harm the whole
> > > > system.
> > > > 
> > > > The commit description might be a bit confusing, but from an access
> > > > control point of view, the filesystem on which we spotted this issue
> > > > (bcachefs) does not allow to open weird files (but they are still
> > > > visible, hence this patch) and I guess it would be the same for other
> > > > filesystems, right?  I'm not sure how a weird file could be used by user
> > > > space.  See
> > > > https://lore.kernel.org/all/Zpc46HEacI%2Fwd7Rg@dread.disaster.area/
> > > > 
> > > > The goal of this fix was mainly to not warn about a bcachefs issue (and
> > > > avoid related syzkaller report for Landlock), and to harden Landlock in
> > > > case other filesystems have this kind of bug.
> > > 
> > > It was issue a CVE because the reviewers thought that it was a way to
> > > circumvent the landlock permission checks, based on the changelog text
> > > (note, creating a "corrupted filesystem" is quite easy to get many Linux
> > > systems to auto-mount it, so those types of issues do get assigned
> > > CVEs.)
> > 
> > That's an argument straight from the security theatre.
> > 
> > > If you all do not think this meets the definition of a vulnerability as
> > > defined by CVE.org as:
> > > 	An instance of one or more weaknesses in a Product that can be
> > > 	exploited, causing a negative impact to confidentiality, integrity, or
> > > 	availability; a set of conditions or behaviors that allows the
> > > 	violation of an explicit or implicit security policy.
> > 
> > Yes, so shall we follow this reasoning based on untrusted user
> > auto-mounts of untrusted devices to it's logical conclusion?
> > 
> > If an untrusted user is in control of the filesystem image, then
> > they don't need to corrupt the filesystem image to subvert the
> > system. They can just change the permissions on files, change ACLs,
> > change security xattrs (selinux, landlock, smack, etc),
> > replace the contents of file data (e.g. trojan executables), etc.
> 
> If user mounts are enabled, that comes with UID mapping, and device
> nodes disabled - no?

Not necessarily. Those security mechanisms are all optional mount
options under userspace control....

> Out of curiosity, what's keeping us from saying "user mounts are
> generally expected to be safe" for XFS?

What does "generally expected to be safe" actually mean?

If be "safe" you mean "won't crash the kernel if the structure has
been altered in detectable ways with", then we already largely tick
that box. However, there are whole classes of DOS attacks that are
very difficult to detect without rigorous, expensive runtime
checking (e.g. loops in btree pointers).

Hence while we catch almost all the the obvious out-of-bounds
corruptions within an object, detecting corruptions that require
spanning a largely unbound number of objects to detect are not
handled at all. I can corrupt a filesystem to induce an endless
btree search loop like this pretty easily with a little bit of
xfs_db magic. Yup, we even provide the tools to make doing stuff
like this easy...

If by "safe" you mean "can detect all cases where a metadata field
or file data has been tampered with", then XFS is completely unsafe
and should not be used.

We can't detect that a malicious actor has changed something like a
file permission field or the contents of a security xattr.  To do
that requires cryptographically secure signatures of metadata
objects and file data. We do not have that sort of feature in the
on-disk format. We expect users that need protection from such
tampering will use an envrypted block device to prevent malicious
actors from being able to mutate the filesystem structure in this
way.

> Obviously, that does expose a massive attack surface, so saying that for
> a C codebase that wasn't initially designed for it has a high pucker
> factor.
>
> But I've been impressed with syzbot's ability to find bugs, so barring
> architectural issues which I assume you'd know about it seems it's not
> nearly as crazy a thought as it used to be - for XFS, as you guys have
> been the most rigorous about hardening so I expect that's about as good
> as it's going to get until we start rewriting our filesystems in Rust.

The concerns I have about malicious actors are not mitigated by the
language the filesystem is implemented in.

It has everything to do with the fact that a filesystem like XFS or
ext4 cannot detect someone changing permissions on a file to, say,
add a setuid bit to the permissions field and then hide the
modification by recalculating the correct CRC for the metdata block.

Solving that problem requires a fundamentally different fs/device
trust model (i.e. the device is *never* trusted) and an on-disk
format that is based around "trust nothing" rather than "trust
everything".

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11  2:19       ` Unprivileged filesystem mounts Demi Marie Obenour
@ 2025-03-11  5:57         ` Dave Chinner
  2025-03-11 11:01           ` Christian Brauner
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dave Chinner @ 2025-03-11  5:57 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: cve, gnoack, gregkh, kent.overstreet, linux-bcachefs,
	linux-fsdevel, linux-security-module, mic, Demi Marie Obenour

On Mon, Mar 10, 2025 at 10:19:57PM -0400, Demi Marie Obenour wrote:
> People have stuff to get done.  If you disallow unprivileged filesystem
> mounts, they will just use sudo (or equivalent) instead.

I am not advocating that we disallow mounting of untrusted devices.

> The problem is
> not that users are mounting untrusted filesystems.  The problem is that
> mounting untrusted filesystems is unsafe.

> Making untrusted filesystems safe to mount is the only solution that
> lets users do what they actually need to do. That means either actually
> fixing the filesystem code,

Yes, and the point I keep making is that we cannot provide that
guarantee from the kernel for existing filesystems. We cannot detect
all possible malicous tampering situations without cryptogrpahically
secure verification, and we can't generate full trust from nothing.

The typical desktop policy of "probe and automount any device that
is plugged in" prevents the user from examining the device to
determine if it contains what it is supposed to contain.  The user
is not given any opportunity to device if trust is warranted before
the kernel filesystem parser running in ring 0 is exposed to the
malicious image.

That's the fundamental policy problem we need to address: the user
and/or admin is not in control of their own security because
application developers and/or distro maintainers have decided they
should not have a choice.

In this situation, the choice of what to do *must* fall to the user,
but the argument for "filesystem corruption is a CVE-worthy bug" is
that the choice has been taken away from the user. That's what I'm
saying needs to change - the choice needs to be returned to the
user...

> or running it in a sufficiently tight
> sandbox that vulnerabilities in it are of too low importance to matter.
> libguestfs+FUSE is the most obvious way to do this, but the performance
> might not be enough for distros to turn it on.

Yes, I have advocated for that to be used for desktop mounts in the
past. Similarly, I have also advocated for liblinux + FUSE to be
used so that the kernel filesystem code is used but run from a
userspace context where the kernel cannot be compromised.

I have also advocated for user removable devices to be encrypted by
default. The act of the user unlocking the device automatically
marks it as trusted because undetectable malicious tampering is
highly unlikely.

I have also advocated for a device registry that records removable
device signatures and whether the user trusted them or not so that
they only need to be prompted once for any given removable device
they use.

There are *many* potential user-friendly solutions to the problem,
but they -all- lie in the domain of userspace applications and/or
policies. This is *not* a problem more or better code in the kernel
can solve.

Kees and Co keep telling us we should be making changes that make it
harder (or compeltely prevent) entire classes of vulnerabilities
from being exploited. Yet every time we suggest that a more secure
policy should be applied to automounting filesystems to prevent
system compromise on device hotplug, nobody seems to be willing to
put security first.

> For ext4 and F2FS, if there is a vulnerability that can be exploited by
> a malicious filesystem image, it is a verified boot bypass for Chrome OS
> and Android, respectively. Verified boot is a security boundary for
> both of them,

How does one maliciously corrupt the root filesystem on an Android
phone? How many security boundaries have to be violated before
an attacker can directly modify the physical storage underlying the
read-only system partition?

Again, if the attacker has device modification capability, why
would they bother trying to perform a complex filesystem
corruption attack during boot when they can simply modify what
runs on startup?

And is this a real attack vector that Android must defend against,
why isn't that device and filesystem image cryptographically signed
and verified at boot time to prevent such attacks? That will prevent
the entire class of malicious tampering exploits completely without
having to care about undiscovered filesystem bugs - that's a much
more robust solution from a verified boot and system security
perspective...

> so just forward syzbot reports to their respective
> security teams and let them do the jobs they are paid to do.

Security teams don't fix "syzbot bugs"; they are typically the
people that run syzbot instances. It's the developers who then
have to triage and fix the issues that are found, so that's who the
bug reports should go to (and do). And just because syzbot finds an
issue, that doesn't make it a security issue - all it is is another
bug found by another automated test suite that needs fixing.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: CVE-2025-21830: landlock: Handle weird files
  2025-03-10 23:42     ` CVE-2025-21830: landlock: Handle weird files Dave Chinner
  2025-03-11  2:09       ` Kent Overstreet
  2025-03-11  2:19       ` Unprivileged filesystem mounts Demi Marie Obenour
@ 2025-03-11  6:53       ` Greg Kroah-Hartman
  2 siblings, 0 replies; 22+ messages in thread
From: Greg Kroah-Hartman @ 2025-03-11  6:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mickaël Salaün, cve, Günther Noack,
	linux-security-module, Kent Overstreet, linux-bcachefs,
	linux-fsdevel

On Tue, Mar 11, 2025 at 10:42:41AM +1100, Dave Chinner wrote:
> Greg, you have the ability to issue a CVE that will require
> downstream distros to fix userspace-based vulnerabilities if they
> want various certifications. You have the power to force downstream
> distros to -change their security model policies- for the wider
> good.
> 
> We could knock out this whole class of vulnerability in one CVE:
> issue a CVE considering the auto-mounting of untrusted filesystem
> images as a *critical system vulnerability*. This can only be solved
> by changing the distro policies and implementations that allow this
> dangerous behaviour to persist.

I wish we could do that, but remember, we can not tell people how to use
Linux.  We have no "control" over that at all.  All we can do is point
out "here is a potential vulnerability, it might be applicable to you,
or you might not, depending on your use case, it's up to you to figure
it out".  And we do that by issuing CVEs.

Heck, if we could dictate use, I would issue a "stop using panic on warn
you fools!" CVE right now which would instantly get rid of a huge
percentage of all kernel CVEs out there.  Smart users of Linux do
disable that, and so they are not vulnerable to those at all.

Remember, we issue on average, 11-13 CVEs a day, here's our most recent
numbers:

	=== CVEs Published in Last 6 Months ===
	   October 2024:  427 CVEs
	  November 2024:  280 CVEs
	  December 2024:  358 CVEs
	   January 2025:  234 CVEs
	  February 2025:  929 CVEs
	     March 2025:   56 CVEs

	=== Overall Averages ===
	Average CVEs per month: 415.99
	Average CVEs per week: 95.64
	Average CVEs per day: 13.66

So don't get all worried about individual CVEs, unless you all think
they are not valid at all, which we are glad to revoke.

> At worst, this makes the reason you give for filesystem corruption
> issues being considered CVE worthy go away completely.

Filesystem corruption or data loss is not considered a vulnerability by
cve.org, so we do not track them at this point in time.  However other
group's requirements might require this in the future, so this might
change (i.e. the CRA law in Europe.)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: CVE-2025-21830: landlock: Handle weird files
  2025-03-11  4:24         ` Dave Chinner
@ 2025-03-11 10:50           ` Kent Overstreet
  0 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-03-11 10:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Kroah-Hartman, Mickaël Salaün, cve,
	Günther Noack, linux-security-module, linux-bcachefs,
	linux-fsdevel

On Tue, Mar 11, 2025 at 03:24:40PM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2025 at 10:09:22PM -0400, Kent Overstreet wrote:
> > On Tue, Mar 11, 2025 at 10:42:41AM +1100, Dave Chinner wrote:
> > If user mounts are enabled, that comes with UID mapping, and device
> > nodes disabled - no?
> 
> Not necessarily. Those security mechanisms are all optional mount
> options under userspace control....

Well, if someone's being an idiot, that's on them and not something I'm
going to argue about :) Uidmapping has been around for plenty long
enough for userspace to start using it.

> 
> > Out of curiosity, what's keeping us from saying "user mounts are
> > generally expected to be safe" for XFS?
> 
> What does "generally expected to be safe" actually mean?
> 
> If be "safe" you mean "won't crash the kernel if the structure has
> been altered in detectable ways with", then we already largely tick
> that box. However, there are whole classes of DOS attacks that are
> very difficult to detect without rigorous, expensive runtime
> checking (e.g. loops in btree pointers).

btree nodes don't change depth, so just recording the level of a node
and validating it trivially defeats that. bcachefs has that in its on
disk format, but if you don't have that then that might be a problem -
you'd at least need to know a priori the depth of the root node.

> Hence while we catch almost all the the obvious out-of-bounds
> corruptions within an object, detecting corruptions that require
> spanning a largely unbound number of objects to detect are not
> handled at all. I can corrupt a filesystem to induce an endless
> btree search loop like this pretty easily with a little bit of
> xfs_db magic. Yup, we even provide the tools to make doing stuff
> like this easy...

*nod*

In bcachefs, we right now have no way to cleanly detect "filesystem is
actually full, disk accounting info is wrong" so - that means corruption
causes allocations to get stuck. That one is fixable, and I'm going to
have to at some point since syzbot knows how to trigger it :)

> If by "safe" you mean "can detect all cases where a metadata field
> or file data has been tampered with", then XFS is completely unsafe
> and should not be used.
> 
> We can't detect that a malicious actor has changed something like a
> file permission field or the contents of a security xattr.  To do
> that requires cryptographically secure signatures of metadata
> objects and file data. We do not have that sort of feature in the
> on-disk format. We expect users that need protection from such
> tampering will use an envrypted block device to prevent malicious
> actors from being able to mutate the filesystem structure in this
> way.

Yeah, but that's the less interesting case to me. Not uninteresting,
since "I don't fully trust my block device" is a real scenario with
network attached storage. But generally, the tampering would be done by
the user that did the mount - so perhaps we need to find some new nudges
to make uidmapping of user mounts required?

That could be done in util-linux...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11  5:57         ` Dave Chinner
@ 2025-03-11 11:01           ` Christian Brauner
  2025-03-11 17:36             ` Al Viro
  2025-03-11 17:54           ` Eric Biggers
  2025-03-11 20:10           ` Demi Marie Obenour
  2 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2025-03-11 11:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Demi Marie Obenour, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Tue, Mar 11, 2025 at 04:57:54PM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2025 at 10:19:57PM -0400, Demi Marie Obenour wrote:
> > People have stuff to get done.  If you disallow unprivileged filesystem
> > mounts, they will just use sudo (or equivalent) instead.
> 
> I am not advocating that we disallow mounting of untrusted devices.
> 
> > The problem is
> > not that users are mounting untrusted filesystems.  The problem is that
> > mounting untrusted filesystems is unsafe.
> 
> > Making untrusted filesystems safe to mount is the only solution that
> > lets users do what they actually need to do. That means either actually
> > fixing the filesystem code,
> 
> Yes, and the point I keep making is that we cannot provide that
> guarantee from the kernel for existing filesystems. We cannot detect
> all possible malicous tampering situations without cryptogrpahically
> secure verification, and we can't generate full trust from nothing.
> 
> The typical desktop policy of "probe and automount any device that
> is plugged in" prevents the user from examining the device to
> determine if it contains what it is supposed to contain.  The user
> is not given any opportunity to device if trust is warranted before
> the kernel filesystem parser running in ring 0 is exposed to the
> malicious image.
> 
> That's the fundamental policy problem we need to address: the user
> and/or admin is not in control of their own security because
> application developers and/or distro maintainers have decided they
> should not have a choice.
> 
> In this situation, the choice of what to do *must* fall to the user,
> but the argument for "filesystem corruption is a CVE-worthy bug" is
> that the choice has been taken away from the user. That's what I'm
> saying needs to change - the choice needs to be returned to the
> user...
> 
> > or running it in a sufficiently tight
> > sandbox that vulnerabilities in it are of too low importance to matter.
> > libguestfs+FUSE is the most obvious way to do this, but the performance
> > might not be enough for distros to turn it on.
> 
> Yes, I have advocated for that to be used for desktop mounts in the
> past. Similarly, I have also advocated for liblinux + FUSE to be
> used so that the kernel filesystem code is used but run from a
> userspace context where the kernel cannot be compromised.
> 
> I have also advocated for user removable devices to be encrypted by
> default. The act of the user unlocking the device automatically
> marks it as trusted because undetectable malicious tampering is
> highly unlikely.
> 
> I have also advocated for a device registry that records removable
> device signatures and whether the user trusted them or not so that
> they only need to be prompted once for any given removable device
> they use.
> 
> There are *many* potential user-friendly solutions to the problem,
> but they -all- lie in the domain of userspace applications and/or
> policies. This is *not* a problem more or better code in the kernel
> can solve.

Strongly agree.

> 
> Kees and Co keep telling us we should be making changes that make it
> harder (or compeltely prevent) entire classes of vulnerabilities
> from being exploited. Yet every time we suggest that a more secure
> policy should be applied to automounting filesystems to prevent
> system compromise on device hotplug, nobody seems to be willing to
> put security first.

I agree with Dave here a lot.

The case where arbitrary devices stuck into a laptop (e.g., USB sticks)
are mounted isn't solved by making a filesystem mountable unprivileged.
The mounted device cannot show up in the global mount namespace
somewhere since the user doesn't own the initial mount+user namespace.
So it's pointless. In other words, there's filesystem level checks and
mount namespace based checks. Circumventing that restriction means that
any user can just mount the device at any location in the global mount
namespace and therefore simply overmount other stuff.

The other thing is whether or not a filesystem is allowed to be mounted
by an unprivileged user namespaces. That is not a policy decision the
kernel can make, should make, or has to make. This is a road to security
disaster.

The new mount api has built-in
delegation capabilities for exactly this reason and use-case so the
kernel doesn't have to do that. Policy like that belongs into userspace. 
The new mount api makes it possible for userspace to correctly and
safely delegate any filesystem mount to unprivileged users. It's e.g.,
heavily used by bpf to make bpffs and thus bpf usable by unprivileged
userspace and containers.

There's a generic API for this already that we presented on in [1] at
LSFMM 2023. This has proper security policies in place when and how it
is allowed even for a user not in a user namespace to mount an arbitrary
filesystem (device or no device-based).

    NAME
    systemd-mountfsd.service, systemd-mountfsd - Disk Image File System Mount Service
    
    SYNOPSIS
    systemd-mountfsd.service
    
    /usr/lib/systemd/systemd-mountfsd
    
    DESCRIPTION
    systemd-mountfsd is a system service that dissects disk images, and
    returns mount file descriptors for the file systems contained therein to
    clients, via a Varlink IPC API.
    
    The disk images provided must contain a raw file system image or must
    follow the Discoverable Partitions Specification[1]. Before mounting any
    file systems authenticity of the disk image is established in one or a
    combination of the following ways:
    
    1. If the disk image is located in a regular file in one of the
       directories /var/lib/machines/, /var/lib/portables/,
       /var/lib/extensions/, /var/lib/confexts/ or their counterparts in the
       /etc/, /run/, /usr/lib/ it is assumed to be trusted.
    
    2. If the disk image contains a Verity enabled disk image, along with a
       signature partition with a key in the kernel keyring or in
       /etc/verity.d/ (and related directories) the disk image is considered
       trusted.

    This service provides one Varlink[2] service:
    io.systemd.MountFileSystem which accepts a file descriptor to a
    regular file or block device, and returns a number of file
    descriptors referring to an fsmount() file descriptor the client may
    then attach to a path of their choice.
    
    The returned mounts are automatically allowlisted in the
    per-user-namespace allowlist maintained by
    systemd-nsresourced.service(8).

    The file systems are automatically fsck(8)'ed before mounting.

    NOTES
    1. Discoverable Partitions Specification
       https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

    2. Varlink
       https://varlink.org/

This work has now also been expanded to cover plain directory trees and
will be available in the next release.

It is currently part of systemd but like with a lot of other such tools
they are available standalone for non-systemd systems and if not that
can be done.

[1]: https://youtu.be/RbMhupT3Dk4?si=pIGH5XPPUJ0m6bi0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11 11:01           ` Christian Brauner
@ 2025-03-11 17:36             ` Al Viro
  2025-03-11 17:43               ` Kent Overstreet
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2025-03-11 17:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Demi Marie Obenour, cve, gnoack, gregkh,
	kent.overstreet, linux-bcachefs, linux-fsdevel,
	linux-security-module, mic, Demi Marie Obenour

On Tue, Mar 11, 2025 at 12:01:48PM +0100, Christian Brauner wrote:

> The case where arbitrary devices stuck into a laptop (e.g., USB sticks)
> are mounted isn't solved by making a filesystem mountable unprivileged.
> The mounted device cannot show up in the global mount namespace
> somewhere since the user doesn't own the initial mount+user namespace.
> So it's pointless. In other words, there's filesystem level checks and
> mount namespace based checks. Circumventing that restriction means that
> any user can just mount the device at any location in the global mount
> namespace and therefore simply overmount other stuff.

Note that "untrusted contents" is not the worst thing you can run into -
it can be content changing behind your back.  I seriously doubt that
anyone fuzzes for that kind of crap (and no, it's not an invitation to
start).  I seriously doubt that there's any local filesystem that would
be resilent to that...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11 17:36             ` Al Viro
@ 2025-03-11 17:43               ` Kent Overstreet
  0 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-03-11 17:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Dave Chinner, Demi Marie Obenour, cve, gnoack,
	gregkh, linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Tue, Mar 11, 2025 at 05:36:00PM +0000, Al Viro wrote:
> On Tue, Mar 11, 2025 at 12:01:48PM +0100, Christian Brauner wrote:
> 
> > The case where arbitrary devices stuck into a laptop (e.g., USB sticks)
> > are mounted isn't solved by making a filesystem mountable unprivileged.
> > The mounted device cannot show up in the global mount namespace
> > somewhere since the user doesn't own the initial mount+user namespace.
> > So it's pointless. In other words, there's filesystem level checks and
> > mount namespace based checks. Circumventing that restriction means that
> > any user can just mount the device at any location in the global mount
> > namespace and therefore simply overmount other stuff.
> 
> Note that "untrusted contents" is not the worst thing you can run into -
> it can be content changing behind your back.  I seriously doubt that
> anyone fuzzes for that kind of crap (and no, it's not an invitation to
> start).  I seriously doubt that there's any local filesystem that would
> be resilent to that...

Given network block devices (more common with cloud stuff these days),
it's not a totally unreasonable thing to want to be secure against.

I'd love to see someone attack bcachefs that way - in a few more years :)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11  5:57         ` Dave Chinner
  2025-03-11 11:01           ` Christian Brauner
@ 2025-03-11 17:54           ` Eric Biggers
  2025-03-11 20:10           ` Demi Marie Obenour
  2 siblings, 0 replies; 22+ messages in thread
From: Eric Biggers @ 2025-03-11 17:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Demi Marie Obenour, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Tue, Mar 11, 2025 at 04:57:54PM +1100, Dave Chinner wrote:
> And is this a real attack vector that Android must defend against,
> why isn't that device and filesystem image cryptographically signed
> and verified at boot time to prevent such attacks? That will prevent
> the entire class of malicious tampering exploits completely without
> having to care about undiscovered filesystem bugs - that's a much
> more robust solution from a verified boot and system security
> perspective...

That's exactly how it works.  See
https://source.android.com/docs/security/features/verifiedboot and
https://source.android.com/docs/security/features/verifiedboot/dm-verity.

- Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11  5:57         ` Dave Chinner
  2025-03-11 11:01           ` Christian Brauner
  2025-03-11 17:54           ` Eric Biggers
@ 2025-03-11 20:10           ` Demi Marie Obenour
  2025-03-18  5:21             ` Dave Chinner
  2025-03-18 22:11             ` Theodore Ts'o
  2 siblings, 2 replies; 22+ messages in thread
From: Demi Marie Obenour @ 2025-03-11 20:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: cve, gnoack, gregkh, kent.overstreet, linux-bcachefs,
	linux-fsdevel, linux-security-module, mic, Demi Marie Obenour

[-- Attachment #1: Type: text/plain, Size: 8379 bytes --]

On Tue, Mar 11, 2025 at 04:57:54PM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2025 at 10:19:57PM -0400, Demi Marie Obenour wrote:
> > People have stuff to get done.  If you disallow unprivileged filesystem
> > mounts, they will just use sudo (or equivalent) instead.
> 
> I am not advocating that we disallow mounting of untrusted devices.
> 
> > The problem is
> > not that users are mounting untrusted filesystems.  The problem is that
> > mounting untrusted filesystems is unsafe.
> 
> > Making untrusted filesystems safe to mount is the only solution that
> > lets users do what they actually need to do. That means either actually
> > fixing the filesystem code,
> 
> Yes, and the point I keep making is that we cannot provide that
> guarantee from the kernel for existing filesystems. We cannot detect
> all possible malicous tampering situations without cryptogrpahically
> secure verification, and we can't generate full trust from nothing.

Why is it not possible to provide that guarantee?  I'm not concerned
about infinite loops or deadlocks.  Is there a reason it is not possible
to prevent memory corruption?

> The typical desktop policy of "probe and automount any device that
> is plugged in" prevents the user from examining the device to
> determine if it contains what it is supposed to contain.  The user
> is not given any opportunity to device if trust is warranted before
> the kernel filesystem parser running in ring 0 is exposed to the
> malicious image.
> 
> That's the fundamental policy problem we need to address: the user
> and/or admin is not in control of their own security because
> application developers and/or distro maintainers have decided they
> should not have a choice.
> 
> In this situation, the choice of what to do *must* fall to the user,
> but the argument for "filesystem corruption is a CVE-worthy bug" is
> that the choice has been taken away from the user. That's what I'm
> saying needs to change - the choice needs to be returned to the
> user...

I am 100% in favor of not automounting filesystems without user
interaction, but that only means that an exploit will require user
interaction.  Users need to get things done, and if their task requires
them to a not-fully-trusted filesystem image, then that is what they
will do, and they will typically do it in the most obvious way possible.
That most obvious way needs to be a safe way, and it needs to have good
enough performance that users don't go around looking for an unsafe way.

> > or running it in a sufficiently tight
> > sandbox that vulnerabilities in it are of too low importance to matter.
> > libguestfs+FUSE is the most obvious way to do this, but the performance
> > might not be enough for distros to turn it on.
> 
> Yes, I have advocated for that to be used for desktop mounts in the
> past. Similarly, I have also advocated for liblinux + FUSE to be
> used so that the kernel filesystem code is used but run from a
> userspace context where the kernel cannot be compromised.
> 
> I have also advocated for user removable devices to be encrypted by
> default. The act of the user unlocking the device automatically
> marks it as trusted because undetectable malicious tampering is
> highly unlikely.

That is definitely a good idea.

> I have also advocated for a device registry that records removable
> device signatures and whether the user trusted them or not so that
> they only need to be prompted once for any given removable device
> they use.
> 
> There are *many* potential user-friendly solutions to the problem,
> but they -all- lie in the domain of userspace applications and/or
> policies. This is *not* a problem more or better code in the kernel
> can solve.

It is certainly possible to make a memory safe implementation of amy
filesystem.  If the current implementation can't prevent memory
corruption if a malicious filesystem is mounted, that is a
characteristic of the implementation.

> Kees and Co keep telling us we should be making changes that make it
> harder (or compeltely prevent) entire classes of vulnerabilities
> from being exploited. Yet every time we suggest that a more secure
> policy should be applied to automounting filesystems to prevent
> system compromise on device hotplug, nobody seems to be willing to
> put security first.

Not automounting filesystems on hotplug is a _part_ of the solution.
It cannot be the _entire_ solution.  Users sometimes need to be able to
interact with untrusted filesystem images with a reasonable speed.

> > For ext4 and F2FS, if there is a vulnerability that can be exploited by
> > a malicious filesystem image, it is a verified boot bypass for Chrome OS
> > and Android, respectively. Verified boot is a security boundary for
> > both of them,
> 
> How does one maliciously corrupt the root filesystem on an Android
> phone? How many security boundaries have to be violated before
> an attacker can directly modify the physical storage underlying the
> read-only system partition?
> 
> Again, if the attacker has device modification capability, why
> would they bother trying to perform a complex filesystem
> corruption attack during boot when they can simply modify what
> runs on startup?
> 
> And is this a real attack vector that Android must defend against,
> why isn't that device and filesystem image cryptographically signed
> and verified at boot time to prevent such attacks? That will prevent
> the entire class of malicious tampering exploits completely without
> having to care about undiscovered filesystem bugs - that's a much
> more robust solution from a verified boot and system security
> perspective...

On both Android and ChromeOS, the root filesystem is a dm-verity volume,
and the Merkle tree hash is either signed or is part of the signed
kernel image.  The signed kernel image is itself verified by the
bootloader.  Therefore, the root filesystem cannot be tampered with.

However, the root filesystem is not the only filesystem image that must
be mounted.  There is also a writable data volume, and that _cannot_ be
signed because it contains user data.  It is encrypted, but part of the
threat model for both Android and ChromeOS is an attacker who has gained
root or even kernel code execution and wants to retain their access
across device reboots.  They can't tamper with the kernel or root
filesystem, and privileged userspace treats the data on the writable
filesystem as untrusted.  However, the attacker can replace the writable
filesystem image with anything they want, so the if they can craft an
image that gains kernel code execution the next time the system boots,
they have successfully obtained persistance.

Also, at least Google Pixels support updating the OS via the bootloader.
The bootloader checks that the image was signed by the OS vendor
(generally, but not always, Google), and I believe it also checks for
downgrade attacks.  However, this means of updating the OS doesn't
wipe user data.  This means that if an attacker has gained code
execution with root or even kernel privileges, updating the OS to a
version that has patched the vulnerability the attacker used will revoke
their access.  The same is true if the attacker used USB for their
exploit and the reboot happens after the user has unplugged the USB
device.

Furthermore, on UEFI systems the EFI System Partition cannot be
cryptographically protected as the firmware does not support this.

> > so just forward syzbot reports to their respective
> > security teams and let them do the jobs they are paid to do.
> 
> Security teams don't fix "syzbot bugs"; they are typically the
> people that run syzbot instances. It's the developers who then
> have to triage and fix the issues that are found, so that's who the
> bug reports should go to (and do). And just because syzbot finds an
> issue, that doesn't make it a security issue - all it is is another
> bug found by another automated test suite that needs fixing.

Browser vendors consider many kinds of memory unsafety problems to be
exploitable until and unless proven otherwise.  My understanding is that
experience has proven them to be correct in this regard.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11 20:10           ` Demi Marie Obenour
@ 2025-03-18  5:21             ` Dave Chinner
  2025-03-19 14:55               ` Demi Marie Obenour
  2025-03-18 22:11             ` Theodore Ts'o
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2025-03-18  5:21 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: cve, gnoack, gregkh, kent.overstreet, linux-bcachefs,
	linux-fsdevel, linux-security-module, mic, Demi Marie Obenour

On Tue, Mar 11, 2025 at 04:10:42PM -0400, Demi Marie Obenour wrote:
> On Tue, Mar 11, 2025 at 04:57:54PM +1100, Dave Chinner wrote:
> > On Mon, Mar 10, 2025 at 10:19:57PM -0400, Demi Marie Obenour wrote:
> > > People have stuff to get done.  If you disallow unprivileged filesystem
> > > mounts, they will just use sudo (or equivalent) instead.
> > 
> > I am not advocating that we disallow mounting of untrusted devices.
> > 
> > > The problem is
> > > not that users are mounting untrusted filesystems.  The problem is that
> > > mounting untrusted filesystems is unsafe.
> > 
> > > Making untrusted filesystems safe to mount is the only solution that
> > > lets users do what they actually need to do. That means either actually
> > > fixing the filesystem code,
> > 
> > Yes, and the point I keep making is that we cannot provide that
> > guarantee from the kernel for existing filesystems. We cannot detect
> > all possible malicous tampering situations without cryptogrpahically
> > secure verification, and we can't generate full trust from nothing.
> 
> Why is it not possible to provide that guarantee?  I'm not concerned
> about infinite loops or deadlocks.  Is there a reason it is not possible
> to prevent memory corruption?

You're asking me to prove that the on-disk filesystem format parsing
implementation is 100% provably correct. Not only that, you're
wanting me to say that journal replay copying incomplete,
unverifiable structure fragments over the top of existing disk
structures is 100% provably correct.

I am the person whole architected the existing metadata validation
infrastructure that XFS uses, and so I know it's limitations in
intimate detail. It is, by far, the closest thing we have to
complete runtime metadata validation in any Linux filesystem
(except maybe bcachefs), but it is nowhere near able to detect and
prevent 100% of potential structure corruptions.

It is *far from trivial* to validate all the weird corner cases that
exist in the on-disk format that have evolved over the last 3
decades. For the first 15 years of development, almost zero thought
was given to runtime validation of the on-disk format. People even
fought against introducing it at all. And despite this, we still
have to support the on-disk functionality those old, difficult to
validate, persistent structures describe.

[ And then there's some other random memory corruption bug in the
code, and all bets are off... ]

IOWs, no filesystem developer is ever going to give you a guarantee
that a filesystem implementation is free from memory corruption bugs
unless they've designed and implemented from the ground up to be
100% safe from such issues. No such filesystem exists in the kernel,
and it will probably be years away before anything may exist to fill
that gap.

> > The typical desktop policy of "probe and automount any device that
> > is plugged in" prevents the user from examining the device to
> > determine if it contains what it is supposed to contain.  The user
> > is not given any opportunity to device if trust is warranted before
> > the kernel filesystem parser running in ring 0 is exposed to the
> > malicious image.
> > 
> > That's the fundamental policy problem we need to address: the user
> > and/or admin is not in control of their own security because
> > application developers and/or distro maintainers have decided they
> > should not have a choice.
> > 
> > In this situation, the choice of what to do *must* fall to the user,
> > but the argument for "filesystem corruption is a CVE-worthy bug" is
> > that the choice has been taken away from the user. That's what I'm
> > saying needs to change - the choice needs to be returned to the
> > user...
> 
> I am 100% in favor of not automounting filesystems without user
> interaction, but that only means that an exploit will require user
> interaction.  Users need to get things done, and if their task requires
> them to a not-fully-trusted filesystem image, then that is what they
> will do, and they will typically do it in the most obvious way possible.
> That most obvious way needs to be a safe way, and it needs to have good
> enough performance that users don't go around looking for an unsafe way.

Well, yes, that is obvious, and not a point of contention at all,
as is evidenced by the list of solutions to this problem I outlined.

> > > or running it in a sufficiently tight
> > > sandbox that vulnerabilities in it are of too low importance to matter.
> > > libguestfs+FUSE is the most obvious way to do this, but the performance
> > > might not be enough for distros to turn it on.
> > 
> > Yes, I have advocated for that to be used for desktop mounts in the
> > past. Similarly, I have also advocated for liblinux + FUSE to be
> > used so that the kernel filesystem code is used but run from a
> > userspace context where the kernel cannot be compromised.
> > 
> > I have also advocated for user removable devices to be encrypted by
> > default. The act of the user unlocking the device automatically
> > marks it as trusted because undetectable malicious tampering is
> > highly unlikely.
> 
> That is definitely a good idea.
> 
> > I have also advocated for a device registry that records removable
> > device signatures and whether the user trusted them or not so that
> > they only need to be prompted once for any given removable device
> > they use.
> > 
> > There are *many* potential user-friendly solutions to the problem,
> > but they -all- lie in the domain of userspace applications and/or
> > policies. This is *not* a problem more or better code in the kernel
> > can solve.
> 
> It is certainly possible to make a memory safe implementation of amy
> filesystem.

Spoken like a True Expert.

> If the current implementation can't prevent memory
> corruption if a malicious filesystem is mounted, that is a
> characteristic of the implementation.

Ah, now I see what you are trying to do. You're building a strawman
around memory corruption that you can use the argument "we need to
reimplement everything in Rust" to knock down.

Sorry, not playing that game.

> However, the root filesystem is not the only filesystem image that must
> be mounted.  There is also a writable data volume, and that _cannot_ be
> signed because it contains user data.  It is encrypted, but part of the
> threat model for both Android and ChromeOS is an attacker who has gained
> root or even kernel code execution and wants to retain their access
> across device reboots. They can't tamper with the kernel or root
> filesystem, and privileged userspace treats the data on the writable
> filesystem as untrusted.  However, the attacker can replace the writable
> filesystem image with anything they want,

And therein lies the attack a fielsystem implementation can't defend
against: the attacker can rewrite the unencrypted block device to
contain anything they want, and that will then pass verification on
the next boot. Perhaps that's the class of storage attack you should
seek to prevent, not try to slap bandaids over trust model
violations or insinuate the only solution is to rewrite complex
subsystems in Rust....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-11 20:10           ` Demi Marie Obenour
  2025-03-18  5:21             ` Dave Chinner
@ 2025-03-18 22:11             ` Theodore Ts'o
  2025-03-19 17:44               ` Demi Marie Obenour
  1 sibling, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2025-03-18 22:11 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Tue, Mar 11, 2025 at 04:10:42PM -0400, Demi Marie Obenour wrote:
> 
> Why is it not possible to provide that guarantee?  I'm not concerned
> about infinite loops or deadlocks.  Is there a reason it is not possible
> to prevent memory corruption?

Companies and users are willing to pay to improve performance for file
systems.  q(For example, we have been working for Cloud services that
are interested in improving the performance of their first party
database products using the fact with cloud emulated block devices, we
can guarantee that 16k write won't be torn, and this can resul;t in
significant database performance.)

However, I have *yet* to see any company willing to invest in
hardening file systems against maliciously modified file system
images.  We can debate how much it might cost it to harden a file
system, but given how much companies are willing to pay --- zero ---
it's mostly an academic question.

In addition, if someone made a file system which is guaranteed to be
safe, but it had massive performance regressions relative other file
systems --- it's unclear how many users or system administrators would
use it.  And we've seen that --- there are known mitigations for CPU
cache attacks which are so expensive, that companies or end users have
chosen not to enable them.  Yes, there are some security folks who
believe that security is the most important thing, uber alles.
Unfortunately, those people tend not to be the ones writing the checks
or authorizing hiring budgets.

That being said, if someone asked me if it was best way to invest
software development dollars --- I'd say no.  Don't get me wrong, if
someone were to give me some minions tasked to harden ext4, I know how
I could keep them busy and productive.  But a more cost effective way
of addressing the "untrusted file sytem problem" would be:

(a) Run a forced fsck to check the file system for inconsistency
before letting the file system be mounted.

(b) Mount the file system in a virtual machine, and then make it
available to the host using something like 9pfs.  9pfs is very simple
file system which is easy to validate, and it's a strategy used by
gVisor's file system gopher.

These two approaches are complementary, with (a) being easier, and (b)
probably a bit more robust from a security perspective, but it a bit
more work --- with both providing a layered approach.

> > In this situation, the choice of what to do *must* fall to the user,
> > but the argument for "filesystem corruption is a CVE-worthy bug" is
> > that the choice has been taken away from the user. That's what I'm
> > saying needs to change - the choice needs to be returned to the
> > user...

Users can alwayus do stupid things.  For example, they could download
a random binary from the web, then execute it.  We've seen very
popular software which is instaled via "curl <URL> | bash".  Should we
therefore call bash be a CVE-vulnerability?

Realistically, this is probably a far bigger vulnerability if we're
talking about stupid user tricks.  ("But.... but... but... users need
to be able to install software" --- we can't stop them from piping the
output of curl into bash.)  Which is another reason why I don't really
blame the VP's that are making funding decisions; it's not clear that
the ROI of funding file system security hardening is the best way to
spend a company's dollars.  Remember, Zuckerburg has been quoted as
saying that he's laying off engineers so his company can buy more
GPU's, we know that funding is not infinite.  Every company is making
ROI decisions; you might not agree with the decisions, but trust me,
they're making them.

But if some company would like to invest software engineering effort
in addition features or perform security hardening --- they should
contact me, and I'd be happy to chat.  We have weekly ext4 video
conference calls, and I'm happy to collaborate with companies have a
business interest in seeing some feature get pursued.  There *have*
been some that are security related --- fscrypt and fsverity were both
implemented for ext4 first, in support of Android and ChromeOS's
security use cases.  But in practice this has been the exception, and
not the rule.

> Not automounting filesystems on hotplug is a _part_ of the solution.
> It cannot be the _entire_ solution.  Users sometimes need to be able to
> interact with untrusted filesystem images with a reasonable speed.

Running fsck on a file system *before* automounting file systems would
be a pretty decent start towards a solution.  Is it perfect?  No.  But
it would provide a huge amount of protection.

Note that this won't help if you have a malicious hardware that
*pretends* to be a USB storage device, but which doens't behave a like
a honest storage device.  For example, reading a particular sector
with one data at time T, and a different data at time T+X, with no
intervening writes.  There is no real defense to this attack, since
there is no way that you can authentiate the external storage device;
you could have a registry of USB vendor and model id's, but a device
can always lie about its id numbers.

If you are worried about this kind of attack, the only thing you can
do is to prevent external USB devices from being attached.  This *is*
something that you can do with Chrome and Android enterprise security
policies, and, I've talked to a bank's senior I/T leader that chose to
put epoxy in their desktop, to mitigate aginst a whole *class* of USB
security attacks.

Like everything else, security and usability and performance and costs
are all engineering tradeoffs.  So what works for one use case and
threat model won't be optimal for another, just as fscrypt works well
for Android and ChromeOS, but it doesn't necessarily work well for
other use cases (where I might recommed dm-crypt instead).

Cheers,

					- Ted


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-18  5:21             ` Dave Chinner
@ 2025-03-19 14:55               ` Demi Marie Obenour
  2025-03-19 16:59                 ` Theodore Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Demi Marie Obenour @ 2025-03-19 14:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: cve, gnoack, gregkh, kent.overstreet, linux-bcachefs,
	linux-fsdevel, linux-security-module, mic, Demi Marie Obenour

[-- Attachment #1: Type: text/plain, Size: 8482 bytes --]

On Tue, Mar 18, 2025 at 04:21:48PM +1100, Dave Chinner wrote:
> On Tue, Mar 11, 2025 at 04:10:42PM -0400, Demi Marie Obenour wrote:
> > On Tue, Mar 11, 2025 at 04:57:54PM +1100, Dave Chinner wrote:
> > > On Mon, Mar 10, 2025 at 10:19:57PM -0400, Demi Marie Obenour wrote:
> > > > People have stuff to get done.  If you disallow unprivileged filesystem
> > > > mounts, they will just use sudo (or equivalent) instead.
> > > 
> > > I am not advocating that we disallow mounting of untrusted devices.
> > > 
> > > > The problem is
> > > > not that users are mounting untrusted filesystems.  The problem is that
> > > > mounting untrusted filesystems is unsafe.
> > > 
> > > > Making untrusted filesystems safe to mount is the only solution that
> > > > lets users do what they actually need to do. That means either actually
> > > > fixing the filesystem code,
> > > 
> > > Yes, and the point I keep making is that we cannot provide that
> > > guarantee from the kernel for existing filesystems. We cannot detect
> > > all possible malicous tampering situations without cryptogrpahically
> > > secure verification, and we can't generate full trust from nothing.
> > 
> > Why is it not possible to provide that guarantee?  I'm not concerned
> > about infinite loops or deadlocks.  Is there a reason it is not possible
> > to prevent memory corruption?
> 
> You're asking me to prove that the on-disk filesystem format parsing
> implementation is 100% provably correct. Not only that, you're
> wanting me to say that journal replay copying incomplete,
> unverifiable structure fragments over the top of existing disk
> structures is 100% provably correct.
> 
> I am the person whole architected the existing metadata validation
> infrastructure that XFS uses, and so I know it's limitations in
> intimate detail. It is, by far, the closest thing we have to
> complete runtime metadata validation in any Linux filesystem
> (except maybe bcachefs), but it is nowhere near able to detect and
> prevent 100% of potential structure corruptions.
> 
> It is *far from trivial* to validate all the weird corner cases that
> exist in the on-disk format that have evolved over the last 3
> decades. For the first 15 years of development, almost zero thought
> was given to runtime validation of the on-disk format. People even
> fought against introducing it at all. And despite this, we still
> have to support the on-disk functionality those old, difficult to
> validate, persistent structures describe.
> 
> [ And then there's some other random memory corruption bug in the
> code, and all bets are off... ]
> 
> IOWs, no filesystem developer is ever going to give you a guarantee
> that a filesystem implementation is free from memory corruption bugs
> unless they've designed and implemented from the ground up to be
> 100% safe from such issues. No such filesystem exists in the kernel,
> and it will probably be years away before anything may exist to fill
> that gap.

That makes sense.  

> > > The typical desktop policy of "probe and automount any device that
> > > is plugged in" prevents the user from examining the device to
> > > determine if it contains what it is supposed to contain.  The user
> > > is not given any opportunity to device if trust is warranted before
> > > the kernel filesystem parser running in ring 0 is exposed to the
> > > malicious image.
> > > 
> > > That's the fundamental policy problem we need to address: the user
> > > and/or admin is not in control of their own security because
> > > application developers and/or distro maintainers have decided they
> > > should not have a choice.
> > > 
> > > In this situation, the choice of what to do *must* fall to the user,
> > > but the argument for "filesystem corruption is a CVE-worthy bug" is
> > > that the choice has been taken away from the user. That's what I'm
> > > saying needs to change - the choice needs to be returned to the
> > > user...
> > 
> > I am 100% in favor of not automounting filesystems without user
> > interaction, but that only means that an exploit will require user
> > interaction.  Users need to get things done, and if their task requires
> > them to a not-fully-trusted filesystem image, then that is what they
> > will do, and they will typically do it in the most obvious way possible.
> > That most obvious way needs to be a safe way, and it needs to have good
> > enough performance that users don't go around looking for an unsafe way.
> 
> Well, yes, that is obvious, and not a point of contention at all,
> as is evidenced by the list of solutions to this problem I outlined.

What kind of performance do the existing solutions (libguestfs, lklfuse)
have?

> > > > or running it in a sufficiently tight
> > > > sandbox that vulnerabilities in it are of too low importance to matter.
> > > > libguestfs+FUSE is the most obvious way to do this, but the performance
> > > > might not be enough for distros to turn it on.
> > > 
> > > Yes, I have advocated for that to be used for desktop mounts in the
> > > past. Similarly, I have also advocated for liblinux + FUSE to be
> > > used so that the kernel filesystem code is used but run from a
> > > userspace context where the kernel cannot be compromised.
> > > 
> > > I have also advocated for user removable devices to be encrypted by
> > > default. The act of the user unlocking the device automatically
> > > marks it as trusted because undetectable malicious tampering is
> > > highly unlikely.
> > 
> > That is definitely a good idea.
> > 
> > > I have also advocated for a device registry that records removable
> > > device signatures and whether the user trusted them or not so that
> > > they only need to be prompted once for any given removable device
> > > they use.
> > > 
> > > There are *many* potential user-friendly solutions to the problem,
> > > but they -all- lie in the domain of userspace applications and/or
> > > policies. This is *not* a problem more or better code in the kernel
> > > can solve.
> > 
> > It is certainly possible to make a memory safe implementation of amy
> > filesystem.
> 
> Spoken like a True Expert.

I am saying this in the sense of "it is possible to make a memory safe
implementation of *anything*, unless that thing exposes a memory unsafe
API.".  It's a generic statement about programs in general.  It does not
imply that doing so is practical.

> > If the current implementation can't prevent memory
> > corruption if a malicious filesystem is mounted, that is a
> > characteristic of the implementation.
> 
> Ah, now I see what you are trying to do. You're building a strawman
> around memory corruption that you can use the argument "we need to
> reimplement everything in Rust" to knock down.
> 
> Sorry, not playing that game.

There are other options, like "run the filesystem in a tightly sandboxed
userspace process, especially compiled through WebAssembly".  The
difficulty is making them sufficiently performant for distributions to
actually use them.

> > However, the root filesystem is not the only filesystem image that must
> > be mounted.  There is also a writable data volume, and that _cannot_ be
> > signed because it contains user data.  It is encrypted, but part of the
> > threat model for both Android and ChromeOS is an attacker who has gained
> > root or even kernel code execution and wants to retain their access
> > across device reboots. They can't tamper with the kernel or root
> > filesystem, and privileged userspace treats the data on the writable
> > filesystem as untrusted.  However, the attacker can replace the writable
> > filesystem image with anything they want,
> 
> And therein lies the attack a fielsystem implementation can't defend
> against: the attacker can rewrite the unencrypted block device to
> contain anything they want, and that will then pass verification on
> the next boot. Perhaps that's the class of storage attack you should
> seek to prevent, not try to slap bandaids over trust model
> violations or insinuate the only solution is to rewrite complex
> subsystems in Rust....

The Chrome OS and Android threat models require that they remain secure
no matter what the contents of the unsigned block device actually are,
even if they are completely malicious.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-19 14:55               ` Demi Marie Obenour
@ 2025-03-19 16:59                 ` Theodore Ts'o
  2025-03-19 17:32                   ` Demi Marie Obenour
  0 siblings, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2025-03-19 16:59 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Wed, Mar 19, 2025 at 10:55:39AM -0400, Demi Marie Obenour wrote:
> What kind of performance do the existing solutions (libguestfs, lklfuse)
> have?

For most of the use cases that I'm aware of, which is to support
occasional file transfers through crappy USB thumb drives (the kind
which a nation state actor would to scatter in the parking lot of
their target), the performance doesn't really matter.  Certainly these
are the ones which apply for the Android and ChromeOS use cases.

I suppose there is the use case of people who are running Adobe
Lightroom Classic on their Macbook Air where they are using an
external SSD because Apple's storage pricing is highway robbery, but
(a) it's MacOS, not Linux, and (b) this is arguably a much smaller
percentage of the use case cases in terms of millions and millions of
Android and Chrome Users.  Most of the more naive Mac users probably
just pay $$$ to Apple and don't use external storage anyway.  :-)

> There are other options, like "run the filesystem in a tightly sandboxed
> userspace process, especially compiled through WebAssembly".  The
> difficulty is making them sufficiently performant for distributions to
> actually use them.

I suspect that using a kernel file system running in a guest VM and
then making it available via 9pfs would be far more performant than
something involving FUSE.  But the details would all be in the
implementation, and the skill level of the engineer doing the work.

I'll also note that since you are mentioning Chrome OS and Android a
lot, there seems to be a lot of interest in using VM's as a security
boundary (see CrosVM[1] which is a Rust-based VMM).  So it's likely
that this infrastructure would be available to you if you are doing
work in this area.

[1] https://github.com/google/crosvm

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-19 16:59                 ` Theodore Ts'o
@ 2025-03-19 17:32                   ` Demi Marie Obenour
  2025-03-19 20:11                     ` Theodore Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Demi Marie Obenour @ 2025-03-19 17:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

[-- Attachment #1: Type: text/plain, Size: 2493 bytes --]

On Wed, Mar 19, 2025 at 12:59:31PM -0400, Theodore Ts'o wrote:
> On Wed, Mar 19, 2025 at 10:55:39AM -0400, Demi Marie Obenour wrote:
> > What kind of performance do the existing solutions (libguestfs, lklfuse)
> > have?
> 
> For most of the use cases that I'm aware of, which is to support
> occasional file transfers through crappy USB thumb drives (the kind
> which a nation state actor would to scatter in the parking lot of
> their target), the performance doesn't really matter.  Certainly these
> are the ones which apply for the Android and ChromeOS use cases.

Would this have sufficient performance for backups?

> I suppose there is the use case of people who are running Adobe
> Lightroom Classic on their Macbook Air where they are using an
> external SSD because Apple's storage pricing is highway robbery, but
> (a) it's MacOS, not Linux, and (b) this is arguably a much smaller
> percentage of the use case cases in terms of millions and millions of
> Android and Chrome Users.  Most of the more naive Mac users probably
> just pay $$$ to Apple and don't use external storage anyway.  :-)
> 
> > There are other options, like "run the filesystem in a tightly sandboxed
> > userspace process, especially compiled through WebAssembly".  The
> > difficulty is making them sufficiently performant for distributions to
> > actually use them.
> 
> I suspect that using a kernel file system running in a guest VM and
> then making it available via 9pfs would be far more performant than
> something involving FUSE.  But the details would all be in the
> implementation, and the skill level of the engineer doing the work.

Why do you suspect this?  I'm genuinely curious, especially because my
understanding is that virtiofs (which uses the FUSE protocol internally)
is considered faster than 9pfs.

> I'll also note that since you are mentioning Chrome OS and Android a
> lot, there seems to be a lot of interest in using VM's as a security
> boundary (see CrosVM[1] which is a Rust-based VMM).  So it's likely
> that this infrastructure would be available to you if you are doing
> work in this area.
> 
> [1] https://github.com/google/crosvm

The need to resort to virtualization as a security boundary makes me
wonder if Linux is designed for outdated threat models and security
paradigms.  Sadly, changing the threat model would be extremely
expensive today.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-18 22:11             ` Theodore Ts'o
@ 2025-03-19 17:44               ` Demi Marie Obenour
  2025-03-19 21:25                 ` Theodore Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Demi Marie Obenour @ 2025-03-19 17:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

[-- Attachment #1: Type: text/plain, Size: 7688 bytes --]

On Tue, Mar 18, 2025 at 06:11:28PM -0400, Theodore Ts'o wrote:
> On Tue, Mar 11, 2025 at 04:10:42PM -0400, Demi Marie Obenour wrote:
> > 
> > Why is it not possible to provide that guarantee?  I'm not concerned
> > about infinite loops or deadlocks.  Is there a reason it is not possible
> > to prevent memory corruption?
> 
> Companies and users are willing to pay to improve performance for file
> systems.  q(For example, we have been working for Cloud services that
> are interested in improving the performance of their first party
> database products using the fact with cloud emulated block devices, we
> can guarantee that 16k write won't be torn, and this can resul;t in
> significant database performance.)
> 
> However, I have *yet* to see any company willing to invest in
> hardening file systems against maliciously modified file system
> images.  We can debate how much it might cost it to harden a file
> system, but given how much companies are willing to pay --- zero ---
> it's mostly an academic question.

Google _ought_ to be willing to pay for ext4 and f2fs.  Have you asked
ChromeOS and Android security about this?  Exploits involving malicious
filesystem images are in scope for their bug bounty programs.

> In addition, if someone made a file system which is guaranteed to be
> safe, but it had massive performance regressions relative other file
> systems --- it's unclear how many users or system administrators would
> use it.  And we've seen that --- there are known mitigations for CPU
> cache attacks which are so expensive, that companies or end users have
> chosen not to enable them.  Yes, there are some security folks who
> believe that security is the most important thing, uber alles.
> Unfortunately, those people tend not to be the ones writing the checks
> or authorizing hiring budgets.
> 
> That being said, if someone asked me if it was best way to invest
> software development dollars --- I'd say no.  Don't get me wrong, if
> someone were to give me some minions tasked to harden ext4, I know how
> I could keep them busy and productive.  But a more cost effective way
> of addressing the "untrusted file sytem problem" would be:
> 
> (a) Run a forced fsck to check the file system for inconsistency
> before letting the file system be mounted.
> 
> (b) Mount the file system in a virtual machine, and then make it
> available to the host using something like 9pfs.  9pfs is very simple
> file system which is easy to validate, and it's a strategy used by
> gVisor's file system gopher.
> 
> These two approaches are complementary, with (a) being easier, and (b)
> probably a bit more robust from a security perspective, but it a bit
> more work --- with both providing a layered approach.

Definitely a good idea.

> > > In this situation, the choice of what to do *must* fall to the user,
> > > but the argument for "filesystem corruption is a CVE-worthy bug" is
> > > that the choice has been taken away from the user. That's what I'm
> > > saying needs to change - the choice needs to be returned to the
> > > user...
> 
> Users can alwayus do stupid things.  For example, they could download
> a random binary from the web, then execute it.  We've seen very
> popular software which is instaled via "curl <URL> | bash".  Should we
> therefore call bash be a CVE-vulnerability?
> 
> Realistically, this is probably a far bigger vulnerability if we're
> talking about stupid user tricks.  ("But.... but... but... users need
> to be able to install software" --- we can't stop them from piping the
> output of curl into bash.)  Which is another reason why I don't really
> blame the VP's that are making funding decisions; it's not clear that
> the ROI of funding file system security hardening is the best way to
> spend a company's dollars.  Remember, Zuckerburg has been quoted as
> saying that he's laying off engineers so his company can buy more
> GPU's, we know that funding is not infinite.  Every company is making
> ROI decisions; you might not agree with the decisions, but trust me,
> they're making them.
> 
> But if some company would like to invest software engineering effort
> in addition features or perform security hardening --- they should
> contact me, and I'd be happy to chat.  We have weekly ext4 video
> conference calls, and I'm happy to collaborate with companies have a
> business interest in seeing some feature get pursued.  There *have*
> been some that are security related --- fscrypt and fsverity were both
> implemented for ext4 first, in support of Android and ChromeOS's
> security use cases.  But in practice this has been the exception, and
> not the rule.

Android and ChromeOS do _not_ allow you to run curl <URL> | bash, at
least outside of a VM.

> > Not automounting filesystems on hotplug is a _part_ of the solution.
> > It cannot be the _entire_ solution.  Users sometimes need to be able to
> > interact with untrusted filesystem images with a reasonable speed.
> 
> Running fsck on a file system *before* automounting file systems would
> be a pretty decent start towards a solution.  Is it perfect?  No.  But
> it would provide a huge amount of protection.
> 
> Note that this won't help if you have a malicious hardware that
> *pretends* to be a USB storage device, but which doens't behave a like
> a honest storage device.  For example, reading a particular sector
> with one data at time T, and a different data at time T+X, with no
> intervening writes.  There is no real defense to this attack, since
> there is no way that you can authentiate the external storage device;
> you could have a registry of USB vendor and model id's, but a device
> can always lie about its id numbers.

This attack can be defended against by sandboxing the filesystem driver
and copying files to trusted storage before using them.  You can
authenticate devices based on what port they are plugged into, and Qubes
OS is working on exactly that.

> If you are worried about this kind of attack, the only thing you can
> do is to prevent external USB devices from being attached.  This *is*
> something that you can do with Chrome and Android enterprise security
> policies, and, I've talked to a bank's senior I/T leader that chose to
> put epoxy in their desktop, to mitigate aginst a whole *class* of USB
> security attacks.

Or you can disable your firmware's USB stack and ensure that USB devices
are only attached to virtual machines.  Dasharo allows the former, and
Qubes OS allows the latter.

(Disclaimer: I work on Qubes OS).

> Like everything else, security and usability and performance and costs
> are all engineering tradeoffs.  So what works for one use case and
> threat model won't be optimal for another, just as fscrypt works well
> for Android and ChromeOS, but it doesn't necessarily work well for
> other use cases (where I might recommed dm-crypt instead).

Is the tradeoff fundamental, or is it a consequence of Linux being a
monolithic kernel?  If Linux were a microkernel and every filesystem
driver ran as a userspace process with no access to anything but the
device it is accessing, then there would be no tradeoff when it comes to
filesystems: a compromised filesystem driver would have no more access
than the device itself would, so compromising a filesystem driver would
be of much less value to an attacker.  There is still the problem that
plug and play is incompatible with not trusting devices to identify
themselves, but that's a different concern.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-19 17:32                   ` Demi Marie Obenour
@ 2025-03-19 20:11                     ` Theodore Ts'o
  0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2025-03-19 20:11 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Wed, Mar 19, 2025 at 01:32:59PM -0400, Demi Marie Obenour wrote:
> > I suspect that using a kernel file system running in a guest VM and
> > then making it available via 9pfs would be far more performant than
> > something involving FUSE.  But the details would all be in the
> > implementation, and the skill level of the engineer doing the work.
> 
> Why do you suspect this?  I'm genuinely curious, especially because my
> understanding is that virtiofs (which uses the FUSE protocol internally)
> is considered faster than 9pfs.

I was saying that 9pfs is faster than fuse.  Yes, virtiofs would be
faster than 9pfs.  No question.  However, it might be harder to audit
the virtiofs client implementation given the virtiofs ring buffer
interface to make sure it is free of potential security exploits.9pfs
would be simpler to reassure folks that it is safe(tm).

> The need to resort to virtualization as a security boundary makes me
> wonder if Linux is designed for outdated threat models and security
> paradigms.  Sadly, changing the threat model would be extremely
> expensive today.

I wouldn't say that it's specific to Linux; for many, MANY, MANY
decades, the disk drive was considered within the Trusted Computing
Boundary.  This was true for Multics; VMS; Unix, and other operating
systems that were certified to the Trusted Computing System Evaluation
Criteria (aka the "Orange Book") to the B1 and B2 certification

Ejecting the storage device so it is outside the TCB is a huge change
in the threat model, especially given that for a long time people have
made performance, including simultaneous modifications to the same
file, the primary requirement for most file systems.

If we want to make a single, simple file system that is good enough
for file exchange and backup, where we only need to optimize for
sequental, single-threaded I/O, and for low-cost or moderate-cost
flash devices, that's a much simpler sort of file system that we could
secure against this modified threat model.

However, given how much companies have always been massively stingy
about funding file system development (and these days, anything which
isn't AI :-), I suspect a sandbox/VM approach is going to be a much
more cost effective approach.  But I'm happy to be proven wrong, if
some company is willing to fund the effort --- let's see the names and
we can invite them into the relevant collaboration forums, such as the
weekly ext4 video conference if it's appropriate.

However, just having security people kvetching on open source mailing
lists, or raising syzbot bugs for threat models that the file system
maintainers had never agreed to, and then trying to bully or shame
volunteers to do the work for free is, I would argue, not productive.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-19 17:44               ` Demi Marie Obenour
@ 2025-03-19 21:25                 ` Theodore Ts'o
  2025-03-20  6:26                   ` Demi Marie Obenour
  0 siblings, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2025-03-19 21:25 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Wed, Mar 19, 2025 at 01:44:13PM -0400, Demi Marie Obenour wrote:
> > Note that this won't help if you have a malicious hardware that
> > *pretends* to be a USB storage device, but which doens't behave a like
> > a honest storage device.  For example, reading a particular sector
> > with one data at time T, and a different data at time T+X, with no
> > intervening writes.  There is no real defense to this attack, since
> > there is no way that you can authentiate the external storage device;
> > you could have a registry of USB vendor and model id's, but a device
> > can always lie about its id numbers.
> 
> This attack can be defended against by sandboxing the filesystem driver
> and copying files to trusted storage before using them.  You can
> authenticate devices based on what port they are plugged into, and Qubes
> OS is working on exactly that.

Copying files to trusted storge is not sufficient.  The problem is
that an untrustworthy storage device can still play games with
metadata blocks.  If you are willing to copy the entire storage device
to trustworthy storage, and then run fsck on the file system, and then
mount it, then *sure* that would help.  But if the storage device is
very large or very slow, this might not be practical.

> > Like everything else, security and usability and performance and costs
> > are all engineering tradeoffs....
>
> Is the tradeoff fundamental, or is it a consequence of Linux being a
> monolithic kernel?  If Linux were a microkernel and every filesystem
> driver ran as a userspace process with no access to anything but the
> device it is accessing, then there would be no tradeoff when it comes to
> filesystems: a compromised filesystem driver would have no more access
> than the device itself would, so compromising a filesystem driver would
> be of much less value to an attacker.  There is still the problem that
> plug and play is incompatible with not trusting devices to identify
> themselves, but that's a different concern.

Microkernels have historically been a performance disaster.  Yes, you
can invest a *vast* amount of effort into trying to make a microkernel
OS more performant, but in the meantime, the competing monolithic
kernel will have gotten even faster, or added more features, leaving
the microkernel in the dust.

The effort needed to create a new file system from scratch, taking it
all the way from the initial design, implementation, testing and
performance tuning, and making it something customers are comfortable
depending on it for enterprise workloads is between 50 and 100
engineer years.  This estimate came from looking at the development
effort needed for various file systems implemented on monolithic
kernels, including Digital's Advfs (part of Digital Unix and OSF/1),
IBM's AIX, and Sun's ZFS, as well as GPFS from IBM (although that was
a cluster file sytem, and the effort estimated from my talking to the
engineering managers and tech leads was around 200 PY's.)

I'm not sure how much harder it will be to make a performant file
system which is suitable for enterprise workloads from a performance,
feature, and stability perspective, *and* to make it secure against
storage devices which are outside the TCB, *and* to make it work on a
microkernel.  But I'm going to guess it would inflate these effort
estimates by at least 50%, if not more.

Of course, if we're just witing a super simple file system that is
suitable for backups and file transfers, but not much else, that would
probably take much less efort.  But if we need to support file
exchange with storge devices with NTFS or HFS, thos aren't simple file
sytes.  So the VM sandbox approach might still be the better way to go.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-19 21:25                 ` Theodore Ts'o
@ 2025-03-20  6:26                   ` Demi Marie Obenour
  2025-03-20 16:00                     ` Theodore Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Demi Marie Obenour @ 2025-03-20  6:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

[-- Attachment #1: Type: text/plain, Size: 5976 bytes --]

On Wed, Mar 19, 2025 at 05:25:17PM -0400, Theodore Ts'o wrote:
> On Wed, Mar 19, 2025 at 01:44:13PM -0400, Demi Marie Obenour wrote:
> > > Note that this won't help if you have a malicious hardware that
> > > *pretends* to be a USB storage device, but which doens't behave a like
> > > a honest storage device.  For example, reading a particular sector
> > > with one data at time T, and a different data at time T+X, with no
> > > intervening writes.  There is no real defense to this attack, since
> > > there is no way that you can authentiate the external storage device;
> > > you could have a registry of USB vendor and model id's, but a device
> > > can always lie about its id numbers.
> > 
> > This attack can be defended against by sandboxing the filesystem driver
> > and copying files to trusted storage before using them.  You can
> > authenticate devices based on what port they are plugged into, and Qubes
> > OS is working on exactly that.
> 
> Copying files to trusted storge is not sufficient.  The problem is
> that an untrustworthy storage device can still play games with
> metadata blocks.  If you are willing to copy the entire storage device
> to trustworthy storage, and then run fsck on the file system, and then
> mount it, then *sure* that would help.  But if the storage device is
> very large or very slow, this might not be practical.

Copying flles is not sufficient on its own.  You need to _also_ sandbox
the file system driver, which defeats the attack you mentioned above:
the attacker can compromise the VM running the file system, but that
doesn't give the attacker anything particularly useful.

> > > Like everything else, security and usability and performance and costs
> > > are all engineering tradeoffs....
> >
> > Is the tradeoff fundamental, or is it a consequence of Linux being a
> > monolithic kernel?  If Linux were a microkernel and every filesystem
> > driver ran as a userspace process with no access to anything but the
> > device it is accessing, then there would be no tradeoff when it comes to
> > filesystems: a compromised filesystem driver would have no more access
> > than the device itself would, so compromising a filesystem driver would
> > be of much less value to an attacker.  There is still the problem that
> > plug and play is incompatible with not trusting devices to identify
> > themselves, but that's a different concern.
> 
> Microkernels have historically been a performance disaster.  Yes, you
> can invest a *vast* amount of effort into trying to make a microkernel
> OS more performant, but in the meantime, the competing monolithic
> kernel will have gotten even faster, or added more features, leaving
> the microkernel in the dust.

The L4 family of microkernels, and especially seL4, show that
microkernels do not need to be slow.  I do agree that making a
microkernel-based OS fast is hard, but on the other hand, running an
entire Linux VM just to host a single application isn't exactly an
efficient use of resources either.  The latter is what systems like Kata
containers wind up doing.

> The effort needed to create a new file system from scratch, taking it
> all the way from the initial design, implementation, testing and
> performance tuning, and making it something customers are comfortable
> depending on it for enterprise workloads is between 50 and 100
> engineer years.  This estimate came from looking at the development
> effort needed for various file systems implemented on monolithic
> kernels, including Digital's Advfs (part of Digital Unix and OSF/1),
> IBM's AIX, and Sun's ZFS, as well as GPFS from IBM (although that was
> a cluster file sytem, and the effort estimated from my talking to the
> engineering managers and tech leads was around 200 PY's.)
> 
> I'm not sure how much harder it will be to make a performant file
> system which is suitable for enterprise workloads from a performance,
> feature, and stability perspective, *and* to make it secure against
> storage devices which are outside the TCB, *and* to make it work on a
> microkernel.  But I'm going to guess it would inflate these effort
> estimates by at least 50%, if not more.

My understanding is that "Secure against storage devices which are
outside the TCB" mostly requires 2 things:

1. Either a programming language in which memory safety vulnerabilities
   are difficult to introduce by accident, or a sandbox that ensures
   that a compromised file system driver cannot do more than cause file
   system operations to return wrong results.

2. A way to kill a file system that is caught in an infinite loop, is
   eating too much memory, or is otherwise the victim of a denial of
   service attack without crashing the whole system.  This is not needed
   if denial of service attacks are outside of your threat model.

I'm not asking you (or anyone else) to write a filesystem driver that
has no bugs in the face of arbitrarily corrupted input.  I _expect_ that
there will be bugs in this case.  Right now, Linux kernel file systems
are written in C and run in the kernel, which means that a bug can
easily result in a complete system compromise.

> Of course, if we're just witing a super simple file system that is
> suitable for backups and file transfers, but not much else, that would
> probably take much less efort.  But if we need to support file
> exchange with storge devices with NTFS or HFS, thos aren't simple file
> sytes.  So the VM sandbox approach might still be the better way to go.

Certainly the VM sandbox is the simplest approach in the short term.

P.S.: For all that I may disagree with you on a lot of things, I am very
grateful for all the work you have put into making ext4 as solid a
filesystem as it is, as well as for your other innovations (like
creating /dev/{u,}random).
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unprivileged filesystem mounts
  2025-03-20  6:26                   ` Demi Marie Obenour
@ 2025-03-20 16:00                     ` Theodore Ts'o
  0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2025-03-20 16:00 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Dave Chinner, cve, gnoack, gregkh, kent.overstreet,
	linux-bcachefs, linux-fsdevel, linux-security-module, mic,
	Demi Marie Obenour

On Thu, Mar 20, 2025 at 02:26:41AM -0400, Demi Marie Obenour wrote:
> The L4 family of microkernels, and especially seL4, show that
> microkernels do not need to be slow.

With all due respect to folks who have wrked on L4 and its derivatives,
L4 is a research prototype.  The gap between a research prototype and
something that can actually be used in wide variety of use cases, from
smart watches, to mainframes, is... large.

If some company is willing to fund such work, I'd be very interested
to see what they can come up with.  I will note that Google has tried
dabbling in this space with Fuchsia, and getting to something that can
actually be shipped in a product has been a very long road.  To their
credit, they have managed to do this for a version of Nest Hub, but
most people would say that it is very far from being suitable for
Android or Chrome OS, and supprting data center workloads was
explicitly a non-goal by the Fuschia team.

See [1] for more details.  In 2018, it was reported that Google had
over 100 engineers working on Fuchsia starting in 2016, with the hopes
that it would be ready "in 5 years".  Per [2], apparently in 2024
Fuschia "is not dead", but work has slowed and there aren't as many
people working on it.  (Disclosure: I work at Google but all of my
recent knowledge about Fuchsia comes from news reports; the last time
I talked to anyone on the Fuchsia team was well before COVID.)

[1] https://www.bloomberg.com/news/articles/2018-07-19/google-team-is-said-to-plot-android-successor-draw-skepticism
[2] https://www.reddit.com/r/Fuchsia/comments/1g7x2vs/what_happened_to_fuchsia/

> I do agree that making a microkernel-based OS fast is hard, but on
> the other hand, running an entire Linux VM just to host a single
> application isn't exactly an efficient use of resources either.

Well, if you want to try to make a business case to VP's with
estimates of how many engineers this would require, probably in a
sustained effort taking at least 5 to 10 years, I cordially invite you
to make the attempt.  :-)

Given how cheap hardware has been geting, running multiple VM's on an
Android phone or a ChromeOS laptop might not actally be that
expensive, relative to the cost of the required number of software
engineers for some of the alternatives we've discussed on this thread.
There are ways that you can share the read-only text pages for the
kernel, etc., to optimize the overhead of the VM, for exaple.

It is also much easier to collavorate with SOC designers to create
hardware optimizations for a VM abstraction, as compared to creating
hardwae optmizations for a software-level OS abstraction such as a
container or microkernel task.  So I don't think it's a safe
assumption that VM overheads will always be unacceptable relative to
the alternatives.

Cheers,

					- Ted


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-03-20 16:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <2025030611-CVE-2025-21830-da64@gregkh>
     [not found] ` <20250310.ooshu9Cha2oo@digikod.net>
     [not found]   ` <2025031034-savanna-debit-eb8e@gregkh>
2025-03-10 23:42     ` CVE-2025-21830: landlock: Handle weird files Dave Chinner
2025-03-11  2:09       ` Kent Overstreet
2025-03-11  4:24         ` Dave Chinner
2025-03-11 10:50           ` Kent Overstreet
2025-03-11  2:19       ` Unprivileged filesystem mounts Demi Marie Obenour
2025-03-11  5:57         ` Dave Chinner
2025-03-11 11:01           ` Christian Brauner
2025-03-11 17:36             ` Al Viro
2025-03-11 17:43               ` Kent Overstreet
2025-03-11 17:54           ` Eric Biggers
2025-03-11 20:10           ` Demi Marie Obenour
2025-03-18  5:21             ` Dave Chinner
2025-03-19 14:55               ` Demi Marie Obenour
2025-03-19 16:59                 ` Theodore Ts'o
2025-03-19 17:32                   ` Demi Marie Obenour
2025-03-19 20:11                     ` Theodore Ts'o
2025-03-18 22:11             ` Theodore Ts'o
2025-03-19 17:44               ` Demi Marie Obenour
2025-03-19 21:25                 ` Theodore Ts'o
2025-03-20  6:26                   ` Demi Marie Obenour
2025-03-20 16:00                     ` Theodore Ts'o
2025-03-11  6:53       ` CVE-2025-21830: landlock: Handle weird files Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).