On 3/22/26 00:51, Gao Xiang wrote:
> 
> 
> On 2026/3/22 11:25, Demi Marie Obenour wrote:
> 
> ...
> 
>>>
>>> Technically speaking fuse4fs could just invoke e2fsck -fn before it
>>> starts up the rest of the libfuse initialization but who knows if that's
>>> an acceptable risk.  Also unclear if you actually want -fy for that.
>>
> 
> Let me try to reply the remaining part:
> 
>> To me, the attacks mentioned above are all either user error,
>> or vulnerabilities in software accessing the filesystem.  If one
> 
> There are many consequences if users try to use potential inconsistent
> writable filesystems directly (without full fsck), what I can think
> out including but not limited to:
> 
>   - data loss (considering data block double free issue);
>   - data theft (for example, users keep sensitive information in the
>        workload in a high permission inode but it can be read with
>        low permission malicious inode later);
>   - data tamper (the same principle).
> 
> All vulnerabilities above happen after users try to write the
> inconsistent filesystem, which is hard to prevent by on-disk
> design.
> 
> But if users write with copy-on-write to another local consistent
> filesystem, all the vulnerabilities above won't exist.

That makes sense!  Is this because the reads are at least
deterministic?

>> doesn't trust a filesystem image, then any data from the filesystem
>> can't be trusted either.  The only exception is if one can verify
> 
> I don't think trustiness is the core part of this whole topic,
> because Linux namespace & cgroup concepts are totally _invented_
> for untrusted or isolated workloads.
> 
> If you untrust some workload, fine, isolate into another
> namespace: you cannot strictly trust anything.
> 
> The kernel always has bugs, but is that the real main reason
> you never run untrusted workloads? I don't think so.
I always use VMs for untrusted workloads.

>> the data cryptographically, which is what fsverity is for.
>> If the filesystem is mounted r/o and the image doesn't change, one
>> could guarantee that accessing the filesystem will at least return
>> deterministic results even for corrupted images.  That's something that
>> would need to be guaranteed by individual filesystem implementations,
>> though.
> 
> I just want to say that the real problem with generic writable
> filesystems is that their on-disk design makes it difficult to
> prevent or detect harmful inconsistencies.
> 
> First, the on-disk format includes redundant metadata and even
> malicious journal metadata (as I mentioned in previous emails).
> This makes it hard to determine whether the filesystem is
> inconsistent without performing a full disk scan, which takes
> much long time.
> 
> Of course, you could mount severely inconsistent writable
> filesystems in read-only (RO) mode.  However, they are still
> inconsistent by definition according to their formal on-disk
> specifications.  Furthermore, the runtime kernel implementatio
>   mixes read-write and read-only logic within the same
> codebase, which complicates the practical consequences.
> 
> Due to immutable filesystem designs, almost all typical severe
> inconsistencies cannot happen by design or be regard as harmful.
> I believe the core issue is not trustworthiness; even with
> an untrusted workload, you should be able to audit it easily.
> However, severely inconsistent writable filesystems make such
> auditability much harder.

That actually makes a lot of sense.  I had not considered the journal,
which means one must modify the disk image just to mount it.

>> See the end of this email for a long note about what can and cannot
>> be guaranteed in the face of corrupt or malicious filesystem images.
>>
>>>> "that is not the case that we will handle with userspace FUSE
>>>> drivers, because the metadata is serious broken"), the only way to
>>>> resolve such attack vectors is to run
>>>>
>>>> the full-scan fsck consistency check and then mount "rw"
>>>>
>>>> or
>>>>
>>>> using the immutable filesystem like EROFS (so that there will not
>>>> be such inconsisteny issues by design) and isolate the entire write
>>>> traffic with a full copy-on-write mechanism with OverlayFS for
>>>> example (IOWs, to make all write copy-on-write into another trusted
>>>> local filesystem).
>>>
>>> (Yeah, that's probably the only way to go for prepopulated images like
>>> root filesystems and container packages)
>>
>> Even an immutable filesystem can still be corrupt.
>>
>>>> I hope it's a valid case, and that can indeed happen if the arbitary
>>>> generic filesystem can be mounted in "rw".  And my immutable image
>>>> filesystem idea can help mitigate this too (just because the immutable
>>>> image won't be changed in any way, and all writes are always copy-up)
>>>
>>> That, we agree on :)
>>
>> Indeed, expecting writes to a corrupt filesystem to behave reasonably
>> is very foolish.
>>
>> Long note starts here: There is no *fundamental* reason that a crafted
>> filesystem image must be able to cause crashes, memory corruption, etc.
> 
> I still think those kinds of security risks just of implementation
> bugs are the easist part of the whole issue.
> 
> Many linux kernel bugs can cause crashes, memory corruption, why
> crafted filesystems need to be specially considered?

In the past, filesystem implementations have often not focused on
this.  The Linux Kernel CVE team does not issue CVEs for such bugs.

>> This applies even if the filesystem image may be written to while
>> mounted.  It is always *possible* to write a filesystem such that
>> it never trusts anything it reads from disk and assumes each read
>> could return arbitrarily malicious results.
> 
> Linux namespaces are invented for those kind of usage, the broken
> archive images return garbage data or even archive images can be
> changed randomly at runtime, what's the real impacts if they are
> isolated by the namespaces?

None!  Regardless of whether one considers namespaces sufficient
for isolating malicious code, they can definitely isolate filesystem
operations very well.

>> Right now, many filesystem maintainers do not consider this to be a
>> priority.  Even if they did, I don't think *anyone* (myself included)
>> could write a filesystem implementation in C that didn't have memory
>> corruption flaws.  The only exceptions are if the filesystem is
> 
> I think this is still falling into the aspect of implementation
> bugs, my question is simply: "why filesystem is special in this
> kind of area, there are many other kernel subsystems in C which
> can receive untrusted data, like TCP/IP stack", why filesystem
> is special for particular memory corruption flaws?

See above - the difference is that filesystems have historically
not been written with untrusted input in mind.  This, of course,
can be changed.

> I really think different aspects are often mixed when this topic
> is mentioned, which makes the discussion getting more and more
> divergent.

I agree.

> If we talk about implementation bugs, I think filesystem is not
> special, but as I said, I think the main issue is the writable
> filesystem on-disk format design, due to the design, there are
> many severe consequences out of inconsistent filesystems.

It definitely makes things much harder, and dramatically increases
the attack surface.

Most uses I have (notably backups) have a hard requirement for writable
storage, and when they don't need it they can use dm-verity.

>> incredibly simple or formal methods are used, and neither is the
>> case for existing filesystems in the Linux kernel.  By sandboxing a
>> filesystem, one ensures that an attacker who compromises a filesystem
>> implementation needs to find *another* exploit to compromise the
>> whole system.
> 
> Yes, yet sandboxing is the one part, of course VM sandboxing
> is better than Linux namespace isolation, but VMs cost much.

I use a lot of VMs, but they indeed use significant resources.  I hope
that at some point this can largely be solved with copy-on-write
VM forking.

> Other than sandboxing, I think auditability is important too,
> especially users provide sensitive data to new workloads.
> 
> Of course, only dealing with trusted workloads is the best,
> out of question.  But in the real world, we cannot always
> face complete trusted workloads.  For untrusted workloads,
> we need to find reliable ways to audit them until they
> become trusted.
> 
> Just like in the real world: accumulate credit, undergo
> audits, and eventually earn trust.
> 
> Sorry about my English, but I hope I express my whole idea.
> 
> Thanks,
> Gao Xiang

Don't worry about your English.  It is completely understandable and
more than capable of getting your (very informative) points across.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)