On 3/22/26 00:51, Gao Xiang wrote: > > > On 2026/3/22 11:25, Demi Marie Obenour wrote: > > ... > >>> >>> Technically speaking fuse4fs could just invoke e2fsck -fn before it >>> starts up the rest of the libfuse initialization but who knows if that's >>> an acceptable risk. Also unclear if you actually want -fy for that. >> > > Let me try to reply the remaining part: > >> To me, the attacks mentioned above are all either user error, >> or vulnerabilities in software accessing the filesystem. If one > > There are many consequences if users try to use potential inconsistent > writable filesystems directly (without full fsck), what I can think > out including but not limited to: > > - data loss (considering data block double free issue); > - data theft (for example, users keep sensitive information in the > workload in a high permission inode but it can be read with > low permission malicious inode later); > - data tamper (the same principle). > > All vulnerabilities above happen after users try to write the > inconsistent filesystem, which is hard to prevent by on-disk > design. > > But if users write with copy-on-write to another local consistent > filesystem, all the vulnerabilities above won't exist. That makes sense! Is this because the reads are at least deterministic? >> doesn't trust a filesystem image, then any data from the filesystem >> can't be trusted either. The only exception is if one can verify > > I don't think trustiness is the core part of this whole topic, > because Linux namespace & cgroup concepts are totally _invented_ > for untrusted or isolated workloads. > > If you untrust some workload, fine, isolate into another > namespace: you cannot strictly trust anything. > > The kernel always has bugs, but is that the real main reason > you never run untrusted workloads? I don't think so. I always use VMs for untrusted workloads. >> the data cryptographically, which is what fsverity is for. >> If the filesystem is mounted r/o and the image doesn't change, one >> could guarantee that accessing the filesystem will at least return >> deterministic results even for corrupted images. That's something that >> would need to be guaranteed by individual filesystem implementations, >> though. > > I just want to say that the real problem with generic writable > filesystems is that their on-disk design makes it difficult to > prevent or detect harmful inconsistencies. > > First, the on-disk format includes redundant metadata and even > malicious journal metadata (as I mentioned in previous emails). > This makes it hard to determine whether the filesystem is > inconsistent without performing a full disk scan, which takes > much long time. > > Of course, you could mount severely inconsistent writable > filesystems in read-only (RO) mode. However, they are still > inconsistent by definition according to their formal on-disk > specifications. Furthermore, the runtime kernel implementatio > mixes read-write and read-only logic within the same > codebase, which complicates the practical consequences. > > Due to immutable filesystem designs, almost all typical severe > inconsistencies cannot happen by design or be regard as harmful. > I believe the core issue is not trustworthiness; even with > an untrusted workload, you should be able to audit it easily. > However, severely inconsistent writable filesystems make such > auditability much harder. That actually makes a lot of sense. I had not considered the journal, which means one must modify the disk image just to mount it. >> See the end of this email for a long note about what can and cannot >> be guaranteed in the face of corrupt or malicious filesystem images. >> >>>> "that is not the case that we will handle with userspace FUSE >>>> drivers, because the metadata is serious broken"), the only way to >>>> resolve such attack vectors is to run >>>> >>>> the full-scan fsck consistency check and then mount "rw" >>>> >>>> or >>>> >>>> using the immutable filesystem like EROFS (so that there will not >>>> be such inconsisteny issues by design) and isolate the entire write >>>> traffic with a full copy-on-write mechanism with OverlayFS for >>>> example (IOWs, to make all write copy-on-write into another trusted >>>> local filesystem). >>> >>> (Yeah, that's probably the only way to go for prepopulated images like >>> root filesystems and container packages) >> >> Even an immutable filesystem can still be corrupt. >> >>>> I hope it's a valid case, and that can indeed happen if the arbitary >>>> generic filesystem can be mounted in "rw". And my immutable image >>>> filesystem idea can help mitigate this too (just because the immutable >>>> image won't be changed in any way, and all writes are always copy-up) >>> >>> That, we agree on :) >> >> Indeed, expecting writes to a corrupt filesystem to behave reasonably >> is very foolish. >> >> Long note starts here: There is no *fundamental* reason that a crafted >> filesystem image must be able to cause crashes, memory corruption, etc. > > I still think those kinds of security risks just of implementation > bugs are the easist part of the whole issue. > > Many linux kernel bugs can cause crashes, memory corruption, why > crafted filesystems need to be specially considered? In the past, filesystem implementations have often not focused on this. The Linux Kernel CVE team does not issue CVEs for such bugs. >> This applies even if the filesystem image may be written to while >> mounted. It is always *possible* to write a filesystem such that >> it never trusts anything it reads from disk and assumes each read >> could return arbitrarily malicious results. > > Linux namespaces are invented for those kind of usage, the broken > archive images return garbage data or even archive images can be > changed randomly at runtime, what's the real impacts if they are > isolated by the namespaces? None! Regardless of whether one considers namespaces sufficient for isolating malicious code, they can definitely isolate filesystem operations very well. >> Right now, many filesystem maintainers do not consider this to be a >> priority. Even if they did, I don't think *anyone* (myself included) >> could write a filesystem implementation in C that didn't have memory >> corruption flaws. The only exceptions are if the filesystem is > > I think this is still falling into the aspect of implementation > bugs, my question is simply: "why filesystem is special in this > kind of area, there are many other kernel subsystems in C which > can receive untrusted data, like TCP/IP stack", why filesystem > is special for particular memory corruption flaws? See above - the difference is that filesystems have historically not been written with untrusted input in mind. This, of course, can be changed. > I really think different aspects are often mixed when this topic > is mentioned, which makes the discussion getting more and more > divergent. I agree. > If we talk about implementation bugs, I think filesystem is not > special, but as I said, I think the main issue is the writable > filesystem on-disk format design, due to the design, there are > many severe consequences out of inconsistent filesystems. It definitely makes things much harder, and dramatically increases the attack surface. Most uses I have (notably backups) have a hard requirement for writable storage, and when they don't need it they can use dm-verity. >> incredibly simple or formal methods are used, and neither is the >> case for existing filesystems in the Linux kernel. By sandboxing a >> filesystem, one ensures that an attacker who compromises a filesystem >> implementation needs to find *another* exploit to compromise the >> whole system. > > Yes, yet sandboxing is the one part, of course VM sandboxing > is better than Linux namespace isolation, but VMs cost much. I use a lot of VMs, but they indeed use significant resources. I hope that at some point this can largely be solved with copy-on-write VM forking. > Other than sandboxing, I think auditability is important too, > especially users provide sensitive data to new workloads. > > Of course, only dealing with trusted workloads is the best, > out of question. But in the real world, we cannot always > face complete trusted workloads. For untrusted workloads, > we need to find reliable ways to audit them until they > become trusted. > > Just like in the real world: accumulate credit, undergo > audits, and eventually earn trust. > > Sorry about my English, but I hope I express my whole idea. > > Thanks, > Gao Xiang Don't worry about your English. It is completely understandable and more than capable of getting your (very informative) points across. -- Sincerely, Demi Marie Obenour (she/her/hers)