From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 15099175A6E for ; Sun, 22 Mar 2026 04:52:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.99 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774155127; cv=none; b=STxn8xRbNBkdH7V2Upzfxpy6WT0cnk2VYqH+ChX2EbewUA4s6G2TN1n6Ozyx786N59n0nVBUaPRqHwygaCJEp9+MZtYRhf7RkFRYOqEi1Z3q6FOPP8tyNh9klamyQG9TjMzkwoRdLBgTSsiIqv5tGLfg40OEjXc3UtYv63sXr1A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774155127; c=relaxed/simple; bh=hUokHI1sC9FD2TILapIkDph9UQw2qBJPi+y7pHsPkFI=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=LThwRrXzyn4j9OcF7nyRGYXX8jrSexmwBporu9wX7GKdCk9TT/bKOT5rjieTZeeoHzVz9+Qrnsu2h/YJ0P1zf4mzcBunA5zhR9SeHFN2/EoV/H76S1w/uJLkZJHWKO1o0n8bghzt1gYDfQzQpd7FPwFk5iv09haK7cYjWBlK7pA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=qCjRR0Yc; arc=none smtp.client-ip=115.124.30.99 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="qCjRR0Yc" DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1774155119; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=qWyMgSyD+3b0Tzx4xPd+ygLX6jCCKDnhqoLF9Yqrxwo=; b=qCjRR0Yc+PqBVaOWeawkZDZ2A0BMBxdzdkxN1epcUn51f+PGwmdgCpNcBiqKXUsDZurS9QICE+L0QzdpbDs5nlrdkScoDbyYZ90+Ng9eaPk4T7dXH6YTkffuH4c+L/Z2E6xhgtVFFbJ6pQbDvi1KgKTukHrem+c0F9H2Ru7S23g= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037033178;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0X.QJQrP_1774155117; Received: from 30.41.54.139(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0X.QJQrP_1774155117 cluster:ay36) by smtp.aliyun-inc.com; Sun, 22 Mar 2026 12:51:58 +0800 Message-ID: <390cd031-742b-4f1b-99c4-8ee41a259744@linux.alibaba.com> Date: Sun, 22 Mar 2026 12:51:57 +0800 Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] Where is fuse going? API cleanup, restructuring and more To: Demi Marie Obenour , "Darrick J. Wong" Cc: Miklos Szeredi , linux-fsdevel@vger.kernel.org, Joanne Koong , John Groves , Bernd Schubert , Amir Goldstein , Luis Henriques , Horst Birthelmer , Gao Xiang , lsf-pc@lists.linux-foundation.org References: <20260204190649.GB7693@frogsfrogsfrogs> <20260206053835.GD7693@frogsfrogsfrogs> <20260221004752.GE11076@frogsfrogsfrogs> <7de8630d-b6f5-406e-809a-bc2a2d945afb@linux.alibaba.com> <20260318215140.GL1742010@frogsfrogsfrogs> <361d312b-9706-45ca-8943-b655a75c765b@gmail.com> From: Gao Xiang In-Reply-To: <361d312b-9706-45ca-8943-b655a75c765b@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 2026/3/22 11:25, Demi Marie Obenour wrote: ... >> >> Technically speaking fuse4fs could just invoke e2fsck -fn before it >> starts up the rest of the libfuse initialization but who knows if that's >> an acceptable risk. Also unclear if you actually want -fy for that. > Let me try to reply the remaining part: > To me, the attacks mentioned above are all either user error, > or vulnerabilities in software accessing the filesystem. If one There are many consequences if users try to use potential inconsistent writable filesystems directly (without full fsck), what I can think out including but not limited to: - data loss (considering data block double free issue); - data theft (for example, users keep sensitive information in the workload in a high permission inode but it can be read with low permission malicious inode later); - data tamper (the same principle). All vulnerabilities above happen after users try to write the inconsistent filesystem, which is hard to prevent by on-disk design. But if users write with copy-on-write to another local consistent filesystem, all the vulnerabilities above won't exist. > doesn't trust a filesystem image, then any data from the filesystem > can't be trusted either. The only exception is if one can verify I don't think trustiness is the core part of this whole topic, because Linux namespace & cgroup concepts are totally _invented_ for untrusted or isolated workloads. If you untrust some workload, fine, isolate into another namespace: you cannot strictly trust anything. The kernel always has bugs, but is that the real main reason you never run untrusted workloads? I don't think so. > the data cryptographically, which is what fsverity is for. > If the filesystem is mounted r/o and the image doesn't change, one > could guarantee that accessing the filesystem will at least return > deterministic results even for corrupted images. That's something that > would need to be guaranteed by individual filesystem implementations, > though. I just want to say that the real problem with generic writable filesystems is that their on-disk design makes it difficult to prevent or detect harmful inconsistencies. First, the on-disk format includes redundant metadata and even malicious journal metadata (as I mentioned in previous emails). This makes it hard to determine whether the filesystem is inconsistent without performing a full disk scan, which takes much long time. Of course, you could mount severely inconsistent writable filesystems in read-only (RO) mode. However, they are still inconsistent by definition according to their formal on-disk specifications. Furthermore, the runtime kernel implementatio mixes read-write and read-only logic within the same codebase, which complicates the practical consequences. Due to immutable filesystem designs, almost all typical severe inconsistencies cannot happen by design or be regard as harmful. I believe the core issue is not trustworthiness; even with an untrusted workload, you should be able to audit it easily. However, severely inconsistent writable filesystems make such auditability much harder. > > See the end of this email for a long note about what can and cannot > be guaranteed in the face of corrupt or malicious filesystem images. > >>> "that is not the case that we will handle with userspace FUSE >>> drivers, because the metadata is serious broken"), the only way to >>> resolve such attack vectors is to run >>> >>> the full-scan fsck consistency check and then mount "rw" >>> >>> or >>> >>> using the immutable filesystem like EROFS (so that there will not >>> be such inconsisteny issues by design) and isolate the entire write >>> traffic with a full copy-on-write mechanism with OverlayFS for >>> example (IOWs, to make all write copy-on-write into another trusted >>> local filesystem). >> >> (Yeah, that's probably the only way to go for prepopulated images like >> root filesystems and container packages) > > Even an immutable filesystem can still be corrupt. > >>> I hope it's a valid case, and that can indeed happen if the arbitary >>> generic filesystem can be mounted in "rw". And my immutable image >>> filesystem idea can help mitigate this too (just because the immutable >>> image won't be changed in any way, and all writes are always copy-up) >> >> That, we agree on :) > > Indeed, expecting writes to a corrupt filesystem to behave reasonably > is very foolish. > > Long note starts here: There is no *fundamental* reason that a crafted > filesystem image must be able to cause crashes, memory corruption, etc. I still think those kinds of security risks just of implementation bugs are the easist part of the whole issue. Many linux kernel bugs can cause crashes, memory corruption, why crafted filesystems need to be specially considered? > This applies even if the filesystem image may be written to while > mounted. It is always *possible* to write a filesystem such that > it never trusts anything it reads from disk and assumes each read > could return arbitrarily malicious results. Linux namespaces are invented for those kind of usage, the broken archive images return garbage data or even archive images can be changed randomly at runtime, what's the real impacts if they are isolated by the namespaces? > > Right now, many filesystem maintainers do not consider this to be a > priority. Even if they did, I don't think *anyone* (myself included) > could write a filesystem implementation in C that didn't have memory > corruption flaws. The only exceptions are if the filesystem is I think this is still falling into the aspect of implementation bugs, my question is simply: "why filesystem is special in this kind of area, there are many other kernel subsystems in C which can receive untrusted data, like TCP/IP stack", why filesystem is special for particular memory corruption flaws? I really think different aspects are often mixed when this topic is mentioned, which makes the discussion getting more and more divergent. If we talk about implementation bugs, I think filesystem is not special, but as I said, I think the main issue is the writable filesystem on-disk format design, due to the design, there are many severe consequences out of inconsistent filesystems. > incredibly simple or formal methods are used, and neither is the > case for existing filesystems in the Linux kernel. By sandboxing a > filesystem, one ensures that an attacker who compromises a filesystem > implementation needs to find *another* exploit to compromise the > whole system. Yes, yet sandboxing is the one part, of course VM sandboxing is better than Linux namespace isolation, but VMs cost much. Other than sandboxing, I think auditability is important too, especially users provide sensitive data to new workloads. Of course, only dealing with trusted workloads is the best, out of question. But in the real world, we cannot always face complete trusted workloads. For untrusted workloads, we need to find reliable ways to audit them until they become trusted. Just like in the real world: accumulate credit, undergo audits, and eventually earn trust. Sorry about my English, but I hope I express my whole idea. Thanks, Gao Xiang