From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D7E9C678D5 for ; Tue, 7 Mar 2023 11:07:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230323AbjCGLHw (ORCPT ); Tue, 7 Mar 2023 06:07:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53368 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230365AbjCGLHb (ORCPT ); Tue, 7 Mar 2023 06:07:31 -0500 Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 02FA77C940 for ; Tue, 7 Mar 2023 03:04:03 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0VdLD5Mt_1678187019; Received: from 30.97.49.8(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0VdLD5Mt_1678187019) by smtp.aliyun-inc.com; Tue, 07 Mar 2023 19:03:40 +0800 Message-ID: <2aa511f2-244d-0c83-af72-bf9d4cda9ec6@linux.alibaba.com> Date: Tue, 7 Mar 2023 19:03:38 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 Subject: Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay To: Christian Brauner Cc: Alexander Larsson , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Amir Goldstein , Jingbo Xu , Giuseppe Scrivano , Dave Chinner , Vivek Goyal , Miklos Szeredi References: <20230307101548.6gvtd62zah5l3doe@wittgenstein> From: Gao Xiang In-Reply-To: <20230307101548.6gvtd62zah5l3doe@wittgenstein> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Hi Christian, On 2023/3/7 18:15, Christian Brauner wrote: > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: >> Hi Alexander, >> >> On 2023/3/3 21:57, Alexander Larsson wrote: >>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson wrote: >>>> >>>> Hello, >>>> >>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >>>> Composefs filesystem. It is an opportunistically sharing, validating >>>> image-based filesystem, targeting usecases like validated ostree >>>> rootfs:es, validated container images that share common files, as well >>>> as other image based usecases. >>>> >>>> During the discussions in the composefs proposal (as seen on LWN[3]) >>>> is has been proposed that (with some changes to overlayfs), similar >>>> behaviour can be achieved by combining the overlayfs >>>> "overlay.redirect" xattr with an read-only filesystem such as erofs. >>>> >>>> There are pros and cons to both these approaches, and the discussion >>>> about their respective value has sometimes been heated. We would like >>>> to have an in-person discussion at the summit, ideally also involving >>>> more of the filesystem development community, so that we can reach >>>> some consensus on what is the best apporach. >>> >>> In order to better understand the behaviour and requirements of the >>> overlayfs+erofs approach I spent some time implementing direct support >>> for erofs in libcomposefs. So, with current HEAD of >>> github.com/containers/composefs you can now do: >>> >>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs >> >> Thanks you for taking time on working on EROFS support. I don't have >> time to play with it yet since I'd like to work out erofs-utils 1.6 >> these days and will work on some new stuffs such as !pagesize block >> size as I said previously. >> >>> >>> This will produce an object store with the backing files, and a erofs >>> file with the required overlayfs xattrs, including a made up one >>> called "overlay.fs-verity" containing the expected fs-verity digest >>> for the lower dir. It also adds the required whiteouts to cover the >>> 00-ff dirs from the lower dir. >>> >>> These erofs files are ordered similarly to the composefs files, and we >>> give similar guarantees about their reproducibility, etc. So, they >>> should be apples-to-apples comparable with the composefs images. >>> >>> Given this, I ran another set of performance tests on the original cs9 >>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to >>> measure the memory use like this: >>> >>> # echo 3 > /proc/sys/vm/drop_caches >>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat >>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >>> >>> These are the alternatives I tried: >>> >>> xfs: the source of the image, regular dir on xfs >>> erofs: the image.erofs above, on loopback >>> erofs dio: the image.erofs above, on loopback with --direct-io=on >>> ovl: erofs above combined with overlayfs >>> ovl dio: erofs dio above combined with overlayfs >>> cfs: composefs mount of image.cfs >>> >>> All tests use the same objects dir, stored on xfs. The erofs and >>> overlay implementations are from a stock 6.1.13 kernel, and composefs >>> module is from github HEAD. >>> >>> I tried loopback both with and without the direct-io option, because >>> without direct-io enabled the kernel will double-cache the loopbacked >>> data, as per[1]. >>> >>> The produced images are: >>> 8.9M image.cfs >>> 11.3M image.erofs >>> >>> And gives these results: >>> | Cold cache | Warm cache | Mem use >>> | (msec) | (msec) | (mb) >>> -----------+------------+------------+--------- >>> xfs | 1449 | 442 | 54 >>> erofs | 700 | 391 | 45 >>> erofs dio | 939 | 400 | 45 >>> ovl | 1827 | 530 | 130 >>> ovl dio | 2156 | 531 | 130 >>> cfs | 689 | 389 | 51 >>> >>> I also ran the same tests in a VM that had the latest kernel including >>> the lazyfollow patches (ovl lazy in the table, not using direct-io), >>> this one ext4 based: >>> >>> | Cold cache | Warm cache | Mem use >>> | (msec) | (msec) | (mb) >>> -----------+------------+------------+--------- >>> ext4 | 1135 | 394 | 54 >>> erofs | 715 | 401 | 46 >>> erofs dio | 922 | 401 | 45 >>> ovl | 1412 | 515 | 148 >>> ovl dio | 1810 | 532 | 149 >>> ovl lazy | 1063 | 523 | 87 >>> cfs | 719 | 463 | 51 >>> >>> Things noticeable in the results: >>> >>> * composefs and erofs (by itself) perform roughly similar. This is >>> not necessarily news, and results from Jingbo Xu match this. >>> >>> * Erofs on top of direct-io enabled loopback causes quite a drop in >>> performance, which I don't really understand. Especially since its >>> reporting the same memory use as non-direct io. I guess the >>> double-cacheing in the later case isn't properly attributed to the >>> cgroup so the difference is not measured. However, why would the >>> double cache improve performance? Maybe I'm not completely >>> understanding how these things interact. >> >> We've already analysed the root cause of composefs is that composefs >> uses a kernel_read() to read its path while irrelevant metadata >> (such as dir data) is read together. Such heuristic readahead is a >> unusual stuff for all local fses (obviously almost all in-kernel >> filesystems don't use kernel_read() to read their metadata. Although >> some filesystems could readahead some related extent metadata when >> reading inode, they at least does _not_ work as kernel_read().) But >> double caching will introduce almost the same impact as kernel_read() >> (assuming you read some source code of loop device.) >> >> I do hope you already read what Jingbo's latest test results, and that >> test result shows how bad readahead performs if fs metadata is >> partially randomly used (stat < 1500 files): >> https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com >> >> Also you could explicitly _disable_ readahead for composefs >> manifiest file (because all EROFS metadata read is without >> readahead), and let's see how it works then. >> >> Again, if your workload is just "ls -lR". My answer is "just async >> readahead the whole manifest file / loop device together" when >> mounting. That will give the best result to you. But I'm not sure >> that is the real use case you propose. >> >>> >>> * Stacking overlay on top of erofs causes about 100msec slower >>> warm-cache times compared to all non-overlay approaches, and much >>> more in the cold cache case. The cold cache performance is helped >>> significantly by the lazyfollow patches, but the warm cache overhead >>> remains. >>> >>> * The use of overlayfs more than doubles memory use, probably >>> because of all the extra inodes and dentries in action for the >>> various layers. The lazyfollow patches helps, but only partially. >>> >>> * Even though overlayfs+erofs is slower than cfs and raw erofs, it is >>> not that much slower (~25%) than the pure xfs/ext4 directory, which >>> is a pretty good baseline for comparisons. It is even faster when >>> using lazyfollow on ext4. >>> >>> * The erofs images are slightly larger than the equivalent composefs >>> image. >>> >>> In summary: The performance of composefs is somewhat better than the >>> best erofs+ovl combination, although the overlay approach is not >>> significantly worse than the baseline of a regular directory, except >>> that it uses a bit more memory. >>> >>> On top of the above pure performance based comparisons I would like to >>> re-state some of the other advantages of composefs compared to the >>> overlay approach: >>> >>> * composefs is namespaceable, in the sense that you can use it (given >>> mount capabilities) inside a namespace (such as a container) without >>> access to non-namespaced resources like loopback or device-mapper >>> devices. (There was work on fixing this with loopfs, but that seems >>> to have stalled.) >>> >>> * While it is not in the current design, the simplicity of the format >>> and lack of loopback makes it at least theoretically possible that >>> composefs can be made usable in a rootless fashion at some point in >>> the future. >> Do you consider sending some commands to /dev/cachefiles to configure >> a daemonless dir and mount erofs image directly by using "erofs over >> fscache" but in a daemonless way? That is an ongoing stuff on our side. >> >> IMHO, I don't think file-based interfaces are quite a charmful stuff. >> Historically I recalled some practice is to "avoid directly reading >> files in kernel" so that I think almost all local fses don't work on >> files directl and loopback devices are all the ways for these use >> cases. If loopback devices are not okay to you, how about improving >> loopback devices and that will benefit to almost all local fses. >> >>> >>> And of course, there are disadvantages to composefs too. Primarily >>> being more code, increasing maintenance burden and risk of security >>> problems. Composefs is particularly burdensome because it is a >>> stacking filesystem and these have historically been shown to be hard >>> to get right. >>> >>> >>> The question now is what is the best approach overall? For my own >>> primary usecase of making a verifying ostree root filesystem, the >>> overlay approach (with the lazyfollow work finished) is, while not >>> ideal, good enough. >> >> So your judgement is still "ls -lR" and your use case is still just >> pure read-only and without writable stuff? >> >> Anyway, I'm really happy to work with you on your ostree use cases >> as always, as long as all corner cases work out by the community. >> >>> >>> But I know for the people who are more interested in using composefs >>> for containers the eventual goal of rootless support is very >>> important. So, on behalf of them I guess the question is: Is there >>> ever any chance that something like composefs could work rootlessly? >>> Or conversely: Is there some way to get rootless support from the >>> overlay approach? Opinions? Ideas? >> >> Honestly, I do want to get a proper answer when Giuseppe asked me >> the same question. My current view is simply "that question is >> almost the same for all in-kernel fses with some on-disk format". > > As far as I'm concerned filesystems with on-disk format will not be made > mountable by unprivileged containers. And I don't think I'm alone in > that view. The idea that ever more parts of the kernel with a massive > attack surface such as a filesystem need to vouchesafe for the safety in > the face of every rando having access to > unshare --mount --user --map-root is a dead end and will just end up > trapping us in a neverending cycle of security bugs (Because every > single bug that's found after making that fs mountable from an > unprivileged container will be treated as a security bug no matter if > justified or not. So this is also a good way to ruin your filesystem's > reputation.). > > And honestly, if we set the precedent that it's fine for one filesystem > with an on-disk format to be able to be mounted by unprivileged > containers then other filesystems eventually want to do this as well. > > At the rate we currently add filesystems that's just a matter of time > even if none of the existing ones would also want to do it. And then > we're left arguing that this was just an exception for one super > special, super safe, unexploitable filesystem with an on-disk format. Yes, +1. That's somewhat why I didn't answer immediately since I'd like to find a chance to get more people interested in EROFS so I hope it could be (somewhat) pointed out by other filesystem guys at that time. > > Imho, none of this is appealing. I don't want to slowly keep building a > future where we end up running fuzzers in unprivileged container to > generate random images to crash the kernel. Even fuzzers don't guarantee this unless we completely freeze the fs code, otherwise any useful improvement will need a much much deep and long long fuzzing in principle. I'm not sure even if it could catch release timing at all, and bug-free, honestly. > > I have more arguments why I don't think is a path we will ever go down > but I don't want this to detract from the legitimate ask of making it > possible to mount trusted images from within unprivileged containers. > Because I think that's perfectly legitimate. > > However, I don't think that this is something the kernel needs to solve > other than providing the necessary infrastructure so that this can be > solved in userspace. Yes, I think it's a principle as long as we have a way to do thing in userspace effectively. > > Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I > explained how we currently mount into mount namespaces of unprivileged > cotainers which had been quite a difficult problem before the new mount > api. But now it's become almost comically trivial. I mean, there's stuff > that will still be good to have but overall all the bits are already > there. > > Imho, delegated mounting should be done by a system service that is > responsible for all the steps that require privileges. So for most > filesytems not mountable by unprivileged user this would amount to: > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_CMD_CREATE) > fd_mnt = fsmount(fd_fs) > // Only required for attributes that require privileges against the sb > // of the filesystem such as idmapped mounts > mount_setattr(fd_mnt, ...) > > and then the fd_mnt can be sent to the container which can then attach > it wherever it wants to. The system level service doesn't even need to > change namespaces via setns(fd_userns|fd_mntns) like I illustrated in > the post I did. It's sufficient if we sent it via AF_UNIX for example > that's exposed to the container. > > Of course, this system level service would be integrated with mount(8) > directly over a well-defined protocol. And this would be nestable as > well by e.g., bind-mounting the AF_UNIX socket. > > And we do already support a rudimentary form of such integration through > systemd. For example via mount -t ddi (cf. [2]) which makes it possible > to mount discoverable disk images (ddi). But that's just an > illustration. > > This should be integrated with mount(8) and should be a simply protocol > over varlink or another lightweight ipc mechanism that can be > implemented by systemd-mountd (which is how I coined this for lack of > imagination when I came up with this) or by some other component if > platforms like k8s really want to do their own thing. > > This also allows us to extend this feature to the whole system btw and > to all filesystems at once. Because it means that if systemd-mountd is > told what images to trust (based on location, from a specific registry, > signature, or whatever) then this isn't just useful for unprivileged > containers but also for regular users on the host that want to mount > stuff. > > This is what we're currently working on. > > (There's stuff that we can do to make this more powerful __if__ we need > to. One example would probably that we _could_ make it possible to mark > a superblock as being owned by a specific namespace with similar > permission checks as what we currently do for idmapped mounts > (privileged in the superblock of the fs, privileged over the ns to > delegate to etc). IOW, > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns) > > which completely sidesteps the issue of making that on-disk filesystem > mountable by unpriv users. > > But let me say that this is completely unnecessary today as you can do: > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_CMD_CREATE) > fd_mnt = fsmount(fd_fs) > mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP) > > which changes ownership across the whole filesystem. The only time you > really want what I mention here is if you want to delegate control over > __every single ioctl and potentially destructive operation associated > with that filesystem__ to an unprivileged container which is almost > never what you want.) Good to know this. I do hope it can be resolved by the userspace stuffs as you said. So is there some barrier to not do like this, so that we have to bother with FS_USERNS_MOUNT for fses with on-disk format? Your delegate control is a good stuff at least on my side and we hope some system-wide service can help this since our cloud might need this in the future as well. Thanks, Gao Xiang > > [1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > [2]: https://github.com/systemd/systemd/pull/26695