From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0D7E9C678D5
	for <linux-fsdevel@archiver.kernel.org>; Tue,  7 Mar 2023 11:07:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230323AbjCGLHw (ORCPT
        <rfc822;linux-fsdevel@archiver.kernel.org>);
        Tue, 7 Mar 2023 06:07:52 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53368 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230365AbjCGLHb (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 7 Mar 2023 06:07:31 -0500
Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 02FA77C940
        for <linux-fsdevel@vger.kernel.org>; Tue,  7 Mar 2023 03:04:03 -0800 (PST)
X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0VdLD5Mt_1678187019;
Received: from 30.97.49.8(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0VdLD5Mt_1678187019)
          by smtp.aliyun-inc.com;
          Tue, 07 Mar 2023 19:03:40 +0800
Message-ID: <2aa511f2-244d-0c83-af72-bf9d4cda9ec6@linux.alibaba.com>
Date:   Tue, 7 Mar 2023 19:03:38 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.6.1
Subject: Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
To:     Christian Brauner <brauner@kernel.org>
Cc:     Alexander Larsson <alexl@redhat.com>,
        lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
        Amir Goldstein <amir73il@gmail.com>,
        Jingbo Xu <jefflexu@linux.alibaba.com>,
        Giuseppe Scrivano <gscrivan@redhat.com>,
        Dave Chinner <david@fromorbit.com>,
        Vivek Goyal <vgoyal@redhat.com>,
        Miklos Szeredi <miklos@szeredi.hu>
References: <e84d009fd32b7a02ceb038db5cf1737db91069d5.camel@redhat.com>
 <CAL7ro1E7KY5yUJOLu6TY0RtAC5304sM3Lvk=zSCrqDrxTPW2og@mail.gmail.com>
 <ffe56605-6ef7-01b5-e613-7600165820d8@linux.alibaba.com>
 <20230307101548.6gvtd62zah5l3doe@wittgenstein>
From:   Gao Xiang <hsiangkao@linux.alibaba.com>
In-Reply-To: <20230307101548.6gvtd62zah5l3doe@wittgenstein>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Hi Christian,

On 2023/3/7 18:15, Christian Brauner wrote:
> On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
>> Hi Alexander,
>>
>> On 2023/3/3 21:57, Alexander Larsson wrote:
>>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>>>> Composefs filesystem. It is an opportunistically sharing, validating
>>>> image-based filesystem, targeting usecases like validated ostree
>>>> rootfs:es, validated container images that share common files, as well
>>>> as other image based usecases.
>>>>
>>>> During the discussions in the composefs proposal (as seen on LWN[3])
>>>> is has been proposed that (with some changes to overlayfs), similar
>>>> behaviour can be achieved by combining the overlayfs
>>>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>>>
>>>> There are pros and cons to both these approaches, and the discussion
>>>> about their respective value has sometimes been heated. We would like
>>>> to have an in-person discussion at the summit, ideally also involving
>>>> more of the filesystem development community, so that we can reach
>>>> some consensus on what is the best apporach.
>>>
>>> In order to better understand the behaviour and requirements of the
>>> overlayfs+erofs approach I spent some time implementing direct support
>>> for erofs in libcomposefs. So, with current HEAD of
>>> github.com/containers/composefs you can now do:
>>>
>>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
>>
>> Thanks you for taking time on working on EROFS support.  I don't have
>> time to play with it yet since I'd like to work out erofs-utils 1.6
>> these days and will work on some new stuffs such as !pagesize block
>> size as I said previously.
>>
>>>
>>> This will produce an object store with the backing files, and a erofs
>>> file with the required overlayfs xattrs, including a made up one
>>> called "overlay.fs-verity" containing the expected fs-verity digest
>>> for the lower dir. It also adds the required whiteouts to cover the
>>> 00-ff dirs from the lower dir.
>>>
>>> These erofs files are ordered similarly to the composefs files, and we
>>> give similar guarantees about their reproducibility, etc. So, they
>>> should be apples-to-apples comparable with the composefs images.
>>>
>>> Given this, I ran another set of performance tests on the original cs9
>>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
>>> measure the memory use like this:
>>>
>>> # echo 3 > /proc/sys/vm/drop_caches
>>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
>>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
>>>
>>> These are the alternatives I tried:
>>>
>>> xfs: the source of the image, regular dir on xfs
>>> erofs: the image.erofs above, on loopback
>>> erofs dio: the image.erofs above, on loopback with --direct-io=on
>>> ovl: erofs above combined with overlayfs
>>> ovl dio: erofs dio above combined with overlayfs
>>> cfs: composefs mount of image.cfs
>>>
>>> All tests use the same objects dir, stored on xfs. The erofs and
>>> overlay implementations are from a stock 6.1.13 kernel, and composefs
>>> module is from github HEAD.
>>>
>>> I tried loopback both with and without the direct-io option, because
>>> without direct-io enabled the kernel will double-cache the loopbacked
>>> data, as per[1].
>>>
>>> The produced images are:
>>>    8.9M image.cfs
>>> 11.3M image.erofs
>>>
>>> And gives these results:
>>>              | Cold cache | Warm cache | Mem use
>>>              |   (msec)   |   (msec)   |  (mb)
>>> -----------+------------+------------+---------
>>> xfs        |   1449     |    442     |    54
>>> erofs      |    700     |    391     |    45
>>> erofs dio  |    939     |    400     |    45
>>> ovl        |   1827     |    530     |   130
>>> ovl dio    |   2156     |    531     |   130
>>> cfs        |    689     |    389     |    51
>>>
>>> I also ran the same tests in a VM that had the latest kernel including
>>> the lazyfollow patches (ovl lazy in the table, not using direct-io),
>>> this one ext4 based:
>>>
>>>              | Cold cache | Warm cache | Mem use
>>>              |   (msec)   |   (msec)   |  (mb)
>>> -----------+------------+------------+---------
>>> ext4       |   1135     |    394     |    54
>>> erofs      |    715     |    401     |    46
>>> erofs dio  |    922     |    401     |    45
>>> ovl        |   1412     |    515     |   148
>>> ovl dio    |   1810     |    532     |   149
>>> ovl lazy   |   1063     |    523     |    87
>>> cfs        |    719     |    463     |    51
>>>
>>> Things noticeable in the results:
>>>
>>> * composefs and erofs (by itself) perform roughly  similar. This is
>>>     not necessarily news, and results from Jingbo Xu match this.
>>>
>>> * Erofs on top of direct-io enabled loopback causes quite a drop in
>>>     performance, which I don't really understand. Especially since its
>>>     reporting the same memory use as non-direct io. I guess the
>>>     double-cacheing in the later case isn't properly attributed to the
>>>     cgroup so the difference is not measured. However, why would the
>>>     double cache improve performance?  Maybe I'm not completely
>>>     understanding how these things interact.
>>
>> We've already analysed the root cause of composefs is that composefs
>> uses a kernel_read() to read its path while irrelevant metadata
>> (such as dir data) is read together.  Such heuristic readahead is a
>> unusual stuff for all local fses (obviously almost all in-kernel
>> filesystems don't use kernel_read() to read their metadata. Although
>> some filesystems could readahead some related extent metadata when
>> reading inode, they at least does _not_ work as kernel_read().) But
>> double caching will introduce almost the same impact as kernel_read()
>> (assuming you read some source code of loop device.)
>>
>> I do hope you already read what Jingbo's latest test results, and that
>> test result shows how bad readahead performs if fs metadata is
>> partially randomly used (stat < 1500 files):
>> https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com
>>
>> Also you could explicitly _disable_ readahead for composefs
>> manifiest file (because all EROFS metadata read is without
>> readahead), and let's see how it works then.
>>
>> Again, if your workload is just "ls -lR".  My answer is "just async
>> readahead the whole manifest file / loop device together" when
>> mounting.  That will give the best result to you.  But I'm not sure
>> that is the real use case you propose.
>>
>>>
>>> * Stacking overlay on top of erofs causes about 100msec slower
>>>     warm-cache times compared to all non-overlay approaches, and much
>>>     more in the cold cache case. The cold cache performance is helped
>>>     significantly by the lazyfollow patches, but the warm cache overhead
>>>     remains.
>>>
>>> * The use of overlayfs more than doubles memory use, probably
>>>     because of all the extra inodes and dentries in action for the
>>>     various layers. The lazyfollow patches helps, but only partially.
>>>
>>> * Even though overlayfs+erofs is slower than cfs and raw erofs, it is
>>>     not that much slower (~25%) than the pure xfs/ext4 directory, which
>>>     is a pretty good baseline for comparisons. It is even faster when
>>>     using lazyfollow on ext4.
>>>
>>> * The erofs images are slightly larger than the equivalent composefs
>>>     image.
>>>
>>> In summary: The performance of composefs is somewhat better than the
>>> best erofs+ovl combination, although the overlay approach is not
>>> significantly worse than the baseline of a regular directory, except
>>> that it uses a bit more memory.
>>>
>>> On top of the above pure performance based comparisons I would like to
>>> re-state some of the other advantages of composefs compared to the
>>> overlay approach:
>>>
>>> * composefs is namespaceable, in the sense that you can use it (given
>>>     mount capabilities) inside a namespace (such as a container) without
>>>     access to non-namespaced resources like loopback or device-mapper
>>>     devices. (There was work on fixing this with loopfs, but that seems
>>>     to have stalled.)
>>>
>>> * While it is not in the current design, the simplicity of the format
>>>     and lack of loopback makes it at least theoretically possible that
>>>     composefs can be made usable in a rootless fashion at some point in
>>>     the future.
>> Do you consider sending some commands to /dev/cachefiles to configure
>> a daemonless dir and mount erofs image directly by using "erofs over
>> fscache" but in a daemonless way?  That is an ongoing stuff on our side.
>>
>> IMHO, I don't think file-based interfaces are quite a charmful stuff.
>> Historically I recalled some practice is to "avoid directly reading
>> files in kernel" so that I think almost all local fses don't work on
>> files directl and loopback devices are all the ways for these use
>> cases.  If loopback devices are not okay to you, how about improving
>> loopback devices and that will benefit to almost all local fses.
>>
>>>
>>> And of course, there are disadvantages to composefs too. Primarily
>>> being more code, increasing maintenance burden and risk of security
>>> problems. Composefs is particularly burdensome because it is a
>>> stacking filesystem and these have historically been shown to be hard
>>> to get right.
>>>
>>>
>>> The question now is what is the best approach overall? For my own
>>> primary usecase of making a verifying ostree root filesystem, the
>>> overlay approach (with the lazyfollow work finished) is, while not
>>> ideal, good enough.
>>
>> So your judgement is still "ls -lR" and your use case is still just
>> pure read-only and without writable stuff?
>>
>> Anyway, I'm really happy to work with you on your ostree use cases
>> as always, as long as all corner cases work out by the community.
>>
>>>
>>> But I know for the people who are more interested in using composefs
>>> for containers the eventual goal of rootless support is very
>>> important. So, on behalf of them I guess the question is: Is there
>>> ever any chance that something like composefs could work rootlessly?
>>> Or conversely: Is there some way to get rootless support from the
>>> overlay approach? Opinions? Ideas?
>>
>> Honestly, I do want to get a proper answer when Giuseppe asked me
>> the same question.  My current view is simply "that question is
>> almost the same for all in-kernel fses with some on-disk format".
> 
> As far as I'm concerned filesystems with on-disk format will not be made
> mountable by unprivileged containers. And I don't think I'm alone in
> that view. The idea that ever more parts of the kernel with a massive
> attack surface such as a filesystem need to vouchesafe for the safety in
> the face of every rando having access to
> unshare --mount --user --map-root is a dead end and will just end up
> trapping us in a neverending cycle of security bugs (Because every
> single bug that's found after making that fs mountable from an
> unprivileged container will be treated as a security bug no matter if
> justified or not. So this is also a good way to ruin your filesystem's
> reputation.).
> 
> And honestly, if we set the precedent that it's fine for one filesystem
> with an on-disk format to be able to be mounted by unprivileged
> containers then other filesystems eventually want to do this as well.
> 
> At the rate we currently add filesystems that's just a matter of time
> even if none of the existing ones would also want to do it. And then
> we're left arguing that this was just an exception for one super
> special, super safe, unexploitable filesystem with an on-disk format.

Yes, +1.  That's somewhat why I didn't answer immediately since I'd like
to find a chance to get more people interested in EROFS so I hope it could
be (somewhat) pointed out by other filesystem guys at that time.

> 
> Imho, none of this is appealing. I don't want to slowly keep building a
> future where we end up running fuzzers in unprivileged container to
> generate random images to crash the kernel.

Even fuzzers don't guarantee this unless we completely freeze the fs
code, otherwise any useful improvement will need a much much deep and
long long fuzzing in principle.  I'm not sure even if it could catch
release timing at all, and bug-free, honestly.

> 
> I have more arguments why I don't think is a path we will ever go down
> but I don't want this to detract from the legitimate ask of making it
> possible to mount trusted images from within unprivileged containers.
> Because I think that's perfectly legitimate.
> 
> However, I don't think that this is something the kernel needs to solve
> other than providing the necessary infrastructure so that this can be
> solved in userspace.

Yes, I think it's a principle as long as we have a way to do thing in
userspace effectively.

> 
> Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I
> explained how we currently mount into mount namespaces of unprivileged
> cotainers which had been quite a difficult problem before the new mount
> api. But now it's become almost comically trivial. I mean, there's stuff
> that will still be good to have but overall all the bits are already
> there.
> 
> Imho, delegated mounting should be done by a system service that is
> responsible for all the steps that require privileges. So for most
> filesytems not mountable by unprivileged user this would amount to:
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_CMD_CREATE)
> fd_mnt = fsmount(fd_fs)
> // Only required for attributes that require privileges against the sb
> // of the filesystem such as idmapped mounts
> mount_setattr(fd_mnt, ...)
> 
> and then the fd_mnt can be sent to the container which can then attach
> it wherever it wants to. The system level service doesn't even need to
> change namespaces via setns(fd_userns|fd_mntns) like I illustrated in
> the post I did. It's sufficient if we sent it via AF_UNIX for example
> that's exposed to the container.
> 
> Of course, this system level service would be integrated with mount(8)
> directly over a well-defined protocol. And this would be nestable as
> well by e.g., bind-mounting the AF_UNIX socket.
> 
> And we do already support a rudimentary form of such integration through
> systemd. For example via mount -t ddi (cf. [2]) which makes it possible
> to mount discoverable disk images (ddi). But that's just an
> illustration.
> 
> This should be integrated with mount(8) and should be a simply protocol
> over varlink or another lightweight ipc mechanism that can be
> implemented by systemd-mountd (which is how I coined this for lack of
> imagination when I came up with this) or by some other component if
> platforms like k8s really want to do their own thing.
> 
> This also allows us to extend this feature to the whole system btw and
> to all filesystems at once. Because it means that if systemd-mountd is
> told what images to trust (based on location, from a specific registry,
> signature, or whatever) then this isn't just useful for unprivileged
> containers but also for regular users on the host that want to mount
> stuff.
> 
> This is what we're currently working on.
> 
> (There's stuff that we can do to make this more powerful __if__ we need
> to. One example would probably that we _could_ make it possible to mark
> a superblock as being owned by a specific namespace with similar
> permission checks as what we currently do for idmapped mounts
> (privileged in the superblock of the fs, privileged over the ns to
> delegate to etc). IOW,
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns)
> 
> which completely sidesteps the issue of making that on-disk filesystem
> mountable by unpriv users.
> 
> But let me say that this is completely unnecessary today as you can do:
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_CMD_CREATE)
> fd_mnt = fsmount(fd_fs)
> mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP)
> 
> which changes ownership across the whole filesystem. The only time you
> really want what I mention here is if you want to delegate control over
> __every single ioctl and potentially destructive operation associated
> with that filesystem__ to an unprivileged container which is almost
> never what you want.)

Good to know this.  I do hope it can be resolved by the userspace stuffs
as you said.  So is there some barrier to not do like this, so that we
have to bother with FS_USERNS_MOUNT for fses with on-disk format?  Your
delegate control is a good stuff at least on my side and we hope some
system-wide service can help this since our cloud might need this in the
future as well.

Thanks,
Gao Xiang

> 
> [1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
> [2]: https://github.com/systemd/systemd/pull/26695