* [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
@ 2023-02-27 9:22 Alexander Larsson
2023-02-27 10:45 ` Gao Xiang
` (2 more replies)
0 siblings, 3 replies; 42+ messages in thread
From: Alexander Larsson @ 2023-02-27 9:22 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-fsdevel
Hello,
Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
Composefs filesystem. It is an opportunistically sharing, validating
image-based filesystem, targeting usecases like validated ostree
rootfs:es, validated container images that share common files, as well
as other image based usecases.
During the discussions in the composefs proposal (as seen on LWN[3])
is has been proposed that (with some changes to overlayfs), similar
behaviour can be achieved by combining the overlayfs
"overlay.redirect" xattr with an read-only filesystem such as erofs.
There are pros and cons to both these approaches, and the discussion
about their respective value has sometimes been heated. We would like
to have an in-person discussion at the summit, ideally also involving
more of the filesystem development community, so that we can reach
some consensus on what is the best apporach.
Good participants would be at least: Alexander Larsson, Giuseppe
Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
Jingbo Xu.
[1] https://github.com/containers/composefs
[2] https://lore.kernel.org/lkml/cover.1674227308.git.alexl@redhat.com/
[3] https://lwn.net/SubscriberLink/922851/45ed93154f336f73/
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
Alexander Larsson Red Hat,
Inc
alexl@redhat.com alexander.larsson@gmail.com
He's a lounge-singing crooked cowboy on his last day in the job. She's
a
psychotic nymphomaniac single mother prone to fits of savage,
blood-crazed rage. They fight crime!
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-02-27 9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson @ 2023-02-27 10:45 ` Gao Xiang 2023-02-27 10:58 ` Christian Brauner 2023-03-01 3:47 ` Jingbo Xu 2023-02-27 11:37 ` Jingbo Xu 2023-03-03 13:57 ` Alexander Larsson 2 siblings, 2 replies; 42+ messages in thread From: Gao Xiang @ 2023-02-27 10:45 UTC (permalink / raw) To: Alexander Larsson, lsf-pc; +Cc: linux-fsdevel, Christian Brauner, Jingbo Xu (+cc Jingbo Xu and Christian Brauner) On 2023/2/27 17:22, Alexander Larsson wrote: > Hello, > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > Composefs filesystem. It is an opportunistically sharing, validating > image-based filesystem, targeting usecases like validated ostree > rootfs:es, validated container images that share common files, as well > as other image based usecases. > > During the discussions in the composefs proposal (as seen on LWN[3]) > is has been proposed that (with some changes to overlayfs), similar > behaviour can be achieved by combining the overlayfs > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > There are pros and cons to both these approaches, and the discussion > about their respective value has sometimes been heated. We would like > to have an in-person discussion at the summit, ideally also involving > more of the filesystem development community, so that we can reach > some consensus on what is the best apporach. > > Good participants would be at least: Alexander Larsson, Giuseppe > Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi, > Jingbo Xu I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed the root cause of the performance gap is that composefs read some data symlink-like payload data by using cfs_read_vdata_path() which involves kernel_read() and trigger heuristic readahead of dir data (which is also landed in composefs vdata area together with payload), so that most composefs dir I/O is already done in advance by heuristic readahead. And we think almost all exist in-kernel local fses doesn't have such heuristic readahead and if we add the similar stuff, EROFS could do better than composefs. Also we've tried random stat()s about 500~1000 files in the tree you shared (rather than just "ls -lR") and EROFS did almost the same or better than composefs. I guess further analysis (including blktrace) could be shown by Jingbo later. Not sure if Christian Brauner would like to discuss this new stacked fs with on-disk metadata as well (especially about userns stuff since it's somewhat a plan in the composefs roadmap as well.) Thanks, Gao Xiang > > [1] https://github.com/containers/composefs > [2] https://lore.kernel.org/lkml/cover.1674227308.git.alexl@redhat.com/ > [3] https://lwn.net/SubscriberLink/922851/45ed93154f336f73/ > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-02-27 10:45 ` Gao Xiang @ 2023-02-27 10:58 ` Christian Brauner 2023-04-27 16:11 ` [Lsf-pc] " Amir Goldstein 2023-03-01 3:47 ` Jingbo Xu 1 sibling, 1 reply; 42+ messages in thread From: Christian Brauner @ 2023-02-27 10:58 UTC (permalink / raw) To: Gao Xiang; +Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Jingbo Xu On Mon, Feb 27, 2023 at 06:45:50PM +0800, Gao Xiang wrote: > > (+cc Jingbo Xu and Christian Brauner) > > On 2023/2/27 17:22, Alexander Larsson wrote: > > Hello, > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > Composefs filesystem. It is an opportunistically sharing, validating > > image-based filesystem, targeting usecases like validated ostree > > rootfs:es, validated container images that share common files, as well > > as other image based usecases. > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > is has been proposed that (with some changes to overlayfs), similar > > behaviour can be achieved by combining the overlayfs > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > There are pros and cons to both these approaches, and the discussion > > about their respective value has sometimes been heated. We would like > > to have an in-person discussion at the summit, ideally also involving > > more of the filesystem development community, so that we can reach > > some consensus on what is the best apporach. > > > > Good participants would be at least: Alexander Larsson, Giuseppe > > Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi, > > Jingbo Xu > I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed > the root cause of the performance gap is that > > composefs read some data symlink-like payload data by using > cfs_read_vdata_path() which involves kernel_read() and trigger heuristic > readahead of dir data (which is also landed in composefs vdata area > together with payload), so that most composefs dir I/O is already done > in advance by heuristic readahead. And we think almost all exist > in-kernel local fses doesn't have such heuristic readahead and if we add > the similar stuff, EROFS could do better than composefs. > > Also we've tried random stat()s about 500~1000 files in the tree you shared > (rather than just "ls -lR") and EROFS did almost the same or better than > composefs. I guess further analysis (including blktrace) could be shown by > Jingbo later. > > Not sure if Christian Brauner would like to discuss this new stacked fs I'll be at lsfmm in any case and already got my invite a while ago. I intend to give some updates about a few vfs things and I can talk about this as well. Thanks, Gao! ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-02-27 10:58 ` Christian Brauner @ 2023-04-27 16:11 ` Amir Goldstein 0 siblings, 0 replies; 42+ messages in thread From: Amir Goldstein @ 2023-04-27 16:11 UTC (permalink / raw) To: Christian Brauner Cc: Gao Xiang, linux-fsdevel, Jingbo Xu, lsf-pc, Alexander Larsson On Mon, Feb 27, 2023 at 12:59 PM Christian Brauner <brauner@kernel.org> wrote: > > On Mon, Feb 27, 2023 at 06:45:50PM +0800, Gao Xiang wrote: > > > > (+cc Jingbo Xu and Christian Brauner) > > > > On 2023/2/27 17:22, Alexander Larsson wrote: > > > Hello, > > > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > > Composefs filesystem. It is an opportunistically sharing, validating > > > image-based filesystem, targeting usecases like validated ostree > > > rootfs:es, validated container images that share common files, as well > > > as other image based usecases. > > > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > > is has been proposed that (with some changes to overlayfs), similar > > > behaviour can be achieved by combining the overlayfs > > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > > > There are pros and cons to both these approaches, and the discussion > > > about their respective value has sometimes been heated. We would like > > > to have an in-person discussion at the summit, ideally also involving > > > more of the filesystem development community, so that we can reach > > > some consensus on what is the best apporach. > > > > > > Good participants would be at least: Alexander Larsson, Giuseppe > > > Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi, > > > Jingbo Xu > > I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed > > the root cause of the performance gap is that > > > > composefs read some data symlink-like payload data by using > > cfs_read_vdata_path() which involves kernel_read() and trigger heuristic > > readahead of dir data (which is also landed in composefs vdata area > > together with payload), so that most composefs dir I/O is already done > > in advance by heuristic readahead. And we think almost all exist > > in-kernel local fses doesn't have such heuristic readahead and if we add > > the similar stuff, EROFS could do better than composefs. > > > > Also we've tried random stat()s about 500~1000 files in the tree you shared > > (rather than just "ls -lR") and EROFS did almost the same or better than > > composefs. I guess further analysis (including blktrace) could be shown by > > Jingbo later. > > > > Not sure if Christian Brauner would like to discuss this new stacked fs > > I'll be at lsfmm in any case and already got my invite a while ago. I > intend to give some updates about a few vfs things and I can talk about > this as well. > FYI, I schedule a ~30min session lead by Alexander on remaining composefs topics another ~30min session lead by Gao on EROFS topics and another session for Christian dedicated to mounting images inside userns. Thanks, Amir. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-02-27 10:45 ` Gao Xiang 2023-02-27 10:58 ` Christian Brauner @ 2023-03-01 3:47 ` Jingbo Xu 2023-03-03 14:41 ` Alexander Larsson 1 sibling, 1 reply; 42+ messages in thread From: Jingbo Xu @ 2023-03-01 3:47 UTC (permalink / raw) To: Gao Xiang, Alexander Larsson, Christian Brauner, Amir Goldstein Cc: linux-fsdevel, lsf-pc Hi all, On 2/27/23 6:45 PM, Gao Xiang wrote: > > (+cc Jingbo Xu and Christian Brauner) > > On 2023/2/27 17:22, Alexander Larsson wrote: >> Hello, >> >> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >> Composefs filesystem. It is an opportunistically sharing, validating >> image-based filesystem, targeting usecases like validated ostree >> rootfs:es, validated container images that share common files, as well >> as other image based usecases. >> >> During the discussions in the composefs proposal (as seen on LWN[3]) >> is has been proposed that (with some changes to overlayfs), similar >> behaviour can be achieved by combining the overlayfs >> "overlay.redirect" xattr with an read-only filesystem such as erofs. >> >> There are pros and cons to both these approaches, and the discussion >> about their respective value has sometimes been heated. We would like >> to have an in-person discussion at the summit, ideally also involving >> more of the filesystem development community, so that we can reach >> some consensus on what is the best apporach. >> >> Good participants would be at least: Alexander Larsson, Giuseppe >> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi, >> Jingbo Xu > I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed > the root cause of the performance gap is that > > composefs read some data symlink-like payload data by using > cfs_read_vdata_path() which involves kernel_read() and trigger heuristic > readahead of dir data (which is also landed in composefs vdata area > together with payload), so that most composefs dir I/O is already done > in advance by heuristic readahead. And we think almost all exist > in-kernel local fses doesn't have such heuristic readahead and if we add > the similar stuff, EROFS could do better than composefs. > > Also we've tried random stat()s about 500~1000 files in the tree you shared > (rather than just "ls -lR") and EROFS did almost the same or better than > composefs. I guess further analysis (including blktrace) could be shown by > Jingbo later. > The link path string and dirents are mix stored in a so-called vdata (variable data) section[1] in composefs, sometimes even in the same block (figured out by dumping the composefs image). When doing lookup, composefs will resolve the link path. It will read the link path string from vdata section through kernel_read(), along which those dirents in the following blocks are also read in by the heuristic readahead algorithm in kernel_read(). I believe this will much benefit the performance in the workload like "ls -lR". Test on Subset of Files ======================= I also tested the performance of running stat(1) on a random subset of these files in the tested image[2] generated by "find <root_directory_of_tested_image> -type f -printf "%p\n" | sort -R | head -n <lines>". | uncached| cached | (ms) | (ms) ----------------------------------------------|---------|-------- (1900 files) composefs | 352 | 15 erofs (raw disk) | 355 | 16 erofs (DIRECT loop) | 367 | 16 erofs (DIRECT loop) + overlayfs(lazyfollowup) | 379 | 16 erofs (BUFFER loop) | 85 | 16 erofs (BUFFER loop) + overlayfs(lazyfollowup) | 96 | 16 (1000 files) composefs | 311 | 9 erofs (DIRECT loop) | 260 | 9 erofs (raw disk) | 255 | 9 erofs (DIRECT loop) + overlayfs(lazyfollowup) | 262 | 9.7 erofs (BUFFER loop) | 71 | 9 erofs (BUFFER loop) + overlayfs(lazyfollowup) | 77 | 9.4 (500 files) composefs | 258 | 5.5 erofs (DIRECT loop) | 180 | 5.5 erofs (raw disk) | 179 | 5.5 erofs (DIRECT loop) + overlayfs(lazyfollowup) | 182 | 5.9 erofs (BUFFER loop) | 55 | 5.7 erofs (BUFFER loop) + overlayfs(lazyfollowup) | 60 | 5.8 Here I tested erofs solely (without overlayfs) and erofs+overlayfs. The code base of tested erofs is the same as the latest upstream without any optimization. It can be seen that, as the number of stated files decreases, erofs gradually behaves better than composefs. It indicates that the heuristic readahead in kernel_read() plays an important role in the final performance statistics of this workload. blktrace Log ============ To further verify that the heuristic readahead in kernel_read() will readahead dirents for composefs, I dumped the blktrace log when composefs is accessing the manifest file. Composefs is mounted on "/mnt/cps", and then I ran the following three commands sequentially. ``` # ls -l /mnt/cps/etc/NetworkManager # ls -l /mnt/cps/etc/pki # strace ls /mnt/cps/etc/pki/pesign-rh-test ``` The blktrace log for the above three commands is shown respectively: ``` # blktrace output for "ls -l /mnt/cps/etc/NetworkManager" 7,0 66 1 0.000000000 0 C R 9136 + 8 [0] 7,0 66 2 0.000302905 0 C R 8 + 8 [0] 7,0 66 3 0.000506568 0 C R 9144 + 8 [0] 7,0 66 4 0.000968212 0 C R 9152 + 8 [0] 7,0 66 5 0.001054728 0 C R 48 + 8 [0] 7,0 66 6 0.001422439 0 C RA 9296 + 32 [0] 7,0 66 7 0.002019686 0 C RA 9328 + 128 [0] 7,0 53 4 0.000006260 9052 Q R 8 + 8 [ls] 7,0 53 5 0.000006699 9052 G R 8 + 8 [ls] 7,0 53 6 0.000006892 9052 D R 8 + 8 [ls] 7,0 53 7 0.000308009 9052 Q R 9144 + 8 [ls] 7,0 53 8 0.000308552 9052 G R 9144 + 8 [ls] 7,0 53 9 0.000308780 9052 D R 9144 + 8 [ls] 7,0 53 10 0.000893060 9052 Q R 9152 + 8 [ls] 7,0 53 11 0.000893604 9052 G R 9152 + 8 [ls] 7,0 53 12 0.000893964 9052 D R 9152 + 8 [ls] 7,0 53 13 0.000975783 9052 Q R 48 + 8 [ls] 7,0 53 14 0.000976134 9052 G R 48 + 8 [ls] 7,0 53 15 0.000976286 9052 D R 48 + 8 [ls] 7,0 53 16 0.001061486 9052 Q RA 9296 + 32 [ls] 7,0 53 17 0.001061892 9052 G RA 9296 + 32 [ls] 7,0 53 18 0.001062066 9052 P N [ls] 7,0 53 19 0.001062282 9052 D RA 9296 + 32 [ls] 7,0 53 20 0.001433106 9052 Q RA 9328 + 128 [ls] <--readahead dirents of "/mnt/cps/etc/pki/pesign-rh-test" directory 7,0 53 21 0.001433613 9052 G RA 9328 + 128 [ls] 7,0 53 22 0.001433742 9052 P N [ls] 7,0 53 23 0.001433888 9052 D RA 9328 + 128 [ls] # blktrace output for "ls -l /mnt/cps/etc/pki" 7,0 66 8 56.301287076 0 C R 32 + 8 [0] 7,0 66 9 56.301580752 0 C R 9160 + 8 [0] 7,0 66 10 56.301666669 0 C R 96 + 8 [0] 7,0 53 24 56.300902079 9065 Q R 32 + 8 [ls] 7,0 53 25 56.300904047 9065 G R 32 + 8 [ls] 7,0 53 26 56.300904720 9065 D R 32 + 8 [ls] 7,0 53 27 56.301478055 9065 Q R 9160 + 8 [ls] 7,0 53 28 56.301478831 9065 G R 9160 + 8 [ls] 7,0 53 29 56.301479147 9065 D R 9160 + 8 [ls] 7,0 53 30 56.301588701 9065 Q R 96 + 8 [ls] 7,0 53 31 56.301589461 9065 G R 96 + 8 [ls] 7,0 53 32 56.301589836 9065 D R 96 + 8 [ls] # no output for "strace ls /mnt/cps/etc/pki/pesign-rh-test" ``` I found that there's respective blktrace log printed out when running the first two commands, i.e. "ls -l /mnt/cps/etc/NetworkManager" and "ls -l /mnt/cps/etc/pki", while there's no blktrace log when running the last command, i.e. "strace ls /mnt/cps/etc/pki/pesign-rh-test". Let's look at the blktrace log for the first command, i.e. "ls -l /mnt/cps/etc/NetworkManager". There's a readahead on sector 9328 with a length of 128 sectors. It can be seen from the filefrag of the manifest file i.e. large.composefs that, the manifest file is stored on the disk starting at sector 8, and thus the readahead range starts at sector 9320 (9328 - 8) of the manifest file. ``` # filefrag -v -b512 large.composefs File size of large.composefs is 8998590 (17576 blocks of 512 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 17567: 8.. 17575: 17568: 1: 8994816.. 8998589: 0.. 3773: 3774: 8998912: last,not_aligned,inline,eof large.composefs: 2 extents found ``` I dumped the manifest file with tool from [3], with an enhancement of printing the sector address of the vdata section for each file. For directories, the corresponding vdata section is used to place dirents. ``` |---pesign-rh-test, block 9320(1)/ <-- dirents in pesign-rh-test |----cert9.db [etc/pki/pesign-rh-test/cert9.db], block 9769(1) |----key4.db [etc/pki/pesign-rh-test/key4.db], block 9769(1) |----pkcs11.txt [etc/pki/pesign-rh-test/pkcs11.txt], block 9769(1) ``` It can be seen that the dirents of "/mnt/cps/etc/pki/pesign-rh-test" directory are placed at sector 9320 starting from the manifest file, which has already been read ahead when running "ls -l /mnt/cps/etc/NetworkManager". It explains why there's no IO submitted when reading dirents of "/mnt/cps/etc/pki/pesign-rh-test" directory. [1] https://lore.kernel.org/lkml/20baca7da01c285b2a77c815c9d4b3080ce4b279.1674227308.git.alexl@redhat.com/ [2] https://my.owndrive.com/index.php/s/irHJXRpZHtT3a5i [3] https://github.com/containers/composefs -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-01 3:47 ` Jingbo Xu @ 2023-03-03 14:41 ` Alexander Larsson 2023-03-03 15:48 ` Gao Xiang 0 siblings, 1 reply; 42+ messages in thread From: Alexander Larsson @ 2023-03-03 14:41 UTC (permalink / raw) To: Jingbo Xu Cc: Gao Xiang, Christian Brauner, Amir Goldstein, linux-fsdevel, lsf-pc On Wed, Mar 1, 2023 at 4:47 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > Hi all, > > On 2/27/23 6:45 PM, Gao Xiang wrote: > > > > (+cc Jingbo Xu and Christian Brauner) > > > > On 2023/2/27 17:22, Alexander Larsson wrote: > >> Hello, > >> > >> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > >> Composefs filesystem. It is an opportunistically sharing, validating > >> image-based filesystem, targeting usecases like validated ostree > >> rootfs:es, validated container images that share common files, as well > >> as other image based usecases. > >> > >> During the discussions in the composefs proposal (as seen on LWN[3]) > >> is has been proposed that (with some changes to overlayfs), similar > >> behaviour can be achieved by combining the overlayfs > >> "overlay.redirect" xattr with an read-only filesystem such as erofs. > >> > >> There are pros and cons to both these approaches, and the discussion > >> about their respective value has sometimes been heated. We would like > >> to have an in-person discussion at the summit, ideally also involving > >> more of the filesystem development community, so that we can reach > >> some consensus on what is the best apporach. > >> > >> Good participants would be at least: Alexander Larsson, Giuseppe > >> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi, > >> Jingbo Xu > > I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed > > the root cause of the performance gap is that > > > > composefs read some data symlink-like payload data by using > > cfs_read_vdata_path() which involves kernel_read() and trigger heuristic > > readahead of dir data (which is also landed in composefs vdata area > > together with payload), so that most composefs dir I/O is already done > > in advance by heuristic readahead. And we think almost all exist > > in-kernel local fses doesn't have such heuristic readahead and if we add > > the similar stuff, EROFS could do better than composefs. > > > > Also we've tried random stat()s about 500~1000 files in the tree you shared > > (rather than just "ls -lR") and EROFS did almost the same or better than > > composefs. I guess further analysis (including blktrace) could be shown by > > Jingbo later. > > > > The link path string and dirents are mix stored in a so-called vdata > (variable data) section[1] in composefs, sometimes even in the same > block (figured out by dumping the composefs image). When doing lookup, > composefs will resolve the link path. It will read the link path string > from vdata section through kernel_read(), along which those dirents in > the following blocks are also read in by the heuristic readahead > algorithm in kernel_read(). I believe this will much benefit the > performance in the workload like "ls -lR". This is interesting stuff, and honestly I'm a bit surprised other filesystems don't try to readahead directory metadata to some degree too. It seems inherent to all filesystems that they try to pack related metadata near each other, so readahead would probably be useful even for read-write filesystems, although even more so for read-only filesystems (due to lack of fragmentation). But anyway, this is sort of beside the current issue. There is nothing inherent in composefs that makes it have to do readahead like this, and correspondingly, if it is a good idea to do it, erofs could do it too, -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-03 14:41 ` Alexander Larsson @ 2023-03-03 15:48 ` Gao Xiang 0 siblings, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-03 15:48 UTC (permalink / raw) To: Alexander Larsson, Jingbo Xu Cc: Christian Brauner, Amir Goldstein, linux-fsdevel, lsf-pc On 2023/3/3 22:41, Alexander Larsson wrote: > On Wed, Mar 1, 2023 at 4:47 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >> >> Hi all, >> >> On 2/27/23 6:45 PM, Gao Xiang wrote: >>> >>> (+cc Jingbo Xu and Christian Brauner) >>> >>> On 2023/2/27 17:22, Alexander Larsson wrote: >>>> Hello, >>>> >>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >>>> Composefs filesystem. It is an opportunistically sharing, validating >>>> image-based filesystem, targeting usecases like validated ostree >>>> rootfs:es, validated container images that share common files, as well >>>> as other image based usecases. >>>> >>>> During the discussions in the composefs proposal (as seen on LWN[3]) >>>> is has been proposed that (with some changes to overlayfs), similar >>>> behaviour can be achieved by combining the overlayfs >>>> "overlay.redirect" xattr with an read-only filesystem such as erofs. >>>> >>>> There are pros and cons to both these approaches, and the discussion >>>> about their respective value has sometimes been heated. We would like >>>> to have an in-person discussion at the summit, ideally also involving >>>> more of the filesystem development community, so that we can reach >>>> some consensus on what is the best apporach. >>>> >>>> Good participants would be at least: Alexander Larsson, Giuseppe >>>> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi, >>>> Jingbo Xu >>> I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed >>> the root cause of the performance gap is that >>> >>> composefs read some data symlink-like payload data by using >>> cfs_read_vdata_path() which involves kernel_read() and trigger heuristic >>> readahead of dir data (which is also landed in composefs vdata area >>> together with payload), so that most composefs dir I/O is already done >>> in advance by heuristic readahead. And we think almost all exist >>> in-kernel local fses doesn't have such heuristic readahead and if we add >>> the similar stuff, EROFS could do better than composefs. >>> >>> Also we've tried random stat()s about 500~1000 files in the tree you shared >>> (rather than just "ls -lR") and EROFS did almost the same or better than >>> composefs. I guess further analysis (including blktrace) could be shown by >>> Jingbo later. >>> >> >> The link path string and dirents are mix stored in a so-called vdata >> (variable data) section[1] in composefs, sometimes even in the same >> block (figured out by dumping the composefs image). When doing lookup, >> composefs will resolve the link path. It will read the link path string >> from vdata section through kernel_read(), along which those dirents in >> the following blocks are also read in by the heuristic readahead >> algorithm in kernel_read(). I believe this will much benefit the >> performance in the workload like "ls -lR". > > This is interesting stuff, and honestly I'm a bit surprised other > filesystems don't try to readahead directory metadata to some degree > too. It seems inherent to all filesystems that they try to pack > related metadata near each other, so readahead would probably be > useful even for read-write filesystems, although even more so for > read-only filesystems (due to lack of fragmentation). As I wrote before, IMHO, local filesystems read data in some basic unit (for example block size), if there are other irreverent metadata read in one shot, of course it can read together. Some local filesystems could read more related metadata when reading inodes. But that is based on the logical relationship rather than the in-kernel readahead algorithm. > > But anyway, this is sort of beside the current issue. There is nothing > inherent in composefs that makes it have to do readahead like this, > and correspondingly, if it is a good idea to do it, erofs could do it > too, I don't think in-tree EROFS should do a random irreverent readahead like kernel_read() without proof since it could have bad results to small random file access. If we do such thing, I'm afraid it's also irresponsible to all end users already using EROFS in production. Again, "ls -lR" is not the whole world, no? If you care about the startup time, FAST 16 slacker implied only 6.4% of that data [1] is read. Even though it mainly told about lazy pulling, but that number is almost the same as the startup I/O in our cloud containers too. [1] https://www.usenix.org/conference/fast16/technical-sessions/presentation/harter Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-02-27 9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson 2023-02-27 10:45 ` Gao Xiang @ 2023-02-27 11:37 ` Jingbo Xu 2023-03-03 13:57 ` Alexander Larsson 2 siblings, 0 replies; 42+ messages in thread From: Jingbo Xu @ 2023-02-27 11:37 UTC (permalink / raw) To: Alexander Larsson, lsf-pc; +Cc: linux-fsdevel On 2/27/23 5:22 PM, Alexander Larsson wrote: > Hello, > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > Composefs filesystem. It is an opportunistically sharing, validating > image-based filesystem, targeting usecases like validated ostree > rootfs:es, validated container images that share common files, as well > as other image based usecases. > > During the discussions in the composefs proposal (as seen on LWN[3]) > is has been proposed that (with some changes to overlayfs), similar > behaviour can be achieved by combining the overlayfs > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > There are pros and cons to both these approaches, and the discussion > about their respective value has sometimes been heated. We would like > to have an in-person discussion at the summit, ideally also involving > more of the filesystem development community, so that we can reach > some consensus on what is the best apporach. > > Good participants would be at least: Alexander Larsson, Giuseppe > Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi, > Jingbo Xu. > > [1] https://github.com/containers/composefs > [2] https://lore.kernel.org/lkml/cover.1674227308.git.alexl@redhat.com/ > [3] https://lwn.net/SubscriberLink/922851/45ed93154f336f73/ > I'm quite interested in the topic and would be glad to attend the discussion if possible. Thanks. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-02-27 9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson 2023-02-27 10:45 ` Gao Xiang 2023-02-27 11:37 ` Jingbo Xu @ 2023-03-03 13:57 ` Alexander Larsson 2023-03-03 15:13 ` Gao Xiang ` (2 more replies) 2 siblings, 3 replies; 42+ messages in thread From: Alexander Larsson @ 2023-03-03 13:57 UTC (permalink / raw) To: lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > Hello, > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > Composefs filesystem. It is an opportunistically sharing, validating > image-based filesystem, targeting usecases like validated ostree > rootfs:es, validated container images that share common files, as well > as other image based usecases. > > During the discussions in the composefs proposal (as seen on LWN[3]) > is has been proposed that (with some changes to overlayfs), similar > behaviour can be achieved by combining the overlayfs > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > There are pros and cons to both these approaches, and the discussion > about their respective value has sometimes been heated. We would like > to have an in-person discussion at the summit, ideally also involving > more of the filesystem development community, so that we can reach > some consensus on what is the best apporach. In order to better understand the behaviour and requirements of the overlayfs+erofs approach I spent some time implementing direct support for erofs in libcomposefs. So, with current HEAD of github.com/containers/composefs you can now do: $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs This will produce an object store with the backing files, and a erofs file with the required overlayfs xattrs, including a made up one called "overlay.fs-verity" containing the expected fs-verity digest for the lower dir. It also adds the required whiteouts to cover the 00-ff dirs from the lower dir. These erofs files are ordered similarly to the composefs files, and we give similar guarantees about their reproducibility, etc. So, they should be apples-to-apples comparable with the composefs images. Given this, I ran another set of performance tests on the original cs9 rootfs dataset, again measuring the time of `ls -lR`. I also tried to measure the memory use like this: # echo 3 > /proc/sys/vm/drop_caches # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' These are the alternatives I tried: xfs: the source of the image, regular dir on xfs erofs: the image.erofs above, on loopback erofs dio: the image.erofs above, on loopback with --direct-io=on ovl: erofs above combined with overlayfs ovl dio: erofs dio above combined with overlayfs cfs: composefs mount of image.cfs All tests use the same objects dir, stored on xfs. The erofs and overlay implementations are from a stock 6.1.13 kernel, and composefs module is from github HEAD. I tried loopback both with and without the direct-io option, because without direct-io enabled the kernel will double-cache the loopbacked data, as per[1]. The produced images are: 8.9M image.cfs 11.3M image.erofs And gives these results: | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- xfs | 1449 | 442 | 54 erofs | 700 | 391 | 45 erofs dio | 939 | 400 | 45 ovl | 1827 | 530 | 130 ovl dio | 2156 | 531 | 130 cfs | 689 | 389 | 51 I also ran the same tests in a VM that had the latest kernel including the lazyfollow patches (ovl lazy in the table, not using direct-io), this one ext4 based: | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- ext4 | 1135 | 394 | 54 erofs | 715 | 401 | 46 erofs dio | 922 | 401 | 45 ovl | 1412 | 515 | 148 ovl dio | 1810 | 532 | 149 ovl lazy | 1063 | 523 | 87 cfs | 719 | 463 | 51 Things noticeable in the results: * composefs and erofs (by itself) perform roughly similar. This is not necessarily news, and results from Jingbo Xu match this. * Erofs on top of direct-io enabled loopback causes quite a drop in performance, which I don't really understand. Especially since its reporting the same memory use as non-direct io. I guess the double-cacheing in the later case isn't properly attributed to the cgroup so the difference is not measured. However, why would the double cache improve performance? Maybe I'm not completely understanding how these things interact. * Stacking overlay on top of erofs causes about 100msec slower warm-cache times compared to all non-overlay approaches, and much more in the cold cache case. The cold cache performance is helped significantly by the lazyfollow patches, but the warm cache overhead remains. * The use of overlayfs more than doubles memory use, probably because of all the extra inodes and dentries in action for the various layers. The lazyfollow patches helps, but only partially. * Even though overlayfs+erofs is slower than cfs and raw erofs, it is not that much slower (~25%) than the pure xfs/ext4 directory, which is a pretty good baseline for comparisons. It is even faster when using lazyfollow on ext4. * The erofs images are slightly larger than the equivalent composefs image. In summary: The performance of composefs is somewhat better than the best erofs+ovl combination, although the overlay approach is not significantly worse than the baseline of a regular directory, except that it uses a bit more memory. On top of the above pure performance based comparisons I would like to re-state some of the other advantages of composefs compared to the overlay approach: * composefs is namespaceable, in the sense that you can use it (given mount capabilities) inside a namespace (such as a container) without access to non-namespaced resources like loopback or device-mapper devices. (There was work on fixing this with loopfs, but that seems to have stalled.) * While it is not in the current design, the simplicity of the format and lack of loopback makes it at least theoretically possible that composefs can be made usable in a rootless fashion at some point in the future. And of course, there are disadvantages to composefs too. Primarily being more code, increasing maintenance burden and risk of security problems. Composefs is particularly burdensome because it is a stacking filesystem and these have historically been shown to be hard to get right. The question now is what is the best approach overall? For my own primary usecase of making a verifying ostree root filesystem, the overlay approach (with the lazyfollow work finished) is, while not ideal, good enough. But I know for the people who are more interested in using composefs for containers the eventual goal of rootless support is very important. So, on behalf of them I guess the question is: Is there ever any chance that something like composefs could work rootlessly? Or conversely: Is there some way to get rootless support from the overlay approach? Opinions? Ideas? [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6 -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-03 13:57 ` Alexander Larsson @ 2023-03-03 15:13 ` Gao Xiang 2023-03-03 17:37 ` Gao Xiang 2023-03-07 10:15 ` Christian Brauner 2023-03-04 0:46 ` Jingbo Xu 2023-03-06 11:33 ` Alexander Larsson 2 siblings, 2 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-03 15:13 UTC (permalink / raw) To: Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi Hi Alexander, On 2023/3/3 21:57, Alexander Larsson wrote: > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: >> >> Hello, >> >> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >> Composefs filesystem. It is an opportunistically sharing, validating >> image-based filesystem, targeting usecases like validated ostree >> rootfs:es, validated container images that share common files, as well >> as other image based usecases. >> >> During the discussions in the composefs proposal (as seen on LWN[3]) >> is has been proposed that (with some changes to overlayfs), similar >> behaviour can be achieved by combining the overlayfs >> "overlay.redirect" xattr with an read-only filesystem such as erofs. >> >> There are pros and cons to both these approaches, and the discussion >> about their respective value has sometimes been heated. We would like >> to have an in-person discussion at the summit, ideally also involving >> more of the filesystem development community, so that we can reach >> some consensus on what is the best apporach. > > In order to better understand the behaviour and requirements of the > overlayfs+erofs approach I spent some time implementing direct support > for erofs in libcomposefs. So, with current HEAD of > github.com/containers/composefs you can now do: > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs Thanks you for taking time on working on EROFS support. I don't have time to play with it yet since I'd like to work out erofs-utils 1.6 these days and will work on some new stuffs such as !pagesize block size as I said previously. > > This will produce an object store with the backing files, and a erofs > file with the required overlayfs xattrs, including a made up one > called "overlay.fs-verity" containing the expected fs-verity digest > for the lower dir. It also adds the required whiteouts to cover the > 00-ff dirs from the lower dir. > > These erofs files are ordered similarly to the composefs files, and we > give similar guarantees about their reproducibility, etc. So, they > should be apples-to-apples comparable with the composefs images. > > Given this, I ran another set of performance tests on the original cs9 > rootfs dataset, again measuring the time of `ls -lR`. I also tried to > measure the memory use like this: > > # echo 3 > /proc/sys/vm/drop_caches > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > > These are the alternatives I tried: > > xfs: the source of the image, regular dir on xfs > erofs: the image.erofs above, on loopback > erofs dio: the image.erofs above, on loopback with --direct-io=on > ovl: erofs above combined with overlayfs > ovl dio: erofs dio above combined with overlayfs > cfs: composefs mount of image.cfs > > All tests use the same objects dir, stored on xfs. The erofs and > overlay implementations are from a stock 6.1.13 kernel, and composefs > module is from github HEAD. > > I tried loopback both with and without the direct-io option, because > without direct-io enabled the kernel will double-cache the loopbacked > data, as per[1]. > > The produced images are: > 8.9M image.cfs > 11.3M image.erofs > > And gives these results: > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > xfs | 1449 | 442 | 54 > erofs | 700 | 391 | 45 > erofs dio | 939 | 400 | 45 > ovl | 1827 | 530 | 130 > ovl dio | 2156 | 531 | 130 > cfs | 689 | 389 | 51 > > I also ran the same tests in a VM that had the latest kernel including > the lazyfollow patches (ovl lazy in the table, not using direct-io), > this one ext4 based: > > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > ext4 | 1135 | 394 | 54 > erofs | 715 | 401 | 46 > erofs dio | 922 | 401 | 45 > ovl | 1412 | 515 | 148 > ovl dio | 1810 | 532 | 149 > ovl lazy | 1063 | 523 | 87 > cfs | 719 | 463 | 51 > > Things noticeable in the results: > > * composefs and erofs (by itself) perform roughly similar. This is > not necessarily news, and results from Jingbo Xu match this. > > * Erofs on top of direct-io enabled loopback causes quite a drop in > performance, which I don't really understand. Especially since its > reporting the same memory use as non-direct io. I guess the > double-cacheing in the later case isn't properly attributed to the > cgroup so the difference is not measured. However, why would the > double cache improve performance? Maybe I'm not completely > understanding how these things interact. We've already analysed the root cause of composefs is that composefs uses a kernel_read() to read its path while irrelevant metadata (such as dir data) is read together. Such heuristic readahead is a unusual stuff for all local fses (obviously almost all in-kernel filesystems don't use kernel_read() to read their metadata. Although some filesystems could readahead some related extent metadata when reading inode, they at least does _not_ work as kernel_read().) But double caching will introduce almost the same impact as kernel_read() (assuming you read some source code of loop device.) I do hope you already read what Jingbo's latest test results, and that test result shows how bad readahead performs if fs metadata is partially randomly used (stat < 1500 files): https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com Also you could explicitly _disable_ readahead for composefs manifiest file (because all EROFS metadata read is without readahead), and let's see how it works then. Again, if your workload is just "ls -lR". My answer is "just async readahead the whole manifest file / loop device together" when mounting. That will give the best result to you. But I'm not sure that is the real use case you propose. > > * Stacking overlay on top of erofs causes about 100msec slower > warm-cache times compared to all non-overlay approaches, and much > more in the cold cache case. The cold cache performance is helped > significantly by the lazyfollow patches, but the warm cache overhead > remains. > > * The use of overlayfs more than doubles memory use, probably > because of all the extra inodes and dentries in action for the > various layers. The lazyfollow patches helps, but only partially. > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is > not that much slower (~25%) than the pure xfs/ext4 directory, which > is a pretty good baseline for comparisons. It is even faster when > using lazyfollow on ext4. > > * The erofs images are slightly larger than the equivalent composefs > image. > > In summary: The performance of composefs is somewhat better than the > best erofs+ovl combination, although the overlay approach is not > significantly worse than the baseline of a regular directory, except > that it uses a bit more memory. > > On top of the above pure performance based comparisons I would like to > re-state some of the other advantages of composefs compared to the > overlay approach: > > * composefs is namespaceable, in the sense that you can use it (given > mount capabilities) inside a namespace (such as a container) without > access to non-namespaced resources like loopback or device-mapper > devices. (There was work on fixing this with loopfs, but that seems > to have stalled.) > > * While it is not in the current design, the simplicity of the format > and lack of loopback makes it at least theoretically possible that > composefs can be made usable in a rootless fashion at some point in > the future. Do you consider sending some commands to /dev/cachefiles to configure a daemonless dir and mount erofs image directly by using "erofs over fscache" but in a daemonless way? That is an ongoing stuff on our side. IMHO, I don't think file-based interfaces are quite a charmful stuff. Historically I recalled some practice is to "avoid directly reading files in kernel" so that I think almost all local fses don't work on files directl and loopback devices are all the ways for these use cases. If loopback devices are not okay to you, how about improving loopback devices and that will benefit to almost all local fses. > > And of course, there are disadvantages to composefs too. Primarily > being more code, increasing maintenance burden and risk of security > problems. Composefs is particularly burdensome because it is a > stacking filesystem and these have historically been shown to be hard > to get right. > > > The question now is what is the best approach overall? For my own > primary usecase of making a verifying ostree root filesystem, the > overlay approach (with the lazyfollow work finished) is, while not > ideal, good enough. So your judgement is still "ls -lR" and your use case is still just pure read-only and without writable stuff? Anyway, I'm really happy to work with you on your ostree use cases as always, as long as all corner cases work out by the community. > > But I know for the people who are more interested in using composefs > for containers the eventual goal of rootless support is very > important. So, on behalf of them I guess the question is: Is there > ever any chance that something like composefs could work rootlessly? > Or conversely: Is there some way to get rootless support from the > overlay approach? Opinions? Ideas? Honestly, I do want to get a proper answer when Giuseppe asked me the same question. My current view is simply "that question is almost the same for all in-kernel fses with some on-disk format". If you think EROFS compression part is too complex and useless to your use cases, okay, I think we could add a new mount option called "nocompress" so that we can avoid that part runtimely explicitly. But that still doesn't help to the original question on my side. Thanks, Gao Xiang > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6 > > > > -- > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Alexander Larsson Red Hat, Inc > alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-03 15:13 ` Gao Xiang @ 2023-03-03 17:37 ` Gao Xiang 2023-03-04 14:59 ` Colin Walters 2023-03-07 10:15 ` Christian Brauner 1 sibling, 1 reply; 42+ messages in thread From: Gao Xiang @ 2023-03-03 17:37 UTC (permalink / raw) To: Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/3 23:13, Gao Xiang wrote: ... >> >> And of course, there are disadvantages to composefs too. Primarily >> being more code, increasing maintenance burden and risk of security >> problems. Composefs is particularly burdensome because it is a >> stacking filesystem and these have historically been shown to be hard >> to get right. Just off a bit of that, I do think you could finally find a fully-functional read-only filesystem is useful. For example with EROFS you could, - keep composefs model files as your main use cases; - keep some small files such as "VERSION" or "README" inline; - refer to some parts of blobs (such as tar data) directly in addition to the whole file, which seems also a useful use cases for OCI containers; - deploy all of the above to raw disks and other media as well; - etc. Actually since you're container guys, I would like to mention a way to directly reuse OCI tar data and not sure if you have some interest as well, that is just to generate EROFS metadata which could point to the tar blobs so that data itself is still the original tar, but we could add fsverity + IMMUTABLE to these blobs rather than the individual untared files. The main advantages over the current way (podman, containerd) are - save untar and snapshot gc time; - OCI layer diff IDs in the OCI spec [1] are guaranteed; - in-kernel mountable with runtime verificiation; - such tar can be mounted in secure containers in the same way as well. Personally I've been working on EROFS since the end of 2017 until now for many years, although it could take more or less time due to other on-oning work, I always believe a read-only approach is beyond just a pure space-saving stuff. So I devoted almost all my extra leisure time for this. Honestly, I do hope there could be more people interested in EROFS in addition to the original Android use cases because the overall intention is much similar and I'm happy to help things that I could do and avoid another random fs dump to Linux kernel (of course not though.) [1] https://github.com/opencontainers/image-spec/blob/main/config.md Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-03 17:37 ` Gao Xiang @ 2023-03-04 14:59 ` Colin Walters 2023-03-04 15:29 ` Gao Xiang 0 siblings, 1 reply; 42+ messages in thread From: Colin Walters @ 2023-03-04 14:59 UTC (permalink / raw) To: Gao Xiang, Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote: > > Actually since you're container guys, I would like to mention > a way to directly reuse OCI tar data and not sure if you > have some interest as well, that is just to generate EROFS > metadata which could point to the tar blobs so that data itself > is still the original tar, but we could add fsverity + IMMUTABLE > to these blobs rather than the individual untared files. > - OCI layer diff IDs in the OCI spec [1] are guaranteed; The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think. Correct me if I'm wrong, but having erofs point to underlying tar wouldn't by default get us page cache sharing or even the "opportunistic" disk sharing that composefs brings, unless userspace did something like attempting to dedup files in the tar stream via hashing and using reflinks on the underlying fs. And then doing reflinks would require alignment inside the stream, right? The https://fedoraproject.org/wiki/Changes/RPMCoW change is very similar in that it's proposing a modification of the RPM format to 4k align files in the stream for this reason. But that's exactly it, then it's a new tweaked format and not identical to what came before, so the "compatibility" rationale is actually weakened a lot. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-04 14:59 ` Colin Walters @ 2023-03-04 15:29 ` Gao Xiang 2023-03-04 16:22 ` Gao Xiang 2023-03-07 1:00 ` Colin Walters 0 siblings, 2 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-04 15:29 UTC (permalink / raw) To: Colin Walters, Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi Hi Colin, On 2023/3/4 22:59, Colin Walters wrote: > > > On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote: >> >> Actually since you're container guys, I would like to mention >> a way to directly reuse OCI tar data and not sure if you >> have some interest as well, that is just to generate EROFS >> metadata which could point to the tar blobs so that data itself >> is still the original tar, but we could add fsverity + IMMUTABLE >> to these blobs rather than the individual untared files. > >> - OCI layer diff IDs in the OCI spec [1] are guaranteed; > > The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think. Thanks for the interest and comment. I'm not aware of this project, and I'm not sure if tar-split helps mount tar stuffs, maybe I'm missing something? As for EROFS, as long as we support subpage block size, it's entirely possible to refer the original tar data without tar stream modification. > > Correct me if I'm wrong, but having erofs point to underlying tar wouldn't by default get us page cache sharing or even the "opportunistic" disk sharing that composefs brings, unless userspace did something like attempting to dedup files in the tar stream via hashing and using reflinks on the underlying fs. And then doing reflinks would require alignment inside the stream, right? The https://fedoraproject.org/wiki/Changes/RPMCoW change is very similar in that it's proposing a modification of the RPM format to 4k align files in the hmmm.. I think userspace don't need to dedupe files in the tar stream. stream for this reason. But that's exactly it, then it's a new tweaked format and not identical to what came before, so the "compatibility" rationale is actually weakened a lot. > > As you said, "opportunistic" finer disk sharing inside all tar streams can be resolved by reflink or other stuffs by the underlay filesystems (like XFS, or virtual devices like device mapper). Not bacause EROFS cannot do on-disk dedupe, just because in this way EROFS can only use the original tar blobs, and EROFS is not the guy to resolve the on-disk sharing stuff. However, here since the original tar blob is used, so that the tar stream data is unchanged (with the same diffID) when the container is running. As a kernel filesystem, if two files are equal, we could treat them in the same inode address space, even they are actually with slightly different inode metadata (uid, gid, mode, nlink, etc). That is entirely possible as an in-kernel filesystem even currently linux kernel doesn't implement finer page cache sharing, so EROFS can support page-cache sharing of files in all tar streams if needed. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-04 15:29 ` Gao Xiang @ 2023-03-04 16:22 ` Gao Xiang 2023-03-07 1:00 ` Colin Walters 1 sibling, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-04 16:22 UTC (permalink / raw) To: Colin Walters, Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/4 23:29, Gao Xiang wrote: > Hi Colin, > > On 2023/3/4 22:59, Colin Walters wrote: >> >> >> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote: >>> >>> Actually since you're container guys, I would like to mention >>> a way to directly reuse OCI tar data and not sure if you >>> have some interest as well, that is just to generate EROFS >>> metadata which could point to the tar blobs so that data itself >>> is still the original tar, but we could add fsverity + IMMUTABLE >>> to these blobs rather than the individual untared files. >> >>> - OCI layer diff IDs in the OCI spec [1] are guaranteed; >> >> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think. > > Thanks for the interest and comment. > > I'm not aware of this project, and I'm not sure if tar-split > helps mount tar stuffs, maybe I'm missing something? > > As for EROFS, as long as we support subpage block size, it's > entirely possible to refer the original tar data without tar > stream modification. > >> >> Correct me if I'm wrong, but having erofs point to underlying tar wouldn't by default get us page cache sharing or even the "opportunistic" disk sharing that composefs brings, unless userspace did something like attempting to dedup files in the tar stream via hashing and using reflinks on the underlying fs. And then doing reflinks would require alignment inside the stream, right? The https://fedoraproject.org/wiki/Changes/RPMCoW change is very similar in that it's proposing a modification of the RPM format to 4k align files in the > > hmmm.. I think userspace don't need to dedupe files in the > tar stream. > > stream for this reason. But that's exactly it, then it's a new tweaked format and not identical to what came before, so the "compatibility" rationale is actually weakened a lot. >> >> > > As you said, "opportunistic" finer disk sharing inside all tar > streams can be resolved by reflink or other stuffs by the underlay > filesystems (like XFS, or virtual devices like device mapper). > > Not bacause EROFS cannot do on-disk dedupe, just because in this > way EROFS can only use the original tar blobs, and EROFS is not > the guy to resolve the on-disk sharing stuff. However, here since > the original tar blob is used, so that the tar stream data is > unchanged (with the same diffID) when the container is running. > > As a kernel filesystem, if two files are equal, we could treat them > in the same inode address space, even they are actually with slightly > different inode metadata (uid, gid, mode, nlink, etc). That is > entirely possible as an in-kernel filesystem even currently linux > kernel doesn't implement finer page cache sharing, so EROFS can > support page-cache sharing of files in all tar streams if needed. By the way, in case of misunderstanding, the current workable ways of Linux page cache sharing don't _strictly_ need the real inode is the same inode (like what stackable fs like overlayfs does), just need sharing data among different inodes consecutive in one address space, which means: 1) we could reuse blob (the tar stream) address space to share page cache, actually that is what Jingbo's did for fscache page cache sharing: https://lore.kernel.org/r/20230203030143.73105-1-jefflexu@linux.alibaba.com 2) create a virtual inode (or reuse one address space of real inodes) to share data between real inodes. Either way can do page cache sharing of inodes with same data across different filesystems and are practial without extra linux-mm improvement. thanks, Gao Xiang > > Thanks, > Gao Xiang ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-04 15:29 ` Gao Xiang 2023-03-04 16:22 ` Gao Xiang @ 2023-03-07 1:00 ` Colin Walters 2023-03-07 3:10 ` Gao Xiang 1 sibling, 1 reply; 42+ messages in thread From: Colin Walters @ 2023-03-07 1:00 UTC (permalink / raw) To: Gao Xiang, Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote: > Hi Colin, > > On 2023/3/4 22:59, Colin Walters wrote: >> >> >> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote: >>> >>> Actually since you're container guys, I would like to mention >>> a way to directly reuse OCI tar data and not sure if you >>> have some interest as well, that is just to generate EROFS >>> metadata which could point to the tar blobs so that data itself >>> is still the original tar, but we could add fsverity + IMMUTABLE >>> to these blobs rather than the individual untared files. >> >>> - OCI layer diff IDs in the OCI spec [1] are guaranteed; >> >> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think. > > Thanks for the interest and comment. > > I'm not aware of this project, and I'm not sure if tar-split > helps mount tar stuffs, maybe I'm missing something? Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top. Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention. Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc. Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked. It's really only very simplistic use cases for which a single read-only filesystem suffices. They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc. But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems. But without also forcing separate size management onto both. > Not bacause EROFS cannot do on-disk dedupe, just because in this > way EROFS can only use the original tar blobs, and EROFS is not > the guy to resolve the on-disk sharing stuff. Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case. > As a kernel filesystem, if two files are equal, we could treat them > in the same inode address space, even they are actually with slightly > different inode metadata (uid, gid, mode, nlink, etc). That is > entirely possible as an in-kernel filesystem even currently linux > kernel doesn't implement finer page cache sharing, so EROFS can > support page-cache sharing of files in all tar streams if needed. Hmmm. I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack). But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally. And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it. That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively! ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 1:00 ` Colin Walters @ 2023-03-07 3:10 ` Gao Xiang 0 siblings, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 3:10 UTC (permalink / raw) To: Colin Walters, Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 09:00, Colin Walters wrote: > > > On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote: >> Hi Colin, >> >> On 2023/3/4 22:59, Colin Walters wrote: >>> >>> >>> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote: >>>> >>>> Actually since you're container guys, I would like to mention >>>> a way to directly reuse OCI tar data and not sure if you >>>> have some interest as well, that is just to generate EROFS >>>> metadata which could point to the tar blobs so that data itself >>>> is still the original tar, but we could add fsverity + IMMUTABLE >>>> to these blobs rather than the individual untared files. >>> >>>> - OCI layer diff IDs in the OCI spec [1] are guaranteed; >>> >>> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think. >> >> Thanks for the interest and comment. >> >> I'm not aware of this project, and I'm not sure if tar-split >> helps mount tar stuffs, maybe I'm missing something? > > Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top. Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention. > > Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc. Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked. Yes, as you said, if you think the actual OCI standard (or Docker whatever) is all about layering. There could be a possibility to directly use the original layer for mounting without any conversion (like "untar" or converting to another blob format which could support 4k reflink dedupe.) I believe it can save untar time and snapshot gc problems that users concern, such as our cloud with thousands of containers launching/running/gcing in the same time. > > It's really only very simplistic use cases for which a single read-only filesystem suffices. They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc. I cannot access the webside. If you consider physical write protection, then a read-only filesystem written on physical media is needed. So that EROFS manifest can be landed on raw disks (for write protection and hardware integrate check) or on other local filesystems. It depends on the actual detailed requirement. > > But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems. But without also forcing separate size management onto both. > >> Not bacause EROFS cannot do on-disk dedupe, just because in this >> way EROFS can only use the original tar blobs, and EROFS is not >> the guy to resolve the on-disk sharing stuff. > > Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case. Anyway, if you consider an _untar_ way, you could also consider a conversion way (like you said padding to 4k). Since OCI standard is all about layering, so you could pad to 4k and then do data dedupe with: - data blobs theirselves (some recent project like Nydus with EROFS); - reflink enabled filesystems (such as XFS or btrfs). Because untar behaves almost the same as the conversion way, except that it doesn't produce massive files/dirs to the underlay filesystem and then gc massive files/dirs again. To be clarified, since you are the OSTree original author, here I'm not promoting alternative ways for you. I believe any practical engineering projects all have advantages and disadvantages. For example, even git is moving toward using packed object store more and more, and I guess OSTree for effective distribution could also have some packed format at least to some extent. Here I just would like to say, on-disk EROFS format (or other most-used kernel filesystem) is not just designed for a specific use cases like OSTree, tar blobs or whatever, or specific media like block-based, file-based, etc. As far as I can see, at least EROFS+overlay already supports the OSTree composefs-like use cases for two years and landed in many distros. And other local kernel filesystems don't behave quite well with "ls -lR" workload. > >> As a kernel filesystem, if two files are equal, we could treat them >> in the same inode address space, even they are actually with slightly >> different inode metadata (uid, gid, mode, nlink, etc). That is >> entirely possible as an in-kernel filesystem even currently linux >> kernel doesn't implement finer page cache sharing, so EROFS can >> support page-cache sharing of files in all tar streams if needed. > > Hmmm. I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack). But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally. And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it. As you said you're a userspace developer, here I just need to clarify internal inodes are very common among local fses, at least to my knowledge I know btrfs and f2fs in addition to EROFS all have such stuffs to keep something to make use of kernel page cache. One advantage over the stackable way is that: With the stackable way, you have to explicitly open the backing file which takes more time to lookup dcache/icache and even on-disk hierarchy. By contrast, if you consider page cache sharing original tar blobs, you don't need to do another open at all. Surely, it's not benchmarked by "ls -lR" but it indeed impacts end users. Again, here I'm trying to say I'm not in favor of or against any user-space distribution solution, like OSTree or some else. Nydus is just one of userspace examples to use EROFS which I persuaded them to do such adaption. Besides, EROFS is already landed to all mainstream in-market Android smartphones, and I hope it can get more attention, adaption over various use cases and more developers could join us. > > That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively! Thanks, as a kernel filesystem developer for many years, I hope our (at least myself) design can be used wider. So again, I'm not against your OSTree design and I believe all detailed distribution approaches have pros and cons. Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-03 15:13 ` Gao Xiang 2023-03-03 17:37 ` Gao Xiang @ 2023-03-07 10:15 ` Christian Brauner 2023-03-07 11:03 ` Gao Xiang ` (2 more replies) 1 sibling, 3 replies; 42+ messages in thread From: Christian Brauner @ 2023-03-07 10:15 UTC (permalink / raw) To: Gao Xiang Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: > Hi Alexander, > > On 2023/3/3 21:57, Alexander Larsson wrote: > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > > > > > Hello, > > > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > > Composefs filesystem. It is an opportunistically sharing, validating > > > image-based filesystem, targeting usecases like validated ostree > > > rootfs:es, validated container images that share common files, as well > > > as other image based usecases. > > > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > > is has been proposed that (with some changes to overlayfs), similar > > > behaviour can be achieved by combining the overlayfs > > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > > > There are pros and cons to both these approaches, and the discussion > > > about their respective value has sometimes been heated. We would like > > > to have an in-person discussion at the summit, ideally also involving > > > more of the filesystem development community, so that we can reach > > > some consensus on what is the best apporach. > > > > In order to better understand the behaviour and requirements of the > > overlayfs+erofs approach I spent some time implementing direct support > > for erofs in libcomposefs. So, with current HEAD of > > github.com/containers/composefs you can now do: > > > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs > > Thanks you for taking time on working on EROFS support. I don't have > time to play with it yet since I'd like to work out erofs-utils 1.6 > these days and will work on some new stuffs such as !pagesize block > size as I said previously. > > > > > This will produce an object store with the backing files, and a erofs > > file with the required overlayfs xattrs, including a made up one > > called "overlay.fs-verity" containing the expected fs-verity digest > > for the lower dir. It also adds the required whiteouts to cover the > > 00-ff dirs from the lower dir. > > > > These erofs files are ordered similarly to the composefs files, and we > > give similar guarantees about their reproducibility, etc. So, they > > should be apples-to-apples comparable with the composefs images. > > > > Given this, I ran another set of performance tests on the original cs9 > > rootfs dataset, again measuring the time of `ls -lR`. I also tried to > > measure the memory use like this: > > > > # echo 3 > /proc/sys/vm/drop_caches > > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > > > > These are the alternatives I tried: > > > > xfs: the source of the image, regular dir on xfs > > erofs: the image.erofs above, on loopback > > erofs dio: the image.erofs above, on loopback with --direct-io=on > > ovl: erofs above combined with overlayfs > > ovl dio: erofs dio above combined with overlayfs > > cfs: composefs mount of image.cfs > > > > All tests use the same objects dir, stored on xfs. The erofs and > > overlay implementations are from a stock 6.1.13 kernel, and composefs > > module is from github HEAD. > > > > I tried loopback both with and without the direct-io option, because > > without direct-io enabled the kernel will double-cache the loopbacked > > data, as per[1]. > > > > The produced images are: > > 8.9M image.cfs > > 11.3M image.erofs > > > > And gives these results: > > | Cold cache | Warm cache | Mem use > > | (msec) | (msec) | (mb) > > -----------+------------+------------+--------- > > xfs | 1449 | 442 | 54 > > erofs | 700 | 391 | 45 > > erofs dio | 939 | 400 | 45 > > ovl | 1827 | 530 | 130 > > ovl dio | 2156 | 531 | 130 > > cfs | 689 | 389 | 51 > > > > I also ran the same tests in a VM that had the latest kernel including > > the lazyfollow patches (ovl lazy in the table, not using direct-io), > > this one ext4 based: > > > > | Cold cache | Warm cache | Mem use > > | (msec) | (msec) | (mb) > > -----------+------------+------------+--------- > > ext4 | 1135 | 394 | 54 > > erofs | 715 | 401 | 46 > > erofs dio | 922 | 401 | 45 > > ovl | 1412 | 515 | 148 > > ovl dio | 1810 | 532 | 149 > > ovl lazy | 1063 | 523 | 87 > > cfs | 719 | 463 | 51 > > > > Things noticeable in the results: > > > > * composefs and erofs (by itself) perform roughly similar. This is > > not necessarily news, and results from Jingbo Xu match this. > > > > * Erofs on top of direct-io enabled loopback causes quite a drop in > > performance, which I don't really understand. Especially since its > > reporting the same memory use as non-direct io. I guess the > > double-cacheing in the later case isn't properly attributed to the > > cgroup so the difference is not measured. However, why would the > > double cache improve performance? Maybe I'm not completely > > understanding how these things interact. > > We've already analysed the root cause of composefs is that composefs > uses a kernel_read() to read its path while irrelevant metadata > (such as dir data) is read together. Such heuristic readahead is a > unusual stuff for all local fses (obviously almost all in-kernel > filesystems don't use kernel_read() to read their metadata. Although > some filesystems could readahead some related extent metadata when > reading inode, they at least does _not_ work as kernel_read().) But > double caching will introduce almost the same impact as kernel_read() > (assuming you read some source code of loop device.) > > I do hope you already read what Jingbo's latest test results, and that > test result shows how bad readahead performs if fs metadata is > partially randomly used (stat < 1500 files): > https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com > > Also you could explicitly _disable_ readahead for composefs > manifiest file (because all EROFS metadata read is without > readahead), and let's see how it works then. > > Again, if your workload is just "ls -lR". My answer is "just async > readahead the whole manifest file / loop device together" when > mounting. That will give the best result to you. But I'm not sure > that is the real use case you propose. > > > > > * Stacking overlay on top of erofs causes about 100msec slower > > warm-cache times compared to all non-overlay approaches, and much > > more in the cold cache case. The cold cache performance is helped > > significantly by the lazyfollow patches, but the warm cache overhead > > remains. > > > > * The use of overlayfs more than doubles memory use, probably > > because of all the extra inodes and dentries in action for the > > various layers. The lazyfollow patches helps, but only partially. > > > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is > > not that much slower (~25%) than the pure xfs/ext4 directory, which > > is a pretty good baseline for comparisons. It is even faster when > > using lazyfollow on ext4. > > > > * The erofs images are slightly larger than the equivalent composefs > > image. > > > > In summary: The performance of composefs is somewhat better than the > > best erofs+ovl combination, although the overlay approach is not > > significantly worse than the baseline of a regular directory, except > > that it uses a bit more memory. > > > > On top of the above pure performance based comparisons I would like to > > re-state some of the other advantages of composefs compared to the > > overlay approach: > > > > * composefs is namespaceable, in the sense that you can use it (given > > mount capabilities) inside a namespace (such as a container) without > > access to non-namespaced resources like loopback or device-mapper > > devices. (There was work on fixing this with loopfs, but that seems > > to have stalled.) > > > > * While it is not in the current design, the simplicity of the format > > and lack of loopback makes it at least theoretically possible that > > composefs can be made usable in a rootless fashion at some point in > > the future. > Do you consider sending some commands to /dev/cachefiles to configure > a daemonless dir and mount erofs image directly by using "erofs over > fscache" but in a daemonless way? That is an ongoing stuff on our side. > > IMHO, I don't think file-based interfaces are quite a charmful stuff. > Historically I recalled some practice is to "avoid directly reading > files in kernel" so that I think almost all local fses don't work on > files directl and loopback devices are all the ways for these use > cases. If loopback devices are not okay to you, how about improving > loopback devices and that will benefit to almost all local fses. > > > > > And of course, there are disadvantages to composefs too. Primarily > > being more code, increasing maintenance burden and risk of security > > problems. Composefs is particularly burdensome because it is a > > stacking filesystem and these have historically been shown to be hard > > to get right. > > > > > > The question now is what is the best approach overall? For my own > > primary usecase of making a verifying ostree root filesystem, the > > overlay approach (with the lazyfollow work finished) is, while not > > ideal, good enough. > > So your judgement is still "ls -lR" and your use case is still just > pure read-only and without writable stuff? > > Anyway, I'm really happy to work with you on your ostree use cases > as always, as long as all corner cases work out by the community. > > > > > But I know for the people who are more interested in using composefs > > for containers the eventual goal of rootless support is very > > important. So, on behalf of them I guess the question is: Is there > > ever any chance that something like composefs could work rootlessly? > > Or conversely: Is there some way to get rootless support from the > > overlay approach? Opinions? Ideas? > > Honestly, I do want to get a proper answer when Giuseppe asked me > the same question. My current view is simply "that question is > almost the same for all in-kernel fses with some on-disk format". As far as I'm concerned filesystems with on-disk format will not be made mountable by unprivileged containers. And I don't think I'm alone in that view. The idea that ever more parts of the kernel with a massive attack surface such as a filesystem need to vouchesafe for the safety in the face of every rando having access to unshare --mount --user --map-root is a dead end and will just end up trapping us in a neverending cycle of security bugs (Because every single bug that's found after making that fs mountable from an unprivileged container will be treated as a security bug no matter if justified or not. So this is also a good way to ruin your filesystem's reputation.). And honestly, if we set the precedent that it's fine for one filesystem with an on-disk format to be able to be mounted by unprivileged containers then other filesystems eventually want to do this as well. At the rate we currently add filesystems that's just a matter of time even if none of the existing ones would also want to do it. And then we're left arguing that this was just an exception for one super special, super safe, unexploitable filesystem with an on-disk format. Imho, none of this is appealing. I don't want to slowly keep building a future where we end up running fuzzers in unprivileged container to generate random images to crash the kernel. I have more arguments why I don't think is a path we will ever go down but I don't want this to detract from the legitimate ask of making it possible to mount trusted images from within unprivileged containers. Because I think that's perfectly legitimate. However, I don't think that this is something the kernel needs to solve other than providing the necessary infrastructure so that this can be solved in userspace. Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I explained how we currently mount into mount namespaces of unprivileged cotainers which had been quite a difficult problem before the new mount api. But now it's become almost comically trivial. I mean, there's stuff that will still be good to have but overall all the bits are already there. Imho, delegated mounting should be done by a system service that is responsible for all the steps that require privileges. So for most filesytems not mountable by unprivileged user this would amount to: fd_fs = fsopen("xfs") fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") fsconfig(FSCONFIG_CMD_CREATE) fd_mnt = fsmount(fd_fs) // Only required for attributes that require privileges against the sb // of the filesystem such as idmapped mounts mount_setattr(fd_mnt, ...) and then the fd_mnt can be sent to the container which can then attach it wherever it wants to. The system level service doesn't even need to change namespaces via setns(fd_userns|fd_mntns) like I illustrated in the post I did. It's sufficient if we sent it via AF_UNIX for example that's exposed to the container. Of course, this system level service would be integrated with mount(8) directly over a well-defined protocol. And this would be nestable as well by e.g., bind-mounting the AF_UNIX socket. And we do already support a rudimentary form of such integration through systemd. For example via mount -t ddi (cf. [2]) which makes it possible to mount discoverable disk images (ddi). But that's just an illustration. This should be integrated with mount(8) and should be a simply protocol over varlink or another lightweight ipc mechanism that can be implemented by systemd-mountd (which is how I coined this for lack of imagination when I came up with this) or by some other component if platforms like k8s really want to do their own thing. This also allows us to extend this feature to the whole system btw and to all filesystems at once. Because it means that if systemd-mountd is told what images to trust (based on location, from a specific registry, signature, or whatever) then this isn't just useful for unprivileged containers but also for regular users on the host that want to mount stuff. This is what we're currently working on. (There's stuff that we can do to make this more powerful __if__ we need to. One example would probably that we _could_ make it possible to mark a superblock as being owned by a specific namespace with similar permission checks as what we currently do for idmapped mounts (privileged in the superblock of the fs, privileged over the ns to delegate to etc). IOW, fd_fs = fsopen("xfs") fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns) which completely sidesteps the issue of making that on-disk filesystem mountable by unpriv users. But let me say that this is completely unnecessary today as you can do: fd_fs = fsopen("xfs") fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") fsconfig(FSCONFIG_CMD_CREATE) fd_mnt = fsmount(fd_fs) mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP) which changes ownership across the whole filesystem. The only time you really want what I mention here is if you want to delegate control over __every single ioctl and potentially destructive operation associated with that filesystem__ to an unprivileged container which is almost never what you want.) [1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]: https://github.com/systemd/systemd/pull/26695 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 10:15 ` Christian Brauner @ 2023-03-07 11:03 ` Gao Xiang 2023-03-07 12:09 ` Alexander Larsson 2023-03-07 13:38 ` Jeff Layton 2 siblings, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 11:03 UTC (permalink / raw) To: Christian Brauner Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi Hi Christian, On 2023/3/7 18:15, Christian Brauner wrote: > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: >> Hi Alexander, >> >> On 2023/3/3 21:57, Alexander Larsson wrote: >>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: >>>> >>>> Hello, >>>> >>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >>>> Composefs filesystem. It is an opportunistically sharing, validating >>>> image-based filesystem, targeting usecases like validated ostree >>>> rootfs:es, validated container images that share common files, as well >>>> as other image based usecases. >>>> >>>> During the discussions in the composefs proposal (as seen on LWN[3]) >>>> is has been proposed that (with some changes to overlayfs), similar >>>> behaviour can be achieved by combining the overlayfs >>>> "overlay.redirect" xattr with an read-only filesystem such as erofs. >>>> >>>> There are pros and cons to both these approaches, and the discussion >>>> about their respective value has sometimes been heated. We would like >>>> to have an in-person discussion at the summit, ideally also involving >>>> more of the filesystem development community, so that we can reach >>>> some consensus on what is the best apporach. >>> >>> In order to better understand the behaviour and requirements of the >>> overlayfs+erofs approach I spent some time implementing direct support >>> for erofs in libcomposefs. So, with current HEAD of >>> github.com/containers/composefs you can now do: >>> >>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs >> >> Thanks you for taking time on working on EROFS support. I don't have >> time to play with it yet since I'd like to work out erofs-utils 1.6 >> these days and will work on some new stuffs such as !pagesize block >> size as I said previously. >> >>> >>> This will produce an object store with the backing files, and a erofs >>> file with the required overlayfs xattrs, including a made up one >>> called "overlay.fs-verity" containing the expected fs-verity digest >>> for the lower dir. It also adds the required whiteouts to cover the >>> 00-ff dirs from the lower dir. >>> >>> These erofs files are ordered similarly to the composefs files, and we >>> give similar guarantees about their reproducibility, etc. So, they >>> should be apples-to-apples comparable with the composefs images. >>> >>> Given this, I ran another set of performance tests on the original cs9 >>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to >>> measure the memory use like this: >>> >>> # echo 3 > /proc/sys/vm/drop_caches >>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat >>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >>> >>> These are the alternatives I tried: >>> >>> xfs: the source of the image, regular dir on xfs >>> erofs: the image.erofs above, on loopback >>> erofs dio: the image.erofs above, on loopback with --direct-io=on >>> ovl: erofs above combined with overlayfs >>> ovl dio: erofs dio above combined with overlayfs >>> cfs: composefs mount of image.cfs >>> >>> All tests use the same objects dir, stored on xfs. The erofs and >>> overlay implementations are from a stock 6.1.13 kernel, and composefs >>> module is from github HEAD. >>> >>> I tried loopback both with and without the direct-io option, because >>> without direct-io enabled the kernel will double-cache the loopbacked >>> data, as per[1]. >>> >>> The produced images are: >>> 8.9M image.cfs >>> 11.3M image.erofs >>> >>> And gives these results: >>> | Cold cache | Warm cache | Mem use >>> | (msec) | (msec) | (mb) >>> -----------+------------+------------+--------- >>> xfs | 1449 | 442 | 54 >>> erofs | 700 | 391 | 45 >>> erofs dio | 939 | 400 | 45 >>> ovl | 1827 | 530 | 130 >>> ovl dio | 2156 | 531 | 130 >>> cfs | 689 | 389 | 51 >>> >>> I also ran the same tests in a VM that had the latest kernel including >>> the lazyfollow patches (ovl lazy in the table, not using direct-io), >>> this one ext4 based: >>> >>> | Cold cache | Warm cache | Mem use >>> | (msec) | (msec) | (mb) >>> -----------+------------+------------+--------- >>> ext4 | 1135 | 394 | 54 >>> erofs | 715 | 401 | 46 >>> erofs dio | 922 | 401 | 45 >>> ovl | 1412 | 515 | 148 >>> ovl dio | 1810 | 532 | 149 >>> ovl lazy | 1063 | 523 | 87 >>> cfs | 719 | 463 | 51 >>> >>> Things noticeable in the results: >>> >>> * composefs and erofs (by itself) perform roughly similar. This is >>> not necessarily news, and results from Jingbo Xu match this. >>> >>> * Erofs on top of direct-io enabled loopback causes quite a drop in >>> performance, which I don't really understand. Especially since its >>> reporting the same memory use as non-direct io. I guess the >>> double-cacheing in the later case isn't properly attributed to the >>> cgroup so the difference is not measured. However, why would the >>> double cache improve performance? Maybe I'm not completely >>> understanding how these things interact. >> >> We've already analysed the root cause of composefs is that composefs >> uses a kernel_read() to read its path while irrelevant metadata >> (such as dir data) is read together. Such heuristic readahead is a >> unusual stuff for all local fses (obviously almost all in-kernel >> filesystems don't use kernel_read() to read their metadata. Although >> some filesystems could readahead some related extent metadata when >> reading inode, they at least does _not_ work as kernel_read().) But >> double caching will introduce almost the same impact as kernel_read() >> (assuming you read some source code of loop device.) >> >> I do hope you already read what Jingbo's latest test results, and that >> test result shows how bad readahead performs if fs metadata is >> partially randomly used (stat < 1500 files): >> https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com >> >> Also you could explicitly _disable_ readahead for composefs >> manifiest file (because all EROFS metadata read is without >> readahead), and let's see how it works then. >> >> Again, if your workload is just "ls -lR". My answer is "just async >> readahead the whole manifest file / loop device together" when >> mounting. That will give the best result to you. But I'm not sure >> that is the real use case you propose. >> >>> >>> * Stacking overlay on top of erofs causes about 100msec slower >>> warm-cache times compared to all non-overlay approaches, and much >>> more in the cold cache case. The cold cache performance is helped >>> significantly by the lazyfollow patches, but the warm cache overhead >>> remains. >>> >>> * The use of overlayfs more than doubles memory use, probably >>> because of all the extra inodes and dentries in action for the >>> various layers. The lazyfollow patches helps, but only partially. >>> >>> * Even though overlayfs+erofs is slower than cfs and raw erofs, it is >>> not that much slower (~25%) than the pure xfs/ext4 directory, which >>> is a pretty good baseline for comparisons. It is even faster when >>> using lazyfollow on ext4. >>> >>> * The erofs images are slightly larger than the equivalent composefs >>> image. >>> >>> In summary: The performance of composefs is somewhat better than the >>> best erofs+ovl combination, although the overlay approach is not >>> significantly worse than the baseline of a regular directory, except >>> that it uses a bit more memory. >>> >>> On top of the above pure performance based comparisons I would like to >>> re-state some of the other advantages of composefs compared to the >>> overlay approach: >>> >>> * composefs is namespaceable, in the sense that you can use it (given >>> mount capabilities) inside a namespace (such as a container) without >>> access to non-namespaced resources like loopback or device-mapper >>> devices. (There was work on fixing this with loopfs, but that seems >>> to have stalled.) >>> >>> * While it is not in the current design, the simplicity of the format >>> and lack of loopback makes it at least theoretically possible that >>> composefs can be made usable in a rootless fashion at some point in >>> the future. >> Do you consider sending some commands to /dev/cachefiles to configure >> a daemonless dir and mount erofs image directly by using "erofs over >> fscache" but in a daemonless way? That is an ongoing stuff on our side. >> >> IMHO, I don't think file-based interfaces are quite a charmful stuff. >> Historically I recalled some practice is to "avoid directly reading >> files in kernel" so that I think almost all local fses don't work on >> files directl and loopback devices are all the ways for these use >> cases. If loopback devices are not okay to you, how about improving >> loopback devices and that will benefit to almost all local fses. >> >>> >>> And of course, there are disadvantages to composefs too. Primarily >>> being more code, increasing maintenance burden and risk of security >>> problems. Composefs is particularly burdensome because it is a >>> stacking filesystem and these have historically been shown to be hard >>> to get right. >>> >>> >>> The question now is what is the best approach overall? For my own >>> primary usecase of making a verifying ostree root filesystem, the >>> overlay approach (with the lazyfollow work finished) is, while not >>> ideal, good enough. >> >> So your judgement is still "ls -lR" and your use case is still just >> pure read-only and without writable stuff? >> >> Anyway, I'm really happy to work with you on your ostree use cases >> as always, as long as all corner cases work out by the community. >> >>> >>> But I know for the people who are more interested in using composefs >>> for containers the eventual goal of rootless support is very >>> important. So, on behalf of them I guess the question is: Is there >>> ever any chance that something like composefs could work rootlessly? >>> Or conversely: Is there some way to get rootless support from the >>> overlay approach? Opinions? Ideas? >> >> Honestly, I do want to get a proper answer when Giuseppe asked me >> the same question. My current view is simply "that question is >> almost the same for all in-kernel fses with some on-disk format". > > As far as I'm concerned filesystems with on-disk format will not be made > mountable by unprivileged containers. And I don't think I'm alone in > that view. The idea that ever more parts of the kernel with a massive > attack surface such as a filesystem need to vouchesafe for the safety in > the face of every rando having access to > unshare --mount --user --map-root is a dead end and will just end up > trapping us in a neverending cycle of security bugs (Because every > single bug that's found after making that fs mountable from an > unprivileged container will be treated as a security bug no matter if > justified or not. So this is also a good way to ruin your filesystem's > reputation.). > > And honestly, if we set the precedent that it's fine for one filesystem > with an on-disk format to be able to be mounted by unprivileged > containers then other filesystems eventually want to do this as well. > > At the rate we currently add filesystems that's just a matter of time > even if none of the existing ones would also want to do it. And then > we're left arguing that this was just an exception for one super > special, super safe, unexploitable filesystem with an on-disk format. Yes, +1. That's somewhat why I didn't answer immediately since I'd like to find a chance to get more people interested in EROFS so I hope it could be (somewhat) pointed out by other filesystem guys at that time. > > Imho, none of this is appealing. I don't want to slowly keep building a > future where we end up running fuzzers in unprivileged container to > generate random images to crash the kernel. Even fuzzers don't guarantee this unless we completely freeze the fs code, otherwise any useful improvement will need a much much deep and long long fuzzing in principle. I'm not sure even if it could catch release timing at all, and bug-free, honestly. > > I have more arguments why I don't think is a path we will ever go down > but I don't want this to detract from the legitimate ask of making it > possible to mount trusted images from within unprivileged containers. > Because I think that's perfectly legitimate. > > However, I don't think that this is something the kernel needs to solve > other than providing the necessary infrastructure so that this can be > solved in userspace. Yes, I think it's a principle as long as we have a way to do thing in userspace effectively. > > Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I > explained how we currently mount into mount namespaces of unprivileged > cotainers which had been quite a difficult problem before the new mount > api. But now it's become almost comically trivial. I mean, there's stuff > that will still be good to have but overall all the bits are already > there. > > Imho, delegated mounting should be done by a system service that is > responsible for all the steps that require privileges. So for most > filesytems not mountable by unprivileged user this would amount to: > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_CMD_CREATE) > fd_mnt = fsmount(fd_fs) > // Only required for attributes that require privileges against the sb > // of the filesystem such as idmapped mounts > mount_setattr(fd_mnt, ...) > > and then the fd_mnt can be sent to the container which can then attach > it wherever it wants to. The system level service doesn't even need to > change namespaces via setns(fd_userns|fd_mntns) like I illustrated in > the post I did. It's sufficient if we sent it via AF_UNIX for example > that's exposed to the container. > > Of course, this system level service would be integrated with mount(8) > directly over a well-defined protocol. And this would be nestable as > well by e.g., bind-mounting the AF_UNIX socket. > > And we do already support a rudimentary form of such integration through > systemd. For example via mount -t ddi (cf. [2]) which makes it possible > to mount discoverable disk images (ddi). But that's just an > illustration. > > This should be integrated with mount(8) and should be a simply protocol > over varlink or another lightweight ipc mechanism that can be > implemented by systemd-mountd (which is how I coined this for lack of > imagination when I came up with this) or by some other component if > platforms like k8s really want to do their own thing. > > This also allows us to extend this feature to the whole system btw and > to all filesystems at once. Because it means that if systemd-mountd is > told what images to trust (based on location, from a specific registry, > signature, or whatever) then this isn't just useful for unprivileged > containers but also for regular users on the host that want to mount > stuff. > > This is what we're currently working on. > > (There's stuff that we can do to make this more powerful __if__ we need > to. One example would probably that we _could_ make it possible to mark > a superblock as being owned by a specific namespace with similar > permission checks as what we currently do for idmapped mounts > (privileged in the superblock of the fs, privileged over the ns to > delegate to etc). IOW, > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns) > > which completely sidesteps the issue of making that on-disk filesystem > mountable by unpriv users. > > But let me say that this is completely unnecessary today as you can do: > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_CMD_CREATE) > fd_mnt = fsmount(fd_fs) > mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP) > > which changes ownership across the whole filesystem. The only time you > really want what I mention here is if you want to delegate control over > __every single ioctl and potentially destructive operation associated > with that filesystem__ to an unprivileged container which is almost > never what you want.) Good to know this. I do hope it can be resolved by the userspace stuffs as you said. So is there some barrier to not do like this, so that we have to bother with FS_USERNS_MOUNT for fses with on-disk format? Your delegate control is a good stuff at least on my side and we hope some system-wide service can help this since our cloud might need this in the future as well. Thanks, Gao Xiang > > [1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > [2]: https://github.com/systemd/systemd/pull/26695 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 10:15 ` Christian Brauner 2023-03-07 11:03 ` Gao Xiang @ 2023-03-07 12:09 ` Alexander Larsson 2023-03-07 12:55 ` Gao Xiang 2023-03-07 15:16 ` Christian Brauner 2023-03-07 13:38 ` Jeff Layton 2 siblings, 2 replies; 42+ messages in thread From: Alexander Larsson @ 2023-03-07 12:09 UTC (permalink / raw) To: Christian Brauner Cc: Gao Xiang, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: > > Hi Alexander, > > > > On 2023/3/3 21:57, Alexander Larsson wrote: > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > > But I know for the people who are more interested in using composefs > > > for containers the eventual goal of rootless support is very > > > important. So, on behalf of them I guess the question is: Is there > > > ever any chance that something like composefs could work rootlessly? > > > Or conversely: Is there some way to get rootless support from the > > > overlay approach? Opinions? Ideas? > > > > Honestly, I do want to get a proper answer when Giuseppe asked me > > the same question. My current view is simply "that question is > > almost the same for all in-kernel fses with some on-disk format". > > As far as I'm concerned filesystems with on-disk format will not be made > mountable by unprivileged containers. And I don't think I'm alone in > that view. The idea that ever more parts of the kernel with a massive > attack surface such as a filesystem need to vouchesafe for the safety in > the face of every rando having access to > unshare --mount --user --map-root is a dead end and will just end up > trapping us in a neverending cycle of security bugs (Because every > single bug that's found after making that fs mountable from an > unprivileged container will be treated as a security bug no matter if > justified or not. So this is also a good way to ruin your filesystem's > reputation.). > > And honestly, if we set the precedent that it's fine for one filesystem > with an on-disk format to be able to be mounted by unprivileged > containers then other filesystems eventually want to do this as well. > > At the rate we currently add filesystems that's just a matter of time > even if none of the existing ones would also want to do it. And then > we're left arguing that this was just an exception for one super > special, super safe, unexploitable filesystem with an on-disk format. > > Imho, none of this is appealing. I don't want to slowly keep building a > future where we end up running fuzzers in unprivileged container to > generate random images to crash the kernel. > > I have more arguments why I don't think is a path we will ever go down > but I don't want this to detract from the legitimate ask of making it > possible to mount trusted images from within unprivileged containers. > Because I think that's perfectly legitimate. > > However, I don't think that this is something the kernel needs to solve > other than providing the necessary infrastructure so that this can be > solved in userspace. So, I completely understand this point of view. And, since I'm not really hearing any other viewpoint from the linux vfs developers it seems to be a shared opinion. So, it seems like further work on the kernel side of composefs isn't really useful anymore, and I will focus my work on the overlayfs side. Maybe we can even drop the summit topic to avoid a bunch of unnecessary travel? That said, even though I understand (and even agree) with your worries, I feel it is kind of unfortunate that we end up with (essentially) a setuid helper approach for this. Because it feels like we're giving up on a useful feature (trustless unprivileged mounts) that the kernel could *theoretically* deliver, but a setuid helper can't. Sure, if you have a closed system you can limit what images can get mounted to images signed by a trusted key, but it won't work well for things like user built images or publically available images. Unfortunately practicalities kinda outweigh theoretical advantages. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 12:09 ` Alexander Larsson @ 2023-03-07 12:55 ` Gao Xiang 2023-03-07 15:16 ` Christian Brauner 1 sibling, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 12:55 UTC (permalink / raw) To: Alexander Larsson, Christian Brauner Cc: lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 20:09, Alexander Larsson wrote: > On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote: >> >> On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: >>> Hi Alexander, >>> >>> On 2023/3/3 21:57, Alexander Larsson wrote: >>>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > >>>> But I know for the people who are more interested in using composefs >>>> for containers the eventual goal of rootless support is very >>>> important. So, on behalf of them I guess the question is: Is there >>>> ever any chance that something like composefs could work rootlessly? >>>> Or conversely: Is there some way to get rootless support from the >>>> overlay approach? Opinions? Ideas? >>> >>> Honestly, I do want to get a proper answer when Giuseppe asked me >>> the same question. My current view is simply "that question is >>> almost the same for all in-kernel fses with some on-disk format". >> >> As far as I'm concerned filesystems with on-disk format will not be made >> mountable by unprivileged containers. And I don't think I'm alone in >> that view. The idea that ever more parts of the kernel with a massive >> attack surface such as a filesystem need to vouchesafe for the safety in >> the face of every rando having access to >> unshare --mount --user --map-root is a dead end and will just end up >> trapping us in a neverending cycle of security bugs (Because every >> single bug that's found after making that fs mountable from an >> unprivileged container will be treated as a security bug no matter if >> justified or not. So this is also a good way to ruin your filesystem's >> reputation.). >> >> And honestly, if we set the precedent that it's fine for one filesystem >> with an on-disk format to be able to be mounted by unprivileged >> containers then other filesystems eventually want to do this as well. >> >> At the rate we currently add filesystems that's just a matter of time >> even if none of the existing ones would also want to do it. And then >> we're left arguing that this was just an exception for one super >> special, super safe, unexploitable filesystem with an on-disk format. >> >> Imho, none of this is appealing. I don't want to slowly keep building a >> future where we end up running fuzzers in unprivileged container to >> generate random images to crash the kernel. >> >> I have more arguments why I don't think is a path we will ever go down >> but I don't want this to detract from the legitimate ask of making it >> possible to mount trusted images from within unprivileged containers. >> Because I think that's perfectly legitimate. >> >> However, I don't think that this is something the kernel needs to solve >> other than providing the necessary infrastructure so that this can be >> solved in userspace. > > So, I completely understand this point of view. And, since I'm not > really hearing any other viewpoint from the linux vfs developers it > seems to be a shared opinion. So, it seems like further work on the > kernel side of composefs isn't really useful anymore, and I will focus > my work on the overlayfs side. Maybe we can even drop the summit topic > to avoid a bunch of unnecessary travel? I am still looking forward to see you here since I'd like to devote my time to work on anything which could makes EROFS better and more useful (I'm always active in the Linux FS community.) Even if you folks finally don't decide to give EROFS a chance, I'm still happy to get your further inputs since I think an immutable filesystem can do better and useful than the current status to the whole Linux ecosystem. I'm very sorry that I didn't have a chance to go to FOSDEM 23 due to my unexpected travelling visa issues to Belgium at that time. > > That said, even though I understand (and even agree) with your > worries, I feel it is kind of unfortunate that we end up with > (essentially) a setuid helper approach for this. Because it feels like > we're giving up on a useful feature (trustless unprivileged mounts) > that the kernel could *theoretically* deliver, but a setuid helper > can't. Sure, if you have a closed system you can limit what images can > get mounted to images signed by a trusted key, but it won't work well > for things like user built images or publically available images. > Unfortunately practicalities kinda outweigh theoretical advantages. In principle, I think _trusted_ unprivileged mounts in kernel could be done in somewhat degree. But before that, firstly, that needs very very very hard proofs why userspace cannot do this. As long as it has a little possibility to work in userspace effectively, it could become another story. I'm somewhat against with untrusted unprivileged mounts with actual on-disk formats all the time. Why? Just because FUSE is a pure simple protocol, and overlayfs uses very limited xattrs without on-disk data. If we have some filesystem with on-disk data, the problem inside not only just causes panic but also deadlock, livelock, DOS, or even corrupted memory due to some on-disk format. In principle, we could freeze all the code without any feature enhancement, but it becomes hard in practice since users like new on-disk useful features all the time. Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 12:09 ` Alexander Larsson 2023-03-07 12:55 ` Gao Xiang @ 2023-03-07 15:16 ` Christian Brauner 2023-03-07 19:33 ` Giuseppe Scrivano 1 sibling, 1 reply; 42+ messages in thread From: Christian Brauner @ 2023-03-07 15:16 UTC (permalink / raw) To: Alexander Larsson Cc: Gao Xiang, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi, Seth Forshee On Tue, Mar 07, 2023 at 01:09:57PM +0100, Alexander Larsson wrote: > On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote: > > > > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: > > > Hi Alexander, > > > > > > On 2023/3/3 21:57, Alexander Larsson wrote: > > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > > > > But I know for the people who are more interested in using composefs > > > > for containers the eventual goal of rootless support is very > > > > important. So, on behalf of them I guess the question is: Is there > > > > ever any chance that something like composefs could work rootlessly? > > > > Or conversely: Is there some way to get rootless support from the > > > > overlay approach? Opinions? Ideas? > > > > > > Honestly, I do want to get a proper answer when Giuseppe asked me > > > the same question. My current view is simply "that question is > > > almost the same for all in-kernel fses with some on-disk format". > > > > As far as I'm concerned filesystems with on-disk format will not be made > > mountable by unprivileged containers. And I don't think I'm alone in > > that view. The idea that ever more parts of the kernel with a massive > > attack surface such as a filesystem need to vouchesafe for the safety in > > the face of every rando having access to > > unshare --mount --user --map-root is a dead end and will just end up > > trapping us in a neverending cycle of security bugs (Because every > > single bug that's found after making that fs mountable from an > > unprivileged container will be treated as a security bug no matter if > > justified or not. So this is also a good way to ruin your filesystem's > > reputation.). > > > > And honestly, if we set the precedent that it's fine for one filesystem > > with an on-disk format to be able to be mounted by unprivileged > > containers then other filesystems eventually want to do this as well. > > > > At the rate we currently add filesystems that's just a matter of time > > even if none of the existing ones would also want to do it. And then > > we're left arguing that this was just an exception for one super > > special, super safe, unexploitable filesystem with an on-disk format. > > > > Imho, none of this is appealing. I don't want to slowly keep building a > > future where we end up running fuzzers in unprivileged container to > > generate random images to crash the kernel. > > > > I have more arguments why I don't think is a path we will ever go down > > but I don't want this to detract from the legitimate ask of making it > > possible to mount trusted images from within unprivileged containers. > > Because I think that's perfectly legitimate. > > > > However, I don't think that this is something the kernel needs to solve > > other than providing the necessary infrastructure so that this can be > > solved in userspace. > > So, I completely understand this point of view. And, since I'm not > really hearing any other viewpoint from the linux vfs developers it > seems to be a shared opinion. So, it seems like further work on the > kernel side of composefs isn't really useful anymore, and I will focus > my work on the overlayfs side. Maybe we can even drop the summit topic > to avoid a bunch of unnecessary travel? > > That said, even though I understand (and even agree) with your > worries, I feel it is kind of unfortunate that we end up with > (essentially) a setuid helper approach for this. Because it feels like > we're giving up on a useful feature (trustless unprivileged mounts) > that the kernel could *theoretically* deliver, but a setuid helper > can't. Sure, if you have a closed system you can limit what images can > get mounted to images signed by a trusted key, but it won't work well > for things like user built images or publically available images. > Unfortunately practicalities kinda outweigh theoretical advantages. Characterzing this as a setuid helper approach feels a bit like negative branding. :) But just in case there's a misunderstanding of any form let me clarify that systemd doesn't produce set*id binaries in any form; never has, never will. It's also good to remember that in order to even use unprivileged containers with meaningful idmappings __two__ set*id binaries - new*idmap - with an extremely clunky, and frankly unusable id delegation policy expressed through these weird /etc/sub*id files have to be used. Which apparently everyone is happy to use. What we're talking about here however is a first class system service capable of expressing meaningful security policy (e.g., image signed by a key in the kernel keyring, polkit, ...). And such well-scoped local services are a good thing. This mentality of shoving ever more functionality under unshare --user --map-root needs to really take a good hard look at itself. Because it fundamentally assumes that unshare --user --map-root is a sufficiently complex security policy to cover everything from exposing complex network settings to complex filesystem settings to unprivileged users. To this day I'm not even sure if having ramfs mountable by unprivileged users isn't just a trivial dos vector that just nobody really considers important enough. (This is not aimed in any form at you because I used to think that this is a future worth building in the past myself but I think it's become sufficiently clear that this just doesn't work especially when our expectations around security and integrity become ever greater.) Fwiw, Lennart is in the middle of implementing this so we can showcase this asap. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 15:16 ` Christian Brauner @ 2023-03-07 19:33 ` Giuseppe Scrivano 2023-03-08 10:31 ` Christian Brauner 0 siblings, 1 reply; 42+ messages in thread From: Giuseppe Scrivano @ 2023-03-07 19:33 UTC (permalink / raw) To: Christian Brauner Cc: Alexander Larsson, Gao Xiang, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Dave Chinner, Vivek Goyal, Miklos Szeredi, Seth Forshee Christian Brauner <brauner@kernel.org> writes: > On Tue, Mar 07, 2023 at 01:09:57PM +0100, Alexander Larsson wrote: >> On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote: >> > >> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: >> > > Hi Alexander, >> > > >> > > On 2023/3/3 21:57, Alexander Larsson wrote: >> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: >> >> > > > But I know for the people who are more interested in using composefs >> > > > for containers the eventual goal of rootless support is very >> > > > important. So, on behalf of them I guess the question is: Is there >> > > > ever any chance that something like composefs could work rootlessly? >> > > > Or conversely: Is there some way to get rootless support from the >> > > > overlay approach? Opinions? Ideas? >> > > >> > > Honestly, I do want to get a proper answer when Giuseppe asked me >> > > the same question. My current view is simply "that question is >> > > almost the same for all in-kernel fses with some on-disk format". >> > >> > As far as I'm concerned filesystems with on-disk format will not be made >> > mountable by unprivileged containers. And I don't think I'm alone in >> > that view. The idea that ever more parts of the kernel with a massive >> > attack surface such as a filesystem need to vouchesafe for the safety in >> > the face of every rando having access to >> > unshare --mount --user --map-root is a dead end and will just end up >> > trapping us in a neverending cycle of security bugs (Because every >> > single bug that's found after making that fs mountable from an >> > unprivileged container will be treated as a security bug no matter if >> > justified or not. So this is also a good way to ruin your filesystem's >> > reputation.). >> > >> > And honestly, if we set the precedent that it's fine for one filesystem >> > with an on-disk format to be able to be mounted by unprivileged >> > containers then other filesystems eventually want to do this as well. >> > >> > At the rate we currently add filesystems that's just a matter of time >> > even if none of the existing ones would also want to do it. And then >> > we're left arguing that this was just an exception for one super >> > special, super safe, unexploitable filesystem with an on-disk format. >> > >> > Imho, none of this is appealing. I don't want to slowly keep building a >> > future where we end up running fuzzers in unprivileged container to >> > generate random images to crash the kernel. >> > >> > I have more arguments why I don't think is a path we will ever go down >> > but I don't want this to detract from the legitimate ask of making it >> > possible to mount trusted images from within unprivileged containers. >> > Because I think that's perfectly legitimate. >> > >> > However, I don't think that this is something the kernel needs to solve >> > other than providing the necessary infrastructure so that this can be >> > solved in userspace. >> >> So, I completely understand this point of view. And, since I'm not >> really hearing any other viewpoint from the linux vfs developers it >> seems to be a shared opinion. So, it seems like further work on the >> kernel side of composefs isn't really useful anymore, and I will focus >> my work on the overlayfs side. Maybe we can even drop the summit topic >> to avoid a bunch of unnecessary travel? >> >> That said, even though I understand (and even agree) with your >> worries, I feel it is kind of unfortunate that we end up with >> (essentially) a setuid helper approach for this. Because it feels like >> we're giving up on a useful feature (trustless unprivileged mounts) >> that the kernel could *theoretically* deliver, but a setuid helper >> can't. Sure, if you have a closed system you can limit what images can >> get mounted to images signed by a trusted key, but it won't work well >> for things like user built images or publically available images. >> Unfortunately practicalities kinda outweigh theoretical advantages. > > Characterzing this as a setuid helper approach feels a bit like negative > branding. :) > > But just in case there's a misunderstanding of any form let me clarify > that systemd doesn't produce set*id binaries in any form; never has, > never will. > > It's also good to remember that in order to even use unprivileged > containers with meaningful idmappings __two__ set*id binaries - > new*idmap - with an extremely clunky, and frankly unusable id delegation > policy expressed through these weird /etc/sub*id files have to be used. > Which apparently everyone is happy to use. > > What we're talking about here however is a first class system service > capable of expressing meaningful security policy (e.g., image signed by > a key in the kernel keyring, polkit, ...). And such well-scoped local > services are a good thing. there are some disadvantages too: - while the impact on system services is negligible, using the proposed approach could slow down container startup. It is somehow similar to the issue we currently have with cgroups, where manually creating a cgroup is faster than going through dbus and systemd. IMHO, the kernel could easily verify the image signature without relying on an additional userland service when mounting it from a user namespace. - it won't be usable from a containerized build system. It is common to build container images inside of a container (so that they can be built in a cluster). To use the systemd approach, we'll need to access systemd on the host from the container. > This mentality of shoving ever more functionality under > unshare --user --map-root needs to really take a good hard look at > itself. Because it fundamentally assumes that unshare --user --map-root > is a sufficiently complex security policy to cover everything from > exposing complex network settings to complex filesystem settings to > unprivileged users. > > To this day I'm not even sure if having ramfs mountable by unprivileged > users isn't just a trivial dos vector that just nobody really considers > important enough. > > (This is not aimed in any form at you because I used to think that this > is a future worth building in the past myself but I think it's become > sufficiently clear that this just doesn't work especially when our > expectations around security and integrity become ever greater.) > > Fwiw, Lennart is in the middle of implementing this so we can showcase > this asap. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 19:33 ` Giuseppe Scrivano @ 2023-03-08 10:31 ` Christian Brauner 0 siblings, 0 replies; 42+ messages in thread From: Christian Brauner @ 2023-03-08 10:31 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Alexander Larsson, Gao Xiang, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Dave Chinner, Vivek Goyal, Miklos Szeredi, Seth Forshee, Jeff Layton On Tue, Mar 07, 2023 at 08:33:29PM +0100, Giuseppe Scrivano wrote: > Christian Brauner <brauner@kernel.org> writes: > > > On Tue, Mar 07, 2023 at 01:09:57PM +0100, Alexander Larsson wrote: > >> On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote: > >> > > >> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: > >> > > Hi Alexander, > >> > > > >> > > On 2023/3/3 21:57, Alexander Larsson wrote: > >> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > >> > >> > > > But I know for the people who are more interested in using composefs > >> > > > for containers the eventual goal of rootless support is very > >> > > > important. So, on behalf of them I guess the question is: Is there > >> > > > ever any chance that something like composefs could work rootlessly? > >> > > > Or conversely: Is there some way to get rootless support from the > >> > > > overlay approach? Opinions? Ideas? > >> > > > >> > > Honestly, I do want to get a proper answer when Giuseppe asked me > >> > > the same question. My current view is simply "that question is > >> > > almost the same for all in-kernel fses with some on-disk format". > >> > > >> > As far as I'm concerned filesystems with on-disk format will not be made > >> > mountable by unprivileged containers. And I don't think I'm alone in > >> > that view. The idea that ever more parts of the kernel with a massive > >> > attack surface such as a filesystem need to vouchesafe for the safety in > >> > the face of every rando having access to > >> > unshare --mount --user --map-root is a dead end and will just end up > >> > trapping us in a neverending cycle of security bugs (Because every > >> > single bug that's found after making that fs mountable from an > >> > unprivileged container will be treated as a security bug no matter if > >> > justified or not. So this is also a good way to ruin your filesystem's > >> > reputation.). > >> > > >> > And honestly, if we set the precedent that it's fine for one filesystem > >> > with an on-disk format to be able to be mounted by unprivileged > >> > containers then other filesystems eventually want to do this as well. > >> > > >> > At the rate we currently add filesystems that's just a matter of time > >> > even if none of the existing ones would also want to do it. And then > >> > we're left arguing that this was just an exception for one super > >> > special, super safe, unexploitable filesystem with an on-disk format. > >> > > >> > Imho, none of this is appealing. I don't want to slowly keep building a > >> > future where we end up running fuzzers in unprivileged container to > >> > generate random images to crash the kernel. > >> > > >> > I have more arguments why I don't think is a path we will ever go down > >> > but I don't want this to detract from the legitimate ask of making it > >> > possible to mount trusted images from within unprivileged containers. > >> > Because I think that's perfectly legitimate. > >> > > >> > However, I don't think that this is something the kernel needs to solve > >> > other than providing the necessary infrastructure so that this can be > >> > solved in userspace. > >> > >> So, I completely understand this point of view. And, since I'm not > >> really hearing any other viewpoint from the linux vfs developers it > >> seems to be a shared opinion. So, it seems like further work on the > >> kernel side of composefs isn't really useful anymore, and I will focus > >> my work on the overlayfs side. Maybe we can even drop the summit topic > >> to avoid a bunch of unnecessary travel? > >> > >> That said, even though I understand (and even agree) with your > >> worries, I feel it is kind of unfortunate that we end up with > >> (essentially) a setuid helper approach for this. Because it feels like > >> we're giving up on a useful feature (trustless unprivileged mounts) > >> that the kernel could *theoretically* deliver, but a setuid helper > >> can't. Sure, if you have a closed system you can limit what images can > >> get mounted to images signed by a trusted key, but it won't work well > >> for things like user built images or publically available images. > >> Unfortunately practicalities kinda outweigh theoretical advantages. > > > > Characterzing this as a setuid helper approach feels a bit like negative > > branding. :) > > > > But just in case there's a misunderstanding of any form let me clarify > > that systemd doesn't produce set*id binaries in any form; never has, > > never will. > > > > It's also good to remember that in order to even use unprivileged > > containers with meaningful idmappings __two__ set*id binaries - > > new*idmap - with an extremely clunky, and frankly unusable id delegation > > policy expressed through these weird /etc/sub*id files have to be used. > > Which apparently everyone is happy to use. > > > > What we're talking about here however is a first class system service > > capable of expressing meaningful security policy (e.g., image signed by > > a key in the kernel keyring, polkit, ...). And such well-scoped local > > services are a good thing. > > there are some disadvantages too: > > - while the impact on system services is negligible, using the proposed > approach could slow down container startup. > It is somehow similar to the issue we currently have with cgroups, > where manually creating a cgroup is faster than going through dbus and > systemd. IMHO, the kernel could easily verify the image signature This will use varlink, dbus would be optional and only be involved if a service wanted to use polkit for trust. Signatures would be the main way. Efficiency is ofc something that is on the forefront. That said, note that big chunks of mounting are serialized on namespace lock (mount propagation et al) and mount lock (properties, parent-child relationships, mountpoint etc.) already so it's not really that this a particularly fast operation. Mounting is expensive in the kernel especially with mount propagation in the mix. If you have a thousand containers all calling mount at the same time with mount propagation between them for a big mount tree that'll be costly. IOW, the cost for mounting isn't paid in userspace. > without relying on an additional userland service when mounting it > from a user namespace. > > - it won't be usable from a containerized build system. It is common to > build container images inside of a container (so that they can be > built in a cluster). To use the systemd approach, we'll need to > access systemd on the host from the container. I don't see why that would be a problem I consider it the proper design in fact. And I've explained in the earlier mail that we even have nesting in mind right away. As you've mentioned the cgroup delegation model above. This is a good example. The whole stick of pressure stall information (PSI) for example, for the memory controller is the realization that instead of pushing the policy about how to handle memory pressure every deeper into the kernel it's better to exposes the necessary infrastructure to userspace which can then implement policies tailored to the workload. The kernel isn't suited for expressing such fine-grained policies. And eBPF for containers will end up being managed in a similar way with a system service that implements the policy for attaching eBPF programs to containers. The mounting of filesystem images, network filesystems and so on is imho a similar problem. The policy when a filesystem mount should be allowed is something that at the end of the day belongs into a userspace system level service. The use-cases are just too many, the filesystems too distinct and too complex to be covered by the kernel. The advantage also is that with the system level service we can extend this ability to all filesystems at once and to regular users on the system. In order to give the security and resource guarantees that a modern system needs the various services need to integrate with one another and that may involve asking for privileged operations to be performed. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 10:15 ` Christian Brauner 2023-03-07 11:03 ` Gao Xiang 2023-03-07 12:09 ` Alexander Larsson @ 2023-03-07 13:38 ` Jeff Layton 2023-03-08 10:37 ` Christian Brauner 2 siblings, 1 reply; 42+ messages in thread From: Jeff Layton @ 2023-03-07 13:38 UTC (permalink / raw) To: Christian Brauner, Gao Xiang Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Tue, 2023-03-07 at 11:15 +0100, Christian Brauner wrote: > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: > > Hi Alexander, > > > > On 2023/3/3 21:57, Alexander Larsson wrote: > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > > > > > > > Hello, > > > > > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > > > Composefs filesystem. It is an opportunistically sharing, validating > > > > image-based filesystem, targeting usecases like validated ostree > > > > rootfs:es, validated container images that share common files, as well > > > > as other image based usecases. > > > > > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > > > is has been proposed that (with some changes to overlayfs), similar > > > > behaviour can be achieved by combining the overlayfs > > > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > > > > > There are pros and cons to both these approaches, and the discussion > > > > about their respective value has sometimes been heated. We would like > > > > to have an in-person discussion at the summit, ideally also involving > > > > more of the filesystem development community, so that we can reach > > > > some consensus on what is the best apporach. > > > > > > In order to better understand the behaviour and requirements of the > > > overlayfs+erofs approach I spent some time implementing direct support > > > for erofs in libcomposefs. So, with current HEAD of > > > github.com/containers/composefs you can now do: > > > > > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs > > > > Thanks you for taking time on working on EROFS support. I don't have > > time to play with it yet since I'd like to work out erofs-utils 1.6 > > these days and will work on some new stuffs such as !pagesize block > > size as I said previously. > > > > > > > > This will produce an object store with the backing files, and a erofs > > > file with the required overlayfs xattrs, including a made up one > > > called "overlay.fs-verity" containing the expected fs-verity digest > > > for the lower dir. It also adds the required whiteouts to cover the > > > 00-ff dirs from the lower dir. > > > > > > These erofs files are ordered similarly to the composefs files, and we > > > give similar guarantees about their reproducibility, etc. So, they > > > should be apples-to-apples comparable with the composefs images. > > > > > > Given this, I ran another set of performance tests on the original cs9 > > > rootfs dataset, again measuring the time of `ls -lR`. I also tried to > > > measure the memory use like this: > > > > > > # echo 3 > /proc/sys/vm/drop_caches > > > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > > > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > > > > > > These are the alternatives I tried: > > > > > > xfs: the source of the image, regular dir on xfs > > > erofs: the image.erofs above, on loopback > > > erofs dio: the image.erofs above, on loopback with --direct-io=on > > > ovl: erofs above combined with overlayfs > > > ovl dio: erofs dio above combined with overlayfs > > > cfs: composefs mount of image.cfs > > > > > > All tests use the same objects dir, stored on xfs. The erofs and > > > overlay implementations are from a stock 6.1.13 kernel, and composefs > > > module is from github HEAD. > > > > > > I tried loopback both with and without the direct-io option, because > > > without direct-io enabled the kernel will double-cache the loopbacked > > > data, as per[1]. > > > > > > The produced images are: > > > 8.9M image.cfs > > > 11.3M image.erofs > > > > > > And gives these results: > > > | Cold cache | Warm cache | Mem use > > > | (msec) | (msec) | (mb) > > > -----------+------------+------------+--------- > > > xfs | 1449 | 442 | 54 > > > erofs | 700 | 391 | 45 > > > erofs dio | 939 | 400 | 45 > > > ovl | 1827 | 530 | 130 > > > ovl dio | 2156 | 531 | 130 > > > cfs | 689 | 389 | 51 > > > > > > I also ran the same tests in a VM that had the latest kernel including > > > the lazyfollow patches (ovl lazy in the table, not using direct-io), > > > this one ext4 based: > > > > > > | Cold cache | Warm cache | Mem use > > > | (msec) | (msec) | (mb) > > > -----------+------------+------------+--------- > > > ext4 | 1135 | 394 | 54 > > > erofs | 715 | 401 | 46 > > > erofs dio | 922 | 401 | 45 > > > ovl | 1412 | 515 | 148 > > > ovl dio | 1810 | 532 | 149 > > > ovl lazy | 1063 | 523 | 87 > > > cfs | 719 | 463 | 51 > > > > > > Things noticeable in the results: > > > > > > * composefs and erofs (by itself) perform roughly similar. This is > > > not necessarily news, and results from Jingbo Xu match this. > > > > > > * Erofs on top of direct-io enabled loopback causes quite a drop in > > > performance, which I don't really understand. Especially since its > > > reporting the same memory use as non-direct io. I guess the > > > double-cacheing in the later case isn't properly attributed to the > > > cgroup so the difference is not measured. However, why would the > > > double cache improve performance? Maybe I'm not completely > > > understanding how these things interact. > > > > We've already analysed the root cause of composefs is that composefs > > uses a kernel_read() to read its path while irrelevant metadata > > (such as dir data) is read together. Such heuristic readahead is a > > unusual stuff for all local fses (obviously almost all in-kernel > > filesystems don't use kernel_read() to read their metadata. Although > > some filesystems could readahead some related extent metadata when > > reading inode, they at least does _not_ work as kernel_read().) But > > double caching will introduce almost the same impact as kernel_read() > > (assuming you read some source code of loop device.) > > > > I do hope you already read what Jingbo's latest test results, and that > > test result shows how bad readahead performs if fs metadata is > > partially randomly used (stat < 1500 files): > > https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com > > > > Also you could explicitly _disable_ readahead for composefs > > manifiest file (because all EROFS metadata read is without > > readahead), and let's see how it works then. > > > > Again, if your workload is just "ls -lR". My answer is "just async > > readahead the whole manifest file / loop device together" when > > mounting. That will give the best result to you. But I'm not sure > > that is the real use case you propose. > > > > > > > > * Stacking overlay on top of erofs causes about 100msec slower > > > warm-cache times compared to all non-overlay approaches, and much > > > more in the cold cache case. The cold cache performance is helped > > > significantly by the lazyfollow patches, but the warm cache overhead > > > remains. > > > > > > * The use of overlayfs more than doubles memory use, probably > > > because of all the extra inodes and dentries in action for the > > > various layers. The lazyfollow patches helps, but only partially. > > > > > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is > > > not that much slower (~25%) than the pure xfs/ext4 directory, which > > > is a pretty good baseline for comparisons. It is even faster when > > > using lazyfollow on ext4. > > > > > > * The erofs images are slightly larger than the equivalent composefs > > > image. > > > > > > In summary: The performance of composefs is somewhat better than the > > > best erofs+ovl combination, although the overlay approach is not > > > significantly worse than the baseline of a regular directory, except > > > that it uses a bit more memory. > > > > > > On top of the above pure performance based comparisons I would like to > > > re-state some of the other advantages of composefs compared to the > > > overlay approach: > > > > > > * composefs is namespaceable, in the sense that you can use it (given > > > mount capabilities) inside a namespace (such as a container) without > > > access to non-namespaced resources like loopback or device-mapper > > > devices. (There was work on fixing this with loopfs, but that seems > > > to have stalled.) > > > > > > * While it is not in the current design, the simplicity of the format > > > and lack of loopback makes it at least theoretically possible that > > > composefs can be made usable in a rootless fashion at some point in > > > the future. > > Do you consider sending some commands to /dev/cachefiles to configure > > a daemonless dir and mount erofs image directly by using "erofs over > > fscache" but in a daemonless way? That is an ongoing stuff on our side. > > > > IMHO, I don't think file-based interfaces are quite a charmful stuff. > > Historically I recalled some practice is to "avoid directly reading > > files in kernel" so that I think almost all local fses don't work on > > files directl and loopback devices are all the ways for these use > > cases. If loopback devices are not okay to you, how about improving > > loopback devices and that will benefit to almost all local fses. > > > > > > > > And of course, there are disadvantages to composefs too. Primarily > > > being more code, increasing maintenance burden and risk of security > > > problems. Composefs is particularly burdensome because it is a > > > stacking filesystem and these have historically been shown to be hard > > > to get right. > > > > > > > > > The question now is what is the best approach overall? For my own > > > primary usecase of making a verifying ostree root filesystem, the > > > overlay approach (with the lazyfollow work finished) is, while not > > > ideal, good enough. > > > > So your judgement is still "ls -lR" and your use case is still just > > pure read-only and without writable stuff? > > > > Anyway, I'm really happy to work with you on your ostree use cases > > as always, as long as all corner cases work out by the community. > > > > > > > > But I know for the people who are more interested in using composefs > > > for containers the eventual goal of rootless support is very > > > important. So, on behalf of them I guess the question is: Is there > > > ever any chance that something like composefs could work rootlessly? > > > Or conversely: Is there some way to get rootless support from the > > > overlay approach? Opinions? Ideas? > > > > Honestly, I do want to get a proper answer when Giuseppe asked me > > the same question. My current view is simply "that question is > > almost the same for all in-kernel fses with some on-disk format". > > As far as I'm concerned filesystems with on-disk format will not be made > mountable by unprivileged containers. And I don't think I'm alone in > that view. > You're absolutely not alone in that view. This is even more unsafe with network and clustered filesystems, as you're trusting remote hardware that is accessible by other users than just the local host. We have had long-standing open requests to allow unprivileged users to mount arbitrary remote filesystems, and I've never seen a way to do that safely. > The idea that ever more parts of the kernel with a massive > attack surface such as a filesystem need to vouchesafe for the safety in > the face of every rando having access to > unshare --mount --user --map-root is a dead end and will just end up > trapping us in a neverending cycle of security bugs (Because every > single bug that's found after making that fs mountable from an > unprivileged container will be treated as a security bug no matter if > justified or not. So this is also a good way to ruin your filesystem's > reputation.). > > And honestly, if we set the precedent that it's fine for one filesystem > with an on-disk format to be able to be mounted by unprivileged > containers then other filesystems eventually want to do this as well. > > At the rate we currently add filesystems that's just a matter of time > even if none of the existing ones would also want to do it. And then > we're left arguing that this was just an exception for one super > special, super safe, unexploitable filesystem with an on-disk format. > > Imho, none of this is appealing. I don't want to slowly keep building a > future where we end up running fuzzers in unprivileged container to > generate random images to crash the kernel. > > I have more arguments why I don't think is a path we will ever go down > but I don't want this to detract from the legitimate ask of making it > possible to mount trusted images from within unprivileged containers. > Because I think that's perfectly legitimate. > > However, I don't think that this is something the kernel needs to solve > other than providing the necessary infrastructure so that this can be > solved in userspace. > > Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I > explained how we currently mount into mount namespaces of unprivileged > cotainers which had been quite a difficult problem before the new mount > api. But now it's become almost comically trivial. I mean, there's stuff > that will still be good to have but overall all the bits are already > there. > > Imho, delegated mounting should be done by a system service that is > responsible for all the steps that require privileges. So for most > filesytems not mountable by unprivileged user this would amount to: > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_CMD_CREATE) > fd_mnt = fsmount(fd_fs) > // Only required for attributes that require privileges against the sb > // of the filesystem such as idmapped mounts > mount_setattr(fd_mnt, ...) > > and then the fd_mnt can be sent to the container which can then attach > it wherever it wants to. The system level service doesn't even need to > change namespaces via setns(fd_userns|fd_mntns) like I illustrated in > the post I did. It's sufficient if we sent it via AF_UNIX for example > that's exposed to the container. > > Of course, this system level service would be integrated with mount(8) > directly over a well-defined protocol. And this would be nestable as > well by e.g., bind-mounting the AF_UNIX socket. > > And we do already support a rudimentary form of such integration through > systemd. For example via mount -t ddi (cf. [2]) which makes it possible > to mount discoverable disk images (ddi). But that's just an > illustration. > > This should be integrated with mount(8) and should be a simply protocol > over varlink or another lightweight ipc mechanism that can be > implemented by systemd-mountd (which is how I coined this for lack of > imagination when I came up with this) or by some other component if > platforms like k8s really want to do their own thing. > > This also allows us to extend this feature to the whole system btw and > to all filesystems at once. Because it means that if systemd-mountd is > told what images to trust (based on location, from a specific registry, > signature, or whatever) then this isn't just useful for unprivileged > containers but also for regular users on the host that want to mount > stuff. > > This is what we're currently working on. > This is a very cool idea, and sounds like a reasonable way forward. I'd be interested to hear more about this (and in particular what sort of security model and use-cases you envision for this). > (There's stuff that we can do to make this more powerful __if__ we need > to. One example would probably that we _could_ make it possible to mark > a superblock as being owned by a specific namespace with similar > permission checks as what we currently do for idmapped mounts > (privileged in the superblock of the fs, privileged over the ns to > delegate to etc). IOW, > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns) > > which completely sidesteps the issue of making that on-disk filesystem > mountable by unpriv users. > > But let me say that this is completely unnecessary today as you can do: > > fd_fs = fsopen("xfs") > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > fsconfig(FSCONFIG_CMD_CREATE) > fd_mnt = fsmount(fd_fs) > mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP) > > which changes ownership across the whole filesystem. The only time you > really want what I mention here is if you want to delegate control over > __every single ioctl and potentially destructive operation associated > with that filesystem__ to an unprivileged container which is almost > never what you want.) > > [1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > [2]: https://github.com/systemd/systemd/pull/26695 -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 13:38 ` Jeff Layton @ 2023-03-08 10:37 ` Christian Brauner 0 siblings, 0 replies; 42+ messages in thread From: Christian Brauner @ 2023-03-08 10:37 UTC (permalink / raw) To: Jeff Layton Cc: Gao Xiang, Alexander Larsson, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Tue, Mar 07, 2023 at 08:38:58AM -0500, Jeff Layton wrote: > On Tue, 2023-03-07 at 11:15 +0100, Christian Brauner wrote: > > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: > > > Hi Alexander, > > > > > > On 2023/3/3 21:57, Alexander Larsson wrote: > > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > > > > > > > > > Hello, > > > > > > > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > > > > Composefs filesystem. It is an opportunistically sharing, validating > > > > > image-based filesystem, targeting usecases like validated ostree > > > > > rootfs:es, validated container images that share common files, as well > > > > > as other image based usecases. > > > > > > > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > > > > is has been proposed that (with some changes to overlayfs), similar > > > > > behaviour can be achieved by combining the overlayfs > > > > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > > > > > > > There are pros and cons to both these approaches, and the discussion > > > > > about their respective value has sometimes been heated. We would like > > > > > to have an in-person discussion at the summit, ideally also involving > > > > > more of the filesystem development community, so that we can reach > > > > > some consensus on what is the best apporach. > > > > > > > > In order to better understand the behaviour and requirements of the > > > > overlayfs+erofs approach I spent some time implementing direct support > > > > for erofs in libcomposefs. So, with current HEAD of > > > > github.com/containers/composefs you can now do: > > > > > > > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs > > > > > > Thanks you for taking time on working on EROFS support. I don't have > > > time to play with it yet since I'd like to work out erofs-utils 1.6 > > > these days and will work on some new stuffs such as !pagesize block > > > size as I said previously. > > > > > > > > > > > This will produce an object store with the backing files, and a erofs > > > > file with the required overlayfs xattrs, including a made up one > > > > called "overlay.fs-verity" containing the expected fs-verity digest > > > > for the lower dir. It also adds the required whiteouts to cover the > > > > 00-ff dirs from the lower dir. > > > > > > > > These erofs files are ordered similarly to the composefs files, and we > > > > give similar guarantees about their reproducibility, etc. So, they > > > > should be apples-to-apples comparable with the composefs images. > > > > > > > > Given this, I ran another set of performance tests on the original cs9 > > > > rootfs dataset, again measuring the time of `ls -lR`. I also tried to > > > > measure the memory use like this: > > > > > > > > # echo 3 > /proc/sys/vm/drop_caches > > > > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > > > > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > > > > > > > > These are the alternatives I tried: > > > > > > > > xfs: the source of the image, regular dir on xfs > > > > erofs: the image.erofs above, on loopback > > > > erofs dio: the image.erofs above, on loopback with --direct-io=on > > > > ovl: erofs above combined with overlayfs > > > > ovl dio: erofs dio above combined with overlayfs > > > > cfs: composefs mount of image.cfs > > > > > > > > All tests use the same objects dir, stored on xfs. The erofs and > > > > overlay implementations are from a stock 6.1.13 kernel, and composefs > > > > module is from github HEAD. > > > > > > > > I tried loopback both with and without the direct-io option, because > > > > without direct-io enabled the kernel will double-cache the loopbacked > > > > data, as per[1]. > > > > > > > > The produced images are: > > > > 8.9M image.cfs > > > > 11.3M image.erofs > > > > > > > > And gives these results: > > > > | Cold cache | Warm cache | Mem use > > > > | (msec) | (msec) | (mb) > > > > -----------+------------+------------+--------- > > > > xfs | 1449 | 442 | 54 > > > > erofs | 700 | 391 | 45 > > > > erofs dio | 939 | 400 | 45 > > > > ovl | 1827 | 530 | 130 > > > > ovl dio | 2156 | 531 | 130 > > > > cfs | 689 | 389 | 51 > > > > > > > > I also ran the same tests in a VM that had the latest kernel including > > > > the lazyfollow patches (ovl lazy in the table, not using direct-io), > > > > this one ext4 based: > > > > > > > > | Cold cache | Warm cache | Mem use > > > > | (msec) | (msec) | (mb) > > > > -----------+------------+------------+--------- > > > > ext4 | 1135 | 394 | 54 > > > > erofs | 715 | 401 | 46 > > > > erofs dio | 922 | 401 | 45 > > > > ovl | 1412 | 515 | 148 > > > > ovl dio | 1810 | 532 | 149 > > > > ovl lazy | 1063 | 523 | 87 > > > > cfs | 719 | 463 | 51 > > > > > > > > Things noticeable in the results: > > > > > > > > * composefs and erofs (by itself) perform roughly similar. This is > > > > not necessarily news, and results from Jingbo Xu match this. > > > > > > > > * Erofs on top of direct-io enabled loopback causes quite a drop in > > > > performance, which I don't really understand. Especially since its > > > > reporting the same memory use as non-direct io. I guess the > > > > double-cacheing in the later case isn't properly attributed to the > > > > cgroup so the difference is not measured. However, why would the > > > > double cache improve performance? Maybe I'm not completely > > > > understanding how these things interact. > > > > > > We've already analysed the root cause of composefs is that composefs > > > uses a kernel_read() to read its path while irrelevant metadata > > > (such as dir data) is read together. Such heuristic readahead is a > > > unusual stuff for all local fses (obviously almost all in-kernel > > > filesystems don't use kernel_read() to read their metadata. Although > > > some filesystems could readahead some related extent metadata when > > > reading inode, they at least does _not_ work as kernel_read().) But > > > double caching will introduce almost the same impact as kernel_read() > > > (assuming you read some source code of loop device.) > > > > > > I do hope you already read what Jingbo's latest test results, and that > > > test result shows how bad readahead performs if fs metadata is > > > partially randomly used (stat < 1500 files): > > > https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com > > > > > > Also you could explicitly _disable_ readahead for composefs > > > manifiest file (because all EROFS metadata read is without > > > readahead), and let's see how it works then. > > > > > > Again, if your workload is just "ls -lR". My answer is "just async > > > readahead the whole manifest file / loop device together" when > > > mounting. That will give the best result to you. But I'm not sure > > > that is the real use case you propose. > > > > > > > > > > > * Stacking overlay on top of erofs causes about 100msec slower > > > > warm-cache times compared to all non-overlay approaches, and much > > > > more in the cold cache case. The cold cache performance is helped > > > > significantly by the lazyfollow patches, but the warm cache overhead > > > > remains. > > > > > > > > * The use of overlayfs more than doubles memory use, probably > > > > because of all the extra inodes and dentries in action for the > > > > various layers. The lazyfollow patches helps, but only partially. > > > > > > > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is > > > > not that much slower (~25%) than the pure xfs/ext4 directory, which > > > > is a pretty good baseline for comparisons. It is even faster when > > > > using lazyfollow on ext4. > > > > > > > > * The erofs images are slightly larger than the equivalent composefs > > > > image. > > > > > > > > In summary: The performance of composefs is somewhat better than the > > > > best erofs+ovl combination, although the overlay approach is not > > > > significantly worse than the baseline of a regular directory, except > > > > that it uses a bit more memory. > > > > > > > > On top of the above pure performance based comparisons I would like to > > > > re-state some of the other advantages of composefs compared to the > > > > overlay approach: > > > > > > > > * composefs is namespaceable, in the sense that you can use it (given > > > > mount capabilities) inside a namespace (such as a container) without > > > > access to non-namespaced resources like loopback or device-mapper > > > > devices. (There was work on fixing this with loopfs, but that seems > > > > to have stalled.) > > > > > > > > * While it is not in the current design, the simplicity of the format > > > > and lack of loopback makes it at least theoretically possible that > > > > composefs can be made usable in a rootless fashion at some point in > > > > the future. > > > Do you consider sending some commands to /dev/cachefiles to configure > > > a daemonless dir and mount erofs image directly by using "erofs over > > > fscache" but in a daemonless way? That is an ongoing stuff on our side. > > > > > > IMHO, I don't think file-based interfaces are quite a charmful stuff. > > > Historically I recalled some practice is to "avoid directly reading > > > files in kernel" so that I think almost all local fses don't work on > > > files directl and loopback devices are all the ways for these use > > > cases. If loopback devices are not okay to you, how about improving > > > loopback devices and that will benefit to almost all local fses. > > > > > > > > > > > And of course, there are disadvantages to composefs too. Primarily > > > > being more code, increasing maintenance burden and risk of security > > > > problems. Composefs is particularly burdensome because it is a > > > > stacking filesystem and these have historically been shown to be hard > > > > to get right. > > > > > > > > > > > > The question now is what is the best approach overall? For my own > > > > primary usecase of making a verifying ostree root filesystem, the > > > > overlay approach (with the lazyfollow work finished) is, while not > > > > ideal, good enough. > > > > > > So your judgement is still "ls -lR" and your use case is still just > > > pure read-only and without writable stuff? > > > > > > Anyway, I'm really happy to work with you on your ostree use cases > > > as always, as long as all corner cases work out by the community. > > > > > > > > > > > But I know for the people who are more interested in using composefs > > > > for containers the eventual goal of rootless support is very > > > > important. So, on behalf of them I guess the question is: Is there > > > > ever any chance that something like composefs could work rootlessly? > > > > Or conversely: Is there some way to get rootless support from the > > > > overlay approach? Opinions? Ideas? > > > > > > Honestly, I do want to get a proper answer when Giuseppe asked me > > > the same question. My current view is simply "that question is > > > almost the same for all in-kernel fses with some on-disk format". > > > > As far as I'm concerned filesystems with on-disk format will not be made > > mountable by unprivileged containers. And I don't think I'm alone in > > that view. > > > > You're absolutely not alone in that view. This is even more unsafe with > network and clustered filesystems, as you're trusting remote hardware > that is accessible by other users than just the local host. We have had > long-standing open requests to allow unprivileged users to mount > arbitrary remote filesystems, and I've never seen a way to do that > safely. > > > The idea that ever more parts of the kernel with a massive > > attack surface such as a filesystem need to vouchesafe for the safety in > > the face of every rando having access to > > unshare --mount --user --map-root is a dead end and will just end up > > trapping us in a neverending cycle of security bugs (Because every > > single bug that's found after making that fs mountable from an > > unprivileged container will be treated as a security bug no matter if > > justified or not. So this is also a good way to ruin your filesystem's > > reputation.). > > > > And honestly, if we set the precedent that it's fine for one filesystem > > with an on-disk format to be able to be mounted by unprivileged > > containers then other filesystems eventually want to do this as well. > > > > At the rate we currently add filesystems that's just a matter of time > > even if none of the existing ones would also want to do it. And then > > we're left arguing that this was just an exception for one super > > special, super safe, unexploitable filesystem with an on-disk format. > > > > Imho, none of this is appealing. I don't want to slowly keep building a > > future where we end up running fuzzers in unprivileged container to > > generate random images to crash the kernel. > > > > I have more arguments why I don't think is a path we will ever go down > > but I don't want this to detract from the legitimate ask of making it > > possible to mount trusted images from within unprivileged containers. > > Because I think that's perfectly legitimate. > > > > However, I don't think that this is something the kernel needs to solve > > other than providing the necessary infrastructure so that this can be > > solved in userspace. > > > > Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I > > explained how we currently mount into mount namespaces of unprivileged > > cotainers which had been quite a difficult problem before the new mount > > api. But now it's become almost comically trivial. I mean, there's stuff > > that will still be good to have but overall all the bits are already > > there. > > > > Imho, delegated mounting should be done by a system service that is > > responsible for all the steps that require privileges. So for most > > filesytems not mountable by unprivileged user this would amount to: > > > > fd_fs = fsopen("xfs") > > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > > fsconfig(FSCONFIG_CMD_CREATE) > > fd_mnt = fsmount(fd_fs) > > // Only required for attributes that require privileges against the sb > > // of the filesystem such as idmapped mounts > > mount_setattr(fd_mnt, ...) > > > > and then the fd_mnt can be sent to the container which can then attach > > it wherever it wants to. The system level service doesn't even need to > > change namespaces via setns(fd_userns|fd_mntns) like I illustrated in > > the post I did. It's sufficient if we sent it via AF_UNIX for example > > that's exposed to the container. > > > > Of course, this system level service would be integrated with mount(8) > > directly over a well-defined protocol. And this would be nestable as > > well by e.g., bind-mounting the AF_UNIX socket. > > > > And we do already support a rudimentary form of such integration through > > systemd. For example via mount -t ddi (cf. [2]) which makes it possible > > to mount discoverable disk images (ddi). But that's just an > > illustration. > > > > This should be integrated with mount(8) and should be a simply protocol > > over varlink or another lightweight ipc mechanism that can be > > implemented by systemd-mountd (which is how I coined this for lack of > > imagination when I came up with this) or by some other component if > > platforms like k8s really want to do their own thing. > > > > This also allows us to extend this feature to the whole system btw and > > to all filesystems at once. Because it means that if systemd-mountd is > > told what images to trust (based on location, from a specific registry, > > signature, or whatever) then this isn't just useful for unprivileged > > containers but also for regular users on the host that want to mount > > stuff. > > > > This is what we're currently working on. > > > > This is a very cool idea, and sounds like a reasonable way forward. I'd > be interested to hear more about this (and in particular what sort of > security model and use-cases you envision for this). I convinced Lennart to put this on the top of his todo so he'll hopefully finish the first implementation within the next week and put up a PR. By LSFMM we should be able to demo this. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-03 13:57 ` Alexander Larsson 2023-03-03 15:13 ` Gao Xiang @ 2023-03-04 0:46 ` Jingbo Xu 2023-03-06 11:33 ` Alexander Larsson 2 siblings, 0 replies; 42+ messages in thread From: Jingbo Xu @ 2023-03-04 0:46 UTC (permalink / raw) To: Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 3/3/23 9:57 PM, Alexander Larsson wrote: > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > * Erofs on top of direct-io enabled loopback causes quite a drop in > performance, which I don't really understand. Especially since its > reporting the same memory use as non-direct io. I guess the > double-cacheing in the later case isn't properly attributed to the > cgroup so the difference is not measured. However, why would the > double cache improve performance? Maybe I'm not completely > understanding how these things interact. > Loop in BUFFERED mode actually calls .read_iter() of the backing file to read from it, e.g. ext4_file_read_iter()->generic_file_read_iter(), where heuristic readahead is also done. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-03 13:57 ` Alexander Larsson 2023-03-03 15:13 ` Gao Xiang 2023-03-04 0:46 ` Jingbo Xu @ 2023-03-06 11:33 ` Alexander Larsson 2023-03-06 12:15 ` Gao Xiang 2023-03-06 15:49 ` Jingbo Xu 2 siblings, 2 replies; 42+ messages in thread From: Alexander Larsson @ 2023-03-06 11:33 UTC (permalink / raw) To: lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote: > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > > > > Hello, > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > Composefs filesystem. It is an opportunistically sharing, validating > > image-based filesystem, targeting usecases like validated ostree > > rootfs:es, validated container images that share common files, as well > > as other image based usecases. > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > is has been proposed that (with some changes to overlayfs), similar > > behaviour can be achieved by combining the overlayfs > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > There are pros and cons to both these approaches, and the discussion > > about their respective value has sometimes been heated. We would like > > to have an in-person discussion at the summit, ideally also involving > > more of the filesystem development community, so that we can reach > > some consensus on what is the best apporach. > > In order to better understand the behaviour and requirements of the > overlayfs+erofs approach I spent some time implementing direct support > for erofs in libcomposefs. So, with current HEAD of > github.com/containers/composefs you can now do: > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs > > This will produce an object store with the backing files, and a erofs > file with the required overlayfs xattrs, including a made up one > called "overlay.fs-verity" containing the expected fs-verity digest > for the lower dir. It also adds the required whiteouts to cover the > 00-ff dirs from the lower dir. > > These erofs files are ordered similarly to the composefs files, and we > give similar guarantees about their reproducibility, etc. So, they > should be apples-to-apples comparable with the composefs images. > > Given this, I ran another set of performance tests on the original cs9 > rootfs dataset, again measuring the time of `ls -lR`. I also tried to > measure the memory use like this: > > # echo 3 > /proc/sys/vm/drop_caches > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > > These are the alternatives I tried: > > xfs: the source of the image, regular dir on xfs > erofs: the image.erofs above, on loopback > erofs dio: the image.erofs above, on loopback with --direct-io=on > ovl: erofs above combined with overlayfs > ovl dio: erofs dio above combined with overlayfs > cfs: composefs mount of image.cfs > > All tests use the same objects dir, stored on xfs. The erofs and > overlay implementations are from a stock 6.1.13 kernel, and composefs > module is from github HEAD. > > I tried loopback both with and without the direct-io option, because > without direct-io enabled the kernel will double-cache the loopbacked > data, as per[1]. > > The produced images are: > 8.9M image.cfs > 11.3M image.erofs > > And gives these results: > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > xfs | 1449 | 442 | 54 > erofs | 700 | 391 | 45 > erofs dio | 939 | 400 | 45 > ovl | 1827 | 530 | 130 > ovl dio | 2156 | 531 | 130 > cfs | 689 | 389 | 51 It has been noted that the readahead done by kernel_read() may cause read-ahead of unrelated data into memory which skews the results in favour of workloads that consume all the filesystem metadata (such as the ls -lR usecase of the above test). In the table above this favours composefs (which uses kernel_read in some codepaths) as well as non-dio erofs (non-dio loopback device uses readahead too). I updated composefs to not use kernel_read here: https://github.com/containers/composefs/pull/105 And a new kernel patch-set based on this is available at: https://github.com/alexlarsson/linux/tree/composefs The resulting table is now (dropping the non-dio erofs): | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- xfs | 1449 | 442 | 54 erofs dio | 939 | 400 | 45 ovl dio | 2156 | 531 | 130 cfs | 833 | 398 | 51 | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- ext4 | 1135 | 394 | 54 erofs dio | 922 | 401 | 45 ovl dio | 1810 | 532 | 149 ovl lazy | 1063 | 523 | 87 cfs | 768 | 459 | 51 So, while cfs is somewhat worse now for this particular usecase, my overall analysis still stands. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-06 11:33 ` Alexander Larsson @ 2023-03-06 12:15 ` Gao Xiang 2023-03-06 15:49 ` Jingbo Xu 1 sibling, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-06 12:15 UTC (permalink / raw) To: Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/6 19:33, Alexander Larsson wrote: > On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote: >> >> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: >>> >>> Hello, >>> >>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >>> Composefs filesystem. It is an opportunistically sharing, validating >>> image-based filesystem, targeting usecases like validated ostree >>> rootfs:es, validated container images that share common files, as well >>> as other image based usecases. >>> >>> During the discussions in the composefs proposal (as seen on LWN[3]) >>> is has been proposed that (with some changes to overlayfs), similar >>> behaviour can be achieved by combining the overlayfs >>> "overlay.redirect" xattr with an read-only filesystem such as erofs. >>> >>> There are pros and cons to both these approaches, and the discussion >>> about their respective value has sometimes been heated. We would like >>> to have an in-person discussion at the summit, ideally also involving >>> more of the filesystem development community, so that we can reach >>> some consensus on what is the best apporach. >> >> In order to better understand the behaviour and requirements of the >> overlayfs+erofs approach I spent some time implementing direct support >> for erofs in libcomposefs. So, with current HEAD of >> github.com/containers/composefs you can now do: >> >> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs >> >> This will produce an object store with the backing files, and a erofs >> file with the required overlayfs xattrs, including a made up one >> called "overlay.fs-verity" containing the expected fs-verity digest >> for the lower dir. It also adds the required whiteouts to cover the >> 00-ff dirs from the lower dir. >> >> These erofs files are ordered similarly to the composefs files, and we >> give similar guarantees about their reproducibility, etc. So, they >> should be apples-to-apples comparable with the composefs images. >> >> Given this, I ran another set of performance tests on the original cs9 >> rootfs dataset, again measuring the time of `ls -lR`. I also tried to >> measure the memory use like this: >> >> # echo 3 > /proc/sys/vm/drop_caches >> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat >> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >> >> These are the alternatives I tried: >> >> xfs: the source of the image, regular dir on xfs >> erofs: the image.erofs above, on loopback >> erofs dio: the image.erofs above, on loopback with --direct-io=on >> ovl: erofs above combined with overlayfs >> ovl dio: erofs dio above combined with overlayfs >> cfs: composefs mount of image.cfs >> >> All tests use the same objects dir, stored on xfs. The erofs and >> overlay implementations are from a stock 6.1.13 kernel, and composefs >> module is from github HEAD. >> >> I tried loopback both with and without the direct-io option, because >> without direct-io enabled the kernel will double-cache the loopbacked >> data, as per[1]. >> >> The produced images are: >> 8.9M image.cfs >> 11.3M image.erofs >> >> And gives these results: >> | Cold cache | Warm cache | Mem use >> | (msec) | (msec) | (mb) >> -----------+------------+------------+--------- >> xfs | 1449 | 442 | 54 >> erofs | 700 | 391 | 45 >> erofs dio | 939 | 400 | 45 >> ovl | 1827 | 530 | 130 >> ovl dio | 2156 | 531 | 130 >> cfs | 689 | 389 | 51 > > It has been noted that the readahead done by kernel_read() may cause > read-ahead of unrelated data into memory which skews the results in > favour of workloads that consume all the filesystem metadata (such as > the ls -lR usecase of the above test). In the table above this favours > composefs (which uses kernel_read in some codepaths) as well as > non-dio erofs (non-dio loopback device uses readahead too). > > I updated composefs to not use kernel_read here: > https://github.com/containers/composefs/pull/105 > > And a new kernel patch-set based on this is available at: > https://github.com/alexlarsson/linux/tree/composefs > > The resulting table is now (dropping the non-dio erofs): > > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > xfs | 1449 | 442 | 54 > erofs dio | 939 | 400 | 45 > ovl dio | 2156 | 531 | 130 > cfs | 833 | 398 | 51 > > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > ext4 | 1135 | 394 | 54 > erofs dio | 922 | 401 | 45 > ovl dio | 1810 | 532 | 149 > ovl lazy | 1063 | 523 | 87 > cfs | 768 | 459 | 51 > > So, while cfs is somewhat worse now for this particular usecase, my > overall analysis still stands. We will investigate it later, also you might still need to test some other random workloads other than "ls -lR" (such as stat ~1000 files randomly [1]) rather than completely ignore my and Jingbo's comments, or at least you have to answer why "ls -lR" is the only judgement on your side. My point is simply simple. If you consider a chance to get an improved EROFS in some extents, we do hope we could improve your "ls -lR" as much as possible without bad impacts to random access. Or if you'd like to upstream a new file-based stackable filesystem for this ostree specific use cases for your whatever KPIs anyway, I don't think we could get some conclusion here and I cannot do any help to you since I'm not that one. Since you're addressing a very specific workload "ls -lR" and EROFS as well as EROFS + overlayfs doesn't perform so bad without further insights compared with Composefs even EROFS doesn't directly use file-based interfaces. Thanks, Gao Xiang [1] https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-06 11:33 ` Alexander Larsson 2023-03-06 12:15 ` Gao Xiang @ 2023-03-06 15:49 ` Jingbo Xu 2023-03-06 16:09 ` Alexander Larsson 2023-03-07 10:00 ` Jingbo Xu 1 sibling, 2 replies; 42+ messages in thread From: Jingbo Xu @ 2023-03-06 15:49 UTC (permalink / raw) To: Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 3/6/23 7:33 PM, Alexander Larsson wrote: > On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote: >> >> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: >>> >>> Hello, >>> >>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >>> Composefs filesystem. It is an opportunistically sharing, validating >>> image-based filesystem, targeting usecases like validated ostree >>> rootfs:es, validated container images that share common files, as well >>> as other image based usecases. >>> >>> During the discussions in the composefs proposal (as seen on LWN[3]) >>> is has been proposed that (with some changes to overlayfs), similar >>> behaviour can be achieved by combining the overlayfs >>> "overlay.redirect" xattr with an read-only filesystem such as erofs. >>> >>> There are pros and cons to both these approaches, and the discussion >>> about their respective value has sometimes been heated. We would like >>> to have an in-person discussion at the summit, ideally also involving >>> more of the filesystem development community, so that we can reach >>> some consensus on what is the best apporach. >> >> In order to better understand the behaviour and requirements of the >> overlayfs+erofs approach I spent some time implementing direct support >> for erofs in libcomposefs. So, with current HEAD of >> github.com/containers/composefs you can now do: >> >> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs >> >> This will produce an object store with the backing files, and a erofs >> file with the required overlayfs xattrs, including a made up one >> called "overlay.fs-verity" containing the expected fs-verity digest >> for the lower dir. It also adds the required whiteouts to cover the >> 00-ff dirs from the lower dir. >> >> These erofs files are ordered similarly to the composefs files, and we >> give similar guarantees about their reproducibility, etc. So, they >> should be apples-to-apples comparable with the composefs images. >> >> Given this, I ran another set of performance tests on the original cs9 >> rootfs dataset, again measuring the time of `ls -lR`. I also tried to >> measure the memory use like this: >> >> # echo 3 > /proc/sys/vm/drop_caches >> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat >> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >> >> These are the alternatives I tried: >> >> xfs: the source of the image, regular dir on xfs >> erofs: the image.erofs above, on loopback >> erofs dio: the image.erofs above, on loopback with --direct-io=on >> ovl: erofs above combined with overlayfs >> ovl dio: erofs dio above combined with overlayfs >> cfs: composefs mount of image.cfs >> >> All tests use the same objects dir, stored on xfs. The erofs and >> overlay implementations are from a stock 6.1.13 kernel, and composefs >> module is from github HEAD. >> >> I tried loopback both with and without the direct-io option, because >> without direct-io enabled the kernel will double-cache the loopbacked >> data, as per[1]. >> >> The produced images are: >> 8.9M image.cfs >> 11.3M image.erofs >> >> And gives these results: >> | Cold cache | Warm cache | Mem use >> | (msec) | (msec) | (mb) >> -----------+------------+------------+--------- >> xfs | 1449 | 442 | 54 >> erofs | 700 | 391 | 45 >> erofs dio | 939 | 400 | 45 >> ovl | 1827 | 530 | 130 >> ovl dio | 2156 | 531 | 130 >> cfs | 689 | 389 | 51 > > It has been noted that the readahead done by kernel_read() may cause > read-ahead of unrelated data into memory which skews the results in > favour of workloads that consume all the filesystem metadata (such as > the ls -lR usecase of the above test). In the table above this favours > composefs (which uses kernel_read in some codepaths) as well as > non-dio erofs (non-dio loopback device uses readahead too). > > I updated composefs to not use kernel_read here: > https://github.com/containers/composefs/pull/105 > > And a new kernel patch-set based on this is available at: > https://github.com/alexlarsson/linux/tree/composefs > > The resulting table is now (dropping the non-dio erofs): > > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > xfs | 1449 | 442 | 54 > erofs dio | 939 | 400 | 45 > ovl dio | 2156 | 531 | 130 > cfs | 833 | 398 | 51 > > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > ext4 | 1135 | 394 | 54 > erofs dio | 922 | 401 | 45 > ovl dio | 1810 | 532 | 149 > ovl lazy | 1063 | 523 | 87 > cfs | 768 | 459 | 51 > > So, while cfs is somewhat worse now for this particular usecase, my > overall analysis still stands. > Hi, I tested your patch removing kernel_read(), and here is the statistics tested in my environment. Setup ====== CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Disk: cloud disk, 11800 IOPS upper limit OS: Linux v6.2 FS of backing objects: xfs Image size =========== 8.6M large.composefs (with --compute-digest) 8.9M large.erofs (mkfs.erofs) 11M large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs) Perf of "ls -lR" ================ | uncached| cached | (ms) | (ms) ----------------------------------------------|---------|-------- composefs | 519 | 178 erofs (mkfs.erofs, DIRECT loop) | 497 | 192 erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536 | 199 I tested the performance of "ls -lR" on the whole tree of cs9-developer-rootfs. It seems that the performance of erofs (generated from mkfs.erofs) is slightly better than that of composefs. While the performance of erofs generated from mkfs.composefs is slightly worse that that of composefs. The uncached performance is somewhat slightly different with that given by Alexander Larsson. I think it may be due to different test environment, as my test machine is a server with robust performance, with cloud disk as storage. It's just a simple test without further analysis, as it's a bit late for me :) -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-06 15:49 ` Jingbo Xu @ 2023-03-06 16:09 ` Alexander Larsson 2023-03-06 16:17 ` Gao Xiang 2023-03-07 10:00 ` Jingbo Xu 1 sibling, 1 reply; 42+ messages in thread From: Alexander Larsson @ 2023-03-06 16:09 UTC (permalink / raw) To: Jingbo Xu Cc: lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Mon, Mar 6, 2023 at 4:49 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > On 3/6/23 7:33 PM, Alexander Larsson wrote: > > On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote: > >> > >> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: > >>> > >>> Hello, > >>> > >>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > >>> Composefs filesystem. It is an opportunistically sharing, validating > >>> image-based filesystem, targeting usecases like validated ostree > >>> rootfs:es, validated container images that share common files, as well > >>> as other image based usecases. > >>> > >>> During the discussions in the composefs proposal (as seen on LWN[3]) > >>> is has been proposed that (with some changes to overlayfs), similar > >>> behaviour can be achieved by combining the overlayfs > >>> "overlay.redirect" xattr with an read-only filesystem such as erofs. > >>> > >>> There are pros and cons to both these approaches, and the discussion > >>> about their respective value has sometimes been heated. We would like > >>> to have an in-person discussion at the summit, ideally also involving > >>> more of the filesystem development community, so that we can reach > >>> some consensus on what is the best apporach. > >> > >> In order to better understand the behaviour and requirements of the > >> overlayfs+erofs approach I spent some time implementing direct support > >> for erofs in libcomposefs. So, with current HEAD of > >> github.com/containers/composefs you can now do: > >> > >> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs > >> > >> This will produce an object store with the backing files, and a erofs > >> file with the required overlayfs xattrs, including a made up one > >> called "overlay.fs-verity" containing the expected fs-verity digest > >> for the lower dir. It also adds the required whiteouts to cover the > >> 00-ff dirs from the lower dir. > >> > >> These erofs files are ordered similarly to the composefs files, and we > >> give similar guarantees about their reproducibility, etc. So, they > >> should be apples-to-apples comparable with the composefs images. > >> > >> Given this, I ran another set of performance tests on the original cs9 > >> rootfs dataset, again measuring the time of `ls -lR`. I also tried to > >> measure the memory use like this: > >> > >> # echo 3 > /proc/sys/vm/drop_caches > >> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > >> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > >> > >> These are the alternatives I tried: > >> > >> xfs: the source of the image, regular dir on xfs > >> erofs: the image.erofs above, on loopback > >> erofs dio: the image.erofs above, on loopback with --direct-io=on > >> ovl: erofs above combined with overlayfs > >> ovl dio: erofs dio above combined with overlayfs > >> cfs: composefs mount of image.cfs > >> > >> All tests use the same objects dir, stored on xfs. The erofs and > >> overlay implementations are from a stock 6.1.13 kernel, and composefs > >> module is from github HEAD. > >> > >> I tried loopback both with and without the direct-io option, because > >> without direct-io enabled the kernel will double-cache the loopbacked > >> data, as per[1]. > >> > >> The produced images are: > >> 8.9M image.cfs > >> 11.3M image.erofs > >> > >> And gives these results: > >> | Cold cache | Warm cache | Mem use > >> | (msec) | (msec) | (mb) > >> -----------+------------+------------+--------- > >> xfs | 1449 | 442 | 54 > >> erofs | 700 | 391 | 45 > >> erofs dio | 939 | 400 | 45 > >> ovl | 1827 | 530 | 130 > >> ovl dio | 2156 | 531 | 130 > >> cfs | 689 | 389 | 51 > > > > It has been noted that the readahead done by kernel_read() may cause > > read-ahead of unrelated data into memory which skews the results in > > favour of workloads that consume all the filesystem metadata (such as > > the ls -lR usecase of the above test). In the table above this favours > > composefs (which uses kernel_read in some codepaths) as well as > > non-dio erofs (non-dio loopback device uses readahead too). > > > > I updated composefs to not use kernel_read here: > > https://github.com/containers/composefs/pull/105 > > > > And a new kernel patch-set based on this is available at: > > https://github.com/alexlarsson/linux/tree/composefs > > > > The resulting table is now (dropping the non-dio erofs): > > > > | Cold cache | Warm cache | Mem use > > | (msec) | (msec) | (mb) > > -----------+------------+------------+--------- > > xfs | 1449 | 442 | 54 > > erofs dio | 939 | 400 | 45 > > ovl dio | 2156 | 531 | 130 > > cfs | 833 | 398 | 51 > > > > | Cold cache | Warm cache | Mem use > > | (msec) | (msec) | (mb) > > -----------+------------+------------+--------- > > ext4 | 1135 | 394 | 54 > > erofs dio | 922 | 401 | 45 > > ovl dio | 1810 | 532 | 149 > > ovl lazy | 1063 | 523 | 87 > > cfs | 768 | 459 | 51 > > > > So, while cfs is somewhat worse now for this particular usecase, my > > overall analysis still stands. > > > > Hi, > > I tested your patch removing kernel_read(), and here is the statistics > tested in my environment. > > > Setup > ====== > CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > Disk: cloud disk, 11800 IOPS upper limit > OS: Linux v6.2 > FS of backing objects: xfs > > > Image size > =========== > 8.6M large.composefs (with --compute-digest) > 8.9M large.erofs (mkfs.erofs) > 11M large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs) > > > Perf of "ls -lR" > ================ > | uncached| cached > | (ms) | (ms) > ----------------------------------------------|---------|-------- > composefs | 519 | 178 > erofs (mkfs.erofs, DIRECT loop) | 497 | 192 > erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536 | 199 > > I tested the performance of "ls -lR" on the whole tree of > cs9-developer-rootfs. It seems that the performance of erofs (generated > from mkfs.erofs) is slightly better than that of composefs. While the > performance of erofs generated from mkfs.composefs is slightly worse > that that of composefs. I suspect that the reason for the lower performance of mkfs.composefs is the added overlay.fs-verity xattr to all the files. It makes the image larger, and that means more i/o. > The uncached performance is somewhat slightly different with that given > by Alexander Larsson. I think it may be due to different test > environment, as my test machine is a server with robust performance, > with cloud disk as storage. > > It's just a simple test without further analysis, as it's a bit late for > me :) Yeah, and for the record, I'm not claiming that my tests contain any high degree of analysis or rigour either. They are short simple test runs that give a rough estimate of the overall performance of metadata operations. What is interesting here is if there are large or unexpected differences, and from that point of view our results are basically the same. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-06 16:09 ` Alexander Larsson @ 2023-03-06 16:17 ` Gao Xiang 2023-03-07 8:21 ` Alexander Larsson 0 siblings, 1 reply; 42+ messages in thread From: Gao Xiang @ 2023-03-06 16:17 UTC (permalink / raw) To: Alexander Larsson, Jingbo Xu Cc: lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 00:09, Alexander Larsson wrote: > On Mon, Mar 6, 2023 at 4:49 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >> On 3/6/23 7:33 PM, Alexander Larsson wrote: >>> On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote: >>>> >>>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: >>>>> >>>>> Hello, >>>>> >>>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >>>>> Composefs filesystem. It is an opportunistically sharing, validating >>>>> image-based filesystem, targeting usecases like validated ostree >>>>> rootfs:es, validated container images that share common files, as well >>>>> as other image based usecases. >>>>> >>>>> During the discussions in the composefs proposal (as seen on LWN[3]) >>>>> is has been proposed that (with some changes to overlayfs), similar >>>>> behaviour can be achieved by combining the overlayfs >>>>> "overlay.redirect" xattr with an read-only filesystem such as erofs. >>>>> >>>>> There are pros and cons to both these approaches, and the discussion >>>>> about their respective value has sometimes been heated. We would like >>>>> to have an in-person discussion at the summit, ideally also involving >>>>> more of the filesystem development community, so that we can reach >>>>> some consensus on what is the best apporach. >>>> >>>> In order to better understand the behaviour and requirements of the >>>> overlayfs+erofs approach I spent some time implementing direct support >>>> for erofs in libcomposefs. So, with current HEAD of >>>> github.com/containers/composefs you can now do: >>>> >>>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs >>>> >>>> This will produce an object store with the backing files, and a erofs >>>> file with the required overlayfs xattrs, including a made up one >>>> called "overlay.fs-verity" containing the expected fs-verity digest >>>> for the lower dir. It also adds the required whiteouts to cover the >>>> 00-ff dirs from the lower dir. >>>> >>>> These erofs files are ordered similarly to the composefs files, and we >>>> give similar guarantees about their reproducibility, etc. So, they >>>> should be apples-to-apples comparable with the composefs images. >>>> >>>> Given this, I ran another set of performance tests on the original cs9 >>>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to >>>> measure the memory use like this: >>>> >>>> # echo 3 > /proc/sys/vm/drop_caches >>>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat >>>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >>>> >>>> These are the alternatives I tried: >>>> >>>> xfs: the source of the image, regular dir on xfs >>>> erofs: the image.erofs above, on loopback >>>> erofs dio: the image.erofs above, on loopback with --direct-io=on >>>> ovl: erofs above combined with overlayfs >>>> ovl dio: erofs dio above combined with overlayfs >>>> cfs: composefs mount of image.cfs >>>> >>>> All tests use the same objects dir, stored on xfs. The erofs and >>>> overlay implementations are from a stock 6.1.13 kernel, and composefs >>>> module is from github HEAD. >>>> >>>> I tried loopback both with and without the direct-io option, because >>>> without direct-io enabled the kernel will double-cache the loopbacked >>>> data, as per[1]. >>>> >>>> The produced images are: >>>> 8.9M image.cfs >>>> 11.3M image.erofs >>>> >>>> And gives these results: >>>> | Cold cache | Warm cache | Mem use >>>> | (msec) | (msec) | (mb) >>>> -----------+------------+------------+--------- >>>> xfs | 1449 | 442 | 54 >>>> erofs | 700 | 391 | 45 >>>> erofs dio | 939 | 400 | 45 >>>> ovl | 1827 | 530 | 130 >>>> ovl dio | 2156 | 531 | 130 >>>> cfs | 689 | 389 | 51 >>> >>> It has been noted that the readahead done by kernel_read() may cause >>> read-ahead of unrelated data into memory which skews the results in >>> favour of workloads that consume all the filesystem metadata (such as >>> the ls -lR usecase of the above test). In the table above this favours >>> composefs (which uses kernel_read in some codepaths) as well as >>> non-dio erofs (non-dio loopback device uses readahead too). >>> >>> I updated composefs to not use kernel_read here: >>> https://github.com/containers/composefs/pull/105 >>> >>> And a new kernel patch-set based on this is available at: >>> https://github.com/alexlarsson/linux/tree/composefs >>> >>> The resulting table is now (dropping the non-dio erofs): >>> >>> | Cold cache | Warm cache | Mem use >>> | (msec) | (msec) | (mb) >>> -----------+------------+------------+--------- >>> xfs | 1449 | 442 | 54 >>> erofs dio | 939 | 400 | 45 >>> ovl dio | 2156 | 531 | 130 >>> cfs | 833 | 398 | 51 >>> >>> | Cold cache | Warm cache | Mem use >>> | (msec) | (msec) | (mb) >>> -----------+------------+------------+--------- >>> ext4 | 1135 | 394 | 54 >>> erofs dio | 922 | 401 | 45 >>> ovl dio | 1810 | 532 | 149 >>> ovl lazy | 1063 | 523 | 87 >>> cfs | 768 | 459 | 51 >>> >>> So, while cfs is somewhat worse now for this particular usecase, my >>> overall analysis still stands. >>> >> >> Hi, >> >> I tested your patch removing kernel_read(), and here is the statistics >> tested in my environment. >> >> >> Setup >> ====== >> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz >> Disk: cloud disk, 11800 IOPS upper limit >> OS: Linux v6.2 >> FS of backing objects: xfs >> >> >> Image size >> =========== >> 8.6M large.composefs (with --compute-digest) >> 8.9M large.erofs (mkfs.erofs) >> 11M large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs) >> >> >> Perf of "ls -lR" >> ================ >> | uncached| cached >> | (ms) | (ms) >> ----------------------------------------------|---------|-------- >> composefs | 519 | 178 >> erofs (mkfs.erofs, DIRECT loop) | 497 | 192 >> erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536 | 199 >> >> I tested the performance of "ls -lR" on the whole tree of >> cs9-developer-rootfs. It seems that the performance of erofs (generated >> from mkfs.erofs) is slightly better than that of composefs. While the >> performance of erofs generated from mkfs.composefs is slightly worse >> that that of composefs. > > I suspect that the reason for the lower performance of mkfs.composefs > is the added overlay.fs-verity xattr to all the files. It makes the > image larger, and that means more i/o. Actually you could move overlay.fs-verity to EROFS shared xattr area (or even overlay.redirect but it depends) if needed, which could save some I/Os for your workloads. shared xattrs can be used in this way as well if you care such minor difference, actually I think inlined xattrs for your workload are just meaningful for selinux labels and capabilities. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-06 16:17 ` Gao Xiang @ 2023-03-07 8:21 ` Alexander Larsson 2023-03-07 8:33 ` Gao Xiang 0 siblings, 1 reply; 42+ messages in thread From: Alexander Larsson @ 2023-03-07 8:21 UTC (permalink / raw) To: Gao Xiang Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >> I tested the performance of "ls -lR" on the whole tree of > >> cs9-developer-rootfs. It seems that the performance of erofs (generated > >> from mkfs.erofs) is slightly better than that of composefs. While the > >> performance of erofs generated from mkfs.composefs is slightly worse > >> that that of composefs. > > > > I suspect that the reason for the lower performance of mkfs.composefs > > is the added overlay.fs-verity xattr to all the files. It makes the > > image larger, and that means more i/o. > > Actually you could move overlay.fs-verity to EROFS shared xattr area (or > even overlay.redirect but it depends) if needed, which could save some > I/Os for your workloads. > > shared xattrs can be used in this way as well if you care such minor > difference, actually I think inlined xattrs for your workload are just > meaningful for selinux labels and capabilities. Really? Could you expand on this, because I would think it will be sort of the opposite. In my usecase, the erofs fs will be read by overlayfs, which will probably access overlay.* pretty often. At the very least it will load overlay.metacopy and overlay.redirect for every lookup. I guess it depends on how the verity support in overlayfs would work. If it delays access to overlay.verity until open time, then it would make sense to move it to the shared area. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 8:21 ` Alexander Larsson @ 2023-03-07 8:33 ` Gao Xiang 2023-03-07 8:48 ` Gao Xiang 2023-03-07 9:07 ` Alexander Larsson 0 siblings, 2 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 8:33 UTC (permalink / raw) To: Alexander Larsson Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 16:21, Alexander Larsson wrote: > On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>>> I tested the performance of "ls -lR" on the whole tree of >>>> cs9-developer-rootfs. It seems that the performance of erofs (generated >>>> from mkfs.erofs) is slightly better than that of composefs. While the >>>> performance of erofs generated from mkfs.composefs is slightly worse >>>> that that of composefs. >>> >>> I suspect that the reason for the lower performance of mkfs.composefs >>> is the added overlay.fs-verity xattr to all the files. It makes the >>> image larger, and that means more i/o. >> >> Actually you could move overlay.fs-verity to EROFS shared xattr area (or >> even overlay.redirect but it depends) if needed, which could save some >> I/Os for your workloads. >> >> shared xattrs can be used in this way as well if you care such minor >> difference, actually I think inlined xattrs for your workload are just >> meaningful for selinux labels and capabilities. > > Really? Could you expand on this, because I would think it will be > sort of the opposite. In my usecase, the erofs fs will be read by > overlayfs, which will probably access overlay.* pretty often. At the > very least it will load overlay.metacopy and overlay.redirect for > every lookup. Really. In that way, it will behave much similiar to composefs on-disk arrangement now (in composefs vdata area). Because in that way, although an extra I/O is needed for verification, and it can only happen when actually opening the file (so "ls -lR" is not impacted.) But on-disk inodes are more compact. All EROFS xattrs will be cached in memory so that accessing overlay.* pretty often is not greatly impacted due to no real I/Os (IOWs, only some CPU time is consumed). > > I guess it depends on how the verity support in overlayfs would work. > If it delays access to overlay.verity until open time, then it would > make sense to move it to the shared area. I think it could be just like what composefs does, it's not hard to add just new dozen lines to overlayfs like: static int cfs_open_file(struct inode *inode, struct file *file) { ... /* If metadata records a digest for the file, ensure it is there * and correct before using the contents. */ if (cino->inode_data.has_digest && fsi->verity_check >= CFS_VERITY_CHECK_IF_SPECIFIED) { ... res = fsverity_get_digest(d_inode(backing_dentry), verity_digest, &verity_algo); if (res < 0) { pr_warn("WARNING: composefs backing file '%pd' has no fs-verity digest\n", backing_dentry); return -EIO; } if (verity_algo != HASH_ALGO_SHA256 || memcmp(cino->inode_data.digest, verity_digest, SHA256_DIGEST_SIZE) != 0) { pr_warn("WARNING: composefs backing file '%pd' has the wrong fs-verity digest\n", backing_dentry); return -EIO; } ... } ... } Is this stacked fsverity feature really hard? Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 8:33 ` Gao Xiang @ 2023-03-07 8:48 ` Gao Xiang 2023-03-07 9:07 ` Alexander Larsson 1 sibling, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 8:48 UTC (permalink / raw) To: Alexander Larsson Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 16:33, Gao Xiang wrote: > > > On 2023/3/7 16:21, Alexander Larsson wrote: >> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >>>>> I tested the performance of "ls -lR" on the whole tree of >>>>> cs9-developer-rootfs. It seems that the performance of erofs (generated >>>>> from mkfs.erofs) is slightly better than that of composefs. While the >>>>> performance of erofs generated from mkfs.composefs is slightly worse >>>>> that that of composefs. >>>> >>>> I suspect that the reason for the lower performance of mkfs.composefs >>>> is the added overlay.fs-verity xattr to all the files. It makes the >>>> image larger, and that means more i/o. >>> >>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or >>> even overlay.redirect but it depends) if needed, which could save some >>> I/Os for your workloads. >>> >>> shared xattrs can be used in this way as well if you care such minor >>> difference, actually I think inlined xattrs for your workload are just >>> meaningful for selinux labels and capabilities. >> >> Really? Could you expand on this, because I would think it will be >> sort of the opposite. In my usecase, the erofs fs will be read by >> overlayfs, which will probably access overlay.* pretty often. At the >> very least it will load overlay.metacopy and overlay.redirect for >> every lookup. > > Really. In that way, it will behave much similiar to composefs on-disk > arrangement now (in composefs vdata area). > > Because in that way, although an extra I/O is needed for verification, > and it can only happen when actually opening the file (so "ls -lR" is > not impacted.) But on-disk inodes are more compact. > > All EROFS xattrs will be cached in memory so that accessing ^ all accessed xattrs in EROFS Sorry about that if there could be some misunderstanding. > overlay.* pretty often is not greatly impacted due to no real I/Os > (IOWs, only some CPU time is consumed). > >> >> I guess it depends on how the verity support in overlayfs would work. >> If it delays access to overlay.verity until open time, then it would >> make sense to move it to the shared area. > > I think it could be just like what composefs does, it's not hard to > add just new dozen lines to overlayfs like: > > static int cfs_open_file(struct inode *inode, struct file *file) > { > ... > /* If metadata records a digest for the file, ensure it is there > * and correct before using the contents. > */ > if (cino->inode_data.has_digest && > fsi->verity_check >= CFS_VERITY_CHECK_IF_SPECIFIED) { > ... > > res = fsverity_get_digest(d_inode(backing_dentry), > verity_digest, &verity_algo); > if (res < 0) { > pr_warn("WARNING: composefs backing file '%pd' has no fs-verity digest\n", > backing_dentry); > return -EIO; > } > if (verity_algo != HASH_ALGO_SHA256 || > memcmp(cino->inode_data.digest, verity_digest, > SHA256_DIGEST_SIZE) != 0) { > pr_warn("WARNING: composefs backing file '%pd' has the wrong fs-verity digest\n", > backing_dentry); > return -EIO; > } > ... > } > ... > } > > Is this stacked fsverity feature really hard? > > Thanks, > Gao Xiang > >> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 8:33 ` Gao Xiang 2023-03-07 8:48 ` Gao Xiang @ 2023-03-07 9:07 ` Alexander Larsson 2023-03-07 9:26 ` Gao Xiang 1 sibling, 1 reply; 42+ messages in thread From: Alexander Larsson @ 2023-03-07 9:07 UTC (permalink / raw) To: Gao Xiang Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > > On 2023/3/7 16:21, Alexander Larsson wrote: > > On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > >>>> I tested the performance of "ls -lR" on the whole tree of > >>>> cs9-developer-rootfs. It seems that the performance of erofs (generated > >>>> from mkfs.erofs) is slightly better than that of composefs. While the > >>>> performance of erofs generated from mkfs.composefs is slightly worse > >>>> that that of composefs. > >>> > >>> I suspect that the reason for the lower performance of mkfs.composefs > >>> is the added overlay.fs-verity xattr to all the files. It makes the > >>> image larger, and that means more i/o. > >> > >> Actually you could move overlay.fs-verity to EROFS shared xattr area (or > >> even overlay.redirect but it depends) if needed, which could save some > >> I/Os for your workloads. > >> > >> shared xattrs can be used in this way as well if you care such minor > >> difference, actually I think inlined xattrs for your workload are just > >> meaningful for selinux labels and capabilities. > > > > Really? Could you expand on this, because I would think it will be > > sort of the opposite. In my usecase, the erofs fs will be read by > > overlayfs, which will probably access overlay.* pretty often. At the > > very least it will load overlay.metacopy and overlay.redirect for > > every lookup. > > Really. In that way, it will behave much similiar to composefs on-disk > arrangement now (in composefs vdata area). > > Because in that way, although an extra I/O is needed for verification, > and it can only happen when actually opening the file (so "ls -lR" is > not impacted.) But on-disk inodes are more compact. > > All EROFS xattrs will be cached in memory so that accessing > overlay.* pretty often is not greatly impacted due to no real I/Os > (IOWs, only some CPU time is consumed). So, I tried moving the overlay.digest xattr to the shared area, but actually this made the performance worse for the ls case. I have not looked into the cause in detail, but my guess is that ls looks for the acl xattr, and such a negative lookup will cause erofs to look at all the shared xattrs for the inode, which means they all end up being loaded anyway. Of course, this will only affect ls (or other cases that read the acl), so its perhaps a bit uncommon. Did you ever consider putting a bloom filter in the h_reserved area of erofs_xattr_ibody_header? Then it could return early without i/o operations for keys that are not set for the inode. Not sure what the computational cost of that would be though. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 9:07 ` Alexander Larsson @ 2023-03-07 9:26 ` Gao Xiang 2023-03-07 9:38 ` Gao Xiang 2023-03-07 9:46 ` Alexander Larsson 0 siblings, 2 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 9:26 UTC (permalink / raw) To: Alexander Larsson Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 17:07, Alexander Larsson wrote: > On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >> >> >> On 2023/3/7 16:21, Alexander Larsson wrote: >>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>> >>>>>> I tested the performance of "ls -lR" on the whole tree of >>>>>> cs9-developer-rootfs. It seems that the performance of erofs (generated >>>>>> from mkfs.erofs) is slightly better than that of composefs. While the >>>>>> performance of erofs generated from mkfs.composefs is slightly worse >>>>>> that that of composefs. >>>>> >>>>> I suspect that the reason for the lower performance of mkfs.composefs >>>>> is the added overlay.fs-verity xattr to all the files. It makes the >>>>> image larger, and that means more i/o. >>>> >>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or >>>> even overlay.redirect but it depends) if needed, which could save some >>>> I/Os for your workloads. >>>> >>>> shared xattrs can be used in this way as well if you care such minor >>>> difference, actually I think inlined xattrs for your workload are just >>>> meaningful for selinux labels and capabilities. >>> >>> Really? Could you expand on this, because I would think it will be >>> sort of the opposite. In my usecase, the erofs fs will be read by >>> overlayfs, which will probably access overlay.* pretty often. At the >>> very least it will load overlay.metacopy and overlay.redirect for >>> every lookup. >> >> Really. In that way, it will behave much similiar to composefs on-disk >> arrangement now (in composefs vdata area). >> >> Because in that way, although an extra I/O is needed for verification, >> and it can only happen when actually opening the file (so "ls -lR" is >> not impacted.) But on-disk inodes are more compact. >> >> All EROFS xattrs will be cached in memory so that accessing >> overlay.* pretty often is not greatly impacted due to no real I/Os >> (IOWs, only some CPU time is consumed). > > So, I tried moving the overlay.digest xattr to the shared area, but > actually this made the performance worse for the ls case. I have not That is much strange. We'd like to open it up if needed. BTW, did you test EROFS with acl enabled all the time? > looked into the cause in detail, but my guess is that ls looks for the > acl xattr, and such a negative lookup will cause erofs to look at all > the shared xattrs for the inode, which means they all end up being > loaded anyway. Of course, this will only affect ls (or other cases > that read the acl), so its perhaps a bit uncommon. Yeah, in addition to that, I guess real acls could be landed in inlined xattrs as well if exists... > > Did you ever consider putting a bloom filter in the h_reserved area of > erofs_xattr_ibody_header? Then it could return early without i/o > operations for keys that are not set for the inode. Not sure what the > computational cost of that would be though. Good idea! Let me think about it, but enabling "noacl" mount option isn't prefered if acl is no needed in your use cases. Optimizing negative xattr lookups might need more on-disk improvements which we didn't care about xattrs more. (although "overlay.redirect" and "overlay.digest" seems fine for composefs use cases.) BTW, if you have more interest in this way, we could get in touch in a more effective way to improve EROFS in addition to community emails except for the userns stuff (I know it's useful but I don't know the answers, maybe as Chistian said, we could develop a new vfs feature to delegate a filesystem mount to an unprivileged one [1]. I think it's much safer in that way for kernel fses with on-disk format.) [1] https://lore.kernel.org/r/20230126082228.rweg75ztaexykejv@wittgenstein Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 9:26 ` Gao Xiang @ 2023-03-07 9:38 ` Gao Xiang 2023-03-07 9:56 ` Alexander Larsson 2023-03-07 9:46 ` Alexander Larsson 1 sibling, 1 reply; 42+ messages in thread From: Gao Xiang @ 2023-03-07 9:38 UTC (permalink / raw) To: Alexander Larsson Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 17:26, Gao Xiang wrote: > > > On 2023/3/7 17:07, Alexander Larsson wrote: >> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>> >>> >>> >>> On 2023/3/7 16:21, Alexander Larsson wrote: >>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>> >>>>>>> I tested the performance of "ls -lR" on the whole tree of >>>>>>> cs9-developer-rootfs. It seems that the performance of erofs (generated >>>>>>> from mkfs.erofs) is slightly better than that of composefs. While the >>>>>>> performance of erofs generated from mkfs.composefs is slightly worse >>>>>>> that that of composefs. >>>>>> >>>>>> I suspect that the reason for the lower performance of mkfs.composefs >>>>>> is the added overlay.fs-verity xattr to all the files. It makes the >>>>>> image larger, and that means more i/o. >>>>> >>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or >>>>> even overlay.redirect but it depends) if needed, which could save some >>>>> I/Os for your workloads. >>>>> >>>>> shared xattrs can be used in this way as well if you care such minor >>>>> difference, actually I think inlined xattrs for your workload are just >>>>> meaningful for selinux labels and capabilities. >>>> >>>> Really? Could you expand on this, because I would think it will be >>>> sort of the opposite. In my usecase, the erofs fs will be read by >>>> overlayfs, which will probably access overlay.* pretty often. At the >>>> very least it will load overlay.metacopy and overlay.redirect for >>>> every lookup. >>> >>> Really. In that way, it will behave much similiar to composefs on-disk >>> arrangement now (in composefs vdata area). >>> >>> Because in that way, although an extra I/O is needed for verification, >>> and it can only happen when actually opening the file (so "ls -lR" is >>> not impacted.) But on-disk inodes are more compact. >>> >>> All EROFS xattrs will be cached in memory so that accessing >>> overlay.* pretty often is not greatly impacted due to no real I/Os >>> (IOWs, only some CPU time is consumed). >> >> So, I tried moving the overlay.digest xattr to the shared area, but >> actually this made the performance worse for the ls case. I have not > > That is much strange. We'd like to open it up if needed. BTW, did you > test EROFS with acl enabled all the time? > >> looked into the cause in detail, but my guess is that ls looks for the >> acl xattr, and such a negative lookup will cause erofs to look at all >> the shared xattrs for the inode, which means they all end up being >> loaded anyway. Of course, this will only affect ls (or other cases >> that read the acl), so its perhaps a bit uncommon. > > Yeah, in addition to that, I guess real acls could be landed in inlined > xattrs as well if exists... > >> >> Did you ever consider putting a bloom filter in the h_reserved area of >> erofs_xattr_ibody_header? Then it could return early without i/o >> operations for keys that are not set for the inode. Not sure what the >> computational cost of that would be though. > > Good idea! Let me think about it, but enabling "noacl" mount > option isn't prefered if acl is no needed in your use cases. ^ is preferred. > Optimizing negative xattr lookups might need more on-disk > improvements which we didn't care about xattrs more. (although > "overlay.redirect" and "overlay.digest" seems fine for > composefs use cases.) Or we could just add a FEATURE_COMPAT_NOACL mount option to disable ACLs explicitly if the image doesn't have any ACLs. At least it's useful for your use cases. Thanks, Gao Xiang > > BTW, if you have more interest in this way, we could get in > touch in a more effective way to improve EROFS in addition to > community emails except for the userns stuff (I know it's useful > but I don't know the answers, maybe as Chistian said, we could > develop a new vfs feature to delegate a filesystem mount to an > unprivileged one [1]. I think it's much safer in that way for > kernel fses with on-disk format.) > > [1] https://lore.kernel.org/r/20230126082228.rweg75ztaexykejv@wittgenstein > > Thanks, > Gao Xiang >> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 9:38 ` Gao Xiang @ 2023-03-07 9:56 ` Alexander Larsson 2023-03-07 10:06 ` Gao Xiang 0 siblings, 1 reply; 42+ messages in thread From: Alexander Larsson @ 2023-03-07 9:56 UTC (permalink / raw) To: Gao Xiang Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Tue, Mar 7, 2023 at 10:38 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > On 2023/3/7 17:26, Gao Xiang wrote: > > > > > > On 2023/3/7 17:07, Alexander Larsson wrote: > >> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>> > >>> > >>> > >>> On 2023/3/7 16:21, Alexander Larsson wrote: > >>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>>> > >>>>>>> I tested the performance of "ls -lR" on the whole tree of > >>>>>>> cs9-developer-rootfs. It seems that the performance of erofs (generated > >>>>>>> from mkfs.erofs) is slightly better than that of composefs. While the > >>>>>>> performance of erofs generated from mkfs.composefs is slightly worse > >>>>>>> that that of composefs. > >>>>>> > >>>>>> I suspect that the reason for the lower performance of mkfs.composefs > >>>>>> is the added overlay.fs-verity xattr to all the files. It makes the > >>>>>> image larger, and that means more i/o. > >>>>> > >>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or > >>>>> even overlay.redirect but it depends) if needed, which could save some > >>>>> I/Os for your workloads. > >>>>> > >>>>> shared xattrs can be used in this way as well if you care such minor > >>>>> difference, actually I think inlined xattrs for your workload are just > >>>>> meaningful for selinux labels and capabilities. > >>>> > >>>> Really? Could you expand on this, because I would think it will be > >>>> sort of the opposite. In my usecase, the erofs fs will be read by > >>>> overlayfs, which will probably access overlay.* pretty often. At the > >>>> very least it will load overlay.metacopy and overlay.redirect for > >>>> every lookup. > >>> > >>> Really. In that way, it will behave much similiar to composefs on-disk > >>> arrangement now (in composefs vdata area). > >>> > >>> Because in that way, although an extra I/O is needed for verification, > >>> and it can only happen when actually opening the file (so "ls -lR" is > >>> not impacted.) But on-disk inodes are more compact. > >>> > >>> All EROFS xattrs will be cached in memory so that accessing > >>> overlay.* pretty often is not greatly impacted due to no real I/Os > >>> (IOWs, only some CPU time is consumed). > >> > >> So, I tried moving the overlay.digest xattr to the shared area, but > >> actually this made the performance worse for the ls case. I have not > > > > That is much strange. We'd like to open it up if needed. BTW, did you > > test EROFS with acl enabled all the time? > > > >> looked into the cause in detail, but my guess is that ls looks for the > >> acl xattr, and such a negative lookup will cause erofs to look at all > >> the shared xattrs for the inode, which means they all end up being > >> loaded anyway. Of course, this will only affect ls (or other cases > >> that read the acl), so its perhaps a bit uncommon. > > > > Yeah, in addition to that, I guess real acls could be landed in inlined > > xattrs as well if exists... > > > >> > >> Did you ever consider putting a bloom filter in the h_reserved area of > >> erofs_xattr_ibody_header? Then it could return early without i/o > >> operations for keys that are not set for the inode. Not sure what the > >> computational cost of that would be though. > > > > Good idea! Let me think about it, but enabling "noacl" mount > > option isn't prefered if acl is no needed in your use cases. > > ^ is preferred. That is probably the right approach for the composefs usecase. But even when you want acls, typically only just a few files have acls set, so it might be interesting to handle the negative acl lookup case more efficiently. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 9:56 ` Alexander Larsson @ 2023-03-07 10:06 ` Gao Xiang 0 siblings, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 10:06 UTC (permalink / raw) To: Alexander Larsson Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 17:56, Alexander Larsson wrote: > On Tue, Mar 7, 2023 at 10:38 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >> On 2023/3/7 17:26, Gao Xiang wrote: >>> >>> >>> On 2023/3/7 17:07, Alexander Larsson wrote: >>>> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>> >>>>> >>>>> >>>>> On 2023/3/7 16:21, Alexander Larsson wrote: >>>>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>>> >>>>>>>>> I tested the performance of "ls -lR" on the whole tree of >>>>>>>>> cs9-developer-rootfs. It seems that the performance of erofs (generated >>>>>>>>> from mkfs.erofs) is slightly better than that of composefs. While the >>>>>>>>> performance of erofs generated from mkfs.composefs is slightly worse >>>>>>>>> that that of composefs. >>>>>>>> >>>>>>>> I suspect that the reason for the lower performance of mkfs.composefs >>>>>>>> is the added overlay.fs-verity xattr to all the files. It makes the >>>>>>>> image larger, and that means more i/o. >>>>>>> >>>>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or >>>>>>> even overlay.redirect but it depends) if needed, which could save some >>>>>>> I/Os for your workloads. >>>>>>> >>>>>>> shared xattrs can be used in this way as well if you care such minor >>>>>>> difference, actually I think inlined xattrs for your workload are just >>>>>>> meaningful for selinux labels and capabilities. >>>>>> >>>>>> Really? Could you expand on this, because I would think it will be >>>>>> sort of the opposite. In my usecase, the erofs fs will be read by >>>>>> overlayfs, which will probably access overlay.* pretty often. At the >>>>>> very least it will load overlay.metacopy and overlay.redirect for >>>>>> every lookup. >>>>> >>>>> Really. In that way, it will behave much similiar to composefs on-disk >>>>> arrangement now (in composefs vdata area). >>>>> >>>>> Because in that way, although an extra I/O is needed for verification, >>>>> and it can only happen when actually opening the file (so "ls -lR" is >>>>> not impacted.) But on-disk inodes are more compact. >>>>> >>>>> All EROFS xattrs will be cached in memory so that accessing >>>>> overlay.* pretty often is not greatly impacted due to no real I/Os >>>>> (IOWs, only some CPU time is consumed). >>>> >>>> So, I tried moving the overlay.digest xattr to the shared area, but >>>> actually this made the performance worse for the ls case. I have not >>> >>> That is much strange. We'd like to open it up if needed. BTW, did you >>> test EROFS with acl enabled all the time? >>> >>>> looked into the cause in detail, but my guess is that ls looks for the >>>> acl xattr, and such a negative lookup will cause erofs to look at all >>>> the shared xattrs for the inode, which means they all end up being >>>> loaded anyway. Of course, this will only affect ls (or other cases >>>> that read the acl), so its perhaps a bit uncommon. >>> >>> Yeah, in addition to that, I guess real acls could be landed in inlined >>> xattrs as well if exists... >>> >>>> >>>> Did you ever consider putting a bloom filter in the h_reserved area of >>>> erofs_xattr_ibody_header? Then it could return early without i/o >>>> operations for keys that are not set for the inode. Not sure what the >>>> computational cost of that would be though. >>> >>> Good idea! Let me think about it, but enabling "noacl" mount >>> option isn't prefered if acl is no needed in your use cases. >> >> ^ is preferred. > > That is probably the right approach for the composefs usecase. But > even when you want acls, typically only just a few files have acls > set, so it might be interesting to handle the negative acl lookup case > more efficiently. Let me to seek time to improve this with bloom filters. It won't be hard, also I'd like to improve some other on-disk formats together with this xattr enhancement. Thanks for your input! Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 9:26 ` Gao Xiang 2023-03-07 9:38 ` Gao Xiang @ 2023-03-07 9:46 ` Alexander Larsson 2023-03-07 10:01 ` Gao Xiang 1 sibling, 1 reply; 42+ messages in thread From: Alexander Larsson @ 2023-03-07 9:46 UTC (permalink / raw) To: Gao Xiang Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On Tue, Mar 7, 2023 at 10:26 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > On 2023/3/7 17:07, Alexander Larsson wrote: > > On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >> > >> > >> > >> On 2023/3/7 16:21, Alexander Larsson wrote: > >>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>> > >>>>>> I tested the performance of "ls -lR" on the whole tree of > >>>>>> cs9-developer-rootfs. It seems that the performance of erofs (generated > >>>>>> from mkfs.erofs) is slightly better than that of composefs. While the > >>>>>> performance of erofs generated from mkfs.composefs is slightly worse > >>>>>> that that of composefs. > >>>>> > >>>>> I suspect that the reason for the lower performance of mkfs.composefs > >>>>> is the added overlay.fs-verity xattr to all the files. It makes the > >>>>> image larger, and that means more i/o. > >>>> > >>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or > >>>> even overlay.redirect but it depends) if needed, which could save some > >>>> I/Os for your workloads. > >>>> > >>>> shared xattrs can be used in this way as well if you care such minor > >>>> difference, actually I think inlined xattrs for your workload are just > >>>> meaningful for selinux labels and capabilities. > >>> > >>> Really? Could you expand on this, because I would think it will be > >>> sort of the opposite. In my usecase, the erofs fs will be read by > >>> overlayfs, which will probably access overlay.* pretty often. At the > >>> very least it will load overlay.metacopy and overlay.redirect for > >>> every lookup. > >> > >> Really. In that way, it will behave much similiar to composefs on-disk > >> arrangement now (in composefs vdata area). > >> > >> Because in that way, although an extra I/O is needed for verification, > >> and it can only happen when actually opening the file (so "ls -lR" is > >> not impacted.) But on-disk inodes are more compact. > >> > >> All EROFS xattrs will be cached in memory so that accessing > >> overlay.* pretty often is not greatly impacted due to no real I/Os > >> (IOWs, only some CPU time is consumed). > > > > So, I tried moving the overlay.digest xattr to the shared area, but > > actually this made the performance worse for the ls case. I have not > > That is much strange. We'd like to open it up if needed. BTW, did you > test EROFS with acl enabled all the time? These were all with acl enabled. And, to test this, I compared "ls -lR" and "ls -ZR", which do the same per-file syscalls, except the later doesn't try to read the system.posix_acl_access xattr. The result is: xattr: inlined | not inlined ------------+---------+------------ ls -lR cold | 708 | 721 ls -lR warm | 415 | 412 ls -ZR cold | 522 | 512 ls -ZR warm | 283 | 279 In the ZR case the out-of band digest is a win, but not in the lR case, which seems to mean the failed lookup of the acl xattr is to blame here. Also, very interesting is the fact that the warm cache difference for these to is so large. I guess that is because most other inode data is cached, but the xattrs lookups are not. If you could cache negative xattr lookups that seems like a large win. This can be either via a bloom cache in the disk format or maybe even just some in-memory negative lookup caches for the inode, maybe even special casing the acl xattrs. > > looked into the cause in detail, but my guess is that ls looks for the > > acl xattr, and such a negative lookup will cause erofs to look at all > > the shared xattrs for the inode, which means they all end up being > > loaded anyway. Of course, this will only affect ls (or other cases > > that read the acl), so its perhaps a bit uncommon. > > Yeah, in addition to that, I guess real acls could be landed in inlined > xattrs as well if exists... Yeah, but that doesn't help with the case where they don't exist. > BTW, if you have more interest in this way, we could get in > touch in a more effective way to improve EROFS in addition to > community emails except for the userns stuff I don't really have time to do any real erofs specific work. These are just some ideas that i got looking at these results. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-07 9:46 ` Alexander Larsson @ 2023-03-07 10:01 ` Gao Xiang 0 siblings, 0 replies; 42+ messages in thread From: Gao Xiang @ 2023-03-07 10:01 UTC (permalink / raw) To: Alexander Larsson Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 2023/3/7 17:46, Alexander Larsson wrote: > On Tue, Mar 7, 2023 at 10:26 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> On 2023/3/7 17:07, Alexander Larsson wrote: >>> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>> >>>> >>>> >>>> On 2023/3/7 16:21, Alexander Larsson wrote: >>>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>> >>>>>>>> I tested the performance of "ls -lR" on the whole tree of >>>>>>>> cs9-developer-rootfs. It seems that the performance of erofs (generated >>>>>>>> from mkfs.erofs) is slightly better than that of composefs. While the >>>>>>>> performance of erofs generated from mkfs.composefs is slightly worse >>>>>>>> that that of composefs. >>>>>>> >>>>>>> I suspect that the reason for the lower performance of mkfs.composefs >>>>>>> is the added overlay.fs-verity xattr to all the files. It makes the >>>>>>> image larger, and that means more i/o. >>>>>> >>>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or >>>>>> even overlay.redirect but it depends) if needed, which could save some >>>>>> I/Os for your workloads. >>>>>> >>>>>> shared xattrs can be used in this way as well if you care such minor >>>>>> difference, actually I think inlined xattrs for your workload are just >>>>>> meaningful for selinux labels and capabilities. >>>>> >>>>> Really? Could you expand on this, because I would think it will be >>>>> sort of the opposite. In my usecase, the erofs fs will be read by >>>>> overlayfs, which will probably access overlay.* pretty often. At the >>>>> very least it will load overlay.metacopy and overlay.redirect for >>>>> every lookup. >>>> >>>> Really. In that way, it will behave much similiar to composefs on-disk >>>> arrangement now (in composefs vdata area). >>>> >>>> Because in that way, although an extra I/O is needed for verification, >>>> and it can only happen when actually opening the file (so "ls -lR" is >>>> not impacted.) But on-disk inodes are more compact. >>>> >>>> All EROFS xattrs will be cached in memory so that accessing >>>> overlay.* pretty often is not greatly impacted due to no real I/Os >>>> (IOWs, only some CPU time is consumed). >>> >>> So, I tried moving the overlay.digest xattr to the shared area, but >>> actually this made the performance worse for the ls case. I have not >> >> That is much strange. We'd like to open it up if needed. BTW, did you >> test EROFS with acl enabled all the time? > > These were all with acl enabled. > > And, to test this, I compared "ls -lR" and "ls -ZR", which do the same > per-file syscalls, except the later doesn't try to read the > system.posix_acl_access xattr. The result is: > > xattr: inlined | not inlined > ------------+---------+------------ > ls -lR cold | 708 | 721 > ls -lR warm | 415 | 412 > ls -ZR cold | 522 | 512 > ls -ZR warm | 283 | 279 > > In the ZR case the out-of band digest is a win, but not in the lR > case, which seems to mean the failed lookup of the acl xattr is to > blame here. > > Also, very interesting is the fact that the warm cache difference for > these to is so large. I guess that is because most other inode data is > cached, but the xattrs lookups are not. If you could cache negative > xattr lookups that seems like a large win. This can be either via a > bloom cache in the disk format or maybe even just some in-memory > negative lookup caches for the inode, maybe even special casing the > acl xattrs. Yes, agree. Actually we don't take much time to look that ACL impacts because almost all generic fses (such as ext4, XFS, btrfs, etc.) all implement ACLs. But you could use "-o noacl" to disable it if needed with the current codebase. > >>> looked into the cause in detail, but my guess is that ls looks for the >>> acl xattr, and such a negative lookup will cause erofs to look at all >>> the shared xattrs for the inode, which means they all end up being >>> loaded anyway. Of course, this will only affect ls (or other cases >>> that read the acl), so its perhaps a bit uncommon. >> >> Yeah, in addition to that, I guess real acls could be landed in inlined >> xattrs as well if exists... > > Yeah, but that doesn't help with the case where they don't exist. > >> BTW, if you have more interest in this way, we could get in >> touch in a more effective way to improve EROFS in addition to >> community emails except for the userns stuff > > I don't really have time to do any real erofs specific work. These are > just some ideas that i got looking at these results. I don't want you guys to do any EROFS-specific work. I just want to confirm your real requirement (so I can improve this) and the final goal of this discussion. At least, on my side after long time discussion and comparison. EROFS and composefs are much similar (but when EROFS was raised we don't have a better choice to get a good performance since you've already partially benchmarked other fses) from many points of views except for some interfaces, and since composefs doesn't implement acl now, if you use "-o noacl" to mount EROFS, it could perform better performance. So I think it's no needed to discuss "ls -lR" stuffs here anymore, if you disagree, we could take more time to investigate on this. In other words, EROFS on-disk format and loopback devices are not performance bottlenack even on "ls -lR" workload. We could improve xattr negative lookups as a real input of this. Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay 2023-03-06 15:49 ` Jingbo Xu 2023-03-06 16:09 ` Alexander Larsson @ 2023-03-07 10:00 ` Jingbo Xu 1 sibling, 0 replies; 42+ messages in thread From: Jingbo Xu @ 2023-03-07 10:00 UTC (permalink / raw) To: Alexander Larsson, lsf-pc Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi On 3/6/23 11:49 PM, Jingbo Xu wrote: > > > On 3/6/23 7:33 PM, Alexander Larsson wrote: >> On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote: >>> >>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote: >>>> >>>> Hello, >>>> >>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the >>>> Composefs filesystem. It is an opportunistically sharing, validating >>>> image-based filesystem, targeting usecases like validated ostree >>>> rootfs:es, validated container images that share common files, as well >>>> as other image based usecases. >>>> >>>> During the discussions in the composefs proposal (as seen on LWN[3]) >>>> is has been proposed that (with some changes to overlayfs), similar >>>> behaviour can be achieved by combining the overlayfs >>>> "overlay.redirect" xattr with an read-only filesystem such as erofs. >>>> >>>> There are pros and cons to both these approaches, and the discussion >>>> about their respective value has sometimes been heated. We would like >>>> to have an in-person discussion at the summit, ideally also involving >>>> more of the filesystem development community, so that we can reach >>>> some consensus on what is the best apporach. >>> >>> In order to better understand the behaviour and requirements of the >>> overlayfs+erofs approach I spent some time implementing direct support >>> for erofs in libcomposefs. So, with current HEAD of >>> github.com/containers/composefs you can now do: >>> >>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs >>> >>> This will produce an object store with the backing files, and a erofs >>> file with the required overlayfs xattrs, including a made up one >>> called "overlay.fs-verity" containing the expected fs-verity digest >>> for the lower dir. It also adds the required whiteouts to cover the >>> 00-ff dirs from the lower dir. >>> >>> These erofs files are ordered similarly to the composefs files, and we >>> give similar guarantees about their reproducibility, etc. So, they >>> should be apples-to-apples comparable with the composefs images. >>> >>> Given this, I ran another set of performance tests on the original cs9 >>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to >>> measure the memory use like this: >>> >>> # echo 3 > /proc/sys/vm/drop_caches >>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat >>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >>> >>> These are the alternatives I tried: >>> >>> xfs: the source of the image, regular dir on xfs >>> erofs: the image.erofs above, on loopback >>> erofs dio: the image.erofs above, on loopback with --direct-io=on >>> ovl: erofs above combined with overlayfs >>> ovl dio: erofs dio above combined with overlayfs >>> cfs: composefs mount of image.cfs >>> >>> All tests use the same objects dir, stored on xfs. The erofs and >>> overlay implementations are from a stock 6.1.13 kernel, and composefs >>> module is from github HEAD. >>> >>> I tried loopback both with and without the direct-io option, because >>> without direct-io enabled the kernel will double-cache the loopbacked >>> data, as per[1]. >>> >>> The produced images are: >>> 8.9M image.cfs >>> 11.3M image.erofs >>> >>> And gives these results: >>> | Cold cache | Warm cache | Mem use >>> | (msec) | (msec) | (mb) >>> -----------+------------+------------+--------- >>> xfs | 1449 | 442 | 54 >>> erofs | 700 | 391 | 45 >>> erofs dio | 939 | 400 | 45 >>> ovl | 1827 | 530 | 130 >>> ovl dio | 2156 | 531 | 130 >>> cfs | 689 | 389 | 51 >> >> It has been noted that the readahead done by kernel_read() may cause >> read-ahead of unrelated data into memory which skews the results in >> favour of workloads that consume all the filesystem metadata (such as >> the ls -lR usecase of the above test). In the table above this favours >> composefs (which uses kernel_read in some codepaths) as well as >> non-dio erofs (non-dio loopback device uses readahead too). >> >> I updated composefs to not use kernel_read here: >> https://github.com/containers/composefs/pull/105 >> >> And a new kernel patch-set based on this is available at: >> https://github.com/alexlarsson/linux/tree/composefs >> >> The resulting table is now (dropping the non-dio erofs): >> >> | Cold cache | Warm cache | Mem use >> | (msec) | (msec) | (mb) >> -----------+------------+------------+--------- >> xfs | 1449 | 442 | 54 >> erofs dio | 939 | 400 | 45 >> ovl dio | 2156 | 531 | 130 >> cfs | 833 | 398 | 51 >> >> | Cold cache | Warm cache | Mem use >> | (msec) | (msec) | (mb) >> -----------+------------+------------+--------- >> ext4 | 1135 | 394 | 54 >> erofs dio | 922 | 401 | 45 >> ovl dio | 1810 | 532 | 149 >> ovl lazy | 1063 | 523 | 87 >> cfs | 768 | 459 | 51 >> >> So, while cfs is somewhat worse now for this particular usecase, my >> overall analysis still stands. >> > > Hi, > > I tested your patch removing kernel_read(), and here is the statistics > tested in my environment. > > > Setup > ====== > CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > Disk: cloud disk, 11800 IOPS upper limit > OS: Linux v6.2 > FS of backing objects: xfs > > > Image size > =========== > 8.6M large.composefs (with --compute-digest) > 8.9M large.erofs (mkfs.erofs) > 11M large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs) > > > Perf of "ls -lR" > ================ > | uncached| cached > | (ms) | (ms) > ----------------------------------------------|---------|-------- > composefs | 519 | 178 > erofs (mkfs.erofs, DIRECT loop) | 497 | 192 > erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536 | 199 > > I tested the performance of "ls -lR" on the whole tree of > cs9-developer-rootfs. It seems that the performance of erofs (generated > from mkfs.erofs) is slightly better than that of composefs. While the > performance of erofs generated from mkfs.composefs is slightly worse > that that of composefs. > > The uncached performance is somewhat slightly different with that given > by Alexander Larsson. I think it may be due to different test > environment, as my test machine is a server with robust performance, > with cloud disk as storage. > > It's just a simple test without further analysis, as it's a bit late for > me :) > Forgot to mention that all erofs (no matter generated from mkfs.erofs or mkfs.composefs) are mounted with "-o noacl", as composefs has not implemented its acl yet. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2023-04-27 16:11 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-02-27 9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson 2023-02-27 10:45 ` Gao Xiang 2023-02-27 10:58 ` Christian Brauner 2023-04-27 16:11 ` [Lsf-pc] " Amir Goldstein 2023-03-01 3:47 ` Jingbo Xu 2023-03-03 14:41 ` Alexander Larsson 2023-03-03 15:48 ` Gao Xiang 2023-02-27 11:37 ` Jingbo Xu 2023-03-03 13:57 ` Alexander Larsson 2023-03-03 15:13 ` Gao Xiang 2023-03-03 17:37 ` Gao Xiang 2023-03-04 14:59 ` Colin Walters 2023-03-04 15:29 ` Gao Xiang 2023-03-04 16:22 ` Gao Xiang 2023-03-07 1:00 ` Colin Walters 2023-03-07 3:10 ` Gao Xiang 2023-03-07 10:15 ` Christian Brauner 2023-03-07 11:03 ` Gao Xiang 2023-03-07 12:09 ` Alexander Larsson 2023-03-07 12:55 ` Gao Xiang 2023-03-07 15:16 ` Christian Brauner 2023-03-07 19:33 ` Giuseppe Scrivano 2023-03-08 10:31 ` Christian Brauner 2023-03-07 13:38 ` Jeff Layton 2023-03-08 10:37 ` Christian Brauner 2023-03-04 0:46 ` Jingbo Xu 2023-03-06 11:33 ` Alexander Larsson 2023-03-06 12:15 ` Gao Xiang 2023-03-06 15:49 ` Jingbo Xu 2023-03-06 16:09 ` Alexander Larsson 2023-03-06 16:17 ` Gao Xiang 2023-03-07 8:21 ` Alexander Larsson 2023-03-07 8:33 ` Gao Xiang 2023-03-07 8:48 ` Gao Xiang 2023-03-07 9:07 ` Alexander Larsson 2023-03-07 9:26 ` Gao Xiang 2023-03-07 9:38 ` Gao Xiang 2023-03-07 9:56 ` Alexander Larsson 2023-03-07 10:06 ` Gao Xiang 2023-03-07 9:46 ` Alexander Larsson 2023-03-07 10:01 ` Gao Xiang 2023-03-07 10:00 ` Jingbo Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).