linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Colin Walters" <walters@verbum.org>
To: "Gao Xiang" <hsiangkao@linux.alibaba.com>,
	"Alexander Larsson" <alexl@redhat.com>,
	lsf-pc@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org,
	"Amir Goldstein" <amir73il@gmail.com>,
	"Christian Brauner" <brauner@kernel.org>,
	"Jingbo Xu" <jefflexu@linux.alibaba.com>,
	"Giuseppe Scrivano" <gscrivan@redhat.com>,
	"Dave Chinner" <david@fromorbit.com>,
	"Vivek Goyal" <vgoyal@redhat.com>,
	"Miklos Szeredi" <miklos@szeredi.hu>
Subject: Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
Date: Mon, 06 Mar 2023 20:00:45 -0500	[thread overview]
Message-ID: <6bea16fa-737f-4aad-a2cd-0a12029e614d@app.fastmail.com> (raw)
In-Reply-To: <0a571702-a907-c2b1-bb38-96aa7b268a1b@linux.alibaba.com>



On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote:
> Hi Colin,
>
> On 2023/3/4 22:59, Colin Walters wrote:
>> 
>> 
>> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:
>>>
>>> Actually since you're container guys, I would like to mention
>>> a way to directly reuse OCI tar data and not sure if you
>>> have some interest as well, that is just to generate EROFS
>>> metadata which could point to the tar blobs so that data itself
>>> is still the original tar, but we could add fsverity + IMMUTABLE
>>> to these blobs rather than the individual untared files.
>> 
>>>    - OCI layer diff IDs in the OCI spec [1] are guaranteed;
>> 
>> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.
>
> Thanks for the interest and comment.
>
> I'm not aware of this project, and I'm not sure if tar-split
> helps mount tar stuffs, maybe I'm missing something?

Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top.   Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention.

Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc.  Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked.

It's really only very simplistic use cases for which a single read-only filesystem suffices.  They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc. 

But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems.   But without also forcing separate size management onto both.

> Not bacause EROFS cannot do on-disk dedupe, just because in this
> way EROFS can only use the original tar blobs, and EROFS is not
> the guy to resolve the on-disk sharing stuff.  

Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case.

> As a kernel filesystem, if two files are equal, we could treat them
> in the same inode address space, even they are actually with slightly
> different inode metadata (uid, gid, mode, nlink, etc).  That is
> entirely possible as an in-kernel filesystem even currently linux
> kernel doesn't implement finer page cache sharing, so EROFS can
> support page-cache sharing of files in all tar streams if needed.

Hmmm.  I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack).  But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally.  And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it.  

That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively!



  parent reply	other threads:[~2023-03-07  1:01 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-27  9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson
2023-02-27 10:45 ` Gao Xiang
2023-02-27 10:58   ` Christian Brauner
2023-04-27 16:11     ` [Lsf-pc] " Amir Goldstein
2023-03-01  3:47   ` Jingbo Xu
2023-03-03 14:41     ` Alexander Larsson
2023-03-03 15:48       ` Gao Xiang
2023-02-27 11:37 ` Jingbo Xu
2023-03-03 13:57 ` Alexander Larsson
2023-03-03 15:13   ` Gao Xiang
2023-03-03 17:37     ` Gao Xiang
2023-03-04 14:59       ` Colin Walters
2023-03-04 15:29         ` Gao Xiang
2023-03-04 16:22           ` Gao Xiang
2023-03-07  1:00           ` Colin Walters [this message]
2023-03-07  3:10             ` Gao Xiang
2023-03-07 10:15     ` Christian Brauner
2023-03-07 11:03       ` Gao Xiang
2023-03-07 12:09       ` Alexander Larsson
2023-03-07 12:55         ` Gao Xiang
2023-03-07 15:16         ` Christian Brauner
2023-03-07 19:33           ` Giuseppe Scrivano
2023-03-08 10:31             ` Christian Brauner
2023-03-07 13:38       ` Jeff Layton
2023-03-08 10:37         ` Christian Brauner
2023-03-04  0:46   ` Jingbo Xu
2023-03-06 11:33   ` Alexander Larsson
2023-03-06 12:15     ` Gao Xiang
2023-03-06 15:49     ` Jingbo Xu
2023-03-06 16:09       ` Alexander Larsson
2023-03-06 16:17         ` Gao Xiang
2023-03-07  8:21           ` Alexander Larsson
2023-03-07  8:33             ` Gao Xiang
2023-03-07  8:48               ` Gao Xiang
2023-03-07  9:07               ` Alexander Larsson
2023-03-07  9:26                 ` Gao Xiang
2023-03-07  9:38                   ` Gao Xiang
2023-03-07  9:56                     ` Alexander Larsson
2023-03-07 10:06                       ` Gao Xiang
2023-03-07  9:46                   ` Alexander Larsson
2023-03-07 10:01                     ` Gao Xiang
2023-03-07 10:00       ` Jingbo Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6bea16fa-737f-4aad-a2cd-0a12029e614d@app.fastmail.com \
    --to=walters@verbum.org \
    --cc=alexl@redhat.com \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=gscrivan@redhat.com \
    --cc=hsiangkao@linux.alibaba.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=miklos@szeredi.hu \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).