From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: Du Rui <durui@linux.alibaba.com>, alexl@redhat.com
Cc: agk@redhat.com, dm-devel@redhat.com, gscrivan@redhat.com,
linux-kernel@vger.kernel.org, snitzer@kernel.org
Subject: Re: dm overlaybd: targets mapping OverlayBD image
Date: Sat, 27 May 2023 00:43:51 +0800 [thread overview]
Message-ID: <ac8519fd-85f4-e778-0c6c-b2e893a37628@linux.alibaba.com> (raw)
In-Reply-To: <20230526102633.31160-1-durui@linux.alibaba.com>
On 2023/5/26 03:26, Du Rui wrote:
> Hi Alexander,
>
>> all the lvm volume changes and mounts during runtime caused
>> weird behaviour (especially at scale) that was painful to manage (just
>> search the docker issue tracker for devmapper backend). In the end
>> everyone moved to a filesystem based implementation (overlayfs based).
>
> Yes, we had exactly the same experience. This is another reason why
> this proposal is for dm and lvm, not for container.
> (BTW, we are using TCMU and ublk for overlaybd in production. They are awesome.)
>
>
>> This solution doesn't even allow page cache sharing between shared
>> layers (like current containers do), much less between independent
>> layers.
>
> Page cache sharing can be realized with DAX support of the dm targets
> (and the inner file system), together with virtual pmem device backend.
First, here I'd suggest you could learn some kernel knowledge of what
DAX is and what page cache is before you explain to a kernel mailing
list. For example, DAX memory cannot be reclaimed at all.
Block drivers has nothing to do on filesystem page cache stuffs, also
currently your approach has nothing to do with pmem stuffs (If you must
mention "DAX" to proposal your "page cache sharing", please _here_
write down your detailed design first and explain how it could work to
ours if you really want to do.)
Apart from unable to share page cache among filesystems, even with
your approach all I/Os are duplicated among your qcow2-like layers.
For example, there are 3 qcow2-like layers: A, B, C:
filesystem 1: A + B
filesystem 2: A + B + C
Filesystem 1 and 2 are runtimely independent filesystems and your block
driver can do nothing help: both duplicated I/Os and page cache for any
data and metadata of layer A, B.
If those container layers are even more (dozens or hundreds), your
approach is more inefficient on duplicated I/Os.
You could implement some internal block cache, but block level cache is
not flexible compared with page cache on kernel memory reclaim and page
migration.
>
>> Erofs already has some block-level support for container images
>
> It is interesting. Erofs runs insider a block device in the first place,
> like what many file systems do. But do you konw why it implements another
> "some block-level support" by itself?
>
That is funny honestly. As for container image use cases, although OCI
image tgz is unseekable but actually ext4 and btrfs images are seekable
and on-demand load could be done with these raw images directly. In
principle, you could dump your container image stuffs from tgz to raw
ext4, btrfs, erofs, whatever. Or if you like, you could dump to some
"qcow2", "vhdx", "vmdx" wildly-used format, their ecosystem is more
mature but all the above don't help on page cache sharing stuffs.
Please don't say "I like erofs" and at the same time "why it implements
another some block-level support" by itself". Local filesystems must
do their block-mapping theirselves: ext4 (extents or blockmap), XFS
(extents), etc.
I've explained internally to your team multiple times as a kernel
developer, personally I don't want to repeat here again and again to
your guys.
Thanks,
Gao Xiang
>> And this new approach doesn't help
> No. It is intended for dm and lvm.>
next prev parent reply other threads:[~2023-05-26 16:44 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-19 10:27 [RFC] dm overlaybd: targets mapping OverlayBD image Du Rui
2023-05-23 17:28 ` Mike Snitzer
2023-05-24 0:56 ` [dm-devel] " Gao Xiang
2023-05-24 6:43 ` Alexander Larsson
2023-05-24 7:13 ` Gao Xiang
2023-05-24 8:11 ` Giuseppe Scrivano
2023-05-24 8:26 ` Gao Xiang
2023-05-24 10:48 ` Giuseppe Scrivano
2023-05-24 11:06 ` Gao Xiang
2023-05-26 10:28 ` Du Rui
2023-05-26 10:26 ` Du Rui
2023-05-26 16:43 ` Gao Xiang [this message]
2023-05-27 3:13 ` Du Rui
2023-05-27 4:12 ` Gao Xiang
2023-05-24 6:59 ` Du Rui
2023-05-26 10:25 ` Du Rui
2023-05-24 7:24 ` [RFC PATCH v2] " Du Rui
2023-05-24 7:40 ` [RFC PATCH v3] " Du Rui
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ac8519fd-85f4-e778-0c6c-b2e893a37628@linux.alibaba.com \
--to=hsiangkao@linux.alibaba.com \
--cc=agk@redhat.com \
--cc=alexl@redhat.com \
--cc=dm-devel@redhat.com \
--cc=durui@linux.alibaba.com \
--cc=gscrivan@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox