* [dm-devel] Potential enhancements to dm-thin v2
@ 2022-04-10 22:03 Demi Marie Obenour
2022-04-11 8:16 ` Zdenek Kabelac
0 siblings, 1 reply; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-10 22:03 UTC (permalink / raw)
To: Joe Thornber; +Cc: dm-devel
[-- Attachment #1.1: Type: text/plain, Size: 2286 bytes --]
For quite a while, I have wanted to write a tool to manage thin volumes
that is not based on LVM. The main thing holding me back is that the
current dm-thin interface is extremely error-prone. The only per-thin
metadata stored by the kernel is a 24-bit thin ID, and userspace must
take great care to keep that ID in sync with its own metadata. Failure
to do so results in data loss, data corruption, or even security
vulnerabilities. Furthermore, having to suspend a thin volume before
one can take a snapshot of it creates a critical section during which
userspace must be very careful, as I/O or a crash can lead to deadlock.
I believe both of these problems can be solved without overly
complicating the kernel implementation.
The metadata problem can be solved by allowing userspace to (1)
associate a 256-byte binary blob with each thin volume and (2) easily
enumerate the thin volumes in a pool. Even with 16777216 thins, this
would only use 4GiB of space, and dm-thin v2 will support far larger
metadata volumes. While being able to look up thins by the blob would
be awesome, I would be okay with just enumerating thins at startup and
caching the ID ⇔ blob mapping in userspace, at least if thin IDs become
64-bit so I do not have to worry about reuse. Being able to enumerate
the thin volumes would allow me to rely solely on the metadata in the
thin pool, without having to manage any metadata in userspace. Looking
at the existing implementation, this seems to be fairly simple: the
current B-tree code supports arbitrary value sizes already, so the blob
could be appended to 'struct disk_device_details'. (Requiring the size
of the blob to be set at pool creation, or when the pool is empty, is
fine.)
The suspend problem can be solved by having the kernel automatically
suspend a thin volume before taking a snapshot of it, and resuming
afterwards. This removes a footgun from the userspace API, and should
improve reliability too, as it reduces the number of error conditions
that can hang the system. Per discussion with Zdenek, having the kernel
do this automatically is infeasible for arbitrary device stacks, but
this is a common special case.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 98 bytes --]
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-10 22:03 [dm-devel] Potential enhancements to dm-thin v2 Demi Marie Obenour
@ 2022-04-11 8:16 ` Zdenek Kabelac
2022-04-11 17:22 ` Demi Marie Obenour
0 siblings, 1 reply; 10+ messages in thread
From: Zdenek Kabelac @ 2022-04-11 8:16 UTC (permalink / raw)
To: Demi Marie Obenour, Joe Thornber; +Cc: dm-devel
Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
> For quite a while, I have wanted to write a tool to manage thin volumes
> that is not based on LVM. The main thing holding me back is that the
> current dm-thin interface is extremely error-prone. The only per-thin
> metadata stored by the kernel is a 24-bit thin ID, and userspace must
> take great care to keep that ID in sync with its own metadata. Failure
> to do so results in data loss, data corruption, or even security
> vulnerabilities. Furthermore, having to suspend a thin volume before
> one can take a snapshot of it creates a critical section during which
> userspace must be very careful, as I/O or a crash can lead to deadlock.
> I believe both of these problems can be solved without overly
> complicating the kernel implementation.
Hi
These things are coming with initial design of whole DM world - where there is
a split of complexity between kernel & user-space. So projects like btrfs,
ZFS, decided to go the other way and create a monolithic 'all-in-one'
solution, where they avoid some problems related with communication between
kernel & user-space - but at the price of having a pretty complicated and very
hard to devel & debug kernel code.
So let me explain one of the reasons, we have this logic with suspend is this
basic principle:
write new lvm metadata -> suspend (with all table preloads) -> commit new
lvm2 metadata -> resume
with this we ensure the user space maintain the only valid 'view' of metadata.
Your proposal actually breaks this sequence and would move things to the state
of 'guess at which states we are now'. (and IMHO presents much more risk than
virtual problem with suspend from user-space - which is only a problem if you
are using suspended device as 'swap' and 'rootfs' - so there are very easy
ways how to orchestrate your LVs to avoid such problems).
Basically you are essentially wanting to move whole management into kernel for
some not so great speed gains (related to the rest of the running system (and
you can certainly do that by writing your own kernel module to manage your
ratehr unique software problem)
But IMHO creation and removal of thousands of devices in very short period of
time rather suggest there is something sub-optimal in your original software
design as I'm really having hard time imagining why would you need this ?
If you wish to operate lots of devices - keep them simply created and ready -
and eventually blkdiscard them for next device reuse.
I'm also unsure from where would arise any special need to instantiate that
many snapshots - if there is some valid & logical purpose - lvm2 can have
extended user space API to create multiple snapshots at once maybe (so i.e.
create 10 snapshots with name-%d of a single thinLV)
Not to mentioning operating that many thin volumes from a single thin-pool is
also nothing close to high performance goal you try to reach...
Regards
Zdenek
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-11 8:16 ` Zdenek Kabelac
@ 2022-04-11 17:22 ` Demi Marie Obenour
2022-04-11 20:16 ` Zdenek Kabelac
0 siblings, 1 reply; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-11 17:22 UTC (permalink / raw)
To: Zdenek Kabelac, Joe Thornber; +Cc: dm-devel
[-- Attachment #1.1: Type: text/plain, Size: 6550 bytes --]
On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
> > For quite a while, I have wanted to write a tool to manage thin volumes
> > that is not based on LVM. The main thing holding me back is that the
> > current dm-thin interface is extremely error-prone. The only per-thin
> > metadata stored by the kernel is a 24-bit thin ID, and userspace must
> > take great care to keep that ID in sync with its own metadata. Failure
> > to do so results in data loss, data corruption, or even security
> > vulnerabilities. Furthermore, having to suspend a thin volume before
> > one can take a snapshot of it creates a critical section during which
> > userspace must be very careful, as I/O or a crash can lead to deadlock.
> > I believe both of these problems can be solved without overly
> > complicating the kernel implementation.
>
>
> Hi
>
> These things are coming with initial design of whole DM world - where there
> is a split of complexity between kernel & user-space. So projects like
> btrfs, ZFS, decided to go the other way and create a monolithic 'all-in-one'
> solution, where they avoid some problems related with communication between
> kernel & user-space - but at the price of having a pretty complicated and
> very hard to devel & debug kernel code.
>
> So let me explain one of the reasons, we have this logic with suspend is
> this basic principle:
>
> write new lvm metadata -> suspend (with all table preloads) -> commit new
> lvm2 metadata -> resume
>
> with this we ensure the user space maintain the only valid 'view' of metadata.
>
> Your proposal actually breaks this sequence and would move things to the
> state of 'guess at which states we are now'. (and IMHO presents much more
> risk than virtual problem with suspend from user-space - which is only a
> problem if you are using suspended device as 'swap' and 'rootfs' - so there
> are very easy ways how to orchestrate your LVs to avoid such problems).
The intent is less “guess what states we are now” and more “It looks
like dm-thin already has the data structures needed to store some
per-thin metadata, and that could make writing a simple userspace volume
manager FAR FAR easier”. It appears to me that the only change needed
would be reserving some space (amount fixed at pool creation) after
‘struct disk_device_details’ for use by userspace, and providing a way
for userspace to enumerate the thin devices on a volume and to set and
retrieve that extra data. Suspend isn’t actually that big of a problem,
since new Qubes OS 4.1 (and later) installs use one pool for the root
filesystem and a separate one for VMs. As a userspace writer, the
scariest part of managing thin volumes is actually making sure I don’t
lose track of which thin ID corresponds to which volume name. The
*only* metadata Qubes OS would need would be a per-thin name, size, thin
ID, and possibly UUID. All of those could be put in that extra space.
> Basically you are essentially wanting to move whole management into kernel
> for some not so great speed gains (related to the rest of the running system
> (and you can certainly do that by writing your own kernel module to manage
> your ratehr unique software problem)
From a storage perspective, my problem is basically the same as Docker’s
devicemapper driver. Unlike Docker, though, Qubes OS must work at the
block level; it can’t work at the filesystem level. So overlayfs and
friends aren’t options.
> But IMHO creation and removal of thousands of devices in very short period
> of time rather suggest there is something sub-optimal in your original
> software design as I'm really having hard time imagining why would you need
> this ?
There very well could be (suggestions for improvement welcome).
> If you wish to operate lots of devices - keep them simply created and ready
> - and eventually blkdiscard them for next device reuse.
That would work for volatile volumes, but those are only about 1/3 of
the volumes in a Qubes OS system. The other 2/3 are writable snapshots.
Also, Qubes OS has found blkdiscard on thins to be a performance
problem. It used to lock up entire pools until Qubes OS moved to doing
the blkdiscard in chunks.
> I'm also unsure from where would arise any special need to instantiate that
> many snapshots - if there is some valid & logical purpose - lvm2 can have
> extended user space API to create multiple snapshots at once maybe (so
> i.e. create 10 snapshots with name-%d of a single thinLV)
This would be amazing, and Qubes OS should be able to use it. That
said, Qubes OS would prefer to be able to choose the name of each volume
separately. Could there be a more general batching operation? Just
supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
--activate=y’ as well would be even better.
> Not to mentioning operating that many thin volumes from a single thin-pool
> is also nothing close to high performance goal you try to reach...
Would you mind explaining? My understanding, and the basis of
essentially all my feature requests in this area, was that virtually all
of the cost of LVM is the userspace metadata operations, udev syncing,
and device scanning. I have been assuming that the kernel does not have
performance problems with large numbers of thin volumes.
Right now, my machine has 334 active thin volumes, split between one
pool on an NVMe drive and one on a spinning hard drive. The pool on an
NVMe drive has 312 active thin volumes, of which I believe 64 are in use.
Are these numbers high enough to cause significant performance
penalties for dm-thin v1, and would they cause problems for dm-thin v2?
How much of a performance win can I expect from only activating the
subset of volumes I actually use?
Also, I believe a significant fraction of I/O is writes to previously
unallocated blocks. I haven’t measured how much, though, since I am not
aware of any way to get that statistic, at least without kprobes or
similar.
The pool on a spinning hard drive has 22 thin volumes, of which I
believe only one is in use. The HDD is mostly used for backups, so its
performance doesn’t matter that much.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 98 bytes --]
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-11 17:22 ` Demi Marie Obenour
@ 2022-04-11 20:16 ` Zdenek Kabelac
2022-04-11 22:30 ` Demi Marie Obenour
0 siblings, 1 reply; 10+ messages in thread
From: Zdenek Kabelac @ 2022-04-11 20:16 UTC (permalink / raw)
To: Demi Marie Obenour, Joe Thornber; +Cc: dm-devel
Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>>
>> Your proposal actually breaks this sequence and would move things to the
>> state of 'guess at which states we are now'. (and IMHO presents much more
>> risk than virtual problem with suspend from user-space - which is only a
>> problem if you are using suspended device as 'swap' and 'rootfs' - so there
>> are very easy ways how to orchestrate your LVs to avoid such problems).
> The intent is less “guess what states we are now” and more “It looks
> like dm-thin already has the data structures needed to store some
> per-thin metadata, and that could make writing a simple userspace volume
> manager FAR FAR easier”. It appears to me that the only change needed
I do not spend hours explaining all the details - but running just the suspend
alone may result in many differnt problem where the things like running
thin-pool out-of-data space is one of the easiest.
Basically each step must be designed with 'power-off' happen during the
operation. For each step you need to know how the recovery step looks like and
how the lvm2 & kernel metadata c/would match together. Combining many steps
together into a single 'kernel' call just increases already large range of
errors. So in many case we simply do favour to keep operation more
'low-level-atomic' even at slight higher performance price (as said - we've
never seen a creation of snapshot to be 'msec' critical operation - as the
'suspend' with implicit flush & fsfreeze itself might be far more expensive
operation.
>> But IMHO creation and removal of thousands of devices in very short period
>> of time rather suggest there is something sub-optimal in your original
>> software design as I'm really having hard time imagining why would you need
>> this ?
> There very well could be (suggestions for improvement welcome).
>
>> If you wish to operate lots of devices - keep them simply created and ready
>> - and eventually blkdiscard them for next device reuse.
> That would work for volatile volumes, but those are only about 1/3 of
> the volumes in a Qubes OS system. The other 2/3 are writable snapshots.
> Also, Qubes OS has found blkdiscard on thins to be a performance
> problem. It used to lock up entire pools until Qubes OS moved to doing
> the blkdiscard in chunks.
Always make sure you use recent Linux kernels.
Blkdiscard should not differ from lvremove too much - also experiment how
the 'lvchange --discards passdown|nopassdown poolLV' works.
>> I'm also unsure from where would arise any special need to instantiate that
>> many snapshots - if there is some valid & logical purpose - lvm2 can have
>> extended user space API to create multiple snapshots at once maybe (so
>> i.e. create 10 snapshots with name-%d of a single thinLV)
> This would be amazing, and Qubes OS should be able to use it. That
> said, Qubes OS would prefer to be able to choose the name of each volume
> separately. Could there be a more general batching operation? Just
> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
> --activate=y’ as well would be even better.
There is kind of 'hidden' plan inside command line processing to allow
'grouped' processing.
lvcreate --snapshot --name lv1 --snapshot --name lv2 vg/origin
However there is currently no man power to proceed further on this part as we
have other parts of code needed enhancements.
But we may put this on our TODO plans...
>> Not to mentioning operating that many thin volumes from a single thin-pool
>> is also nothing close to high performance goal you try to reach...
> Would you mind explaining? My understanding, and the basis of
> essentially all my feature requests in this area, was that virtually all
> of the cost of LVM is the userspace metadata operations, udev syncing,
> and device scanning. I have been assuming that the kernel does not have
> performance problems with large numbers of thin volumes.
The main idea behind the comment is - when there is increased disk usage -
the manipulation with thin-pool metadata and locking will soon start to be a
considerable performance problem.
So while it's easy to have active 1000 thinLVs from a single thin-pool that
are UNUSED, situation is dramatically different when there LVs would be in
some heavy use load. There you should keep the active thinLV at low number
of tens LVs, especially if you are performance oriented. The lighter usage
and less provisioning and especially bigger block size - improve
>
> Right now, my machine has 334 active thin volumes, split between one
> pool on an NVMe drive and one on a spinning hard drive. The pool on an
> NVMe drive has 312 active thin volumes, of which I believe 64 are in use.
> Are these numbers high enough to cause significant performance
> penalties for dm-thin v1, and would they cause problems for dm-thin v2?
There are not yet any numbers for v2
For v1 - 64 thins might eventually experience some congestion for heavy load
(compared with 'native' raw spindle).
> How much of a performance win can I expect from only activating the
> subset of volumes I actually use?
I can only advice benchmark with some good approximation of your expected
workload.
In some cases it may appear your workload is not too sensitive to various
locking limitations.
Regards
Zdenek
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-11 20:16 ` Zdenek Kabelac
@ 2022-04-11 22:30 ` Demi Marie Obenour
2022-04-12 9:32 ` Zdenek Kabelac
0 siblings, 1 reply; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-11 22:30 UTC (permalink / raw)
To: Zdenek Kabelac, Joe Thornber; +Cc: dm-devel
[-- Attachment #1.1: Type: text/plain, Size: 7879 bytes --]
On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
> > On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
> > > Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
> > >
> > > Your proposal actually breaks this sequence and would move things to the
> > > state of 'guess at which states we are now'. (and IMHO presents much more
> > > risk than virtual problem with suspend from user-space - which is only a
> > > problem if you are using suspended device as 'swap' and 'rootfs' - so there
> > > are very easy ways how to orchestrate your LVs to avoid such problems).
> > The intent is less “guess what states we are now” and more “It looks
> > like dm-thin already has the data structures needed to store some
> > per-thin metadata, and that could make writing a simple userspace volume
> > manager FAR FAR easier”. It appears to me that the only change needed
>
>
> I do not spend hours explaining all the details - but running just the
> suspend alone may result in many differnt problem where the things like
> running thin-pool out-of-data space is one of the easiest.
>
> Basically each step must be designed with 'power-off' happen during the
> operation. For each step you need to know how the recovery step looks like
> and how the lvm2 & kernel metadata c/would match together.
That is absolutely the case, and is in fact the reason I proposed this
change to begin with. By having dm-thin store a small amount of
userspace-provided metadata for each thin volume, and by providing an
API to enumerate the thin volumes in a pool, I can store all of the
metadata I need in the thin pool itself. This is much simpler than
having to store metadata outside of the pool.
> Combining many
> steps together into a single 'kernel' call just increases already large
> range of errors. So in many case we simply do favour to keep operation more
> 'low-level-atomic' even at slight higher performance price (as said - we've
> never seen a creation of snapshot to be 'msec' critical operation - as the
> 'suspend' with implicit flush & fsfreeze itself might be far more expensive
> operation.
Qubes OS should never be snapshotting an in-use volume of any kind.
Right now, there is one case where it does so, but that is a bug, and I
am working on fixing it. A future API might support snapshotting to an
in-use volume, but that would likely require a way to tell the VM to
freeze its own filesystem.
> > > But IMHO creation and removal of thousands of devices in very short period
> > > of time rather suggest there is something sub-optimal in your original
> > > software design as I'm really having hard time imagining why would you need
> > > this ?
> > There very well could be (suggestions for improvement welcome).
> >
> > > If you wish to operate lots of devices - keep them simply created and ready
> > > - and eventually blkdiscard them for next device reuse.
> > That would work for volatile volumes, but those are only about 1/3 of
> > the volumes in a Qubes OS system. The other 2/3 are writable snapshots.
> > Also, Qubes OS has found blkdiscard on thins to be a performance
> > problem. It used to lock up entire pools until Qubes OS moved to doing
> > the blkdiscard in chunks.
>
> Always make sure you use recent Linux kernels.
Should the 5.16 series be recent enough?
> Blkdiscard should not differ from lvremove too much - also experiment how
> the 'lvchange --discards passdown|nopassdown poolLV' works.
I believe this was with passdown on, which is the default in Qubes OS.
The bug was tracked down by Jinoh Kang in
https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
and found to be due to dm-thin deleting B-tree nodes one at a time,
causing large amounts of time to be wasted on btree rebalancing and node
locking.
> > > I'm also unsure from where would arise any special need to instantiate that
> > > many snapshots - if there is some valid & logical purpose - lvm2 can have
> > > extended user space API to create multiple snapshots at once maybe (so
> > > i.e. create 10 snapshots with name-%d of a single thinLV)
> > This would be amazing, and Qubes OS should be able to use it. That
> > said, Qubes OS would prefer to be able to choose the name of each volume
> > separately. Could there be a more general batching operation? Just
> > supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
> > ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
> > --activate=y’ as well would be even better.
>
> There is kind of 'hidden' plan inside command line processing to allow
> 'grouped' processing.
>
> lvcreate --snapshot --name lv1 --snapshot --name lv2 vg/origin
>
> However there is currently no man power to proceed further on this part as
> we have other parts of code needed enhancements.
>
> But we may put this on our TODO plans...
That would be great, thanks!
> > > Not to mentioning operating that many thin volumes from a single thin-pool
> > > is also nothing close to high performance goal you try to reach...
> > Would you mind explaining? My understanding, and the basis of
> > essentially all my feature requests in this area, was that virtually all
> > of the cost of LVM is the userspace metadata operations, udev syncing,
> > and device scanning. I have been assuming that the kernel does not have
> > performance problems with large numbers of thin volumes.
>
>
> The main idea behind the comment is - when there is increased disk usage -
> the manipulation with thin-pool metadata and locking will soon start to be a
> considerable performance problem.
>
> So while it's easy to have active 1000 thinLVs from a single thin-pool that
> are UNUSED, situation is dramatically different when there LVs would be in
> some heavy use load. There you should keep the active thinLV at low number
> of tens LVs, especially if you are performance oriented. The lighter
> usage and less provisioning and especially bigger block size - improve
I can try to modify the storage pool so that LVs are not activated by
default. That said, Qubes OS will always be provisioning-heavy. With
the notable exception of volatile volumes, qubesd always snapshots a
volume at startup and then provides the snapshot to the VM. After
shutdown, the original volume is renamed to be a backup, and the
snapshot gets the name of the original volume. Bigger block sizes would
substantially increase write amplification, as turning off zeroing is
not an option for security reasons.
Is this just a workload that dm-thin is ill-suited for? Qubes OS does
support storing VM images on either BTRFS or XFS files, and it could be
that this is a better plan going forward.
> > Right now, my machine has 334 active thin volumes, split between one
> > pool on an NVMe drive and one on a spinning hard drive. The pool on an
> > NVMe drive has 312 active thin volumes, of which I believe 64 are in use.
> > Are these numbers high enough to cause significant performance
> > penalties for dm-thin v1, and would they cause problems for dm-thin v2?
>
> There are not yet any numbers for v2
>
> For v1 - 64 thins might eventually experience some congestion for heavy load
> (compared with 'native' raw spindle).
>
> > How much of a performance win can I expect from only activating the
> > subset of volumes I actually use?
>
>
> I can only advice benchmark with some good approximation of your expected
> workload.
That’s already on the Qubes OS team’s (very long) to-do list.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 98 bytes --]
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-11 22:30 ` Demi Marie Obenour
@ 2022-04-12 9:32 ` Zdenek Kabelac
2022-04-12 11:58 ` Demi Marie Obenour
2022-04-12 14:29 ` David Teigland
0 siblings, 2 replies; 10+ messages in thread
From: Zdenek Kabelac @ 2022-04-12 9:32 UTC (permalink / raw)
To: Demi Marie Obenour, Joe Thornber; +Cc: dm-devel
Dne 12. 04. 22 v 0:30 Demi Marie Obenour napsal(a):
> On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
>> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
>>> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
>>>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>>>>
>>>> Your proposal actually breaks this sequence and would move things to the
>>>> state of 'guess at which states we are now'. (and IMHO presents much more
>>>> risk than virtual problem with suspend from user-space - which is only a
>>>> problem if you are using suspended device as 'swap' and 'rootfs' - so there
>>>> are very easy ways how to orchestrate your LVs to avoid such problems).
>>> The intent is less “guess what states we are now” and more “It looks
>>> like dm-thin already has the data structures needed to store some
>>> per-thin metadata, and that could make writing a simple userspace volume
>>> manager FAR FAR easier”. It appears to me that the only change needed
>>
>> I do not spend hours explaining all the details - but running just the
>> suspend alone may result in many differnt problem where the things like
>> running thin-pool out-of-data space is one of the easiest.
>>
>> Basically each step must be designed with 'power-off' happen during the
>> operation. For each step you need to know how the recovery step looks like
>> and how the lvm2 & kernel metadata c/would match together.
> That is absolutely the case, and is in fact the reason I proposed this
> change to begin with. By having dm-thin store a small amount of
> userspace-provided metadata for each thin volume, and by providing an
> API to enumerate the thin volumes in a pool, I can store all of the
> metadata I need in the thin pool itself. This is much simpler than
> having to store metadata outside of the pool.
Hi
Here is actually the fundamental problem with your proposal - our design was
about careful split between user-space and kernel 'who is the owner/holder of
information' - your proposal unfortunately does not fit the model where lvm2
is the authoritative owner of info about devices - note - we also tried the
'model' where the info is held within target - our mdraid dm wrapper - but it
has more troubles compared with very clear thin logic. So from the lvm2
position - we do not have any plans to change this proven model.
What you are asking for is - that 'kernel' module is doing all the job - and
lvm2 would be obtaining info from the kernel metadata - and eventually you
would be able to command everything with ioctl() interface and letting the
complexity sit completely in kernel - but as explained our design is heading
in opposite direction - what can be done in user-space stays in user space and
kernel does the necessary minimum, which can be then much easier developed and
traced.
>> Combining many
>> steps together into a single 'kernel' call just increases already large
>> range of errors. So in many case we simply do favour to keep operation more
>> 'low-level-atomic' even at slight higher performance price (as said - we've
>> never seen a creation of snapshot to be 'msec' critical operation - as the
>> 'suspend' with implicit flush & fsfreeze itself might be far more expensive
>> operation.
> Qubes OS should never be snapshotting an in-use volume of any kind.
> Right now, there is one case where it does so, but that is a bug, and I
> am working on fixing it. A future API might support snapshotting to an
> in-use volume, but that would likely require a way to tell the VM to
> freeze its own filesystem.
Yeah - you have very unusual use case - in fact lvm2 goal is usually to
support as much things as we can while devices are in-use so user does not
need to take them offline - which surely complicates everything a lot - also
there was basically never any user demand to operate with offline device in
very quick way - so admittedly not the focused area of development.
>>>> But IMHO creation and removal of thousands of devices in very short period
>>>> of time rather suggest there is something sub-optimal in your original
>>>> software design as I'm really having hard time imagining why would you need
>>>> this ?
>>> There very well could be (suggestions for improvement welcome).
>>>
>>>> If you wish to operate lots of devices - keep them simply created and ready
>>>> - and eventually blkdiscard them for next device reuse.
>>> That would work for volatile volumes, but those are only about 1/3 of
>>> the volumes in a Qubes OS system. The other 2/3 are writable snapshots.
>>> Also, Qubes OS has found blkdiscard on thins to be a performance
>>> problem. It used to lock up entire pools until Qubes OS moved to doing
>>> the blkdiscard in chunks.
>> Always make sure you use recent Linux kernels.
> Should the 5.16 series be recent enough?
>
>> Blkdiscard should not differ from lvremove too much - also experiment how
>> the 'lvchange --discards passdown|nopassdown poolLV' works.
> I believe this was with passdown on, which is the default in Qubes OS.
> The bug was tracked down by Jinoh Kang in
> https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
> and found to be due to dm-thin deleting B-tree nodes one at a time,
> causing large amounts of time to be wasted on btree rebalancing and node
> locking.
>
>>>> I'm also unsure from where would arise any special need to instantiate that
>>>> many snapshots - if there is some valid & logical purpose - lvm2 can have
>>>> extended user space API to create multiple snapshots at once maybe (so
>>>> i.e. create 10 snapshots with name-%d of a single thinLV)
>>> This would be amazing, and Qubes OS should be able to use it. That
>>> said, Qubes OS would prefer to be able to choose the name of each volume
>>> separately. Could there be a more general batching operation? Just
>>> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
>>> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
>>> --activate=y’ as well would be even better.
>> There is kind of 'hidden' plan inside command line processing to allow
>> 'grouped' processing.
>>
>> lvcreate --snapshot --name lv1 --snapshot --name lv2 vg/origin
>>
>> However there is currently no man power to proceed further on this part as
>> we have other parts of code needed enhancements.
>>
>> But we may put this on our TODO plans...
> That would be great, thanks!
Although the main reason to support this kind of API was the request to
support an atomic snapshot of multiple LVs at once - but so far not a high
priority.
>>>> Not to mentioning operating that many thin volumes from a single thin-pool
>>>> is also nothing close to high performance goal you try to reach...
>>> Would you mind explaining? My understanding, and the basis of
>>> essentially all my feature requests in this area, was that virtually all
>>> of the cost of LVM is the userspace metadata operations, udev syncing,
>>> and device scanning. I have been assuming that the kernel does not have
>>> performance problems with large numbers of thin volumes.
>>
>> The main idea behind the comment is - when there is increased disk usage -
>> the manipulation with thin-pool metadata and locking will soon start to be a
>> considerable performance problem.
>>
>> So while it's easy to have active 1000 thinLVs from a single thin-pool that
>> are UNUSED, situation is dramatically different when there LVs would be in
>> some heavy use load. There you should keep the active thinLV at low number
>> of tens LVs, especially if you are performance oriented. The lighter
>> usage and less provisioning and especially bigger block size - improve
> I can try to modify the storage pool so that LVs are not activated by
> default. That said, Qubes OS will always be provisioning-heavy. With
> the notable exception of volatile volumes, qubesd always snapshots a
You definitely should keep active ONLY LVs you need to have active - it's
impacting many other kernel areas and consumes system resources to keep
'unused' LVs active.
> volume at startup and then provides the snapshot to the VM. After
> shutdown, the original volume is renamed to be a backup, and the
> snapshot gets the name of the original volume. Bigger block sizes would
> substantially increase write amplification, as turning off zeroing is
> not an option for security reasons.
For 'snapshot' heavy loads - smaller chunks are usually better - but it comes
with price.
> Is this just a workload that dm-thin is ill-suited for? Qubes OS does
> support storing VM images on either BTRFS or XFS files, and it could be
> that this is a better plan going forward.
Not knowing the details - but as mentioned 'zeroing' is not needed for
'filesystem' security - modern filesystem will never let you read unwritten
data - as it keeps its own map of written data - but of course if the user
has root access to device with 'dd' it could read some 'unwritten' data on
that device...
>
>>> How much of a performance win can I expect from only activating the
>>> subset of volumes I actually use?
>>
>> I can only advice benchmark with some good approximation of your expected
>> workload.
> That’s already on the Qubes OS team’s (very long) to-do list.
>
I'd prioritize this - to get the best balance for performance - i.e. slightly
bigger chunks could give you much better numbers - if your 'snapshot' workload
is focused on small 'areas' so you know exactly where the focus should go
(too many cooks spoll the broth)...
So even jump 64k -> 256K can be significant
Regards
Zdenek
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-12 9:32 ` Zdenek Kabelac
@ 2022-04-12 11:58 ` Demi Marie Obenour
2022-04-12 14:29 ` David Teigland
1 sibling, 0 replies; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-12 11:58 UTC (permalink / raw)
To: Zdenek Kabelac, Joe Thornber, David Teigland; +Cc: dm-devel
[-- Attachment #1.1.1.1: Type: text/plain, Size: 12117 bytes --]
On 4/12/22 05:32, Zdenek Kabelac wrote:
> Dne 12. 04. 22 v 0:30 Demi Marie Obenour napsal(a):
>> On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
>>> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
>>>> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
>>>>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>>>>>
>>>>> Your proposal actually breaks this sequence and would move things to the
>>>>> state of 'guess at which states we are now'. (and IMHO presents much more
>>>>> risk than virtual problem with suspend from user-space - which is only a
>>>>> problem if you are using suspended device as 'swap' and 'rootfs' - so there
>>>>> are very easy ways how to orchestrate your LVs to avoid such problems).
>>>> The intent is less “guess what states we are now” and more “It looks
>>>> like dm-thin already has the data structures needed to store some
>>>> per-thin metadata, and that could make writing a simple userspace volume
>>>> manager FAR FAR easier”. It appears to me that the only change needed
>>>
>>> I do not spend hours explaining all the details - but running just the
>>> suspend alone may result in many differnt problem where the things like
>>> running thin-pool out-of-data space is one of the easiest.
>>>
>>> Basically each step must be designed with 'power-off' happen during the
>>> operation. For each step you need to know how the recovery step looks like
>>> and how the lvm2 & kernel metadata c/would match together.
>> That is absolutely the case, and is in fact the reason I proposed this
>> change to begin with. By having dm-thin store a small amount of
>> userspace-provided metadata for each thin volume, and by providing an
>> API to enumerate the thin volumes in a pool, I can store all of the
>> metadata I need in the thin pool itself. This is much simpler than
>> having to store metadata outside of the pool.
>
> Hi
>
> Here is actually the fundamental problem with your proposal - our design was
> about careful split between user-space and kernel 'who is the owner/holder of
> information' - your proposal unfortunately does not fit the model where lvm2
> is the authoritative owner of info about devices - note - we also tried the
> 'model' where the info is held within target - our mdraid dm wrapper - but it
> has more troubles compared with very clear thin logic. So from the lvm2
> position - we do not have any plans to change this proven model.
This does not surprise me. lvm2 already has the infrastructure to
store its own metadata and update it in a crash-safe way, so having the
kernel be able to store additional metadata would be of no benefit to
lvm2. The intended use-case for this feature is tools that are dm-thin
specific, and which do not already have such infrastructure.
> What you are asking for is - that 'kernel' module is doing all the job - and
> lvm2 would be obtaining info from the kernel metadata - and eventually you
> would be able to command everything with ioctl() interface and letting the
> complexity sit completely in kernel - but as explained our design is heading
> in opposite direction - what can be done in user-space stays in user space and
> kernel does the necessary minimum, which can be then much easier developed and
> traced.
Not at all. I just want userspace to be able to stash some data in
each thin and retrieve it later. The complex logic would still remain
in userspace. That’s why I dropped the “lookup thin by blob”
functionality: it requires a new data structure in the kernel, and
userspace can achieve almost the same effect with a cache. Qubes OS
has a persistent daemon that has exclusive ownership of the storage,
so there are no cache invalidation problems. The existing thin
pool already has a btree that could store the blob, so no new data
structures are required on the kernel side.
>> Combining many
>>> steps together into a single 'kernel' call just increases already large
>>> range of errors. So in many case we simply do favour to keep operation more
>>> 'low-level-atomic' even at slight higher performance price (as said - we've
>>> never seen a creation of snapshot to be 'msec' critical operation - as the
>>> 'suspend' with implicit flush & fsfreeze itself might be far more expensive
>>> operation.
>> Qubes OS should never be snapshotting an in-use volume of any kind.
>> Right now, there is one case where it does so, but that is a bug, and I
>> am working on fixing it. A future API might support snapshotting to an
>> in-use volume, but that would likely require a way to tell the VM to
>> freeze its own filesystem.
>
>
> Yeah - you have very unusual use case - in fact lvm2 goal is usually to
> support as much things as we can while devices are in-use so user does not
> need to take them offline - which surely complicates everything a lot - also
> there was basically never any user demand to operate with offline device in
> very quick way - so admittedly not the focused area of development.
I wouldn’t exactly say my use case is *that* unusual. My
understanding is that Docker has the same one, and they too resorted to
bypassing lvm2 and using raw dm-thin. Docker does have the advantage
that the filesystem is assumed to be trusted, so they do not need eager
zeroing. Stratis also uses dm-thin, and it, too, avoids using LVM.
>>>>> But IMHO creation and removal of thousands of devices in very short period
>>>>> of time rather suggest there is something sub-optimal in your original
>>>>> software design as I'm really having hard time imagining why would you need
>>>>> this ?
>>>> There very well could be (suggestions for improvement welcome).
>>>>
>>>>> If you wish to operate lots of devices - keep them simply created and ready
>>>>> - and eventually blkdiscard them for next device reuse.
>>>> That would work for volatile volumes, but those are only about 1/3 of
>>>> the volumes in a Qubes OS system. The other 2/3 are writable snapshots.
>>>> Also, Qubes OS has found blkdiscard on thins to be a performance
>>>> problem. It used to lock up entire pools until Qubes OS moved to doing
>>>> the blkdiscard in chunks.
>>> Always make sure you use recent Linux kernels.
>> Should the 5.16 series be recent enough?
>>
>>> Blkdiscard should not differ from lvremove too much - also experiment how
>>> the 'lvchange --discards passdown|nopassdown poolLV' works.
>> I believe this was with passdown on, which is the default in Qubes OS.
>> The bug was tracked down by Jinoh Kang in
>> https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
>> and found to be due to dm-thin deleting B-tree nodes one at a time,
>> causing large amounts of time to be wasted on btree rebalancing and node
>> locking.
>>
>>>>> I'm also unsure from where would arise any special need to instantiate that
>>>>> many snapshots - if there is some valid & logical purpose - lvm2 can have
>>>>> extended user space API to create multiple snapshots at once maybe (so
>>>>> i.e. create 10 snapshots with name-%d of a single thinLV)
>>>> This would be amazing, and Qubes OS should be able to use it. That
>>>> said, Qubes OS would prefer to be able to choose the name of each volume
>>>> separately. Could there be a more general batching operation? Just
>>>> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
>>>> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
>>>> --activate=y’ as well would be even better.
>>> There is kind of 'hidden' plan inside command line processing to allow
>>> 'grouped' processing.
>>>
>>> lvcreate --snapshot --name lv1 --snapshot --name lv2 vg/origin
>>>
>>> However there is currently no man power to proceed further on this part as
>>> we have other parts of code needed enhancements.
>>>
>>> But we may put this on our TODO plans...
>> That would be great, thanks!
>
> Although the main reason to support this kind of API was the request to
> support an atomic snapshot of multiple LVs at once - but so far not a high
> priority.
Still, it will be useful if it becomes available.
>>>>> Not to mentioning operating that many thin volumes from a single thin-pool
>>>>> is also nothing close to high performance goal you try to reach...
>>>> Would you mind explaining? My understanding, and the basis of
>>>> essentially all my feature requests in this area, was that virtually all
>>>> of the cost of LVM is the userspace metadata operations, udev syncing,
>>>> and device scanning. I have been assuming that the kernel does not have
>>>> performance problems with large numbers of thin volumes.
>>>
>>> The main idea behind the comment is - when there is increased disk usage -
>>> the manipulation with thin-pool metadata and locking will soon start to be a
>>> considerable performance problem.
>>>
>>> So while it's easy to have active 1000 thinLVs from a single thin-pool that
>>> are UNUSED, situation is dramatically different when there LVs would be in
>>> some heavy use load. There you should keep the active thinLV at low number
>>> of tens LVs, especially if you are performance oriented. The lighter
>>> usage and less provisioning and especially bigger block size - improve
>> I can try to modify the storage pool so that LVs are not activated by
>> default. That said, Qubes OS will always be provisioning-heavy. With
>> the notable exception of volatile volumes, qubesd always snapshots a
>
>
> You definitely should keep active ONLY LVs you need to have active - it's
> impacting many other kernel areas and consumes system resources to keep
> 'unused' LVs active.
Thank you. There has already been work done in that direction and I
will see if I can finish it now that R4.1 is out.
>> volume at startup and then provides the snapshot to the VM. After
>> shutdown, the original volume is renamed to be a backup, and the
>> snapshot gets the name of the original volume. Bigger block sizes would
>> substantially increase write amplification, as turning off zeroing is
>> not an option for security reasons.
>
> For 'snapshot' heavy loads - smaller chunks are usually better - but it comes
> with price.
>
>
>> Is this just a workload that dm-thin is ill-suited for? Qubes OS does
>> support storing VM images on either BTRFS or XFS files, and it could be
>> that this is a better plan going forward.
>
>
> Not knowing the details - but as mentioned 'zeroing' is not needed for
> 'filesystem' security - modern filesystem will never let you read unwritten
> data - as it keeps its own map of written data - but of course if the user
> has root access to device with 'dd' it could read some 'unwritten' data on
> that device...
Qubes OS aims to defend against an attacker with kernel privilege
in a VM, so zeroing is indeed necessary. The only way to avoid the
overhead would be for dm-thin’s chunk size to match the block size
of the paravirtualized disks, so that there are only full-chunk writes.
The reflink storage pool does not have this problem because it works on
4K blocks natively.
>>>> How much of a performance win can I expect from only activating the
>>>> subset of volumes I actually use?
>>>
>>> I can only advice benchmark with some good approximation of your expected
>>> workload.
>> That’s already on the Qubes OS team’s (very long) to-do list.
>>
> I'd prioritize this - to get the best balance for performance - i.e. slightly
> bigger chunks could give you much better numbers - if your 'snapshot' workload
> is focused on small 'areas' so you know exactly where the focus should go
> (too many cooks spoll the broth)...
>
> So even jump 64k -> 256K can be significant
I will try to do a proper benchmark at some point.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #1.1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 4963 bytes --]
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 98 bytes --]
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-12 9:32 ` Zdenek Kabelac
2022-04-12 11:58 ` Demi Marie Obenour
@ 2022-04-12 14:29 ` David Teigland
2022-04-13 7:55 ` Zdenek Kabelac
1 sibling, 1 reply; 10+ messages in thread
From: David Teigland @ 2022-04-12 14:29 UTC (permalink / raw)
To: Zdenek Kabelac; +Cc: Demi Marie Obenour, Joe Thornber, dm-devel
Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
> For quite a while, I have wanted to write a tool to manage thin volumes
> that is not based on LVM.
On Tue, Apr 12, 2022 at 11:32:09AM +0200, Zdenek Kabelac wrote:
> Here is actually the fundamental problem with your proposal - our design was
> about careful split between user-space and kernel 'who is the owner/holder
> of information' - your proposal unfortunately does not fit the model where
> lvm2 is the authoritative owner of info about devices
The proposal is a new tool to manage dm-thin devices, not to rewrite lvm.
I would hope the tool is nothing at all like lvm, but rather "thinsetup"
in the tradition of dmsetup, cryptsetup. I think it's a great idea and
have wanted such a tool for years. I have a feeling that many have
already written ad hoc thinsetup-like tools, and there would be fairly
broad interest in it (especially if it has a proper lib api.)
Dave
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-12 14:29 ` David Teigland
@ 2022-04-13 7:55 ` Zdenek Kabelac
2022-04-13 15:00 ` Demi Marie Obenour
0 siblings, 1 reply; 10+ messages in thread
From: Zdenek Kabelac @ 2022-04-13 7:55 UTC (permalink / raw)
To: dm-devel; +Cc: Demi Marie Obenour
Dne 12. 04. 22 v 16:29 David Teigland napsal(a):
> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>> For quite a while, I have wanted to write a tool to manage thin volumes
>> that is not based on LVM.
>
> On Tue, Apr 12, 2022 at 11:32:09AM +0200, Zdenek Kabelac wrote:
>> Here is actually the fundamental problem with your proposal - our design was
>> about careful split between user-space and kernel 'who is the owner/holder
>> of information' - your proposal unfortunately does not fit the model where
>> lvm2 is the authoritative owner of info about devices
>
> The proposal is a new tool to manage dm-thin devices, not to rewrite lvm.
> I would hope the tool is nothing at all like lvm, but rather "thinsetup"
> in the tradition of dmsetup, cryptsetup. I think it's a great idea and
> have wanted such a tool for years. I have a feeling that many have
> already written ad hoc thinsetup-like tools, and there would be fairly
> broad interest in it (especially if it has a proper lib api.)
>
The problem with these 'ad-hoc' tools is their 'support - aka how to proceed
in case of any failure.
So while there will be no problem to generate many device in very fast way -
the recoverability from failure will then be always individual based on the
surrounding environment.
So it's in the principle the very same case as the request for support of
managing DM devices with 'external' metadata - if there are different
constrains to match - you end with different requirements on the tool.
If there is pure focus on thin device management - surely a standalone tool
does this jobs faster.
Zdenek
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dm-devel] Potential enhancements to dm-thin v2
2022-04-13 7:55 ` Zdenek Kabelac
@ 2022-04-13 15:00 ` Demi Marie Obenour
0 siblings, 0 replies; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-13 15:00 UTC (permalink / raw)
To: Zdenek Kabelac, dm-devel
[-- Attachment #1.1: Type: text/plain, Size: 1802 bytes --]
On Wed, Apr 13, 2022 at 09:55:00AM +0200, Zdenek Kabelac wrote:
> Dne 12. 04. 22 v 16:29 David Teigland napsal(a):
> > Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
> > > For quite a while, I have wanted to write a tool to manage thin volumes
> > > that is not based on LVM.
> >
> > On Tue, Apr 12, 2022 at 11:32:09AM +0200, Zdenek Kabelac wrote:
> > > Here is actually the fundamental problem with your proposal - our design was
> > > about careful split between user-space and kernel 'who is the owner/holder
> > > of information' - your proposal unfortunately does not fit the model where
> > > lvm2 is the authoritative owner of info about devices
> >
> > The proposal is a new tool to manage dm-thin devices, not to rewrite lvm.
> > I would hope the tool is nothing at all like lvm, but rather "thinsetup"
> > in the tradition of dmsetup, cryptsetup. I think it's a great idea and
> > have wanted such a tool for years. I have a feeling that many have
> > already written ad hoc thinsetup-like tools, and there would be fairly
> > broad interest in it (especially if it has a proper lib api.)
> >
>
>
> The problem with these 'ad-hoc' tools is their 'support - aka how to proceed
> in case of any failure.
>
> So while there will be no problem to generate many device in very fast way -
> the recoverability from failure will then be always individual based on the
> surrounding environment.
That’s why I want to stick a name and UUID in each thin device’s
metadata. That makes creating a thin device and associating it with a
name an atomic operation, and means that if there is a failure, the
sysadmin or management toolstack knows what it needs to do to clean up.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 98 bytes --]
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-04-14 11:17 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-10 22:03 [dm-devel] Potential enhancements to dm-thin v2 Demi Marie Obenour
2022-04-11 8:16 ` Zdenek Kabelac
2022-04-11 17:22 ` Demi Marie Obenour
2022-04-11 20:16 ` Zdenek Kabelac
2022-04-11 22:30 ` Demi Marie Obenour
2022-04-12 9:32 ` Zdenek Kabelac
2022-04-12 11:58 ` Demi Marie Obenour
2022-04-12 14:29 ` David Teigland
2022-04-13 7:55 ` Zdenek Kabelac
2022-04-13 15:00 ` Demi Marie Obenour
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.