* status page status - dedupe
@ 2022-03-05 19:21 Christoph Anton Mitterer
2022-03-05 23:25 ` Qu Wenruo
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Christoph Anton Mitterer @ 2022-03-05 19:21 UTC (permalink / raw)
To: linux-btrfs
Hey.
I just wondered about the status of the wiki status page?! ;-)
E.g. it says seeding would be stable, while right now there's an
ongoing thread on this list about it being broken again.
In especially, what's the status of out-of-band deduplication (i.e. run
manually by some program like duperemove or jdupes)?
Is it safe to be used?
My understanding was, that for out-of-band dedupe, the kernel performs
a full byte-by-byte comparison before actually deduplicating, right?
So it shouldn't matter so much which tool is used in the end and
whether that's stable or not?
Thanks,
Chris.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: status page status - dedupe 2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer @ 2022-03-05 23:25 ` Qu Wenruo 2022-03-06 0:00 ` Zygo Blaxell 2022-03-06 10:54 ` waxhead 2 siblings, 0 replies; 8+ messages in thread From: Qu Wenruo @ 2022-03-05 23:25 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2022/3/6 03:21, Christoph Anton Mitterer wrote: > Hey. > > I just wondered about the status of the wiki status page?! ;-) > > E.g. it says seeding would be stable, while right now there's an > ongoing thread on this list about it being broken again. I'm over-reacting on that thread. It's only in misc-next, not really affecting anyone. > > > In especially, what's the status of out-of-band deduplication (i.e. run > manually by some program like duperemove or jdupes)? > Is it safe to be used? Pretty safe AFAIK. Thanks, Qu > > > My understanding was, that for out-of-band dedupe, the kernel performs > a full byte-by-byte comparison before actually deduplicating, right? > > So it shouldn't matter so much which tool is used in the end and > whether that's stable or not? > > > Thanks, > Chris. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe 2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer 2022-03-05 23:25 ` Qu Wenruo @ 2022-03-06 0:00 ` Zygo Blaxell 2022-03-06 1:00 ` Andy Smith 2022-03-06 1:38 ` Christoph Anton Mitterer 2022-03-06 10:54 ` waxhead 2 siblings, 2 replies; 8+ messages in thread From: Zygo Blaxell @ 2022-03-06 0:00 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On Sat, Mar 05, 2022 at 08:21:26PM +0100, Christoph Anton Mitterer wrote: > Hey. > > I just wondered about the status of the wiki status page?! ;-) > > E.g. it says seeding would be stable, while right now there's an > ongoing thread on this list about it being broken again. > > > In especially, what's the status of out-of-band deduplication (i.e. run > manually by some program like duperemove or jdupes)? > Is it safe to be used? > > My understanding was, that for out-of-band dedupe, the kernel performs > a full byte-by-byte comparison before actually deduplicating, right? > > So it shouldn't matter so much which tool is used in the end and > whether that's stable or not? The kernel provides a dedupe ioctl (FIDEDUPERANGE or BTRFS_IOC_FILE_EXTENT_SAME), and that ioctl does a full byte-by-byte comparison while locking both inodes; however, there are other ways to achieve deduplication on btrfs without the ioctl, so you must verify the tool you are using uses the safe ioctl. bees, duperemove, btrfs-dedupe, and solstice use the safe dedupe ioctl, and provide no option to do otherwise. dduper can be configured to use either the safe dedupe ioctl or the unsafe clone_range ioctl (--fast-mode). The unsafe clone ioctl is faster, and can be used if you know the data is not being modified concurrently with dedupe. jdupes can use the safe dedupe ioctl (-B) or very unsafe hardlinks (-H). bedup uses the unsafe clone ioctl. There is a branch in the bedup github repo (wip/dedup-syscall) which uses the safe ioctl but it has not been merged to master. Old-school POSIX deduplicators are based on hardlinks and not safe unless all files are strictly read-only during and after dedupe. > Thanks, > Chris. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe 2022-03-06 0:00 ` Zygo Blaxell @ 2022-03-06 1:00 ` Andy Smith 2022-03-06 3:12 ` Zygo Blaxell 2022-03-06 1:38 ` Christoph Anton Mitterer 1 sibling, 1 reply; 8+ messages in thread From: Andy Smith @ 2022-03-06 1:00 UTC (permalink / raw) To: linux-btrfs Hello, On Sat, Mar 05, 2022 at 07:00:23PM -0500, Zygo Blaxell wrote: > bees, duperemove, btrfs-dedupe, and solstice use the safe dedupe ioctl, > and provide no option to do otherwise. Is there some issue with combining offline dedupe and compression in that it undoes all the benefits of the compression? I'm sorry, I don't know the details and may have got the wrong impression but I thought I had read here recently that there was negative interaction here still. Thanks, Andy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe 2022-03-06 1:00 ` Andy Smith @ 2022-03-06 3:12 ` Zygo Blaxell 0 siblings, 0 replies; 8+ messages in thread From: Zygo Blaxell @ 2022-03-06 3:12 UTC (permalink / raw) To: Andy Smith; +Cc: linux-btrfs On Sun, Mar 06, 2022 at 01:00:11AM +0000, Andy Smith wrote: > Hello, > > On Sat, Mar 05, 2022 at 07:00:23PM -0500, Zygo Blaxell wrote: > > bees, duperemove, btrfs-dedupe, and solstice use the safe dedupe ioctl, > > and provide no option to do otherwise. > > Is there some issue with combining offline dedupe and compression in > that it undoes all the benefits of the compression? I'm sorry, I > don't know the details and may have got the wrong impression but I > thought I had read here recently that there was negative interaction > here still. It's more like the other way around: compression makes some deduplication tools ineffective. To perform well, a deduper must have specific support for btrfs and compression in order to issue dedupe requests that will remove complete extents and recover free disk space, and it must not use optimizations that are incompatible with compression. Without this support, the deduper may fail to detect duplicates and not have very much impact on total space usage for compressed extents. All current btrfs dedupe tools choose to keep one duplicate data copy arbitrarily, without considering the size of the encoding. So if you have a compressed file, and make an uncompressed copy, about half of the time the dedupe tool will replace the compressed copy with the uncompressed one, when ideally it would measure the size of both and always keep the smallest version of the data. bees has limited support for compressed data. It will avoid shortening compressed data blocks when this would result in a larger overall encoding, and it will compress new data extents created by splitting uncompressed extents. bees can match compressed and uncompressed copies of duplicate data. It uses a variable block size with a small lower bound for a better hit rate on shorter compressed extents. bees outperforms everything else on final data size with compression. duperemove blindly issues dedupe requests without regard for extent boundaries or compression. Compressed data has shorter extents, so it tends to help duperemove achieve space savings in more cases, but it's difficult to predict the more or less random effect on the total data size. Compression sometimes improves duperemove hit rate, but sometimes reduces it. duperemove can match compressed data with uncompressed data. jdupes gives the same dedupe hit rate for compressed and uncompressed data since jdupes only handles whole-file duplicates (this also applies to duperemove in fdupes-compatibility mode). A whole-file deduplicator will completely replace all extents in the duplicate files, which avoids many compression-related issues. jdupes can match compressed data with uncompressed data (or any mixture of these in each file). dduper and solstice use btrfs data csums exclusively to find duplicate blocks. Compressed data csums in btrfs are computed on the on-disk encoding of the data, meaning that they are the csums of the data _after_ compression for compressed blocks. The csums cannot be used to deduplicate data blocks that are uncompressed, that are compressed with a different algorithm or level, or appear at a different position within an extent. These tools cannot match compressed and uncompressed copies of the same data, and will get very low (often near zero) hit rates on compressed data. > Thanks, > Andy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe 2022-03-06 0:00 ` Zygo Blaxell 2022-03-06 1:00 ` Andy Smith @ 2022-03-06 1:38 ` Christoph Anton Mitterer 2022-03-06 1:40 ` Zygo Blaxell 1 sibling, 1 reply; 8+ messages in thread From: Christoph Anton Mitterer @ 2022-03-06 1:38 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs Thanks for that elaborate description (and thanks to Qu, too). I think that might be be a good addition to https://btrfs.wiki.kernel.org/index.php/Deduplication Also: On Sat, 2022-03-05 at 19:00 -0500, Zygo Blaxell wrote: > jdupes can use the safe dedupe ioctl (-B) or very unsafe hardlinks (- > H). I guess you mean -L ?! Thanks, Chris. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe 2022-03-06 1:38 ` Christoph Anton Mitterer @ 2022-03-06 1:40 ` Zygo Blaxell 0 siblings, 0 replies; 8+ messages in thread From: Zygo Blaxell @ 2022-03-06 1:40 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On Sun, Mar 06, 2022 at 02:38:48AM +0100, Christoph Anton Mitterer wrote: > Thanks for that elaborate description (and thanks to Qu, too). > > I think that might be be a good addition to > https://btrfs.wiki.kernel.org/index.php/Deduplication > > > Also: > > On Sat, 2022-03-05 at 19:00 -0500, Zygo Blaxell wrote: > > jdupes can use the safe dedupe ioctl (-B) or very unsafe hardlinks (- > > H). > > I guess you mean -L ?! Uhhh, yes. I don't use either option myself. ;) > Thanks, > Chris. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe 2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer 2022-03-05 23:25 ` Qu Wenruo 2022-03-06 0:00 ` Zygo Blaxell @ 2022-03-06 10:54 ` waxhead 2 siblings, 0 replies; 8+ messages in thread From: waxhead @ 2022-03-06 10:54 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs Christoph Anton Mitterer wrote: > Hey. > > I just wondered about the status of the wiki status page?! ;-) > > E.g. it says seeding would be stable, while right now there's an > ongoing thread on this list about it being broken again. > > As just a regular user I got a couple of thoughts here. I think that the status page should primarily reflect the status of the LTS kernels. Perhaps the last three or four LTS kernels. If a bug is fixed or introduced it should be pointed out at what specific version. Perhaps this would be easier to maintain and easier to direct users to as well. Another interesting thing about the status page : zoned mode is marked as "mostly ok" since 5.16 , but in the description it stays "there are known bugs, use only for testing". In my point of view this is UNSTABLE so I hope someone updates either the status or the description to whatever fits best. And one more thing - would it perhaps be a good idea to put the status page somewhere in the documentation pages?! ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-03-06 10:54 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer 2022-03-05 23:25 ` Qu Wenruo 2022-03-06 0:00 ` Zygo Blaxell 2022-03-06 1:00 ` Andy Smith 2022-03-06 3:12 ` Zygo Blaxell 2022-03-06 1:38 ` Christoph Anton Mitterer 2022-03-06 1:40 ` Zygo Blaxell 2022-03-06 10:54 ` waxhead
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.