* status page status - dedupe
@ 2022-03-05 19:21 Christoph Anton Mitterer
2022-03-05 23:25 ` Qu Wenruo
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Christoph Anton Mitterer @ 2022-03-05 19:21 UTC (permalink / raw)
To: linux-btrfs
Hey.
I just wondered about the status of the wiki status page?! ;-)
E.g. it says seeding would be stable, while right now there's an
ongoing thread on this list about it being broken again.
In especially, what's the status of out-of-band deduplication (i.e. run
manually by some program like duperemove or jdupes)?
Is it safe to be used?
My understanding was, that for out-of-band dedupe, the kernel performs
a full byte-by-byte comparison before actually deduplicating, right?
So it shouldn't matter so much which tool is used in the end and
whether that's stable or not?
Thanks,
Chris.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe
2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer
@ 2022-03-05 23:25 ` Qu Wenruo
2022-03-06 0:00 ` Zygo Blaxell
2022-03-06 10:54 ` waxhead
2 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2022-03-05 23:25 UTC (permalink / raw)
To: Christoph Anton Mitterer, linux-btrfs
On 2022/3/6 03:21, Christoph Anton Mitterer wrote:
> Hey.
>
> I just wondered about the status of the wiki status page?! ;-)
>
> E.g. it says seeding would be stable, while right now there's an
> ongoing thread on this list about it being broken again.
I'm over-reacting on that thread.
It's only in misc-next, not really affecting anyone.
>
>
> In especially, what's the status of out-of-band deduplication (i.e. run
> manually by some program like duperemove or jdupes)?
> Is it safe to be used?
Pretty safe AFAIK.
Thanks,
Qu
>
>
> My understanding was, that for out-of-band dedupe, the kernel performs
> a full byte-by-byte comparison before actually deduplicating, right?
>
> So it shouldn't matter so much which tool is used in the end and
> whether that's stable or not?
>
>
> Thanks,
> Chris.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe
2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer
2022-03-05 23:25 ` Qu Wenruo
@ 2022-03-06 0:00 ` Zygo Blaxell
2022-03-06 1:00 ` Andy Smith
2022-03-06 1:38 ` Christoph Anton Mitterer
2022-03-06 10:54 ` waxhead
2 siblings, 2 replies; 8+ messages in thread
From: Zygo Blaxell @ 2022-03-06 0:00 UTC (permalink / raw)
To: Christoph Anton Mitterer; +Cc: linux-btrfs
On Sat, Mar 05, 2022 at 08:21:26PM +0100, Christoph Anton Mitterer wrote:
> Hey.
>
> I just wondered about the status of the wiki status page?! ;-)
>
> E.g. it says seeding would be stable, while right now there's an
> ongoing thread on this list about it being broken again.
>
>
> In especially, what's the status of out-of-band deduplication (i.e. run
> manually by some program like duperemove or jdupes)?
> Is it safe to be used?
>
> My understanding was, that for out-of-band dedupe, the kernel performs
> a full byte-by-byte comparison before actually deduplicating, right?
>
> So it shouldn't matter so much which tool is used in the end and
> whether that's stable or not?
The kernel provides a dedupe ioctl (FIDEDUPERANGE or
BTRFS_IOC_FILE_EXTENT_SAME), and that ioctl does a full byte-by-byte
comparison while locking both inodes; however, there are other ways to
achieve deduplication on btrfs without the ioctl, so you must verify
the tool you are using uses the safe ioctl.
bees, duperemove, btrfs-dedupe, and solstice use the safe dedupe ioctl,
and provide no option to do otherwise.
dduper can be configured to use either the safe dedupe ioctl or the
unsafe clone_range ioctl (--fast-mode). The unsafe clone ioctl is faster,
and can be used if you know the data is not being modified concurrently
with dedupe.
jdupes can use the safe dedupe ioctl (-B) or very unsafe hardlinks (-H).
bedup uses the unsafe clone ioctl. There is a branch in the bedup github
repo (wip/dedup-syscall) which uses the safe ioctl but it has not been
merged to master.
Old-school POSIX deduplicators are based on hardlinks and not safe unless
all files are strictly read-only during and after dedupe.
> Thanks,
> Chris.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe
2022-03-06 0:00 ` Zygo Blaxell
@ 2022-03-06 1:00 ` Andy Smith
2022-03-06 3:12 ` Zygo Blaxell
2022-03-06 1:38 ` Christoph Anton Mitterer
1 sibling, 1 reply; 8+ messages in thread
From: Andy Smith @ 2022-03-06 1:00 UTC (permalink / raw)
To: linux-btrfs
Hello,
On Sat, Mar 05, 2022 at 07:00:23PM -0500, Zygo Blaxell wrote:
> bees, duperemove, btrfs-dedupe, and solstice use the safe dedupe ioctl,
> and provide no option to do otherwise.
Is there some issue with combining offline dedupe and compression in
that it undoes all the benefits of the compression? I'm sorry, I
don't know the details and may have got the wrong impression but I
thought I had read here recently that there was negative interaction
here still.
Thanks,
Andy
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe
2022-03-06 0:00 ` Zygo Blaxell
2022-03-06 1:00 ` Andy Smith
@ 2022-03-06 1:38 ` Christoph Anton Mitterer
2022-03-06 1:40 ` Zygo Blaxell
1 sibling, 1 reply; 8+ messages in thread
From: Christoph Anton Mitterer @ 2022-03-06 1:38 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: linux-btrfs
Thanks for that elaborate description (and thanks to Qu, too).
I think that might be be a good addition to
https://btrfs.wiki.kernel.org/index.php/Deduplication
Also:
On Sat, 2022-03-05 at 19:00 -0500, Zygo Blaxell wrote:
> jdupes can use the safe dedupe ioctl (-B) or very unsafe hardlinks (-
> H).
I guess you mean -L ?!
Thanks,
Chris.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe
2022-03-06 1:38 ` Christoph Anton Mitterer
@ 2022-03-06 1:40 ` Zygo Blaxell
0 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2022-03-06 1:40 UTC (permalink / raw)
To: Christoph Anton Mitterer; +Cc: linux-btrfs
On Sun, Mar 06, 2022 at 02:38:48AM +0100, Christoph Anton Mitterer wrote:
> Thanks for that elaborate description (and thanks to Qu, too).
>
> I think that might be be a good addition to
> https://btrfs.wiki.kernel.org/index.php/Deduplication
>
>
> Also:
>
> On Sat, 2022-03-05 at 19:00 -0500, Zygo Blaxell wrote:
> > jdupes can use the safe dedupe ioctl (-B) or very unsafe hardlinks (-
> > H).
>
> I guess you mean -L ?!
Uhhh, yes. I don't use either option myself. ;)
> Thanks,
> Chris.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe
2022-03-06 1:00 ` Andy Smith
@ 2022-03-06 3:12 ` Zygo Blaxell
0 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2022-03-06 3:12 UTC (permalink / raw)
To: Andy Smith; +Cc: linux-btrfs
On Sun, Mar 06, 2022 at 01:00:11AM +0000, Andy Smith wrote:
> Hello,
>
> On Sat, Mar 05, 2022 at 07:00:23PM -0500, Zygo Blaxell wrote:
> > bees, duperemove, btrfs-dedupe, and solstice use the safe dedupe ioctl,
> > and provide no option to do otherwise.
>
> Is there some issue with combining offline dedupe and compression in
> that it undoes all the benefits of the compression? I'm sorry, I
> don't know the details and may have got the wrong impression but I
> thought I had read here recently that there was negative interaction
> here still.
It's more like the other way around: compression makes some deduplication
tools ineffective. To perform well, a deduper must have specific
support for btrfs and compression in order to issue dedupe requests
that will remove complete extents and recover free disk space, and it
must not use optimizations that are incompatible with compression.
Without this support, the deduper may fail to detect duplicates and
not have very much impact on total space usage for compressed extents.
All current btrfs dedupe tools choose to keep one duplicate data copy
arbitrarily, without considering the size of the encoding. So if you have
a compressed file, and make an uncompressed copy, about half of the time
the dedupe tool will replace the compressed copy with the uncompressed
one, when ideally it would measure the size of both and always keep the
smallest version of the data.
bees has limited support for compressed data. It will avoid shortening
compressed data blocks when this would result in a larger overall
encoding, and it will compress new data extents created by splitting
uncompressed extents. bees can match compressed and uncompressed copies
of duplicate data. It uses a variable block size with a small lower bound
for a better hit rate on shorter compressed extents. bees outperforms
everything else on final data size with compression.
duperemove blindly issues dedupe requests without regard for extent
boundaries or compression. Compressed data has shorter extents, so it
tends to help duperemove achieve space savings in more cases, but it's
difficult to predict the more or less random effect on the total data
size. Compression sometimes improves duperemove hit rate, but sometimes
reduces it. duperemove can match compressed data with uncompressed data.
jdupes gives the same dedupe hit rate for compressed and uncompressed
data since jdupes only handles whole-file duplicates (this also applies
to duperemove in fdupes-compatibility mode). A whole-file deduplicator
will completely replace all extents in the duplicate files, which avoids
many compression-related issues. jdupes can match compressed data with
uncompressed data (or any mixture of these in each file).
dduper and solstice use btrfs data csums exclusively to find duplicate
blocks. Compressed data csums in btrfs are computed on the on-disk
encoding of the data, meaning that they are the csums of the data
_after_ compression for compressed blocks. The csums cannot be used to
deduplicate data blocks that are uncompressed, that are compressed with
a different algorithm or level, or appear at a different position within
an extent. These tools cannot match compressed and uncompressed copies
of the same data, and will get very low (often near zero) hit rates on
compressed data.
> Thanks,
> Andy
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: status page status - dedupe
2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer
2022-03-05 23:25 ` Qu Wenruo
2022-03-06 0:00 ` Zygo Blaxell
@ 2022-03-06 10:54 ` waxhead
2 siblings, 0 replies; 8+ messages in thread
From: waxhead @ 2022-03-06 10:54 UTC (permalink / raw)
To: Christoph Anton Mitterer, linux-btrfs
Christoph Anton Mitterer wrote:
> Hey.
>
> I just wondered about the status of the wiki status page?! ;-)
>
> E.g. it says seeding would be stable, while right now there's an
> ongoing thread on this list about it being broken again.
>
>
As just a regular user I got a couple of thoughts here.
I think that the status page should primarily reflect the status of the
LTS kernels. Perhaps the last three or four LTS kernels. If a bug is
fixed or introduced it should be pointed out at what specific version.
Perhaps this would be easier to maintain and easier to direct users to
as well.
Another interesting thing about the status page : zoned mode is marked
as "mostly ok" since 5.16 , but in the description it stays "there are
known bugs, use only for testing". In my point of view this is UNSTABLE
so I hope someone updates either the status or the description to
whatever fits best.
And one more thing - would it perhaps be a good idea to put the status
page somewhere in the documentation pages?!
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-03-06 10:54 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer
2022-03-05 23:25 ` Qu Wenruo
2022-03-06 0:00 ` Zygo Blaxell
2022-03-06 1:00 ` Andy Smith
2022-03-06 3:12 ` Zygo Blaxell
2022-03-06 1:38 ` Christoph Anton Mitterer
2022-03-06 1:40 ` Zygo Blaxell
2022-03-06 10:54 ` waxhead
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.