Re: status page status - dedupe

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Andy Smith <andy@strugglers.net>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: status page status - dedupe
Date: Sat, 5 Mar 2022 22:12:33 -0500	[thread overview]
Message-ID: <YiQmoUnd4jSSdCNt@hungrycats.org> (raw)
In-Reply-To: <20220306010011.m66pgmvpvetnthok@bitfolk.com>

On Sun, Mar 06, 2022 at 01:00:11AM +0000, Andy Smith wrote:
> Hello,
> 
> On Sat, Mar 05, 2022 at 07:00:23PM -0500, Zygo Blaxell wrote:
> > bees, duperemove, btrfs-dedupe, and solstice use the safe dedupe ioctl,
> > and provide no option to do otherwise.
> 
> Is there some issue with combining offline dedupe and compression in
> that it undoes all the benefits of the compression? I'm sorry, I
> don't know the details and may have got the wrong impression but I
> thought I had read here recently that there was negative interaction
> here still.

It's more like the other way around:  compression makes some deduplication
tools ineffective.  To perform well, a deduper must have specific
support for btrfs and compression in order to issue dedupe requests
that will remove complete extents and recover free disk space, and it
must not use optimizations that are incompatible with compression.
Without this support, the deduper may fail to detect duplicates and
not have very much impact on total space usage for compressed extents.

All current btrfs dedupe tools choose to keep one duplicate data copy
arbitrarily, without considering the size of the encoding.  So if you have
a compressed file, and make an uncompressed copy, about half of the time
the dedupe tool will replace the compressed copy with the uncompressed
one, when ideally it would measure the size of both and always keep the
smallest version of the data.

bees has limited support for compressed data.  It will avoid shortening
compressed data blocks when this would result in a larger overall
encoding, and it will compress new data extents created by splitting
uncompressed extents.  bees can match compressed and uncompressed copies
of duplicate data.  It uses a variable block size with a small lower bound
for a better hit rate on shorter compressed extents.  bees outperforms
everything else on final data size with compression.

duperemove blindly issues dedupe requests without regard for extent
boundaries or compression.  Compressed data has shorter extents, so it
tends to help duperemove achieve space savings in more cases, but it's
difficult to predict the more or less random effect on the total data
size.  Compression sometimes improves duperemove hit rate, but sometimes
reduces it.  duperemove can match compressed data with uncompressed data.

jdupes gives the same dedupe hit rate for compressed and uncompressed
data since jdupes only handles whole-file duplicates (this also applies
to duperemove in fdupes-compatibility mode).  A whole-file deduplicator
will completely replace all extents in the duplicate files, which avoids
many compression-related issues.  jdupes can match compressed data with
uncompressed data (or any mixture of these in each file).

dduper and solstice use btrfs data csums exclusively to find duplicate
blocks.  Compressed data csums in btrfs are computed on the on-disk
encoding of the data, meaning that they are the csums of the data
_after_ compression for compressed blocks.  The csums cannot be used to
deduplicate data blocks that are uncompressed, that are compressed with
a different algorithm or level, or appear at a different position within
an extent.  These tools cannot match compressed and uncompressed copies
of the same data, and will get very low (often near zero) hit rates on
compressed data.

> Thanks,
> Andy

next prev parent reply	other threads:[~2022-03-06  3:12 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-05 19:21 status page status - dedupe Christoph Anton Mitterer
2022-03-05 23:25 ` Qu Wenruo
2022-03-06  0:00 ` Zygo Blaxell
2022-03-06  1:00   ` Andy Smith
2022-03-06  3:12     ` Zygo Blaxell [this message]
2022-03-06  1:38   ` Christoph Anton Mitterer
2022-03-06  1:40     ` Zygo Blaxell
2022-03-06 10:54 ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YiQmoUnd4jSSdCNt@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=andy@strugglers.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.