From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Saint Germain <saintger@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Identifying reflink / CoW files
Date: Thu, 3 Nov 2016 01:17:07 -0400 [thread overview]
Message-ID: <20161103051707.GE21290@hungrycats.org> (raw)
In-Reply-To: <20161027133011.631cf1e5@system>
[-- Attachment #1: Type: text/plain, Size: 2903 bytes --]
On Thu, Oct 27, 2016 at 01:30:11PM +0200, Saint Germain wrote:
> Hello,
>
> Following the previous discussion:
> https://www.spinics.net/lists/linux-btrfs/msg19075.html
>
> I would be interested in finding a way to reliably identify reflink /
> CoW files in order to use deduplication programs (like fdupes, jdupes,
> rmlint) efficiently.
>
> Using FIEMAP doesn't seem to be reliable according to this discussion
> on rmlint:
> https://github.com/sahib/rmlint/issues/132#issuecomment-157665154
Inline extents have no physical address (FIEMAP returns 0 in that field).
You can't dedup them and each file can have only one, so if you see
the FIEMAP_EXTENT_INLINE bit set, you can just skip processing the entire
file immediately.
You can create a separate non-inline extent in a temporary file then
use dedup to replace _both_ copies of the original inline extent.
Or don't bother, as the savings are negligible.
> Is there another way that deduplication programs can easily use ?
The problem is that it's not files that are reflinked--individual extents
are. "reflink file copy" really just means "a file whose extents are
100% shared with another file." It's possible for files on btrfs to have
any percentage of shared extents from 0 to 100% in increments of the
host page size. It's also possible for the blocks to be shared with
different extent boundaries.
The quality of the result therefore depends on the amount of effort
put into measuring it. If you look for the first non-hole extent in
each file and use its physical address as a physical file identifier,
then you get a fast reflink detector function that has a high risk of
false positives. If you map out two files and compare physical addresses
block by block, you get a slow function with a low risk of false positives
(but maybe a small risk of false negatives too).
If your dedup program only does full-file reflink copies then the first
extent physical address method is sufficient. If your program does
block- or extent-level dedup then it shouldn't be using files in its
data model at all, except where necessary to provide a mechanism to
access the physical blocks through the POSIX filesystem API.
FIEMAP will tell you about all the extents (physical address for extents
that have them, zero for other extent types). It's also slow and has
assorted accuracy problems especially with compressed files. Any user
can run FIEMAP, and it uses only standard structure arrays.
SEARCH_V2 is root-only and requires parsing variable-length binary
btrfs data encoding, but it's faster than FIEMAP and gives more accurate
results on compressed files.
> Thanks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
next prev parent reply other threads:[~2016-11-03 5:17 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-27 11:30 Identifying reflink / CoW files Saint Germain
2016-11-03 5:17 ` Zygo Blaxell [this message]
2016-11-04 14:41 ` Saint Germain
2016-11-25 3:55 ` Zygo Blaxell
-- strict thread matches above, loose matches on Subject: below --
2012-09-22 3:38 Jp Wise
2012-09-22 7:49 ` Arne Jansen
2012-09-22 21:56 ` Jp Wise
2012-09-24 13:53 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161103051707.GE21290@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=saintger@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.