linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Identifying reflink / CoW files
@ 2012-09-22  3:38 Jp Wise
  2012-09-22  7:49 ` Arne Jansen
  0 siblings, 1 reply; 8+ messages in thread
From: Jp Wise @ 2012-09-22  3:38 UTC (permalink / raw)
  To: linux-btrfs

Good morning, I'm working on an offline deduplication script intended to 
work around the copy-on-write functionality of BTRFS.

Simply put - is there any existing utility to compare two files (or 
dirs) and output if the files share the same physical extents / data 
blocks on disk?
- aka - they're CoW copies.

I'm not actively working with BTRFS yet, but for the project i'm working 
on it's looking to the be most suitable candidate, and the CoW 
functionality avoids issues with file changes that hardlinks would create.
 From reading other posts, aware the information could be pulled out via 
btrfs-debug-tree, but it would then involve parsing the entire output to 
locate the required files inodes and their extents which seems like 
quite a roundabout way to retrieve the information.

Also my programming skills aren't  up to the task of trying to pull the 
tree data directly from the filesystem to do it, and I'd like to avoid 
doing byte-by-byte comparisons on all files as it's inefficient if the 
file can instead be identified as a CoW copy.

Open to suggestions of other tools that could be used to acheive the 
desired result.

Thanks.
Jp.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Identifying reflink / CoW files
  2012-09-22  3:38 Jp Wise
@ 2012-09-22  7:49 ` Arne Jansen
  2012-09-22 21:56   ` Jp Wise
  0 siblings, 1 reply; 8+ messages in thread
From: Arne Jansen @ 2012-09-22  7:49 UTC (permalink / raw)
  To: Jp Wise; +Cc: linux-btrfs

On 09/22/12 05:38, Jp Wise wrote:
> Good morning, I'm working on an offline deduplication script intended to
> work around the copy-on-write functionality of BTRFS.
> 
> Simply put - is there any existing utility to compare two files (or
> dirs) and output if the files share the same physical extents / data
> blocks on disk?
> - aka - they're CoW copies.
> 
> I'm not actively working with BTRFS yet, but for the project i'm working
> on it's looking to the be most suitable candidate, and the CoW
> functionality avoids issues with file changes that hardlinks would create.
> From reading other posts, aware the information could be pulled out via
> btrfs-debug-tree, but it would then involve parsing the entire output to
> locate the required files inodes and their extents which seems like
> quite a roundabout way to retrieve the information.
> 
> Also my programming skills aren't  up to the task of trying to pull the
> tree data directly from the filesystem to do it, and I'd like to avoid
> doing byte-by-byte comparisons on all files as it's inefficient if the
> file can instead be identified as a CoW copy.

The information is available in the kernel, but to find a good way to
extract it you have to describe in much more detail what you intend to
do. What I, first of all, don't understand, is, why you need the
information of already shared (=deduped) blocks to build a dedup. Don't
you want to find data that is identical, but not shared, instead?

> 
> Open to suggestions of other tools that could be used to acheive the
> desired result.
> 

Afaik without playing with it myself fiemap can give you information
about the mappings of each file. If the mappings of 2 files match,
the data is shared.

> Thanks.
> Jp.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Identifying reflink / CoW files
  2012-09-22  7:49 ` Arne Jansen
@ 2012-09-22 21:56   ` Jp Wise
  2012-09-24 13:53     ` David Sterba
  0 siblings, 1 reply; 8+ messages in thread
From: Jp Wise @ 2012-09-22 21:56 UTC (permalink / raw)
  To: Arne Jansen; +Cc: linux-btrfs


On 22/09/2012 7:49 p.m., Arne Jansen wrote:
> On 09/22/12 05:38, Jp Wise wrote:
>> Good morning, I'm working on an offline deduplication script intended to
>> work around the copy-on-write functionality of BTRFS.
>>
>> Simply put - is there any existing utility to compare two files (or
>> dirs) and output if the files share the same physical extents / data
>> blocks on disk?
>> - aka - they're CoW copies.
>>
>> I'm not actively working with BTRFS yet, but for the project i'm working
>> on it's looking to the be most suitable candidate, and the CoW
>> functionality avoids issues with file changes that hardlinks would create.
>>  From reading other posts, aware the information could be pulled out via
>> btrfs-debug-tree, but it would then involve parsing the entire output to
>> locate the required files inodes and their extents which seems like
>> quite a roundabout way to retrieve the information.
>>
>> Also my programming skills aren't  up to the task of trying to pull the
>> tree data directly from the filesystem to do it, and I'd like to avoid
>> doing byte-by-byte comparisons on all files as it's inefficient if the
>> file can instead be identified as a CoW copy.
> The information is available in the kernel, but to find a good way to
> extract it you have to describe in much more detail what you intend to
> do. What I, first of all, don't understand, is, why you need the
> information of already shared (=deduped) blocks to build a dedup. Don't
> you want to find data that is identical, but not shared, instead?
Hi Arne, that's exactly my issue. I want to filter out files that have 
already been de-duped to avoid re-checking two files that already share 
the same data blocks.
  In this usecase, I can identify potential duplicates via filename data 
(this dataset also stores a basic checksum as part of the filename), but 
rather than then doing a secondary checksum/cmp, I'd like to instead 
check if it's already sharing the same data blocks (ie: already 
de-duped). If two files share the same data blocks, I can safely say 
it's already de-duped and move onto the next potential match with doing 
the additional crunching of checksum/cmp to verify the match.

Logically there should also be usecases where someone has made a data 
copy in the past and may be uncertain if they made a reflink copy or a 
full copy and wants to recheck if they share the same data blocks.
>> Open to suggestions of other tools that could be used to acheive the
>> desired result.
>>
> Afaik without playing with it myself fiemap can give you information
> about the mappings of each file. If the mappings of 2 files match,
> the data is shared.
OK, have just done some searching on fiemap and located a code example 
using it to pull the file extent data.
http://smackerelofopinion.blogspot.com/2010/01/using-fiemap-ioctl-to-get-file-extents.html
Will have a play around to see if i might be able to hack it up to 
compare two files, or just parse it's output between two files to 
identify matches.  Thank you for the pointer. :)

Likewise if anyone else knows of an existing utility to do a 
non-bytewise compare between two files, and just check if they share the 
same datablocks please let me know. :)
>> Thanks.
>> Jp.
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Identifying reflink / CoW files
  2012-09-22 21:56   ` Jp Wise
@ 2012-09-24 13:53     ` David Sterba
  0 siblings, 0 replies; 8+ messages in thread
From: David Sterba @ 2012-09-24 13:53 UTC (permalink / raw)
  To: Jp Wise; +Cc: Arne Jansen, linux-btrfs

On Sun, Sep 23, 2012 at 09:56:34AM +1200, Jp Wise wrote:
> >Afaik without playing with it myself fiemap can give you information
> >about the mappings of each file. If the mappings of 2 files match,
> >the data is shared.
> OK, have just done some searching on fiemap and located a code example using
> it to pull the file extent data.
> http://smackerelofopinion.blogspot.com/2010/01/using-fiemap-ioctl-to-get-file-extents.html
> Will have a play around to see if i might be able to hack it up to compare
> two files, or just parse it's output between two files to identify matches.
> Thank you for the pointer. :)
> 
> Likewise if anyone else knows of an existing utility to do a non-bytewise
> compare between two files, and just check if they share the same datablocks
> please let me know. :)

The FIEMAP is a way with stable and defined interface to show file
extents, there's the filefrag utility (from e2fsprogs). It does not
have a parser-friendly output, so you may want to call the ioctl
directly. The key information is in the (struct
fiemap_extent)->fe_physical field. If physical block ranges from two
files overlap, they're shared.

There's another way how to get the extent info, via btrfs' SEARCH_TREE
ioctl, but it's more low-level and needs basic knowledge about the
internal b-tree items and how they're linked together.

david

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Identifying reflink / CoW files
@ 2016-10-27 11:30 Saint Germain
  2016-11-03  5:17 ` Zygo Blaxell
  0 siblings, 1 reply; 8+ messages in thread
From: Saint Germain @ 2016-10-27 11:30 UTC (permalink / raw)
  To: linux-btrfs

Hello,

Following the previous discussion:
https://www.spinics.net/lists/linux-btrfs/msg19075.html

I would be interested in finding a way to reliably identify reflink /
CoW files in order to use deduplication programs (like fdupes, jdupes,
rmlint) efficiently.

Using FIEMAP doesn't seem to be reliable according to this discussion
on rmlint:
https://github.com/sahib/rmlint/issues/132#issuecomment-157665154

Is there another way that deduplication programs can easily use ?

Thanks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Identifying reflink / CoW files
  2016-10-27 11:30 Identifying reflink / CoW files Saint Germain
@ 2016-11-03  5:17 ` Zygo Blaxell
  2016-11-04 14:41   ` Saint Germain
  0 siblings, 1 reply; 8+ messages in thread
From: Zygo Blaxell @ 2016-11-03  5:17 UTC (permalink / raw)
  To: Saint Germain; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2903 bytes --]

On Thu, Oct 27, 2016 at 01:30:11PM +0200, Saint Germain wrote:
> Hello,
> 
> Following the previous discussion:
> https://www.spinics.net/lists/linux-btrfs/msg19075.html
> 
> I would be interested in finding a way to reliably identify reflink /
> CoW files in order to use deduplication programs (like fdupes, jdupes,
> rmlint) efficiently.
> 
> Using FIEMAP doesn't seem to be reliable according to this discussion
> on rmlint:
> https://github.com/sahib/rmlint/issues/132#issuecomment-157665154

Inline extents have no physical address (FIEMAP returns 0 in that field).
You can't dedup them and each file can have only one, so if you see
the FIEMAP_EXTENT_INLINE bit set, you can just skip processing the entire
file immediately.

You can create a separate non-inline extent in a temporary file then
use dedup to replace _both_ copies of the original inline extent.
Or don't bother, as the savings are negligible.

> Is there another way that deduplication programs can easily use ?

The problem is that it's not files that are reflinked--individual extents
are.  "reflink file copy" really just means "a file whose extents are
100% shared with another file." It's possible for files on btrfs to have
any percentage of shared extents from 0 to 100% in increments of the
host page size.  It's also possible for the blocks to be shared with
different extent boundaries.

The quality of the result therefore depends on the amount of effort
put into measuring it.  If you look for the first non-hole extent in
each file and use its physical address as a physical file identifier,
then you get a fast reflink detector function that has a high risk of
false positives.  If you map out two files and compare physical addresses
block by block, you get a slow function with a low risk of false positives
(but maybe a small risk of false negatives too).

If your dedup program only does full-file reflink copies then the first
extent physical address method is sufficient.  If your program does
block- or extent-level dedup then it shouldn't be using files in its
data model at all, except where necessary to provide a mechanism to
access the physical blocks through the POSIX filesystem API.

FIEMAP will tell you about all the extents (physical address for extents
that have them, zero for other extent types).  It's also slow and has
assorted accuracy problems especially with compressed files.  Any user
can run FIEMAP, and it uses only standard structure arrays.

SEARCH_V2 is root-only and requires parsing variable-length binary
btrfs data encoding, but it's faster than FIEMAP and gives more accurate
results on compressed files.

> Thanks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Identifying reflink / CoW files
  2016-11-03  5:17 ` Zygo Blaxell
@ 2016-11-04 14:41   ` Saint Germain
  2016-11-25  3:55     ` Zygo Blaxell
  0 siblings, 1 reply; 8+ messages in thread
From: Saint Germain @ 2016-11-04 14:41 UTC (permalink / raw)
  To: linux-btrfs

On Thu, 3 Nov 2016 01:17:07 -0400, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote :

> On Thu, Oct 27, 2016 at 01:30:11PM +0200, Saint Germain wrote:
> > Hello,
> > 
> > Following the previous discussion:
> > https://www.spinics.net/lists/linux-btrfs/msg19075.html
> > 
> > I would be interested in finding a way to reliably identify
> > reflink / CoW files in order to use deduplication programs (like
> > fdupes, jdupes, rmlint) efficiently.
> > 
> > Using FIEMAP doesn't seem to be reliable according to this
> > discussion on rmlint:
> > https://github.com/sahib/rmlint/issues/132#issuecomment-157665154
> 
> Inline extents have no physical address (FIEMAP returns 0 in that
> field). You can't dedup them and each file can have only one, so if
> you see the FIEMAP_EXTENT_INLINE bit set, you can just skip
> processing the entire file immediately.
> 
> You can create a separate non-inline extent in a temporary file then
> use dedup to replace _both_ copies of the original inline extent.
> Or don't bother, as the savings are negligible.
> 
> > Is there another way that deduplication programs can easily use ?
> 
> The problem is that it's not files that are reflinked--individual
> extents are.  "reflink file copy" really just means "a file whose
> extents are 100% shared with another file." It's possible for files
> on btrfs to have any percentage of shared extents from 0 to 100% in
> increments of the host page size.  It's also possible for the blocks
> to be shared with different extent boundaries.
> 
> The quality of the result therefore depends on the amount of effort
> put into measuring it.  If you look for the first non-hole extent in
> each file and use its physical address as a physical file identifier,
> then you get a fast reflink detector function that has a high risk of
> false positives.  If you map out two files and compare physical
> addresses block by block, you get a slow function with a low risk of
> false positives (but maybe a small risk of false negatives too).
> 
> If your dedup program only does full-file reflink copies then the
> first extent physical address method is sufficient.  If your program
> does block- or extent-level dedup then it shouldn't be using files in
> its data model at all, except where necessary to provide a mechanism
> to access the physical blocks through the POSIX filesystem API.
> 
> FIEMAP will tell you about all the extents (physical address for
> extents that have them, zero for other extent types).  It's also slow
> and has assorted accuracy problems especially with compressed files.
> Any user can run FIEMAP, and it uses only standard structure arrays.
> 
> SEARCH_V2 is root-only and requires parsing variable-length binary
> btrfs data encoding, but it's faster than FIEMAP and gives more
> accurate results on compressed files.
> 

As the dedup program only does full-file reflink, the first extent
physical address method can be used as a fast first check to identify
potential files.

But how to implement the second check in order to have 0% risk of false
positive ?
Because you said that mapping out two files and comparing the physical
addresses block by block also has a low risk of false positives.

Thank you very much for the detailed explanation !

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Identifying reflink / CoW files
  2016-11-04 14:41   ` Saint Germain
@ 2016-11-25  3:55     ` Zygo Blaxell
  0 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2016-11-25  3:55 UTC (permalink / raw)
  To: Saint Germain; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3749 bytes --]

On Fri, Nov 04, 2016 at 03:41:49PM +0100, Saint Germain wrote:
> On Thu, 3 Nov 2016 01:17:07 -0400, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote :
> > [...]
> > The quality of the result therefore depends on the amount of effort
> > put into measuring it.  If you look for the first non-hole extent in
> > each file and use its physical address as a physical file identifier,
> > then you get a fast reflink detector function that has a high risk of
> > false positives.  If you map out two files and compare physical
> > addresses block by block, you get a slow function with a low risk of
> > false positives (but maybe a small risk of false negatives too).
> > 
> > If your dedup program only does full-file reflink copies then the
> > first extent physical address method is sufficient.  If your program
> > does block- or extent-level dedup then it shouldn't be using files in
> > its data model at all, except where necessary to provide a mechanism
> > to access the physical blocks through the POSIX filesystem API.
> > 
> > FIEMAP will tell you about all the extents (physical address for
> > extents that have them, zero for other extent types).  It's also slow
> > and has assorted accuracy problems especially with compressed files.
> > Any user can run FIEMAP, and it uses only standard structure arrays.
> > 
> > SEARCH_V2 is root-only and requires parsing variable-length binary
> > btrfs data encoding, but it's faster than FIEMAP and gives more
> > accurate results on compressed files.
> 
> As the dedup program only does full-file reflink, the first extent
> physical address method can be used as a fast first check to identify
> potential files.
> 
> But how to implement the second check in order to have 0% risk of false
> positive ?
> Because you said that mapping out two files and comparing the physical
> addresses block by block also has a low risk of false positives.

In theory, what you do is call FIEMAP on each file and compare the
physical blocks that come back.  If they are large files you will have
to call FIEMAP multiple times on both files, each time setting the start
position to the end position of the previous run.  Translate each result
record into a range of physical addresses, then compare them.  If there
were no differences, the files are already deduped.

In practice, FIEMAP doesn't provide full accuracy for compressed extents,
and in some cases the physical address data will compare equal when
the files are in fact different.  This is the small risk of false
positives, and the only way to get 100% accuracy is to not use FIEMAP.

Instead you can use the SEARCH ioctl, which dumps out the binary extent
items from btrfs.  If you look up the items corresponding to one inode,
you can get the real physical block address plus the offset from the
beginning of the extent for compressed extents.

In Bees I encode the compressed extent start offset into the same
uint64_t as the physical extent start address using the bottom 6 bits
of the physical (bytenr) address:

	https://github.com/Zygo/bees/blob/master/src/bees-types.cc#L744

This fills in an object which uniquely (and reversibly) identifies
the block on the filesystem.

The raw btrfs extent data is extracted here:

	https://github.com/Zygo/bees/blob/master/lib/extentwalker.cc#L533

BeesAddress gives no false positives, but it's built on top of hundreds
of lines of userspace support code.  :-/

> Thank you very much for the detailed explanation !
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-25  3:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-27 11:30 Identifying reflink / CoW files Saint Germain
2016-11-03  5:17 ` Zygo Blaxell
2016-11-04 14:41   ` Saint Germain
2016-11-25  3:55     ` Zygo Blaxell
  -- strict thread matches above, loose matches on Subject: below --
2012-09-22  3:38 Jp Wise
2012-09-22  7:49 ` Arne Jansen
2012-09-22 21:56   ` Jp Wise
2012-09-24 13:53     ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).