From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Chris Murphy <lists@colorremedies.com>,
"Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Marc Joliet <marcec@gmx.de>, Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: out-of-band dedup status?
Date: Fri, 9 Dec 2016 07:29:21 -0500 [thread overview]
Message-ID: <c992df7d-d58f-1b59-371e-149516bb358e@gmail.com> (raw)
In-Reply-To: <CAJCQCtSRUmQarebB6rK_sCiVGpBZ-Xcptr17zu_XoJC8EF9BoQ@mail.gmail.com>
On 2016-12-08 21:54, Chris Murphy wrote:
> On Thu, Dec 8, 2016 at 7:26 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> On Thu, Dec 08, 2016 at 05:45:40PM -0700, Chris Murphy wrote:
>>> OK something's wrong.
>>>
>>> Kernel 4.8.12 and duperemove v0.11.beta4. Brand new file system
>>> (mkfs.btrfs -dsingle -msingle, default mount options) and two
>>> identical files separately copied.
>>>
>>> [chris@f25s]$ ls -li /mnt/test
>>> total 2811904
>>> 260 -rw-r--r--. 1 root root 1439694848 Dec 8 17:26
>>> Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso
>>> 259 -rw-r--r--. 1 root root 1439694848 Dec 8 17:26
>>> Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2
>>>
>>> [chris@f25s]$ filefrag /mnt/test/*
>>> /mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso: 3 extents found
>>> /mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2: 2 extents found
>>>
>>>
>>> [chris@f25s duperemove]$ sudo ./duperemove -dv /mnt/test/*
>>> Using 128K blocks
>>> Using hash: murmur3
>>> Gathering file list...
>>> Using 4 threads for file hashing phase
>>> [1/2] (50.00%) csum: /mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso
>>> [2/2] (100.00%) csum: /mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2
>>> Total files: 2
>>> Total hashes: 21968
>>> Loading only duplicated hashes from hashfile.
>>> Using 4 threads for dedupe phase
>>> [0xba8400] (00001/10947) Try to dedupe extents with id e47862ea
>>> [0xba84a0] (00003/10947) Try to dedupe extents with id ffed44f2
>>> [0xba84f0] (00002/10947) Try to dedupe extents with id ffeefcdd
>>> [0xba8540] (00004/10947) Try to dedupe extents with id ffe4cf64
>>> [0xba8540] Add extent for file
>>> "/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso" at offset
>>> 1182924800 (4)
>>> [0xba8540] Add extent for file
>>> "/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2" at offset
>>> 1182924800 (5)
>>> [0xba8540] Dedupe 1 extents (id: ffe4cf64) with target: (1182924800,
>>> 131072), "/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso"
>>
>> Ew, it's deduping these two 1.4GB files 128K at a time, which results in
>> 12000 ioctl calls. Each of those 12000 calls has to lock the two
>> inodes, read the file contents, remap the blocks, etc. instead of
>> finding the maximal identical range and making a single call for the
>> whole range.
>>
>> That's probably why it's taking forever to dedupe.
>
> Yes but it looks like it's also heavily fragmenting the files as a
> result as well.
>
This kind of reinforces what I've been telling people recently, namely
that while generic batch deduplication generally works, it's quite often
better to do a custom tool that understands your data-set and knows how
to handle it efficiently.
As an example, one of the cases where I use deduplication is on a set of
directories that are disjoint sets of a larger tree. So, the
directories look something like this:
+ a
| + file1
| \ file2
+ b
| + file3
| \ file2
\ c
+ file1
\ file3
In this case, I know that if a/file1 and c/file1 have the same mtime and
size, they're (supposed to be) copies of the same file. Given this, the
tool I use for this just checks for duplicate names with the same size
and mtime, and then counts on the ioctl's check to verify that the files
are actually identical (and throws a warning if they aren't), and does
some special stuff to submit things such that any given file both has
the fewest possible number of extents and all the extents are roughly
the same size. On average, even with the fancy extent size calculation
logic, this still takes less than a quarter of the time that duperemove
took on the same data-set.
next prev parent reply other threads:[~2016-12-09 12:29 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-08 18:36 out-of-band dedup status? Christoph Anton Mitterer
2016-12-08 20:15 ` Jeff Mahoney
2016-12-08 20:41 ` Chris Murphy
2016-12-08 22:27 ` Christoph Anton Mitterer
2016-12-08 23:31 ` Marc Joliet
2016-12-09 0:45 ` Chris Murphy
2016-12-09 2:26 ` Darrick J. Wong
2016-12-09 2:54 ` Chris Murphy
2016-12-09 8:25 ` Adam Borowski
2016-12-09 12:29 ` Austin S. Hemmelgarn [this message]
2016-12-09 18:16 ` Darrick J. Wong
2016-12-09 19:18 ` Chris Murphy
2016-12-09 8:43 ` Adam Borowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c992df7d-d58f-1b59-371e-149516bb358e@gmail.com \
--to=ahferroin7@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=marcec@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.