Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Peter Becker <floyd.net@gmail.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
Date: Tue, 3 Jan 2017 15:40:15 -0500	[thread overview]
Message-ID: <dd94bb02-822b-2244-a326-c75a50c4429e@gmail.com> (raw)
In-Reply-To: <CAEtw4r3rnKvBgz4BqiVYDz2Ru9RdXODyJOp0sYSiuNNh1Gbebg@mail.gmail.com>

On 2017-01-03 15:20, Peter Becker wrote:
> I think i understand. The resulting keyquestion is, how i can improve
> the performance of extend_same ioctl.
> I tested it with following results:
>
> enviorment:
> 2 files, called "file", size each 100GB, duperemove nofiemap-options
> set, 1MB extend size.
>
> duperemove output:
> [0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
> [0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
> [0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
> [0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
> 1.0M), "/mnt/new/file"
>
> iotop output for a 30 sec. sample
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                22,31    0,00   13,83        33,81    0,00       30,05
>
> Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
> wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdd               0,00     1,70          1149,93    0,73  4600,53
> 139,60    8,24       0,23          0,20    0,19   13,64      0,19
> 21,84
> sde               0,00     0,00          1149,33    0,53  4597,33
> 23,87     8,04       0,20          0,18    0,18    1,75       0,18
> 20,47
> sdf                0,00     1,70          1149,60    0,63  4598,40
> 139,60    8,24       0,21          0,18    0,18    4,63       0,18
> 20,63
> sdh               0,00     0,00          1149,33    0,53  4597,33
> 23,87     8,04       0,21          0,18    0,18    4,25       0,18
> 20,85
>
> resulting in less then 18MB/s read. realy slow.
>
> Querstion 1: why, so slow?
For a couple of reasons.  First, you have to understand that duperemove 
itself actually does a pretty large amount of processing outside of the 
call to the ioctl.  It first hashes the blocks for quicker comparison 
and matching, then figures out which blocks match, and finally calls the 
ioctl on the resulting matches.  The reason for this behavior is that 
the ioctl is insanely slow.  It first locks the ranges passed in (so 
they don't get changed by anything else during the deduplication 
process), then does a byte-by-byte comparison to make sure they all 
actually do match (data safety, I've said at least once before that I 
think there should be a flag for the ioctl (or a separate ioctl) to skip 
this and assume that userspace really knows what it's doing), then 
finally sets up the reflinks, and unlocks the new extent.

All of this ties into why I keep telling people that efficient 
deduplication requires a tool that understands how the data being 
deduplicated is structured.  By avoiding the need to hash and compare 
every block of data, you can significantly improve the time that part 
takes, and quite often this will mitigate the impact of getting a few 
false positives passed into the ioctl.
> Questiont 2a: would be a higher extend-size perform better?
> Querstion 2b: or did i understand something wrong?
No, a larger extent would probably not help much, and that's actually a 
really good performance sample you've created.

The block size does end up being somewhat of a trade-off.  Ideally, you 
want it matched to the smallest possible chunk of duplicate data greater 
than or equal to the filesystem block size for maximal space efficiency. 
  Doing this however makes the extra processing done by duperemove take 
exponentially longer because it has to calculate hashes for more blocks 
(this has very low impact until you get to very small block sizes), and 
has to make exponentially more comparisons (this has a very big impact 
as you shrink the block size, just halving the block size will roughly 
quadruple the time it takes to make the comparisons).
>
> 2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> On 2017-01-03 14:21, Peter Becker wrote:
>>>
>>> All invocations are justified, but not relevant in (offline) backup
>>> and archive scenarios.
>>>
>>> For example you have multiple version of append-only log-files or
>>> append-only db-files (each more then 100GB in size), like this:
>>>
>>>> Snapshot_01_01_2017
>>>
>>> -> file1.log .. 201 GB
>>>
>>>> Snapshot_02_01_2017
>>>
>>> -> file1.log .. 205 GB
>>>
>>>> Snapshot_03_01_2017
>>>
>>> -> file1.log .. 221 GB
>>>
>>> The first 201 GB would be every time the same.
>>> Files a copied at night from windows, linux or bsd systems and
>>> snapshoted after copy.
>>>
>>> So a fast way to dedupe this is needed. Using 128KB blocks would
>>> result in 1646592 extends per Snapshot. 1MB blocksize results in
>>> 205.824 extends (not bad, but still terrible speed).
>>> I will test it at night with a patched version of duperemove with
>>> 100MB blocksize, but I have no hope that the throughput increases
>>> thereby.
>>
>> Deduplication is not a general purpose thing (usually you have very
>> specifically structured data), but duperemove is supposed to be a general
>> purpose tool.  It works fine for two of the most common cases (deduplicating
>> large numbers of small files or small numbers of large files), but it is
>> sub-optimal for those cases, and will be for almost any other case.  This is
>> a canonical example though of a case where you can use a custom script or
>> program to figure out what's duplicated and then have that just call the
>> ioctl as appropriate itself.  Most cases where you are storing some kind of
>> well structured data fall into this category.  In fact, both of the cases
>> where I use deduplication myself fall into such a category.  One case
>> involves multiple directories that are partial copies of a larger tree which
>> are kept in sync with the larger tree and each other.  In that particular
>> case, I care about whole file deduplication, so I have a script that just
>> matches on path relative to the roots of each copy and the master copy,
>> verifies that the mtime and size are the same, and if so calls the ioctl for
>> deduplication (with some fancy processing to fit within the max size
>> supported by the ioctl and prevent tiny tail extents).  The other case is a
>> set of archives with a pretty solid fixed structure to them.  In that case,
>> I have a different script that knows enough about the file structure to know
>> where to look for duplicate blocks, thus avoiding having to hash the whole
>> files.
>>
>> The append-only log/database case fits this type of thing perfectly, for
>> each subsequent file, you know already that (most of) the file up to the
>> length of the previous file is duplicated, so you can just split that
>> however you want into chunks and pass those to the dedupe ioctl and free up
>> most of the space that would be used by the new file.  You can then run
>> duperemove with a hash-file to process any new blocks beyond the point you
>> deduplicated up to to reclaim any excess space (currently this will process
>> the whole file, but it should see that most of it is deduplicated already).
>>
>>>
>>> For backup and archive scenarios the checksum-feature and the
>>> dub-data/metadata-feature of btrfs is realy nice. In particular if one
>>> considers the 7 years legally prescribed storage time.
>>>
>>> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>>
>>>> On 2016-12-30 15:28, Peter Becker wrote:
>>>>>
>>>>>
>>>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB
>>>>> each.
>>>>> I try to dedupe this because the first hundred GB of many files are
>>>>> identical.
>>>>> With 128KB blocksize with nofiemap and lookup-extends=no option, will
>>>>> take more then a week (only dedupe, previously hashed). So i tryed -b
>>>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>>>
>>>>> The reason is that the blocksize is limit to
>>>>>
>>>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>>>
>>>>> But i can't found any description why.
>>>>
>>>>
>>>> Beyond what Xin mentioned (namely that 1MB is a much larger block than
>>>> will
>>>> be duplicated in most data-sets), there are a couple of other reasons:
>>>> 1. Smaller blocks will actually get you better deduplication on average
>>>> because they're more likely to match.  As an example, assume you have 2
>>>> files with the same 8 4k blocks in different orders:
>>>>   FileA: 1 2 3 4 5 6 7 8
>>>>   FileB: 7 8 5 6 3 4 1 2
>>>> In such a case, deduplicating at any block size above 8k would result in
>>>> zero deduplication between these files, while 8k or less would completely
>>>> deduplicate them.  This is of course a highly specific and somewhat
>>>> contrived example (in most cases it will be scattered duplicate blocks
>>>> over
>>>> dozens of files), but it does convey this specific point.
>>>> 2. The kernel will do a byte-wise comparison of all ranges you pass into
>>>> the
>>>> ioctl at the same time.  Larger block sizes here mean that:
>>>>         a) The extents will be locked longer, which will prevent any I/O
>>>> to
>>>> the files being deduplicated for the duration of the comparison, which
>>>> may
>>>> in turn cause other issues on the system.
>>>>         b) The deduplication process will be stuck in uninterruptible
>>>> sleep
>>>> longer, which on many systems will trigger hung task detection, which
>>>> will
>>>> in turn either spam the system log or panic the system depending on how
>>>> it's
>>>> configured.
>>>>
>>

next prev parent reply	other threads:[~2017-01-03 20:40 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-30 20:28 [markfasheh/duperemove] Why blocksize is limit to 1MB? Peter Becker
2017-01-01  4:38 ` Xin Zhou
2017-01-02 12:32   ` Peter Becker
2017-01-02 19:36     ` Xin Zhou
2017-01-03 12:40 ` Austin S. Hemmelgarn
2017-01-03 19:24   ` Peter Becker
2017-01-03 23:07     ` Hans van Kranenburg
2017-01-03 23:12       ` Peter Becker
2017-01-03 23:43         ` Hans van Kranenburg
2017-01-04  0:08           ` Martin Raiber
     [not found]   ` <CAEtw4r3mUA_4vcS-dbxagQn3NPRh8Cxcz0iF0L7jHwv5c9Ui+g@mail.gmail.com>
     [not found]     ` <7b0c897f-844c-e7f4-0ce7-c9f888b95983@gmail.com>
2017-01-03 20:20       ` Peter Becker
2017-01-03 20:40         ` Austin S. Hemmelgarn [this message]
2017-01-03 21:35           ` Peter Becker
2017-01-04 12:58             ` Austin S. Hemmelgarn
2017-01-04 14:42               ` Peter Becker
2017-01-09  1:09               ` Zygo Blaxell
2017-01-09  9:29                 ` Peter Becker
2017-01-10  4:12                   ` Zygo Blaxell
2017-01-03 20:21       ` Fwd: " Peter Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dd94bb02-822b-2244-a326-c75a50c4429e@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=floyd.net@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).