Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
To: Peter Becker <floyd.net@gmail.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
Date: Wed, 4 Jan 2017 00:43:19 +0100	[thread overview]
Message-ID: <624c67f5-6039-0332-fdea-19d6a80074ec@mendix.com> (raw)
In-Reply-To: <CAEtw4r1-WwLiK2trAWNJMwFRP2shOS+hemSNWMbKWAd-FBQrKw@mail.gmail.com>

On 01/04/2017 12:12 AM, Peter Becker wrote:
> Good hint, this would be an option and i will try this.
> 
> Regardless of this the curiosity has packed me and I will try to
> figure out where the problem with the low transfer rate is.
> 
> 2017-01-04 0:07 GMT+01:00 Hans van Kranenburg <hans.van.kranenburg@mendix.com>:
>> On 01/03/2017 08:24 PM, Peter Becker wrote:
>>> All invocations are justified, but not relevant in (offline) backup
>>> and archive scenarios.
>>>
>>> For example you have multiple version of append-only log-files or
>>> append-only db-files (each more then 100GB in size), like this:
>>>
>>>> Snapshot_01_01_2017
>>> -> file1.log .. 201 GB
>>>
>>>> Snapshot_02_01_2017
>>> -> file1.log .. 205 GB
>>>
>>>> Snapshot_03_01_2017
>>> -> file1.log .. 221 GB
>>>
>>> The first 201 GB would be every time the same.
>>> Files a copied at night from windows, linux or bsd systems and
>>> snapshoted after copy.
>>
>> XY problem?
>>
>> Why not use rsync --inplace in combination with btrfs snapshots? Even if
>> the remote does not support rsync and you need to pull the full file
>> first, you could again use rsync locally.

<annoyed>please don't toppost</annoyed>

Also, there is a rather huge difference in the two approaches, given the
way how btrfs works internally.

Say, I have a subvolume with thousands of directories and millions of
files with random data in it, and I want to have a second deduped copy
of it.

Approach 1:

Create a full copy of everything (compare: retrieving remote file again)
(now 200% of data storage is used), and after that do deduplication, so
that again only 100% of data storage is used.

Approach 2:

cp -av --reflink original/ copy/

By doing this, you end up with the same as doing approach 1 if your
deduper is the most ideal in the world (and the files are so random they
don't contain duplicate blocks inside them).

Approach 3:

btrfs sub snap original copy

W00t, that was fast, and the only thing that happened was writing a few
16kB metadata pages again. (1 for the toplevel tree page that got cloned
into a new filesystem tree, and a few for the blocks one level lower to
add backreferences to the new root).

So:

The big difference in the end result between approach 1,2 and otoh 3 is
that while deduplicating your data, you're actually duplicating all your
metadata at the same time.

In your situation, if possible doing an rsync --inplace from the remote,
so that only changed appended data gets stored, and then useing native
btrfs snapshotting it would seem the most effective.

-- 
Hans van Kranenburg

next prev parent reply	other threads:[~2017-01-03 23:53 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-30 20:28 [markfasheh/duperemove] Why blocksize is limit to 1MB? Peter Becker
2017-01-01  4:38 ` Xin Zhou
2017-01-02 12:32   ` Peter Becker
2017-01-02 19:36     ` Xin Zhou
2017-01-03 12:40 ` Austin S. Hemmelgarn
2017-01-03 19:24   ` Peter Becker
2017-01-03 23:07     ` Hans van Kranenburg
2017-01-03 23:12       ` Peter Becker
2017-01-03 23:43         ` Hans van Kranenburg [this message]
2017-01-04  0:08           ` Martin Raiber
     [not found]   ` <CAEtw4r3mUA_4vcS-dbxagQn3NPRh8Cxcz0iF0L7jHwv5c9Ui+g@mail.gmail.com>
     [not found]     ` <7b0c897f-844c-e7f4-0ce7-c9f888b95983@gmail.com>
2017-01-03 20:20       ` Peter Becker
2017-01-03 20:40         ` Austin S. Hemmelgarn
2017-01-03 21:35           ` Peter Becker
2017-01-04 12:58             ` Austin S. Hemmelgarn
2017-01-04 14:42               ` Peter Becker
2017-01-09  1:09               ` Zygo Blaxell
2017-01-09  9:29                 ` Peter Becker
2017-01-10  4:12                   ` Zygo Blaxell
2017-01-03 20:21       ` Fwd: " Peter Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=624c67f5-6039-0332-fdea-19d6a80074ec@mendix.com \
    --to=hans.van.kranenburg@mendix.com \
    --cc=floyd.net@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).