From: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
To: Peter Becker <floyd.net@gmail.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
Date: Wed, 4 Jan 2017 00:43:19 +0100 [thread overview]
Message-ID: <624c67f5-6039-0332-fdea-19d6a80074ec@mendix.com> (raw)
In-Reply-To: <CAEtw4r1-WwLiK2trAWNJMwFRP2shOS+hemSNWMbKWAd-FBQrKw@mail.gmail.com>
On 01/04/2017 12:12 AM, Peter Becker wrote:
> Good hint, this would be an option and i will try this.
>
> Regardless of this the curiosity has packed me and I will try to
> figure out where the problem with the low transfer rate is.
>
> 2017-01-04 0:07 GMT+01:00 Hans van Kranenburg <hans.van.kranenburg@mendix.com>:
>> On 01/03/2017 08:24 PM, Peter Becker wrote:
>>> All invocations are justified, but not relevant in (offline) backup
>>> and archive scenarios.
>>>
>>> For example you have multiple version of append-only log-files or
>>> append-only db-files (each more then 100GB in size), like this:
>>>
>>>> Snapshot_01_01_2017
>>> -> file1.log .. 201 GB
>>>
>>>> Snapshot_02_01_2017
>>> -> file1.log .. 205 GB
>>>
>>>> Snapshot_03_01_2017
>>> -> file1.log .. 221 GB
>>>
>>> The first 201 GB would be every time the same.
>>> Files a copied at night from windows, linux or bsd systems and
>>> snapshoted after copy.
>>
>> XY problem?
>>
>> Why not use rsync --inplace in combination with btrfs snapshots? Even if
>> the remote does not support rsync and you need to pull the full file
>> first, you could again use rsync locally.
<annoyed>please don't toppost</annoyed>
Also, there is a rather huge difference in the two approaches, given the
way how btrfs works internally.
Say, I have a subvolume with thousands of directories and millions of
files with random data in it, and I want to have a second deduped copy
of it.
Approach 1:
Create a full copy of everything (compare: retrieving remote file again)
(now 200% of data storage is used), and after that do deduplication, so
that again only 100% of data storage is used.
Approach 2:
cp -av --reflink original/ copy/
By doing this, you end up with the same as doing approach 1 if your
deduper is the most ideal in the world (and the files are so random they
don't contain duplicate blocks inside them).
Approach 3:
btrfs sub snap original copy
W00t, that was fast, and the only thing that happened was writing a few
16kB metadata pages again. (1 for the toplevel tree page that got cloned
into a new filesystem tree, and a few for the blocks one level lower to
add backreferences to the new root).
So:
The big difference in the end result between approach 1,2 and otoh 3 is
that while deduplicating your data, you're actually duplicating all your
metadata at the same time.
In your situation, if possible doing an rsync --inplace from the remote,
so that only changed appended data gets stored, and then useing native
btrfs snapshotting it would seem the most effective.
--
Hans van Kranenburg
next prev parent reply other threads:[~2017-01-03 23:53 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-30 20:28 [markfasheh/duperemove] Why blocksize is limit to 1MB? Peter Becker
2017-01-01 4:38 ` Xin Zhou
2017-01-02 12:32 ` Peter Becker
2017-01-02 19:36 ` Xin Zhou
2017-01-03 12:40 ` Austin S. Hemmelgarn
2017-01-03 19:24 ` Peter Becker
2017-01-03 23:07 ` Hans van Kranenburg
2017-01-03 23:12 ` Peter Becker
2017-01-03 23:43 ` Hans van Kranenburg [this message]
2017-01-04 0:08 ` Martin Raiber
[not found] ` <CAEtw4r3mUA_4vcS-dbxagQn3NPRh8Cxcz0iF0L7jHwv5c9Ui+g@mail.gmail.com>
[not found] ` <7b0c897f-844c-e7f4-0ce7-c9f888b95983@gmail.com>
2017-01-03 20:20 ` Peter Becker
2017-01-03 20:40 ` Austin S. Hemmelgarn
2017-01-03 21:35 ` Peter Becker
2017-01-04 12:58 ` Austin S. Hemmelgarn
2017-01-04 14:42 ` Peter Becker
2017-01-09 1:09 ` Zygo Blaxell
2017-01-09 9:29 ` Peter Becker
2017-01-10 4:12 ` Zygo Blaxell
2017-01-03 20:21 ` Fwd: " Peter Becker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=624c67f5-6039-0332-fdea-19d6a80074ec@mendix.com \
--to=hans.van.kranenburg@mendix.com \
--cc=floyd.net@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).