From: Andrei Borzenkov <arvidjaar@gmail.com>
To: Newbugreport <newbugreport@protonmail.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Btrfs send bloat
Date: Sun, 19 May 2019 23:06:25 +0300 [thread overview]
Message-ID: <275f7add-382c-bf6d-4cf8-f9823cf55daf@gmail.com> (raw)
In-Reply-To: <clzY4RoSOURzgBtua3TjQ4WXJzgY3EwTyiaYwt49zFAPIi_jO2nAQ8O2saTwpqHH9x0ISw9AVbWOvVR4hFDIx8_dzlWKAzHwcOtEuwaXzJ8=@protonmail.com>
19.05.2019 11:11, Newbugreport пишет:
> I have 3-4 years worth of snapshots I use for backup purposes. I keep
> R-O live snapshots, two local backups, and AWS Glacier Deep Freeze. I
> use both send | receive and send > file. This works well but I get
> massive deltas when files are moved around in a GUI via samba.
Did you analyze whether it is client or server problem? If client does
file copy (instead of move as you imply) may be the simplest solution
would be to use different tool on client. If problem is on server side,
it is something to discuss with SAMBA folks.
> Reorganize a bunch of files and the next snapshot is 50 or 100 GB.
> Perhaps mv or cp with reflink=always would fix the problem but it's
> just not usable enough for my family.
>
> I'd like a solution to the massive delta problem. Perhaps someone
> already has a solution, that would be great. If not, I need advice on
> a few ideas.
>
> It seems a realistic solution to deduplicate the subvolume before
> each snapshot is taken, and in theory I could write a small program
> to do that.
You mean that none of existing half a dozen tools to perform
deduplication on btrfs fits your requirements?
> However I don't know if that would work. Will Btrfs will
> let me deduplicate between a file on the live subvolume and a file on
> the R-O snapshot (really the same file but different path). If so,
btrfs does not care because it does not perform any deduplication at
all. All tools compute identical file ranges and then invoke kernel
ioctl to replace reference to range in destination file by reference to
identical range in source file. So there is nothing that prevents using
read-only data as source for deduplcation of read-write data. Whether
each of existing tools supports it (or makes it easy to do) I do not know.
> will Btrfs send with -p result in a small delta?
>
Well, if all data is replaced by reference to existing extents in some
snapshot then delta to this snapshot will be small.
> Failing that I could probably make changes to the send data stream,
> but that's suboptimal for the live volume and any backup volumes
> where data has been received.
>
> Also, is it possible to access the Btrfs hash values for files so I
> don't have to recalculate file hashes for the whole volume myself?
>
Currently btrfs does not compute hashes suitable for deduplication. It
only stores CRC32 checksums. You can access checksum tree and at least
one tool makes use of it to speed up scanning; but it then computes
second hash to avoid false positives.
Recently patch series was posted to add support for different hashes (I
believe SHA256 at least); these would be more useful for deduplication
when merged.
next prev parent reply other threads:[~2019-05-19 20:14 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-19 8:11 Btrfs send bloat Newbugreport
2019-05-19 20:06 ` Andrei Borzenkov [this message]
2019-05-20 9:20 ` David Disseldorp
2019-05-20 10:34 ` Patrik Lundquist
2019-05-20 11:15 ` Newbugreport
2019-05-20 11:58 ` Austin S. Hemmelgarn
2019-05-20 12:14 ` Patrik Lundquist
2019-05-20 12:40 ` Btrfs remote reflink with Samba David Disseldorp
2019-05-20 20:33 ` Patrik Lundquist
2019-05-20 22:50 ` Chris Murphy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=275f7add-382c-bf6d-4cf8-f9823cf55daf@gmail.com \
--to=arvidjaar@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=newbugreport@protonmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox