From: <rsbecker@nexbridge.com>
To: "'Jeff King'" <peff@peff.net>, "'Michal Suchánek'" <msuchanek@suse.de>
Cc: "'Akash S'" <akashs@commvault.com>, <git@vger.kernel.org>,
"'Adithya Urugudige'" <aurugudige@commvault.com>,
"'Abhishek Dalmia'" <adalmia@commvault.com>
Subject: RE: Incremental Backup of repositories using Git
Date: Thu, 8 May 2025 16:06:08 -0400 [thread overview]
Message-ID: <029701dbc054$a6e9af00$f4bd0d00$@nexbridge.com> (raw)
In-Reply-To: <20250508194731.GA13108@coredump.intra.peff.net>
On May 8, 2025 3:48 PM, Jeff King wrote:
>On Thu, May 08, 2025 at 08:47:47PM +0200, Michal Suchánek wrote:
>
>> If you have one of those filesystems that support deduplication on
>> filesystem level you could make each snapshot as a full repository
>> with all objects unpacked, and the filesystem would deduplicate the
>> objects for you.
>>
>> The downside is that you have no way to do multiple full backups this
>> way, and you would have to use something else for that (such as those
>> bundles, or plain archiving the repository as files in a tar archive
>> or such.
>
>This is tempting, but I suspect that storing the objects unpacked will become
>unfeasibly large, because you are missing out on delta compression in the packfiles.
>You can compare the on-disk and uncompressed sizes of objects in a repo like this:
>
> git cat-file --batch-all-objects --unordered \
> --batch-check='%(objectsize:disk) %(objectsize)' |
> perl -alne '
> $disk += $F[0];
> $true += $F[1];
> END {
> print "$true / $disk = ", int($true / $disk);
> }
> '
>
>It's not entirely fair because the "true" size is missing out on zlib compression that
>loose objects would get. But that's at best going to be about 4:1 (and in practice
>worse, since trees are full of sha1 hashes that don't compress very well).
>
>In my copy of linux.git, that yields ~135G versus ~2.4G, for a factor of 56. Even if we
>grant 4:1 compression from zlib, that's still inflating your on-disk repository by a
>factor of 14.
>
>If you have the patience, you can run:
>
> git cat-file --batch-all-objects --unordered --batch | gzip | wc -c
>
>to get a better sense of what it looks like with the extra deflate (this is cheating a bit,
>because it will find cross-object compression opportunities which would not be
>there in loose objects storage, but should get you in the right ballpark).
>
>You're probably also paying some inode costs with loose objects (1K trees at the
>root of linux.git all pay 4K or whatever as individual loose objects).
>
>So you're probably much better off with some strategy .keep files. I.e., make a good
>big pack and mark it with .keep, so that it is retained forever.
As a possible alternative, would some kind of information presented via the proposed
git blame-tree series (or call it git annotate-tree perhaps) be useful for this enhancement?
I am not sure what the results will look like, but it might be useful and then cached by
the backup strategy. I'm grasping at straws, though.
--Randall
next prev parent reply other threads:[~2025-05-08 20:06 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-05 14:35 Incremental Backup of repositories using Git Akash S
2025-05-05 16:18 ` Justin Tobler
2025-05-06 12:44 ` Abhishek Dalmia
2025-05-06 20:46 ` Justin Tobler
2025-05-08 10:24 ` Abhishek Dalmia
2025-05-08 18:39 ` Jeff King
2025-05-27 22:21 ` Abhishek Dalmia
2025-05-08 18:47 ` Michal Suchánek
2025-05-08 19:47 ` Jeff King
2025-05-08 20:06 ` rsbecker [this message]
2025-05-08 20:20 ` Jeff King
2025-05-09 9:08 ` Michal Suchánek
2025-05-09 11:13 ` Michal Suchánek
2025-05-09 11:22 ` Michal Suchánek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='029701dbc054$a6e9af00$f4bd0d00$@nexbridge.com' \
--to=rsbecker@nexbridge.com \
--cc=adalmia@commvault.com \
--cc=akashs@commvault.com \
--cc=aurugudige@commvault.com \
--cc=git@vger.kernel.org \
--cc=msuchanek@suse.de \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.