Incremental Backup of repositories using Git

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Incremental Backup of repositories using Git
@ 2025-05-05 14:35 Akash S
  2025-05-05 16:18 ` Justin Tobler
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Akash S @ 2025-05-05 14:35 UTC (permalink / raw)
  To: git@vger.kernel.org; +Cc: Adithya Urugudige, Abhishek Dalmia

Hi,

Currently we are backing up repositories using the "git clone -bare" command and save it to disk. If we want to restore, we just run git push -mirror from the repo that was saved during the backup.

Currently we are running full backups (run git clone -bare) everyday, which is taking a lot of disk space and time.

Are there any possible ways to backup only the incremental changes of a repository? And somehow construct the whole repository when we want to do a restore from the incremental backups?

Thanks,
Akash

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-05 14:35 Incremental Backup of repositories using Git Akash S
@ 2025-05-05 16:18 ` Justin Tobler
  2025-05-06 12:44   ` Abhishek Dalmia
  2025-05-08 18:47 ` Michal Suchánek
  2025-05-09 11:13 ` Michal Suchánek
  2 siblings, 1 reply; 14+ messages in thread
From: Justin Tobler @ 2025-05-05 16:18 UTC (permalink / raw)
  To: Akash S; +Cc: git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

On 25/05/05 02:35PM, Akash S wrote:
> Hi,
> 
> Currently we are backing up repositories using the "git clone -bare" command and save it to disk. If we want to restore, we just run git push -mirror from the repo that was saved during the backup.
> 
> Currently we are running full backups (run git clone -bare) everyday, which is taking a lot of disk space and time.
> 
> Are there any possible ways to backup only the incremental changes of a repository? And somehow construct the whole repository when we want to do a restore from the incremental backups?

You could look into using git-bundle(1) to create incremental bundles
using exclusions. Examples:

  # Creates a bundle containing the last 10 commits for main.
  $ git bundle create inc-backup main~10..main

  # Creates incremental bundle based on time for all references.
  $ git bundle create inc-backup --all --since=7.days

These bundles can then be "unbundled" into a repository as long as the
repo contains the required prerequisite objects.

-Justin

> 
> Thanks,
> Akash
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Incremental Backup of repositories using Git
  2025-05-05 16:18 ` Justin Tobler
@ 2025-05-06 12:44   ` Abhishek Dalmia
  2025-05-06 20:46     ` Justin Tobler
  0 siblings, 1 reply; 14+ messages in thread
From: Abhishek Dalmia @ 2025-05-06 12:44 UTC (permalink / raw)
  To: Justin Tobler, Akash S
  Cc: git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

Hi Justin

(My previous email got blocked due to HTML content)

Thanks for the recommendation. We want to backup all the repo contents, so could you please comment if the following steps will help us backup and restore everything, or we might miss some tags/references?

During backup:
- Create full bundle first time using: git bundle create <full-bundle-file-path> --all
- Create further incremental bundles using: git bundle create <inc-bundle-file-path> --since="<last-backup-time>" -all
	- making sure we don't miss out any time

During restore:
- Create the initial repo with: git clone -bare <full-bundle-file-path> - using the full bundle we created earlier
- For restoring further incremental bundle files
	- git fetch <inc-bundle-file-path> 'refs/*:refs/*'
	- I can't use --all here, that works only with remote repos

Will using 'refs/*:refs/*' restore everything, or is it possible any git data might get missed out?

Regards,
Abhishek

-----Original Message-----
From: Justin Tobler <jltobler@gmail.com> 
Sent: 05 May 2025 21:49
To: Akash S <akashs@commvault.com>
Cc: git@vger.kernel.org; Adithya Urugudige <aurugudige@commvault.com>; Abhishek Dalmia <adalmia@commvault.com>
Subject: Re: Incremental Backup of repositories using Git

[Some people who received this message don't often get email from jltobler@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

External email. Inspect before opening.

On 25/05/05 02:35PM, Akash S wrote:
> Hi,
>
> Currently we are backing up repositories using the "git clone -bare" command and save it to disk. If we want to restore, we just run git push -mirror from the repo that was saved during the backup.
>
> Currently we are running full backups (run git clone -bare) everyday, which is taking a lot of disk space and time.
>
> Are there any possible ways to backup only the incremental changes of a repository? And somehow construct the whole repository when we want to do a restore from the incremental backups?

You could look into using git-bundle(1) to create incremental bundles using exclusions. Examples:

  # Creates a bundle containing the last 10 commits for main.
  $ git bundle create inc-backup main~10..main

  # Creates incremental bundle based on time for all references.
  $ git bundle create inc-backup --all --since=7.days

These bundles can then be "unbundled" into a repository as long as the repo contains the required prerequisite objects.

-Justin

>
> Thanks,
> Akash
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-06 12:44   ` Abhishek Dalmia
@ 2025-05-06 20:46     ` Justin Tobler
  2025-05-08 10:24       ` Abhishek Dalmia
  0 siblings, 1 reply; 14+ messages in thread
From: Justin Tobler @ 2025-05-06 20:46 UTC (permalink / raw)
  To: Abhishek Dalmia; +Cc: Akash S, git@vger.kernel.org, Adithya Urugudige

On 25/05/06 12:44PM, Abhishek Dalmia wrote:
> Hi Justin
> 
> (My previous email got blocked due to HTML content)
> 
> Thanks for the recommendation. We want to backup all the repo contents, so could you please comment if the following steps will help us backup and restore everything, or we might miss some tags/references?
> 
> During backup:
> - Create full bundle first time using: git bundle create <full-bundle-file-path> --all
> - Create further incremental bundles using: git bundle create <inc-bundle-file-path> --since="<last-backup-time>" -all
> 	- making sure we don't miss out any time

Just something to note, it's ok if a bundle contains objects that
already exist in the repository. So some overlap with the previous
backup would be fine.

> During restore:
> - Create the initial repo with: git clone -bare <full-bundle-file-path> - using the full bundle we created earlier
> - For restoring further incremental bundle files
> 	- git fetch <inc-bundle-file-path> 'refs/*:refs/*'
> 	- I can't use --all here, that works only with remote repos

This seems reasonable to me. It may be worth validating that the bundles
would apply to a fresh repository. If an incremental bundle depends on
some prerequistite objects that are not in a repository it cannot be
applied. This means if you have a series of incremental backups, they
all would depend on each other and one missing in the middle could
prevent subsequent bundles from being applied.

> Will using 'refs/*:refs/*' restore everything, or is it possible any git data might get missed out?

That refspec captures all references and mirrors them. All branches and
tags, along with all reachable objects from them, would be fetched.

-Justin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Incremental Backup of repositories using Git
  2025-05-06 20:46     ` Justin Tobler
@ 2025-05-08 10:24       ` Abhishek Dalmia
  2025-05-08 18:39         ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Abhishek Dalmia @ 2025-05-08 10:24 UTC (permalink / raw)
  To: Justin Tobler
  Cc: Akash S, git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

Hi Justin,

I ran into an edge case while testing incremental backups with git bundle. If a commit is created with a timestamp earlier than the latest full or incremental backup, it can be excluded from the next bundle due to the --since parameter even if there is a buffer.
Given this, do you think git bundle is still the most reliable approach for incremental backups, or is there a better alternative worth exploring?

Regards,
Abhishek 

-----Original Message-----
From: Justin Tobler <jltobler@gmail.com> 
Sent: 07 May 2025 02:17
To: Abhishek Dalmia <adalmia@commvault.com>
Cc: Akash S <akashs@commvault.com>; git@vger.kernel.org; Adithya Urugudige <aurugudige@commvault.com>
Subject: Re: Incremental Backup of repositories using Git

External email. Inspect before opening.

On 25/05/06 12:44PM, Abhishek Dalmia wrote:
> Hi Justin
>
> (My previous email got blocked due to HTML content)
>
> Thanks for the recommendation. We want to backup all the repo contents, so could you please comment if the following steps will help us backup and restore everything, or we might miss some tags/references?
>
> During backup:
> - Create full bundle first time using: git bundle create 
> <full-bundle-file-path> --all
> - Create further incremental bundles using: git bundle create <inc-bundle-file-path> --since="<last-backup-time>" -all
>       - making sure we don't miss out any time

Just something to note, it's ok if a bundle contains objects that already exist in the repository. So some overlap with the previous backup would be fine.

> During restore:
> - Create the initial repo with: git clone -bare 
> <full-bundle-file-path> - using the full bundle we created earlier
> - For restoring further incremental bundle files
>       - git fetch <inc-bundle-file-path> 'refs/*:refs/*'
>       - I can't use --all here, that works only with remote repos

This seems reasonable to me. It may be worth validating that the bundles would apply to a fresh repository. If an incremental bundle depends on some prerequistite objects that are not in a repository it cannot be applied. This means if you have a series of incremental backups, they all would depend on each other and one missing in the middle could prevent subsequent bundles from being applied.

> Will using 'refs/*:refs/*' restore everything, or is it possible any git data might get missed out?

That refspec captures all references and mirrors them. All branches and tags, along with all reachable objects from them, would be fetched.

-Justin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-08 10:24       ` Abhishek Dalmia
@ 2025-05-08 18:39         ` Jeff King
  2025-05-27 22:21           ` Abhishek Dalmia
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2025-05-08 18:39 UTC (permalink / raw)
  To: Abhishek Dalmia
  Cc: Justin Tobler, Akash S, git@vger.kernel.org, Adithya Urugudige

On Thu, May 08, 2025 at 10:24:55AM +0000, Abhishek Dalmia wrote:

> I ran into an edge case while testing incremental backups with git
> bundle. If a commit is created with a timestamp earlier than the
> latest full or incremental backup, it can be excluded from the next
> bundle due to the --since parameter even if there is a buffer.

Yeah, I don't think you want to use "--since" here, since it is about
commit timestamps. You care about the state of the refs at a particular
time. Or more accurately, you care that you have captured a particular
ref state previously.

So ideally you'd snapshot that state in an atomic way, feed it as the
"current" state when doing a bundle, and then save it for later. You can
easily create such a snapshot with for-each-ref, but I don't think
git-bundle has a way to provide the exact set of ref tips and their
values (it just takes rev-list arguments, and wants to resolve the refs
themselves).

You could probably get away with just creating a bundle with the current
state, and then pulling the snapshot values from the created bundle.
Something like this:

  # for initial backup
  if ! test -e last-bundle-snapshot; then
    >last-bundle-snapshot
  fi

  # mark everything from last as seen, so we do not include it,
  # along with --all (or your choice of refs) to pick up everything
  # we have currently
  sed -e 's/^/^/' <last-bundle-snapshot |
  git bundle create out.bundle --all --stdin

  # and now save that ref state for next time; this is inherently
  # peeking at the bundle format.
  sed -ne '
	# quit when we see end of header
	/^$/q;
	# drop comments and old negatives; copy only first word (the oid)
	s/^\([^-#][^ ]*\).*/\1/p;
  ' <out.bundle >last-bundle-snapshot

Or alternatively, instead of using git-bundle at all, you could just
store a collection of ref snapshots (from "for-each-ref") and thin packs
(from "pack-objects --thin --stdout", fed from the old snapshot and the
new). Which is really all that bundles are anyway.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-05 14:35 Incremental Backup of repositories using Git Akash S
  2025-05-05 16:18 ` Justin Tobler
@ 2025-05-08 18:47 ` Michal Suchánek
  2025-05-08 19:47   ` Jeff King
  2025-05-09 11:13 ` Michal Suchánek
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Suchánek @ 2025-05-08 18:47 UTC (permalink / raw)
  To: Akash S; +Cc: git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

Hello,

On Mon, May 05, 2025 at 02:35:43PM +0000, Akash S wrote:
> Hi,
> 
> Currently we are backing up repositories using the "git clone -bare" command and save it to disk. If we want to restore, we just run git push -mirror from the repo that was saved during the backup.
> 
> Currently we are running full backups (run git clone -bare) everyday, which is taking a lot of disk space and time.
> 
> Are there any possible ways to backup only the incremental changes of a repository? And somehow construct the whole repository when we want to do a restore from the incremental backups?

If you have one of those filesystems that support deduplication on
filesystem level you could make each snapshot as a full repository with
all objects unpacked, and the filesystem would deduplicate the objects
for you.

The downside is that you have no way to do multiple full backups this
way, and you would have to use something else for that (such as those
bundles, or plain archiving the repository as files in a tar archive or
such.

Thanks

Michal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-08 18:47 ` Michal Suchánek
@ 2025-05-08 19:47   ` Jeff King
  2025-05-08 20:06     ` rsbecker
  2025-05-09  9:08     ` Michal Suchánek
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff King @ 2025-05-08 19:47 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Akash S, git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

On Thu, May 08, 2025 at 08:47:47PM +0200, Michal Suchánek wrote:

> If you have one of those filesystems that support deduplication on
> filesystem level you could make each snapshot as a full repository with
> all objects unpacked, and the filesystem would deduplicate the objects
> for you.
> 
> The downside is that you have no way to do multiple full backups this
> way, and you would have to use something else for that (such as those
> bundles, or plain archiving the repository as files in a tar archive or
> such.

This is tempting, but I suspect that storing the objects unpacked will
become unfeasibly large, because you are missing out on delta
compression in the packfiles. You can compare the on-disk and
uncompressed sizes of objects in a repo like this:

  git cat-file --batch-all-objects --unordered \
               --batch-check='%(objectsize:disk) %(objectsize)' |
  perl -alne '
    $disk += $F[0];
    $true += $F[1];
    END {
      print "$true / $disk = ", int($true / $disk);
    }
  '

It's not entirely fair because the "true" size is missing out on zlib
compression that loose objects would get. But that's at best going to be
about 4:1 (and in practice worse, since trees are full of sha1 hashes
that don't compress very well).

In my copy of linux.git, that yields ~135G versus ~2.4G, for a factor of
56. Even if we grant 4:1 compression from zlib, that's still inflating
your on-disk repository by a factor of 14.

If you have the patience, you can run:

  git cat-file --batch-all-objects --unordered --batch | gzip | wc -c

to get a better sense of what it looks like with the extra deflate (this
is cheating a bit, because it will find cross-object compression
opportunities which would not be there in loose objects storage, but
should get you in the right ballpark).

You're probably also paying some inode costs with loose objects (1K
trees at the root of linux.git all pay 4K or whatever as individual
loose objects).

So you're probably much better off with some strategy .keep files. I.e.,
make a good big pack and mark it with .keep, so that it is retained
forever.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Incremental Backup of repositories using Git
  2025-05-08 19:47   ` Jeff King
@ 2025-05-08 20:06     ` rsbecker
  2025-05-08 20:20       ` Jeff King
  2025-05-09  9:08     ` Michal Suchánek
  1 sibling, 1 reply; 14+ messages in thread
From: rsbecker @ 2025-05-08 20:06 UTC (permalink / raw)
  To: 'Jeff King', 'Michal Suchánek'
  Cc: 'Akash S', git, 'Adithya Urugudige',
	'Abhishek Dalmia'

On May 8, 2025 3:48 PM, Jeff King wrote:
>On Thu, May 08, 2025 at 08:47:47PM +0200, Michal Suchánek wrote:
>
>> If you have one of those filesystems that support deduplication on
>> filesystem level you could make each snapshot as a full repository
>> with all objects unpacked, and the filesystem would deduplicate the
>> objects for you.
>>
>> The downside is that you have no way to do multiple full backups this
>> way, and you would have to use something else for that (such as those
>> bundles, or plain archiving the repository as files in a tar archive
>> or such.
>
>This is tempting, but I suspect that storing the objects unpacked will become
>unfeasibly large, because you are missing out on delta compression in the packfiles.
>You can compare the on-disk and uncompressed sizes of objects in a repo like this:
>
>  git cat-file --batch-all-objects --unordered \
>               --batch-check='%(objectsize:disk) %(objectsize)' |
>  perl -alne '
>    $disk += $F[0];
>    $true += $F[1];
>    END {
>      print "$true / $disk = ", int($true / $disk);
>    }
>  '
>
>It's not entirely fair because the "true" size is missing out on zlib compression that
>loose objects would get. But that's at best going to be about 4:1 (and in practice
>worse, since trees are full of sha1 hashes that don't compress very well).
>
>In my copy of linux.git, that yields ~135G versus ~2.4G, for a factor of 56. Even if we
>grant 4:1 compression from zlib, that's still inflating your on-disk repository by a
>factor of 14.
>
>If you have the patience, you can run:
>
>  git cat-file --batch-all-objects --unordered --batch | gzip | wc -c
>
>to get a better sense of what it looks like with the extra deflate (this is cheating a bit,
>because it will find cross-object compression opportunities which would not be
>there in loose objects storage, but should get you in the right ballpark).
>
>You're probably also paying some inode costs with loose objects (1K trees at the
>root of linux.git all pay 4K or whatever as individual loose objects).
>
>So you're probably much better off with some strategy .keep files. I.e., make a good
>big pack and mark it with .keep, so that it is retained forever.

As a possible alternative, would some kind of information presented via the proposed
git blame-tree series (or call it git annotate-tree perhaps) be useful for this enhancement?
I am not sure what the results will look like, but it might be useful and then cached by
the backup strategy. I'm grasping at straws, though.

--Randall


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-08 20:06     ` rsbecker
@ 2025-05-08 20:20       ` Jeff King
  0 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2025-05-08 20:20 UTC (permalink / raw)
  To: rsbecker
  Cc: 'Michal Suchánek', 'Akash S', git,
	'Adithya Urugudige', 'Abhishek Dalmia'

On Thu, May 08, 2025 at 04:06:08PM -0400, rsbecker@nexbridge.com wrote:

> As a possible alternative, would some kind of information presented via the proposed
> git blame-tree series (or call it git annotate-tree perhaps) be useful for this enhancement?
> I am not sure what the results will look like, but it might be useful and then cached by
> the backup strategy. I'm grasping at straws, though.

I don't think so. From an efficiency perspective, your best git-aware
backup really is going to be packfiles representing slices of history,
depending on each other. I.e., bundles or something approximating them.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-08 19:47   ` Jeff King
  2025-05-08 20:06     ` rsbecker
@ 2025-05-09  9:08     ` Michal Suchánek
  1 sibling, 0 replies; 14+ messages in thread
From: Michal Suchánek @ 2025-05-09  9:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Akash S, git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

On Thu, May 08, 2025 at 03:47:31PM -0400, Jeff King wrote:
> On Thu, May 08, 2025 at 08:47:47PM +0200, Michal Suchánek wrote:
> 
> > If you have one of those filesystems that support deduplication on
> > filesystem level you could make each snapshot as a full repository with
> > all objects unpacked, and the filesystem would deduplicate the objects
> > for you.
> > 
> > The downside is that you have no way to do multiple full backups this
> > way, and you would have to use something else for that (such as those
> > bundles, or plain archiving the repository as files in a tar archive or
> > such.
> 
> This is tempting, but I suspect that storing the objects unpacked will
> become unfeasibly large, because you are missing out on delta
> compression in the packfiles. You can compare the on-disk and
> uncompressed sizes of objects in a repo like this:
> 
>   git cat-file --batch-all-objects --unordered \
>                --batch-check='%(objectsize:disk) %(objectsize)' |
>   perl -alne '
>     $disk += $F[0];
>     $true += $F[1];
>     END {
>       print "$true / $disk = ", int($true / $disk);
>     }
>   '
> 
> It's not entirely fair because the "true" size is missing out on zlib
> compression that loose objects would get. But that's at best going to be
> about 4:1 (and in practice worse, since trees are full of sha1 hashes
> that don't compress very well).
> 
> In my copy of linux.git, that yields ~135G versus ~2.4G, for a factor of
> 56. Even if we grant 4:1 compression from zlib, that's still inflating
> your on-disk repository by a factor of 14.

So with this estimate you recoup that size inflation after 14
incremental backups.

Since no other working incremantal backup strategy was proposed so far
this is the best one ;-)

Thanks

Michal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-05 14:35 Incremental Backup of repositories using Git Akash S
  2025-05-05 16:18 ` Justin Tobler
  2025-05-08 18:47 ` Michal Suchánek
@ 2025-05-09 11:13 ` Michal Suchánek
  2025-05-09 11:22   ` Michal Suchánek
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Suchánek @ 2025-05-09 11:13 UTC (permalink / raw)
  To: Akash S; +Cc: git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

On Mon, May 05, 2025 at 02:35:43PM +0000, Akash S wrote:
> Hi,
> 
> Currently we are backing up repositories using the "git clone -bare" command and save it to disk. If we want to restore, we just run git push -mirror from the repo that was saved during the backup.
> 
> Currently we are running full backups (run git clone -bare) everyday, which is taking a lot of disk space and time.
> 
> Are there any possible ways to backup only the incremental changes of a repository? And somehow construct the whole repository when we want to do a restore from the incremental backups?

Hello,

first, to make it easier to update the backup the clone should be done
with --bare --mirror.

If your clone ends up having multiple packs and loose objects you
can reduce its size with

git --git-dir=/path/to/clone repack -adk

This should give you a repository with a single pack and no loose
objects.

The -k (or --cruft) option is required, using only -ad seems to corrupt
repositories quite reliably.

To speed up the clone next time around you can make a copy of the
previous backup and fetch from the remote repository but because
there is no safe way I am aware of to eliminate no longer referenced
objects you will accumulate cruft this way.

This is now a complete backup, and should be made readonly to not get
corrupted with further operations.

The inrementeal backups are somewhat speculative, I have not tested this
at all.

You can create a shared clone of the full backup, update the origin URL
of the shared clone to the remote repository to backup, and do a fetch
-p (which now should do the right thing because the initial clone was
set up as mirror).

To repack you need to use the --local option in addition.

With this you should have a valid repository for each backup with the
incremental backups sharing most objects with the full backup.

These can be inspected with git commands, exported over gitweb, or
whatever.

Thanks

Michal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Incremental Backup of repositories using Git
  2025-05-09 11:13 ` Michal Suchánek
@ 2025-05-09 11:22   ` Michal Suchánek
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Suchánek @ 2025-05-09 11:22 UTC (permalink / raw)
  To: Akash S; +Cc: git@vger.kernel.org, Adithya Urugudige, Abhishek Dalmia

On Fri, May 09, 2025 at 01:13:36PM +0200, Michal Suchánek wrote:
> On Mon, May 05, 2025 at 02:35:43PM +0000, Akash S wrote:
> > Hi,
> > 
> > Currently we are backing up repositories using the "git clone -bare" command and save it to disk. If we want to restore, we just run git push -mirror from the repo that was saved during the backup.
> > 
> > Currently we are running full backups (run git clone -bare) everyday, which is taking a lot of disk space and time.
> > 
> > Are there any possible ways to backup only the incremental changes of a repository? And somehow construct the whole repository when we want to do a restore from the incremental backups?
> 
> Hello,
> 
> first, to make it easier to update the backup the clone should be done
> with --bare --mirror.
> 
> If your clone ends up having multiple packs and loose objects you
> can reduce its size with
> 
> git --git-dir=/path/to/clone repack -adk
> 
> This should give you a repository with a single pack and no loose
> objects.
> 
> The -k (or --cruft) option is required, using only -ad seems to corrupt
> repositories quite reliably.
> 
> To speed up the clone next time around you can make a copy of the
> previous backup and fetch from the remote repository but because
> there is no safe way I am aware of to eliminate no longer referenced
> objects you will accumulate cruft this way.
> 
> This is now a complete backup, and should be made readonly to not get
> corrupted with further operations.
> 
> The inrementeal backups are somewhat speculative, I have not tested this
> at all.
> 
> You can create a shared clone of the full backup, update the origin URL

Also with the --mirror option, it's not transferred.

> of the shared clone to the remote repository to backup, and do a fetch
> -p (which now should do the right thing because the initial clone was
> set up as mirror).
> 
> To repack you need to use the --local option in addition.
> 
> With this you should have a valid repository for each backup with the
> incremental backups sharing most objects with the full backup.
> 
> These can be inspected with git commands, exported over gitweb, or
> whatever.
> 
> Thanks
> 
> Michal
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Incremental Backup of repositories using Git
  2025-05-08 18:39         ` Jeff King
@ 2025-05-27 22:21           ` Abhishek Dalmia
  0 siblings, 0 replies; 14+ messages in thread
From: Abhishek Dalmia @ 2025-05-27 22:21 UTC (permalink / raw)
  To: Jeff King
  Cc: Justin Tobler, Akash S, git@vger.kernel.org, Adithya Urugudige,
	Abhishek Dalmia

Hi Justin/Jeff
(prev. email got rejected due to HTML content)
I tried researching more, if we have the previous state of repo then we can use git fetch --all and have storage level incremental backups, by using the changed objects under .git/objects (.pack by preventing auto gc). But it will not be feasible, to keep around the repo clone for the incremental backups.

I researched about git fetch-pack in a git init --bare repo, which might have helped here, but it is not working as expected:
1. It doesn't work with https ->
$ git fetch-pack --thin --shallow-exclude=28307688f7344018cad46c310826a82041b39b8d https://github.com/elastic/elasticsearch refs/heads/main
fatal: protocol 'https' is not supported
2. With ssh it says fatal: the remote end hung up unexpectedly ->
$ git fetch-pack --thin --shallow-exclude=28307688f7344018cad46c310826a82041b39b8d mailto:git@github.com:elastic/elasticsearch.git refs/heads/main
fatal: the remote end hung up unexpectedly
Is what I require here (fetch new objects, without requiring previous objects present) technically possible with git-cli/ libgit2 library?
We can have some metadata to tell us what commits were backed up for each ref in the previous backup if that can help us.

As an alternative I tried out with API requests to download commit blobs, but that just hits rate limits too often and is far slower than git protocol.

-----Original Message-----
From: Jeff King <peff@peff.net> 
Sent: 09 May 2025 00:09
To: Abhishek Dalmia <adalmia@commvault.com>
Cc: Justin Tobler <jltobler@gmail.com>; Akash S <akashs@commvault.com>; git@vger.kernel.org; Adithya Urugudige <aurugudige@commvault.com>
Subject: Re: Incremental Backup of repositories using Git

[You don't often get email from mailto:peff@peff.net. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

External email. Inspect before opening.

On Thu, May 08, 2025 at 10:24:55AM +0000, Abhishek Dalmia wrote:

> I ran into an edge case while testing incremental backups with git 
> bundle. If a commit is created with a timestamp earlier than the 
> latest full or incremental backup, it can be excluded from the next 
> bundle due to the --since parameter even if there is a buffer.

Yeah, I don't think you want to use "--since" here, since it is about commit timestamps. You care about the state of the refs at a particular time. Or more accurately, you care that you have captured a particular ref state previously.

So ideally you'd snapshot that state in an atomic way, feed it as the "current" state when doing a bundle, and then save it for later. You can easily create such a snapshot with for-each-ref, but I don't think git-bundle has a way to provide the exact set of ref tips and their values (it just takes rev-list arguments, and wants to resolve the refs themselves).

You could probably get away with just creating a bundle with the current state, and then pulling the snapshot values from the created bundle.
Something like this:

  # for initial backup
  if ! test -e last-bundle-snapshot; then
    >last-bundle-snapshot
  fi

  # mark everything from last as seen, so we do not include it,
  # along with --all (or your choice of refs) to pick up everything
  # we have currently
  sed -e 's/^/^/' <last-bundle-snapshot |
  git bundle create out.bundle --all --stdin

  # and now save that ref state for next time; this is inherently
  # peeking at the bundle format.
  sed -ne '
        # quit when we see end of header
        /^$/q;
        # drop comments and old negatives; copy only first word (the oid)
        s/^\([^-#][^ ]*\).*/\1/p;
  ' <out.bundle >last-bundle-snapshot

Or alternatively, instead of using git-bundle at all, you could just store a collection of ref snapshots (from "for-each-ref") and thin packs (from "pack-objects --thin --stdout", fed from the old snapshot and the new). Which is really all that bundles are anyway.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-27 22:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-05 14:35 Incremental Backup of repositories using Git Akash S
2025-05-05 16:18 ` Justin Tobler
2025-05-06 12:44   ` Abhishek Dalmia
2025-05-06 20:46     ` Justin Tobler
2025-05-08 10:24       ` Abhishek Dalmia
2025-05-08 18:39         ` Jeff King
2025-05-27 22:21           ` Abhishek Dalmia
2025-05-08 18:47 ` Michal Suchánek
2025-05-08 19:47   ` Jeff King
2025-05-08 20:06     ` rsbecker
2025-05-08 20:20       ` Jeff King
2025-05-09  9:08     ` Michal Suchánek
2025-05-09 11:13 ` Michal Suchánek
2025-05-09 11:22   ` Michal Suchánek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).