git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Verifying data integrity of two git repositories
@ 2025-04-22  7:19 Akash S
  2025-04-22 19:57 ` Johannes Sixt
  0 siblings, 1 reply; 2+ messages in thread
From: Akash S @ 2025-04-22  7:19 UTC (permalink / raw)
  To: git@vger.kernel.org

I have a bare repository of size 5.7 GB in my local disk.

I need to push this to Azure DevOps. I usually do it with the command "git push --mirror" but unfortunately, Azure DevOps has a single push size limit of 5GB.

So I have to push repos larger than 5GB in chunks.

I used this stackoverflow answer (https://stackoverflow.com/questions/79167276/splitt-git-push-to-azure-devops) asmy basis and created a script to push each branch in batches of commits.

I pushed my repository in batches to lets say remote repo "A".

I did a "git clone --bare" from remote repo A to my local disk. I verified the size of this bare and it seems to be of size 5 GB only.

	i) I counted the number of objects using this command "git rev-list --objects --all | wc -l" in both repos, both are same.

	ii) There is only 1 branch master in both repos and the last commit id of both master branches are matching (read an article that data integrity can be checked like this also since git also works like Blockchain)

	iii) git fsck --full in both repos,  both gave the same output: 

		Checking object directories: 100% (256/256), done.
		Checking objects: 100% (10793794/10793794), done.
		Checking connectivity: 10793794, done.

		But original repo on disk had this extra line in the end (which the remote bare on disk did not display)

		Verifying commits in commit graph: 100% (1351940/1351940), done.
	
	iv) I create a bundle of the original repo on disk using command "git bundle create repo.bundle --all" and then in the remote cloned repo on disk I ran, "git bundle verify ../repo.bundle". Output:

		The bundle contains these 883 refs:
		<All Refs>
		The bundle records a complete history.
		The bundle uses this hash algorithm: sha1
		/home/repo.bundle is okay

	ii) I checked the repo size using this command "git count-objects -vH", the size-pack differs (original repo says 5.62 GB and the remote cloned repo on disk says 4.93 GB)

Note: My repository does not have lfs/objects also. So I do not have any lfs objects to begin with. So that is out of the question.

Why is there a change in size? Also how do I validate if two repos are the same or not?

Script being used to push in batches of commits:

#!/bin/bash
set -e

# === CONFIGURATION ===
RepositoryFolderPathForBareCloneBAK="/root/linux"
BackupRepositoryHttpsURL="<REMOTE_URL> "
remoteName="origin"
maxPushSizeInMB=$((4 * 1024)) # 4GB
splitPushCommitsCount=35000
splitPush=false

ALocation=$(pwd)

if [ ! -d "$RepositoryFolderPathForBareCloneBAK" ]; then
    echo "Error: Bare clone folder not found at $RepositoryFolderPathForBareCloneBAK"
    exit 1
fi

cd "$RepositoryFolderPathForBareCloneBAK"
git config http.postBuffer 524288000

doSplitPush=$splitPush

# Check repo size and decide whether to split push
if [ "$doSplitPush" = false ]; then
    echo "Checking repository size..."
    repositorySize=0
    while read -r line; do
        echo "$line"
        if [[ "$line" =~ ^size-pack:\ ([0-9]+(\.[0-9]+)?)\ ([A-Za-z]+) ]]; then
            value=${BASH_REMATCH[1]}
            unit=${BASH_REMATCH[3]}
            case "$unit" in
                bytes) repositorySize=$(echo "$value / 1024 / 1024" | bc) ;;
                KiB)   repositorySize=$(echo "$value / 1024" | bc) ;;
                MiB)   repositorySize=$(echo "$value" | bc) ;;
                GiB)   repositorySize=$(echo "$value * 1024" | bc) ;;
                *)     repositorySize=$(echo "$value" | bc) ;;
            esac
        fi
    done < <(git count-objects -vH)

    # Round down to integer
    repositorySize=${repositorySize%.*}

    echo "Repo size: $repositorySize MiB"

    if [ "$repositorySize" -ge "$maxPushSizeInMB" ]; then
        doSplitPush=true
    fi
fi

# Unset mirror config to allow partial pushes if needed
if git config --get remote.origin.mirror >/dev/null; then
    git config --unset remote.origin.mirror
fi

# Setup remote
NewREMOTE="push_remote"
if git remote | grep -q "$NewREMOTE"; then
    git remote remove "$NewREMOTE"
fi
git remote add "$NewREMOTE" "$BackupRepositoryHttpsURL"

if [ "$doSplitPush" = false ]; then
    echo "Performing full push to $BackupRepositoryHttpsURL"
    git push "$NewREMOTE" --mirror
else
    echo "Performing split push to $BackupRepositoryHttpsURL"

    git for-each-ref --format="%(refname)" --sort='authordate' | while read -r ref; do
        if [[ "$ref" == refs/heads/* ]]; then
            BRANCH="${ref#refs/heads/}"
            echo "Processing branch: $BRANCH"

            git symbolic-ref HEAD "$ref"

            if git show-ref --quiet --verify "refs/remotes/$NewREMOTE/$BRANCH"; then
                range="$NewREMOTE/$BRANCH..HEAD"
            else
                range="HEAD"
            fi

            n=$(git log --first-parent --format="format:x" $range | wc -l)
            echo "$n commits to push"

            splitPushCommitsCount=$(( (maxPushSizeInMB * n) / repositorySize ))
            [ "$splitPushCommitsCount" -gt 20000 ] && splitPushCommitsCount=20000

            echo "Calculated splitPushCommitsCount: $splitPushCommitsCount"

            if [ "$n" -gt 0 ]; then
                loopCount=$((n / splitPushCommitsCount))
                for ((i=1; i<=loopCount; i++)); do
                    h=$(git log --first-parent --reverse --format=format:%H --skip $((n - (i * splitPushCommitsCount))) -n1)
                    echo "Batch commit: $h"
                    git push "$NewREMOTE" --force "$h:refs/heads/$BRANCH"
                    echo "sleeping for 5 minutes"
                    sleep 300
                done
                echo "Final push: HEAD:refs/heads/$BRANCH"
                git push "$NewREMOTE" --force "HEAD:refs/heads/$BRANCH"
            else
                echo "No commits to push for $BRANCH"
            fi
        fi
    done

    echo "Pushing tags"
    git push "$NewREMOTE" --force 'refs/tags/*'

    echo "Pushing replace refs (if any)"
    git push "$NewREMOTE" --force 'refs/replace/*'
fi

# === LFS Push ===
echo "Pushing Git LFS objects..."
Get_LFS_Objects() {
    lfs_objects_dir="$1/lfs/objects"
    if [ -d "$lfs_objects_dir" ]; then
        lfs_objects=$(find "$lfs_objects_dir" -type f -printf "%f ")
        if [ -z "$lfs_objects" ]; then
            lfs_objects="NO_OBJECTS"
        fi
    else
        lfs_objects="NO_OBJECTS"
    fi
}
Get_LFS_Objects "$RepositoryFolderPathForBareCloneBAK"
if [[ "$lfs_objects" != "NO_OBJECTS" ]]; then
    LFS_SPECIFIER="--object-id $lfs_objects"
    echo "Running lfs"

    git lfs push "$NewREMOTE" $LFS_SPECIFIER
    retCode=$?
    echo "LFS push exited with code: $retCode"
else
    echo "No LFS objects to push."
fi

cd "$ALocation"
echo "All done! Git and LFS data pushed successfully."




^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Verifying data integrity of two git repositories
  2025-04-22  7:19 Verifying data integrity of two git repositories Akash S
@ 2025-04-22 19:57 ` Johannes Sixt
  0 siblings, 0 replies; 2+ messages in thread
From: Johannes Sixt @ 2025-04-22 19:57 UTC (permalink / raw)
  To: Akash S; +Cc: git@vger.kernel.org

Am 22.04.25 um 09:19 schrieb Akash S:
> ii) There is only 1 branch master in both repos and the last commit
> id of both master branches are matching (read an article that data
> integrity can be checked like this also since git also works like
> Blockchain)
> 
> 	iii) git fsck --full in both repos,  both gave the same output: 
> 
> 		Checking object directories: 100% (256/256), done.
> 		Checking objects: 100% (10793794/10793794), done.
> 		Checking connectivity: 10793794, done.

The facts that no errors were reported and that the commit ids are
identical are sufficient evidence that both repositories are identical.

> But original repo on disk had this extra line in the end (which the
> remote bare on disk did not display)
> 
> 		Verifying commits in commit graph: 100% (1351940/1351940), done.

A commit graph is an optional data structure. Its absence doesn't
invalidate the repository.

> Why is there a change in size? Also how do I validate if two repos
> are the same or not?
Most likely, the two repositories have been packed in different manners.
This has no bearing on the validity of the repository at all as long as
`git fsck --full` reports no error. The different sizes should not worry
you.

-- Hannes


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-04-22 20:33 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22  7:19 Verifying data integrity of two git repositories Akash S
2025-04-22 19:57 ` Johannes Sixt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).