* Verifying data integrity of two git repositories
@ 2025-04-22 7:19 Akash S
2025-04-22 19:57 ` Johannes Sixt
0 siblings, 1 reply; 2+ messages in thread
From: Akash S @ 2025-04-22 7:19 UTC (permalink / raw)
To: git@vger.kernel.org
I have a bare repository of size 5.7 GB in my local disk.
I need to push this to Azure DevOps. I usually do it with the command "git push --mirror" but unfortunately, Azure DevOps has a single push size limit of 5GB.
So I have to push repos larger than 5GB in chunks.
I used this stackoverflow answer (https://stackoverflow.com/questions/79167276/splitt-git-push-to-azure-devops) asmy basis and created a script to push each branch in batches of commits.
I pushed my repository in batches to lets say remote repo "A".
I did a "git clone --bare" from remote repo A to my local disk. I verified the size of this bare and it seems to be of size 5 GB only.
i) I counted the number of objects using this command "git rev-list --objects --all | wc -l" in both repos, both are same.
ii) There is only 1 branch master in both repos and the last commit id of both master branches are matching (read an article that data integrity can be checked like this also since git also works like Blockchain)
iii) git fsck --full in both repos, both gave the same output:
Checking object directories: 100% (256/256), done.
Checking objects: 100% (10793794/10793794), done.
Checking connectivity: 10793794, done.
But original repo on disk had this extra line in the end (which the remote bare on disk did not display)
Verifying commits in commit graph: 100% (1351940/1351940), done.
iv) I create a bundle of the original repo on disk using command "git bundle create repo.bundle --all" and then in the remote cloned repo on disk I ran, "git bundle verify ../repo.bundle". Output:
The bundle contains these 883 refs:
<All Refs>
The bundle records a complete history.
The bundle uses this hash algorithm: sha1
/home/repo.bundle is okay
ii) I checked the repo size using this command "git count-objects -vH", the size-pack differs (original repo says 5.62 GB and the remote cloned repo on disk says 4.93 GB)
Note: My repository does not have lfs/objects also. So I do not have any lfs objects to begin with. So that is out of the question.
Why is there a change in size? Also how do I validate if two repos are the same or not?
Script being used to push in batches of commits:
#!/bin/bash
set -e
# === CONFIGURATION ===
RepositoryFolderPathForBareCloneBAK="/root/linux"
BackupRepositoryHttpsURL="<REMOTE_URL> "
remoteName="origin"
maxPushSizeInMB=$((4 * 1024)) # 4GB
splitPushCommitsCount=35000
splitPush=false
ALocation=$(pwd)
if [ ! -d "$RepositoryFolderPathForBareCloneBAK" ]; then
echo "Error: Bare clone folder not found at $RepositoryFolderPathForBareCloneBAK"
exit 1
fi
cd "$RepositoryFolderPathForBareCloneBAK"
git config http.postBuffer 524288000
doSplitPush=$splitPush
# Check repo size and decide whether to split push
if [ "$doSplitPush" = false ]; then
echo "Checking repository size..."
repositorySize=0
while read -r line; do
echo "$line"
if [[ "$line" =~ ^size-pack:\ ([0-9]+(\.[0-9]+)?)\ ([A-Za-z]+) ]]; then
value=${BASH_REMATCH[1]}
unit=${BASH_REMATCH[3]}
case "$unit" in
bytes) repositorySize=$(echo "$value / 1024 / 1024" | bc) ;;
KiB) repositorySize=$(echo "$value / 1024" | bc) ;;
MiB) repositorySize=$(echo "$value" | bc) ;;
GiB) repositorySize=$(echo "$value * 1024" | bc) ;;
*) repositorySize=$(echo "$value" | bc) ;;
esac
fi
done < <(git count-objects -vH)
# Round down to integer
repositorySize=${repositorySize%.*}
echo "Repo size: $repositorySize MiB"
if [ "$repositorySize" -ge "$maxPushSizeInMB" ]; then
doSplitPush=true
fi
fi
# Unset mirror config to allow partial pushes if needed
if git config --get remote.origin.mirror >/dev/null; then
git config --unset remote.origin.mirror
fi
# Setup remote
NewREMOTE="push_remote"
if git remote | grep -q "$NewREMOTE"; then
git remote remove "$NewREMOTE"
fi
git remote add "$NewREMOTE" "$BackupRepositoryHttpsURL"
if [ "$doSplitPush" = false ]; then
echo "Performing full push to $BackupRepositoryHttpsURL"
git push "$NewREMOTE" --mirror
else
echo "Performing split push to $BackupRepositoryHttpsURL"
git for-each-ref --format="%(refname)" --sort='authordate' | while read -r ref; do
if [[ "$ref" == refs/heads/* ]]; then
BRANCH="${ref#refs/heads/}"
echo "Processing branch: $BRANCH"
git symbolic-ref HEAD "$ref"
if git show-ref --quiet --verify "refs/remotes/$NewREMOTE/$BRANCH"; then
range="$NewREMOTE/$BRANCH..HEAD"
else
range="HEAD"
fi
n=$(git log --first-parent --format="format:x" $range | wc -l)
echo "$n commits to push"
splitPushCommitsCount=$(( (maxPushSizeInMB * n) / repositorySize ))
[ "$splitPushCommitsCount" -gt 20000 ] && splitPushCommitsCount=20000
echo "Calculated splitPushCommitsCount: $splitPushCommitsCount"
if [ "$n" -gt 0 ]; then
loopCount=$((n / splitPushCommitsCount))
for ((i=1; i<=loopCount; i++)); do
h=$(git log --first-parent --reverse --format=format:%H --skip $((n - (i * splitPushCommitsCount))) -n1)
echo "Batch commit: $h"
git push "$NewREMOTE" --force "$h:refs/heads/$BRANCH"
echo "sleeping for 5 minutes"
sleep 300
done
echo "Final push: HEAD:refs/heads/$BRANCH"
git push "$NewREMOTE" --force "HEAD:refs/heads/$BRANCH"
else
echo "No commits to push for $BRANCH"
fi
fi
done
echo "Pushing tags"
git push "$NewREMOTE" --force 'refs/tags/*'
echo "Pushing replace refs (if any)"
git push "$NewREMOTE" --force 'refs/replace/*'
fi
# === LFS Push ===
echo "Pushing Git LFS objects..."
Get_LFS_Objects() {
lfs_objects_dir="$1/lfs/objects"
if [ -d "$lfs_objects_dir" ]; then
lfs_objects=$(find "$lfs_objects_dir" -type f -printf "%f ")
if [ -z "$lfs_objects" ]; then
lfs_objects="NO_OBJECTS"
fi
else
lfs_objects="NO_OBJECTS"
fi
}
Get_LFS_Objects "$RepositoryFolderPathForBareCloneBAK"
if [[ "$lfs_objects" != "NO_OBJECTS" ]]; then
LFS_SPECIFIER="--object-id $lfs_objects"
echo "Running lfs"
git lfs push "$NewREMOTE" $LFS_SPECIFIER
retCode=$?
echo "LFS push exited with code: $retCode"
else
echo "No LFS objects to push."
fi
cd "$ALocation"
echo "All done! Git and LFS data pushed successfully."
^ permalink raw reply [flat|nested] 2+ messages in thread* Re: Verifying data integrity of two git repositories
2025-04-22 7:19 Verifying data integrity of two git repositories Akash S
@ 2025-04-22 19:57 ` Johannes Sixt
0 siblings, 0 replies; 2+ messages in thread
From: Johannes Sixt @ 2025-04-22 19:57 UTC (permalink / raw)
To: Akash S; +Cc: git@vger.kernel.org
Am 22.04.25 um 09:19 schrieb Akash S:
> ii) There is only 1 branch master in both repos and the last commit
> id of both master branches are matching (read an article that data
> integrity can be checked like this also since git also works like
> Blockchain)
>
> iii) git fsck --full in both repos, both gave the same output:
>
> Checking object directories: 100% (256/256), done.
> Checking objects: 100% (10793794/10793794), done.
> Checking connectivity: 10793794, done.
The facts that no errors were reported and that the commit ids are
identical are sufficient evidence that both repositories are identical.
> But original repo on disk had this extra line in the end (which the
> remote bare on disk did not display)
>
> Verifying commits in commit graph: 100% (1351940/1351940), done.
A commit graph is an optional data structure. Its absence doesn't
invalidate the repository.
> Why is there a change in size? Also how do I validate if two repos
> are the same or not?
Most likely, the two repositories have been packed in different manners.
This has no bearing on the validity of the repository at all as long as
`git fsck --full` reports no error. The different sizes should not worry
you.
-- Hannes
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-04-22 20:33 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 7:19 Verifying data integrity of two git repositories Akash S
2025-04-22 19:57 ` Johannes Sixt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).