* Questions about git-push for huge repositories @ 2015-09-06 8:16 Levin Du 2015-09-06 17:48 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Levin Du @ 2015-09-06 8:16 UTC (permalink / raw) To: git; +Cc: Levin Du Hi all, I meet with a strange problem: I've two repositories, with sizes: - A: 6.1G - B: 6G Both A & B have been 'git gc' with: git reflog expire --expire=now --all git gc --prune=now --aggressive Since A & B share many common files, to save disk space, I'd like to merge them: (note: branch of A & B are independent, i.e. have no common ancestor.) git clone --bare A C (cd B; git push ../C master:master_b) Repo C's size has grown to 12G. Doing a 'git gc' again, it drops to 6.2G. I expect that 'git push' push only new files and commits, which will save lots of space. Yet it turns out I'm wrong. Since Repo A has been published, pushing branch of B will double the repo size, which is impossible for the storage limit. Any suggestions? Thanks in advance. Best Regards, Levin Du ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-06 8:16 Questions about git-push for huge repositories Levin Du @ 2015-09-06 17:48 ` Junio C Hamano 2015-09-07 1:05 ` Levin Du 0 siblings, 1 reply; 9+ messages in thread From: Junio C Hamano @ 2015-09-06 17:48 UTC (permalink / raw) To: Levin Du; +Cc: git Levin Du <zslevin@gmail.com> writes: > Since A & B share many common files, to save disk space, I'd like to merge them: > (note: branch of A & B are independent, i.e. have no common ancestor.) Not having any shared history is exactly the cause. If the optimization were to exchange list of all the commits, blobs and trees each side has and sending only the ones that the receiving end lacks, you would get the result you seem to be expecting, but that approach is not taken because it is impractically expensive. Instead, the object transfer is optimized by comparing what commits each side has and sending trees and blobs that are reachable from the commits that the receiving side does not have. This approach does not have to exchange the list of trees and blobs at all, and in a pair of repositories for the same project, it does not even have to send the list of all commits, because traversing from the tips of histories and exchanging more recent ones iteratively is expected to find commits common to both and because of the history graph is a DAG, we know what is behind commits that are common exist on both ends. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-06 17:48 ` Junio C Hamano @ 2015-09-07 1:05 ` Levin Du 2015-09-07 3:51 ` Levin Du 2015-09-08 5:00 ` Jeff King 0 siblings, 2 replies; 9+ messages in thread From: Levin Du @ 2015-09-07 1:05 UTC (permalink / raw) To: Junio C Hamano; +Cc: git > Instead, the object transfer is optimized by comparing what commits > each side has and sending trees and blobs that are reachable from > the commits that the receiving side does not have. The sender A sends all the commits that the receiver B does not have. The commits contains trees and blobs. In my situation, branch in A has only one commit. It seems that B has received lots of duplicate blobs, concluded from the GC result. What I do not understand is, how duplicate blobs happen in a git repository? Git repository is famous for its content addressing storage system. I guess that A sends its packed file to B directly, no matter what are already in B. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-07 1:05 ` Levin Du @ 2015-09-07 3:51 ` Levin Du 2015-09-08 1:30 ` Levin Du 2015-09-08 5:00 ` Jeff King 1 sibling, 1 reply; 9+ messages in thread From: Levin Du @ 2015-09-07 3:51 UTC (permalink / raw) To: Junio C Hamano; +Cc: git I try to use 'git replace --graft' to work aroud this. Here's the process: cd A fetch ../B master:master_b git replace --graft master_b master_a # now master_b's parent is master_a # do a filter-branch to make the stone solid git filter-branch --tag-name-filter cat -- master_a..master_b # prune all the old refs and do gc git replace -d <origin_commit_of_master_b> git update-ref -d refs/original/refs/heads/master_b git reflog expire --expire=now --all git gc --prune=now --aggressive And I'd like to make master_b look orphan, so using 'git replace' again: git replace --graft master_b git log master_b # only show one commit, fine git push /path/to/public/A master_b # small amount of data pushed du -hs /path/to/public/A # 6.2 GiB All are fine, except when I want to push the replace ref: git push /path/to/public/A 'refs/replace/*' It pushes 6 GiB data again. So right now, 'git replace --graft master_b' needs to run by users if they need a tidy history view. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-07 3:51 ` Levin Du @ 2015-09-08 1:30 ` Levin Du 2015-09-08 5:44 ` Jeff King 0 siblings, 1 reply; 9+ messages in thread From: Levin Du @ 2015-09-08 1:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: git I consider 'git push' need further optimization. Take kernel source code for example: # Clone the kernel to A and B $ git --version git version 2.3.2 $ git clone --bare ../kernel/ A $ git clone --bare ../kernel/ B # Create the orphan commit and check $ cd A $ git branch test Switched to a new branch 'test' $ git replace --graft test $ git rev-parse test cbbae6741c60c9e09f87521e3a79810abd6a2fda $ git rev-parse test^{tree} 929bdce0b48ca6079ad281a9d8ba24de3e49881a $ git rev-parse replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda 82d3e9ce1ca062c219f1209c5291ccd5603e5302 $ git rev-parse 82d3e9ce1ca062c219f1209c5291ccd5603e5302^{tree} 929bdce0b48ca6079ad281a9d8ba24de3e49881a $ git log --pretty=oneline 82d3e9ce1ca062c219f1209c5291ccd5603e5302 | wc -l 1 We can see that commit 82d3e9ce1ca062c219f1209c5291ccd5603e5302 (root commit) is meant to replace for commit cbbae6741c60c9e09f87521e3a79810abd6a2fda . They both contain the same tree 929bdce0b48ca6079ad281a9d8ba24de3e49881a . $ du -hs ../B 1.6G ../B $ git push ../B 'refs/replace/*' Counting objects: 51216, done. Delta compression using up to 8 threads. Compressing objects: 100% (48963/48963), done. Writing objects: 100% (51216/51216), 139.61 MiB | 17.88 MiB/s, done. Total 51216 (delta 3647), reused 34580 (delta 1641) To ../B * [new branch] refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda -> refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda $ du -hs ../B 1.7G ../B It takes some time for 'git push' to compress the objects and B has finally increased 0.1G, which is for the newly commit whose tree is already in the repository. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-08 1:30 ` Levin Du @ 2015-09-08 5:44 ` Jeff King 2015-09-08 18:24 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Jeff King @ 2015-09-08 5:44 UTC (permalink / raw) To: Levin Du; +Cc: Junio C Hamano, git On Tue, Sep 08, 2015 at 09:30:09AM +0800, Levin Du wrote: > Take kernel source code for example: > > # Clone the kernel to A and B > $ git --version > git version 2.3.2 > $ git clone --bare ../kernel/ A > $ git clone --bare ../kernel/ B OK, two repos with the same source. > # Create the orphan commit and check > $ cd A > $ git branch test > Switched to a new branch 'test' > $ git replace --graft test > $ git rev-parse test > cbbae6741c60c9e09f87521e3a79810abd6a2fda > $ git rev-parse test^{tree} > 929bdce0b48ca6079ad281a9d8ba24de3e49881a > $ git rev-parse replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda > 82d3e9ce1ca062c219f1209c5291ccd5603e5302 > $ git rev-parse 82d3e9ce1ca062c219f1209c5291ccd5603e5302^{tree} > 929bdce0b48ca6079ad281a9d8ba24de3e49881a > $ git log --pretty=oneline 82d3e9ce1ca062c219f1209c5291ccd5603e5302 | wc -l > 1 So you've created a new commit object, 82d3e9ce1, which has the same tree as the original branch, but no parents. Note that fetch and push do not respect the "replace" mechanism. They can't, because we have no idea if the other side of the connection shares our "replace" view of the world. So if I use "replace" to say that commit X has parent Y, I cannot assume that pushing to some _other_ repository with X means that they also have all of Y. But it should be OK, of course, to push the new orphan commit. I.e., if we are pushing the object itself, not caring that it is part of a "replace" mechanism, that should be no different than pushing any other commit. > $ du -hs ../B > 1.6G ../B > $ git push ../B 'refs/replace/*' > Counting objects: 51216, done. > Delta compression using up to 8 threads. > Compressing objects: 100% (48963/48963), done. > Writing objects: 100% (51216/51216), 139.61 MiB | 17.88 MiB/s, done. > Total 51216 (delta 3647), reused 34580 (delta 1641) > To ../B > * [new branch] > refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda -> > refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda > $ du -hs ../B > 1.7G ../B > > It takes some time for 'git push' to compress the objects and B has > finally increased 0.1G, > which is for the newly commit whose tree is already in the repository. Right, this is due to the commit-walking that Junio explained earlier. We walk the commits only, and then expand the positive side (things the other side wants) into trees and blobs. Even though we know about a commit that the other side has that points to the tree, we don't make the connection. You can get a more thorough answer by expanding and marking all trees and blobs, taking the set difference between all of the objects you want to send, and all of the objects you know the other side has. I.e., basically: # what we want to send git rev-list --objects 82d3e9ce1ca062c219f1209c5291ccd5603e5302 | sort >want # what we know the other side has; turn off replacements, since we # want the real value, not with our fake replace overlaid git --no-replace-objects rev-list --objects refs/heads/master | sort >have # set difference comm -23 want have which should consist of only the one commit. But if you actually ran that, you may notice that the second rev-list takes a long time to run. In your exact case, one can get lucky by progressively drilling down into commits and their trees (since the tip commit of "master" happens to share the identical tree with our new fake commit). But that is rather an uncommon example, and in more normal cases of fetching from somebody, building on top, and then pushing back up, it is much more expensive. In those cases it is much more efficient to walk the small number of new commits and then expand only their newly-added objects. If you turn on reachability bitmaps, git _will_ do the thorough set difference, because it becomes much cheaper to do so. E.g., try: git repack -adb in repo A to build a single pack with bitmaps enabled. Then a subsequent push should send only a single object (the new commit). Of course the time spent building the bitmaps is larger than a single push, so this is not a good strategy if you are just trying to send one tree. -Peff ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-08 5:44 ` Jeff King @ 2015-09-08 18:24 ` Junio C Hamano 2015-09-08 21:54 ` Jeff King 0 siblings, 1 reply; 9+ messages in thread From: Junio C Hamano @ 2015-09-08 18:24 UTC (permalink / raw) To: Jeff King; +Cc: Levin Du, git Jeff King <peff@peff.net> writes: > If you turn on reachability bitmaps, git _will_ do the thorough set > difference, because it becomes much cheaper to do so. E.g., try: > > git repack -adb > > in repo A to build a single pack with bitmaps enabled. Then a subsequent > push should send only a single object (the new commit). Hmph, A has the tip of B, and has a new commit B hasn't seen but A knows that new commit's tree matches the tree of the tip of B. Wouldn't --thin transfer from A to B know to send only that new commit object without sending anything below the tree in such a case, even without the bitmap? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-08 18:24 ` Junio C Hamano @ 2015-09-08 21:54 ` Jeff King 0 siblings, 0 replies; 9+ messages in thread From: Jeff King @ 2015-09-08 21:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Levin Du, git On Tue, Sep 08, 2015 at 11:24:06AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > If you turn on reachability bitmaps, git _will_ do the thorough set > > difference, because it becomes much cheaper to do so. E.g., try: > > > > git repack -adb > > > > in repo A to build a single pack with bitmaps enabled. Then a subsequent > > push should send only a single object (the new commit). > > Hmph, A has the tip of B, and has a new commit B hasn't seen but A > knows that new commit's tree matches the tree of the tip of B. > > Wouldn't --thin transfer from A to B know to send only that new > commit object without sending anything below the tree in such a > case, even without the bitmap? I started to write about that in my analysis, but it gets confusing quickly. There are actually many tip trees, because A and B also share all of their tags. We do not mark every blob of every tip tree as a preferred base, because it is expensive to do so (and it just clogs our object array). Plus this only helps in the narrow circumstance that we have the exact same tree as the tip (and not, say, the same tree as master^, which I think it would be unreasonable to expect git to find). But if we do: (cd ../B && git tag | git tag -d) to delete all of the other tips besides master, leaving only the one that we know has the same tree, I'd expect git to figure it out. Certainly I would not expect it to save all of the delta compression, in the sense that we may throw away on-disk delta bases to older objects (because we don't realize the other side has those older objects). But I would have thought before we even hit that phase, adding those objects as "preferred bases" would have marked them as "do not send" in the first place. There is code in have_duplicate_entry() to handle this. I wonder why it doesn't kick in. -Peff ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Questions about git-push for huge repositories 2015-09-07 1:05 ` Levin Du 2015-09-07 3:51 ` Levin Du @ 2015-09-08 5:00 ` Jeff King 1 sibling, 0 replies; 9+ messages in thread From: Jeff King @ 2015-09-08 5:00 UTC (permalink / raw) To: Levin Du; +Cc: Junio C Hamano, git On Mon, Sep 07, 2015 at 09:05:41AM +0800, Levin Du wrote: > > Instead, the object transfer is optimized by comparing what commits > > each side has and sending trees and blobs that are reachable from > > the commits that the receiving side does not have. > > The sender A sends all the commits that the receiver B does not have. > The commits contains trees and blobs. In my situation, branch in A has > only one commit. It seems that B has received lots of duplicate blobs, > concluded from the GC result. Right. B tells A "I already have this commit", but A does not already have it, so that information is not helpful. It cannot make any assumptions about what B has, and must send all trees and blobs referenced by its commit. > What I do not understand is, how duplicate blobs happen in a git repository? > Git repository is famous for its content addressing storage system. > I guess that A sends its packed file to B directly, no matter what are > already in B. Not exactly. During a push, git may or may not keep the packfile sent over the wire, depending on the number of objects in it and the receive.unpackLimit config setting. The same object can exist in two separate packfiles. One of the effects of "git gc" is to remove such duplicates. So A effectively does send its whole pack in this case, but only because it cannot find any shared history with B (and B keeps it as-is until the next gc because it is over the unpackLimit). -Peff ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-09-08 21:55 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-06 8:16 Questions about git-push for huge repositories Levin Du 2015-09-06 17:48 ` Junio C Hamano 2015-09-07 1:05 ` Levin Du 2015-09-07 3:51 ` Levin Du 2015-09-08 1:30 ` Levin Du 2015-09-08 5:44 ` Jeff King 2015-09-08 18:24 ` Junio C Hamano 2015-09-08 21:54 ` Jeff King 2015-09-08 5:00 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).