Questions about git-push for huge repositories

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Questions about git-push for huge repositories
@ 2015-09-06  8:16 Levin Du
  2015-09-06 17:48 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Levin Du @ 2015-09-06  8:16 UTC (permalink / raw)
  To: git; +Cc: Levin Du

Hi all,

I meet with a strange problem:

I've two repositories, with sizes:
  - A:  6.1G
  - B:  6G

Both A & B have been 'git gc' with:
  git reflog expire --expire=now --all
  git gc --prune=now --aggressive

Since A & B share many common files, to save disk space, I'd like to merge them:
(note: branch of A & B are independent, i.e. have no common ancestor.)
   git clone --bare A  C
   (cd B; git push ../C master:master_b)

Repo C's size has grown to 12G. Doing a 'git gc' again, it drops to 6.2G.

I expect that 'git push' push only new files and commits, which will
save lots of space.
Yet it turns out I'm wrong. Since Repo A has been published, pushing branch of B
will double the repo size, which is impossible for the storage limit.

Any suggestions? Thanks in advance.

Best Regards,
Levin Du

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-06  8:16 Questions about git-push for huge repositories Levin Du
@ 2015-09-06 17:48 ` Junio C Hamano
  2015-09-07  1:05   ` Levin Du
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2015-09-06 17:48 UTC (permalink / raw)
  To: Levin Du; +Cc: git

Levin Du <zslevin@gmail.com> writes:

> Since A & B share many common files, to save disk space, I'd like to merge them:
> (note: branch of A & B are independent, i.e. have no common ancestor.)

Not having any shared history is exactly the cause.  If the
optimization were to exchange list of all the commits, blobs and
trees each side has and sending only the ones that the receiving end
lacks, you would get the result you seem to be expecting, but that
approach is not taken because it is impractically expensive.

Instead, the object transfer is optimized by comparing what commits
each side has and sending trees and blobs that are reachable from
the commits that the receiving side does not have.  This approach
does not have to exchange the list of trees and blobs at all, and in
a pair of repositories for the same project, it does not even have
to send the list of all commits, because traversing from the tips of
histories and exchanging more recent ones iteratively is expected to
find commits common to both and because of the history graph is a
DAG, we know what is behind commits that are common exist on both
ends.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-06 17:48 ` Junio C Hamano
@ 2015-09-07  1:05   ` Levin Du
  2015-09-07  3:51     ` Levin Du
  2015-09-08  5:00     ` Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Levin Du @ 2015-09-07  1:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

> Instead, the object transfer is optimized by comparing what commits
> each side has and sending trees and blobs that are reachable from
> the commits that the receiving side does not have.

The sender A sends all the commits that the receiver B does not have.
The commits contains trees and blobs. In my situation, branch in A has
only one commit. It seems that B has received lots of duplicate blobs,
concluded from the GC result.

What I do not understand is, how duplicate blobs happen in a git repository?
Git repository is famous for its content addressing storage system.
I guess that A sends its packed file to B directly, no matter what are
already in
B.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-07  1:05   ` Levin Du
@ 2015-09-07  3:51     ` Levin Du
  2015-09-08  1:30       ` Levin Du
  2015-09-08  5:00     ` Jeff King
  1 sibling, 1 reply; 9+ messages in thread
From: Levin Du @ 2015-09-07  3:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

I try to use 'git replace --graft' to work aroud this. Here's the process:
  cd A
  fetch ../B master:master_b
  git replace --graft master_b master_a
  # now master_b's parent is master_a

  # do a filter-branch to make the stone solid
  git filter-branch --tag-name-filter cat -- master_a..master_b

  # prune all the old refs and do gc
  git replace -d <origin_commit_of_master_b>
  git update-ref -d refs/original/refs/heads/master_b
  git reflog expire --expire=now --all
  git gc --prune=now --aggressive

And I'd like to make master_b look orphan, so using 'git replace' again:
   git replace --graft master_b
   git log master_b
   # only show one commit, fine
   git push /path/to/public/A master_b
   # small amount of data pushed
   du -hs  /path/to/public/A
   # 6.2 GiB

All are fine, except when I want to push the replace ref:
   git push /path/to/public/A 'refs/replace/*'

It pushes 6 GiB data again.

So right now, 'git replace --graft master_b' needs to run by users
if they need a tidy history view.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-07  3:51     ` Levin Du
@ 2015-09-08  1:30       ` Levin Du
  2015-09-08  5:44         ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Levin Du @ 2015-09-08  1:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

I consider 'git push' need further optimization.

Take kernel source code for example:

# Clone the kernel to A and B
$ git --version
git version 2.3.2
$ git clone --bare  ../kernel/ A
$ git clone --bare  ../kernel/ B

# Create the orphan commit and check
$ cd A
$ git branch test
Switched to a new branch 'test'
$ git replace --graft test
$ git rev-parse test
cbbae6741c60c9e09f87521e3a79810abd6a2fda
$ git rev-parse test^{tree}
929bdce0b48ca6079ad281a9d8ba24de3e49881a
$ git rev-parse replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda
82d3e9ce1ca062c219f1209c5291ccd5603e5302
$ git rev-parse 82d3e9ce1ca062c219f1209c5291ccd5603e5302^{tree}
929bdce0b48ca6079ad281a9d8ba24de3e49881a
$ git log --pretty=oneline 82d3e9ce1ca062c219f1209c5291ccd5603e5302 | wc -l
1

We can see that commit 82d3e9ce1ca062c219f1209c5291ccd5603e5302 (root commit)
is meant to replace for commit cbbae6741c60c9e09f87521e3a79810abd6a2fda .
They both contain the same tree 929bdce0b48ca6079ad281a9d8ba24de3e49881a .

$ du -hs ../B
1.6G ../B
$ git push ../B 'refs/replace/*'
Counting objects: 51216, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (48963/48963), done.
Writing objects: 100% (51216/51216), 139.61 MiB | 17.88 MiB/s, done.
Total 51216 (delta 3647), reused 34580 (delta 1641)
To ../B
* [new branch]
refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda ->
refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda
$ du -hs ../B
1.7G ../B

It takes some time for 'git push' to compress the objects and B has
finally increased 0.1G,
which is for the newly commit whose tree is already in the repository.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-08  1:30       ` Levin Du
@ 2015-09-08  5:44         ` Jeff King
  2015-09-08 18:24           ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2015-09-08  5:44 UTC (permalink / raw)
  To: Levin Du; +Cc: Junio C Hamano, git

On Tue, Sep 08, 2015 at 09:30:09AM +0800, Levin Du wrote:

> Take kernel source code for example:
> 
> # Clone the kernel to A and B
> $ git --version
> git version 2.3.2
> $ git clone --bare  ../kernel/ A
> $ git clone --bare  ../kernel/ B

OK, two repos with the same source.

> # Create the orphan commit and check
> $ cd A
> $ git branch test
> Switched to a new branch 'test'
> $ git replace --graft test
> $ git rev-parse test
> cbbae6741c60c9e09f87521e3a79810abd6a2fda
> $ git rev-parse test^{tree}
> 929bdce0b48ca6079ad281a9d8ba24de3e49881a
> $ git rev-parse replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda
> 82d3e9ce1ca062c219f1209c5291ccd5603e5302
> $ git rev-parse 82d3e9ce1ca062c219f1209c5291ccd5603e5302^{tree}
> 929bdce0b48ca6079ad281a9d8ba24de3e49881a
> $ git log --pretty=oneline 82d3e9ce1ca062c219f1209c5291ccd5603e5302 | wc -l
> 1

So you've created a new commit object, 82d3e9ce1, which has the same
tree as the original branch, but no parents.

Note that fetch and push do not respect the "replace" mechanism. They
can't, because we have no idea if the other side of the connection
shares our "replace" view of the world. So if I use "replace" to say
that commit X has parent Y, I cannot assume that pushing to some _other_
repository with X means that they also have all of Y.

But it should be OK, of course, to push the new orphan commit. I.e., if
we are pushing the object itself, not caring that it is part of a
"replace" mechanism, that should be no different than pushing any other
commit.

> $ du -hs ../B
> 1.6G ../B
> $ git push ../B 'refs/replace/*'
> Counting objects: 51216, done.
> Delta compression using up to 8 threads.
> Compressing objects: 100% (48963/48963), done.
> Writing objects: 100% (51216/51216), 139.61 MiB | 17.88 MiB/s, done.
> Total 51216 (delta 3647), reused 34580 (delta 1641)
> To ../B
> * [new branch]
> refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda ->
> refs/replace/cbbae6741c60c9e09f87521e3a79810abd6a2fda
> $ du -hs ../B
> 1.7G ../B
> 
> It takes some time for 'git push' to compress the objects and B has
> finally increased 0.1G,
> which is for the newly commit whose tree is already in the repository.

Right, this is due to the commit-walking that Junio explained earlier.
We walk the commits only, and then expand the positive side (things the
other side wants) into trees and blobs. Even though we know about a
commit that the other side has that points to the tree, we don't make
the connection.

You can get a more thorough answer by expanding and marking all trees
and blobs, taking the set difference between all of the objects you want
to send, and all of the objects you know the other side has. I.e.,
basically:

  # what we want to send
  git rev-list --objects 82d3e9ce1ca062c219f1209c5291ccd5603e5302 | sort >want

  # what we know the other side has; turn off replacements, since we
  # want the real value, not with our fake replace overlaid
  git --no-replace-objects rev-list --objects refs/heads/master | sort >have

  # set difference
  comm -23 want have

which should consist of only the one commit. But if you actually ran
that, you may notice that the second rev-list takes a long time to run.
In your exact case, one can get lucky by progressively drilling down
into commits and their trees (since the tip commit of "master" happens
to share the identical tree with our new fake commit). But that is
rather an uncommon example, and in more normal cases of fetching from
somebody, building on top, and then pushing back up, it is much more
expensive. In those cases it is much more efficient to walk the small
number of new commits and then expand only their newly-added objects.

If you turn on reachability bitmaps, git _will_ do the thorough set
difference, because it becomes much cheaper to do so. E.g., try:

    git repack -adb

in repo A to build a single pack with bitmaps enabled. Then a subsequent
push should send only a single object (the new commit).

Of course the time spent building the bitmaps is larger than a single
push, so this is not a good strategy if you are just trying to send one
tree.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-08  5:44         ` Jeff King
@ 2015-09-08 18:24           ` Junio C Hamano
  2015-09-08 21:54             ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2015-09-08 18:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Levin Du, git

Jeff King <peff@peff.net> writes:

> If you turn on reachability bitmaps, git _will_ do the thorough set
> difference, because it becomes much cheaper to do so. E.g., try:
>
>     git repack -adb
>
> in repo A to build a single pack with bitmaps enabled. Then a subsequent
> push should send only a single object (the new commit).

Hmph, A has the tip of B, and has a new commit B hasn't seen but A
knows that new commit's tree matches the tree of the tip of B.

Wouldn't --thin transfer from A to B know to send only that new
commit object without sending anything below the tree in such a
case, even without the bitmap?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-08 18:24           ` Junio C Hamano
@ 2015-09-08 21:54             ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2015-09-08 21:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Levin Du, git

On Tue, Sep 08, 2015 at 11:24:06AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > If you turn on reachability bitmaps, git _will_ do the thorough set
> > difference, because it becomes much cheaper to do so. E.g., try:
> >
> >     git repack -adb
> >
> > in repo A to build a single pack with bitmaps enabled. Then a subsequent
> > push should send only a single object (the new commit).
> 
> Hmph, A has the tip of B, and has a new commit B hasn't seen but A
> knows that new commit's tree matches the tree of the tip of B.
> 
> Wouldn't --thin transfer from A to B know to send only that new
> commit object without sending anything below the tree in such a
> case, even without the bitmap?

I started to write about that in my analysis, but it gets confusing
quickly. There are actually many tip trees, because A and B also share
all of their tags. We do not mark every blob of every tip tree as a
preferred base, because it is expensive to do so (and it just clogs our
object array).  Plus this only helps in the narrow circumstance that we
have the exact same tree as the tip (and not, say, the same tree as
master^, which I think it would be unreasonable to expect git to find).

But if we do:

  (cd ../B && git tag | git tag -d)

to delete all of the other tips besides master, leaving only the one
that we know has the same tree, I'd expect git to figure it out.

Certainly I would not expect it to save all of the delta compression,
in the sense that we may throw away on-disk delta bases to older objects
(because we don't realize the other side has those older objects). But I
would have thought before we even hit that phase, adding those objects
as "preferred bases" would have marked them as "do not send" in the
first place.

There is code in have_duplicate_entry() to handle this. I wonder why it
doesn't kick in.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Questions about git-push for huge repositories
  2015-09-07  1:05   ` Levin Du
  2015-09-07  3:51     ` Levin Du
@ 2015-09-08  5:00     ` Jeff King
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff King @ 2015-09-08  5:00 UTC (permalink / raw)
  To: Levin Du; +Cc: Junio C Hamano, git

On Mon, Sep 07, 2015 at 09:05:41AM +0800, Levin Du wrote:

> > Instead, the object transfer is optimized by comparing what commits
> > each side has and sending trees and blobs that are reachable from
> > the commits that the receiving side does not have.
> 
> The sender A sends all the commits that the receiver B does not have.
> The commits contains trees and blobs. In my situation, branch in A has
> only one commit. It seems that B has received lots of duplicate blobs,
> concluded from the GC result.

Right. B tells A "I already have this commit", but A does not already
have it, so that information is not helpful. It cannot make any
assumptions about what B has, and must send all trees and blobs
referenced by its commit.

> What I do not understand is, how duplicate blobs happen in a git repository?
> Git repository is famous for its content addressing storage system.
> I guess that A sends its packed file to B directly, no matter what are
> already in B.

Not exactly.  During a push, git may or may not keep the packfile sent
over the wire, depending on the number of objects in it and the
receive.unpackLimit config setting. The same object can exist in two
separate packfiles. One of the effects of "git gc" is to remove such
duplicates.

So A effectively does send its whole pack in this case, but only because
it cannot find any shared history with B (and B keeps it as-is until the
next gc because it is over the unpackLimit).

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-09-08 21:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-06  8:16 Questions about git-push for huge repositories Levin Du
2015-09-06 17:48 ` Junio C Hamano
2015-09-07  1:05   ` Levin Du
2015-09-07  3:51     ` Levin Du
2015-09-08  1:30       ` Levin Du
2015-09-08  5:44         ` Jeff King
2015-09-08 18:24           ` Junio C Hamano
2015-09-08 21:54             ` Jeff King
2015-09-08  5:00     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).