git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Question about how git determines the minimum packfile for a push.
@ 2015-04-27  0:41 Brad Litterell
  2015-04-27  4:39 ` Junio C Hamano
  2015-04-28  5:33 ` Jeff King
  0 siblings, 2 replies; 3+ messages in thread
From: Brad Litterell @ 2015-04-27  0:41 UTC (permalink / raw)
  To: git@vger.kernel.org

Hi,

I'm using git with a submodule containing a (large) binary toolchain where I updated the version from GCC-4.7 to 4.8.  When I added 4.8 I deleted 4.7 and now want to add 4.7 back to the HEAD.  As shown in the tree objects below, the 4.7 bits are still in the repository (as expected), but when I try to push a commit and tree object that puts them back, git tries to send all 4000+ objects again, even though the objects are already be on the server from a previous commit. (I have confirmed this with git cat-file on the server.) From my research it seems like git is supposed to identify the common objects and not do this.

For example, consider the last three commits (most recent on top).

$ git log
bb384915e12e925ead5ab8ad5161c84e0ef2b7f7 (HEAD, master) Add GCC-4.7 back to the image for side-by-side testing. Will delete it later.
2dfd226e6d2cc0a1dc58770d1dcaec1ba863df72 (origin/master) Upgrade toolchain to GCC-4.8. (delete GCC-4.7)
816fde0fdec1506600f19de4e3e4e02a6fe08639 (tag: release-1) Compiler toolchain 4.7

Here are the internal tree objects for those 3 commits:

# git cat-file -p 816fde^{tree}
$ cat old.tree
100644 blob 841b8359fce4edaf7549b87e1a81e7091c1dff6c    .arcconfig
100644 blob 21cb788d9f6e99367d96ba19d8c7470164e7d298    .gitmodules
040000 tree 470dfed1791b3ab7b4c731e354fd7685609c057a    arcanist
100644 blob 588866ae2c8a815ff26c5930b98d2a8ac9b934a0    gcc-linaro-arm-linux-gnueabihf-4.7-2013.03-20130313_linux.tar.bz2
040000 tree 7a1ae67dd2ff1fd358ad52d25852b340355339fe    gcc-linaro-arm-linux-gnueabihf-4.7-2013.03-20130313_linux

# git cat-file -p 2dfd22^{tree}
$ cat middle.tree
100644 blob 841b8359fce4edaf7549b87e1a81e7091c1dff6c    .arcconfig
100644 blob 21cb788d9f6e99367d96ba19d8c7470164e7d298    .gitmodules
100644 blob 79c4d3271c1177822786f82201c80b928fc35c6e    README
040000 tree 470dfed1791b3ab7b4c731e354fd7685609c057a    arcanist
100644 blob f71d6dd4e13a3de4b0c38dd37e4c2bc94c503f26    gcc-linaro-arm-linux-gnueabihf-4.8-2013.12_linux.tar.bz2
100644 blob 026e2d232bd7bb1cf0b9efb61c8cd307a52526ec    gcc-linaro-arm-linux-gnueabihf-4.8-2013.12_linux.tar.bz2.asc
040000 tree d03b4cf20163ca7a0ab5c02365890855146d8e0c    gcc-linaro-arm-linux-gnueabihf-4.8-2013.12_linux

# git cat-file -p HEAD^{tree}
$ cat new.tree
100644 blob 841b8359fce4edaf7549b87e1a81e7091c1dff6c    .arcconfig
100644 blob 21cb788d9f6e99367d96ba19d8c7470164e7d298    .gitmodules
100644 blob 79c4d3271c1177822786f82201c80b928fc35c6e    README
040000 tree 470dfed1791b3ab7b4c731e354fd7685609c057a    arcanist
100644 blob 588866ae2c8a815ff26c5930b98d2a8ac9b934a0    gcc-linaro-arm-linux-gnueabihf-4.7-2013.03-20130313_linux.tar.bz2
040000 tree 7a1ae67dd2ff1fd358ad52d25852b340355339fe    gcc-linaro-arm-linux-gnueabihf-4.7-2013.03-20130313_linux
100644 blob f71d6dd4e13a3de4b0c38dd37e4c2bc94c503f26    gcc-linaro-arm-linux-gnueabihf-4.8-2013.12_linux.tar.bz2
100644 blob 026e2d232bd7bb1cf0b9efb61c8cd307a52526ec    gcc-linaro-arm-linux-gnueabihf-4.8-2013.12_linux.tar.bz2.asc
040000 tree d03b4cf20163ca7a0ab5c02365890855146d8e0c    gcc-linaro-arm-linux-gnueabihf-4.8-2013.12_linux

By examining the tree objects with git cat-file -p it is clear that adding back GCC-4.7 added back the same exact blobs, which was expected, namely:

100644 blob 588866ae2c8a815ff26c5930b98d2a8ac9b934a0    gcc-linaro-arm-linux-gnueabihf-4.7-2013.03-20130313_linux.tar.bz2
040000 tree 7a1ae67dd2ff1fd358ad52d25852b340355339fe    gcc-linaro-arm-linux-gnueabihf-4.7-2013.03-20130313_linux

But, whenever I try to push, git tries to write 4000+ objects across the wire:

$ git push origin master
Counting objects: 4854, done.
Delta compression using up to 6 threads.
Compressing objects: 100% (2295/2295), done.
Writing objects:   0% (2/4853), 9.26 MiB | 3.08 MiB/s
^C   

All the parts are already on the server so it seems like the only objects to upload would be the commit object and the new associated tree (not the blob (5888) and tree (7a1a) below it).

Further, it seems like my local git client should be able to discern this from its knowledge of the origin/master ref.

I don't want to let this push complete if it will result in duplication on the server because this repo is already 400MB and slow to clone.

Can someone please explain what is happening here? using git push -thin doesn't seem to make a difference.

Is it possible git is not computing the delta correctly?  Or does git only look at the top-level commit objects to figure out what to include in the push packfile?

Will it upload the larger pack only to have the server correctly handle the duplicates?

In case it matters, the server in question is running Atlassian Stash.

Thanks,
Brad

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question about how git determines the minimum packfile for a push.
  2015-04-27  0:41 Question about how git determines the minimum packfile for a push Brad Litterell
@ 2015-04-27  4:39 ` Junio C Hamano
  2015-04-28  5:33 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2015-04-27  4:39 UTC (permalink / raw)
  To: Brad Litterell; +Cc: git@vger.kernel.org

On Sun, Apr 26, 2015 at 5:41 PM, Brad Litterell <brad@evidence.com> wrote:
>
> Is it possible git is not computing the delta correctly?
> Or does git only look at the top-level commit objects to figure out what to
> include in the push packfile?

We walk the commit graph backwards to discover the common ancestries to
minimize the network cost when fetching, but I do not think the
reverse direction
has such smart in the protocol.

If you fetch (not "pull") first to remote tracking branches and then push, that
probably will reduce the transfer, as the side that pushes is the only one that
decides what objects are sent in "git push -> git receive-pack" direction.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question about how git determines the minimum packfile for a push.
  2015-04-27  0:41 Question about how git determines the minimum packfile for a push Brad Litterell
  2015-04-27  4:39 ` Junio C Hamano
@ 2015-04-28  5:33 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2015-04-28  5:33 UTC (permalink / raw)
  To: Brad Litterell; +Cc: git@vger.kernel.org

On Mon, Apr 27, 2015 at 12:41:28AM +0000, Brad Litterell wrote:

> Is it possible git is not computing the delta correctly?  Or does git
> only look at the top-level commit objects to figure out what to
> include in the push packfile?

It's the latter. Junio mentioned that "push" is not as thorough about
finding common ancestors as "fetch", but I think even "fetch" would have
the same problem.

If we know that the other side has commit X, we know that it also has
X~3, and we also know that it has every tree and blob mentioned by X~3.
But it's much too expensive to open up every tree to generate the full
set of reachable objects; for the Linux kernel, that is something like 45
seconds of CPU time, just to find out "oh, we only need to send 5
objects".

This works pretty well in practice, because trees and blobs from older
history don't tend to resurface verbatim. But as you noticed, there are
certain cases where it does happen, and the number of objects affected
can be quite large (to the point that sending the extra objects is much
more expensive than the cost of doing the extra tree traversal).
Unfortunately there is no "look harder" option you can give to
"git push" when you, as the user, realize this is happening.

If you have pack reachability bitmaps, they do produce a more thorough
answer. So probably:

  git repack -adb
  git push

on the client would make this work as you expect.

> Will it upload the larger pack only to have the server correctly handle the duplicates?

Yes, the receiving side should correctly handle the duplicates.

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-04-28  5:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-27  0:41 Question about how git determines the minimum packfile for a push Brad Litterell
2015-04-27  4:39 ` Junio C Hamano
2015-04-28  5:33 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).