git pull transfers useless files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git pull transfers useless files
@ 2012-09-24 17:51 Angelo Borsotti
  2012-09-24 18:59 ` Junio C Hamano
  2012-09-24 19:17 ` Jeff King
  0 siblings, 2 replies; 3+ messages in thread
From: Angelo Borsotti @ 2012-09-24 17:51 UTC (permalink / raw)
  To: git

Hello,

git pull transfers useless files when called with the --squash option
and merge=binary
attribute.
Consider the following example:

#!/bin/bash

set -v
cd remote
rm -rf * .git/
git init
echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes

touch f1
git add f1
echo "aaa" >f1.pdf
git add f1.pdf
cp <very large pdf file, some 100 Mbytes>.pdf f2.pdf
git add f2.pdf
git commit -m A
cd ..

cd local
rm -rf * .git/
git init
echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes
git remote add remote ../remote

touch f3
git add f3
git commit -m B
git checkout -b develop

echo "bbb" >f2.pdf
git add f2.pdf
git commit -m C
git pull -v --squash remote master

ls
cat <f2.pdf

set +v

Replace <very large pdf file, some 100 Mbytes>.pdf with the path of a pdf file
that is really large and run it.
When it executes the git pull it spends on my computer some 30 seconds,
obviously transferring the pdf file, that then it disregards because of the
merge=binary attribute.
When a commit contains many binary files, the command spends a lot of
time needlessly.

Is it possible to optimize it?

Thank you
-Angelo Borsotti

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git pull transfers useless files
  2012-09-24 17:51 git pull transfers useless files Angelo Borsotti
@ 2012-09-24 18:59 ` Junio C Hamano
  2012-09-24 19:17 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2012-09-24 18:59 UTC (permalink / raw)
  To: Angelo Borsotti; +Cc: git

Angelo Borsotti <angelo.borsotti@gmail.com> writes:

> When it executes the git pull it spends on my computer some 30 seconds,
> obviously transferring the pdf file, that then it disregards because of the
> merge=binary attribute.
> When a commit contains many binary files, the command spends a lot of
> time needlessly.

That is hardly needless nor useless.

Unless you are saying that having a large pdf file as binary is
useless in your history, that is.  After such a merge or squash
merge, you still need to be able to say "git checkout remote" to
check out their version to inspect, and you did not have that
version of the blob before that "git pull".

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git pull transfers useless files
  2012-09-24 17:51 git pull transfers useless files Angelo Borsotti
  2012-09-24 18:59 ` Junio C Hamano
@ 2012-09-24 19:17 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2012-09-24 19:17 UTC (permalink / raw)
  To: Angelo Borsotti; +Cc: git

On Mon, Sep 24, 2012 at 07:51:20PM +0200, Angelo Borsotti wrote:

> #!/bin/bash
> 
> set -v
> cd remote
> rm -rf * .git/
> git init
> echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes
> 
> touch f1
> git add f1
> echo "aaa" >f1.pdf
> git add f1.pdf
> cp <very large pdf file, some 100 Mbytes>.pdf f2.pdf
> git add f2.pdf
> git commit -m A
> cd ..
> 
> cd local
> rm -rf * .git/
> git init
> echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes
> git remote add remote ../remote
> 
> touch f3
> git add f3
> git commit -m B
> git checkout -b develop
> 
> echo "bbb" >f2.pdf
> git add f2.pdf
> git commit -m C
> git pull -v --squash remote master
> 
> ls
> cat <f2.pdf
> 
> set +v
> 
> Replace <very large pdf file, some 100 Mbytes>.pdf with the path of a pdf file
> that is really large and run it.
> When it executes the git pull it spends on my computer some 30 seconds,
> obviously transferring the pdf file, that then it disregards because of the
> merge=binary attribute.

It does not disregard the file. The working tree is left with your
existing version of f2, but note that the index still marks the
conflict. Your next step would be to resolve the conflict in some way.
Towards that end, you can now inspect both sides:

  git show :2:f2.pdf  ;# our side
  git show :3:f2.pdf  ;# their side

Or you can invoke a mergetool to start a third-party merge helper on the
binary files:

  git mergetool

Or you can just resolve in favor of "their" side:

  git checkout --theirs f2.pdf

>From your description, I imagine your intent is to simply resolve in
favor of the "ours", and never look at the other side. However, git does
not have enough information to know that.

There is no "merge=ours" attribute (and indeed, it would be kind of
crazy, since your result would depend on which direction you were
merging, which is something you only know at the time of merge. Hence it
makes sense as a command-line option for a strategy, but not something
that is an attribute as a file).

All that being said, we can construct a case where the contents of the
PDF really _don't_ matter at all to the result. Like this:

  # new repo
  git init parent
  cd parent

  # make a commit with a giant file
  echo small >foo.txt
  cp <your-giant-file>.pdf big.pdf
  git add .
  git commit -m one

  # now get rid of the giant file
  git rm big.pdf
  git commit -m two

  # now merge it into another history
  git init ../child
  cd ../child
  echo unrelated >file.txt
  git add .
  git commit -m three
  git pull -v --squash ../parent master

Because we are doing a squash merge, we will throw away most of the
history we fetch, and only ever look at the tip of parent/master (which
in this case does not contain the PDF), and the shared ancestor (which
in this case is empty, since there is no shared history).

So in theory we could get by with fetching all the commits (to do the
history traversal), and the trees and blobs only from the tip commit.
But that is not a good idea in general for two reasons:

  1. Even if that PDF is not used in the actual merge algorithm, the
     contents of the earlier commits are useful for figuring out what
     happened (e.g., when resolving another conflict, you might want to
     refer back via "git log").

  2. It breaks git's reachability assumptions. Git always makes sure
     that if you have object X, you have all of the objects it refers
     to, the ones they refer to, and so on. This assumption underlies
     many of git's operations (e.g., what we need to send to a remote
     who claims to have commit X).

     In this case, since you are using --squash, you could presumably
     throw away the original history after doing the squash merge. But
     it would be quite complex to special-case this in the protocol, and
     almost certainly not worth it for this corner case.

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-09-24 19:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-24 17:51 git pull transfers useless files Angelo Borsotti
2012-09-24 18:59 ` Junio C Hamano
2012-09-24 19:17 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).