* git pull transfers useless files
@ 2012-09-24 17:51 Angelo Borsotti
2012-09-24 18:59 ` Junio C Hamano
2012-09-24 19:17 ` Jeff King
0 siblings, 2 replies; 3+ messages in thread
From: Angelo Borsotti @ 2012-09-24 17:51 UTC (permalink / raw)
To: git
Hello,
git pull transfers useless files when called with the --squash option
and merge=binary
attribute.
Consider the following example:
#!/bin/bash
set -v
cd remote
rm -rf * .git/
git init
echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes
touch f1
git add f1
echo "aaa" >f1.pdf
git add f1.pdf
cp <very large pdf file, some 100 Mbytes>.pdf f2.pdf
git add f2.pdf
git commit -m A
cd ..
cd local
rm -rf * .git/
git init
echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes
git remote add remote ../remote
touch f3
git add f3
git commit -m B
git checkout -b develop
echo "bbb" >f2.pdf
git add f2.pdf
git commit -m C
git pull -v --squash remote master
ls
cat <f2.pdf
set +v
Replace <very large pdf file, some 100 Mbytes>.pdf with the path of a pdf file
that is really large and run it.
When it executes the git pull it spends on my computer some 30 seconds,
obviously transferring the pdf file, that then it disregards because of the
merge=binary attribute.
When a commit contains many binary files, the command spends a lot of
time needlessly.
Is it possible to optimize it?
Thank you
-Angelo Borsotti
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: git pull transfers useless files
2012-09-24 17:51 git pull transfers useless files Angelo Borsotti
@ 2012-09-24 18:59 ` Junio C Hamano
2012-09-24 19:17 ` Jeff King
1 sibling, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2012-09-24 18:59 UTC (permalink / raw)
To: Angelo Borsotti; +Cc: git
Angelo Borsotti <angelo.borsotti@gmail.com> writes:
> When it executes the git pull it spends on my computer some 30 seconds,
> obviously transferring the pdf file, that then it disregards because of the
> merge=binary attribute.
> When a commit contains many binary files, the command spends a lot of
> time needlessly.
That is hardly needless nor useless.
Unless you are saying that having a large pdf file as binary is
useless in your history, that is. After such a merge or squash
merge, you still need to be able to say "git checkout remote" to
check out their version to inspect, and you did not have that
version of the blob before that "git pull".
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: git pull transfers useless files
2012-09-24 17:51 git pull transfers useless files Angelo Borsotti
2012-09-24 18:59 ` Junio C Hamano
@ 2012-09-24 19:17 ` Jeff King
1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2012-09-24 19:17 UTC (permalink / raw)
To: Angelo Borsotti; +Cc: git
On Mon, Sep 24, 2012 at 07:51:20PM +0200, Angelo Borsotti wrote:
> #!/bin/bash
>
> set -v
> cd remote
> rm -rf * .git/
> git init
> echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes
>
> touch f1
> git add f1
> echo "aaa" >f1.pdf
> git add f1.pdf
> cp <very large pdf file, some 100 Mbytes>.pdf f2.pdf
> git add f2.pdf
> git commit -m A
> cd ..
>
> cd local
> rm -rf * .git/
> git init
> echo '*.pdf -crlf -diff merge=binary' >.git/info/attributes
> git remote add remote ../remote
>
> touch f3
> git add f3
> git commit -m B
> git checkout -b develop
>
> echo "bbb" >f2.pdf
> git add f2.pdf
> git commit -m C
> git pull -v --squash remote master
>
> ls
> cat <f2.pdf
>
> set +v
>
> Replace <very large pdf file, some 100 Mbytes>.pdf with the path of a pdf file
> that is really large and run it.
> When it executes the git pull it spends on my computer some 30 seconds,
> obviously transferring the pdf file, that then it disregards because of the
> merge=binary attribute.
It does not disregard the file. The working tree is left with your
existing version of f2, but note that the index still marks the
conflict. Your next step would be to resolve the conflict in some way.
Towards that end, you can now inspect both sides:
git show :2:f2.pdf ;# our side
git show :3:f2.pdf ;# their side
Or you can invoke a mergetool to start a third-party merge helper on the
binary files:
git mergetool
Or you can just resolve in favor of "their" side:
git checkout --theirs f2.pdf
>From your description, I imagine your intent is to simply resolve in
favor of the "ours", and never look at the other side. However, git does
not have enough information to know that.
There is no "merge=ours" attribute (and indeed, it would be kind of
crazy, since your result would depend on which direction you were
merging, which is something you only know at the time of merge. Hence it
makes sense as a command-line option for a strategy, but not something
that is an attribute as a file).
All that being said, we can construct a case where the contents of the
PDF really _don't_ matter at all to the result. Like this:
# new repo
git init parent
cd parent
# make a commit with a giant file
echo small >foo.txt
cp <your-giant-file>.pdf big.pdf
git add .
git commit -m one
# now get rid of the giant file
git rm big.pdf
git commit -m two
# now merge it into another history
git init ../child
cd ../child
echo unrelated >file.txt
git add .
git commit -m three
git pull -v --squash ../parent master
Because we are doing a squash merge, we will throw away most of the
history we fetch, and only ever look at the tip of parent/master (which
in this case does not contain the PDF), and the shared ancestor (which
in this case is empty, since there is no shared history).
So in theory we could get by with fetching all the commits (to do the
history traversal), and the trees and blobs only from the tip commit.
But that is not a good idea in general for two reasons:
1. Even if that PDF is not used in the actual merge algorithm, the
contents of the earlier commits are useful for figuring out what
happened (e.g., when resolving another conflict, you might want to
refer back via "git log").
2. It breaks git's reachability assumptions. Git always makes sure
that if you have object X, you have all of the objects it refers
to, the ones they refer to, and so on. This assumption underlies
many of git's operations (e.g., what we need to send to a remote
who claims to have commit X).
In this case, since you are using --squash, you could presumably
throw away the original history after doing the squash merge. But
it would be quite complex to special-case this in the protocol, and
almost certainly not worth it for this corner case.
-Peff
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2012-09-24 19:17 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-24 17:51 git pull transfers useless files Angelo Borsotti
2012-09-24 18:59 ` Junio C Hamano
2012-09-24 19:17 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).