* incremental push/pull for large repositories
@ 2010-07-10 3:12 Enrico Weigelt
2010-07-12 15:17 ` Avery Pennarun
0 siblings, 1 reply; 2+ messages in thread
From: Enrico Weigelt @ 2010-07-10 3:12 UTC (permalink / raw)
To: git
Hi folks,
I often have situations where I've rebased branches with large files
(10th of megabytes per file) and pushing them to the remote. Normally
these files themselves stay untouched, but the history changes (eg.
commits reordered, several changes in smaller files,etc).
It seem that on each push, the whole branch is transferred, including
all the large files, which already exist on the remote site. Is there
any way to prevent this ?
IMHO it would be enough having a negotiation on which objects really
have to be transferred before creating and transmitting the actual
pack file. That would save _much_ traffic (and transmit time) for
those situations.
This could also be used to remotely repair broken repositories:
just repack leaving off the broken objects and fetch the missing
ones from remotes.
cu
--
----------------------------------------------------------------------
Enrico Weigelt, metux IT service -- http://www.metux.de/
phone: +49 36207 519931 email: weigelt@metux.de
mobile: +49 151 27565287 icq: 210169427 skype: nekrad666
----------------------------------------------------------------------
Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: incremental push/pull for large repositories
2010-07-10 3:12 incremental push/pull for large repositories Enrico Weigelt
@ 2010-07-12 15:17 ` Avery Pennarun
0 siblings, 0 replies; 2+ messages in thread
From: Avery Pennarun @ 2010-07-12 15:17 UTC (permalink / raw)
To: weigelt; +Cc: git
On Fri, Jul 9, 2010 at 11:12 PM, Enrico Weigelt <weigelt@metux.de> wrote:
> I often have situations where I've rebased branches with large files
> (10th of megabytes per file) and pushing them to the remote. Normally
> these files themselves stay untouched, but the history changes (eg.
> commits reordered, several changes in smaller files,etc).
>
> It seem that on each push, the whole branch is transferred, including
> all the large files, which already exist on the remote site. Is there
> any way to prevent this ?
I was hoping someone else would have replied to you with a brilliant
solution to this by now, but I guess not, so I'll try with my limited
knowledge. I've seen this behaviour as well.
>From what I understand, git uses an algorithm something like this to
determine which objects need to be transmitted by a push:
- find the latest commit T on the remote side that is also in the
branch you want to push. (This part isn't an exhaustive search, and
might be off by a few commits if both ends have new changes, but this
problem usually happens only with fetch/pull, not push.)
- on the client doing the push, get a list of all objects in all new
commits that weren't in commit T and generate and send the pack.
As you can imagine, this is terribly non-optimal. For example, if you
use 'git revert', it uploads all the objects you reverted, even though
they obviously already existed in the remote repo. Example:
#!/bin/sh
set -e
cd /tmp
mkdir repo
cd repo
git init --bare
cd ..
git clone repo worktree
cd worktree
for i in $(seq 1000); do echo $i >$i; done
git add .
git commit -m orig
git push # sends about 1000 objects
echo
echo
for i in $(seq 1000); do echo $i >>$i; done
git commit -a -m doubled
git push # sends about 1000 objects
echo
echo
git revert --no-edit HEAD
git push # sends about 1000 objects (again!)
The promising looking "--thin" option to git-push doesn't help this at
all. I don't really know what it does, but whatever it does seems to
be relatively ineffective. (I guess that's why it's not the default.)
You can imagine lots of ways to improve this, of course. There's a
tradeoff between searching the history for old objects (which can be
slow in a huge repo) vs. just sending them and discarding duplicates
on the remote server. For many projects, the tradeoff is an easy one:
just send the files, since they're tiny anyway, and sending them is
much faster than exhaustively searching the history. But as soon as
huge files or huge numbers of files start to get involved, the
situation changes fast.
bup (http://github.com/apenwarr/bup) uses a totally different method:
the server sends you its pack .idx files, and you never push any
object that matches anything in those .idx files. That works fine in
bup, because bup repositories are virtually never repacked. (You pay
a time/bandwidth cost when you first talk to a repo, but it's worth it
to potentially avoid re-sending gigabytes worth of data.) But in git,
where repacking happens frequently, this wouldn't fly, because the
indexes change every time someone runs "git gc".
If you came up with a patch to do improved packing/negotiation, I bet
it would be accepted. Of course, it would have to either be optional
or have a decent heuristic for when to enable itself, because *most*
of the time, the default git behaviour is probably as fast as
possible.
Have fun,
Avery
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2010-07-12 15:18 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-10 3:12 incremental push/pull for large repositories Enrico Weigelt
2010-07-12 15:17 ` Avery Pennarun
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).