incremental push/pull for large repositories

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* incremental push/pull for large repositories
@ 2010-07-10  3:12 Enrico Weigelt
  2010-07-12 15:17 ` Avery Pennarun
  0 siblings, 1 reply; 2+ messages in thread
From: Enrico Weigelt @ 2010-07-10  3:12 UTC (permalink / raw)
  To: git

Hi folks,

I often have situations where I've rebased branches with large files 
(10th of megabytes per file) and pushing them to the remote. Normally 
these files themselves stay untouched, but the history changes (eg. 
commits reordered, several changes in smaller files,etc).

It seem that on each push, the whole branch is transferred, including
all the large files, which already exist on the remote site. Is there
any way to prevent this ?

IMHO it would be enough having a negotiation on which objects really
have to be transferred before creating and transmitting the actual
pack file. That would save _much_ traffic (and transmit time) for
those situations. 

This could also be used to remotely repair broken repositories: 
just repack leaving off the broken objects and fetch the missing
ones from remotes.

cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: incremental push/pull for large repositories
  2010-07-10  3:12 incremental push/pull for large repositories Enrico Weigelt
@ 2010-07-12 15:17 ` Avery Pennarun
  0 siblings, 0 replies; 2+ messages in thread
From: Avery Pennarun @ 2010-07-12 15:17 UTC (permalink / raw)
  To: weigelt; +Cc: git

On Fri, Jul 9, 2010 at 11:12 PM, Enrico Weigelt <weigelt@metux.de> wrote:
> I often have situations where I've rebased branches with large files
> (10th of megabytes per file) and pushing them to the remote. Normally
> these files themselves stay untouched, but the history changes (eg.
> commits reordered, several changes in smaller files,etc).
>
> It seem that on each push, the whole branch is transferred, including
> all the large files, which already exist on the remote site. Is there
> any way to prevent this ?

I was hoping someone else would have replied to you with a brilliant
solution to this by now, but I guess not, so I'll try with my limited
knowledge.  I've seen this behaviour as well.

>From what I understand, git uses an algorithm something like this to
determine which objects need to be transmitted by a push:

- find the latest commit T on the remote side that is also in the
branch you want to push.  (This part isn't an exhaustive search, and
might be off by a few commits if both ends have new changes, but this
problem usually happens only with fetch/pull, not push.)

- on the client doing the push, get a list of all objects in all new
commits that weren't in commit T and generate and send the pack.

As you can imagine, this is terribly non-optimal.  For example, if you
use 'git revert', it uploads all the objects you reverted, even though
they obviously already existed in the remote repo.  Example:

    #!/bin/sh
    set -e
    cd /tmp
    mkdir repo
    cd repo
    git init --bare
    cd ..
    git clone repo worktree
    cd worktree
    for i in $(seq 1000); do echo $i >$i; done
    git add .
    git commit -m orig
    git push  # sends about 1000 objects
    echo
    echo
    for i in $(seq 1000); do echo $i >>$i; done
    git commit -a -m doubled
    git push  # sends about 1000 objects
    echo
    echo
    git revert --no-edit HEAD
    git push  # sends about 1000 objects (again!)

The promising looking "--thin" option to git-push doesn't help this at
all.  I don't really know what it does, but whatever it does seems to
be relatively ineffective.  (I guess that's why it's not the default.)

You can imagine lots of ways to improve this, of course.  There's a
tradeoff between searching the history for old objects (which can be
slow in a huge repo) vs. just sending them and discarding duplicates
on the remote server.  For many projects, the tradeoff is an easy one:
just send the files, since they're tiny anyway, and sending them is
much faster than exhaustively searching the history.  But as soon as
huge files or huge numbers of files start to get involved, the
situation changes fast.

bup (http://github.com/apenwarr/bup) uses a totally different method:
the server sends you its pack .idx files, and you never push any
object that matches anything in those .idx files.  That works fine in
bup, because bup repositories are virtually never repacked.  (You pay
a time/bandwidth cost when you first talk to a repo, but it's worth it
to potentially avoid re-sending gigabytes worth of data.)  But in git,
where repacking happens frequently, this wouldn't fly, because the
indexes change every time someone runs "git gc".

If you came up with a patch to do improved packing/negotiation, I bet
it would be accepted.  Of course, it would have to either be optional
or have a decent heuristic for when to enable itself, because *most*
of the time, the default git behaviour is probably as fast as
possible.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-07-12 15:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-10  3:12 incremental push/pull for large repositories Enrico Weigelt
2010-07-12 15:17 ` Avery Pennarun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).