git-fetching from a big repository is slow

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-fetching from a big repository is slow
@ 2006-12-14 13:40 Andy Parkins
  2006-12-14 13:53 ` Andreas Ericsson
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Parkins @ 2006-12-14 13:40 UTC (permalink / raw)
  To: git

Hello,

I've got a big repository.  I've got two computers.  One has the repository 
up-to-date (164M after repack); one is behind (30M ish).

I used git-fetch to try and update; and the sync took HOURS.  I zipped 
the .git directory and transferred that and it took about 15 minutes to 
transfer.

Am I doing something wrong?  The git-fetch was done with a git+ssh:// URL.  
The zip transfer with scp (so ssh shouldn't be a factor).

Andy
-- 
Dr Andy Parkins, M Eng (hons), MIEE

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 13:40 git-fetching from a big repository is slow Andy Parkins
@ 2006-12-14 13:53 ` Andreas Ericsson
  2006-12-14 14:14   ` Johannes Schindelin
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Andreas Ericsson @ 2006-12-14 13:53 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Andy Parkins wrote:
> Hello,
> 
> I've got a big repository.  I've got two computers.  One has the repository 
> up-to-date (164M after repack); one is behind (30M ish).
> 
> I used git-fetch to try and update; and the sync took HOURS.  I zipped 
> the .git directory and transferred that and it took about 15 minutes to 
> transfer.
> 
> Am I doing something wrong?  The git-fetch was done with a git+ssh:// URL.  
> The zip transfer with scp (so ssh shouldn't be a factor).
> 

This seems to happen if your repository consists of many large binary 
files, especially many large binary files of several versions that do 
not deltify well against each other. Perhaps it's worth adding gzip 
compression detecion to git? I imagine more people than me are tracking 
gzipped/bzip2'ed content that pretty much never deltifies well against 
anything else.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 13:53 ` Andreas Ericsson
@ 2006-12-14 14:14   ` Johannes Schindelin
  2006-12-14 15:06     ` Andreas Ericsson
  2006-12-14 15:18   ` Andy Parkins
  2006-12-14 18:14   ` Nicolas Pitre
  2 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2006-12-14 14:14 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Andy Parkins, git

Hi,

On Thu, 14 Dec 2006, Andreas Ericsson wrote:

> Andy Parkins wrote:
> > Hello,
> > 
> > I've got a big repository.  I've got two computers.  One has the repository
> > up-to-date (164M after repack); one is behind (30M ish).
> > 
> > I used git-fetch to try and update; and the sync took HOURS.  I zipped the
> > .git directory and transferred that and it took about 15 minutes to
> > transfer.
> > 
> > Am I doing something wrong?  The git-fetch was done with a git+ssh:// URL.
> > The zip transfer with scp (so ssh shouldn't be a factor).
> > 
> 
> This seems to happen if your repository consists of many large binary files,
> especially many large binary files of several versions that do not deltify
> well against each other. Perhaps it's worth adding gzip compression detecion
> to git? I imagine more people than me are tracking gzipped/bzip2'ed content
> that pretty much never deltifies well against anything else.

Or we add something like the heuristics we discovered in another thread, 
where rename detection (which is related to delta candidate searching) is 
not started if the sizes differ drastically.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 14:14   ` Johannes Schindelin
@ 2006-12-14 15:06     ` Andreas Ericsson
  2006-12-14 19:05       ` Geert Bosch
  0 siblings, 1 reply; 25+ messages in thread
From: Andreas Ericsson @ 2006-12-14 15:06 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Andy Parkins, git

Johannes Schindelin wrote:
> Hi,
> 
> On Thu, 14 Dec 2006, Andreas Ericsson wrote:
> 
>> Andy Parkins wrote:
>>> Hello,
>>>
>>> I've got a big repository.  I've got two computers.  One has the repository
>>> up-to-date (164M after repack); one is behind (30M ish).
>>>
>>> I used git-fetch to try and update; and the sync took HOURS.  I zipped the
>>> .git directory and transferred that and it took about 15 minutes to
>>> transfer.
>>>
>>> Am I doing something wrong?  The git-fetch was done with a git+ssh:// URL.
>>> The zip transfer with scp (so ssh shouldn't be a factor).
>>>
>> This seems to happen if your repository consists of many large binary files,
>> especially many large binary files of several versions that do not deltify
>> well against each other. Perhaps it's worth adding gzip compression detecion
>> to git? I imagine more people than me are tracking gzipped/bzip2'ed content
>> that pretty much never deltifies well against anything else.
> 
> Or we add something like the heuristics we discovered in another thread, 
> where rename detection (which is related to delta candidate searching) is 
> not started if the sizes differ drastically.
> 

It wouldn't work for this particular case though. In our distribution 
repository we have ~300 bzip2 compressed tarballs with an average size 
of 3MiB. 240 of those are between 2.5 and 4 MiB, so they don't 
drastically differ, but neither do they delta well.

One option would be to add some sort of config option to skip attempting 
deltas of files with a certain suffix. That way we could just tell it to 
ignore *.gz,*.tgz,*.bz2 and everything would work just as it does today, 
but a lot faster.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 15:06     ` Andreas Ericsson
@ 2006-12-14 19:05       ` Geert Bosch
  2006-12-14 19:46         ` Shawn Pearce
  2006-12-14 22:28         ` Andreas Ericsson
  0 siblings, 2 replies; 25+ messages in thread
From: Geert Bosch @ 2006-12-14 19:05 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Johannes Schindelin, Andy Parkins, git

On Dec 14, 2006, at 10:06, Andreas Ericsson wrote:

> It wouldn't work for this particular case though. In our  
> distribution repository we have ~300 bzip2 compressed tarballs with  
> an average size of 3MiB. 240 of those are between 2.5 and 4 MiB, so  
> they don't drastically differ, but neither do they delta well.
>
> One option would be to add some sort of config option to skip  
> attempting deltas of files with a certain suffix. That way we could  
> just tell it to ignore *.gz,*.tgz,*.bz2 and everything would work  
> just as it does today, but a lot faster.

Such special magic based on filenames is always a bad idea. Tomorrow  
somebody
comes with .zip files (oh, and of course .ZIP), then it's .jpg's other
compressed content. In the end git will be doing lots of magic and  
still perform
badly on unknown compressed content.

There is a very simple way of detecting compressed files: just look  
at the
size of the compressed blob and compare against the size of the  
expanded blob.
If the compressed blob has a non-trivial size which is close to the  
expanded
size, assume the file is not interesting as source or target for deltas.

Example:
    if (compressed_size > expanded_size / 4 * 3 + 1024) {
      /* don't try to deltify if blob doesn't compress well */
      return ...;
    }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 19:05       ` Geert Bosch
@ 2006-12-14 19:46         ` Shawn Pearce
  2006-12-14 22:12           ` Horst H. von Brand
                             ` (2 more replies)
  2006-12-14 22:28         ` Andreas Ericsson
  1 sibling, 3 replies; 25+ messages in thread
From: Shawn Pearce @ 2006-12-14 19:46 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Andreas Ericsson, Johannes Schindelin, Andy Parkins, git

Geert Bosch <bosch@adacore.com> wrote:
> Such special magic based on filenames is always a bad idea. Tomorrow  
> somebody
> comes with .zip files (oh, and of course .ZIP), then it's .jpg's other
> compressed content. In the end git will be doing lots of magic and  
> still perform
> badly on unknown compressed content.
> 
> There is a very simple way of detecting compressed files: just look  
> at the
> size of the compressed blob and compare against the size of the  
> expanded blob.
> If the compressed blob has a non-trivial size which is close to the  
> expanded
> size, assume the file is not interesting as source or target for deltas.
> 
> Example:
>    if (compressed_size > expanded_size / 4 * 3 + 1024) {
>      /* don't try to deltify if blob doesn't compress well */
>      return ...;
>    }

And yet I get good delta compression on a number of ZIP formatted
files which don't get good additional zlib compression (<3%).
Doing the above would cause those packfiles to explode to about
10x their current size.

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 19:46         ` Shawn Pearce
@ 2006-12-14 22:12           ` Horst H. von Brand
  2006-12-14 22:38             ` Shawn Pearce
  2006-12-14 23:01           ` Geert Bosch
  2006-12-14 23:15           ` Johannes Schindelin
  2 siblings, 1 reply; 25+ messages in thread
From: Horst H. von Brand @ 2006-12-14 22:12 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Geert Bosch, Andreas Ericsson, Johannes Schindelin, Andy Parkins,
	git

Shawn Pearce <spearce@spearce.org> wrote:

[...]

> And yet I get good delta compression on a number of ZIP formatted
> files which don't get good additional zlib compression (<3%).

.zip is something like a tar of the compressed files, if the files inside
the archive don't change, the deltas will be small.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 22:12           ` Horst H. von Brand
@ 2006-12-14 22:38             ` Shawn Pearce
  2006-12-15 21:49               ` Pazu
  0 siblings, 1 reply; 25+ messages in thread
From: Shawn Pearce @ 2006-12-14 22:38 UTC (permalink / raw)
  To: Horst H. von Brand
  Cc: Geert Bosch, Andreas Ericsson, Johannes Schindelin, Andy Parkins,
	git

"Horst H. von Brand" <vonbrand@inf.utfsm.cl> wrote:
> Shawn Pearce <spearce@spearce.org> wrote:
> 
> [...]
> 
> > And yet I get good delta compression on a number of ZIP formatted
> > files which don't get good additional zlib compression (<3%).
> 
> .zip is something like a tar of the compressed files, if the files inside
> the archive don't change, the deltas will be small.

Yes, especially when the new zip is made using the exact same
software with the same parameters, so the resulting compressed file
stream is identical for files whose content has not changed.  :-)

Since this is actually a JAR full of Java classes which have
been recompiled, its even more interesting that javac produced an
identical class file given the same input.  I've seen times where
it doesn't thanks to the automatic serialVersionUID field being
somewhat randomly generated.

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 22:38             ` Shawn Pearce
@ 2006-12-15 21:49               ` Pazu
  2006-12-16 13:32                 ` Robin Rosenberg
  0 siblings, 1 reply; 25+ messages in thread
From: Pazu @ 2006-12-15 21:49 UTC (permalink / raw)
  To: git

Shawn Pearce <spearce <at> spearce.org> writes:

> identical class file given the same input.  I've seen times where
> it doesn't thanks to the automatic serialVersionUID field being
> somewhat randomly generated.

Probably offline, but… serialVersionUID isn't randomly generated. It's
calculated using the types of fields in the class, recursively. The actual
algorithm is quite arbitrary, but not random. The automatically generated
serialVersionUID should change only if you add/remove class fields (either on
the class itself, or to the class of nested objects).

*sigh* Java chases me. 8+ hours of java work everyday, and when I finally get
home… there it is, looking at me again. *sob*

-- Pazu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-15 21:49               ` Pazu
@ 2006-12-16 13:32                 ` Robin Rosenberg
  0 siblings, 0 replies; 25+ messages in thread
From: Robin Rosenberg @ 2006-12-16 13:32 UTC (permalink / raw)
  To: Pazu; +Cc: git

fredag 15 december 2006 22:49 skrev Pazu:
> Shawn Pearce <spearce <at> spearce.org> writes:
> > identical class file given the same input.  I've seen times where
> > it doesn't thanks to the automatic serialVersionUID field being
> > somewhat randomly generated.
>
> Probably offline, but… serialVersionUID isn't randomly generated. It's
> calculated using the types of fields in the class, recursively. The actual
> algorithm is quite arbitrary, but not random. The automatically generated
> serialVersionUID should change only if you add/remove class fields (either
> on the class itself, or to the class of nested objects).

Different java compilers (e.g. SUN's javac and Eclipse) generate slipghtly 
different code for some cases, including somee synthetic member fields. that 
get involved in the UID calculation. Neither compiler is wrong. The java 
specifications don't cover all cases.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 19:46         ` Shawn Pearce
  2006-12-14 22:12           ` Horst H. von Brand
@ 2006-12-14 23:01           ` Geert Bosch
  2006-12-14 23:15           ` Johannes Schindelin
  2 siblings, 0 replies; 25+ messages in thread
From: Geert Bosch @ 2006-12-14 23:01 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Andreas Ericsson, Johannes Schindelin, Andy Parkins, git

On Dec 14, 2006, at 14:46, Shawn Pearce wrote:
> And yet I get good delta compression on a number of ZIP formatted
> files which don't get good additional zlib compression (<3%).
> Doing the above would cause those packfiles to explode to about
> 10x their current size.

Yes, that's because for zip files each file in the archive is
compressed independently. Similar things might happen when
checking in uncompressed tar files with JPG's. The question
is whether you prefer bad time usage or bad space usage when
handling large binary blobs. Maybe we should use a faster,
less precise algorithm instead of giving up.

Still, I think doing anything based on filename is a mistake.
If we want to have a heuristic to prevent spending too much time
on deltifying large compressed files, the heuristic should be
based on content, not filename.

Maybe we could some "magic" as used by the file(1) command
that allows git to say a bit more about the content of blobs.
This could be used both for ordering files during deltification
and to determine wether to try deltification at all.

   -Geert

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 19:46         ` Shawn Pearce
  2006-12-14 22:12           ` Horst H. von Brand
  2006-12-14 23:01           ` Geert Bosch
@ 2006-12-14 23:15           ` Johannes Schindelin
  2006-12-14 23:29             ` Shawn Pearce
  2006-12-15  2:26             ` Nicolas Pitre
  2 siblings, 2 replies; 25+ messages in thread
From: Johannes Schindelin @ 2006-12-14 23:15 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git

Hi,

On Thu, 14 Dec 2006, Shawn Pearce wrote:

> Geert Bosch <bosch@adacore.com> wrote:
> > Such special magic based on filenames is always a bad idea. Tomorrow  
> > somebody
> > comes with .zip files (oh, and of course .ZIP), then it's .jpg's other
> > compressed content. In the end git will be doing lots of magic and  
> > still perform
> > badly on unknown compressed content.
> > 
> > There is a very simple way of detecting compressed files: just look  
> > at the
> > size of the compressed blob and compare against the size of the  
> > expanded blob.
> > If the compressed blob has a non-trivial size which is close to the  
> > expanded
> > size, assume the file is not interesting as source or target for deltas.
> > 
> > Example:
> >    if (compressed_size > expanded_size / 4 * 3 + 1024) {
> >      /* don't try to deltify if blob doesn't compress well */
> >      return ...;
> >    }
> 
> And yet I get good delta compression on a number of ZIP formatted files 
> which don't get good additional zlib compression (<3%). Doing the above 
> would cause those packfiles to explode to about 10x their current size.

A pity. Geert's proposition sounded good to me.

However, there's got to be a way to cut short the search for a delta 
base/deltification when a certain (maybe even configurable) amount of time 
has been spent on it.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 23:15           ` Johannes Schindelin
@ 2006-12-14 23:29             ` Shawn Pearce
  2006-12-15  0:07               ` Johannes Schindelin
  2006-12-15  2:26             ` Nicolas Pitre
  1 sibling, 1 reply; 25+ messages in thread
From: Shawn Pearce @ 2006-12-14 23:29 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Thu, 14 Dec 2006, Shawn Pearce wrote:
> > Geert Bosch <bosch@adacore.com> wrote:
> > >    if (compressed_size > expanded_size / 4 * 3 + 1024) {
> > >      /* don't try to deltify if blob doesn't compress well */
> > >      return ...;
> > >    }
> > 
> > And yet I get good delta compression on a number of ZIP formatted files 
> > which don't get good additional zlib compression (<3%). Doing the above 
> > would cause those packfiles to explode to about 10x their current size.
> 
> A pity. Geert's proposition sounded good to me.
> 
> However, there's got to be a way to cut short the search for a delta 
> base/deltification when a certain (maybe even configurable) amount of time 
> has been spent on it.

I'm not sure time is the best rule there.

Maybe if the object is large (e.g. over 512 KiB or some configured
limit) and did not compress well when we last deflated it
(e.g. Geert's rule above) then only try to delta it against another
object whose hinted filename is very close/exactly matches and
whose size is very close, and don't make nearly as many attempts
on the matching hunks within any two files if the file appears to
be binary and not text.

I'm OK with a small increase in packfile size as a result of slightly
less optimal delta base selection on the really large binary files
due to something like the above, but 10x is insane.

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 23:29             ` Shawn Pearce
@ 2006-12-15  0:07               ` Johannes Schindelin
  2006-12-15  0:42                 ` Shawn Pearce
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2006-12-15  0:07 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git

Hi,

On Thu, 14 Dec 2006, Shawn Pearce wrote:

> I'm OK with a small increase in packfile size as a result of slightly 
> less optimal delta base selection on the really large binary files due 
> to something like the above, but 10x is insane.

Not if it is a server having to do all the work. Along with all the work 
for all other clients. When you do a fetch, you really should be nice to 
the serving side.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-15  0:07               ` Johannes Schindelin
@ 2006-12-15  0:42                 ` Shawn Pearce
  0 siblings, 0 replies; 25+ messages in thread
From: Shawn Pearce @ 2006-12-15  0:42 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Thu, 14 Dec 2006, Shawn Pearce wrote:
> 
> > I'm OK with a small increase in packfile size as a result of slightly 
> > less optimal delta base selection on the really large binary files due 
> > to something like the above, but 10x is insane.
> 
> Not if it is a server having to do all the work. Along with all the work 
> for all other clients. When you do a fetch, you really should be nice to 
> the serving side.

Yes, that's true.

But I fail to see what that has to do with the part you quoted above.
A 1% increase in transfer bandwidth may be better for a server if
it halves the CPU usage or disk IO usage if the server has more
bandwidth than those available; likewise a 1% decrease in transfer
bandwidth may be better for a server if it has lots of CPU to spare
but very little network bandwidth available.

Since every server is different its not like we can tune for just
one of those cases and cross our fingers.

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 23:15           ` Johannes Schindelin
  2006-12-14 23:29             ` Shawn Pearce
@ 2006-12-15  2:26             ` Nicolas Pitre
  1 sibling, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2006-12-15  2:26 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Shawn Pearce, Geert Bosch, Andreas Ericsson, Andy Parkins, git

On Fri, 15 Dec 2006, Johannes Schindelin wrote:

> Hi,
> 
> On Thu, 14 Dec 2006, Shawn Pearce wrote:
> 
> > Geert Bosch <bosch@adacore.com> wrote:
> > > Such special magic based on filenames is always a bad idea. Tomorrow  
> > > somebody
> > > comes with .zip files (oh, and of course .ZIP), then it's .jpg's other
> > > compressed content. In the end git will be doing lots of magic and  
> > > still perform
> > > badly on unknown compressed content.
> > > 
> > > There is a very simple way of detecting compressed files: just look  
> > > at the
> > > size of the compressed blob and compare against the size of the  
> > > expanded blob.
> > > If the compressed blob has a non-trivial size which is close to the  
> > > expanded
> > > size, assume the file is not interesting as source or target for deltas.
> > > 
> > > Example:
> > >    if (compressed_size > expanded_size / 4 * 3 + 1024) {
> > >      /* don't try to deltify if blob doesn't compress well */
> > >      return ...;
> > >    }
> > 
> > And yet I get good delta compression on a number of ZIP formatted files 
> > which don't get good additional zlib compression (<3%). Doing the above 
> > would cause those packfiles to explode to about 10x their current size.
> 
> A pity. Geert's proposition sounded good to me.
> 
> However, there's got to be a way to cut short the search for a delta 
> base/deltification when a certain (maybe even configurable) amount of time 
> has been spent on it.

Yes! Run git-repack -a -d on the remote repository.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 19:05       ` Geert Bosch
  2006-12-14 19:46         ` Shawn Pearce
@ 2006-12-14 22:28         ` Andreas Ericsson
  1 sibling, 0 replies; 25+ messages in thread
From: Andreas Ericsson @ 2006-12-14 22:28 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Johannes Schindelin, Andy Parkins, git

Geert Bosch wrote:
> 
> On Dec 14, 2006, at 10:06, Andreas Ericsson wrote:
> 
>> It wouldn't work for this particular case though. In our distribution 
>> repository we have ~300 bzip2 compressed tarballs with an average size 
>> of 3MiB. 240 of those are between 2.5 and 4 MiB, so they don't 
>> drastically differ, but neither do they delta well.
>>
>> One option would be to add some sort of config option to skip 
>> attempting deltas of files with a certain suffix. That way we could 
>> just tell it to ignore *.gz,*.tgz,*.bz2 and everything would work just 
>> as it does today, but a lot faster.
> 
> Such special magic based on filenames is always a bad idea. Tomorrow 
> somebody
> comes with .zip files (oh, and of course .ZIP), then it's .jpg's other
> compressed content. In the end git will be doing lots of magic and still 
> perform
> badly on unknown compressed content.
> 

Hence config option. People can tell git to skip trying to delta 
whatever they want. For this particular mothership repo, we only ever 
work against it when we're at the office, meaning resulting datasize is 
not an issue, but data computation can be a real bottle-neck.

> There is a very simple way of detecting compressed files: just look at the
> size of the compressed blob and compare against the size of the expanded 
> blob.
> If the compressed blob has a non-trivial size which is close to the 
> expanded
> size, assume the file is not interesting as source or target for deltas.
> 
> Example:
>    if (compressed_size > expanded_size / 4 * 3 + 1024) {
>      /* don't try to deltify if blob doesn't compress well */
>      return ...;
>    }
> 

Many compression algorithms generate similar output for similar input. 
Most source-code projects change relatively little between releases, so 
they *could* delta well, it's just that in our repo they don't.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 13:53 ` Andreas Ericsson
  2006-12-14 14:14   ` Johannes Schindelin
@ 2006-12-14 15:18   ` Andy Parkins
  2006-12-14 15:45     ` Han-Wen Nienhuys
  2006-12-14 18:14   ` Nicolas Pitre
  2 siblings, 1 reply; 25+ messages in thread
From: Andy Parkins @ 2006-12-14 15:18 UTC (permalink / raw)
  To: git

On Thursday 2006 December 14 13:53, Andreas Ericsson wrote:

> This seems to happen if your repository consists of many large binary
> files, especially many large binary files of several versions that do
> not deltify well against each other. Perhaps it's worth adding gzip

It's actually just every released patch to the linux kernel ever issued.  
Almost entirely ASCII and every revision (save the first) created by patching 
the previous.


Andy

-- 
Dr Andy Parkins, M Eng (hons), MIEE

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 15:18   ` Andy Parkins
@ 2006-12-14 15:45     ` Han-Wen Nienhuys
  2006-12-14 16:20       ` Andy Parkins
  0 siblings, 1 reply; 25+ messages in thread
From: Han-Wen Nienhuys @ 2006-12-14 15:45 UTC (permalink / raw)
  To: git

Andy Parkins escreveu:
> On Thursday 2006 December 14 13:53, Andreas Ericsson wrote:
> 
>> This seems to happen if your repository consists of many large binary
>> files, especially many large binary files of several versions that do
>> not deltify well against each other. Perhaps it's worth adding gzip
> 
> It's actually just every released patch to the linux kernel ever issued.  
> Almost entirely ASCII and every revision (save the first) created by patching 
> the previous.

I just noticed that git-fetch now runs git-show-ref --verify on every
tag it encounters. This seems to slow down fetch over here.

-- 
 Han-Wen Nienhuys - hanwen@xs4all.nl - http://www.xs4all.nl/~hanwen

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 15:45     ` Han-Wen Nienhuys
@ 2006-12-14 16:20       ` Andy Parkins
  2006-12-14 16:34         ` Johannes Schindelin
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Parkins @ 2006-12-14 16:20 UTC (permalink / raw)
  To: git, hanwen

On Thursday 2006 December 14 15:45, Han-Wen Nienhuys wrote:

> I just noticed that git-fetch now runs git-show-ref --verify on every
> tag it encounters. This seems to slow down fetch over here.

There aren't any tags in this repository :-)

Andy

-- 
Dr Andy Parkins, M Eng (hons), MIEE

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 16:20       ` Andy Parkins
@ 2006-12-14 16:34         ` Johannes Schindelin
  2006-12-14 20:41           ` Junio C Hamano
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2006-12-14 16:34 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git, hanwen

Hi,

On Thu, 14 Dec 2006, Andy Parkins wrote:

> On Thursday 2006 December 14 15:45, Han-Wen Nienhuys wrote:
> 
> > I just noticed that git-fetch now runs git-show-ref --verify on every
> > tag it encounters. This seems to slow down fetch over here.
> 
> There aren't any tags in this repository :-)

git-show-ref traverses every single _local_ tag when called. This is to 
overcome the problem that tags can be packed now, so a simple file 
existence check is not sufficient.

It would be much faster, probably, if you pack the local refs. IIRC I once 
argued for automatically packing refs (and all refs), but this has not 
been picked up, and I do not really care about it either.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 16:34         ` Johannes Schindelin
@ 2006-12-14 20:41           ` Junio C Hamano
  2006-12-14 23:26             ` Johannes Schindelin
  0 siblings, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2006-12-14 20:41 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> git-show-ref traverses every single _local_ tag when called. This is to 
> overcome the problem that tags can be packed now, so a simple file 
> existence check is not sufficient.

Is "traverses every single _local_ tag" a fact?  It might go
through every single _local_ (possibly stale) packed tag in
memory but it should not traverse $GIT_DIR/refs/tags.

If I recall correctly, show-ref (1) first checks the filesystem
"$GIT_DIR/$named_ref" and says Ok if found and valid; otherwise
(2) checks packed refs (reads $GIT_DIR/packed-refs if not
already).  So that would be at most one open (which may fail in
(1)) and one open+read (in (2)).  Unless we are talking about
fork+exec overhead, that "traverse" should be reasonably fast.

Where is the bottleneck?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 20:41           ` Junio C Hamano
@ 2006-12-14 23:26             ` Johannes Schindelin
  2006-12-15  0:38               ` Junio C Hamano
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2006-12-14 23:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi,

On Thu, 14 Dec 2006, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > git-show-ref traverses every single _local_ tag when called. This is to 
> > overcome the problem that tags can be packed now, so a simple file 
> > existence check is not sufficient.
> 
> Is "traverses every single _local_ tag" a fact?  It might go
> through every single _local_ (possibly stale) packed tag in
> memory but it should not traverse $GIT_DIR/refs/tags.
> 
> If I recall correctly, show-ref (1) first checks the filesystem
> "$GIT_DIR/$named_ref" and says Ok if found and valid; otherwise
> (2) checks packed refs (reads $GIT_DIR/packed-refs if not
> already).

If I read builtin-show-ref.c correctly, it _always_ calls 
for_each_ref(show_ref, NULL);

The only reason that the loop in for_each_ref can stop early is if 
show_ref returns something different than 0. But it does not! Every single 
return in show_ref() returns 0. It does not matter, though (see below).

> So that would be at most one open (which may fail in (1)) and one 
> open+read (in (2)).  Unless we are talking about fork+exec overhead, 
> that "traverse" should be reasonably fast.
> 
> Where is the bottleneck?

The problem is that so many stat()s _do_ take time. Again, if I read the 
code correctly, it not only stat()s every loose ref, but also resolves the 
refs in get_ref_dir(), which is called from get_loose_refs(), which is 
unconditionally called in for_each_ref().

Even if the refs are packed, it takes quite _long_ (I confirmed this). And 
it is not at all necessary! Instead of a O(n^2) we can easily reduce this 
to O(n*log(n)), and we can reduce the n fork()&exec()s of git-show-ref by 
a single one.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 23:26             ` Johannes Schindelin
@ 2006-12-15  0:38               ` Junio C Hamano
  0 siblings, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2006-12-15  0:38 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> If I read builtin-show-ref.c correctly, it _always_ calls 
> for_each_ref(show_ref, NULL);

Ok, that settles it.  If there is a reason to have --verify, we
should really special case it.  There is no point in looping,
because verify does not do the tail match (which could cause
ambiguity) and its answer should be either "yes it is there" or
"no there is no such ref".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-fetching from a big repository is slow
  2006-12-14 13:53 ` Andreas Ericsson
  2006-12-14 14:14   ` Johannes Schindelin
  2006-12-14 15:18   ` Andy Parkins
@ 2006-12-14 18:14   ` Nicolas Pitre
  2 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2006-12-14 18:14 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Andy Parkins, git

On Thu, 14 Dec 2006, Andreas Ericsson wrote:

> Andy Parkins wrote:
> > Hello,
> > 
> > I've got a big repository.  I've got two computers.  One has the repository
> > up-to-date (164M after repack); one is behind (30M ish).
> > 
> > I used git-fetch to try and update; and the sync took HOURS.  I zipped the
> > .git directory and transferred that and it took about 15 minutes to
> > transfer.
> > 
> > Am I doing something wrong?  The git-fetch was done with a git+ssh:// URL.
> > The zip transfer with scp (so ssh shouldn't be a factor).
> > 
> 
> This seems to happen if your repository consists of many large binary files,
> especially many large binary files of several versions that do not deltify
> well against each other. Perhaps it's worth adding gzip compression detecion
> to git? I imagine more people than me are tracking gzipped/bzip2'ed content
> that pretty much never deltifies well against anything else.

If your remote repository is fully packed in a single pack that should 
not have any impact on the transfer latency since no attempt to 
redeltify objects against each other is attempted by default when those 
objects are in the same pack.



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-12-16 13:30 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-14 13:40 git-fetching from a big repository is slow Andy Parkins
2006-12-14 13:53 ` Andreas Ericsson
2006-12-14 14:14   ` Johannes Schindelin
2006-12-14 15:06     ` Andreas Ericsson
2006-12-14 19:05       ` Geert Bosch
2006-12-14 19:46         ` Shawn Pearce
2006-12-14 22:12           ` Horst H. von Brand
2006-12-14 22:38             ` Shawn Pearce
2006-12-15 21:49               ` Pazu
2006-12-16 13:32                 ` Robin Rosenberg
2006-12-14 23:01           ` Geert Bosch
2006-12-14 23:15           ` Johannes Schindelin
2006-12-14 23:29             ` Shawn Pearce
2006-12-15  0:07               ` Johannes Schindelin
2006-12-15  0:42                 ` Shawn Pearce
2006-12-15  2:26             ` Nicolas Pitre
2006-12-14 22:28         ` Andreas Ericsson
2006-12-14 15:18   ` Andy Parkins
2006-12-14 15:45     ` Han-Wen Nienhuys
2006-12-14 16:20       ` Andy Parkins
2006-12-14 16:34         ` Johannes Schindelin
2006-12-14 20:41           ` Junio C Hamano
2006-12-14 23:26             ` Johannes Schindelin
2006-12-15  0:38               ` Junio C Hamano
2006-12-14 18:14   ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).