* git-fetching from a big repository is slow @ 2006-12-14 13:40 Andy Parkins 2006-12-14 13:53 ` Andreas Ericsson 0 siblings, 1 reply; 25+ messages in thread From: Andy Parkins @ 2006-12-14 13:40 UTC (permalink / raw) To: git Hello, I've got a big repository. I've got two computers. One has the repository up-to-date (164M after repack); one is behind (30M ish). I used git-fetch to try and update; and the sync took HOURS. I zipped the .git directory and transferred that and it took about 15 minutes to transfer. Am I doing something wrong? The git-fetch was done with a git+ssh:// URL. The zip transfer with scp (so ssh shouldn't be a factor). Andy -- Dr Andy Parkins, M Eng (hons), MIEE ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 13:40 git-fetching from a big repository is slow Andy Parkins @ 2006-12-14 13:53 ` Andreas Ericsson 2006-12-14 14:14 ` Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Andreas Ericsson @ 2006-12-14 13:53 UTC (permalink / raw) To: Andy Parkins; +Cc: git Andy Parkins wrote: > Hello, > > I've got a big repository. I've got two computers. One has the repository > up-to-date (164M after repack); one is behind (30M ish). > > I used git-fetch to try and update; and the sync took HOURS. I zipped > the .git directory and transferred that and it took about 15 minutes to > transfer. > > Am I doing something wrong? The git-fetch was done with a git+ssh:// URL. > The zip transfer with scp (so ssh shouldn't be a factor). > This seems to happen if your repository consists of many large binary files, especially many large binary files of several versions that do not deltify well against each other. Perhaps it's worth adding gzip compression detecion to git? I imagine more people than me are tracking gzipped/bzip2'ed content that pretty much never deltifies well against anything else. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 13:53 ` Andreas Ericsson @ 2006-12-14 14:14 ` Johannes Schindelin 2006-12-14 15:06 ` Andreas Ericsson 2006-12-14 15:18 ` Andy Parkins 2006-12-14 18:14 ` Nicolas Pitre 2 siblings, 1 reply; 25+ messages in thread From: Johannes Schindelin @ 2006-12-14 14:14 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Andy Parkins, git Hi, On Thu, 14 Dec 2006, Andreas Ericsson wrote: > Andy Parkins wrote: > > Hello, > > > > I've got a big repository. I've got two computers. One has the repository > > up-to-date (164M after repack); one is behind (30M ish). > > > > I used git-fetch to try and update; and the sync took HOURS. I zipped the > > .git directory and transferred that and it took about 15 minutes to > > transfer. > > > > Am I doing something wrong? The git-fetch was done with a git+ssh:// URL. > > The zip transfer with scp (so ssh shouldn't be a factor). > > > > This seems to happen if your repository consists of many large binary files, > especially many large binary files of several versions that do not deltify > well against each other. Perhaps it's worth adding gzip compression detecion > to git? I imagine more people than me are tracking gzipped/bzip2'ed content > that pretty much never deltifies well against anything else. Or we add something like the heuristics we discovered in another thread, where rename detection (which is related to delta candidate searching) is not started if the sizes differ drastically. Ciao, Dscho ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 14:14 ` Johannes Schindelin @ 2006-12-14 15:06 ` Andreas Ericsson 2006-12-14 19:05 ` Geert Bosch 0 siblings, 1 reply; 25+ messages in thread From: Andreas Ericsson @ 2006-12-14 15:06 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Andy Parkins, git Johannes Schindelin wrote: > Hi, > > On Thu, 14 Dec 2006, Andreas Ericsson wrote: > >> Andy Parkins wrote: >>> Hello, >>> >>> I've got a big repository. I've got two computers. One has the repository >>> up-to-date (164M after repack); one is behind (30M ish). >>> >>> I used git-fetch to try and update; and the sync took HOURS. I zipped the >>> .git directory and transferred that and it took about 15 minutes to >>> transfer. >>> >>> Am I doing something wrong? The git-fetch was done with a git+ssh:// URL. >>> The zip transfer with scp (so ssh shouldn't be a factor). >>> >> This seems to happen if your repository consists of many large binary files, >> especially many large binary files of several versions that do not deltify >> well against each other. Perhaps it's worth adding gzip compression detecion >> to git? I imagine more people than me are tracking gzipped/bzip2'ed content >> that pretty much never deltifies well against anything else. > > Or we add something like the heuristics we discovered in another thread, > where rename detection (which is related to delta candidate searching) is > not started if the sizes differ drastically. > It wouldn't work for this particular case though. In our distribution repository we have ~300 bzip2 compressed tarballs with an average size of 3MiB. 240 of those are between 2.5 and 4 MiB, so they don't drastically differ, but neither do they delta well. One option would be to add some sort of config option to skip attempting deltas of files with a certain suffix. That way we could just tell it to ignore *.gz,*.tgz,*.bz2 and everything would work just as it does today, but a lot faster. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 15:06 ` Andreas Ericsson @ 2006-12-14 19:05 ` Geert Bosch 2006-12-14 19:46 ` Shawn Pearce 2006-12-14 22:28 ` Andreas Ericsson 0 siblings, 2 replies; 25+ messages in thread From: Geert Bosch @ 2006-12-14 19:05 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Johannes Schindelin, Andy Parkins, git On Dec 14, 2006, at 10:06, Andreas Ericsson wrote: > It wouldn't work for this particular case though. In our > distribution repository we have ~300 bzip2 compressed tarballs with > an average size of 3MiB. 240 of those are between 2.5 and 4 MiB, so > they don't drastically differ, but neither do they delta well. > > One option would be to add some sort of config option to skip > attempting deltas of files with a certain suffix. That way we could > just tell it to ignore *.gz,*.tgz,*.bz2 and everything would work > just as it does today, but a lot faster. Such special magic based on filenames is always a bad idea. Tomorrow somebody comes with .zip files (oh, and of course .ZIP), then it's .jpg's other compressed content. In the end git will be doing lots of magic and still perform badly on unknown compressed content. There is a very simple way of detecting compressed files: just look at the size of the compressed blob and compare against the size of the expanded blob. If the compressed blob has a non-trivial size which is close to the expanded size, assume the file is not interesting as source or target for deltas. Example: if (compressed_size > expanded_size / 4 * 3 + 1024) { /* don't try to deltify if blob doesn't compress well */ return ...; } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 19:05 ` Geert Bosch @ 2006-12-14 19:46 ` Shawn Pearce 2006-12-14 22:12 ` Horst H. von Brand ` (2 more replies) 2006-12-14 22:28 ` Andreas Ericsson 1 sibling, 3 replies; 25+ messages in thread From: Shawn Pearce @ 2006-12-14 19:46 UTC (permalink / raw) To: Geert Bosch; +Cc: Andreas Ericsson, Johannes Schindelin, Andy Parkins, git Geert Bosch <bosch@adacore.com> wrote: > Such special magic based on filenames is always a bad idea. Tomorrow > somebody > comes with .zip files (oh, and of course .ZIP), then it's .jpg's other > compressed content. In the end git will be doing lots of magic and > still perform > badly on unknown compressed content. > > There is a very simple way of detecting compressed files: just look > at the > size of the compressed blob and compare against the size of the > expanded blob. > If the compressed blob has a non-trivial size which is close to the > expanded > size, assume the file is not interesting as source or target for deltas. > > Example: > if (compressed_size > expanded_size / 4 * 3 + 1024) { > /* don't try to deltify if blob doesn't compress well */ > return ...; > } And yet I get good delta compression on a number of ZIP formatted files which don't get good additional zlib compression (<3%). Doing the above would cause those packfiles to explode to about 10x their current size. -- ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 19:46 ` Shawn Pearce @ 2006-12-14 22:12 ` Horst H. von Brand 2006-12-14 22:38 ` Shawn Pearce 2006-12-14 23:01 ` Geert Bosch 2006-12-14 23:15 ` Johannes Schindelin 2 siblings, 1 reply; 25+ messages in thread From: Horst H. von Brand @ 2006-12-14 22:12 UTC (permalink / raw) To: Shawn Pearce Cc: Geert Bosch, Andreas Ericsson, Johannes Schindelin, Andy Parkins, git Shawn Pearce <spearce@spearce.org> wrote: [...] > And yet I get good delta compression on a number of ZIP formatted > files which don't get good additional zlib compression (<3%). .zip is something like a tar of the compressed files, if the files inside the archive don't change, the deltas will be small. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 2654431 Universidad Tecnica Federico Santa Maria +56 32 2654239 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 22:12 ` Horst H. von Brand @ 2006-12-14 22:38 ` Shawn Pearce 2006-12-15 21:49 ` Pazu 0 siblings, 1 reply; 25+ messages in thread From: Shawn Pearce @ 2006-12-14 22:38 UTC (permalink / raw) To: Horst H. von Brand Cc: Geert Bosch, Andreas Ericsson, Johannes Schindelin, Andy Parkins, git "Horst H. von Brand" <vonbrand@inf.utfsm.cl> wrote: > Shawn Pearce <spearce@spearce.org> wrote: > > [...] > > > And yet I get good delta compression on a number of ZIP formatted > > files which don't get good additional zlib compression (<3%). > > .zip is something like a tar of the compressed files, if the files inside > the archive don't change, the deltas will be small. Yes, especially when the new zip is made using the exact same software with the same parameters, so the resulting compressed file stream is identical for files whose content has not changed. :-) Since this is actually a JAR full of Java classes which have been recompiled, its even more interesting that javac produced an identical class file given the same input. I've seen times where it doesn't thanks to the automatic serialVersionUID field being somewhat randomly generated. -- ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 22:38 ` Shawn Pearce @ 2006-12-15 21:49 ` Pazu 2006-12-16 13:32 ` Robin Rosenberg 0 siblings, 1 reply; 25+ messages in thread From: Pazu @ 2006-12-15 21:49 UTC (permalink / raw) To: git Shawn Pearce <spearce <at> spearce.org> writes: > identical class file given the same input. I've seen times where > it doesn't thanks to the automatic serialVersionUID field being > somewhat randomly generated. Probably offline, but… serialVersionUID isn't randomly generated. It's calculated using the types of fields in the class, recursively. The actual algorithm is quite arbitrary, but not random. The automatically generated serialVersionUID should change only if you add/remove class fields (either on the class itself, or to the class of nested objects). *sigh* Java chases me. 8+ hours of java work everyday, and when I finally get home… there it is, looking at me again. *sob* -- Pazu ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-15 21:49 ` Pazu @ 2006-12-16 13:32 ` Robin Rosenberg 0 siblings, 0 replies; 25+ messages in thread From: Robin Rosenberg @ 2006-12-16 13:32 UTC (permalink / raw) To: Pazu; +Cc: git fredag 15 december 2006 22:49 skrev Pazu: > Shawn Pearce <spearce <at> spearce.org> writes: > > identical class file given the same input. I've seen times where > > it doesn't thanks to the automatic serialVersionUID field being > > somewhat randomly generated. > > Probably offline, but… serialVersionUID isn't randomly generated. It's > calculated using the types of fields in the class, recursively. The actual > algorithm is quite arbitrary, but not random. The automatically generated > serialVersionUID should change only if you add/remove class fields (either > on the class itself, or to the class of nested objects). Different java compilers (e.g. SUN's javac and Eclipse) generate slipghtly different code for some cases, including somee synthetic member fields. that get involved in the UID calculation. Neither compiler is wrong. The java specifications don't cover all cases. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 19:46 ` Shawn Pearce 2006-12-14 22:12 ` Horst H. von Brand @ 2006-12-14 23:01 ` Geert Bosch 2006-12-14 23:15 ` Johannes Schindelin 2 siblings, 0 replies; 25+ messages in thread From: Geert Bosch @ 2006-12-14 23:01 UTC (permalink / raw) To: Shawn Pearce; +Cc: Andreas Ericsson, Johannes Schindelin, Andy Parkins, git On Dec 14, 2006, at 14:46, Shawn Pearce wrote: > And yet I get good delta compression on a number of ZIP formatted > files which don't get good additional zlib compression (<3%). > Doing the above would cause those packfiles to explode to about > 10x their current size. Yes, that's because for zip files each file in the archive is compressed independently. Similar things might happen when checking in uncompressed tar files with JPG's. The question is whether you prefer bad time usage or bad space usage when handling large binary blobs. Maybe we should use a faster, less precise algorithm instead of giving up. Still, I think doing anything based on filename is a mistake. If we want to have a heuristic to prevent spending too much time on deltifying large compressed files, the heuristic should be based on content, not filename. Maybe we could some "magic" as used by the file(1) command that allows git to say a bit more about the content of blobs. This could be used both for ordering files during deltification and to determine wether to try deltification at all. -Geert ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 19:46 ` Shawn Pearce 2006-12-14 22:12 ` Horst H. von Brand 2006-12-14 23:01 ` Geert Bosch @ 2006-12-14 23:15 ` Johannes Schindelin 2006-12-14 23:29 ` Shawn Pearce 2006-12-15 2:26 ` Nicolas Pitre 2 siblings, 2 replies; 25+ messages in thread From: Johannes Schindelin @ 2006-12-14 23:15 UTC (permalink / raw) To: Shawn Pearce; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git Hi, On Thu, 14 Dec 2006, Shawn Pearce wrote: > Geert Bosch <bosch@adacore.com> wrote: > > Such special magic based on filenames is always a bad idea. Tomorrow > > somebody > > comes with .zip files (oh, and of course .ZIP), then it's .jpg's other > > compressed content. In the end git will be doing lots of magic and > > still perform > > badly on unknown compressed content. > > > > There is a very simple way of detecting compressed files: just look > > at the > > size of the compressed blob and compare against the size of the > > expanded blob. > > If the compressed blob has a non-trivial size which is close to the > > expanded > > size, assume the file is not interesting as source or target for deltas. > > > > Example: > > if (compressed_size > expanded_size / 4 * 3 + 1024) { > > /* don't try to deltify if blob doesn't compress well */ > > return ...; > > } > > And yet I get good delta compression on a number of ZIP formatted files > which don't get good additional zlib compression (<3%). Doing the above > would cause those packfiles to explode to about 10x their current size. A pity. Geert's proposition sounded good to me. However, there's got to be a way to cut short the search for a delta base/deltification when a certain (maybe even configurable) amount of time has been spent on it. Ciao, Dscho ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 23:15 ` Johannes Schindelin @ 2006-12-14 23:29 ` Shawn Pearce 2006-12-15 0:07 ` Johannes Schindelin 2006-12-15 2:26 ` Nicolas Pitre 1 sibling, 1 reply; 25+ messages in thread From: Shawn Pearce @ 2006-12-14 23:29 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Thu, 14 Dec 2006, Shawn Pearce wrote: > > Geert Bosch <bosch@adacore.com> wrote: > > > if (compressed_size > expanded_size / 4 * 3 + 1024) { > > > /* don't try to deltify if blob doesn't compress well */ > > > return ...; > > > } > > > > And yet I get good delta compression on a number of ZIP formatted files > > which don't get good additional zlib compression (<3%). Doing the above > > would cause those packfiles to explode to about 10x their current size. > > A pity. Geert's proposition sounded good to me. > > However, there's got to be a way to cut short the search for a delta > base/deltification when a certain (maybe even configurable) amount of time > has been spent on it. I'm not sure time is the best rule there. Maybe if the object is large (e.g. over 512 KiB or some configured limit) and did not compress well when we last deflated it (e.g. Geert's rule above) then only try to delta it against another object whose hinted filename is very close/exactly matches and whose size is very close, and don't make nearly as many attempts on the matching hunks within any two files if the file appears to be binary and not text. I'm OK with a small increase in packfile size as a result of slightly less optimal delta base selection on the really large binary files due to something like the above, but 10x is insane. -- ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 23:29 ` Shawn Pearce @ 2006-12-15 0:07 ` Johannes Schindelin 2006-12-15 0:42 ` Shawn Pearce 0 siblings, 1 reply; 25+ messages in thread From: Johannes Schindelin @ 2006-12-15 0:07 UTC (permalink / raw) To: Shawn Pearce; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git Hi, On Thu, 14 Dec 2006, Shawn Pearce wrote: > I'm OK with a small increase in packfile size as a result of slightly > less optimal delta base selection on the really large binary files due > to something like the above, but 10x is insane. Not if it is a server having to do all the work. Along with all the work for all other clients. When you do a fetch, you really should be nice to the serving side. Ciao, Dscho ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-15 0:07 ` Johannes Schindelin @ 2006-12-15 0:42 ` Shawn Pearce 0 siblings, 0 replies; 25+ messages in thread From: Shawn Pearce @ 2006-12-15 0:42 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Geert Bosch, Andreas Ericsson, Andy Parkins, git Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Thu, 14 Dec 2006, Shawn Pearce wrote: > > > I'm OK with a small increase in packfile size as a result of slightly > > less optimal delta base selection on the really large binary files due > > to something like the above, but 10x is insane. > > Not if it is a server having to do all the work. Along with all the work > for all other clients. When you do a fetch, you really should be nice to > the serving side. Yes, that's true. But I fail to see what that has to do with the part you quoted above. A 1% increase in transfer bandwidth may be better for a server if it halves the CPU usage or disk IO usage if the server has more bandwidth than those available; likewise a 1% decrease in transfer bandwidth may be better for a server if it has lots of CPU to spare but very little network bandwidth available. Since every server is different its not like we can tune for just one of those cases and cross our fingers. -- ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 23:15 ` Johannes Schindelin 2006-12-14 23:29 ` Shawn Pearce @ 2006-12-15 2:26 ` Nicolas Pitre 1 sibling, 0 replies; 25+ messages in thread From: Nicolas Pitre @ 2006-12-15 2:26 UTC (permalink / raw) To: Johannes Schindelin Cc: Shawn Pearce, Geert Bosch, Andreas Ericsson, Andy Parkins, git On Fri, 15 Dec 2006, Johannes Schindelin wrote: > Hi, > > On Thu, 14 Dec 2006, Shawn Pearce wrote: > > > Geert Bosch <bosch@adacore.com> wrote: > > > Such special magic based on filenames is always a bad idea. Tomorrow > > > somebody > > > comes with .zip files (oh, and of course .ZIP), then it's .jpg's other > > > compressed content. In the end git will be doing lots of magic and > > > still perform > > > badly on unknown compressed content. > > > > > > There is a very simple way of detecting compressed files: just look > > > at the > > > size of the compressed blob and compare against the size of the > > > expanded blob. > > > If the compressed blob has a non-trivial size which is close to the > > > expanded > > > size, assume the file is not interesting as source or target for deltas. > > > > > > Example: > > > if (compressed_size > expanded_size / 4 * 3 + 1024) { > > > /* don't try to deltify if blob doesn't compress well */ > > > return ...; > > > } > > > > And yet I get good delta compression on a number of ZIP formatted files > > which don't get good additional zlib compression (<3%). Doing the above > > would cause those packfiles to explode to about 10x their current size. > > A pity. Geert's proposition sounded good to me. > > However, there's got to be a way to cut short the search for a delta > base/deltification when a certain (maybe even configurable) amount of time > has been spent on it. Yes! Run git-repack -a -d on the remote repository. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 19:05 ` Geert Bosch 2006-12-14 19:46 ` Shawn Pearce @ 2006-12-14 22:28 ` Andreas Ericsson 1 sibling, 0 replies; 25+ messages in thread From: Andreas Ericsson @ 2006-12-14 22:28 UTC (permalink / raw) To: Geert Bosch; +Cc: Johannes Schindelin, Andy Parkins, git Geert Bosch wrote: > > On Dec 14, 2006, at 10:06, Andreas Ericsson wrote: > >> It wouldn't work for this particular case though. In our distribution >> repository we have ~300 bzip2 compressed tarballs with an average size >> of 3MiB. 240 of those are between 2.5 and 4 MiB, so they don't >> drastically differ, but neither do they delta well. >> >> One option would be to add some sort of config option to skip >> attempting deltas of files with a certain suffix. That way we could >> just tell it to ignore *.gz,*.tgz,*.bz2 and everything would work just >> as it does today, but a lot faster. > > Such special magic based on filenames is always a bad idea. Tomorrow > somebody > comes with .zip files (oh, and of course .ZIP), then it's .jpg's other > compressed content. In the end git will be doing lots of magic and still > perform > badly on unknown compressed content. > Hence config option. People can tell git to skip trying to delta whatever they want. For this particular mothership repo, we only ever work against it when we're at the office, meaning resulting datasize is not an issue, but data computation can be a real bottle-neck. > There is a very simple way of detecting compressed files: just look at the > size of the compressed blob and compare against the size of the expanded > blob. > If the compressed blob has a non-trivial size which is close to the > expanded > size, assume the file is not interesting as source or target for deltas. > > Example: > if (compressed_size > expanded_size / 4 * 3 + 1024) { > /* don't try to deltify if blob doesn't compress well */ > return ...; > } > Many compression algorithms generate similar output for similar input. Most source-code projects change relatively little between releases, so they *could* delta well, it's just that in our repo they don't. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 13:53 ` Andreas Ericsson 2006-12-14 14:14 ` Johannes Schindelin @ 2006-12-14 15:18 ` Andy Parkins 2006-12-14 15:45 ` Han-Wen Nienhuys 2006-12-14 18:14 ` Nicolas Pitre 2 siblings, 1 reply; 25+ messages in thread From: Andy Parkins @ 2006-12-14 15:18 UTC (permalink / raw) To: git On Thursday 2006 December 14 13:53, Andreas Ericsson wrote: > This seems to happen if your repository consists of many large binary > files, especially many large binary files of several versions that do > not deltify well against each other. Perhaps it's worth adding gzip It's actually just every released patch to the linux kernel ever issued. Almost entirely ASCII and every revision (save the first) created by patching the previous. Andy -- Dr Andy Parkins, M Eng (hons), MIEE ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 15:18 ` Andy Parkins @ 2006-12-14 15:45 ` Han-Wen Nienhuys 2006-12-14 16:20 ` Andy Parkins 0 siblings, 1 reply; 25+ messages in thread From: Han-Wen Nienhuys @ 2006-12-14 15:45 UTC (permalink / raw) To: git Andy Parkins escreveu: > On Thursday 2006 December 14 13:53, Andreas Ericsson wrote: > >> This seems to happen if your repository consists of many large binary >> files, especially many large binary files of several versions that do >> not deltify well against each other. Perhaps it's worth adding gzip > > It's actually just every released patch to the linux kernel ever issued. > Almost entirely ASCII and every revision (save the first) created by patching > the previous. I just noticed that git-fetch now runs git-show-ref --verify on every tag it encounters. This seems to slow down fetch over here. -- Han-Wen Nienhuys - hanwen@xs4all.nl - http://www.xs4all.nl/~hanwen ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 15:45 ` Han-Wen Nienhuys @ 2006-12-14 16:20 ` Andy Parkins 2006-12-14 16:34 ` Johannes Schindelin 0 siblings, 1 reply; 25+ messages in thread From: Andy Parkins @ 2006-12-14 16:20 UTC (permalink / raw) To: git, hanwen On Thursday 2006 December 14 15:45, Han-Wen Nienhuys wrote: > I just noticed that git-fetch now runs git-show-ref --verify on every > tag it encounters. This seems to slow down fetch over here. There aren't any tags in this repository :-) Andy -- Dr Andy Parkins, M Eng (hons), MIEE ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 16:20 ` Andy Parkins @ 2006-12-14 16:34 ` Johannes Schindelin 2006-12-14 20:41 ` Junio C Hamano 0 siblings, 1 reply; 25+ messages in thread From: Johannes Schindelin @ 2006-12-14 16:34 UTC (permalink / raw) To: Andy Parkins; +Cc: git, hanwen Hi, On Thu, 14 Dec 2006, Andy Parkins wrote: > On Thursday 2006 December 14 15:45, Han-Wen Nienhuys wrote: > > > I just noticed that git-fetch now runs git-show-ref --verify on every > > tag it encounters. This seems to slow down fetch over here. > > There aren't any tags in this repository :-) git-show-ref traverses every single _local_ tag when called. This is to overcome the problem that tags can be packed now, so a simple file existence check is not sufficient. It would be much faster, probably, if you pack the local refs. IIRC I once argued for automatically packing refs (and all refs), but this has not been picked up, and I do not really care about it either. Ciao, Dscho ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 16:34 ` Johannes Schindelin @ 2006-12-14 20:41 ` Junio C Hamano 2006-12-14 23:26 ` Johannes Schindelin 0 siblings, 1 reply; 25+ messages in thread From: Junio C Hamano @ 2006-12-14 20:41 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > git-show-ref traverses every single _local_ tag when called. This is to > overcome the problem that tags can be packed now, so a simple file > existence check is not sufficient. Is "traverses every single _local_ tag" a fact? It might go through every single _local_ (possibly stale) packed tag in memory but it should not traverse $GIT_DIR/refs/tags. If I recall correctly, show-ref (1) first checks the filesystem "$GIT_DIR/$named_ref" and says Ok if found and valid; otherwise (2) checks packed refs (reads $GIT_DIR/packed-refs if not already). So that would be at most one open (which may fail in (1)) and one open+read (in (2)). Unless we are talking about fork+exec overhead, that "traverse" should be reasonably fast. Where is the bottleneck? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 20:41 ` Junio C Hamano @ 2006-12-14 23:26 ` Johannes Schindelin 2006-12-15 0:38 ` Junio C Hamano 0 siblings, 1 reply; 25+ messages in thread From: Johannes Schindelin @ 2006-12-14 23:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Hi, On Thu, 14 Dec 2006, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > git-show-ref traverses every single _local_ tag when called. This is to > > overcome the problem that tags can be packed now, so a simple file > > existence check is not sufficient. > > Is "traverses every single _local_ tag" a fact? It might go > through every single _local_ (possibly stale) packed tag in > memory but it should not traverse $GIT_DIR/refs/tags. > > If I recall correctly, show-ref (1) first checks the filesystem > "$GIT_DIR/$named_ref" and says Ok if found and valid; otherwise > (2) checks packed refs (reads $GIT_DIR/packed-refs if not > already). If I read builtin-show-ref.c correctly, it _always_ calls for_each_ref(show_ref, NULL); The only reason that the loop in for_each_ref can stop early is if show_ref returns something different than 0. But it does not! Every single return in show_ref() returns 0. It does not matter, though (see below). > So that would be at most one open (which may fail in (1)) and one > open+read (in (2)). Unless we are talking about fork+exec overhead, > that "traverse" should be reasonably fast. > > Where is the bottleneck? The problem is that so many stat()s _do_ take time. Again, if I read the code correctly, it not only stat()s every loose ref, but also resolves the refs in get_ref_dir(), which is called from get_loose_refs(), which is unconditionally called in for_each_ref(). Even if the refs are packed, it takes quite _long_ (I confirmed this). And it is not at all necessary! Instead of a O(n^2) we can easily reduce this to O(n*log(n)), and we can reduce the n fork()&exec()s of git-show-ref by a single one. Ciao, Dscho ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 23:26 ` Johannes Schindelin @ 2006-12-15 0:38 ` Junio C Hamano 0 siblings, 0 replies; 25+ messages in thread From: Junio C Hamano @ 2006-12-15 0:38 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > If I read builtin-show-ref.c correctly, it _always_ calls > for_each_ref(show_ref, NULL); Ok, that settles it. If there is a reason to have --verify, we should really special case it. There is no point in looping, because verify does not do the tail match (which could cause ambiguity) and its answer should be either "yes it is there" or "no there is no such ref". ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: git-fetching from a big repository is slow 2006-12-14 13:53 ` Andreas Ericsson 2006-12-14 14:14 ` Johannes Schindelin 2006-12-14 15:18 ` Andy Parkins @ 2006-12-14 18:14 ` Nicolas Pitre 2 siblings, 0 replies; 25+ messages in thread From: Nicolas Pitre @ 2006-12-14 18:14 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Andy Parkins, git On Thu, 14 Dec 2006, Andreas Ericsson wrote: > Andy Parkins wrote: > > Hello, > > > > I've got a big repository. I've got two computers. One has the repository > > up-to-date (164M after repack); one is behind (30M ish). > > > > I used git-fetch to try and update; and the sync took HOURS. I zipped the > > .git directory and transferred that and it took about 15 minutes to > > transfer. > > > > Am I doing something wrong? The git-fetch was done with a git+ssh:// URL. > > The zip transfer with scp (so ssh shouldn't be a factor). > > > > This seems to happen if your repository consists of many large binary files, > especially many large binary files of several versions that do not deltify > well against each other. Perhaps it's worth adding gzip compression detecion > to git? I imagine more people than me are tracking gzipped/bzip2'ed content > that pretty much never deltifies well against anything else. If your remote repository is fully packed in a single pack that should not have any impact on the transfer latency since no attempt to redeltify objects against each other is attempted by default when those objects are in the same pack. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-12-16 13:30 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-12-14 13:40 git-fetching from a big repository is slow Andy Parkins 2006-12-14 13:53 ` Andreas Ericsson 2006-12-14 14:14 ` Johannes Schindelin 2006-12-14 15:06 ` Andreas Ericsson 2006-12-14 19:05 ` Geert Bosch 2006-12-14 19:46 ` Shawn Pearce 2006-12-14 22:12 ` Horst H. von Brand 2006-12-14 22:38 ` Shawn Pearce 2006-12-15 21:49 ` Pazu 2006-12-16 13:32 ` Robin Rosenberg 2006-12-14 23:01 ` Geert Bosch 2006-12-14 23:15 ` Johannes Schindelin 2006-12-14 23:29 ` Shawn Pearce 2006-12-15 0:07 ` Johannes Schindelin 2006-12-15 0:42 ` Shawn Pearce 2006-12-15 2:26 ` Nicolas Pitre 2006-12-14 22:28 ` Andreas Ericsson 2006-12-14 15:18 ` Andy Parkins 2006-12-14 15:45 ` Han-Wen Nienhuys 2006-12-14 16:20 ` Andy Parkins 2006-12-14 16:34 ` Johannes Schindelin 2006-12-14 20:41 ` Junio C Hamano 2006-12-14 23:26 ` Johannes Schindelin 2006-12-15 0:38 ` Junio C Hamano 2006-12-14 18:14 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).