* 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3 @ 2017-04-27 16:36 Robert Stryker 2017-04-27 20:09 ` Jeff King 0 siblings, 1 reply; 3+ messages in thread From: Robert Stryker @ 2017-04-27 16:36 UTC (permalink / raw) To: git Hi all: The following script attempts to merge 4 git repos into one, maintaining tag and branch content (but not SHAs). Each original repo basically gets its own subfolder in the new one. Original repos are first rewritten to have their history think they always belonged in the target subfolder. The problem: the script takes 30 minutes for one environment including git 2.7.4, and generates a repo of about 30mb. When run by a coworker using git 2.9.3, it takes 22+ hours and generates a 10gb repo. Clearly something here is very wrong. Either there's a pretty horrible regression or my idea is a pretty bad one ;) General process for the script: - check out 4 repos - rewrite their history so they always thought they were in a subfolder - copy these 4 rewritten folders to a temporary location - get a list of branches and tags for each of the 4 repos - initialize a new repo with a readme.md - for each unique tag - check the 4 rewritten / backed up repos for the tag - for each of the 4 rewritten repos: - if the tag exists in that repo, merge it into the new repo in a test branch - git pull --no-edit ../intermediate/oneRewrittenRepo (SLOW PART) - save the tag - for each unique branch (same logic) So... yeah... 30mb + 30 minutes -> 11gb + 22 hours somewhere between these two versions of git? According to coworker: during each pass of the Tags' loop it's sitting for a long time on: git pull --no-edit ../intermediate/webtools.common which runs in its turn git fetch --update-head-ok ../intermediate/webtools.common which in its turn runs git-upload-pack ../intermediate/webtools.common Any ideas here are much appreciated =/ The Script in question is here: https://gist.github.com/robstryker/4854fc86ab3714a5e1af353b98cbc768 ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3 2017-04-27 16:36 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3 Robert Stryker @ 2017-04-27 20:09 ` Jeff King 2017-04-27 20:42 ` Jeff King 0 siblings, 1 reply; 3+ messages in thread From: Jeff King @ 2017-04-27 20:09 UTC (permalink / raw) To: Robert Stryker; +Cc: git On Thu, Apr 27, 2017 at 12:36:54PM -0400, Robert Stryker wrote: > The problem: the script takes 30 minutes for one environment > including git 2.7.4, and generates a repo of about 30mb. When run by > a coworker using git 2.9.3, it takes 22+ hours and generates a 10gb > repo. > > Clearly something here is very wrong. Either there's a pretty horrible > regression or my idea is a pretty bad one ;) The large size makes me think that you're getting an auto-gc in the middle that is exploding the unreachable objects into loose storage. This can happen when objects are ready to be pruned, but Git holds on to them for a grace periods (2 weeks by default) as a precaution against simultaneous use. Try doing: git config gc.auto 0 in the repositories before the slow step. Or alternatively, try: git config gc.pruneExpire now which will continue to do the auto-gc, but throw away unreachable objects immediately. Or alternatively, we're failing to run gc at all and just getting tons of loose objects that need packed. What does running "git gc --auto" say if you run it in the slow repository? Does it improve the disk space problem? Even if one of those helps, I'd still like to know why the gc behavior changed between the two versions. The best way to do that is via git-bisect. You should be able to do: # make sure you can compile git from source git clone git://git.kernel.org/pub/scm/git/git.git cd git make git bisect start git bisect good v2.7.4 git bisect bad v2.9.3 # for each commit bisect dumps you at, run your test. The bin-wrappers # part is important, because it sets up the environment to run # sub-programs from the built version. And as pull is a shell script, # the problem is likely in a sub-program. /path/to/git/bin-wrappers/git pull ... # And then mark whether it was fast or slow. You obviously don't need # to run the program to completion; just enough to decide if it's fast # or slow (which might be better done by observing disk space rather # than timing). git bisect good ;# or "bad" if it was slow It's going to be tedious even if it takes 30 minutes per iteration. It might be worth trying to adjust the test case for smaller repos. :) It may also be worth trying the test with the latest tip of "master". v2.9.3 is several versions behind, and it's possible that something may have been fixed since then (nothing comes immediately to mind, but it's worth a shot). -Peff ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3 2017-04-27 20:09 ` Jeff King @ 2017-04-27 20:42 ` Jeff King 0 siblings, 0 replies; 3+ messages in thread From: Jeff King @ 2017-04-27 20:42 UTC (permalink / raw) To: Robert Stryker; +Cc: git On Thu, Apr 27, 2017 at 04:09:56PM -0400, Jeff King wrote: > On Thu, Apr 27, 2017 at 12:36:54PM -0400, Robert Stryker wrote: > > > The problem: the script takes 30 minutes for one environment > > including git 2.7.4, and generates a repo of about 30mb. When run by > > a coworker using git 2.9.3, it takes 22+ hours and generates a 10gb > > repo. > > > > Clearly something here is very wrong. Either there's a pretty horrible > > regression or my idea is a pretty bad one ;) > > The large size makes me think that you're getting an auto-gc in the > middle that is exploding the unreachable objects into loose storage. > This can happen when objects are ready to be pruned, but Git holds on to > them for a grace periods (2 weeks by default) as a precaution against > simultaneous use. > > Try doing: > > git config gc.auto 0 > > in the repositories before the slow step. Or alternatively, try: > > git config gc.pruneExpire now > > which will continue to do the auto-gc, but throw away unreachable > objects immediately. > > Or alternatively, we're failing to run gc at all and just getting tons > of loose objects that need packed. What does running "git gc --auto" say > if you run it in the slow repository? Does it improve the disk space > problem? Fiddling with your script a bit, I have a suspect. Between your two versions of git, we started disallowing merge of unrelated histories by default[1]. Which is exactly what your script is doing: echo "Merge in the four rewritten projects, with generic commit messages" git pull --no-edit webtools.common.fproj git pull --no-edit webtools.common git pull --no-edit webtools.common.tests git pull --no-edit webtools.common.snippets If you run under "set -e", or just put "|| exit 1" after those, you'll see that they fail with v2.9.3 and newer. So what I think is happening is that we never create that shared history, and then your per-tag work is building further on a nonsense fake history. That has two implications: - as the divergent history in the shared repo gets bigger and bigger, the fetches have to do more and more work to try to find a common ancestor (but of course they'll never find one, because the two histories aren't related) - the divergent history racks up tons of unreachable objects, which auto-gc won't pack. After a while of the script running, you can see that auto-gc fails with "There are too many unreachable loose objects" after the pack. Due to the way background gc works these days, that blocks further auto-gc from running until the situation is resolved. And you just rack up tons of loose objects, which explains the disk usage. Try adding "--allow-unrelated-histories" to your git-pull invocation. -Peff [1] See e379fdf34 (merge: refuse to create too cool a merge by default, 2016-03-18) ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-04-27 20:42 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-04-27 16:36 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3 Robert Stryker 2017-04-27 20:09 ` Jeff King 2017-04-27 20:42 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).