* git pull & git gc @ 2015-03-18 13:53 Дилян Палаузов 2015-03-18 14:16 ` Duy Nguyen 0 siblings, 1 reply; 21+ messages in thread From: Дилян Палаузов @ 2015-03-18 13:53 UTC (permalink / raw) To: git Hello, I have a local folder with the git-repository (so that its .git/config contains ([remote "origin"]\n url = git://github.com/git/git.git\nfetch = +refs/heads/*:refs/remotes/origin/* ) I do there "git pull". Usually the output is Already up to date but since today it prints Auto packing the repository in background for optimum performance. See "git help gc" for manual housekeeping. Already up-to-date. and starts in the background a "git gc --auto" process. This is all fine, however, when the "git gc" process finishes, and I do again "git pull" I get the same message, as above (git gc is again started). My understanding is, that "git gc" has to be occasionally run and then the garbage collection is done for a while. In the concrete case, if "git pull" starts "git gc" in the background and prints a message on this, it is all fine, but running "git pull" after a while, when the garbage collection was recently done, where shall be neither message nor an action about "git gc". My system-wide gitconfig contains "[pack] threads = 1". I have "tar xJf"'ed my local git repository and have put it under http://mail.aegee.org/dpa/v/git-repository.tar.xz The question is: Why does "git pull" every time when it is invoked today print information about "git gc"? I have git 2.3.3 adjusted with "./configure --with-openssl --with-libpcre --with-curl --with-expat". Thanks in advance for your answer Dilian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 13:53 git pull & git gc Дилян Палаузов @ 2015-03-18 14:16 ` Duy Nguyen 2015-03-18 14:23 ` Дилян Палаузов 0 siblings, 1 reply; 21+ messages in thread From: Duy Nguyen @ 2015-03-18 14:16 UTC (permalink / raw) To: Дилян Палаузов Cc: Git Mailing List On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов <dilyan.palauzov@aegee.org> wrote: > Hello, > > I have a local folder with the git-repository (so that its .git/config > contains ([remote "origin"]\n url = git://github.com/git/git.git\nfetch = > +refs/heads/*:refs/remotes/origin/* ) > > I do there "git pull". > > Usually the output is > Already up to date > > but since today it prints > Auto packing the repository in background for optimum performance. > See "git help gc" for manual housekeeping. > Already up-to-date. > > and starts in the background a "git gc --auto" process. This is all fine, > however, when the "git gc" process finishes, and I do again "git pull" I get > the same message, as above (git gc is again started). So if you do "git gc --auto" now, does it exit immediately or go through the garbage collection process again (it'll print something)? What does "git count-objects -v" show? -- Duy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:16 ` Duy Nguyen @ 2015-03-18 14:23 ` Дилян Палаузов 2015-03-18 14:33 ` Duy Nguyen 0 siblings, 1 reply; 21+ messages in thread From: Дилян Палаузов @ 2015-03-18 14:23 UTC (permalink / raw) To: Duy Nguyen; +Cc: Git Mailing List Hello, # git gc --auto Auto packing the repository in background for optimum performance. See "git help gc" for manual housekeeping. and calls in the background: 25618 1 0 32451 884 1 14:20 ? 00:00:00 git gc --auto 25639 25618 51 49076 49428 0 14:20 ? 00:00:07 git prune --expire 2.weeks.ago # git count-objects -v count: 6039 size: 65464 in-pack: 185432 packs: 1 size-pack: 46687 prune-packable: 0 garbage: 0 size-garbage: 0 Regards Dilian On 18.03.2015 15:16, Duy Nguyen wrote: > On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов > <dilyan.palauzov@aegee.org> wrote: >> Hello, >> >> I have a local folder with the git-repository (so that its .git/config >> contains ([remote "origin"]\n url = git://github.com/git/git.git\nfetch = >> +refs/heads/*:refs/remotes/origin/* ) >> >> I do there "git pull". >> >> Usually the output is >> Already up to date >> >> but since today it prints >> Auto packing the repository in background for optimum performance. >> See "git help gc" for manual housekeeping. >> Already up-to-date. >> >> and starts in the background a "git gc --auto" process. This is all fine, >> however, when the "git gc" process finishes, and I do again "git pull" I get >> the same message, as above (git gc is again started). > > So if you do "git gc --auto" now, does it exit immediately or go > through the garbage collection process again (it'll print something)? > What does "git count-objects -v" show? > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:23 ` Дилян Палаузов @ 2015-03-18 14:33 ` Duy Nguyen 2015-03-18 14:41 ` Duy Nguyen 2015-03-18 14:48 ` Дилян Палаузов 0 siblings, 2 replies; 21+ messages in thread From: Duy Nguyen @ 2015-03-18 14:33 UTC (permalink / raw) To: Дилян Палаузов Cc: Git Mailing List On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов <dilyan.palauzov@aegee.org> wrote: > Hello, > > # git gc --auto > Auto packing the repository in background for optimum performance. > See "git help gc" for manual housekeeping. > > and calls in the background: > > 25618 1 0 32451 884 1 14:20 ? 00:00:00 git gc --auto > 25639 25618 51 49076 49428 0 14:20 ? 00:00:07 git prune --expire > 2.weeks.ago > > # git count-objects -v > count: 6039 loose number threshold is 6700, unless you tweaked something. But there's a tweak, we'll come back to this. > size: 65464 > in-pack: 185432 > packs: 1 Pack threshold is 50, You only have one pack, good OK back to the "count 6039" above. You have that many loose objects. But 'git gc' is lazier than 'git count-objects'. It assume a flat distribution, and only counts the number of objects in .git/objects/17 directory only, then extrapolate for the total number. So can you see how many files you have in this directory .git/objects/17? That number, multiplied by 256, should be greater than 6700. If that's the case "git gc" laziness is the problem. If not, I made some mistake in analyzing this and we'll start again. -- Duy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:33 ` Duy Nguyen @ 2015-03-18 14:41 ` Duy Nguyen 2015-03-18 14:58 ` John Keeping 2015-03-18 14:48 ` Дилян Палаузов 1 sibling, 1 reply; 21+ messages in thread From: Duy Nguyen @ 2015-03-18 14:41 UTC (permalink / raw) To: Дилян Палаузов Cc: Git Mailing List On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote: > If not, I made some mistake in analyzing this and we'll start again. I did make one mistake, the first "gc" should have reduced the number of loose objects to zero. Why didn't it.? I'll come back to this tomorrow if nobody finds out first :) -- Duy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:41 ` Duy Nguyen @ 2015-03-18 14:58 ` John Keeping 2015-03-18 21:04 ` Jeff King 2015-03-19 9:47 ` Duy Nguyen 0 siblings, 2 replies; 21+ messages in thread From: John Keeping @ 2015-03-18 14:58 UTC (permalink / raw) To: Duy Nguyen Cc: Дилян Палаузов, Git Mailing List On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: > On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote: > > If not, I made some mistake in analyzing this and we'll start again. > > I did make one mistake, the first "gc" should have reduced the number > of loose objects to zero. Why didn't it.? I'll come back to this > tomorrow if nobody finds out first :) Most likely they are not referenced by anything but are younger than 2 weeks. I saw a similar issue with automatic gc triggering after every operation when I did something equivalent to: git add <lots of files> git commit git reset --hard HEAD^ which creates a log of unreachable objects which are not old enough to be pruned. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:58 ` John Keeping @ 2015-03-18 21:04 ` Jeff King 2015-03-19 0:31 ` Duy Nguyen 2015-03-19 9:47 ` Duy Nguyen 1 sibling, 1 reply; 21+ messages in thread From: Jeff King @ 2015-03-18 21:04 UTC (permalink / raw) To: John Keeping Cc: Duy Nguyen, Дилян Палаузов, Git Mailing List On Wed, Mar 18, 2015 at 02:58:15PM +0000, John Keeping wrote: > On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: > > On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote: > > > If not, I made some mistake in analyzing this and we'll start again. > > > > I did make one mistake, the first "gc" should have reduced the number > > of loose objects to zero. Why didn't it.? I'll come back to this > > tomorrow if nobody finds out first :) > > Most likely they are not referenced by anything but are younger than 2 > weeks. > > I saw a similar issue with automatic gc triggering after every operation > when I did something equivalent to: > > git add <lots of files> > git commit > git reset --hard HEAD^ > > which creates a log of unreachable objects which are not old enough to > be pruned. Yes, this is almost certainly the problem. Though to be pedantic, the command above will still have a reflog entry, so the objects will be reachable (and packed). But there are other variants that don't leave the objects reachable from even reflogs. I don't know if there is an easy way around this. Auto-gc's object count is making the assumption that running the gc will reduce the number of objects, but obviously it does not always do so. We could do a more thorough check and find the number of actual packable and prune-able objects. The "prune-able" part of that is easy; just omit objects from the count that are newer than 2 weeks. But "packable" is expensive. You would have to compute reachability by walking from the tips. That can take tens of seconds on a large repo. You could perhaps cut off the walk early when you hit a packed commit (this does not strictly imply that all of the related objects are packed, but it would be good enough for a heuristic). But even that is probably too expensive for "gc --auto". -Peff PS Note that in git v2.2.0 and up, prune will leave not only "recent" unreachable objects, but older objects which are reachable from those recent ones (so that we keep or prune whole chunks of history, rather than dropping part and leaving the rest broken). Technically this exacerbates the problem (we keep more objects), though I doubt it makes much difference in practice (most chunks of history were created at similar times, so the mtimes of the whole chunk will be close together). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 21:04 ` Jeff King @ 2015-03-19 0:31 ` Duy Nguyen 2015-03-19 1:27 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: Duy Nguyen @ 2015-03-19 0:31 UTC (permalink / raw) To: Jeff King Cc: John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 4:04 AM, Jeff King <peff@peff.net> wrote: > On Wed, Mar 18, 2015 at 02:58:15PM +0000, John Keeping wrote: > >> On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: >> > On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote: >> > > If not, I made some mistake in analyzing this and we'll start again. >> > >> > I did make one mistake, the first "gc" should have reduced the number >> > of loose objects to zero. Why didn't it.? I'll come back to this >> > tomorrow if nobody finds out first :) >> >> Most likely they are not referenced by anything but are younger than 2 >> weeks. >> >> I saw a similar issue with automatic gc triggering after every operation >> when I did something equivalent to: >> >> git add <lots of files> >> git commit >> git reset --hard HEAD^ >> >> which creates a log of unreachable objects which are not old enough to >> be pruned. > > Yes, this is almost certainly the problem. Though to be pedantic, the > command above will still have a reflog entry, so the objects will be > reachable (and packed). But there are other variants that don't leave > the objects reachable from even reflogs. > > I don't know if there is an easy way around this. Auto-gc's object count > is making the assumption that running the gc will reduce the number of > objects, but obviously it does not always do so. We could do a more > thorough check and find the number of actual packable and prune-able > objects. The "prune-able" part of that is easy; just omit objects from > the count that are newer than 2 weeks. But "packable" is expensive. You > would have to compute reachability by walking from the tips. That can > take tens of seconds on a large repo. Or we could count/estimate the number of loose objects again after repack/prune. Then we can maybe have a way to prevent the next gc that we know will not improve the situation anyway. One option is pack unreachable objects in the second pack. This would stop the next gc, but that would screw prune up because st_mtime info is gone.. Maybe we just save a file to tell gc to ignore the number of loose objects until after a specific date. -- Duy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 0:31 ` Duy Nguyen @ 2015-03-19 1:27 ` Jeff King 2015-03-19 2:01 ` Mike Hommey ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Jeff King @ 2015-03-19 1:27 UTC (permalink / raw) To: Duy Nguyen Cc: John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote: > Or we could count/estimate the number of loose objects again after > repack/prune. Then we can maybe have a way to prevent the next gc that > we know will not improve the situation anyway. One option is pack > unreachable objects in the second pack. This would stop the next gc, > but that would screw prune up because st_mtime info is gone.. Maybe we > just save a file to tell gc to ignore the number of loose objects > until after a specific date. I don't think packing the unreachables is a good plan. They just end up accumulating then, and they never expire, because we keep refreshing their mtime at each pack (unless you pack them once and then leave them to expire, but then you end up with a large number of packs). Keeping a file that says "I ran gc at time T, and there were still N objects left over" is probably the best bet. When the next "gc --auto" runs, if T is recent enough, subtract N from the estimated number of objects. I'm not sure of the right value for "recent enough" there, though. If it is too far back, you will not gc when you could. If it is too close, then you will end up running gc repeatedly, waiting for those objects to leave the expiration window. I guess leaving a bunch of loose objects around longer than necessary isn't the end of the world. It wastes space, but it does not actively make the rest of git slower (whereas having a large number of packs does impact performance). So you could probably make "recent enough" be "T < now - gc.pruneExpire / 4" or something. At most we would try to gc 4 times before dropping unreachable objects, and for the default period, that's only once every couple days. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 1:27 ` Jeff King @ 2015-03-19 2:01 ` Mike Hommey 2015-03-19 4:14 ` Jeff King 2015-03-19 2:27 ` Junio C Hamano 2015-03-19 4:15 ` Duy Nguyen 2 siblings, 1 reply; 21+ messages in thread From: Mike Hommey @ 2015-03-19 2:01 UTC (permalink / raw) To: Jeff King Cc: Duy Nguyen, John Keeping, Дилян Палаузов, Git Mailing List On Wed, Mar 18, 2015 at 09:27:22PM -0400, Jeff King wrote: > On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote: > > > Or we could count/estimate the number of loose objects again after > > repack/prune. Then we can maybe have a way to prevent the next gc that > > we know will not improve the situation anyway. One option is pack > > unreachable objects in the second pack. This would stop the next gc, > > but that would screw prune up because st_mtime info is gone.. Maybe we > > just save a file to tell gc to ignore the number of loose objects > > until after a specific date. > > I don't think packing the unreachables is a good plan. They just end up > accumulating then, and they never expire, because we keep refreshing > their mtime at each pack (unless you pack them once and then leave them > to expire, but then you end up with a large number of packs). Note, sometimes I wish unreachables were packed. Recently, I ended up in a situation where running gc created something like 3GB of data as per du, because I suddenly had something like 600K unreachable objects, each of them, as a loose object, taking at least 4K on disk. This made my .git take 5GB instead of 2GB. That surely didn't feel like garbage collection. Mike ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 2:01 ` Mike Hommey @ 2015-03-19 4:14 ` Jeff King 2015-03-19 4:26 ` Mike Hommey 0 siblings, 1 reply; 21+ messages in thread From: Jeff King @ 2015-03-19 4:14 UTC (permalink / raw) To: Mike Hommey Cc: Duy Nguyen, John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote: > > I don't think packing the unreachables is a good plan. They just end up > > accumulating then, and they never expire, because we keep refreshing > > their mtime at each pack (unless you pack them once and then leave them > > to expire, but then you end up with a large number of packs). > > Note, sometimes I wish unreachables were packed. Recently, I ended up in > a situation where running gc created something like 3GB of data as per > du, because I suddenly had something like 600K unreachable objects, each > of them, as a loose object, taking at least 4K on disk. This made my > .git take 5GB instead of 2GB. That surely didn't feel like garbage > collection. That's definitely a thing that happens, but it is a bit of a corner case. It's unusual to have such a large number of unreferenced objects all at once. I don't suppose you happen to remember the details, but would a lower expiration time (e.g., 1 day or 1 hour) have made all of those objects go away? Or were they really from some extremely recent event (of course, "event" here might just have been "I did a full repack right before rewriting history" which would freshen the mtimes on everything in the pack). Certainly the "loosening" behavior for unreachable objects has corner cases like this, and they suck when you hit one. Leaving the objects packed would be better, but IMHO is not a viable alternative unless somebody comes up with a plan for segregating the "old" objects in a way that they actually expire eventually, and don't just keep getting repacked and freshened over and over. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 4:14 ` Jeff King @ 2015-03-19 4:26 ` Mike Hommey 0 siblings, 0 replies; 21+ messages in thread From: Mike Hommey @ 2015-03-19 4:26 UTC (permalink / raw) To: Jeff King Cc: Duy Nguyen, John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 12:14:53AM -0400, Jeff King wrote: > On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote: > > > > I don't think packing the unreachables is a good plan. They just end up > > > accumulating then, and they never expire, because we keep refreshing > > > their mtime at each pack (unless you pack them once and then leave them > > > to expire, but then you end up with a large number of packs). > > > > Note, sometimes I wish unreachables were packed. Recently, I ended up in > > a situation where running gc created something like 3GB of data as per > > du, because I suddenly had something like 600K unreachable objects, each > > of them, as a loose object, taking at least 4K on disk. This made my > > .git take 5GB instead of 2GB. That surely didn't feel like garbage > > collection. > > That's definitely a thing that happens, but it is a bit of a corner > case. It's unusual to have such a large number of unreferenced objects > all at once. > > I don't suppose you happen to remember the details, but would a lower > expiration time (e.g., 1 day or 1 hour) have made all of those objects > go away? Or were they really from some extremely recent event (of > course, "event" here might just have been "I did a full repack right > before rewriting history" which would freshen the mtimes on everything > in the pack). Unfortunately, I don't know the exact details. But yes, I guess a lower expiration time might have helped. > Certainly the "loosening" behavior for unreachable objects has corner > cases like this, and they suck when you hit one. Leaving the objects > packed would be better, but IMHO is not a viable alternative unless > somebody comes up with a plan for segregating the "old" objects in a way > that they actually expire eventually, and don't just keep getting > repacked and freshened over and over. It sure is a corner case, otoh, when it happens, every single git operation calls git gc --auto, which happily spends 5 minutes sucking CPU to end up doing nothing in practice. And add more salt on the injury if you are on battery 6700 loose objects seems easy to reach on a repo with 6M objects... Mike ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 1:27 ` Jeff King 2015-03-19 2:01 ` Mike Hommey @ 2015-03-19 2:27 ` Junio C Hamano 2015-03-19 4:09 ` Jeff King 2015-03-19 4:15 ` Duy Nguyen 2 siblings, 1 reply; 21+ messages in thread From: Junio C Hamano @ 2015-03-19 2:27 UTC (permalink / raw) To: Jeff King Cc: Duy Nguyen, John Keeping, Дилян Палаузов, Git Mailing List On Wed, Mar 18, 2015 at 6:27 PM, Jeff King <peff@peff.net> wrote: > > Keeping a file that says "I ran gc at time T, and there were still N > objects left over" is probably the best bet. When the next "gc --auto" > runs, if T is recent enough, subtract N from the estimated number of > objects. I'm not sure of the right value for "recent enough" there, > though. If it is too far back, you will not gc when you could. If it is > too close, then you will end up running gc repeatedly, waiting for those > objects to leave the expiration window. > > I guess leaving a bunch of loose objects around longer than necessary > isn't the end of the world. It wastes space, but it does not actively > make the rest of git slower (whereas having a large number of packs does > impact performance). So you could probably make "recent enough" be "T < > now - gc.pruneExpire / 4" or something. At most we would try to gc 4 > times before dropping unreachable objects, and for the default period, > that's only once every couple days. We could simply prune unreachables more aggressively, and it would solve this issue at the root cause, no? We do keep things reachable from reflogs, so the only thing you are getting by leaving the unreachables around is for an expert to perform some forensic analysis---especially if there are so many loose objects that are all unreachable, nobody sane can go through them one by one and guess correctly if each of them is what they wished they kept if their ancient reflog entry extended a few weeks more. That is, unless there is some tool to analyse the unreachable loose objects, collect them into meaningful islands, and present them in some way that the end user can make sense of, which I do not think exists (yet). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 2:27 ` Junio C Hamano @ 2015-03-19 4:09 ` Jeff King 0 siblings, 0 replies; 21+ messages in thread From: Jeff King @ 2015-03-19 4:09 UTC (permalink / raw) To: Junio C Hamano Cc: Duy Nguyen, John Keeping, Дилян Палаузов, Git Mailing List On Wed, Mar 18, 2015 at 07:27:46PM -0700, Junio C Hamano wrote: > > I guess leaving a bunch of loose objects around longer than necessary > > isn't the end of the world. It wastes space, but it does not actively > > make the rest of git slower (whereas having a large number of packs does > > impact performance). So you could probably make "recent enough" be "T < > > now - gc.pruneExpire / 4" or something. At most we would try to gc 4 > > times before dropping unreachable objects, and for the default period, > > that's only once every couple days. > > We could simply prune unreachables more aggressively, and it would > solve this issue at the root cause, no? Yes, but not too aggressively. You mentioned object archaeology, but my main interest is avoiding corruption. The mtime check is the thing that prevents us from pruning objects being used for an operation-in-progress that has not yet updated a ref. For some long-running operations, like adding files to a commit, we take into account references like a blob being mentioned in the index. But I do not know offhand if there are other long-running operations that would run into problems if we shortened the expiration time drastically. Anything building a temporary index is potentially problematic. But if we assume that operations like that tend to create and reference their objects within a reasonable time period (say, seconds to minutes) then the current default of 2 weeks is absurd for this purpose. For raciness within a single operation, a few seconds is probably enough (e.g., we may write out a commit object and then update the ref a few milliseconds later). The potential for problems is exacerbated by the fact that object `X` may exist in the filesystem with an old mtime, and then a new operation wants to reference it. That's made somewhat better by 33d4221 (write_sha1_file: freshen existing objects, 2014-10-15), as before we could silently turn a file write into a noop. But it's still racy to do: git cat-file -e $commit git update-ref refs/heads/foo $commit as we do not update the mtime for a read-only operation like cat-file (and even if we did, it's still somewhat racy as prune does not atomically check the mtime and remove the file). So I think there's definitely some possible danger with dropping the default prune expiration time. For a long time GitHub ran with it as 1.hour.ago. We definitely saw some oddities and corruption over the years that were apparently caused by over-aggressive pruning and/or raciness. I've fixed a number of bugs, and things did get better as a result. But I could not say whether all such problems are gone. These days we do our regular repacks with "--keep-unreachable" and almost never prune anything. It's also not clear whether GitHub represents anything close to "normal" use. We have a much smaller array of operations that we perform (most objects are either from a push, or from a test-merge between a topic branch and HEAD). But we also have busy repos that are frequently doing gc in the background (especially because we share object storage, so activity on another fork can trigger a gc job that affects a whole repository network). On workstations, I'd guess most git-gc jobs run during a fairly quiescent period. All of which is to say that I don't really know the answer, and there may be dragons. I'd imagine that dropping the default expiration time from 2 weeks to 1 day would probably be fine. A good way to experiment would be for some brave souls to set gc.pruneexpire themselves, run with it for a few weeks or months, and see if anything goes wrong. -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 1:27 ` Jeff King 2015-03-19 2:01 ` Mike Hommey 2015-03-19 2:27 ` Junio C Hamano @ 2015-03-19 4:15 ` Duy Nguyen 2015-03-19 4:20 ` Jeff King 2 siblings, 1 reply; 21+ messages in thread From: Duy Nguyen @ 2015-03-19 4:15 UTC (permalink / raw) To: Jeff King Cc: John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 8:27 AM, Jeff King <peff@peff.net> wrote: > Keeping a file that says "I ran gc at time T, and there were still N > objects left over" is probably the best bet. When the next "gc --auto" > runs, if T is recent enough, subtract N from the estimated number of > objects. I'm not sure of the right value for "recent enough" there, > though. If it is too far back, you will not gc when you could. If it is > too close, then you will end up running gc repeatedly, waiting for those > objects to leave the expiration window. And would not be hard to implement either. git-gc is already prepared to deal with stale gc.pid, which would stop git-gc for a day or so before it deletes gc.pid and starts anyway. All we need to do is check at the end of git-gc, if we know for sure the next 'gc --auto' is a waste, then leave gc.pid behind. -- Duy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 4:15 ` Duy Nguyen @ 2015-03-19 4:20 ` Jeff King 2015-03-19 4:29 ` Duy Nguyen 0 siblings, 1 reply; 21+ messages in thread From: Jeff King @ 2015-03-19 4:20 UTC (permalink / raw) To: Duy Nguyen Cc: John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote: > On Thu, Mar 19, 2015 at 8:27 AM, Jeff King <peff@peff.net> wrote: > > Keeping a file that says "I ran gc at time T, and there were still N > > objects left over" is probably the best bet. When the next "gc --auto" > > runs, if T is recent enough, subtract N from the estimated number of > > objects. I'm not sure of the right value for "recent enough" there, > > though. If it is too far back, you will not gc when you could. If it is > > too close, then you will end up running gc repeatedly, waiting for those > > objects to leave the expiration window. > > And would not be hard to implement either. git-gc is already prepared > to deal with stale gc.pid, which would stop git-gc for a day or so > before it deletes gc.pid and starts anyway. All we need to do is check > at the end of git-gc, if we know for sure the next 'gc --auto' is a > waste, then leave gc.pid behind. That omits the "N objects left over" information. Which I think may be useful, because otherwise the rule is basically "don't do another gc at all for X time units". That's OK for most use, but it has its own corner cases. E.g., imagine you are doing an SVN import that does an auto-gc check every 1000 commits. You have some unreferenced objects in your repository. After the first 1000 commits, we do a gc, and then say "wow, still a lot of cruft; let's block gc for a day". Five minutes later, after another 1000 commits, we run "gc --auto" again. It doesn't run because of the cruft-check, even though there are a _huge_ number of new packable objects. If the blocker file tells us "7000 extra objects" and we see that there are 17,000 in the repo, then we know it's still worth doing the gc (i.e., we know we that we'll probably end up ignoring the 7000 cruft that didn't get cleaned up last time, but we also know that there are 10,000 new objects). -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 4:20 ` Jeff King @ 2015-03-19 4:29 ` Duy Nguyen 2015-03-19 4:34 ` Jeff King 0 siblings, 1 reply; 21+ messages in thread From: Duy Nguyen @ 2015-03-19 4:29 UTC (permalink / raw) To: Jeff King Cc: John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 11:20 AM, Jeff King <peff@peff.net> wrote: > On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote: > >> On Thu, Mar 19, 2015 at 8:27 AM, Jeff King <peff@peff.net> wrote: >> > Keeping a file that says "I ran gc at time T, and there were still N >> > objects left over" is probably the best bet. When the next "gc --auto" >> > runs, if T is recent enough, subtract N from the estimated number of >> > objects. I'm not sure of the right value for "recent enough" there, >> > though. If it is too far back, you will not gc when you could. If it is >> > too close, then you will end up running gc repeatedly, waiting for those >> > objects to leave the expiration window. >> >> And would not be hard to implement either. git-gc is already prepared >> to deal with stale gc.pid, which would stop git-gc for a day or so >> before it deletes gc.pid and starts anyway. All we need to do is check >> at the end of git-gc, if we know for sure the next 'gc --auto' is a >> waste, then leave gc.pid behind. > > That omits the "N objects left over" information. Which I think may be > useful, because otherwise the rule is basically "don't do another gc at > all for X time units". That's OK for most use, but it has its own corner > cases. True. But saving "N objects left over" in a file also has a corner case. If the user "prune --expire=now" manually, the next 'gc --auto' still thinks we have that many leftovers and keeps delaying gc for some more time. Unless we make 'prune' (or any other commands that delete leftovers) to also delete this file. Yeah maybe saving this info in a file will work. > E.g., imagine you are doing an SVN import that does an auto-gc > check every 1000 commits. You have some unreferenced objects in your > repository. After the first 1000 commits, we do a gc, and then say "wow, > still a lot of cruft; let's block gc for a day". Five minutes later, > after another 1000 commits, we run "gc --auto" again. It doesn't run > because of the cruft-check, even though there are a _huge_ number of new > packable objects. > > If the blocker file tells us "7000 extra objects" and we see that there > are 17,000 in the repo, then we know it's still worth doing the gc > (i.e., we know we that we'll probably end up ignoring the 7000 cruft > that didn't get cleaned up last time, but we also know that there are > 10,000 new objects). -- Duy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-19 4:29 ` Duy Nguyen @ 2015-03-19 4:34 ` Jeff King 0 siblings, 0 replies; 21+ messages in thread From: Jeff King @ 2015-03-19 4:34 UTC (permalink / raw) To: Duy Nguyen Cc: John Keeping, Дилян Палаузов, Git Mailing List On Thu, Mar 19, 2015 at 11:29:57AM +0700, Duy Nguyen wrote: > > That omits the "N objects left over" information. Which I think may be > > useful, because otherwise the rule is basically "don't do another gc at > > all for X time units". That's OK for most use, but it has its own corner > > cases. > > True. But saving "N objects left over" in a file also has a corner > case. If the user "prune --expire=now" manually, the next 'gc --auto' > still thinks we have that many leftovers and keeps delaying gc for > some more time. Unless we make 'prune' (or any other commands that > delete leftovers) to also delete this file. Yeah maybe saving this > info in a file will work. I assumed that the user would not run prune manually, but would run "git gc --prune=now". And yeah, definitely any time gc runs, it should update the file (if there are fewer than `gc.auto` objects, I think it could just delete the file). We could also apply that rule any run of "git prune", but my mental model is that "git gc" is the magical porcelain that will do this stuff for you, and "git prune" is the plumbing that users shouldn't need to call themselves. I don't know if that model is shared by users, though. :) -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:58 ` John Keeping 2015-03-18 21:04 ` Jeff King @ 2015-03-19 9:47 ` Duy Nguyen 1 sibling, 0 replies; 21+ messages in thread From: Duy Nguyen @ 2015-03-19 9:47 UTC (permalink / raw) To: John Keeping Cc: Дилян Палаузов, Git Mailing List On Wed, Mar 18, 2015 at 9:58 PM, John Keeping <john@keeping.me.uk> wrote: > On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: >> On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen <pclouds@gmail.com> wrote: >> > If not, I made some mistake in analyzing this and we'll start again. >> >> I did make one mistake, the first "gc" should have reduced the number >> of loose objects to zero. Why didn't it.? I'll come back to this >> tomorrow if nobody finds out first :) > > Most likely they are not referenced by anything but are younger than 2 > weeks. > > I saw a similar issue with automatic gc triggering after every operation > when I did something equivalent to: > > git add <lots of files> > git commit > git reset --hard HEAD^ > > which creates a log of unreachable objects which are not old enough to > be pruned. And there's another problem caused by background gc. If it's not run in background, it should print this warning: There are too many unreachable loose objects; run 'git prune' to remove them. but because background gc does not have access to stdout/stderr anymore, this is lost. -- Duy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:33 ` Duy Nguyen 2015-03-18 14:41 ` Duy Nguyen @ 2015-03-18 14:48 ` Дилян Палаузов 2015-03-18 21:07 ` Jeff King 1 sibling, 1 reply; 21+ messages in thread From: Дилян Палаузов @ 2015-03-18 14:48 UTC (permalink / raw) To: Duy Nguyen; +Cc: Git Mailing List Hello Duy, #ls .git/objects/17/* | wc -l 30 30 * 256 = 7 680 > 6 700 And now? Do I have to run git gc --aggressive ? Kind regards Dilian On 18.03.2015 15:33, Duy Nguyen wrote: > On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов > <dilyan.palauzov@aegee.org> wrote: >> Hello, >> >> # git gc --auto >> Auto packing the repository in background for optimum performance. >> See "git help gc" for manual housekeeping. >> >> and calls in the background: >> >> 25618 1 0 32451 884 1 14:20 ? 00:00:00 git gc --auto >> 25639 25618 51 49076 49428 0 14:20 ? 00:00:07 git prune --expire >> 2.weeks.ago >> >> # git count-objects -v >> count: 6039 > > loose number threshold is 6700, unless you tweaked something. But > there's a tweak, we'll come back to this. > >> size: 65464 >> in-pack: 185432 >> packs: 1 > > Pack threshold is 50, You only have one pack, good > > OK back to the "count 6039" above. You have that many loose objects. > But 'git gc' is lazier than 'git count-objects'. It assume a flat > distribution, and only counts the number of objects in .git/objects/17 > directory only, then extrapolate for the total number. > > So can you see how many files you have in this directory > .git/objects/17? That number, multiplied by 256, should be greater > than 6700. If that's the case "git gc" laziness is the problem. If > not, I made some mistake in analyzing this and we'll start again. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git pull & git gc 2015-03-18 14:48 ` Дилян Палаузов @ 2015-03-18 21:07 ` Jeff King 0 siblings, 0 replies; 21+ messages in thread From: Jeff King @ 2015-03-18 21:07 UTC (permalink / raw) To: Дилян Палаузов Cc: Duy Nguyen, Git Mailing List On Wed, Mar 18, 2015 at 03:48:42PM +0100, Дилян Палаузов wrote: > #ls .git/objects/17/* | wc -l > 30 > > 30 * 256 = 7 680 > 6 700 > > And now? Do I have to run git gc --aggressive ? No, aggressive just controls the time we spend on repacking. If the guess is correct that the objects are kept because they are unreachable but "recent", then shortening the prune expiration time would get rid of them. E.g., "git gc --prune=1.hour.ago". That does not solve the underlying problem discussed elsewhere in the thread, but it would make this particular instance of it go away. :) -Peff ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2015-03-19 9:48 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-18 13:53 git pull & git gc Дилян Палаузов 2015-03-18 14:16 ` Duy Nguyen 2015-03-18 14:23 ` Дилян Палаузов 2015-03-18 14:33 ` Duy Nguyen 2015-03-18 14:41 ` Duy Nguyen 2015-03-18 14:58 ` John Keeping 2015-03-18 21:04 ` Jeff King 2015-03-19 0:31 ` Duy Nguyen 2015-03-19 1:27 ` Jeff King 2015-03-19 2:01 ` Mike Hommey 2015-03-19 4:14 ` Jeff King 2015-03-19 4:26 ` Mike Hommey 2015-03-19 2:27 ` Junio C Hamano 2015-03-19 4:09 ` Jeff King 2015-03-19 4:15 ` Duy Nguyen 2015-03-19 4:20 ` Jeff King 2015-03-19 4:29 ` Duy Nguyen 2015-03-19 4:34 ` Jeff King 2015-03-19 9:47 ` Duy Nguyen 2015-03-18 14:48 ` Дилян Палаузов 2015-03-18 21:07 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).